library(RCAS)
RCAS is an automated system that provides dynamic genome annotations for custom input files that contain transcriptomic regions. Such transcriptomic regions could be, for instance, peak regions detected by CLIP-Seq analysis that detect protein-RNA interactions, RNA modifications (alias the epitranscriptome), CAGE-tag locations, or any other collection of target regions at the level of the transcriptome.
RCAS is designed as a reporting tool for the functional analysis of RNA-binding sites detected by high-throughput experiments. It takes as input a BED format file containing the genomic coordinates of the RNA binding sites and a GTF file that contains the genomic annotation features usually provided by publicly available databases such as Ensembl and UCSC. RCAS performs overlap operations between the genomic coordinates of the RNA binding sites and the genomic annotation features and produces in-depth annotation summaries such as the distribution of binding sites with respect to gene features (exons, introns, 5’/3’ UTR regions, exon-intron boundaries, promoter regions, and whole transcripts). Moreover, by detecting the collection of targeted transcripts, RCAS can carry out functional annotation tables for enriched gene sets (annotated by the Molecular Signatures Database) and GO terms. As one of the most important questions that arise during protein-RNA interaction analysis; RCAS has a module for detecting sequence motifs enriched in the targeted regions of the transcriptome. The final report of RCAS consists of high-quality dynamic figures and tables, which are readily applicable for publications or other academic usage.
RCAS minimally requires as input a BED file and a GTF file. The BED file should contain coordinates/intervals of transcriptomic regions which are located via transcriptomics methods such as Clip-Seq. The GTF file should provide reference annotation. The recommended source of GTF files is the ENSEMBLE database.
For this vignette, in order to demonstrate RCAS functionality, we use sample BED and GTF data that are built-in the RCAS library, which can be imported using a common R function: data(). To import custom BED and GTF files, the user should execute two RCAS functions called importBed() and importGtf().
library(RCAS)
data(queryRegions) #sample queryRegions in BED format()
data(gff) #sample GFF file
To use importBed() and importGtf(), the user should provide file paths to the respective BED file and GTF file. To reduce memory usage and time consumption, we advise the user to set sampleN=10000
to avoid huge input of intervals.
Before running following the code chunks in your R interpreter, please remove #
comment sign.
#library(RCAS)
#queryRegions <- importBed(filePath = <path to BED file>, sampleN = 10000)
#gff <- importGtf(filePath = <path to GTF file>)
overlaps <- queryGff(queryRegions = queryRegions, gffData = gff)
#data.table is used to do quick summary operations
overlaps.dt <- data.table(as.data.frame(overlaps))
To find out the distribution of the query regions across gene types:
biotype_col <- grep('gene_biotype', colnames(overlaps.dt), value = T)
df <- overlaps.dt[,length(unique(overlappingQuery)), by = biotype_col]
colnames(df) <- c("feature", "count")
df$percent <- round(df$count / length(queryRegions) * 100, 1)
df <- df[order(count, decreasing = TRUE)]
p <- plot_ly(data = df,
type = "bar",
x = df$feature,
y = df$percent,
text = paste("count:", df$count), color=df$feature)
layout(p = p,
margin = list(l=100, r=100, b=150),
xaxis = list(showticklabels = TRUE, tickangle = 90),
yaxis = list(title = paste("percentage of query regions,",
"n =", length(queryRegions))))
GTF files contain some annotation features (e.g. exons, transcripts) that are usually explicitly defined, however, some gene features such as introns, exon-intron boundaries, promoter regions are only implicitly defined. Such implicit features can be extracted from a GTF file using makeTxDb family of functions from the GenomicFeatures library.
First we create a GRangesList object, where each list element contains all the available coordinates of gene features such as transcripts, exons, introns, 5’/3’ UTRs, exon-intron boundaries, and promoter regions.
txdbFeatures <- getTxdbFeaturesFromGRanges(gff)
To have a global overview of the distribution of query regions across gene features, we can use the summarizeQueryRegions function.
summary <- summarizeQueryRegions(queryRegions = queryRegions,
txdbFeatures = txdbFeatures)
df <- data.frame(summary)
df$percent <- round((df$count / length(queryRegions)), 3) * 100
p <- plot_ly( data = df,
x = rownames(df),
y = df$percent,
type = 'bar',
text = paste("count:", df$count),
color = rownames(df)
)
layout(p = p,
xaxis = list(title = 'features'),
yaxis = list(title = paste("percentage of query regions,",
"n =", length(queryRegions)
)
),
margin = list(b = 150, r = 50)
)
To find out which genes overlap with how many queries and categorise overlaps by gene features; we use getTargetedGenesTable function, which returns a data.frame object. Then we use datatable function from ‘DT’ library to print an interactive table of genes that overlap query Regions.
dt <- getTargetedGenesTable(queryRegions = queryRegions,
txdbFeatures = txdbFeatures)
dt <- dt[order(transcripts, decreasing = TRUE)]
DT::datatable(dt[1:100],
extensions = c('Buttons', 'FixedColumns'),
options = list(fixedColumns = TRUE,
scrollX = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'print', 'csv','excel', 'pdf')),
filter = 'bottom'
)
Coverage profiles can be obtained for a single type of gene feature or a list of gene features. Here we demonstrate how to get coverage profile of query regions across 3’UTRs. It might be a good idea to use sampleN parameter to randomly downsample the target regions to speed up the calculations.
cov <- calculateCoverageProfile(queryRegions = queryRegions,
targetRegions = txdbFeatures$threeUTRs,
sampleN = 10000)
plot_ly(data = cov, x = ~bins, y = ~coverage,
type = 'scatter', mode = 'lines')
Coverage profiles can be obtained for a single type of gene feature or a list of gene features. Here we demonstrate how to get coverage profile of query regions across all available gene features. It might be a good idea to use sampleN parameter to randomly downsample the target regions to speed up the calculations.
covList <- calculateCoverageProfileList(queryRegions = queryRegions,
targetRegionsList = txdbFeatures,
sampleN = 10000)
df <- do.call('cbind', covList)
df <- df[,!grepl(colnames(df), pattern = '*.bins')]
df$bins <- c(1:100)
colnames(df) <- gsub(pattern = ".coverage", replacement = "", x = colnames(df))
mdf <- reshape2::melt(df, id.vars = c('bins'))
colnames(mdf) <- c('bins', 'feature', 'coverage')
p = plot_ly(data = mdf, x = ~bins, y = ~coverage,
type = 'scatter', mode = 'lines', color = ~feature)
layout(p)
Another way of profiling the signal at different gene features is to look into the depth of coverage of the query regions at the boundaries such as the transcription start/end sites,
exon-intron boundaries, and so on. Below, we demonstrate how to get the signal distribution at the flanking regions of the transcription start/end sites. The signal can be either binned by a window of certain number of base-pairs (using getFeatureBoundaryCoverageBin function) or the signal can be computed for each base-pair around the boundary region including a certain length of flanking bases (using the getFeatureBoundaryCoverage function). Plotting the coverage profiles of TSS and TES regions side-by-side, we can quickly observe that for this particular dataset, the signal concentrates around the 3’ end of the transcript.
cvg <- getFeatureBoundaryCoverage(queryRegions = queryRegions,
featureCoords = txdbFeatures$transcripts,
flankSize = 1000,
sampleN = 10000)
yLimit <- (as.integer(max(c(cvg$fivePrime, cvg$threePrime))/10)+1)*10
p <- subplot(
plot_ly(data = cvg, x = ~bases, y = ~fivePrime, type = 'scatter', mode = 'lines'),
plot_ly(data = cvg, x = ~bases, y = ~threePrime, type = 'scatter', mode = 'lines'),
margin = 0.05
) %>% layout (xaxis = list(title = 'Distance (bp) to TSS'),
xaxis2 = list(title = 'Distance (bp) to TES'),
yaxis = list(title = 'coverage', range = c(0, yLimit)),
yaxis2 = list(title = 'coverage', range = c(0, yLimit)),
showlegend = FALSE)
layout(p)
With the RCAS package, a motif analysis is also possible. RCAS uses motifRG library to find enriched motifs among the query regions.
motifResults <- runMotifRG(queryRegions = queryRegions,
genomeVersion = 'hg19',
motifN = 2, nCores = 2)
par(mfrow = c(1,2), mar = c(2,2,2,2))
for (i in 1:length(motifResults$motifs)) {
motifPattern <- motifResults$motifs[[i]]@pattern
motifRG::plotMotif(match = motifResults$motifs[[i]]@match$pattern,
main = paste0('Motif-',i,': ',motifPattern),
entropy = TRUE)
}
A summary table from the motif analysis results can be obtained
summary <- getMotifSummaryTable(motifResults)
DT::datatable(summary,
extensions = c('Buttons', 'FixedColumns'),
options = list(fixedColumns = TRUE,
scrollX = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'print', 'csv','excel', 'pdf')),
filter = 'bottom'
)
RCAS can perform GO term enrichment analysis to find out enriched functions in genes that overlap the query regions. Below is demonstrated how to get biological processes terms (‘BP’) enriched in the genes that overlap query regions and the top 10 GO terms with most fold change increase relative to the background are provided.
#get all genes from the GTF data
backgroundGenes <- unique(gff$gene_id)
#get genes that overlap query regions
targetedGenes <- unique(overlaps$gene_id)
#run TopGO
goBP <- runTopGO(ontology = 'BP',
species = 'human',
backgroundGenes = backgroundGenes,
targetedGenes = targetedGenes)
goBP <- goBP[order(goBP$foldEnrichment, decreasing = TRUE),]
rownames(goBP) <- goBP$GO.ID
goBP <- subset(goBP, select = -c(Annotated,classicFisher, bh, GO.ID))
DT::datatable(goBP[1:10,],
extensions = c('Buttons', 'FixedColumns'),
options = list(fixedColumns = TRUE,
scrollX = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'print', 'csv', 'excel', 'pdf')),
filter = 'bottom'
)
RCAS can use gene sets from Molecular Signatures Database and calculate gene set enrichment analysis (GSEA) to find out which gene sets are enriched among the genes targeted by the query regions.
Below we demonstrate a GSEA case using randomly generated gene sets (in order not to breach MSIGDB licence agreement) that are provided as built-in data in RCAS. The actual MSIGDB gene set annotations must be downloaded by the user from the MSIGDB website. RCAS provides functions to parse the annotations and map them to species other to enable GSEA on other species such as mouse and fly.
#geneSets <- parseMsigdb(< path to msigdbFile>)
data(geneSets)
resultsGSEA <- runGSEA(geneSetList = geneSets,
backgroundGenes = backgroundGenes,
targetedGenes = targetedGenes)
datatable(resultsGSEA[1:10,],
extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
scrollCollapse = TRUE
))
RCAS also provides functions to map the MSIGDB annotations from human to fly and mouse.
#parse human annotations
refGeneSets <- parseMsigdb(filePath = <path to MSIGDB annotation file>)
#Map the gene sets to other species using orthologous relationships of genes between
#the reference genome (human) and the target genome (e.g. mouse)
orthGeneSets <- createOrthologousGeneSetList(referenceGeneSetList = refGeneSets,
refGenomeVersion = 'hg19',
targetGenomeVersion = 'mm9')
#the mapped gene sets can be used for GSEA analysis using the runGSEA command.
The users can use the runReport() function to generate full custom reports including all the analysis modules described above. There are four main parts of the analysis report.
By default, runReport() function aims to run all four modules, while the user can turn off these individual modules.
Below are example commands to generate reports using these functionalities.
runReport()
runReport( queryFilePath = 'input.BED',
gffFilePath = 'annotation.gtf',
msigdbFilePath = 'human_msigdb.gmt')
runReport( queryFilePath = 'input.BED',
gffFilePath = 'annotation.gtf',
msigdbFilePath = 'human_msigdb.gmt',
motifAnalysis = FALSE,
goAnalysis = FALSE )
If the msigdb module is needed, the msigdbFilePath must be set to the MSIGDB annotations for ‘human’. MSIGDB datasets for other species will be calculated in the background using the createOrthologousMsigdbDataset function
runReport( queryFilePath = 'input.mm9.BED',
gffFilePath = 'annotation.mm9.gtf',
msigdbFilePath = 'human_msigdb.gmt',
genomeVersion = 'mm9' )
runReport(quiet = TRUE)
RCAS is developed by Dr. Altuna Akalin (head of the Scientific Bioinformatics Platform), Dr. Bora Uyar (Bioinformatics Scientist), Dr. Dilmurat Yusuf (Bioinformatics Scientist) and Ricardo Wurmus (System Administrator) at the Berlin Institute of Medical Systems Biology (BIMSB) at the Max-Delbrueck-Center for Molecular Medicine (MDC) in Berlin.
RCAS is developed as a bioinformatics service as part of the RNA Bioinformatics Center, which is one of the eight centers of the German Network for Bioinformatics Infrastructure (de.NBI).