Warning: In this vignette, due to space limitations, we demonstrate the functions of RCAS using static images. In order to see how an interactive report from RCAS looks see RCAS::runReport().

For the most up-to-date functionality, usage and installation instructions, and example outputs, see our github repository here.

Introduction

RCAS is an automated system that provides dynamic genome annotations for custom input files that contain transcriptomic regions. Such transcriptomic regions could be, for instance, peak regions detected by CLIP-Seq analysis that detect protein-RNA interactions, RNA modifications (alias the epitranscriptome), CAGE-tag locations, or any other collection of target regions at the level of the transcriptome.

RCAS is designed as a reporting tool for the functional analysis of RNA-binding sites detected by high-throughput experiments. It takes as input a BED format file containing the genomic coordinates of the RNA binding sites and a GTF file that contains the genomic annotation features usually provided by publicly available databases such as Ensembl and UCSC. RCAS performs overlap operations between the genomic coordinates of the RNA binding sites and the genomic annotation features and produces in-depth annotation summaries such as the distribution of binding sites with respect to transcript features (exons, introns, 5’/3’ UTR regions, exon-intron boundaries, promoter regions, and whole transcripts). Moreover, by detecting the collection of targeted transcripts, RCAS can carry out functional annotation tables for enriched gene sets (annotated by the Molecular Signatures Database) and GO terms. As one of the most important questions that arise during protein-RNA interaction analysis; RCAS has a module for detecting sequence motifs enriched in the targeted regions of the transcriptome. The final report of RCAS consists of high-quality dynamic figures and tables, which are readily applicable for publications or other academic usage.

Data input

RCAS minimally requires as input a BED file and a GTF file. The BED file should contain coordinates/intervals of transcriptomic regions which are located via transcriptomics methods such as Clip-Seq. The GTF file should provide reference annotation. The recommended source of GTF files is the ENSEMBLE database.

For this vignette, in order to demonstrate RCAS functionality, we use sample BED and GTF data that are built-in the RCAS library, which can be imported using a common R function: data(). To import custom BED and GTF files, the user should execute two RCAS functions called importBed() and importGtf().

Importing sample data

library(RCAS)
data(queryRegions) #sample queryRegions in BED format()
data(gff)          #sample GFF file

Importing custom data

To use importBed() and importGtf(), the user should provide file paths to the respective BED file and GTF file. To reduce memory usage and time consumption, we advise the user to set sampleN=10000 to avoid huge input of intervals.

queryRegions <- importBed(filePath = <path to BED file>, sampleN = 10000)
gff <- importGtf(filePath = <path to GTF file>)

Summarizing Overlaps of Query Regions with Genomic Annotation Features

Querying the annotation file

overlaps <- as.data.table(queryGff(queryRegions = queryRegions, gffData = gff))

Finding targeted gene types

To find out the distribution of the query regions across gene types:

biotype_col <- grep('gene_biotype', colnames(overlaps), value = T)
df <- overlaps[,length(unique(overlappingQuery)), by = biotype_col]
colnames(df) <- c("feature", "count")
df$percent <- round(df$count / length(queryRegions) * 100, 1)
df <- df[order(count, decreasing = TRUE)]
ggplot2::ggplot(df, aes(x = reorder(feature, -percent), y = percent)) + 
  geom_bar(stat = 'identity', aes(fill = feature)) + 
  geom_label(aes(y = percent + 0.5), label = df$count) + 
  labs(x = 'transcript feature', y = paste0('percent overlap (n = ', length(queryRegions), ')')) + 
  theme_bw(base_size = 14) + 
  theme(axis.text.x = element_text(angle = 90))

Extending the annotation feature space

GTF files contain some annotation features (e.g. exons, transcripts) that are usually explicitly defined, however, some transcript features such as introns, exon-intron boundaries, promoter regions are only implicitly defined. Such implicit features can be extracted from a GTF file using makeTxDb family of functions from the GenomicFeatures library.

First we create a list of GRanges objects, where each list element contains all the available coordinates of transcript features such as transcripts, exons, introns, 5’/3’ UTRs, exon-intron boundaries, and promoter regions.

txdbFeatures <- getTxdbFeaturesFromGRanges(gff)

Plotting overlap counts between query regions and transcript features

To have a global overview of the distribution of query regions across gene features, we can use the summarizeQueryRegions function. If a given query region does not overlap with any of the given coordinates of the transcript features, it is categorized under NoFeatures.

summary <- summarizeQueryRegions(queryRegions = queryRegions, 
                                 txdbFeatures = txdbFeatures)

df <- data.frame(summary)
df$percent <- round((df$count / length(queryRegions)), 3) * 100
df$feature <- rownames(df)
ggplot2::ggplot(df, aes(x = reorder(feature, -percent), y = percent)) + 
  geom_bar(stat = 'identity', aes(fill = feature)) + 
  geom_label(aes(y = percent + 3), label = df$count) + 
  labs(x = 'transcript feature', y = paste0('percent overlap (n = ', length(queryRegions), ')')) + 
  theme_bw(base_size = 14) + 
  theme(axis.text.x = element_text(angle = 90))

Obtaining a table of overlap counts between query regions and genes

To find out which genes overlap with how many queries and categorise overlaps by transcript features; we use getTargetedGenesTable function, which returns a data.frame object.

dt <- getTargetedGenesTable(queryRegions = queryRegions, 
                           txdbFeatures = txdbFeatures)
dt <- dt[order(transcripts, decreasing = TRUE)]

knitr::kable(dt[1:10,])

tx_name	transcripts	exons	introns	cds	threeUTRs
ENST00000317713	33	28	5	24	4
ENST00000361689	33	28	5	24	4
ENST00000372915	33	28	5	24	4
ENST00000539005	33	28	5	24	4
ENST00000545844	33	28	5	24	4
ENST00000564288	33	28	5	24	4
ENST00000567887	33	28	5	24	4
ENST00000372925	28	23	5	19	4
ENST00000289893	27	22	5	18	4
ENST00000367142	14	14	0	3	12

Profiling the coverage of query regions across transcript features

Coverage profile of query regions at feature boundaries

It may be useful to look at the distribution of query regions at the boundaries of transcript features. For instance, it may be important to see the relative signal at transcript ends (transcription start sites versus transcription end sites). Or, it may be important to see how the signal is distributed at exon boundaries, which may give an idea about the regulation of the transcript. Here we demonstrate how to get such signal distributions at transcription start/end sites. The same approach can be done for any other collection of transcript features (exons, introns, promoters, UTRs etc.)

cvgF <- getFeatureBoundaryCoverage(queryRegions = queryRegions, 
                                   featureCoords = txdbFeatures$transcripts, 
                                   flankSize = 1000, 
                                   boundaryType = 'fiveprime', 
                                   sampleN = 10000)
cvgT <- getFeatureBoundaryCoverage(queryRegions = queryRegions, 
                                   featureCoords = txdbFeatures$transcripts, 
                                   flankSize = 1000, 
                                   boundaryType = 'threeprime', 
                                   sampleN = 10000)

cvgF$boundary <- 'fiveprime'
cvgT$boundary <- 'threeprime'

df <- rbind(cvgF, cvgT)

ggplot2::ggplot(df, aes(x = bases, y = meanCoverage)) + 
  geom_ribbon(fill = 'lightgreen', 
              aes(ymin = meanCoverage - standardError * 1.96, 
                  ymax = meanCoverage + standardError * 1.96)) + 
 geom_line(color = 'black') + 
 facet_grid( ~ boundary) + theme_bw(base_size = 14)

Coverage profile of query regions for all transcript features

Coverage profiles can be obtained for a single type of transcript feature or a list of transcript features. Here we demonstrate how to get coverage profile of query regions across all available transcript features. It might be a good idea to use sampleN parameter to randomly downsample the target regions to speed up the calculations.

cvgList <- calculateCoverageProfileList(queryRegions = queryRegions, 
                                       targetRegionsList = txdbFeatures, 
                                       sampleN = 10000)

ggplot2::ggplot(cvgList, aes(x = bins, y = meanCoverage)) + 
  geom_ribbon(fill = 'lightgreen', 
              aes(ymin = meanCoverage - standardError * 1.96, 
                  ymax = meanCoverage + standardError * 1.96)) + 
 geom_line(color = 'black') + theme_bw(base_size = 14) +
 facet_wrap( ~ feature, ncol = 3)

Motif Analysis using motifRG

Calculating enriched motifs

With the RCAS package, a motif analysis is also possible. RCAS uses motifRG library to find enriched motifs among the query regions.

motifResults <- runMotifRG(queryRegions = queryRegions, 
                           resizeN = 15, sampleN = 10000,
                           genomeVersion = 'hg19', 
                           motifN = 2, nCores = 2)

## GAAGGA 1.769472e-06 
## Skip pattern  ATTTTT 
##  Refine  GAAGGA 11.70585 : 11.78265 12.53327 11.24818 11.6384 10.93289 TRUE 478 162 464 157 
## New motif:  GAAGGA 
## match range  637 
## [1] "Rescore"
## [1] "Finished Rescore"
## TGGAGA 3.713046e-06 
## Skip pattern  TTTTTA 
##  Refine  TGGAGA 12.97348 : 11.80649 12.57438 13.59972 12.5562 13.74099 TRUE 551 173 530 165 
## New motif:  TGGAGA

par(mfrow = c(1,2), mar = c(2,2,2,2))
for (i in 1:length(motifResults$motifs)) {
  motifPattern <- motifResults$motifs[[i]]@pattern
  motifRG::plotMotif(match = motifResults$motifs[[i]]@match$pattern, 
                     main = paste0('Motif-',i,': ',motifPattern),
                     entropy = TRUE)
}

motif analysis: getting motif summary statistics

A summary table from the motif analysis results can be obtained

summary <- getMotifSummaryTable(motifResults)
knitr::kable(summary)

patterns	scores	fgHits	bgHits	fgSeq	bgSeq	ratio	fgFrac	bgFrac
GAAGGA	11.7	478	162	464	157	3.0	0.0464	0.0157
TGGAGA	13.1	558	175	536	167	3.2	0.0536	0.0167

GO term analysis

Biological processes enriched among targeted genes

RCAS can perform GO term enrichment analysis to find out enriched functions in genes that overlap the query regions. Below is demonstrated how to get biological processes terms (‘BP’) enriched in the genes that overlap query regions and the top 10 GO terms with most fold change increase relative to the background are provided.

#get all genes from the GTF data
backgroundGenes <- unique(gff$gene_id)
#get genes that overlap query regions
targetedGenes <- unique(overlaps$gene_id)

#run TopGO
goBP <- runTopGO(ontology = 'BP', 
                      species = 'human', 
                      backgroundGenes = backgroundGenes, 
                      targetedGenes = targetedGenes)

goBP <- goBP[order(goBP$foldEnrichment, decreasing = TRUE),]
rownames(goBP) <- goBP$GO.ID
goBP <- subset(goBP, select = -c(Annotated,classicFisher, bh, GO.ID))

knitr::kable(goBP[1:10,])

	Term	Significant	Expected	bonferroni	foldEnrichment
GO:0006403	RNA localization	15	4.47	0.0011750	3.36
GO:0006402	mRNA catabolic process	27	8.73	0.0000007	3.09
GO:0043488	regulation of mRNA stability	15	4.90	0.0069325	3.06
GO:0015931	nucleobase-containing compound transport	16	5.32	0.0048175	3.01
GO:0006401	RNA catabolic process	28	9.37	0.0000011	2.99
GO:0043487	regulation of RNA stability	15	5.11	0.0152750	2.94
GO:0061013	regulation of mRNA catabolic process	15	5.32	0.0305500	2.82
GO:1903311	regulation of mRNA metabolic process	22	8.09	0.0008342	2.72
GO:0006913	nucleocytoplasmic transport	17	6.60	0.0434750	2.58
GO:0051169	nuclear transport	17	6.60	0.0434750	2.58

Gene Set Enrichment Analysis

MSIGDB gene sets enriched among targeted genes

RCAS can use gene sets from Molecular Signatures Database and calculate gene set enrichment analysis (GSEA) to find out which gene sets are enriched among the genes targeted by the query regions.

Below we demonstrate a GSEA case using randomly generated gene sets (in order not to breach MSIGDB licence agreement) that are provided as built-in data in RCAS. The actual MSIGDB gene set annotations must be downloaded by the user from the MSIGDB website. RCAS provides functions to parse the annotations (RCAS::parseMsigdb) and map them to other species via orthology (RCAS::createOrthologousGeneSetList) to enable GSEA on other species such as mouse and fly.

#geneSets <- parseMsigdb(< path to msigdbFile>)
data(geneSets)
resultsGSEA <- runGSEA(geneSetList = geneSets,
                       backgroundGenes = backgroundGenes, 
                       targetedGenes = targetedGenes)

knitr::kable(x = resultsGSEA[1:10,])

	treatment	treatmentSize	expectedInTreatment	fisherPVal	BH	bonferroni	foldEnrichment
randomGeneSet52	9	411	3.5	0.0201899	0.628554	1	2.57
randomGeneSet16	8	411	3.3	0.0370182	0.628554	1	2.42
randomGeneSet87	10	411	4.3	0.0251697	0.628554	1	2.33
randomGeneSet99	7	411	3.0	0.0550203	0.628554	1	2.33
randomGeneSet42	9	411	4.0	0.0368588	0.628554	1	2.25
randomGeneSet8	7	411	3.2	0.0664294	0.628554	1	2.19
randomGeneSet53	7	411	3.2	0.0664294	0.628554	1	2.19
randomGeneSet11	10	411	4.7	0.0360243	0.628554	1	2.13
randomGeneSet95	9	411	4.3	0.0521195	0.628554	1	2.09
randomGeneSet13	4	411	2.0	0.1851949	0.628554	1	2.00

RCAS also provides functions to map the MSIGDB annotations from human to fly and mouse.

#parse human annotations
refGeneSets <- parseMsigdb(filePath = <path to MSIGDB annotation file>)

#Map the gene sets to other species using orthologous relationships of genes between
#the reference genome (human) and the target genome (e.g. mouse)
orthGeneSets <- createOrthologousGeneSetList(referenceGeneSetList = refGeneSets, 
                                                refGenomeVersion = 'hg19', 
                                                targetGenomeVersion = 'mm9')
#the mapped gene sets can be used for GSEA analysis using the runGSEA command.

Generating a full report

The users can use the runReport() function to generate full custom reports including all the analysis modules described above. There are four main parts of the analysis report.

Annotation summaries via overlap operations
GO term analysis
MSIGDB analysis
Motif analysis

By default, runReport() function aims to run all four modules, while the user can turn off these individual modules.

Below are example commands to generate reports using these functionalities.

A test run for human

runReport()

A custom run for human

runReport( queryFilePath = 'input.BED',
            gffFilePath = 'annotation.gtf',
            msigdbFilePath = 'human_msigdb.gmt')

To turn off certain modules of the report

runReport( queryFilePath = 'input.BED',
            gffFilePath = 'annotation.gtf',
            msigdbFilePath = 'human_msigdb.gmt',
            motifAnalysis = FALSE,
            goAnalysis = FALSE )

To run the pipeline for species other than human

If the msigdb module is needed, the msigdbFilePath must be set to the MSIGDB annotations for ‘human’. MSIGDB datasets for other species will be calculated in the background using the createOrthologousMsigdbDataset function

runReport( queryFilePath = 'input.mm9.BED',
            gffFilePath = 'annotation.mm9.gtf',
            msigdbFilePath = 'human_msigdb.gmt',
            genomeVersion = 'mm9' )

To turn off verbose output and progress bars

runReport(quiet = TRUE)

Printing raw data generated by the runReport function

One may be interested in printing the raw data used to make the plots and tables in the HTML report output of runReport function. Such tables could be used for meta-analysis of multiple analysis results. In order to activate this function, printProcessedTables argument must be set to TRUE.

runReport(printProcessedTables = TRUE)

The RNA Centric Analysis System Report

Bora Uyar, Dilmurat Yusuf, Ricardo Wurmus, Altuna Akalin

2019-10-29