Exon-Intron Split Analysis has been described in (Gaidatzis et al. 2015). It consists of separately quantifying exonic and intronic alignments in RNA-seq data, in order to measure changes in mature RNA and pre-mRNA reads across different experimental conditions. We have shown that this allows quantification of transcriptional and post-transcriptional regulation of gene expression.
eisaR package contains convenience functions to facilitate the steps in an
exon-intron split analysis, which consists of:
1. preparing the annotation (exonic and gene body coordinate ranges)
2. quantifying RNA-seq alignments in exons and introns
3. calculating and comparing exonic and intronic changes across conditions
4. visualizing the results
For the steps 1. and 2. above, this vignette makes use of Bioconductor annotation and the QuasR package. It is also possible to obtain count tables for exons and introns using some other pipeline or approach, and directly start with step 3.
To install the
eisaR package, start R and enter:
# BiocManager is needed to install Bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") # Install eisaR BiocManager::install("eisaR")
eisaR uses gene annotations from Bioconductor.
They are provided in the form of
EnsDb objects, e.g. via packages such as TxDb.Mmusculus.UCSC.mm10.knownGene or EnsDb.Hsapiens.v86.
You can see available annotations using the following code:
pkgs <- c(BiocManager::available("TxDb") BiocManager::available("EnsDb"))
If you would like to use an alternative source of gene annotations, you might
still be able to use
eisaR by first converting your annotations into a
EnsDb (for creating a
makeTxDb in the GenomicFeatures
package, for creating an
makeEnsembldbPackage in the ensembldb
For this example,
eisaR contains a small
TxDb to illustrate how regions are extracted.
We will load it from a file. Alternatively, the object would be loaded using
for example using
# load package library(eisaR) # get TxDb object txdbFile <- system.file("extdata", "hg19sub.sqlite", package = "eisaR") txdb <- AnnotationDbi::loadDb(txdbFile)
Exon and gene body regions are then extracted from the
# extract filtered exonic and gene body regions regS <- getRegionsFromTxDb(txdb = txdb, strandedData = TRUE) #> extracting exon coordinates #> total number of genes/exons: 12/32 #> removing overlapping/single-exon/ambiguous genes (8) #> creating filtered regions for 4 genes (33.3%) with 20 exons (62.5%) regU <- getRegionsFromTxDb(txdb = txdb, strandedData = FALSE) #> extracting exon coordinates #> total number of genes/exons: 12/32 #> removing overlapping/single-exon/ambiguous genes (9) #> creating filtered regions for 3 genes (25%) with 17 exons (53.1%) lengths(regS) #> exons genebodies #> 20 4 lengths(regU) #> exons genebodies #> 17 3 regS$exons #> GRanges object with 20 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> ENSG00000078808 chr1 17278-18194 - #> ENSG00000078808 chr1 18828-21741 - #> ENSG00000078808 chr1 23614-23747 - #> ENSG00000078808 chr1 24202-24358 - #> ENSG00000078808 chr1 27799-27854 - #> ... ... ... ... #> ENSG00000186891 chr1 5740-6070 - #> ENSG00000186891 chr1 6755-7081 - #> ENSG00000254999 chr3 2266-2513 + #> ENSG00000254999 chr3 12300-12402 + #> ENSG00000254999 chr3 12943-13884 + #> ------- #> seqinfo: 3 sequences from an unspecified genome
As you can see, the filtering procedure removes slightly more genes for unstranded data
strandedData = FALSE), as overlapping genes cannot be discriminated even if
they reside on opposite strands.
You can also export the obtained regions into files. This may be useful if
you plan to align and/or quantify reads outside of R. For example, you can use
rtracklayer to export the regions in
library(rtracklayer) export(regS$exons, "hg19sub_exons_stranded.gtf") export(regS$genebodies, "hg19sub_genebodies_stranded.gtf")
For this example we will use the QuasR package for indexing and alignment of short reads, and a small RNA-seq dataset that is contained in that package. As mentioned, it is also possible to align or also quantify your reads using an alternative aligner/counter, and skip over these steps.
Let’s first copy the sample data from the QuasR package to the
current working directory, all contained in a folder named
library(QuasR) #> Loading required package: Rbowtie file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE) #>  TRUE
We next align the reads to a mini-genome (fasta file
sampleFile <- "extdata/samples_chip_single.txt" genomeFile <- "extdata/hg19sub.fa" proj <- qAlign(sampleFile = "extdata/samples_rna_single.txt", genome = "extdata/hg19sub.fa", aligner = "Rhisat2", splicedAlignment = TRUE) #> Creating .fai file for: /tmp/RtmpLr8RfC/Rbuild77477e088b03/eisaR/vignettes/extdata/hg19sub.fa #> alignment files missing - need to: #> create alignment index for the genome #> create 4 genomic alignment(s) #> Creating an Rhisat2 index for /tmp/RtmpLr8RfC/Rbuild77477e088b03/eisaR/vignettes/extdata/hg19sub.fa #> Finished creating index #> Testing the compute nodes...OK #> Loading QuasR on the compute nodes...OK #> Available cores: #> malbec2: 1 #> Performing genomic alignments for 4 samples. See progress in the log file: #> /tmp/RtmpLr8RfC/Rbuild77477e088b03/eisaR/vignettes/QuasR_log_7d9a56d010c5.txt #> Genomic alignments have been created successfully alignmentStats(proj) #> seqlength mapped unmapped #> Sample1:genome 95000 5961 43 #> Sample2:genome 95000 5914 86
Alignments in exons and gene bodies can now be counted using
qCount and the
regU that we have generated earlier (assuming that the data is unstranded).
Intronic counts can then be obtained from the difference between gene bodies and
cntEx <- qCount(proj, regU$exons, orientation = "any") #> counting alignments...done #> collapsing counts by sample...done #> collapsing counts by query name...done cntGb <- qCount(proj, regU$genebodies, orientation = "any") #> counting alignments...done #> collapsing counts by sample...done cntIn <- cntGb - cntEx head(cntEx) #> width Sample1 Sample2 #> ENSG00000078808 4837 705 1065 #> ENSG00000186827 1821 37 8 #> ENSG00000186891 1470 26 2 head(cntIn) #> width Sample1 Sample2 #> ENSG00000078808 10307 5 15 #> ENSG00000186827 1012 3 0 #> ENSG00000186891 1734 3 0
As mentioned, both alignments and counts can also be obtained using alternative approaches. It is required that the two resulting exon and intron count tables have identical structure (genes in rows, samples in columns, the same order of rows and columns in both tables).
The above example only contains very few genes. For the rest of the vignette,
we will use count tables from a real RNA-seq experiment that are provided in the
eisaR package. The counts correspond to the rawdata used in Figure 3a of (Gaidatzis et al. 2015)
and are also available online from the supplementary material:
cntEx <- readRDS(system.file("extdata", "Fig3abc_GSE33252_rawcounts_exonic.rds", package = "eisaR")) cntIn <- readRDS(system.file("extdata", "Fig3abc_GSE33252_rawcounts_intronic.rds", package = "eisaR"))
All the further steps in exon-intron split analysis can now be performed using
a single function
runEISA. If you prefer to perform the analysis step-by-step,
you can skip now to section 7.
# remove "width" column Rex <- cntEx[, colnames(cntEx) != "width"] Rin <- cntIn[, colnames(cntIn) != "width"] # create condition factor (contrast will be TN - ES) cond <- factor(c("ES", "ES", "TN", "TN")) # run EISA res <- runEISA(Rex, Rin, cond) #> filtering quantifyable genes...keeping 11034 from 20288 (54.4%) #> fitting statistical model...done #> calculating contrasts...done
There are five arguments in
pscnt) that control gene filtering, calculation of contrasts
and the statistical method used, summarized in the bullet list below.
The default values of these arguments correspond to the currently recommended way
of running EISA. You can also run EISA exactly as it was described in (Gaidatzis et al. 2015), by
method = "Gaidatzis2015". This will override the values of the five
other arguments and set them according to the published algorithm (as indicated
modelSamples: Account for individual samples in statistical model? Possible values are:
method="Gaidatzis2015"): use a model of the form
~ condition * region
TRUE: use a model of the form
~ sample + condition * region
geneSelection: How to select detected genes. Possible values are:
"filterByExpr"(default): First, counts are normalized using
edgeR::calcNormFactors, treating intonic and exonic counts as individual samples. Then, the
edgeR::filterByExprfunction is used with default parameters to select quantifyable genes.
"none": This will use all the genes provided in the count tables, assuming that an appropriate selection of quantifyable genes has already been done.
method="Gaidatzis2015"): First, intronic and exonic counts are linearly scaled to the mean library size (estimated as the sum of all intronic or exonic counts, respectively). Then, quantifyable genes are selected as the genes with counts
log2(x + 8) > 5in both exons and introns.
statFramework: The framework within
edgeRthat is used for the statistical analysis. Possible values are:
"QLF"(default): quasi-likelihood F-test using
edgeR::glmQLFTest. This framework is highly recommended as it gives stricter error rate control by accounting for the uncertainty in dispersion estimation.
method="Gaidatzis2015"): likelyhood ratio test using
effects: How the effects (log2 fold-changes) are calculated. Possible values are:
"predFC"(default): Fold-changes are calculated using the fitted model with
edgeR::predFCand the value provided to
pscnt. Please note that if a sample factor is included in the statistical model (
modelSamples=TRUE), effects cannot be obtained from that model. In that case, effects are obtained from a simpler model without sample effects.
method="Gaidatzis2015"): Fold-changes are calculated using the formula
log2((x + pscnt)/(y + pscnt)). If
pscntis not set to 8,
runEISAwill warn that this deviates from the method used in Gaidatzis et al., 2015.
pscnt: The pseudocount that is added to normalized counts before log transformation. For
pscntis used both in gene selection as well as in the calculation of log2 fold-changes. Otherwise,
pscntis only used in the calculation of log2 fold-changes in
edgeR::predFC(, prior.count = pscnt).
While different values for these arguments typically yield similar results,
the defaults are often less stringent compared to
selecting quantifyable genes, but more stringent when calling significant changes
(especially with low numbers of replicates).
Here is an illustration of how the results differ
res1 <- runEISA(Rex, Rin, cond, method = "Gaidatzis2015") #> setting parameters according to Gaidatzis et al., 2015 #> filtering quantifyable genes...keeping 8481 from 20288 (41.8%) #> fitting statistical model...done #> calculating contrasts...done res2 <- runEISA(Rex, Rin, cond) #> filtering quantifyable genes...keeping 11034 from 20288 (54.4%) #> fitting statistical model...done #> calculating contrasts...done # number of quantifyable genes nrow(res1$DGEList) #>  8481 nrow(res2$DGEList) #>  11034 # number of genes with significant post-transcriptional regulation sum(res1$tab.ExIn$FDR < 0.05) #>  490 sum(res2$tab.ExIn$FDR < 0.05) #>  131 # method="Gaidatzis2015" results contain most of default results summary(rownames(res2$contrasts)[res2$tab.ExIn$FDR < 0.05] %in% rownames(res1$contrasts)[res1$tab.ExIn$FDR < 0.05]) #> Mode FALSE TRUE #> logical 14 117 # comparison of deltas ids <- intersect(rownames(res1$DGEList), rownames(res2$DGEList)) cor(res1$contrasts[ids,"Dex"], res2$contrasts[ids,"Dex"]) #>  0.9958296 cor(res1$contrasts[ids,"Din"], res2$contrasts[ids,"Din"]) #>  0.9945688 cor(res1$contrasts[ids,"Dex.Din"], res2$contrasts[ids,"Dex.Din"]) #>  0.9722784 plot(res1$contrasts[ids,"Dex.Din"], res2$contrasts[ids,"Dex.Din"], pch = "*", xlab = expression(paste(Delta, "exon", - Delta, "intron for method='Gaidatzis2015'")), ylab = expression(paste(Delta, "exon", - Delta, "intron for default parameters")))
The calculation of the significance of interactions (here whether the fold-changes differ between exonic or intronic data) is well defined for experimental designs were all samples are independent from one another. Within EISA, this is not the case (each sample yields two data points, on for exons and one for introns). That results in a dependency between datapoints: If a sample is affected by a problem in the experiment, it might at the same time give rise to outliers in both exonic and intronic counts.
In statistics, such an experimental design is often referred to as a split-plot
design, and a recommended way to analyze interactions in such experiments would
be to use a mixed effect model with the plot (in our case, the sample) as a random
effect. The disadvantage here however would be that available packages for mixed
effect models are not designed for count data, and we therefore use an alternative
approach to explicitely model the sample dependency, by introducing a sample
factor into the design matrix (for
modelSamples=TRUE). That sample factor is
nested in the condition factor (no sample can belong to more than one condition),
and therefore not all coefficients can be estimated. We therefore drop one of
the columns of the design matrix (identified by
This has no impact on the effects (the log2 fold-changes of
modelSamples=FALSE are nearly identical). However, in the presence of sample effects,
modelSamples=TRUE increases the sensitivity of detecting genes with significant
interactions. Here is a comparison of the EISA results with and without accounting
for the sample in the model:
res3 <- runEISA(Rex, Rin, cond, modelSamples = FALSE) #> filtering quantifyable genes...keeping 11034 from 20288 (54.4%) #> fitting statistical model...done #> calculating contrasts...done res4 <- runEISA(Rex, Rin, cond, modelSamples = TRUE) #> filtering quantifyable genes...keeping 11759 from 20288 (58%) #> fitting statistical model...done #> calculating contrasts...fitting model without sample factor...done ids <- intersect(rownames(res3$contrasts), rownames(res4$contrasts)) # number of genes with significant post-transcriptional regulation sum(res3$tab.ExIn$FDR < 0.05) #>  131 sum(res4$tab.ExIn$FDR < 0.05) #>  900 # modelSamples=TRUE results are a super-set of # modelSamples=FALSE results summary(rownames(res3$contrasts)[res3$tab.ExIn$FDR < 0.05] %in% rownames(res4$contrasts)[res4$tab.ExIn$FDR < 0.05]) #> Mode TRUE #> logical 131 # comparison of contrasts diag(cor(res3$contrasts[ids, ], res4$contrasts[ids, ])) #> Dex Din Dex.Din #> 0.9999998 0.9999983 0.9958381 plot(res3$contrasts[ids, 3], res4$contrasts[ids, 3], pch = "*", xlab = "Interaction effects for modelSamples=FALSE", ylab = "Interaction effects for modelSamples=TRUE")
# comparison of interaction significance plot(-log10(res3$tab.ExIn[ids, "FDR"]), -log10(res4$tab.ExIn[ids, "FDR"]), pch = "*", xlab = "-log10(FDR) for modelSamples=FALSE", ylab = "-log10(FDR) for modelSamples=TRUE") abline(a=0, b=1, col="gray") legend("bottomright", "y = x", bty = "n", lty = 1, col = "gray")