IntEREst

Ali Oghabian

2017-04-24

The Intron Exon Retention Estimator (IntEREst) facilitates estimation and comparison of splicing efficiency of various transcripts across several samples. In particular, it can estimate the intron-retention levels or the exon junction levels in the transcripts. Our method estimates the Intron-retention by counting the number of rna-seq reads that have been mapped to the Intron-exon junctions of the genes, and it can estimate the exon junction levels by counting the reads that have been mapped to the exon-exon junctions. In addition, it is possible to limit the analysis to reads that are mapped to the intron-exon or exon-exon junctions only (See the junctionReadsOnly parameter in the interest() and interest.sequential() functions). However, by default this limitation is not taken into account, i.e. the reads that are fully mapped to the introns or exons are also considered. The method is similar to the Intron retention analysis used by Niemelä et al. (2014). The package accepts standard BAM files as input and produces tab separated text files together with SummarizedExperiment objects as results. To improve the performance and running time, the processing of each single BAM file can be distributed and run on several computing cores. The results can also be plotted and statistically analyzed to check the distribution of the intron retention levels, and compare the retention levels of U12 type introns to the U2 type across the studied samples. Note that although we mainly use this package to compare the splicing efficiency of the transcripts containing U12-type introns, some functions can be used with U2-type introns as well. The functions u12NbIndex(), u12Index(), u12Boxplot(), u12BoxplotNb(), u12DensityPlot()and u12DensityPlotIntron() are specifically used for the splicing efficiency analysis of the transcripts with U12-type introns. A diagram of the running pipeline is shown in figure 1.


**Figure 1:** Diagram of IntEREst running pipeline

Figure 1: Diagram of IntEREst running pipeline


Intron retention and exon junction analysis

The normalized intron retention levels and exon junction levels can be estimated using two functions, interest() and interest.sequential(). The interest() function improves the performance of the most time-consuming part of the analysis (i.e. estimating the number of fragments mapped to each intron or exon) by distributing the reads in the .bam file over several computing cores to be analyzed simultaneously. Note that regions in the genome with repetitive sequence elements may bias the mapping of the read sequences and the retention analysis. If you wish to exclude these regions from the analysis you can use the getRepeatTable() function, however We did not find repetitive DNA elements in particular biasing our results therefore we do not routinely use this function. As for instance, if you wish to exclude, for example you can run these functions with the repeatsTableToFilter= getRepeatTable(repFamilyFil= "Alu") parameter setting. Also, as mentioned previously, to only consider the reads that map to intron-exon or exon-exon junctions set junctionReadsOnly= TRUE.

The interest() and interest.sequential() functions write output text files, moreover they can return a summarizedExperiment object for every sample they analyze. We, however, usually prevent the individual runs to return any objects (by setting the returnObj parameter as FALSE); instead, after running the analysis for all samples we generate a single summarizedExperiment object that includes results of all analyzed samples. To build such object from the output text files the readInterestResults() function can be used.

Using the exported data mdsChr22Obj

As a demo we ran the IntEREst pipeline on 16 .bam files that each includes reads mapped to U12 genes located in chromosome 22. These bam files were results of mapping RNA-Seq data from bone-marrow samples published by Madan et al. (2015) to the Human genome (hg19). The studied samples were extracted from 16 individuals; out of which 8 were diagnosed with Myelodysplastic syndrome (MDS) and featured ZRSR2 mutation, 4 were diagnosed with MDS but lacked the mutation (referred to as ZRSR2 wild-type MDS samples) and 4 were healthy individuals.

The data is accessible through GEO with the accession number GSE63816 and the scripts that we ran to map the RNA-seq data, modify the bam files, extract the reads mapped to U12 genes in chr22 and build mdsChr22Obj object have been listed in the readme.txt file in scripts folder of the IntEREst package. You can get its full path using this script in R: system.file("scripts","readme.txt", package="IntEREst"). The mdsChr22Obj object is a summarizedExperiment object that includes retention levels of the introns of the genes located in the Chromosome 22 that feature at least one U12-type intron, across the 16 MDS samples. The mdsChr22Obj object is included in the IntEREst package.


# Load library quietly
suppressMessages(library("IntEREst"))
#View object
print(mdsChr22Obj)
## class: SummarizedExperiment 
## dim: 618 16 
## metadata(2): scaleFragment scaleLength
## assays(2): counts scaledRetention
## rownames: NULL
## rowData names(16): id int_ex_id ... int_type int_subtype
## colnames(16): SRR1691633_ZRSR2Mut SRR1691634_ZRSR2Mut ...
##   SRR1691647_Normal SRR1691648_Normal
## colData names(3): resultFiles type test_ctrl


It is possible to plot() the object to check the distribution of the intron retention levels. The following scripts plot the average retention of all introns across the 3 sample types: ZRSR2 mutated MDS, ZRSR2 wild type MDS and healthy. The lowerPlot=TRUE and upperPlot=TRUE parameter settings ensures that both, the upper and lower triangle of the grid are plotted.


# Retention of all introns
plot(mdsChr22Obj, logScaleBase=exp(1), pch=20, loessLwd=1.2, 
    summary="mean", col="black", sampleAnnoCol="type", 
    lowerPlot=TRUE, upperPlot=TRUE)
**Figure 2:** Plotting the distribution of the retention levels ($log_e$ scaled retention) of introns of genes located on chromosome 22. The values have been averaged across the sample types ZRSR2 mutated, ZRSR2 wild type, and healthy.

Figure 2: Plotting the distribution of the retention levels (\(log_e\) scaled retention) of introns of genes located on chromosome 22. The values have been averaged across the sample types ZRSR2 mutated, ZRSR2 wild type, and healthy.


The following script plots the average retention of the U12 introns across the 3 sample types: ZRSR2 mutated MDS, ZRSR2 MDS wild type and healthy. By default the upper triangle of the grid is plotted only (lowerPlot=FALSE).


#Retention of U12 introns
plot(mdsChr22Obj, logScaleBase=exp(1), pch=20, plotLoess=FALSE, 
    summary="mean", col="black", sampleAnnoCol="type", 
    subsetRows=u12Index(mdsChr22Obj))
**Figure 3:** Plotting the distribution of the retention levels ($log_e$ scaled retention) of introns of genes located on chromosome 22. The values have been averaged across the sample types ZRSR2 mutated, ZRSR2 wild type, and healthy.

Figure 3: Plotting the distribution of the retention levels (\(log_e\) scaled retention) of introns of genes located on chromosome 22. The values have been averaged across the sample types ZRSR2 mutated, ZRSR2 wild type, and healthy.


Comparing intron retention levels across various samples

IntEREst also provides tools to compare the retention levels of the U12-type introns to the U2-type across various samples. Initially, we extract the significantly higher and lower retained introns by using exactTestInterest() function which employs the exactTest() function from the edgeR package, i.e. an exact test for differences between two groups of negative-binomial counts. Note that exactTestInterest() makes comparison between a pair of sample types only (e.g. test vs ctrl).


# Check the sample annotation table
getAnnotation(mdsChr22Obj)
## DataFrame with 16 rows and 3 columns
##                                               resultFiles      type
##                                               <character>  <factor>
## SRR1691633_ZRSR2Mut ./SRR1691633_ZRSR2Mut/interestRes.tsv ZRSR2_Mut
## SRR1691634_ZRSR2Mut ./SRR1691634_ZRSR2Mut/interestRes.tsv ZRSR2_Mut
## SRR1691635_ZRSR2Mut ./SRR1691635_ZRSR2Mut/interestRes.tsv ZRSR2_Mut
## SRR1691636_ZRSR2Mut ./SRR1691636_ZRSR2Mut/interestRes.tsv ZRSR2_Mut
## SRR1691637_ZRSR2Mut ./SRR1691637_ZRSR2Mut/interestRes.tsv ZRSR2_Mut
## ...                                                   ...       ...
## SRR1691644_WT             ./SRR1691644_WT/interestRes.tsv  ZRSR2_WT
## SRR1691645_Normal     ./SRR1691645_Normal/interestRes.tsv   Healthy
## SRR1691646_Normal     ./SRR1691646_Normal/interestRes.tsv   Healthy
## SRR1691647_Normal     ./SRR1691647_Normal/interestRes.tsv   Healthy
## SRR1691648_Normal     ./SRR1691648_Normal/interestRes.tsv   Healthy
##                     test_ctrl
##                      <factor>
## SRR1691633_ZRSR2Mut      test
## SRR1691634_ZRSR2Mut      test
## SRR1691635_ZRSR2Mut      test
## SRR1691636_ZRSR2Mut      test
## SRR1691637_ZRSR2Mut      test
## ...                       ...
## SRR1691644_WT            ctrl
## SRR1691645_Normal        ctrl
## SRR1691646_Normal        ctrl
## SRR1691647_Normal        ctrl
## SRR1691648_Normal        ctrl
# Run exact test
test<- exactTestInterest(mdsChr22Obj, 
    sampleAnnoCol="test_ctrl", sampleAnnotation=c("ctrl","test"), 
    geneIdCol= "ens_gene_id", silent=TRUE, disp="common")
## Design matrix not provided. Switch to the classic mode.
# Number of stabilized introns (in Chr 22)
sInt<- length(which(test$table[,"PValue"]<0.05 
    & test$table[,"logFC"]>0 & 
    rowData(mdsChr22Obj)[,"int_ex"]=="intron"))
print(sInt)
## [1] 19
# Number of stabilized (significantly retained) U12 type introns
numStU12Int<- length(which(test$table[,"PValue"]<0.05 & 
    test$table[,"logFC"]>0 & 
    rowData(mdsChr22Obj)[,"int_type"]=="U12" & 
    !is.na(rowData(mdsChr22Obj)[,"int_type"])))
# Number of U12 introns
numU12Int<- 
    length(which(rowData(mdsChr22Obj)[,"int_type"]=="U12" & 
    !is.na(rowData(mdsChr22Obj)[,"int_type"]))) 
# Fraction(%) of stabilized (significantly retained) U12 introns
perStU12Int<- numStU12Int/numU12Int*100
print(perStU12Int)
## [1] 47.36842
# Number of stabilized U2 type introns
numStU2Int<- length(which(test$table[,"PValue"]<0.05 & 
    test$table[,"logFC"]>0 & 
    rowData(mdsChr22Obj)[,"int_type"]=="U2" & 
    !is.na(rowData(mdsChr22Obj)[,"int_type"])))
# Number of U2 introns
numU2Int<- 
    length(which(rowData(mdsChr22Obj)[,"int_type"]=="U2" & 
    !is.na(rowData(mdsChr22Obj)[,"int_type"])))
# Fraction(%) of stabilized U2 introns
perStU2Int<- numStU2Int/numU2Int*100
print(perStU2Int)
## [1] 3.23741


As shown in the previous analysis ~47% of U12-type introns (of genes on Chr22) are significantly more retained (i.e. stabilized) in the ZRSR2 mutated samples comparing to the other samples, whereas same comparison shows that only ~3% of the U2-type introns are significantly more retained. For more complex experiments such as comparing samples based on a user defined design matrix other differential expression analysis functions from edgeR package, e.g. Linear Model (GLM) functions, have also been implemented in IntEREst; glmInterest() performs GLM likelihood ratio test, qlfInterest() runs quasi likelihood F-test, and treatInterest() runs fold-change threshold test on the retention levels of the introns/exons. The following commands can be used to extract the data for introns/exons that their retention levels vary significantly across all sample types: ZRSR2 mutation, ZRSR2 wild type, and healthy.

# Extract type of samples
group <- getAnnotation(mdsChr22Obj)[,"type"]
group
##  [1] ZRSR2_Mut ZRSR2_Mut ZRSR2_Mut ZRSR2_Mut ZRSR2_Mut ZRSR2_Mut ZRSR2_Mut
##  [8] ZRSR2_Mut ZRSR2_WT  ZRSR2_WT  ZRSR2_WT  ZRSR2_WT  Healthy   Healthy  
## [15] Healthy   Healthy  
## Levels: Healthy ZRSR2_Mut ZRSR2_WT
# Test retention levels' differentiation across 3 types samples
qlfRes<- qlfInterest(x=mdsChr22Obj, 
    design=model.matrix(~group), silent=TRUE, 
    disp="tagwiseInitTrended", coef=2:3, contrast=NULL, 
    poisson.bound=TRUE)

# Extract index of the introns with significant retention changes
ind= which(qlfRes$table$PValue< 0.05)
# Extract introns with significant retention level changes
variedIntrons= rowData(mdsChr22Obj)[ind,]

#Show first 5 rows and columns of the result table
print(variedIntrons[1:5,1:5])
## DataFrame with 5 rows and 5 columns
##          id                int_ex_id         chr     begin       end
##   <integer>              <character> <character> <integer> <integer>
## 1      1912 ENSG00000069998_intron_1       chr22  17619248  17619439
## 2      1914 ENSG00000069998_intron_2       chr22  17619629  17621948
## 3      1922 ENSG00000069998_intron_6       chr22  17629451  17630431
## 4      1924 ENSG00000069998_intron_7       chr22  17630636  17640015
## 5      1941 ENSG00000070010_intron_7       chr22  19452798  19455395

Next, to better illustrate the differences in the retention levels of different types of introns across the studied samples, we first use the bopxplot() method to illustrate the retention levels of all U12-type and U2-type introns in various sample types, and then we use the u12BoxplotNb() function to compare the retention of the U12 introns to their up- and down-stream U2-type introns.


# boxplot U12 and U2-type introns 
par(mar=c(7,4,2,1))
u12Boxplot(mdsChr22Obj, sampleAnnoCol="type", 
    intExCol="int_ex",  intTypeCol="int_type", intronExon="intron", 
    col=rep(c("orange", "yellow"),3) ,  lasNames=3, 
    outline=FALSE, ylab="FPKM", cex.axis=0.8)
**Figure 4:** Boxplot of the retention levels of U12 introns vs U2 introns, summed over samples with similar annotations *i.e.* ZRSR2 mutation, ZRSR2 wild type, or healthy.

Figure 4: Boxplot of the retention levels of U12 introns vs U2 introns, summed over samples with similar annotations i.e. ZRSR2 mutation, ZRSR2 wild type, or healthy.


# boxplot U12-type intron and its up/downstream U2-type introns 
par(mar=c(2,4,1,1))
u12BoxplotNb(mdsChr22Obj, sampleAnnoCol="type", lasNames=1,
    intExCol="int_ex", intTypeCol="int_type", intronExon="intron", 
    boxplotNames=c(), outline=FALSE, plotLegend=TRUE, 
    geneIdCol="ens_gene_id", xLegend="topleft", 
    col=c("pink", "lightblue", "lightyellow"), ylim=c(0,1e+06), 
    ylab="FPKM", cex.axis=0.8)
**Figure 5:** Boxplot of retention levels of U12 introns vs their up- and down-stream U2 introns across all samples.

Figure 5: Boxplot of retention levels of U12 introns vs their up- and down-stream U2 introns across all samples.


The boxplot clearly shows the increase retention of U12-type introns comparing to all the U2 introns (figure 4) and in particular comparing to the U2-type introns located on the up- or down-stream of the U12-type introns (figure 5). It is also clear that the elevated level of intron retention with U12-type introns is exacerbated in the ZRSR2 mutated samples comparing to the other studied samples. In order to better illustrate the stabilization of the U12-type introns comparing to the U2-type, we plot the density of the log fold-change of the retention (ZRSR2 mutated v.s. other samples) of U12-type introns and compare it to the log fold-change values for randomly selected U2-type introns, and U2-type introns up- or down-stream the U12-type introns.


u12DensityPlotIntron(mdsChr22Obj, 
    type= c("U12", "U2Up", "U2Dn", "U2UpDn", "U2Rand"), 
    fcType= "edgeR", sampleAnnoCol="test_ctrl", 
    sampleAnnotation=c("ctrl","test"), intExCol="int_ex", 
    intTypeCol="int_type", strandCol= "strand", 
    geneIdCol= "ens_gene_id", naUnstrand=FALSE, col=c(2,3,4,5,6), 
    lty=c(1,2,3,4,5), lwd=1, plotLegend=TRUE, cexLegend=0.7, 
    xLegend="topright", yLegend=NULL, legend=c(), randomSeed=10,
    ylim=c(0,0.6), xlab=expression("log"[2]*" fold change FPKM"))
## Design matrix not provided. Switch to the classic mode.
Figure 6: Density plot of the log fold change of U12-type introns, random U2-type introns and U2 introns (up / down / up and down)stream of U12-type introns.

Figure 6: Density plot of the log fold change of U12-type introns, random U2-type introns and U2 introns (up / down / up and down)stream of U12-type introns.

# estimate log fold-change of introns 
# by comparing test samples vs ctrl 
# and don't show warnings !
lfcRes<- lfc(mdsChr22Obj, fcType= "edgeR", 
    sampleAnnoCol="test_ctrl",sampleAnnotation=c("ctrl","test"))
## Design matrix not provided. Switch to the classic mode.
# Build the order vector
ord<- rep(1,length(lfcRes))
ord[u12Index(mdsChr22Obj)]=2

# Median of log fold change of U2 introns (ZRSR2 mut. vs ctrl)
median(lfcRes[ord==1])
## [1] 2.562741e-15
# Median of log fold change of U2 introns (ZRSR2 mut. vs ctrl)
median(lfcRes[ord==2])
## [1] 0.6992868


As shown in figure 6 (and computed after the plot), when comparing the ZRSR2 mutated samples vs the other samples, for all U2-type introns the most frequent log fold-change (median) is ~0 whereas this value for the U12-type introns is noticeably higher (~0.7). It is also possible to run a statistical test to see if the log fold-changes of U12-type introns (ZRSR2 mutated samples vs other samples) are significantly higher than the log fold-changes of U2-type introns. For this purpose we use the jonckheere.test() function, i.e. Jonckheere-Terpstra ordered alternative hypothesis test, from the Clinfun package.


# Run Jockheere Terpstra's trend test
library(clinfun)
jtRes<- jonckheere.test(lfcRes, ord, alternative = "increasing", 
    nperm=1000)
jtRes
## 
##  Jonckheere-Terpstra test
## 
## data:  
## JT = 9061, p-value = 0.001
## alternative hypothesis: increasing


The result of the Jonckheere-Terpstra test with 1000 permutation runs shows that when comparing the samples that lack the ZRSR2 mutation to the the ZRSR2 mutated samples, the null hypothesis that the log fold-changes of the retentions of U12-type and U2-type introns are equally distributed was rejected with p-value 0.001, while the alternative being that the values in the U12-type introns are higher compared to the U2-type.

References

Madan, V., D. Kanojia, J. Li, R. Okamoto, A. Sato-Otsubo, A. Kohlmann, M. Sanada, et al. 2015. “Aberrant splicing of U12-type introns is the hallmark of ZRSR2 mutant myelodysplastic syndrome.” Nat Commun 6: 6042.

Niemela, E. H., A. Oghabian, R. H. Staals, D. Greco, G. J. Pruijn, and M. J. Frilander. 2014. “Global analysis of the nuclear processing of transcripts with unspliced U12-type introns by the exosome.” Nucleic Acids Res. 42 (11): 7358–69.