1 Introduction

We introduce PAIRADISE (PAIred Replicate analysis of Allelic DIfferential Splicing Events), a method for detecting allele-specific alternative splicing (ASAS) from RNA-seq data. PAIRADISE uses a statistical model that aggregates ASAS signals across multiple individuals in a population. It formulates ASAS detection as a statistical problem for identifying differential alternative splicing from RNA-seq data with paired replicates. The PAIRADISE statistical model is applicable to many forms of allele-specific isoform variation (e.g. RNA editing), and can be used as a generic statistical model for RNA-seq studies involving paired replicates. More details can be found: https://github.com/Xinglab/PAIRADISE

2 Installation

  1. Download the package.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("PAIRADISE")

The development version is also available to download from Github.

BiocManager::install("hubentu/PAIRADISE")
  1. Load the package into the R session.
library(PAIRADISE)

3 PAIRADISE

3.1 PDseDataSet

A PDseDataSet class is defined to store the splicing count data, and contains inclusion and skipping counts for each sample. A design dataframe is required to describe the paired sample information. An integer dataframe of inclusion and skipping lengths are also required. The PDseDataSet extends the SummarizedExperiment class. All functions from SummarizedExperiment are inherited.

Here is an example to construct a PDseDataSet with 2 pairs of samples.

library(abind)
icount <- matrix(1:4, 1)
scount <- matrix(5:8, 1)
acount <- abind(icount, scount, along = 3)
acount
#> , , 1
#> 
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3] [,4]
#> [1,]    5    6    7    8
design <- data.frame(sample = rep(c("s1", "s2"), 2),
                     group = rep(c("T", "N"), each = 2))
lens <- data.frame(sLen=1L, iLen=2L)
PDseDataSet(acount, design, lens)
#> class: PDseDataSet 
#> dim: 1 4 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(2): sLen iLen
#> colnames: NULL
#> colData names(2): sample group

A count matrix can be imported as a PDseDataSet directly.

data("sample_dataset")
sample_dataset
#>   ExonID              I1            S1           I2              S2 I_len
#> 1 Exon 1     624,661,209   564,450,167  549,468,103    1261,767,325   180
#> 2 Exon 2    963,1139,388 1104,1100,330 1196,938,439      317,374,93   180
#> 3 Exon 3 15,17000,20,100      2,12,1,1     1,6,7,10 274,NA,320,5650     3
#> 4 Exon 4           3,5,9       13,27,4        5,9,9         11,29,3     3
#>   S_len
#> 1    90
#> 2    90
#> 3     1
#> 4     1

PAIRADISE also includes two small sample datasets from Geuvadis and TCGA:

data("sample_dataset_CEU")

data("sample_dataset_LUSC")

The “sample_dataset_CEU” dataset was generated by analyzing the allele-specific alternative splicing events in the GEUVADIS CEU data. Allele-specific reads were mapped onto alternative splicing events using rPGA (version 2.0.0). Then the allele-specific bam files mapped onto the two haplotypes are merged together to detect alternative splicing events using rMATS (version 3.2.5). The second LUSC dataset was generated by analyzing the tumor versus adjacent control samples from TCGA LUSC RNA-seq data.

Each row of the data corresponds to a different alternative splicing event. The data should have 7 columns. The order of the 7 columns in the input data frame to PAIRADISE follows the convention of the output of the rMATS software, arranged as follows:

  1. Column 1 contains the ID of the alternative splicing events.
  2. Column 2 contains counts of isoform 1 corresponding to the first group.
  3. Column 3 contains counts of isoform 2 corresponding to the first group.
  4. Column 4 contains counts of isoform 1 corresponding to the second group.
  5. Column 5 contains counts of isoform 2 corresponding to the second group.
  6. Column 6 contains the effective length of isoform 1.
  7. Column 7 contains the effective length of isoform 2.

To import the data to a PDseDataSet object.

pdat <- PDseDataSetFromMat(sample_dataset)
#> Loading data...
pdat
#> class: PDseDataSet 
#> dim: 4 8 
#> metadata(0):
#> assays(1): counts
#> rownames(4): Exon 1 Exon 2 Exon 3 Exon 4
#> rowData names(2): iLen sLen
#> colnames(8): S1.T S2.T ... S3.N S4.N
#> colData names(2): sample group

3.2 pairadise method

The pairadise function implements the PAIRADISE statistical model for the PDseDataSet object. Multiple processors can be used via the BiocParallel package. The function returns a PDseDataSet object with statistical estimates in its rowData. Here is how to run the model with 2 threads.

pairadise_output <- pairadise(pdat, numCluster = 2)

3.3 Output

A function results can be used to calculate p values and filter the significant results. For example, results significant at an FDR of 0.01 can be obtained as follows.

res <- results(pairadise_output, p.adj = "BH", sig.level = 0.01)
res
#> DataFrame with 3 rows and 3 columns
#>               testStats              p.value               p.adj
#>               <numeric>            <numeric>           <numeric>
#> Exon 1 9.53535469089184  0.00201551218710416  0.0027186933098157
#> Exon 2 9.51407375719282  0.00203901998236178  0.0027186933098157
#> Exon 3 12.2851542954807 0.000456575650943147 0.00182630260377259

With details=TRUE, more detailed statistical estimates can be returned.

res <- results(pairadise_output, details = TRUE)
colnames(res)
#>  [1] "testStats" "p.value"   "mu.u"      "s1.u"      "s2.u"      "s.u"      
#>  [7] "delta"     "mu.c"      "s1.c"      "s2.c"      "s.c"       "totalIter"
#> [13] "latent"    "p.adj"
res$latent[3,]
#> $latent
#>                [,1]        [,2]         [,3]
#> psi1.u  0.857961821 0.898498692 0.9395772713
#> psi2.u  0.001439670 0.006074175 0.0006442593
#> alpha.u 2.264093241 2.280384717 2.2752524923
#> psi1.c  0.824549049 0.878164621 0.9348575262
#> psi2.c  0.001352312 0.007337847 0.0005977598
#> alpha.c 1.556213823 1.975830313 2.6509171042

4 SessionInfo

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] abind_1.4-5      PAIRADISE_1.0.0  nloptr_1.2.1     BiocStyle_2.12.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1                  XVector_0.24.0             
#>  [3] knitr_1.22                  magrittr_1.5               
#>  [5] zlibbioc_1.30.0             GenomicRanges_1.36.0       
#>  [7] BiocGenerics_0.30.0         IRanges_2.18.0             
#>  [9] BiocParallel_1.18.0         lattice_0.20-38            
#> [11] stringr_1.4.0               GenomeInfoDb_1.20.0        
#> [13] tools_3.6.0                 grid_3.6.0                 
#> [15] SummarizedExperiment_1.14.0 parallel_3.6.0             
#> [17] Biobase_2.44.0              xfun_0.6                   
#> [19] matrixStats_0.54.0          htmltools_0.3.6            
#> [21] yaml_2.2.0                  digest_0.6.18              
#> [23] bookdown_0.9                Matrix_1.2-17              
#> [25] GenomeInfoDbData_1.2.1      BiocManager_1.30.4         
#> [27] S4Vectors_0.22.0            bitops_1.0-6               
#> [29] RCurl_1.95-4.12             evaluate_0.13              
#> [31] rmarkdown_1.12              DelayedArray_0.10.0        
#> [33] stringi_1.4.3               compiler_3.6.0             
#> [35] stats4_3.6.0