# Standard RNA-seq processing

This tutorial assumes that the reader is familiar with the limma/voom workflow for RNA-seq. Process raw count data using limma/voom.

# Limma Analysis

Limma has a built-in approach for analyzing repeated measures data using duplicateCorrelation(). The model can handle a single random effect, and forces the magnitude of the random effect to be the same across all genes.

# Dream Analysis

The dream method replaces two core functions of limma with a linear mixed model.

1. voomWithDreamWeights() replaces voom() to estimate precision weights
2. dream() replaces lmFit() to estimate regression coefficients.

Otherwise dream uses the same workflow as limma with topTable(), since any statistical differences are handled behind the scenes.

##           (Intercept) Disease1
## sample_01           1        0
## sample_02           1        0
## sample_03           1        0
##                                    logFC  AveExpr        t      P.Value    adj.P.Val     z.std
## ENST00000283033.5 gene=TXNDC11  1.556233 3.567624 68.62597 3.701359e-27 3.701359e-24 10.793329
## ENST00000257181.9 gene=PRPF38A  1.380549 4.398270 35.53037 6.311777e-21 3.155888e-18  9.384663
## ENST00000421974.2 gene=ATP6V0E2 1.386821 3.478030 22.35156 1.289338e-16 3.584404e-14  8.274557

Since dream uses an estimated degrees of freedom value for each hypothsis test, the degrees of freedom is different for each gene here. Therefore, the t-statistics are not directly comparable since they have different degrees of freedom. In order to be able to compare test statistics, we report z.std which is the p-value transformed into a signed z-score. This can be used for downstream analysis.

Note that if a random effect is not specified, dream() automatically uses lmFit(), but the user must run eBayes() afterward.

### Using contrasts to compare coefficients

You can also perform a hypothesis test of the difference between two or more coefficients by using a contrast matrix. The contrasts are evaluated at the time of the model fit and the results can be extracted with topTable(). This behaves like makeContrasts() and contrasts.fit() in limma.

Make sure to inspect your contrast matrix to confirm it is testing what you intend.

## [1] "L1"              "DiseaseSubtype0" "DiseaseSubtype1" "DiseaseSubtype2" "SexM"
##                                       logFC  AveExpr         t      P.Value  adj.P.Val     z.std
## ENST00000355624.3 gene=RAB11FIP2 -0.9493146 5.260280 -5.384721 2.857024e-05 0.02857024 -4.184573
## ENST00000593466.1 gene=DDA1      -1.7265710 3.901579 -3.516847 2.168831e-03 0.99955523 -3.066083
## ENST00000200676.3 gene=CETP       1.4777420 3.723438  3.732329 8.085176e-03 0.99955523  2.648492

Multiple contrasts can be evaluated at the same time, in order to save computation time:

##                                       logFC  AveExpr         t      P.Value  adj.P.Val     z.std
## ENST00000355624.3 gene=RAB11FIP2 -0.9493146 5.260280 -5.384721 2.857024e-05 0.02857024 -4.184573
## ENST00000593466.1 gene=DDA1      -1.7265710 3.901579 -3.516847 2.168831e-03 0.99955523 -3.066083
## ENST00000200676.3 gene=CETP       1.4777420 3.723438  3.732329 8.085176e-03 0.99955523  2.648492

### Joint hypothesis test of multiple coefficients

Joint hypothesis testing of multiple coefficients at the same time can be performed by using an F-test. Just like in limma, the results can be extracted using topTable()

##                                 DiseaseSubtype2 DiseaseSubtype1  AveExpr         F      P.Value
## ENST00000283033.5 gene=TXNDC11         4.364305        4.316435 3.567624 22739.435 2.693356e-34
## ENST00000257181.9 gene=PRPF38A         5.083019        5.091434 4.398270  9828.227 1.177149e-30
## ENST00000423994.2 gene=CACNA2D2        3.859983        3.752166 3.766314  4114.966 7.011262e-27
## ENST00000283033.5 gene=TXNDC11  2.693356e-31 77.29711
## ENST00000257181.9 gene=PRPF38A  5.885745e-28 68.91446
## ENST00000423994.2 gene=CACNA2D2 2.337087e-24 60.22228

Since dream uses an estimated degrees of freedom value for each hypothsis test, the degrees of freedom is different for each gene here. Therefore, the F-statistics are not directly comparable since they have different degrees of freedom. In order to be able to compare test statistics, we report F.std which is the p-value transformed into an F-statistic with $$df_1={\text{number of coefficiets tested}}$$ and $$df_2=\infty$$. This can be used for downstream analysis.

## Small-sample method

For small datasets, the Kenward-Roger method can be more powerful. But it is substantially more computationally intensive.

## variancePartition plot

Dream and variancePartition share the same underlying linear mixed model framework. A variancePartition analysis can indicate important variables that should be included as fixed or random effects in the dream analysis.

## Compare p-values from dream and duplicateCorrelation

In order to understand the empircal difference between dream and duplication correlation, we can plot the $$-\log_{10}$$ p-values from both methods.

The duplicateCorrelation method estimates a single variance term genome-wide even though the donor contribution of a particular gene can vary substantially from the genome-wide trend. Using a single value genome-wide for the within-donor variance can reduce power and increase the false positive rate in a particular, reproducible way. Let $$\tau^2_g$$ be the value of the donor component for gene $$g$$ and $$\bar{\tau}^2$$ be the genome-wide mean. For genes where $$\tau^2_g>\bar{\tau}^2$$, using $$\bar{\tau}^2$$ under-corrects for the donor component so that it increases the false positive rate compared to using $$\tau^2_g$$. Conversely, for genes where $$\tau^2_g<\bar{\tau}^2$$, using $$\bar{\tau}^2$$ over-corrects for the donor component so that it decreases power. Increasing sample size does not overcome this issue. The dream method overcomes this issue by using a $$\tau^2_g$$.

Here, the $$-\log_{10}$$ p-values from both methods are plotted and colored by the donor contribution estiamted by variancePartition. The green value indicates $$\bar{\tau}^2$$, while red and blue indicate higher and lower values, respectively. When only one variance component is used and the contrast matrix is simple, the effect of using dream versus duplicateCorrelation is determined by the comparison of $$\tau^2_g$$ to $$\bar{\tau}^2$$:

dream can increase the $$-\log_{10}$$ p-value for genes with a lower donor component (i.e. $$\tau^2_g<\bar{\tau}^2$$) and decrease $$-\log_{10}$$ p-value for genes with a higher donor component (i.e. $$\tau^2_g>\bar{\tau}^2$$)

Note that using more variance components or a more complicated contrast matrix can make the relationship more complicated.

# Parallel processing

variancePartition functions including dream(), fitExtractVarPartModel() and fitVarPartModel() can take advange of multicore machines to speed up analysis. It uses the BiocParallel package to manage the parallelization.

There are multiple ways to use parallel processing depending on your needs

• Specify parameters with the BPPARAM argument.
• Set parameters globally for the entire R session

By default BPPARAM and the global setttings are set the results of bpparam(). But note that using SnowParam() can dramatically reduce the memory usage needed for parallel processing because it reduces memory redundancy between threads.

# Session info

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##  [1] BiocParallel_1.20.1      edgeR_3.28.0             pander_0.6.3
##  [4] variancePartition_1.16.1 Biobase_2.46.0           BiocGenerics_0.32.0
##  [7] scales_1.1.0             foreach_1.4.7            limma_3.42.0
## [10] ggplot2_3.2.1            knitr_1.26
##
## loaded via a namespace (and not attached):
##  [1] splines_3.6.2       gtools_3.8.1        assertthat_0.2.1    statmod_1.4.32
##  [5] yaml_2.2.0          progress_1.2.2      numDeriv_2016.8-1.1 pillar_1.4.3
##  [9] backports_1.1.5     lattice_0.20-38     glue_1.3.1          digest_0.6.23
## [13] minqa_1.2.4         colorspace_1.4-1    htmltools_0.4.0     Matrix_1.2-18
## [17] plyr_1.8.5          pkgconfig_2.0.3     purrr_0.3.3         snow_0.4-3
## [21] gdata_2.18.0        lme4_1.1-21         tibble_2.1.3        farver_2.0.1
## [25] withr_2.1.2         lazyeval_0.2.2      pbkrtest_0.4-7      magrittr_1.5
## [29] crayon_1.3.4        evaluate_0.14       doParallel_1.0.15   nlme_3.1-143
## [33] MASS_7.3-51.5       gplots_3.0.1.1      tools_3.6.2         prettyunits_1.0.2
## [37] hms_0.5.2           lifecycle_0.1.0     stringr_1.4.0       munsell_0.5.0
## [41] locfit_1.5-9.1      colorRamps_2.3      compiler_3.6.2      caTools_1.17.1.3
## [45] rlang_0.4.2         grid_3.6.2          nloptr_1.2.1        iterators_1.0.12
## [49] bitops_1.0-6        labeling_0.3        rmarkdown_2.0       boot_1.3-24
## [53] gtable_0.3.0        codetools_0.2-16    lmerTest_3.1-1      reshape2_1.4.3
## [57] R6_2.4.1            dplyr_0.8.3         zeallot_0.1.0       KernSmooth_2.23-16
## [61] stringi_1.4.3       Rcpp_1.0.3          vctrs_0.2.1         tidyselect_0.2.5
## [65] xfun_0.11

# References

Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2):R29. https://doi.org/10.1186/gb-2014-15-2-r29.