ResidualMatrix
classResidualMatrix 1.0.0
A common step in genomics involves computing residuals to regress out uninteresting factors of variation.
However, doing so naively would discard aspects of the underlying matrix representation.
The most obvious example is the loss of sparsity when a dense matrix of residuals is computed,
increasing memory usage and compute time in downstream applications.
The ResidualMatrix package implements the ResidualMatrix
class (duh),
which provides an efficient alternative to explicit calculation of the residuals.
Users can install this package by following the usual Bioconductor installation process:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("ResidualMatrix")
ResidualMatrix
The constructor takes a matrix of input values and a design matrix, where residuals are conceptually computed by fitting the linear model to the columns of the input matrix. However, the actual calculation of the residuals is delayed until they are explictly required.
design <- model.matrix(~gl(5, 10000))
# Making up a large-ish sparse matrix.
library(Matrix)
set.seed(100)
y0 <- rsparsematrix(nrow(design), 30000, 0.01)
library(ResidualMatrix)
resids <- ResidualMatrix(y0, design)
resids
## <50000 x 30000> matrix of class ResidualMatrix and type "double":
## [,1] [,2] [,3] ... [,29999] [,30000]
## [1,] -0.00176859 -0.00061002 -0.00145170 . 0.0014689 -0.0001879
## [2,] -0.00176859 -0.00061002 -0.00145170 . 0.0014689 -0.0001879
## [3,] -0.00176859 -0.00061002 -0.00145170 . 0.0014689 -0.0001879
## [4,] -0.00176859 -0.00061002 -0.00145170 . 0.0014689 -0.0001879
## [5,] -0.00176859 -0.00061002 -0.00145170 . 0.0014689 -0.0001879
## ... . . . . . .
## [49996,] -0.000179354 0.000624710 0.001578000 . -5.773e-04 6.741e-05
## [49997,] -0.000179354 0.000624710 0.001578000 . -5.773e-04 6.741e-05
## [49998,] -0.000179354 0.000624710 0.001578000 . -5.773e-04 6.741e-05
## [49999,] -0.000179354 0.000624710 0.001578000 . -5.773e-04 6.741e-05
## [50000,] -0.000179354 0.000624710 0.001578000 . -5.773e-04 6.741e-05
It is simple to obtain the residuals for, say, a single column. We could also use the DelayedArray block processing machinery to do this for chunks of columns at a time, allowing downstream code to compute on the residuals within memory limits.
hist(resids[,1])
In fact, matrix multiplication steps involving a ResidualMatrix
do not need to compute the residuals at all.
This means that ResidualMatrix
objects can be efficiently used in approximate PCA algorithms based on multiplication,
as shown below for randomized SVD via BiocSingular’s runPCA()
function.
The only requirement is that the original matrix has a reasonably efficient matrix multiplication operator.
(For randomized PCA, we also set deferred=TRUE
to ensure that our object does not collapse to a DelayedMatrix
upon centering;
this speeds up the computation without changing the result, given the column means for the residuals should be zero anyway.)
set.seed(100)
system.time(pc.out <- BiocSingular::runPCA(resids, 10,
BSPARAM=BiocSingular::RandomParam(deferred=TRUE)))
## user system elapsed
## 7.812 0.144 7.962
str(pc.out)
## List of 3
## $ sdev : num [1:10] 0.16 0.159 0.159 0.159 0.159 ...
## $ rotation: num [1:30000, 1:10] 0.001048 -0.010312 -0.000749 0.000831 -0.003602 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
## $ x : num [1:50000, 1:10] -0.0172 -0.1613 0.0576 -0.1764 -0.0826 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
Similarly, the row and column sums/means can be computed efficiently, based on the matrix multiplication machinery and the original matrix’s row and column sum functions.
hist(rowSums(resids))
Other operations will cause the ResidualMatrix
to collapse into DelayedMatrix
for further processing.
We can also specify that we only want to regress out some factors in our design
.
For example, let’s say we have a dataset with an interesting two-group structure and an uninteresting continuous covariate BAD
:
design2 <- model.matrix(~gl(2, 10000))
design2 <- cbind(design2, BAD=runif(nrow(design2)))
colnames(design2)
## [1] "(Intercept)" "gl(2, 10000)2" "BAD"
We can instruct ResidualMatrix()
to retain the interesting structure (first two coefficients)
while regressing out the uninteresting covariate in the third coefficient:
# Making up another large-ish sparse matrix.
y0 <- rsparsematrix(nrow(design2), 30000, 0.01)
resid2 <- ResidualMatrix(y0, design2, keep=1:2)
resid2
## <20000 x 30000> matrix of class ResidualMatrix and type "double":
## [,1] [,2] [,3] ... [,29999]
## [1,] 0.0003610968 -0.0005362812 -0.0006870567 . -4.423137e-04
## [2,] 0.0006099956 -0.0009059321 -0.0011606349 . -7.471941e-04
## [3,] 0.0013597619 -0.0020194441 -0.0025872107 . -1.665596e-03
## [4,] 0.0013671407 -0.0020304026 -0.0026012502 . -1.674634e-03
## [5,] 0.0009095187 -0.0013507675 -0.0017305357 . -1.114085e-03
## ... . . . . .
## [19996,] 0.0009696272 -0.0014400373 -0.0018449037 . -0.0011877129
## [19997,] 0.0009091618 -0.0013502374 -0.0017298566 . -0.0011136479
## [19998,] 0.0009893631 -0.0014693480 -0.0018824551 . -0.0012118878
## [19999,] 0.0018900069 -0.0028069349 -0.0035961045 . -0.0023151018
## [20000,] 0.0014685486 -0.0021810081 -0.0027941984 . -0.0017988503
## [,30000]
## [1,] 6.401415e-05
## [2,] 1.081382e-04
## [3,] 2.410545e-04
## [4,] 2.423626e-04
## [5,] 1.612367e-04
## ... .
## [19996,] 0.0001718926
## [19997,] 0.0001611735
## [19998,] 0.0001753913
## [19999,] 0.0003350547
## [20000,] 0.0002603399
In this sense, the ResidualMatrix
is effectively a delayed version of removeBatchEffect()
,
the old workhorse function from limma.
In some cases, we may only be confident in the correctness of design
for a subset of our samples.
For example, we may have several batches of observations, each of which contains a subset of control observations.
All other observations in each batch have unknown structure but are affected by the same additive batch effect as the controls.
We would like to use the controls to remove the batch effect without making assumptions about the other observations.
To achieve this, we set the restrict=
argument in the ResidualMatrix
constructor.
This performs model fitting using only the specified (control) subset to estimate the batch effect.
It then uses those estimates to perform regression on all observations.
This option can also be combined with keep
if the controls themselves have some structure that should be retained.
batches <- gl(3, 1000)
controls <- c(1:100, 1:100+1000, 1:100+2000)
y <- matrix(rnorm(30000), nrow=3000)
resid3 <- ResidualMatrix(y, design=model.matrix(~batches), restrict=controls)
resid3
## <3000 x 10> matrix of class ResidualMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.19825824 0.17996457 1.13936050 . 0.33228412 -1.64454139
## [2,] -0.28219580 -1.15656947 0.25430702 . -0.08018124 0.87146688
## [3,] -2.21640973 0.26290101 -0.87678574 . -0.64221058 -0.09514042
## [4,] -0.08887064 0.14095095 0.98438377 . -1.05511992 1.02235182
## [5,] -0.99462216 0.63980424 -0.35985208 . 0.33789575 -0.68093555
## ... . . . . . .
## [2996,] -0.3511486 -2.2495602 0.3509155 . 0.59967323 -1.07031924
## [2997,] -0.4235921 2.0300343 1.4830692 . -0.09382277 -3.42017897
## [2998,] 1.0993104 0.4044992 -2.6605940 . 1.14067257 -1.33360979
## [2999,] -0.1802699 -1.7624554 -0.1289637 . -1.71318104 -1.96317436
## [3000,] -0.3151365 0.1005264 0.3068870 . 1.49625115 0.39279434
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ResidualMatrix_1.0.0 Matrix_1.2-18 BiocStyle_2.18.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 BiocSingular_1.6.0 knitr_1.30
## [4] magrittr_1.5 BiocGenerics_0.36.0 IRanges_2.24.0
## [7] BiocParallel_1.24.0 lattice_0.20-41 rlang_0.4.8
## [10] stringr_1.4.0 tools_4.0.3 rsvd_1.0.3
## [13] parallel_4.0.3 grid_4.0.3 xfun_0.18
## [16] irlba_2.3.3 htmltools_0.5.0 matrixStats_0.57.0
## [19] yaml_2.2.1 digest_0.6.27 bookdown_0.21
## [22] BiocManager_1.30.10 S4Vectors_0.28.0 evaluate_0.14
## [25] rmarkdown_2.5 DelayedArray_0.16.0 stringi_1.5.3
## [28] compiler_4.0.3 magick_2.5.0 MatrixGenerics_1.2.0
## [31] beachmat_2.6.0 stats4_4.0.3