TileDBArray 1.0.0
TileDB implements a framework for local and remote storage of dense and sparse arrays.
We can use this as a DelayedArray
backend to provide an array-level abstraction,
thus allowing the data to be used in many places where an ordinary array or matrix might be used.
The TileDBArray package implements the necessary wrappers around TileDB-R
to support read/write operations on TileDB arrays within the DelayedArray framework.
TileDBArray
Creating a TileDBArray
is as easy as:
X <- matrix(rnorm(1000), ncol=10)
library(TileDBArray)
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] -0.6403446 -0.8170174 -1.7581782 . 2.11630192 0.83743410
## [2,] 1.2489907 1.8189847 1.4017621 . -0.84270734 -0.82742227
## [3,] -0.5738975 -0.2733372 0.7656216 . -0.20322937 -1.81067504
## [4,] -0.4315936 -0.4328671 0.8108528 . 0.29691354 0.04692914
## [5,] 0.6640558 0.7936778 0.3830932 . -0.71506229 -1.08808125
## ... . . . . . .
## [96,] 0.1141757 -1.0155497 0.9454654 . 0.27062483 0.86725081
## [97,] 1.9215418 -1.0620873 -0.3174608 . 0.68395274 -0.79706712
## [98,] -0.7920276 0.3284353 0.9055748 . -0.70124046 -0.47922283
## [99,] 1.4986801 0.0546685 -1.2755624 . -0.03685066 -0.67132229
## [100,] -0.4315163 -1.6084666 1.7064254 . 0.66564698 2.01212661
Alternatively, we can use coercion methods:
as(X, "TileDBArray")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] -0.6403446 -0.8170174 -1.7581782 . 2.11630192 0.83743410
## [2,] 1.2489907 1.8189847 1.4017621 . -0.84270734 -0.82742227
## [3,] -0.5738975 -0.2733372 0.7656216 . -0.20322937 -1.81067504
## [4,] -0.4315936 -0.4328671 0.8108528 . 0.29691354 0.04692914
## [5,] 0.6640558 0.7936778 0.3830932 . -0.71506229 -1.08808125
## ... . . . . . .
## [96,] 0.1141757 -1.0155497 0.9454654 . 0.27062483 0.86725081
## [97,] 1.9215418 -1.0620873 -0.3174608 . 0.68395274 -0.79706712
## [98,] -0.7920276 0.3284353 0.9055748 . -0.70124046 -0.47922283
## [99,] 1.4986801 0.0546685 -1.2755624 . -0.03685066 -0.67132229
## [100,] -0.4315163 -1.6084666 1.7064254 . 0.66564698 2.01212661
This process works also for sparse matrices:
Y <- Matrix::rsparsematrix(1000, 1000, density=0.01)
writeTileDBArray(Y)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] 0.00 0.00 0.00 . 0.44 0.00
## [2,] 0.00 0.00 0.00 . 0.00 0.00
## [3,] 0.00 -1.20 0.00 . 0.00 0.00
## [4,] -0.18 0.00 0.00 . 0.00 0.00
## [5,] 0.00 0.00 0.00 . 0.00 0.00
## ... . . . . . .
## [996,] 0 0 0 . 0 0
## [997,] 0 0 0 . 0 0
## [998,] 0 0 0 . 0 0
## [999,] 0 0 0 . 0 0
## [1000,] 0 0 0 . 0 0
Logical and integer matrices are supported:
writeTileDBArray(Y > 0)
## <1000 x 1000> sparse matrix of class TileDBMatrix and type "logical":
## [,1] [,2] [,3] ... [,999] [,1000]
## [1,] FALSE FALSE FALSE . TRUE FALSE
## [2,] FALSE FALSE FALSE . FALSE FALSE
## [3,] FALSE FALSE FALSE . FALSE FALSE
## [4,] FALSE FALSE FALSE . FALSE FALSE
## [5,] FALSE FALSE FALSE . FALSE FALSE
## ... . . . . . .
## [996,] FALSE FALSE FALSE . FALSE FALSE
## [997,] FALSE FALSE FALSE . FALSE FALSE
## [998,] FALSE FALSE FALSE . FALSE FALSE
## [999,] FALSE FALSE FALSE . FALSE FALSE
## [1000,] FALSE FALSE FALSE . FALSE FALSE
As are matrices with dimension names:
rownames(X) <- sprintf("GENE_%i", seq_len(nrow(X)))
colnames(X) <- sprintf("SAMP_%i", seq_len(ncol(X)))
writeTileDBArray(X)
## <100 x 10> matrix of class TileDBMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 -0.6403446 -0.8170174 -1.7581782 . 2.11630192 0.83743410
## GENE_2 1.2489907 1.8189847 1.4017621 . -0.84270734 -0.82742227
## GENE_3 -0.5738975 -0.2733372 0.7656216 . -0.20322937 -1.81067504
## GENE_4 -0.4315936 -0.4328671 0.8108528 . 0.29691354 0.04692914
## GENE_5 0.6640558 0.7936778 0.3830932 . -0.71506229 -1.08808125
## ... . . . . . .
## GENE_96 0.1141757 -1.0155497 0.9454654 . 0.27062483 0.86725081
## GENE_97 1.9215418 -1.0620873 -0.3174608 . 0.68395274 -0.79706712
## GENE_98 -0.7920276 0.3284353 0.9055748 . -0.70124046 -0.47922283
## GENE_99 1.4986801 0.0546685 -1.2755624 . -0.03685066 -0.67132229
## GENE_100 -0.4315163 -1.6084666 1.7064254 . 0.66564698 2.01212661
TileDBArray
sTileDBArray
s are simply DelayedArray
objects and can be manipulated as such.
The usual conventions for extracting data from matrix-like objects work as expected:
out <- as(X, "TileDBArray")
dim(out)
## [1] 100 10
head(rownames(out))
## [1] "GENE_1" "GENE_2" "GENE_3" "GENE_4" "GENE_5" "GENE_6"
head(out[,1])
## [1] -0.6403446 1.2489907 -0.5738975 -0.4315936 0.6640558 0.5599638
We can also perform manipulations like subsetting and arithmetic.
Note that these operations do not affect the data in the TileDB backend;
rather, they are delayed until the values are explicitly required,
hence the creation of the DelayedMatrix
object.
out[1:5,1:5]
## <5 x 5> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5
## GENE_1 -0.6403446 -0.8170174 -1.7581782 -2.6658388 -0.8005822
## GENE_2 1.2489907 1.8189847 1.4017621 2.2269270 -0.2517466
## GENE_3 -0.5738975 -0.2733372 0.7656216 -0.1499467 -0.2645592
## GENE_4 -0.4315936 -0.4328671 0.8108528 0.8199029 -1.2371527
## GENE_5 0.6640558 0.7936778 0.3830932 0.2841337 0.2342860
out * 2
## <100 x 10> matrix of class DelayedMatrix and type "double":
## SAMP_1 SAMP_2 SAMP_3 ... SAMP_9 SAMP_10
## GENE_1 -1.2806891 -1.6340348 -3.5163564 . 4.23260384 1.67486821
## GENE_2 2.4979815 3.6379695 2.8035241 . -1.68541468 -1.65484455
## GENE_3 -1.1477950 -0.5466744 1.5312431 . -0.40645873 -3.62135008
## GENE_4 -0.8631872 -0.8657342 1.6217055 . 0.59382707 0.09385828
## GENE_5 1.3281116 1.5873557 0.7661864 . -1.43012458 -2.17616249
## ... . . . . . .
## GENE_96 0.2283515 -2.0310993 1.8909308 . 0.54124966 1.73450162
## GENE_97 3.8430835 -2.1241745 -0.6349215 . 1.36790548 -1.59413424
## GENE_98 -1.5840553 0.6568707 1.8111496 . -1.40248092 -0.95844567
## GENE_99 2.9973603 0.1093370 -2.5511248 . -0.07370131 -1.34264459
## GENE_100 -0.8630325 -3.2169332 3.4128507 . 1.33129396 4.02425322
We can also do more complex matrix operations that are supported by DelayedArray:
colSums(out)
## SAMP_1 SAMP_2 SAMP_3 SAMP_4 SAMP_5 SAMP_6
## 14.9135437 9.9354164 0.1462107 3.8351340 -19.8961001 -6.1191525
## SAMP_7 SAMP_8 SAMP_9 SAMP_10
## 0.5745272 -9.8738447 -11.6510171 19.5198772
out %*% runif(ncol(out))
## <100 x 1> matrix of class DelayedMatrix and type "double":
## y
## GENE_1 -3.4222142
## GENE_2 2.6793949
## GENE_3 -1.9388585
## GENE_4 0.2217462
## GENE_5 -0.5663988
## ... .
## GENE_96 2.494283
## GENE_97 -3.006141
## GENE_98 1.634663
## GENE_99 -0.233447
## GENE_100 2.410841
We can adjust some parameters for creating the backend with appropriate arguments to writeTileDBArray()
.
For example, the example below allows us to control the path to the backend
as well as the name of the attribute containing the data.
X <- matrix(rnorm(1000), ncol=10)
path <- tempfile()
writeTileDBArray(X, path=path, attr="WHEE")
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.25266653 -0.24011781 0.67338228 . 1.30461019 0.92720913
## [2,] 0.23705701 0.60729013 -1.07048118 . 0.07928566 0.18447197
## [3,] -0.32941182 1.02196994 0.60049382 . -0.72663290 -0.60038867
## [4,] 0.11911876 -0.05474217 1.39931695 . 1.60447595 -0.70716435
## [5,] 0.70435530 0.28522512 -0.76751586 . 0.49495257 -0.58300782
## ... . . . . . .
## [96,] 2.6150158 -0.1456431 0.1271837 . -0.01841790 1.46738465
## [97,] 0.3957304 -0.1704016 1.4729717 . -0.13021188 -0.52377537
## [98,] -0.1789723 -1.4509313 0.6560800 . 0.11618727 0.40736054
## [99,] -1.2832269 1.4452461 -0.8475233 . -0.44619807 -0.99520119
## [100,] 0.5340158 -0.7582750 -0.7496639 . 0.09235738 0.87388196
As these arguments cannot be passed during coercion, we instead provide global variables that can be set or unset to affect the outcome.
path2 <- tempfile()
setTileDBPath(path2)
as(X, "TileDBArray") # uses path2 to store the backend.
## <100 x 10> matrix of class TileDBMatrix and type "double":
## [,1] [,2] [,3] ... [,9] [,10]
## [1,] 0.25266653 -0.24011781 0.67338228 . 1.30461019 0.92720913
## [2,] 0.23705701 0.60729013 -1.07048118 . 0.07928566 0.18447197
## [3,] -0.32941182 1.02196994 0.60049382 . -0.72663290 -0.60038867
## [4,] 0.11911876 -0.05474217 1.39931695 . 1.60447595 -0.70716435
## [5,] 0.70435530 0.28522512 -0.76751586 . 0.49495257 -0.58300782
## ... . . . . . .
## [96,] 2.6150158 -0.1456431 0.1271837 . -0.01841790 1.46738465
## [97,] 0.3957304 -0.1704016 1.4729717 . -0.13021188 -0.52377537
## [98,] -0.1789723 -1.4509313 0.6560800 . 0.11618727 0.40736054
## [99,] -1.2832269 1.4452461 -0.8475233 . -0.44619807 -0.99520119
## [100,] 0.5340158 -0.7582750 -0.7496639 . 0.09235738 0.87388196
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TileDBArray_1.0.0 DelayedArray_0.16.0 IRanges_2.24.0
## [4] S4Vectors_0.28.0 MatrixGenerics_1.2.0 matrixStats_0.57.0
## [7] BiocGenerics_0.36.0 Matrix_1.2-18 BiocStyle_2.18.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 RcppCCTZ_0.2.9 knitr_1.30
## [4] magrittr_1.5 bit_4.0.4 nanotime_0.3.2
## [7] lattice_0.20-41 rlang_0.4.8 stringr_1.4.0
## [10] tools_4.0.3 grid_4.0.3 xfun_0.18
## [13] tiledb_0.8.2 htmltools_0.5.0 bit64_4.0.5
## [16] yaml_2.2.1 digest_0.6.27 bookdown_0.21
## [19] BiocManager_1.30.10 evaluate_0.14 rmarkdown_2.5
## [22] stringi_1.5.3 compiler_4.0.3 zoo_1.8-8