Contents

1 Overview

beachmat has a few useful utilities outside of the C++ API. This document describes how to use them.

2 Choosing HDF5 chunk dimensions

Given the dimensions of a matrix, users can choose HDF5 chunk dimensions that give fast performance for both row- and column-level access.

library(beachmat)
nrows <- 10000
ncols <- 200
getBestChunkDims(c(nrows, ncols))
## [1] 708  15

In the future, it should be possible to feed this back into the API. Currently, if chunk dimensions are not specified in the C++ code, the API will retrieve them from R via the getHDF5DumpChunkDim() function from HDF5Array. The aim is to also provide a setHDF5DumpChunkDim() function so that any chunk dimension specified in R will be respected.

3 Rechunking a HDF5 file

The most common access patterns for matrices (at least, for high-throughput biological data) is by row or by column. The rechunkByMargins() will take a HDF5 file and convert it to using purely row- or column-based chunks.

library(HDF5Array)
A <- as(matrix(runif(5000), nrow=100, ncol=50), "HDF5Array")
byrow <- rechunkByMargins(A, byrow=TRUE)
byrow
## <100 x 50> HDF5Matrix object of type "double":
##               [,1]        [,2]        [,3] ...      [,49]      [,50]
##   [1,]  0.56255186  0.83748576  0.59590163   .  0.8206885  0.8822453
##   [2,]  0.15557033  0.56658438  0.62395089   .  0.4702967  0.1473072
##   [3,]  0.05805285  0.99513841  0.99089764   .  0.9474082  0.9386334
##   [4,]  0.12489074  0.99634430  0.79680600   .  0.3950245  0.8057325
##   [5,]  0.46060778  0.32031363  0.50575895   .  0.2542922  0.7555907
##    ...           .           .           .   .          .          .
##  [96,] 0.153485853 0.529738530 0.008270724   . 0.77627432 0.05214570
##  [97,] 0.097863280 0.403424662 0.862630078   . 0.19495715 0.53035455
##  [98,] 0.970400211 0.572643731 0.463074813   . 0.19958239 0.27946596
##  [99,] 0.329732096 0.666838235 0.423838963   . 0.92683161 0.05196235
## [100,] 0.871021765 0.455680494 0.913476148   . 0.43593475 0.15573580
bycol <- rechunkByMargins(A, byrow=FALSE)
bycol
## <100 x 50> HDF5Matrix object of type "double":
##               [,1]        [,2]        [,3] ...      [,49]      [,50]
##   [1,]  0.56255186  0.83748576  0.59590163   .  0.8206885  0.8822453
##   [2,]  0.15557033  0.56658438  0.62395089   .  0.4702967  0.1473072
##   [3,]  0.05805285  0.99513841  0.99089764   .  0.9474082  0.9386334
##   [4,]  0.12489074  0.99634430  0.79680600   .  0.3950245  0.8057325
##   [5,]  0.46060778  0.32031363  0.50575895   .  0.2542922  0.7555907
##    ...           .           .           .   .          .          .
##  [96,] 0.153485853 0.529738530 0.008270724   . 0.77627432 0.05214570
##  [97,] 0.097863280 0.403424662 0.862630078   . 0.19495715 0.53035455
##  [98,] 0.970400211 0.572643731 0.463074813   . 0.19958239 0.27946596
##  [99,] 0.329732096 0.666838235 0.423838963   . 0.92683161 0.05196235
## [100,] 0.871021765 0.455680494 0.913476148   . 0.43593475 0.15573580

Rechunking can provide a substantial speed-up to downstream functions, especially those requiring access to random columns or rows. Indeed, the time saved in those functions often offsets the time spent in constructing a new HDF5Matrix.

4 Session information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] HDF5Array_1.8.0     rhdf5_2.24.0        DelayedArray_0.6.0 
##  [4] BiocParallel_1.14.1 IRanges_2.14.10     S4Vectors_0.18.2   
##  [7] BiocGenerics_0.26.0 matrixStats_0.53.1  beachmat_1.2.1     
## [10] knitr_1.20          BiocStyle_2.8.1    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17    magrittr_1.5    stringr_1.3.1   tools_3.5.0    
##  [5] xfun_0.1        htmltools_0.3.6 yaml_2.1.19     rprojroot_1.3-2
##  [9] digest_0.6.15   bookdown_0.7    Rhdf5lib_1.2.1  evaluate_0.10.1
## [13] rmarkdown_1.9   stringi_1.2.2   compiler_3.5.0  backports_1.1.2