Contents

1 Overview

beachmat has a few useful utilities outside of the C++ API. This document describes how to use them.

2 Choosing HDF5 chunk dimensions

Given the dimensions of a matrix, users can choose HDF5 chunk dimensions that give fast performance for both row- and column-level access.

library(beachmat)
nrows <- 10000
ncols <- 200
getBestChunkDims(c(nrows, ncols))
## [1] 708  15

In the future, it should be possible to feed this back into the API. Currently, if chunk dimensions are not specified in the C++ code, the API will retrieve them from R via the getHDF5DumpChunkDim() function from HDF5Array. The aim is to also provide a setHDF5DumpChunkDim() function so that any chunk dimension specified in R will be respected.

3 Rechunking a HDF5 file

The most common access patterns for matrices (at least, for high-throughput biological data) is by row or by column. The rechunkByMargins() will take a HDF5 file and convert it to using purely row- or column-based chunks.

library(HDF5Array)
A <- as(matrix(runif(5000), nrow=100, ncol=50), "HDF5Array")
byrow <- rechunkByMargins(A, byrow=TRUE)
byrow
## <100 x 50> HDF5Matrix object of type "double":
##               [,1]        [,2]        [,3] ...      [,49]      [,50]
##   [1,] 0.671935338 0.127743164 0.007323584   .  0.9681527  0.2113349
##   [2,] 0.344944342 0.352190747 0.922947284   .  0.6760347  0.5485550
##   [3,] 0.835692084 0.866131707 0.478473800   .  0.8881267  0.8801893
##   [4,] 0.273643794 0.494952899 0.715246589   .  0.8669271  0.2811329
##   [5,] 0.730758679 0.310119807 0.138480946   .  0.7158266  0.7891160
##    ...           .           .           .   .          .          .
##  [96,]  0.53746423  0.71994111  0.73454019   . 0.02912742 0.23766238
##  [97,]  0.27497985  0.36566871  0.94744167   . 0.28335706 0.32554060
##  [98,]  0.69559310  0.46266183  0.17462890   . 0.08723101 0.70944649
##  [99,]  0.92896545  0.02100046  0.64761798   . 0.08542746 0.86557679
## [100,]  0.90952207  0.95978297  0.22024540   . 0.75801834 0.55175647
bycol <- rechunkByMargins(A, byrow=FALSE)
bycol
## <100 x 50> HDF5Matrix object of type "double":
##               [,1]        [,2]        [,3] ...      [,49]      [,50]
##   [1,] 0.671935338 0.127743164 0.007323584   .  0.9681527  0.2113349
##   [2,] 0.344944342 0.352190747 0.922947284   .  0.6760347  0.5485550
##   [3,] 0.835692084 0.866131707 0.478473800   .  0.8881267  0.8801893
##   [4,] 0.273643794 0.494952899 0.715246589   .  0.8669271  0.2811329
##   [5,] 0.730758679 0.310119807 0.138480946   .  0.7158266  0.7891160
##    ...           .           .           .   .          .          .
##  [96,]  0.53746423  0.71994111  0.73454019   . 0.02912742 0.23766238
##  [97,]  0.27497985  0.36566871  0.94744167   . 0.28335706 0.32554060
##  [98,]  0.69559310  0.46266183  0.17462890   . 0.08723101 0.70944649
##  [99,]  0.92896545  0.02100046  0.64761798   . 0.08542746 0.86557679
## [100,]  0.90952207  0.95978297  0.22024540   . 0.75801834 0.55175647

Rechunking can provide a substantial speed-up to downstream functions, especially those requiring access to random columns or rows. Indeed, the time saved in those functions often offsets the time spent in constructing a new HDF5Matrix.

4 Session information

sessionInfo()
## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] HDF5Array_1.10.0    rhdf5_2.26.0        DelayedArray_0.8.0 
##  [4] BiocParallel_1.16.0 IRanges_2.16.0      S4Vectors_0.20.0   
##  [7] BiocGenerics_0.28.0 matrixStats_0.54.0  beachmat_1.4.0     
## [10] knitr_1.20          BiocStyle_2.10.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.19       magrittr_1.5       stringr_1.3.1     
##  [4] tools_3.5.1        xfun_0.4           htmltools_0.3.6   
##  [7] yaml_2.2.0         rprojroot_1.3-2    digest_0.6.18     
## [10] bookdown_0.7       Rhdf5lib_1.4.0     BiocManager_1.30.3
## [13] evaluate_0.12      rmarkdown_1.10     stringi_1.2.4     
## [16] compiler_3.5.1     backports_1.1.2