A common C++ API for all types of R matrices

Package: beachmat
Author: Aaron Lun (alun@wehi.edu.au)
Compilation date: 2017-11-25

Introduction

The beachmat package provides a C++ API for handling a variety of R matrix types. The aim is to abstract away the specific type of matrix object when writing C++ extensions, thus simplifying the processing of data stored in those objects. Currently, the API supports double-precision, integer, logical and character matrices. Supported classes include base matrix objects, a number of classes from the Matrix package, and disk-backed matrices from the HDF5Array package.

Linking to the package

Prerequisites

The beachmat package currently has several dependencies:

Most of the following instructions are ripped from the Rhtslib vignette.

Link to the library

To link successfully to the beachmat library, a package must include both a src/Makevars.win and src/Makevars file.

Note: the contents of src/Makevars.win and src/Makevars are almost identical, but not quite. Be careful of the differences.

Create a src/Makevars.win file with the following lines:

BEACHMAT_LIBS=$(shell echo 'beachmat::pkgconfig("PKG_LIBS")'|\
    "${R_HOME}/bin/R" --vanilla --slave)
PKG_LIBS=$(BEACHMAT_LIBS)

… and a src/Makevars file with the following lines:

BEACHMAT_LIBS=`echo 'beachmat::pkgconfig("PKG_LIBS")'|\
    "${R_HOME}/bin/R" --vanilla --slave`
PKG_LIBS=$(BEACHMAT_LIBS)

The statement for each platfrom modifies the $PKG_LIBS variable. If your package needs to add to the $PKG_LIBS variable, do so by adding to the PKG_LIBS=$(BEACHMAT_LIBS) line. For example:

PKG_LIBS=$(BEACHMAT_LIBS) -L/path/to/foolib -lfoo

The Linux implementation embeds the location of the beachmat library in the package-specific shared object via the compiler flag -Wl,rpath,path, where path is determined by system.file("lib", package="beachmat"). The path determined by system.file() is from .libPaths() and will resolve all symbolic links. This can cause problems, e.g., when the “head” node of a cluster mimicks the cluster node via a symbolic link to the directory in which beachmat is installed. Use the environment variable BEACHMAT_RPATH to resolve this by setting it to the cluster-node accessible path. Similar arguments apply to Rhdf5lib with the environment variable RHDF5LIB_RPATH.

Find headers

In order for the C/C++ compiler to find the beachmat package headers during installation, add the following to the LinkingTo field of the DESCRIPTION file:

LinkingTo: Rcpp, Rhdf5lib, beachmat

In C or C++ code files, use standard techniques, e.g., #include "beachmat/numeric_matrix.h" (see below for more details). Header files are available for perusal at the following location (enter in an R session):

system.file(package="beachmat", "include")
## [1] "/tmp/Rtmppw2J5Q/Rinst71da27de2146/beachmat/include"

Finishing up

You need to tell the build system to use C++11, by modifying the SystemRequirements field of the DESCRIPTION file:

SystemRequirements: C++11

You also need to ensure that Rcpp is initialized when your package is loaded. This requires addition of Rcpp to the Imports field of the DESCRIPTION file:

Imports: Rcpp

… and a corresponding importFrom specification in the NAMESPACE file:

importFrom(Rcpp, sourceCpp)

(The exact function to be imported doesn't matter, as long as the namespace is loaded. Check out the Rcpp documentation for more details.)

HDF5Array, DelayedArray and beachmat itself should be added to the Suggests field, as the API will perform some calls to R functions in those packages to query certain parameters. If you intend to accept instances of Matrix classes, the package should also be listed in the Suggests field, if not already in Imports or Depends:

Suggests: beachmat, HDF5Array, DelayedArray, Matrix

Overview of the API (input)

Creating the matrix pointer

We demonstrate the use of the API for numeric matrices. First, we include the relevant header file:

#include "beachmat/numeric_matrix.h"

A double-precision matrix object dmat is handled in C++ by passing the SEXP struct from .Call to create_numeric_matrix:

std::unique_ptr<beachmat::numeric_matrix> dptr = beachmat::create_numeric_matrix(dmat);

This creates a unique pointer that points to an object of the numeric_matrix virtual class. The exact class depends on the type of matrix in dmat, though the behaviour of the user-level functions are not affected by this detail.

Methods for input matrices

The available methods for this object are:

In all cases, r and c should be non-negative integers (specificaly size_t) in [0, nrow) and [0, ncol) respectively. Zero-based indexing is assumed for both r and c, as is standard for most C/C++ applications. Similar rules apply to first and last, which should be in [0, nrow] for get_col and in [0, ncol] for get_row. Furthermore, last >= first should be true.

If the object X is a Rcpp::NumericVector::iterator instance, matrix entries will be extracted as double-precision values. If it is a Rcpp::IntegerVector::iterator instance, matrix entries will be extracted as integers with implicit conversion. It is also possible to use a Rcpp::LogicalVector::iterator, though this will not behave as expected - see notes below.

Special methods for specific matrix types

There are additional methods that provide some advantages for specific matrix representations:

Obviously, the get_nonzero_* functions are not available for character matrices.

Other matrix types

Logical, integer and character matrices can be handled by including the following header files:

#include "beachmat/logical_matrix.h"
#include "beachmat/integer_matrix.h"
#include "beachmat/character_matrix.h"

The dispatch function changes correspondingly, for logical matrix lmat, integer matrix imat and character matrix cmat:

std::unique_ptr<beachmat::logical_matrix> lptr=beachmat::create_logical_matrix(lmat);
std::unique_ptr<beachmat::integer_matrix> iptr=beachmat::create_integer_matrix(imat);
std::unique_ptr<beachmat::character_matrix> cptr=beachmat::create_character_matrix(cmat);

Equivalent methods are available for each matrix types with appropriate changes in type. For integer and logical matrices, get will return an integer, while X can be an iterator object of a Rcpp::IntegerVector, Rcpp::LogicalVector or Rcpp::NumericVector instance (type conversions are implicitly performed as necessary). For character matrices, X should be of type Rcpp::StringVector::iterator, and get will return a Rcpp::String.

The following matrix classes are supported:

Additional classes can be added on a need-to-use basis. As a general rule, if a matrix-like object can be stored in a SummarizedExperiment class (from the SummarizedExperiment package), the API should be able to handle it. Please contact the maintainers if you have a class that you would like to see supported.

Important developer information

Overview of the API (output)

Specifying the output type

Three types of output matrices are supported - simple matrix, *gCMatrix and HDF5Matrix objects. For example, a simple numeric output matrix with nrow rows and ncol columns is created by:

std::unique_ptr<numeric_output> odmat=beachmat::create_numeric_output(nrow, ncol, beachmat::SIMPLE_PARAM);

A sparse matrix is similarly created by setting the last argument to beachmat::SPARSE_PARAM, while a HDF5Matrix is constructed by setting beachmat::HDF5_PARAM. These constants are instances of the output_param class that specify the type and parameters of the output matrix to be constructed.

Another option is to allow the function to dynamically choose the output type to match that of an existing matrix. This is useful for automatically choosing an output format that reflects the choice of input format. For example, if data are supplied to a function in a simple matrix, it would be reasonable to expect that the output is similarly small enough to be stored as a simple matrix. On the other hand, if the input is a HDF5Matrix, it may make more sense to return a HDF5Matrix object.

Dynamic choice of output type is performed by using the Rcpp::Robject object containing the input matrix to initialize the output_param object. If I have a matrix object dmat, the output type can be matched to the input type with:

beachmat::output_param oparam(dmat, /* simplify = */ true, /* preserve_zero = */ false);
std::unique_ptr<numeric_output> odmat=beachmat::create_numeric_output(nrow, ncol, oparam);

A similar process can be used for a pointer dptr to an existing *_matrix instance:

beachmat::output_param oparam(dptr->get_matrix_type(), /* simplify = */ true, /* preserve_zero = */ false);

The simplify argument indicates whether non-matrix input objects should be “simplified” to a matrix output object. If false, a HDF5Matrix output object will be returned instead. The preserve_zero argument indicates whether a *gCMatrix input should result in a *gCMatrix output when simplify=false (for logical or double-precision data only). Exact zeroes are detected and ignored when filling this matrix.

Methods for output matrices

To put data into the output matrix pointed to by dptr, the following methods are available:

The allowable ranges of r, c, first and last are the same as previously described. The get_nrow, get_ncol, get_row, get_col, get, get_matrix_type and clone methods are also available and behave as described for numeric_matrix objects.

Other matrix types

Logical, integer and character output matrices are supported by changing the types in the creator function (and its variants):

std::unique_ptr<integer_output> oimat=beachmat::create_integer_output(nrow, ncol);
std::unique_ptr<logical_output> olmat=beachmat::create_logical_output(nrow, ncol);
std::unique_ptr<character_output> ocmat=beachmat::create_character_output(nrow, ncol);

Equivalent methods are available for these matrix types. For integer and logical matrices, X should be of type Rcpp::IntegerVector::iterator and Rcpp::LogicalVector::iterator, respectively, and Y should be an integer. For character matrices, X should be of type Rcpp::StringVector::iterator and Y should be a Rcpp::String object.

Important developer information

Session information

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.6.0 knitr_1.17     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.2  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
##  [5] htmltools_0.3.6 tools_3.4.2     yaml_2.1.14     Rcpp_0.12.14   
##  [9] rmarkdown_1.8   stringi_1.1.6   digest_0.6.12   stringr_1.2.0  
## [13] evaluate_0.10.1