1 Overview of the input API
2 Querying matrix information
3 Extracting data
4 Generalizing to other matrices
- 4.1 Other data types
- 4.2 Alternative matrix representations
5 Cloning for parallelization

1 Overview of the input API

This document describes the use of the beachmat API for accessing data in R matrices. We will demonstrate the API on numeric matrices, though same semantics are used for matrices of other types (e.g., logical, integer, character). First, we include the relevant header file:

#include "beachmat/numeric_matrix.h"

A double-precision matrix object dmat is handled in C++ by passing the SEXP struct from .Call to create_numeric_matrix:

// returns a std::unique_ptr<beachmat::numeric_matrix> object
auto dptr = beachmat::create_numeric_matrix(dmat);

This creates a unique pointer that points to an object of the numeric_matrix base class. The exact derived class that is actually instantiated depends on the type of matrix in dmat, though the behaviour of the user-level functions are not affected by this detail.

Additional notes:

The auto keyword just avoids the need to write the full type of the returned pointer, which is std::unique_ptr<beachmat::numeric_matrix>. We use unique pointers to control ownership and smoothly handle destruction and memory deallocation at the end of the function.
The API will happily throw exceptions of the std::exception class, containing an informative error message. These should be caught and handled gracefully by the end-user code, otherwise a segmentation fault will probably occur. See the error-handling mechanism in Rcpp for how to deal with these exceptions.

2 Querying matrix information

The get_nrow() method returns the number of rows (referred to below as nrow) in the matrix as a size_t.

dptr->get_nrow();

The get_ncol() method returns the number of columns (referred to below as ncol) in the matrix as a size_t.

dptr->get_ncol();

The get_matrix_type() method returns a matrix_type value specifying the specific matrix representation that is pointed to by dptr. This is an enumeration type that can be tested against constants like beachmat::SIMPLE or beachmat::SPARSE.

dptr->get_matrix_type();

The yield() method returns a Rcpp::RObject containing the original R matrix that was used to create dptr.

dptr->yield()

3 Extracting data

3.1 From columns

The get_col() method takes an iterator in to a Rcpp vector and fills it with values at column c. There should be at least nrow accessible elements, i.e., *in and *(in+nrow-1) should be valid entries.

dptr->get_col(/* size_t */ c, /* Rcpp::Vector::iterator */ in);

Extraction of a range of the column can be specified with the first and last arguments. This will fill in with values at column c from row first to last-1. There should be at least last-first accessible elements, i.e., *in and *(in+last-first-1) should be valid entries.

dptr->get_col(/* size_t */ c, /* Rcpp::Vector::iterator */ in, 
              /* size_t */ first, /* size_t */ last);

No value is returned by either of these methods. Note that c should be a zero-indexed integer in [0, ncol). Similarly, both first and last should be in [0, nrow] and zero-indexed, with the additional requirement that last >= first.

3.2 From rows

The get_row() method takes an iterator in to a Rcpp vector and fills it with values at row r. There should be at least ncol accessible elements, i.e., *in and *(in+ncol-1) should be valid entries.

dptr->get_row(/* size_t */ r, /* Rcpp::Vector::iterator */ in);

Extraction of a range of the row can be specified with the first and last arguments. This will fill in with values at row r from column first to last-1. There should be at least last-first accessible elements, i.e., *in and *(in+last-first-1) should be valid entries.

dptr->get_row(/* size_t */ r, 
              /* Rcpp::Vector::iterator */ in, 
              /* size_t */ first, 
              /* size_t */ last);

No value is returned by either of these methods. Again, r should be a zero-indexed integer in [0, nrow). Both first and last should be in [0, ncol] and zero-indexed, with the additional requirement that last >= first.

3.3 From individual cells

The get() method returns a double-precision value at the matrix entry for row r and column c. Both r and c should be zero-indexed integers in [0, nrow) and [0, ncol) respectively.

dptr->get(/* size_t */ r, 
          /* size_t */ c)

3.4 Type conversions

If the object in is a Rcpp::NumericVector::iterator instance, matrix entries will be extracted as double-precision values. If it is a Rcpp::IntegerVector::iterator instance, matrix entries will be extracted as integers with implicit conversion.

It is also possible to use a Rcpp::LogicalVector::iterator, though this will not behave as expected. Double-to-integer conversion is performed such that values in (-1, 1) are converted to integer 0. This would be interpreted as a logical FALSE, which is incorrect for non-zero double-precision values.

3.5 Fast constant access

3.5.1 From dense matrices

The get_const_col() method returns an iterator to a Rcpp vector pointing to first row of column c. The arguments are the same as get_col() with work being equivalent to in. (Note that first and last do not have to be specified and will default to the entire column.) The type of work and the output iterator must correspond to the data type of the matrix - in this case, both should be Rcpp::NumericVector::iterator objects.

dptr->get_const_col(/* size_t */ c, 
                    /* Rcpp::NumericVector::iterator */ work, 
                    /* size_t */ first, 
                    /* size_t */ last);

For simple/dense matrices, this method is more efficient than get_col() as it returns the iterator without needing to copy data into work. For other matrices, this function simply calls get_col() to copy the data into the vector starting at work, and then returns an iterator equal to work. Thus, for general use, work must point to a writeable block of memory, even if it does not get used when dptr points to a simple/dense matrix.

Additional notes:

Calling processes should consider the vector starting at work to be read-only after a call to get_const_col(). In particular, the underlying data will be modified if get_const_col() is called again with the same work. This has some consequences if a process calls get_const_col() multiple times to obtain more than one iterators for further computation. In such cases, developers should ensure that each get_const_col() call uses a work pointing to a different vector.

3.5.2 From sparse matrices

The get_const_col_indexed() method returns a const_col_indexed_info object, which is a tuple consisting of:

A size_t, specifying the number of entries in column c from rows [first, last).
A Rcpp::IntegerVector::iterator, pointing to a vector containing the row index for each entry in column c from rows [first, last).
A Rcpp::NumericVector::iterator, pointing to a vector containing the value of each entry in column c from rows [first, last).

The data type of work and the output iterator in (3) must correspond to that of the matrix - in this case, both should be Rcpp::NumericVector::iterator objects. Again, first and last do not have to be specified and will default to the entire column.

dptr->get_const_col_indexed(/* size_t */ c, 
                            /* Rcpp::NumericVector::iterator */ work, 
                            /* size_t */ first, 
                            /* size_t */ last);

This method is quite efficient for dgCMatrix objects, as it will directly return iterators to the indices and values of the non-zero entries only. No copying is performed, and the zero values do not have to be explicitly generated as they would be in get_col. For all other matrices, get_const_col is called11 The comments above regarding the read-only nature of work also apply here. and iterator (2) is pointed at an internal array of consecutive integers (which should be treated as read-only).

4 Generalizing to other matrices

4.1 Other data types

To create logical, integer and character matrices, include the following header files:

#include "beachmat/logical_matrix.h"
#include "beachmat/integer_matrix.h"
#include "beachmat/character_matrix.h"

The dispatch function changes correspondingly for logical matrix lmat, integer matrix imat or character matrix cmat. Each function creates a unique pointer to a *_matrix of the appropriate type.

// creates a std::unique_ptr<beachmat::logical_matrix>
auto lptr=beachmat::create_logical_matrix(lmat);

// creates a std::unique_ptr<beachmat::integer_matrix>
auto iptr=beachmat::create_integer_matrix(imat);

// creates a std::unique_ptr<beachmat::character_matrix>
auto cptr=beachmat::create_character_matrix(cmat);

Equivalent methods are available for each matrix type with appropriate changes in type.

For integer and logical matrices, get will return an integer. in can be any type previously described for numeric_matrix objects22 Note the caveats previously discussed for Rcpp::LogicalVector iterators.. work should be an iterator to a Rcpp::IntegerVector for integer matrices or a Rcpp::LogicalVector for logical matrices. Similar changes apply to the iterator in the tuple from get_const_col_indexed().

For character matrices, all iterators should be of type Rcpp::StringVector::iterator, and get will return a Rcpp::String.

Additional notes:

If in is a Rcpp::LogicalVector::iterator for integer_matrix instances, integer values are not coerced to {0, 1} when they are assigned to *in. Thus, even though the interpretation is correct, the vector produced will not be equivalent to the result of an as.logical call. As a general rule, it is unwise to use Rcpp::LogicalVector::iterators for anything other than logical_matrix access.
When accessing character_matrix data, we do not return raw const char* pointers to the C-style string. Rather, the Rcpp::String class is used as it provides a convenient wrapper around the underlying CHARSXP. This ensures that the string is stored in R’s global cache and is suitably protected against garbage collection.

4.2 Alternative matrix representations

The following matrix classes are supported:

numeric: matrix, dgeMatrix, dgCMatrix, dspMatrix, RleMatrix, HDF5Matrix, DelayedMatrix
integer: matrix, RleMatrix, HDF5Matrix, DelayedMatrix
logical: matrix, lgeMatrix, lgCMatrix, lspMatrix, RleMatrix, HDF5Matrix, DelayedMatrix
character: matrix, RleMatrix, HDF5Matrix, DelayedMatrix

Additional classes may be added on request. As a general rule, if a matrix-like object can be stored in a SummarizedExperiment class (from the SummarizedExperiment package), the API should be able to handle it. Please contact the maintainers if you have a class that you would like to see supported.

Additional notes:

DelayedMatrix objects can be directly parsed by the API if the only delayed operations involve transposition or subsetting. Otherwise, they are realized in a chunk-by-chunk manner, using methods in the DelayedArray package. The granularity of the chunking is determined using the defaultGrid function.
For numeric matrices, beachmat does not support higher-level matrix operations such as addition, multiplication or various factorizations. Rather, the yield method can be used to obtain the original Rcpp::RObject for input to RcppArmadillo or RcppEigen. This functionality is generally limited to base matrices, though there is also limited support for sparse matrices in these libraries.

5 Cloning for parallelization

The clone() method returns a unique pointer to a numeric_matrix instance of the same type as that pointed to by dptr.

dptr->clone();

This is useful as direct use of the beachmat API is not thread-safe for simultaneous calls to the get methods from different threads. Some methods use cached class members for greater efficiency, and simultaneous calls will cause race conditions. It is the responsibility of the calling function to coordinate data access across threads. To this end, the clone method can be called to generate a unique pointer to a new *_matrix instance, which can be used concurrently in another thread. This is fairly cheap as the underlying matrix data are not copied.

An example of parallelized beachmat code using OpenMP might look like this:

#pragma omp parallel
{
    beachmat::numeric_matrix* rptr=NULL;
    std::unique_ptr<beachmat::numeric_matrix> uptr=nullptr;
    if (omp_get_thread_num()==0) {
        rptr=dptr.get();
    } else {
        uptr=dptr->clone();
        rptr=uptr.get();
    }

    size_t col;
    const size_t NC=rptr->get_ncol();
    Rcpp::NumericVector output(rptr->get_nrow());

    #pragma omp for schedule(static)
    for (col=0; col<NC; ++col) {
        // Do parallel operation here.
        rptr->get_col(col, output.begin());
    }
}

The start of the parallel region uses the existing dptr in the master thread and clones a new matrix in the other threads. The parallelized for loop then uses rptr to avoid race conditions in cached variables. Note that a static schedule may be faster than other schedule types, as several of the matrix implementations in beachmat are optimized for consecutive row/column access.

Using beachmat to read data from R matrices in C++

23 May 2018

Package

Contents

1 Overview of the input API

2 Querying matrix information

3 Extracting data

3.1 From columns

3.2 From rows

3.3 From individual cells

3.4 Type conversions

3.5 Fast constant access

3.5.1 From dense matrices

3.5.2 From sparse matrices

4 Generalizing to other matrices

4.1 Other data types

4.2 Alternative matrix representations

5 Cloning for parallelization