Contents

1 Introduction

The DelayedArray framework currently supports the HDF5 on-disk backend (via the HDF5Array package) but can be extended to support other on-disk backends, that is, to support other file formats. In theory, it should be possible to implement a DelayedArray backend for any file format that has the capability to store array data with fast random access. Let’s assume that the ADS format (Array Data Store) is such format (this is a made-up format for the purpose of this vignette only). Implementing a DelayedArray backend for ADS files should typically be done in a dedicated package (say ADSArray) that will depend on the DelayedArray package.

The ADSArray package will need to implement:

The rest of this document covers the above topics in greater details. Some familiarity with writing R packages is assumed. Don’t hesitate to look at the source of the HDF5Array package for a real example of DelayedArray on-disk backend implementation.

2 Implementing the seed class

2.1 Class definition

A “seed object” should store at least the path or URL to the file. If the file format allows storing more than one array per file, then the seed object should also store any additional information needed to locate a particular array in the file.

The definition of the seed class will look something like this:

setClass("ADSArraySeed",
    contains="Array",
    slots=c(
        filepath="character",
        ...
        ... additional slots needed
        ... to locate the array in the file
        ...
    )
)

The filepath slot should be a single string that contains the absolute path to the ADS file so the object doesn’t break when the user changes the working directory (e.g. with setwd()).

Note that storing an open connection to the file should be avoided because connections don’t work properly in the context of a fork (e.g. when processing the seed object in parallel) and tend to break when serializing the object.

2.2 Constructor

It is highly recommended to provide a “seed constructor” e.g.:

ADSArraySeed <- function(filepath, other args)
{
    sanity checks
    ...
    filepath <- file_path_as_absolute(filepath)
    ...
    new("ADSArraySeed", filepath=filepath, other args)
}

Note that file_path_as_absolute() is defined in the tools package so it needs to be imported by adding the following to the NAMESPACE file of the ADSArray package:

importFrom(tools, file_path_as_absolute)

and adding tools to the Imports field of the DESCRIPTION file of the package.

2.3 The seed contract

Seed objects are expected to comply with the “seed contract” i.e. to support dim(), dimnames(), and extract_array(). This is normally done by implementing methods for these generics, but, as we will see below, a method is rarely needed for dim() or dimnames().

For example, the dim method for ADSArraySeed objects could look like this:

### An implementation that extracts the dimensions from the file
### each time the method is called.
setMethod("dim", "ADSArraySeed",
    function(x)
    {
        - open the connection to the file
        - on.exit(close the connection)
        - extract the dimensions and return them in an integer vector
    }
)

Note that the above dim method consults the ADS file each time it’s called. However this can be avoided by adding a dim (and dimnames) slot (of type integer for dim, of type list for dimnames) to the ADSArraySeed class, and to populate it at construction time, so this information is retrieved from the file only once. With this approach, the dim and dimnames methods are actually not needed, because, by default, the dim and dimnames primitive functions return the content of these slots if present.

If the ADS format does not allow storage of the dimnames, then there is no need to implement a dimnames method or to add a dimnames slot to the ADSArraySeed class.

extract_array is a generic function defined in the DelayedArray package:

library(DelayedArray)
?extract_array

It takes 2 arguments: x and index. x is the seed object to extract array values from. index must be an unnamed list of subscripts as positive integer vectors, one vector per seed dimension. Empty and missing subscripts (represented by integer(0) and NULL list elements, respectively) are allowed. The subscripts in index can contain duplicated indices. They cannot contain NAs or non-positive values.

The extract_array method must return an ordinary array of the appropriate type (i.e. integer, double, etc…). For example, if x is an ADSArraySeed object representing an M x N on-disk matrix of complex numbers, extract_array(x, list(NULL, 2L)) must return its 2nd column as an ordinary M x 1 matrix of type complex.

Note that the extract_array method needs to support empty and missing subscripts e.g. extract_array(x, list(NULL, integer(0))) must return an M x 0 matrix of type complex and extract_array(x, list(integer(0), integer(0))) a 0 x 0 matrix of type complex. This last edge case is important because the type and show methods for DelayedArray objects rely on it to work. More precisely, once the extract_array method supports an index with empty integer vectors, the following should work:

seed <- ADSArraySeed(...)
M <- DelayedArray(seed)
type(M)
show(M)

Finally note that subscripts are allowed to contain duplicated indices so things like extract_array(seed, list(c(1:3, 3:1), 2L)) need to be supported.

2.4 What to import?

Make sure the NAMESPACE file of the ADSArray package contains at least the following imports:

import(methods)
importFrom(tools, file_path_as_absolute)

import(BiocGenerics)
import(S4Vectors)
import(IRanges)
import(DelayedArray)

Unless you have a good reason for it, don’t try to selectively import things from the methods, BiocGenerics, S4Vectors, IRanges, and DelayedArray packages. This will only complicate maintenance of the ADSArray package in the long run and has no real benefits (contrary to popular belief).

Add methods, BiocGenerics, and DelayedArray to the Depends field of the DESCRIPTION file of the package, and tools, S4Vectors, and IRanges to its Imports field.

2.5 Testing

Make sure to export the ADSArraySeed class, its constructor, and the dim, dimnames, and extract_array methods.

At this point, you should be able to wrap an ADSArraySeed object seed in a DelayedArray object with DelayedArray(seed), and this should return a fully functional DelayedArray object.

3 Implementing high-level classes ADSArray and ADSMatrix

These classes are not strictly needed but add a nice level of convenience.

3.1 ADSArray class definition

An ADSArray or ADSMatrix object is a DelayedArray derivative that doesn’t carry delayed operations yet. As soon as the user will start operating on it, it will be degraded to a DelayedArray instance.

The ADSArray and ADSMatrix classes should extend the DelayedArray and DelayedMatrix classes, respectively, without adding any slot to them.

So just:

setClass("ADSArray",
    contains="DelayedArray",
    representation(seed="ADSArraySeed")
)

We’ll define the ADSMatrix class later.

3.2 ADSArray() constructor

Add a DelayedArray method for ADSArraySeed objects that does:

setMethod("DelayedArray", "ADSArraySeed",
    function(seed) new_DelayedArray(seed, Class="ADSArray")
)

Now you should be able to construct an ADSArray object with:

DelayedArray(ADSArraySeed(...))

The ADSArray constructor should just do that:

ADSArray <- function(filepath, other args)
    DelayedArray(ADSArraySeed(filepath, other args))

However, it’s also nice to be able to pass an ADSArraySeed object to this constructor (with ADSArray(seed)). This can easily be supported with something like:

### Works directly on an ADSArraySeed object, in which case it must be
### called with a single argument.
ADSArray <- function(filepath, other args)
{
    if (is(filepath, "ADSArraySeed")) {
        if (!(missing(other arg1) && missing(other arg2) && ...))
            stop(wmsg("ADSArray() must be called with a single argument ",
                      "when passed an ADSArraySeed object"))
        seed <- filepath
    } else {
        seed <- ADSArraySeed(filepath, other args)
    }
    DelayedArray(seed)
}

3.3 ADSMatrix class definition

setClass("ADSMatrix", contains=c("ADSArray", "DelayedMatrix"))

3.4 Going from ADSArray to ADSMatrix

Define a matrixClass method for ADSArray objects as follow:

setMethod("matrixClass", "ADSArray", function(x) "ADSMatrix")

matrixClass is a generic function defined in the DelayedArray package. When passed an ADSArraySeed object, low-level constructor new_DelayedArray (see below) will generally return an ADSArray instance, except when the ADSArraySeed object is 2-dimensional, in which case it needs to return an ADSMatrix instance. It will obtain the name of the class of the object to return ("ADSMatrix" in this case) by calling matrixClass.

Also coercion from ADSArray to ADSMatrix needs to be supported with:

setAs("ADSArray", "ADSMatrix", function(from) new("ADSMatrix", from))

This coercion will make sure that the end-user gets the following error when trying to coerce an ADSArray object that is not 2-dimensional to ADSMatrix:

as(x, "ADSMatrix")
# Error in validObject(.Object) : invalid class "ADSMatrix" object:
#     'x' must have exactly 2 dimensions

Without the above coercion method, as(x, "ADSMatrix") would silently return an invalid ADSMatrix object.

3.5 Going from ADSMatrix to ADSArray

The user should not be able to degrade an ADSMatrix object to an ADSArray object so as(x, "ADSArray", strict=TRUE) should fail or be a no-op when x is an ADSMatrix object. The easiest (and recommended) way to achieve this is to define the following coercion method:

setAs("ADSMatrix", "ADSArray", function(from) from)  # no-op

3.6 What to export?

Make sure to export the ADSArray and ADSMatrix classes, the ADSArray constructor, and the coerce methods.

4 Testing

Install the ADSArray package and load it in a fresh R session:

library(ADSArray)
... coming soon ...