1 Introduction

The SharedObject package is designed for sharing data across multiple R processes, where all processes can read the data located in the same memory location. This sharing mechanism has the potential to save the memory usage and reduce the overhead of data transmission in parallel computing. The use of the package arises from many data-science subjects such as high-throughput gene data analysis, in which case the data is very large and a parallel computing is desirable. Blindly exporting data to all R processes via functions such as clusterExport will duplicate the data for each process and it is obviously unnecessary if other processes just need to read the data. The sharedObject package can share the data without duplications and is able to reduce the time cost. A new set of R APIs called ALTREP is used to provide a seamless experience when sharing an object.

2 Quick example

We first demonstrate the package with an example. In this example, we create a cluster with 4 cores and share an n-by-n matrix A, we use the function share to create a shared object shared_A and call the function clusterExport to export it:

library(parallel)
## Initiate the cluster
cl <- makeCluster(1)
## create data
n <- 3
A <- matrix(runif(n^2), n, n)
## create a shared object
shared_A <- share(A)
## export the shared object
clusterExport(cl,"shared_A")

stopCluster(cl)

As the code shows above, the procedure of exporting a shared object to the other R processes is similar to the procedure of exporting a regular R object, except that we replace the matrix A with a shared object shared_A. Notably, there is no different between the matrix A and the shared object shared_A. The shared object shared_A is neither an S3 nor S4 object and it behaves exactly the same as the matrix A, so there is no need to change the existing code to work with the shared object. We can verify this through

## check the data
A
#>           [,1]       [,2]      [,3]
#> [1,] 0.3663821 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
shared_A
#>           [,1]       [,2]      [,3]
#> [1,] 0.3663821 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
## check the class
class(A)
#> [1] "matrix" "array"
class(shared_A)
#> [1] "matrix" "array"
## idential
identical(A, shared_A)
#> [1] TRUE

Users can treate the shared object shared_A as a matrix and do operations on it as usual. For reducing the unnecessary creation of a shared object, the subset of a shared object is a regular R object. Users can verify this by calling is.shared

## `shared_A` is a shared object
is.shared(shared_A)
#> [1] TRUE

## The subset of `shared_A` is not
is.shared(shared_A[1:2])
#> [1] FALSE

This behavior, however, can be altered via the argument sharedSubset. Therefore, if a shared object shared_A is made by share(A, sharedSubset = TRUE), then all the subsets of the object shared_A will be shared objects automatically.

3 Supported data types and structures

Currently, the package supports raw, logical, integer and double data types. character is not supported. Sharing the data structure atomic(aka vector), matrix, data.frame and list is available. The function share is an S4 generic, developers can define an S4 share function to support their own data structures.

Please note that sharing a list object will not sharing the list itself, but sharing each element of the list object instead. Therefore, adding or replace an element on a shared list in the main process will not implicitly change the shared list on the other processes. Since a data frame is fundamentally a list object, sharing a data frame will follow the same principle as sharing a list.

When a list consists of both sharable and non-sharable objects, mustWork argument can be used in the share function to share the sharable elements and keep the non-sharable elements same(Otherwise an error will be shown). Alternatively, the function tryShare is a shortcut for share(..., mustWork = FALSE)

## the element `a` is sharable and b is not
mydata <- list(a = 1:3, b = letters[1:3])

## Will get an error if we directly share the object
## share(mydata)

## Use the `mustWork` argument to suppress the error message
sharedList1 <- share(mydata, mustWork = FALSE)

## Use the function `tryShare`
## this is equivalent to `share(mydata, mustWork = FALSE)`
sharedList2 <- tryShare(mydata)

## Only the element `a` is a shared object
is.shared(sharedList1)
#> $a
#> [1] TRUE
#> 
#> $b
#> [1] FALSE

4 Check object information

In order to distinguish a shared object, the package provide is.shared function to identify a shared object

## Check if an object is a shared object
## This works for both vector and data.frame
is.shared(A)
#> [1] FALSE
is.shared(shared_A)
#> [1] TRUE

For an atomic object, is.shared returns a logical value indicating whether the object is a shared object. For a list object, it returns a list of logical values with each element representing the corresponding element in the list object.

There are several properties associated with the shared object, one can check them via

## get a summary report
getSharedObjectProperty(shared_A)
#> $dataId
#> [1] 1
#> 
#> $length
#> [1] 9
#> 
#> $totalSize
#> [1] 72
#> 
#> $dataType
#> [1] 14
#> 
#> $ownData
#> [1] TRUE
#> 
#> $copyOnWrite
#> [1] TRUE
#> 
#> $sharedSubset
#> [1] FALSE
#> 
#> $sharedCopy
#> [1] FALSE

## Internal function to check the properties
## get the individual properties
getCopyOnWrite(shared_A)
#> [1] TRUE
getSharedSubset(shared_A)
#> [1] FALSE
getSharedCopy(shared_A)
#> [1] FALSE

Please see the advanced topic to see the meaning of the properties and how to set them in a proper way.

5 Global options

There are some options that can control the default behavior of a shared object, you can view them via

getSharedObjectOptions()
#> $mustWork
#> [1] TRUE
#> 
#> $copyOnWrite
#> [1] TRUE
#> 
#> $sharedSubset
#> [1] FALSE
#> 
#> $sharedCopy
#> [1] FALSE

As beforementioned, the option sharedSubset controls whether the subset of a shared object is still a shared object. The option mustWork suppress the error message when the function share encounter a non-sharable object and force the function to return the same object. We will talk about the options copyOnWrite and sharedCopy in the advanced section, but for most users these two options should not be changed. The global setting can be modified via setSharedObjectOptions

## change the default setting
setSharedObjectOptions(mustWork = FALSE)
## Check if the change is made
getSharedObjectOptions("mustWork")
#> [1] FALSE

6 Advanced topic

6.1 Copy-On-Write

Because all cores are using the shared object shared_A located in the same memory location, a reckless change made on the matrix shared_A on one process will immediately be broadcasted to the other process. To prevent users from changing the values of a shared object without awareness, a shared object will duplicate itself if a change of its value is made. Therefore, the code like

shared_A2 <- shared_A
shared_A[1,1] <- 10

shared_A
#>            [,1]       [,2]      [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,]  0.4349437 0.07030278 0.7385204
#> [3,]  0.6573789 0.31251063 0.3734009
shared_A2
#>           [,1]       [,2]      [,3]
#> [1,] 0.3663821 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009

## shared_A became a regular R object
is.shared(shared_A)
#> [1] FALSE

will result in a memory dulplication. The matrix shared_A2 is not affected. This default behavior can be overwritten by passing an argument copyOnWrite to the function share. For example

shared_A <- share(A, copyOnWrite=FALSE)
shared_A2 <- shared_A
shared_A[1,1] <- 10

shared_A
#>            [,1]       [,2]      [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,]  0.4349437 0.07030278 0.7385204
#> [3,]  0.6573789 0.31251063 0.3734009
shared_A2
#>            [,1]       [,2]      [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,]  0.4349437 0.07030278 0.7385204
#> [3,]  0.6573789 0.31251063 0.3734009

A change in the matrix shared_A cause a change in shared_A2. This feature could be potentially useful to return the result from each R process without additional memory allocation, so shared_A can be both the initial data and the final result. However, due to the limitation of R, it is possible to change the value of a shared object unexpectly. For example

shared_A <- share(A, copyOnWrite = FALSE)
shared_A
#>           [,1]       [,2]      [,3]
#> [1,] 0.3663821 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
-shared_A
#>            [,1]        [,2]       [,3]
#> [1,] -0.3663821 -0.67517193 -0.0012652
#> [2,] -0.4349437 -0.07030278 -0.7385204
#> [3,] -0.6573789 -0.31251063 -0.3734009
shared_A
#>            [,1]        [,2]       [,3]
#> [1,] -0.3663821 -0.67517193 -0.0012652
#> [2,] -0.4349437 -0.07030278 -0.7385204
#> [3,] -0.6573789 -0.31251063 -0.3734009

The above example shows an unexpected result when the copy-on-write feature is off. Simply calling an unary function can change the values of a shared object. Therefore, for the safty of the naive users, it is recommended to use the default setting. For the sophisticated R users, the copy-on-write feature of an object can be altered via setCopyOnwrite funtion. The old value will be invisibly returned by the function.

shared_A <- share(A, copyOnWrite = FALSE)
shared_A2 <- shared_A
## change the value of shared_A
shared_A[1,1] <- 10
## Both shared_A and shared_A2 are affected
shared_A
#>            [,1]       [,2]      [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,]  0.4349437 0.07030278 0.7385204
#> [3,]  0.6573789 0.31251063 0.3734009
shared_A2
#>            [,1]       [,2]      [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,]  0.4349437 0.07030278 0.7385204
#> [3,]  0.6573789 0.31251063 0.3734009

## Enable copy-on-write
setCopyOnWrite(shared_A, TRUE)
## shared_A is now independent with shared_A2
shared_A[1,2] <- 10
shared_A
#>            [,1]        [,2]      [,3]
#> [1,] 10.0000000 10.00000000 0.0012652
#> [2,]  0.4349437  0.07030278 0.7385204
#> [3,]  0.6573789  0.31251063 0.3734009
shared_A2
#>            [,1]       [,2]      [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,]  0.4349437 0.07030278 0.7385204
#> [3,]  0.6573789 0.31251063 0.3734009

These flexibilities provide us a way to do safe operations during the computation and return the results without memory duplications.

6.1.1 Warning

If a high-precision value is assigned to a low-precision shared object(E.g. assigning a numeric value to an integer shared object), an implicit type conversion will be triggered for correctly storing the change. The resulting object would be a regular R object, not a shared object. Therefore, the change will not be broadcasted even if the copy-on-write feature is off. Users should be caution with the data type that a shared object is using.

6.2 Shared subset and shared copy

The options sharedSubset controls whether to create a shared object when subsetting a shared object. sharedCopy determines if the duplication of a shared object is still a shared object(Recall that after changing the value of the variable shared_A, it is no longer a shared object, this is due to sharedCopy = FALSE). For performance consideration, the default settings are sharedSubset = FALSE and sharedCopy=FALSE, but they can be temporary overwritten via:

shared_A <- share(A, sharedSubset = TRUE, sharedCopy = TRUE)
getSharedObjectProperty(shared_A, property = c("sharedSubset", "sharedCopy"))
#> $sharedSubset
#> [1] TRUE
#> 
#> $sharedCopy
#> [1] TRUE

## Changing the value of `shared_A` will not 
## result in a regular R object
shared_A2 <- shared_A
shared_A[1,1] <- 10
is.shared(shared_A)
#> [1] TRUE

Please note that sharedCopy is only available when copyOnWrite = TRUE.

7 Developing package based upon SharedObject

The package offers three levels of APIs to help the package developers to build their own shared memory object.

7.1 user API

The simplest and recommended way to make your own shared object is to define an S4 function share in your own package, where you can rely on the existing share functions to quickly add the support of an S4 class which is not provided by SharedObject. We recommend to use this method to build your package for the simple reason that the developers do not have to control the lifecycle of the shared memory. The package will automatically destroy the shared memory after usage.

7.2 R memory management APIs

It is a common request to have a low level control to the shared memory. To achieve that, the package exports some R APIs for the developers who want to have a fine control of their shared objects. These functions are allocateSharedMemory, allocateNamedSharedMemory, mapSharedMemory, unmapSharedMemory, freeSharedMemory, hasSharedMemory and getSharedMemorySize. Note that developers are responsible for destroying the shared memory after usage. Please see the function documentation for more infomation

7.3 C++ memory management APIs

For the most sophisticated package developers, it might be more comfortable to use the C++ APIs rather than R APIs. All the R APIs mentioned in the previous section can be found at C++ level. Here I will borrow the instruction from Rhtslib package with slight changes to show how to use the C++ APIs in your package.

7.3.1 Step 1

For using the C++ APIS, you must add SharedObject to the LinkingTo field of the DESCRIPTION file, e.g.,

LinkingTo: SharedObject

7.3.2 Step 2

In C++ code files, include the header of the shared object #include "SharedObject/sharedMemory.h".

7.3.3 Step 3

To compile and link your package successfully against the SharedObject C++ library, you must include a src/Makevars file.

SHARED_OBJECT_LIBS = $(shell echo 'SharedObject:::pkgconfig("PKG_LIBS")'|\
                     "${R_HOME}/bin/R" --vanilla --slave)
SHARED_OBJECT_CPPFLAGS = $(shell echo 'SharedObject:::pkgconfig("PKG_CPPFLAGS")'|\
                         "${R_HOME}/bin/R" --vanilla --slave)

PKG_LIBS := $(PKG_LIBS) $(SHARED_OBJECT_LIBS)
PKG_CPPFLAGS := $(PKG_CPPFLAGS) $(SHARED_OBJECT_CPPFLAGS)

Note that $(shell ...) is GNU make syntax so you should add GNU make to the SystemRequirements field of the DESCRIPTION file of your package, e.g.,

SystemRequirements: GNU make

You can find short descriptions on how to use the C++ API in the header file.

8 Session Information

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] SharedObject_1.2.2 BiocStyle_2.16.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6        bookdown_0.18       digest_0.6.25      
#>  [4] magrittr_1.5        evaluate_0.14       rlang_0.4.6        
#>  [7] stringi_1.4.6       rmarkdown_2.1       tools_4.0.0        
#> [10] stringr_1.4.0       xfun_0.13           yaml_2.2.1         
#> [13] compiler_4.0.0      BiocGenerics_0.34.0 BiocManager_1.30.10
#> [16] htmltools_0.4.0     knitr_1.28

Package Quick Start Guide

2020-05-06

Package