SharedObject 1.2.2
The SharedObject
package is designed for sharing data across multiple R processes, where all processes can read the data located in the same memory location. This sharing mechanism has the potential to save the memory usage and reduce the overhead of data transmission in parallel computing. The use of the package arises from many data-science subjects such as high-throughput gene data analysis, in which case the data is very large and a parallel computing is desirable. Blindly exporting data to all R processes via functions such as clusterExport
will duplicate the data for each process and it is obviously unnecessary if other processes just need to read the data. The sharedObject
package can share the data without duplications and is able to reduce the time cost. A new set of R APIs called ALTREP
is used to provide a seamless experience when sharing an object.
We first demonstrate the package with an example. In this example, we create a cluster with 4 cores and share an n-by-n matrix A
, we use the function share
to create a shared object shared_A
and call the function clusterExport
to export it:
library(parallel)
## Initiate the cluster
cl <- makeCluster(1)
## create data
n <- 3
A <- matrix(runif(n^2), n, n)
## create a shared object
shared_A <- share(A)
## export the shared object
clusterExport(cl,"shared_A")
stopCluster(cl)
As the code shows above, the procedure of exporting a shared object to the other R processes is similar to the procedure of exporting a regular R object, except that we replace the matrix A
with a shared object shared_A
. Notably, there is no different between the matrix A
and the shared object shared_A
. The shared object shared_A
is neither an S3 nor S4 object and it behaves exactly the same as the matrix A
, so there is no need to change the existing code to work with the shared object. We can verify this through
## check the data
A
#> [,1] [,2] [,3]
#> [1,] 0.3663821 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
shared_A
#> [,1] [,2] [,3]
#> [1,] 0.3663821 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
## check the class
class(A)
#> [1] "matrix" "array"
class(shared_A)
#> [1] "matrix" "array"
## idential
identical(A, shared_A)
#> [1] TRUE
Users can treate the shared object shared_A
as a matrix and do operations on it as usual. For reducing the unnecessary creation of a shared object, the subset of a shared object is a regular R object. Users can verify this by calling is.shared
## `shared_A` is a shared object
is.shared(shared_A)
#> [1] TRUE
## The subset of `shared_A` is not
is.shared(shared_A[1:2])
#> [1] FALSE
This behavior, however, can be altered via the argument sharedSubset
. Therefore, if a shared object shared_A
is made by share(A, sharedSubset = TRUE)
, then all the subsets of the object shared_A
will be shared objects automatically.
Currently, the package supports raw
, logical
, integer
and double
data types. character
is not supported. Sharing the data structure atomic
(aka vector
), matrix
, data.frame
and list
is available. The function share
is an S4 generic, developers can define an S4 share
function to support their own data structures.
Please note that sharing a list object will not sharing the list itself, but sharing each element of the list object instead. Therefore, adding or replace an element on a shared list in the main process will not implicitly change the shared list on the other processes. Since a data frame is fundamentally a list object, sharing a data frame will follow the same principle as sharing a list.
When a list consists of both sharable and non-sharable objects, mustWork
argument can be used in the share
function to share the sharable elements and keep the non-sharable elements same(Otherwise an error will be shown). Alternatively, the function tryShare
is a shortcut for share(..., mustWork = FALSE)
## the element `a` is sharable and b is not
mydata <- list(a = 1:3, b = letters[1:3])
## Will get an error if we directly share the object
## share(mydata)
## Use the `mustWork` argument to suppress the error message
sharedList1 <- share(mydata, mustWork = FALSE)
## Use the function `tryShare`
## this is equivalent to `share(mydata, mustWork = FALSE)`
sharedList2 <- tryShare(mydata)
## Only the element `a` is a shared object
is.shared(sharedList1)
#> $a
#> [1] TRUE
#>
#> $b
#> [1] FALSE
In order to distinguish a shared object, the package provide is.shared
function to identify a shared object
## Check if an object is a shared object
## This works for both vector and data.frame
is.shared(A)
#> [1] FALSE
is.shared(shared_A)
#> [1] TRUE
For an atomic object, is.shared
returns a logical value indicating whether the object is a shared object. For a list object, it returns a list of logical values with each element representing the corresponding element in the list object.
There are several properties associated with the shared object, one can check them via
## get a summary report
getSharedObjectProperty(shared_A)
#> $dataId
#> [1] 1
#>
#> $length
#> [1] 9
#>
#> $totalSize
#> [1] 72
#>
#> $dataType
#> [1] 14
#>
#> $ownData
#> [1] TRUE
#>
#> $copyOnWrite
#> [1] TRUE
#>
#> $sharedSubset
#> [1] FALSE
#>
#> $sharedCopy
#> [1] FALSE
## Internal function to check the properties
## get the individual properties
getCopyOnWrite(shared_A)
#> [1] TRUE
getSharedSubset(shared_A)
#> [1] FALSE
getSharedCopy(shared_A)
#> [1] FALSE
Please see the advanced topic to see the meaning of the properties and how to set them in a proper way.
There are some options that can control the default behavior of a shared object, you can view them via
getSharedObjectOptions()
#> $mustWork
#> [1] TRUE
#>
#> $copyOnWrite
#> [1] TRUE
#>
#> $sharedSubset
#> [1] FALSE
#>
#> $sharedCopy
#> [1] FALSE
As beforementioned, the option sharedSubset
controls whether the subset of a shared object is still a shared object. The option mustWork
suppress the error message when the function share
encounter a non-sharable object and force the function to return the same object. We will talk about the options copyOnWrite
and sharedCopy
in the advanced section, but for most users these two options should not be changed. The global setting can be modified via setSharedObjectOptions
## change the default setting
setSharedObjectOptions(mustWork = FALSE)
## Check if the change is made
getSharedObjectOptions("mustWork")
#> [1] FALSE
Because all cores are using the shared object shared_A
located in the same memory location, a reckless change made on the matrix shared_A
on one process will immediately be broadcasted to the other process. To prevent users from changing the values of a shared object without awareness, a shared object will duplicate itself if a change of its value is made. Therefore, the code like
shared_A2 <- shared_A
shared_A[1,1] <- 10
shared_A
#> [,1] [,2] [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
shared_A2
#> [,1] [,2] [,3]
#> [1,] 0.3663821 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
## shared_A became a regular R object
is.shared(shared_A)
#> [1] FALSE
will result in a memory dulplication. The matrix shared_A2
is not affected. This default behavior can be overwritten by passing an argument copyOnWrite
to the function share
. For example
shared_A <- share(A, copyOnWrite=FALSE)
shared_A2 <- shared_A
shared_A[1,1] <- 10
shared_A
#> [,1] [,2] [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
shared_A2
#> [,1] [,2] [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
A change in the matrix shared_A
cause a change in shared_A2
. This feature could be potentially useful to return the result from each R process without additional memory allocation, so shared_A
can be both the initial data and the final result. However, due to the limitation of R, it is possible to change the value of a shared object unexpectly. For example
shared_A <- share(A, copyOnWrite = FALSE)
shared_A
#> [,1] [,2] [,3]
#> [1,] 0.3663821 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
-shared_A
#> [,1] [,2] [,3]
#> [1,] -0.3663821 -0.67517193 -0.0012652
#> [2,] -0.4349437 -0.07030278 -0.7385204
#> [3,] -0.6573789 -0.31251063 -0.3734009
shared_A
#> [,1] [,2] [,3]
#> [1,] -0.3663821 -0.67517193 -0.0012652
#> [2,] -0.4349437 -0.07030278 -0.7385204
#> [3,] -0.6573789 -0.31251063 -0.3734009
The above example shows an unexpected result when the copy-on-write feature is off. Simply calling an unary function can change the values of a shared object. Therefore, for the safty of the naive users, it is recommended to use the default setting. For the sophisticated R users, the copy-on-write feature of an object can be altered via setCopyOnwrite
funtion. The old value will be invisibly returned by the function.
shared_A <- share(A, copyOnWrite = FALSE)
shared_A2 <- shared_A
## change the value of shared_A
shared_A[1,1] <- 10
## Both shared_A and shared_A2 are affected
shared_A
#> [,1] [,2] [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
shared_A2
#> [,1] [,2] [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
## Enable copy-on-write
setCopyOnWrite(shared_A, TRUE)
## shared_A is now independent with shared_A2
shared_A[1,2] <- 10
shared_A
#> [,1] [,2] [,3]
#> [1,] 10.0000000 10.00000000 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
shared_A2
#> [,1] [,2] [,3]
#> [1,] 10.0000000 0.67517193 0.0012652
#> [2,] 0.4349437 0.07030278 0.7385204
#> [3,] 0.6573789 0.31251063 0.3734009
These flexibilities provide us a way to do safe operations during the computation and return the results without memory duplications.
If a high-precision value is assigned to a low-precision shared object(E.g. assigning a numeric value to an integer shared object), an implicit type conversion will be triggered for correctly storing the change. The resulting object would be a regular R object, not a shared object. Therefore, the change will not be broadcasted even if the copy-on-write feature is off. Users should be caution with the data type that a shared object is using.
sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SharedObject_1.2.2 BiocStyle_2.16.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 bookdown_0.18 digest_0.6.25
#> [4] magrittr_1.5 evaluate_0.14 rlang_0.4.6
#> [7] stringi_1.4.6 rmarkdown_2.1 tools_4.0.0
#> [10] stringr_1.4.0 xfun_0.13 yaml_2.2.1
#> [13] compiler_4.0.0 BiocGenerics_0.34.0 BiocManager_1.30.10
#> [16] htmltools_0.4.0 knitr_1.28