Contents

1 Introduction

The alabaster.base package (and its family) implements methods to save common Bioconductor objects to file artifacts and load them back into R. This aims to provide a functional equivalent to RDS-based serialization that is:

2 Quick start

To demonstrate, let’s mock up a DataFrame object from the S4Vectors package.

library(S4Vectors)
df <- DataFrame(X=1:10, Y=letters[1:10])
df
## DataFrame with 10 rows and 2 columns
##            X           Y
##    <integer> <character>
## 1          1           a
## 2          2           b
## 3          3           c
## 4          4           d
## 5          5           e
## 6          6           f
## 7          7           g
## 8          8           h
## 9          9           i
## 10        10           j

We’ll save this DataFrame to file inside a “staging directory”:

staging <- tempfile()

library(alabaster.base)
saveLocalObject(df, staging, "my_favorite_df")

And reading it back in:

readLocalObject(staging, "my_favorite_df")
## DataFrame with 10 rows and 2 columns
##            X           Y
##    <integer> <character>
## 1          1           a
## 2          2           b
## 3          3           c
## 4          4           d
## 5          5           e
## 6          6           f
## 7          7           g
## 8          8           h
## 9          9           i
## 10        10           j

3 Class-specific methods

Each class implements a staging and loading method for use inside the alabaster framework. The staging method (for the stageObject() generic) will save the object to one or more files, given a staging directory and a path inside it:

tmp <- tempfile()
dir.create(tmp)

# DataFrame method already defined for the stageObject generic:
meta <- stageObject(df, tmp, path="my_df")
str(meta)
## List of 5
##  $ $schema       : chr "csv_data_frame/v1.json"
##  $ path          : chr "my_df/simple.csv.gz"
##  $ is_child      : logi FALSE
##  $ data_frame    :List of 5
##   ..$ columns    :List of 2
##   .. ..$ :List of 2
##   .. .. ..$ name: chr "X"
##   .. .. ..$ type: chr "integer"
##   .. ..$ :List of 2
##   .. .. ..$ name: chr "Y"
##   .. .. ..$ type: chr "string"
##   ..$ row_names  : logi FALSE
##   ..$ column_data: NULL
##   ..$ other_data : NULL
##   ..$ dimensions : int [1:2] 10 2
##  $ csv_data_frame:List of 1
##   ..$ compression: chr "gzip"
# Writing the metadata to file.
invisible(.writeMetadata(meta, tmp))

list.files(tmp, recursive=TRUE)
## [1] "my_df/simple.csv.gz"      "my_df/simple.csv.gz.json"

Conversely, the loading function will - as the name suggests - load the object back into memory, given the staging directory and the file’s metadata. The correct loading function for each class is automatically called by the loadObject() function:

remeta <- acquireMetadata(tmp, "my_df/simple.csv.gz")
loadObject(remeta, tmp)
## DataFrame with 10 rows and 2 columns
##            X           Y
##    <integer> <character>
## 1          1           a
## 2          2           b
## 3          3           c
## 4          4           d
## 5          5           e
## 6          6           f
## 7          7           g
## 8          8           h
## 9          9           i
## 10        10           j

alabaster.base itself supports a small set of classes from the S4Vectors packages; support for additional classes can be found in other packages like alabaster.ranges and alabaster.se. Third-party developers can also add support for their own classes by defining new methods, see the Extensions vignette for details.

Incidentally, readLocalObject() and saveLocalObject() are just convenience wrappers acround stageObject() and loadObject(), respectively.

4 What’s a staging directory?

For some concept of a “project”, the staging directory contains all of the artifacts generated for that project. Multiple objects can be saved into a single staging directory, usually in separate subdirectories to avoid conflicts. For example:

# Creating a nested DF to be a little spicy:
df2 <- DataFrame(Z=factor(1:5), AA=I(DataFrame(B=runif(5), C=rnorm(5))))

meta2 <- stageObject(df2, tmp, path="my_df2")
list.files(tmp, recursive=TRUE)
## [1] "my_df/simple.csv.gz"               "my_df/simple.csv.gz.json"         
## [3] "my_df2/column1/levels.csv.gz"      "my_df2/column1/levels.csv.gz.json"
## [5] "my_df2/column2/simple.csv.gz"      "my_df2/column2/simple.csv.gz.json"
## [7] "my_df2/simple.csv.gz"

We use the staging directory as a root for the hard-coded paths in the metadata returned by stageObject() (i.e., meta and meta2). Thus, we can easily copy the entire directory to a new file system and everything will still be correctly referenced.

meta2$path
## [1] "my_df2/simple.csv.gz"
new.dir <- tempfile()
dir.create(new.dir)
invisible(file.copy(tmp, new.dir, recursive=TRUE))
loadObject(meta2, file.path(new.dir, basename(tmp)))
## DataFrame with 5 rows and 2 columns
##          Z                 AA
##   <factor>        <DataFrame>
## 1        1 0.533730:-1.204991
## 2        2 0.491705:-0.146865
## 3        3 0.863495: 0.354438
## 4        4 0.521754: 0.576081
## 5        5 0.015356:-0.455894

The simplest way to share objects is to just zip or tar the staging directory for ad hoc distribution. For more serious applications, alabaster.base can be used in conjunction with centralized file storage systems - see the Applications vignette for more details.

5 Saving and validating metadata

Each file artifact is associated with JSON-formatted metadata, denoted by a file with an additional .json suffix. This defines the interpretation of the contents of the file as well as pointing to other files that are required to represent the corresponding object in an R session. Users can use the .writeMetadata() utility to quickly write the metadata to an appropriate location inside the staging directory.

# Writing the metadata to file.
invisible(.writeMetadata(meta, dir=tmp))
invisible(.writeMetadata(meta2, dir=tmp))

# Reading a snippet.
meta.path <- file.path(tmp, paste0(meta$path, ".json"))
cat(head(readLines(meta.path), 20), sep="\n")
## {
##   "$schema": "csv_data_frame/v1.json",
##   "path": "my_df/simple.csv.gz",
##   "is_child": false,
##   "data_frame": {
##     "columns": [
##       {
##         "name": "X",
##         "type": "integer"
##       },
##       {
##         "name": "Y",
##         "type": "string"
##       }
##     ],
##     "row_names": false,
##     "dimensions": [10, 2]
##   },
##   "csv_data_frame": {
##     "compression": "gzip"

The JSON metadata for each artifact follows the constraints of the schema specified in the $schema property. This schema determines what fields can be stored and which are mandatory; the loading method uses such fields to reliably restore the object into memory. .writeMetadata() will automatically validate the metadata against the schema to ensure that the contents are appropriate.

meta.fail <- meta
meta.fail[["data_frame"]][["columns"]] <- NULL
try(.writeMetadata(meta.fail, dir=tmp))
## Error : 1 error validating json:
##  - /data_frame (#/properties/data_frame/required): must have required property 'columns'

The schema will also list the exact R function that is required to load an object into memory. This means that developers can create third-party extensions for new data structures by defining new schemas, without requiring changes to alabaster.base’s code. Indeed, loadObject() automatically uses the schema to determine how it should load the files back into memory:

re.read <- acquireMetadata(tmp, "my_df2/simple.csv.gz")
loadObject(re.read, tmp)
## DataFrame with 5 rows and 2 columns
##          Z                 AA
##   <factor>        <DataFrame>
## 1        1 0.533730:-1.204991
## 2        2 0.491705:-0.146865
## 3        3 0.863495: 0.354438
## 4        4 0.521754: 0.576081
## 5        5 0.015356:-0.455894

Session information

sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] S4Vectors_0.38.0     BiocGenerics_0.46.0  Matrix_1.5-4        
## [4] alabaster.base_1.0.0 BiocStyle_2.28.0    
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.1               knitr_1.42              rlang_1.1.0            
##  [4] xfun_0.39               DelayedArray_0.26.0     jsonlite_1.8.4         
##  [7] V8_4.3.0                htmltools_0.5.5         MatrixGenerics_1.12.0  
## [10] sass_0.4.5              rmarkdown_2.21          grid_4.3.0             
## [13] evaluate_0.20           jquerylib_0.1.4         fastmap_1.1.1          
## [16] IRanges_2.34.0          yaml_2.3.7              alabaster.schemas_1.0.0
## [19] Rhdf5lib_1.22.0         bookdown_0.33           jsonvalidate_1.3.2     
## [22] BiocManager_1.30.20     compiler_4.3.0          Rcpp_1.0.10            
## [25] rhdf5filters_1.12.0     lattice_0.21-8          rhdf5_2.44.0           
## [28] digest_0.6.31           R6_2.5.1                curl_5.0.0             
## [31] HDF5Array_1.28.0        bslib_0.4.2             tools_4.3.0            
## [34] matrixStats_0.63.0      alabaster.matrix_1.0.0  cachem_1.0.7