Introduction

The package provides a high-level R interface to CoreArray Genomic Data Structure (GDS) data files, which are portable across platforms and include hierarchical structure to store multiple scalable array-oriented data sets with metadata information. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. The package gdsfmt offers the efficient operations specifically designed for integers with less than 8 bits, since a single genetic/genomic variant, like single-nucleotide polymorphism (SNP), usually occupies fewer bits than a byte. Data compression and decompression are also supported with relatively efficient random access.

Installation of the package gdsfmt

To install the package gdsfmt, you need a current version (>=2.14.0) of R. After installing R you can run the following commands from the R command shell to install the package gdsfmt.

Install the package from Bioconductor repository:

source("http://bioconductor.org/biocLite.R")
biocLite("gdsfmt")

Install the development version from Github:

library("devtools")
install_github("zhengxwen/gdsfmt")

The install_github() approach requires that you build from source, i.e. make and compilers must be installed on your system – see the R FAQ for your operating system; you may also need to install dependencies manually.

High-level R functions

Creating a GDS file and variable hierarchy

An empty GDS file can be created by createfn.gds():

library(gdsfmt)
gfile <- createfn.gds("test.gds")

Now, a file handle associated with “test.gds” is saved in the R variable gfile.

The GDS file can contain a hierarchical structure to store multiple GDS variables (or GDS nodes) in the file, and various data types are allowed (see the document of add.gdsn()) including integer, floating-point number and character.

add.gdsn(gfile, "int", val=1:10000)
add.gdsn(gfile, "double", val=seq(1, 1000, 0.4))
add.gdsn(gfile, "character", val=c("int", "double", "logical", "factor"))
add.gdsn(gfile, "logical", val=rep(c(TRUE, FALSE, NA), 50), visible=FALSE)
add.gdsn(gfile, "factor", val=as.factor(c(NA, "AA", "CC")), visible=FALSE)
add.gdsn(gfile, "bit2", val=sample(0:3, 1000, replace=TRUE), storage="bit2")

# list and data.frame
add.gdsn(gfile, "list", val=list(X=1:10, Y=seq(1, 10, 0.25)))
add.gdsn(gfile, "data.frame", val=data.frame(X=1:19, Y=seq(1, 10, 0.5)))

folder <- addfolder.gdsn(gfile, "folder")
add.gdsn(folder, "int", val=1:1000)
add.gdsn(folder, "double", val=seq(1, 100, 0.4), visible=FALSE)

Users can display the file content by typing gfile or print(gfile):

gfile

## File: /tmp/RtmpyLuaNY/Rbuild1eb38615761/gdsfmt/vignettes/test.gds
## +    [  ]
## |--+ int   { Int32 10000 }
## |--+ double   { Float64 2498 }
## |--+ character   { VStr8 4 }
## |--+ bit2   { Bit2 1000 }
## |--+ list   [ list ] *
## |  |--+ X   { Int32 10 }
## |  |--+ Y   { Float64 37 }
## |--+ data.frame   [ data.frame ] *
## |  |--+ X   { Int32 19 }
## |  |--+ Y   { Float64 19 }
## |--+ folder   [  ]
## |  |--+ int   { Int32 1000 }

print(gfile, ...) has an argument all to control the display of file content. By default, all=FALSE; if all=TRUE, to show all contents in the file including hidden variables or folders. The GDS variables logical, factor and folder/double are hidden.

print(gfile, all=TRUE)

## File: /tmp/RtmpyLuaNY/Rbuild1eb38615761/gdsfmt/vignettes/test.gds
## +    [  ]
## |--+ int   { Int32 10000 }
## |--+ double   { Float64 2498 }
## |--+ character   { VStr8 4 }
## |--+ logical   { Int32,logical 150 } *
## |--+ factor   { Int32,factor 3 } *
## |--+ bit2   { Bit2 1000 }
## |--+ list   [ list ] *
## |  |--+ X   { Int32 10 }
## |  |--+ Y   { Float64 37 }
## |--+ data.frame   [ data.frame ] *
## |  |--+ X   { Int32 19 }
## |  |--+ Y   { Float64 19 }
## |--+ folder   [  ]
## |  |--+ int   { Int32 1000 }
## |  |--+ double   { Float64 248 } *

The asterisk indicates attributes attached to a GDS variable. The attributes can be used in the R environment to interpret the variable as logical, factor, data.frame or list.

index.gdsn() can locate the GDS variable by a path:

index.gdsn(gfile, "int")

## + int   { Int32 10000 }

index.gdsn(gfile, "list/Y")

## + list/Y   { Float64 37 }

index.gdsn(gfile, "folder/int")

## + folder/int   { Int32 1000 }

# close the GDS file
closefn.gds(gfile)

Writing Data

Array-oriented data sets can be written to the GDS file. There are three possible ways to write data to a GDS variable.

gfile <- createfn.gds("test.gds")

R function add.gdsn

Users could pass an R variable to the function add.gdsn() directly. read.gdsn() can read data from the GDS variable.

(n <- add.gdsn(gfile, "I1", val=matrix(1:15, nrow=3)))

## + I1   { Int32 3x5 }

read.gdsn(n)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    4    7   10   13
## [2,]    2    5    8   11   14
## [3,]    3    6    9   12   15

R function write.gdsn

Users can specify the arguments start and count to write a subset of data. -1 in count means the size of that dimension, and the corresponding element in start should be 1. The values in start and cound should be in the dimension range.

write.gdsn(n, rep(0,5), start=c(2,1), count=c(1,-1))
read.gdsn(n)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    4    7   10   13
## [2,]    0    0    0    0    0
## [3,]    3    6    9   12   15

R function append.gdsn

Users can append new data to an existing GDS variable.

append.gdsn(n, 16:24)
read.gdsn(n)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    4    7   10   13   16   19   22
## [2,]    0    0    0    0    0   17   20   23
## [3,]    3    6    9   12   15   18   21   24

Create a large-scale data set

When the size of dataset is larger than the system memory, users can not add a GDS variable via add.gdsn() directly. If the dimension is pre-defined, users can specify the dimension size in add.gdsn() to allocate data space. Then call write.gdsn() to write a small subset of data space.

(n2 <- add.gdsn(gfile, "I2", storage="int", valdim=c(100, 2000)))

## + I2   { Int32 100x2000 }

for (i in 1:2000)
{
    write.gdsn(n2, seq.int(100*(i-1)+1, length.out=100),
        start=c(1,i), count=c(-1,1))
}

Or call append.gdsn() to append new data when the initial size is ZERO. If a compression algorithm is specified in add.gdsn() (e.g., compress=“ZIP”), users should call append.gdsn() instead of write.gdsn(), since data has to be compressed sequentially.

(n3 <- add.gdsn(gfile, "I3", storage="int", valdim=c(100, 0), compress="ZIP"))

## + I3   { Int32 100x0 ZIP }

for (i in 1:2000)
{
    append.gdsn(n3, seq.int(100*(i-1)+1, length.out=100))
}

n3

## + I3   { Int32 100x2000 ZIP(32.29%) }

# close the GDS file
closefn.gds(gfile)

Reading Data

gfile <- createfn.gds("test.gds")
add.gdsn(gfile, "I1", val=matrix(1:20, nrow=4))
add.gdsn(gfile, "I2", val=1:100)
closefn.gds(gfile)

read.gdsn() can load all data to an R variable in memory.

gfile <- openfn.gds("test.gds")
n <- index.gdsn(gfile, "I1")

read.gdsn(n)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20

Subset reading read.gdsn and readex.gdsn

A subset of data can be specified via the arguments start and count in the R function read.gdsn. Or specify a list of logical vectors in readex.gdsn().

# read a subset
read.gdsn(n, start=c(2, 2), count=c(2, 3))

##      [,1] [,2] [,3]
## [1,]    6   10   14
## [2,]    7   11   15

# read a subset
readex.gdsn(n, list(c(FALSE,TRUE,TRUE,FALSE), c(TRUE,FALSE,TRUE,FALSE,TRUE)))

##      [,1] [,2] [,3]
## [1,]    2   10   18
## [2,]    3   11   19

Apply a user-defined function marginally

A user-defined function can be applied marginally to a GDS variable via apply.gdsn(). margin=1 indicates applying the function row by row, and margin=2 for applying the function column by column.

apply.gdsn(n, margin=1, FUN=print, as.is="none")

## [1]  1  5  9 13 17
## [1]  2  6 10 14 18
## [1]  3  7 11 15 19
## [1]  4  8 12 16 20

apply.gdsn(n, margin=2, FUN=print, as.is="none")

## [1] 1 2 3 4
## [1] 5 6 7 8
## [1]  9 10 11 12
## [1] 13 14 15 16
## [1] 17 18 19 20

# close the GDS file
closefn.gds(gfile)

Examples

To create a simple GDS file:

gfile <- createfn.gds("test.gds")
n1 <- add.gdsn(gfile, "I1", val=1:100)
n2 <- add.gdsn(gfile, "I2", val=matrix(1:20, nrow=4))

gfile

## File: /tmp/RtmpyLuaNY/Rbuild1eb38615761/gdsfmt/vignettes/test.gds
## +    [  ]
## |--+ I1   { Int32 100 }
## |--+ I2   { Int32 4x5 }

Output to a text file

apply.gdsn() can be used to export a GDS variable to a text file. If the GDS variable is a vector:

fout <- file("text.txt", "wt")
apply.gdsn(n1, 1, FUN=cat, as.is="none", file=fout, fill=TRUE)
close(fout)

scan("text.txt")

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

The arguments file and fill are defined in the function cat().

If the GDS variable is a matrix:

fout <- file("text.txt", "wt")
apply.gdsn(n2, 1, FUN=cat, as.is="none", file=fout, fill=32768)
close(fout)

readLines("text.txt")

## [1] "1 5 9 13 17"  "2 6 10 14 18" "3 7 11 15 19" "4 8 12 16 20"

The number 32768 is the maximum number of columns on a line used in printing vectors.

Transpose a matrix

apply.gdsn() allows that the data returned from the user-defined function FUN is directly written to a target GDS node target.node, when as.is=“gdsnode” and target.node are both given. Little c in R is a generic function which combines its arguments, and it passes all data to the target GDS node in the following code:

n.t <- add.gdsn(gfile, "transpose", storage="int", valdim=c(5,0))

# apply the function over rows of matrix
apply.gdsn(n2, margin=1, FUN=c, as.is="gdsnode", target.node=n.t)

# matrix transpose
read.gdsn(n.t)

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16
## [5,]   17   18   19   20

# close the GDS file
closefn.gds(gfile)

Floating-point number vs. packed real number

In computing, floating point is a method of representing an approximation of a real number in a way that can support a trade-off between range and precision, which can be represented exactly is of the following form “significand \(\times\) 2^exponent”. A packed real number in GDS format is defined as “int \(\times\) scale \(+\) offset”, where int can be a 8-bit, 16-bit or 32-bit signed interger. In some cases, the strategy of packed real numbers can significantly improve the compression ratio for floating-point numbers.

set.seed(1000)
val <- sample(seq(0,1,0.001), 50000, replace=TRUE)
head(val)

## [1] 0.328 0.759 0.114 0.691 0.516 0.067

gfile <- createfn.gds("test.gds")

add.gdsn(gfile, "N1", val=val)
add.gdsn(gfile, "N2", val=val, compress="ZIP", closezip=TRUE)
add.gdsn(gfile, "N3", val=val, storage="float")
add.gdsn(gfile, "N4", val=val, storage="float", compress="ZIP", closezip=TRUE)
add.gdsn(gfile, "N5", val=val, storage="packedreal16", scale=0.001)
add.gdsn(gfile, "N6", val=val, storage="packedreal16", scale=0.001, compress="ZIP", closezip=TRUE)

gfile

## File: /tmp/RtmpyLuaNY/Rbuild1eb38615761/gdsfmt/vignettes/test.gds
## +    [  ]
## |--+ N1   { Float64 50000 }
## |--+ N2   { Float64 50000 ZIP(24.76%) }
## |--+ N3   { Float32 50000 }
## |--+ N4   { Float32 50000 ZIP(46.77%) }
## |--+ N5   { PackedReal16 50000 }
## |--+ N6   { PackedReal16 50000 ZIP(76.00%) }

Variable	Type	Compression Method	Size	Ratio	Machine epsilon¹
N1	64-bit floating-point number	—	400.0 KB	100.0%	0
N2	64-bit floating-point number	ZIP	99.0 KB	24.8%	0
N3	32-bit floating-point number	—	200.0 KB	50.0%	9.89e-09
N4	32-bit floating-point number	ZIP	93.5 KB	23.4%	9.89e-09
N5	16-bit packed real number	—	100.0 KB	25.0%	0
N6	16-bit packed real number	ZIP	76.0 KB	19.0%	0

¹: the relative error due to rounding in floating point arithmetic.

# close the GDS file
closefn.gds(gfile)

Limited random-access reading on compressed data

10,000,000 random 0,1 sequence of 32-bit integers
- in each 32 bits, one bit stores random 0,1 and others are ZERO
- lower bound of compression percentage is 1/32 = 3.125%
Testing:
- of 10,000 random positions, read a 32-bit integer
- compression ratio is maximized for each method
- compression method: None, ZIP, ZIP_RA, LZ4 and LZ4_RA
- ZIP_RA and LZ4_RA: data are composed of multiple independent compressed blocks

set.seed(100)
# 10,000,000 random 0,1 sequence of 32-bit integers
v <- sample.int(2, 10*1000*1000, replace=TRUE) - 1L
table(v)

## v
##       0       1 
## 4999138 5000862

# cteate a GDS file
f <- createfn.gds("test.gds")

# compression algorithms
compression <- c("", "ZIP.max", "ZIP_RA.max:16K", "LZ4.max", "LZ4_RA.max:16K")

# save
for (i in 1:length(compression))
    print(add.gdsn(f, paste0("I", i), val=v, compress=compression[i], closezip=TRUE))

# close the file
closefn.gds(f)

System configuration:
- MacBook Pro, Retina, 13-inch, Late 2013, 2.8 GHz Intel Core i7, 16 GB 1600 MHz DDR3

# open the GDS file
f <- openfn.gds("test.gds")

# 10,000 random positions
set.seed(1000)
idx <- sample.int(length(v), 10000)

# for-loop each compression method
for (i in 1:length(n))
{
    cat("Compression:", compression[i], "\n")
    print(system.time({
        vv <- rep.int(0, length(idx))
        for (k in 1:length(idx))
            vv[k] <- read.gdsn(n[[i]], start=idx[k], count=1)
    }))
}

# close the file
closefn.gds(f)

Compression Method	Raw	ZIP	ZIP_RA	LZ4	LZ4_RA
Compression Percent	100%	5.08%	5.42%	6.98%	7.33%
Reading Time (second)	0.35	203.99	3.54	90.03	1.47

Session Info

sessionInfo()

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gdsfmt_1.4.0
## 
## loaded via a namespace (and not attached):
## [1] formatR_1.1     htmltools_0.2.6 tools_3.2.0     yaml_2.1.13    
## [5] rmarkdown_0.5.1 knitr_1.9       digest_0.6.8    stringr_0.6.2  
## [9] evaluate_0.6

R Interface to CoreArray Genomic Data Structure (GDS) Files

Xiuwen Zheng

Feb 26, 2015