GDSArray 1.0.0
GDSArray is a Bioconductor package that represents GDS files as
objects derived from the DelayedArray package and DelayedArray
class. It converts a GDS node in the file to a DelayedArray
-derived
data structure. The rich common methods and data operations defined on
GDSArray
makes it more R-user-friendly than working with the GDS
file directly. The array data from GDS files are always returned with
the first dimension being variants/snps
and the second dimension
being samples
. This feature is consistent with the assay data saved
in SummarizedExperiment
, and makes the GDSArray
package
interoperable with other established Bioconductor data
infrastructure.
source("https://bioconductor.org/biocLite.R")
biocLite("GDSArray")
library(GDSArray)
The Bioconductor package gdsfmt has provided a high-level R interface to CoreArray Genomic Data Structure (GDS) data files, which is designed for large-scale datasets, especially for data which are much larger than the available random-access memory.
The GDS format has been widely used in genetic/genomic research for
high-throughput genotyping or sequencing data. There are two major
classes that extends the gds.class
: SNPGDSFileClass
suited for
genotyping data (e.g., GWAS), and SeqVarGDSClass
that are designed
specifically for DNA-sequencing data. The file format attribute in
each data class is set as SNP_ARRAY
and SEQ_ARRAY
. There are rich
functions written based on these data classes for common data
operation and statistical analysis.
More details about GDS format can be found in the vignettes of the gdsfmt, SNPRelate, and SeqArray packages.
GDSArray
, GDSMatrix
, and GDSFile
GDSArray
represents GDS files as DelayedArray
instances. It has
methods like dim
, dimnames
defined, and it inherits array-like
operations and methods from DelayedArray
, e.g., the subsetting
method of [
.
The GDSArray()
constructor takes as arguments the file path and the
GDS node inside the GDS file. The GDSArray()
constructor always
returns the object with rows being features (genes / variants / snps)
and the columns being “samples”. This is consistent with the assay
data inside SummarizedExperiment
.
file <- SeqArray::seqExampleFileName("gds")
GDSArray(file, "genotype/data")
## <1348 x 90 x 2> GDSArray object of type "integer":
## , , 1
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 0 0 0 0
## [2,] 3 3 0 3 . 0 0 0 0
## ... . . . . . . . . .
## [1347,] 0 0 0 0 . 0 0 0 0
## [1348,] 3 3 0 3 . 3 3 3 3
##
## , , 2
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 0 0 0 0
## [2,] 3 3 0 3 . 0 0 0 0
## ... . . . . . . . . .
## [1347,] 0 0 0 0 . 0 0 0 0
## [1348,] 3 3 1 3 . 3 3 3 3
A GDSMatrix
is a 2-dimensional GDSArray
, and will be returned from
the GDSArray()
constructor automatically if the input GDS node is
2-dimensional.
GDSArray(file, "phase/data")
## <1348 x 90> GDSMatrix object of type "integer":
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 0 0 0 0 . 0 0 0 0
## 2 0 0 0 0 . 0 0 0 0
## 3 0 0 0 0 . 0 0 0 0
## ... . . . . . . . . .
## 1346 0 0 0 0 . 0 0 0 0
## 1347 0 0 0 0 . 0 0 0 0
## 1348 0 0 0 0 . 0 0 0 0
GDSFile
The GDSFile
is a light-weight class to represent GDS files. It
has the $
completion method to complete any possible gds nodes. It
could be used as a convenient GDSArray
constructor if the slot of
current_path
in GDSFile
object represents a valid gds node.
Otherwise, it will return the GDSFile
object with an updated current_path
.
gf <- GDSFile(file)
gf$annotation$info
## class: GDSFile
## file: /home/biocbuild/bbs-3.7-bioc/R/library/SeqArray/extdata/CEU_Exon.gds
## current node: annotation/info
## subnodes:
## annotation/info/AA
## annotation/info/AC
## annotation/info/AN
## annotation/info/DP
## annotation/info/HM2
## annotation/info/HM3
## annotation/info/OR
## annotation/info/GP
## annotation/info/BN
gf$annotation$info$AC
## <1348> GDSArray object of type "integer":
## 1 2 3 4 . 1345 1346 1347 1348
## 4 1 6 128 . 2 11 1 1
Try typing in gf$ann
and pressing tab
key for the completion.
GDSArray
methodsseed
returns the GDSArraySeed
of the GDSArray
object.ga <- GDSArray(file, "genotype/data")
seed(ga)
## GDSArraySeed
## gds file: /home/biocbuild/bbs-3.7-bioc/R/library/SeqArray/extdata/CEU_Exon.gds
## array data: genotype/data
## dim: 1348 x 90 x 2
gdsfile
returns the file path of the corresponding GDS file.gdsfile(ga)
## [1] "/home/biocbuild/bbs-3.7-bioc/R/library/SeqArray/extdata/CEU_Exon.gds"
gdsnodes()
takes the GDS file path or GDSFile
object as input, and
returns all nodes that can be converted to GDSArray
instances. The
returned GDS node names can be used as input for the GDSArray(name=)
constructor.
gdsnodes(file)
## [1] "description" "sample.id"
## [3] "variant.id" "position"
## [5] "chromosome" "allele"
## [7] "genotype/data" "genotype/~data"
## [9] "genotype/extra.index" "genotype/extra"
## [11] "phase/data" "phase/~data"
## [13] "phase/extra.index" "phase/extra"
## [15] "annotation/id" "annotation/qual"
## [17] "annotation/filter" "annotation/info/AA"
## [19] "annotation/info/AC" "annotation/info/AN"
## [21] "annotation/info/DP" "annotation/info/HM2"
## [23] "annotation/info/HM3" "annotation/info/OR"
## [25] "annotation/info/GP" "annotation/info/BN"
## [27] "annotation/format/DP/data" "annotation/format/DP/~data"
## [29] "sample.annotation/family"
identical(gdsnodes(file), gdsnodes(gf))
## [1] FALSE
GDSArray(file, name=gdsnodes(file)[2])
## <90> GDSArray object of type "character":
## NA06984 NA06985 NA06986 . NA12891 NA12892
## "NA06984" "NA06985" "NA06986" . "NA12891" "NA12892"
dim()
, dimnames()
The dimnames(GDSArray)
returns an unnamed list, with the length of
each element to be the same as return from dim(GDSArray)
.
ga <- GDSArray(file, "phase/data")
dim(ga)
## [1] 1348 90
class(dimnames(ga))
## [1] "list"
lengths(dimnames(ga))
## variant.id sample.id
## 1348 90
[
subsettingGDSArray
instances can be subset, following the usual R
conventions, with numeric or logical vectors; logical vectors are
recycled to the appropriate length.
ga[1:3, 10:15]
## <3 x 6> DelayedMatrix object of type "integer":
## NA07346 NA07347 NA07357 NA10847 NA10851 NA11829
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
ga[c(TRUE, FALSE), ]
## <674 x 90> DelayedMatrix object of type "integer":
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 0 0 0 0 . 0 0 0 0
## 3 0 0 0 0 . 0 0 0 0
## 5 0 0 0 0 . 0 0 0 0
## ... . . . . . . . . .
## 1343 0 0 0 0 . 0 0 0 0
## 1345 0 0 0 0 . 0 0 0 0
## 1347 0 0 0 0 . 0 0 0 0
dp <- GDSArray(file, "annotation/format/DP/data")
dp
## <1348 x 90> GDSMatrix object of type "integer":
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 0 0 107 0 . 3 81 67 156
## 2 0 0 92 0 . 4 84 47 150
## 3 12 17 247 11 . 9 217 134 417
## ... . . . . . . . . .
## 1346 5 8 15 1 . 3 61 57 101
## 1347 4 7 26 1 . 4 92 71 144
## 1348 0 0 3 0 . 0 0 2 2
log(dp)
## <1348 x 90> DelayedMatrix object of type "double":
## NA06984 NA06985 NA06986 ... NA12891 NA12892
## 1 -Inf -Inf 4.672829 . 4.204693 5.049856
## 2 -Inf -Inf 4.521789 . 3.850148 5.010635
## 3 2.484907 2.833213 5.509388 . 4.897840 6.033086
## ... . . . . . .
## 1346 1.609438 2.079442 2.708050 . 4.0430513 4.6151205
## 1347 1.386294 1.945910 3.258097 . 4.2626799 4.9698133
## 1348 -Inf -Inf 1.098612 . 0.6931472 0.6931472
dp[rowMeans(dp) < 60, ]
## <413 x 90> DelayedMatrix object of type "integer":
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 0 0 107 0 . 3 81 67 156
## 2 0 0 92 0 . 4 84 47 150
## 4 15 4 177 1 . 2 110 111 195
## ... . . . . . . . . .
## 1346 5 8 15 1 . 3 61 57 101
## 1347 4 7 26 1 . 4 92 71 144
## 1348 0 0 3 0 . 0 0 2 2
GDSArraySeed
The GDSArraySeed
class represents the ‘seed’ for the GDSArray
object. It is not exported from the GDSArray package. Seed objects
should contain the GDS file path, and are expected to satisfy the
“seed contract” i.e. to support dim() and dimnames().
seed <- GDSArray:::GDSArraySeed(file, "genotype/data")
seed
## GDSArraySeed
## gds file: /home/biocbuild/bbs-3.7-bioc/R/library/SeqArray/extdata/CEU_Exon.gds
## array data: genotype/data
## dim: 1348 x 90 x 2
The seed can be used to construct a GDSArray
instance.
GDSArray(seed)
## <1348 x 90 x 2> GDSArray object of type "integer":
## , , 1
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 0 0 0 0
## [2,] 3 3 0 3 . 0 0 0 0
## ... . . . . . . . . .
## [1347,] 0 0 0 0 . 0 0 0 0
## [1348,] 3 3 0 3 . 3 3 3 3
##
## , , 2
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 0 0 0 0
## [2,] 3 3 0 3 . 0 0 0 0
## ... . . . . . . . . .
## [1347,] 0 0 0 0 . 0 0 0 0
## [1348,] 3 3 1 3 . 3 3 3 3
The DelayedArray()
constructor with GDSArraySeed
object as
argument will return the same content as the GDSArray()
constructor
over the same GDSArraySeed
.
class(DelayedArray(seed))
## [1] "GDSArray"
## attr(,"package")
## [1] "GDSArray"
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GDSArray_1.0.0 DelayedArray_0.6.0 BiocParallel_1.14.0
## [4] IRanges_2.14.0 S4Vectors_0.18.0 matrixStats_0.53.1
## [7] BiocGenerics_0.26.0 gdsfmt_1.16.0 BiocStyle_2.8.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 XVector_0.20.0 knitr_1.20
## [4] magrittr_1.5 zlibbioc_1.26.0 GenomicRanges_1.32.0
## [7] GenomeInfoDb_1.16.0 stringr_1.3.0 tools_3.5.0
## [10] xfun_0.1 SeqArray_1.20.0 htmltools_0.3.6
## [13] yaml_2.1.18 rprojroot_1.3-2 digest_0.6.15
## [16] SNPRelate_1.14.0 bookdown_0.7 GenomeInfoDbData_1.1.0
## [19] bitops_1.0-6 RCurl_1.95-4.10 evaluate_0.10.1
## [22] rmarkdown_1.9 stringi_1.1.7 compiler_3.5.0
## [25] Biostrings_2.48.0 backports_1.1.2