Overview

In the era of big data, thousands of gigabyte-size data sets are challenging scientists for data management, even on well-equipped hardware. Currently, next-generation sequencing techniques are being adopted to investigate common and rare variants, making the analyses of large-scale genotypic data challenging. For example, the 1000 Genomes Project Phase 1 has identified approximately 38 million single nucleotide polymorphisms (SNPs), 1.4 million short insertions and deletions, and more than 14,000 larger deletions from whole-genome sequencing technologies [1]. In the near future, new technologies, like third-generation whole-genome sequencing, will be enabling data to be generated at an unprecedented scale [2]. The Variant Call Format (VCF) was developed for the 1000 Genomes Project, which is a generic text format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations [3]. However, this format is less efficient for large-scale analyses since numeric data have to be parsed from a text VCF file before further analyses. The computational burden associated with sequence variants is especially evident with large sample and variant sizes, and it requires efficient numerical implementation and data management.

Here I introduce a high-performance C++ computing library CoreArray (http://corearray.sourceforge.net) for big-data management of genome-wide variants [4]. CoreArray was designed for developing portable and scalable storage technologies for bioinformatics data, allowing parallel computing at the multicore and cluster levels. It provides the genomic data structure (GDS) file format for array-oriented data: this is a universal data format to store multiple data variables in a single file. The CoreArray library modules are demonstrated in Figure 1. A hierarchical data structure is used to store multiple extensible data variables in the GDS format, and all datasets are stored in a single file with chunked storage layout. The C++ CoreArray library is implemented in the R/Bioconductor package gdsfmt with a tutorial http://corearray.sourceforge.net/tutorials/gdsfmt/.

Here, I focus on the application of CoreArray for statisticians working in the R environment, and developed R packages gdsfmt and SeqArray to address or reduce the computational burden associated with data management of sequence variants. The gdsfmt package provides a general R interface for the CoreArray library, and SeqArray is specifically designed for data management of sequence variants. The kernels of gdsfmt and SeqArray were written in C/C++ and highly optimized. Genotypic data and annotations are stored in a binary and array-oriented manner, offering efficient access of genetic variants using the R language. There are several key functions in SeqArray (Table 1), and most of data analyses could be done using these functions. The 1000 Genomes Project Phase 1 released 39 million genetic variants for 1092 individuals, and a 26G data file was created by SeqArray to store sequence variants with phasing information, where 2 bits were used as a primitive data type. The file size can be further reduced to 1.5G by compression algorithms without sacrificing access efficiency, since it has a large proportion of rare variants.

SeqArray will be of great interest to scientists involved in data analyses of large-scale genomic sequence data using R environment, particularly those with limited experience of low-level C programming and parallel computing.

Flowchart

Figure 1: CoreArray library modules.

Table 1: The key functions in the SeqArray package.

Function	Description
seqVCF2GDS	Imports VCF files.
seqSummary	Gets the summary (Num. of samples, Num. of variants, INFO/FORMAT variables, etc).
seqSetFilter	Sets a filter to sample or variant (i.e., define a subset of data).
seqGetData	Gets data from a GDS file (from a subset of data).
seqApply	Applies a user-defined function over array margins.
seqParallel	Applies functions in parallel.

Preparing Data

Data formats used in SeqArray

To support efficient memory management for genome-wide numerical data, the gdsfmt package provides the genomic data structure (GDS) file format for array-oriented bioinformatic data based on the CoreArray library, which is a container for storing genotypic and annotation data. The GDS format supports data blocking so that only the subset of data that is being processed needs to reside in memory.

# load the R package SeqArray
library(SeqArray)

Here is a typical GDS file shipped with the SeqArray package:

gds.fn <- seqExampleFileName("gds")
# or gds.fn <- "C:/YourFolder/Your_GDS_File.gds"
gds.fn

## [1] "/tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/CEU_Exon.gds"

seqSummary(gds.fn)

## File: /tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/CEU_Exon.gds
## Format Version: v1.0
## Reference: human_b36_both.fasta
## Ploidy: 2
## Number of samples: 90
## Number of variants: 1348
## Chromosomes:
##    1   10   11   12   13   14   15   16   17   18   19    2   20   21   22 
##  142   70   16   62   11   61   46   84  100   54  111   59   59   23   23 
##    3    4    5    6    7    8    9 <NA> 
##   81   48   61   99   58   51   29    0 
## Number of alleles per site:
##    2    3 
## 1346    2 
## Annotation, INFO variable(s):
##  AA, ., String, Ancestral Allele
##  AC, 1, Integer, Total number of alternate alleles in called genotypes
##  AN, 1, Integer, Total number of alleles in called genotypes
##  DP, 1, Integer, Total Depth
##  HM2, 0, Flag, HapMap2 membership
##  HM3, 0, Flag, HapMap3 membership
##  OR, 1, String, Previous rs number
##  GP, 1, String, GRCh37 position(s)
##  BN, 1, Integer, First dbSNP build #
## Annotation, FORMAT variable(s):
##  DP, ., Integer, Read Depth from MOSAIK BAM
## Annotation, sample variable(s):
##  family

seqExampleFileName() returns the file name of a GDS file used as an example in SeqArray, and it is a subset of data from the 1000 Genomes Project. seqSummary() summarizes the genotypes and annotations stored in the GDS file.

# open a GDS file
genofile <- seqOpen(gds.fn)

# display the contents of the GDS file in a hierarchical structure
genofile

## Object of class "SeqVarGDSClass"
## File: /tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/CEU_Exon.gds (396.3 KB)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { VStr8 90 ZIP_RA(30.83%), 222 bytes }
## |--+ variant.id   { Int32 1348 ZIP_RA(35.72%), 1.9 KB }
## |--+ position   { Int32 1348 ZIP_RA(86.44%), 4.7 KB }
## |--+ chromosome   { VStr8 1348 ZIP_RA(2.66%), 91 bytes }
## |--+ allele   { VStr8 1348 ZIP_RA(17.19%), 928 bytes }
## |--+ genotype   [  ] *
## |  |--+ data   { Bit2 2x90x1348 ZIP_RA(28.39%), 17.2 KB }
## |  |--+ ~data   { Bit2 2x1348x90 ZIP_RA(36.04%), 21.9 KB }
## |  |--+ extra.index   { Int32 3x0 ZIP_RA, 17 bytes } *
## |  |--+ extra   { Int16 0 ZIP_RA, 17 bytes }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 90x1348 ZIP_RA(0.36%), 55 bytes }
## |  |--+ ~data   { Bit1 1348x90 ZIP_RA(0.36%), 55 bytes }
## |  |--+ extra.index   { Int32 3x0 ZIP_RA, 17 bytes } *
## |  |--+ extra   { Bit1 0 ZIP_RA, 17 bytes }
## |--+ annotation   [  ]
## |  |--+ id   { VStr8 1348 ZIP_RA(41.02%), 6.0 KB }
## |  |--+ qual   { Float32 1348 ZIP_RA(0.91%), 49 bytes }
## |  |--+ filter   { Int32,factor 1348 ZIP_RA(0.89%), 48 bytes } *
## |  |--+ info   [  ]
## |  |  |--+ AA   { VStr8 1348 ZIP_RA(24.22%), 653 bytes } *
## |  |  |--+ AC   { Int32 1348 ZIP_RA(27.23%), 1.5 KB } *
## |  |  |--+ AN   { Int32 1348 ZIP_RA(20.62%), 1.1 KB } *
## |  |  |--+ DP   { Int32 1348 ZIP_RA(62.57%), 3.4 KB } *
## |  |  |--+ HM2   { Bit1 1348 ZIP_RA(117.16%), 198 bytes } *
## |  |  |--+ HM3   { Bit1 1348 ZIP_RA(117.16%), 198 bytes } *
## |  |  |--+ OR   { VStr8 1348 ZIP_RA(13.98%), 238 bytes } *
## |  |  |--+ GP   { VStr8 1348 ZIP_RA(34.36%), 5.4 KB } *
## |  |  |--+ BN   { Int32 1348 ZIP_RA(21.64%), 1.2 KB } *
## |  |--+ format   [  ]
## |  |  |--+ DP   [  ] *
## |  |  |  |--+ data   { Int32 90x1348 ZIP_RA(33.83%), 164.2 KB }
## |  |  |  |--+ ~data   { Int32 1348x90 ZIP_RA(32.23%), 156.4 KB }
## |--+ sample.annotation   [  ]
## |  |--+ family   { VStr8 90 ZIP_RA(34.70%), 135 bytes }

For those who would like to know how variable-length genotypic data and annotations are stored in an array-oriented manner, print(, all=TRUE) displays all contents including hidden structures:

# display all contents of the GDS file
print(genofile, all=TRUE, attribute=TRUE, attribute.trim=FALSE)

## File: /tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/CEU_Exon.gds (396.3 KB)
## +    [  ] *< FileFormat: SEQ_ARRAY; FileVersion: v1.0
## |--+ description   [  ] *< vcf.fileformat: VCFv4.0
## |  |--+ reference   { VStr8 1 ZIP_RA(223.81%), 47 bytes } *< R.invisible: NULL
## |--+ sample.id   { VStr8 90 ZIP_RA(30.83%), 222 bytes }
## |--+ variant.id   { Int32 1348 ZIP_RA(35.72%), 1.9 KB }
## |--+ position   { Int32 1348 ZIP_RA(86.44%), 4.7 KB }
## |--+ chromosome   { VStr8 1348 ZIP_RA(2.66%), 91 bytes }
## |--+ allele   { VStr8 1348 ZIP_RA(17.19%), 928 bytes }
## |--+ genotype   [  ] *< VariableName: GT; Description: Genotype
## |  |--+ data   { Bit2 2x90x1348 ZIP_RA(28.39%), 17.2 KB }
## |  |--+ ~data   { Bit2 2x1348x90 ZIP_RA(36.04%), 21.9 KB }
## |  |--+ @data   { UInt8 1348 ZIP_RA(2.82%), 38 bytes } *< R.invisible: NULL
## |  |--+ extra.index   { Int32 3x0 ZIP_RA, 17 bytes } *< R.colnames: sample.index, variant.index, length
## |  |--+ extra   { Int16 0 ZIP_RA, 17 bytes }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 90x1348 ZIP_RA(0.36%), 55 bytes }
## |  |--+ ~data   { Bit1 1348x90 ZIP_RA(0.36%), 55 bytes }
## |  |--+ extra.index   { Int32 3x0 ZIP_RA, 17 bytes } *< R.colnames: sample.index, variant.index, length
## |  |--+ extra   { Bit1 0 ZIP_RA, 17 bytes }
## |--+ annotation   [  ]
## |  |--+ id   { VStr8 1348 ZIP_RA(41.02%), 6.0 KB }
## |  |--+ qual   { Float32 1348 ZIP_RA(0.91%), 49 bytes }
## |  |--+ filter   { Int32,factor 1348 ZIP_RA(0.89%), 48 bytes } *< R.class: factor; R.levels: PASS, q10; Description: All filters passed, Quality below 10
## |  |--+ info   [  ]
## |  |  |--+ AA   { VStr8 1348 ZIP_RA(24.22%), 653 bytes } *< Number: .; Type: String; Description: Ancestral Allele
## |  |  |--+ @AA   { Int32 1348 ZIP_RA(0.89%), 48 bytes } *< R.invisible: NULL
## |  |  |--+ AC   { Int32 1348 ZIP_RA(27.23%), 1.5 KB } *< Number: 1; Type: Integer; Description: Total number of alternate alleles in called genotypes
## |  |  |--+ AN   { Int32 1348 ZIP_RA(20.62%), 1.1 KB } *< Number: 1; Type: Integer; Description: Total number of alleles in called genotypes
## |  |  |--+ DP   { Int32 1348 ZIP_RA(62.57%), 3.4 KB } *< Number: 1; Type: Integer; Description: Total Depth
## |  |  |--+ HM2   { Bit1 1348 ZIP_RA(117.16%), 198 bytes } *< Number: 0; Type: Flag; Description: HapMap2 membership
## |  |  |--+ HM3   { Bit1 1348 ZIP_RA(117.16%), 198 bytes } *< Number: 0; Type: Flag; Description: HapMap3 membership
## |  |  |--+ OR   { VStr8 1348 ZIP_RA(13.98%), 238 bytes } *< Number: 1; Type: String; Description: Previous rs number
## |  |  |--+ GP   { VStr8 1348 ZIP_RA(34.36%), 5.4 KB } *< Number: 1; Type: String; Description: GRCh37 position(s)
## |  |  |--+ BN   { Int32 1348 ZIP_RA(21.64%), 1.2 KB } *< Number: 1; Type: Integer; Description: First dbSNP build #
## |  |--+ format   [  ]
## |  |  |--+ DP   [  ] *< Number: .; Type: Integer; Description: Read Depth from MOSAIK BAM
## |  |  |  |--+ data   { Int32 90x1348 ZIP_RA(33.83%), 164.2 KB }
## |  |  |  |--+ ~data   { Int32 1348x90 ZIP_RA(32.23%), 156.4 KB }
## |  |  |  |--+ @data   { Int32 1348 ZIP_RA(0.89%), 48 bytes } *< R.invisible: NULL
## |--+ sample.annotation   [  ]
## |  |--+ family   { VStr8 90 ZIP_RA(34.70%), 135 bytes }

# close the GDS file
seqClose(genofile)

The output lists all variables stored in the GDS file. At the first level, it stores variables sample.id, variant.id, etc. The additional information are displayed in the square brackets indicating data type, size, compressed or not + compression ratio, where Bit2 indicates that each byte encodes up to four alleles since one byte consists of eight bits, and VStr8 indicates variable-length character. The annotations are stored in the directory annotation, which includes the fields of ID, QUAL, FILTER, INFO and FORMAT corresponding to the original VCF file(s). All of the functions in SeqArray require a minimum set of variables in the annotation data:

sample.id, a unique identifier for each sample.
variant.id, a unique identifier for each variant.
position, integer, the base position of each variant on the chromosome, and 0 or NA for unknown position.
chromosome, character, the chromosome code, e.g., 1-22 for autosomes, X, Y, XY (the pseudoautosomal region), M (the mitochondrial probes), and `` (a blank string) for probes with unknown chromosome.
allele, character, reference and alternative alleles using comma as a separator.
genotype, a folder:
1. data, a 3-dimensional array for genotypic data, the first dimension refers to the number of ploidy, the second is sample and the third is variant.
2. ~data, an optional variable, the transposed array according to data, the second dimension is variant and the third is sample.
3. @data, the index for the variable data, and the prefix @ is used to indicate the index. It should be equal-size as variant.id, which is used to specify the data size of each variant.
4. extra.index, an index (3-by-$*$) matrix for triploid call (look like 0/0/1 in the VCF file). E.g., for 0/0/1, the first two alleles are stored in data, and the last allele is saved in the variable extra. For each column of extra.index, the first value is the index of sample (starting from 1), the second value is the index of variant (starting from 1), and the last value is how many alleles remain (usually it is 1 since the first two alleles are stored in data) that indicates how many alleles stored in extra contiguously.
5. extra, one-dimensional array, the additional data for triploid call, each allele block corresponds to each column in extra.index.

The optional folders include phase (phasing information), annotation, and sample.annotation.

The folder phase includes:

data, a matrix or 3-dimensional array for phasing information. 0 for unphased status and 1 for phased status. If it is a matrix, the first dimension is sample and the second is variant, corresponding to diploid genotypic data. If it is a 3-dimensional array, the first dimension refers to the number of ploidy minus one. More details about / and | in a VCF file can be founded: VCF format.
~data, an optional variable, the transposed array according to data, the second dimension is variant and the third is sample.
extra.index, an index (3-by-$*$) matrix for triploid call (look like 0/0/1 in the VCF file). E.g., for 0/0/1, the first separator (/ here) is stored in data, and the last separator is saved in the variable extra. For each column of extra.index, the first value is the index of sample (starting from 1), the second value is the index of variant (starting from 1), and the last value is how many separators remain (usually it is 1 since the first separator is stored in data) that indicates how many separator stored in extra contiguously.
extra, one-dimensional array, the additional data of separator indicator for triploid call, each separator block corresponds to each column in extra.index.

The folder annotation includes:

id, ID semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then a blank string is used.
qual, phred-scaled quality score for the assertion made in ALT.
filter, PASS if this position has passed all filters, i.e. a call is made at this position.
info, additional information for each variant, according to the INFO field in a VCF file,
1. VARIABLE_NAME, variable. If it is fixed-length, missing value indicates that there is no entry for that variant in the VCF file.
2. @VARIABLE_NAME (optional). If VARIABLE_NAME is variable-length, one-dimensional array. The prefix @ is used to indicate the index data. It should be equal-size as variant.id, which is used to specify the data size of each variant.
3. OTHER_VARIABLES, …
format, additional information for each variant and sample, according to the FORMAT field in a VCF file,
1. VARIABLE_NAME, a folder,
  1. data, a $n_{samp}$-by-$*$ matrix.
  2. ~data, an optional variable, the transposed array according to data.
  3. @data, one-dimensional array, the index data for the variable data, and the prefix @ is used to indicate the index data. It should be equal-size as variant.id, which is used to specify the data size of each variant.
2. OTHER_VARIABLES, …

The folder sample.annotation contains variables of vector or matrix according to sample.id.

Format conversion from VCF files

The SeqArray package provides a function seqVCF2GDS() to reformat a VCF file, and it allows merging multiple VCF files during format conversion. The genotypic and annotation data are stored in a compressed manner.

# the VCF file, using the example in the SeqArray package
vcf.fn <- seqExampleFileName("vcf")
# or vcf.fn <- "C:/YourFolder/Your_VCF_File.vcf"
vcf.fn

## [1] "/tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/CEU_Exon.vcf.gz"

# parse the header
seqVCF.Header(vcf.fn)

## $fileformat
## [1] "VCFv4.0"
## 
## $info
##    ID Number    Type                                           Description
## 1  AA      .  String                                      Ancestral Allele
## 2  AC      1 Integer Total number of alternate alleles in called genotypes
## 3  AN      1 Integer           Total number of alleles in called genotypes
## 4  DP      1 Integer                                           Total Depth
## 5 HM2      0    Flag                                    HapMap2 membership
## 6 HM3      0    Flag                                    HapMap3 membership
## 7  OR      1  String                                    Previous rs number
## 8  GP      1  String                                    GRCh37 position(s)
## 9  BN      1 Integer                                   First dbSNP build #
##   Source Version
## 1     NA      NA
## 2     NA      NA
## 3     NA      NA
## 4     NA      NA
## 5     NA      NA
## 6     NA      NA
## 7     NA      NA
## 8     NA      NA
## 9     NA      NA
## 
## $filter
##     ID        Description
## 1 PASS All filters passed
## 2  q10   Quality below 10
## 
## $format
##   ID Number    Type                Description
## 1 GT      1  String                   Genotype
## 2 DP      . Integer Read Depth from MOSAIK BAM
## 
## $alt
## NULL
## 
## $contig
## NULL
## 
## $assembly
## NULL
## 
## $reference
## [1] "human_b36_both.fasta"
## 
## $header
## [1] id    value
## <0 rows> (or 0-length row.names)
## 
## $ploidy
## [1] 2
## 
## attr(,"class")
## [1] "SeqVCFHeaderClass"

The columns Number, Type and Description are defined by the 1000 Genomes Project: VCF format. Briefly, the Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. If the field has one value per alternate allele then this value should be A; if the field has one value for each possible genotype then this value should be G. If the number of possible values varies, is unknown, or is unbounded, then this value should be .. The Flag type indicates that the INFO field does not contain an entry, and hence the Number should be 0 in this case. Possible Types for FORMAT fields are: Integer, Float, Character, and String (this field is otherwise defined precisely as the INFO field).

# convert, save in "tmp.gds"
seqVCF2GDS(vcf.fn, "tmp.gds")

## Sun Nov 22 19:42:33 2015
## The Variant Call Format (VCF) header:
##  file format: VCFv4.0
##  the number of sets of chromosomes (ploidy): 2
##  the number of samples: 90
##  GDS genotype storage: bit2
## Parsing "/tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/CEU_Exon.vcf.gz" ...
## + genotype/data   { Bit2 2x90x1348 ZIP_RA, 17 bytes }
## Done.
## Sun Nov 22 19:42:33 2015
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##  open the file "tmp.gds" (size: 233233)
##  # of fragments in total: 157
##  save it to "tmp.gds.tmp"
##  rename "tmp.gds.tmp" (size: 232129)
##  # of fragments in total: 65
## Sun Nov 22 19:42:33 2015

# maximize the compression level
seqVCF2GDS(vcf.fn, "tmp.gds", storage.option=seqStorage.Option("ZIP_RA.max"))

## Sun Nov 22 19:42:33 2015
## The Variant Call Format (VCF) header:
##  file format: VCFv4.0
##  the number of sets of chromosomes (ploidy): 2
##  the number of samples: 90
##  GDS genotype storage: bit2
## Parsing "/tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/CEU_Exon.vcf.gz" ...
## + genotype/data   { Bit2 2x90x1348 ZIP_RA, 17 bytes }
## Done.
## Sun Nov 22 19:42:34 2015
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##  open the file "tmp.gds" (size: 218146)
##  # of fragments in total: 157
##  save it to "tmp.gds.tmp"
##  rename "tmp.gds.tmp" (size: 217042)
##  # of fragments in total: 65
## Sun Nov 22 19:42:34 2015

# maximize the access speed with LZ4 fast compression algorithm
seqVCF2GDS(vcf.fn, "tmp.gds", storage.option=seqStorage.Option("LZ4_RA.max"))

## Sun Nov 22 19:42:34 2015
## The Variant Call Format (VCF) header:
##  file format: VCFv4.0
##  the number of sets of chromosomes (ploidy): 2
##  the number of samples: 90
##  GDS genotype storage: bit2
## Parsing "/tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/CEU_Exon.vcf.gz" ...
## + genotype/data   { Bit2 2x90x1348 LZ4_RA, 21 bytes }
## Done.
## Sun Nov 22 19:42:34 2015
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##  open the file "tmp.gds" (size: 267478)
##  # of fragments in total: 157
##  save it to "tmp.gds.tmp"
##  rename "tmp.gds.tmp" (size: 266374)
##  # of fragments in total: 65
## Sun Nov 22 19:42:34 2015

seqSummary("tmp.gds")

## File: /tmp/RtmpkGyHWJ/Rbuild6c672c17d65c/SeqArray/vignettes/tmp.gds
## Format Version: v1.0
## Reference: human_b36_both.fasta
## Ploidy: 2
## Number of samples: 90
## Number of variants: 1348
## Chromosomes:
##    1   10   11   12   13   14   15   16   17   18   19    2   20   21   22 
##  142   70   16   62   11   61   46   84  100   54  111   59   59   23   23 
##    3    4    5    6    7    8    9 <NA> 
##   81   48   61   99   58   51   29    0 
## Number of alleles per site:
##    2    3 
## 1346    2 
## Annotation, INFO variable(s):
##  AA, ., String, Ancestral Allele
##  AC, 1, Integer, Total number of alternate alleles in called genotypes
##  AN, 1, Integer, Total number of alleles in called genotypes
##  DP, 1, Integer, Total Depth
##  HM2, 0, Flag, HapMap2 membership
##  HM3, 0, Flag, HapMap3 membership
##  OR, 1, String, Previous rs number
##  GP, 1, String, GRCh37 position(s)
##  BN, 1, Integer, First dbSNP build #
## Annotation, FORMAT variable(s):
##  DP, ., Integer, Read Depth from MOSAIK BAM
## Annotation, sample variable(s):
##

Export to a VCF File

The SeqArray package provides a function seqGDS2VCF() to export data to a VCF file. The arguments info.var and fmt.var in seqGDS2VCF allow users to specify the variables listed in the INFO and FORMAT fields of VCF format, or remove the INFO and FORMAT information. seqSetFilter() can be used to define a subset of data for the export.

# the file of GDS
gds.fn <- seqExampleFileName("gds")
# or gds.fn <- "C:/YourFolder/Your_GDS_File.gds"

# open a GDS file
genofile <- seqOpen(gds.fn)

# convert
seqGDS2VCF(genofile, "tmp.vcf.gz")

## Sun Nov 22 19:42:34 2015
## Output: tmp.vcf.gz
## The INFO field: AA, AC, AN, DP, HM2, HM3, OR, GP, BN
## The FORMAT field: DP
## Done.
## Sun Nov 22 19:42:35 2015

# read
z <- readLines("tmp.vcf.gz", n=20)
for (i in z) cat(substr(i, 1, 76), " ...\n", sep="")

## ##fileformat=VCFv4.0 ...
## ##fileDate=20151122 ...
## ##source=SeqArray_Format_v1.0 ...
## ##INFO=<ID=AA,Number=.,Type=String,Description="Ancestral Allele"> ...
## ##INFO=<ID=AC,Number=1,Type=Integer,Description="Total number of alternate a ...
## ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in  ...
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ...
## ##INFO=<ID=HM2,Number=0,Type=Flag,Description="HapMap2 membership"> ...
## ##INFO=<ID=HM3,Number=0,Type=Flag,Description="HapMap3 membership"> ...
## ##INFO=<ID=OR,Number=1,Type=String,Description="Previous rs number"> ...
## ##INFO=<ID=GP,Number=1,Type=String,Description="GRCh37 position(s)"> ...
## ##INFO=<ID=BN,Number=1,Type=Integer,Description="First dbSNP build #"> ...
## ##FILTER=<ID=PASS,Description="All filters passed"> ...
## ##FILTER=<ID=q10,Description="Quality below 10"> ...
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ...
## ##FORMAT=<ID=DP,Number=.,Type=Integer,Description="Read Depth from MOSAIK BA ...
## #CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA06984 NA06985 NA06986 NA0698 ...
## 1    1105366 rs111751804 T   C   .   PASS    AA=T;AC=4;AN=114;DP=3251;GP=1:1115503;BN=13 ...
## 1    1105411 rs114390380 G   A   .   PASS    AA=G;AC=1;AN=106;DP=2676;GP=1:1115548;BN=13 ...
## 1    1110294 rs1320571   G   A   .   PASS    AA=A;AC=6;AN=154;DP=7610;HM2;HM3;GP=1:1120431 ...

# output BN,GP,AA,HM2 in INFO (the variables are in this order), no FORMAT
seqGDS2VCF(genofile, "tmp2.vcf.gz", info.var=c("BN","GP","AA","HM2"),
    fmt.var=character())

## Sun Nov 22 19:42:35 2015
## Output: tmp2.vcf.gz
## The INFO field: BN, GP, AA, HM2
## The FORMAT field: 
## Done.
## Sun Nov 22 19:42:35 2015

# read
z <- readLines("tmp2.vcf.gz", n=15)
for (i in z) cat(substr(i, 1, 56), " ...\n", sep="")

## ##fileformat=VCFv4.0 ...
## ##fileDate=20151122 ...
## ##source=SeqArray_Format_v1.0 ...
## ##INFO=<ID=BN,Number=1,Type=Integer,Description="First d ...
## ##INFO=<ID=GP,Number=1,Type=String,Description="GRCh37 p ...
## ##INFO=<ID=AA,Number=.,Type=String,Description="Ancestra ...
## ##INFO=<ID=HM2,Number=0,Type=Flag,Description="HapMap2 m ...
## ##FILTER=<ID=PASS,Description="All filters passed"> ...
## ##FILTER=<ID=q10,Description="Quality below 10"> ...
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genoty ...
## #CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA06984 NA ...
## 1    1105366 rs111751804 T   C   .   PASS    BN=132;GP=1:1115503;AA= ...
## 1    1105411 rs114390380 G   A   .   PASS    BN=132;GP=1:1115548;AA= ...
## 1    1110294 rs1320571   G   A   .   PASS    BN=88;GP=1:1120431;AA=A;H ...
## 1    3537996 rs2760321   T   C   .   PASS    BN=100;GP=1:3548136;AA=C; ...

# close the GDS file
seqClose(genofile)

Users can use diff, a command line tool in Unix-like systems, to compare files line by line, in order to confirm data consistency.

# assuming the original VCF file is old.vcf.gz,
# call "seqVCF2GDS" for the import and "seqGDS2VCF" for the export to create a new VCF file new.vcf.gz
$ diff <(zcat old.vcf.gz) <(zcat new.vcf.gz)
# OR
$ diff <(gunzip -c old.vcf.gz) <(gunzip -c new.vcf.gz)

1a2,3
> ##fileDate=20130309
> ##source=SeqArray_RPackage_v1.0

# LOOK GOOD! There are only two lines different, and both are in the header.

# delete temporary files
unlink(c("tmp.vcf.gz", "tmp1.vcf.gz", "tmp2.vcf.gz"))

Modification

The SeqArray package provides a function seqDelete() to remove data annotations in the INFO and FORMAT fields. It is suggested to use cleanup.gds() in the gdsfmt package after calling seqDelete() to reduce the file size. For example,

# the file of VCF
vcf.fn <- seqExampleFileName("vcf")
# or vcf.fn <- "C:/YourFolder/Your_VCF_File.vcf"

# convert
seqVCF2GDS(vcf.fn, "tmp.gds", verbose=FALSE)

# make sure that open with "readonly=FALSE"
genofile <- seqOpen("tmp.gds", readonly=FALSE)

# display the original structure
genofile

## Object of class "SeqVarGDSClass"
## File: /tmp/RtmpkGyHWJ/Rbuild6c672c17d65c/SeqArray/vignettes/tmp.gds (232.1 KB)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { VStr8 90 ZIP_RA(30.83%), 222 bytes }
## |--+ variant.id   { Int32 1348 ZIP_RA(35.68%), 1.9 KB }
## |--+ position   { Int32 1348 ZIP_RA(86.44%), 4.7 KB }
## |--+ chromosome   { VStr8 1348 ZIP_RA(2.66%), 91 bytes }
## |--+ allele   { VStr8 1348 ZIP_RA(17.25%), 931 bytes }
## |--+ genotype   [  ] *
## |  |--+ data   { Bit2 2x90x1348 ZIP_RA(29.62%), 18.0 KB }
## |  |--+ extra.index   { Int32 3x0 ZIP_RA, 17 bytes } *
## |  |--+ extra   { Int16 0 ZIP_RA, 17 bytes }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 90x1348 ZIP_RA(0.36%), 55 bytes }
## |  |--+ extra.index   { Int32 3x0 ZIP_RA, 17 bytes } *
## |  |--+ extra   { Bit1 0 ZIP_RA, 17 bytes }
## |--+ annotation   [  ]
## |  |--+ id   { VStr8 1348 ZIP_RA(41.27%), 6.0 KB }
## |  |--+ qual   { Float32 1348 ZIP_RA(0.91%), 49 bytes }
## |  |--+ filter   { Int32,factor 1348 ZIP_RA(0.89%), 48 bytes } *
## |  |--+ info   [  ]
## |  |  |--+ AA   { VStr8 1348 ZIP_RA(24.26%), 654 bytes } *
## |  |  |--+ AC   { Int32 1348 ZIP_RA(29.19%), 1.6 KB } *
## |  |  |--+ AN   { Int32 1348 ZIP_RA(22.18%), 1.2 KB } *
## |  |  |--+ DP   { Int32 1348 ZIP_RA(62.57%), 3.4 KB } *
## |  |  |--+ HM2   { Bit1 1348 ZIP_RA(117.16%), 198 bytes } *
## |  |  |--+ HM3   { Bit1 1348 ZIP_RA(117.16%), 198 bytes } *
## |  |  |--+ OR   { VStr8 1348 ZIP_RA(14.50%), 247 bytes } *
## |  |  |--+ GP   { VStr8 1348 ZIP_RA(34.36%), 5.4 KB } *
## |  |  |--+ BN   { Int32 1348 ZIP_RA(22.53%), 1.2 KB } *
## |  |--+ format   [  ]
## |  |  |--+ DP   [  ] *
## |  |  |  |--+ data   { Int32 90x1348 ZIP_RA(36.73%), 178.2 KB }
## |--+ sample.annotation   [  ]

# delete "HM2", "HM3", "AA", "OR" and "DP"
seqDelete(genofile, info.varname=c("HM2", "HM3", "AA", "OR"),
    format.varname="DP")

## Delete INFO variable(s): HM2 HM3 AA OR
## Delete FORMAT variable(s): DP

# display
genofile

## Object of class "SeqVarGDSClass"
## File: /tmp/RtmpkGyHWJ/Rbuild6c672c17d65c/SeqArray/vignettes/tmp.gds (232.1 KB)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { VStr8 90 ZIP_RA(30.83%), 222 bytes }
## |--+ variant.id   { Int32 1348 ZIP_RA(35.68%), 1.9 KB }
## |--+ position   { Int32 1348 ZIP_RA(86.44%), 4.7 KB }
## |--+ chromosome   { VStr8 1348 ZIP_RA(2.66%), 91 bytes }
## |--+ allele   { VStr8 1348 ZIP_RA(17.25%), 931 bytes }
## |--+ genotype   [  ] *
## |  |--+ data   { Bit2 2x90x1348 ZIP_RA(29.62%), 18.0 KB }
## |  |--+ extra.index   { Int32 3x0 ZIP_RA, 17 bytes } *
## |  |--+ extra   { Int16 0 ZIP_RA, 17 bytes }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 90x1348 ZIP_RA(0.36%), 55 bytes }
## |  |--+ extra.index   { Int32 3x0 ZIP_RA, 17 bytes } *
## |  |--+ extra   { Bit1 0 ZIP_RA, 17 bytes }
## |--+ annotation   [  ]
## |  |--+ id   { VStr8 1348 ZIP_RA(41.27%), 6.0 KB }
## |  |--+ qual   { Float32 1348 ZIP_RA(0.91%), 49 bytes }
## |  |--+ filter   { Int32,factor 1348 ZIP_RA(0.89%), 48 bytes } *
## |  |--+ info   [  ]
## |  |  |--+ AC   { Int32 1348 ZIP_RA(29.19%), 1.6 KB } *
## |  |  |--+ AN   { Int32 1348 ZIP_RA(22.18%), 1.2 KB } *
## |  |  |--+ DP   { Int32 1348 ZIP_RA(62.57%), 3.4 KB } *
## |  |  |--+ GP   { VStr8 1348 ZIP_RA(34.36%), 5.4 KB } *
## |  |  |--+ BN   { Int32 1348 ZIP_RA(22.53%), 1.2 KB } *
## |  |--+ format   [  ]
## |--+ sample.annotation   [  ]

# close the GDS file
seqClose(genofile)

# clean up the fragments to reduce the file size
cleanup.gds("tmp.gds")

## Clean up the fragments of GDS file:
##  open the file "tmp.gds" (size: 232129)
##  # of fragments in total: 65
##  save it to "tmp.gds.tmp"
##  rename "tmp.gds.tmp" (size: 50849)
##  # of fragments in total: 50

Data Processing

Functions for Data Analysis

Table 2: A list of functions for data analysis.

Function	Description
seqGetData	Gets data from a sequence GDS file (from a subset of data).
seqApply	Applies a user-defined function over array margins.
seqNumAllele	Numbers of alleles per site.
seqMissing	Missing genotype percentages.
seqAlleleFreq	Allele frequencies.
seqAlleleCount	Allele counts.
…

Get Data

# open a GDS file
gds.fn <- seqExampleFileName("gds")
genofile <- seqOpen(gds.fn)

It is suggested to use seqGetData() to take out data from the GDS file since this function can take care of variable-length data and multi-allelic genotypes, although users could also use read.gdsn() in the gdsfmt package to read data.

# take out sample id
head(samp.id <- seqGetData(genofile, "sample.id"))

## [1] "NA06984" "NA06985" "NA06986" "NA06989" "NA06994" "NA07000"

# take out variant id
head(variant.id <- seqGetData(genofile, "variant.id"))

## [1] 1 2 3 4 5 6

# get "chromosome"
table(seqGetData(genofile, "chromosome"))

## 
##   1  10  11  12  13  14  15  16  17  18  19   2  20  21  22   3   4   5 
## 142  70  16  62  11  61  46  84 100  54 111  59  59  23  23  81  48  61 
##   6   7   8   9 
##  99  58  51  29

# get "allele"
head(seqGetData(genofile, "allele"))

## [1] "T,C" "G,A" "G,A" "T,C" "G,C" "C,T"

# get "annotation/info/GP"
head(seqGetData(genofile, "annotation/info/GP"))

## [1] "1:1115503" "1:1115548" "1:1120431" "1:3548136" "1:3548832" "1:3551737"

# get "sample.annotation/family"
head(seqGetData(genofile, "sample.annotation/family"))

## [1] "1328"  ""      "13291" "1328"  "1340"  "1340"

Users can set a filter to samples and/or variants by seqSetFilter(). For example, a subset consisting of three samples and four variants:

# set sample and variant filters
seqSetFilter(genofile, sample.id=samp.id[c(2,4,6)])

## # of selected samples: 3

set.seed(100)
seqSetFilter(genofile, variant.id=sample(variant.id, 4))

## # of selected variants: 4

# get "allele"
seqGetData(genofile, "allele")

## [1] "T,A" "G,A" "G,C" "A,G"

Get genotypic data, it is a 3-dimensional array with respect to allele, sample and variant. 0 refers to the reference allele (or the first allele in the variable allele), 1 for the second allele, and so on, while NA is missing allele.

# get genotypic data
seqGetData(genofile, "genotype")

## , , 1
## 
##       sample
## allele [,1] [,2] [,3]
##   [1,]    0    0    0
##   [2,]    0    0    0
## 
## , , 2
## 
##       sample
## allele [,1] [,2] [,3]
##   [1,]    1    0    0
##   [2,]    0    0    0
## 
## , , 3
## 
##       sample
## allele [,1] [,2] [,3]
##   [1,]    0    0    0
##   [2,]    0    0    0
## 
## , , 4
## 
##       sample
## allele [,1] [,2] [,3]
##   [1,]    0    0    0
##   [2,]    0    0    0

Now let us take a look at a variable-length dataset annotation/info/AA, which corresponds to the INFO column in the original VCF file. There are four variants, each variant has data with size ONE ($length), and data are saved in $data contiguously. $length could be ZERO indicating no data for that variant.

# get "annotation/info/AA", a variable-length dataset
seqGetData(genofile, "annotation/info/AA")

## $length
## [1] 1 1 1 1
## 
## $data
## [1] "T" "G" "G" "A"

Another variable-length dataset is annotation/format/DP corresponding to the FORMAT column in the original VCF file. Again, $length refers to the size of each variant, and data are saved in $data contiguously with respect to the dimension variant. $length could be ZERO indicating no data for that variant.

# get "annotation/format/DP", a variable-length dataset
seqGetData(genofile, "annotation/format/DP")

## $length
## [1] 1 1 1 1
## 
## $data
##       variant
## sample [,1] [,2] [,3] [,4]
##   [1,]    1   12   18    9
##   [2,]   11   79   34  130
##   [3,]   52   31   56   81

Apply Functions Over Array Margins

SeqArray provides seqApply() to apply a user-defined function over array margins, which is coded in C++. It is suggested to use seqApply() instead of apply.gdsn() in the gdsfmt package, since this function can take care of variable-length data and multi-allelic genotypes. For example, reading the two variables genotype and annotation/id variant by variant:

# set sample and variant filters
set.seed(100)
seqSetFilter(genofile, sample.id=samp.id[c(2,4,6)],
    variant.id=sample(variant.id, 4))

## # of selected samples: 3
## # of selected variants: 4

# read multiple variables variant by variant
seqApply(genofile, c(geno="genotype", id="annotation/id"),
    FUN=print, margin="by.variant", as.is="none")

## $geno
##       sample
## allele [,1] [,2] [,3]
##   [1,]    0    0    0
##   [2,]    0    0    0
## 
## $id
## [1] "rs114199731"
## 
## $geno
##       sample
## allele [,1] [,2] [,3]
##   [1,]    1    0    0
##   [2,]    0    0    0
## 
## $id
## [1] "rs12347"
## 
## $geno
##       sample
## allele [,1] [,2] [,3]
##   [1,]    0    0    0
##   [2,]    0    0    0
## 
## $id
## [1] "rs41269293"
## 
## $geno
##       sample
## allele [,1] [,2] [,3]
##   [1,]    0    0    0
##   [2,]    0    0    0
## 
## $id
## [1] "rs35349730"

# read genotypes sample by sample
seqApply(genofile, "genotype",
    FUN=print, margin="by.sample", as.is="none")

##       variant
## allele [,1] [,2] [,3] [,4]
##   [1,]    0    1    0    0
##   [2,]    0    0    0    0
##       variant
## allele [,1] [,2] [,3] [,4]
##   [1,]    0    0    0    0
##   [2,]    0    0    0    0
##       variant
## allele [,1] [,2] [,3] [,4]
##   [1,]    0    0    0    0
##   [2,]    0    0    0    0

seqApply(genofile, c(sample.id="sample.id", genotype="genotype"),
    FUN=print, margin="by.sample", as.is="none")

## $sample.id
## [1] "NA06985"
## 
## $genotype
##       variant
## allele [,1] [,2] [,3] [,4]
##   [1,]    0    1    0    0
##   [2,]    0    0    0    0
## 
## $sample.id
## [1] "NA06989"
## 
## $genotype
##       variant
## allele [,1] [,2] [,3] [,4]
##   [1,]    0    0    0    0
##   [2,]    0    0    0    0
## 
## $sample.id
## [1] "NA07000"
## 
## $genotype
##       variant
## allele [,1] [,2] [,3] [,4]
##   [1,]    0    0    0    0
##   [2,]    0    0    0    0

# remove the sample and variant filters
seqResetFilter(genofile)

## # of selected samples: 90
## # of selected variants: 1348

# get the numbers of alleles per variant
z <- seqApply(genofile, "allele",
    FUN=function(x) length(unlist(strsplit(x,","))), as.is="integer")
table(z)

## z
##    2    3 
## 1346    2

Another example is to use the argument var.index in the function seqApply() to include external information in the analysis, where the variable index in the user-defined FUN is an index of the specified dimension starting from 1 (e.g., variant).

HM3 <- seqGetData(genofile, "annotation/info/HM3")

# Now HM3 is a global variable
# print out RS id if the frequency of reference allele is less than 0.5% and it is HM3
seqApply(genofile, c(geno="genotype", id="annotation/id"),
    FUN = function(index, x) {
        p <- mean(x$geno == 0, na.rm=TRUE)  # the frequency of reference allele
        if ((p < 0.005) & HM3[index]) print(x$id)
    }, as.is="none", var.index="relative", margin="by.variant")

## [1] "rs10908722"
## [1] "rs939777"
## [1] "rs698673"
## [1] "rs943542"
## [1] "rs2062163"
## [1] "rs7142228"
## [1] "rs322965"
## [1] "rs2507733"

Apply Functions in Parallel

Now, let us consider an example of calculating the frequency of reference allele, and this calculation can be done using seqApply() and seqParallel(). Let’s try the uniprocessor implementation first.

# calculate the frequency of reference allele,
afreq <- seqApply(genofile, "genotype", FUN=function(x) mean(x==0, na.rm=TRUE),
    as.is="double", margin="by.variant")
length(afreq)

## [1] 1348

summary(afreq)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8114  0.9719  0.8488  0.9942  1.0000

A multi-process implementation:

# load the "parallel" package
library(parallel)

# choose an appropriate cluster size (or # of cores)
seqParallelSetup(2)

## Enable the computing cluster with 2 forked R processes.

# run in parallel
afreq <- seqParallel(, genofile, FUN = function(f) {
        seqApply(f, "genotype", as.is="double", FUN=function(x) mean(x==0, na.rm=TRUE))
    }, split = "by.variant")

length(afreq)

## [1] 1348

summary(afreq)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8114  0.9719  0.8488  0.9942  1.0000

# Close the GDS file
seqClose(genofile)

Examples

In this section, a GDS file shipped with the package is used as an example:

# open a GDS file
gds.fn <- seqExampleFileName("gds")
genofile <- seqOpen(gds.fn)

The performance of seqApply

Let us try three approaches to export unphased genotypes: 1) the for loop in R; 2) vectorize the function in R; 3) the for loop in seqApply(). The function seqApply() has been highly optimized by blocking the computations to exploit the high-speed memory instead of disk. The results of running times (listed as follows) indicate that seqApply() works well and is comparable with vectorization in R (the benchmark in R_v3.0).

the for loop in R:

system.time({
    geno <- seqGetData(genofile, "genotype")
    gc <- matrix("", nrow=dim(geno)[2], ncol=dim(geno)[3])
    for (i in 1:dim(geno)[3])
    {
        for (j in 1:dim(geno)[2])
        gc[j,i] <- paste(geno[1,j,i], geno[2,j,i], sep="/")
    }
    gc[gc == "NA/NA"] <- NA
    gc
})

   user  system elapsed
  2.185   0.019   2.386       <- the function takes 2.4 seconds

dim(gc)
[1]   90 1348

table(c(gc))
  0/0   0/1   1/0   1/1  <NA> 
88350  7783  8258  8321  8608

Vectorize the function in R:

system.time({
    geno <- seqGetData(genofile, "genotype")
    gc <- matrix(paste(geno[1,,], geno[2,,], sep="/"),
        nrow=dim(geno)[2], ncol=dim(geno)[3])
    gc[gc == "NA/NA"] <- NA
    gc
})

   user  system elapsed
  0.134   0.002   0.147       <- the function takes 0.15 seconds

the for loop in seqApply():

system.time({
    gc <- seqApply(genofile, "genotype",
        function(x) { paste(x[1,], x[2,], sep="/") },
        margin="by.variant", as.is="list")
    gc2 <- matrix(unlist(gc), ncol=length(gc))
    gc2[gc2 == "NA/NA"] <- NA
    gc2
})

   user  system elapsed
  0.157   0.002   0.168       <- the function takes 0.17 seconds

Missing Rates

Calculate the missing rate per variant:

m.variant <- local({
    # get ploidy, the number of samples and variants
    z <- seqSummary(genofile, "genotype")
    # dm[1] -- # of selected samples, dm[2] -- # of selected variants
    dm <- z$seldim
    # ploidy
    ploidy <- z$dim[1]

    # apply the function marginally
    m <- seqApply(genofile, "genotype", function(x) { sum(is.na(x)) },
        margin="by.variant", as.is="integer")

    # output
    m / (ploidy * dm[1])
})

length(m.variant)

## [1] 1348

summary(m.variant)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.01111 0.07095 0.07778 0.96670

# Or call the function in SeqArray
summary(seqMissing(genofile, per.variant=TRUE))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.01111 0.07095 0.07778 0.96670

Calculate the missing rate per sample:

m.sample <- local({
    # get ploidy, the number of samples and variants
    z <- seqSummary(genofile, "genotype")
    # dm[1] -- # of selected samples, dm[2] -- # of selected variants
    dm <- z$seldim
    # ploidy
    ploidy <- z$dim[1]

    # need a global variable (only available in the bracket of "local")
    n <- integer(dm[1])

    # apply the function marginally
    # use "<<-" operator to find "n" in the parent environment
    seqApply(genofile, "genotype", function(x) { n <<- n + colSums(is.na(x)) },
        margin="by.variant", as.is="none")

    # output
    n / (ploidy * dm[2])
})

length(m.sample)

## [1] 90

summary(m.sample)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001484 0.014090 0.054150 0.070950 0.117000 0.264100

# Or call the function in SeqArray
summary(seqMissing(genofile, per.variant=FALSE))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001484 0.014090 0.054150 0.070950 0.117000 0.264100

Multi-process Implementation

# choose an appropriate cluster size (or # of cores)
seqParallelSetup(2)

Calculate the missing rate per variant:

# run in parallel
m.variant <- local({
    # get ploidy, the number of samples and variants
    z <- seqSummary(genofile, "genotype")
    # dm[1] -- # of selected samples, dm[2] -- # of selected variants
    dm <- z$seldim
    # ploidy
    ploidy <- z$dim[1]

    # run in parallel
    m <- seqParallel(, genofile, FUN = function(f) {
        # apply the function marginally
        seqApply(f, "genotype", function(x) { sum(is.na(x)) },
            margin="by.variant", as.is="integer")
    }, split = "by.variant")

    # output
    m / (ploidy * dm[1])
})

summary(m.variant)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.01111 0.07095 0.07778 0.96670

Calculate the missing rate per sample:

m.sample <- local({
    # run in parallel
    m <- seqParallel(, genofile, FUN = function(f)
        {
            # dm[1] -- # of selected samples, dm[2] -- # of selected variants
            dm <- seqSummary(f, "genotype")$seldim

            # need a global variable (only available in the bracket of "local")
            n <- integer(dm[1])

            # apply the function marginally
            # use "<<-" operator to find "n" in the parent environment
            seqApply(f, "genotype", function(x) { n <<- n + colSums(is.na(x)) },
                margin="by.variant", as.is="none")

            # output
            n
        }, .combine = "+",     # sum all variables "n" of different processes
        split = "by.variant")

    # get ploidy, the number of samples and variants
    z <- seqSummary(genofile, "genotype")
    # dm[1] -- # of selected samples, dm[2] -- # of selected variants
    dm <- z$seldim
    # ploidy
    ploidy <- z$dim[1]

    # output
    m / (ploidy * dm[2])
})

summary(m.sample)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001484 0.014090 0.054150 0.070950 0.117000 0.264100

Allele Frequency

Calculate the frequency of reference allele:

# apply the function variant by variant
afreq <- seqApply(genofile, "genotype", function(x) { mean(x==0, na.rm=TRUE) },
    as.is="double", margin="by.variant")

length(afreq)

## [1] 1348

summary(afreq)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8114  0.9719  0.8488  0.9942  1.0000

# Or call the function in SeqArray
summary(seqAlleleFreq(genofile, 0L))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8114  0.9719  0.8488  0.9942  1.0000

Multi-process Implementation

# choose an appropriate cluster size (or # of cores)
seqParallelSetup(2)

# run in parallel
afreq <- seqParallel(, genofile, FUN = function(f) {
        seqApply(f, "genotype", as.is="double", FUN=function(x) mean(x==0, na.rm=TRUE))
    }, split = "by.variant")

length(afreq)

## [1] 1348

summary(afreq)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8114  0.9719  0.8488  0.9942  1.0000

Principal Component Analysis

In the principal component analysis, we employ the dosage of reference alleles to avoid confusion of multiple alleles. The genetic covariance matrix is defined as $M = [ m_{j,j'} ]$: \[ m_{j,j'} = \frac{1}{L} \sum_{l=1}^L \frac{(g_{j,l} - 2p_l)(g_{j',l} - 2p_l)}{p_l (1 - p_l)} \] where $g_{j,l}$ is a genotype of individual $j$ at locus $l$ ($\in \{0,1,2\}$, Num. of reference allele), $p_l$ is the frequency of reference allele and there are $L$ loci in total.

# genetic covariance matrix
genmat <- local({
    # get the number of samples and variants
    # dm[1] -- # of selected samples, dm[2] -- # of selected variants
    dm <- seqSummary(genofile, "genotype")$seldim

    # need a global variable (only available in the bracket of "local")
    s <- matrix(0.0, nrow=dm[1], ncol=dm[1])

    # apply the function variant by variant
    # use "<<-" operator to find "s" in the parent environment
    seqApply(genofile, "genotype", function(x) {
        g <- (x==0)                   # indicator of reference allele
        p <- mean(g, na.rm=TRUE)      # allele frequency
        g2 <- colSums(g) - 2*p        # genotypes adjusted by allele frequency
        m <- (g2 %o% g2) / (p*(1-p))  # genetic covariance matrix
        m[!is.finite(m)] <- 0         # correct missing values
        s <<- s + m                   # output to the global variable "s"
    }, margin="by.variant", as.is="none")

    # output, scaled by the trace of matrix "s" over the number of samples
    s / (sum(diag(s)) / dm[1])
})

# eigen-decomposition
eig <- eigen(genmat)

# eigenvalues
head(eig$value)

## [1] 1.984696 1.867190 1.821147 1.759780 1.729839 1.678466

# eigenvectors
plot(eig$vectors[,1], eig$vectors[,2], xlab="PC 1", ylab="PC 2")

Multi-process Implementation

# choose an appropriate cluster size (or # of cores)
seqParallelSetup(2)

genmat <- seqParallel(, genofile, FUN = function(f)
    {
        # get the number of samples and variants
        # dm[1] -- # of selected samples, dm[2] -- # of selected variants
        dm <- seqSummary(f, "genotype")$seldim

        # need a global variable (only available in the bracket of "local")
        s <- matrix(0.0, nrow=dm[1], ncol=dm[1])

        # apply the function variant by variant
        # use "<<-" operator to find "s" in the parent environment
        seqApply(f, "genotype", function(x) {
            g <- (x==0)                   # indicator of reference allele
            p <- mean(g, na.rm=TRUE)      # allele frequency
            g2 <- colSums(g) - 2*p        # genotypes adjusted by allele frequency
            m <- (g2 %o% g2) / (p*(1-p))  # genetic covariance matrix
            m[!is.finite(m)] <- 0         # correct missing values
            s <<- s + m                   # output to the global variable "s"
        }, margin="by.variant", as.is="none")

        # output
        s
    }, .combine = "+",     # sum all variables "s" of different processes
    split = "by.variant")

# finally, scaled by the trace of matrix "s" over the number of samples
dm <- seqSummary(genofile, "genotype")$seldim
genmat <- genmat / (sum(diag(genmat)) / dm[1])

# eigen-decomposition
eig <- eigen(genmat)
# eigenvalues
head(eig$value)

## [1] 1.984696 1.867190 1.821147 1.759780 1.729839 1.678466

Individual Inbreeding Coefficient

To calculate an individual inbreeding coefficient using SNP genotype data, I demonstrate how to use seqApply() to calculate Visscher’s estimator described in Yang et al. (2010) [5]. The estimator of individual inbreeding coefficient is defined as \[ \hat{\theta} = \frac{1}{L} \sum_{l=1}^L \frac{g_l^2 - g_l (1 + 2p_l) + 2p_l^2}{2 p_l (1 - p_l)} \] where $g_l$ is a SNP genotype at locus $l$ $\in \{0,1,2\}$ (Num. of reference allele), $p_l$ is the frequency of reference allele and there are $L$ loci in total.

coeff <- local({
    # get the number of samples and variants
    # dm[1] -- # of selected samples, dm[2] -- # of selected variants
    dm <- seqSummary(genofile, "genotype")$seldim

    # need global variables (only available in the bracket of "local")
    n <- integer(dm[1])
    s <- double(dm[1])

    # apply the function variant by variant
    # use "<<-" operator to find "n" and "s" in the parent environment
    seqApply(genofile, "genotype", function(x) {
        p <- mean(x==0, na.rm=TRUE)      # allele frequency
        g <- colSums(x==0)               # genotypes: # of reference allele
        d <- (g*g - g*(1 + 2*p) + 2*p*p) / (2*p*(1-p))
        n <<- n + is.finite(d)           # output to the global variable "n"
        d[!is.finite(d)] <- 0
        s <<- s + d                      # output to the global variable "s"
    }, margin="by.variant", as.is="none")

    # output
    s / n
})

length(coeff)

## [1] 90

summary(coeff)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.004e-01 -3.662e-02 -1.981e-02 -2.002e-02  5.300e-07  4.181e-02

Multi-process Implementation

# choose an appropriate cluster size (or # of cores)
seqParallelSetup(2)

coeff <- seqParallel(, genofile, FUN = function(f)
    {
        # get the number of samples and variants
        # dm[1] -- # of selected samples, dm[2] -- # of selected variants
        dm <- seqSummary(f, "genotype")$seldim

        # need global variables (only available in the bracket)
        n <- integer(dm[1])
        s <- double(dm[1])

        # apply the function variant by variant
        # use "<<-" operator to find "n" and "s" in the parent environment
        seqApply(f, "genotype", function(x) {
            p <- mean(x==0, na.rm=TRUE)      # allele frequency
            g <- colSums(x==0)               # genotypes: # of reference allele
            d <- (g*g - g*(1 + 2*p) + 2*p*p) / (2*p*(1-p))
            n <<- n + is.finite(d)           # output to the global variable "n"
            d[!is.finite(d)] <- 0
            s <<- s + d                      # output to the global variable "s"
        }, margin="by.variant", as.is="none")

        # output
        list(s=s, n=n)
    }, # sum all variables "s" and "n" of different processes
    .combine = function(x1,x2) { list(s = x1$s + x2$s, n = x1$n + x2$n) },
    split = "by.variant")

# finally, average!
coeff <- coeff$s / coeff$n

summary(coeff)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.004e-01 -3.662e-02 -1.981e-02 -2.002e-02  5.300e-07  4.181e-02

# stop the cluster environment
seqParallelSetup(FALSE)

## Stop the computing cluster.

Seamless R and C++ Integration

The Rcpp package provides R functions as well as a C++ library which facilitate the integration of R and C++.

library(Rcpp)

The user can dynamically define an inline C/C++ function in R.

cppFunction('double RefAlleleFreq(IntegerMatrix x)
{
    int nrow = x.nrow(), ncol = x.ncol();
    int cnt=0, zero_cnt=0, g;
    for (int i = 0; i < nrow; i++)
    {
        for (int j = 0; j < ncol; j++)
        {
            if ((g = x(i, j)) != NA_INTEGER)
            {
                cnt ++;
                if (g == 0) zero_cnt ++;
            }
        }
    }
    return double(zero_cnt) / cnt;
}')

Now, RefAlleleFreq is a function in R.

RefAlleleFreq

## function (x) 
## .Primitive(".Call")(<pointer: 0x7fdef903fcf0>, x)

afreq <- seqApply(genofile, "genotype", RefAlleleFreq,
    as.is="double", margin="by.variant")

summary(afreq)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8114  0.9719  0.8488  0.9942  1.0000

# close the GDS file
seqClose(genofile)

Resources

CoreArray C++ project: http://corearray.sourceforge.net/
gdsfmt R package: http://github.com/zhengxwen/gdsfmt, http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html
SeqArray R package: http://github.com/zhengxwen/SeqArray, http://www.bioconductor.org/packages/release/bioc/html/SeqArray.html

Session Information

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] Rcpp_0.12.2     SNPRelate_1.4.0 SeqArray_1.10.6 gdsfmt_1.6.2   
## 
## loaded via a namespace (and not attached):
##  [1] formatR_1.2.1              futile.logger_1.4.1       
##  [3] GenomeInfoDb_1.6.1         XVector_0.10.0            
##  [5] GenomicFeatures_1.22.5     bitops_1.0-6              
##  [7] futile.options_1.0.0       tools_3.2.2               
##  [9] zlibbioc_1.16.0            biomaRt_2.26.1            
## [11] digest_0.6.8               evaluate_0.8              
## [13] RSQLite_1.0.0              memoise_0.2.1             
## [15] BSgenome_1.38.0            DBI_0.3.1                 
## [17] yaml_2.1.13                rtracklayer_1.30.1        
## [19] stringr_1.0.0              knitr_1.11                
## [21] Biostrings_2.38.2          S4Vectors_0.8.3           
## [23] IRanges_2.4.4              stats4_3.2.2              
## [25] Biobase_2.30.0             AnnotationDbi_1.32.1      
## [27] XML_3.98-1.3               BiocParallel_1.4.0        
## [29] rmarkdown_0.8.1            lambda.r_1.1.7            
## [31] magrittr_1.5               Rsamtools_1.22.0          
## [33] htmltools_0.2.6            BiocGenerics_0.16.1       
## [35] GenomicRanges_1.22.1       GenomicAlignments_1.6.1   
## [37] SummarizedExperiment_1.0.1 stringi_1.0-1             
## [39] RCurl_1.95-4.7             VariantAnnotation_1.16.3  
## [41] crayon_1.3.1

References

1000 Genomes Project Consortium, Abecasis, G. R., Auton, A., Brooks, L. D., DePristo, M. A., Durbin, R. M., Handsaker, R. E., Kang, H. M., Marth, G. T., and McVean, G. A. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 7422 (Nov 2012), 56–65.
Schadt, E. E., Linderman, M. D., Sorenson, J., Lee, L., and Nolan, G. P. Computational solutions to large-scale data management and analysis. Nat Rev Genet 11, 9 (Sep 2010), 647–657.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin, R., and 1000 Genomes Project Analysis Group. The variant call format and vcftools. Bioinformatics 27, 15 (Aug 2011), 2156–2158.
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. 2012 Oct 11.
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM. 2010. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 42(7):565-9. Epub 2010 Jun 20.

SeqArray – Big Data Management of Genome-Wide Sequence Variants

Xiuwen Zheng (Department of Biostatistics, University of Washington, Seattle)

Jun 23, 2015

Overview

Preparing Data

Data formats used in SeqArray

Format conversion from VCF files

Export to a VCF File

Modification

Data Processing

Functions for Data Analysis

Get Data

Apply Functions Over Array Margins

Apply Functions in Parallel

Examples

The performance of seqApply

Missing Rates

Multi-process Implementation

Allele Frequency

Multi-process Implementation

Principal Component Analysis

Multi-process Implementation

Individual Inbreeding Coefficient

Multi-process Implementation

Seamless R and C++ Integration

Resources

Session Information

References