1 User instruction

This vignette is about the conversion methods and statistical functions that are enabled on VariantExperiment objects, for the analysis of genotyping or DNA-seq data sets. If you want to learn more about the implementation of the VariantExperiment class, and basic methods, please refer to the other vignette.

2 Introduction

The package of VariantExperiment is implemented to represent VCF/GDS files using standard SummarizedExperiment metaphor. It is a container for high-through genetic/genomic data with GDS back-end, and is interoperable with the statistical functions/methods that are implemented in SeqArray and SeqVarTools that are designed for GDS data. The SummarizedExperiment metaphor also gets the benefit of common manipulations within Bioconductor ecosystem that are more user-friendly.

First, we load the package into R session.

library(VariantExperiment)

3 Coercion methods

To take advantage of the functions and methods that are defined on SummarizedExperiment, from which the VariantExperiment extends, we have defined coercion methods from VCF and GDS to VariantExperiment.

3.1 From VCF to VariantExperiment

The coercion function of makeVariantExperimentFromVCF could convert the VCF file directly into VariantExperiment object. To achieve the best storage efficiency, the assay data are saved in GDSArray format, and the annotation data are saved in DelayedDataFrame format (with no option of ordinary DataFrame), which could be retrieved by rowData() for feature related annotations and colData() for sample related annotations (Only when sample.info argument is specified).

vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf)
ve
## class: VariantExperiment 
## dim: 1348 90 
## metadata(0):
## assays(2): genotype annotation/format/DP
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(0):

Internally, the VCF file was converted into a on-disk GDS file, which could be retrieved by:

gdsfile(ve)
## [1] "/tmp/RtmpcPzm15/file12a01e5a79a647/se.gds"

assay data is in GDSArray format:

assay(ve, 1)
## <1348 x 90 x 2> array of class GDSArray and type "integer":
## ,,1
##      NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
##    1      NA      NA       0      NA   .       0       0       0       0
##    2      NA      NA       0      NA   .       0       0       0       0
##  ...       .       .       .       .   .       .       .       .       .
## 1347       0       0       0       0   .       0       0       0       0
## 1348      NA      NA       0      NA   .      NA      NA      NA      NA
## 
## ,,2
##      NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
##    1      NA      NA       0      NA   .       0       0       0       0
##    2      NA      NA       0      NA   .       0       0       0       0
##  ...       .       .       .       .   .       .       .       .       .
## 1347       0       0       0       0   .       0       0       0       0
## 1348      NA      NA       1      NA   .      NA      NA      NA      NA

feature-related annotation is in DelayedDataFrame format:

rowData(ve)
## DelayedDataFrame with 1348 rows and 14 columns
##               ID            ALT            REF       QUAL     FILTER    info_AA
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T        NaN       PASS          T
## 2    rs114390380              A              G        NaN       PASS          G
## 3      rs1320571              A              G        NaN       PASS          A
## ...          ...            ...            ...        ...        ...        ...
## 1346   rs8135982              T              C        NaN       PASS          C
## 1347 rs116581756              A              G        NaN       PASS          G
## 1348   rs5771206              G              A        NaN       PASS          G
##         info_AC    info_AN    info_DP   info_HM2   info_HM3    info_OR
##      <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1             4        114       3251      FALSE      FALSE           
## 2             1        106       2676      FALSE      FALSE           
## 3             6        154       7610       TRUE       TRUE           
## ...         ...        ...        ...        ...        ...        ...
## 1346         11        142        823      FALSE      FALSE           
## 1347          1        152       1257      FALSE      FALSE           
## 1348          1          6         48      FALSE      FALSE           
##          info_GP    info_BN
##       <GDSArray> <GDSArray>
## 1      1:1115503        132
## 2      1:1115548        132
## 3      1:1120431         88
## ...          ...        ...
## 1346 22:45312345        116
## 1347 22:45312409        132
## 1348 22:50616806        114

User could also have the opportunity to save the sample related annotation info directly into the VariantExperiment object, by providing the file path to the sample.info argument, and then retrieve by colData().

sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
                          package="VariantExperiment")
ve <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)
## Warning in (function (node, name, val = NULL, storage = storage.mode(val), :
## Missing characters are converted to "".
colData(ve)
## DelayedDataFrame with 90 rows and 1 column
##             family
##         <GDSArray>
## NA06984       1328
## NA06985           
## NA06986      13291
## ...            ...
## NA12890       1463
## NA12891           
## NA12892

Arguments could be specified to take only certain info columns or format columns from the vcf file.

ve1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(ve1)
## DelayedDataFrame with 1348 rows and 7 columns
##               ID            ALT            REF       QUAL     FILTER    info_OR
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T        NaN       PASS           
## 2    rs114390380              A              G        NaN       PASS           
## 3      rs1320571              A              G        NaN       PASS           
## ...          ...            ...            ...        ...        ...        ...
## 1346   rs8135982              T              C        NaN       PASS           
## 1347 rs116581756              A              G        NaN       PASS           
## 1348   rs5771206              G              A        NaN       PASS           
##          info_GP
##       <GDSArray>
## 1      1:1115503
## 2      1:1115548
## 3      1:1120431
## ...          ...
## 1346 22:45312345
## 1347 22:45312409
## 1348 22:50616806

In the above example, only 2 info entries (“OR” and “GP”) are read into the VariantExperiment object.

The start and count arguments could be used to specify the start position and number of variants to read into Variantexperiment object.

ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
ve2
## class: VariantExperiment 
## dim: 1000 90 
## metadata(0):
## assays(2): genotype annotation/format/DP
## rownames(1000): 101 102 ... 1099 1100
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(0):

For the above example, only 1000 variants are read into the VariantExperiment object, starting from the position of 101.

3.2 From GDS to VariantExperiment

The coercion function of makeVariantExperimentFromGDS coerces GDS files into VariantExperiment objects directly, with the assay data saved as GDSArray, and the rowData()/colData() in DelayedDataFrame by default (with the option of ordinary DataFrame object).

gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
ve
## class: VariantExperiment 
## dim: 1348 90 
## metadata(0):
## assays(2): genotype annotation/format/DP
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): family
rowData(ve)
## DelayedDataFrame with 1348 rows and 14 columns
##               ID            ALT            REF       QUAL     FILTER    info_AA
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T        NaN       PASS          T
## 2    rs114390380              A              G        NaN       PASS          G
## 3      rs1320571              A              G        NaN       PASS          A
## ...          ...            ...            ...        ...        ...        ...
## 1346   rs8135982              T              C        NaN       PASS          C
## 1347 rs116581756              A              G        NaN       PASS          G
## 1348   rs5771206              G              A        NaN       PASS          G
##         info_AC    info_AN    info_DP   info_HM2   info_HM3    info_OR
##      <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1             4        114       3251      FALSE      FALSE           
## 2             1        106       2676      FALSE      FALSE           
## 3             6        154       7610       TRUE       TRUE           
## ...         ...        ...        ...        ...        ...        ...
## 1346         11        142        823      FALSE      FALSE           
## 1347          1        152       1257      FALSE      FALSE           
## 1348          1          6         48      FALSE      FALSE           
##          info_GP    info_BN
##       <GDSArray> <GDSArray>
## 1      1:1115503        132
## 2      1:1115548        132
## 3      1:1120431         88
## ...          ...        ...
## 1346 22:45312345        116
## 1347 22:45312409        132
## 1348 22:50616806        114
colData(ve)
## DelayedDataFrame with 90 rows and 1 column
##             family
##         <GDSArray>
## NA06984       1328
## NA06985           
## NA06986      13291
## ...            ...
## NA12890       1463
## NA12891           
## NA12892

Arguments could be specified to take only certain annotation columns for features and samples. All available data entries for makeVariantExperimentFromGDS arguments could be retrieved by the showAvailable() function with the gds file name as input.

showAvailable(gds)
## CharacterList of length 4
## [["name"]] genotype annotation/format/DP
## [["rowDataColumns"]] ID ALT REF QUAL FILTER
## [["colDataColumns"]] family
## [["infoColumns"]] AA AC AN DP HM2 HM3 OR GP BN

Note that the infoColumns from gds file will be saved as columns inside the rowData(), with the prefix of “info_”. rowDataOnDisk/colDataOnDisk could be set as FALSE to save all annotation data in ordinary DataFrame format.

ve3 <- makeVariantExperimentFromGDS(gds,
                                    rowDataColumns = c("ID", "ALT", "REF"),
                                    infoColumns = c("AC", "AN", "DP"),
                                    rowDataOnDisk = TRUE,
                                    colDataOnDisk = FALSE)
rowData(ve3)  ## DelayedDataFrame object 
## DelayedDataFrame with 1348 rows and 6 columns
##               ID            ALT            REF    info_AC    info_AN    info_DP
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T          4        114       3251
## 2    rs114390380              A              G          1        106       2676
## 3      rs1320571              A              G          6        154       7610
## ...          ...            ...            ...        ...        ...        ...
## 1346   rs8135982              T              C         11        142        823
## 1347 rs116581756              A              G          1        152       1257
## 1348   rs5771206              G              A          1          6         48
colData(ve3)  ## DataFrame object
## DataFrame with 90 rows and 1 column
##              family
##         <character>
## NA06984        1328
## NA06985            
## NA06986       13291
## ...             ...
## NA12890        1463
## NA12891            
## NA12892

4 Lazy data operations

VariantExperiment supports basic subsetting operations using [, [[, $, and ranged-based subsetting operations using subsetByOverlap.

NOTE that after a set of lazy operations, you need to call saveVariantExperiment function to synchronize the on-disk file associated with the in-memory representation by providing a file path. Statistical functions could only work on synchronized VariantExperiment object, or error will return.

Refer to the “VariantExperiment-class” vignette for more details.

5 Statistical functions

Many statistical functions and methods are defined on VariantExperiment objects, most of which has their generic defined in Bioconductor package of SeqArray and SeqVarTools. These functions could be called directly on VariantExperiment object as input, with additional arguments to specify based on user’s need. More details please refer to the vignettes of SeqArray and SeqVarTools.

Here is a list of the statistical functions with brief description:

statistical functions Description
seqAlleleFreq Calculates the allele frequencies
seqAlleleCount Calculates the allele counts
seqMissing Calculates the missing rate for variant/sample
seqNumAllele Calculates the number of alleles (for ref/alt allele)
hwe Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants
inbreedCoeff Calculates the inbreeding coefficient by variant/sample
pca Calculates the eigenvalues and eignevectors with Principal Component Analysis
titv Calculate transition/transversion ratio overall or by sample
refDosage Calculate the dosage of reference allele (matrix with integers of 0/1/2)
altDosage Calculate the dosage of alternative allele (matrix with integers of 0/1/2)
countSingletons Count singleton variants for each sample
heterozygosity Calculate heterozygosity rate by sample or by variants
homozygosity Calculate homozygosity rate by sample or by variants
meanBySample Calculate the mean value of a variable by sample over all variants
isSNV Flag a single nucleotide variant
isVariant Locate which samples are variant for each site

Here are some examples in calculating the sample missing rate, hwe, titv ratio and the count of singletons for each sample.

## sample missing rate
mr.samp <- seqMissing(ve, per.variant = FALSE)
head(mr.samp)
## [1] 0.083086053 0.096439169 0.009643917 0.094213650 0.129821958 0.022997033
## hwe
hwe <- hwe(ve)
head(hwe)
##   variant.id nAA nAa naa     afreq p            f
## 1          1  53   4   0 0.9649123 1 -0.036363636
## 2          2  52   1   0 0.9905660 1 -0.009523810
## 3          3  71   6   0 0.9610390 1 -0.040540541
## 4          4   1  16  56 0.1232877 1 -0.013888889
## 5          5  76  13   0 0.9269663 1 -0.078787879
## 6          6  88   1   0 0.9943820 1 -0.005649718
## titv ratio by sample / overall
titv <- titv(ve, by.sample=TRUE)
head(titv)
## [1] 4.352941 3.791667 3.439394 3.568966 3.750000 3.646154
titv(ve, by.sample=FALSE)
## [1] 3.562712
## countSingletons
countSingletons(ve)
##  [1]  2  2  7  5  1  5  3  5  2  2  8  2  4  3  6  5  9  5  3  5  5  5  7  0  5
## [26]  8 13  9  1  6  2  8  2  6  0  5  4  7  7  3  1  9  0  0  5  3  4  8  6  9
## [51]  6  4  7  8  5  7  3  5  2  2  6  8  2  6  4  3  7  4  3  3  8  2  8  6  1
## [76]  5  1  9  8  5  0  1  2  2  0  1  0 10  2  4

As we have noted in the other vignette, after the subsetting by [, $ or Ranged-based operations, we need to save the new VariantExperiment object to synchronize the gds file (on-disk) associated with the subset of data (in-memory representation) before any statistical analysis. Otherwise, an error will be returned.

6 Future work

As a feature addition, we want to add the option of VCFArray in saving the assay data in the step of makeVariantExperimentFromVCF. We also seek to implement the SQLDataFrame in representation of the annotation data. We also plan to connect Bioconductor package VariantAnnotation to implement the variant filtering and annotation functions based on VariantExperiment format, and with that, to develop a pipeline for using VariantExperiment object as the basic data structure for DNA-sequencing data analysis.

7 Session Info

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] VariantExperiment_1.6.0     DelayedDataFrame_1.8.0     
##  [3] GDSArray_1.12.0             DelayedArray_0.18.0        
##  [5] Matrix_1.3-3                gdsfmt_1.28.0              
##  [7] SummarizedExperiment_1.22.0 Biobase_2.52.0             
##  [9] GenomicRanges_1.44.0        GenomeInfoDb_1.28.0        
## [11] IRanges_2.26.0              MatrixGenerics_1.4.0       
## [13] matrixStats_0.58.0          S4Vectors_0.30.0           
## [15] BiocGenerics_0.38.0         BiocStyle_2.20.0           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6             lattice_0.20-44        tidyr_1.1.3           
##  [4] formula.tools_1.7.1    Biostrings_2.60.0      assertthat_0.2.1      
##  [7] digest_0.6.27          utf8_1.2.1             R6_2.5.0              
## [10] backports_1.2.1        evaluate_0.14          pillar_1.6.1          
## [13] zlibbioc_1.38.0        rlang_0.4.11           data.table_1.14.0     
## [16] jquerylib_0.1.4        rmarkdown_2.8          splines_4.1.0         
## [19] stringr_1.4.0          GWASExactHW_1.01       RCurl_1.98-1.3        
## [22] broom_0.7.6            compiler_4.1.0         xfun_0.23             
## [25] pkgconfig_2.0.3        mgcv_1.8-35            htmltools_0.5.1.1     
## [28] tidyselect_1.1.1       tibble_3.1.2           GenomeInfoDbData_1.2.6
## [31] bookdown_0.22          fansi_0.4.2            crayon_1.4.1          
## [34] dplyr_1.0.6            bitops_1.0-7           grid_4.1.0            
## [37] DBI_1.1.1              nlme_3.1-152           jsonlite_1.7.2        
## [40] lifecycle_1.0.0        magrittr_2.0.1         SeqVarTools_1.30.0    
## [43] stringi_1.6.2          XVector_0.32.0         SNPRelate_1.26.0      
## [46] mice_3.13.0            bslib_0.2.5.1          ellipsis_0.3.2        
## [49] vctrs_0.3.8            generics_0.1.0         tools_4.1.0           
## [52] glue_1.4.2             purrr_0.3.4            yaml_2.2.1            
## [55] SeqArray_1.32.0        BiocManager_1.30.15    operator.tools_1.6.3  
## [58] logistf_1.24           knitr_1.33             sass_0.4.0