1 User instruction

This vignette is about the conversion methods and statistical functions that are enabled on VariantExperiment objects, for the analysis of genotyping or DNA-seq data sets. If you want to learn more about the implementation of the VariantExperiment class, and basic methods, please refer to the other vignette.

2 Introduction

The package of VariantExperiment is implemented to represent VCF/GDS files using standard SummarizedExperiment metaphor. It is a container for high-through genetic/genomic data with GDS back-end, and is interoperable with the statistical functions/methods that are implemented in SeqArray and SeqVarTools that are designed for GDS data. The SummarizedExperiment metaphor also gets the benefit of common manipulations within Bioconductor ecosystem that are more user-friendly.

First, we load the package into R session.

library(VariantExperiment)

3 Coercion methods

To take advantage of the functions and methods that are defined on SummarizedExperiment, from which the VariantExperiment extends, we have defined coercion methods from VCF and GDS to VariantExperiment.

3.1 From VCF to VariantExperiment

The coercion function of makeVariantExperimentFromVCF could convert the VCF file directly into VariantExperiment object. To achieve the best storage efficiency, the assay data are saved in GDSArray format, and the annotation data are saved in DelayedDataFrame format (with no option of ordinary DataFrame), which could be retrieved by rowData() for feature related annotations and colData() for sample related annotations (Only when sample.info argument is specified).

vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf)
ve

Internally, the VCF file was converted into a on-disk GDS file, which could be retrieved by:

gdsfile(ve)

assay data is in GDSArray format:

assay(ve, 1)

feature-related annotation is in DelayedDataFrame format:

rowData(ve)

User could also have the opportunity to save the sample related annotation info directly into the VariantExperiment object, by providing the file path to the sample.info argument, and then retrieve by colData().

sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
                          package="VariantExperiment")
ve <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)
colData(ve)

Arguments could be specified to take only certain info columns or format columns from the vcf file.

ve1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(ve1)

In the above example, only 2 info entries (“OR” and “GP”) are read into the VariantExperiment object.

The start and count arguments could be used to specify the start position and number of variants to read into Variantexperiment object.

ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
ve2

For the above example, only 1000 variants are read into the VariantExperiment object, starting from the position of 101.

3.2 From GDS to VariantExperiment

The coercion function of makeVariantExperimentFromGDS coerces GDS files into VariantExperiment objects directly, with the assay data saved as GDSArray, and the rowData()/colData() in DelayedDataFrame by default (with the option of ordinary DataFrame object).

gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
ve
rowData(ve)
colData(ve)

Arguments could be specified to take only certain annotation columns for features and samples. All available data entries for makeVariantExperimentFromGDS arguments could be retrieved by the showAvailable() function with the gds file name as input.

showAvailable(gds)

Note that the infoColumns from gds file will be saved as columns inside the rowData(), with the prefix of “info_”. rowDataOnDisk/colDataOnDisk could be set as FALSE to save all annotation data in ordinary DataFrame format.

ve3 <- makeVariantExperimentFromGDS(gds,
                                    rowDataColumns = c("ID", "ALT", "REF"),
                                    infoColumns = c("AC", "AN", "DP"),
                                    rowDataOnDisk = TRUE,
                                    colDataOnDisk = FALSE)
rowData(ve3)  ## DelayedDataFrame object 
colData(ve3)  ## DataFrame object

4 Lazy data operations

VariantExperiment supports basic subsetting operations using [, [[, $, and ranged-based subsetting operations using subsetByOverlap.

NOTE that after a set of lazy operations, you need to call saveVariantExperiment function to synchronize the on-disk file associated with the in-memory representation by providing a file path. Statistical functions could only work on synchronized VariantExperiment object, or error will return.

Refer to the “VariantExperiment-class” vignette for more details.

5 Statistical functions

Many statistical functions and methods are defined on VariantExperiment objects, most of which has their generic defined in Bioconductor package of SeqArray and SeqVarTools. These functions could be called directly on VariantExperiment object as input, with additional arguments to specify based on user’s need. More details please refer to the vignettes of SeqArray and SeqVarTools.

Here is a list of the statistical functions with brief description:

statistical functions Description
seqAlleleFreq Calculates the allele frequencies
seqAlleleCount Calculates the allele counts
seqMissing Calculates the missing rate for variant/sample
seqNumAllele Calculates the number of alleles (for ref/alt allele)
hwe Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants
inbreedCoeff Calculates the inbreeding coefficient by variant/sample
pca Calculates the eigenvalues and eignevectors with Principal Component Analysis
titv Calculate transition/transversion ratio overall or by sample
refDosage Calculate the dosage of reference allele (matrix with integers of 0/1/2)
altDosage Calculate the dosage of alternative allele (matrix with integers of 0/1/2)
countSingletons Count singleton variants for each sample
heterozygosity Calculate heterozygosity rate by sample or by variants
homozygosity Calculate homozygosity rate by sample or by variants
meanBySample Calculate the mean value of a variable by sample over all variants
isSNV Flag a single nucleotide variant
isVariant Locate which samples are variant for each site

Here are some examples in calculating the sample missing rate, hwe, titv ratio and the count of singletons for each sample.

## sample missing rate
mr.samp <- seqMissing(ve, per.variant = FALSE)
head(mr.samp)
## hwe
hwe <- hwe(ve)
head(hwe)
## titv ratio by sample / overall
titv <- titv(ve, by.sample=TRUE)
head(titv)
titv(ve, by.sample=FALSE)
## countSingletons
countSingletons(ve)

As we have noted in the other vignette, after the subsetting by [, $ or Ranged-based operations, we need to save the new VariantExperiment object to synchronize the gds file (on-disk) associated with the subset of data (in-memory representation) before any statistical analysis. Otherwise, an error will be returned.

6 Future work

As a feature addition, we want to add the option of VCFArray in saving the assay data in the step of makeVariantExperimentFromVCF. We also seek to implement the SQLDataFrame in representation of the annotation data. We also plan to connect Bioconductor package VariantAnnotation to implement the variant filtering and annotation functions based on VariantExperiment format, and with that, to develop a pipeline for using VariantExperiment object as the basic data structure for DNA-sequencing data analysis.

7 Session Info

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] VariantExperiment_1.2.0     DelayedDataFrame_1.4.0     
##  [3] GDSArray_1.8.0              gdsfmt_1.24.0              
##  [5] SummarizedExperiment_1.18.0 DelayedArray_0.14.0        
##  [7] matrixStats_0.56.0          Biobase_2.48.0             
##  [9] GenomicRanges_1.40.0        GenomeInfoDb_1.24.0        
## [11] IRanges_2.22.0              S4Vectors_0.26.0           
## [13] BiocGenerics_0.34.0         BiocStyle_2.16.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.0.0       xfun_0.13              purrr_0.3.4           
##  [4] splines_4.0.0          lattice_0.20-41        generics_0.0.2        
##  [7] vctrs_0.2.4            GWASExactHW_1.01       htmltools_0.4.0       
## [10] yaml_2.2.1             mgcv_1.8-31            rlang_0.4.5           
## [13] pillar_1.4.3           glue_1.4.0             GenomeInfoDbData_1.2.3
## [16] lifecycle_0.2.0        stringr_1.4.0          zlibbioc_1.34.0       
## [19] Biostrings_2.56.0      evaluate_0.14          knitr_1.28            
## [22] SeqArray_1.28.0        broom_0.5.6            Rcpp_1.0.4.6          
## [25] backports_1.1.6        BiocManager_1.30.10    XVector_0.28.0        
## [28] digest_0.6.25          stringi_1.4.6          bookdown_0.18         
## [31] dplyr_0.8.5            grid_4.0.0             tools_4.0.0           
## [34] bitops_1.0-6           magrittr_1.5           RCurl_1.98-1.2        
## [37] tibble_3.0.1           mice_3.8.0             tidyr_1.0.2           
## [40] crayon_1.3.4           SNPRelate_1.22.0       pkgconfig_2.0.3       
## [43] ellipsis_0.3.0         Matrix_1.2-18          assertthat_0.2.1      
## [46] rmarkdown_2.1          logistf_1.23           R6_2.4.1              
## [49] SeqVarTools_1.26.0     nlme_3.1-147           compiler_4.0.0