VariantExperiment 1.2.0
This vignette is about the conversion methods and statistical
functions that are enabled on VariantExperiment
objects, for the
analysis of genotyping or DNA-seq data sets. If you want to learn more
about the implementation of the VariantExperiment
class, and basic
methods, please refer to the other vignette.
The package of VariantExperiment
is implemented to represent VCF/GDS
files using standard SummarizedExperiment metaphor. It is a container
for high-through genetic/genomic data with GDS back-end, and is
interoperable with the statistical functions/methods that are
implemented in SeqArray
and SeqVarTools
that are designed for GDS
data. The SummarizedExperiment
metaphor also gets the benefit of
common manipulations within Bioconductor ecosystem that are more
user-friendly.
First, we load the package into R session.
library(VariantExperiment)
To take advantage of the functions and methods that are defined on
SummarizedExperiment
, from which the VariantExperiment
extends, we
have defined coercion methods from VCF
and GDS
to
VariantExperiment
.
VCF
to VariantExperiment
The coercion function of makeVariantExperimentFromVCF
could
convert the VCF
file directly into VariantExperiment
object. To
achieve the best storage efficiency, the assay data are saved in
GDSArray
format, and the annotation data are saved in
DelayedDataFrame
format (with no option of ordinary DataFrame
),
which could be retrieved by rowData()
for feature related
annotations and colData()
for sample related annotations (Only when
sample.info
argument is specified).
vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf)
ve
Internally, the VCF
file was converted into a on-disk GDS
file,
which could be retrieved by:
gdsfile(ve)
assay data is in GDSArray
format:
assay(ve, 1)
feature-related annotation is in DelayedDataFrame
format:
rowData(ve)
User could also have the opportunity to save the sample related
annotation info directly into the VariantExperiment
object, by
providing the file path to the sample.info
argument, and then
retrieve by colData()
.
sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
package="VariantExperiment")
ve <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)
colData(ve)
Arguments could be specified to take only certain info columns or format columns from the vcf file.
ve1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(ve1)
In the above example, only 2 info entries (“OR” and “GP”) are read
into the VariantExperiment
object.
The start
and count
arguments could be used to specify the start
position and number of variants to read into Variantexperiment
object.
ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
ve2
For the above example, only 1000 variants are read into the
VariantExperiment
object, starting from the position of 101.
GDS
to VariantExperiment
The coercion function of makeVariantExperimentFromGDS
coerces
GDS
files into VariantExperiment
objects directly, with the assay
data saved as GDSArray
, and the rowData()/colData()
in
DelayedDataFrame
by default (with the option of ordinary DataFrame
object).
gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
ve
rowData(ve)
colData(ve)
Arguments could be specified to take only certain annotation columns
for features and samples. All available data entries for
makeVariantExperimentFromGDS
arguments could be retrieved by the
showAvailable()
function with the gds file name as input.
showAvailable(gds)
Note that the infoColumns
from gds file will be saved as columns
inside the rowData()
, with the prefix of
“info_”. rowDataOnDisk/colDataOnDisk
could be set as FALSE
to
save all annotation data in ordinary DataFrame
format.
ve3 <- makeVariantExperimentFromGDS(gds,
rowDataColumns = c("ID", "ALT", "REF"),
infoColumns = c("AC", "AN", "DP"),
rowDataOnDisk = TRUE,
colDataOnDisk = FALSE)
rowData(ve3) ## DelayedDataFrame object
colData(ve3) ## DataFrame object
VariantExperiment
supports basic subsetting operations using [
,
[[
, $
, and ranged-based subsetting operations using
subsetByOverlap
.
NOTE that after a set of lazy operations, you need to call
saveVariantExperiment
function to synchronize the on-disk file
associated with the in-memory representation by providing a file
path. Statistical functions could only work on synchronized
VariantExperiment
object, or error will return.
Refer to the “VariantExperiment-class” vignette for more details.
Many statistical functions and methods are defined on
VariantExperiment
objects, most of which has their generic defined
in Bioconductor package of SeqArray
and SeqVarTools
. These
functions could be called directly on VariantExperiment
object as
input, with additional arguments to specify based on user’s need. More
details please refer to the vignettes of SeqArray and
SeqVarTools.
Here is a list of the statistical functions with brief description:
statistical functions | Description |
---|---|
seqAlleleFreq | Calculates the allele frequencies |
seqAlleleCount | Calculates the allele counts |
seqMissing | Calculates the missing rate for variant/sample |
seqNumAllele | Calculates the number of alleles (for ref/alt allele) |
hwe | Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants |
inbreedCoeff | Calculates the inbreeding coefficient by variant/sample |
pca | Calculates the eigenvalues and eignevectors with Principal Component Analysis |
titv | Calculate transition/transversion ratio overall or by sample |
refDosage | Calculate the dosage of reference allele (matrix with integers of 0/1/2) |
altDosage | Calculate the dosage of alternative allele (matrix with integers of 0/1/2) |
countSingletons | Count singleton variants for each sample |
heterozygosity | Calculate heterozygosity rate by sample or by variants |
homozygosity | Calculate homozygosity rate by sample or by variants |
meanBySample | Calculate the mean value of a variable by sample over all variants |
isSNV | Flag a single nucleotide variant |
isVariant | Locate which samples are variant for each site |
Here are some examples in calculating the sample missing rate, hwe, titv ratio and the count of singletons for each sample.
## sample missing rate
mr.samp <- seqMissing(ve, per.variant = FALSE)
head(mr.samp)
## hwe
hwe <- hwe(ve)
head(hwe)
## titv ratio by sample / overall
titv <- titv(ve, by.sample=TRUE)
head(titv)
titv(ve, by.sample=FALSE)
## countSingletons
countSingletons(ve)
As we have noted in the other vignette, after the subsetting by
[
, $
or Ranged-based operations
, we need to save the new
VariantExperiment
object to synchronize the gds file (on-disk)
associated with the subset of data (in-memory representation) before
any statistical analysis. Otherwise, an error will be returned.
As a feature addition, we want to add the option of VCFArray
in
saving the assay
data in the step of
makeVariantExperimentFromVCF
. We also seek to implement the
SQLDataFrame in representation of the annotation data. We also
plan to connect Bioconductor package VariantAnnotation to
implement the variant filtering and annotation functions based on
VariantExperiment
format, and with that, to develop a pipeline for
using VariantExperiment
object as the basic data structure for
DNA-sequencing data analysis.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] VariantExperiment_1.2.0 DelayedDataFrame_1.4.0
## [3] GDSArray_1.8.0 gdsfmt_1.24.0
## [5] SummarizedExperiment_1.18.0 DelayedArray_0.14.0
## [7] matrixStats_0.56.0 Biobase_2.48.0
## [9] GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
## [11] IRanges_2.22.0 S4Vectors_0.26.0
## [13] BiocGenerics_0.34.0 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.0.0 xfun_0.13 purrr_0.3.4
## [4] splines_4.0.0 lattice_0.20-41 generics_0.0.2
## [7] vctrs_0.2.4 GWASExactHW_1.01 htmltools_0.4.0
## [10] yaml_2.2.1 mgcv_1.8-31 rlang_0.4.5
## [13] pillar_1.4.3 glue_1.4.0 GenomeInfoDbData_1.2.3
## [16] lifecycle_0.2.0 stringr_1.4.0 zlibbioc_1.34.0
## [19] Biostrings_2.56.0 evaluate_0.14 knitr_1.28
## [22] SeqArray_1.28.0 broom_0.5.6 Rcpp_1.0.4.6
## [25] backports_1.1.6 BiocManager_1.30.10 XVector_0.28.0
## [28] digest_0.6.25 stringi_1.4.6 bookdown_0.18
## [31] dplyr_0.8.5 grid_4.0.0 tools_4.0.0
## [34] bitops_1.0-6 magrittr_1.5 RCurl_1.98-1.2
## [37] tibble_3.0.1 mice_3.8.0 tidyr_1.0.2
## [40] crayon_1.3.4 SNPRelate_1.22.0 pkgconfig_2.0.3
## [43] ellipsis_0.3.0 Matrix_1.2-18 assertthat_0.2.1
## [46] rmarkdown_2.1 logistf_1.23 R6_2.4.1
## [49] SeqVarTools_1.26.0 nlme_3.1-147 compiler_4.0.0