Below we compare the results of annotating variants via VariantAnnotation specifically repeating a subset of the steps in vignette Introduction to VariantAnnotation. We compare using data from 1,000 Genomes Phase 1 Variants:
First we read in the data from the VCF file:
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
vcf <- renameSeqlevels(vcf, c("22"="chr22"))
vcf
## class: CollapsedVCF
## dim: 10376 5
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, CIP...
## info(header(vcf)):
## Number Type Description
## LDAF 1 Float MLE Allele Frequency Accounting for LD
## AVGPOST 1 Float Average posterior probability from MaCH/Thunder
## RSQ 1 Float Genotype imputation quality from MaCH/Thunder
## ERATE 1 Float Per-marker Mutation rate from MaCH/Thunder
## THETA 1 Float Per-marker Transition rate from MaCH/Thunder
## CIEND 2 Integer Confidence interval around END for imprecise ...
## CIPOS 2 Integer Confidence interval around POS for imprecise ...
## END 1 Integer End position of the variant described in this...
## HOMLEN . Integer Length of base pair identical micro-homology ...
## HOMSEQ . String Sequence of base pair identical micro-homolog...
## SVLEN 1 Integer Difference in length between REF and ALT alleles
## SVTYPE 1 String Type of structural variant
## AC . Integer Alternate Allele Count
## AN 1 Integer Total Allele Count
## AA 1 String Ancestral Allele, ftp://ftp.1000genomes.ebi.a...
## AF 1 Float Global Allele Frequency based on AC/AN
## AMR_AF 1 Float Allele Frequency for samples from AMR based o...
## ASN_AF 1 Float Allele Frequency for samples from ASN based o...
## AFR_AF 1 Float Allele Frequency for samples from AFR based o...
## EUR_AF 1 Float Allele Frequency for samples from EUR based o...
## VT 1 String indicates what type of variant the line repre...
## SNPSOURCE . String indicates if a snp was called when analysing ...
## geno(vcf):
## SimpleList of length 3: GT, DS, GL
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## DS 1 Float Genotype dosage from MaCH/Thunder
## GL . Float Genotype Likelihoods
The file chr22.vcf.gz
within package VariantAnnotation holds data for 5 of the 1,092 individuals in 1,000 Genomes, starting at position 50300078 and ending at position 50999964.
HG00096 HG00097 HG00099 HG00100 HG00101
Important data differences to note:
chr22.vcf.gz
. They are the only two variants within the genomic range with ALT == <DEL>
.library(GoogleGenomics)
# This vignette is authenticated on package load from the env variable GOOGLE_API_KEY.
# When running interactively, call the authenticate method.
# ?authenticate
# We're just getting the first few variants so that this runs quickly.
# If we wanted to get them all, we sould set end=50999964.
granges <- getVariants(variantSetId="10473108253681171589",
chromosome="22",
start=50300077,
end=50303000,
converter=variantsToGRanges)
## Fetching variants page.
## Continuing variant query with the nextPageToken: CIWK_hcQ_Nr-8ZGlk78i
## Fetching variants page.
## Continuing variant query with the nextPageToken: CIaM_hcQvM_un46SmcJ0
## Fetching variants page.
## Continuing variant query with the nextPageToken: CJGN_hcQiYna3cSwse_RAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMKQ_hcQr9OUpKu4n_xz
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMKS_hcQn-nIneaph7WJAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKOU_hcQ2eCCwsmBysdd
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLWV_hcQmo-n-tX08uPyAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMqX_hcQ__3H4KKV9tUR
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKuZ_hcQ9pHXwa742bKgAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLqa_hcQ-6zjo_Dc86gy
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLSb_hcQ3vrDo5XJwsB_
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKOd_hcQ6IPgk-PVjskK
## Fetching variants page.
## Continuing variant query with the nextPageToken: CPGe_hcQ-8Saw6Ojg8yXAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CPGf_hcQuKDdodiCvLUm
## Fetching variants page.
## Variants are now available.
Ensure that the data retrieved by each matches:
vcf <- vcf[1:length(granges)] # Truncate the VCF data so that it is the same
# set as what was retrieved from the API.
library(testthat)
expect_equal(start(granges), start(vcf))
expect_equal(end(granges), end(vcf))
expect_equal(as.character(granges$REF), as.character(ref(vcf)))
expect_equal(as.character(unlist(granges$ALT)), as.character(unlist(alt(vcf))))
expect_equal(granges$QUAL, qual(vcf))
expect_equal(granges$FILTER, filt(vcf))
Now locate the protein coding variants in each:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
rd <- rowRanges(vcf)
vcf_locations <- locateVariants(rd, txdb, CodingVariants())
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence
## 75253. Note that ranges located on a sequence whose length is
## unknown (NA) or on a circular sequence are not considered
## out-of-bound (use seqlengths() and isCircular() to get the lengths
## and circularity flags of the underlying sequences). You can use
## trim() to trim these ranges. See ?`trim,GenomicRanges-method` for
## more information.
## 'select()' returned many:1 mapping between keys and columns
vcf_locations
## GRanges object with 7 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
## [1] chr22 50301422 - | coding 939 939 24
## [2] chr22 50301476 - | coding 885 885 25
## [3] chr22 50301488 - | coding 873 873 26
## [4] chr22 50301494 - | coding 867 867 27
## [5] chr22 50301584 - | coding 777 777 28
## [6] chr22 50302962 - | coding 698 698 57
## [7] chr22 50302995 - | coding 665 665 58
## TXID CDSID GENEID PRECEDEID FOLLOWID
## <character> <IntegerList> <character> <CharacterList> <CharacterList>
## [1] 75253 218562 79087 <NA> <NA>
## [2] 75253 218562 79087 <NA> <NA>
## [3] 75253 218562 79087 <NA> <NA>
## [4] 75253 218562 79087 <NA> <NA>
## [5] 75253 218562 79087 <NA> <NA>
## [6] 75253 218563 79087 <NA> <NA>
## [7] 75253 218563 79087 <NA> <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
granges_locations <- locateVariants(granges, txdb, CodingVariants())
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence
## 75253. Note that ranges located on a sequence whose length is
## unknown (NA) or on a circular sequence are not considered
## out-of-bound (use seqlengths() and isCircular() to get the lengths
## and circularity flags of the underlying sequences). You can use
## trim() to trim these ranges. See ?`trim,GenomicRanges-method` for
## more information.
## 'select()' returned many:1 mapping between keys and columns
granges_locations
## GRanges object with 7 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
## [1] chr22 50301422 - | coding 939 939 24
## [2] chr22 50301476 - | coding 885 885 25
## [3] chr22 50301488 - | coding 873 873 26
## [4] chr22 50301494 - | coding 867 867 27
## [5] chr22 50301584 - | coding 777 777 28
## [6] chr22 50302962 - | coding 698 698 57
## [7] chr22 50302995 - | coding 665 665 58
## TXID CDSID GENEID PRECEDEID FOLLOWID
## <character> <IntegerList> <character> <CharacterList> <CharacterList>
## [1] 75253 218562 79087 <NA> <NA>
## [2] 75253 218562 79087 <NA> <NA>
## [3] 75253 218562 79087 <NA> <NA>
## [4] 75253 218562 79087 <NA> <NA>
## [5] 75253 218562 79087 <NA> <NA>
## [6] 75253 218563 79087 <NA> <NA>
## [7] 75253 218563 79087 <NA> <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
expect_equal(granges_locations, vcf_locations)
And predict the effect of the protein coding variants:
library(BSgenome.Hsapiens.UCSC.hg19)
vcf_coding <- predictCoding(vcf, txdb, seqSource=Hsapiens)
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence
## 75253. Note that ranges located on a sequence whose length is
## unknown (NA) or on a circular sequence are not considered
## out-of-bound (use seqlengths() and isCircular() to get the lengths
## and circularity flags of the underlying sequences). You can use
## trim() to trim these ranges. See ?`trim,GenomicRanges-method` for
## more information.
vcf_coding
## GRanges object with 7 ranges and 17 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## rs114335781 chr22 50301422 - | <NA> G
## rs8135963 chr22 50301476 - | <NA> T
## 22:50301488_C/T chr22 50301488 - | <NA> C
## 22:50301494_G/A chr22 50301494 - | <NA> G
## 22:50301584_C/T chr22 50301584 - | <NA> C
## rs114264124 chr22 50302962 - | <NA> C
## rs149209714 chr22 50302995 - | <NA> C
## ALT QUAL FILTER varAllele
## <DNAStringSetList> <numeric> <character> <DNAStringSet>
## rs114335781 A 100 PASS T
## rs8135963 C 100 PASS G
## 22:50301488_C/T T 100 PASS A
## 22:50301494_G/A A 100 PASS T
## 22:50301584_C/T T 100 PASS A
## rs114264124 T 100 PASS A
## rs149209714 G 100 PASS C
## CDSLOC PROTEINLOC QUERYID TXID
## <IRanges> <IntegerList> <integer> <character>
## rs114335781 939 313 24 75253
## rs8135963 885 295 25 75253
## 22:50301488_C/T 873 291 26 75253
## 22:50301494_G/A 867 289 27 75253
## 22:50301584_C/T 777 259 28 75253
## rs114264124 698 233 57 75253
## rs149209714 665 222 58 75253
## CDSID GENEID CONSEQUENCE REFCODON
## <IntegerList> <character> <factor> <DNAStringSet>
## rs114335781 218562 79087 synonymous ATC
## rs8135963 218562 79087 synonymous GCA
## 22:50301488_C/T 218562 79087 synonymous CCG
## 22:50301494_G/A 218562 79087 synonymous CAC
## 22:50301584_C/T 218562 79087 synonymous CCG
## rs114264124 218563 79087 nonsynonymous CGG
## rs149209714 218563 79087 nonsynonymous GGA
## VARCODON REFAA VARAA
## <DNAStringSet> <AAStringSet> <AAStringSet>
## rs114335781 ATT I I
## rs8135963 GCG A A
## 22:50301488_C/T CCA P P
## 22:50301494_G/A CAT H H
## 22:50301584_C/T CCA P P
## rs114264124 CAG R Q
## rs149209714 GCA G A
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
granges_coding <- predictCoding(rep(granges, elementNROWS(granges$ALT)),
txdb,
seqSource=Hsapiens,
varAllele=unlist(granges$ALT, use.names=FALSE))
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence
## 75253. Note that ranges located on a sequence whose length is
## unknown (NA) or on a circular sequence are not considered
## out-of-bound (use seqlengths() and isCircular() to get the lengths
## and circularity flags of the underlying sequences). You can use
## trim() to trim these ranges. See ?`trim,GenomicRanges-method` for
## more information.
granges_coding
## GRanges object with 7 ranges and 16 metadata columns:
## seqnames ranges strand | REF ALT
## <Rle> <IRanges> <Rle> | <DNAStringSet> <DNAStringSetList>
## rs114335781 chr22 50301422 - | G A
## rs8135963 chr22 50301476 - | T C
## rs200080075 chr22 50301488 - | C T
## rs147801200 chr22 50301494 - | G A
## rs138060012 chr22 50301584 - | C T
## rs114264124 chr22 50302962 - | C T
## rs149209714 chr22 50302995 - | C G
## QUAL FILTER varAllele CDSLOC PROTEINLOC
## <numeric> <character> <DNAStringSet> <IRanges> <IntegerList>
## rs114335781 100 PASS T 939 313
## rs8135963 100 PASS G 885 295
## rs200080075 100 PASS A 873 291
## rs147801200 100 PASS T 867 289
## rs138060012 100 PASS A 777 259
## rs114264124 100 PASS A 698 233
## rs149209714 100 PASS C 665 222
## QUERYID TXID CDSID GENEID CONSEQUENCE
## <integer> <character> <IntegerList> <character> <factor>
## rs114335781 24 75253 218562 79087 synonymous
## rs8135963 25 75253 218562 79087 synonymous
## rs200080075 26 75253 218562 79087 synonymous
## rs147801200 27 75253 218562 79087 synonymous
## rs138060012 28 75253 218562 79087 synonymous
## rs114264124 57 75253 218563 79087 nonsynonymous
## rs149209714 58 75253 218563 79087 nonsynonymous
## REFCODON VARCODON REFAA VARAA
## <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
## rs114335781 ATC ATT I I
## rs8135963 GCA GCG A A
## rs200080075 CCG CCA P P
## rs147801200 CAC CAT H H
## rs138060012 CCG CCA P P
## rs114264124 CGG CAG R Q
## rs149209714 GGA GCA G A
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
expect_equal(as.matrix(granges_coding$REFCODON), as.matrix(vcf_coding$REFCODON))
expect_equal(as.matrix(granges_coding$VARCODON), as.matrix(vcf_coding$VARCODON))
expect_equal(granges_coding$GENEID, vcf_coding$GENEID)
expect_equal(granges_coding$CONSEQUENCE, vcf_coding$CONSEQUENCE)
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] testthat_2.0.0
## [2] ggbio_1.28.0
## [3] ggplot2_2.2.1
## [4] org.Hs.eg.db_3.6.0
## [5] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [6] BSgenome_1.48.0
## [7] rtracklayer_1.40.0
## [8] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [9] GenomicFeatures_1.32.0
## [10] AnnotationDbi_1.42.0
## [11] GoogleGenomics_2.2.0
## [12] VariantAnnotation_1.26.0
## [13] GenomicAlignments_1.16.0
## [14] Rsamtools_1.32.0
## [15] Biostrings_2.48.0
## [16] XVector_0.20.0
## [17] SummarizedExperiment_1.10.0
## [18] DelayedArray_0.6.0
## [19] BiocParallel_1.14.0
## [20] matrixStats_0.53.1
## [21] Biobase_2.40.0
## [22] GenomicRanges_1.32.0
## [23] GenomeInfoDb_1.16.0
## [24] IRanges_2.14.0
## [25] S4Vectors_0.18.0
## [26] BiocGenerics_0.26.0
## [27] knitr_1.20
## [28] BiocStyle_2.8.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.12.0 bitops_1.0-6 bit64_0.9-7
## [4] RColorBrewer_1.1-2 progress_1.1.2 httr_1.3.1
## [7] rprojroot_1.3-2 tools_3.5.0 backports_1.1.2
## [10] R6_2.2.2 rpart_4.1-13 Hmisc_4.1-1
## [13] DBI_0.8 lazyeval_0.2.1 colorspace_1.3-2
## [16] nnet_7.3-12 gridExtra_2.3 prettyunits_1.0.2
## [19] GGally_1.3.2 bit_1.1-12 curl_3.2
## [22] compiler_3.5.0 graph_1.58.0 htmlTable_1.11.2
## [25] labeling_0.3 bookdown_0.7 scales_0.5.0
## [28] checkmate_1.8.5 RBGL_1.56.0 stringr_1.3.0
## [31] digest_0.6.15 foreign_0.8-70 rmarkdown_1.9
## [34] dichromat_2.0-0 base64enc_0.1-3 pkgconfig_2.0.1
## [37] htmltools_0.3.6 ensembldb_2.4.0 htmlwidgets_1.2
## [40] rlang_0.2.0 rstudioapi_0.7 RSQLite_2.1.0
## [43] BiocInstaller_1.30.0 jsonlite_1.5 acepack_1.4.1
## [46] RCurl_1.95-4.10 magrittr_1.5 GenomeInfoDbData_1.1.0
## [49] Formula_1.2-2 Matrix_1.2-14 Rcpp_0.12.16
## [52] munsell_0.4.3 stringi_1.1.7 yaml_2.1.18
## [55] zlibbioc_1.26.0 plyr_1.8.4 grid_3.5.0
## [58] blob_1.1.1 lattice_0.20-35 splines_3.5.0
## [61] pillar_1.2.2 rjson_0.2.15 reshape2_1.4.3
## [64] biomaRt_2.36.0 XML_3.98-1.11 evaluate_0.10.1
## [67] biovizBase_1.28.0 latticeExtra_0.6-28 data.table_1.10.4-3
## [70] gtable_0.2.0 reshape_0.8.7 assertthat_0.2.0
## [73] xfun_0.1 AnnotationFilter_1.4.0 survival_2.42-3
## [76] OrganismDbi_1.22.0 tibble_1.4.2 memoise_1.1.0
## [79] cluster_2.0.7-1