The AnVILVRS package provides an R interface to the AnVIL VRS Toolkit, a Python library for working with the Global Alliance for Genomics and Health (GA4GH) Variation Representation Specification (VRS) standard. The package allows users to translate variant identifiers from various formats (e.g., gnomAD, SPDI, HGVS, Beacon) into GA4GH VRS Allele IDs and vice versa. Additionally, it provides functionality to retrieve allele frequency data from the 1000 Genomes Project based on VRS Allele IDs.
To use the AnVILVRS package, you need to have Python installed on your system. The package requires Python 3.11, so ensure that you have this version installed.
Install the AnVILVRS package from Bioconductor with:
Load the AnVILVRS package and the reticulate
package for Python integration:
After installing the package, you need to set up a Python virtual environment with the required dependencies. You can do this by running the following command in R:
Once the virtual environment is set up, you can use the AnVILVRS package to translate variant identifiers and retrieve allele frequency data. First, load the package and activate the virtual environment:
You can translate variant identifiers from various formats into GA4GH
VRS Allele IDs using the get_vrs_id function. Supported
formats include “gnomad”, “spdi”, “hgvs”, and “beacon”.
You can also retrieve the VRS Allele object using the
get_vrs_allele function:
You can convert a VRS Allele object back to a variant identifier in a
specified format using the get_variant_from_allele
function:
The get_pop_descriptor function downloads the population
descriptor file from a known Google Storage URI to the BiocFileCache.
This file is used in the calculation of the Cohort Allele Frequency
(CAF). The get_caf function makes use of the population
descriptor .tsv file to dynamically provide a Cohort Allele
Frequency based on the sub-population of interest. In our example, we
will use the “USA” population code.
The get_seqrepo function downloads a SeqRepo archive
from a specified URI to a local directory. This is useful for setting up
the SeqRepo database required by the AnVIL VRS Toolkit.
Finally, to calculate the Cohort Allele Frequency (CAF) using the
1000 Genomes Project based on a VRS Allele ID, use the
get_caf function. A .gz zipped VCF file and
its corresponding index file obtained from the 1000 Genomes Project is
needed. It should include variants with allele frequency
annotations.
use_virtualenv("vrs_env", required = TRUE)
toolkit_dir <- setup_vrs_toolkit()
vcf <- AnVILVRS:::.get_fixture_vcf()
vcf_index <- build_vrs_index()
variant_id <- "chr1-20094-TAA-T"
vrs_id <- get_vrs_id(variant_id, "gnomad")
pop_desc <- get_pop_descriptor()
get_caf(
vrs_id = vrs_id,
vcf = vcf,
vcf_index = vcf_index,
phenotype = "USA",
pop_desc_file = pop_desc,
toolkit_dir = toolkit_dir
)sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] reticulate_1.46.0 AnVILVRS_0.99.19 BiocStyle_2.41.0
#>
#> loaded via a namespace (and not attached):
#> [1] rappdirs_0.3.4 sass_0.4.10 generics_0.1.4
#> [4] tidyr_1.3.2 RSQLite_2.4.6 lattice_0.22-9
#> [7] digest_0.6.39 magrittr_2.0.5 evaluate_1.0.5
#> [10] grid_4.6.0 fastmap_1.2.0 blob_1.3.0
#> [13] jsonlite_2.0.0 Matrix_1.7-5 AnVILGCP_1.7.0
#> [16] DBI_1.3.0 BiocManager_1.30.27 httr_1.4.8
#> [19] purrr_1.2.2 codetools_0.2-20 httr2_1.2.2
#> [22] jquerylib_0.1.4 cli_3.6.6 rlang_1.2.0
#> [25] dbplyr_2.5.2 bit64_4.8.0 cachem_1.1.0
#> [28] yaml_2.3.12 BiocBaseUtils_1.15.1 tools_4.6.0
#> [31] memoise_2.0.1 dplyr_1.2.1 filelock_1.0.3
#> [34] GCPtools_1.3.0 curl_7.1.0 buildtools_1.0.0
#> [37] vctrs_0.7.3 R6_2.6.1 png_0.1-9
#> [40] lifecycle_1.0.5 BiocFileCache_3.3.0 bit_4.6.0
#> [43] pkgconfig_2.0.3 pillar_1.11.1 bslib_0.10.0
#> [46] glue_1.8.1 Rcpp_1.1.1-1.1 xfun_0.57
#> [49] tibble_3.3.1 tidyselect_1.2.1 sys_3.4.3
#> [52] knitr_1.51 AnVILBase_1.7.0 htmltools_0.5.9
#> [55] rmarkdown_2.31 maketools_1.3.2 compiler_4.6.0