1 Version Info

R version: R version 4.1.0 (2021-05-18)
Bioconductor version: 3.13
Package version: 1.16.0
Created: 5 Sept 2017
Revised: 7 January 2021

2 Background

The VariantAnnotation package has facilities for reading in all or subsets of Variant Call Format (VCF) files. These text files contain meta-information lines, a header line and data lines each containing information about a position in the genome. The format also may also contain genotype information on samples for each position. More on the file format can be found in the VCF specs.

The ‘locateVariants’ function in the VariantAnnotation package identifies where a variant is located with respect to the gene model (e.g., exon, intron, splice site, etc.). The ‘predictCoding’ function reports the amino acid change for non-synonymous coding variants. Consequences of the coding changes can be investigated with the SIFT and PolyPhen database packages. We’ll use these functions to learn about variants located on the TRPV gene on chromosome

3 Set Up

This workflow requires several different Bioconductor packages. Usage of each will be described in detail in the following sections.

library(VariantAnnotation)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)
library(PolyPhen.Hsapiens.dbSNP131)

Use BiocManager::install() to get the packages you don’t have installed:

if (!"BiocManager" %in% rownames(installed.packages()))
     install.packages("BiocManager")
BiocManager::install("mypackage")

4 Exploring variants in the TRPV gene family

This workflow focuses on variants located in the Transient Receptor Potential Vanilloid (TRPV) gene family on chromosome 17. We use a VCF file included in the “variants” package and representing “Complete Genomics Diversity” panel data for chromosome 17 on 46 individuals for a single individual from the CEU population.

file <- system.file("vcf", "NA06985_17.vcf.gz", package = "variants")

4.1 Examine header data in a vcf file

To get an idea of what data are in the file we look at the header. scanVcfHeader() parses the file header into a VCFHeader object and the info() and geno() accessors extract field-specific data.

hdr <- scanVcfHeader(file)

info(hdr)

## DataFrame with 3 rows and 3 columns
##         Number        Type            Description
##    <character> <character>            <character>
## NS           1     Integer Number of Samples Wi..
## DP           1     Integer            Total Depth
## DB           0        Flag dbSNP membership, bu..

geno(hdr)

## DataFrame with 12 rows and 3 columns
##             Number        Type            Description
##        <character> <character>            <character>
## GT               1      String               Genotype
## GQ               1     Integer       Genotype Quality
## DP               1     Integer             Read Depth
## HDP              2     Integer   Haplotype Read Depth
## HQ               2     Integer      Haplotype Quality
## ...            ...         ...                    ...
## mRNA             .      String        Overlaping mRNA
## rmsk             .      String     Overlaping Repeats
## segDup           .      String Overlaping segmentat..
## rCov             1       Float      relative Coverage
## cPd              1      String   called Ploidy(level)

Variants in the VCF have been aligned to NCBI genome build GRCh37:

meta(hdr)

## DataFrameList of length 6
## names(6): fileDate fileformat phasing reference source contig

[ Back to top ]

4.2 Convert gene symbols to gene ids

Use the org.Hs.eg.db package to convert gene symbols to gene ids.

## get entrez ids from gene symbols
genesym <- c("TRPV1", "TRPV2", "TRPV3")
geneid <- select(
    org.Hs.eg.db, keys=genesym, keytype="SYMBOL", columns="ENTREZID"
)

## 'select()' returned 1:1 mapping between keys and columns

geneid

##   SYMBOL ENTREZID
## 1  TRPV1     7442
## 2  TRPV2    51393
## 3  TRPV3   162514

[ Back to top ]

4.3 Create gene ranges

We use the hg19 known gene track from UCSC to identify the TRPV gene ranges. These ranges will eventually be used to extract variants from a regions in the VCF file.

Load the annotation package.

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

Our VCF file was aligned to a genome from NCBI and the known gene track was from UCSC. These institutions have different naming conventions for the chromosomes. In order to use these two pieces of data in a matching or overlap operation the chromosome names (also called sesqlevels) need to match. We will modify the txdb to match the VCF file.

txdb <- keepSeqlevels(txdb, "chr17")

Create a list of transcripts by gene:

txbygene <- transcriptsBy(txdb, "gene")

Create the gene ranges for the TRPV genes

gnrng <- unlist(range(txbygene[geneid$ENTREZID]), use.names=FALSE)
names(gnrng) <- geneid$SYMBOL

[ Back to top ]

4.4 Extract variant subsets

A ScanVcfParam object is used to retrieve data subsets. This object can specify genomic coordinates (ranges) or individual VCF elements. Extractions of ranges (vs fields) requires a tabix index. See ?indexTabix for details.

seqinfo(gnrng)

## Seqinfo object with 1 sequence from hg19 genome:
##   seqnames seqlengths isCircular genome
##   chr17      81195210         NA   hg19

seqlevels(gnrng) <- sub("chr", "", seqlevels(gnrng))
genome(gnrng) <- "B37"
seqinfo(gnrng)

## Seqinfo object with 1 sequence from B37 genome:
##   seqnames seqlengths isCircular genome
##   17         81195210         NA    B37

## seqlevelsStyle(gnrng) <- "NCBI"
param <- ScanVcfParam(which = gnrng, info = "DP", geno = c("GT", "cPd"))
param

## class: ScanVcfParam 
## vcfWhich: 1 elements
## vcfFixed: character() [All] 
## vcfInfo: DP 
## vcfGeno: GT cPd 
## vcfSamples:

## Extract the TRPV ranges from the VCF file
vcf <- readVcf(file, param = param)
## Inspect the VCF object with the 'fixed', 'info' and 'geno' accessors
vcf

## class: CollapsedVCF 
## dim: 405 1 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 1 column: DP
## info(header(vcf)):
##       Number Type    Description
##    DP 1      Integer Total Depth
## geno(vcf):
##   List of length 2: GT, cPd
## geno(header(vcf)):
##        Number Type   Description         
##    GT  1      String Genotype            
##    cPd 1      String called Ploidy(level)

head(fixed(vcf))

## DataFrame with 6 rows and 4 columns
##              REF                ALT      QUAL      FILTER
##   <DNAStringSet> <DNAStringSetList> <numeric> <character>
## 1              A                  G       120        PASS
## 2              A                            0        PASS
## 3          AAAAA                            0        PASS
## 4             AA                            0        PASS
## 5              C                  T        59        PASS
## 6              T                  C       157        PASS

geno(vcf)

## List of length 2
## names(2): GT cPd

seqinfo(vcf)

## Seqinfo object with 25 sequences from B37 genome; no seqlengths:
##   seqnames seqlengths isCircular genome
##   1              <NA>       <NA>    B37
##   2              <NA>       <NA>    B37
##   3              <NA>       <NA>    B37
##   4              <NA>       <NA>    B37
##   5              <NA>       <NA>    B37
##   ...             ...        ...    ...
##   21             <NA>       <NA>    B37
##   22             <NA>       <NA>    B37
##   X              <NA>       <NA>    B37
##   Y              <NA>       <NA>    B37
##   M              <NA>       <NA>    B37

genome(vcf) <- "hg19"
seqlevels(vcf) <- sub("([[:digit:]]+)", "chr\\1", seqlevels(vcf))
seqinfo(vcf)

## Seqinfo object with 25 sequences from hg19 genome; no seqlengths:
##   seqnames seqlengths isCircular genome
##   chr1           <NA>       <NA>   hg19
##   chr2           <NA>       <NA>   hg19
##   chr3           <NA>       <NA>   hg19
##   chr4           <NA>       <NA>   hg19
##   chr5           <NA>       <NA>   hg19
##   ...             ...        ...    ...
##   chr21          <NA>       <NA>   hg19
##   chr22          <NA>       <NA>   hg19
##   X              <NA>       <NA>   hg19
##   Y              <NA>       <NA>   hg19
##   M              <NA>       <NA>   hg19

## seqlevelsStyle(vcf) <- "UCSC"
vcf <- keepSeqlevels(vcf, "chr17")

[ Back to top ]

4.5 Variant location in the gene model

The locateVariants function identifies where a variant falls with respect to gene structure, e.g., exon, utr, splice site, etc. We use the gene model from the TxDb.Hsapiens.UCSC.hg19.knownGene package loaded eariler.

## Use the 'region' argument to define the region
## of interest. See ?locateVariants for details.
cds <- locateVariants(vcf, txdb, CodingVariants())
five <- locateVariants(vcf, txdb, FiveUTRVariants())
splice <- locateVariants(vcf, txdb, SpliceSiteVariants())
intron <- locateVariants(vcf, txdb, IntronVariants())

all <- locateVariants(vcf, txdb, AllVariants())

## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 140 out-of-bound ranges located on sequences
##   62411, 62399, 62400, 62401, 62403, 62404, 62405, and 62407. Note that
##   ranges located on a sequence whose length is unknown (NA) or on a
##   circular sequence are not considered out-of-bound (use seqlengths() and
##   isCircular() to get the lengths and circularity flags of the underlying
##   sequences). You can use trim() to trim these ranges. See
##   ?`trim,GenomicRanges-method` for more information.

## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns

Each row in cds represents a variant-transcript match so multiple rows per variant are possible. If we are interested in gene-centric questions the data can be summarized by gene regardless of transcript.

## Did any variants match more than one gene?
geneByQuery <- sapply(split(all$GENEID, all$QUERYID), unique)
table(lengths(geneByQuery) > 1)

## 
## FALSE  TRUE 
##   176   193

## Summarize the number of variants by gene:
queryByGene <- sapply(split(all$QUERYID, all$GENEID), unique)
table(lengths(queryByGene) > 1)

## 
## FALSE  TRUE 
##     1    12

sapply(queryByGene, length)

## 100141515     10801    125144    147138    162514     23210     23729     51393 
##        46       139         1       111        57        10         2         5 
##     56985     57690      6427      7442     84690 
##        22       162        10        29        16

## Summarize variant location by gene:
sapply(names(queryByGene), function(nm) {
    d <- all[all$GENEID %in% nm, c("QUERYID", "LOCATION")]
    table(d$LOCATION[duplicated(d) == FALSE])
})

##            100141515 10801 125144 147138 162514 23210 23729 51393 56985 57690
## spliceSite         0     0      0      0      2     0     0     0     0     0
## intron            46   139      0    111      0    10     0     0    22   162
## fiveUTR            0     0      0      0      2     0     0     1     0     0
## threeUTR           0     0      0      0     25     0     2     1     0     0
## coding             0     0      0      0      5     0     0     3     0     0
## intergenic         0     0      0      0      0     0     0     0     0     0
## promoter           0     0      1      0     23     0     0     0     0     0
##            6427 7442 84690
## spliceSite    0    1     0
## intron       10    0     0
## fiveUTR       0    3     5
## threeUTR      0    2     0
## coding        0    8     0
## intergenic    0    0     0
## promoter      0   15    11

[ Back to top ]

4.6 Amino acid coding changes in non-synonymous variants

Amino acid coding for non-synonymous variants can be computed with the function predictCoding. The BSgenome.Hsapiens.UCSC.hg19 package is used as the source of the reference alleles. Variant alleles are provided by the user.

aa <- predictCoding(vcf, txdb, Hsapiens)

## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 140 out-of-bound ranges located on sequences
##   62411, 62399, 62400, 62401, 62403, 62404, 62405, and 62407. Note that
##   ranges located on a sequence whose length is unknown (NA) or on a
##   circular sequence are not considered out-of-bound (use seqlengths() and
##   isCircular() to get the lengths and circularity flags of the underlying
##   sequences). You can use trim() to trim these ranges. See
##   ?`trim,GenomicRanges-method` for more information.

## Warning in .predictCodingGRangesList(query, cache[["cdsbytx"]], seqSource, :
## records with missing 'varAllele' were ignored

## Warning in .predictCodingGRangesList(query, cache[["cdsbytx"]], seqSource, :
## 'varAllele' values with 'N', '.', '+' or '-' were not translated

predictCoding returns results for coding variants only. As with locateVariants, the output has one row per variant-transcript match so multiple rows per variant are possible.

## Did any variants match more than one gene?
aageneByQuery <- split(aa$GENEID, aa$QUERYID)
table(lengths(sapply(aageneByQuery, unique)) > 1)

## 
## FALSE 
##    17

## Summarize the number of variants by gene:
aaaqueryByGene <- split(aa$QUERYID, aa$GENEID, drop=TRUE)
sapply(aaaqueryByGene, length)

## 162514  51393   7442 
##     37      3     56

## Summarize variant consequence by gene:
sapply(names(aaaqueryByGene), function(nm) {
    d <- aa[aa$GENEID %in% nm, c("QUERYID","CONSEQUENCE")]
    table(d$CONSEQUENCE[duplicated(d) == FALSE])
})

##                162514 51393 7442
## nonsynonymous       2     0    2
## not translated      1     0    5
## synonymous          3     3    1

The variants ‘not translated’ are explained by the warnings thrown when predictCoding was called. Variants that have a missing varAllele or have an ‘N’ in the varAllele are not translated. If the varAllele substitution had resulted in a frameshift the consequence would be ‘frameshift’. See ?predictCoding for details.

[ Back to top ]

5 Exploring Package Content

Packages have extensive help pages, and include vignettes highlighting common use cases. The help pages and vignettes are available from within R. After loading a package, use syntax like

help(package="VariantAnnotation")
?predictCoding

to obtain an overview of help on the VariantAnnotation package, and the predictCoding function. View the package vignette with

browseVignettes(package="VariantAnnotation")

To view vignettes providing a more comprehensive introduction to package functionality use

help.start()

[ Back to top ]

6 sessionInfo()

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] variants_1.16.0                        
##  [2] PolyPhen.Hsapiens.dbSNP131_1.0.2       
##  [3] RSQLite_2.2.7                          
##  [4] BSgenome.Hsapiens.UCSC.hg19_1.4.3      
##  [5] BSgenome_1.60.0                        
##  [6] rtracklayer_1.52.0                     
##  [7] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [8] GenomicFeatures_1.44.0                 
##  [9] org.Hs.eg.db_3.13.0                    
## [10] AnnotationDbi_1.54.0                   
## [11] VariantAnnotation_1.38.0               
## [12] Rsamtools_2.8.0                        
## [13] Biostrings_2.60.0                      
## [14] XVector_0.32.0                         
## [15] SummarizedExperiment_1.22.0            
## [16] Biobase_2.52.0                         
## [17] GenomicRanges_1.44.0                   
## [18] GenomeInfoDb_1.28.0                    
## [19] IRanges_2.26.0                         
## [20] S4Vectors_0.30.0                       
## [21] MatrixGenerics_1.4.0                   
## [22] matrixStats_0.58.0                     
## [23] BiocGenerics_0.38.0                    
## [24] BiocStyle_2.20.0                       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2               sass_0.4.0               bit64_4.0.5             
##  [4] jsonlite_1.7.2           bslib_0.2.5.1            assertthat_0.2.1        
##  [7] BiocManager_1.30.15      BiocFileCache_2.0.0      blob_1.2.1              
## [10] GenomeInfoDbData_1.2.6   yaml_2.2.1               progress_1.2.2          
## [13] pillar_1.6.1             lattice_0.20-44          glue_1.4.2              
## [16] digest_0.6.27            htmltools_0.5.1.1        Matrix_1.3-3            
## [19] XML_3.99-0.6             pkgconfig_2.0.3          biomaRt_2.48.0          
## [22] bookdown_0.22            zlibbioc_1.38.0          purrr_0.3.4             
## [25] BiocParallel_1.26.0      tibble_3.1.2             KEGGREST_1.32.0         
## [28] generics_0.1.0           ellipsis_0.3.2           cachem_1.0.5            
## [31] magrittr_2.0.1           crayon_1.4.1             memoise_2.0.0           
## [34] evaluate_0.14            fansi_0.4.2              tools_4.1.0             
## [37] prettyunits_1.1.1        hms_1.1.0                BiocIO_1.2.0            
## [40] lifecycle_1.0.0          stringr_1.4.0            DelayedArray_0.18.0     
## [43] compiler_4.1.0           jquerylib_0.1.4          rlang_0.4.11            
## [46] grid_4.1.0               RCurl_1.98-1.3           rjson_0.2.20            
## [49] rappdirs_0.3.3           bitops_1.0-7             rmarkdown_2.8           
## [52] restfulr_0.0.13          curl_4.3.1               DBI_1.1.1               
## [55] R6_2.5.0                 GenomicAlignments_1.28.0 knitr_1.33              
## [58] dplyr_1.0.6              fastmap_1.1.0            bit_4.0.4               
## [61] utf8_1.2.1               filelock_1.0.2           stringi_1.6.2           
## [64] Rcpp_1.0.6               vctrs_0.3.8              png_0.1-7               
## [67] tidyselect_1.1.1         dbplyr_2.1.1             xfun_0.23

[ Back to top ]

Annotating Genomic Variants

5 Sept 2017

Contents

1 Version Info

2 Background

3 Set Up

4 Exploring variants in the TRPV gene family

4.1 Examine header data in a vcf file

4.2 Convert gene symbols to gene ids

4.3 Create gene ranges

4.4 Extract variant subsets

4.5 Variant location in the gene model

4.6 Amino acid coding changes in non-synonymous variants

5 Exploring Package Content

6 sessionInfo()