Annotating Variants

0.1 Working with Variants
0.2 Annotating Variants
0.3 Provenance

0.1 Working with Variants

Google Genomics implements the GA4GH variants API and this R package can retrieve data from that implementation. For more detail, see https://cloud.google.com/genomics/v1beta2/reference/variants

library(GoogleGenomics)
# This vignette is authenticated on package load from the env variable GOOGLE_API_KEY.
# When running interactively, call the authenticate method.
# ?authenticate

By default, this function retrieves variants for a small genomic region from the 1,000 Genomes phase 1 variants.

variants <- getVariants()

## Fetching variants page.

## Variants are now available.

length(variants)

## [1] 4

We can see that 4 individual variants were returned and that the JSON response was deserialized into an R list object:

class(variants)

## [1] "list"

mode(variants)

## [1] "list"

The top level field names are:

names(variants[[1]])

##  [1] "variantSetId"   "id"             "names"          "created"       
##  [5] "referenceName"  "start"          "end"            "referenceBases"
##  [9] "alternateBases" "quality"        "filter"         "info"          
## [13] "calls"

The variant contains nested calls:

length(variants[[1]]$calls)

## [1] 1092

With top level call field names:

names(variants[[1]]$calls[[1]])

## [1] "callSetId"          "callSetName"        "genotype"          
## [4] "phaseset"           "genotypeLikelihood" "info"

And examining a call for a particular variant:

variants[[1]]$referenceName

## [1] "22"

variants[[1]]$start

## [1] "16051452"

variants[[1]]$alternateBases

## [[1]]
## [1] "C"

variants[[1]]$calls[[1]]

## $callSetId
## [1] "10473108253681171589-0"
## 
## $callSetName
## [1] "HG00261"
## 
## $genotype
## $genotype[[1]]
## [1] 0
## 
## $genotype[[2]]
## [1] 0
## 
## 
## $phaseset
## [1] "*"
## 
## $genotypeLikelihood
## $genotypeLikelihood[[1]]
## [1] -0.1
## 
## $genotypeLikelihood[[2]]
## [1] -0.67
## 
## $genotypeLikelihood[[3]]
## [1] -3.3
## 
## 
## $info
## $info$DS
## $info$DS[[1]]
## [1] "0.000"

This is good, but this data becomes much more useful when it is converted to Bioconductor data types. For example, we can convert variants in this list representation to VRanges from VariantAnnotation:

variantsToVRanges(variants)

## VRanges object with 4 ranges and 2 metadata columns:
##               seqnames    ranges strand         ref              alt
##                  <Rle> <IRanges>  <Rle> <character> <characterOrRle>
##   rs143503259    chr22  16051453      *           A                C
##   rs192339082    chr22  16051477      *           C                A
##    rs79725552    chr22  16051480      *           T                C
##   rs141578542    chr22  16051497      *           A                G
##                   totalDepth       refDepth       altDepth   sampleNames
##               <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle>
##   rs143503259           <NA>           <NA>           <NA>          <NA>
##   rs192339082           <NA>           <NA>           <NA>          <NA>
##    rs79725552           <NA>           <NA>           <NA>          <NA>
##   rs141578542           <NA>           <NA>           <NA>          <NA>
##               softFilterMatrix |      QUAL      FILTER
##                       <matrix> | <numeric> <character>
##   rs143503259                  |       100        PASS
##   rs192339082                  |       100        PASS
##    rs79725552                  |       100        PASS
##   rs141578542                  |       100        PASS
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
##   hardFilters: NULL

0.2 Annotating Variants

Next let’s use package VariantAnnotation to annotate some specific 1,000 Genomes Phase 1 variants.

library(VariantAnnotation)

Note that the parameters start and end are expressed in 0-based coordinates per the GA4GH specification but the Bioconductor data type converters in GoogleGenomics, by default, transform the returned data to 1-based coordinates.

granges <- getVariants(variantSetId="10473108253681171589",
                       chromosome="22",
                       start=50300077,
                       end=50303000,
                       converter=variantsToGRanges)

## Fetching variants page.

## Continuing variant query with the nextPageToken: CIWK_hcQ_Nr-8ZGlk78i

## Fetching variants page.

## Continuing variant query with the nextPageToken: CIaM_hcQvM_un46SmcJ0

## Fetching variants page.

## Continuing variant query with the nextPageToken: CJGN_hcQiYna3cSwse_RAQ

## Fetching variants page.

## Continuing variant query with the nextPageToken: CMKQ_hcQr9OUpKu4n_xz

## Fetching variants page.

## Continuing variant query with the nextPageToken: CMKS_hcQn-nIneaph7WJAQ

## Fetching variants page.

## Continuing variant query with the nextPageToken: CKOU_hcQ2eCCwsmBysdd

## Fetching variants page.

## Continuing variant query with the nextPageToken: CLWV_hcQmo-n-tX08uPyAQ

## Fetching variants page.

## Continuing variant query with the nextPageToken: CMqX_hcQ__3H4KKV9tUR

## Fetching variants page.

## Continuing variant query with the nextPageToken: CKuZ_hcQ9pHXwa742bKgAQ

## Fetching variants page.

## Continuing variant query with the nextPageToken: CLqa_hcQ-6zjo_Dc86gy

## Fetching variants page.

## Continuing variant query with the nextPageToken: CLSb_hcQ3vrDo5XJwsB_

## Fetching variants page.

## Continuing variant query with the nextPageToken: CKOd_hcQ6IPgk-PVjskK

## Fetching variants page.

## Continuing variant query with the nextPageToken: CPGe_hcQ-8Saw6Ojg8yXAQ

## Fetching variants page.

## Continuing variant query with the nextPageToken: CPGf_hcQuKDdodiCvLUm

## Fetching variants page.

## Variants are now available.

Now locate the protein coding variants in each:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
granges_locations <- locateVariants(granges, txdb, CodingVariants())

## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence
##   75253. Note that ranges located on a sequence whose length is
##   unknown (NA) or on a circular sequence are not considered
##   out-of-bound (use seqlengths() and isCircular() to get the lengths
##   and circularity flags of the underlying sequences). You can use
##   trim() to trim these ranges. See ?`trim,GenomicRanges-method` for
##   more information.

## 'select()' returned many:1 mapping between keys and columns

granges_locations

## GRanges object with 7 ranges and 9 metadata columns:
##       seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID
##          <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer>
##   [1]    chr22  50301422      - |   coding       939       939        24
##   [2]    chr22  50301476      - |   coding       885       885        25
##   [3]    chr22  50301488      - |   coding       873       873        26
##   [4]    chr22  50301494      - |   coding       867       867        27
##   [5]    chr22  50301584      - |   coding       777       777        28
##   [6]    chr22  50302962      - |   coding       698       698        57
##   [7]    chr22  50302995      - |   coding       665       665        58
##              TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
##       <character> <IntegerList> <character> <CharacterList> <CharacterList>
##   [1]       75253        218562       79087            <NA>            <NA>
##   [2]       75253        218562       79087            <NA>            <NA>
##   [3]       75253        218562       79087            <NA>            <NA>
##   [4]       75253        218562       79087            <NA>            <NA>
##   [5]       75253        218562       79087            <NA>            <NA>
##   [6]       75253        218563       79087            <NA>            <NA>
##   [7]       75253        218563       79087            <NA>            <NA>
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

And predict the effect of the protein coding variants:

library(BSgenome.Hsapiens.UCSC.hg19)

granges_coding <- predictCoding(rep(granges, elementNROWS(granges$ALT)),
                                txdb,
                                seqSource=Hsapiens,
                                varAllele=unlist(granges$ALT, use.names=FALSE))

## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence
##   75253. Note that ranges located on a sequence whose length is
##   unknown (NA) or on a circular sequence are not considered
##   out-of-bound (use seqlengths() and isCircular() to get the lengths
##   and circularity flags of the underlying sequences). You can use
##   trim() to trim these ranges. See ?`trim,GenomicRanges-method` for
##   more information.

granges_coding

## GRanges object with 7 ranges and 16 metadata columns:
##               seqnames    ranges strand |            REF                ALT
##                  <Rle> <IRanges>  <Rle> | <DNAStringSet> <DNAStringSetList>
##   rs114335781    chr22  50301422      - |              G                  A
##     rs8135963    chr22  50301476      - |              T                  C
##   rs200080075    chr22  50301488      - |              C                  T
##   rs147801200    chr22  50301494      - |              G                  A
##   rs138060012    chr22  50301584      - |              C                  T
##   rs114264124    chr22  50302962      - |              C                  T
##   rs149209714    chr22  50302995      - |              C                  G
##                    QUAL      FILTER      varAllele    CDSLOC    PROTEINLOC
##               <numeric> <character> <DNAStringSet> <IRanges> <IntegerList>
##   rs114335781       100        PASS              T       939           313
##     rs8135963       100        PASS              G       885           295
##   rs200080075       100        PASS              A       873           291
##   rs147801200       100        PASS              T       867           289
##   rs138060012       100        PASS              A       777           259
##   rs114264124       100        PASS              A       698           233
##   rs149209714       100        PASS              C       665           222
##                 QUERYID        TXID         CDSID      GENEID   CONSEQUENCE
##               <integer> <character> <IntegerList> <character>      <factor>
##   rs114335781        24       75253        218562       79087    synonymous
##     rs8135963        25       75253        218562       79087    synonymous
##   rs200080075        26       75253        218562       79087    synonymous
##   rs147801200        27       75253        218562       79087    synonymous
##   rs138060012        28       75253        218562       79087    synonymous
##   rs114264124        57       75253        218563       79087 nonsynonymous
##   rs149209714        58       75253        218563       79087 nonsynonymous
##                     REFCODON       VARCODON         REFAA         VARAA
##               <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
##   rs114335781            ATC            ATT             I             I
##     rs8135963            GCA            GCG             A             A
##   rs200080075            CCG            CCA             P             P
##   rs147801200            CAC            CAT             H             H
##   rs138060012            CCG            CCA             P             P
##   rs114264124            CGG            CAG             R             Q
##   rs149209714            GGA            GCA             G             A
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Finally, add gene information:

library(org.Hs.eg.db)

annots <- select(org.Hs.eg.db,
                 keys=granges_coding$GENEID,
                 keytype="ENTREZID",
                 columns=c("SYMBOL", "GENENAME","ENSEMBL"))

## 'select()' returned many:1 mapping between keys and columns

cbind(elementMetadata(granges_coding), annots)

## DataFrame with 7 rows and 20 columns
##              REF                ALT      QUAL      FILTER      varAllele
##   <DNAStringSet> <DNAStringSetList> <numeric> <character> <DNAStringSet>
## 1              G                  A       100        PASS              T
## 2              T                  C       100        PASS              G
## 3              C                  T       100        PASS              A
## 4              G                  A       100        PASS              T
## 5              C                  T       100        PASS              A
## 6              C                  T       100        PASS              A
## 7              C                  G       100        PASS              C
##      CDSLOC    PROTEINLOC   QUERYID        TXID         CDSID      GENEID
##   <IRanges> <IntegerList> <integer> <character> <IntegerList> <character>
## 1       939           313        24       75253        218562       79087
## 2       885           295        25       75253        218562       79087
## 3       873           291        26       75253        218562       79087
## 4       867           289        27       75253        218562       79087
## 5       777           259        28       75253        218562       79087
## 6       698           233        57       75253        218563       79087
## 7       665           222        58       75253        218563       79087
##     CONSEQUENCE       REFCODON       VARCODON         REFAA         VARAA
##        <factor> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
## 1    synonymous            ATC            ATT             I             I
## 2    synonymous            GCA            GCG             A             A
## 3    synonymous            CCG            CCA             P             P
## 4    synonymous            CAC            CAT             H             H
## 5    synonymous            CCG            CCA             P             P
## 6 nonsynonymous            CGG            CAG             R             Q
## 7 nonsynonymous            GGA            GCA             G             A
##      ENTREZID      SYMBOL                             GENENAME
##   <character> <character>                          <character>
## 1       79087       ALG12 ALG12, alpha-1,6-mannosyltransferase
## 2       79087       ALG12 ALG12, alpha-1,6-mannosyltransferase
## 3       79087       ALG12 ALG12, alpha-1,6-mannosyltransferase
## 4       79087       ALG12 ALG12, alpha-1,6-mannosyltransferase
## 5       79087       ALG12 ALG12, alpha-1,6-mannosyltransferase
## 6       79087       ALG12 ALG12, alpha-1,6-mannosyltransferase
## 7       79087       ALG12 ALG12, alpha-1,6-mannosyltransferase
##           ENSEMBL
##       <character>
## 1 ENSG00000182858
## 2 ENSG00000182858
## 3 ENSG00000182858
## 4 ENSG00000182858
## 5 ENSG00000182858
## 6 ENSG00000182858
## 7 ENSG00000182858

0.3 Provenance

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.6.0                     
##  [2] BSgenome.Hsapiens.UCSC.hg19_1.4.0      
##  [3] BSgenome_1.48.0                        
##  [4] rtracklayer_1.40.0                     
##  [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [6] GenomicFeatures_1.32.0                 
##  [7] AnnotationDbi_1.42.0                   
##  [8] GoogleGenomics_2.2.0                   
##  [9] VariantAnnotation_1.26.0               
## [10] GenomicAlignments_1.16.0               
## [11] Rsamtools_1.32.0                       
## [12] Biostrings_2.48.0                      
## [13] XVector_0.20.0                         
## [14] SummarizedExperiment_1.10.0            
## [15] DelayedArray_0.6.0                     
## [16] BiocParallel_1.14.0                    
## [17] matrixStats_0.53.1                     
## [18] Biobase_2.40.0                         
## [19] GenomicRanges_1.32.0                   
## [20] GenomeInfoDb_1.16.0                    
## [21] IRanges_2.14.0                         
## [22] S4Vectors_0.18.0                       
## [23] BiocGenerics_0.26.0                    
## [24] knitr_1.20                             
## [25] BiocStyle_2.8.0                        
## 
## loaded via a namespace (and not attached):
##  [1] progress_1.1.2         xfun_0.1               lattice_0.20-35       
##  [4] htmltools_0.3.6        yaml_2.1.18            blob_1.1.1            
##  [7] XML_3.98-1.11          DBI_0.8                bit64_0.9-7           
## [10] GenomeInfoDbData_1.1.0 stringr_1.3.0          zlibbioc_1.26.0       
## [13] evaluate_0.10.1        memoise_1.1.0          biomaRt_2.36.0        
## [16] curl_3.2               Rcpp_0.12.16           backports_1.1.2       
## [19] jsonlite_1.5           bit_1.1-12             rjson_0.2.15          
## [22] digest_0.6.15          stringi_1.1.7          bookdown_0.7          
## [25] grid_3.5.0             rprojroot_1.3-2        tools_3.5.0           
## [28] bitops_1.0-6           magrittr_1.5           RCurl_1.95-4.10       
## [31] RSQLite_2.1.0          pkgconfig_2.0.1        Matrix_1.2-14         
## [34] prettyunits_1.0.2      assertthat_0.2.0       rmarkdown_1.9         
## [37] httr_1.3.1             R6_2.2.2               compiler_3.5.0

Annotating Variants

30 April 2018

Contents

0.1 Working with Variants

0.2 Annotating Variants

0.3 Provenance