Ensembl’s Variant Effect Predictor is described in McLaren et al. (2016).
Prior to Bioconductor 3.19, the ensemblVEP package provided access to Ensembl’s predictions through an interface between Perl and MySQL.
In 3.19 VariantAnnotation supports the use of the VEP component of the REST API at https://rest.ensembl.org.
The function vep_by_region
will accept
a VCF object as defined in VariantAnnotation.
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
r22 = readVcf(fl)
r22
## class: CollapsedVCF
## dim: 10376 5
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, CIPOS,...
## info(header(vcf)):
## Number Type Description
## LDAF 1 Float MLE Allele Frequency Accounting for LD
## AVGPOST 1 Float Average posterior probability from MaCH/Thunder
## RSQ 1 Float Genotype imputation quality from MaCH/Thunder
## ERATE 1 Float Per-marker Mutation rate from MaCH/Thunder
## THETA 1 Float Per-marker Transition rate from MaCH/Thunder
## CIEND 2 Integer Confidence interval around END for imprecise var...
## CIPOS 2 Integer Confidence interval around POS for imprecise var...
## END 1 Integer End position of the variant described in this re...
## HOMLEN . Integer Length of base pair identical micro-homology at ...
## HOMSEQ . String Sequence of base pair identical micro-homology a...
## SVLEN 1 Integer Difference in length between REF and ALT alleles
## SVTYPE 1 String Type of structural variant
## AC . Integer Alternate Allele Count
## AN 1 Integer Total Allele Count
## AA 1 String Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.u...
## AF 1 Float Global Allele Frequency based on AC/AN
## AMR_AF 1 Float Allele Frequency for samples from AMR based on A...
## ASN_AF 1 Float Allele Frequency for samples from ASN based on A...
## AFR_AF 1 Float Allele Frequency for samples from AFR based on A...
## EUR_AF 1 Float Allele Frequency for samples from EUR based on A...
## VT 1 String indicates what type of variant the line represents
## SNPSOURCE . String indicates if a snp was called when analysing the...
## geno(vcf):
## List of length 3: GT, DS, GL
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## DS 1 Float Genotype dosage from MaCH/Thunder
## GL G Float Genotype Likelihoods
In this example we confine attention to single nucleotide variants.
There is a limit of 200 locations in a request, and 55000 requests per hour. We’ll base our query on 100 positions in the chr22 VCF.
dr = which(width(rowRanges(r22))!=1)
r22s = r22[-dr]
res = vep_by_region(r22[1:100], snv_only=FALSE, chk_max=FALSE)
jans = toJSON(content(res))
There are various ways to work with the result of this query to the API. We’ll use the rjsoncons JSON processing infrastructure to dig in and understand aspects of the API behavior.
First, the top-level concepts produced for each variant can be retrieved using
## [1] "input" "allele_string"
## [3] "start" "most_severe_consequence"
## [5] "transcript_consequences" "assembly_name"
## [7] "seq_region_name" "end"
## [9] "strand" "regulatory_feature_consequences"
## [11] "id" "colocated_variants"
## [13] "motif_feature_consequences"
Annotation of the most severe consequence known will typically be of interest:
##
## 5_prime_UTR_variant intron_variant splice_region_variant
## 1 98 1
There is variability in the structure of data returned for each query.
## [[1]]
## consequence_terms biotype impact regulatory_feature_id variant_allele
## 1 regulato.... promoter MODIFIER ENSR0000.... G
##
## [[2]]
## consequence_terms biotype impact regulatory_feature_id variant_allele
## 1 regulato.... promoter MODIFIER ENSR0000.... T
##
## [[3]]
## biotype consequence_terms impact variant_allele regulatory_feature_id
## 1 promoter regulato.... MODIFIER A ENSR0000....
##
## [[4]]
## consequence_terms impact biotype regulatory_feature_id variant_allele
## 1 regulato.... MODIFIER promoter ENSR0000.... T
##
## [[5]]
## regulatory_feature_id variant_allele consequence_terms biotype impact
## 1 ENSR0000.... T regulato.... promoter MODIFIER
##
## [[6]]
## variant_allele regulatory_feature_id consequence_terms biotype impact
## 1 A ENSR0000.... regulato.... promoter MODIFIER
Furthermore, the content of the motif feature consequences field seems very peculiar.
##
## -0.003 -0.022 -0.087
## 1 1 1
## 1 2 6
## 3 1 1
## 9 A ENSM00205804739
## 1 1 1
## ENSM00494167763 ENSM00522532781 ENSPFM0238
## 1 1 1
## ENSPFM0506 ENSPFM0597 GCM1::MAX
## 1 1 1
## MODIFIER N T
## 3 2 2
## TEAD4::ELF1 TEAD4::ELK1 TEAD4::SPIB
## 1 1 1
## TFAP2C::MAX TF_binding_site_variant Y
## 1 3 1
We’ll consider the following approach to converting the API response to a GenomicRanges GRanges instance. Eventually this may become part of the package.
library(GenomicRanges)
.make_GRanges = function( vep_response ) {
stopifnot(inherits(vep_response, "response")) # httr
nested = fromJSON(toJSON(content(vep_response)))
ini = GRanges(seqnames = unlist(nested$seq_region_name),
IRanges(start=unlist(nested$start), end=unlist(nested$end)))
dr = match(c("seq_region_name", "start", "end"), names(nested))
mcols(ini) = DataFrame(nested[,-dr])
ini
}
tstg = .make_GRanges( res )
tstg[,1] # full print is unwieldy
## GRanges object with 100 ranges and 1 metadata column:
## seqnames ranges strand | input
## <Rle> <IRanges> <Rle> | <list>
## [1] 22 50300078 * | 22 50300078 rs741029..
## [2] 22 50300086 * | 22 50300086 rs147922..
## [3] 22 50300101 * | 22 50300101 rs114143..
## [4] 22 50300113 * | 22 50300113 rs141778..
## [5] 22 50300166 * | 22 50300166 rs182170..
## ... ... ... ... . ...
## [96] 22 50304748 * | 22 50304748 rs141641..
## [97] 22 50304805 * | 22 50304805 rs761151..
## [98] 22 50304935 * | 22 50304935 rs121677..
## [99] 22 50304943 * | 22 50304943 rs186556..
## [100] 22 50305084 * | 22 50305084 rs116244..
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
## [1] "input" "allele_string"
## [3] "most_severe_consequence" "transcript_consequences"
## [5] "assembly_name" "strand"
## [7] "regulatory_feature_consequences" "id"
## [9] "colocated_variants" "motif_feature_consequences"
Now information about variants can be retrieved with range operations. Deep annotation requires nested structure of the metadata columns.
## [[1]]
## gene_symbol_source biotype consequence_terms strand hgnc_id
## 1 HGNC protein_.... intron_v.... -1 HGNC:9104
## 2 HGNC protein_.... intron_v.... -1 HGNC:9104
## 3 HGNC protein_.... intron_v.... -1 HGNC:9104
## 4 HGNC protein_.... intron_v.... -1 HGNC:9104
## variant_allele transcript_id impact gene_id gene_symbol flags
## 1 G ENST0000.... MODIFIER ENSG0000.... PLXNB2
## 2 G ENST0000.... MODIFIER ENSG0000.... PLXNB2 cds_end_NF
## 3 G ENST0000.... MODIFIER ENSG0000.... PLXNB2 cds_end_NF
## 4 G ENST0000.... MODIFIER ENSG0000.... PLXNB2
An important element of prior work in ensemblVEP supports feeding annotation back into the VCF used to generate the effect prediction query. This seems feasible but concrete use cases are of interest.
McLaren, William, Laurent Gil, Sarah E. Hunt, Harpreet Singh Riat, Graham R. S. Ritchie, Anja Thormann, Paul Flicek, and Fiona Cunningham. 2016. “The Ensembl Variant Effect Predictor.” Genome Biology 17 (1): 122. https://doi.org/10.1186/s13059-016-0974-4.