GenomicScores provides infrastructure to store and access genomewide position-specific scores within R and Bioconductor.
GenomicScores 1.0.2
GenomicScores is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicScores")
Once GenomicScores is installed, it can be loaded with the following command.
library(GenomicScores)
Often, however, GenomicScores will be automatically loaded when working with an annotation package that uses GenomicScores, such as phastCons100way.UCSC.hg19.
Genomewide scores assign each genomic position a numeric value denoting an estimated measure of constraint or impact on variation at that position. They are commonly used to filter single nucleotide variants or assess the degree of constraint or functionality of genomic features. Genomic scores are built on the basis of different sources of information such as sequence homology, functional domains, physical-chemical changes of amino acid residues, etc.
One particular example of genomic scores are phastCons scores. They provide a measure of conservation obtained from genomewide alignments using the program phast (Phylogenetic Analysis with Space/Time models) from Siepel et al. (2005). The GenomicScores package allows one to retrieve these scores through annotation packages (Section 4) or as AnnotationHub resources (Section 5).
Often, genomic scores such as phastCons are used within workflows running on top of R and Bioconductor. The purpose of the GenomicScores package is to enable an easy and interactive access to genomic scores within those workflows.
Storing and accessing genomic scores within R is challenging when their values cover large regions of the genome, resulting in gigabytes of double-precision numbers. This is the case, for instance, for phastCons (Siepel et al. 2005), CADD (Kircher et al. 2014) or M-CAP (Jagadeesh et al. 2016) scores.
We address this problem by using lossy compression, also called quantization, coupled with run-length encoding (Rle) vectors. Lossy compression attempts to trade off precision for compression without compromising the scientific integrity of the data (Zender 2016).
Sometimes, measurements and statistical estimates under certain models generate false precision. False precision is essentialy noise that wastes storage space and it is meaningless from the scientific point of view (Zender 2016). In those circumstances, lossy compression not only saves storage space, but also removes false precision.
The use of lossy compression leads to a subset of quantized values much smaller than the original set of genomic scores, resulting in long runs of identical values along the genome. These runs of identical values can be further compressed using the implementation of Rle vectors available in the S4Vectors Bioconductor package.
There are currently four different annotation packages that store genomic scores and can be accessed using the GenomicScores package (Table 1):
Annotation Package | Description |
---|---|
phastCons100way.UCSC.hg19 | phastCons scores derived from the alignment of the human genome (hg19) to other 99 vertebrate species. |
phastCons100way.UCSC.hg38 | phastCons scores derived from the alignment of the human genome (hg38) to other 99 vertebrate species. |
phastCons7way.UCSC.hg38 | phastCons scores derived from the alignment of the human genome (hg38) to other 6 mammal species. |
fitCons.UCSC.hg19 | fitCons scores: fitness consequences of functional annotation for the human genome (hg19). |
This is an example of how genomic scores can be retrieved using the phastCons100way.UCSC.hg19 package. Here, a GScores object is created when the package is loaded.
library(phastCons100way.UCSC.hg19)
library(GenomicRanges)
gsco <- phastCons100way.UCSC.hg19
We should use the function scores() to retrive genomic scores for specific positions.
scores(gsco, GRanges(seqnames="chr7", IRanges(start=117232380, width=1)))
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | scores
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr7 [117232380, 117232380] * | 0.8
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Another way to retrieve genomic scores is by using the AnnotationHub, which is a web resource that provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard (e.g., UCSC, Ensembl) and distributed sites, can be found. A Bioconductor AnnotationHub web resource creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.
The first step to retrieve genomic scores is to check the ones available to download.
availableGScores()
## [1] "fitCons.UCSC.hg19" "phastCons100way.UCSC.hg19"
## [3] "phastCons100way.UCSC.hg38" "phastCons7way.UCSC.hg38"
The selected resource can be downloaded with the function getGScores(). After the resource is downloaded the first time, the cached copy will enable quicker later retrieval.
gsco <- getGScores("phastCons100way.UCSC.hg19")
Finally, the phastCons score of a particular genomic position is retrieved as it has been seen before.
scores(gsco, GRanges(seqnames="chr7", IRanges(start=117232380, width=1)))
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | scores
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr7 [117232380, 117232380] * | 0.8
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
A typical use case of the GenomicScores package is in the context of annotating variants with genomic scores, such as phastCons conservation scores. For this purpose, we load the VariantAnnotaiton and TxDb.Hsapiens.UCSC.hg19.knownGene packages. The former will allow us to read a VCF file and annotate it, and the latter contains the gene annotations from UCSC that will be used in this process.
library(VariantAnnotation)
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:base':
##
## apply
## Loading required package: Rsamtools
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:DelayedArray':
##
## type
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'VariantAnnotation'
## The following object is masked from 'package:base':
##
## tabulate
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
Let’s load one of the sample VCF files that form part of the VariantAnnotation package.
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevelsStyle(vcf)
## [1] "NCBI" "Ensembl"
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(txdb)
## [1] "UCSC"
Because the chromosome nomenclature from the VCF file (NCBI) is different from the one with the gene annotations (UCSC) we use the seqlevelsStyle()
function to force our variants having the chromosome nomenclature of the gene annotations. Then we fetch the variants as a GRanges
object.
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
rd <- rowRanges(vcf)
rd[1:3]
## GRanges object with 3 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## rs7410291 chr22 [50300078, 50300078] * | <NA>
## rs147922003 chr22 [50300086, 50300086] * | <NA>
## rs114143073 chr22 [50300101, 50300101] * | <NA>
## REF ALT QUAL FILTER
## <DNAStringSet> <DNAStringSetList> <numeric> <character>
## rs7410291 A G 100 PASS
## rs147922003 C T 100 PASS
## rs114143073 G A 100 PASS
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
We annotate the location of variants using the function locateVariants()
from the VariantAnnotation package.
loc <- locateVariants(rd, txdb, AllVariants())
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
loc[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## chr22 [50300078, 50300078] - | intron 10763 10763
## chr22 [50300086, 50300086] - | intron 10755 10755
## chr22 [50300101, 50300101] - | intron 10740 10740
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 75253 79087
## 2 75253 79087
## 3 75253 79087
## FOLLOWID
## <CharacterList>
##
##
##
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
table(loc$LOCATION)
##
## spliceSite intron fiveUTR threeUTR coding intergenic promoter
## 11 22572 309 1368 2822 2867 2864
Finally annotate phastCons conservation scores on the variants and store those annotations as an additional metadata column of the GRanges
object.
loc$PHASTCONS <- scores(gsco, loc, scores.only=TRUE)
loc[1:3]
## GRanges object with 3 ranges and 10 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## chr22 [50300078, 50300078] - | intron 10763 10763
## chr22 [50300086, 50300086] - | intron 10755 10755
## chr22 [50300101, 50300101] - | intron 10740 10740
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 75253 79087
## 2 75253 79087
## 3 75253 79087
## FOLLOWID PHASTCONS
## <CharacterList> <numeric>
## 0.0
## 0.1
## 0.0
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
Using the following code we can examine the distribution of phastCons conservation scores of variants across the different annotated regions, shown in Figure 1.
x <- split(loc$PHASTCONS, loc$LOCATION)
mask <- elementNROWS(x) > 0
boxplot(x[mask], ylab="phastCons score")
points(1:length(x[mask])+0.25, sapply(x[mask], mean, na.rm=TRUE), pch=23, bg="black")
To have a sense of the extent of the trade-off between precision and compression we compare here raw phastCons scores with the ones stored in the corresponding annotation package using lossy compression. In this case, phastCons scores were quantized by rounding their precision to one decimal digit.
One thousand raw phastCons scores were sampled uniformly at random from CDS and 3’UTR regions and are included in this package to illustrate this comparison. Interestingly, among the phastCons scores sampled from 1000 CDS positions, there are only 198 different values despite the apparently very high precision of some of them.
origpcscoCDS <- readRDS(system.file("extdata", "origphastCons100wayhg19CDS.rds", package="GenomicScores"))
origpcscoCDS
## GRanges object with 1000 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 [ 976248, 976248] * | 0.763999998569489
## [2] chr1 [1886625, 1886625] * | 0.00300000002607703
## [3] chr1 [3751636, 3751636] * | 1
## [4] chr1 [5950945, 5950945] * | 0.236000001430511
## [5] chr1 [6638784, 6638784] * | 0.99099999666214
## ... ... ... ... . ...
## [996] chrX [154113587, 154113587] * | 1
## [997] chrX [154261706, 154261706] * | 1
## [998] chrX [154305499, 154305499] * | 1
## [999] chrY [ 15481226, 15481226] * | 1
## [1000] chrY [ 21871402, 21871402] * | 0.981000006198883
## -------
## seqinfo: 25 sequences from an unspecified genome
length(unique(origpcscoCDS$score))
## [1] 198
We look more closely the number of decimal digits of precision used in these raw scores.
numDecimals <- function(x) {
spl <- strsplit(as.character(x+1), "\\.")
spl <- sapply(spl, "[", 2)
spl[is.na(spl)] <- ""
nchar(spl)
}
nd1 <- numDecimals(origpcscoCDS$score)
table(nd1)
## nd1
## 0 1 2 12 13 14
## 637 1 2 5 63 292
Similarly, in 3’UTR regions, only 209 unique phastCons scores are observed.
origpcsco3UTRs <- readRDS(system.file("extdata", "origphastCons100wayhg193UTR.rds", package="GenomicScores"))
origpcsco3UTRs
## GRanges object with 1000 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 [1189930, 1189930] * | 0
## [2] chr1 [1227810, 1227810] * | 0
## [3] chr1 [1595390, 1595390] * | 0.0149999996647239
## [4] chr1 [1595685, 1595685] * | 0
## [5] chr1 [2336044, 2336044] * | 0
## ... ... ... ... . ...
## [996] chrX [125684491, 125684491] * | 0
## [997] chrX [129190576, 129190576] * | 1
## [998] chrX [135299066, 135299066] * | 0.196999996900558
## [999] chrX [148560948, 148560948] * | 0.00300000002607703
## [1000] chrX [153289202, 153289202] * | 0
## -------
## seqinfo: 25 sequences from an unspecified genome
length(table(origpcsco3UTRs$score))
## [1] 209
nd2 <- numDecimals(origpcsco3UTRs$score)
table(nd2)
## nd2
## 0 12 13 14
## 502 3 112 383
Retrieve the corresponding phastCons scores stored in the annotation package.
pkgpcscoCDS <- scores(gsco, origpcscoCDS, scores.only=TRUE)
pkgpcsco3UTRs <- scores(gsco, origpcsco3UTRs, scores.only=TRUE)
In Figure 2 we show a visual comparison between raw and lossy-compressed phastCons scores. The two panels on top compare the whole range of scores observed in CDS (left) and 3’UTR (right) regions. However, the effect of lossy compression can be better observed in the cumulative distributions shown in the panels at the bottom, again for CDS (left) and 3’UTR (right) regions.
In these bottom panels, phastcons scores in CDS and 3’UTR regions display very different cumulative distributions. In CDS regions, most of the genomic scores (>60%) are found between the values of 0.9 and 1.0, while around 25% of the scores are found below 0.1. Indeed, these are the range of values where lossy compression loses more precison. The cumulative distribution of 3’UTR shows the same critical points, with the difference that most of scores are found below 0.1 (>70%).
The bottom plots in Figure 2 also reveal that when the cumulative distribution is of interest, such as in the context of filtering genetic variants above or below certain threshold of conservation, the quantization of phastCons scores to one decimal digit provides sufficient precision for a wide range of conservation values.
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [2] GenomicFeatures_1.28.4
## [3] AnnotationDbi_1.38.2
## [4] VariantAnnotation_1.22.3
## [5] Rsamtools_1.28.0
## [6] Biostrings_2.44.2
## [7] XVector_0.16.0
## [8] SummarizedExperiment_1.6.3
## [9] DelayedArray_0.2.7
## [10] matrixStats_0.52.2
## [11] Biobase_2.36.2
## [12] phastCons100way.UCSC.hg19_3.5.0
## [13] GenomicScores_1.0.2
## [14] GenomicRanges_1.28.4
## [15] GenomeInfoDb_1.12.2
## [16] IRanges_2.10.3
## [17] S4Vectors_0.14.3
## [18] BiocGenerics_0.22.0
## [19] BiocStyle_2.4.1
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-35 htmltools_0.3.6
## [3] rtracklayer_1.36.4 yaml_2.1.14
## [5] interactiveDisplayBase_1.14.0 blob_1.1.0
## [7] XML_3.98-1.9 rlang_0.1.2
## [9] DBI_0.7 BiocParallel_1.10.1
## [11] bit64_0.9-7 GenomeInfoDbData_0.99.0
## [13] stringr_1.2.0 zlibbioc_1.22.0
## [15] memoise_1.1.0 evaluate_0.10.1
## [17] knitr_1.17 biomaRt_2.32.1
## [19] httpuv_1.3.5 BiocInstaller_1.26.1
## [21] highr_0.6 Rcpp_0.12.12
## [23] xtable_1.8-2 backports_1.1.0
## [25] BSgenome_1.44.1 mime_0.5
## [27] bit_1.1-12 AnnotationHub_2.8.2
## [29] digest_0.6.12 stringi_1.1.5
## [31] bookdown_0.5 shiny_1.0.5
## [33] rprojroot_1.2 grid_3.4.1
## [35] tools_3.4.1 bitops_1.0-6
## [37] magrittr_1.5 RCurl_1.95-4.8
## [39] tibble_1.3.4 RSQLite_2.0
## [41] pkgconfig_2.0.1 Matrix_1.2-11
## [43] rmarkdown_1.6 httr_1.3.1
## [45] R6_2.2.2 GenomicAlignments_1.12.2
## [47] compiler_3.4.1
Jagadeesh, Karthik A, Aaron M Wenger, Mark J Berger, Harendra Guturu, Peter D Stenson, David N Cooper, Jonathan A Bernstein, and Gill Bejerano. 2016. “M-Cap Eliminates a Majority of Variants of Uncertain Significance in Clinical Exomes at High Sensitivity.” Nat. Genet. 48 (12). Nature Publishing Group: 1581–6.
Kircher, Martin, Daniela M Witten, Preti Jain, Brian J O’roak, Gregory M Cooper, and Jay Shendure. 2014. “A General Framework for Estimating the Relative Pathogenicity of Human Genetic Variants.” Nat. Genet. 46 (3). Nature Publishing Group: 310–15.
Siepel, Adam, Gill Bejerano, Jakob S Pedersen, Angie S Hinrichs, Minmei Hou, Kate Rosenbloom, Hiram Clawson, et al. 2005. “Evolutionarily Conserved Elements in Vertebrate, Insect, Worm, and Yeast Genomes.” Genome Res. 15 (8). Cold Spring Harbor Laboratory: 1034–50.
Zender, Charles S. 2016. “Bit Grooming: Statistically Accurate Precision-Preserving Quantization with Compression, Evaluated in the NetCDF Operators (Nco, V4. 4.8+).” Geosci. Model Dev. 9 (9). Copernicus GmbH: 3199–3211.