1 Getting started

GenomicScores is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:

install.packages("BiocManager")
BiocManager::install("GenomicScores")

Once GenomicScores is installed, it can be loaded with the following command.

library(GenomicScores)

Often, however, GenomicScores will be automatically loaded when working with an annotation package that uses GenomicScores, such as phastCons100way.UCSC.hg19.

2 Genomewide position-specific scores

Genomewide scores assign each genomic position a numeric value denoting an estimated measure of constraint or impact on variation at that position. They are commonly used to filter single nucleotide variants or assess the degree of constraint or functionality of genomic features. Genomic scores are built on the basis of different sources of information such as sequence homology, functional domains, physical-chemical changes of amino acid residues, etc.

One particular example of genomic scores are phastCons scores. They provide a measure of conservation obtained from genomewide alignments using the program phast (Phylogenetic Analysis with Space/Time models) from Siepel et al. (2005). The GenomicScores package allows one to retrieve these scores through annotation packages (Section 4) or as AnnotationHub resources (Section 5).

Often, genomic scores such as phastCons are used within workflows running on top of R and Bioconductor. The purpose of the GenomicScores package is to enable an easy and interactive access to genomic scores within those workflows.

3 Lossy storage of genomic scores with compressed vectors

Storing and accessing genomic scores within R is challenging when their values cover large regions of the genome, resulting in gigabytes of double-precision numbers. This is the case, for instance, for phastCons (Siepel et al. 2005), CADD (Kircher et al. 2014) or M-CAP (Jagadeesh et al. 2016) scores.

We address this problem by using lossy compression, also called quantization, coupled with run-length encoding (Rle) vectors. Lossy compression attempts to trade off precision for compression without compromising the scientific integrity of the data (Zender 2016).

Sometimes, measurements and statistical estimates under certain models generate false precision. False precision is essentialy noise that wastes storage space and it is meaningless from the scientific point of view (Zender 2016). In those circumstances, lossy compression not only saves storage space, but also removes false precision.

The use of lossy compression leads to a subset of quantized values much smaller than the original set of genomic scores, resulting in long runs of identical values along the genome. These runs of identical values can be further compressed using the implementation of Rle vectors available in the S4Vectors Bioconductor package.

To enable a seamless access to genomic scores stored with quantized values in compressed vectors the GenomicScores defines the GScores class of objects. This class manages the location, loading and dequantization of genomic scores stored separately on each chromosome.

4 Retrieval of genomic scores through annotation packages

There are currently four different annotation packages that store genomic scores and can be accessed using the GenomicScores package; see Table 1.


Table 1: Bioconductor annotation packages storing genomic scores
Annotation Package Description
phastCons100way.UCSC.hg19 phastCons scores derived from the alignment of the human genome (hg19) to other 99 vertebrate species.
phastCons100way.UCSC.hg38 phastCons scores derived from the alignment of the human genome (hg38) to other 99 vertebrate species.
phastCons7way.UCSC.hg38 phastCons scores derived from the alignment of the human genome (hg38) to other 6 mammal species.
fitCons.UCSC.hg19 fitCons scores: fitness consequences of functional annotation for the human genome (hg19).

This is an example of how genomic scores can be retrieved using the phastCons100way.UCSC.hg19 package. Here, a GScores object is created when the package is loaded.

library(phastCons100way.UCSC.hg19)
phast <- phastCons100way.UCSC.hg19
class(phast)
[1] "GScores"
attr(,"package")
[1] "GenomicScores"

The help page of the GScores class describes the different methods to access the information and metadata stored in a GScores object. To retrieve genomic scores for specific positions we should use the function gscores(), as follows.

gscores(phast, GRanges(seqnames="chr22",
                       IRanges(start=50967020:50967025, width=1)))
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |   default
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50967020      * |       1.0
  [2]    chr22  50967021      * |       1.0
  [3]    chr22  50967022      * |       0.8
  [4]    chr22  50967023      * |       1.0
  [5]    chr22  50967024      * |       1.0
  [6]    chr22  50967025      * |       0.0
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome

The GenomicScores package only loads the scores data from one sequence to retrieve metadata and from the sequences that are being queried. Note that now the GScores object has loaded the scores from chr22.

phast
GScores object 
# organism: Homo sapiens (UCSC, hg19)
# provider: UCSC
# provider version: 09Feb2014
# download date: Apr 10, 2018
# loaded sequences: chr19_gl000208_random, chr22
# number of sites: 2858 millions
# maximum abs. error: 0.05
# use 'citation()' to cite these data in publications

The bibliographic reference to cite the genomic scores stored in a GScores object can be accessed using the citation() method either on the package name or on the GScores object. The latter is implemented in the GenomicScores package and provides a bibentry object.

citation(phast)
Adam Siepel, Gill Berejano, Jakob S. Pedersen, Angie S. Hinrichs,
Minmei Hou, Kate Rosenbloom, Hiram Clawson, John Spieth, LaDeana W.
Hillier, Stephen Richards, George M. Weinstock, Richard K. Wilson,
Richard A. Gibbs, W. James Kent, Webb Miller, David Haussler (2005).
"Evolutionarily conserved elements in vertebrate, insect, worm, and
yeast genomes." _Genome Research_, *15*, 1034-1050.
doi:10.1101/gr.3715005 <https://doi.org/10.1101/gr.3715005>.

Other methods tracing provenance and other metadata are provider(), providerVersion(), organism() and seqlevelsStyle(); please consult the help page of the GScores class for a comprehensive list of available methods.

provider(phast)
[1] "UCSC"
providerVersion(phast)
[1] "09Feb2014"
organism(phast)
[1] "Homo sapiens"
seqlevelsStyle(phast)
[1] "UCSC"

5 Retrieval of genomic scores through AnnotationHub resources

Another way to retrieve genomic scores is by using the AnnotationHub, which is a web resource that provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard (e.g., UCSC, Ensembl) and distributed sites, can be found. A Bioconductor AnnotationHub web resource creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.

The first step to retrieve genomic scores is to check the ones available to download.

availableGScores()
 [1] "cadd.v1.3.hg19"                    "fitCons.UCSC.hg19"                
 [3] "linsight.UCSC.hg19"                "mcap.v1.0.hg19"                   
 [5] "phastCons7way.UCSC.hg38"           "phastCons27way.UCSC.dm6"          
 [7] "phastCons30way.UCSC.hg38"          "phastCons46wayPlacental.UCSC.hg19"
 [9] "phastCons46wayPrimates.UCSC.hg19"  "phastCons60way.UCSC.mm10"         
[11] "phastCons100way.UCSC.hg19"         "phastCons100way.UCSC.hg38"        
[13] "phyloP60way.UCSC.mm10"             "phyloP100way.UCSC.hg19"           
[15] "phyloP100way.UCSC.hg38"            "phastCons35way.UCSC.mm39"         
[17] "phyloP35way.UCSC.mm39"            

The selected resource can be downloaded with the function getGScores(). After the resource is downloaded the first time, the cached copy will enable a quicker retrieval later.

phast <- getGScores("phastCons100way.UCSC.hg19")

Finally, the phastCons score of a particular genomic position is retrieved exactly in the same we did with the annotation package.

gscores(phast, GRanges(seqnames="chr22",
                       IRanges(start=50967020:50967025, width=1)))
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |   default
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50967020      * |       1.0
  [2]    chr22  50967021      * |       1.0
  [3]    chr22  50967022      * |       0.8
  [4]    chr22  50967023      * |       1.0
  [5]    chr22  50967024      * |       1.0
  [6]    chr22  50967025      * |       0.0
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome

5.1 Building an annotation package from a GScores object

Retrieving genomic scores through AnnotationHub resources requires an internet connection and we may want to work with such resources offline. For that purpose, we can create ourselves an annotation package, such as phastCons100way.UCSC.hg19, from a GScores object corresponding to a downloaded AnnotationHub resource. To do that we use the function makeGScoresPackage() as follows:

makeGScoresPackage(phast, maintainer="Me <me@example.com>",
                   author="Me", version="1.0.0")
Creating package in ./phastCons100way.UCSC.hg19

An argument, destDir, which by default points to the current working directory, can be used to change where in the filesystem the package is created. Afterwards, we should still build and install the package via, e.g., R CMD build and R CMD INSTALL, to be able to use it offline.

6 Retrieval of multiple scores per genomic position

Among the score sets available as AnnotationHub web resources shown in the previous section, two of them, CADD (Kircher et al. 2014) and M-CAP (Jagadeesh et al. 2016), provide multiple scores per genomic position that capture the tolerance to mutations of single nucleotides. Using M-CAP scores, we will illustrate how this type of scores are retrieved by default.

mcap <- getGScores("mcap.v1.0.hg19")
mcap
GScores object 
# organism: Homo sapiens (UCSC, hg19)
# provider: Stanford
# provider version: v1.0
# download date: Mar 15, 2017
# loaded sequences: default
# maximum abs. error: 0.005
# use 'citation()' to cite these data in publications
citation(mcap)
Karthik A. Jagadeesh, Aaron M. Wenger, Mark J. Berger, Harendra Guturu,
Peter D. Stenson, David N. Cooper, Jonathan A. Bernstein, Gill Berejano
(2016). "M-CAP eliminates a majority of variants of uncertain
significance in clinical exomes at high sensitivity." _Nature
Genetics_, *48*, 1581-1586. doi:10.1038/ng.3703
<https://doi.org/10.1038/ng.3703>.
gr <- GRanges(seqnames="chr22", IRanges(50967020:50967025, width=1))
gscores(mcap, gr)
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |        default
         <Rle> <IRanges>  <Rle> |       <matrix>
  [1]    chr22  50967020      * | 0.54:0.55:0.34
  [2]    chr22  50967021      * | 0.47:0.64:0.64
  [3]    chr22  50967022      * |   NA:  NA:  NA
  [4]    chr22  50967023      * | 0.57:0.39:0.46
  [5]    chr22  50967024      * |   NA:0.50:0.49
  [6]    chr22  50967025      * |   NA:0.59:0.59
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome

The previous call to the gscores() method returns three scores per position corresponding to the M-CAP scores that estimate the tolerance of mutating the reference nucleotide at that position to each of the possible alternative alleles, in alphabetical order. One may directly retrieve the scores for combinations of reference and alternative alelles using the ref and alt arguments of the gscores() method. See the following example with the the previous genomic positions and some randomly selected alternative alleles.

library(BSgenome.Hsapiens.UCSC.hg19)

refAlleles <- as.character(getSeq(Hsapiens, gr))
altAlleles <- DNA_BASES[(match(refAlleles, DNA_BASES)) %% 4 + 1]
cbind(REF=refAlleles, ALT=altAlleles)
     REF ALT
[1,] "C" "G"
[2,] "G" "T"
[3,] "T" "A"
[4,] "C" "G"
[5,] "C" "G"
[6,] "G" "T"
gscores(mcap, gr, ref=refAlleles, alt=altAlleles)
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |   default
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50967020      * |      0.55
  [2]    chr22  50967021      * |      0.64
  [3]    chr22  50967022      * |        NA
  [4]    chr22  50967023      * |      0.39
  [5]    chr22  50967024      * |      0.50
  [6]    chr22  50967025      * |      0.59
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome

7 Summarization of genomic scores

The input genomic ranges to the gscores() method may have widths larger than one nucleotide. In those cases, and when there is only one score per position, the gscores() method calculates, by default, the arithmetic mean of the scores across each range.

gr1 <- GRanges(seqnames="chr22", IRanges(start=50967020:50967025, width=1))
gr1sco <- gscores(phast, gr1)
gr1sco
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |   default
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50967020      * |       1.0
  [2]    chr22  50967021      * |       1.0
  [3]    chr22  50967022      * |       0.8
  [4]    chr22  50967023      * |       1.0
  [5]    chr22  50967024      * |       1.0
  [6]    chr22  50967025      * |       0.0
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome
mean(gr1sco$default)
[1] 0.8
gr2 <- GRanges("chr22:50967020-50967025")
gscores(phast, gr2)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |   default
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]    chr22 50967020-50967025      * |       0.8
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome

However, we may change the way in which scores from multiple-nucleotide ranges are summarized with the argument summaryFun, as follows.

gscores(phast, gr2, summaryFun=max)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |   default
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]    chr22 50967020-50967025      * |         1
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome
gscores(phast, gr2, summaryFun=min)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |   default
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]    chr22 50967020-50967025      * |         0
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome
gscores(phast, gr2, summaryFun=median)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |   default
         <Rle>         <IRanges>  <Rle> | <numeric>
  [1]    chr22 50967020-50967025      * |         1
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome

8 Retrieval of quantized genomic scores

The specific quantization and dequantization functions are stored as part of the metadata of a GScores object and they can be examined with the methods qfun() and dqfun(), respectively. The latter is called by the gscores() method to retrieve genomic scores.

phastqfun <- qfun(phast)
phastqfun
function (x) 
{
    q <- as.integer(sprintf("%.0f", 10 * x))
    q <- q + 1L
    q
}
attr(,"description")
[1] "multiply by 10, round to nearest integer, add up one"
phastdqfun <- dqfun(phast)
phastdqfun
function (q) 
{
    x <- as.numeric(q)
    x[x == 0] <- NA
    x <- (x - 1)/10
    x
}
<bytecode: 0x56494c58f788>
attr(,"description")
[1] "subtract one integer unit, divide by 10"

For single-nucleotide ranges, we can retrieve the quantized genomic scores using the argument quantized=TRUE.

gr1sco <- gscores(phast, gr1, quantized=TRUE)
gr1sco
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand | default
         <Rle> <IRanges>  <Rle> |   <Rle>
  [1]    chr22  50967020      * |      0b
  [2]    chr22  50967021      * |      0b
  [3]    chr22  50967022      * |      09
  [4]    chr22  50967023      * |      0b
  [5]    chr22  50967024      * |      0b
  [6]    chr22  50967025      * |      01
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome

Using the dequantization function we can obtain later the genomic scores.

phastdqfun(gr1sco$default)
[1] 1.0 1.0 0.8 1.0 1.0 0.0

9 Populations of scores

A single GScores object may store multiple populations of scores of the same kind. One such case is when scores are stored with different precision levels. This is the case for the package phastCons100way.UCSC.hg19, in which phastCons scores are compressed rounding values to 1 and 2 decimal places. To figure out what are the available score populations in a GScores object we should use the method populations():

populations(phast)
[1] "default" "DP2"    

Whenever one of these populations is called default, this is the one used by default. In other cases we can find out which is the default population as follows:

defaultPopulation(phast)
[1] "default"

To use one of the available score populations we should use the argument pop in the corresponding methods, as follows:

gscores(phast, gr1, pop="DP2")
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |       DP2
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50967020      * |      0.98
  [2]    chr22  50967021      * |      0.98
  [3]    chr22  50967022      * |      0.83
  [4]    chr22  50967023      * |      0.99
  [5]    chr22  50967024      * |      0.96
  [6]    chr22  50967025      * |      0.00
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome
qfun(phast, pop="DP2")
function (x) 
{
    q <- as.integer(sprintf("%.0f", 100 * x))
    q <- q + 1L
    q
}
attr(,"description")
[1] "multiply by 100, round to nearest integer, add up one"
dqfun(phast, pop="DP2")

We can also change the default scores population to use, as follows:

defaultPopulation(phast) <- "DP2"
phast
GScores object 
# organism: Homo sapiens (UCSC, hg19)
# provider: UCSC
# provider version: 09Feb2014
# download date: Apr 10, 2018
# loaded sequences: chr19_gl000208_random, chr22
# loaded populations: default, DP2
# default scores population: DP2
# number of sites: 2858 millions
# maximum abs. error (def. pop.): 0.005
# use 'citation()' to cite these data in publications
gscores(phast, gr1)
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |       DP2
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]    chr22  50967020      * |      0.98
  [2]    chr22  50967021      * |      0.98
  [3]    chr22  50967022      * |      0.83
  [4]    chr22  50967023      * |      0.99
  [5]    chr22  50967024      * |      0.96
  [6]    chr22  50967025      * |      0.00
  -------
  seqinfo: 93 sequences (1 circular) from Genome Reference Consortium GRCh37 genome
qfun(phast)
function (x) 
{
    q <- as.integer(sprintf("%.0f", 100 * x))
    q <- q + 1L
    q
}
attr(,"description")
[1] "multiply by 100, round to nearest integer, add up one"
dqfun(phast)

10 Retrieval of minor allele frequency data

One particular type of genomic scores that are accessible through the GScores class is minor allele frequency (MAF) data. There are currently 15 annotation packages that store MAF values using the GenomicScores package, named using the prefix MafDb; see Table 2 below.


Table 2: Bioconductor annotation packages storing MAF data.
Annotation Package Description
MafDb.1Kgenomes.phase1.hs37d5 MAF data from the 1000 Genomes Project Phase 1 for the human genome version GRCh37.
MafDb.1Kgenomes.phase1.GRCh38 MAF data from the 1000 Genomes Project Phase 1 for the human genome version GRCh38.
MafDb.1Kgenomes.phase3.hs37d5 MAF data from the 1000 Genomes Project Phase 3 for the human genome version GRCh37.
MafDb.1Kgenomes.phase3.GRCh38 MAF data from the 1000 Genomes Project Phase 3 for the human genome version GRCh38.
MafDb.ExAC.r1.0.hs37d5 MAF data from ExAC 60706 exomes for the human genome version GRCh37.
MafDb.ExAC.r1.0.GRCh38 MAF data from ExAC 60706 exomes for the human genome version GRCh38.
MafDb.ExAC.r1.0.nonTCGA.hs37d5 MAF data from ExAC 53105 nonTCGA exomes for the human genome version GRCh37.
MafDb.ExAC.r1.0.nonTCGA.GRCh38 MAF data from ExAC 53105 nonTCGA exomes for the human genome version GRCh38.
MafDb.gnomAD.r2.1.hs37d5 MAF data from gnomAD 15496 genomes for the human genome version GRCh37.
MafDb.gnomAD.r2.1.GRCh38 MAF data from gnomAD 15496 genomes for the human genome version GRCh38.
MafDb.gnomADex.r2.1.hs37d5 MAF data from gnomADex 123136 exomes for the human genome version GRCh37.
MafDb.gnomADex.r2.1.GRCh38 MAF data from gnomADex 123136 exomes for the human genome version GRCh38.
MafH5.gnomAD.v3.1.2.GRCh38 MAF data from gnomAD 76156 genomes for the human genome version GRCh38.
MafDb.TOPMed.freeze5.hg19 MAF data from NHLBI TOPMed 62784 genomes for the human genome version GRCh37.
MafDb.TOPMed.freeze5.hg38 MAF data from NHLBI TOPMed 62784 genomes for the human genome version GRCh38.

In this context, the scores populations correspond to populations of individuals from which the MAF data were derived, while all MAF data were compressed using a precision of one significant figure for MAF < 0.1 and two significant figures for MAF >= 0.1.

library(MafDb.1Kgenomes.phase1.hs37d5)

mafdb <- MafDb.1Kgenomes.phase1.hs37d5
mafdb
GScores object 
# organism: Homo sapiens (1000genomes, hs37d5)
# provider: IGSR
# provider version: Phase1
# download date: Oct 23, 2019
# loaded sequences (SNRs): 22
# loaded sequences (nonSNRs): none
# default scores population: AF
# number of sites: 40 millions
# maximum abs. error (def. pop.): 0.000141
# use 'citation()' to cite these data in publications

populations(mafdb)
[1] "AF"     "AFR_AF" "AMR_AF" "ASN_AF" "EUR_AF"

Consider the following example in which we are interested in MAF values for variants associated with eye color. We can start by fetching the corresponding variant identifers from the GWAS catalog using the gwascat package, as follows.

library(gwascat)

gwascat <- makeCurrentGwascat()
eyersids <- unique(subsetByTraits(gwascat, tr="Eye color")$SNPS)
eyersids
 [1] "rs1667394"  "rs288139"   "rs4596632"  "rs16891982" "rs1393350" 
 [6] "rs12896399" "rs1847134"  "rs12913832" "rs12203592" "rs3002288" 
[11] "rs12520016" "rs7173419"  "rs1408799"  "rs10809826" "rs1800404" 
[16] "rs4778249"  "rs1426654"  "rs4778219"  "rs1800407" 

We query the genomic position of the variant identifier through the SNPlocs.Hsapiens.dbSNP144.GRCh37 package, which stores the build 144 of the dbSNP database.

library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
rng <- snpsById(SNPlocs.Hsapiens.dbSNP144.GRCh37, ids=eyersids)
rng
UnstitchedGPos object with 19 positions and 2 metadata columns:
       seqnames       pos strand |   RefSNP_id alleles_as_ambig
          <Rle> <integer>  <Rle> | <character>      <character>
   [1]       15  28530182      * |   rs1667394                Y
   [2]        5 107400505      * |    rs288139                R
   [3]        8 132353735      * |   rs4596632                R
   [4]        5  33951693      * |  rs16891982                S
   [5]       11  89011046      * |   rs1393350                R
   ...      ...       ...    ... .         ...              ...
  [15]       15  28235773      * |   rs1800404                Y
  [16]       15  28380518      * |   rs4778249                W
  [17]       15  48426484      * |   rs1426654                R
  [18]       15  28213850      * |   rs4778219                Y
  [19]       15  28230318      * |   rs1800407                Y
  -------
  seqinfo: 25 sequences (1 circular) from GRCh37.p13 genome

Finally, fetch the MAF values on those positions.

eyecolormafs <- gscores(mafdb, rng, pop=c("AF", "EUR_AF", "AFR_AF"))
eyecolormafs
UnstitchedGPos object with 19 positions and 5 metadata columns:
       seqnames       pos strand |   RefSNP_id alleles_as_ambig        AF
          <Rle> <integer>  <Rle> | <character>      <character> <numeric>
   [1]       15  28530182      * |   rs1667394                Y      0.46
   [2]        5 107400505      * |    rs288139                R      0.23
   [3]        8 132353735      * |   rs4596632                R      0.45
   [4]        5  33951693      * |  rs16891982                S      0.44
   [5]       11  89011046      * |   rs1393350                R      0.11
   ...      ...       ...    ... .         ...              ...       ...
  [15]       15  28235773      * |   rs1800404                Y      0.49
  [16]       15  28380518      * |   rs4778249                W      0.17
  [17]       15  48426484      * |   rs1426654                R      0.48
  [18]       15  28213850      * |   rs4778219                Y      0.35
  [19]       15  28230318      * |   rs1800407                Y      0.03
          EUR_AF    AFR_AF
       <numeric> <numeric>
   [1]      0.20      0.09
   [2]      0.23      0.13
   [3]      0.32      0.46
   [4]      0.03      0.05
   [5]      0.24      0.01
   ...       ...       ...
  [15]     0.210      0.12
  [16]     0.020      0.34
  [17]     0.004      0.08
  [18]     0.130      0.19
  [19]     0.080        NA
  -------
  seqinfo: 86 sequences (1 circular) from hs37d5 genome

Sometimes, data producers may annotate identifiers to genomic scores, such as dbSNP rs identifiers to MAF values. In such a case, we can directly attempt to fetch genomic scores using those identifiers as follows.

gscores(mafdb, eyersids, pop=c("AF", "EUR_AF", "AFR_AF"))
GRanges object with 19 ranges and 3 metadata columns:
             seqnames    ranges strand |        AF    EUR_AF    AFR_AF
                <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric>
   rs1667394       15  28530182      * |      0.46      0.20      0.09
    rs288139        5 107400505      * |      0.23      0.23      0.13
   rs4596632        8 132353735      * |      0.45      0.32      0.46
  rs16891982        5  33951693      * |      0.44      0.03      0.05
   rs1393350       11  89011046      * |      0.11      0.24      0.01
         ...      ...       ...    ... .       ...       ...       ...
   rs1800404       15  28235773      * |      0.49     0.210      0.12
   rs4778249       15  28380518      * |      0.17     0.020      0.34
   rs1426654       15  48426484      * |      0.48     0.004      0.08
   rs4778219       15  28213850      * |      0.35     0.130      0.19
   rs1800407       15  28230318      * |      0.03     0.080        NA
  -------
  seqinfo: 23 sequences from hs37d5 genome; no seqlengths

The following code produces the barplot shown in Figure 1 illustrating graphically the differences in MAF values from these variants between the three queried populations.

par(mar=c(3, 5, 1, 1))
mafpops <- c("AF", "EUR_AF", "AFR_AF")
bp <- barplot(t(as.matrix(mcols(eyecolormafs)[, mafpops])), beside=TRUE,
              col=c("darkblue", "darkgreen", "darkorange"),
              ylim=c(0, 0.55), las=1, ylab="Minor allele frequency")
axis(1, bp[2, ], 1:length(eyecolormafs))
mtext("Variant", side=1, line=2)
legend("topright", mafpops, fill=c("darkblue", "darkgreen", "darkorange"))
Eye color MAFs. Minor allele frequencies (MAFs) of variants associated with eye color for global, european and african populations of the Phase 1 data from the 1000 Genomes Project.

Figure 1: Eye color MAFs
Minor allele frequencies (MAFs) of variants associated with eye color for global, european and african populations of the Phase 1 data from the 1000 Genomes Project.

11 Annotating variants with genomic scores

A typical use case of the GenomicScores package is in the context of annotating variants with genomic scores, such as phastCons conservation scores. For this purpose, we load the VariantAnnotaiton and TxDb.Hsapiens.UCSC.hg19.knownGene packages. The former will allow us to read a VCF file and annotate it, and the latter contains the gene annotations from UCSC that will be used in this process.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

Let’s load one of the sample VCF files that form part of the VariantAnnotation package.

fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl)##, "hg19")
seqlevelsStyle(vcf)
[1] "NCBI"    "Ensembl"
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(txdb)
[1] "UCSC"

Because the chromosome nomenclature from the VCF file (NCBI) is different from the one with the gene annotations (UCSC) we use the seqlevelsStyle() function to force our variants having the chromosome nomenclature of the gene annotations.

seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

We annotate the location of variants using the function locateVariants() from the VariantAnnotation package.

loc <- locateVariants(vcf, txdb, AllVariants())
Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 2405 out-of-bound ranges located on sequences
  75253, 74357, 74359, 74360, 74361, 74362, 74363, 74358, 74364, 74365,
  75254, 75259, 74368, 74369, 74366, 74367, 74370, 74372, 74373, 74374,
  74375, 74378, 74377, 74380, 74381, 75262, 75263, 75265, 75266, 75268,
  75269, 75271, 75273, 75276, 75281, 75282, 75283, 74389, 74383, 74384,
  74385, 74386, 74387, 75287, 75288, 75286, 75289, 74390, 74391, 74392,
  74393, 74394, 75291, 74395, 74396, 74397, 74398, 75302, 75304, 75305,
  and 75306. Note that ranges located on a sequence whose length is
  unknown (NA) or on a circular sequence are not considered out-of-bound
  (use seqlengths() and isCircular() to get the lengths and circularity
  flags of the underlying sequences). You can use trim() to trim these
  ranges. See ?`trim,GenomicRanges-method` for more information.
loc[1:3]
GRanges object with 3 ranges and 9 metadata columns:
                  seqnames    ranges strand | LOCATION  LOCSTART    LOCEND
                     <Rle> <IRanges>  <Rle> | <factor> <integer> <integer>
      rs114335781    chr22  50301422      - |   coding       939       939
        rs8135963    chr22  50301476      - |   coding       885       885
  22:50301488_C/T    chr22  50301488      - |   coding       873       873
                    QUERYID        TXID         CDSID      GENEID
                  <integer> <character> <IntegerList> <character>
      rs114335781        24       75253        218562       79087
        rs8135963        25       75253        218562       79087
  22:50301488_C/T        26       75253        218562       79087
                        PRECEDEID        FOLLOWID
                  <CharacterList> <CharacterList>
      rs114335781                                
        rs8135963                                
  22:50301488_C/T                                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
table(loc$LOCATION)

spliceSite     intron    fiveUTR   threeUTR     coding intergenic   promoter 
        11      23339        309       1368       2822       2867       2864 

Annotate phastCons conservation scores on the variants and store those annotations as an additional metadata column of the GRanges object. For this specific purpose we should use the method score() that returns the genomic scores as a numeric vector instead as a metadata column in the input ranges object.

loc$PHASTCONS <- score(phast, loc)
loc[1:3]
GRanges object with 3 ranges and 10 metadata columns:
                  seqnames    ranges strand | LOCATION  LOCSTART    LOCEND
                     <Rle> <IRanges>  <Rle> | <factor> <integer> <integer>
      rs114335781    chr22  50301422      - |   coding       939       939
        rs8135963    chr22  50301476      - |   coding       885       885
  22:50301488_C/T    chr22  50301488      - |   coding       873       873
                    QUERYID        TXID         CDSID      GENEID
                  <integer> <character> <IntegerList> <character>
      rs114335781        24       75253        218562       79087
        rs8135963        25       75253        218562       79087
  22:50301488_C/T        26       75253        218562       79087
                        PRECEDEID        FOLLOWID PHASTCONS
                  <CharacterList> <CharacterList> <numeric>
      rs114335781                                         1
        rs8135963                                         0
  22:50301488_C/T                                         1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Using the following code we can examine the distribution of phastCons conservation scores of variants across the different annotated regions, shown in Figure 2.

x <- split(loc$PHASTCONS, loc$LOCATION)
mask <- elementNROWS(x) > 0
boxplot(x[mask], ylab="phastCons score", las=1, cex.axis=1.2, cex.lab=1.5, col="gray")
points(1:length(x[mask])+0.25, sapply(x[mask], mean, na.rm=TRUE), pch=23, bg="black")
Distribution of phastCons conservation scores in variants across different annotated regions. Diamonds indicate mean values.

Figure 2: Distribution of phastCons conservation scores in variants across different annotated regions
Diamonds indicate mean values.

Next, we can annotate M-CAP and CADD scores as follows. Note that we need to take care to only query positions of single nucleotide variants, and using the QUERYID column of the annotations to fetch back reference and alternative alleles from the original VCF file container.

maskSNVs <- isSNV(vcf)[loc$QUERYID]
loc$MCAP <- rep(NA_real_, length(loc))
loc$MCAP[maskSNVs] <- score(mcap, loc[maskSNVs],
                            ref=ref(vcf)[loc$QUERYID[maskSNVs]],
                            alt=alt(vcf)[loc$QUERYID[maskSNVs]])
maskSNVs <- isSNV(vcf)[loc$QUERYID]
loc$CADD <- rep(NA_real_, length(loc))
loc$CADD[maskSNVs] <- score(cadd, loc[maskSNVs],
                            ref=ref(vcf)[loc$QUERYID[maskSNVs]],
                            alt=alt(vcf)[loc$QUERYID[maskSNVs]])

Using the code below we can produce the plot of Figure 3 comparing CADD and M-CAP scores and labeling the location of the variants from which they are derived.

library(RColorBrewer)
par(mar=c(4, 5, 1, 1))
hmcol <- colorRampPalette(brewer.pal(nlevels(loc$LOCATION), "Set1"))(nlevels(loc$LOCATION))
plot(jitter(loc$MCAP, factor=2), jitter(loc$CADD, factor=2), pch=19,
     col=hmcol, xlab="M-CAP scores", ylab="CADD scores",
     las=1, cex.axis=1.2, cex.lab=1.5, panel.first=grid())
legend("bottomright", levels(loc$LOCATION), pch=19, col=hmcol, inset=0.01)
Comparison of M-CAP and CADD scores. Values on the x- and y-axis are jittered to facilitate visualization.

Figure 3: Comparison of M-CAP and CADD scores
Values on the x- and y-axis are jittered to facilitate visualization.

Finally, we will show how to annotate MAF values on these variants. However, in this particular case, we should take care of the different sequence styles (UCSC vs NCBI) and genome version nomenclatures (hg19 vs. hs37d5) between the annotated variants and the GScores object.

seqlevelsStyle(loc) <- seqlevelsStyle(mafdb)[1]
seqinfo(loc, new2old=match(seqlevels(mafdb), seqlevels(loc))) <- seqinfo(mafdb)
maskSNVs <- isSNV(vcf)[loc$QUERYID]
loc$MAF[maskSNVs] <- gscores(mafdb, loc[maskSNVs])$AF
loc$MAF[!maskSNVs] <- gscores(mafdb, loc[!maskSNVs], type="nonsnrs")$AF
loc[1:3]
GRanges object with 3 ranges and 13 metadata columns:
                  seqnames    ranges strand | LOCATION  LOCSTART    LOCEND
                     <Rle> <IRanges>  <Rle> | <factor> <integer> <integer>
      rs114335781       22  50301422      - |   coding       939       939
        rs8135963       22  50301476      - |   coding       885       885
  22:50301488_C/T       22  50301488      - |   coding       873       873
                    QUERYID        TXID         CDSID      GENEID
                  <integer> <character> <IntegerList> <character>
      rs114335781        24       75253        218562       79087
        rs8135963        25       75253        218562       79087
  22:50301488_C/T        26       75253        218562       79087
                        PRECEDEID        FOLLOWID PHASTCONS      MCAP      CADD
                  <CharacterList> <CharacterList> <numeric> <numeric> <numeric>
      rs114335781                                         1        NA        10
        rs8135963                                         0        NA         0
  22:50301488_C/T                                         1        NA        10
                        MAF
                  <numeric>
      rs114335781    0.0100
        rs8135963    0.3500
  22:50301488_C/T    0.0005
  -------
  seqinfo: 86 sequences (1 circular) from hs37d5 genome

We also show that by providing reference and alternate alleles through the arguments ref and alt, we can obtain their individual allele frequencies. The current lossy compression of these values yields their correct asignment in the case of biallelic variants and an aproximation in the case of multiallelic ones.

afs <- DataFrame(matrix(NA, nrow=length(loc), ncol=2,
                        dimnames=list(NULL, c("AF_REF", "AF_ALT"))))
afs[maskSNVs, ] <- score(mafdb, loc[maskSNVs],
                         ref=ref(vcf)[loc$QUERYID[maskSNVs]],
                         alt=alt(vcf)[loc$QUERYID[maskSNVs]])
afs[!maskSNVs, ] <- score(mafdb, loc[!maskSNVs],
                          ref=ref(vcf)[loc$QUERYID[!maskSNVs]],
                          alt=alt(vcf)[loc$QUERYID[!maskSNVs]],
                          type="nonsnrs")
mcols(loc) <- cbind(mcols(loc), afs)
loc[1:3]
GRanges object with 3 ranges and 15 metadata columns:
                  seqnames    ranges strand | LOCATION  LOCSTART    LOCEND
                     <Rle> <IRanges>  <Rle> | <factor> <integer> <integer>
      rs114335781       22  50301422      - |   coding       939       939
        rs8135963       22  50301476      - |   coding       885       885
  22:50301488_C/T       22  50301488      - |   coding       873       873
                    QUERYID        TXID         CDSID      GENEID
                  <integer> <character> <IntegerList> <character>
      rs114335781        24       75253        218562       79087
        rs8135963        25       75253        218562       79087
  22:50301488_C/T        26       75253        218562       79087
                        PRECEDEID        FOLLOWID PHASTCONS      MCAP      CADD
                  <CharacterList> <CharacterList> <numeric> <numeric> <numeric>
      rs114335781                                         1        NA        10
        rs8135963                                         0        NA         0
  22:50301488_C/T                                         1        NA        10
                        MAF    AF_REF    AF_ALT
                  <numeric> <numeric> <numeric>
      rs114335781    0.0100    0.9900    0.0100
        rs8135963    0.3500    0.6500    0.3500
  22:50301488_C/T    0.0005    0.9995    0.0005
  -------
  seqinfo: 86 sequences (1 circular) from hs37d5 genome

12 Comparison between lossy-compressed and original phastCons scores

To have a sense of the extent of the trade-off between precision and compression in a specific case, we compare here original phastCons scores with the ones obtained by rounding their precision to one significant figure, and stored in the annotation package phastCons100way.UCSC.hg19. Because phastCons scores measure conservation, we sampled uniformly at random one thousand phastCons scores from differently conserved regions, concretely CDS and 3’UTR. These sampled scores are included in this package to illustrate this comparison. Interestingly, among the phastCons scores sampled from 1000 CDS positions, there are only 198 different values despite the apparently very high precision of some of them.

origpcscoCDS <- readRDS(system.file("extdata", "origphastCons100wayhg19CDS.rds", package="GenomicScores"))
origpcscoCDS
GRanges object with 1000 ranges and 1 metadata column:
         seqnames    ranges strand |     score
            <Rle> <IRanges>  <Rle> | <numeric>
     [1]     chr1    976248      * |     0.764
     [2]     chr1   1886625      * |     0.003
     [3]     chr1   3751636      * |     1.000
     [4]     chr1   5950945      * |     0.236
     [5]     chr1   6638784      * |     0.991
     ...      ...       ...    ... .       ...
   [996]     chrX 154113587      * |     1.000
   [997]     chrX 154261706      * |     1.000
   [998]     chrX 154305499      * |     1.000
   [999]     chrY  15481226      * |     1.000
  [1000]     chrY  21871402      * |     0.981
  -------
  seqinfo: 25 sequences from an unspecified genome

length(unique(origpcscoCDS$score))
[1] 198

We look more closely the number of significant figures of precision used in these original phastCons scores.

numDecimals <- function(x) {
  spl <- strsplit(as.character(x+1), "\\.")
  spl <- sapply(spl, "[", 2)
  spl[is.na(spl)] <- ""
  nchar(spl)
}

nd1 <- numDecimals(origpcscoCDS$score)
table(nd1)
nd1
  0   1   2  12  13  14 
637   1   2   5  63 292 

Similarly, in 3’UTR regions, only 209 unique phastCons scores are observed.

origpcsco3UTRs <- readRDS(system.file("extdata", "origphastCons100wayhg193UTR.rds", package="GenomicScores"))
origpcsco3UTRs
GRanges object with 1000 ranges and 1 metadata column:
         seqnames    ranges strand |     score
            <Rle> <IRanges>  <Rle> | <numeric>
     [1]     chr1   1189930      * |     0.000
     [2]     chr1   1227810      * |     0.000
     [3]     chr1   1595390      * |     0.015
     [4]     chr1   1595685      * |     0.000
     [5]     chr1   2336044      * |     0.000
     ...      ...       ...    ... .       ...
   [996]     chrX 125684491      * |     0.000
   [997]     chrX 129190576      * |     1.000
   [998]     chrX 135299066      * |     0.197
   [999]     chrX 148560948      * |     0.003
  [1000]     chrX 153289202      * |     0.000
  -------
  seqinfo: 25 sequences from an unspecified genome

length(table(origpcsco3UTRs$score))
[1] 209

nd2 <- numDecimals(origpcsco3UTRs$score)
table(nd2)
nd2
  0  12  13  14 
502   3 112 383 

Reset the default scores population to the one that achieves more compression by rounding to 1 decimal place.

defaultPopulation(phast) <- "default"
phast
GScores object 
# organism: Homo sapiens (UCSC, hg19)
# provider: UCSC
# provider version: 09Feb2014
# download date: Apr 10, 2018
# loaded sequences: chr19_gl000208_random, chr22
# loaded populations: default, DP2
# number of sites: 2858 millions
# maximum abs. error: 0.05
# use 'citation()' to cite these data in publications

Retrieve the corresponding phastCons scores stored in the annotation package.

pkgpcscoCDS <- score(phast, origpcscoCDS)
pkgpcsco3UTRs <- score(phast, origpcsco3UTRs)

In Figure 4 we show a visual comparison between raw and rounded phastCons scores. The two panels on top compare the whole range of scores observed in CDS (left) and 3’UTR (right) regions. However, the rounding effect can be better observed in the cumulative distributions shown in the panels at the bottom, again for CDS (left) and 3’UTR (right) regions.

In these bottom panels, phastcons scores in CDS and 3’UTR regions display very different cumulative distributions. In CDS regions, most of the genomic scores (>60%) are found between the values of 0.9 and 1.0, while around 25% of the scores are found below 0.1. Indeed, these are the range of values where lossy compression loses more precison. The cumulative distribution of 3’UTR shows the same critical points, with the difference that most of scores are found below 0.1 (>70%).

Original and lossy-compressed phastCons scores. Top panels (a, b): comparison of the distribution of values. Bottom panels (c, d): comparison of the cumulative distribution

Figure 4: Original and lossy-compressed phastCons scores
Top panels (a, b): comparison of the distribution of values. Bottom panels (c, d): comparison of the cumulative distribution

The bottom plots in Figure 4 also reveal that when the cumulative distribution is of interest, such as in the context of filtering genetic variants above or below certain threshold of conservation, the quantization of phastCons scores to 1 decimal place provides sufficient precision for a wide range of conservation values.

13 Session information

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] RColorBrewer_1.1-3                      
 [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 
 [3] GenomicFeatures_1.52.2                  
 [4] AnnotationDbi_1.62.2                    
 [5] VariantAnnotation_1.46.0                
 [6] Rsamtools_2.16.0                        
 [7] SummarizedExperiment_1.30.2             
 [8] Biobase_2.60.0                          
 [9] MatrixGenerics_1.12.3                   
[10] matrixStats_1.0.0                       
[11] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20
[12] MafDb.1Kgenomes.phase1.hs37d5_3.10.0    
[13] BSgenome.Hsapiens.UCSC.hg19_1.4.3       
[14] BSgenome_1.68.0                         
[15] rtracklayer_1.60.1                      
[16] Biostrings_2.68.1                       
[17] XVector_0.40.0                          
[18] phastCons100way.UCSC.hg19_3.7.2         
[19] GenomicScores_2.12.1                    
[20] GenomicRanges_1.52.0                    
[21] GenomeInfoDb_1.36.4                     
[22] IRanges_2.34.1                          
[23] S4Vectors_0.38.2                        
[24] BiocGenerics_0.46.0                     
[25] knitr_1.44                              
[26] BiocStyle_2.28.1                        

loaded via a namespace (and not attached):
 [1] DBI_1.1.3                     bitops_1.0-7                 
 [3] biomaRt_2.56.1                rlang_1.1.1                  
 [5] magrittr_2.0.3                compiler_4.3.1               
 [7] RSQLite_2.3.1                 png_0.1-8                    
 [9] vctrs_0.6.3                   stringr_1.5.0                
[11] pkgconfig_2.0.3               crayon_1.5.2                 
[13] fastmap_1.1.1                 dbplyr_2.3.4                 
[15] magick_2.8.0                  ellipsis_0.3.2               
[17] utf8_1.2.3                    promises_1.2.1               
[19] rmarkdown_2.25                bit_4.0.5                    
[21] xfun_0.40                     zlibbioc_1.46.0              
[23] cachem_1.0.8                  jsonlite_1.8.7               
[25] progress_1.2.2                blob_1.2.4                   
[27] later_1.3.1                   rhdf5filters_1.12.1          
[29] DelayedArray_0.26.7           Rhdf5lib_1.22.1              
[31] BiocParallel_1.34.2           interactiveDisplayBase_1.38.0
[33] parallel_4.3.1                prettyunits_1.2.0            
[35] R6_2.5.1                      bslib_0.5.1                  
[37] stringi_1.7.12                jquerylib_0.1.4              
[39] Rcpp_1.0.11                   bookdown_0.35                
[41] httpuv_1.6.11                 Matrix_1.6-1.1               
[43] tidyselect_1.2.0              abind_1.4-5                  
[45] yaml_2.3.7                    codetools_0.2-19             
[47] curl_5.1.0                    lattice_0.21-9               
[49] tibble_3.2.1                  shiny_1.7.5                  
[51] KEGGREST_1.40.1               evaluate_0.22                
[53] BiocFileCache_2.8.0           xml2_1.3.5                   
[55] pillar_1.9.0                  BiocManager_1.30.22          
[57] filelock_1.0.2                generics_0.1.3               
[59] RCurl_1.98-1.12               BiocVersion_3.17.1           
[61] hms_1.1.3                     xtable_1.8-4                 
[63] glue_1.6.2                    tools_4.3.1                  
[65] AnnotationHub_3.8.0           BiocIO_1.10.0                
[67] GenomicAlignments_1.36.0      XML_3.99-0.14                
[69] rhdf5_2.44.0                  grid_4.3.1                   
[71] GenomeInfoDbData_1.2.10       HDF5Array_1.28.1             
[73] restfulr_0.0.15               cli_3.6.1                    
[75] rappdirs_0.3.3                fansi_1.0.4                  
[77] S4Arrays_1.0.6                dplyr_1.1.3                  
[79] sass_0.4.7                    digest_0.6.33                
[81] rjson_0.2.21                  memoise_2.0.1                
[83] htmltools_0.5.6               lifecycle_1.0.3              
[85] httr_1.4.7                    mime_0.12                    
[87] bit64_4.0.5                  

References

Jagadeesh, Karthik A, Aaron M Wenger, Mark J Berger, Harendra Guturu, Peter D Stenson, David N Cooper, Jonathan A Bernstein, and Gill Bejerano. 2016. “M-Cap Eliminates a Majority of Variants of Uncertain Significance in Clinical Exomes at High Sensitivity.” Nat. Genet. 48 (12): 1581–6.

Kircher, Martin, Daniela M Witten, Preti Jain, Brian J O’roak, Gregory M Cooper, and Jay Shendure. 2014. “A General Framework for Estimating the Relative Pathogenicity of Human Genetic Variants.” Nat. Genet. 46 (3): 310–15.

Siepel, Adam, Gill Bejerano, Jakob S Pedersen, Angie S Hinrichs, Minmei Hou, Kate Rosenbloom, Hiram Clawson, et al. 2005. “Evolutionarily Conserved Elements in Vertebrate, Insect, Worm, and Yeast Genomes.” Genome Res. 15 (8): 1034–50.

Zender, Charles S. 2016. “Bit Grooming: Statistically Accurate Precision-Preserving Quantization with Compression, Evaluated in the netCDF Operators (Nco, V4. 4.8+).” Geosci. Model Dev. 9 (9): 3199–3211.