Introduction

It is well-recognized that cis-eQTL searches with dense genotyping yields billions of test results. While many are consistent with no association, it is hard to draw an objective threshold, and targeted analysis may reveal signals of interest that do not deserve penalization for genome-wide search.

We recently performed a comprehensive cis-eQTL search with the GEUVADIS FPKM expression measures. The most prevalent transcript types in this dataset are

## 
##       protein_coding           pseudogene            antisense 
##                15280                 4853                 1450 
##              lincRNA processed_transcript            IG_V_gene 
##                 1153                  476                  114

cis-associated variation in abundance of these entities was assessed using 20 million 1000 genomes genotypes with radius 1 million bases around each transcribed region. There are 185 million SNP-transcript pairs in this analysis. This package (gQTLBase) aims to simplify interactive interrogation of this resource.

Brief view of how the tests were done

The following function takes as argument ‘chunk’ a list with elements chr (character token for indexing chromosomes in genotype data in VCF) and genes (vector of gene identifiers). It implicitly uses a TabixFile reference to acquire genotypes on the samples managed in the geuvPack package.

gettests = function( chunk, useS3=FALSE ) {
  library(VariantAnnotation)
  snpsp = gtpath( chunk$chr, useS3=useS3)
  tf = TabixFile( snpsp )
  library(geuvPack)
  if (!exists("geuFPKM")) data(geuFPKM)
  clipped = clipPCs(regressOut(geuFPKM, ~popcode), 1:10)
  set.seed(54321)
  ans = cisAssoc( clipped[ chunk$genes, ], tf, cisradius=1000000, lbmaf=0.01 )
  metadata(ans)$prepString = "clipPCs(regressOut(geuFPKM, ~popcode), 1:10)"
  ans
  }

cisAssoc returns a GRanges instance with fields relevant to computing FDR for cis association.

A BatchJobs registry is created as follows:

flatReg = makeRegistry("flatReg",  file.dir="flatStore",
        seed=123, packages=c("GenomicRanges",
            "GGtools", "VariantAnnotation", "Rsamtools",
            "geuvPack", "GenomeInfoDb"))

For any list ‘flatlist’ of pairs (chr, genes), the following code asks the scheduler to run gettests on every element, when it can. Using the Channing cumulus cloud, the job ran on 40 hosts at a cost of 170 USD.

batchMap(flatReg, gettests, flatlist)
submitJobs(flatReg)

This creates a ‘sharded’ archive of 7GB of results managed by a Registry object.

Illustration on a subset

We have extracted 3 shards from the job for illustration with the gQTLBase package.

library(gQTLBase)
library(geuvStore)
mm = partialRegistry()
## NOTE: ids of 92 jobs available for this registry can be obtained with partialIds()
mm
## Job registry:  flatReg 
##   Number of jobs:  2283 
##   Files dir: /home/biocbuild/bbs-3.1-bioc/R/library/geuvStore/geuvStore 
##   Work dir: /tmp/RtmpkuCh9w/Rbuild443452f41095/gQTLBase/vignettes 
##   Multiple result files: FALSE 
##   Seed: 123 
##   Required packages: BatchJobs, GenomeInfoDb, GenomicRanges, geuvPack, GGtools, Rsamtools, VariantAnnotation

mm here denotes a BatchJobs Registry. We want to wrap it lightly, adding maps from probes and addresses of SNPs to job identifiers where these probes and SNPs have been analyzed.

mm = ciseStore(mm, addProbeMap = TRUE, addRangeMap = TRUE)
## building probe:job map...
## Warning: executing %dopar% sequentially: no parallel backend registered
## done.
## building range:job map...
## done.
mm
## ciseStore instance with 92 completed jobs.
## excerpt from job  1 :
## GRanges object with 1 range and 11 metadata columns:
##       seqnames                 ranges strand |       paramRangeID
##          <Rle>              <IRanges>  <Rle> |           <factor>
##   [1]        1 [225418903, 225418903]      * | ENSG00000183814.10
##                  REF             ALT     chisq permScore_1 permScore_2
##       <DNAStringSet> <CharacterList> <numeric>   <numeric>   <numeric>
##   [1]              G               A 0.2972831   0.1150969    8.312895
##       permScore_3             snp        MAF            probeid   mindist
##         <numeric>     <character>  <numeric>        <character> <numeric>
##   [1]   0.1021324 snp_1_225418903 0.03246753 ENSG00000183814.10    999947
##   -------
##   seqinfo: 1 sequence from hg19 genome; no seqlengths

There are various approaches available to get results out of the store. At present we don’t want a full API for result-level operations, so work from BatchJobs directly:

loadResult(mm@reg, 1)[1:3]
## GRanges object with 3 ranges and 11 metadata columns:
##       seqnames                 ranges strand |       paramRangeID
##          <Rle>              <IRanges>  <Rle> |           <factor>
##   [1]        1 [225418903, 225418903]      * | ENSG00000183814.10
##   [2]        1 [225419456, 225419456]      * | ENSG00000183814.10
##   [3]        1 [225419667, 225419667]      * | ENSG00000183814.10
##                  REF             ALT       chisq permScore_1 permScore_2
##       <DNAStringSet> <CharacterList>   <numeric>   <numeric>   <numeric>
##   [1]              G               A 0.297283124   0.1150969   8.3128951
##   [2]              T               C 0.001771913   2.4850662   0.2125758
##   [3]              G               C 0.055747393   3.4358410   0.6570552
##       permScore_3             snp        MAF            probeid   mindist
##         <numeric>     <character>  <numeric>        <character> <numeric>
##   [1]   0.1021324 snp_1_225418903 0.03246753 ENSG00000183814.10    999947
##   [2]   1.1020702 snp_1_225419456 0.28787879 ENSG00000183814.10    999394
##   [3]   0.7084834 snp_1_225419667 0.17532468 ENSG00000183814.10    999183
##   -------
##   seqinfo: 1 sequence from hg19 genome; no seqlengths

On a multicore machine or cluster, we can visit job results in parallel. The storeApply function uses BatchJobs reduceResultsList to transform job results by a user-supplied function. The reduction events occur in parallel through BiocParallel bplapply over a set of job id chunks whose character can be controlled through the n.chunks parameter.

We’ll illustrate by taking the length of each result.

library(BiocParallel)
library(parallel)
mp = MulticoreParam(workers=max(c(1, detectCores()-4)))
register(mp)
lens = storeApply(mm, length)
summary(unlist(lens))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   61950   74810   80110   82170   86330  168100

It is possible to limit the scope of application by setting the ids parameter in storeApply.

Interrogating by probe

For a known GEUVADIS Ensembl identifier (or vector thereof) we can acquire all cis association test results as follows.

pvec = mm@probemap[1:4,1]  # don't want API for map, just getting examples
litex = extractByProbes( mm, pvec )
length(litex)
## [1] 31787
litex[1:3]
## GRanges object with 3 ranges and 12 metadata columns:
##       seqnames                 ranges strand |       paramRangeID
##          <Rle>              <IRanges>  <Rle> |           <factor>
##   [1]        1 [225418903, 225418903]      * | ENSG00000183814.10
##   [2]        1 [225419456, 225419456]      * | ENSG00000183814.10
##   [3]        1 [225419667, 225419667]      * | ENSG00000183814.10
##                  REF             ALT       chisq permScore_1 permScore_2
##       <DNAStringSet> <CharacterList>   <numeric>   <numeric>   <numeric>
##   [1]              G               A 0.297283124   0.1150969   8.3128951
##   [2]              T               C 0.001771913   2.4850662   0.2125758
##   [3]              G               C 0.055747393   3.4358410   0.6570552
##       permScore_3             snp        MAF            probeid   mindist
##         <numeric>     <character>  <numeric>        <character> <numeric>
##   [1]   0.1021324 snp_1_225418903 0.03246753 ENSG00000183814.10    999947
##   [2]   1.1020702 snp_1_225419456 0.28787879 ENSG00000183814.10    999394
##   [3]   0.7084834 snp_1_225419667 0.17532468 ENSG00000183814.10    999183
##           jobid
##       <numeric>
##   [1]         1
##   [2]         1
##   [3]         1
##   -------
##   seqinfo: 1 sequence from hg19 genome; no seqlengths

We also have extractByRanges.

Towards estimation of distributions relevant to FDR computation

Small-footprint collection of association statistics

In the gQTLstats package, we will use the plug-in FDR algorithm of Hastie, Tibshirani and Friedman Elements of Statistical Learning ch. 18.7, algorithm 18.3. We will not handle hundreds of millions of scores directly in a holistic way, except for the estimation of quantiles of the observed association scores. This particular step is carried out using ff and ffbase packages. We illustrate with our subset of GEUVADIS scores.

allassoc = storeToFf(mm, "chisq")
length(allassoc)
## [1] 7559723
object.size(allassoc)
## [1] 3112
allassoc[1:4]
## [1] 0.297283124 0.001771913 0.055747393 0.028064071

Other statistical functions

Refer to the gQTLstats package for additional functions that generate quantile estimates, histograms, and FDR estimates based on ciseStore contents and various filtrations thereof.