% % %
It is well-recognized that cis-eQTL searches with dense genotyping yields billions of test results. While many are consistent with no association, it is hard to draw an objective threshold, and targeted analysis may reveal signals of interest that do not deserve penalization for genome-wide search.
We recently performed a comprehensive cis-eQTL search with the GEUVADIS FPKM expression measures. The most prevalent transcript types in this dataset are
##
## protein_coding pseudogene antisense
## 15280 4853 1450
## lincRNA processed_transcript IG_V_gene
## 1153 476 114
cis-associated variation in abundance of these entities was assessed using 20 million 1000 genomes genotypes with radius 1 million bases around each transcribed region. There are 185 million SNP-transcript pairs in this analysis. This package (gQTLBase) aims to simplify interactive interrogation of this resource.
The following function takes as argument ‘chunk’ a list with elements
chr (character token for indexing chromosomes in
genotype data in VCF) and genes (vector of gene identifiers).
It implicitly uses a TabixFile
reference to acquire genotypes
on the samples managed in the geuvPack package.
gettests = function( chunk, useS3=FALSE ) {
library(VariantAnnotation)
snpsp = gtpath( chunk$chr, useS3=useS3)
tf = TabixFile( snpsp )
library(geuvPack)
if (!exists("geuFPKM")) data(geuFPKM)
clipped = clipPCs(regressOut(geuFPKM, ~popcode), 1:10)
set.seed(54321)
ans = cisAssoc( clipped[ chunk$genes, ], tf, cisradius=1000000, lbmaf=0.01 )
metadata(ans)$prepString = "clipPCs(regressOut(geuFPKM, ~popcode), 1:10)"
ans
}
cisAssoc
returns a GRanges
instance with fields relevant to computing
FDR for cis association.
A BatchJobs registry is created as follows:
flatReg = makeRegistry("flatReg", file.dir="flatStore",
seed=123, packages=c("GenomicRanges",
"GGtools", "VariantAnnotation", "Rsamtools",
"geuvPack", "GenomeInfoDb"))
For any list ‘flatlist’ of pairs (chr, genes), the following code asks the scheduler to run gettests on every element, when it can. Using the Channing cumulus cloud, the job ran on 40 hosts at a cost of 170 USD.
This creates a ‘sharded’ archive of 7GB of results managed by a Registry object.
We have extracted 3 shards from the job for illustration with the gQTLBase package.
## ciseStore instance with 160 completed jobs.
## excerpt from job 1 :
## GRanges object with 1 range and 14 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## [1] 1 526736 * | ENSG00000215915.5 C
## ALT chisq permScore_1 permScore_2 permScore_3 permScore_4
## <CharacterList> <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] G 2.46383 3.14567 0.409225 0.157174 0.0298147
## permScore_5 permScore_6 snp MAF probeid mindist
## <numeric> <numeric> <character> <numeric> <character> <numeric>
## [1] 0.164809 0.0123114 rs28863004 0.0910112 ENSG00000215915.5 858333
## -------
## seqinfo: 86 sequences from hg19 genome
mm
here is an instance of the ciseStore
class.
This is a BatchJobs Registry
wrapped
with additional information concerning the
map from identifiers or ranges to jobs in the
registry.
There are various approaches available to get results out of the store. At present we don’t want a full API for result-level operations, so work from BatchJobs directly:
## GRanges object with 3 ranges and 14 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## [1] 1 526736 * | ENSG00000215915.5 C
## [2] 1 526840 * | ENSG00000215915.5 T
## [3] 1 529782 * | ENSG00000215915.5 C
## ALT chisq permScore_1 permScore_2 permScore_3 permScore_4
## <CharacterList> <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] G 2.463829 3.14567 0.409225 1.57174e-01 0.0298147
## [2] C 0.976104 1.78520 0.364931 4.12659e-07 0.1818076
## [3] T 0.120734 3.39460 0.930155 4.76263e-01 0.4957439
## permScore_5 permScore_6 snp MAF probeid mindist
## <numeric> <numeric> <character> <numeric> <character> <numeric>
## [1] 0.1648088 0.0123114 rs28863004 0.0910112 ENSG00000215915.5 858333
## [2] 0.1089619 1.2933784 rs60396226 0.1033708 ENSG00000215915.5 858229
## [3] 0.0165574 1.3853998 rs144425991 0.0719101 ENSG00000215915.5 855287
## -------
## seqinfo: 86 sequences from hg19 genome
On a multicore machine or cluster, we can visit job results in parallel.
The storeApply
function uses BatchJobs reduceResultsList
to transform
job results by a user-supplied function. The reduction events
occur in parallel through BiocParallel bplapply
over a set
of job id chunks whose character can be controlled through
the n.chunks
parameter.
We’ll illustrate by taking the length of each result.
library(BiocParallel)
library(parallel)
mp = MulticoreParam(workers=max(c(1, detectCores()-4)))
register(mp)
## Warning: executing %dopar% sequentially: no parallel backend registered
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19852 32987 37346 38645 42701 131671
It is possible to limit the scope of application by setting the
ids
parameter in storeApply
.
For a known GEUVADIS Ensembl identifier (or vector thereof) we can acquire all cis association test results as follows.
pvec = mm@probemap[1:4,1] # don't want API for map, just getting examples
litex = extractByProbes( mm, pvec )
length(litex)
## [1] 33107
## GRanges object with 3 ranges and 15 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## [1] 1 526736 * | ENSG00000215915.5 C
## [2] 1 526840 * | ENSG00000215915.5 T
## [3] 1 529782 * | ENSG00000215915.5 C
## ALT chisq permScore_1 permScore_2 permScore_3 permScore_4
## <CharacterList> <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] G 2.463829 3.14567 0.409225 1.57174e-01 0.0298147
## [2] C 0.976104 1.78520 0.364931 4.12659e-07 0.1818076
## [3] T 0.120734 3.39460 0.930155 4.76263e-01 0.4957439
## permScore_5 permScore_6 snp MAF probeid mindist
## <numeric> <numeric> <character> <numeric> <character> <numeric>
## [1] 0.1648088 0.0123114 rs28863004 0.0910112 ENSG00000215915.5 858333
## [2] 0.1089619 1.2933784 rs60396226 0.1033708 ENSG00000215915.5 858229
## [3] 0.0165574 1.3853998 rs144425991 0.0719101 ENSG00000215915.5 855287
## jobid
## <integer>
## [1] 1
## [2] 1
## [3] 1
## -------
## seqinfo: 86 sequences from hg19 genome
We also have extractByRanges.
In the gQTLstats package, we will use the plug-in FDR algorithm of Hastie, Tibshirani and Friedman Elements of Statistical Learning ch. 18.7, algorithm 18.3. We will not handle hundreds of millions of scores directly in a holistic way, except for the estimation of quantiles of the observed association scores. This particular step is carried out using ff and ffbase packages. We illustrate with our subset of GEUVADIS scores.
## [1] 6183186
## 3312 bytes
## [1] 2.4638290 0.9761038 0.1207337 1.5093189
Refer to the gQTLstats package for additional
functions that generate quantile estimates, histograms, and
FDR estimates based on ciseStore
contents and various
filtrations thereof.