The software in this package aims to support refinements and functional interpretation of members of a collection of association statistics on a family of feature \(\times\) genome hypotheses. provide a basis for refinement or functional interpretation.
We take for granted the use of the gQTL* infrastructure for testing and management of test results. We use for examples elements of the geuvPack and geuvStore packages.
We work with a ciseStore
instance based on a small subset of transcriptome-wide cis-eQTL tests for GEUVADIS FPKM data. The overall testing procedure was conducted for all SNP:probe pairs for which SNP minor allele frequency (MAF) is at least 1% and for which the minimum distance between SNP and either boundary of the gene coding region for the probe is at most 1 million bp.
library(geuvStore)
library(gQTLBase)
library(gQTLstats)
library(parallel)
nco = detectCores()
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
registerDoSEQ()
if (.Platform$OS.type != "windows") {
registerDoParallel(cores=max(c(1, floor(nco/2))))
}
prst = ciseStore(partialRegistry(), FALSE, FALSE)
## NOTE: ids of 92 jobs available for this registry can be obtained with partialIds()
Quantile estimation is very memory-efficient, based on a temporary ff representation of the vector of all association test results.
qassoc = storeToQuantiles(prst, field="chisq",
probs=c(seq(0,.999,.001), 1-(c(1e-4,1e-5,1e-6))))
tail(qassoc)
## 99.7% 99.8% 99.9% 99.99% 99.999% 99.9999%
## 18.45125 25.92172 48.26925 203.57129 223.90211 243.88727
Because we compute fixed breaks, contributions to the overall histogram can be assembled in parallel, with small footprint. This is a tremendous reduction of data.
hh = storeToHist( prst, breaks= c(0,qassoc,1e9) )
## Warning in file.remove(filename(x)): cannot remove file
## '/tmp/Rtmp8yhUOv/ff_14aca52930fe5.ff', reason 'No such file or directory'
## Warning in file.remove(filename(x)): cannot remove file
## '/tmp/Rtmp8yhUOv/ff_14aca3757823f.ff', reason 'No such file or directory'
tail(hh$counts)
## [1] 4768 2086 215 0 0 0
FDR computation is post-hoc relative to filtering that need not be specified prior to testing. For illustration, we survey the results in geuvStore to obtain FDRs for each SNP:probe pair in two forms. First, we obtain FDR without any filtering. Second, we compute an a FDR for those SNP:probe pairs separated by at most 500kb, and for which the MAF for the SNP is at least 5 per cent.
rawFDR = storeToFDR(prst,
xprobs=c(seq(.05,.95,.05),.975,.990,.995,.9975,.999,
.9995, .9999, .99999) )
## counting tests...
## counting #NA...
## obtaining assoc quantiles...
## computing perm_assoc histogram....
## Warning in file.remove(filename(x)): cannot remove file
## '/tmp/Rtmp8yhUOv/ff_14aca4742aabc.ff', reason 'No such file or directory'
## Warning in file.remove(filename(x)): cannot remove file
## '/tmp/Rtmp8yhUOv/ff_14aca4af4ae4b.ff', reason 'No such file or directory'
dmfilt = function(x) # define the filtering function
x[ which(x$MAF >= 0.05 & x$mindist <= 500000) ]
filtFDR = storeToFDR(prst,
xprobs=c(seq(.05,.95,.05),.975,.990,.995,.9975,.999,
.9995, .9999, .99999), filter = dmfilt )
## counting tests...
## counting #NA...
## obtaining assoc quantiles...
## computing perm_assoc histogram....
## Warning in file.remove(filename(x)): cannot remove file
## '/tmp/Rtmp8yhUOv/ff_14aca146cc471.ff', reason 'No such file or directory'
## Warning in file.remove(filename(x)): cannot remove file
## '/tmp/Rtmp8yhUOv/ff_14aca1c730d8e.ff', reason 'No such file or directory'
rawFDR
## FDRsupp instance with 27 rows.
## assoc fdr ncalls avg.false
## 5% 0.003618066 0.9509009 7181737 6829120
## 10% 0.014572904 0.9496298 6803751 6461045
## 15% 0.033069795 0.9480637 6425765 6092034
## ...
## assoc fdr ncalls avg.false
## 99.95% 126.5972 0 3779.86150 0
## 99.99% 203.5713 0 755.97230 0
## 99.999% 223.9021 0 75.59723 0
## No interpolating function is available; use 'setFDRfunc'.
filtFDR
## FDRsupp instance with 27 rows.
## assoc fdr ncalls avg.false
## 5% 0.004063308 0.9462589 2086766 1974621
## 10% 0.016409957 0.9419880 1976936 1862250
## 15% 0.037334757 0.9369467 1867107 1749379
## ...
## assoc fdr ncalls avg.false
## 99.95% 203.5713 0 1098.29800 0
## 99.99% 205.2224 0 219.65960 0
## 99.999% 238.8027 0 21.96596 0
## No interpolating function is available; use 'setFDRfunc'.
The filtering leads to a lower FDR for a given strength of association. This is an inspiration for sensitivity analysis. Even with 5 million observations there is an effect of histogram bin selection in summarizing the permutation distribution of association. This can be seen fairly clearly in the wiggliness of the trace over the unfiltered association score:FDR plot.
rawtab = getTab(rawFDR)
filttab = getTab(filtFDR)
plot(rawtab[-(1:10),"assoc"],
-log10(rawtab[-(1:10),"fdr"]+1e-6), log="x", axes=FALSE,
xlab="Observed association", ylab="-log10 plugin FDR")
axis(1, at=c(seq(0,10,1),100,200))
axis(2)
points(filttab[-(1:10),1], -log10(filttab[-(1:10),2]+1e-6), pch=2)
legend(1, 5, pch=c(1,2), legend=c("all loci", "MAF >= 0.05 & dist <= 500k"))
We’ll address this below by fitting smooth functions for the score:FDR relationship.
The storeToFDRByProbe
FDR function examines the maximal association score by gene, for observed and permuted measures.
Good performance of this procedure is obtained by using group_by
and summarize
utilities of . Concurrent computing capabilities of BatchJobs are employed when available. You will need suitable contents for .BatchJobs.R
and any scheduler-dependent template files to take advantage of this.
fdbp = storeToFDRByProbe( prst, xprobs=c(seq(.025,.975,.025),.99))
## counting tests...
## counting #NA...
## obtaining assoc quantiles...
## computing perm_assoc histogram....
tail(getTab(fdbp),5)
## assoc fdr ncalls avg.false
## 90% 41.51883 0.10144928 92.0 9.333333
## 92.5% 51.23749 0.07729469 69.0 5.333333
## 95% 63.40014 0.03623188 46.0 1.666667
## 97.5% 102.90260 0.00000000 23.0 0.000000
## 99% 170.24911 0.00000000 9.2 0.000000
fdAtM05bp = storeToFDRByProbe( prst, filter=function(x) x[which(x$MAF > .05)],
xprobs=c(seq(.025,.975,.025),.99))
## counting tests...
## counting #NA...
## obtaining assoc quantiles...
## computing perm_assoc histogram....
tail(getTab(fdAtM05bp),5)
## assoc fdr ncalls avg.false
## 90% 35.90629 0.003623188 92.0 0.3333333
## 92.5% 44.00699 0.000000000 69.0 0.0000000
## 95% 55.98042 0.000000000 46.0 0.0000000
## 97.5% 96.75989 0.000000000 23.0 0.0000000
## 99% 150.75099 0.000000000 9.2 0.0000000
We’ll focus here on all-pairs analysis, with and without filtering.
Especially in this small example there will be some wiggling or even non-monotonicity in the trace of empirical FDR against association. We want to be able to compute the approximate FDR quickly and with minimal assumptions and pathology. To accomplish this, we will bind an interpolating model to the FDR estimates that we have. Interpolation will be accomplished with scatterplot smoothing in the gam framework. We’ll use different smoothing spans for the raw and unfiltered FDR estimates.
library(gam)
## Loading required package: splines
## Loaded gam 1.09.1
rawFDR = setFDRfunc(rawFDR, span=.75)
filtFDR = setFDRfunc(filtFDR, span=.25)
par(mfrow=c(2,2))
txsPlot(rawFDR)
txsPlot(filtFDR)
directPlot(rawFDR)
directPlot(filtFDR)
More work is needed on assessing tolerability of relative error in FDR interpolation.
Recall that dmfilt
is a function that obtains the SNP-probe pairs for which SNP has MAF at least five percent and SNP-probe distance at most 500kbp.
We use the FDRsupp
instances with ciseStore
to list the SNP-probe pairs with FDR lying beneath a given upper bound.
Unfiltered pairs:
rawEnum = enumerateByFDR(prst, rawFDR, threshold=.05)
rawEnum[order(rawEnum$chisq,decreasing=TRUE)[1:3]]
## GRanges object with 3 ranges and 12 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## 201 1 [54683925, 54683925] * | ENSG00000231581.1
## 201 1 [54685855, 54685855] * | ENSG00000231581.1
## 201 1 [54694948, 54694948] * | ENSG00000231581.1
## REF ALT chisq permScore_1 permScore_2
## <DNAStringSet> <CharacterList> <numeric> <numeric> <numeric>
## 201 G A 252.8924 0.08203802 0.3729235
## 201 G A 252.8924 0.08203802 0.3729235
## 201 C T 252.4241 0.05751442 0.7552756
## permScore_3 snp MAF probeid mindist
## <numeric> <character> <numeric> <character> <numeric>
## 201 0.3960782 snp_1_54683925 0.1266234 ENSG00000231581.1 6256
## 201 0.3960782 snp_1_54685855 0.1266234 ENSG00000231581.1 4326
## 201 0.4921478 snp_1_54694948 0.1222944 ENSG00000231581.1 3792
## estFDR
## <numeric>
## 201 0
## 201 0
## 201 0
## -------
## seqinfo: 21 sequences from hg19 genome; no seqlengths
length(rawEnum)
## [1] 44524
A small quantity of metadata is bound into the resulting GRanges
instance.
names(metadata(rawEnum))
## [1] "enumCall" "enumSess" "fdrCall"
Pairs meeting MAF and distance conditions are obtained with a filter
setting to the enumerating function.
filtEnum = enumerateByFDR(prst, filtFDR, threshold=.05,
filter=dmfilt)
filtEnum[order(filtEnum$chisq,decreasing=TRUE)[1:3]]
## GRanges object with 3 ranges and 12 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## 201 1 [54683925, 54683925] * | ENSG00000231581.1
## 201 1 [54685855, 54685855] * | ENSG00000231581.1
## 201 1 [54694948, 54694948] * | ENSG00000231581.1
## REF ALT chisq permScore_1 permScore_2
## <DNAStringSet> <CharacterList> <numeric> <numeric> <numeric>
## 201 G A 252.8924 0.08203802 0.3729235
## 201 G A 252.8924 0.08203802 0.3729235
## 201 C T 252.4241 0.05751442 0.7552756
## permScore_3 snp MAF probeid mindist
## <numeric> <character> <numeric> <character> <numeric>
## 201 0.3960782 snp_1_54683925 0.1266234 ENSG00000231581.1 6256
## 201 0.3960782 snp_1_54685855 0.1266234 ENSG00000231581.1 4326
## 201 0.4921478 snp_1_54694948 0.1222944 ENSG00000231581.1 3792
## estFDR
## <numeric>
## 201 0
## 201 0
## 201 0
## -------
## seqinfo: 21 sequences from hg19 genome; no seqlengths
length(filtEnum)
## [1] 89820
The yield of an enumeration procedure depends on filtering based on SNP-gene distance and SNP MAF. This can be illustrated as follows, with minimal computational effort owing to the retention of genome-scale permutations and the use of the plug-in FDR algorithm.
data(sensByProbe) # see example(senstab) for construction approach
tab = senstab( sensByProbe )
plot(tab)
If we wish to maximize the yield of eQTL enumeration at FDR at most 0.05, we can apply a filter to the store.
flens = storeApply( prst, function(x) {
length(x[ which(x$MAF >= .08 & x$mindist <= 25000), ] )
})
sum(unlist(flens))
## [1] 174044
This is a count of gene-snp pairs satisfying structural and genetic criteria.
In the case of geuFPKM
there is some relevant metadata in the rowRanges
element. We will bind that into the collection of significant findings.
library(geuvPack)
data(geuFPKM)
basic = mcols(rowRanges(geuFPKM))[, c("gene_id", "gene_status", "gene_type",
"gene_name")]
rownames(basic) = basic$gene_id
extr = basic[ filtEnum$probeid, ]
mcols(filtEnum) = cbind(mcols(filtEnum), extr)
stopifnot(all.equal(filtEnum$probeid, filtEnum$gene_id))
filtEnum[1:3]
## GRanges object with 3 ranges and 16 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## 1 1 [226019653, 226019653] * | ENSG00000183814.10
## 1 1 [226219554, 226219554] * | ENSG00000183814.10
## 1 1 [226229189, 226229189] * | ENSG00000183814.10
## REF ALT chisq permScore_1 permScore_2
## <DNAStringSet> <CharacterList> <numeric> <numeric> <numeric>
## 1 G A 7.052343 0.2427214 0.0001436935
## 1 C G 6.116819 1.0205941 0.0242830567
## 1 T C 6.062519 2.9692153 0.0195030475
## permScore_3 snp MAF probeid mindist
## <numeric> <character> <numeric> <character> <numeric>
## 1 3.972022e-01 snp_1_226019653 0.1601732 ENSG00000183814.10 399197
## 1 3.010252e-01 snp_1_226219554 0.4058442 ENSG00000183814.10 199296
## 1 5.067021e-07 snp_1_226229189 0.4188312 ENSG00000183814.10 189661
## estFDR gene_id gene_status gene_type gene_name
## <numeric> <character> <character> <character> <character>
## 1 0.01201117 ENSG00000183814.10 KNOWN protein_coding LIN9
## 1 0.03212841 ENSG00000183814.10 KNOWN protein_coding LIN9
## 1 0.03394815 ENSG00000183814.10 KNOWN protein_coding LIN9
## -------
## seqinfo: 21 sequences from hg19 genome; no seqlengths
We have a utility to create an annotated Manhattan plot for a search cis to a gene. The basic ingredients are
ciseStore
instance for basic location and association informationFDRsupp
instance that includes the function that maps from association scores to FDR, and the filter employed during FDR estimationhmm878
GRanges instance in gQTLstats/data.It is important to recognize that, given an FDRsupp
instance we can compute the FDR for any association score, but validity of the FDR attribution requires that we refrain from computing it for any locus excluded by filtering. the manhWngr
executes the FDRsupp
-resident filter by default.
data(hmm878)
library(geuvStore)
prst = ciseStore(partialRegistry(), TRUE, FALSE)
## NOTE: ids of 92 jobs available for this registry can be obtained with partialIds()
## building probe:job map...
## done.
myg = "ENSG00000183814.10" # LIN9
data(filtFDR)
library(ggplot2)
manhWngr( store = prst, probeid = myg, sym="LIN9",
fdrsupp=filtFDR, namedGR=hmm878 )
For a dynamic visualization procedure, see the vjcitn/gQTLbrowse github archive.
We can use VariantAnnotation to establish basic structural characteristics for all filtered variants.
suppressPackageStartupMessages({
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
})
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(filtEnum) = "UCSC"
allv = locateVariants(filtEnum, txdb, AllVariants()) # multiple recs per eQTL
table(allv$LOCATION)
hits = findOverlaps( filtEnum, allv )
filtEex = filtEnum[ queryHits(hits) ]
mcols(filtEex) = cbind(mcols(filtEex), mcols(allv[subjectHits(hits)])[,1:7])
filtEex[1:3]
The resulting table is SNP:transcript specific.
The following tasks need to be addressed in the modeling of phenorelevance
We will make a temporary reconstruction of geuvStore contents with the enhanced information.