apeglm provides Bayesian shrinkage estimators for effect sizes for a variety of GLM models; apeglm stands for “approximate posterior estimation for GLM”.
apeglm 1.0.3
Note: the typical RNA-seq workflow for users would be to call apeglm estimation from within the lfcShrink
function in DESeq2, with the following code (not run here). See the DESeq2 vignette for more details. The lfcShrink
wrapper function takes care of many details below, and unifies the interface for multiple shrinkage estimators.
res <- lfcShrink(dds, coef=2, type="apeglm")
Here we show example code which mimics what will happen inside the lfcShrink
function when using the apeglm method.
Load a prepared SummarizedExperiment
:
library(airway)
data(airway)
head(assay(airway))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
For demonstration, we will just use the first 5000 genes of the airway
dataset.
keep <- head(which(rowSums(assay(airway)) > 0), 5000)
airway <- airway[keep,]
Run a DESeq2 differential expression analysis (size factors, and dispersion estimates could similarly be estimated using edgeR):
library(DESeq2)
dds <- DESeqDataSet(airway, ~cell + dex)
dds$dex <- relevel(dds$dex, "untrt")
dds <- DESeq(dds)
res <- results(dds)
Defining data and parameter objects necessary for apeglm
. We must multiply the coefficients from DESeq2 by a factor, because apeglm provides natural log coefficients. Again, this would be handled inside of lfcShrink
in DESeq2 for a typical RNA-seq analysis.
x <- model.matrix(design(dds), colData(dds))
param <- dispersions(dds)
mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE)
offset <- matrix(log(sizeFactors(dds)),
ncol=ncol(dds),
nrow=nrow(dds),byrow=TRUE)
Here apeglm
on 5000 genes takes less than a minute on a laptop. It scales with number of genes, the number of samples and the number of variables in the design formula, where here we have 5 coefficients (one for the four cell cultures and one for the difference due to dexamethasone treatment).
We provide apeglm
with the SummarizedExperiment although the function can also run on a matrix of counts or other observed data. We specify a coef
as well as a threshold
which we discuss below. Note that we multiple the threshold
by log(2)
to convert from log2 scale to natural log scale.
library(apeglm)
system.time({
fit <- apeglm(Y=airway, x=x,
log.lik=logLikNB,
param=param,
coef=ncol(x),
threshold=log(2) * 1,
mle=mle,
offset=offset)
})
## user system elapsed
## 31.400 0.008 31.432
str(fit$prior.control)
## List of 7
## $ no.shrink : int [1:4] 1 2 3 4
## $ prior.mean : num 0
## $ prior.scale : num 0.299
## $ prior.df : num 1
## $ prior.no.shrink.mean : num 0
## $ prior.no.shrink.scale: num 15
## $ prior.var : num 0.0894
Among other output, we have the estimated coefficients attached to the ranges of the SummarizedExperiment used as input:
class(fit$ranges)
## [1] "GRangesList"
## attr(,"package")
## [1] "GenomicRanges"
mcols(fit$ranges, use.names=TRUE)
## DataFrame with 5000 rows and 5 columns
## X.Intercept. cellN061011 cellN080611 cellN61311
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 6.6456597 0.06079181 0.22967225 -0.15204025
## ENSG00000000419 6.2345215 -0.08500074 -0.01514705 -0.04766311
## ENSG00000000457 5.4529513 0.06016694 -0.05563161 0.04846138
## ENSG00000000460 3.7704767 0.54519358 0.26134633 0.34941358
## ENSG00000000938 -0.3098332 -4.18932720 -0.54205760 -4.21138018
## ... ... ... ... ...
## ENSG00000122756 -3.638665 4.04815018 -1.98772144 2.96765739
## ENSG00000122778 5.239055 0.08007145 0.78572318 0.24009755
## ENSG00000122779 5.668604 -0.05697458 0.05013094 -0.04413214
## ENSG00000122783 6.015349 0.03689149 -0.15782833 -0.29740109
## ENSG00000122786 9.782511 -0.18235005 0.09422847 -0.12028323
## dextrt
## <numeric>
## ENSG00000000003 -0.264705484
## ENSG00000000419 0.115138038
## ENSG00000000457 0.009068942
## ENSG00000000460 -0.042500548
## ENSG00000000938 -0.008253560
## ... ...
## ENSG00000122756 -0.0007354379
## ENSG00000122778 -0.2854416424
## ENSG00000122779 -0.1099347437
## ENSG00000122783 0.0690095583
## ENSG00000122786 0.5235240365
We can compare the coefficients from apeglm with those from DESeq2. apeglm provides coefficients on the natural log scale, so we must convert to log2 scale by multiplying by log2(exp(1))
. Note that DESeq2’s lfcShrink
function converts apeglm coefficients to the log2 scale internally.
res.shr <- lfcShrink(dds, coef=5)
DESeq2.lfc <- res.shr$log2FoldChange
apeglm.lfc <- log2(exp(1)) * fit$map[,5]
Here we plot apeglm estimators against DESeq2:
plot(DESeq2.lfc, apeglm.lfc)
abline(0,1)
Here we plot MLE, DESeq2 and apeglm estimators against the mean of normalized counts:
par(mfrow=c(1,3))
lims <- c(-8,8)
hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2))
xlab <- "mean of normalized counts"
plot(res$baseMean, res$log2FoldChange, log="x",
ylim=lims, main="MLE", xlab=xlab)
hline()
plot(res$baseMean, DESeq2.lfc, log="x",
ylim=lims, main="DESeq2", xlab=xlab)
hline()
plot(res$baseMean, apeglm.lfc, log="x",
ylim=lims, main="apeglm", xlab=xlab)
hline()
Note that p-values and FSR define different events, and are not on the same scale. An FSR of 0.5 means that the estimated sign is as bad as random guess.
par(mfrow=c(1,2),mar=c(5,5,1,1))
plot(res$pvalue, fit$fsr, col="blue",
xlab="DESeq2 pvalue", ylab="apeglm local FSR",
xlim=c(0,1), ylim=c(0,.5))
abline(0,1)
plot(-log10(res$pvalue), -log10(fit$fsr),
xlab="-log10 DESeq2 pvalue", ylab="-log10 apeglm local FSR",
col="blue")
abline(0,1)
We recommend using a lower threshold on s-values than typically used for adjusted p-values, for example one might be interested in sets with 0.01 or 0.005 aggregate FSR.
plot(res$padj, fit$svalue, col="blue",
xlab="DESeq2 padj", ylab="apeglm svalue",
xlim=c(0,.2), ylim=c(0,.02))
More scrutiny can be applied by using an LFC threshold greater than zero, and asking for the probability of a false sign or that the effect size is not further from zero in distance than the threshold amount.
As we already specfified a threshold of 1 on the log2 scale, we can see how this changes the local FSRs:
cols <- ifelse(svalue(fit$thresh) < .01, "red", "black")
plot(res$baseMean, log2(exp(1)) * fit$map[,5], log="x", col=cols,
xlab=xlab, ylab="LFC")
abline(h=c(-1,0,1), col=rgb(0,0,1,.5), lwd=2)
We also show an short example using an alternative likelihood to the negative binomial. Suppose we have allele-specific counts for n=20 vs 20 samples across 5000 genes. We can define a binomial model and test for the allelic balance across groups of samples.
Here we will simulate allele counts from our existing dataset for demonstration. We spike in 10 genes with allelic imbalance.
library(emdbook)
n <- 20
f <- factor(rep(1:2,each=n))
mu <- ifelse(res$baseMean > 10, res$baseMean, 10)
set.seed(1)
cts <- matrix(rnbinom(nrow(dds)*2*n,
mu=mu,
size=1/dispersions(dds)),
ncol=2*n)
theta <- runif(nrow(cts),1,100)
prob <- rnorm(nrow(cts),.5,.01)
ase.cts <- matrix(rbetabinom(prod(dim(cts)), prob=prob,
size=cts, theta=rep(theta,ncol(cts))),
nrow=nrow(cts))
idx <- 1:10
idx2 <- which(f == 2)
theta[idx] <- 100
ase.cts[idx,idx2] <- matrix(rbetabinom(length(idx)*length(idx2), prob=.75,
size=cts[idx,idx2], theta=100),
nrow=length(idx))
We define a beta-binomial likelihood function which uses the total counts as a parameter:
betabinom.log.lik <- function(y, x, beta, param, offset) {
xbeta <- x %*% beta
p.hat <- (1+exp(-xbeta))^-1
dbetabinom(y, prob=p.hat, size=param[-1], theta=param[1], log=TRUE)
}
We first need to estimate MLE coefficients and standard errors.
theta.hat.0 <- 100 # rough estimate of dispersion
param <- cbind(theta.hat.0, cts)
x <- model.matrix(~f)
system.time({
fit.mle <- apeglm(Y=ase.cts, x=x,
log.lik=betabinom.log.lik,
param=param,
no.shrink=TRUE,
log.link=FALSE)
})
## user system elapsed
## 20.232 0.000 20.246
coef <- 2
plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="LFC")
points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)
theta.hat <- bbEstDisp(success=ase.cts, size=cts,
x=x, beta=fit.mle$map,
minDisp=1, maxDisp=500)
Note: here we use the se
slot of the returned fit object, although in future versions of apeglm, this slot is named sd
(posterior standard deviation is a more accurate description of this quantity).
mle <- cbind(fit.mle$map[,coef], fit.mle$se[,coef])
param <- cbind(theta.hat, cts)
system.time({
fit2 <- apeglm(Y=ase.cts, x=x,
log.lik=betabinom.log.lik,
param=param,
coef=coef,
mle=mle,
log.link=FALSE)
})
## user system elapsed
## 20.576 0.000 20.594
par(mfrow=c(1,2))
ylim <- c(-3,3)
plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="LFC", ylim=ylim)
points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)
cols <- ifelse(fit2$svalue < .001, "red", "black")
plot(res$baseMean, fit2$map[,coef], log="x", xlab=xlab, ylab="LFC",
col=cols, ylim=ylim)
points(res$baseMean[idx], fit2$map[idx,coef], col="dodgerblue", cex=3)
log(.75/(1-.75))
## [1] 1.098612
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] emdbook_1.3.9 apeglm_1.0.3
## [3] DESeq2_1.18.1 airway_0.112.0
## [5] SummarizedExperiment_1.8.1 DelayedArray_0.4.1
## [7] matrixStats_0.52.2 Biobase_2.38.0
## [9] GenomicRanges_1.30.1 GenomeInfoDb_1.14.0
## [11] IRanges_2.12.0 S4Vectors_0.16.0
## [13] BiocGenerics_0.24.0 BiocStyle_2.6.1
##
## loaded via a namespace (and not attached):
## [1] tidyr_0.7.2 bit64_0.9-7 splines_3.4.3
## [4] Formula_1.2-2 assertthat_0.2.0 latticeExtra_0.6-28
## [7] blob_1.1.0 GenomeInfoDbData_1.0.0 yaml_2.1.16
## [10] numDeriv_2016.8-1 RSQLite_2.0 backports_1.1.2
## [13] lattice_0.20-35 glue_1.2.0 bbmle_1.0.20
## [16] digest_0.6.13 RColorBrewer_1.1-2 XVector_0.18.0
## [19] checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6
## [22] Matrix_1.2-12 plyr_1.8.4 XML_3.98-1.9
## [25] pkgconfig_2.0.1 genefilter_1.60.0 bookdown_0.5
## [28] zlibbioc_1.24.0 purrr_0.2.4 xtable_1.8-2
## [31] scales_0.5.0 BiocParallel_1.12.0 htmlTable_1.11.0
## [34] tibble_1.3.4 annotate_1.56.1 ggplot2_2.2.1
## [37] nnet_7.3-12 lazyeval_0.2.1 survival_2.41-3
## [40] magrittr_1.5 memoise_1.1.0 evaluate_0.10.1
## [43] MASS_7.3-47 foreign_0.8-69 tools_3.4.3
## [46] data.table_1.10.4-3 stringr_1.2.0 locfit_1.5-9.1
## [49] munsell_0.4.3 cluster_2.0.6 AnnotationDbi_1.40.0
## [52] bindrcpp_0.2 compiler_3.4.3 rlang_0.1.4
## [55] grid_3.4.3 RCurl_1.95-4.8 rstudioapi_0.7
## [58] htmlwidgets_0.9 bitops_1.0-6 base64enc_0.1-3
## [61] rmarkdown_1.8 gtable_0.2.0 codetools_0.2-15
## [64] DBI_0.7 R6_2.2.2 gridExtra_2.3
## [67] knitr_1.17 dplyr_0.7.4 bit_1.1-12
## [70] bindr_0.1 Hmisc_4.1-0 rprojroot_1.3-1
## [73] stringi_1.1.6 Rcpp_0.12.14 geneplotter_1.56.0
## [76] rpart_4.1-11 acepack_1.4.1 coda_0.19-1