The biotmle
R package can be used to isolate biomarkers in two ways: based
on the associations of genomic objects with an exposure variable of interest.
In this vignette, we illustrate how to use biotmle
to isolate and visualize
genes associated with an exposure, using a data set containing microarray
expression measures from an Illumina platform. In the analysis described below,
Targeted Minimum Loss-Based Estimation (TMLE) is used to transform the
microarray expression values based on the influence curve representation of the
Average Treatment Effect (ATE). Following this transformation, the moderated
t-statistic of Smyth is used to test for a binary groupwise difference (based
on the exposure variable), using the tools provided by the R package limma
.
First, we load the biotmle
package and the (included) illuminaData
data set:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(biotmle)
## biotmle v:1.4.0: Moderated and Targeted Statistical Learning for Biomarker Discovery
library(biotmleData)
data(illuminaData)
suppressMessages(library(SummarizedExperiment))
"%ni%" = Negate("%in%")
In order to perform Targeted Minimum Loss-Based Estimation, we need three
separate data structures: (1) W, baseline covariates that could potentially
confound the association of biomarkers with the exposure of interest; (2) A,
the exposure of interest; and (3) Y, the biomarkers of interest. All values in
W and A ought to be discretized, in order to avoid practical violations of
the assumption of positivity. With the illuminaData
data set below, we
discretize the age variable in the phenotype-level data below (this can be
accessed via the colData
of the SummarizedExperiment
object). To invoke the
biomarker assessment function (biomarkertmle
), we also need to specify a
variable of interest (or the position of said variable in the design matrix). We
do both in just a few lines below:
# discretize "age" in the phenotype-level data
colData(illuminaData) <- colData(illuminaData) %>%
data.frame %>%
dplyr::mutate(age = as.numeric(age > median(age))) %>%
DataFrame
# specify column index of treatment/exposure variable of interest
varInt_index <- which(names(colData(illuminaData)) %in% "benzene")
The TMLE-based biomarker discovery process can be invoked using the
biomarkertmle
function. The procedure is quite resource-intensive because it
evaluates the association of each individual potential biomarker (of which there
are over 20,000 in the included data set) with an exposure of interest, while
accounting for potential confounding based on all other covariates included in
the design matrix. We demonstrate the necessary syntax for calling
biomarkertmle
below:
biomarkerTMLEout <- biomarkertmle(se = illuminaData,
varInt = varInt_index,
family = "gaussian",
g_lib = c("SL.glmnet", "SL.randomForest",
"SL.polymars", "SL.mean"),
Q_lib = c("SL.glmnet", "SL.randomForest",
"SL.nnet", "SL.mean")
)
The output of biomarkertmle
is an object of class bioTMLE
, containing four
objects: (1) call
, the call to biomarkertmle
; (2) topTable
, an empty
slot meant to hold the output of limma::topTable
, after a later call to
modtest_ic
; (3) limmaOut
, an empty slot meant to hold the output of
limma::lmFit
, after a later call to modtest_ic
; and (4) tmleOut
, a data
frame containing the point estimates of the associations of each biomarker with
the exposure of interest based on the influence curve representation of the
Average Treatment Effect parameter.
The output of biomarkertmle
can be directly fed to modtest_ic
, a wrapper
around limma::lmFit
and limma::topTable
that outputs a biotmle
object
with the slots described above completely filled in. The modtest_ic
function
requires as input a biotmle
object containing a data frame in the tmleOut
field as well as a design matrix indicating the groupwise difference to be
tested. The design matrix should contain an intercept term and a term for the
exposure of interest (with discretized exposure levels). Based on the relevant
statistical theory, it is not appropriate to include any further terms in the
design matrix (n.b., this differs from standard calls to limma::lmFit
).
limmaTMLEout <- modtest_ic(biotmle = biomarkerTMLEout)
After invoking modtest_ic
, the resultant bioTMLE
object will contain all
information relevant to the analytic procedure for identifying biomarkers: that
is, it will contain the origin call to biomarkertmle
, the result of running
limma::topTable
, the result of running limma::lmFit
, and the result of
running biomarkertmle
. The statistical results of this procedure can be
extracted from the topTable
object in the bioTMLE
object produced by
modtest_ic
.
This package provides several plotting methods that can be used to visualize the results of the TMLE-based biomarker discovery process.
The plot
method for a bioTMLE
object will produce a histogram of the
adjusted p-values of each biomarker (based on the Benjamini-Hochberg procedure
for controlling the False Discovery Rate) as generated by limma::topTable
:
plot(x = limmaTMLEout, type = "pvals_adj")
Setting the argument type = "pvals_raw"
will instead produce a histogram of
the raw p-values (these are less informative and should, in general, not be
used for inferential purposes, as the computation producing these p-values
ignores the multiple testing nature of the biomarker discovery problem):
plot(x = limmaTMLEout, type = "pvals_raw")
Heatmaps are useful graphics for visualizing the relationship between measures
on genomic objects and covariates of interest. The heatmap_ic
function
provides this graphic for bioTMLE
objects, allowing for the relationship
between the exposure variable and some number of “top” biomarkers (as
determined by the call to modtest_ic
) to be visualized. In general, the
heatmap for bioTMLE
objects expresses how the contributions of each biomarker
to the Average Treatment Effect (ATE) vary across differences in the exposure
variable (that is, there is a causal interpretation to the findings). The plot
produced is a ggplot2
object and can be modified in place if stored properly.
For our analysis:
varInt_index <- which(names(colData(illuminaData)) %in% "benzene")
designVar <- as.data.frame(colData(illuminaData))[, varInt_index]
designVar <- as.numeric(designVar == max(designVar))
heatmap_ic(x = limmaTMLEout, design = designVar, FDRcutoff = 0.05, top = 25)
The volcano plot is standard graphical tools for examining how changes in
expression relate to the raw p-value. The utility of such plots lies in their
providing a convenient way to identify and systematically ignore those genomic
objects that have extremely low p-values due to extremely low variance between
observations. The volcano_ic
function provides much of the same
interpretation, except that the fold change values displayed in the x-axis refer
to changes in the contributions of each biomarker to the Average Treatment
Effect (in standard practice, for microarray technology, these would be fold
changes in gene expression). The plot produced is a ggplot2
object and can
be modified in place if stored properly. For our analysis:
volcano_ic(biotmle = limmaTMLEout)
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 SummarizedExperiment_1.10.0
## [3] DelayedArray_0.6.0 BiocParallel_1.14.0
## [5] matrixStats_0.53.1 Biobase_2.40.0
## [7] GenomicRanges_1.32.0 GenomeInfoDb_1.16.0
## [9] IRanges_2.14.0 S4Vectors_0.18.0
## [11] BiocGenerics_0.26.0 biotmleData_1.3.0
## [13] biotmle_1.4.0 dplyr_0.7.4
## [15] BiocStyle_2.8.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 lattice_0.20-35 listenv_0.7.0
## [4] assertthat_0.2.0 rprojroot_1.3-2 digest_0.6.15
## [7] foreach_1.4.4 R6_2.2.2 plyr_1.8.4
## [10] nnls_1.4 backports_1.1.2 evaluate_0.10.1
## [13] ggplot2_2.2.1 pillar_1.2.2 zlibbioc_1.26.0
## [16] rlang_0.2.0 lazyeval_0.2.1 Matrix_1.2-14
## [19] rmarkdown_1.9 labeling_0.3 stringr_1.3.0
## [22] RCurl_1.95-4.10 munsell_0.4.3 compiler_3.5.0
## [25] xfun_0.1 pkgconfig_2.0.1 superheat_0.1.0
## [28] globals_0.11.0 htmltools_0.3.6 tibble_1.4.2
## [31] GenomeInfoDbData_1.1.0 bookdown_0.7 codetools_0.2-15
## [34] doFuture_0.6.0 future_1.8.0 tmle_1.3.0-1
## [37] MASS_7.3-50 bitops_1.0-6 grid_3.5.0
## [40] gtable_0.2.0 magrittr_1.5 scales_0.5.0
## [43] stringi_1.1.7 XVector_0.20.0 limma_3.36.0
## [46] ggdendro_0.1-20 ggsci_2.8 iterators_1.0.9
## [49] tools_3.5.0 glue_1.2.0 yaml_2.1.18
## [52] colorspace_1.3-2 SuperLearner_2.0-23 knitr_1.20
## [55] bindr_0.1.1