# Identifying Biomarkers from an Exposure Variable

## Introduction

The biotmle R package can be used to isolate biomarkers in two ways: based on the associations of genomic objects with (1) an exposure variable of interest and (2) an outcome variable of interest. In this vignette, we illustrate how to use biotmle to isolate and visualize genes associated with an exposure, using a data set containing microarray expression measures from an Illumina platform. In the analysis described below, Targeted Minimum Loss-Based Estimation (TMLE) is used to transform the microarray expression values based on the influence curve representation of the Average Treatment Effect (ATE). Following this transformation, the moderated t-statistic of Smyth is used to test for a binary groupwise difference (based on the exposure variable), using the tools provided by the R package limma.

## Biomarker Identification

First, we load the biotmle package and the (included) illuminaData data set:

library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
library(biotmle)
## biotmle: Moderated Statistics and Targeted Learning for Biomarker Discovery
## Version: 1.0.4
library(biotmleData)
library(SummarizedExperiment)
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, cbind, colMeans, colSums, colnames, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rowMeans,
##     rowSums, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
##     first, rename
## The following object is masked from 'package:base':
##
##     expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
##     collapse, desc, slice
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
##     anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
##
##     count
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:base':
##
##     apply
"%ni%" = Negate("%in%")

In order to perform Targeted Minimum Loss-Based Estimation, we need three separate data structures: (1) W, baseline covariates that could potentially confound the association of biomarkers with the exposure of interest; (2) A, the exposure of interest; and (3) Y, the biomarkers of interest. All values in W and A ought to be discretized, in order to avoid practical violations of the assumption of positivity. With the illuminaData data set below, we discretize the age variable in the phenotype-level data below (this can be accessed via the colData of the SummarizedExperiment object). To invoke the biomarker assessment function (biomarkertmle), we also need to specify a variable of interest (or the position of said variable in the design matrix). We do both in just a few lines below:

# discretize "age" in the phenotype-level data
data.frame %>%
dplyr::mutate(age = as.numeric(age > median(age))) %>%
DataFrame

# specify column index of treatment/exposure variable of interest
varInt_index <- which(names(colData(illuminaData)) %in% "benzene")

The TMLE-based biomarker discovery process can be invoked using the biomarkertmle function. The procedure is quite resource-intensive because it evaluates the association of each individual potential biomarker (of which there are over 20,000 in the included data set) with an exposure of interest, while accounting for potential confounding based on all other covariates included in the design matrix. We demonstrate the necessary syntax for calling biomarkertmle below:

biomarkerTMLEout <- biomarkertmle(se = illuminaData,
varInt = varInt_index,
type = "exposure",
family = "gaussian",
g_lib = c("SL.glmnet", "SL.randomForest",
"SL.polymars", "SL.mean"),
Q_lib = c("SL.glmnet", "SL.randomForest",
"SL.nnet", "SL.mean")
)

The output of biomarkertmle is an object of class bioTMLE, containing four objects: (1) call, the call to biomarkertmle; (2) topTable, an empty slot meant to hold the output of limma::topTable, after a later call to modtest_ic; (3) limmaOut, an empty slot meant to hold the output of limma::lmFit, after a later call to modtest_ic; and (4) tmleOut, a data frame containing the point estimates of the associations of each biomarker with the exposure of interest based on the influence curve representation of the Average Treatment Effect parameter.

The output of biomarkertmle can be directly fed to modtest_ic, a wrapper around limma::lmFit and limma::topTable that outputs a biotmle object with the slots described above completely filled in. The modtest_ic function requires as input a biotmle object containing a data frame in the tmleOut field as well as a design matrix indicating the groupwise difference to be tested. The design matrix should contain an intercept term and a term for the exposure of interest (with discretized exposure levels). Based on the relevant statistical theory, it is not appropriate to include any further terms in the design matrix (n.b., this differs from standard calls to limma::lmFit).

varInt_index <- which(names(colData(illuminaData)) %in% "benzene")
design <- as.numeric(designVar == max(designVar))

limmaTMLEout <- modtest_ic(biotmle = biomarkerTMLEout, design = design)

After invoking modtest_ic, the resultant bioTMLE object will contain all information relevant to the analytic procedure for identifying biomarkers: that is, it will contain the origin call to biomarkertmle, the result of running limma::topTable, the result of running limma::lmFit, and the result of running biomarkertmle. The statistical results of this procedure can be extracted from the topTable object in the bioTMLE object produced by modtest_ic.

## Visualization of Results

This package provides several plotting methods that can be used to visualize the results of the TMLE-based biomarker discovery process.

The plot method for a bioTMLE object will produce a histogram of the adjusted p-values of each biomarker (based on the Benjamini-Hochberg procedure for controlling the False Discovery Rate) as generated by limma::topTable:

plot(x = limmaTMLEout, type = "pvals_adj")

Setting the argument type = "pvals_raw" will instead produce a histogram of the raw p-values (these are less informative and should, in general, not be used for inferential purposes, as the computation producing these p-values ignores the multiple testing nature of the biomarker discovery problem):

plot(x = limmaTMLEout, type = "pvals_raw")

Heatmaps are useful graphics for visualizing the relationship between measures on genomic objects and covariates of interest. The heatmap_ic function provides this graphic for bioTMLE objects, allowing for the relationship between the exposure variable and some number of “top” biomarkers (as determined by the call to modtest_ic) to be visualized. In general, the heatmap for bioTMLE objects expresses how the contributions of each biomarker to the Average Treatment Effect (ATE) vary across differences in the exposure variable (that is, there is a causal interpretation to the findings). The plot produced is a ggplot2 object and can be modified in place if stored properly. For our analysis:

heatmap_ic(x = limmaTMLEout, design = design, FDRcutoff = 0.05, top = 25)

The volcano plot is standard graphical tools for examining how changes in expression relate to the raw p-value. The utility of such plots lies in their providing a convenient way to identify and systematically ignore those genomic objects that have extremely low p-values due to extremely low variance between observations. The volcano_ic function provides much of the same interpretation, except that the fold change values displayed in the x-axis refer to changes in the contributions of each biomarker to the Average Treatment Effect (in standard practice, for microarray technology, these would be fold changes in gene expression). The plot produced is a ggplot2 object and can be modified in place if stored properly. For our analysis:

volcano_ic(biotmle = limmaTMLEout)

## Session Information

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
##  [1] bindrcpp_0.2               SummarizedExperiment_1.6.3
##  [3] DelayedArray_0.2.7         matrixStats_0.52.2
##  [5] Biobase_2.36.2             GenomicRanges_1.28.5
##  [7] GenomeInfoDb_1.12.2        IRanges_2.10.3
##  [9] S4Vectors_0.14.4           BiocGenerics_0.22.0
## [11] biotmleData_1.1.0          biotmle_1.0.4
## [13] dplyr_0.7.2
##
## loaded via a namespace (and not attached):
##  [1] superheat_0.1.0         lattice_0.20-35
##  [3] colorspace_1.3-2        htmltools_0.3.6
##  [5] yaml_2.1.14             rlang_0.1.2
##  [7] glue_1.1.1              GenomeInfoDbData_0.99.0
##  [9] foreach_1.4.3           bindr_0.1
## [11] plyr_1.8.4              stringr_1.2.0
## [13] zlibbioc_1.22.0         munsell_0.4.3
## [15] gtable_0.2.0            tmle_1.2.0-5
## [17] codetools_0.2-15        evaluate_0.10.1
## [19] labeling_0.3            knitr_1.17
## [21] doParallel_1.0.10       Rcpp_0.12.12
## [23] backports_1.1.0         scales_0.5.0
## [25] limma_3.32.5            XVector_0.16.0
## [27] ggplot2_2.2.1           digest_0.6.12
## [29] stringi_1.1.5           grid_3.4.1
## [31] rprojroot_1.2           tools_3.4.1
## [33] bitops_1.0-6            magrittr_1.5
## [35] RCurl_1.95-4.8          lazyeval_0.2.0
## [37] tibble_1.3.4            wesanderson_0.3.2
## [39] ggdendro_0.1-20         pkgconfig_2.0.1
## [41] MASS_7.3-47             Matrix_1.2-11
## [43] SuperLearner_2.0-22     nnls_1.4
## [45] assertthat_0.2.0        rmarkdown_1.6
## [47] iterators_1.0.8         R6_2.2.2
## [49] compiler_3.4.1