Targeted Data-Adaptive Estimation and Inference for Differential Methylation Analysis

Nima Hejazi

2020-04-27

Abstract

We present a general algorithm for the nonparametric estimation of effects of DNA methylation at CpG sites scattered across the genome, complete with honest statistical inference for such estimates. This approach leverages variable importance measures, a class of parameters that arise in the study of causal inference. The parameters we present are defined in such a manner that they provide targeted estimates of the relative importance of CpG sites in the case of binary exposures/treatments assigned at the level of subjects. Such parameters come equipped with rich scientific interpretations, providing an avenue to move beyond linear models, applying modern developments in machine learning to estimating quantities of scientific interest in computational biology.

0.1 Introduction
0.2 Methodology
0.3 Parameters of Interest
0.4 Preliminaries: Setting up the Data
0.5 Differential Methylation Based on a Binary Treatment or Exposure
- 0.5.1 The Average Treatment Effect as Variable Importance Measure
- 0.5.2 The Risk Ratio as Variable Importance Measure
0.6 Data Analysis with methyvim
- 0.6.1 Visualization of Results
0.7 Session Information
References

0.1 Introduction

DNA methylation is a fundamental epigenetic process known to play an important role in the regulation of gene expression. DNA CpG methylation involves the addition of a methyl group (\(\text{CH}_3\)) to the fifth carbon of the cytosine ring structure to form 5-methylcytosine. Numerous biological and medical studies have implicated DNA CpG methylation as playing a role in disease and development (Robertson 2005). Perhaps unsurprisingly then, biotechnologies have been developed to study the molecular mechanisms of this epigenetic process. Modern assays, like the Illumina Infinium Methylation assay, allow for quantitative interrogation of DNA methylation of CpG sites scattered across the genome at single-nucleotide resolution; moreover, much effort has been invested, by the bioinformatics community, in the development of tools for properly removing technological effects that may contaminate biological signatures measured by such assays. Despite these advances in both biological and bioninformatical techniques, most statistical methods available for the analysis of data produced by such assays rely on over-simplified (often generalized linear) models.

Here, we present an alternative to such statistical analysis approaches, in the form of nonparametric estimation procedures inspired by machine learning and causal inference. Specifically, we provide a technique for obtaining estimates of nonparametric variable importance measures (VIM), parameters with rich scientific interpretations under the standard (untestable) assumptions used in statistical causal inference, defining a limited set of VIMs specifically with respect to the type of data commonly produced by DNA methylation assays. For VIMs defined in such a manner, targeted minimum loss-based estimates may be readily computed based on the data made available by DNA methylation assays. Our contribution, methyvim is an R package that provides facilities for performing differential methylation analyses within exactly this scope.

As the substantive contribution of our work is an estimation procedure, we focus on data produced by 450k and 850k (EPIC) arrays made by Illumina and assume that data has been subjected to proper quality control and normalizaton procedures, as outlined by others in the computational biology community (Fortin et al. 2014, Dedeurwaerder et al. (2013)). For a general discussion of the framework of targeted minimum loss-based estimation and the role this approach plays in statistical causal inference, the interested reader is invited to consult van der Laan and Rose (2011) and van der Laan and Rose (2018). For a more general introduction to statistical causal inference, Pearl (2000) serves well.

0.2 Methodology

The core functionality of this package is made available via the eponymous methyvim function, which implements a statistical algorithm designed to compute targeted estimates of VIMs, defined in such a way that the VIMs represent parameters of scientific interest in computational biology experiments; moreover, these VIMs are defined such that they may be estimated in a manner that is very nearly assumption-free, that is, within a fully nonparametric statistical model. The statistical algorithm consists in several major steps:

Pre-screening of genomic sites is used to isolate a subset of sites for which there is cursory evidence of differential methylation. For the sake of computational feasibility, targeted minimum loss-based estimates of VIMs are computed only for this subset of sites. Currently, the available screening approach adapts core routines from the limma R package. Future releases will support functionality from other packages (e.g., randomForest, tmle.npvi). Following the style of the function for performing screening via limma, users may write their own screening functions and are invited to contribute such functions to the core software package by opening pull requests at the GitHub repository.
Nonparametric estimates of VIMs, for the specified target parameter, are computed at each of the CpG sites passing the screening step. The VIMs are defined in such a way that the estimated effects is of an exposure/treatment on the methylation status of a target CpG site, controlling for the observed methylation status of the neighbors of that site. Currently, routines are adapted from the tmle R package. Future releases will support doubly-robust estimates of these VIMs (via the drtmle package) and add parameters for continuous treatments/exposures (via the tmle.npvi package).
Since pre-screening is performed prior to estimating VIMs, we make use of a multiple testing correction uniquely suited to such settings. Due to the multiple testing nature of the estimation problem, a variant of the Benjamini & Hochberg procedure for controlling the False Discovery Rate (FDR) is applied (Benjamini and Hochberg 1995). Specifically, we apply the “modified marginal Benjamini & Hochberg step-up False Discovery Rate controlling procedure for multi-stage analyses” (FDR-MSA), which is guaranteed to control the FDR as if all sites were tested (i.e., without screening) (Tuglus and van der Laan 2009).

0.3 Parameters of Interest

For discrete-valued treatments or exposures:

The average treatment effect (ATE): The effect of a binary exposure or treatment on the observed methylation at a target CpG site is estimated, controlling for the observed methylation at all other CpG sites in the same neighborhood as the target site, based on an additive form. In particular, the parameter estimate represents the additive difference in methylation that would have been observed at the target site had all observations received the treatment versus the counterfactual under which none received the treatment.
The relative risk (RR): The effect of a binary exposure or treatment on the observed methylation at a target CpG site is estimated, controlling for the observed methylation at all other CpG sites in the same neighborhood as the target site, based on a geometric form. In particular, the parameter estimate represents the multiplicative difference in methylation that would have been observed at the target site had all observations received the treatment versus the counterfactual under which none received the treatment.

Support for continuous-valued treatments or exposures is planned but not yet available, though work is underway to incorporate into our methodology the following

A nonparametric variable importance measure (NPVI) (Chambaz, Neuvial, and van der Laan 2012): The effect of continuous-valued exposure or treatment (the observed methylation at a target CpG site) on an outcome of interest is estimated, controlling for the observed methylation at all other CpG sites in the same neighborhood as the target (treatment) site, based on a parameter that compares values of the treatment against a reference value taken to be the null. In particular, the implementation provided is designed to assess the effect of differential methylation at the target CpG site on a (typically) phenotype-level outcome of interest (e.g., survival), in effect providing an nonparametric evaluation of the impact of methylation at the target site on said outcome.

As previously noted, in all cases, an estimator of the target parameter is constructed via targeted minimum loss-based estimation.

Having now discussed the foundational principles of the estimation procedure employed and the statistical algorithm implemented, it is best to proceed by examining methyvim by example.

0.4 Preliminaries: Setting up the Data

First, we’ll load the methyvim package and the example data contained in the methyvimData package that accompanies it:

library(methyvim)

## methyvim v1.11.0: Targeted, Robust, and Model-free Differential Methylation Analysis

library(methyvimData)

Now, let’s load the data set and seed the RNG:

set.seed(479253)
data(grsExample)
grsExample

## class: GenomicRatioSet 
## dim: 400 210 
## metadata(0):
## assays(2): Beta M
## rownames(400): cg23578515 cg06747907 ... cg01715842 cg09895959
## rowData names(0):
## colnames(210): V2 V3 ... V397 V398
## colData names(2): exp outcome
## Annotation
##   array: IlluminaHumanMethylationEPIC
## Preprocessing
##   Method: NA
##   minfi version: NA
##   Manifest version: NA

var_int <- as.numeric(colData(grsExample)[, 1])
table(var_int)

## var_int
##   0   1 
## 105 105

The example data object is of class GenomicRatioSet, provided by the minfi package. The summary provided by the print method gives a wealth of information on the experiment that generated the data – since we are working with a simulated data set, we need not concern ourselves with much of this information.

We can create an object of class methytmle from any GenomicRatioSet object simply invoking the S4 class constructor:

mtmle <- .methytmle(grsExample)

Since the methytmle class builds upon the GenomicRatioSet class, it contains all of the slots of GenomicRatioSet. The new class introduced in the methyvim package includes several new slots:

call - the form of the original call to the methyvim function.
screen_ind - indices identifying CpG sites that pass the screening process.
clusters - non-unique IDs corresponding to the manner in wich sites are treated as neighbors. These are assigned by genomic distance (bp) and respect chromosome boundaries (produced via a call to bumphunter::clusterMaker).
var_int - the treatment/exposure status for each subject. Currently, these must be binary, due to the definition of the supported targeted parameters.
param - the name of the target parameter from which the estimated VIMs are defined.
vim - a table of statistical results obtained from estimating VIMs for each of the CpG sites that pass the screening procedure.
ic - the measured array values for each of the CpG sites passing the screening, transformed into influence curve space based on the chosen target parameter.

The interested analyst might consider consulting the documentation of the minfi package for an in-depth description of all of the other slots that appear in this class (Aryee et al. 2014). Having examined the core structure of the package, it is time now to discuss the analytic capabilities implemented.

0.5 Differential Methylation Based on a Binary Treatment or Exposure

0.5.1 The Average Treatment Effect as Variable Importance Measure

The average treatment effect (ATE) is a canonical parameter that arises in statistical causal inference, often denoted \(\psi_0 = \psi_0(1) - \psi_0(0)\), representing the difference in an outcome between the counterfactuals under which all subjects received the treatment/exposure and under which none received such treatment/exposure. Under additional (untestable) assumptions, this parameter has a richer interpretation as a mean counterfactual outcome, wherein the counterfactuals used in this definition define causal effects. When causal assumptions remain unfulfilled or untested, this parameter may still be estimated in the form of a nonparametric VIM.

Estimating such the VIM corresponding to such a parameter requires two separate regression steps: one for the treatment mechanism (propensity score) and one for the outcome regression. Technical details on the nature of these regressions are discussed in Hernan and Robins (2019), and details for estimating these regressions in the framework of targeted minimum loss-based estimation are discussed in van der Laan and Rose (2011).

0.5.1.1 Super Learning for nonparametric parameter estimation

Nonparametric and data-adaptive regressions (i.e., machine learning) may be used in the two regression steps outlined above. This is implemented using the super learner algorithm, which produces optimal combinations of such regression functions (a variant of stacked regressions) using cross-validation (van der Laan, Polley, and Hubbard 2007, Breiman (1996), Wolpert (1992)).

methyvim makes performing such estimation for CpG sites using a given VIM essentially trivial:

suppressMessages(
  methyvim_ate_sl <- methyvim(data_grs = grsExample, sites_comp = 25,
                              var_int = var_int, vim = "ate", type = "Mval",
                              filter = "limma", filter_cutoff = 0.10,
                              parallel = FALSE, tmle_type = "sl"
                             )
)

## Warning in set_parallel(parallel = parallel, future_param = future_param, :
## Sequential evaluation over many probes may take a long time.

As is clear from examining the object methyvim_ate_sl, the output resembles exactly that returned when examining objects of class GenomicRatioSet from the minfi R package. In particular, the returned methytmle object is merely a modified form (in particular, a subclass) of the input GenomicRatioSet object – thus, it contains all of the original slots, with all of the experimental data intact. Several extra pieces of information are contained within the output object as well_.

We can take a look at the results produced from the estimation procedure (stored in the vim slot of the methytmle object) simply by invoking the custom S4 accessor function vim (note that the show method of the resultant object also displays this same information, amongst other things):

vim(methyvim_ate_sl)

##                lwr_ci    est_ate     upr_ci   var_ate         pval n_neighbors
## cg22913481  -2.241046 -0.5738109  1.0934239 0.7235714 4.999479e-01           0
## cg15131207  -1.414328 -0.4633953  0.4875377 0.2353898 3.395171e-01           0
## cg10613282  -2.230081 -0.8691549  0.4917713 0.4821220 2.106598e-01           0
## cg15857610  -2.044781 -0.2695842  1.5056124 0.8203152 7.659713e-01           5
## cg24775884  -3.749987  0.5128954  4.7757779 4.7303641 8.135720e-01           5
## cg22954484  -2.570027  0.6063534  3.7827334 2.6263510 7.082903e-01           5
## cg23076894  -5.761423 -0.4517812  4.8578604 7.3386853 8.675507e-01           5
## cg07697276  -3.079394 -0.7915787  1.4962367 1.3624790 4.976732e-01           5
## cg01674119  -2.517292 -1.0973701  0.3225520 0.5248279 1.298325e-01           5
## cg02749733  -1.883485 -0.8608829  0.1617195 0.2722084 9.893564e-02           0
## cg12742764  -8.686076 -4.4381324 -0.1901884 4.6972689 4.058422e-02           8
## cg15396367 -10.948696 -5.4022936  0.1441086 8.0077512 5.625247e-02           8
## cg18233010  -8.446659 -4.3621161 -0.2775733 4.3428492 3.633168e-02           8
## cg01532849  -6.020016 -2.6830751  0.6538663 2.8985781 1.150391e-01           8
## cg24268444  -6.977900 -3.5516457 -0.1253913 3.0558151 4.218095e-02           8
## cg10560245  -4.537997 -2.0915700  0.3548566 1.5579455 9.379696e-02           8
## cg20185936  -6.528386 -3.3249208 -0.1214552 2.6713328 4.192027e-02           8
## cg06714180  -1.905104 -1.2702925 -0.6354814 0.1049003 8.779157e-05           8
## cg15817960  -0.998158 -0.1084321  0.7812938 0.2060632 8.112081e-01           8
## cg15611151  -1.674590 -0.6383086  0.3979726 0.2795394 2.273231e-01           7
## cg00567703  -1.643306 -0.7310898  0.1811269 0.2166127 1.162225e-01           7
## cg01609275  -1.675129 -0.6567175  0.3616940 0.2699818 2.062673e-01           7
## cg00058576  -3.259925  0.3561932  3.9723111 3.4038704 8.469096e-01           7
## cg26992600  -3.346484 -0.2549158  2.8366523 2.4879721 8.716113e-01           7
## cg12172984  -2.077834  1.5873883  5.2526103 3.4969421 3.959561e-01           7
##            n_neighbors_control max_cor_neighbors
## cg22913481                   0                NA
## cg15131207                   0                NA
## cg10613282                   0                NA
## cg15857610                   4        0.77207047
## cg24775884                   2        0.88975664
## cg22954484                   2        0.86439358
## cg23076894                   1        0.88975664
## cg07697276                   2        0.84238389
## cg01674119                   5        0.61874408
## cg02749733                   0                NA
## cg12742764                   2        0.90569752
## cg15396367                   2        0.90569752
## cg18233010                   2        0.88601655
## cg01532849                   2        0.89423768
## cg24268444                   2        0.89690199
## cg10560245                   2        0.86644061
## cg20185936                   2        0.87376499
## cg06714180                   8       -0.09301902
## cg15817960                   8        0.40419875
## cg15611151                   7        0.13851867
## cg00567703                   7        0.13851867
## cg01609275                   7        0.20795189
## cg00058576                   4        0.88568207
## cg26992600                   4        0.85605741
## cg12172984                   4        0.87356095

From the table displayed, we note that we have access to point estimates of the ATE (“est_ate”) as well as lower and upper confidence interval bounds for each estimate (“lwr_ci” and “upr_ci”, respectively). Additional statistical information we have access to include the variance (“var_ate”) of the estimate as well as the p-value (“pval”) associated with each point estimate (based on Wald-style testing procedures). Beyond these, key bioinformatical quantities, with respect to the algorithm outlined above, are also returned; these include the total number of neighbors of the target site (“n_neighbors”), the number of neighboring sites controlled for when estimating the effect of exposure on DNA methylation (“n_neighbors_control”), and, the maximum correlation between the target site and any given site in its full set of neighbors (“max_cor_neighbors”).

0.5.1.2 Generalized linear models for parameter estimation

In cases where nonparametric regressions may not be preferred (e.g., where time constraints are of concern), generalized linear models (GLMs) may be used to fit the two regression steps required for estimating VIMs for the ATE.

methyvim makes performing such estimation for CpG sites using a given VIM essentially trivial:

suppressMessages(
  methyvim_ate_glm <- methyvim(data_grs = grsExample, sites_comp = 25,
                               var_int = var_int, vim = "ate", type = "Mval",
                               filter = "limma", filter_cutoff = 0.10,
                               parallel = FALSE, tmle_type = "glm"
                              )
)

## Warning in set_parallel(parallel = parallel, future_param = future_param, :
## Sequential evaluation over many probes may take a long time.

Just as before, we can take a look at the results produced from the estimation procedure (stored in the vim slot of the methytmle object) simply by invoking the custom S4 accessor function vim (note that the show method of the resultant object also displays this same information, amongst other things):

vim(methyvim_ate_glm)

##                 lwr_ci     est_ate      upr_ci   var_ate         pval
## cg22913481  -2.2737329 -0.61095578  1.05182129 0.7197073 4.714236e-01
## cg15131207  -1.4052652 -0.46187188  0.48152148 0.2316720 3.372626e-01
## cg10613282  -2.2542647 -0.88585344  0.48255779 0.4874399 2.045039e-01
## cg15857610  -2.4325450 -0.48107089  1.47040317 0.9913190 6.289731e-01
## cg24775884  -3.3747760  1.07346950  5.52171503 5.1506894 6.362164e-01
## cg22954484  -2.0920242  1.08070264  4.25342943 2.6203132 5.043759e-01
## cg23076894  -5.7167079 -0.50400813  4.70869167 7.0731568 8.496935e-01
## cg07697276  -2.8678982 -0.52591073  1.81607675 1.4277659 6.598411e-01
## cg01674119  -2.4613602 -0.90544317  0.65047384 0.6301743 2.540392e-01
## cg02749733  -1.8923030 -0.87365602  0.14499097 0.2701066 9.275906e-02
## cg12742764  -8.8892360 -4.56431555 -0.23939514 4.8690485 3.859389e-02
## cg15396367 -10.0395853 -4.28710706  1.46537120 8.6138604 1.440937e-01
## cg18233010  -8.1404001 -3.88759842  0.36520322 4.7080179 7.318299e-02
## cg01532849  -6.2097167 -2.80064889  0.60841887 3.0252351 1.073548e-01
## cg24268444  -6.8848141 -3.45993270 -0.03505126 3.0533665 4.769687e-02
## cg10560245  -4.3925622 -1.98734363  0.41787495 1.5059029 1.053449e-01
## cg20185936  -6.2100752 -3.05038976  0.10929566 2.5988161 5.846404e-02
## cg06714180  -1.9161920 -1.27931242 -0.64243279 0.1055851 8.247573e-05
## cg15817960  -0.9821399 -0.08687332  0.80839327 0.2086376 8.491594e-01
## cg15611151  -1.6895107 -0.64225180  0.40500713 0.2854934 2.293605e-01
## cg00567703  -1.5560011 -0.65512297  0.24575516 0.2112613 1.540643e-01
## cg01609275  -1.5946467 -0.61958346  0.35547977 0.2474876 2.129697e-01
## cg00058576  -4.0047990 -0.17636854  3.65206191 3.8153060 9.280540e-01
## cg26992600  -4.6156645 -1.30503285  2.00559882 2.8530513 4.397466e-01
## cg12172984  -3.8585666  0.26942032  4.39740728 4.4357237 8.982100e-01
##            n_neighbors n_neighbors_control max_cor_neighbors
## cg22913481           0                   0                NA
## cg15131207           0                   0                NA
## cg10613282           0                   0                NA
## cg15857610           5                   4        0.77207047
## cg24775884           5                   2        0.88975664
## cg22954484           5                   2        0.86439358
## cg23076894           5                   1        0.88975664
## cg07697276           5                   2        0.84238389
## cg01674119           5                   5        0.61874408
## cg02749733           0                   0                NA
## cg12742764           8                   2        0.90569752
## cg15396367           8                   2        0.90569752
## cg18233010           8                   2        0.88601655
## cg01532849           8                   2        0.89423768
## cg24268444           8                   2        0.89690199
## cg10560245           8                   2        0.86644061
## cg20185936           8                   2        0.87376499
## cg06714180           8                   8       -0.09301902
## cg15817960           8                   8        0.40419875
## cg15611151           7                   7        0.13851867
## cg00567703           7                   7        0.13851867
## cg01609275           7                   7        0.20795189
## cg00058576           7                   4        0.88568207
## cg26992600           7                   4        0.85605741
## cg12172984           7                   4        0.87356095

Remark: Here, the estimates are obtained via GLMs, making each of the regression steps less robust than if nonparametric regressions were used. It is expected that these estimates differ from those obtained previously.

0.5.2 The Risk Ratio as Variable Importance Measure

The risk ratio (RR) is another popular parameter that arises in statistical causal inference, denoted \(\psi_0 = \frac{\psi_0(1)}{\psi_0(0)}\), representing the multiplicative contrast of an outcome between the counterfactuals under which all subjects received the treatment/exposure and under which none received such treatment/exposure. Under additional (untestable) assumptions, this parameter has a richer interpretation as a mean counterfactual outcome, wherein the counterfactuals used in this definition define causal effects. When causal assumptions remain unfulfilled or untested, this parameter may still be estimated in the form of a nonparametric VIM.

Just as before (in the case of the ATE), there are two regression steps required for estimating VIMs based on this parameter. We do so in a manner analogous to that described previously.

0.5.2.1 Super Learning for nonparametric parameter estimation

Nonparametric and data-adaptive regressions (i.e., machine learning) may be used in the two regression steps required for estimating a VIM based on the RR. This is implemented using the Super Learner algorithm.

methyvim makes performing such estimation for CpG sites using a given VIM essentially trivial:

methyvim_rr_sl <- methyvim(data_grs = grsExample, sites_comp = 25,
                            var_int = var_int, vim = "rr", type = "Mval",
                            filter = "limma", filter_cutoff = 0.10,
                            parallel = FALSE, tmle_type = "sl"
                           )

## Warning in set_parallel(parallel = parallel, future_param = future_param, :
## Sequential evaluation over many probes may take a long time.

vim(methyvim_rr_sl)

##                lwr_ci    est_logrr      upr_ci  var_logrr         pval
## cg22913481 -0.4092109 -0.157928211  0.09335450 0.01643664 0.2180100541
## cg15131207 -0.3866242 -0.132565211  0.12149379 0.01680185 0.3064465908
## cg10613282 -0.4609511 -0.209992283  0.04096657 0.01639430 0.1009949347
## cg15857610 -0.3912380 -0.122627974  0.14598203 0.01878158 0.3708967928
## cg24775884 -0.3001760 -0.012590429  0.27499517 0.02152891 0.9316187838
## cg22954484 -0.2096686  0.082632719  0.37493408 0.02224076 0.5795199112
## cg23076894 -0.3972518 -0.118400427  0.16045096 0.02024107 0.4052865289
## cg07697276 -0.4183878 -0.147105093  0.12417763 0.01915720 0.2878614482
## cg01674119 -0.5631423 -0.260549225  0.04204388 0.02383449 0.0914751160
## cg02749733 -0.3595319 -0.107507455  0.14451703 0.01653382 0.4031054760
## cg12742764 -0.7252065 -0.443645032 -0.16208358 0.02063641 0.0020130988
## cg15396367 -0.4854500 -0.195317164  0.09481569 0.02191198 0.1870119863
## cg18233010 -0.5963658 -0.310380117 -0.02439445 0.02129004 0.0334045380
## cg01532849 -0.4587578 -0.185399867  0.08795806 0.01945142 0.1837381017
## cg24268444 -0.6211676 -0.329263725 -0.03735987 0.02218031 0.0270459272
## cg10560245 -0.4215148 -0.156131854  0.10925108 0.01833301 0.2488610324
## cg20185936 -0.6060344 -0.325562935 -0.04509150 0.02047694 0.0228993370
## cg06714180 -0.7287084 -0.472424460 -0.21614047 0.01709743 0.0003026843
## cg15817960 -0.2907364 -0.005245429  0.28024557 0.02121645 0.9712729639
## cg15611151 -0.3362025 -0.106709841  0.12278280 0.01370962 0.3621037170
## cg00567703 -0.4298602 -0.162501535  0.10485717 0.01860701 0.2335379131
## cg01609275 -0.5015721 -0.240907053  0.01975800 0.01768697 0.0700735236
## cg00058576 -0.4121000 -0.130558355  0.15098334 0.02063352 0.3634007899
## cg26992600 -0.6375980 -0.328990324 -0.02038266 0.02479141 0.0366670157
## cg12172984 -0.2061964  0.055151896  0.31650019 0.01777981 0.6791556913
##            n_neighbors n_neighbors_control max_cor_neighbors
## cg22913481           0                   0                NA
## cg15131207           0                   0                NA
## cg10613282           0                   0                NA
## cg15857610           5                   4        0.77207047
## cg24775884           5                   2        0.88975664
## cg22954484           5                   2        0.86439358
## cg23076894           5                   1        0.88975664
## cg07697276           5                   2        0.84238389
## cg01674119           5                   5        0.61874408
## cg02749733           0                   0                NA
## cg12742764           8                   2        0.90569752
## cg15396367           8                   2        0.90569752
## cg18233010           8                   2        0.88601655
## cg01532849           8                   2        0.89423768
## cg24268444           8                   2        0.89690199
## cg10560245           8                   2        0.86644061
## cg20185936           8                   2        0.87376499
## cg06714180           8                   8       -0.09301902
## cg15817960           8                   8        0.40419875
## cg15611151           7                   7        0.13851867
## cg00567703           7                   7        0.13851867
## cg01609275           7                   7        0.20795189
## cg00058576           7                   4        0.88568207
## cg26992600           7                   4        0.85605741
## cg12172984           7                   4        0.87356095

0.5.2.2 Generalized linear models for parameter estimation

methyvim makes performing such estimation for CpG sites using a given VIM essentially trivial:

methyvim_rr_glm <- methyvim(data_grs = grsExample, sites_comp = 25,
                            var_int = var_int, vim = "rr", type = "Mval",
                            filter = "limma", filter_cutoff = 0.10,
                            parallel = FALSE, tmle_type = "glm"
                           )

## Warning in set_parallel(parallel = parallel, future_param = future_param, :
## Sequential evaluation over many probes may take a long time.

vim(methyvim_rr_glm)

##                lwr_ci    est_logrr      upr_ci  var_logrr         pval
## cg22913481 -0.4069465 -0.155520490  0.09590554 0.01645540 0.2253726279
## cg15131207 -0.3873508 -0.133384529  0.12058178 0.01678959 0.3032903444
## cg10613282 -0.4643026 -0.211461376  0.04137980 0.01664116 0.1011658212
## cg15857610 -0.4228660 -0.155410347  0.11204529 0.01862050 0.2547463475
## cg24775884 -0.3033120 -0.016800891  0.26971026 0.02136835 0.9084977837
## cg22954484 -0.2082032  0.086205045  0.38061331 0.02256253 0.5660332076
## cg23076894 -0.4000291 -0.121096780  0.15783558 0.02025283 0.3948126687
## cg07697276 -0.4239004 -0.152774774  0.11835081 0.01913502 0.2694077189
## cg01674119 -0.5602798 -0.261862697  0.03655444 0.02318117 0.0854486358
## cg02749733 -0.3663749 -0.114789703  0.13679545 0.01647623 0.3711715787
## cg12742764 -0.7247314 -0.442113551 -0.15949570 0.02079156 0.0021685034
## cg15396367 -0.5015719 -0.211841735  0.07788841 0.02185120 0.1518316990
## cg18233010 -0.6001598 -0.314071570 -0.02798334 0.02130531 0.0314198235
## cg01532849 -0.4735781 -0.194981413  0.08361524 0.02020411 0.1701428235
## cg24268444 -0.6319682 -0.339784180 -0.04760017 0.02222290 0.0226488361
## cg10560245 -0.4216809 -0.155147888  0.11138516 0.01849226 0.2539079346
## cg20185936 -0.6129249 -0.329697504 -0.04647011 0.02088134 0.0225139071
## cg06714180 -0.7220646 -0.464710184 -0.20735573 0.01724055 0.0004013234
## cg15817960 -0.2824709  0.002817598  0.28810607 0.02118636 0.9845558569
## cg15611151 -0.3524153 -0.119936485  0.11254236 0.01406872 0.3119353187
## cg00567703 -0.4340690 -0.167252518  0.09956395 0.01853161 0.2192158746
## cg01609275 -0.5099162 -0.246428433  0.01705934 0.01807211 0.0667875358
## cg00058576 -0.3969395 -0.111432402  0.17407468 0.02121884 0.4442828298
## cg26992600 -0.6417775 -0.332846224 -0.02391490 0.02484344 0.0347098769
## cg12172984 -0.2504724  0.014082907  0.27863817 0.01821884 0.9169031295
##            n_neighbors n_neighbors_control max_cor_neighbors
## cg22913481           0                   0                NA
## cg15131207           0                   0                NA
## cg10613282           0                   0                NA
## cg15857610           5                   4        0.77207047
## cg24775884           5                   2        0.88975664
## cg22954484           5                   2        0.86439358
## cg23076894           5                   1        0.88975664
## cg07697276           5                   2        0.84238389
## cg01674119           5                   5        0.61874408
## cg02749733           0                   0                NA
## cg12742764           8                   2        0.90569752
## cg15396367           8                   2        0.90569752
## cg18233010           8                   2        0.88601655
## cg01532849           8                   2        0.89423768
## cg24268444           8                   2        0.89690199
## cg10560245           8                   2        0.86644061
## cg20185936           8                   2        0.87376499
## cg06714180           8                   8       -0.09301902
## cg15817960           8                   8        0.40419875
## cg15611151           7                   7        0.13851867
## cg00567703           7                   7        0.13851867
## cg01609275           7                   7        0.20795189
## cg00058576           7                   4        0.88568207
## cg26992600           7                   4        0.85605741
## cg12172984           7                   4        0.87356095

0.6 Data Analysis with `methyvim`

In order to explore practical applications of the methyvim package, as well as the full set of utilities it provides, our toy example (of just \(10\) CpG sites) is unfortunately insufficient. To proceed, we will use a publicly available example data set produced by the Illumina 450K array, from the minfiData R package. Now, let’s load the package and data set, and take a look

suppressMessages(library(minfiData))
data(MsetEx)
mset <- mapToGenome(MsetEx)
grs <- ratioConvert(mset)
grs

## class: GenomicRatioSet 
## dim: 485512 6 
## metadata(0):
## assays(2): Beta CN
## rownames(485512): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowData names(0):
## colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
##   5723646053_R06C02
## colData names(13): Sample_Name Sample_Well ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.21.2
##   Manifest version: 0.4.0

After loading the data, which comes in the form of a raw MethylSet object, we perform some further processing by mapping to the genome (with mapToGenome) and converting the values from the methylated and unmethylated channels to Beta-values (via ratioConvert). These two steps together produce an object of class GenomicRatioSet, like what we had worked with previously.

For this example analysis, we’ll treat the condition of the patients as the exposure/treatment variable of interest. The methyvim function requires that this variable either be numeric or easily coercible to numeric. To facilitate this, we’ll simply convert the covariate (currently a character):

var_int <- (as.numeric(as.factor(colData(grs)$status)) - 1)
table(var_int)

## var_int
## 0 1 
## 3 3

n.b., the re-coding process results in “normal” patients being assigned a value of 1 and cancer patients a 0.

Now, we are ready to analyze the effects of cancer status on DNA methylation using this data set. To do this with a targeted minimum loss-based estimate of the Average Treatment Effect, we may proceed as follows:

suppressMessages(
  methyvim_cancer_ate <- methyvim(data_grs = grs, var_int = var_int,
                                  vim = "ate", type = "Beta", filter = "limma",
                                  filter_cutoff = 0.30, obs_per_covar = 2,
                                  parallel = FALSE, sites_comp = 50,
                                  tmle_args = list(cv_folds = 2),
                                  tmle_type = "sl"
                                 )
)

Note that we set the obs_per_covar argument to a relatively low value (2, where the recommended default is 20) for the purposes of this example as the sample size is only 10. We do this only to exemplify the estimation procedure and it is important to point out that such low values for obs_per_covar will compromise the quality of inference obtained because this setting directly affects the definition of the target parameter.

Further, note that here we apply the glm flavor of the tmle_type argument, which produces faster results by fitting models for the propensity score and outcome regressions using a limited number of parametric models. By contrast, the sl (for “Super Learning”) flavor fits these two regressions using highly nonparametric and data-adaptive procedures (i.e., via machine learning).

Let’s examine the table of results by examining the methytmle object that was produced by our call to the methyvim wrapper function:

methyvim_cancer_ate

Finally, we may compute FDR-corrected p-values, by applying a modified procedure for controlling the False Discovery Rate for multi-stage analyses (FDR-MSA) (Tuglus and van der Laan 2009). We do this by simply applying the fdr_msa function:

fdr_p <- fdr_msa(pvals = vim(methyvim_cancer_ate)$pval,
                 total_obs = nrow(methyvim_cancer_ate))

Having explored the results of our analysis numerically, we now proceed to use the visualization tools provided with the methyvim R package to further enhance our understanding of the results.

0.6.1 Visualization of Results

While making allowance for users to explore the full set of results produced by the estimation procedure (by way of exposing these directly to the user), the methyvim package also provides three (3) visualization utilities that produce plots commonly used in examining the results of differential methylation analyses.

A simple call to plot produces side-by-side histograms of the raw p-values computed as part of the estimation process and the corrected p-values obtained from using the FDR-MSA procedure.

plot(methyvim_cancer_ate)

Remark: The plots displayed above may also be generated separately by explicitly setting the argument “type” to plot.methytmle. For a plot of the raw p-values, specify type = "raw_pvals", and for a plot of the FDR-corrected p-values, specify type = "fdr_pvals".

While histograms of the p-values may be generally useful in inspecting the results of the estimation procedure, a more common plot used in examining the results of differential methylation procedures is the volcano plot, which plots the parameter estimate along the x-axis and \(-\text{log}_{10}(\text{p-value})\) along the y-axis. We implement such a plot in the methyvolc function:

methyvolc(methyvim_cancer_ate)

The purpose of such a plot is to ensure that very low (possibly statistically significant) p-values do not arise from cases of low variance. This appears to be the case in the plot above (notice that most parameter estimates are near zero, even in cases where the raw p-values are quite low).

Yet another popular plot for visualizing effects in such settings is the heatmap, which plots estimates of the raw methylation effects (as measured by the assay) across subjects using a heat gradient. We implement this in the methyheat function:

methyheat(methyvim_cancer_ate)

Invoking methyheat in this manner produces a plot of the top sites (\(25\), by default) based on the raw p-value, using the raw methylation measures in the plot. This uses the exceptional superheat R package (Barter and Yu 2017).

0.7 Session Information

## R version 4.0.0 RC (2020-04-19 r78255)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] minfiData_0.33.0                                  
##  [2] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
##  [3] IlluminaHumanMethylation450kmanifest_0.4.0        
##  [4] gam_1.16.1                                        
##  [5] doFuture_0.9.0                                    
##  [6] future_1.17.0                                     
##  [7] globals_0.12.5                                    
##  [8] methyvimData_1.9.0                                
##  [9] methyvim_1.11.0                                   
## [10] minfi_1.35.0                                      
## [11] bumphunter_1.31.0                                 
## [12] locfit_1.5-9.4                                    
## [13] iterators_1.0.12                                  
## [14] foreach_1.5.0                                     
## [15] Biostrings_2.57.0                                 
## [16] XVector_0.29.0                                    
## [17] SummarizedExperiment_1.19.0                       
## [18] DelayedArray_0.15.0                               
## [19] matrixStats_0.56.0                                
## [20] Biobase_2.49.0                                    
## [21] GenomicRanges_1.41.0                              
## [22] GenomeInfoDb_1.25.0                               
## [23] IRanges_2.23.0                                    
## [24] S4Vectors_0.27.0                                  
## [25] BiocGenerics_0.35.0                               
## [26] tmle_1.4.0.1                                      
## [27] SuperLearner_2.0-26                               
## [28] nnls_1.4                                          
## [29] glmnet_3.0-2                                      
## [30] Matrix_1.2-18                                     
## [31] BiocStyle_2.17.0                                  
## 
## loaded via a namespace (and not attached):
##   [1] BiocFileCache_1.13.0      plyr_1.8.6               
##   [3] BiocParallel_1.23.0       listenv_0.8.0            
##   [5] ggplot2_3.3.0             digest_0.6.25            
##   [7] htmltools_0.4.0           earth_5.1.2              
##   [9] magrittr_1.5              memoise_1.1.0            
##  [11] cluster_2.1.0             limma_3.45.0             
##  [13] readr_1.3.1               annotate_1.67.0          
##  [15] askpass_1.1               siggenes_1.63.0          
##  [17] prettyunits_1.1.1         colorspace_1.4-1         
##  [19] blob_1.2.1                rappdirs_0.3.1           
##  [21] xfun_0.13                 dplyr_0.8.5              
##  [23] crayon_1.3.4              RCurl_1.98-1.2           
##  [25] lme4_1.1-23               genefilter_1.71.0        
##  [27] GEOquery_2.57.0           survival_3.1-12          
##  [29] glue_1.4.0                gtable_0.3.0             
##  [31] zlibbioc_1.35.0           superheat_0.1.0          
##  [33] Rhdf5lib_1.11.0           shape_1.4.4              
##  [35] HDF5Array_1.17.0          abind_1.4-5              
##  [37] scales_1.1.0              DBI_1.1.0                
##  [39] rngtools_1.5              Rcpp_1.0.4.6             
##  [41] plotrix_3.7-8             xtable_1.8-4             
##  [43] progress_1.2.2            bit_1.1-15.2             
##  [45] mclust_5.4.6              preprocessCore_1.51.0    
##  [47] Formula_1.2-3             httr_1.4.1               
##  [49] RColorBrewer_1.1-2        ellipsis_0.3.0           
##  [51] pkgconfig_2.0.3           reshape_0.8.8            
##  [53] XML_3.99-0.3              dbplyr_1.4.3             
##  [55] tidyselect_1.0.0          rlang_0.4.5              
##  [57] AnnotationDbi_1.51.0      TeachingDemos_2.12       
##  [59] munsell_0.5.0             tools_4.0.0              
##  [61] RSQLite_2.2.0             evaluate_0.14            
##  [63] stringr_1.4.0             arm_1.11-1               
##  [65] yaml_2.2.1                knitr_1.28               
##  [67] bit64_0.9-7               beanplot_1.2             
##  [69] scrime_1.3.5              purrr_0.3.4              
##  [71] nlme_3.1-147              doRNG_1.8.2              
##  [73] nor1mix_1.3-0             xml2_1.3.2               
##  [75] biomaRt_2.45.0            compiler_4.0.0           
##  [77] curl_4.3                  statmod_1.4.34           
##  [79] tibble_3.0.1              stringi_1.4.6            
##  [81] plotmo_3.5.7              GenomicFeatures_1.41.0   
##  [83] lattice_0.20-41           nloptr_1.2.2.1           
##  [85] ggsci_2.9                 multtest_2.45.0          
##  [87] vctrs_0.2.4               pillar_1.4.3             
##  [89] lifecycle_0.2.0           BiocManager_1.30.10      
##  [91] data.table_1.12.8         bitops_1.0-6             
##  [93] rtracklayer_1.49.0        R6_2.4.1                 
##  [95] bookdown_0.18             gridExtra_2.3            
##  [97] codetools_0.2-16          boot_1.3-25              
##  [99] MASS_7.3-51.6             gtools_3.8.2             
## [101] assertthat_0.2.1          rhdf5_2.33.0             
## [103] openssl_1.4.1             GenomicAlignments_1.25.0 
## [105] Rsamtools_2.5.0           GenomeInfoDbData_1.2.3   
## [107] hms_0.5.3                 quadprog_1.5-8           
## [109] grid_4.0.0                minqa_1.2.4              
## [111] coda_0.19-3               tidyr_1.0.2              
## [113] base64_2.0                rmarkdown_2.1            
## [115] DelayedMatrixStats_1.11.0 illuminaio_0.31.0

References

Aryee, Martin J, Andrew E Jaffe, Hector Corrada-Bravo, Christine Ladd-Acosta, Andrew P Feinberg, Kasper D Hansen, and Rafael A Irizarry. 2014. “Minfi: A Flexible and Comprehensive Bioconductor Package for the Analysis of Infinium DNA Methylation Microarrays.” Bioinformatics 30 (10). Oxford University Press (OUP):1363–9. https://doi.org/10.1093/bioinformatics/btu049.

Barter, Rebecca L, and Bin Yu. 2017. “Superheat: An R Package for Creating Beautiful and Extendable Heatmaps for Visualizing Complex Data.”

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.

Breiman, Leo. 1996. “Stacked Regressions.” Machine Learning 24 (1). Springer:49–64.

Chambaz, Antoine, Pierre Neuvial, and Mark J van der Laan. 2012. “Estimation of a Non-Parametric Variable Importance Measure of a Continuous Exposure.” Electronic Journal of Statistics 6. NIH Public Access:1059.

Dedeurwaerder, Sarah, Matthieu Defrance, Martin Bizet, Emilie Calonne, Gianluca Bontempi, and François Fuks. 2013. “A Comprehensive Overview of Infinium Humanmethylation450 Data Processing.” Briefings in Bioinformatics. Oxford Univ Press, bbt054.

Fortin, Jean-Philippe, Aurelie Labbe, Mathieu Lemire, Brent W Zanke, Thomas J Hudson, Elana J Fertig, Celia MT Greenwood, and Kasper D Hansen. 2014. “Functional Normalization of 450k Methylation Array Data Improves Replication in Large Cancer Studies.” bioRxiv. Cold Spring Harbor Labs Journals.

Hernan, Miguel A, and James M Robins. 2019. Causal Inference. Chapman & Hall / CRC Texts in Statistical Science. Taylor & Francis.

Pearl, Judea. 2000. Causality: Models, Reasoning, and Inference. Cambridge University Press.

Robertson, Keith D. 2005. “DNA Methylation and Human Disease.” Nature Reviews. Genetics 6 (8). Nature Publishing Group:597.

Tuglus, Catherine, and Mark J van der Laan. 2009. “Modified FDR Controlling Procedure for Multi-Stage Analyses.” Statistical Applications in Genetics and Molecular Biology 8 (1). Walter de Gruyter:1–15. https://doi.org/10.2202/1544-6115.1397.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

van der Laan, Mark J, and Sherri Rose. 2011. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media.

———. 2018. Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies. Springer Science & Business Media.

Wolpert, David H. 1992. “Stacked Generalization.” Neural Networks 5 (2). Springer:241–59.

Targeted Data-Adaptive Estimation and Inference for Differential Methylation Analysis

2020-04-27

Abstract

Contents

0.1 Introduction

0.2 Methodology

0.3 Parameters of Interest

0.4 Preliminaries: Setting up the Data

0.5 Differential Methylation Based on a Binary Treatment or Exposure

0.5.1 The Average Treatment Effect as Variable Importance Measure

0.5.1.1 Super Learning for nonparametric parameter estimation

0.5.1.2 Generalized linear models for parameter estimation

0.5.2 The Risk Ratio as Variable Importance Measure

0.5.2.1 Super Learning for nonparametric parameter estimation

0.5.2.2 Generalized linear models for parameter estimation

0.6 Data Analysis with methyvim

0.6.1 Visualization of Results

0.7 Session Information

References

0.6 Data Analysis with `methyvim`