MEAL 1.12.0
Illumina Infinium HumanMethylation 450K BeadChip assay has become a standard tool to analyse methylation in human samples. Developed in 2011, it has already been used in projects such as The Cancer Genome Atlas (TCGA). Their 450.000 probes provide a good overall image of the methylation state of the genome, being one of the reasons of its success.
Given its complex design1 More information can be found at this minfi tutorial, many Bioconductor packages have been developed to assess normalization and pre-processing issues (e.g. minfi (Aryee et al. 2014) or lumi (Du, Kibbe, and Lin 2008)). In addition, these packages can detect differentially methylated probes (DMPs) and differentially methylated regions (DMRs). However, the interfaces are not very intuitive and several scripting steps are usually required.
MEAL aims to facilitate the analysis of Illumina Methylation 450K chips. We have included two methods to analyze DMPs (Differentially Methylated Probes), that test differences in means (limma) or differences in variance (DiffVar). We have included three DMRs (Differentially Methylated Regions) detection algorithms (bumphunter, blockFinder and DMRcate) and a new method to test differences in methylation in a target region (RDA). Finally, we have prepared plots for all these analyses as well as a wrapper to run all the analyses in the same dataset.
MEAL is meant to analyze methylation data already preprocessed. All our functions accept a GenomicRatioSet
as input, which is a class from minfi package designed to manage preprocessed methylation data. Users willing to preprocess their own data are encouraged to take a look to minfi’s vignette
In this vignette, we will use methylation data from minfiData package.
library(MEAL)
library(MultiDataSet)
library(minfiData)
library(minfi)
library(ggplot2)
data("MsetEx")
MsetEx
is a MethylationRatioSet
that contains measurements for 485512 CpGs and 6 samples, as well as some phenotypic variables such as age or sex. The first step will be to convert it to a GenomicRatioSet
. Then, we will add some extra features annotation. Finally, we will remove probes not measuring methylation, with SNPs or with NAs:
meth <- mapToGenome(ratioConvert(MsetEx))
rowData(meth) <- getAnnotation(meth)[, -c(1:3)]
## Remove probes measuring SNPs
meth <- dropMethylationLoci(meth)
## Remove probes with SNPs
meth <- dropLociWithSnps(meth)
## Remove probes with NAs
meth <- meth[!apply(getBeta(meth), 1, function(x) any(is.na(x))), ]
The function runPipeline
run all methods included in MEAL to the same dataset. We only need to pass to this function a GenomicRatioSet
and the name of our variable of interest. In our case, we will analyze the effect of cancer on methylation:
res <- runPipeline(set = meth, variable_names = "status")
runPipeline
includes several parameters to customize the analyses. The most important parameters are covariable_names
, betas
and sva
. covariable_names
is used to include covariates in our models. betas
allows the user choosing between running the analyis with beta (TRUE) or M-values (FALSE). If sva
is TRUE, Surrogate Variable Analysis is run and surrogate variables are included in the models. Finally, some parameters modify the behaviour of the methods included in the wrapper and they will be covered later on. More information about the parameters can be found in the documentation (by typing ?runPipeline).
We will run a new analysis including age as covariate:
resAdj <- runPipeline(set = meth, variable_names = "status",
covariable_names = "age")
resAdj
## Object of class 'ResultSet'
## . created with: runPipeline
## . sva: no
## . #results: 5 ( error: 0 )
## . featureData: 464876 probes x 35 variables
runPipeline
generates a ResultSet
object. ResultSet
is a class designed to encapsulate different results from the same dataset. It contains the results of the different methods, the feature data and other data required to get tables or plots. We can examine the analyses included in a ResultSet
with the function names
:
names(resAdj)
## [1] "DiffMean" "DiffVar" "bumphunter" "blockFinder" "dmrcate"
Both objects contains five analyses. DiffMean is an analysis of difference of means performed with limma while the others are named with the method name (DiffVar, bumphunter, blockFinder and dmrcate).
We can use the function getAssociation
to get a data.frame with the results, independent of the original method. This function has two main arguments: object
and rid
. object
is the ResultSet
with our data and rid
is the name or the index of the analysis we want to extract.
head(getAssociation(resAdj, "DiffMean"))
## logFC CI.L CI.R AveExpr t P.Value
## cg09383816 -0.5938196 -0.6310585 -0.5565807 0.4885486 -42.94192 7.147556e-07
## cg27651090 0.5433331 0.5073130 0.5793532 0.5411453 40.62051 9.092184e-07
## cg21938148 -0.6659934 -0.7114994 -0.6204875 0.4712352 -39.41178 1.036252e-06
## cg25104555 -0.5254995 -0.5614327 -0.4895664 0.3906114 -39.38227 1.039618e-06
## cg25937714 -0.5906359 -0.6311714 -0.5501004 0.4350042 -39.23815 1.056248e-06
## cg15732851 -0.5760397 -0.6165580 -0.5355214 0.3869865 -38.28468 1.174927e-06
## adj.P.Val B SE
## cg09383816 0.01808345 5.578902 0.03360340
## cg27651090 0.01808345 5.494733 0.01992266
## cg21938148 0.01808345 5.446147 0.03810597
## cg25104555 0.01808345 5.444916 0.02470939
## cg25937714 0.01808345 5.438874 0.01417408
## cg15732851 0.01808345 5.397542 0.02840324
head(getAssociation(resAdj, "DiffVar"))
## logFC CI.L CI.R AveExpr t P.Value
## cg02939019 -2.166225 -2.927983 -1.404467 1.203729 -6.824129 0.0003375482
## cg11847929 -2.327910 -3.149902 -1.505918 1.465679 -6.796091 0.0003457268
## cg22976979 -1.910704 -2.585431 -1.235977 1.173032 -6.795573 0.0003458801
## cg25385529 -2.223012 -3.010252 -1.435771 1.347720 -6.776338 0.0003516238
## cg22676401 -1.914225 -2.596416 -1.232034 1.233126 -6.733607 0.0003647752
## cg27402591 2.710765 1.744380 3.677151 1.706821 6.731349 0.0003654855
## adj.P.Val B SE
## cg02939019 0.1789425 0.2046014 0.16205491
## cg11847929 0.1789425 0.1896636 0.09848762
## cg22976979 0.1789425 0.1893866 0.15402291
## cg25385529 0.1789425 0.1790827 0.11296652
## cg22676401 0.1789425 0.1560305 0.11488471
## cg27402591 0.1789425 0.1548062 0.10400866
head(getAssociation(resAdj, "bumphunter"))
## chr start end value area cluster indexStart
## 86177 chr6 133561649 133562776 -0.4137316 15.30807 161402 173009
## 72799 chr10 118030848 118034357 -0.4244696 12.73409 27119 255180
## 85444 chr6 29520698 29521803 -0.3009453 11.73687 155567 151996
## 75309 chr13 78492568 78493590 -0.4345396 11.73257 53247 318486
## 81792 chr20 61050560 61051915 -0.4558366 11.39591 116792 439865
## 80368 chr2 63279693 63285365 -0.3692168 10.33807 102613 54191
## indexEnd L clusterL
## 86177 173045 37 41
## 72799 255209 30 30
## 85444 152034 39 40
## 75309 318512 27 49
## 81792 439889 25 26
## 80368 54218 28 41
head(getAssociation(resAdj, "blockFinder"))
## chr start end value area cluster indexStart
## 423 chr2 217468708 219051445 0.1725776 28.47531 107 34602
## 2194 chr15 25145254 26095690 0.1317540 21.34416 699 158363
## 1237 chr7 40098119 42760642 0.1744860 21.11281 378 87342
## 119 chr1 152053201 153177304 0.1724642 19.14353 35 13139
## 505 chr3 54169983 56448270 0.1642120 18.06332 125 41752
## 1860 chr11 131278641 132728088 0.1503282 16.83676 588 134392
## indexEnd L clusterL
## 423 34801 165 500
## 2194 158548 162 623
## 1237 87490 121 1575
## 119 13278 111 1545
## 505 41884 110 1630
## 1860 134513 112 1659
head(getAssociation(resAdj, "dmrcate"))
## coord no.cpgs minfdr Stouffer maxbetafc
## 3470 chr6:33130696-33148812 135 1.093719e-73 1.844138e-46 0.4429870
## 3630 chr6:133561368-133564578 51 0.000000e+00 2.905878e-31 -0.6600767
## 1574 chr16:51183363-51190201 47 2.603037e-189 2.562262e-30 -0.5757717
## 475 chr10:118030292-118034357 31 0.000000e+00 4.965712e-30 -0.6980408
## 3386 chr6:29520527-29521803 40 5.746100e-205 3.535914e-28 -0.5406367
## 1579 chr16:54964677-54973966 42 5.649098e-112 5.549090e-26 -0.6023573
## meanbetafc
## 3470 0.2055207
## 3630 -0.3596354
## 1574 -0.3019712
## 475 -0.4162689
## 3386 -0.2956325
## 1579 -0.2411854
DiffMean and DiffVar are internally stored as a MArrayLM, the class from limma results. This class allows testing different constrasts or evaluating different variables simultaneously. The function getProbeResults
helps the user performing these operations. It also has the arguments object
and rid
from getAssociation
. coef
is a numeric with the index of the coefficient from which we want the results. If we did not pass a custom model to runPipeline
, the first coefficient (coef = 1) is the intercept and the second coefficient (coef = 2) is the first variable that we included in variable_names
. We can evaluate different coefficients simultaneously by passing a vector to coef
. contrast
is a matrix with the contrasts that we want to evaluate. This option is useful when our variable of interest is a factor with several levels and we want to do all the different comparisons. Finally, the argument fNames
is used to select the variables from features annotation that will be added to the tables.
To exemplify the use of this function, we will evaluate our whole adjusted model, including age coefficient. We will also add some annotation of the CpGs:
head(getProbeResults(resAdj, rid = 1, coef = 2:3,
fNames = c("chromosome", "start")))
## statusnormal age AveExpr F P.Value
## cg09383816 -0.5938196 -0.0026657333 0.4885486 930.0807 1.937243e-06
## cg27651090 0.5433331 -0.0009235097 0.5411453 826.0491 2.504222e-06
## cg25104555 -0.5254995 -0.0031493548 0.3906114 787.5881 2.776367e-06
## cg21938148 -0.6659934 -0.0028869823 0.4712352 782.9876 2.811782e-06
## cg25937714 -0.5906359 0.0009181122 0.4350042 770.6247 2.910287e-06
## cg15732851 -0.5760397 -0.0050629750 0.3869865 757.4667 3.020764e-06
## adj.P.Val SE.statusnormal SE.age chromosome start
## cg09383816 0.04782436 0.0336034035 0.0016117775 chr8 67344556
## cg27651090 0.04782436 0.0199226569 0.0009555845 chr13 109270071
## cg25104555 0.04782436 0.0381059727 0.0018277420 chr10 16562998
## cg21938148 0.04782436 0.0247093863 0.0011851786 chr13 110958977
## cg25937714 0.04782436 0.0141740814 0.0006798557 chr3 44036492
## cg15732851 0.04782436 0.0284032366 0.0013623530 chr1 203598761
When more than one coefficient is evaluated, a estimate for each coefficient is returned and the t-statistic is substituted by a F-statistic. More information about linear models, including a detailed section of how to create a constrast matrix can be found in limma users’ guide.
Finally, we can obtain the results of CpGs mapped to some genes with the function getGeneVals
. This function accepts the same arguments than getProbeResults
but includes the arguments gene
and genecol
to pass the names of the genes to be selected and the column name of feature data containing gene names.
We will retrieve the difference in variance results for all CpGs mapped to ARMS2. We can see in the rowData of meth
that gene names are in the column ‘UCSC_RefGene_Name’:
getGeneVals(resAdj, "ARMS2", genecol = "UCSC_RefGene_Name", fNames = c("chromosome", "start"))
## logFC CI.L CI.R AveExpr t P.Value
## cg24296920 0.3258748 0.159072418 0.4926771 0.5876988 5.261053 0.004972695
## cg24884230 0.1925476 0.066951396 0.3181439 0.6064401 4.128436 0.012262156
## cg00676728 0.1106853 0.012546532 0.2088240 0.8206485 3.037200 0.034639784
## cg03623097 0.1255012 -0.008866243 0.2598686 0.5342317 2.515231 0.060880693
## cg18222240 0.1874701 -0.015810610 0.3907509 0.5432686 2.483476 0.063093712
## cg13265583 0.1911058 -0.059969571 0.4421812 0.6897652 2.049717 0.104248164
## adj.P.Val B SE chromosome start
## cg24296920 0.06757389 -1.904354 0.03225175 chr10 124214120
## cg24884230 0.10906357 -2.942678 0.06287739 chr10 124216658
## cg00676728 0.19474309 -4.128563 0.06953081 chr10 124213760
## cg03623097 0.26596849 -4.759333 0.01297918 chr10 124213466
## cg18222240 0.27137500 -4.798762 0.01276563 chr10 124213527
## cg13265583 0.35287264 -5.344456 0.04044509 chr10 124214151
## UCSC_RefGene_Name
## cg24296920 ARMS2
## cg24884230 ARMS2
## cg00676728 ARMS2
## cg03623097 ARMS2
## cg18222240 ARMS2
## cg13265583 ARMS2
We can easily get Manhattan plots, Volcano plots and QQ-plots for the probes results (DiffMean and DiffVar) using plot
method. Our extension of plot
method to ResultSet
includes the arguments rid
or coef
that were already present in getProbeResult
. In addition, the argument type
allows choosing between a Manhattan plot (“manhattan”), a Volcano plot (“volcano”) or a qq-plot (“qq”).
We can customize different aspects of a Manhattan plot. We can highlight the CpGs of a target region by passing a GenomicRanges
to the argument highlight
. Similarly, we can get a Manhattan plot with only the CpGs of our target region passing a GenomicRanges
to the argument subset
. It should be noticed that the GenomicRange
should have the chromosome as a number (1-24).
We will show these capabilities by highlighting and subsetting a region of ten Mb in chromosome X:
targetRange <- GRanges("23:13000000-23000000")
plot(resAdj, rid = "DiffMean", type = "manhattan", highlight = targetRange)