Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain
Bioinformatics Research Group in Epidemiolgy (BRGE)
(http://www.creal.cat/brge.htm)
In this vignette, a workflow to analyse methylation and expression data will be presented. It will be also shown how to consider SNP data, since it has been demonstrated that the existence of meQTL and eQTL could confound the results. The example data will be taken from MEALData package (latest version available at BRGE web), a data package obtained from GEO GSE53261. The dataset contains data from untransformed human fibroblasts. The script with the processing steps from the GEO to the final datasets can be found in the folder data_raw of MEALData package.
Let’s start by loading the required packages.
library(MEAL)
library(MultiDataSet)
library(MEALData)
library(GenomicRanges)
There are four objects in MEALData: betavals, pheno, eset and snps. Let’s take a look to each of these objects.
betavals and pheno belongs to the methylation experiment. betavals is a matrix of beta values corresponding to an Infinium HumanMethylation 450K BeadChip and pheno a data.frame with the phenotypes.
betavals[1:5, 1:5]
## GM02704 GM02706 GM01650 GM01653 GM02640
## cg00000029 0.2864929 0.2069152 0.5525335 0.5176491 0.2791274
## cg00000108 0.9293251 0.9429289 0.8961195 0.9325227 0.9286574
## cg00000109 0.7168331 0.7419590 0.7320755 0.5347986 0.8196660
## cg00000165 0.2935756 0.2475510 0.4311649 0.3937953 0.3349943
## cg00000236 0.9026300 0.9034649 0.8755394 0.8710790 0.8867925
dim(betavals)
## [1] 451448 61
summary(pheno)
## gender source inv
## female:31 Coriell cell line:26 NI/NI:44
## male :30 other :35 NI/I :17
betavals contains data from 451448 probes and 61 samples. pheno contains the same number of samples and three variables: gender, source and inv. gender and source come from the original data while inv correspond to the inversion genotypes for the inversion located in chromosome 17q21.31 (Li et al. 2014) computed using invClust (Cáceres and González 2015). There is approximately the same number of cells coming from male and female donnors. Source is composed of Coriell cell lines and cells with unknown source. A quick overview shows that there are no homozygous for the inverted phenotype due to its reduced frequency.
eset is an ExpressionSet
with the expression data corresponding to an Illumina HumanRef-8 v3.0 expression beadchip.
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 21916 features, 61 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GM00016 GM00022 ... WG3466 (61 total)
## varLabels: gender inv
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1343291 ILMN_1651209 ... ILMN_2415949 (21916
## total)
## fvarLabels: chr start end genes
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6883
summary(pData(eset))
## gender inv
## female:31 NI/NI:44
## male :30 NI/I :17
There are 21916 expression probes and 61 samples. The phenotype data contains the variables gender and inv that are equivalent to that of pheno.
Finally, let’s take a look at snps data. snps is a list containing two objects, the genotypes and the mapping.
str(snps)
## List of 2
## $ genotypes: num [1:29909, 1:98] 0 2 1 2 0.0952 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:29909] "rs1000068-127_B_R_1501589836" "rs1000299-127_B_R_1501589728" "rs100043-127_T_F_1501589788" "rs1000465-127_T_R_1501589734" ...
## .. ..$ : chr [1:98] "WG2275" "WG2276" "WG2274" "WG2066" ...
## $ map :'data.frame': 29909 obs. of 5 variables:
## ..$ Chromosome: chr [1:29909] "17" "17" "17" "17" ...
## ..$ snp.name : Factor w/ 1199188 levels "200003","200006",..: 43665 43954 44102 44140 44142 44418 44539 44635 44675 45233 ...
## ..$ position : num [1:29909] 45653965 70660148 64903636 67315267 67315325 ...
## ..$ SNP : Factor w/ 12 levels "[A/C]","[A/G]",..: 7 7 2 2 7 7 7 7 7 2 ...
## ..$ chromosome: chr [1:29909] "chr17" "chr17" "chr17" "chr17" ...
Genotypes are enclosed in a matrix and can be retrieved by:
snps$genotypes[1:5, 1:5]
## WG2275 WG2276 WG2274 WG2066 WG2065
## rs1000068-127_B_R_1501589836 0.0000000 2 0 0 0
## rs1000299-127_B_R_1501589728 2.0000000 2 2 2 0
## rs100043-127_T_F_1501589788 1.0000000 2 0 2 0
## rs1000465-127_T_R_1501589734 2.0000000 0 0 0 2
## rs1000466-127_B_F_1501590038 0.0952381 1 1 0 1
SNP probes mapping can be retrieved by typing snps$map
. Here, the original object has been modified to fit the requirements of MEAL objects: SNPs annotation data.frame must contain a column called position
and a column called chromosome
with the name of the chromosome.
head(snps$map)
## Chromosome snp.name position SNP
## rs1000068-127_B_R_1501589836 17 rs1000068 45653965 [T/C]
## rs1000299-127_B_R_1501589728 17 rs1000299 70660148 [T/C]
## rs100043-127_T_F_1501589788 17 rs100043 64903636 [A/G]
## rs1000465-127_T_R_1501589734 17 rs1000465 67315267 [A/G]
## rs1000466-127_B_F_1501590038 17 rs1000466 67315325 [T/C]
## rs1000724-126_B_R_1502063163 17 rs1000724 14579993 [T/C]
## chromosome
## rs1000068-127_B_R_1501589836 chr17
## rs1000299-127_B_R_1501589728 chr17
## rs100043-127_T_F_1501589788 chr17
## rs1000465-127_T_R_1501589734 chr17
## rs1000466-127_B_F_1501590038 chr17
## rs1000724-126_B_R_1502063163 chr17
Let’s illustrate the use of this package by analyzing the effect of the inversion genotypes in methylation and expression. First, a genome wide analysis will be performed and then the region funcionalities will be introduced. It should be noticed that methylation and expression analyses will include hypothesis testing and visualitzation.
This demonstration will show the main functions needed to perform a methylation analysis. A more exhaustive description of these and other auxiliary functions can be found at MEAL vignette.
Let’s start the analysis. The first step is to create the MethylationSet
with the betas, the phenotype and the snps.
methSet <- prepareMethylationSet(matrix = betavals, phenotypes = pheno)
table(fData(methSet)$chr)
##
## chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2
## 43479 22631 26764 22703 11338 13942 14145 20605 25956 5515 23896 32322
## chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chrY
## 9674 3976 8044 23469 19036 22727 33557 27898 19596 9211 10558 341
As it can be seen, more than 10000 probes are placed in sexual chromosomes. Given that they vary a lot depending on sex, some authors recommend to not include them in the analyses to avoid introducing confounding results. This step can be easily done with the following code:
annot <- fData(methSet)
autosomiccpgs <- rownames(annot)[!annot$chr %in% c("chrX", "chrY")]
methSet <- methSet[autosomiccpgs, ]
methSet
## MethylationSet (storageMode: lockedEnvironment)
## assayData: 440484 features, 61 samples
## element names: meth
## protocolData: none
## phenoData
## rowNames: GM02704 GM02706 ... WG1983 (61 total)
## varLabels: gender source inv
## varMetadata: labelDescription
## featureData
## featureNames: cg13869341 cg14008030 ... cg02366642 (440484
## total)
## fvarLabels: chromosome position ... DHS (17 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: IlluminaHumanMethylation450kanno.ilmn12.hg19
Once the object is filtered, we are now ready to run the analysis with DAPipeline
. It can work with an ExpressionSet
or a MethylationSet
. In both cases, for each feature, it computes the effect of the variables indicated in the parameter variable_names. DAPipeline
also allows the inclusion of adjustment variables (such as cell count) with the parameter covariable_names. In addition, if a MethylationSet
is used, algorithms to detect differentially methylated regions will be run (Bumphunter and DMRcate).
In our case, we will set variable_names to “inv”. If we would like to analyse more than one variable, a vector containing all the variables names could be used. Finally, probe_method is set to “ls” (least squares) to speed up the demonstration. If we would like to use robust linear regression, the option is “robust”.
methRes <- DAPipeline(set = methSet, variable_names = "inv", probe_method = "ls")
## Your contrast returned 3 individually significant probes; a small but real effect. Consider manually setting the value of pcutoff to return more DMRs, but be warned that doing this increases the risk of Type I errors.
## Fitting chr1...
## Fitting chr10...
## Fitting chr11...
## Fitting chr12...
## Fitting chr13...
## Fitting chr14...
## Fitting chr15...
## Fitting chr16...
## Fitting chr17...
## Fitting chr18...
## Fitting chr19...
## Fitting chr2...
## Fitting chr20...
## Fitting chr21...
## Fitting chr22...
## Fitting chr3...
## Fitting chr4...
## Fitting chr5...
## Fitting chr6...
## Fitting chr7...
## Fitting chr8...
## Fitting chr9...
## Demarcating regions...
## Done!
methRes
## class: AnalysisResults
## original class: MethylationSet
## features(50): cg22968622 cg16652462 ... cg19702194 cg18046575
## samples(61): GM02704 GM02706 ... WG1984 WG1983
## variables: inv
## model variables names: inv
## covariables: none
## Number of bumps: 20505
## Number of blocks: NULL
## Number of regions: 1
## Number of differentially methylated probes: 3
The analysis generates an AnalysisResults
object containing the results. The printing of the object shows that 3 probes have been detected as differentially methylated. There are two plots that gives us a quick overview of the results of the analysis. A Manhattan plot provides an overview of the distribution of the probes (figure 1). We can observe that the differentially methylated probes are clustered in chromosome 17. On the other hand, with the QQplot we can assess the validity of the model (figure 2). Almost all of the points are on the theoretical line and only the differentially methylated probes are separated, thus no further adjustement seems to be required.
plotEWAS(methRes)
plotQQ(methRes)
With this analysis, we see that the inversion genotypes can affect the methylation values of the probes inside the inversion region.
\newpageThe expression analysis can be performed in the same way than methylation. The main difference is that an ExpressionSet
will be used intead of a MethylationSet
. In order to assure consistent results, ExpressionSet
s passed to DAPipeline
needs to follow some constraints: its fData should contain columns chromosome, start and end. Let’s check our object:
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 21916 features, 61 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GM00016 GM00022 ... WG3466 (61 total)
## varLabels: gender inv
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1343291 ILMN_1651209 ... ILMN_2415949 (21916
## total)
## fvarLabels: chr start end genes
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6883
fvarLabels(eset)
## [1] "chr" "start" "end" "genes"
We have this data in our object but the column containing the chromosomes is called chr instead of chromosome. In the following lines, we will fix it and we will run the analysis.
colnames(fData(eset))[colnames(fData(eset)) == "chr"] <- "chromosome"
expRes <- DAPipeline(eset, variable_names = "inv", probe_method = "ls")
expRes
## class: AnalysisResults
## original class: ExpressionSet
## features(50): ILMN_1815849 ILMN_2415329 ... ILMN_2257607
## ILMN_2162860
## samples(61): GM00016 GM00022 ... WG3465 WG3466
## variables: inv
## model variables names: inv
## covariables: none
## Number of bumps: NULL
## Number of blocks: NULL
## Number of regions: NULL
## Number of differentially expressed probes: 0
plotEWAS(expRes)
plotQQ(expRes)
There are no significant probes after multiple testing (figure 6) and the QQplot shows a great deflation (figure 7). This situation usually happens when there are hidden variables (such as batch) that we are not taking into account so we could try surrogate variables analysis (SVA). SVA tries to infer these hidden variables, which we can include in our model. DAPipeline
performs this process automatically by setting sva = TRUE
.
expResSVA <- DAPipeline(eset, variable_names = "inv", probe_method = "ls", sva = TRUE)
## Number of significant surrogate variables is: 12
## Iteration (out of 5 ):1 2 3 4 5
expResSVA
## class: AnalysisResults
## original class: ExpressionSet
## features(50): ILMN_1815849 ILMN_2087941 ... ILMN_2171789
## ILMN_2177732
## samples(61): GM00016 GM00022 ... WG3465 WG3466
## variables: inv
## model variables names: inv
## covariables: none
## Number of bumps: NULL
## Number of blocks: NULL
## Number of regions: NULL
## Number of differentially expressed probes: 0
plotEWAS(expResSVA)
plotQQ(expResSVA)
We can see that there are no probes with a significant expression change after bonferroni correction (figure 8). The QQ-plot still shows a bit of deflation but the points are closer to the theoretical values (figure 9). Consequently, SVA has improved the analysis a bit but not enough to detect differentially expressed probes.
plotVolcano(expResSVA)
head(probeResults(expResSVA)[[1]])
## [1] 7.439174 7.410460 7.657713 7.561391 7.666501 7.682265
In figure 10, a Volcano plot of the results is shown. Given that none of the points has a log fold change bigger than 1, none of the points is labeled.
It should be noticed that although an ExpressionSet
can be the input for DAPipeline
and an AnalysisResults
object is generated, not all functions will be available. The two functions that might not work properly are plotRegionR2
and plotRegion
.
Region analysis is also available for expression data. Now, the region analysis will be run directly on the ExpressionSet
so the SNPs effect will not be considered:
range <- GRanges(seqnames = Rle("chr17"),
ranges = IRanges(43000000, end = 45000000))
exprsRegion <- DARegionAnalysis(set = eset, range = range, variable_names = "inv", probe_method = "ls")
exprsRegion
## class: AnalysisRegionResults
## original class: ExpressionSet
## features(51): ILMN_1742677 ILMN_1673805 ... ILMN_1683950
## ILMN_1739450
## samples(61): GM00016 GM00022 ... WG3465 WG3466
## variables: inv
## model variables names: inv
## covariables: none
## Number of bumps: NULL
## Number of blocks: NULL
## Number of regions: NULL
## Number of differentially expressed probes: 0
## Relevant snps(0):
## Snps Variance: NA
## Range:
## Chr: chr17 start: 43000000 end: 45000000
## R2: 0.024
## RDA P-value: 0.235
## Region P-value: 0.0529
## Global R2: 0.013
As in the methylation analysis, the returned object is an AnalysisRegionResults
.
plotRDA(exprsRegion)
The RDA plot shows that NI/NI and NI/I samples are grouped but they are very close (figure 11). However, the R2 and the p-value discard the possible correlation between expression and the inversion.
rdahits <- topRDAhits(exprsRegion)
rdahits
## [1] feat RDA cor P.Value adj.P.Val
## <0 rows> (or 0-length row.names)
The topRDAhits
function cannot detect any probe correlated with the inversion phenotype, reinforcing the idea that expression and inversion haplotypes are not correlated.
In this last step, the correlation between expression and methylation will be studied. There are two functions in MEAL
that can compute the relationship between these two omics datasets. correlationMethExprs
looks for expression probes that are correlated to differentially methylated probes. On the other hand, multiCorrMethExprs
tests if, in a chromosomic region, the methylation pattern can explain the expression pattern.
Both functions requires a MultiDataSet
, so let’s create a new MultiDataSet
with expression and methylation data.
multi2 <- new("MultiDataSet")
multi2 <- add_genexp(multi2, eset)
## Warning in add_eset(object, gexpSet, dataset.type = "expression", GRanges
## = range, : No id column found in pData. The id will be equal to the
## sampleNames
multi2 <- add_methy(multi2, methSet)
## Warning in add_eset(object, methySet, dataset.type = "methylation",
## GRanges = range, : No id column found in pData. The id will be equal to the
## sampleNames
It should be noticed that add_genexp
share the same constraints for the ExpressionSet
, so only objects with fData containing chromosomes, start and end columns will be accepted. Another interesting feature is that MultiDataSet
adding functions ensure that sample names are the same for all sets. To do so, when a new set is added, samples of the new set and samples of the multiset are filtered to only contain the common samples.
There is a general procedure to integrate methylation and expression. The method starts with a list of differentially methylated CpGs (DMPs) obtained from a single methylation analysis (e.g. comparing cases vs. controls, exposed vs. non-exposed …). Then, DMPs are paired to nearby expression probes and the relationship between them is tested.
In MEAL, the function correlationMethExprs
applies this method as it was described in (Steenaard et al. 2015). DMPs and expression probes are paired by proximity. Expression probes are paired to a DMP if they are completely inside a range of \(\pm\) 250Kb from the DMP. This distance of 250Kb is set by default but it can be changed with the parameter flank (whose units are bases). The correlation between methylation and expression is done by a linear regression.
To account for technical (e.g. batch) or biological (e.g. sex, age) artifacts, a model including those variables (z) is fitted: \[ x_{ij} = \sum_{k = 1}^K{\beta_{ik} z_{kj}} + r_{ij}, i = 1, ..., P \] where \(x_{ij}\) is the methylation or expression level of probe i (P is the total number of probes) for individual j, \(\beta_{ik}\) is the effect of variable k at probe i, \(z_{kj}\) is the value of variable k (K is the total number of covariates) for indivudal j and \(r_{ij}\) is the residual value of probe i and individual j. The residuals of methylation and expression are used to assess the correlation.
These analyses take the most significant CpGs so we will retrieve the top 50 CpGs and we will look for correlated expression probes. Using add_methy
in a MultiDataSet
containing methylation, the methylation data is substituted, and the same applies for add_genexp
and expression:
topcpgs <- feats(methRes)
methExprs <- correlationMethExprs(multi2, sel_cpgs = topcpgs)
## Computing residuals
## Computing correlation Methylation-Expression
head(methExprs)
## cpg exprs Beta se P.Value adj.P.Val
## 1 cg02610723 ILMN_1806189 0.3894946 0.10564140 0.0005207823 0.1661296
## 2 cg19199266 ILMN_1718079 -0.2846189 0.09132653 0.0029078990 0.3092066
## 3 cg14569423 ILMN_2352881 0.5712482 0.18131356 0.0026353253 0.3092066
## 4 cg19199266 ILMN_1751400 0.1834295 0.06418645 0.0060130203 0.3250103
## 5 cg04021672 ILMN_1778347 1.3700407 0.48973511 0.0070839197 0.3250103
## 6 cg04021672 ILMN_2349444 1.3879223 0.47585499 0.0051122607 0.3250103
The results shows the CpG name, the expression probe, the change of the relationship and se, and the p-value and adjusted p-value (using B&H). Results are ordered by the adjusted p-value, so we can than in our data there are no correlated CpGs-expression probes.
This case example shows the main functionalities of MEAL package. It has been shown how to evaluate the effect of a phenotype on the methylation and on the expression at genome wide and at region level. Finally, the integration between methylation and expression is tested. First, we have looked if there are expression probes correlated to the differentially methylated probes. Then, we have checked if methylation and expression are correlated between them and with the inversion haplotypes in the region of the inversions. The results suggests that the inversion can modify the methylation pattern but there is no effect onk the expression profile.
Cáceres, Alejandro, and Juan R González. 2015. “Following the footprints of polymorphic inversions on SNP data: from detection to association tests.” Nucleic Acids Research, February, 1–11. doi:10.1093/nar/gkv073.
Li, Yun, Jason A Chen, Renee L Sears, Fuying Gao, Eric D Klein, Anna Karydas, Michael D Geschwind, et al. 2014. “An epigenetic signature in peripheral blood associated with the haplotype on 17q21.31, a risk factor for neurodegenerative tauopathy.” PLoS Genetics 10 (3): e1004211. doi:10.1371/journal.pgen.1004211.
Steenaard, Rebecca V, Symen Ligthart, Lisette Stolk, Marjolein J Peters, Joyce B van Meurs, Andre G Uitterlinden, Albert Hofman, Oscar H Franco, and Abbas Dehghan. 2015. “Tobacco smoking is associated with methylation of genes related to coronary artery disease.” Clinical Epigenetics 7 (1): 54. doi:10.1186/s13148-015-0088-y.