1 Introduction

In this vignette, a workflow to analyse methylation and expression data will be presented. It will be also shown how to consider SNP data, since it has been demonstrated that the existence of meQTL and eQTL could confound the results. The example data will be taken from MEALData package (latest version available at BRGE web), a data package obtained from GEO GSE53261. The dataset contains data from untransformed human fibroblasts. The script with the processing steps from the GEO to the final datasets can be found in the folder data_raw of MEALData package.

Let’s start by loading the required packages.

library(MEAL)
library(MultiDataSet)
library(MEALData)
library(GenomicRanges)

There are four objects in MEALData: betavals, pheno, eset and snps. Let’s take a look to each of these objects.

betavals and pheno belongs to the methylation experiment. betavals is a matrix of beta values corresponding to an Infinium HumanMethylation 450K BeadChip and pheno a data.frame with the phenotypes.

betavals[1:5, 1:5]

##              GM02704   GM02706   GM01650   GM01653   GM02640
## cg00000029 0.2864929 0.2069152 0.5525335 0.5176491 0.2791274
## cg00000108 0.9293251 0.9429289 0.8961195 0.9325227 0.9286574
## cg00000109 0.7168331 0.7419590 0.7320755 0.5347986 0.8196660
## cg00000165 0.2935756 0.2475510 0.4311649 0.3937953 0.3349943
## cg00000236 0.9026300 0.9034649 0.8755394 0.8710790 0.8867925

dim(betavals)

## [1] 451448     61

summary(pheno)

##     gender                 source      inv    
##  female:31   Coriell cell line:26   NI/NI:44  
##  male  :30   other            :35   NI/I :17

betavals contains data from 451448 probes and 61 samples. pheno contains the same number of samples and three variables: gender, source and inv. gender and source come from the original data while inv correspond to the inversion genotypes for the inversion located in chromosome 17q21.31 (Li et al. 2014) computed using invClust (Cáceres and González 2015). There is approximately the same number of cells coming from male and female donnors. Source is composed of Coriell cell lines and cells with unknown source. A quick overview shows that there are no homozygous for the inverted phenotype due to its reduced frequency.

eset is an ExpressionSet with the expression data corresponding to an Illumina HumanRef-8 v3.0 expression beadchip.

eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 21916 features, 61 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GM00016 GM00022 ... WG3466 (61 total)
##   varLabels: gender inv
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1343291 ILMN_1651209 ... ILMN_2415949 (21916
##     total)
##   fvarLabels: chr start end genes
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6883

summary(pData(eset))

##     gender      inv    
##  female:31   NI/NI:44  
##  male  :30   NI/I :17

There are 21916 expression probes and 61 samples. The phenotype data contains the variables gender and inv that are equivalent to that of pheno.

Finally, let’s take a look at snps data. snps is a list containing two objects, the genotypes and the mapping.

str(snps)

## List of 2
##  $ genotypes: num [1:29909, 1:98] 0 2 1 2 0.0952 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:29909] "rs1000068-127_B_R_1501589836" "rs1000299-127_B_R_1501589728" "rs100043-127_T_F_1501589788" "rs1000465-127_T_R_1501589734" ...
##   .. ..$ : chr [1:98] "WG2275" "WG2276" "WG2274" "WG2066" ...
##  $ map      :'data.frame':   29909 obs. of  5 variables:
##   ..$ Chromosome: chr [1:29909] "17" "17" "17" "17" ...
##   ..$ snp.name  : Factor w/ 1199188 levels "200003","200006",..: 43665 43954 44102 44140 44142 44418 44539 44635 44675 45233 ...
##   ..$ position  : num [1:29909] 45653965 70660148 64903636 67315267 67315325 ...
##   ..$ SNP       : Factor w/ 12 levels "[A/C]","[A/G]",..: 7 7 2 2 7 7 7 7 7 2 ...
##   ..$ chromosome: chr [1:29909] "chr17" "chr17" "chr17" "chr17" ...

Genotypes are enclosed in a matrix and can be retrieved by:

snps$genotypes[1:5, 1:5]

##                                 WG2275 WG2276 WG2274 WG2066 WG2065
## rs1000068-127_B_R_1501589836 0.0000000      2      0      0      0
## rs1000299-127_B_R_1501589728 2.0000000      2      2      2      0
## rs100043-127_T_F_1501589788  1.0000000      2      0      2      0
## rs1000465-127_T_R_1501589734 2.0000000      0      0      0      2
## rs1000466-127_B_F_1501590038 0.0952381      1      1      0      1

SNP probes mapping can be retrieved by typing snps$map. Here, the original object has been modified to fit the requirements of MEAL objects: SNPs annotation data.frame must contain a column called position and a column called chromosome with the name of the chromosome.

head(snps$map)

##                              Chromosome  snp.name position   SNP
## rs1000068-127_B_R_1501589836         17 rs1000068 45653965 [T/C]
## rs1000299-127_B_R_1501589728         17 rs1000299 70660148 [T/C]
## rs100043-127_T_F_1501589788          17  rs100043 64903636 [A/G]
## rs1000465-127_T_R_1501589734         17 rs1000465 67315267 [A/G]
## rs1000466-127_B_F_1501590038         17 rs1000466 67315325 [T/C]
## rs1000724-126_B_R_1502063163         17 rs1000724 14579993 [T/C]
##                              chromosome
## rs1000068-127_B_R_1501589836      chr17
## rs1000299-127_B_R_1501589728      chr17
## rs100043-127_T_F_1501589788       chr17
## rs1000465-127_T_R_1501589734      chr17
## rs1000466-127_B_F_1501590038      chr17
## rs1000724-126_B_R_1502063163      chr17

Let’s illustrate the use of this package by analyzing the effect of the inversion genotypes in methylation and expression. First, a genome wide analysis will be performed and then the region funcionalities will be introduced. It should be noticed that methylation and expression analyses will include hypothesis testing and visualitzation.

2 Methylation Analysis

This demonstration will show the main functions needed to perform a methylation analysis. A more exhaustive description of these and other auxiliary functions can be found at MEAL vignette.

Let’s start the analysis. The first step is to create the MethylationSet with the betas, the phenotype and the snps.

methSet <- prepareMethylationSet(matrix = betavals, phenotypes = pheno)
table(fData(methSet)$chr)

## 
##  chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2 
## 43479 22631 26764 22703 11338 13942 14145 20605 25956  5515 23896 32322 
## chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX  chrY 
##  9674  3976  8044 23469 19036 22727 33557 27898 19596  9211 10558   341

As it can be seen, more than 10000 probes are placed in sexual chromosomes. Given that they vary a lot depending on sex, some authors recommend to not include them in the analyses to avoid introducing confounding results. This step can be easily done with the following code:

annot <- fData(methSet)
autosomiccpgs <- rownames(annot)[!annot$chr %in% c("chrX", "chrY")]
methSet <- methSet[autosomiccpgs, ]
methSet

## MethylationSet (storageMode: lockedEnvironment)
## assayData: 440484 features, 61 samples 
##   element names: meth 
## protocolData: none
## phenoData
##   rowNames: GM02704 GM02706 ... WG1983 (61 total)
##   varLabels: gender source inv
##   varMetadata: labelDescription
## featureData
##   featureNames: cg13869341 cg14008030 ... cg02366642 (440484
##     total)
##   fvarLabels: chromosome position ... DHS (17 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: IlluminaHumanMethylation450kanno.ilmn12.hg19

Once the object is filtered, we are now ready to run the analysis with DAPipeline. It can work with an ExpressionSet or a MethylationSet. In both cases, for each feature, it computes the effect of the variables indicated in the parameter variable_names. DAPipeline also allows the inclusion of adjustment variables (such as cell count) with the parameter covariable_names. In addition, if a MethylationSet is used, algorithms to detect differentially methylated regions will be run (Bumphunter and DMRcate).

In our case, we will set variable_names to “inv”. If we would like to analyse more than one variable, a vector containing all the variables names could be used. Finally, probe_method is set to “ls” (least squares) to speed up the demonstration. If we would like to use robust linear regression, the option is “robust”.

methRes <- DAPipeline(set = methSet, variable_names = "inv", probe_method = "ls")

## Your contrast returned 3 individually significant probes; a small but real effect. Consider manually setting the value of pcutoff to return more DMRs, but be warned that doing this increases the risk of Type I errors.

## Fitting chr1...

## Fitting chr10...

## Fitting chr11...

## Fitting chr12...

## Fitting chr13...

## Fitting chr14...

## Fitting chr15...

## Fitting chr16...

## Fitting chr17...

## Fitting chr18...

## Fitting chr19...

## Fitting chr2...

## Fitting chr20...

## Fitting chr21...

## Fitting chr22...

## Fitting chr3...

## Fitting chr4...

## Fitting chr5...

## Fitting chr6...

## Fitting chr7...

## Fitting chr8...

## Fitting chr9...

## Demarcating regions...

## Done!

methRes

## class: AnalysisResults 
## original class: MethylationSet 
## features(50): cg22968622 cg16652462 ... cg19702194 cg18046575
## samples(61): GM02704 GM02706 ... WG1984 WG1983
## variables:  inv 
## model variables names:  inv 
## covariables: none 
## Number of bumps: 20505 
## Number of blocks: NULL 
## Number of regions: 1 
## Number of differentially methylated probes: 3

The analysis generates an AnalysisResults object containing the results. The printing of the object shows that 3 probes have been detected as differentially methylated. There are two plots that gives us a quick overview of the results of the analysis. A Manhattan plot provides an overview of the distribution of the probes (figure 1). We can observe that the differentially methylated probes are clustered in chromosome 17. On the other hand, with the QQplot we can assess the validity of the model (figure 2). Almost all of the points are on the theoretical line and only the differentially methylated probes are separated, thus no further adjustement seems to be required.

plotEWAS(methRes)

Figure 1: Manhattan plot of methylation results. Some differentially methylated probes are detected in the 17q21.31 region.

plotQQ(methRes)

Figure 2: QQplot of methylation analysis. All the p-values are on the theoretical line but the three probes differentially methylated.

With this analysis, we see that the inversion genotypes can affect the methylation values of the probes inside the inversion region.

\newpage

3 Expression Analysis

The expression analysis can be performed in the same way than methylation. The main difference is that an ExpressionSet will be used intead of a MethylationSet. In order to assure consistent results, ExpressionSets passed to DAPipeline needs to follow some constraints: its fData should contain columns chromosome, start and end. Let’s check our object:

eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 21916 features, 61 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GM00016 GM00022 ... WG3466 (61 total)
##   varLabels: gender inv
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1343291 ILMN_1651209 ... ILMN_2415949 (21916
##     total)
##   fvarLabels: chr start end genes
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6883

fvarLabels(eset)

## [1] "chr"   "start" "end"   "genes"

We have this data in our object but the column containing the chromosomes is called chr instead of chromosome. In the following lines, we will fix it and we will run the analysis.

colnames(fData(eset))[colnames(fData(eset)) == "chr"] <- "chromosome"
expRes <- DAPipeline(eset, variable_names = "inv", probe_method = "ls")
expRes

## class: AnalysisResults 
## original class: ExpressionSet 
## features(50): ILMN_1815849 ILMN_2415329 ... ILMN_2257607
##   ILMN_2162860
## samples(61): GM00016 GM00022 ... WG3465 WG3466
## variables:  inv 
## model variables names:  inv 
## covariables: none 
## Number of bumps: NULL 
## Number of blocks: NULL 
## Number of regions: NULL 
## Number of differentially expressed probes: 0

plotEWAS(expRes)

Figure 3: Manhattan plot of expression results. No probe is differentially expressed after multiple testing correction.

plotQQ(expRes)

Figure 4: QQplot of expression results. Most of the p-values are below the theoretical line so there is a great deflation.

There are no significant probes after multiple testing (figure 6) and the QQplot shows a great deflation (figure 7). This situation usually happens when there are hidden variables (such as batch) that we are not taking into account so we could try surrogate variables analysis (SVA). SVA tries to infer these hidden variables, which we can include in our model. DAPipeline performs this process automatically by setting sva = TRUE.

expResSVA <- DAPipeline(eset, variable_names = "inv", probe_method = "ls", sva = TRUE)

## Number of significant surrogate variables is:  12 
## Iteration (out of 5 ):1  2  3  4  5

expResSVA

## class: AnalysisResults 
## original class: ExpressionSet 
## features(50): ILMN_1815849 ILMN_2087941 ... ILMN_2171789
##   ILMN_2177732
## samples(61): GM00016 GM00022 ... WG3465 WG3466
## variables:  inv 
## model variables names:  inv 
## covariables: none 
## Number of bumps: NULL 
## Number of blocks: NULL 
## Number of regions: NULL 
## Number of differentially expressed probes: 0

plotEWAS(expResSVA)

Figure 5: Manhattan plot of expression results using SVA. Again, no probe is differentially expressed after multiple testing correction.

plotQQ(expResSVA)

Figure 6: QQ plot of expression results using SVA.

We can see that there are no probes with a significant expression change after bonferroni correction (figure 8). The QQ-plot still shows a bit of deflation but the points are closer to the theoretical values (figure 9). Consequently, SVA has improved the analysis a bit but not enough to detect differentially expressed probes.

plotVolcano(expResSVA)

Figure 7: Volcano of expression results using SVA. All the probes are inside the non significant region.

head(probeResults(expResSVA)[[1]])

## [1] 7.439174 7.410460 7.657713 7.561391 7.666501 7.682265

In figure 10, a Volcano plot of the results is shown. Given that none of the points has a log fold change bigger than 1, none of the points is labeled.

It should be noticed that although an ExpressionSet can be the input for DAPipeline and an AnalysisResults object is generated, not all functions will be available. The two functions that might not work properly are plotRegionR2 and plotRegion.

3.1 Region Analysis

Region analysis is also available for expression data. Now, the region analysis will be run directly on the ExpressionSet so the SNPs effect will not be considered:

range <- GRanges(seqnames = Rle("chr17"),
                 ranges = IRanges(43000000, end = 45000000))
exprsRegion <- DARegionAnalysis(set = eset, range = range, variable_names = "inv", probe_method = "ls")
exprsRegion

## class: AnalysisRegionResults 
## original class: ExpressionSet 
## features(51): ILMN_1742677 ILMN_1673805 ... ILMN_1683950
##   ILMN_1739450
## samples(61): GM00016 GM00022 ... WG3465 WG3466
## variables:  inv 
## model variables names:  inv 
## covariables: none 
## Number of bumps: NULL 
## Number of blocks: NULL 
## Number of regions: NULL 
## Number of differentially expressed probes: 0 
## Relevant snps(0):
## Snps Variance:  NA 
## Range:
##  Chr:  chr17     start:  43000000    end:  45000000 
## R2:  0.024 
## RDA P-value:  0.235 
## Region P-value:  0.0529 
## Global R2:  0.013

As in the methylation analysis, the returned object is an AnalysisRegionResults.

plotRDA(exprsRegion)

Figure 8: RDA plot for the range analysis for expression data. The different genotypes are very closed but separated.

The RDA plot shows that NI/NI and NI/I samples are grouped but they are very close (figure 11). However, the R² and the p-value discard the possible correlation between expression and the inversion.

rdahits <- topRDAhits(exprsRegion)
rdahits

## [1] feat      RDA       cor       P.Value   adj.P.Val
## <0 rows> (or 0-length row.names)

The topRDAhits function cannot detect any probe correlated with the inversion phenotype, reinforcing the idea that expression and inversion haplotypes are not correlated.

\newpage

4 Methylation and expression integration

In this last step, the correlation between expression and methylation will be studied. There are two functions in MEAL that can compute the relationship between these two omics datasets. correlationMethExprs looks for expression probes that are correlated to differentially methylated probes. On the other hand, multiCorrMethExprs tests if, in a chromosomic region, the methylation pattern can explain the expression pattern.

Both functions requires a MultiDataSet, so let’s create a new MultiDataSet with expression and methylation data.

multi2 <- new("MultiDataSet")
multi2 <- add_genexp(multi2, eset)

## Warning in add_eset(object, gexpSet, dataset.type = "expression", GRanges
## = range, : No id column found in pData. The id will be equal to the
## sampleNames

multi2 <- add_methy(multi2, methSet)

## Warning in add_eset(object, methySet, dataset.type = "methylation",
## GRanges = range, : No id column found in pData. The id will be equal to the
## sampleNames

It should be noticed that add_genexp share the same constraints for the ExpressionSet, so only objects with fData containing chromosomes, start and end columns will be accepted. Another interesting feature is that MultiDataSet adding functions ensure that sample names are the same for all sets. To do so, when a new set is added, samples of the new set and samples of the multiset are filtered to only contain the common samples.

4.1 Methylation expression correlation

There is a general procedure to integrate methylation and expression. The method starts with a list of differentially methylated CpGs (DMPs) obtained from a single methylation analysis (e.g. comparing cases vs. controls, exposed vs. non-exposed …). Then, DMPs are paired to nearby expression probes and the relationship between them is tested.

In MEAL, the function correlationMethExprs applies this method as it was described in (Steenaard et al. 2015). DMPs and expression probes are paired by proximity. Expression probes are paired to a DMP if they are completely inside a range of $\pm$ 250Kb from the DMP. This distance of 250Kb is set by default but it can be changed with the parameter flank (whose units are bases). The correlation between methylation and expression is done by a linear regression.

To account for technical (e.g. batch) or biological (e.g. sex, age) artifacts, a model including those variables (z) is fitted: \[ x_{ij} = \sum_{k = 1}^K{\beta_{ik} z_{kj}} + r_{ij}, i = 1, ..., P \] where $x_{ij}$ is the methylation or expression level of probe i (P is the total number of probes) for individual j, $\beta_{ik}$ is the effect of variable k at probe i, $z_{kj}$ is the value of variable k (K is the total number of covariates) for indivudal j and $r_{ij}$ is the residual value of probe i and individual j. The residuals of methylation and expression are used to assess the correlation.

These analyses take the most significant CpGs so we will retrieve the top 50 CpGs and we will look for correlated expression probes. Using add_methy in a MultiDataSet containing methylation, the methylation data is substituted, and the same applies for add_genexp and expression:

topcpgs <- feats(methRes)
methExprs <- correlationMethExprs(multi2, sel_cpgs = topcpgs)

## Computing residuals

## Computing correlation Methylation-Expression

head(methExprs)

##          cpg        exprs       Beta         se      P.Value adj.P.Val
## 1 cg02610723 ILMN_1806189  0.3894946 0.10564140 0.0005207823 0.1661296
## 2 cg19199266 ILMN_1718079 -0.2846189 0.09132653 0.0029078990 0.3092066
## 3 cg14569423 ILMN_2352881  0.5712482 0.18131356 0.0026353253 0.3092066
## 4 cg19199266 ILMN_1751400  0.1834295 0.06418645 0.0060130203 0.3250103
## 5 cg04021672 ILMN_1778347  1.3700407 0.48973511 0.0070839197 0.3250103
## 6 cg04021672 ILMN_2349444  1.3879223 0.47585499 0.0051122607 0.3250103

The results shows the CpG name, the expression probe, the change of the relationship and se, and the p-value and adjusted p-value (using B&H). Results are ordered by the adjusted p-value, so we can than in our data there are no correlated CpGs-expression probes.

MEAL case example

Carlos Ruiz, Carles Hernández, Juan R. González

Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain
Bioinformatics Research Group in Epidemiolgy (BRGE)
(http://www.creal.cat/brge.htm)

2016-11-16

Package version: MEAL 1.4.2

Contents

1 Introduction

2 Methylation Analysis

3 Expression Analysis

3.1 Region Analysis

4 Methylation and expression integration

4.1 Methylation expression correlation

5 Conclusions

References

MEAL case example

Carlos Ruiz, Carles Hernández, Juan R. González

Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain Bioinformatics Research Group in Epidemiolgy (BRGE) (http://www.creal.cat/brge.htm)

2016-11-16

Package version: MEAL 1.4.2

Contents

1 Introduction

2 Methylation Analysis

3 Expression Analysis

3.1 Region Analysis

4 Methylation and expression integration

4.1 Methylation expression correlation

5 Conclusions

References

Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain
Bioinformatics Research Group in Epidemiolgy (BRGE)
(http://www.creal.cat/brge.htm)