Methylation in the human genome is known to be associated with development and disease. The Illumina Infinium methylation arrays are by far the most common way to interrogate methylation across the human genome. This paper provides a Bioconductor workflow using multiple packages for the analysis of methylation array data. Specifically, we demonstrate the steps involved in a typical differential methylation analysis pipeline including: quality control, filtering, normalization, data exploration and statistical testing for probe-wise differential methylation. We further outline other analyses such as differential methylation of regions, differential variability analysis, estimating cell type composition and gene ontology testing. Finally, we provide some examples of how to visualise methylation array data.
R version: R version 4.3.0 RC (2023-04-13 r84269)
Bioconductor version: 3.17
Package: 1.24.0
DNA methylation, the addition of a methyl group to a CG dinucleotide of the DNA, is the most extensively studied epigenetic mark due to its role in both development and disease (Bird 2002; Laird 2003). Although DNA methylation can be measured in several ways, the epigenetics community has enthusiastically embraced the Illumina HumanMethylation450 (450k) array (Bibikova et al. 2011) as a cost-effective way to assay methylation across the human genome. More recently, Illumina has increased the genomic coverage of the platform to >850,000 sites with the release of their MethylationEPIC (850k) array. As methylation arrays are likely to remain popular for measuring methylation for the foreseeable future, it is necessary to provide robust workflows for methylation array analysis.
Measurement of DNA methylation by Infinium technology (Infinium I) was first employed by Illumina on the HumanMethylation27 (27k) array (Bibikova et al. 2009), which measured methylation at approximately 27,000 CpGs, primarily in gene promoters. Like bisulfite sequencing, the Infinium assay detects methylation status at single base resolution. However, due to its relatively limited coverage the array platform was not truly considered “genome-wide” until the arrival of the 450k array. The 450k array increased the genomic coverage of the platform to over 450,000 gene-centric sites by combining the original Infinium I assay with the novel Infinium II probes. Both assay types employ 50bp probes that query a [C/T] polymorphism created by bisulfite conversion of unmethylated cytosines in the genome, however, the Infinium I and II assays differ in the number of beads required to detect methylation at a single locus. Infinium I uses two bead types per CpG, one for each of the methylated and unmethylated states (Figure 1a). In contrast, the Infinium II design uses one bead type and the methylated state is determined at the single base extension step after hybridization (Figure 1b). The 850k array also uses a combination of the Infinium I and II assays but achieves additional coverage by increasing the size of each array; a 450k slide contains 12 arrays whilst the 850k has only 8.
Regardless of the Illumina array version, for each CpG, there are two measurements: a methylated intensity (denoted by \(M\)) and an unmethylated intensity (denoted by \(U\)). These intensity values can be used to determine the proportion of methylation at each CpG locus. Methylation levels are commonly reported as either beta values (\(\beta = M/(M + U)\)) or M-values (\(Mvalue = log2(M/U)\)). For practical purposes, a small offset, \(\alpha\), can be added to the denominator of the \(\beta\) value equation to avoid dividing by small values, which is the default behaviour of the getBeta
function in minfi. The default value for \(\alpha\) is 100. It may also be desirable to add a small offset to the numerator and denominator when calculating M-values to avoid dividing by zero in rare cases, however the default getM
function in minfi does not do this. Beta values and M-values are related through a logit transformation. Beta values are generally preferable for describing the level of methylation at a locus or for graphical presentation because percentage methylation is easily interpretable. However, due to their distributional properties, M-values are more appropriate for statistical testing (Du et al. 2010).
In this workflow, we will provide examples of the steps involved in analysing methylation array data using R (R Core Team 2014) and Bioconductor (Huber et al. 2015), including: quality control, filtering, normalization, data exploration and probe-wise differential methylation analysis. We will also cover other approaches such as differential methylation analysis of regions, differential variability analysis, gene ontology analysis and estimating cell type composition. Finally, we will provide some examples of useful ways to visualise methylation array data.
The data required for this workflow has been bundled with the R package that contains this workflow document. Alternatively, it can be obtained from figshare. If you choose to download it seperately, once the data has been downloaded, it needs to be extracted from the archive. This will create a folder called data
, which contains all the files necessary to execute the workflow.
Once the data has been downloaded and extracted, there should be a folder called data
that contains all the files necessary to execute the workflow.
# set up a path to the data directory
dataDirectory <- system.file("extdata", package = "methylationArrayAnalysis")
# list the files
list.files(dataDirectory, recursive = TRUE)
## [1] "48639-non-specific-probes-Illumina450k.csv"
## [2] "5975827018/5975827018_R06C02_Grn.idat"
## [3] "5975827018/5975827018_R06C02_Red.idat"
## [4] "6264509100/6264509100_R01C01_Grn.idat"
## [5] "6264509100/6264509100_R01C01_Red.idat"
## [6] "6264509100/6264509100_R01C02_Grn.idat"
## [7] "6264509100/6264509100_R01C02_Red.idat"
## [8] "6264509100/6264509100_R02C01_Grn.idat"
## [9] "6264509100/6264509100_R02C01_Red.idat"
## [10] "6264509100/6264509100_R02C02_Grn.idat"
## [11] "6264509100/6264509100_R02C02_Red.idat"
## [12] "6264509100/6264509100_R03C01_Grn.idat"
## [13] "6264509100/6264509100_R03C01_Red.idat"
## [14] "6264509100/6264509100_R03C02_Grn.idat"
## [15] "6264509100/6264509100_R03C02_Red.idat"
## [16] "6264509100/6264509100_R04C01_Grn.idat"
## [17] "6264509100/6264509100_R04C01_Red.idat"
## [18] "6264509100/6264509100_R04C02_Grn.idat"
## [19] "6264509100/6264509100_R04C02_Red.idat"
## [20] "6264509100/6264509100_R05C01_Grn.idat"
## [21] "6264509100/6264509100_R05C01_Red.idat"
## [22] "6264509100/6264509100_R05C02_Grn.idat"
## [23] "6264509100/6264509100_R05C02_Red.idat"
## [24] "6264509100/6264509100_R06C01_Grn.idat"
## [25] "6264509100/6264509100_R06C01_Red.idat"
## [26] "6264509100/6264509100_R06C02_Grn.idat"
## [27] "6264509100/6264509100_R06C02_Red.idat"
## [28] "SampleSheet.csv"
## [29] "ageData.RData"
## [30] "human_c2_v5.rdata"
## [31] "model-based-cpg-islands-hg19-chr17.txt"
## [32] "wgEncodeRegDnaseClusteredV3chr17.bed"
To demonstrate the various aspects of analysing methylation data, we will be using a small, publicly available 450k methylation dataset (GSE49667)(Zhang et al. 2013). The dataset contains 10 samples in total: there are 4 different sorted T-cell types (naive, rTreg, act_naive, act_rTreg, collected from 3 different individuals (M28, M29, M30). For details describing sample collection and preparation, see Zhang et al. (2013). An additional birth
sample (individual VICS-72098-18-B) is included from another study (GSE51180)(Cruickshank et al. 2013) to illustrate approaches for identifying and excluding poor quality samples.
There are several R Bioconductor packages available that have been developed for analysing methylation array data, including minfi (Aryee et al. 2014), missMethyl (Phipson, Maksimovic, and Oshlack 2016), wateRmelon (Pidsley et al. 2013), methylumi (Davis et al. 2015), ChAMP (Morris et al. 2014) and charm (Aryee et al. 2011). Some of the packages, such as minfi and methylumi include a framework for reading in the raw data from IDAT files and various specialised objects for storing and manipulating the data throughout the course of an analysis. Other packages provide specialised analysis methods for normalisation and statistical testing that rely on either minfi or methylumi objects. It is possible to convert between minfi and methylumi data types, however, this is not always trivial. Thus, it is advisable to consider the methods that you are interested in using and the data types that are most appropriate before you begin your analysis. Another popular method for analysing methylation array data is limma (Ritchie et al. 2015), which was originally developed for gene expression microarray analysis. As limma operates on a matrix of values, it is easily applied to any data that can be converted to a matrix
in R. For a complete list of Bioconductor packages for analysing DNA methylation data, one can search for “DNAMethylation” in BiocViews (https://www.bioconductor.org/packages/release/BiocViews.html#___DNAMethylation) on the Bioconductor website.
We will begin with an example of a probe-wise differential methylation analysis using minfi and limma. By probe-wise analysis we mean each individual CpG probe will be tested for differential methylation for the comparisons of interest and p-values and moderated t-statistics (Smyth 2004) will be generated for each CpG probe.
It is useful to begin an analysis in R by loading all the packages that are likely to be required.
# load packages required for analysis
library(knitr)
library(limma)
library(minfi)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(IlluminaHumanMethylation450kmanifest)
library(RColorBrewer)
library(missMethyl)
library(minfiData)
library(Gviz)
library(DMRcate)
library(stringr)
The minfi, IlluminaHumanMethylation450kanno.ilmn12.hg19, IlluminaHumanMethylation450kmanifest, missMethyl, minfiData and DMRcate are methylation specific packages, while RColorBrewer and Gviz are visualisation packages. We use limma for testing differential methylation, and matrixStats and stringr have functions used in the workflow. The IlluminaHumanMethylation450kmanifest package provides the Illumina manifest as an R object which can easily be loaded into the environment. The manifest contains all of the annotation information for each of the CpG probes on the 450k array. This is useful for determining where any differentially methylated probes are located in a genomic context.
# get the 450k annotation data
ann450k <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
head(ann450k)
## DataFrame with 6 rows and 33 columns
## chr pos strand Name AddressA
## <character> <integer> <character> <character> <character>
## cg00050873 chrY 9363356 - cg00050873 32735311
## cg00212031 chrY 21239348 - cg00212031 29674443
## cg00213748 chrY 8148233 - cg00213748 30703409
## cg00214611 chrY 15815688 - cg00214611 69792329
## cg00455876 chrY 9385539 - cg00455876 27653438
## cg01707559 chrY 6778695 + cg01707559 45652402
## AddressB ProbeSeqA ProbeSeqB
## <character> <character> <character>
## cg00050873 31717405 ACAAAAAAACAACACACAAC.. ACGAAAAAACAACGCACAAC..
## cg00212031 38703326 CCCAATTAACCACAAAAACT.. CCCAATTAACCGCAAAAACT..
## cg00213748 36767301 TTTTAACACCTAACACCATT.. TTTTAACGCCTAACACCGTT..
## cg00214611 46723459 CTAACTTCCAAACCACACTT.. CTAACTTCCGAACCGCGCTT..
## cg00455876 69732350 AACTCTAAACTACCCAACAC.. AACTCTAAACTACCCGACAC..
## cg01707559 64689504 ACAAATTAAAAACACTAAAA.. GCGAATTAAAAACACTAAAA..
## Type NextBase Color Probe_rs Probe_maf
## <character> <character> <character> <character> <numeric>
## cg00050873 I A Red NA NA
## cg00212031 I T Red NA NA
## cg00213748 I A Red NA NA
## cg00214611 I A Red NA NA
## cg00455876 I A Red NA NA
## cg01707559 I A Red NA NA
## CpG_rs CpG_maf SBE_rs SBE_maf Islands_Name
## <character> <numeric> <character> <numeric> <character>
## cg00050873 NA NA NA NA chrY:9363680-9363943
## cg00212031 NA NA NA NA chrY:21238448-21240005
## cg00213748 NA NA NA NA chrY:8147877-8148210
## cg00214611 NA NA NA NA chrY:15815488-15815779
## cg00455876 NA NA NA NA chrY:9385471-9385777
## cg01707559 NA NA NA NA chrY:6778574-6780028
## Relation_to_Island Forward_Sequence SourceSeq
## <character> <character> <character>
## cg00050873 N_Shore TATCTCTGTCTGGCGAGGAG.. CGGGGTCCACCCACTCCAAA..
## cg00212031 Island CCATTGGCCCGCCCCAGTTG.. CGCACGTCTTCCCGACCGCA..
## cg00213748 S_Shore TCTGTGGGACCATTTTAACG.. CGCCCCCTCCTGCAGAACCT..
## cg00214611 Island GCGCCGGCAGGACTAGCTTC.. CGCCCGCGCCACACTGCAGC..
## cg00455876 Island CGCGTGTGCCTGGACTCTGA.. GACTCTGAGCTACCCGGCAC..
## cg01707559 Island AGCGGCCGCTCCCAGTGGTG.. CGCCCTCTGTCGCTGCAGCC..
## Random_Loci Methyl27_Loci UCSC_RefGene_Name UCSC_RefGene_Accession
## <character> <character> <character> <character>
## cg00050873 TSPY4;FAM197Y2 NM_001164471;NR_001553
## cg00212031 TTTY14 NR_001543
## cg00213748
## cg00214611 TMSB4Y;TMSB4Y NM_004202;NM_004202
## cg00455876
## cg01707559 TBL1Y;TBL1Y;TBL1Y NM_134259;NM_033284;..
## UCSC_RefGene_Group Phantom DMR Enhancer
## <character> <character> <character> <character>
## cg00050873 Body;TSS1500
## cg00212031 TSS200
## cg00213748
## cg00214611 1stExon;5'UTR
## cg00455876
## cg01707559 TSS200;TSS200;TSS200
## HMM_Island Regulatory_Feature_Name Regulatory_Feature_Group
## <character> <character> <character>
## cg00050873 Y:9973136-9976273
## cg00212031 Y:19697854-19699393
## cg00213748 Y:8207555-8208234
## cg00214611 Y:14324883-14325218 Y:15815422-15815706 Promoter_Associated_..
## cg00455876 Y:9993394-9995882
## cg01707559 Y:6838022-6839951
## DHS
## <character>
## cg00050873
## cg00212031
## cg00213748
## cg00214611
## cg00455876
## cg01707559
As for their many other BeadArray platforms, Illumina methylation data is usually obtained in the form of Intensity Data (IDAT) Files. This is a proprietary format that is output by the scanner and stores summary intensities for each probe on the array. However, there are Bioconductor packages available that facilitate the import of data from IDAT files into R (Smith et al. 2013). Typically, each IDAT file is approximately 8MB in size. The simplest way to import the raw methylation data into R is using the minfi function read.metharray.sheet
, along with the path to the IDAT files and a sample sheet. The sample sheet is a CSV (comma-separated) file containing one line per sample, with a number of columns describing each sample. The format expected by the read.metharray.sheet
function is based on the sample sheet file that usually accompanies Illumina methylation array data. It is also very similar to the targets file described by the limma package. Importing the sample sheet into R creates a data.frame
with one row for each sample and several columns. The read.metharray.sheet
function uses the specified path and other information from the sample sheet to create a column called Basename
which specifies the location of each individual IDAT file in the experiment.
# read in the sample sheet for the experiment
targets <- read.metharray.sheet(dataDirectory, pattern="SampleSheet.csv")
targets
Now that we have imported the information about the samples and where the data is located, we can read the raw intensity signals into R from the IDAT files using the read.metharray.exp
function. This creates an RGChannelSet
object that contains all the raw intensity data, from both the red and green colour channels, for each of the samples. At this stage, it can be useful to rename the samples with more descriptive names.
# read in the raw data from the IDAT files
rgSet <- read.metharray.exp(targets=targets)
rgSet
## class: RGChannelSet
## dim: 622399 11
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(11): 6264509100_R01C01 6264509100_R02C01 ... 6264509100_R04C02
## 5975827018_R06C02
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
# give the samples descriptive names
targets$ID <- paste(targets$Sample_Group,targets$Sample_Name,sep=".")
sampleNames(rgSet) <- targets$ID
rgSet
## class: RGChannelSet
## dim: 622399 11
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(11): naive.1 rTreg.2 ... act_rTreg.10 birth.11
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
Once the data has been imported into R, we can evaluate its quality. Firstly, we need to calculate detection p-values. We can generate a detection p-value for every CpG in every sample, which is indicative of the quality of the signal. The method used by minfi to calculate detection p-values compares the total signal \((M + U)\) for each probe to the background signal level, which is estimated from the negative control probes. Very small p-values are indicative of a reliable signal whilst large p-values, for example >0.01, generally indicate a poor quality signal.
Plotting the mean detection p-value for each sample allows us to gauge the general quality of the samples in terms of the overall signal reliability (Figure 2). Samples that have many failed probes will have relatively large mean detection p-values.
# calculate the detection p-values
detP <- detectionP(rgSet)
head(detP)
## naive.1 rTreg.2 act_naive.3 naive.4 act_naive.5
## cg00050873 0.000000e+00 0.000000e+00 0.000000e+00 0.00000e+00 0.000000e+00
## cg00212031 0.000000e+00 0.000000e+00 0.000000e+00 0.00000e+00 0.000000e+00
## cg00213748 2.139652e-88 4.213813e-31 1.181802e-12 1.29802e-47 8.255482e-15
## cg00214611 0.000000e+00 0.000000e+00 0.000000e+00 0.00000e+00 0.000000e+00
## cg00455876 1.400696e-234 9.349236e-111 4.272105e-90 0.00000e+00 3.347145e-268
## cg01707559 0.000000e+00 0.000000e+00 0.000000e+00 0.00000e+00 0.000000e+00
## act_rTreg.6 naive.7 rTreg.8 act_naive.9 act_rTreg.10
## cg00050873 0.000000e+00 0.00000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## cg00212031 0.000000e+00 0.00000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## cg00213748 2.592206e-23 1.16160e-28 1.469801e-05 1.543654e-21 1.365951e-08
## cg00214611 0.000000e+00 0.00000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## cg00455876 4.690740e-308 1.08647e-219 5.362780e-178 0.000000e+00 7.950724e-295
## cg01707559 0.000000e+00 0.00000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## birth.11
## cg00050873 0.000000e+00
## cg00212031 2.638199e-237
## cg00213748 6.735224e-01
## cg00214611 7.344451e-01
## cg00455876 4.403634e-174
## cg01707559 0.000000e+00
# examine mean detection p-values across all samples to identify any failed samples
pal <- brewer.pal(8,"Dark2")
par(mfrow=c(1,2))
barplot(colMeans(detP), col=pal[factor(targets$Sample_Group)], las=2,
cex.names=0.8, ylab="Mean detection p-values")
abline(h=0.05,col="red")
legend("topleft", legend=levels(factor(targets$Sample_Group)), fill=pal,
bg="white")
barplot(colMeans(detP), col=pal[factor(targets$Sample_Group)], las=2,
cex.names=0.8, ylim=c(0,0.002), ylab="Mean detection p-values")
abline(h=0.05,col="red")
legend("topleft", legend=levels(factor(targets$Sample_Group)), fill=pal,
bg="white")
The minfi qcReport
function generates many other useful quality control plots. The minfi vignette describes the various plots and how they should be interpreted in detail. Generally, samples that look poor based on mean detection p-value will also look poor using other metrics and it is usually advisable to exclude them from further analysis.
qcReport(rgSet, sampNames=targets$ID, sampGroups=targets$Sample_Group,
pdf="qcReport.pdf")
Poor quality samples can be easily excluded from the analysis using a detection p-value cutoff, for example >0.05. For this particular dataset, the birth
sample shows a very high mean detection p-value, and hence it is excluded from subsequent analysis (Figure 2).
# remove poor quality samples
keep <- colMeans(detP) < 0.05
rgSet <- rgSet[,keep]
rgSet
## class: RGChannelSet
## dim: 622399 10
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(10): naive.1 rTreg.2 ... act_naive.9 act_rTreg.10
## colData names(10): Sample_Name Sample_Well ... Basename filenames
## Annotation
## array: IlluminaHumanMethylation450k
## annotation: ilmn12.hg19
# remove poor quality samples from targets data
targets <- targets[keep,]
targets[,1:5]
## Sample_Name Sample_Well Sample_Source Sample_Group Sample_Label
## 1 1 A1 M28 naive naive
## 2 2 B1 M28 rTreg rTreg
## 3 3 C1 M28 act_naive act_naive
## 4 4 D1 M29 naive naive
## 5 5 E1 M29 act_naive act_naive
## 6 6 F1 M29 act_rTreg act_rTreg
## 7 7 G1 M30 naive naive
## 8 8 H1 M30 rTreg rTreg
## 9 9 A2 M30 act_naive act_naive
## 10 10 B2 M30 act_rTreg act_rTreg
# remove poor quality samples from detection p-value table
detP <- detP[,keep]
dim(detP)
## [1] 485512 10
To minimise the unwanted variation within and between samples, various data normalisations can be applied. Many different types of normalisation have been developed for methylation arrays and it is beyond the scope of this workflow to compare and contrast all of them (Fortin et al. 2014; Wu et al. 2014; Sun et al. 2011; Wang et al. 2012; Maksimovic, Gordon, and Oshlack 2012; Mancuso et al. 2011; Touleimat and Tost 2012; Teschendorff et al. 2013; Pidsley et al. 2013; Triche et al. 2013). Several methods have been built into minfi and can be directly applied within its framework (Fortin et al. 2014; Triche et al. 2013; Maksimovic, Gordon, and Oshlack 2012; Touleimat and Tost 2012), whilst others are methylumi-specific or require custom data types (Wu et al. 2014; Sun et al. 2011; Wang et al. 2012; Mancuso et al. 2011; Teschendorff et al. 2013; Pidsley et al. 2013). Although there is no single normalisation method that is universally considered best, a recent study by Fortin et al. (2014) has suggested that a good rule of thumb within the minfi framework is that the preprocessFunnorm
(Fortin et al. 2014) function is most appropriate for datasets with global methylation differences such as cancer/normal or vastly different tissue types, whilst the preprocessQuantile
function (Touleimat and Tost 2012) is more suited for datasets where you do not expect global differences between your samples, for example a single tissue. Further discussion on appropriate choice of normalisation can be found in (Hicks and Irizarry 2015), and the accompanying quantro package includes data-driven tests for the assumptions of quantile normalisation.
As we are comparing different blood cell types, which are globally relatively similar, we will apply the preprocessQuantile
method to our data (Figure 3). This function implements a stratified quantile normalisation procedure which is applied to the methylated and unmethylated signal intensities separately, and takes into account the different probe types. Note that after normalisation, the data is housed in a GenomicRatioSet
object. This is a much more compact representation of the data as the colour channel information has been discarded and the \(M\) and \(U\) intensity information has been converted to M-values and beta values, together with associated genomic coordinates. Note, running the preprocessQuantile
function on this dataset produces the warning: ‘An inconsistency was encountered while determining sex’; this can be ignored as it is due to all the samples being from male donors.
# normalize the data; this results in a GenomicRatioSet object
mSetSq <- preprocessQuantile(rgSet)
## [preprocessQuantile] Mapping to genome.
## Warning in .getSex(CN = CN, xIndex = xIndex, yIndex = yIndex, cutoff = cutoff):
## An inconsistency was encountered while determining sex. One possibility is that
## only one sex is present. We recommend further checks, for example with the
## plotSex function.
## [preprocessQuantile] Fixing outliers.
## [preprocessQuantile] Quantile normalizing.
# create a MethylSet object from the raw data for plotting
mSetRaw <- preprocessRaw(rgSet)
# visualise what the data looks like before and after normalisation
par(mfrow=c(1,2))
densityPlot(rgSet, sampGroups=targets$Sample_Group,main="Raw", legend=FALSE)
legend("top", legend = levels(factor(targets$Sample_Group)),
text.col=brewer.pal(8,"Dark2"))
densityPlot(getBeta(mSetSq), sampGroups=targets$Sample_Group,
main="Normalized", legend=FALSE)
legend("top", legend = levels(factor(targets$Sample_Group)),
text.col=brewer.pal(8,"Dark2"))