The Chip Analysis Methylation Pipeline

Yuan Tian, Morris TJ, Butcher L, Feber A, Teschendorff A, Chakravarthy A, Beck S

2017-08-08

1 New Updates

If you are using ChAMP version above 2.8.3, note that you MUST install latest ChAMPdata package(version 2.8.1) to make new ChAMP works.

1: Added parameter “force” in champ.load(). “force” is a parameter in minfi’s read.meth.exp() function, when your dataset comes from various batches, force parameter would allow read.meth.exp() function to extract common probes across all batches and do analysis.

2: champ.DMP() works on numeric variable now: Now champ.DMP() would works on numeric variable like age, you can set “pheno” parameter in champ.DMP() function to make it detect numeric variable related CpGs.

3: champ.DMP() do pairewise comparison between each two categorical phenotypes: Previously champ.DMP() can only works on two phenotypes, if you have multiple phenotypes in your covariate, you need to set parameters “compare.group” to do comparison between any two of them. Now champ.DMP() would works on each pair of them return a list containing all results.

4: goseq was replaced by gometh: Previously we use goseq to do CpG number corrected GSEA, now we used another implementation of goseq algotithm. gometh is published in missMethyl package and also use numbers of CpGs to do correction when doing GSEA.

5: missMethyl and combinat pacakges are needed now: Since we used some new functions like gometh and combn in our package, we need these two packages added in our import list. You may use command: source("https://bioconductor.org/biocLite.R");biocLite("missMethyl","combinat") to install them.

6: DMP.GUI() is modified: Now DMP.GUI() function would automatically detect your pheno parameter’s type, if it’s “numeric”, scatter plot would plotted on CpG plot, while for “character” or “factor” type, boxplot would be plotted.

7: SVD plot added legend: We now added legend of color in SVD plot.

8: “minfi” loading method fixed: Previously “minfi” function was not working because of beadcount() function from wateRmelon failed to get beads information from rgSet object, which might be caused by upgrading conflict between minfi and wateRmelon. Now we fixed this issue and old “minfi” loading method works now. You may set parameter method in champ.load() function to use your prefered way to load data.

9: vignette of ChAMP and github Demo pages updated: We spend lots of work to upgrade ChAMP vignette here. Also added more latest Demo result on our github Demo page.

10: Added some figures from GSE40279 in vignette. Since your 450K demo data does not contain any numeric variable, we use GSE40279 here to show performance of champ.SVD() and DMP.GUI() in this vignette.

2 Introduction

The ChAMP package is designed for the analysis of Illumina Methylation beadarray data (EPIC and 450k) and provides a pipeline that integrates currently available 450k and EPIC analysis methods. This includes a variety of different data import methods (e.g. from .idat files or a beta-valued matrix) and Quality Control plots. Type-2 probe correction methods include SWAN1, Peak Based Correction (PBC)2 and BMIQ3 (the default choice). The popular Functional Normalization4 function offered by the minfi56 package is also available. The singular value decomposition (SVD) method7 allows an in-depth look at batch effects, and for correction of multiple batch effects the ComBat method8 has been implemented. Adjusting for cell-type heterogeneity can be done via RefbaseEWAS9. ChAMP also includes a function for infering copy-number alterations from either 450k or EPIC data10. For the identification of Differentially Methylated Regions (DMR), ChAMP offers the new Probe Lasso method11, in addition to previous DMR detection functions Bumphunter12 and DMRcate13. For users who need to find Differentially Methylated Blocks14, the new version of ChAMP includes a function to detect these. Gene Set Enrichment Analysis (GSEA) is also possible and the new version of ChAMP incorporates methods that correct for the bias caused by unequal representation of probes among genes15. Also, the new version of ChAMP incorporates the FEM package16, which can infer gene modules in user-specified gene-networks that exhibit differential methylation between phenotypes.

While a number of other pipelines and packages for 450k or EPIC array analysis are available (including IMA17, minfi5, methylumi18, RnBeads19 and wateRmelon20 ), ChAMP provides a more comprehensive and complete analysis pipeline from reading original data files to final tertiary analysis results, such as GSEA, which streamlines methylation array analysis for researchers. The new version ChAMP also provides a series of Shiny and Plotly-based WebBrower Interactive analysis functions (GUI functions), to help scientists view ChAMP results. This requires an interactive WebBrower-based framework, for calling Graph System locally or remotely. For example, X11 for most Linux Systems.

If you have any bugs to report, or any suggestions for ChAMP, please email champ450k@gmail.com.

3 Installation

It is essential that you have R 3.3 or above already installed on your computer or server. ChAMP is a pipeline that utilises many other Bioconductor packages that are currently available from CRAN and Bioconductor. For all of the steps of the pipeline to work, make sure that you have upgraded Bioconductor to newest version (3.5).

After you have R and Bioconductor installed properly, you can start to install ChAMP. The easiest way to install it is by typing following code into your R session:

source("https://bioconductor.org/biocLite.R")
biocLite("ChAMP")

If you have any problems with Bioconductor, another option is to install ALL dependent packages, then install ChAMP. There are a lot if packages relied on by ChAMP. You can use the following command in R to install most of them:

source("http://bioconductor.org/biocLite.R")
biocLite(c("minfi","ChAMPdata","Illumina450ProbeVariants.db","sva","IlluminaHumanMethylation450kmanifest","limma","RPMM","DNAcopy","preprocessCore","impute","marray","wateRmelon","goseq","plyr","GenomicRanges","RefFreeEWAS","qvalue","isva","doParallel","bumphunter","quadprog","shiny","shinythemes","plotly","RColorBrewer","DMRcate","dendextend","IlluminaHumanMethylationEPICmanifest","FEM","matrixStats","missMethyl","combinat"))

When you are installing ChAMP, you may encounter some errors saying that some packages are not installed. These errors are caused by recursively depending on R packages, for example, ChAMP uses minfi, minfi uses lumi, and lumi uses other packages, so if lumi was not installed on your computer, ChAMP would fail. To solve these errors, you simply need to check those error messages, find out which packages required are missing, then install it with command biocLite("YourErrorPackage") directly. Then retry installing ChAMP, it may take 3-4 times, but eventually it should work.

After installation, you should be able to load the ChAMP package in your R session:

library("ChAMP")

4 Test Data

The package contains two test datasets, one is HumanMethylation450 data (.idat) and the other is simulated EPIC data, which can be used to test functions available in ChAMP. This can be loaded by pointing the directory to the testDataSet like below:

testDir=system.file("extdata",package="ChAMPdata")
myLoad <- champ.load(testDir,arraytype="450K")

The 450k lung tumor data set contains only 8 samples, 4 lung tumor samples (T) and 4 control samples (C). We use this data set to show the functionality of ChAMP.

For the EPIC Simulation Data Set, user may use following code to load it:

data(EPICSimData)

This Simulation Data Set contains 16 samples, all of them are actually originally from one sample but modified into DMP and DMR, also alone with some errors variance. In this 16 sample dataset, we simulated 8 samples are control and 8 sample as case, samples are marked in pd of EPICSimData object. In this data, we randomly choose 5000 regions from clusterMaker() function of bumphunter package. In each region, we randomly selected some continually CpGs as DMR, then increased or decreased beta value of this DMR. So there should have less than 5000 DMRs (4700+) in this data set, because some simulated DMRs contains only 1-2 CpGs, they will not be regarded as DMR in champ.DMR() function. Users may use this data to test the functionality of ChAMP on EPIC data.

5 ChAMP Pipeline

5.1 Pipeline Introduction

ChAMP Pipeline Figure above outlines the steps in the ChAMP pipeline. Each step can be run individually as a separate function. This allows integration between ChAMP’s result and other analysis pipelines. In above pipeline plot, all functions of ChAMP are included, which are categorized using three colors:

Solid gray line in above plot represent stream of pipeline, while dash gray line represent functions may not necessarily be used, which depends on your data and projects.

In the above plot, there are some functions and connections (lines) marked with green gleam, which indicate the Main Analysis Pipeline. ChAMP combines many functions, but not all of them may be used for a successful analysis. The green gleam represents a main analysis pipeline most likely to be used for various data sets. We also marked steps for these main pipeline steps, which should be helpful for new users. The black dot in the middle of the plot represents fully prepared methylation datasets, which is the milestone between data preparation and data analysis. Thus, we suggest you save the fully prepared data since your other analyses may be started from it directly.

5.2 Full Pipeline

The full pipeline can be run at once with one command:

champ.process(directory = testDir)

When running the full pipeline through the champ.process() function a number of parameters can be adjusted. These parameters are related to all functions involved in the ChAMP package, but not all parameters are covered, otherwise champ.process() function would be too large; please check the manual of ChAMP for parameter settings.

5.3 Separated Steps

We note that champ.process() may not always run successfully, which is likely due to particularities of specific data sets . For example, the covariate of interest in the pd file may not be denoted “Sample_Group” but another category, for example, “Disease”, but in the ChAMP function, “Sample_Group” is set as default as the variable of interest. If you encounter any problems when using champ.process(), we recommend you apply it in a step-by-step fashion:

myLoad <- cham.load(testDir)
# Or you may separate about code as champ.import(testDir) + champ.filter()
CpG.GUI()
champ.QC() # Alternatively: QC.GUI()
myNorm <- champ.norm()
champ.SVD()
# If Batch detected, run champ.runCombat() here.
myDMP <- champ.DMP()
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myBlock <- champ.Block()
Block.GUI()
myGSEA <- champ.GSEA()
myEpiMod <- champ.EpiMod()
myCNA <- champ.CNA()

# If DataSet is Blood samples, run champ.refbase() here.
myRefbase <- champ.refbase()

We note that the above implementation of the ChAMP functions uses default parameter specifications, and that therefore parameters can be adjusted and integrated with users’ other unique analysis tools and methods. For details of parameter options, please check the relevant sections of this vignette, or the ChAMP manual.

5.4 EPIC pipeline

ChAMP provides a friendly user interface for EPIC array analysis. Most functions are the same for both 450K array and EPIC array, the only difference lies in a parameter ‘arraytype’, used for some functions which require annotation files: you simply need to set this parameter to “EPIC”. In the ChAMP package, we created a simulation EPIC dataset (beta value only). The pipeline to analyse EPIC array is below:

# myLoad <- champ.load(directory = testDir,arraytype="EPIC")
# We simulated EPIC data from beta value instead of .idat file,
# but user may use above code to read .idat files directly.
# Here we we started with myLoad.

data(EPICSimData)
CpG.GUI(arraytype="EPIC")
champ.QC() # Alternatively QC.GUI(arraytype="EPIC")
myNorm <- champ.norm(arraytype="EPIC")
champ.SVD()
# If Batch detected, run champ.runCombat() here.This data is not suitable.
myDMP <- champ.DMP(arraytype="EPIC")
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myDMR <- champ.DMR(arraytype="EPIC")
DMR.GUI(arraytype="EPIC")
myBlock <- champ.Block(arraytype="EPIC")
Block.GUI(arraytype="EPIC") # For this simulation data, not Differential Methylation Block is detected.
myGSEA <- champ.GSEA(arraytype="EPIC")
myEpiMod <- champ.EpiMod(arraytype="EPIC")

# champ.CNA(arraytype="EPIC")
# champ.CNA() function call for intensity data, which is not included in our Simulation data.

5.5 Computational Requirements

The ability to run the pipeline on a large number of samples depends somewhat on the memory available. The ChAMP pipeline runs 200 samples successfully on a computer with 8GB of memory. Beyond this it may be necessary to find a server/cluster to run the analysis on.

The champ.load() function uses the most memory. If you plan to run the analysis more than once it is recommended to run myLoad <- champ.load() and save this list for future analyses.

In champ.DMR(), champ.norm() and champ.Block() functions, some methods may use parallel method to accelerate the process. If your server or computer has more cores, you may specify more cores at the same time to make the function run faster, though this may cost more memory. You can use the following code in R to detect number of cores.

library("doParallel")
detectCores()

GUI functions requires more things. A basic framework for a GUI environment is required. If you run ChAMP on your own computer, it should be fine. But if you run ChAMP on a remote server, a local display system is essential to use these WebBrower-based GUI functions. X11 is a very common used tool for this. For Windows users, Xmanager or MobaxTerm are excellent choices.

A significant improvement for ChAMP is that users can also conduct full analysis even if they are not starting from raw .idat files. As long as you have a methylation beta matrix and the corresponding phenotypes (pd) file, you can conduct nearly all of the ChAMP analysis. This makes analysis easier for users who have beta-values only but not original idat files, for example if users obtain data from TCGA or GEO.

6 Description of ChAMP Pipelines

As previously mentioned, users have the option to run each of the ChAMP functions individually to allow integration with other pipelines or to save the results at each step. Below we describe each function in further detail.

6.1 Loading Data

Loading data is always the first step. ChAMP provides a loading function to get data from .idat files (with pd file (Sample_Sheet.csv) in it). The old champ.load() function utilises minfi to load data from .idat files, but now we provide a new method “ChAMP” to read data. The two loading methods effectively return the same objects, except that the old version would also return rgSet and mset mset objects, which might be used as input to SWAN and FunctionalNormliazation method in champ.norm(). By default this loads data from the current working directory, in this directory you should have all .idat files and a sample sheet. The sample sheet currently needs to be a .csv file as came with your results following hybridization. The champ.load() function used to load the data from the idat files automatically filters out the SNPs that might influence the analysis result. The SNP list was generated from Zhou’s Nucleic Acids Research paper in 201621, which is designed specifically for 450K and EPIC, also with population differences considered. As such, for 450k bead array data, before filtering for low quality probes, the dataset will include 485,512 probes. And for EPIC bead array data, before filtering probes the dataset, will include 867,531 probes.

Currently the default method for loading is “ChAMP”. You may set “method” in champ.load() to use classical minfi way. Note that “ChAMP” method in champ.load() is merely a combination of champ.import() and champ.filter(). If you are not satisfied with result returned by champ.load(), say you want to get beadcount, detection P matrix, or Meth UnMeth matrix, you may use champ.import() to generate them.

User may invoke following code to load data set:

myLoad <- champ.load(testDir)

Alternatively, the user can load the data from the dataset object testDataSet in ChAMPdata package, using:

data(testDataSet)

It is important to check the pd file (Sample_Sheet.csv). Most pd files are similar in format, but their most valuable information are not always stored in same column, for example, for most pd files, Sample_Group means groups of phenotype wants to be researched, but in some pd files, similar information may be stored in columns named as “Diagnose” or “CancerType”, so the user MUST understand their pd file well to achieve a successful analysis.

myLoad$pd
##    Sample_Name Sample_Plate Sample_Group Pool_ID Project Sample_Well
## C1          C1           NA            C      NA      NA         E09
## C2          C2           NA            C      NA      NA         G09
## C3          C3           NA            C      NA      NA         E02
## C4          C4           NA            C      NA      NA         F02
## T1          T1           NA            T      NA      NA         B09
## T2          T2           NA            T      NA      NA         C09
## T3          T3           NA            T      NA      NA         E08
## T4          T4           NA            T      NA      NA         C09
##     Array      Slide                   Basename                  filenames
## C1 R03C02 7990895118 Test450K/7990895118_R03C02 Test450K/7990895118_R03C02
## C2 R05C02 7990895118 Test450K/7990895118_R05C02 Test450K/7990895118_R05C02
## C3 R01C01 9247377086 Test450K/9247377086_R01C01 Test450K/9247377086_R01C01
## C4 R02C01 9247377086 Test450K/9247377086_R02C01 Test450K/9247377086_R02C01
## T1 R06C01 7766130112 Test450K/7766130112_R06C01 Test450K/7766130112_R06C01
## T2 R01C02 7766130112 Test450K/7766130112_R01C02 Test450K/7766130112_R01C02
## T3 R01C01 7990895118 Test450K/7990895118_R01C01 Test450K/7990895118_R01C01
## T4 R01C02 7990895118 Test450K/7990895118_R01C02 Test450K/7990895118_R01C02

As we can see here, the Sample_Group contains two phenotypes: C and T, which is what we want to compare. In this data set, C means “Control”, while T means “Tumor”.

6.2 Filtering Data

ChAMP provides a comprehensive filtering function called champ.filter() which can take any data matrix as input (beta, M, Meth, UnMeth, intensity) and do filtering on it. champ.filter() does not call for any specific object or class result, nor does it need an IDAT file, as long as you have a single beta matrix, the filtering can be done. In champ.filter(), since some probes and samples might be removed for reason of low quality, NA might be still exist in filtered data. So, we provided a parameter autoimpute in the function to conduct imputation on remain probes, if NAs are still existing. Actually, imputation work is done by champ.impute() function, reader may check that function for more information.

For some filtering methods, additional data must be provided, for example if you want to perform filtering using detection P-values, you need to provide a detection P-value matrix. Also, if you want to perform filtering based on beadcounts, you need to provide the beadcount information. If you read IDAT file with champ.load() function’s default method “ChAMP”, beadcount information should be included in returned result. Also champ.filter() is designed to do filtering on multiple types of data frames from same data set, say you have beta matrix, M matrix, intensity matrix from same .idat file, you may use champ.filter() to do filtering on them at the same time, which would keep their compliance in future analysis.

The champ.filter() function is the after process of champ.import() which is a novel loading function coded by ChAMP team. You may use them like below:

myImport <- champ.import(testDir)
myLoad <- champ.filter()

Then the result of champ.filter() (myLoad in above code), should be the same result as old minfi method.

Actually During the loading process, champ.load() function would also do some filtering. Below are filtering steps which would be performed during both champ.load() and champ.filter().

All above filterings are controlled by parameters in the champ.load() and champ.filter() function, and users may adjust them as they wish. Note that, though most ChAMP functions are supporting isolated beta matrix analysis, which means not relying on .idat files from the beginning, champ.load() can not perform filtering on beta matrix alone. For users have no .IDAT data but beta matrix and Sample_Sheet.csv, you may want perform filtering using the champ.filter() function and then use following functions to do analysis.

In addition to the filtering capability, we also provide a function CpG.GUI() for the user to check their distribution of CpGs on chromosome, CpG island, TSS reagions. e.g. This function can be used for any CpGs list, for example, during your analysis, whenever you get a significant CpG list from DMR, you can use the following function to check the distribution of your CpG vector:

CpG.GUI(CpG=rownames(myLoad$beta),arraytype="450K")

This is a useful function to demonstrate the distribution of your CpG list. If you get a DMP list, you may use this function to check your DMP’s distribution on chromosome.

6.3 Further quality control and exploratory analysis

Quality Control is an important process to make sure dataset is suitable for downstream analysis. champ.QC() function and QC.GUI() function would draw some plots for user to easily check their data’s quality. Usually, there are three plots that would be generated:

champ.QC()

Or you may use QC.GUI() function as below, which however may be memory intensive, so make sure you have a good server or computer when you run this GUI function in ChAMP.

QC.GUI(beta=myLoad$beta,arraytype="450K")