The Chip Analysis Methylation Pipeline

Yuan Tian, Morris TJ, Butcher L, Feber A, Teschendorff A, Chakravarthy A, Beck S

2017-04-10

1 New Updates

1: Added Demo on github: In respond to our reviewer’s question and to make users have better understanding on our package, we processed ChAMP fully on some data sets and saved all messages shown during processing. We upload these information to github so that user could use them to check. Currently we added to Demo Report in repositories, Demo450 of this vignette and DemoEPIC in our paper’s supplementary.

2: Fixed Error in DMRcate pacakge in cpg.annotate() function: DMRcate pacakge get updated for “Error in if (nsig == 0) { : missing value where TRUE/FALSE needed.”. Thanks professor Timothy Peters upgraded DMRcate to make it more robust.

3: Replace 0 into smallest positive value : In champ.load(), instead of replacing all 0 and negative value into 0.000001, we relplace them as the smallest positive value now, which might leads to slightly difference from previous result from same code and data set.

4: Fixed warnings() in all GUI() functions : Previously, there were some warnings in GUI() functions, which are merely because we did not assign some parameters in painting (we used default parameter, which is correct but would trigger warnings), it will NOT cause any error in plot but some warnings. Based on reviewer’s suggestion, in this version pacakge, we fixed most warnings.

5: More flexible for champ.runCombat() function : In previous champ.runCombat() function, we setted restriction that user can NOT do correction on factors like Sample_Group or Sample_Name. Based on reviewer’s suggestion, we removed this restriction in this version, which means users can fix all covariates by using champ.runCombat() function. Also, we added “variablename” parameter so that user may assign other variables other then “Sample_Group” to be corrected for.

6: Removed “DMP” parameter in champ.DMR() function : We modified ProbeLasso method in champ.DMR() function, there is no need to input “myDMP” parameter anymore, ProbeLasso function would calculate it inside the function. Previously, ProbeLasso needs result from champ.DMP() to calculate DMRs, now we have updated it, so that ProbeLasso would calculate DMPs inside champ.DMR() function instead of calling for myDMP parameter from users.

7: Changed jpg in champ.CNA() into pdf: Previously champ.CNA() generate some jpg figures, now they have been replaced into pdf figures.

8: Accelerated champ.DMR() : We did some modification to make champ.DMR() faster then before, especially for DMRcate function.

2 Introduction

The ChAMP package is a pipeline that integrates currently available 450k and EPIC analysis methods and also offers its own novel functionality. It utilises the data import from .idat file and normal beta matrix, Quality Control Plots; SWAN1 and Functional Normalization2 functions offered by the minfi34 package.

In addition, the ChAMP package includes the Peak Based Correction (PBC) method5 and the BMIQ Normalization6 is set as the default method.

A number of other pipelines and packages are available for 450k or EPIC array analysis including IMA7, minfi3, methylumi8, RnBeads9 and watermelon10. However, ChAMP provides a comprehensive and compelete analysis pipeline from reading origin data file to final GSEA result, which would make researchers’ work much easier. Also, the new version ChAMP provides a series of Shiny-based WebBrower Interactive analysis function (GUI functions), which should benifit even more scientists’ work.

There are also some other advantages for ChAMP. First of all ChAMP is both available for 450k or EPIC array. Also this package contains multiple novel functions. The singular value decomposition (SVD) method11 allows an in-depth look at batch effects and for correction of multiple batch effect the ComBat method12 has been implemented. For the identification of Differentially methylated regions (DMR), ChAMP offers the new Probe Lasso method13. Also another two effective DMR detection function Bumphunter14 and DMRcate15 are also integrated. Users have more freedom to select their prefered methods.

For the purpose of dealing with cell heterogeneity problem, we included two functions: RefFreeEWAS16 and RefbaseEWAS17 into ChAMP. Finally, ChAMP has an additional function to analyse 450k or EPIC for copy number alterations18.

Except new functions, all functions and scripts exist already in old version ChAMP, which all have been enhanced and checked carefully. Also, some new functions are included as well:

For users who need to find Differentially Methylated Blocks19. New version ChAMP includes a function to detect methylation Blocks. Another important function is champ.GSEA(), which would take singnificant genes related with DMPs and DMRs and do GSEA analysis on them, to see what pathways are enriched by significant CpGs or DMRs. To correct biased might be caused by ininequality between genes and CpG probes, we used goseq pacakge to correct the bias20. Also, new version ChAMP incoporates FEM package21, which would calculate modules in pathways that show differential methylation changes between two phenotypes.

Apart from that, new version ChAMP provides a Shiny and Plotly based GUI framework, to help scientists to view ChAMP result. This is an interactive WebBrower-based framework, calling for Graph System locally or remotely. For example X11 for most Linux System.

If you have any bug to report, or any suggestions for ChAMP, please email champ450k@gmail.com, or tianyuan1991hit@gmail.com. I will respond you as fast as I can.

3 Installation

It is essential that you have R 3.3 or above version already installed on your computer or server. ChAMP is a pipeline that utilises many other Bioconductor packages that are currently available from CRAN and Bioconductor. For all of the steps of the pipeline to work, make sure that you have ungraded Bioconductor to newest version (3.3).

After you have R and Bioconductor installed properly, you can start to install ChAMP. The most easy way to install it is typing following code into R session:

source("https://bioconductor.org/biocLite.R")
biocLite("ChAMP")

If you have any problem with traffic with Bioconductor, another option is install ALL dependence packages, then install ChAMP. There are a lot if packaged relied by ChAMP. You can use following command in R to install most of them:

source("http://bioconductor.org/biocLite.R")
biocLite(c("minfi","ChAMPdata","Illumina450ProbeVariants.db","sva","IlluminaHumanMethylation450kmanifest","limma","RPMM","DNAcopy","preprocessCore","impute","marray","wateRmelon","goseq","plyr","GenomicRanges","RefFreeEWAS","qvalue","isva","doParallel","bumphunter","quadprog","shiny","shinythemes","plotly","RColorBrewer","DMRcate","dendextend","IlluminaHumanMethylationEPICmanifest","FEM","matrixStats"))

When you are installing ChAMP, You may encounter some error saying some package are not installed. This errors are caused by recursively depending on R pacakges, for example, ChAMP used minfi, minfi used lumi, and lumi used other packages, so if lumi was not installed on your computer, ChAMP would fail. To solve these errors, you simply need to check those error messages, find out which pacakges required are missing, then install it with command ’biocLite(“ChAMP”)` directly. Then retry installing ChAMP, it may takes 3-4 times, but eventually it should work.

Note: ChAMP used plotly to generate interactive plots, at the time ChAMP was firstly developed, plotly was on 3.6.1 version, which caused some error report in user’s GUI function if you are still using old version ChAMP. Now we have made ChAMP fully support version 4.5.6, user may download then use following code in Bash to install it, so that the plotly in your system will be upgraded:

R CMD INSTALL plotly-4.5.6.tar.gz

After you did all above work, you may use code below to install ChAMP:

R CMD INSTALL ChAMP_2.6.4.tar.gz

After installation, you may easily load ChAMP package:

library("ChAMP")

4 Test Data

The package contains two test dataset, one is HumanMethylation450 data (.idat) and the other is Simulated EPIC data, which can be used to test functions available in ChAMP. This can be loaded by pointing the directory to the testDataSet like below:

testDir=system.file("extdata",package="ChAMPdata")
myLoad <- champ.load(testDir,arraytype="450K")

This dataset contains only 8 Samples, we will use it to show ChAMP functions’ usage blow.

For the EPIC Simulation Data Set, user may use following code to load it:

data(EPICSimData)

This Simulation Data Set contains 16 samples, all of them are actually originally from one sample but midified into DMP and DMR, also alone with some errors variance.

5 ChAMP Pipeline

5.1 Pipeline Introduction

ChAMP Pipeline Figure above outlines the steps in the ChAMP pipeline. Each step can be run individually as a separate function. This allows integration between ChAMP’s result and other analysis pipelines. In above pipeline plot, all functions of ChAMP are included, which are merely divided into three colors:

In above plot, there are some functions and connections(lines) are marked green gleam, which indicate a Main Analysis Pipeline. ChAMP combined many functions, but not all of them may be used for a successful analysis. The green gleam represents a main analysis pipeline most likely to be used for various data sets. The black dot in the middle of the plot represents fully prepared methylation datasets, which is the milestone between data preparation and data analysis. Thus we suggest you save the fully prepared data since your other analysis would be started from it directly.

5.2 Full Pipeline

The full pipeline can be run at once with one command:

champ.process(directory = testDir)

When running the full pipeline through the champ.process() function a number of parameters can be adjusted. These parameters are related to all functions involved in ChAMP package, but not all parameters are covered, otherwise champ.process() function would be be too large, please check the Manual of ChAMP for parameter settings.

5.3 Separated Steps

Note that champ.process() may not always successfully run, because any small problem on data set in it may result to Error (For example, for some data set I get, the most important covariates in pd file is not “Sample_Group” but “Diagnose”, but in all ChAMP function, “Sample_Group” was defaultly setted as user’s most focused variable). So if you get any problem in champ.process() you may try separated steps below:

myLoad <- cham.load(testDir)
CpG.GUI()
champ.QC() # Alternatively: QC.GUI()
myNorm <- champ.norm()
champ.SVD()
# If Batch detected, run champ.runCombat() here.
myDMP <- champ.DMP()
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myBlock <- champ.Block()
Block.GUI()
myGSEA <- champ.GSEA()
myEpiMod <- champ.EpiMod()
myCNA <- champ.CNA()
myRefFree <- champ.reffree()
# If DataSet is Blood samples, run champ.refbase() here.

Above are main functions run separately, which contains more parameters to be adjusted and may be integrated with users’ other unique analysis tools and method.

5.4 EPIC pipeline

ChAMP provide a friendly user interface for EPIC array analysis, most functions are the same between 450K array and EPIC array, the only difference lies on a parameter arraytype, for some functions call for annotation files, you simply need to set parameter arraytype as “EPIC”.

In ChAMP package, we created a simulation EPIC data (beta value only, because we can not simulate Red Gree Color intensity information). The pipeline to analysis EPIC array is below:

# myLoad <- champ.load(directory = testDir,arraytype="EPIC")
# We simulated EPIC data from beta value instead of .idat file,
# but user may use above code to read .idat files directly.
# Here we we started with myLoad.

data(EPICSimData)
CpG.GUI(arraytype="EPIC")
champ.QC() # Alternatively QC.GUI(arraytype="EPIC")
myNorm <- champ.norm(arraytype="EPIC")
champ.SVD()
# If Batch detected, run champ.runCombat() here.This data is not suitable.
myDMP <- champ.DMP(arraytype="EPIC")
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myDMR <- champ.DMR(arraytype="EPIC")
DMR.GUI(arraytype="EPIC")
myBlock <- champ.Block(arraytype="EPIC")
Block.GUI(arraytype="EPIC") # For this simulation data, not Differential Methylation Block is detected.
myGSEA <- champ.GSEA(arraytype="EPIC")
myEpiMod <- champ.EpiMod(arraytype="EPIC")
myRefFree <- champ.reffree()

# champ.CNA(arraytype="EPIC")
# champ.CNA() function call for intensity data, which is not included in our Simulation data.

Not that ChAMP incoporates many other functions, not all of them are successfully supporting EPIC array yet. For example, at the time I am writing the vignette, Functional Normalization from minfi package is not supporting EPIC array.

5.5 Computational Requirements

The ability to run the pipeline on a large number of samples depends somewhat on the memory available. The ChAMP pipeline runs 200 samples successfully on a computer with 8GB of memory. Beyond this it may be necessary to find a server/cluster to run the analysis on.

The champ.load() function uses the most memory. If you plan to run the analysis more than once it is recommended to to run myLoad <- champ.load() and save this list for future analyses.

In champ.DMR(), champ.norm() and champ.Block() functions, some methods may use parallel method to accelerate the speed. If your server or computer has more cores, you may specify more cores at the same time to make function faster, but which may cost more memory. You can use following code in R to detect number of your CpG Cores.

library("doParallel")
detectCores()

GUI functions requires more things. A basic framework for a GUI environment is required. If you run ChAMP on your own computer, it should be fine. But if you run ChAMP on a remote server, a local display system is essential to use these WebBrower-based GUI functions. X11 is a very common used tool for this. For Windows users, Xmanager or MobaxTerm are excellent choices.

A significant improvement for new version ChAMP is that: User can also conduct full analysis even if they are not started from .idat file. As long as you have a methylation beta matrix and corresponding Phenotypes pd file, you can conduct nearly all ChAMP analysis. It should be much more easier for user get data from TCGA or GEO, but without original .idat file.

6 Description of ChAMP Pipelines

As previously mentioned, users has the option to run each of the ChAMP functions individually to allow integration with other pipelines or to save the results of each step. Here each function is described in detail.

6.1 Loading Data

Loading Data is the first step for all analysis. ChAMP provides a loading function to get data from .idat file (with pd file (Sample_Sheet.csv) in it). The champ.load() function utilises minfi to load data from .idat files. By default this loads data from the current working directory, in this directory you should have all .idat files and a sample sheet. The sample sheet currently needs to be a .csv file as came with your results following hybridization. You can choose whether you want M-values or beta-values. For small datasets M-values are recommended22. The minfi function used to load the data from the idat files would automatically filters out the SNPs that might influent the analyais result. The SNP list are generated from Zhou’s Nucleic Acids Research paper in 201623, which is designed specifically for 450K and EPIC, also with population differences considered. As such, for 450k bead array data, before filtering for low quality probes, the dataset will include 485,512 probes. And for EPIC bead array data, before filtering probes the dataset, will include 867,531 probes.

User may use following code to load data set:

myLoad <- champ.load(testDir)
## We are not running this code here because it cost about 1 minute.

To save user’s time to open this vignette, we already saved loaded dataset into testDataSet in ChAMPdata package, you can use following code to load it:

data(testDataSet)

Now we can check the pd file (Sample_Sheet.csv), because it’s very important. Most pd files are similar but not all of them are exactly the same. So user MUST understand their pd file perfectly to achieve a successfully analysis.

myLoad$pd
##    Sample_Name Sample_Plate Sample_Group Pool_ID Project Sample_Well
## C1          C1           NA            C      NA      NA         E09
## C2          C2           NA            C      NA      NA         G09
## C3          C3           NA            C      NA      NA         E02
## C4          C4           NA            C      NA      NA         F02
## T1          T1           NA            T      NA      NA         B09
## T2          T2           NA            T      NA      NA         C09
## T3          T3           NA            T      NA      NA         E08
## T4          T4           NA            T      NA      NA         C09
##     Array      Slide                   Basename                  filenames
## C1 R03C02 7990895118 Test450K/7990895118_R03C02 Test450K/7990895118_R03C02
## C2 R05C02 7990895118 Test450K/7990895118_R05C02 Test450K/7990895118_R05C02
## C3 R01C01 9247377086 Test450K/9247377086_R01C01 Test450K/9247377086_R01C01
## C4 R02C01 9247377086 Test450K/9247377086_R02C01 Test450K/9247377086_R02C01
## T1 R06C01 7766130112 Test450K/7766130112_R06C01 Test450K/7766130112_R06C01
## T2 R01C02 7766130112 Test450K/7766130112_R01C02 Test450K/7766130112_R01C02
## T3 R01C01 7990895118 Test450K/7990895118_R01C01 Test450K/7990895118_R01C01
## T4 R01C02 7990895118 Test450K/7990895118_R01C02 Test450K/7990895118_R01C02

As we can see here, the Sample_Group contains two phenotypes: C and T, which is what we want to compare.

During the loading process, champ.load() function would also do some filtering.

All above filterings are controled by parameters in champ.load() function, user may adjust them as you wish. Note that, though most ChAMP functions are supporting isolated beta matrix analysis, which means not relying on .idat file from beginning, but champ.load() can not do filtering on beta matrix alone. For users have no .IDAT data but beta matrix and Sample_Sheet.csv, you may want do filtering with champ.filter() function and then use following functions to do analysis.

Beside this, we provide a small function CpG.GUI() for user to check their distribution of CpGs. This function can be used for any CpGs list, for example, during your analysis, whenever you get a significant CpG list from DMR, you can user following function to check the distribution of your CpG vector:

CpG.GUI(CpG=rownames(myLoad$beta),arraytype="450K")

This is a very easy function to demonstrate the distribution of your CpG list.

6.2 Quality Check

Quality Checking is an important process to make sure dataset is suitable to analysis. champ.QC() function and QC.GUI() function would draw some plots for user to easily check their data’s quality. Merely there are three plot would be generated:

champ.QC()

Or you may use QC.GUI() function as below, which may cost more memory, thus make sure you have a good server or computer when you run this all GUI function in ChAMP.

QC.GUI(CpG=rownames(myLoad$beta),arraytype="450K")

Differently from champ.QC(), QC.GUI() will provide five interactive plot. mdsPlot, type-I&II densityPlot, Sample beta distribution plot, dendrogram plot and top 1000 variable CpG’s heatmap. Users may easily interactive with these plots to see if your sample are qualited for further analysis, and check clustering result and top CpGs.

6.3 Normalization

Normalization is an essential step for data preparation. The type-I and type-II probes in one data set share different beta distributions, thus they may cause problem in DMP or DMR detection. Here you can use champ.norm() function to eliminate effect of two Type probes.

In champ.norm(), we provides four methods to do normalization. BMIQ6, SWAN1, PBC5 and FunctionalNormliazation2. There are some different between each method, user may read related paper to select one.

Until the time I am writing this vignette. FunctionalNormliazation can only support 450K, SWAN call for mset data and rgSet, FunctionalNormliazation call for rgSet, which means, FunctionalNormliazation and SWAN merely only works for .idat file users.

Note that BMIQ function has been updated to version 1.6, which would provide better normalization result for EPIC array. Some scientists may encounter error when using BMIQ on certain samples, merely because some samples’ quality are really bad that they are not even beta distribution any more. Also BMIQ function now can be run in parallel, if your computer has more than one core, you may set parameter “cores” in function to accelerate the function. If you set parameter PDFplot=TRUE, the plot for BMIQ function would be saved in resultsDir.

The code to do Normalization is below:

myNorm <- champ.norm(beta=myLoad$beta,arraytype="450K",cores=5)

Below is the one result of BMIQ function: After Normalization, you may use QC.GUI() function to check the result again, remember to set beta parameter in QC.GUI() function as myNorm.

6.4 SVD Plot

The singular value decomposition method (SVD) implemented by Teschendorff11 for methylation data is used to identify the most significant components of variation. These components of variation would ideally be biological factors of interest, but it may be technical variation (batch effects) as well. To get the most from this analysis it is useful to include as much information as possible.

If samples have been loaded from .idat files then the 18 internal controls on the bead chip (including bisulphite conversion efficiency) will be included if you set parameter RGEffect=TRUE in champ.SVD() function. Also compared with old version of package, current version of champ.SVD() would detect all valid factors to analysis, which means plot contain two following conditions:

champ.SVD() function would detect all valid covariates in your pd file. Thus you have some unique covariates want to be analyed, you may combine them into your pd file, champ.SVD() would analysis together. Note that champ.SVD() will only take pd parameter as pd file. So if you have your own covariates and want them to be analysed with current pd file, please cbind() your covariates along with myLoad$pd together, then input it as pd parameter into champ.SVD().

Note that, for different type of covariates (categorical and numeric) covariates, champ.SVD() would used different ways (Krustal.Test and Linear Regression) to calculate the significance between covariates and components. Thus please make sure your pd is an data frame and numeric covariates are transfered into “numeric” type, while categorical covariates are transfered into “factor” or “character” type. If your Age covariate was assigned as “character”, it’s really hard to our function to detect it’s actual should be analysed as numeric. Thus we suggest user have very clear understanding on their data set and pd file.

During champ.SVD() running, all detected covariates will be printed on screen, so that user may check if covariates you want to analysis are correct.

The result is a heatmap (saved as SVDsummary.pdf) of the top 6 principle components correlated to the covariates information provided. The darker colours represent a lower p-value, indicating a larger component of variation. If it becomes clear from this SVD analysis that the largest components of variation are technical factors (batch effects) then it is worth considering the experimental design and implementing other normalization methods that may help remove technical variation. ComBat is included in this pipeline to remove variation related to chip number but it or other methods may be implemented independently to remove other sources of technical variation revealed in the SVD analysis.

You may use following code to generate SVD plot:

champ.SVD(beta=myNorm,pd=myLoad$pd)

6.5 Batch Effect Correction

This function implements the ComBat normalization method12 that was developed for microarrays. The sva R package is used to implement this function. ComBat is specifically defined in the ChAMP package to correct for batch effects. More advanced users can implement ComBat using sva documentation to adjust for other batch effects. You need to set parameter batchname=c("Slide") in the function to correct them.

Note that champ.runCombat() would extract batch from your batchname, so please make sure you understand you dataset perfectly. Another important thing is, previously champ.runCombat() would automatically assume “Sample_Group” is the covariates need to be analysed, but now, we added a parameter called variablename so that users may assign their own covariates to analysis, thus for some data set whose disease phenotype is NOT Sample_Group, please remember to assign variablename parameter.

ComBat algorithsm uses an empirical Bayes method to correct for technical variation. If ComBat were applied directly to beta values zeros could be introduced as beta values have a range between 0 and 1. For this reason ChAMP logit transforms beta values before ComBat adjustment and then computes the reverse logit transformation following the ComBat adjustment. If the user has chosen to use M-values, please assign logitTrans=FALSE.

During the function champ.runCombat(), the program would automatically detect all valid factors might be corrected and list on screen, so that user may check if there are the batches you want to correct. Then if the batchname assigned by user are correct, the function would started to do batch correction.

You can use following code to do batch correction:

myCombat <- champ.runCombat(beta=myNorm,pd=myLoad$pd,batchname=c("Slide"))

The ComBat function could be a very time consuming step in the pipeline if you have a large number of samples. After the correction, you may use champ.SVD() function to check the corrected result.

6.6 Differential Methylation Probes

Differential Methylation Probes (DMP) is the most commonly analysed result for methylation research. Here users may use champ.DMP() function to calculate differential methylation probes. And use DMP.GUI() to check the result.

champ.DMP() function implements the limma package to calculate the p-value for differential methylation using a linear model. At present, the function can only compute differential methylation between two groups. By default, the two groups are determined by the Sample Group column in the sample sheet. If more than two groups are present the function will calculate the differential methylation between the first two unique groups in the file. As the function is running it will print which two groups the contrast is being computed for and will then print out how many DMPs were found for the given p-value.

The result of champ.DMP() included the p value, t value logFC between two groups, also includes the annotation for each probe, the average beta (for the sample group) and the delta beta value for the two groups used in the comparison. It is planned that a future version of ChAMP will include the ability to compare multiple groups. However, more advanced users can run limma with other settings and use the output (including all probes and their p-values) for the DMR hunter.

Users may use following code to get DMPs:

myDMP <- champ.DMP(beta = myNorm,pheno=myLoad$pd$Sample_Group)
##       Contrasts
## Levels pT-pC
##     pC    -1
##     pT     1

We can check the result of champ.DMP() as below:

head(myDMP)
##                 logFC   AveExpr         t      P.Value   adj.P.Val
## cg06822689  0.7222816 0.3969473  27.12991 2.327953e-08 0.009912028
## cg04472725  0.6698029 0.3733044  22.99951 7.324487e-08 0.013488219
## cg04645886  0.5622509 0.6497356  21.16730 1.301067e-07 0.013488219
## cg20837557 -0.4272843 0.4431514 -20.52186 1.611528e-07 0.013488219
## cg00841760  0.2227482 0.5512956  20.15955 1.822466e-07 0.013488219
## cg05572370  0.1662926 0.5826421  20.03715 1.900717e-07 0.013488219
##                   B      C_AVG     T_AVG  deltaBeta CHR   MAPINFO Strand
## cg06822689 7.476378 0.03580651 0.7580881  0.7222816   3 141516291      R
## cg04472725 7.030821 0.03840299 0.7082059  0.6698029   3 141516232      F
## cg04645886 6.770387 0.36861015 0.9308610  0.5622509   4   1214608      F
## cg20837557 6.666781 0.65679358 0.2295093 -0.4272843   6 146864278      F
## cg00841760 6.605583 0.43992150 0.6626697  0.2227482   7 158362966      F
## cg05572370 6.584391 0.49949577 0.6657884  0.1662926  11   2323247      F
##            Type    gene feature     cgi        feat.cgi
## cg06822689   II    GRK7    Body  island     Body-island
## cg04472725   II    GRK7    Body  island     Body-island
## cg04645886   II   CTBP1    Body opensea    Body-opensea
## cg20837557   II   RAB32 TSS1500   shore   TSS1500-shore
## cg00841760   II  PTPRN2    Body  island     Body-island
## cg05572370   II TSPAN32 1stExon opensea 1stExon-opensea
##               UCSC_CpG_Islands_Name  DHS Enhancer                 Phantom
## cg06822689 chr3:141516055-141516639   NA     TRUE                        
## cg04472725 chr3:141516055-141516639   NA     TRUE                        
## cg04645886                            NA       NA                        
## cg20837557 chr6:146864481-146865426   NA       NA                        
## cg00841760 chr7:158361909-158362970   NA       NA                        
## cg05572370                          TRUE       NA low-CpG:2279812-2279967
##            Probe_SNPs Probe_SNPs_10
## cg06822689                         
## cg04472725                         
## cg04645886                         
## cg20837557                         
## cg00841760                         
## cg05572370

In new version of ChAMP, we includs a nice GUI interface for user to check the result of myDMP generated from above code. You just need to provide the “unmodified” result of champ.DMP (myDMP), orginal beta matrix you used to generat DMP (beta) and covariates you want to analysis. You can used following code to open DMP.GUI() function.

Note that your myDMP reuslt may only be calcuated from two phenotypes, but your covariates may contain 3 or even more phenotypes.

DMP.GUI(DMP=myDMP,beta=myNorm,pheno=myLoad$pd$Sample_Group)

The left panel indicates the cut off parameters for significant CpGs. Currently only two parameters are supported here: p value and abs(logFC). Users may set these two values then press “Submit” to generate significant CpGs table on the right. On the right is the table of significant CpGs, all information are included, the table can be ranked by each column.

The second tab provides the heatmap of ALL significant CpGs you generate by left panel cutoffs.

The third panel provides the barplot of proportion of each CpG feature and cgi summary information for hyper and hypo CpGs. You may press the marker in the legend to close or open certain bars.

The fourth tab is controled by the Gene parameter on the left panel. You may select an gene name (the example default is NFIX here) and press “submit”. Then the plot for this gene will be plot above. Note that the x axis here is not the actual MAPINFO of each CpG. The distance between each CpG are the same. There are two kinds of lines plotted for each phenotype, the solid lines represent Mean Value plot, and the dash line represent the Loess line, user may close or open they by clicking the marker on the legend. Below is the feature and cgi information of each CpG. At the bottom of this tab, is the gene information extracted from wikigene, which is also controled by the gene parameter on the left.

The fifth tab includes two plots. At the top is the enrichment plot for top 70 genes involved in significant CpGs you select by left cutoff parameters. The number in each bar indicates how many hyper or hypo differential methylated CpGs are included in this gene. Users may use this plot to find intereted genes then use tab 4 to plot these genes. At the bottom of this tab, is the boxplot of certain CpGs, which is controled by parameter on the left panel.

Note that all plots here can be Zoomed in or out. And can be downloaded directly. You may even change the shape of the brower to change the resize of plot then download it with different resolution.

6.7 Differential Methylation Regions

Differentially Methylated Eegions (DMRs) are extended and discreet segments of the genome that show a quantitative alteration in DNA methylation levels between two groups. champ.DMR() is the function ChAMP provides to computes and returns a data frame of probes binned into discreet differentially methylated regions (DMRs), with accompanying p-value.

There are three algotithms integrated to do this job, Bumphunter, ProbeLasso and DMRcate. For Bumphunter algotithms, it would firstly cluster all probes into small clusters (or regions), then apply random permulation method to estimate candidate DMRs. This method is very user friendly and is not relying on any output of previous functions. The permulation steps in Bumphunter algotithms may be a little bit slow, but user may assign more core to accelerate it by parallel more threads. The result of bumphunter algotithm is a dataframe contain all detected DMRs, with their length, clusters, number of CpGs.e.g annotated.

For ProbeLasso method, the final data frame is distilled from a limma output of probes and their association statistics. Previously, the result from champ.DMP() must be inputted into champ.DMR() function to do run ProbeLasso, but now we have upgraded the function and ProbeLasso needs no result from champ.DMP() anymore. For the principle and mechanism of ProbeLasso function, user may find it in original paper13.

The new method here incoporated in new version ChAMP is DMRcate. This is a data-driven approach that is agnostic to all annotations except for spatial ones, specifically chromosomal coordinates. Critical to DMRcate are robust estimates of differential methylation (DM) at individual CpG sites derived from limma, arguably the most widely used tool for microarray analysis. DMRcate pass the square of the moderated t statistic calculated on each 450K probe to DMR-finding function. Then DMRcate would apply a Gaussian kernel to smooth this metric within a given window, and also derive an expected value of the smoothed estimate (in other words, one with no experimental effect) from the varying density of CpGs sites incurred by reduced representation and irregular spacing. For more detail about DMRcate function, user may read the original paper [$Peters2015] and DMRcate R pacakge.

Users may use following code to generate DMR:

myDMR <- champ.DMR(beta=myNorm,pheno=myLoad$pd$Sample_Group,method="Bumphunter")

The result of Bumphunter, DMRcate and ProbeLasso functions are all dataframe, but main contain different columns. Here we can have a check on DMRcate result.

head(myDMR$DMRcateDMR)
##       seqnames     start       end width strand no.cpgs        minfdr
## DMR_1     chr2 177021702 177030228  8527      *      48  3.243739e-69
## DMR_2     chr2 177012117 177017797  5681      *      33  1.137990e-59
## DMR_3    chr12 115134148 115136308  2161      *      51 3.571093e-108
## DMR_4    chr11  31820441  31828715  8275      *      40  3.087712e-43
## DMR_5    chr17  46655289  46658170  2882      *      27  1.565729e-85
## DMR_6     chr6  32115964  32118811  2848      *      50 5.055382e-123
##           Stouffer maxbetafc meanbetafc
## DMR_1 4.011352e-24 0.6962753  0.4383726
## DMR_2 6.555658e-20 0.6138975  0.4676655
## DMR_3 5.711951e-19 0.5372083  0.3601741
## DMR_4 1.206741e-18 0.5673115  0.4112588
## DMR_5 2.087453e-17 0.6859260  0.4812521
## DMR_6 7.216567e-16 0.5772168  0.2971849
##                                                                                              overlapping.promoters
## DMR_1                                                                       HOXD3-001, HOXD3-004, RP11-387A1.5-001
## DMR_2                                                                             HOXD4-001, MIR10B-201, HOXD3-005
## DMR_3                                                                                                         <NA>
## DMR_4 PAX6-016, PAX6-006, PAX6-014, PAX6-015, PAX6-007, PAX6-028, PAX6-025, PAX6-027, PAX6-029, PAX6-024, PAX6-026
## DMR_5                                                      HOXB4-001, MIR10A-201, HOXB3-009, HOXB3-005, MIR10A-001
## DMR_6                                                        PRRT1-002, PRRT1-201, PRRT1-007, PRRT1-006, PRRT1-003

Just like DMP result, we provide DMR.GUI() as interface for DMR result. All result from three different method of champ.DMR() are accepted in the GUI function. The DMR.GUI() function would automatically detected what kind of input the DMR result is. Thus it’s not recommended to modify the structure of myDMR result.

DMR.GUI(DMR=myDMR)
# It might be a little bit slow to open DMR.GUI() because function need to extract annotation for CpGs from DMR. Might take 30 seconds.

Just like the DMP.GUI above, there is a panel for parameters on the left, user may set the parameters to select significant DMRs. We currently only provide two parameters: p value and min number probes. The above title will indicate the method you have used to get DMR result (here is DMRcate). Note that for different method the p value may indicate different columns. Bumphunter is pvaluearea, while ProbeLasso is dmrP, and there is no cutoff for DMRcate result, which has been done during calculation. After your selection and press “Submit”, you will get table of significant DMRs on the right. For different method, the table content might be different. Thus user should research what method you want to use to find DMRs.

The second tab provides a table for all CpGs involvded in these DMRs, which can be regarded as a mapping list between CpGs and DMRs. Also if you can not get gene information in first tab, you can get them here. In this table, for each CpGs, their corresponding genes and DMRs are marked.

The third plot is the plot of each DMR, just like the gene plot in DMP.GUI() function. Here the x axis are not real MAPINFO of each CpG, the distance between each CpGs are equal. Also I have feature and CpGs marked under the plot. At the top of this tab, are the table of all CpGs in this DMR. You can use the DMR index parameter on the left panel to select DMR you want to check.

The fourth tab provide the enrichment information of genes and heatmap. Note that there might be many genes and CpGs in this plot. So if you have too many DMRs selected, the plot might be too big to be open. But you don’t need to worry can not see it clearly, because you may zoom in as much as you can to check details.

6.8 Differential Methylation Blocks

In recent years, there are more researchs about Differential Methylated Blocks19. Here we provide a function to calculate methylation blocks. In our Block-finder function, champ.Block() would firstly calculate small clusters (regions) on whole genome based on their locations on genome. Then for each cluster (region), its mean value and mean position would be calcuated, thus you can assume that each region would be collapsed into a dot. Then Bumphunter algorithsm will be used to find “DMRs” over these regions (dots after collapse). Note that only open sea regions will be used to calculate Blocks. In our previous paper25 we demonstrated that Differential Methylated Blocks may show universal feature across various cancers.

User may use following code to generate methylation blocks:

myBlock <- champ.Block(beta=myNorm,pheno=myLoad$pd$Sample_Group,arraytype="450K")

A lot of things would be returned by champ.Block() function, but the most important is the first list: Block. You may check it like below:

head(myBlock$Block)
##         chr     start       end      value     area cluster indexStart
## Block_1  12 125405786 133245207 -0.1398238 61.66232     649      27996
## Block_2   6  28839951  30308328 -0.1725335 40.20032     356      81359
## Block_3  15  24778439  27360551 -0.1752729 38.73531     732      35251
## Block_4  14 100851839 102144080 -0.2166707 35.75066     724      34607
## Block_5  11 130933979 133954227 -0.1833729 34.29073     616      22569
## Block_6   7  40026390  42107777 -0.2616146 24.85339     400      88773
##         indexEnd   L clusterL      p.value  fwer  p.valueArea fwerArea
## Block_1    28436 441     2176 1.698689e-05 0.028 1.698689e-05    0.028
## Block_2    81591 233     2396 1.698689e-05 0.028 1.577354e-04    0.136
## Block_3    35471 221      261 1.698689e-05 0.028 2.196162e-04    0.136
## Block_4    34771 165      708 1.698689e-05 0.028 3.858451e-04    0.188
## Block_5    22755 187     1601 1.698689e-05 0.028 4.622861e-04    0.188
## Block_6    88867  95     1497 1.698689e-05 0.028 1.571287e-03    0.420

And we provide a GUI function for users to check the result of myBlock above.

Block.GUI(Block=myBlock,beta=myNorm,pheno=myLoad$pd$Sample_Group,runDMP=TRUE,compare.group=NULL,arraytype="450K")

Note that there are two special parameters here: runDMP means if Block.GUI() function should calculate DMP for each CpGs, if not, the GUI() function would only return the annotation of all involvded CpGs. And just like champ.DMP() function, compare.group need to be assigned if you want to calculate DMP while running Block.GUI().

Just like DMP.GUI() and DMR.GUI(), the p value cutoff and min cluster (similar to min probes in DMR.GUI()) controls the number of significant Blocks you selected. Then the information of significant blocks will be listed on the right.

The second tab provides information of all CpGs involved in all Blocks. For methylation blocks, there are actually three unit integrated together: Block, Cluster (regions) and CpG. Blocks are formed by some clusters, and clusters are formed by some CpGs. Thus the second tab actually provides a mapping solution from CpGs to Blocks. User may use this tab to find their interested genes and CpGs. Then use DMP.GUI() to combind research between Blocks and DMPs.

The third tab provides the plot for each Blocks. Be caution that each Block may contain many many clusters. Thus it would be slow to calculate and plot. And please make sure you have a good computer or server to run this function. Note that plot is slightly different from DMR.GUI() and DMP.GUI() that the x axis is actually the REAL MAPINFO of each cluster, because I also need to map CpG islands position above the lines (Those dark red dots).

6.9 Gene Set Enrichment Analysis

Gene Set Enrichment Analysis is a very important step for most bioinformatics work. After previous steps, you may already get some significant DMPs or DMRs, thus you may want to know if genes involvded in these significant DMPs or DMRs are actually corresponding to some pathways. To achieve this analysis, you can use champ.GSEA() to do GSEA analysis.

champ.GSEA() would automatically extract genes in myDMP and myDMR (No matter what method you have used to generate DMRs). Also if you have your own significant CpG list or gene list calculated from THE SAME methylation arrays (450K or EPIC) by other method not included in ChAMP. You may also format them into list and input the function to do GSEA (Please assign each element in your own CpG list or gene list a name, otherwise error might be triggered). champ.GSEA() would automatically extract gene information, transfer CpG information into gene information then conduct GSEA on each list. During the mapping process between CpGs to genes, if there are multiple CpGs mapped to one gene, that gene would only be counted once in case of over counting.

There are two ways to do GSEA. In previous version, ChAMP used pathway information downloaded from MSigDB. Then Fisher Exact Test will be used to calculate the enrichment status of each pathway. After gene enrichment analysis, champ.GSEA() function would automatically return pathways with p value smaller then adjPval cutoff.

For most situation, above fisher exact test method should works fine. However, as the Geeleher’s Paper addressed. Since different genes has different numbers of CpGs contained inside, the two situation that one genes with 50 CpGs inside but only one of them show significant methylation, and one gene with 2 CpGs inside but two are significant methylated should not be eaqualy treated. The solution is use number of CpGs contained by genes to correct significant genes. Here we used goseq package20 to solve this problem. This package was originally designed to correct bias caused by gene’s length during GSEA procedure. Here we used number of CpGs contained by each gene replace length as biased data, to correct this issue.

Note: If user want to correct the biased caused between genes’ number and CpGs’ number, they may set method parameter in champ.GSEA() function as “goseq” to use goseq method to do GSEA, or user may set it as “fisher” to do normal Gene Set Enrichment Analysis.

User may use following command to do GSEA:

myGSEA <- champ.GSEA(beta=myNorm,DMP=myDMP,DMR=myDMR,arraytype="450K",adjPval=0.05)
# myDMP and myDMR could (not must) be used directly.

After the calculation, for each list (DMP,DMR or other list), one GSEA result would be returned. User may check the result like below:

head(myGSEA$DMP)
##         category over_represented_pvalue under_represented_pvalue
## 4620  GO:0009952            3.028466e-10                        1
## 13116 GO:0048704            1.029305e-09                        1
## 13034 GO:0048562            1.982377e-09                        1
## 1339  GO:0003002            6.331492e-09                        1
## 13039 GO:0048568            7.967879e-09                        1
## 13118 GO:0048706            7.990748e-09                        1
##       numDEInCat numInCat                                     term
## 4620          65      191 anterior/posterior pattern specification
## 13116         40       92  embryonic skeletal system morphogenesis
## 13034         80      276            embryonic organ morphogenesis
## 1339          85      314                          regionalization
## 13039        100      396              embryonic organ development
## 13118         47      121    embryonic skeletal system development
##       ontology over_represented_adjPvalue
## 4620        BP               6.269531e-06
## 13116       BP               1.065434e-05
## 13034       BP               1.367973e-05
## 1339        BP               2.757074e-05
## 13039       BP               2.757074e-05
## 13118       BP               2.757074e-05
# Above is the GSEA result for differential methylation probes.
head(myGSEA$DMR)
##         category over_represented_pvalue under_represented_pvalue
## 3725  GO:0007389            3.301398e-22                        1
## 11270 GO:0043565            8.287681e-22                        1
## 1339  GO:0003002            2.310823e-21                        1
## 4620  GO:0009952            2.799958e-21                        1
## 1583  GO:0003700            3.838911e-19                        1
## 395   GO:0001071            3.911054e-19                        1
##       numDEInCat numInCat
## 3725          49      410
## 11270         72      957
## 1339          43      314
## 4620          36      191
## 1583          70     1093
## 395           70     1094
##                                                               term
## 3725                                 pattern specification process
## 11270                                sequence-specific DNA binding
## 1339                                               regionalization
## 4620                      anterior/posterior pattern specification
## 1583  transcription factor activity, sequence-specific DNA binding
## 395             nucleic acid binding transcription factor activity
##       ontology over_represented_adjPvalue
## 3725        BP               6.834554e-18
## 11270       MF               8.578578e-18
## 1339        BP               1.449118e-17
## 4620        BP               1.449118e-17
## 1583        MF               1.349444e-15
## 395         MF               1.349444e-15
# Above is the GSEA result for differential methylation regions.
# Too many information may be printed, so we are not going to show the result here.

6.10 Differential Methylated Interaction Hotspots

Another function included in ChAMP is champ.EpiMod(). This function uses FEM package to calcuate the Interaction Hotspots in modules for certain data set. By “interactome hotspot” we mean a connected subnetwork of the protein interaction network (PIN) with an exceptionally large average edge-weight density in relation to the rest of the network. The weight edges are constructed from the statistics of association of DNA methylation. Thus, the champ.EpiMod() function can be viewed as a functional supervised algorithm, which uses a network of relations between genes (in our case a PPI network), to identify subnetworks where a significant number of genes are associated with a phenotype of interest (POI).

Thus compared with GSEA and other pathway detection method, champ.EpiMod() (or FEM package) used more strict way to estimate small modules located in big pathways. Each edge and genes’ status would be calculated based on the data set you inputted. Orginally, FEM package was developed to do interaction analysis between methylation data and RNA expression data, which could get modules show both Differential Methylation and Differential Expression status. However here we only provide usage to calcuated module show differential methylation. More advanced user may refer to FEM package for more information.

User may use following code to do champ.EpiMod()

myEpiMod <- champ.EpiMod(beta=myNorm,pheno=myLoad$pd$Sample_Group)

The all significant modules would be returned. And PDF version plots would returned in resultsDir in function parameter. Like below:

Above we demonstrated four modules detect by champ.EpiMod() based on our small 450K test data. But in practic, each modules would be saved in to a pdf file individually in resultsDir directory.

6.11 Copy Number Variation

For Copy Number Variation analysis, we provide champ.CNA() to achieve this work. This function uses the HumanMethylation450 or HumanMethylationEPIC data to identify copy number alterations18. The function utilises the intensity values for each probe to count copy number and determine if copy number alterations are present. Copy number is determined using the CopyNumber package.

Basically there are two methods you may choose to do CNA analysis: you may compare the copy number various between case samples and control samples; or you can compare copy number of each sample to the average copy number status of all samples. For the first method, you may specify which groups of samples in your pheno parameter, or ChAMP already included some blood control samples itself, you may use them as control then calculate the copy number variance, all you need to do is assign parameter control=TRUE and controlGroup="champCtls". For the other method, you may assign control=FALSE to apply.

There are two kinds of plots would be generated, analysis for each sample (sampleCNA=TRUE parameter) and analysis for each group (groupFreqPlots=TRUE parameter). Like champ.QC() function before, there are two parameter could be assigned for plots. The Rplot parameter would control if champ.CNA() would plot with R session, and PDFplot parameter would control if champ.CNA() would save PDF file to resultsDir. By default, only PDFplot is TRUE.

Note that different from old version of ChAMP. No batch correction would be run automatically on intensity data, if you want to correct batch effect on intensity data, please use champ.runCombat(), the usage for intensity data is exactly the same for beta or M matrix, the only difference is you need to replace the beta value with myLoad$intensity and set logitTrans=FALSE.

Feber18 compared the results obtained using this method to copy number data from Illumina CytoSNP array and Affymetrix SNP 6.0 arrays and found that using this method for 450k data was effective in identifying regions of gain and loss.

Users may use following code to calculate Copy Number Variance:

myCNA <- champ.CNA(intensity=myLoad$intensity,pheno=myLoad$pd$Sample_Group)

The main usage of champ.CNA() function is to generate plot for sample and groups in pheno. Like below:

6.12 Cell Type Heterogeneity

While many methylation sample are collected from whole blood, their methylation status actually are combination of various cell types. These cell-type specified methylation may cover ture epigenome signals as well as true relationships between CpGs sites and phenotypes, or themsleves are the reason of some diseases. Thus many methods have be invented to deal with cell heterogeneity problem, like RefbaseEWAS and RefFreeEWAS incoporated.

RefbaseEWAS method is a method similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations (e.g. cases and controls) using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples.

We integrates another cell type heterogeneity correction method RefFreeEWAS in ChAMP. The new RefFreeEWAS method is similar to non-negative matrix factorization, with additional constraints and additional utilities. This function is specifically suitable for tissue datasets such as placenta, saliva, adipose or tumor tissue. This method closely related to surrogate variable analysis, relies on a simple projection based on singular value decomposition (SVD), as does SVA. In SVA, the residuals of a linear model are decomposed into a factor-analytic structure and the factors are used subsequently in a regression model, with iteration resulting in a final set of surrogate variables.

Here we prepared two cell-type purified methylation reference, one for 27K and the other for 450K, user may use any of them to do cell heterogeneity to correct heterogeneity based on cell propotions of each sample. However, though RefbaseEWAS method would return fairly accurate cell propotion, this method can ONLY be applied for whole blood samples, which contain cells in the reference. After champ.refbase() function, cell type heterogeneity corrected beta matrix, and cell-type specific propotions in each sample will be returned.

On the other hand, RefFreeEWAS method requires two parameters in champ.reffree() function, one is the covariates matrix for inputted dataset, which could be a matrix or a vector. containing various phenotypes of each sample. The other important parameter is number of latent variable K, which represent numbers of latent cell type mixed in the dataset. If users don’t know the correct number of latent variable (cell types), they could ignore this parameter, and champ.reffree() will apply Random Matrix Theory from isva package to estimate number of latent variables.

User may use following code to do cell heterogeneity analysis:

myRefFree <- champ.reffree(beta=myNorm,pheno=myLoad$pd$Sample_Group)

Then we can see the q value calculated for each CpGs, user may relying on the q value below to select significant CpGs after cell heterogeneity correction:

head(myRefFree$qvBeta)
##                           pheno
## cg00001349 0.2359390 0.12512447
## cg00001583 0.2736186 0.24987531
## cg00002028 0.2169540 0.04471770
## cg00002719 0.2169540 0.04183915
## cg00002837 0.4068856 0.04658125
## cg00003202 0.2352289 0.04183915

For RefbaseEWAS method, user could get cell propotion of each sample by champ.refbase(). But do remember champ.refbase() can only works on Blood Sample Data Set.

myRefBase <- champ.refbase(beta=myNorm,arraytype="450K")
# Our test data set is not blood.

7 Summary

Above are all pipelines and functions included in new ChAMP. It’s a totally new version that different from old one, we hope to provide you most convenient and fast methylation analysis pipeline for 450K and EPIC array. It’s a very comprehensive package with thousands of lines of code, we already made our best to make sure it’s correct and fast.

In the future we may make it better, along with more functions and usages. If you have question,bug report, suggestion for ChAMP, please email to champ450k@gmail.com or my personal email tianyuan1991hit@gmail.com.

8 Citing ChAMP

ChAMP is implemented by incoporated many other very talented scientists’ work. Some functions are achieved by adopting other packages, while some functions are coded by ChAMP’s author but with ideas and theories provided by other works. So when you are using functions provided by ChAMP, please citate other scientists’s work correctly, so that people who invented these theory and method would get corresponding credits and honor.

If you want to citate ChAMP package itself, please add following papers in your article. This article is about ChAMP, an integrated analysis pipeline for DNA methylation analysis:

Morris, T. J., Butcher, L. M., Feber, A., Teschendorff, A. E., Chakravarthy, A. R., Wojdacz, T. K., and Beck, S. (2014). Champ: 450k chip analysis methylation pipeline pg - 428-30. Bioinformatics, 30(3), 428-30.

The loading part of ChAMP (champ.load) used minfi package, so does FunctionalNormliazation, SWAN normalization methods. If you want used above two normalization methods, you may want to citate below minfi paper:

Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD and Irizarry RA (2014). Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA Methylation microarrays. Bioinformatics, 30(10), pp. 1363-1369. doi: 10.1093/bioinformatics/btu049.

Jean-Philippe Fortin, Timothy Triche, Kasper Hansen. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array. bioRxiv 065490; doi: https://doi.org/10.1101/065490

Note that during filtering step, we used below paper’s SNP list, which included SNP located 5 bp close to CpG sites, with global MAF >1%, and other characters. For more information about SNP being filtered, you may check below paper:

Zhou W, Laird PW and Shen H: Comprehensive characterization, annotation and innovative use of Infinium DNA Methylation BeadChip probes. Nucleic Acids Research 2016

The default normalization method in ChAMP is BMIQ, which is invented by Andrew Teschendorff. This is a very popular and well-designed normalization function, you may citate below paper:

Teschendorff AE et al . A beta-mixture quantile normalization method for correcting probe design bias in illumina infinium 450 k dna methylation data. Bioinformatics. 2013;29(2):189-196.

Another included normalization method is PBC, the related paper is below:

Dedeurwaerder S et a. Evaluation of the infinium methylation 450K technology. Epigenomics. 2011;3(6):771-784.

champ.SVD() function is a very good option to detect effect of factors on dataset. The paper related with this function is below:

Teschendorff AE et a. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4(12):e8274.

ChAMP used famous batch correction method Combat from SVA package for batch correction. You may citate Combat-related paper below:

Johnson WE et a. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics. 2007;8(1):118-127.

Above are all functions related to Data Preparation. Below are functions and methods related to statistic result estimation.

In ChAMP we used limma to find DMP. According to user guild of limma, researches related to differential analysis should citate some specific paper. Below are two papers user may want to use:

Ritchie, ME, Phipson, B, Wu, D, Hu, Y, Law, CW, Shi, W, and Smyth, GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47

Phipson, B, Lee, S, Majewski, IJ, Alexander, WS, and Smyth, GK (2016). Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Annals of Applied Statistics 10(2), 946-963.

When it comes to DMR detection, here there are three method incoporated: If you used Bumphunter method, you may use this one:

Jaffe AE et a. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012;41(1):200-209.

If you used ProbeLasso function, you may use this one:

Butcher LM, Beck S. Probe lasso: A novel method to rope in differentially methylated regions with 450K dna methylation data. Methods. 2015;72:21-28.

Or if you used DMRcate function invented by Tim Peters, you may use paper below:

Peters TJ, Buckley MJ, Statham AL, et al. De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin. 2015;8(1):1-16.

In this version of ChAMP, we provided a function to find Differential Methylation Blocks (DMB), this function is totally implemented by author of ChAMP, but the theory is based on below papers:

Hansen, KD, Timp, W, Bravo, HC, Sabunciyan, S, Langmead, B, McDonald, OG, Wen, B, Wu, H, Liu, Y, Diep, D, Briem, E, Zhang, K, Irizarry, RA, Feinberg, AP (2011). Increased methylation variation in epigenetic domains across cancer types. Nat. Genet., 43, 8:768-75.

For GSEA method, if you used “goseq” method, please cite below paper:

Paul Geeleher, Lori Hartnett, Laurance J. Egan, Aaron Golden, Raja Affendi Raja Ali, and Cathal Seoighe Gene-set analysis is severely biased when applied to genome-wide methylation data Bioinformatics 2013 : btt311v2-btt311.

Young MD, Wakefield MJ, Smyth GK and Oshlack A (2010). “Gene ontology analysis for RNA-seq: accounting for selection bias.” Genome Biology, 11, pp. R14.

We incoporated FEM package to find Differential Methylated Module among huge network. Below is the citation paper for FEM package:

Jiao Y, Widschwendter M, Teschendorff AE. A systems-level integrative framework for genome-wide dna methylation and gene expression data identifies differential gene expression modules under epigenetic control. Bioinformatics. 2014;30(16):2360-2366.

The CNA detection function was provided by Andy Feber of UCL, you may citate related paper below:

Feber A, Guilhamon P, Lechner M, et al. Using high-density dna methylation arrays to profile copy number alterations. Genome Biol. 2014;15(2):R30.

Just as above context says, we included two method for cell proportion correction, both are from Professor Houseman and his RefFreeEWAS package. If you used Refbase method in ChAMP, please citate following paper:

Houseman EA, Accomando WP, Koestler DC, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):1-16.

Or if you used RefFree method, you may want to citate below paper:

Houseman EA, Molitor J, Marsit CJ. Reference-free cell mixture adjustments in analysis of dna methylation data. Bioinformatics. 2014;30(10):1431-1439.

Above are all citation paper involvded by ChAMP, please remember to add them into your paper so that other scientists’s hard word would get proper evaluation and credits.

Finally, all GUI function, and function not mentioned above, like filtering, imputation, preparation between functions, API are all coded by author of ChAMP. We are now prepared a paper for that now, hopefully you can citate a paper for that too soon.

References

1. Maksimovic J et a. SWAN: Subset-quantile within array normalization for illumina infinium humanmethylation450 beadchips. Genome Biol. 2012;13(6):R44.

2. Fortin J-P, Labbe A, Lemire M, et al. Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biology. 2014;15(11):1-17.

3. Aryee MJ, Jaffe AE, Corrada-Bravo H, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30(10):1363-1369.

4. Fortin J-P, Triche T, Hansen K. Preprocessing, normalization and integration of the illumina humanmethylationepic array. bioRxiv. 2016. doi:10.1101/065490.

5. Dedeurwaerder S et a. Evaluation of the infinium methylation 450K technology. Epigenomics. 2011;3(6):771-784.

6. Teschendorff AE et al . A beta-mixture quantile normalization method for correcting probe design bias in illumina infinium 450 k dna methylation data. Bioinformatics. 2013;29(2):189-196.

7. Wang D et a. IMA: An r package for high-throughput analysis of illumina’s 450K infinium methylation data. Bioinformatics. 2012;28(5):729-730.

8. Davis S, Du P, Bilke S, Triche T, Jr., Bootwalla M. Methylumi: Handle illumina methylation data. 2013.

9. Assenov Y, Muller F, Lutsik P. RnBeads - Comprehensive DNA Methylation Analysis. 2012.

10. Wong C, Pidsley R, Schalkwyk L. The wateRmelon Package. 2012.

11. Teschendorff AE et a. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4(12):e8274.

12. Johnson WE et a. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics. 2007;8(1):118-127.

13. Butcher LM, Beck S. Probe lasso: A novel method to rope in differentially methylated regions with 450K dna methylation data. Methods. 2015;72:21-28.

14. Jaffe AE et a. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012;41(1):200-209.

15. Peters TJ, Buckley MJ, Statham AL, et al. De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin. 2015;8(1):1-16.

16. Houseman EA, Molitor J, Marsit CJ. Reference-free cell mixture adjustments in analysis of dna methylation data. Bioinformatics. 2014;30(10):1431-1439.

17. Houseman EA, Accomando WP, Koestler DC, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):1-16.

18. Feber A, Guilhamon P, Lechner M, et al. Using high-density dna methylation arrays to profile copy number alterations. Genome Biol. 2014;15(2):R30.

19. Hansen KD, Timp W, Bravo HC, et al. Increased methylation variation in epigenetic domains across cancer types. Nat Genet. 2011;43(8):768-775.

20. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010;11(2):R14.

21. Jiao Y, Widschwendter M, Teschendorff AE. A systems-level integrative framework for genome-wide dna methylation and gene expression data identifies differential gene expression modules under epigenetic control. Bioinformatics. 2014;30(16):2360-2366.

22. Zhuang J et a. A comparison of feature selection and classification methods in dna methylation studies using the illumina infinium platform. BMC Bioinformatics. 2012;13:59.

23. Zhou W, Laird PW, Shen H. Comprehensive characterization, annotation and innovative use of infinium dna methylation beadchip probes. Nucleic Acids Research. 2016. doi:10.1093/nar/gkw967.

24. Nordlund J, Bäcklin CL, Wahlberg P, et al. Genome-wide signatures of differential dna methylation in pediatric acute lymphoblastic leukemia. Genome Biology. 2013;14(9):r105. doi:10.1186/gb-2013-14-9-r105.

25. Yuan YA de J Tian AND Jiao. An integrative multi-scale analysis of the dynamic dna methylation landscape in aging. PLoS Genet. 2015;11(2):1-21.