6 Description of ChAMP Pipelines

As previously mentioned, users has the option to run each of the ChAMP functions individually to allow integration with other pipelines or to save the results of each step. Here each function is described in detail.

6.1 Loading Data

Loading Data is the first step for all analysis. ChAMP provides a loading function to get data from .idat file (with pd file (Sample_Sheet.csv) in it). The champ.load() function utilises minfi to load data from .idat files. By default this loads data from the current working directory, in this directory you should have all .idat files and a sample sheet. The sample sheet currently needs to be a .csv file as came with your results following hybridization. You can choose whether you want M-values or beta-values. For small datasets M-values are recommended²². The minfi function used to load the data from the idat files would automatically filters out the SNPs that might influent the analyais result. The SNP list are generated from Zhou’s Nucleic Acids Research paper in 2016²³, which is designed specifically for 450K and EPIC, also with population differences considered. As such, for 450k bead array data, before filtering for low quality probes, the dataset will include 485,512 probes. And for EPIC bead array data, before filtering probes the dataset, will include 867,531 probes.

User may use following code to load data set:

myLoad <- champ.load(testDir)
## We are not running this code here because it cost about 1 minute.

To save user’s time to open this vignette, we already saved loaded dataset into testDataSet in ChAMPdata package, you can use following code to load it:

data(testDataSet)

Now we can check the pd file (Sample_Sheet.csv), because it’s very important. Most pd files are similar but not all of them are exactly the same. So user MUST understand their pd file perfectly to achieve a successfully analysis.

myLoad$pd

##    Sample_Name Sample_Plate Sample_Group Pool_ID Project Sample_Well
## C1          C1           NA            C      NA      NA         E09
## C2          C2           NA            C      NA      NA         G09
## C3          C3           NA            C      NA      NA         E02
## C4          C4           NA            C      NA      NA         F02
## T1          T1           NA            T      NA      NA         B09
## T2          T2           NA            T      NA      NA         C09
## T3          T3           NA            T      NA      NA         E08
## T4          T4           NA            T      NA      NA         C09
##     Array      Slide                   Basename                  filenames
## C1 R03C02 7990895118 Test450K/7990895118_R03C02 Test450K/7990895118_R03C02
## C2 R05C02 7990895118 Test450K/7990895118_R05C02 Test450K/7990895118_R05C02
## C3 R01C01 9247377086 Test450K/9247377086_R01C01 Test450K/9247377086_R01C01
## C4 R02C01 9247377086 Test450K/9247377086_R02C01 Test450K/9247377086_R02C01
## T1 R06C01 7766130112 Test450K/7766130112_R06C01 Test450K/7766130112_R06C01
## T2 R01C02 7766130112 Test450K/7766130112_R01C02 Test450K/7766130112_R01C02
## T3 R01C01 7990895118 Test450K/7990895118_R01C01 Test450K/7990895118_R01C01
## T4 R01C02 7990895118 Test450K/7990895118_R01C02 Test450K/7990895118_R01C02

As we can see here, the Sample_Group contains two phenotypes: C and T, which is what we want to compare.

During the loading process, champ.load() function would also do some filtering.

First filter is for probes with detection p-value (< 0.01). This utilises minfi method for calculating the detection p-value which differs from the method used in Genome Studio. A failedSamples result will be printed to the screen, showing the fraction of failed probes per sample. If any of these values is high (>0.05) you may want to consider removing that sample from the analysis and rerun it. Previously we found that in many cases, only one or two samples in a big data set are not qualited for analysis, and they may have 70% or even 80% of failed probes. If we only do filtering on probes, about 80% of probes will be masked, so we developed new filtering system on both sample and probe. If certain sample’s failed probes’ ratio if above certain threshold (default = 0.1), then this sample will be discarded, then filtering to probes will be carried on rest samples. The threshold of sample and probe can be controled by parameter detSamplecut and detPcut separately.
Second, ChAMP will filter out probes with <3 beads in at least 5% of samples per probe. This default can changed with the filterBeads parameter or the frequency can be adjusted with the beadCutoff parameter.
Third, ChAMP will filter out all None-CpG probes contained in your dataset.
Fourth, ChAMP will filter all SNP-related probes. The SNP list comes from Zhou’s Nucleic Acids Research paper in 2016.
Fifth, ChAMP will filter all multi-hit probes. The multi-hit probe list comes from Nordlund’s Genome Biology Paper in 2013²⁴.
And the Sixth, ChAMP will filter out all probes located in chromosome X and Y.

All above filterings are controled by parameters in champ.load() function, user may adjust them as you wish. Note that, though most ChAMP functions are supporting isolated beta matrix analysis, which means not relying on .idat file from beginning, but champ.load() can not do filtering on beta matrix alone. For users have no .IDAT data but beta matrix and Sample_Sheet.csv, you may want do filtering with champ.filter() function and then use following functions to do analysis.

Beside this, we provide a small function CpG.GUI() for user to check their distribution of CpGs. This function can be used for any CpGs list, for example, during your analysis, whenever you get a significant CpG list from DMR, you can user following function to check the distribution of your CpG vector:

CpG.GUI(CpG=rownames(myLoad$beta),arraytype="450K")

This is a very easy function to demonstrate the distribution of your CpG list.

6.2 Quality Check

Quality Checking is an important process to make sure dataset is suitable to analysis. champ.QC() function and QC.GUI() function would draw some plots for user to easily check their data’s quality. Merely there are three plot would be generated:

champ.QC()

mdsPlot (Multidimensional Scaling Plot): This plot allows a visualization of the similarity of samples based on the top 1000 most variable probes amongst all samples. Samples are colored by Sample Groups in this plot.
densityPlot: The beta distribution lines for each sample, users may use this figure to find samples not very good (maybe with too many probes deleted during loading process.)
dendrogram: The clustering plot for all samples, you may select different method to generat this plot. There is a parameter Feature.sel="None" in champ.QC() function. While “None” means the distance between samples would be calculated directly by all probe ( your server may clush if you directly do this on large data sets), “SVD” means champ.QC() function would use SVD method to do deconvolute on beta matrix, then select significant components based only EstDimRMT() method from “isva” package. Then the distance between samples would be calculated based on these components.

Or you may use QC.GUI() function as below, which may cost more memory, thus make sure you have a good server or computer when you run this all GUI function in ChAMP.

QC.GUI(CpG=rownames(myLoad$beta),arraytype="450K")

Differently from champ.QC(), QC.GUI() will provide five interactive plot. mdsPlot, type-I&II densityPlot, Sample beta distribution plot, dendrogram plot and top 1000 variable CpG’s heatmap. Users may easily interactive with these plots to see if your sample are qualited for further analysis, and check clustering result and top CpGs.

Pobe-Type-I&II plot: the beta distributions of Type-I and Type-II probes in your dataset, which could help you to check your data set’s normalization status.
Top variable CpGs’ heatmap: This heatmap would select most variable CpGs, and do heatmap on them.

6.3 Normalization

Normalization is an essential step for data preparation. The type-I and type-II probes in one data set share different beta distributions, thus they may cause problem in DMP or DMR detection. Here you can use champ.norm() function to eliminate effect of two Type probes.

In champ.norm(), we provides four methods to do normalization. BMIQ⁶, SWAN¹, PBC⁵ and FunctionalNormliazation². There are some different between each method, user may read related paper to select one.

Until the time I am writing this vignette. FunctionalNormliazation can only support 450K, SWAN call for mset data and rgSet, FunctionalNormliazation call for rgSet, which means, FunctionalNormliazation and SWAN merely only works for .idat file users.

Note that BMIQ function has been updated to version 1.6, which would provide better normalization result for EPIC array. Some scientists may encounter error when using BMIQ on certain samples, merely because some samples’ quality are really bad that they are not even beta distribution any more. Also BMIQ function now can be run in parallel, if your computer has more than one core, you may set parameter “cores” in function to accelerate the function. If you set parameter PDFplot=TRUE, the plot for BMIQ function would be saved in resultsDir.

The code to do Normalization is below:

myNorm <- champ.norm(beta=myLoad$beta,arraytype="450K",cores=5)

Below is the one result of BMIQ function: After Normalization, you may use QC.GUI() function to check the result again, remember to set beta parameter in QC.GUI() function as myNorm.

6.4 SVD Plot

The singular value decomposition method (SVD) implemented by Teschendorff¹¹ for methylation data is used to identify the most significant components of variation. These components of variation would ideally be biological factors of interest, but it may be technical variation (batch effects) as well. To get the most from this analysis it is useful to include as much information as possible.

If samples have been loaded from .idat files then the 18 internal controls on the bead chip (including bisulphite conversion efficiency) will be included if you set parameter RGEffect=TRUE in champ.SVD() function. Also compared with old version of package, current version of champ.SVD() would detect all valid factors to analysis, which means plot contain two following conditions:

Covariates are not related with name, Sample_Name, file_Name .e.g.
Covariates are contain at least two different values.

champ.SVD() function would detect all valid covariates in your pd file. Thus you have some unique covariates want to be analyed, you may combine them into your pd file, champ.SVD() would analysis together. Note that champ.SVD() will only take pd parameter as pd file. So if you have your own covariates and want them to be analysed with current pd file, please cbind() your covariates along with myLoad$pd together, then input it as pd parameter into champ.SVD().

Note that, for different type of covariates (categorical and numeric) covariates, champ.SVD() would used different ways (Krustal.Test and Linear Regression) to calculate the significance between covariates and components. Thus please make sure your pd is an data frame and numeric covariates are transfered into “numeric” type, while categorical covariates are transfered into “factor” or “character” type. If your Age covariate was assigned as “character”, it’s really hard to our function to detect it’s actual should be analysed as numeric. Thus we suggest user have very clear understanding on their data set and pd file.

During champ.SVD() running, all detected covariates will be printed on screen, so that user may check if covariates you want to analysis are correct.

The result is a heatmap (saved as SVDsummary.pdf) of the top 6 principle components correlated to the covariates information provided. The darker colours represent a lower p-value, indicating a larger component of variation. If it becomes clear from this SVD analysis that the largest components of variation are technical factors (batch effects) then it is worth considering the experimental design and implementing other normalization methods that may help remove technical variation. ComBat is included in this pipeline to remove variation related to chip number but it or other methods may be implemented independently to remove other sources of technical variation revealed in the SVD analysis.

You may use following code to generate SVD plot:

champ.SVD(beta=myNorm,pd=myLoad$pd)

6.5 Batch Effect Correction

This function implements the ComBat normalization method¹² that was developed for microarrays. The sva R package is used to implement this function. ComBat is specifically defined in the ChAMP package to correct for batch effects. More advanced users can implement ComBat using sva documentation to adjust for other batch effects. You need to set parameter batchname=c("Slide") in the function to correct them.

Note that champ.runCombat() would extract batch from your batchname, so please make sure you understand you dataset perfectly. Another important thing is, previously champ.runCombat() would automatically assume “Sample_Group” is the covariates need to be analysed, but now, we added a parameter called variablename so that users may assign their own covariates to analysis, thus for some data set whose disease phenotype is NOT Sample_Group, please remember to assign variablename parameter.

ComBat algorithsm uses an empirical Bayes method to correct for technical variation. If ComBat were applied directly to beta values zeros could be introduced as beta values have a range between 0 and 1. For this reason ChAMP logit transforms beta values before ComBat adjustment and then computes the reverse logit transformation following the ComBat adjustment. If the user has chosen to use M-values, please assign logitTrans=FALSE.

During the function champ.runCombat(), the program would automatically detect all valid factors might be corrected and list on screen, so that user may check if there are the batches you want to correct. Then if the batchname assigned by user are correct, the function would started to do batch correction.

You can use following code to do batch correction:

myCombat <- champ.runCombat(beta=myNorm,pd=myLoad$pd,batchname=c("Slide"))

The ComBat function could be a very time consuming step in the pipeline if you have a large number of samples. After the correction, you may use champ.SVD() function to check the corrected result.

6.6 Differential Methylation Probes

Differential Methylation Probes (DMP) is the most commonly analysed result for methylation research. Here users may use champ.DMP() function to calculate differential methylation probes. And use DMP.GUI() to check the result.

champ.DMP() function implements the limma package to calculate the p-value for differential methylation using a linear model. At present, the function can only compute differential methylation between two groups. By default, the two groups are determined by the Sample Group column in the sample sheet. If more than two groups are present the function will calculate the differential methylation between the first two unique groups in the file. As the function is running it will print which two groups the contrast is being computed for and will then print out how many DMPs were found for the given p-value.

The result of champ.DMP() included the p value, t value logFC between two groups, also includes the annotation for each probe, the average beta (for the sample group) and the delta beta value for the two groups used in the comparison. It is planned that a future version of ChAMP will include the ability to compare multiple groups. However, more advanced users can run limma with other settings and use the output (including all probes and their p-values) for the DMR hunter.

Users may use following code to get DMPs:

myDMP <- champ.DMP(beta = myNorm,pheno=myLoad$pd$Sample_Group)

##       Contrasts
## Levels pT-pC
##     pC    -1
##     pT     1

We can check the result of champ.DMP() as below:

head(myDMP)

##                 logFC   AveExpr         t      P.Value   adj.P.Val
## cg06822689  0.7222816 0.3969473  27.12991 2.327953e-08 0.009912028
## cg04472725  0.6698029 0.3733044  22.99951 7.324487e-08 0.013488219
## cg04645886  0.5622509 0.6497356  21.16730 1.301067e-07 0.013488219
## cg20837557 -0.4272843 0.4431514 -20.52186 1.611528e-07 0.013488219
## cg00841760  0.2227482 0.5512956  20.15955 1.822466e-07 0.013488219
## cg05572370  0.1662926 0.5826421  20.03715 1.900717e-07 0.013488219
##                   B      C_AVG     T_AVG  deltaBeta CHR   MAPINFO Strand
## cg06822689 7.476378 0.03580651 0.7580881  0.7222816   3 141516291      R
## cg04472725 7.030821 0.03840299 0.7082059  0.6698029   3 141516232      F
## cg04645886 6.770387 0.36861015 0.9308610  0.5622509   4   1214608      F
## cg20837557 6.666781 0.65679358 0.2295093 -0.4272843   6 146864278      F
## cg00841760 6.605583 0.43992150 0.6626697  0.2227482   7 158362966      F
## cg05572370 6.584391 0.49949577 0.6657884  0.1662926  11   2323247      F
##            Type    gene feature     cgi        feat.cgi
## cg06822689   II    GRK7    Body  island     Body-island
## cg04472725   II    GRK7    Body  island     Body-island
## cg04645886   II   CTBP1    Body opensea    Body-opensea
## cg20837557   II   RAB32 TSS1500   shore   TSS1500-shore
## cg00841760   II  PTPRN2    Body  island     Body-island
## cg05572370   II TSPAN32 1stExon opensea 1stExon-opensea
##               UCSC_CpG_Islands_Name  DHS Enhancer                 Phantom
## cg06822689 chr3:141516055-141516639   NA     TRUE                        
## cg04472725 chr3:141516055-141516639   NA     TRUE                        
## cg04645886                            NA       NA                        
## cg20837557 chr6:146864481-146865426   NA       NA                        
## cg00841760 chr7:158361909-158362970   NA       NA                        
## cg05572370                          TRUE       NA low-CpG:2279812-2279967
##            Probe_SNPs Probe_SNPs_10
## cg06822689                         
## cg04472725                         
## cg04645886                         
## cg20837557                         
## cg00841760                         
## cg05572370

In new version of ChAMP, we includs a nice GUI interface for user to check the result of myDMP generated from above code. You just need to provide the “unmodified” result of champ.DMP (myDMP), orginal beta matrix you used to generat DMP (beta) and covariates you want to analysis. You can used following code to open DMP.GUI() function.

Note that your myDMP reuslt may only be calcuated from two phenotypes, but your covariates may contain 3 or even more phenotypes.

DMP.GUI(DMP=myDMP,beta=myNorm,pheno=myLoad$pd$Sample_Group)

The left panel indicates the cut off parameters for significant CpGs. Currently only two parameters are supported here: p value and abs(logFC). Users may set these two values then press “Submit” to generate significant CpGs table on the right. On the right is the table of significant CpGs, all information are included, the table can be ranked by each column.

The second tab provides the heatmap of ALL significant CpGs you generate by left panel cutoffs.

The third panel provides the barplot of proportion of each CpG feature and cgi summary information for hyper and hypo CpGs. You may press the marker in the legend to close or open certain bars.

The fourth tab is controled by the Gene parameter on the left panel. You may select an gene name (the example default is NFIX here) and press “submit”. Then the plot for this gene will be plot above. Note that the x axis here is not the actual MAPINFO of each CpG. The distance between each CpG are the same. There are two kinds of lines plotted for each phenotype, the solid lines represent Mean Value plot, and the dash line represent the Loess line, user may close or open they by clicking the marker on the legend. Below is the feature and cgi information of each CpG. At the bottom of this tab, is the gene information extracted from wikigene, which is also controled by the gene parameter on the left.

The fifth tab includes two plots. At the top is the enrichment plot for top 70 genes involved in significant CpGs you select by left cutoff parameters. The number in each bar indicates how many hyper or hypo differential methylated CpGs are included in this gene. Users may use this plot to find intereted genes then use tab 4 to plot these genes. At the bottom of this tab, is the boxplot of certain CpGs, which is controled by parameter on the left panel.

Note that all plots here can be Zoomed in or out. And can be downloaded directly. You may even change the shape of the brower to change the resize of plot then download it with different resolution.

6.7 Differential Methylation Regions

Differentially Methylated Eegions (DMRs) are extended and discreet segments of the genome that show a quantitative alteration in DNA methylation levels between two groups. champ.DMR() is the function ChAMP provides to computes and returns a data frame of probes binned into discreet differentially methylated regions (DMRs), with accompanying p-value.

There are three algotithms integrated to do this job, Bumphunter, ProbeLasso and DMRcate. For Bumphunter algotithms, it would firstly cluster all probes into small clusters (or regions), then apply random permulation method to estimate candidate DMRs. This method is very user friendly and is not relying on any output of previous functions. The permulation steps in Bumphunter algotithms may be a little bit slow, but user may assign more core to accelerate it by parallel more threads. The result of bumphunter algotithm is a dataframe contain all detected DMRs, with their length, clusters, number of CpGs.e.g annotated.

For ProbeLasso method, the final data frame is distilled from a limma output of probes and their association statistics. Previously, the result from champ.DMP() must be inputted into champ.DMR() function to do run ProbeLasso, but now we have upgraded the function and ProbeLasso needs no result from champ.DMP() anymore. For the principle and mechanism of ProbeLasso function, user may find it in original paper¹³.

The new method here incoporated in new version ChAMP is DMRcate. This is a data-driven approach that is agnostic to all annotations except for spatial ones, specifically chromosomal coordinates. Critical to DMRcate are robust estimates of differential methylation (DM) at individual CpG sites derived from limma, arguably the most widely used tool for microarray analysis. DMRcate pass the square of the moderated t statistic calculated on each 450K probe to DMR-finding function. Then DMRcate would apply a Gaussian kernel to smooth this metric within a given window, and also derive an expected value of the smoothed estimate (in other words, one with no experimental effect) from the varying density of CpGs sites incurred by reduced representation and irregular spacing. For more detail about DMRcate function, user may read the original paper [$Peters2015] and DMRcate R pacakge.

Users may use following code to generate DMR:

myDMR <- champ.DMR(beta=myNorm,pheno=myLoad$pd$Sample_Group,method="Bumphunter")

The result of Bumphunter, DMRcate and ProbeLasso functions are all dataframe, but main contain different columns. Here we can have a check on DMRcate result.

head(myDMR$DMRcateDMR)

##       seqnames     start       end width strand no.cpgs        minfdr
## DMR_1     chr2 177021702 177030228  8527      *      48  3.243739e-69
## DMR_2     chr2 177012117 177017797  5681      *      33  1.137990e-59
## DMR_3    chr12 115134148 115136308  2161      *      51 3.571093e-108
## DMR_4    chr11  31820441  31828715  8275      *      40  3.087712e-43
## DMR_5    chr17  46655289  46658170  2882      *      27  1.565729e-85
## DMR_6     chr6  32115964  32118811  2848      *      50 5.055382e-123
##           Stouffer maxbetafc meanbetafc
## DMR_1 4.011352e-24 0.6962753  0.4383726
## DMR_2 6.555658e-20 0.6138975  0.4676655
## DMR_3 5.711951e-19 0.5372083  0.3601741
## DMR_4 1.206741e-18 0.5673115  0.4112588
## DMR_5 2.087453e-17 0.6859260  0.4812521
## DMR_6 7.216567e-16 0.5772168  0.2971849
##                                                                                              overlapping.promoters
## DMR_1                                                                       HOXD3-001, HOXD3-004, RP11-387A1.5-001
## DMR_2                                                                             HOXD4-001, MIR10B-201, HOXD3-005
## DMR_3                                                                                                         <NA>
## DMR_4 PAX6-016, PAX6-006, PAX6-014, PAX6-015, PAX6-007, PAX6-028, PAX6-025, PAX6-027, PAX6-029, PAX6-024, PAX6-026
## DMR_5                                                      HOXB4-001, MIR10A-201, HOXB3-009, HOXB3-005, MIR10A-001
## DMR_6                                                        PRRT1-002, PRRT1-201, PRRT1-007, PRRT1-006, PRRT1-003

Just like DMP result, we provide DMR.GUI() as interface for DMR result. All result from three different method of champ.DMR() are accepted in the GUI function. The DMR.GUI() function would automatically detected what kind of input the DMR result is. Thus it’s not recommended to modify the structure of myDMR result.

DMR.GUI(DMR=myDMR)
# It might be a little bit slow to open DMR.GUI() because function need to extract annotation for CpGs from DMR. Might take 30 seconds.

Just like the DMP.GUI above, there is a panel for parameters on the left, user may set the parameters to select significant DMRs. We currently only provide two parameters: p value and min number probes. The above title will indicate the method you have used to get DMR result (here is DMRcate). Note that for different method the p value may indicate different columns. Bumphunter is pvaluearea, while ProbeLasso is dmrP, and there is no cutoff for DMRcate result, which has been done during calculation. After your selection and press “Submit”, you will get table of significant DMRs on the right. For different method, the table content might be different. Thus user should research what method you want to use to find DMRs.

The second tab provides a table for all CpGs involvded in these DMRs, which can be regarded as a mapping list between CpGs and DMRs. Also if you can not get gene information in first tab, you can get them here. In this table, for each CpGs, their corresponding genes and DMRs are marked.

The third plot is the plot of each DMR, just like the gene plot in DMP.GUI() function. Here the x axis are not real MAPINFO of each CpG, the distance between each CpGs are equal. Also I have feature and CpGs marked under the plot. At the top of this tab, are the table of all CpGs in this DMR. You can use the DMR index parameter on the left panel to select DMR you want to check.

The fourth tab provide the enrichment information of genes and heatmap. Note that there might be many genes and CpGs in this plot. So if you have too many DMRs selected, the plot might be too big to be open. But you don’t need to worry can not see it clearly, because you may zoom in as much as you can to check details.

6.8 Differential Methylation Blocks

In recent years, there are more researchs about Differential Methylated Blocks¹⁹. Here we provide a function to calculate methylation blocks. In our Block-finder function, champ.Block() would firstly calculate small clusters (regions) on whole genome based on their locations on genome. Then for each cluster (region), its mean value and mean position would be calcuated, thus you can assume that each region would be collapsed into a dot. Then Bumphunter algorithsm will be used to find “DMRs” over these regions (dots after collapse). Note that only open sea regions will be used to calculate Blocks. In our previous paper²⁵ we demonstrated that Differential Methylated Blocks may show universal feature across various cancers.

User may use following code to generate methylation blocks:

myBlock <- champ.Block(beta=myNorm,pheno=myLoad$pd$Sample_Group,arraytype="450K")

A lot of things would be returned by champ.Block() function, but the most important is the first list: Block. You may check it like below:

head(myBlock$Block)

##         chr     start       end      value     area cluster indexStart
## Block_1  12 125405786 133245207 -0.1398238 61.66232     649      27996
## Block_2   6  28839951  30308328 -0.1725335 40.20032     356      81359
## Block_3  15  24778439  27360551 -0.1752729 38.73531     732      35251
## Block_4  14 100851839 102144080 -0.2166707 35.75066     724      34607
## Block_5  11 130933979 133954227 -0.1833729 34.29073     616      22569
## Block_6   7  40026390  42107777 -0.2616146 24.85339     400      88773
##         indexEnd   L clusterL      p.value  fwer  p.valueArea fwerArea
## Block_1    28436 441     2176 1.698689e-05 0.028 1.698689e-05    0.028
## Block_2    81591 233     2396 1.698689e-05 0.028 1.577354e-04    0.136
## Block_3    35471 221      261 1.698689e-05 0.028 2.196162e-04    0.136
## Block_4    34771 165      708 1.698689e-05 0.028 3.858451e-04    0.188
## Block_5    22755 187     1601 1.698689e-05 0.028 4.622861e-04    0.188
## Block_6    88867  95     1497 1.698689e-05 0.028 1.571287e-03    0.420

And we provide a GUI function for users to check the result of myBlock above.

Block.GUI(Block=myBlock,beta=myNorm,pheno=myLoad$pd$Sample_Group,runDMP=TRUE,compare.group=NULL,arraytype="450K")

Note that there are two special parameters here: runDMP means if Block.GUI() function should calculate DMP for each CpGs, if not, the GUI() function would only return the annotation of all involvded CpGs. And just like champ.DMP() function, compare.group need to be assigned if you want to calculate DMP while running Block.GUI().

Just like DMP.GUI() and DMR.GUI(), the p value cutoff and min cluster (similar to min probes in DMR.GUI()) controls the number of significant Blocks you selected. Then the information of significant blocks will be listed on the right.

The second tab provides information of all CpGs involved in all Blocks. For methylation blocks, there are actually three unit integrated together: Block, Cluster (regions) and CpG. Blocks are formed by some clusters, and clusters are formed by some CpGs. Thus the second tab actually provides a mapping solution from CpGs to Blocks. User may use this tab to find their interested genes and CpGs. Then use DMP.GUI() to combind research between Blocks and DMPs.

The third tab provides the plot for each Blocks. Be caution that each Block may contain many many clusters. Thus it would be slow to calculate and plot. And please make sure you have a good computer or server to run this function. Note that plot is slightly different from DMR.GUI() and DMP.GUI() that the x axis is actually the REAL MAPINFO of each cluster, because I also need to map CpG islands position above the lines (Those dark red dots).

6.9 Gene Set Enrichment Analysis

Gene Set Enrichment Analysis is a very important step for most bioinformatics work. After previous steps, you may already get some significant DMPs or DMRs, thus you may want to know if genes involvded in these significant DMPs or DMRs are actually corresponding to some pathways. To achieve this analysis, you can use champ.GSEA() to do GSEA analysis.

champ.GSEA() would automatically extract genes in myDMP and myDMR (No matter what method you have used to generate DMRs). Also if you have your own significant CpG list or gene list calculated from THE SAME methylation arrays (450K or EPIC) by other method not included in ChAMP. You may also format them into list and input the function to do GSEA (Please assign each element in your own CpG list or gene list a name, otherwise error might be triggered). champ.GSEA() would automatically extract gene information, transfer CpG information into gene information then conduct GSEA on each list. During the mapping process between CpGs to genes, if there are multiple CpGs mapped to one gene, that gene would only be counted once in case of over counting.

There are two ways to do GSEA. In previous version, ChAMP used pathway information downloaded from MSigDB. Then Fisher Exact Test will be used to calculate the enrichment status of each pathway. After gene enrichment analysis, champ.GSEA() function would automatically return pathways with p value smaller then adjPval cutoff.

For most situation, above fisher exact test method should works fine. However, as the Geeleher’s Paper addressed. Since different genes has different numbers of CpGs contained inside, the two situation that one genes with 50 CpGs inside but only one of them show significant methylation, and one gene with 2 CpGs inside but two are significant methylated should not be eaqualy treated. The solution is use number of CpGs contained by genes to correct significant genes. Here we used goseq package²⁰ to solve this problem. This package was originally designed to correct bias caused by gene’s length during GSEA procedure. Here we used number of CpGs contained by each gene replace length as biased data, to correct this issue.

Note: If user want to correct the biased caused between genes’ number and CpGs’ number, they may set method parameter in champ.GSEA() function as “goseq” to use goseq method to do GSEA, or user may set it as “fisher” to do normal Gene Set Enrichment Analysis.

User may use following command to do GSEA:

myGSEA <- champ.GSEA(beta=myNorm,DMP=myDMP,DMR=myDMR,arraytype="450K",adjPval=0.05)
# myDMP and myDMR could (not must) be used directly.

After the calculation, for each list (DMP,DMR or other list), one GSEA result would be returned. User may check the result like below:

head(myGSEA$DMP)

##         category over_represented_pvalue under_represented_pvalue
## 4620  GO:0009952            3.028466e-10                        1
## 13116 GO:0048704            1.029305e-09                        1
## 13034 GO:0048562            1.982377e-09                        1
## 1339  GO:0003002            6.331492e-09                        1
## 13039 GO:0048568            7.967879e-09                        1
## 13118 GO:0048706            7.990748e-09                        1
##       numDEInCat numInCat                                     term
## 4620          65      191 anterior/posterior pattern specification
## 13116         40       92  embryonic skeletal system morphogenesis
## 13034         80      276            embryonic organ morphogenesis
## 1339          85      314                          regionalization
## 13039        100      396              embryonic organ development
## 13118         47      121    embryonic skeletal system development
##       ontology over_represented_adjPvalue
## 4620        BP               6.269531e-06
## 13116       BP               1.065434e-05
## 13034       BP               1.367973e-05
## 1339        BP               2.757074e-05
## 13039       BP               2.757074e-05
## 13118       BP               2.757074e-05

# Above is the GSEA result for differential methylation probes.
head(myGSEA$DMR)

##         category over_represented_pvalue under_represented_pvalue
## 3725  GO:0007389            3.301398e-22                        1
## 11270 GO:0043565            8.287681e-22                        1
## 1339  GO:0003002            2.310823e-21                        1
## 4620  GO:0009952            2.799958e-21                        1
## 1583  GO:0003700            3.838911e-19                        1
## 395   GO:0001071            3.911054e-19                        1
##       numDEInCat numInCat
## 3725          49      410
## 11270         72      957
## 1339          43      314
## 4620          36      191
## 1583          70     1093
## 395           70     1094
##                                                               term
## 3725                                 pattern specification process
## 11270                                sequence-specific DNA binding
## 1339                                               regionalization
## 4620                      anterior/posterior pattern specification
## 1583  transcription factor activity, sequence-specific DNA binding
## 395             nucleic acid binding transcription factor activity
##       ontology over_represented_adjPvalue
## 3725        BP               6.834554e-18
## 11270       MF               8.578578e-18
## 1339        BP               1.449118e-17
## 4620        BP               1.449118e-17
## 1583        MF               1.349444e-15
## 395         MF               1.349444e-15

# Above is the GSEA result for differential methylation regions.
# Too many information may be printed, so we are not going to show the result here.

6.10 Differential Methylated Interaction Hotspots

Another function included in ChAMP is champ.EpiMod(). This function uses FEM package to calcuate the Interaction Hotspots in modules for certain data set. By “interactome hotspot” we mean a connected subnetwork of the protein interaction network (PIN) with an exceptionally large average edge-weight density in relation to the rest of the network. The weight edges are constructed from the statistics of association of DNA methylation. Thus, the champ.EpiMod() function can be viewed as a functional supervised algorithm, which uses a network of relations between genes (in our case a PPI network), to identify subnetworks where a significant number of genes are associated with a phenotype of interest (POI).

Thus compared with GSEA and other pathway detection method, champ.EpiMod() (or FEM package) used more strict way to estimate small modules located in big pathways. Each edge and genes’ status would be calculated based on the data set you inputted. Orginally, FEM package was developed to do interaction analysis between methylation data and RNA expression data, which could get modules show both Differential Methylation and Differential Expression status. However here we only provide usage to calcuated module show differential methylation. More advanced user may refer to FEM package for more information.

User may use following code to do champ.EpiMod()

myEpiMod <- champ.EpiMod(beta=myNorm,pheno=myLoad$pd$Sample_Group)

The all significant modules would be returned. And PDF version plots would returned in resultsDir in function parameter. Like below:

Above we demonstrated four modules detect by champ.EpiMod() based on our small 450K test data. But in practic, each modules would be saved in to a pdf file individually in resultsDir directory.

6.11 Copy Number Variation

For Copy Number Variation analysis, we provide champ.CNA() to achieve this work. This function uses the HumanMethylation450 or HumanMethylationEPIC data to identify copy number alterations¹⁸. The function utilises the intensity values for each probe to count copy number and determine if copy number alterations are present. Copy number is determined using the CopyNumber package.

Basically there are two methods you may choose to do CNA analysis: you may compare the copy number various between case samples and control samples; or you can compare copy number of each sample to the average copy number status of all samples. For the first method, you may specify which groups of samples in your pheno parameter, or ChAMP already included some blood control samples itself, you may use them as control then calculate the copy number variance, all you need to do is assign parameter control=TRUE and controlGroup="champCtls". For the other method, you may assign control=FALSE to apply.

There are two kinds of plots would be generated, analysis for each sample (sampleCNA=TRUE parameter) and analysis for each group (groupFreqPlots=TRUE parameter). Like champ.QC() function before, there are two parameter could be assigned for plots. The Rplot parameter would control if champ.CNA() would plot with R session, and PDFplot parameter would control if champ.CNA() would save PDF file to resultsDir. By default, only PDFplot is TRUE.

Note that different from old version of ChAMP. No batch correction would be run automatically on intensity data, if you want to correct batch effect on intensity data, please use champ.runCombat(), the usage for intensity data is exactly the same for beta or M matrix, the only difference is you need to replace the beta value with myLoad$intensity and set logitTrans=FALSE.

Feber¹⁸ compared the results obtained using this method to copy number data from Illumina CytoSNP array and Affymetrix SNP 6.0 arrays and found that using this method for 450k data was effective in identifying regions of gain and loss.

Users may use following code to calculate Copy Number Variance:

myCNA <- champ.CNA(intensity=myLoad$intensity,pheno=myLoad$pd$Sample_Group)

The main usage of champ.CNA() function is to generate plot for sample and groups in pheno. Like below:

6.12 Cell Type Heterogeneity

While many methylation sample are collected from whole blood, their methylation status actually are combination of various cell types. These cell-type specified methylation may cover ture epigenome signals as well as true relationships between CpGs sites and phenotypes, or themsleves are the reason of some diseases. Thus many methods have be invented to deal with cell heterogeneity problem, like RefbaseEWAS and RefFreeEWAS incoporated.

RefbaseEWAS method is a method similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations (e.g. cases and controls) using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples.

We integrates another cell type heterogeneity correction method RefFreeEWAS in ChAMP. The new RefFreeEWAS method is similar to non-negative matrix factorization, with additional constraints and additional utilities. This function is specifically suitable for tissue datasets such as placenta, saliva, adipose or tumor tissue. This method closely related to surrogate variable analysis, relies on a simple projection based on singular value decomposition (SVD), as does SVA. In SVA, the residuals of a linear model are decomposed into a factor-analytic structure and the factors are used subsequently in a regression model, with iteration resulting in a final set of surrogate variables.

Here we prepared two cell-type purified methylation reference, one for 27K and the other for 450K, user may use any of them to do cell heterogeneity to correct heterogeneity based on cell propotions of each sample. However, though RefbaseEWAS method would return fairly accurate cell propotion, this method can ONLY be applied for whole blood samples, which contain cells in the reference. After champ.refbase() function, cell type heterogeneity corrected beta matrix, and cell-type specific propotions in each sample will be returned.

On the other hand, RefFreeEWAS method requires two parameters in champ.reffree() function, one is the covariates matrix for inputted dataset, which could be a matrix or a vector. containing various phenotypes of each sample. The other important parameter is number of latent variable K, which represent numbers of latent cell type mixed in the dataset. If users don’t know the correct number of latent variable (cell types), they could ignore this parameter, and champ.reffree() will apply Random Matrix Theory from isva package to estimate number of latent variables.

User may use following code to do cell heterogeneity analysis:

myRefFree <- champ.reffree(beta=myNorm,pheno=myLoad$pd$Sample_Group)

Then we can see the q value calculated for each CpGs, user may relying on the q value below to select significant CpGs after cell heterogeneity correction:

head(myRefFree$qvBeta)

##                           pheno
## cg00001349 0.2359390 0.12512447
## cg00001583 0.2736186 0.24987531
## cg00002028 0.2169540 0.04471770
## cg00002719 0.2169540 0.04183915
## cg00002837 0.4068856 0.04658125
## cg00003202 0.2352289 0.04183915

For RefbaseEWAS method, user could get cell propotion of each sample by champ.refbase(). But do remember champ.refbase() can only works on Blood Sample Data Set.

myRefBase <- champ.refbase(beta=myNorm,arraytype="450K")
# Our test data set is not blood.

8 Citing ChAMP

ChAMP is implemented by incoporated many other very talented scientists’ work. Some functions are achieved by adopting other packages, while some functions are coded by ChAMP’s author but with ideas and theories provided by other works. So when you are using functions provided by ChAMP, please citate other scientists’s work correctly, so that people who invented these theory and method would get corresponding credits and honor.

If you want to citate ChAMP package itself, please add following papers in your article. This article is about ChAMP, an integrated analysis pipeline for DNA methylation analysis:

Morris, T. J., Butcher, L. M., Feber, A., Teschendorff, A. E., Chakravarthy, A. R., Wojdacz, T. K., and Beck, S. (2014). Champ: 450k chip analysis methylation pipeline pg - 428-30. Bioinformatics, 30(3), 428-30.

The loading part of ChAMP (champ.load) used minfi package, so does FunctionalNormliazation, SWAN normalization methods. If you want used above two normalization methods, you may want to citate below minfi paper:

Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD and Irizarry RA (2014). Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA Methylation microarrays. Bioinformatics, 30(10), pp. 1363-1369. doi: 10.1093/bioinformatics/btu049.

Jean-Philippe Fortin, Timothy Triche, Kasper Hansen. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array. bioRxiv 065490; doi: https://doi.org/10.1101/065490

Note that during filtering step, we used below paper’s SNP list, which included SNP located 5 bp close to CpG sites, with global MAF >1%, and other characters. For more information about SNP being filtered, you may check below paper:

Zhou W, Laird PW and Shen H: Comprehensive characterization, annotation and innovative use of Infinium DNA Methylation BeadChip probes. Nucleic Acids Research 2016

The default normalization method in ChAMP is BMIQ, which is invented by Andrew Teschendorff. This is a very popular and well-designed normalization function, you may citate below paper:

Teschendorff AE et al . A beta-mixture quantile normalization method for correcting probe design bias in illumina infinium 450 k dna methylation data. Bioinformatics. 2013;29(2):189-196.

Another included normalization method is PBC, the related paper is below:

Dedeurwaerder S et a. Evaluation of the infinium methylation 450K technology. Epigenomics. 2011;3(6):771-784.

champ.SVD() function is a very good option to detect effect of factors on dataset. The paper related with this function is below:

Teschendorff AE et a. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4(12):e8274.

ChAMP used famous batch correction method Combat from SVA package for batch correction. You may citate Combat-related paper below:

Johnson WE et a. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics. 2007;8(1):118-127.

Above are all functions related to Data Preparation. Below are functions and methods related to statistic result estimation.

In ChAMP we used limma to find DMP. According to user guild of limma, researches related to differential analysis should citate some specific paper. Below are two papers user may want to use:

Ritchie, ME, Phipson, B, Wu, D, Hu, Y, Law, CW, Shi, W, and Smyth, GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47

Phipson, B, Lee, S, Majewski, IJ, Alexander, WS, and Smyth, GK (2016). Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Annals of Applied Statistics 10(2), 946-963.

When it comes to DMR detection, here there are three method incoporated: If you used Bumphunter method, you may use this one:

Jaffe AE et a. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012;41(1):200-209.

If you used ProbeLasso function, you may use this one:

Butcher LM, Beck S. Probe lasso: A novel method to rope in differentially methylated regions with 450K dna methylation data. Methods. 2015;72:21-28.

Or if you used DMRcate function invented by Tim Peters, you may use paper below:

Peters TJ, Buckley MJ, Statham AL, et al. De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin. 2015;8(1):1-16.

In this version of ChAMP, we provided a function to find Differential Methylation Blocks (DMB), this function is totally implemented by author of ChAMP, but the theory is based on below papers:

Hansen, KD, Timp, W, Bravo, HC, Sabunciyan, S, Langmead, B, McDonald, OG, Wen, B, Wu, H, Liu, Y, Diep, D, Briem, E, Zhang, K, Irizarry, RA, Feinberg, AP (2011). Increased methylation variation in epigenetic domains across cancer types. Nat. Genet., 43, 8:768-75.

For GSEA method, if you used “goseq” method, please cite below paper:

Paul Geeleher, Lori Hartnett, Laurance J. Egan, Aaron Golden, Raja Affendi Raja Ali, and Cathal Seoighe Gene-set analysis is severely biased when applied to genome-wide methylation data Bioinformatics 2013 : btt311v2-btt311.

Young MD, Wakefield MJ, Smyth GK and Oshlack A (2010). “Gene ontology analysis for RNA-seq: accounting for selection bias.” Genome Biology, 11, pp. R14.

We incoporated FEM package to find Differential Methylated Module among huge network. Below is the citation paper for FEM package:

Jiao Y, Widschwendter M, Teschendorff AE. A systems-level integrative framework for genome-wide dna methylation and gene expression data identifies differential gene expression modules under epigenetic control. Bioinformatics. 2014;30(16):2360-2366.

The CNA detection function was provided by Andy Feber of UCL, you may citate related paper below:

Feber A, Guilhamon P, Lechner M, et al. Using high-density dna methylation arrays to profile copy number alterations. Genome Biol. 2014;15(2):R30.

Just as above context says, we included two method for cell proportion correction, both are from Professor Houseman and his RefFreeEWAS package. If you used Refbase method in ChAMP, please citate following paper:

Houseman EA, Accomando WP, Koestler DC, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):1-16.

Or if you used RefFree method, you may want to citate below paper:

Houseman EA, Molitor J, Marsit CJ. Reference-free cell mixture adjustments in analysis of dna methylation data. Bioinformatics. 2014;30(10):1431-1439.

Above are all citation paper involvded by ChAMP, please remember to add them into your paper so that other scientists’s hard word would get proper evaluation and credits.

Finally, all GUI function, and function not mentioned above, like filtering, imputation, preparation between functions, API are all coded by author of ChAMP. We are now prepared a paper for that now, hopefully you can citate a paper for that too soon.

The Chip Analysis Methylation Pipeline

Yuan Tian, Morris TJ, Butcher L, Feber A, Teschendorff A, Chakravarthy A, Beck S

2017-04-10

1 New Updates

2 Introduction

3 Installation

4 Test Data

5 ChAMP Pipeline

5.1 Pipeline Introduction

5.2 Full Pipeline

5.3 Separated Steps

5.4 EPIC pipeline

5.5 Computational Requirements