1 Introduction

In this manual, we will show how to use the methylKit package. methylKit is an R package for analysis and annotation of DNA methylation information obtained by high-throughput bisulfite sequencing. The package is designed to deal with sequencing data from RRBS and its variants. But, it can potentially handle whole-genome bisulfite sequencing data if proper input format is provided.

1.1 DNA methylation

DNA methylation in vertebrates typically occurs at CpG dinucleotides, however non-CpG Cs are also methylated in certain tissues such as embryonic stem cells. DNA methylation can act as an epigenetic control mechanism for gene regulation. Methylation can hinder binding of transcription factors and/or methylated bases can be bound by methyl-binding-domain proteins which can recruit chromatin remodeling factors. In both cases, the transcription of the regulated gene will be effected. In addition, aberrant DNA methylation patterns have been associated with many human malignancies and can be used in a predictive manner. In malignant tissues, DNA is either hypo-methylated or hyper-methylated compared to the normal tissue. The location of hyper- and hypo-methylated sites gives a distinct signature to many diseases. Traditionally, hypo-methylation is associated with gene transcription (if it is on a regulatory region such as promoters) and hyper-methylation is associated with gene repression.

1.2 High-throughput bisulfite sequencing

Bisulfite sequencing is a technique that can determine DNA methylation patterns. The major difference from regular sequencing experiments is that, in bisulfite sequencing DNA is treated with bisulfite which converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. By sequencing and aligning those converted DNA fragments it is possible to call methylation status of a base. Usually, the methylation status of a base determined by a high-throughput bisulfite sequencing will not be a binary score, but it will be a percentage. The percentage simply determines how many of the bases that are aligning to a given cytosine location in the genome have actual C bases in the reads. Since bisulfite treatment leaves methylated Cs intact, that percentage will give us percent methylation score on that base. The reasons why we will not get a binary response are:

  • the probable sequencing errors in high-throughput sequencing experiments
  • incomplete bisulfite conversion
  • (and a more likely scenario) is heterogeneity of samples and heterogeneity of paired chromosomes from the same sample

2 Basics

2.1 Reading the methylation call files

We start by reading in the methylation call data from bisulfite sequencing with methRead function. Reading in the data this way will return a methylRawList object which stores methylation information per sample for each covered base. The methylation call files are basically text files that contain percent methylation score per base. Such input files may be obtained from AMP pipeline developed for aligning RRBS reads or from processBismarkAln function. . However, “cytosineReport” and “coverage” files from Bismark aligner can be read in to methylKit as well.

A typical methylation call file looks like this:

##         chrBase   chr    base strand coverage freqC  freqT
## 1 chr21.9764539 chr21 9764539      R       12 25.00  75.00
## 2 chr21.9764513 chr21 9764513      R       12  0.00 100.00
## 3 chr21.9820622 chr21 9820622      F       13  0.00 100.00
## 4 chr21.9837545 chr21 9837545      F       11  0.00 100.00
## 5 chr21.9849022 chr21 9849022      F      124 72.58  27.42

Most of the time bisulfite sequencing experiments have test and control samples. The test samples can be from a disease tissue while the control samples can be from a healthy tissue. You can read a set of methylation call files that have test/control conditions giving treatment vector option. For sake of subsequent analysis, file.list, sample.id and treatment option should have the same order. In the following example, first two files have the sample ids “test1” and “test2” and as determined by treatment vector they belong to the same group. The third and fourth files have sample ids “ctrl1” and “ctrl2” and they belong to the same group as indicated by the treatment vector.

library(methylKit)
file.list=list( system.file("extdata", 
                            "test1.myCpG.txt", package = "methylKit"),
                system.file("extdata",
                            "test2.myCpG.txt", package = "methylKit"),
                system.file("extdata", 
                            "control1.myCpG.txt", package = "methylKit"),
                system.file("extdata", 
                            "control2.myCpG.txt", package = "methylKit") )


# read the files to a methylRawList object: myobj
myobj=methRead(file.list,
           sample.id=list("test1","test2","ctrl1","ctrl2"),
           assembly="hg18",
           treatment=c(1,1,0,0),
           context="CpG"
           )

In addition to the options we mentioned above, any tab separated text file with a generic format can be read in using methylKit, such as methylation ratio files from BSMAP. See here for an example.

2.2 Reading the methylation call files and store them as flat file database

Sometimes, when dealing with multiple samples and increased sample sizes coming from genome wide bisulfite sequencing experiments, the memory of your computer might not be sufficient enough.

Therefore methylKit offers a new group of classes, that are basically pendants to the original methylKit classes with one important difference: The methylation information, which normally is internally stored as data.frame, is stored in an external bgzipped file and is indexed by tabix (H. Li 2011), to enable fast retrieval of records or regions. This group contains methylRawListDB, methylRawDB, methylBaseDB and methylDiffDB, let us call them methylDB objects.

We can now create a methylRawListDB object, which stores the same content as myobj from above. But the single methylRaw objects retrieve their data from the tabix-file linked under dbpath.

library(methylKit)
file.list=list( system.file("extdata", "test1.myCpG.txt", package = "methylKit"),
                system.file("extdata", "test2.myCpG.txt", package = "methylKit"),
                system.file("extdata", "control1.myCpG.txt", package = "methylKit"),
                system.file("extdata", "control2.myCpG.txt", package = "methylKit") )


# read the files to a methylRawListDB object: myobjDB 
# and save in databases in folder methylDB
myobjDB=methRead(file.list,
           sample.id=list("test1","test2","ctrl1","ctrl2"),
           assembly="hg18",
           treatment=c(1,1,0,0),
           context="CpG",
           dbtype = "tabix",
           dbdir = "methylDB"
           )

print(myobjDB[[1]]@dbpath)
## [1] "/tmp/Rtmpg54ajB/Rbuild18125f4ca52d/methylKit/vignettes/methylDB/test1.txt.bgz"

Most if not all functions in this package will work with methylDB objects the same way as it does with normal methylKit objects. Functions that return methylKit objects, will return a methylDB object if provided, but there are a few exceptions such as the select, the [ and the selectByOverlap functions.

2.3 Reading the methylation calls from sorted Bismark alignments

Alternatively, methylation percentage calls can be calculated from sorted SAM or BAM file(s) from Bismark aligner and read-in to the memory. Bismark is a popular aligner for bisulfite sequencing reads, available here (Krueger and Andrews 2011). processBismarkAln function is designed to read-in Bismark SAM/BAM files as methylRaw or methylRawList objects which store per base methylation calls. SAM files must be sorted by chromosome and read position columns, using ‘sort’ command in unix-like machines will accomplish such a sort easily. BAM files should be sorted and indexed. This could be achieved with samtools (http://www.htslib.org/doc/samtools.html).

The following command reads a sorted SAM file and creates a methylRaw object for CpG methylation. The user has the option to save the methylation call files to a folder given by save.folder option. The saved files can be read-in using the methRead function when needed.

my.methRaw=processBismarkAln( location = 
                                system.file("extdata",
                                                "test.fastq_bismark.sorted.min.sam", 
                                                  package = "methylKit"),
                         sample.id="test1", assembly="hg18", 
                         read.context="CpG", save.folder=getwd())

It is also possible to read multiple SAM files at the same time, check processBismarkAln documentation.

2.4 Descriptive statistics on samples

Since we read the methylation data now, we can check the basic stats about the methylation data such as coverage and percent methylation. We now have a methylRawList object which contains methylation information per sample. The following command prints out percent methylation statistics for second sample: “test2”

getMethylationStats(myobj[[2]],plot=FALSE,both.strands=FALSE)
## methylation statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   82.79   63.17   94.74  100.00 
## percentiles:
##        0%       10%       20%       30%       40%       50%       60% 
##   0.00000   0.00000   0.00000  48.38710  70.00000  82.78556  90.00000 
##       70%       80%       90%       95%       99%     99.5%     99.9% 
##  93.33333  96.42857 100.00000 100.00000 100.00000 100.00000 100.00000 
##      100% 
## 100.00000

The following command plots the histogram for percent methylation distribution.The figure below is the histogram and numbers on bars denote what percentage of locations are contained in that bin. Typically, percent methylation histogram should have two peaks on both ends. In any given cell, any given base are either methylated or not. Therefore, looking at many cells should yield a similar pattern where we see lots of locations with high methylation and lots of locations with low methylation.

getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE)

We can also plot the read coverage per base information in a similar way, again numbers on bars denote what percentage of locations are contained in that bin. Experiments that are highly suffering from PCR duplication bias will have a secondary peak towards the right hand side of the histogram.

getCoverageStats(myobj[[2]],plot=TRUE,both.strands=FALSE)

2.5 Filtering samples based on read coverage

It might be useful to filter samples based on coverage. Particularly, if our samples are suffering from PCR bias it would be useful to discard bases with very high read coverage. Furthermore, we would also like to discard bases that have low read coverage, a high enough read coverage will increase the power of the statistical tests. The code below filters a methylRawList and discards bases that have coverage below 10X and also discards the bases that have more than 99.9th percentile of coverage in each sample.

filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL,
                                      hi.count=NULL,hi.perc=99.9)

3 Comparative analysis

3.1 Merging samples

In order to do further analysis, we will need to get the bases covered in all samples. The following function will merge all samples to one object for base-pair locations that are covered in all samples. Setting destrand=TRUE (the default is FALSE) will merge reads on both strands of a CpG dinucleotide. This provides better coverage, but only advised when looking at CpG methylation (for CpH methylation this will cause wrong results in subsequent analyses). In addition, setting destrand=TRUE will only work when operating on base-pair resolution, otherwise setting this option TRUE will have no effect. The unite() function will return a methylBase object which will be our main object for all comparative analysis. The methylBase object contains methylation information for regions/bases that are covered in all samples.

meth=unite(myobj, destrand=FALSE)

Let us take a look at the data content of methylBase object:

head(meth)
##     chr   start     end strand coverage1 numCs1 numTs1 coverage2 numCs2
## 1 chr21 9853296 9853296      +        17     10      7       333    268
## 2 chr21 9853326 9853326      +        17     12      5       329    249
## 3 chr21 9860126 9860126      +        39     38      1        83     78
## 4 chr21 9906604 9906604      +        68     42     26       111     97
## 5 chr21 9906616 9906616      +        68     52     16       111    104
## 6 chr21 9906619 9906619      +        68     59      9       111    109
##   numTs2 coverage3 numCs3 numTs3 coverage4 numCs4 numTs4
## 1     65        18     16      2       395    341     54
## 2     79        16     14      2       379    284     95
## 3      5        83     83      0        41     40      1
## 4     14        23     18      5        37     33      4
## 5      7        23     14      9        37     27     10
## 6      2        22     18      4        37     29      8

By default, unite function produces bases/regions covered in all samples. That requirement can be relaxed using “min.per.group” option in unite function.

# creates a methylBase object, 
# where only CpGs covered with at least 1 sample per group will be returned

# there were two groups defined by the treatment vector, 
# given during the creation of myobj: treatment=c(1,1,0,0)
meth.min=unite(myobj,min.per.group=1L)

3.2 Sample Correlation

We can check the correlation between samples using getCorrelation. This function will either plot scatter plot and correlation coefficients or just print a correlation matrix

getCorrelation(meth,plot=TRUE)
##           test1     test2     ctrl1     ctrl2
## test1 1.0000000 0.9252530 0.8767865 0.8737509
## test2 0.9252530 1.0000000 0.8791864 0.8801669
## ctrl1 0.8767865 0.8791864 1.0000000 0.9465369
## ctrl2 0.8737509 0.8801669 0.9465369 1.0000000

3.3 Clustering samples

We can cluster the samples based on the similarity of their methylation profiles. The following function will cluster the samples and draw a dendrogram.

clusterSamples(meth, dist="correlation", method="ward", plot=TRUE)
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

## 
## Call:
## hclust(d = d, method = HCLUST.METHODS[hclust.method])
## 
## Cluster method   : ward.D 
## Distance         : pearson 
## Number of objects: 4

Setting the plot=FALSE will return a dendrogram object which can be manipulated by users or fed in to other user functions that can work with dendrograms.

hc = clusterSamples(meth, dist="correlation", method="ward", plot=FALSE)

We can also do a PCA analysis on our samples. The following function will plot a scree plot for importance of components.

PCASamples(meth, screeplot=TRUE)

We can also plot PC1 and PC2 axis and a scatter plot of our samples on those axis which will reveal how they cluster.

PCASamples(meth)

3.4 Batch effects

We have implemented some rudimentary functionality for batch effect control. You can check which one of the principal components are statistically associated with the potential batch effects such as batch processing dates, age of subjects, sex of subjects using assocComp. The function gets principal components from the percent methylation matrix derived from the input methylBase object, and checks for association. The tests for association are either via Kruskal-Wallis test or Wilcoxon test for categorical attributes and correlation test for numerical attributes for samples such as age. If you are convinced that some principal components are accounting for batch effects, you can remove those principal components from your data using removeComp.

# make some batch data frame
# this is a bogus data frame
# we don't have batch information
# for the example data
sampleAnnotation=data.frame(batch_id=c("a","a","b","b"),
                            age=c(19,34,23,40))

as=assocComp(mBase=meth,sampleAnnotation)
as
## $pcs
##              PC1        PC2         PC3         PC4
## test1 -0.4978699 -0.5220504  0.68923849 -0.06737363
## test2 -0.4990924 -0.4805506 -0.71827964  0.06365693
## ctrl1 -0.5016543  0.4938800  0.08068700  0.70563101
## ctrl2 -0.5013734  0.5026102 -0.05014261 -0.70249091
## 
## $vars
## [1] 92.271885  4.525328  1.870950  1.331837
## 
## $association
##                PC1       PC2       PC3       PC4
## batch_id 0.3333333 0.3333333 1.0000000 1.0000000
## age      0.5864358 0.6794346 0.3140251 0.3467957
# construct a new object by removing the first pricipal component
# from percent methylation value matrix
newObj=removeComp(meth,comp=1)

In addition to the methods described above, if you have used other ways to correct for batch effects and obtained a corrected percent methylation matrix, you can use reconstruct function to reconstruct a corrected methylBase object. Users have to supply a corrected percent methylation matrix and methylBase object (where the uncorrected percent methylation matrix obtained from) to the reconstruct function. Corrected percent methylation matrix should have the same row and column order as the original percent methylation matrix. All of these functions described in this section work on a methylBase object that does not have missing values (that means all bases in methylBase object should have coverage in all samples).

mat=percMethylation(meth)

# do some changes in the matrix
# this is just a toy example
# ideally you want to correct the matrix
# for batch effects
mat[mat==100]=80
 
# reconstruct the methylBase from the corrected matrix
newobj=reconstruct(mat,meth)

3.5 Tiling windows analysis

For some situations, it might be desirable to summarize methylation information over tiling windows rather than doing base-pair resolution analysis. methylKit provides functionality to do such analysis. The function below tiles the genome with windows 1000bp length and 1000bp step-size and summarizes the methylation information on those tiles. In this case, it returns a methylRawList object which can be fed into unite and calculateDiffMeth functions consecutively to get differentially methylated regions. The tilling function adds up C and T counts from each covered cytosine and returns a total C and T count for each tile.

tiles=tileMethylCounts(myobj,win.size=1000,step.size=1000)
head(tiles[[1]],3)
##     chr   start     end strand coverage numCs numTs
## 1 chr21 9764001 9765000      *       24     3    21
## 2 chr21 9820001 9821000      *       13     0    13
## 3 chr21 9837001 9838000      *       11     0    11

3.6 Finding differentially methylated bases or regions

The calculateDiffMeth() function is the main function to calculate differential methylation. Depending on the sample size per each set it will either use Fisher’s exact or logistic regression to calculate P-values. P-values will be adjusted to Q-values using SLIM method (Wang, Tuominen, and Tsai 2011). If you have replicates, the function will automatically use logistic regression. You can force the calculateDiffMeth() function to use Fisher’s exact test if you pool the replicates when there is only test and control sample groups. This can be achieved with pool() function, see FAQ for more info.

In its simplest form ,where there are no covariates, the logistic regression will try to model the the log odds ratio which is based on methylation proportion of a CpG, \(\pi_i\), using the treatment vector which denotes the sample group membership for the CpGs in the model. Below, the “Treatment” variable is used to predict the log-odds ratio of methylation proportions.

\[ \text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) =\beta_0 + \beta_1 Treatment_i \]

The logistic regression model is fitted per CpG or per region and we test if treatment vector has any effect on the outcome variable or not. In other words, we are testing if \(log(\pi_i/(1-\pi_i)) = \beta_0 + \beta_1 Treatment_i\) is a “better” model than \(log(\pi_i/(1-\pi_i)) = \beta_0\).

The following code snippet tests for differential methylation. Since the example data has replicates, the logistic regression based modeling and test will be used.

myDiff=calculateDiffMeth(meth)

After q-value calculation, we can select the differentially methylated regions/bases based on q-value and percent methylation difference cutoffs. Following bit selects the bases that have q-value<0.01 and percent methylation difference larger than 25%. If you specify type="hyper" or type="hypo" options, you will get hyper-methylated or hypo-methylated regions/bases.

# get hyper methylated bases
myDiff25p.hyper=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hyper")
#
# get hypo methylated bases
myDiff25p.hypo=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hypo")
#
#
# get all differentially methylated bases
myDiff25p=getMethylDiff(myDiff,difference=25,qvalue=0.01)

We can also visualize the distribution of hypo/hyper-methylated bases/regions per chromosome using the following function. In this case, the example set includes only one chromosome. The list shows percentages of hypo/hyper methylated bases over all the covered bases in a given chromosome.

diffMethPerChr(myDiff,plot=FALSE,qvalue.cutoff=0.01, meth.cutoff=25)
## $diffMeth.per.chr
##     chr number.of.hypermethylated percentage.of.hypermethylated
## 1 chr21                        75                      7.788162
##   number.of.hypomethylated percentage.of.hypomethylated
## 1                       59                     6.126687
## 
## $diffMeth.all
##   number.of.hypermethylated percentage.of.hypermethylated
## 1                        75                      7.788162
##   number.of.hypomethylated percentage.of.hypomethylated
## 1                       59                     6.126687

3.7 Correcting for overdispersion

Overdispersion occurs when there is more variability in the data than assumed by the distribution. In the logistic regression model, the response variable \(meth_i\) (number of methylated CpGs) is expected to have a binomial distribution: \[meth_i \sim Bin(n_i, \pi_i)\] Therefore, the methylated CpGs will have the variance \(n_i \pi_i(1-\pi_i)\) and mean \(\mu_i=n_i \pi_i\). \(n_i\) is the coverage for the CpG or a region and \(\pi_i\) is the underlying methylation proportion.

Overdispersion occurs when the variance of \(meth_i\) is greater than \(n_i \hat{\pi_i}(1-\hat{\pi_i})\), where \(\hat{\pi_i}\) is the estimated methylation proportion from the model. This can be corrected by calculating a scaling parameter \(\phi\) and adjusting the variance as \(\phi n_i \hat{\pi_i}(1-\hat{\pi_i})\). calculateDiffMeth can calculate that scaling parameter and use it in statistical tests to correct for overdispersion. The parameter is calculated as proposed by (McCullagh and Nelder 1989) as follows: \(\hat{\phi}=X^2/(N-P)\), where \(X\) is Pearson goodness-of-fit statistic, \(N\) is the number of samples, and \(P\) is the number of parameters. This scaling parameter also effects the statistical tests and if there is overdispersion correction the tests will be more stringent in general.

By default,this overdispersion correction is not applied. This can be achieved by setting overdispersion="MN". The Chisq-test is used by default only when no overdispersion correction is applied. If overdispersion correction is applied, the function automatically switches to the F-test. The Chisq-test can be manually chosen in this case as well, but the F-test only works with overdispersion correction switched on. In both cases, the procedure tests if the full model (the model where treatment is included as an explanatory variable) explains the data better than the null model (the model with no treatment, just intercept). If there is no effect based on samples being from different groups adding a treatment vector for sample groupings will be no better than not adding the treatment vector. Below, we simulate methylation data and use overdispersion correction for the logistic regression model.

sim.methylBase1<-dataSim(replicates=6,sites=1000,
                         treatment=c(rep(1,3),rep(0,3)),
                        sample.ids=c(paste0("test",1:3),paste0("ctrl",1:3))
                        )

my.diffMeth<-calculateDiffMeth(sim.methylBase1[1:,],
                                overdispersion="MN",test="Chisq",mc.cores=1)

3.8 Accounting for covariates

Covariates can be included in the analysis. The function will then try to separate the influence of the covariates from the treatment effect via the logistic regression model. In this case, we will test if full model (model with treatment and covariates) is better than the model with the covariates only. If there is no effect due to the treatment (sample groups), the full model will not explain the data better than the model with covariates only. In calculateDiffMeth, this is achieved by supplying the covariates argument in the format of a data.frame. Below, we simulate methylation data and add make a data.frame for the age. The data frame can include more columns, and those columns can also be factor variables. The row order of the data.frame should match the order of samples in the methylBase object.

covariates=data.frame(age=c(30,80,34,30,80,40))
sim.methylBase<-dataSim(replicates=6,sites=1000,
                        treatment=c(rep(1,3),rep(0,3)),
                        covariates=covariates,
                        sample.ids=c(paste0("test",1:3),paste0("ctrl",1:3))
                        )
my.diffMeth3<-calculateDiffMeth(sim.methylBase,
                                covariates=covariates,
                                overdispersion="MN",test="Chisq",mc.cores=1)

3.9 Finding differentially methylated bases using multiple-cores

The differential methylation calculation speed can be increased substantially by utilizing multiple-cores in a machine if available. Both Fisher’s Exact test and logistic regression based test are able to use multiple-core option.

The following piece of code will run differential methylation calculation using 2 cores.

myDiff=calculateDiffMeth(meth,num.cores=2)

4 Annotating differentially methylated bases or regions

We can annotate our differentially methylated regions/bases based on gene annotation using genomation package. In this example, we read the gene annotation from a BED file and annotate our differentially methylated regions with that information using genomation functions. Note that these functions operate on GRanges objects ,so we first coerce methylKit objects to GRanges. This annotation operation will tell us what percentage of our differentially methylated regions are on promoters/introns/exons/intergenic region. In this case we read annotation from a BED file, similar gene annotation information can be fetched using GenomicFeatures package or other packages available from Bioconductor.org.

library(genomation)
## Loading required package: grid
## 
## Attaching package: 'genomation'
## The following objects are masked from 'package:methylKit':
## 
##     getFeatsWithTargetsStats, getFlanks, getMembers,
##     getTargetAnnotationStats, plotTargetAnnotation
# read the gene BED file
gene.obj=readTranscriptFeatures(system.file("extdata", "refseq.hg18.bed.txt", 
                                           package = "methylKit"))
## Reading the table...
## Calculating intron coordinates...
## Calculating exon coordinates...
## Calculating TSS coordinates...
## Calculating promoter coordinates...
## Outputting the final GRangesList...
#
# annotate differentially methylated CpGs with 
# promoter/exon/intron using annotation data
#
annotateWithGeneParts(as(myDiff25p,"GRanges"),gene.obj)
## Summary of target set annotation with genic parts
## Rows in target set: 133
## -----------------------
## percentage of target features overlapping with annotation:
##   promoter       exon     intron intergenic 
##      27.82      15.04      34.59      57.14
## 
## percentage of target features overlapping with annotation:
## (with promoter > exon > intron precedence):
##   promoter       exon     intron intergenic 
##      27.82       0.00      15.04      57.14
## 
## percentage of annotation boundaries with feature overlap:
## promoter     exon   intron 
##     0.29     0.03     0.17
## 
## summary of distances to the nearest TSS:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5     828   45158   52034   94644  313528
## 

Similarly, we can read the CpG island annotation and annotate our differentially methylated bases/regions with them.

# read the shores and flanking regions and name the flanks as shores 
# and CpG islands as CpGi
cpg.obj=readFeatureFlank(system.file("extdata", "cpgi.hg18.bed.txt", 
                                        package = "methylKit"),
                           feature.flank.name=c("CpGi","shores"))
#
# convert methylDiff object to GRanges and annotate
diffCpGann=annotateWithFeatureFlank(as(myDiff25p,"GRanges"),
                                    cpg.obj$CpGi,cpg.obj$shores,
                         feature.name="CpGi",flank.name="shores")

4.1 Regional analysis

We can also summarize methylation information over a set of defined regions such as promoters or CpG islands. The function below summarizes the methylation information over a given set of promoter regions and outputs a methylRaw or methylRawList object depending on the input. We are using the output of genomation functions used above to provide the locations of promoters. For regional summary functions, we need to provide regions of interest as GRanges object.

promoters=regionCounts(myobj,gene.obj$promoters)

head(promoters[[1]])
##     chr    start      end strand coverage numCs numTs
## 1 chr21 10011791 10013791      -     7953  6662  1290
## 2 chr21 10119796 10121796      -     1725  1171   554
## 3 chr21 10119808 10121808      -     1725  1171   554
## 4 chr21 13903368 13905368      +       10    10     0
## 5 chr21 14273636 14275636      -      282   220    62
## 6 chr21 14509336 14511336      +     1058    55  1003

4.2 Convenience functions for annotation objects

After getting the annotation of differentially methylated regions, we can get the distance to TSS and nearest gene name using the getAssociationWithTSS function from genomation package.

diffAnn=annotateWithGeneParts(as(myDiff25p,"GRanges"),gene.obj)

# target.row is the row number in myDiff25p
head(getAssociationWithTSS(diffAnn))
##      target.row dist.to.feature feature.name feature.strand
## 60            1          106111    NM_199260              -
## 60.1          2          106098    NM_199260              -
## 60.2          3          106092    NM_199260              -
## 60.3          4          105919    NM_199260              -
## 60.4          5           85265    NM_199260              -
## 60.5          6           68287    NM_199260              -

It is also desirable to get percentage/number of differentially methylated regions that overlap with intron/exon/promoters

getTargetAnnotationStats(diffAnn,percentage=TRUE,precedence=TRUE)
##   promoter       exon     intron intergenic 
##      27.82       0.00      15.04      57.14

We can also plot the percentage of differentially methylated bases overlapping with exon/intron/promoters

plotTargetAnnotation(diffAnn,precedence=TRUE,
    main="differential methylation annotation")

We can also plot the CpG island annotation the same way. The plot below shows what percentage of differentially methylated bases are on CpG islands, CpG island shores and other regions.

plotTargetAnnotation(diffCpGann,col=c("green","gray","white"),
       main="differential methylation annotation")

It might be also useful to get percentage of intron/exon/promoters that overlap with differentially methylated bases.

getFeatsWithTargetsStats(diffAnn,percentage=TRUE)
## promoter     exon   intron 
##     0.29     0.03     0.17

5 methylKit convenience functions

5.1 Coercing methylKit objects to GRanges

Most methylKit objects (methylRaw,methylBase and methylDiff), including methylDB objects (methylRawDB,methylBaseDB and methylDiffDB) can be coerced to GRanges objects from GenomicRanges package. Coercing methylKit objects to GRanges will give users additional flexibility when customizing their analyses.

class(meth)
## [1] "methylBase"
## attr(,"package")
## [1] "methylKit"
as(meth,"GRanges")
## GRanges object with 963 ranges and 12 metadata columns:
##         seqnames               ranges strand | coverage1    numCs1
##            <Rle>            <IRanges>  <Rle> | <integer> <numeric>
##     [1]    chr21   [9853296, 9853296]      + |        17        10
##     [2]    chr21   [9853326, 9853326]      + |        17        12
##     [3]    chr21   [9860126, 9860126]      + |        39        38
##     [4]    chr21   [9906604, 9906604]      + |        68        42
##     [5]    chr21   [9906616, 9906616]      + |        68        52
##     ...      ...                  ...    ... .       ...       ...
##   [959]    chr21 [19855690, 19855690]      + |        27        26
##   [960]    chr21 [19855706, 19855706]      + |        27        27
##   [961]    chr21 [19855711, 19855711]      + |        18        18
##   [962]    chr21 [19943653, 19943653]      + |        12        12
##   [963]    chr21 [19943695, 19943695]      + |        12        11
##            numTs1 coverage2    numCs2    numTs2 coverage3    numCs3
##         <numeric> <integer> <numeric> <numeric> <integer> <numeric>
##     [1]         7       333       268        65        18        16
##     [2]         5       329       249        79        16        14
##     [3]         1        83        78         5        83        83
##     [4]        26       111        97        14        23        18
##     [5]        16       111       104         7        23        14
##     ...       ...       ...       ...       ...       ...       ...
##   [959]         1        19        17         2        34        34
##   [960]         0        19        19         0        34        34
##   [961]         0        18        15         3        34        34
##   [962]         0        32        30         2        26        25
##   [963]         1        32        32         0        26        26
##            numTs3 coverage4    numCs4    numTs4
##         <numeric> <integer> <numeric> <numeric>
##     [1]         2       395       341        54
##     [2]         2       379       284        95
##     [3]         0        41        40         1
##     [4]         5        37        33         4
##     [5]         9        37        27        10
##     ...       ...       ...       ...       ...
##   [959]         0        12        12         0
##   [960]         0        12        11         1
##   [961]         0        12        12         0
##   [962]         1        24        22         2
##   [963]         0        27        24         3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
class(myDiff)
## [1] "methylDiff"
## attr(,"package")
## [1] "methylKit"
as(myDiff,"GRanges")
## GRanges object with 963 ranges and 3 metadata columns:
##         seqnames               ranges strand |              pvalue
##            <Rle>            <IRanges>  <Rle> |           <numeric>
##     [1]    chr21   [9853296, 9853296]      + | 0.00990814169236099
##     [2]    chr21   [9853326, 9853326]      + |   0.947354554113234
##     [3]    chr21   [9860126, 9860126]      + |  0.0416044892768734
##     [4]    chr21   [9906604, 9906604]      + |   0.210651680438027
##     [5]    chr21   [9906616, 9906616]      + |  0.0015764977469575
##     ...      ...                  ...    ... .                 ...
##   [959]    chr21 [19855690, 19855690]      + |  0.0390193280519204
##   [960]    chr21 [19855706, 19855706]      + |   0.237178917400155
##   [961]    chr21 [19855711, 19855711]      + |   0.024129065685761
##   [962]    chr21 [19943653, 19943653]      + |   0.752836685511465
##   [963]    chr21 [19943695, 19943695]      + |   0.390423440187475
##                      qvalue         meth.diff
##                   <numeric>         <numeric>
##     [1]  0.0215658126063664 -7.01210653753027
##     [2]   0.592173028310101 0.209135938359928
##     [3]  0.0697808391745445 -4.11158117398203
##     [4]   0.259453089661393 -7.34636871508379
##     [5] 0.00432220069940914  18.8175046554935
##     ...                 ...               ...
##   [959]  0.0660274910679262 -6.52173913043478
##   [960]   0.282958728725395  2.17391304347826
##   [961]  0.0446545923565475 -8.33333333333334
##   [962]   0.592173028310101  1.45454545454546
##   [963]   0.396045532748492  3.38765008576329
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

5.2 Converting methylKit objects to methylDB objects and vice versa

methylDB objects (methylRawDB, methylBaseDB and methylDiffDB) can be coerced to normal methylKit objects. This might speed up the analysis if sufficient computing resources are available. This can be done via “as()” function.

class(myobjDB[[1]])
## [1] "methylRawDB"
## attr(,"package")
## [1] "methylKit"
as(myobjDB[[1]],"methylRaw")

You can also convert methylDB objects to their in-memory equivalents. Since that requires an additional parameter (the directory where the files will be located), we have a different function, named makeMethylDB to achieve this goal. Below, we convert a methylBase object to methylBaseDB and saving it at “exMethylDB” directory.

data(methylKit)
 
objDB=makeMethylDB(methylBase.obj,"exMethylDB")

5.3 Selection: subsetting methylKit Objects

We can also select rows from methylRaw, methylBase and methylDiff objects and methylDB pendants with select function. An appropriate methylKit object will be returned as a result of select function. Or you can use the '[' notation to subset the methylKit objects.

select(meth,1:5) # get first 10 rows of a methylBase object
##     chr   start     end strand coverage1 numCs1 numTs1 coverage2 numCs2
## 1 chr21 9853296 9853296      +        17     10      7       333    268
## 2 chr21 9853326 9853326      +        17     12      5       329    249
## 3 chr21 9860126 9860126      +        39     38      1        83     78
## 4 chr21 9906604 9906604      +        68     42     26       111     97
## 5 chr21 9906616 9906616      +        68     52     16       111    104
##   numTs2 coverage3 numCs3 numTs3 coverage4 numCs4 numTs4
## 1     65        18     16      2       395    341     54
## 2     79        16     14      2       379    284     95
## 3      5        83     83      0        41     40      1
## 4     14        23     18      5        37     33      4
## 5      7        23     14      9        37     27     10
myDiff[21:25,] # get 5 rows of a methylDiff object
##      chr   start     end strand       pvalue       qvalue  meth.diff
## 21 chr21 9913543 9913543      + 1.254379e-02 2.632641e-02 -13.343109
## 22 chr21 9913575 9913575      + 2.755448e-01 3.161628e-01  -5.442623
## 23 chr21 9927527 9927527      + 1.120126e-07 9.257475e-07 -46.109840
## 24 chr21 9944505 9944505      + 7.907909e-20 7.515975e-18 -51.017943
## 25 chr21 9944663 9944663      - 1.790779e-05 7.678302e-05 -28.099911

Important: Using select or '[' on methylDB objects will return its normal methylKit pendant, to avoid overhead of database operations.

5.3.1 selectByOverlap

We can select rows from any methylKit object, that lie inside the ranges of a GRanges object from GenomicRanges package with selectByOverlap function. An appropriate methylKit object will be returned as a result of selectByOverlap function.

library(GenomicRanges)
my.win=GRanges(seqnames="chr21",
               ranges=IRanges(start=seq(from=9764513,by=10000,length.out=20),width=5000) )
 
# selects the records that lie inside the regions
selectByOverlap(myobj[[1]],my.win)

Important: Using selectByOverlap on methylDB objects will return its normal methylKit pendant, to avoid overhead of database operations.

5.4 reorganize(): reorganizing samples and treatment vector within methylKit objects

The methylBase and methylRawList, as well as methylDB pendants can be reorganized by reorganize function. The function can subset the objects based on provided sample ids, it also creates a new treatment vector determining which samples belong to which group. Order of sample ids should match the treatment vector order.

# creates a new methylRawList object
myobj2=reorganize(myobj,sample.ids=c("test1","ctrl2"),treatment=c(1,0) )
# creates a new methylBase object
meth2 =reorganize(meth,sample.ids=c("test1","ctrl2"),treatment=c(1,0) )

5.5 percMethylation(): Getting percent methylation matrix from methylBase objects

Percent methylation values can be extracted from methylBase object by using percMethylation function.

# creates a matrix containing percent methylation values
perc.meth=percMethylation(meth)

5.6 methSeg(): segmentation of methylation or differential methylation profiles

Methylation or differential methylation profiles can be segmented to sections that contain similar CpGs with respect to their methylation profiles. This kind of segmentation could help us find interesting regions. For example, segmentation analysis will usually reveal high or low methylated regions, where low methylated regions could be interesting for gene regulation. The algorithm first finds segments that have CpGs with similar methylation levels, then those segments are classified to segment groups based on their mean methylation levels. This enables us to group segments with similar methylation levels to the same class.

See more at http://zvfak.blogspot.de/2015/06/segmentation-of-methylation-profiles.html

 download.file("https://dl.dropboxusercontent.com/u/1373164/H1.chr21.chr22.rds",
               destfile="H1.chr21.chr22.rds",method="curl")

 mbw=readRDS("H1.chr21.chr22.rds")

 # it finds the optimal number of componets as 6
 res=methSeg(mbw,diagnostic.plot=TRUE,maxInt=100,minSeg=10)

 # however the BIC stabilizes after 4, we can also try 4 componets
 res=methSeg(mbw,diagnostic.plot=TRUE,maxInt=100,minSeg=10,G=1:4)

 # get segments to BED file
 methSeg2bed(res,filename="H1.chr21.chr22.trial.seg.bed")

6 Frequently Asked Questions

Detailed answers to some of the frequently asked questions and various HOW-TO’s can be found at http://zvfak.blogspot.com/search/label/methylKit. In addition, http://code.google.com/p/methylkit/ has online documentation and links to tutorials and other related material. You can also check methylKit Q&A forum for answers https://groups.google.com/forum/#!forum/methylkit_discussion.

Apart from those here are some of the frequently asked questions.

6.0.1 How can I select certain regions/bases from methylRaw or methylBase objects ?

See ?select or help("[", package = "methylKit")

6.0.2 How can I find if my regions of interest overlap with exon/intron/promoter/CpG island etc.?

Currently, we will be able to tell you if your regions/bases overlap with the genomic features or not. See ?getMembers.

6.0.3 How can I find the nearest TSS associated with my CpGs ?

See ?genomation::getAssociationWithTSS

6.0.4 How do you define promoters and CpG island shores ?

Promoters are defined by options at genomation::readTranscriptFeatures function. The default option is to take -1000,+1000bp around the TSS and you can change that. Same goes for CpG islands when reading them in via genomation::readFeatureFlank function. Default is to take 2000bp flanking regions on each side of the CpG island as shores. But you can change that as well.

6.0.5 What does Bismark SAM output look like, where can I get more info ?

Check the Bismark (Krueger and Andrews 2011) website and there are also example files that ship with the package. Look at their formats and try to run different variations of processBismarkAln() command on the example files.

6.0.6 How can I reorder or remove samples at/from methylRawList or methylBase objects ?

See ?reorganize

6.0.7 Should I normalize my data ?

methylKit comes with a simple normalizeCoverage() function to normalize read coverage distributions between samples. Ideally, you should first filter bases with extreme coverage to account for PCR bias using filterByCoverage() function, then run normalizeCoverage() function to normalize coverage between samples. These two functions will help reduce the bias in the statistical tests that might occur due to systematic over-sampling of reads in certain samples.

6.0.8 How can I force methylKit to use Fisher’s exact test ?

methylKit decides which test to use based on number of samples per group. In order to use Fisher’s exact there must be one sample in each of the test and control groups. So if you have multiple samples for group, the package will employ Logistic Regression based test. However, you can use pool() function to pool samples in each group so that you have one representative sample per group. pool() function will sum up number of Cs and Ts in each group. We recommend using filterByCoverage() and normalizeCoverage() functions prior to using pool(). See ?pool

6.0.9 Can use data from other aligners than Bismark in methylKit ?

Yes, you can. methylKit can read any generic methylation percentage/ratio file as long as that text file contains columns for chromosome, start, end, strand, coverage and number of methylated cytosines. However, methylKit can only process SAM files from Bismark. For other aligners, you need to get a text file containing the minimal information described above. Some aligners will come with scripts or built-in tools to provide such files. See http://zvfak.blogspot.com/2012/10/how-to-read-bsmap-methylation-ratio.html for how to read methylation ratio files from BSMAP (Xi and Li 2009) aligner.

6.0.10 Can I transform an methylKit object into an methylDB object ?

Yes, you can. Many functions of the analysis workflow provide an save.db argument, which allows you to save the output as methylDB object. For example see ?unite and also check the ... argument section for further details.

7 Acknowledgements

This package is initially developed at Weill Cornell Medical College by Altuna Akalin with important code contributions from Matthias Kormaksson(mk375@cornell.edu) and Sheng Li (shl2018@med.cornell.edu). We wish to thank especially Maria E. Figueroa, Francine Garret-Bakelman, Christopher Mason and Ari Melnick for their contribution of ideas, data and support. Their support and discussions lead to development of methylKit.

7.1 Full list of contributors

  • Altuna Akalin (main design and development)
  • Matthias Kormaksson (initial differential methylation tests)
  • Sheng Li (adding PCA and clustering)
  • Adrian Bierling (methylation simulation, covariate and over-dispersion correction)
  • Alexander Gosdschan (C++ based BAM parsing and tabix based classes )
  • Arsene Webo (methylation segmentation)

8 R session info

sessionInfo() 
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
## [1] genomation_1.8.0     methylKit_1.2.4      GenomicRanges_1.28.4
## [4] GenomeInfoDb_1.12.2  IRanges_2.10.2       S4Vectors_0.14.3    
## [7] BiocGenerics_0.22.0 
## 
## loaded via a namespace (and not attached):
##  [1] mclust_5.3                 Rcpp_0.12.12              
##  [3] lattice_0.20-35            Rsamtools_1.28.0          
##  [5] Biostrings_2.44.2          gtools_3.5.0              
##  [7] rprojroot_1.2              digest_0.6.12             
##  [9] gridBase_0.4-7             R6_2.2.2                  
## [11] plyr_1.8.4                 emdbook_1.3.9             
## [13] backports_1.1.0            evaluate_0.10.1           
## [15] coda_0.19-1                ggplot2_2.2.1             
## [17] zlibbioc_1.22.0            rlang_0.1.2               
## [19] lazyeval_0.2.0             data.table_1.10.4         
## [21] R.utils_2.5.0              R.oo_1.21.0               
## [23] Matrix_1.2-11              bbmle_1.0.19              
## [25] qvalue_2.8.0               rmarkdown_1.6             
## [27] splines_3.4.1              BiocParallel_1.10.1       
## [29] readr_1.1.1                stringr_1.2.0             
## [31] fastseg_1.22.0             RCurl_1.95-4.8            
## [33] munsell_0.4.3              DelayedArray_0.2.7        
## [35] compiler_3.4.1             numDeriv_2016.8-1         
## [37] rtracklayer_1.36.4         htmltools_0.3.6           
## [39] SummarizedExperiment_1.6.3 tibble_1.3.4              
## [41] GenomeInfoDbData_0.99.0    matrixStats_0.52.2        
## [43] XML_3.98-1.9               GenomicAlignments_1.12.2  
## [45] MASS_7.3-47                bitops_1.0-6              
## [47] R.methodsS3_1.7.1          gtable_0.2.0              
## [49] magrittr_1.5               scales_0.4.1              
## [51] KernSmooth_2.23-15         stringi_1.1.5             
## [53] impute_1.50.1              XVector_0.16.0            
## [55] reshape2_1.4.2             limma_3.32.5              
## [57] tools_3.4.1                BSgenome_1.44.0           
## [59] Biobase_2.36.2             seqPattern_1.8.0          
## [61] hms_0.3                    plotrix_3.6-6             
## [63] yaml_2.1.14                colorspace_1.3-2          
## [65] knitr_1.17

References

Krueger, Felix, and Simon R Andrews. 2011. “Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.” Bioinformatics (Oxford, England) 27 (11): 1571–2. doi:10.1093/bioinformatics/btr167.

Li, Heng. 2011. “Tabix: Fast Retrieval of Sequence Features from Generic Tab-Delimited Files.” Bioinformatics 27 (5): 718–19. doi:10.1093/bioinformatics/btq671.

McCullagh, P., and John A. Nelder. 1989. Generalized Linear Models, Second Edition (Chapman & Hall/Crc Monographs on Statistics & Applied Probability). Chapman; Hall/CRC. http://www.amazon.com/Generalized-Chapman-Monographs-Statistics-Probability/dp/0412317605%3FSubscriptionId%3D0JYN1NVW651KCA56C102%26tag%3Dtechkie-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3D0412317605.

Wang, Hong-Qiang, Lindsey K Tuominen, and Chung-Jui Tsai. 2011. “SLIM: a sliding linear model for estimating the proportion of true null hypotheses in datasets with dependence structures.” Bioinformatics (Oxford, England) 27 (2): 225–31. doi:10.1093/bioinformatics/btq650.

Xi, Yuanxin, and Wei Li. 2009. “BSMAP: whole genome bisulfite sequence MAPping program.” BMC Bioinformatics 10 (1): 232. doi:10.1186/1471-2105-10-232.


  1. Author of the vignette. See Acknowledgements for a list of contributors.