1 Introduction

In this manual, we will show how to use the methylKit package. methylKit is an R package for analysis and annotation of DNA methylation information obtained by high-throughput bisulfite sequencing. The package is designed to deal with sequencing data from RRBS and its variants. But, it can potentially handle whole-genome bisulfite sequencing data if proper input format is provided.

1.1 DNA methylation

DNA methylation in vertebrates typically occurs at CpG dinucleotides, however non-CpG Cs are also methylated in certain tissues such as embryonic stem cells. DNA methylation can act as an epigenetic control mechanism for gene regulation. Methylation can hinder binding of transcription factors and/or methylated bases can be bound by methyl-binding-domain proteins which can recruit chromatin remodeling factors. In both cases, the transcription of the regulated gene will be effected. In addition, aberrant DNA methylation patterns have been associated with many human malignancies and can be used in a predictive manner. In malignant tissues, DNA is either hypo-methylated or hyper-methylated compared to the normal tissue. The location of hyper- and hypo-methylated sites gives a distinct signature to many diseases. Traditionally, hypo-methylation is associated with gene transcription (if it is on a regulatory region such as promoters) and hyper-methylation is associated with gene repression.

1.2 High-throughput bisulfite sequencing

Bisulfite sequencing is a technique that can determine DNA methylation patterns. The major difference from regular sequencing experiments is that, in bisulfite sequencing DNA is treated with bisulfite which converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. By sequencing and aligning those converted DNA fragments it is possible to call methylation status of a base. Usually, the methylation status of a base determined by a high-throughput bisulfite sequencing will not be a binary score, but it will be a percentage. The percentage simply determines how many of the bases that are aligning to a given cytosine location in the genome have actual C bases in the reads. Since bisulfite treatment leaves methylated Cs intact, that percentage will give us percent methylation score on that base. The reasons why we will not get a binary response are:

  • the probable sequencing errors in high-throughput sequencing experiments
  • incomplete bisulfite conversion
  • (and a more likely scenario) is heterogeneity of samples and heterogeneity of paired chromosomes from the same sample

2 Basics

2.1 Reading the methylation call files

We start by reading in the methylation call data from bisulfite sequencing with methRead function. Reading in the data this way will return a methylRawList object which stores methylation information per sample for each covered base. The methylation call files are basically text files that contain percent methylation score per base. Such input files may be obtained from AMP pipeline developed for aligning RRBS reads or from processBismarkAln function. . However, “cytosineReport” and “coverage” files from Bismark aligner can be read in to methylKit as well.

A typical methylation call file looks like this:

##         chrBase   chr    base strand coverage freqC  freqT
## 1 chr21.9764539 chr21 9764539      R       12 25.00  75.00
## 2 chr21.9764513 chr21 9764513      R       12  0.00 100.00
## 3 chr21.9820622 chr21 9820622      F       13  0.00 100.00
## 4 chr21.9837545 chr21 9837545      F       11  0.00 100.00
## 5 chr21.9849022 chr21 9849022      F      124 72.58  27.42

Most of the time bisulfite sequencing experiments have test and control samples. The test samples can be from a disease tissue while the control samples can be from a healthy tissue. You can read a set of methylation call files that have test/control conditions giving treatment vector option. For sake of subsequent analysis, file.list, sample.id and treatment option should have the same order. In the following example, first two files have the sample ids “test1” and “test2” and as determined by treatment vector they belong to the same group. The third and fourth files have sample ids “ctrl1” and “ctrl2” and they belong to the same group as indicated by the treatment vector.

file.list=list( system.file("extdata", 
                            "test1.myCpG.txt", package = "methylKit"),
                            "test2.myCpG.txt", package = "methylKit"),
                            "control1.myCpG.txt", package = "methylKit"),
                            "control2.myCpG.txt", package = "methylKit") )

# read the files to a methylRawList object: myobj

In addition to the options we mentioned above, any tab separated text file with a generic format can be read in using methylKit, such as methylation ratio files from BSMAP. See here for an example.

2.2 Reading the methylation call files and store them as flat file database

Sometimes, when dealing with multiple samples and increased sample sizes coming from genome wide bisulfite sequencing experiments, the memory of your computer might not be sufficient enough.

Therefore methylKit offers a new group of classes, that are basically pendants to the original methylKit classes with one important difference: The methylation information, which normally is internally stored as data.frame, is stored in an external bgzipped file and is indexed by tabix (H. Li 2011), to enable fast retrieval of records or regions. This group contains methylRawListDB, methylRawDB, methylBaseDB and methylDiffDB, let us call them methylDB objects.

We can now create a methylRawListDB object, which stores the same content as myobj from above. But the single methylRaw objects retrieve their data from the tabix-file linked under dbpath.

file.list=list( system.file("extdata", "test1.myCpG.txt", package = "methylKit"),
                system.file("extdata", "test2.myCpG.txt", package = "methylKit"),
                system.file("extdata", "control1.myCpG.txt", package = "methylKit"),
                system.file("extdata", "control2.myCpG.txt", package = "methylKit") )

# read the files to a methylRawListDB object: myobjDB 
# and save in databases in folder methylDB
           dbtype = "tabix",
           dbdir = "methylDB"

## [1] "/tmp/Rtmpg54ajB/Rbuild18125f4ca52d/methylKit/vignettes/methylDB/test1.txt.bgz"

Most if not all functions in this package will work with methylDB objects the same way as it does with normal methylKit objects. Functions that return methylKit objects, will return a methylDB object if provided, but there are a few exceptions such as the select, the [ and the selectByOverlap functions.

2.3 Reading the methylation calls from sorted Bismark alignments

Alternatively, methylation percentage calls can be calculated from sorted SAM or BAM file(s) from Bismark aligner and read-in to the memory. Bismark is a popular aligner for bisulfite sequencing reads, available here (Krueger and Andrews 2011). processBismarkAln function is designed to read-in Bismark SAM/BAM files as methylRaw or methylRawList objects which store per base methylation calls. SAM files must be sorted by chromosome and read position columns, using ‘sort’ command in unix-like machines will accomplish such a sort easily. BAM files should be sorted and indexed. This could be achieved with samtools (http://www.htslib.org/doc/samtools.html).

The following command reads a sorted SAM file and creates a methylRaw object for CpG methylation. The user has the option to save the methylation call files to a folder given by save.folder option. The saved files can be read-in using the methRead function when needed.

my.methRaw=processBismarkAln( location = 
                                                  package = "methylKit"),
                         sample.id="test1", assembly="hg18", 
                         read.context="CpG", save.folder=getwd())

It is also possible to read multiple SAM files at the same time, check processBismarkAln documentation.

2.4 Descriptive statistics on samples

Since we read the methylation data now, we can check the basic stats about the methylation data such as coverage and percent methylation. We now have a methylRawList object which contains methylation information per sample. The following command prints out percent methylation statistics for second sample: “test2”

## methylation statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   82.79   63.17   94.74  100.00 
## percentiles:
##        0%       10%       20%       30%       40%       50%       60% 
##   0.00000   0.00000   0.00000  48.38710  70.00000  82.78556  90.00000 
##       70%       80%       90%       95%       99%     99.5%     99.9% 
##  93.33333  96.42857 100.00000 100.00000 100.00000 100.00000 100.00000 
##      100% 
## 100.00000

The following command plots the histogram for percent methylation distribution.The figure below is the histogram and numbers on bars denote what percentage of locations are contained in that bin. Typically, percent methylation histogram should have two peaks on both ends. In any given cell, any given base are either methylated or not. Therefore, looking at many cells should yield a similar pattern where we see lots of locations with high methylation and lots of locations with low methylation.