Melissa 1.16.0
## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("Melissa")
## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/Melissa", build_vignettes = TRUE)
Measurements of DNA methylation at the single cell level are promising to revolutionise our understanding of epigenetic control of gene expression. Yet, intrinsic limitations of the technology result in very sparse coverage of CpG sites (around 5% to 20% coverage), effectively limiting the analysis repertoire to a semi-quantitative level. Melissa
(MEthyLation Inference for Single cell Analysis) [1], is a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq). Melissa clusters individual cells based on local methylation patterns, enabling the discovery of epigenetic differences and similarities among individual cells. The clustering also acts as an effective regularisation method for imputation of methylation on unassayed CpG sites, enabling transfer of information between individual cells.
Melissa
depends heavily on the BPRMeth package [2, 3] for reading and processing bisulfite sequencing data. It assumes that the data are first processed using Bismark [4], hence from fastq and BAM files we will obtain a coverage file by running the bismark_methylation_extractor
command as shown below,
# Requires Bismark
bismark_methylation_extractor --comprehensive --merge_non_CpG \
--no_header --gzip --bedGraph input_file.bam
The format of the coverage file is the following
<chr> <start> <end> <met_prcg> <met_reads> <unmet_reads>
where each row corresponds to an observed CpG (i.e. we have at least one read mapped to this location). Note that CpGs with no coverage are not included in this file. This format however contains redundant information, hence we bring the scBS-seq files in the format that Melissa (and BPRMeth) require, which is
<chr> <start> <met_level>
where met_level
corresponds to the binary methylation state, either 0 or 1. We can do this by calling the binarise_files
helper function, which only requires the input directory of the files and optionally the path to the output directory. Each file of the indir
corresponds to a different cell and it can also be compressed, e.g. in .gz
file format, to save storage space.
library(Melissa)
# Binarise scBS-seq data
binarise_files(indir = "path_to_met_files_dir")
Note that the new binarised files will not be compressed, due to system commands not being portable inside the R package. To save storage space, the user could compress the files for instance in .gz file format by running in command line :
gzip filenames
Now we are ready to process the binarised input files and create methylation regions using functions from the BPRMeth package. Briefly, the steps required to create this object are as follows.
read_anno
file. Note that the annotation file can contain any genomic context: from promoters and gene bodies to enhancers, Nanog regulatory regions and CTCF regions; hence Melissa
can be used for a plethora of analyses that want to take spatial genomic correlations into account.read_met
function. We will do this independently per cell.create_region_object
will create the methylation regions object which is the main object for storing methylation data.The create_melissa_data_obj
is a wrapper function for doing all the above steps at once. Note that this step is important so read carefully the purpose of all the parameters to obtain the right object for downstream analysis.
melissa_data <- create_melissa_data_obj(met_dir = "path_to_bin_met_dir",
anno_file = "anno_file", cov = 3)
The melissa_data$met
contains the methylation data whose structure is a list of length \(N\) (number of cells), and each element of this list is another list of length \(M\) (number of genomic regions). Each entry in the inner list is an \(I\times 2\) matrix, where \(I\) are the number of CpGs, where the 1st column contains the (relative) CpG locations and the 2nd column contains the methylation state: methylated or unmethylated.
It is often useful to save this object to file with the saveRDS
function. The object can then be restored using the readRDS
function. This allows us to conduct downstream analysis without having to repeat the processing steps described above.
saveRDS(file = "melissa_data_obj.rds", melissa_data)
Next we will filter genomic regions according to different criteria. Note that these steps and their combinations are all optional and depend on the downstream analysis you want to perform.
Genomic regions with really sparse coverage of CpGs are not informative to infer methylation profiles. Hence, we only consider genomic regions with at least min_cpgcov
CpG coverage in each region. Note that this step will not actually remove any genomic regions, it will only set to NA
those regions that have coverage below the threshold.
melissa_data <- filter_by_cpg_coverage(melissa_data, min_cpgcov = 10)
Genomic regions that have not heterogeneity across different cells are often of no interest, for example if we were to use them for identifying cell subpopulations. This way we will both keep only the informative genomic regions and reduce the number of genomic regions for downstream analysis for efficiency.
melissa_data <- filter_by_variability(melissa_data, min_var = 0.2)
Genomic regions that have coverage only on a handful of cells are not powerful for sharing information across cells. For example, a specific promoter that has observations in 5 out of the 100 cells, will not contain enough to perform sharing of information, either for imputation or clustering. Hence, regions that are are not covered in at least min_cell_cov_prcg
of the cells are filtered out.
melissa_data <- filter_by_coverage_across_cells(melissa_data,
min_cell_cov_prcg = 0.5)
saveRDS(file = "melissa_data_obj_filtered.rds", melissa_data)
The Smallwood et al. (2014) [5] dataset can be downloaded with accession number GSE56879. For this dataset we used the already processed coverage files from Bismark. The filtering of cells that do not pass quality control (QC) was done according to the original studies. See supplementary information of [5] for IDs of cells that passed filtering.
The Angermueller et al. (2016) [6] dataset can be downloaded with accession number GSE74535. For this dataset we used the already processed coverage files from Bismark. The filtering of cells that do not pass quality control (QC) was done according to the original studies. See supplementary information of [5] for IDs of cells that passed filtering.
The analysis on the subsampled ENCODE WGBS data was performed on the bulk GM12878 (GEO GSE86765) and H1-hESC (GEO GSE80911) cell lines. The BAM files for these studies can be obtained directly from the ENCODE project portal.
#=================
# 1. Download BAM data
DATA_DIR="../encode/wgbs/"
# Download GM12878 cell line
wget -P ${DATA_DIR}GM12878/ https://www.encodeproject.org/files/ENCFF681ASN/@@download/ENCFF681ASN.bam
# Download H1-hESC cell line
wget -P ${DATA_DIR}H1hESC/ https://www.encodeproject.org/files/ENCFF546TLK/@@download/ENCFF546TLK.bam
Then we subsample the WGBS data from the BAM file, that is, we are going to remove individual reads instead of individual CpGs to take into account the nature of missing values of scBS-seq data. To do so we will run the samtools view
command which subsamples random lines from a BAM file. This way, we can generate artificially 40 pseudo-single cells by keeping only 0.5% of the bulk reads for each single cell.
data_dir="encode/wgbs/GM12878/SRR4235788.bam"
out_dir="encode/wgbs/GM12878/subsampled/GM12878"
for (( i=1; i <= 40; ++i ))
do
my_command="samtools view -s ${i}.005 -b $data_dir > ${out_dir}_${i}.bam"
eval $my_command
done
Finally, we run the bismark_methylation_extractor
command to obtain the methylation state of each covered CpG fomr the resulting BAM files. The following command will result in files of coverage
output and bedGraph
output.
data_dir="encode/wgbs/GM12878/subsampled/"
proc_dir="encode/wgbs/GM12878/processed/"
for (( i=1; i <= 40; ++i ))
do
my_command="bismark_methylation_extractor --ignore 2 --comprehensive --merge_non_CpG --no_header --multicore 4 -o $proc_dir --gzip --bedGraph ${data_dir}GM12878_${i}.bam"
eval $my_command
done
The analysis on the subsampled ENCODE RRBS data was performed again on the bulk GM12878 H1-hESC cells lines. We can download the the raw fastq
files from.
http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeHaibMethylRrbs
and search for GM12878
or H1-hESC
and download the fastq files only for the 2nd replicate.
Next we run Bismark. First run the bismark_genome_preparation
command to create a genome indexing for hg19
.
bismark_genome_preparation hg19/
After that we run the bismark
command which will create alignment files in bam
format.
#=================
# 3. Run bismark
bismark --genome hg19/ encode/wgEncodeHaibMethylRrbsGm12878HaibRawDataRep2.fastq.gz
bismark --genome hg19/ encode/wgEncodeHaibMethylRrbsH1hescHaibRawDataRep2.fastq.gz
After this step, we follow the same process as we did for the bulk ENCODE WGBS data above.
This vignette was compiled using:
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.42 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 bookdown_0.33
## [4] fastmap_1.1.1 xfun_0.39 cachem_1.0.7
## [7] htmltools_0.5.5 rmarkdown_2.21 cli_3.6.1
## [10] sass_0.4.5 jquerylib_0.1.4 compiler_4.3.0
## [13] tools_4.3.0 evaluate_0.20 bslib_0.4.2
## [16] yaml_2.3.7 BiocManager_1.30.20 jsonlite_1.8.4
## [19] rlang_1.1.0
[1] Kapourani, C. A., & Sanguinetti, G. (2019). Melissa: Bayesian clustering and imputation of single cell methylomes. Genome Biology, 20(1), 61, DOI: https://doi.org/10.1186/s13059-019-1665-8
[2] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412, DOI: https://doi.org/10.1093/bioinformatics/btw432
[3] Kapourani, C. A. & Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129
[4] Krueger, F., & Andrews, S. R. (2011). Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics, 27(11), 1571-1572.
[5] Smallwood, S. A., Lee, H. J., Angermueller C., Krueger F., Saadeh H., Peat J., Andrews S. R., Stegle S., Reik W., and Kelsey G. (2014). Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nature methods, 11(8):817.
[6] Angermueller, C., Clark, S.J., Lee, H.J., Macaulay, I.C., Teng, M.J., Hu, T.X., Krueger, F., Smallwood, S.A., Ponting, C.P., Voet, T. and Kelsey, G. (2016). Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nature methods, 13(3), p.229.
This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.
This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.