1 Installation

## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("Melissa")

## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/Melissa", build_vignettes = TRUE)

2 Background

Measurements of DNA methylation at the single cell level are promising to revolutionise our understanding of epigenetic control of gene expression. Yet, intrinsic limitations of the technology result in very sparse coverage of CpG sites (around 5% to 20% coverage), effectively limiting the analysis repertoire to a semi-quantitative level. Melissa (MEthyLation Inference for Single cell Analysis) [1], is a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq).

Melissa addresses the data sparsity issue by leveraging local correlations between neighbouring CpGs and similarity between individual cells (see Fig.1). The starting point is the definition of a set of genomic regions (e.g. genes or enhancers). Within each region, Melissa postulates a latent profile of methylation, a function mapping each CpG within the region to a number in \([0,1]\) which defines the probability of that CpG being methylated. To ensure spatial smoothness of the profile, Melissa uses a generalised linear model of basis function regression along the lines of [2,3] (with modified likelihood to account for single cell data). Local correlations are however often insufficient for regions with extremely sparse coverage, and these are quite common in scBS-seq data. Therefore, we share information across different cells by coupling the local GLM regressions through a shared prior distribution. In order to respect the (generally unknown) population structure that may be present within the cells assayed, we choose a (finite) Dirichlet mixture model prior.

`Melissa` model overview. Melissa combines a likelihood computed from single cell methylation profiles fitted to each genomic region using a supervised regression approach (bottom left) and an unsupervised Bayesian clustering prior (top left). The posterior distribution provides a methylome-based clustering (top right) and imputation (bottom right) of single cells.

Figure 1: Melissa model overview. Melissa combines a likelihood computed from single cell methylation profiles fitted to each genomic region using a supervised regression approach (bottom left) and an unsupervised Bayesian clustering prior (top left). The posterior distribution provides a methylome-based clustering (top right) and imputation (bottom right) of single cells.

3 Melissa analysis pipeline

3.1 Reading scBS-seq data

For reading, processing and filtering raw scBS-seq data consult the vignette “Process and filter scBS-seq data”, which provides a step by step tutorial on obtaining a melissa_data_obj object, which will be the input to the variational inference machinery for Melissa.

3.2 Loading synthetic data

To create a minimal working example, Melissa comes with already generated synthetic data of \(N = 200\) cells and \(M = 100\) genomic regions, consisting of \(K = 4\) cell subpopulations.

suppressPackageStartupMessages(library(Melissa)) # Load package
dt_obj <- melissa_encode_dt   # Load synthetic data

The structure of the dt_obj is a list of three elements, which only the met element is of direct interest, the remaining ones are mostly for keeping metadata information and information about the data generating / processing procedure.

# Elements of `dt_obj` object
names(dt_obj)
## [1] "met"         "anno_region" "opts"

From the \(2^{nd}\) cell we can access the \(50^{th}\) genomic region using the following code, where the 1st column represents the relative CpG location within the region (scaled to [-1, 1] interval) and the 2nd column represents the methylation state: 1 = methylated and 0 = unmethylated.

head(dt_obj$met[[2]][[50]])
##        [,1] [,2]
## [1,] -0.998    0
## [2,] -0.985    1
## [3,] -0.982    0
## [4,] -0.757    1
## [5,] -0.697    1
## [6,] -0.668    1
# Number of cells
cat("Number of cells: ", length(dt_obj$met))
## Number of cells:  200
# Number of genomic regions in each cell
cat("Number of genomic regions: ", length(dt_obj$met[[1]]) )
## Number of genomic regions:  100

3.3 Creating basis object

For each genomic region we infer a methylation profile using a GLM of basis function regression along the lines of [1,2]. Again we use the functionality of BPRMeth to create a basis object, in our case it will be a Radial Basis Function (RBF), however, one can create polynomial and Fourier basis functions which can be created with the create_polynomial_object and create_fourier_object functions, respectively.

library(BPRMeth)
# Create RBF basis object with 4 RBFs
basis_obj <- create_rbf_object(M = 4)

The rbf object contains information such as the centre locations \(\mu_{j}\) and the value of the spatial scale parameter \(\gamma\)

# Show the slots of the 'rbf' object
basis_obj
## $M
## [1] 4
## 
## $mus
## [1] -0.6 -0.2  0.2  0.6
## 
## $gamma
## [1] 4
## 
## $eq_spaced_mus
## [1] TRUE
## 
## $whole_region
## [1] TRUE
## 
## attr(,"class")
## [1] "rbf"

4 Clustering and imputing scBS-seq data

Now we are ready to perfrom inference using the Melissa model and jointly cluster cells based on their DNA methylation landscape and impute missing CpG methylation states.

4.1 Partitioning to training and test set (Optional)

To be able to evaluate the imputation performance of Melissa, we need ground truth labels of methylation states. To do so, we can partition the original dataset to training and test set, where a subset of CpGs will be used for training and the remaining sites for testing. For instance in a genomic region with 50 CpGs, 35 will be used for training and the remaining will be used for testing. We should highlight that the partitioning is at the genomic region level and not at the cell level, so we do not have to impute the methylome of a whole cell (which is not practically useful from the first place).

Note that this process is optional and is required only for being able to evaluate how well Melissa performs imputation. The user can skip the partitioning step and directly run Melissa on the whole dataset. For real scBS-seq data the user can use the inferred profiles to impute CpGs with no coverage at all.

set.seed(15)
# Partition to training and test set
dt_obj <- partition_dataset(dt_obj = dt_obj, data_train_prcg = 0.2,
                            region_train_prcg = 1, cpg_train_prcg = 0.4, 
                            is_synth = TRUE)

This code snippet, will update the met element of the dt_obj to retain only the CpGs used as training set and the test set will be now stored in the met_test element in the same object. Note that during inference, Melissa will only have access to the training data and will ignore totally the test set. For more details on the usage of the additional parameters type ?partition_dataset.

4.2 Running Melissa model

Whether or not a subset of the data was used as test set, the command for running Melissa remains the same. We need to provide it with the (training) data X, the number of clusters K that we expect to find in the cell population, the basis function object and with initial values for the hyperparameters. In this example, we will mostly keep the default values for the (hyper)-parameters.

set.seed(15)
# Run Melissa with K = 4 clusters
melissa_obj <- melissa(X = dt_obj$met, K = 4, basis = basis_obj,
                       vb_max_iter = 20, vb_init_nstart = 1, is_parallel = FALSE)
## 
  |                                                                         
  |                                                                   |   0%
  |                                                                         
  |====                                                               |   6%
  |                                                                         
  |=======                                                            |  11%
  |                                                                         
  |===========                                                        |  17%
  |                                                                         
  |===============                                                    |  22%
  |                                                                         
  |===================                                                |  28%
  |                                                                         
  |======================                                             |  33%
  |                                                                         
  |==========================                                         |  39%
  |                                                                         
  |==============================                                     |  44%
  |                                                                         
  |==================================                                 |  50%
  |                                                                         
  |=====================================                              |  56%
  |                                                                         
  |=========================================                          |  61%
  |                                                                         
  |=============================================                      |  67%
  |                                                                         
  |================================================                   |  72%
  |                                                                         
  |====================================================               |  78%
  |                                                                         
  |========================================================           |  83%
  |                                                                         
  |============================================================       |  89%
  |                                                                         
  |===============================================================    |  94%
  |                                                                         
  |===================================================================| 100%

4.2.1 Output summary

We can check the mixing proportions for each cluster, to obtain an estimate of the proportion of cells that are assigned to each cell subpopulation.

melissa_obj$pi_k
##  cluster1  cluster2  cluster3  cluster4 
## 0.1559409 0.4232673 0.2450502 0.1757436

The posterior probabilities of each cell belonging to a certain cluster (responsibilities) are stored in the r_nk element. Here each column represents a different cluster and each row a different cell.

head(melissa_obj$r_nk)
##           cluster1      cluster2      cluster3      cluster4
## [1,] 1.005152e-228  1.000000e+00 2.292296e-169 3.328830e-175
## [2,] 1.923678e-225  1.000000e+00 2.961327e-176 7.823470e-167
## [3,] 4.816203e-255  1.000000e+00 6.470452e-156 2.263771e-174
## [4,] 8.372995e-240  1.000000e+00 3.419202e-135 1.298840e-190
## [5,] 2.813925e-239  1.000000e+00 1.001113e-156 8.391105e-194
## [6,] 4.382205e-183 5.945496e-166 4.085146e-199  1.000000e+00

The posterior mean methylation profile of each genomic region and cell subtype is stored in the W element. We can access the posterior weights of the 10th genomic region for the 2nd cell subpopulation. Here each element of the vector corresponds to a basis function, except the first element which corresponds to the bias term.

melissa_obj$W[10, , 3]
## [1] -1.786867  0.742057 -1.953177  1.153712  2.835504

4.2.2 Plottting methylation profiles

We can plot methylation profiles of each cell subtype for specific genomic regions using the plot_melissa_profiles function.

# Plot profiles from all cell subtypes for genomic region 22
plot_melissa_profiles(melissa_obj = melissa_obj, region = 22, 
                      title = "Methylation profiles for region 22")

# Plot profiles from all cell subtypes for genomic region 77
plot_melissa_profiles(melissa_obj = melissa_obj, region = 77, 
                      title = "Methylation profiles for region 77")

4.2.3 Evaluating clustering performance

Since these are synthetic generated data, we have the ground truth labels for the assignment of each cell to the corresponding cluster, which is stored in dt_obj$opts$C_true. Let’s now evaluate the clustering performance both in terms of Adjusted Rand Index (ARI) and clustering assignment error metrics.

# Run clustering performance
melissa_obj <- eval_cluster_performance(melissa_obj, dt_obj$opts$C_true)
# ARI metric
cat("ARI: ", melissa_obj$clustering$ari)
## ARI:  1
# Clustering assignment error metric
cat("Clustering assignment error: ", melissa_obj$clustering$error)
## Clustering assignment error:  0

As we can observe Melissa clustered perfectly the synthetic data, which are also clearly separable from the example methylation profiles shown above.

4.2.4 Evaluating imputation performance

To evaluate the imputation performance of Melissa, we use the held out test set and compare the true methylation state of each CpG with the predicted methylation value, which is the evaluation of the latent methylation profile at the corresponding location.

imputation_obj <- impute_met_state(obj = melissa_obj, test = dt_obj$met_test)

Now we use different metrics to evaluate the prediction performance, such as AUC, ROC curve and precision-recall curves and store them as elements in melissa_obj object.

melissa_obj <- eval_imputation_performance(obj = melissa_obj, 
                                           imputation_obj = imputation_obj)
# AUC 
cat("AUC: ", melissa_obj$imputation$auc)
## AUC:  0.8566162
# F-measure
cat("F-measure: ", melissa_obj$imputation$f_measure)
## F-measure:  0.7415819

5 Session Info

This vignette was compiled using:

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] Melissa_1.2.0        BPRMeth_1.12.0       GenomicRanges_1.38.0
## [4] GenomeInfoDb_1.22.0  IRanges_2.20.0       S4Vectors_0.24.0    
## [7] BiocGenerics_0.32.0  knitr_1.25           BiocStyle_2.14.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2             mvtnorm_1.0-11         lattice_0.20-38       
##  [4] gtools_3.8.1           png_0.1-7              clues_0.6.1           
##  [7] assertthat_0.2.1       digest_0.6.22          foreach_1.4.7         
## [10] truncnorm_1.0-8        R6_2.4.0               MatrixModels_0.4-1    
## [13] evaluate_0.14          coda_0.19-3            ggplot2_3.2.1         
## [16] highr_0.8              pillar_1.4.2           gplots_3.0.1.1        
## [19] zlibbioc_1.32.0        rlang_0.4.1            lazyeval_0.2.2        
## [22] gdata_2.18.0           data.table_1.12.6      SparseM_1.77          
## [25] Matrix_1.2-17          rmarkdown_1.16         labeling_0.3          
## [28] stringr_1.4.0          RCurl_1.95-4.12        munsell_0.5.0         
## [31] compiler_3.6.1         xfun_0.10              pkgconfig_2.0.3       
## [34] mcmc_0.9-6             htmltools_0.4.0        tidyselect_0.2.5      
## [37] tibble_2.1.3           GenomeInfoDbData_1.2.2 bookdown_0.14         
## [40] matrixcalc_1.0-3       codetools_0.2-16       crayon_1.3.4          
## [43] dplyr_0.8.3            MASS_7.3-51.4          bitops_1.0-6          
## [46] grid_3.6.1             gtable_0.3.0           magrittr_1.5          
## [49] scales_1.0.0           KernSmooth_2.23-16     stringi_1.4.3         
## [52] XVector_0.26.0         ROCR_1.0-7             cowplot_1.0.0         
## [55] RColorBrewer_1.1-2     iterators_1.0.12       tools_3.6.1           
## [58] glue_1.3.1             purrr_0.3.3            yaml_2.2.0            
## [61] colorspace_1.4-1       BiocManager_1.30.9     caTools_1.17.1.2      
## [64] quantreg_5.51          MCMCpack_1.4-4

6 Bibliography

[1] Kapourani, C. A., & Sanguinetti, G. (2018). Melissa: Bayesian clustering and imputation of single cell methylomes. bioRxiv, 312025, DOI: https://doi.org/10.1101/312025

[2] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412, DOI: https://doi.org/10.1093/bioinformatics/btw432

[3] Kapourani, C. A. & Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129

7 Acknowledgements

This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.

This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.