1 Introduction

The scMerge algorithm allows batch effect removal and normalisation for single cell RNA-Seq data. It comprises of three key components including:

  1. The identification of stably expressed genes (SEGs) as “negative controls” for estimating unwanted factors;
  2. The construction of pseudo-replicates to estimate the effects of unwanted factors; and
  3. The adjustment of the datasets with unwanted variation using a fastRUVIII model.

The purpose of this vignette is to illustrate some uses of scMerge and explain its key components.

2 Loading Packages and Data

We will load the scMerge package. We designed our package to be consistent with the popular BioConductor’s single cell analysis framework, namely the SingleCellExperiment and scater package.

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(scMerge)
  library(scater)
  })

We provided an illustrative mouse embryonic stem cell (mESC) data in our package, as well as a set of pre-computed stably expressed gene (SEG) list to be used as negative control genes.

The full curated, unnormalised mESC data can be found here. The scMerge package comes with a sub-sampled, two-batches version of this data (named “batch2” and “batch3” to be consistent with the full data) .

## Subsetted mouse ESC data
data("example_sce", package = "scMerge")

In this mESC data, we pooled data from 2 different batches from three different cell types. Using a PCA plot, we can see that despite strong separation of cell types, there is also a strong separation due to batch effects. This information is stored in the colData of example_sce.

scater::plotPCA(
  example_sce, 
  colour_by = "cellTypes", 
  shape_by = "batch")

3 Illustrating pseudo-replicates constructions

The first major component of scMerge is to obtain negative controls for our normalisation. In this vignette, we will be using a set of pre-computed SEGs from a single cell mouse data made available through the segList_ensemblGeneID data in our package. For more information about the selection of negative controls and SEGs, please see Section select SEGs.

## single-cell stably expressed gene list
data("segList_ensemblGeneID", package = "scMerge")
head(segList_ensemblGeneID$mouse$mouse_scSEG)
#> [1] "ENSMUSG00000058835" "ENSMUSG00000026842" "ENSMUSG00000027671"
#> [4] "ENSMUSG00000020152" "ENSMUSG00000054693" "ENSMUSG00000049470"

The second major component of scMerge is to compute pseudo-replicates for cells so we can perform normalisation. We offer three major ways of computing this pseudo-replicate information:

  1. Unsupervised clustering, using k-means clustering;
  2. Supervised clustering, using known cell type information; and
  3. Semi-supervised clustering, using partially known cell type information.

4 Unsupervised scMerge

In unsupervised scMerge, we will perform a k-means clustering to obtain pseudo-replicates. This requires the users to supply a kmeansK vector with each element indicating number of clusters in each of the batches. For example, we know “batch2” and “batch3” both contain three cell types. Hence, kmeansK = c(3, 3) in this case.

scMerge_unsupervised <- scMerge(
  sce_combine = example_sce, 
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_unsupervised")
#> Step 1: Computation will run in serial
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!

We now colour construct the PCA plot again on our normalised data. We can observe a much better separation by cell type and less separation by batches.

scater::plotPCA(
  scMerge_unsupervised, 
  colour_by = "cellTypes", 
  shape_by = "batch",
  run_args = list(exprs_values = "scMerge_unsupervised"))

5 Selecting all cells

By default, scMerge only uses 50% of the cells to perform kmeans clustering. While this is sufficient to perform a satisfactory normalisation in most cases, users can control if they wish all cells be used in the kmeans clustering.

scMerge_unsupervised_all <- scMerge(
  sce_combine = example_sce, 
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_unsupervised_all",
  replicate_prop = 1)
#> Step 1: Computation will run in serial
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
scater::plotPCA(
  scMerge_unsupervised_all, 
  colour_by = "cellTypes", 
  shape_by = "batch",
  run_args = list(exprs_values = "scMerge_unsupervised_all"))