Contents

1 Introduction

The CellMixS package is a toolbox to explore and compare group effects in single-cell RNA-seq data. It has two major applications:

For this purpose it introduces two new metrics:

It also provides implementations and wrappers for a set of metrics with a similar purpose: entropy, the inverse Simpson index (Korsunsky et al. 2018), and Seurat’s mixing metric and local structure metric (Stuart et al. 2018). Besides this, several exploratory plotting functions enable evaluation of key integration and mixing features.

2 Installation

CellMixS can be installed from Bioconductor as follows.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("CellMixS")

After installation the package can be loaded into R.

library(CellMixS)

3 Getting started

3.1 Load example data

CellMixS uses the SingleCellExperiment class from the SingleCellExperiment Bioconductor package as the format for input data.

The package contains example data named sim50, a list of simulated single-cell RNA-seq data with varying batch effect strength and unbalanced batch sizes.

Batch effects were introduced by sampling 0%, 20% or 50% of gene expression values from a distribution with modified mean value (e.g. 0% - 50% of genes were affected by a batch effect).

All datasets consist of 3 batches, one with 250 cells and the others with half of its size (125 cells). The simulation is modified after (Büttner et al. 2019) and described in sim50.

# Load required packages
suppressPackageStartupMessages({
    library(SingleCellExperiment)
    library(cowplot)
    library(limma)
    library(magrittr)
    library(dplyr)
    library(purrr)
    library(ggplot2)
    library(scater)
})
# Load sim_list example data
sim_list <- readRDS(system.file(file.path("extdata", "sim50.rds"), 
                                package = "CellMixS"))
names(sim_list)
#> [1] "batch0"  "batch20" "batch50"

sce50 <- sim_list[["batch50"]]
class(sce50)
#> [1] "SingleCellExperiment"
#> attr(,"package")
#> [1] "SingleCellExperiment"

table(sce50[["batch"]])
#> 
#>   1   2   3 
#> 250 125 125

3.2 Visualize batch effect

Often batch effects can already be detected by visual inspection and simple visualization (e.g. in a normal tSNE or UMAP plot) depending on the strength. CellMixS contains various plotting functions to visualize group label and mixing scores aside. Results are ggplot objects and can be further customized using ggplot2. Other packages, such as scater, provide similar plotting functions and could be used instead.

# Visualize batch distribution in sce50
visGroup(sce50, group = "batch")

# Visualize batch distribution in other elements of sim_list 
batch_names <- c("batch0", "batch20")
  
vis_batch <- lapply(batch_names, function(name){
    sce <- sim_list[[name]]
    visGroup(sce, "batch") + ggtitle(paste0("sim_", name))
})

plot_grid(plotlist = vis_batch, ncol = 2)

4 Quantify batch effects

4.1 Cellspecific Mixing scores

Not all batch effects or group differences are obvious using visualization. Also, in single-cell experiments celltypes and cells can be affected in different ways by experimental conditions causing batch effects (e.g. some cells are more robust to storing conditions than others).

Furthermore the range of methods for data integration and batch effect removal gives rise to the question of which method performs best on which data, and thereby a need to quantify batch effects.

The cellspecific mixing score cms tests for each cell the hypothesis that batch-specific distance distributions towards it’s k-nearest neighbouring (knn) cells are derived from the same unspecified underlying distribution using the Anderson-Darling test (Scholz and Stephens 1987). Results from the cms function are two scores cms and cms_smooth, where the latter is the weighted mean of the cms within each cell’s neighbourhood. They can be interpreted as the data’s probability within an equally mixed neighbourhood. A high cms score refers to good mixing, while a low score indicates batch-specific bias. The test considers differences in the number of cells from each batch. This facilitates the cms score to differentiate between unbalanced batches (e.g. one celltype being more abundant in a specific batch) and a biased distribution of cells. The cms function takes a SingleCellExperiment object (described in SingleCellExperiment) as input and appends results to the colData slot.

5 Parameter

# Call cell-specific mixing score for sce50
# Note that cell_min is set to 4 due to the low number of cells and small k.
# Usually default parameters are recommended!! 
sce50 <- cms(sce50, k = 30, group = "batch", res_name = "unaligned", 
             n_dim = 3, cell_min = 4)
head(colData(sce50))
#> DataFrame with 6 rows and 3 columns
#>           batch cms_smooth.unaligned cms.unaligned
#>        <factor>            <numeric>     <numeric>
#> cell_1        1                    0             0
#> cell_2        1   0.0196989597146557             0
#> cell_3        1                    0             0
#> cell_4        1  0.00823377968898889             0
#> cell_5        1   0.0289643804228694             0
#> cell_6        1   0.0654406965671296             0

# Call cell-specific mixing score for all datasets
sim_list <- lapply(batch_names, function(name){
    sce <- sim_list[[name]]
    sce <- cms(sce, k = 30, group = "batch", res_name = "unaligned", 
               n_dim = 3, cell_min = 4)
}) %>% set_names(batch_names)

# Append cms50
sim_list[["batch50"]] <- sce50

5.1 Defining neighbourhoods

A key question when evaluating dataset structures is how to define neighbourhoods, or in this case, which cells to use to calculate the mixing. cms provides 3 options to define cells included in each Anderson-Darling test:

  • Number of knn: This is the default setting and can be set by the parameter k. The optimal choice depends on the application, as with a small k focus is on local mixing, while with a large k mixing with regard to more global structures is evaluated. In general k should not exceed the size of the smallest cell population as including cells from different cell populations can conflict with the underlying assumptions of distance distributions.
  • Density based neighbourhoods: In cases of unequal cell population sizes the optimal number of neighbours might vary. Using the size of the smallest population can lead to suboptimal neighbourhoods for larger populations in these cases, as the power of the AD-test increases with cell numbers. For these cases or others where a local adaptation of the neighbourhood is desired the k_min parameter is provided. It denotes the minimum number of cells that define a neighbourhood. Starting with the knn as defined by k the cms function will cut neighbourhoods by their first local minimum in the overall distance distribution of the knn cells. Only cells within a distance smaller than the first local minimum are included in the AD-test, but at least k_min cells.
  • Fixed minimum per batch: Another option to define a dynamic neighbourhood is provided by the batch_min parameter. It defines the minimum number of cells for each batch that shall be included into each neighbourhood. In this case the neighbourhoods will include an increasing number of neighbours until at least batch_min cells from each batch are included.

5.2 Further cms parameter settings

For smoothing, either k or (if specified) k_min cells are included to get a weighted mean of cms-scores. Weights are defined by the reciprocal distance towards the respective cell, with 1 as weight of the respective cell itself.

Another important parameter is the subspace to use to calculate cell distances. This can be set using the dim_red parameter. By default the PCA subspace will be used and calculated if not present. Some data integration methods provide embeddings of a common subspace instead of “corrected counts”. cms scores can be calculated within these by defining them with the dim_red argument (see 6.1). In general all reduced dimension representations can be specified, but only PCA will be computed automatically, while other methods need to be precomputed.

5.3 Visualize the cell mixing score

An overall summary of cms scores can be visualized as a histogram. As cms scores are p.values from hypothesis testing, without any batch effect the p.value histogram should be flat. An increased number of very small p-values indicates the presence of a batch-specific bias within data.

# p-value histogram of cms50
visHist(sce50)


# p-value histogram sim30
# Combine cms results in one matrix
batch_names <- names(sim_list)
cms_mat <- batch_names %>% 
  map(function(name) sim_list[[name]]$cms.unaligned) %>% 
  bind_cols() %>% set_colnames(batch_names)

visHist(cms_mat, n_col = 3)

Results of cms can be visualized in a cell-specific way and alongside any metadata.

# cms only cms20
sce20 <- sim_list[["batch20"]]
metric_plot <- visMetric(sce20, metric_var = "cms_smooth.unaligned")

# group only cms20
group_plot <- visGroup(sce20, group = "batch")

plot_grid(metric_plot, group_plot, ncol = 2)

# Add random celltype assignments as new metadata
sce20[["celltype"]] <- rep(c("CD4+", "CD8+", "CD3"), length.out = ncol(sce20)) %>% 
    as.factor 

visOverview(sce20, "batch", other_var = "celltype")

Systematic differences (e.g. celltype differences) can be further explored using visCluster. Here we do not expect any systematic difference as celltypes were randomly assigned.

visCluster(sce20, metric_var = "cms.unaligned", cluster_var = "celltype")
#> Picking joint bandwidth of 0.0996

6 Evaluate data integration

6.1 Mixing after data integration

To remove batch effects when integrating different single-cell RNAseq datasets, a range of methods can be used. The cms function can be used to evaluate the effect of these methods, using a cell-specific mixing score. Some of them (e.g. fastMNN from the batchelor package) provide a “common subspace” with integrated embeddings. Other methods like limma give “batch-corrected data” as results. Both work as input for cms.

# MNN - embeddings are stored in the reducedDims slot of sce
reducedDimNames(sce20)
#> [1] "TSNE" "PCA"  "MNN"
sce20 <- cms(sce20, k = 30, group = "batch", 
             dim_red = "MNN", res_name = "MNN", n_dim = 3, cell_min = 4)

# Run limma
sce20 <- scater::logNormCounts(sce20)
limma_corrected <- removeBatchEffect(logcounts(sce20), batch = sce20$batch)
# Add corrected counts to sce
assay(sce20, "lim_corrected") <- limma_corrected 

# Run cms
sce20 <- cms(sce20, k = 30, group = "batch", 
             assay_name = "lim_corrected", res_name = "limma", n_dim = 3, 
             cell_min = 4)

names(colData(sce20))
#> [1] "batch"                "cms_smooth.unaligned" "cms.unaligned"       
#> [4] "celltype"             "cms_smooth.MNN"       "cms.MNN"             
#> [7] "cms_smooth.limma"     "cms.limma"

6.2 Compare data integration methods

To compare different methods, summary plots from visIntegration (see 6.4) and p-value histograms from visHist can be used. Local patterns within single methods can be explored as described above.

# As pvalue histograms
visHist(sce20, metric = "cms.",  n_col = 3)

Here both methods limma and fastMNN from the scran package flattened the p.value distribution. So cells are better mixed after batch effect removal.

6.3 Remaining batch-specific structure - ldfDiff

Besides successful batch “mixing”, data integration should also preserve the data’s internal structure and variability without adding new sources of variability or removing underlying structures. Especially for methods that result in “corrected counts” it is important to understand how much of the dataset’s internal structures are preserved.

ldfDiff calculates the differences between each cell’s local density factor before and after data integration (Latecki, Lazarevic, and Pokrajac 2007). The local density factor is a relative measure of the cell density around a cell compared to the densities within its neighbourhood. Local density factors are calculated on the same set of k cells from the cell’s knn before integration. In an optimal case relative densities (according to the same set of cells) should not change by integration and the ldfDiff score should be close to 0. In general the overall distribution of ldfDiff should be centered around 0 without long tails.

# Prepare input 
# List with single SingleCellExperiment objects 
# (Important: List names need to correspond to batch levels! See ?ldfDiff)
sce_pre_list <- list("1" = sce20[,sce20$batch == "1"], 
                     "2" = sce20[,sce20$batch == "2"], 
                     "3" = sce20[,sce20$batch == "3"])

sce20 <- ldfDiff(sce_pre_list, sce_combined = sce20, 
                 group = "batch", k = 70, dim_red = "PCA", 
                 dim_combined = "MNN", assay_pre = "counts", 
                 n_dim = 3, res_name = "MNN")

sce20 <- ldfDiff(sce_pre_list, sce_combined = sce20, 
                 group = "batch", k = 70, dim_red = "PCA", 
                 dim_combined = "PCA", assay_pre = "counts", 
                 assay_combined = "lim_corrected",  
                 n_dim = 3, res_name = "limma")

names(colData(sce20))
#>  [1] "batch"                "cms_smooth.unaligned" "cms.unaligned"       
#>  [4] "celltype"             "cms_smooth.MNN"       "cms.MNN"             
#>  [7] "cms_smooth.limma"     "cms.limma"            "diff_ldf.MNN"        
#> [10] "diff_ldf.limma"

6.4 Visualize ldfDiff

Results from ldfDiff can be visualized in a similar way as results from cms.

# ldfDiff score summarized
visIntegration(sce20, metric = "diff_ldf", metric_name = "ldfDiff") 
#> Picking joint bandwidth of 0.0867

ldfDiff shows a clear difference between the two methods. While limma is able to preserve the batch internal structure within batches, fastMNN clearly changes it. Even if batches are well mixed (see 6.2), fastMNN does not work for batch effect removal on these simulated data. Again this is in line with expectations due to the small number of genes in the example data. One of MNN’s assumptions is that batch effects should be much smaller than biological variation, which does not hold true in this small example dataset.

7 Testing different metrics

Often it is useful to check different aspects of data mixing and integration by the use of different metrics, as many of them emphasize different features of mixing. To provide an easy interface for thorough investigation of batch effects and data integration a wrapper function of a variety of metrics is included into CellMixS. evalIntegration calls one or all of cms, ldfDiff, entropy or equivalents to mixingMetric, localStruct from the Seurat package or isi, a simplfied version of the local inverse Simpson index as suggested by (Korsunsky et al. 2018). entropy calculates the Shannon entropy within each cell’s knn describing the randomness of the batch variable. isi calculates the inverse Simpson index within each cell’s knn. The Simpson index describes the probability that two entities are taken at random from the dataset and its inverse represents the effective number of batches in the neighbourhood. A simplified version of the distance based weightening as proposed by (Korsunsky et al. 2018) is provided by the weight option. As before the resulting scores are included into the colData slot of the input SingleCellExperiment object and can be visualized with visMetric and other plotting functions.

sce50 <- evalIntegration(metrics = c("isi", "entropy"), sce50, 
                         group = "batch", k = 30, n_dim = 2, cell_min = 4,
                         res_name = c("weighted_isi", "entropy"))

visOverview(sce50, "batch", 
            metric = c("cms_smooth.unaligned", "weighted_isi", "entropy"), 
            prefix = FALSE)

8 Session info

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] scater_1.14.6               ggplot2_3.3.0              
#>  [3] purrr_0.3.3                 dplyr_0.8.5                
#>  [5] magrittr_1.5                limma_3.42.2               
#>  [7] cowplot_1.0.0               SingleCellExperiment_1.8.0 
#>  [9] SummarizedExperiment_1.16.1 DelayedArray_0.12.3        
#> [11] BiocParallel_1.20.1         matrixStats_0.56.0         
#> [13] Biobase_2.46.0              GenomicRanges_1.38.0       
#> [15] GenomeInfoDb_1.22.1         IRanges_2.20.2             
#> [17] S4Vectors_0.24.4            BiocGenerics_0.32.0        
#> [19] CellMixS_1.2.6              kSamples_1.2-9             
#> [21] SuppDists_1.1-9.5           BiocStyle_2.14.4           
#> 
#> loaded via a namespace (and not attached):
#>  [1] viridis_0.5.1            BiocSingular_1.2.2       tidyr_1.0.2             
#>  [4] viridisLite_0.3.0        DelayedMatrixStats_1.8.0 assertthat_0.2.1        
#>  [7] BiocManager_1.30.10      GenomeInfoDbData_1.2.2   vipor_0.4.5             
#> [10] yaml_2.2.1               pillar_1.4.3             lattice_0.20-41         
#> [13] glue_1.4.0               digest_0.6.25            XVector_0.26.0          
#> [16] listarrays_0.3.1         colorspace_1.4-1         htmltools_0.4.0         
#> [19] Matrix_1.2-18            plyr_1.8.6               pkgconfig_2.0.3         
#> [22] magick_2.3               bookdown_0.18            zlibbioc_1.32.0         
#> [25] scales_1.1.0             tibble_3.0.0             farver_2.0.3            
#> [28] ellipsis_0.3.0           withr_2.1.2              cli_2.0.2               
#> [31] crayon_1.3.4             evaluate_0.14            fansi_0.4.1             
#> [34] beeswarm_0.2.3           tools_3.6.3              lifecycle_0.2.0         
#> [37] stringr_1.4.0            munsell_0.5.0            irlba_2.3.3             
#> [40] compiler_3.6.3           rsvd_1.0.3               rlang_0.4.5             
#> [43] grid_3.6.3               RCurl_1.98-1.1           ggridges_0.5.2          
#> [46] BiocNeighbors_1.4.2      labeling_0.3             bitops_1.0-6            
#> [49] rmarkdown_2.1            gtable_0.3.0             R6_2.4.1                
#> [52] gridExtra_2.3            knitr_1.28               stringi_1.4.6           
#> [55] ggbeeswarm_0.6.0         Rcpp_1.0.4.6             vctrs_0.2.4             
#> [58] tidyselect_1.0.0         xfun_0.12

References

Büttner, Maren, Zhichao Miao, F. Alexander Wolf, Sarah A. Teichmann, and Fabian J. Theis. 2019. “A test metric for assessing single-cell RNA-seq batch correction.” Nat. Methods 16 (1). Nature Publishing Group:43–49. https://doi.org/10.1038/s41592-018-0254-1.

Korsunsky, Ilya, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-Ru Loh, and Soumya Raychaudhuri. 2018. “Fast, sensitive, and flexible integration of single cell data with Harmony.” bioRxiv, November. Cold Spring Harbor Laboratory, 461954. https://doi.org/10.1101/461954.

Latecki, Longin Jan, Aleksandar Lazarevic, and Dragoljub Pokrajac. 2007. “Outlier Detection with Kernel Density Functions.” In Mach. Learn. Data Min. Pattern Recognit., 61–75. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-73499-4_6.

Scholz, F. W., and M. A. Stephens. 1987. “K-Sample Anderson-Darling Tests.” J. Am. Stat. Assoc. 82 (399):918. https://doi.org/10.2307/2288805.

Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2018. “Comprehensive integration of single cell data.” bioRxiv, November. Cold Spring Harbor Laboratory, 460147. https://doi.org/10.1101/460147.