1 Introduction

Batch effects refer to differences between data sets generated at different times or in different laboratories. These often occur due to uncontrolled variability in experimental factors, e.g., reagent quality, operator skill, atmospheric ozone levels. The presence of batch effects can interfere with downstream analyses if they are not explicitly modelled. For example, differential expression analyses typically use a blocking factor to absorb any batch-to-batch differences.

For single-cell RNA sequencing (scRNA-seq) data analyses, explicit modelling of the batch effect is less relevant. Manny common downstream procedures for exploratory data analysis are not model-based, including clustering and visualization. It is more generally useful to have methods that can remove batch effects to create an corrected expression matrix for further analysis. This follows the same strategy as, e.g., the removeBatchEffect() function in the limma package (Ritchie et al. 2015).

Batch correction methods designed for bulk genomics data usually require knowledge of the other factors of variation. This is usually not known in scRNA-seq experiments where the aim is to explore unknown heterogeneity in cell populations. The batchelor package implements batch correction methods that do not rely on a priori knowledge about the population structure. To demonstrate, we will use a small scRNA-seq data set (Tasic et al. 2016) from the scRNAseq package:


sce1 <- as(allen, "SingleCellExperiment")
counts(sce1) <- assay(sce1)

sce1 <- sce1[1:2000,] # reducing the size for demo purposes.
sce1 <- normalize(sce1) # quick and dirty normalization.

We artificially create a batch effect in a separate SingleCellExperiment object:

sce2 <- sce1
logcounts(sce2) <- logcounts(sce2) + rnorm(nrow(sce2), sd=2)

combined <- cbind(sce1, sce2)
combined$batch <- factor(rep(1:2, c(ncol(sce1), ncol(sce2))))

plotPCA(combined, colour_by="batch") # checking there is a batch effect.