1 Introduction

Batch effects refer to differences between data sets generated at different times or in different laboratories. These often occur due to uncontrolled variability in experimental factors, e.g., reagent quality, operator skill, atmospheric ozone levels. The presence of batch effects can interfere with downstream analyses if they are not explicitly modelled. For example, differential expression analyses typically use a blocking factor to absorb any batch-to-batch differences.

For single-cell RNA sequencing (scRNA-seq) data analyses, explicit modelling of the batch effect is less relevant. Manny common downstream procedures for exploratory data analysis are not model-based, including clustering and visualization. It is more generally useful to have methods that can remove batch effects to create an corrected expression matrix for further analysis. This follows the same strategy as, e.g., the removeBatchEffect() function in the limma package (Ritchie et al. 2015).

Batch correction methods designed for bulk genomics data usually require knowledge of the other factors of variation. This is usually not known in scRNA-seq experiments where the aim is to explore unknown heterogeneity in cell populations. The batchelor package implements batch correction methods that do not rely on a priori knowledge about the population structure.

2 Setting up demonstration data

To demonstrate, we will use two brain data sets (Tasic et al. 2016; Zeisel et al. 2015) from the scRNAseq package. A more thorough explanation of each of these steps is available in the book.

library(scRNAseq)
sce1 <- ZeiselBrainData()
sce1
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## altExpNames(2): ERCC repeat
sce2 <- TasicBrainData()
sce2
## class: SingleCellExperiment 
## dim: 24058 1809 
## metadata(0):
## assays(1): counts
## rownames(24058): 0610005C13Rik 0610007C21Rik ... mt_X57780 tdTomato
## rowData names(0):
## colnames(1809): Calb2_tdTpositive_cell_1 Calb2_tdTpositive_cell_2 ...
##   Rbp4_CTX_250ng_2 Trib2_CTX_250ng_1
## colData names(13): sample_title mouse_line ... secondary_type
##   aibs_vignette_id
## reducedDimNames(0):
## altExpNames(1): ERCC

We apply some quick-and-dirty quality control to both datasets, using the outlier detection strategy from the scuttle package.

library(scuttle)
sce1 <- addPerCellQC(sce1, subsets=list(Mito=grep("mt-", rownames(sce1))))
qc1 <- quickPerCellQC(colData(sce1), sub.fields="subsets_Mito_percent")
sce1 <- sce1[,!qc1$discard]

sce2 <- addPerCellQC(sce2, subsets=list(Mito=grep("mt_", rownames(sce2))))
qc2 <- quickPerCellQC(colData(sce2), sub.fields="subsets_Mito_percent")
sce2 <- sce2[,!qc2$discard]

Some preprocessing is required to render these two datasets comparable. We subset to the common subset of genes:

universe <- intersect(rownames(sce1), rownames(sce2))
sce1 <- sce1[universe,]
sce2 <- sce2[universe,]

We compute log-normalized expression values using library size-derived size factors for simplicity. (More complex size factor calculation methods are available in the scran package.)

out <- multiBatchNorm(sce1, sce2)
sce1 <- out[[1]]
sce2 <- out[[2]]

Finally, we identify the top 5000 genes with the largest biological components of their variance. We will use these for all high-dimensional procedures such as PCA and nearest-neighbor searching.

library(scran)
dec1 <- modelGeneVar(sce1)
dec2 <- modelGeneVar(sce2)
combined.dec <- combineVar(dec1, dec2)
chosen.hvgs <- getTopHVGs(combined.dec, n=5000)

As a diagnostic, we check that there actually is a batch effect across these datasets by checking that they cluster separately. Here, we combine the two SingleCellExperiment objects without any correction using the NoCorrectParam() flag, and we informally verify that cells from different batches are separated using a \(t\)-SNE plot.

combined <- correctExperiments(A=sce1, B=sce2, PARAM=NoCorrectParam())

library(scater)
set.seed(100)
combined <- runPCA(combined, subset_row=chosen.hvgs)
combined <- runTSNE(combined, dimred="PCA")
plotTSNE(combined, colour_by="batch")