Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.
Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for valid downstream analysis.
However, existing methods such as removeBatchEffect()
(Ritchie et al. 2015) assume that the composition of cell populations are either known or the same across batches.
This workflow describes the application of an alternative strategy for batch correction based on the detection of mutual nearest neighbours (MNNs) (Haghverdi et al. 2018).
The MNN approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches.
We demonstrate its use on two human pancreas scRNA-seq datasets generated in separate studies.
This dataset was generated by Grun et al. (2016) using the CEL-seq protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above.
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
grun.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
"GSE81nnn/GSE81076/suppl/GSE81076%5FD2%5F3%5F7%5F10%5F17%2Etxt%2Egz"))
We first read the table into memory.
gse81076.df <- read.table(grun.fname, sep='\t',
header=TRUE, stringsAsFactors=FALSE, row.names=1)
dim(gse81076.df)
## [1] 20148 1728
head(rownames(gse81076.df))
## [1] "A1BG-AS1__chr19" "A1BG__chr19" "A1CF__chr10" "A2M-AS1__chr12"
## [5] "A2ML1__chr12" "A2MP1__chr12"
head(colnames(gse81076.df))
## [1] "D2ex_1" "D2ex_2" "D2ex_3" "D2ex_4" "D2ex_5" "D2ex_6"
Unfortunately, the data and metadata are all mixed together in this file. As a result, we need to manually extract the metadata from the column names.
donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse81076.df))
table(donor.names)
## donor.names
## D101 D102 D10631 D17 D1713 D172444 D2 D3 D71
## 96 96 96 288 96 96 96 480 96
## D72 D73 D74
## 96 96 96
plate.id <- sub("^D[0-9]+(.*)_.*", "\\1", colnames(gse81076.df))
table(plate.id)
## plate.id
## All1 All2 TGFB en1 en2 en3 en4 ex
## 864 96 96 96 96 96 96 96 192
Another irritating feature of this dataset is that gene symbols were supplied, rather than stable identifiers such as Ensembl.
Stable identifiers are desirable as they facilitate reliable cross-referencing of gene identities between data sets1 The two data sets used in this workflow come from the same provider, who at least uses gene symbols consistently. So, technically, we could have skipped the conversion here. Nonetheless, we have chosen to perform it to demonstrate how one would deal with more heterogeneous data sources (e.g., mixtures of Ensembl identifiers and gene symbols, or multiple synonymous gene symbols)..
We convert all row names to Ensembl identifiers, removing NA
or duplicated entries (with the exception of spike-in transcripts).
gene.symb <- gsub("__chr.*$", "", rownames(gse81076.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE TRUE
## 20064 84
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse81076.df <- gse81076.df[keep,]
rownames(gse81076.df) <- gene.ids[keep]
summary(keep)
## Mode FALSE TRUE
## logical 2372 17776
We create a SingleCellExperiment
object to store the counts and metadata together.
This reduces the risk of book-keeping errors in later steps of the analysis.
Note that we re-identify the spike-in rows, as the previous indices would have changed after the subsetting.
library(SingleCellExperiment)
sce.gse81076 <- SingleCellExperiment(list(counts=as.matrix(gse81076.df)),
colData=DataFrame(Donor=donor.names, Plate=plate.id),
rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse81076, "ERCC") <- grepl("^ERCC-", rownames(gse81076.df))
sce.gse81076
## class: SingleCellExperiment
## dim: 17776 1728
## metadata(0):
## assays(1): counts
## rownames(17776): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
## ENSG00000036549
## rowData names(1): Symbol
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC
We compute quality control (QC) metrics for each cell (McCarthy et al. 2017) and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.
library(scater)
sce.gse81076 <- calculateQCMetrics(sce.gse81076, compact=TRUE)
QC <- sce.gse81076$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes),
HighSpike=sum(high.spike, na.rm=TRUE))
## LowLib LowNgenes HighSpike
## 1 55 130 388
Cells with extreme values for these QC metrics are presumed to be of low quality and are removed. A more thorough analysis would examine the distributions of these QC metrics beforehand, but we will skip that step for brevity here.
discard <- low.lib | low.genes | high.spike
sce.gse81076 <- sce.gse81076[,!discard]
summary(discard)
## Mode FALSE TRUE
## logical 1292 436
We compute size factors for the endogenous genes using the deconvolution method (Lun, Bach, and Marioni 2016).
This is done with pre-clustering by quickCluster()
to avoid pooling together very different cells.
library(scran)
library(BiocSingular)
set.seed(1000) # for irlba.
clusters <- quickCluster(sce.gse81076, use.ranks=FALSE, BSPARAM=IrlbaParam())
table(clusters)
## clusters
## 1 2 3 4 5 6 7
## 106 196 273 299 153 163 102
sce.gse81076 <- computeSumFactors(sce.gse81076, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse81076))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003541 0.438547 0.791872 1.000000 1.290387 8.275084
We also compute size factors for the spike-in transcripts (Lun et al. 2017).
Recall that we set general.use=FALSE
to ensure that the spike-in size factors are only applied to the spike-in transcripts.
sce.gse81076 <- computeSpikeFactors(sce.gse81076, general.use=FALSE)
summary(sizeFactors(sce.gse81076, "ERCC"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01042 0.57782 0.88699 1.00000 1.27765 7.43998
We then compute normalized log-expression values for use in downstream analyses.
sce.gse81076 <- normalize(sce.gse81076)
We identify highly variable genes (HVGs) using trendVar()
and decomposeVar()
, using the variances of spike-in transcripts to model technical noise.
We set block=
to ensure that uninteresting differences between plates or donors do not inflate the variance.
The small discrepancy in the fitted trend in Figure 1 is caused by the fact that the trend is fitted robustly to the block-wise variances of the spike-ins,
while the variances shown are averaged across blocks and not robust to outliers.
block <- paste0(sce.gse81076$Plate, "_", sce.gse81076$Donor)
fit <- trendVar(sce.gse81076, block=block, parametric=TRUE)
dec <- decomposeVar(sce.gse81076, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance of log-expression", pch=16)
is.spike <- isSpike(sce.gse81076)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
We order genes by decreasing biological component, revealing some usual suspects such as insulin and glucagon.
We will be using this information later when performing feature selection prior to running mnnCorrect()
.
dec.gse81076 <- dec
dec.gse81076$Symbol <- rowData(sce.gse81076)$Symbol
dec.gse81076 <- dec.gse81076[order(dec.gse81076$bio, decreasing=TRUE),]
head(dec.gse81076)
## DataFrame with 6 rows and 7 columns
## mean total bio
## <numeric> <numeric> <numeric>
## ENSG00000254647 2.84438696097646 6.37771955862257 5.93492163812922
## ENSG00000129965 1.89026042119725 6.029667887451 5.58238903990959
## ENSG00000115263 4.01989463994372 5.74235119909272 5.50960551193726
## ENSG00000118271 3.67399583064614 5.77781565535565 5.47705987083213
## ENSG00000115386 4.24266345753652 5.39869761580333 5.16308284971761
## ENSG00000164266 3.03518659963559 5.4170378405534 5.00991839667255
## tech p.value FDR Symbol
## <numeric> <numeric> <numeric> <character>
## ENSG00000254647 0.442797920493351 0 0 INS
## ENSG00000129965 0.447278847541415 0 0 INS-IGF2
## ENSG00000115263 0.232745687155467 0 0 GCG
## ENSG00000118271 0.300755784523523 0 0 TTR
## ENSG00000115386 0.235614766085716 0 0 REG1A
## ENSG00000164266 0.407119443880852 0 0 SPINK1
This dataset was generated by Muraro et al. (2016) using the CEL-seq2 protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above.
muraro.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
"GSE85nnn/GSE85241/suppl",
"GSE85241%5Fcellsystems%5Fdataset%5F4donors%5Fupdated%2Ecsv%2Egz"))
We first read the table into memory.
gse85241.df <- read.table(muraro.fname, sep='\t',
header=TRUE, row.names=1, stringsAsFactors=FALSE)
dim(gse85241.df)
## [1] 19140 3072
head(rownames(gse85241.df))
## [1] "A1BG-AS1__chr19" "A1BG__chr19" "A1CF__chr10" "A2M-AS1__chr12"
## [5] "A2ML1__chr12" "A2M__chr12"
head(colnames(gse85241.df))
## [1] "D28.1_1" "D28.1_2" "D28.1_3" "D28.1_4" "D28.1_5" "D28.1_6"
We extract the metadata from the column names.
donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse85241.df))
table(donor.names)
## donor.names
## D28 D29 D30 D31
## 768 768 768 768
plate.id <- sub("^D[0-9]+\\.([0-9]+)_.*", "\\1", colnames(gse85241.df))
table(plate.id)
## plate.id
## 1 2 3 4 5 6 7 8
## 384 384 384 384 384 384 384 384
Yet again, gene symbols were supplied instead of Ensembl or Entrez identifiers.
We convert all row names to Ensembl identifiers, removing NA
or duplicated entries (with the exception of spike-in transcripts).
gene.symb <- gsub("__chr.*$", "", rownames(gse85241.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE TRUE
## 19059 81
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse85241.df <- gse85241.df[keep,]
rownames(gse85241.df) <- gene.ids[keep]
summary(keep)
## Mode FALSE TRUE
## logical 2223 16917
We create a SingleCellExperiment
object to store the counts and metadata together.
sce.gse85241 <- SingleCellExperiment(list(counts=as.matrix(gse85241.df)),
colData=DataFrame(Donor=donor.names, Plate=plate.id),
rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse85241, "ERCC") <- grepl("^ERCC-", rownames(gse85241.df))
sce.gse85241
## class: SingleCellExperiment
## dim: 16917 3072
## metadata(0):
## assays(1): counts
## rownames(16917): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
## ENSG00000036549
## rowData names(1): Symbol
## colnames(3072): D28.1_1 D28.1_2 ... D30.8_95 D30.8_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC
We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.
sce.gse85241 <- calculateQCMetrics(sce.gse85241, compact=TRUE)
QC <- sce.gse85241$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes),
HighSpike=sum(high.spike, na.rm=TRUE))
## LowLib LowNgenes HighSpike
## 1 577 669 695
Low-quality cells are defined as those with extreme values for these QC metrics and are removed.
discard <- low.lib | low.genes | high.spike
sce.gse85241 <- sce.gse85241[,!discard]
summary(discard)
## Mode FALSE TRUE
## logical 2346 726
We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values.
set.seed(1000)
clusters <- quickCluster(sce.gse85241, use.ranks=FALSE, BSPARAM=IrlbaParam())
table(clusters)
## clusters
## 1 2 3 4 5 6 7 8 9
## 257 393 224 316 226 426 129 203 172
sce.gse85241 <- computeSumFactors(sce.gse85241, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse85241))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0845 0.5328 0.8151 1.0000 1.2090 15.3318
sce.gse85241 <- computeSpikeFactors(sce.gse85241, general.use=FALSE)
summary(sizeFactors(sce.gse85241, "ERCC"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.09295 0.61309 0.88902 1.00000 1.27519 4.04643
sce.gse85241 <- normalize(sce.gse85241)
We fit a trend to the spike-in variances as previously described, allowing us to model the technical noise for each gene (Figure 2).
Again, we set block=
to ensure that uninteresting differences between plates or donors do not inflate the variance.
block <- paste0(sce.gse85241$Plate, "_", sce.gse85241$Donor)
fit <- trendVar(sce.gse85241, block=block, parametric=TRUE)
dec <- decomposeVar(sce.gse85241, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance of log-expression", pch=16)
is.spike <- isSpike(sce.gse85241)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
We order genes by decreasing biological component, as described above.
dec.gse85241 <- dec
dec.gse85241$Symbol <- rowData(sce.gse85241)$Symbol
dec.gse85241 <- dec.gse85241[order(dec.gse85241$bio, decreasing=TRUE),]
head(dec.gse85241)
## DataFrame with 6 rows and 7 columns
## mean total bio
## <numeric> <numeric> <numeric>
## ENSG00000115263 7.67110494913732 6.64869412141223 6.61993009624089
## ENSG00000089199 4.63141777116535 6.49101561206233 6.37005001267999
## ENSG00000169903 3.02363719480826 6.60374469250074 6.25171241896403
## ENSG00000254647 2.00812251287305 6.46803316161483 5.91300126053973
## ENSG00000118271 7.34417012391726 5.82255630100473 5.7911444440125
## ENSG00000171951 4.2002996520299 5.63389093078673 5.47030577732233
## tech p.value FDR Symbol
## <numeric> <numeric> <numeric> <character>
## ENSG00000115263 0.0287640251713411 0 0 GCG
## ENSG00000089199 0.120965599382341 0 0 CHGB
## ENSG00000169903 0.352032273536714 0 0 TM4SF4
## ENSG00000254647 0.555031901075099 0 0 INS
## ENSG00000118271 0.0314118569922332 0 0 TTR
## ENSG00000171951 0.16358515346441 0 0 SCG2
To obtain a single set of features for batch correction, we compute the average biological component across all batches.
We then take all genes with positive biological components to ensure that all interesting biology is retained, equivalent to the behaviour of denoisePCA()
.
However, the quality of the correction can often be sensitive to technical noise, which means that some discretion may be required during feature selection.
Users may prefer to take the top 1000-5000 genes with the largest average components, or to use combineVar()
to obtain combined \(p\)-values for gene selection.
universe <- intersect(rownames(dec.gse85241), rownames(dec.gse81076))
mean.bio <- (dec.gse85241[universe,"bio"] + dec.gse81076[universe,"bio"])/2
chosen <- universe[mean.bio > 0]
length(chosen)
## [1] 14667
We also rescale each batch to adjust for differences in sequencing depth between batches.
The multiBatchNorm()
function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment
objects.
(Keep in mind that the previously computed size factors only remove biases between cells within a single batch.)
This improves the quality of the correction by removing one aspect of the technical differences between batches.
rescaled <- batchelor::multiBatchNorm(
sce.gse85241[universe,],
sce.gse81076[universe,]
)
rescaled.gse85241 <- rescaled[[1]]
rescaled.gse81076 <- rescaled[[2]]
Comments from Aaron:
multiBatchNorm()
.
This ensures that the variance components are estimated from the same values to be used in the batch correction.
In practice, this makes little difference, and it tends to be easier to process each batch separately and consolidate all results in one step as shown above.batchelor::
prefix avoids ambiguity during the migration of multiBatchNorm()
from scran to batchelor.
This will not be necessary in the next release.Consider a cell \(a\) in batch \(A\), and identify the cells in batch \(B\) that are nearest neighbours to \(a\) in the expression space defined by the selected features. Repeat this for a cell \(b\) in batch \(B\), identifying its nearest neighbours in \(A\). Mutual nearest neighbours are pairs of cells from different batches that belong in each other’s set of nearest neighbours. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which can yield batch-corrected values.
We apply the fastMNN()
function to the three batches to remove the batch effect, using the genes in chosen
.
To reduce computational work and technical noise, all cells in all cells are projected into the low-dimensional space defined by the top d
principal components.
Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space.
The function returns a SingleCellExperiment
object containing low-dimensional corrected values for downstream analyses like clustering or visualization.
set.seed(100)
unc.gse81076 <- logcounts(rescaled.gse81076)[chosen,]
unc.gse85241 <- logcounts(rescaled.gse85241)[chosen,]
mnn.out <- batchelor::fastMNN(
GSE81076=unc.gse81076, GSE85241=unc.gse85241,
k=20, d=50, BSPARAM=IrlbaParam(deferred=TRUE)
)
mnn.out
## class: SingleCellExperiment
## dim: 14667 3638
## metadata(2): merge.order merge.info
## assays(1): reconstructed
## rownames(14667): ENSG00000115263 ENSG00000089199 ... ENSG00000125450
## ENSG00000105171
## rowData names(1): rotation
## colnames(3638): D2ex_1 D2ex_2 ... D30.8_93 D30.8_94
## colData names(1): batch
## reducedDimNames(1): corrected
## spikeNames(0):
Each column of mnn.out
corresponds to a cell in one of the batches, while each row corresponds to an input gene in chosen
.
The corrected
matrix in the reducedDims
slot contains the low-dimensional corrected coordinates for all cells.
dim(reducedDim(mnn.out, "corrected"))
## [1] 3638 50
The batch
field in the column metadata contains an object specifying the batch of origin of each cell.
# Using an Rle for pretty-printing of batch IDs
# (as all cells from the same batch are consecutive).
Rle(mnn.out$batch)
## character-Rle of length 3638 with 2 runs
## Lengths: 1292 2346
## Values : "GSE81076" "GSE85241"
Advanced users may also be interested in the list of DataFrame
s in the pairs
metadata field.
Each DataFrame
describes the MNN pairs identified upon merging of each successive batch.
This may be useful for checking the identified MNN pairs against known cell type identity, e.g., to determine if the cell types are being paired correctly.
metadata(mnn.out)$merge.info$pairs[[1]]
## DataFrame with 6626 rows and 2 columns
## first second
## <integer> <integer>
## 1 1 1794
## 2 1 2087
## 3 15 1612
## 4 15 2512
## 5 15 2575
## ... ... ...
## 6622 1290 1538
## 6623 1290 2342
## 6624 1290 2625
## 6625 1290 2001
## 6626 1290 2794
As previously mentioned, we have only used two batches here to simplify the workflow.
However, the MNN approach is not limited to two batches, and inclusion of more batches is as simple as adding more SingleCellExperiment
objects to the fastMNN()
call.
Comments from Aaron:
k=
parameter specifies the number of nearest neighbours to consider when defining MNN pairs.
This should be interpreted as the minimum frequency of each cell type or state in each batch.
k
can also reduce accuracy by allowing incorrect MNN pairs to form between cells of different types.
Thus, we suggest starting with the default k
and increasing it if one is confident that the same cell types are not adequately merged across batches.
This is better than starting with a large k
as incorrect merging is much harder to diagnose than insufficient merging.BSPARAM=IrlbaParam(deferred=TRUE)
, fastMNN()
uses methods from the irlba package to perform the principal components analysis quickly.
While the run-to-run differences should be minimal, it does mean that set.seed()
is required to obtain fully reproducible results.
The deferred=
argument instructs fastMNN()
to sacrifice some numerical precision for greater speed.We examine the batch correction with some t-SNE plots. Figure~3 demonstrates how the cells separate by batch of origin in the uncorrected data. After correction, more intermingling between batches is observed, consistent with the removal of batch effects. Note that the E-MTAB-5601 dataset still displays some separation, which is probably due to the fact that the other batches are UMI datasets.
# Adding uncorrected values.
sce <- mnn.out
assay(sce, "original") <- cbind(unc.gse81076, unc.gse85241)
# Using irlba to set up the t-SNE, for speed.
set.seed(100)
osce <- runPCA(sce, exprs_values="original", ntop=Inf, BSPARAM=IrlbaParam())
osce <- runTSNE(osce, use_dimred="PCA")
ot <- plotTSNE(osce, colour_by="batch") + ggtitle("Original")
# Corrected.
set.seed(100)
sce <- runTSNE(sce, use_dimred="corrected")
ct <- plotTSNE(sce, colour_by="batch") + ggtitle("Corrected")
multiplot(ot, ct, cols=2)
We colour by the expression of marker genes for known pancreas cell types to determine whether the correction is biologically sensible. Cells in the same visual cluster express the same marker genes (Figure 4), indicating that the correction maintains separation of cell types.
# Replacing the row names for easier reference.
rowData(sce)$ENSEMBL <- rownames(sce)
rowData(sce)$SYMBOL <- mapIds(org.Hs.eg.db, keytype="ENSEMBL",
keys=rownames(sce), column="SYMBOL")
rownames(sce) <- uniquifyFeatureNames(rownames(sce), rowData(sce)$SYMBOL)
ct.gcg <- plotTSNE(sce, by_exprs_values="reconstructed", colour_by="GCG")
ct.ins <- plotTSNE(sce, by_exprs_values="reconstructed", colour_by="INS")
ct.sst <- plotTSNE(sce, by_exprs_values="reconstructed", colour_by="SST")
ct.ppy <- plotTSNE(sce, by_exprs_values="reconstructed", colour_by="PPY")
multiplot(ct.gcg + ggtitle("Alpha cells"),
ct.ins + ggtitle("Beta cells"),
ct.sst + ggtitle("Delta cells"),
ct.ppy + ggtitle("PP cells"),
cols=2)
One useful diagnostic is the proportion of variance within each batch that is lost during MNN correction.
Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step.
This is returned via the lost.var
field in the metadata of mnn.out
, which contains a matrix of the variance lost in each batch (column) at each merge step (row).
metadata(mnn.out)$merge.info$lost.var
## [,1] [,2]
## [1,] 0.008257959 0.01133855
Large proportions of lost variance suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace (Haghverdi et al. 2018). In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern.
The order of the supplied batches will affect the result as the first batch is used to define the reference space to which all other batches are corrected. Specifically, the first batch is defined as the reference batch; the second batch is corrected to and merged with the current reference batch, yielding a new reference batch; and so on for all batches in the supplied order, with an increasingly large reference batch at each step. The use of a merged reference ensures that information from batches other than the first are used to identify MNN pairs in later batches.
The order of batches to merge can be manually specified in the auto.order=
argument to fastMNN()
.
In the example shown below, batch 2 is treated as the reference as it is the first specified batch in auto.order=
.
The first batch is then corrected to the second batch to obtain a new reference batch - and so on, if more than two batches were present.
mnn.out2 <- batchelor::fastMNN(
GSE81076=unc.gse81076, GSE85241=unc.gse85241,
k=20, d=50, auto.order=c(2,1), BSPARAM=IrlbaParam(deferred=TRUE)
)
metadata(mnn.out2)$merge.order # batch 2 (GSE85241) is first in the order.
## [1] "GSE85241" "GSE81076"
metadata(mnn.out2)$merge.info$pairs[[1]] # 'first' now refers to GSE85241.
## DataFrame with 6626 rows and 2 columns
## first second
## <integer> <integer>
## 1 1294 748
## 2 1294 635
## 3 1294 963
## 4 1294 1206
## 5 1294 921
## ... ... ...
## 6622 3638 1191
## 6623 3638 597
## 6624 3638 1274
## 6625 3638 1197
## 6626 3638 1188
Using auto.order=
will change the merge order without requiring a change to the supplied order of batches in fastMNN()
.
Similarly, the order of batches (and cells) in the output will not be altered.
This makes it easy to explore different merge orders without altering the surrounding code.
Rle(mnn.out2$batch) # same as mnn.out$batch
## character-Rle of length 3638 with 2 runs
## Lengths: 1292 2346
## Values : "GSE81076" "GSE85241"
If very different batches (in terms of cell composition) are present, we suggest setting the largest, most heterogeneous batch as the first. This ensures that sufficient MNN pairs will be identified between the first and other batches for stable correction. Conversely, if two small batches without shared populations are supplied first, the wrong MNN pairs will be detected and the result of the merge will be incorrect.
In more complex experiments, we may know beforehand that certain sets of batches are more similar to each other. We might then want to merge those similar batches before attempting the more difficult merges involving batches with different cell type composition and/or gene expression. Examples include:
This strategy limits encourages detection of correct MNN pairs as similar batches should have more shared populations, By comparison, performing the more difficult merges first is more likely to introduce errors whereby distinct subpopulations are incorrectly placed together. This unnecessarily propagates the error to later steps as the initial merge is used as a reference for subsequent merges.
Hierarchical merging can be achieved through multiple calls to fastMNN()
with progressively merged batches.
To illustrate, assume that we want to remove Donor
effects within each batch prior to merging across batches.
We split up the cells in GSE85241 according to the donor of origin, creating one SingleCellExperiment
object for each donor.
all.donors <- unique(rescaled.gse85241$Donor)
table(rescaled.gse85241$Donor)
##
## D28 D29 D30 D31
## 340 604 689 713
by.donor.85241 <- vector("list", length(all.donors))
names(by.donor.85241) <- sort(all.donors)
for (x in all.donors) {
by.donor.85241[[x]] <- rescaled.gse85241[,rescaled.gse85241$Donor==x]
}
We repeat this process for GSE81076. For demonstration purposes, we will aggregate some of the donors together to ensure that there are enough cells in each level for MNN detection.
adj.donors <- c(D101="A", D102="A", D10631="A",
D17="B", D1713="B", D172444="B",
D2="C", D3="C",
D71="D", D72="D", D73="D", D74="D")[rescaled.gse81076$Donor]
table(adj.donors)
## adj.donors
## A B C D
## 162 464 320 346
all.donors <- unique(adj.donors)
by.donor.81076 <- vector("list", length(all.donors))
names(by.donor.81076) <- sort(all.donors)
for (x in all.donors) {
by.donor.81076[[x]] <- rescaled.gse81076[,adj.donors==x]
}
We use the multiBatchPCA()
function to perform a PCA across all batches to be merged.
This ensures that all cells are placed onto the same coordinate space, which would obviously not be possible if a PCA was performed for each batch separately.
Specifically, multiBatchPCA()
performs a modified PCA to ensure that each supplied matrix contributes equally to the definition of the PC space.
This avoids problems with imbalances in the number of cells across batches, meanining that smaller batches (possibly with unique cell types) are not ignored.
all.batches <- c(by.donor.85241, by.donor.81076)
# Cosine normalizing for consistency with fastMNN() defaults.
all.logcounts <- lapply(all.batches, logcounts)
scaled <- lapply(all.logcounts, batchelor::cosineNorm)
set.seed(1000) # for irlba.
pc.all <- do.call(batchelor::multiBatchPCA, c(scaled,
list(d=50, BSPARAM=IrlbaParam(deferred=TRUE))
))
names(pc.all)
## [1] "D28" "D29" "D30" "D31" "A" "B" "C" "D"
Comments from Aaron:
multiBatchPCA()
to the batch-level inputs for convenience.
It is also possible to supply donor-level matrices to equalize contributions across donors.multiBatchNorm()
earlier to generate the rescaled.gse*
objects.
As a result, cosine normalization is not technically necessary, as all batches should be on the same scale already (see ?cosineNorm
for a discussion of this).
Nonetheless, we run it here for consistency with our previous fastMNN()
call where cosine normalization is turned on by default.We pass the matrices corresponding to the donors in GSE85241 to fastMNN()
, setting pc.input=TRUE
to indicate that dimensionality reduction has already been performed.
This uses the first donor to define the reference space2 Which can be changed with auto.order=
, if so desired. and merges cells from all other donors to the first.
In this manner, we remove donor effects within the GSE85241 batch.
pcs.85241 <- pc.all[seq_along(by.donor.85241)]
mnn.out.85241 <- do.call(batchelor::fastMNN, c(pcs.85241, list(pc.input=TRUE)))
Rle(mnn.out.85241$batch)
## character-Rle of length 2346 with 4 runs
## Lengths: 340 604 689 713
## Values : "D28" "D29" "D30" "D31"
Note that when pc.input=TRUE
, a DataFrame
is returned instead of a SingleCellExperiment
.
This reflects the fact that per-gene identities are lost when PCs are supplied instead of per-gene expression vectors3 Though it is entirely possible to reconstruct these using rotation vectors, see below..
We repeat this for the donors in the GSE81076 batch.
pcs.81076 <- tail(pc.all, length(by.donor.81076))
mnn.out.81076 <- do.call(batchelor::fastMNN, c(pcs.81076, list(pc.input=TRUE)))
Rle(mnn.out.81076$batch)
## character-Rle of length 1292 with 4 runs
## Lengths: 162 464 320 346
## Values : "A" "B" "C" "D"
The next step is to merge the two batches together.
To do this, we simply repeat the fastMNN()
call with the donor-corrected values for each batch.
Again, we need to set pc.input=TRUE
to prevent the function from unnecessarily (and incorrectly) repeating the cosine normalization and PCA steps on the corrected values.
We use a larger k
in the final fastMNN()
call to improve the robustness of MNN detection to outlier cells on the edges of each cluster.
mnn.out3 <- batchelor::fastMNN(
GSE81076=mnn.out.81076,
GSE85241=mnn.out.85241,
pc.input=TRUE, k=100 # see comments below.
)
Rle(mnn.out3$batch) # by dataset
## character-Rle of length 3638 with 2 runs
## Lengths: 1292 2346
## Values : "GSE81076" "GSE85241"
Rle(c(mnn.out.81076$batch, mnn.out.85241$batch)) # by donor
## character-Rle of length 3638 with 8 runs
## Lengths: 162 464 320 346 340 604 689 713
## Values : "A" "B" "C" "D" "D28" "D29" "D30" "D31"
This yields a final corrected expression matrix where both within-batch donor effects and batch effects have been corrected. We examine the quality of each of the merge steps with t-SNE plots (Figure 5). Within each batch, the donors are generally well-mixed, and the final merge is consistent with Figure 3.
set.seed(1000)
par(mfrow=c(1,3))
library(Rtsne)
tout.85241 <- Rtsne(mnn.out.85241$corrected, pca=FALSE)
plot(tout.85241$Y[,1], tout.85241$Y[,2], main="GSE85241 donors",
col=as.factor(mnn.out.85241$batch), xlab="tSNE1", ylab="tSNE2")
tout.81076 <- Rtsne(mnn.out.81076$corrected, pca=FALSE)
plot(tout.81076$Y[,1], tout.81076$Y[,2], main="GSE81076 donors",
col=as.factor(mnn.out.81076$batch), xlab="tSNE1", ylab="tSNE2")
tout.all <- Rtsne(mnn.out3$corrected, pca=FALSE)
plot(tout.all$Y[,1], tout.all$Y[,2], main="Final",
col=as.factor(mnn.out3$batch), xlab="tSNE1", ylab="tSNE2")
Comments from Aaron:
multiBatchPCA()
to the batch-level inputs for convenience.
It is also possible to supply donor-level matrices to equalize contributions across donors.k
was not needed in Figure 3 as well.
This was probably because the outliers were masked by inter-donor heterogeneity when the t-SNE plots were generated without removing the donor effects.In this specific example, cells from the same donor will occupy contiguous rows in the mnn.out3$corrected
matrix.
However, this may not have been the case for the original ordering of cells in each SingleCellExperiment
.
This requires some extra account-keeping to match up the final corrected matrix to the original ordering, e.g., when cross-referencing to metadata.
original.plate <- unlist(lapply(rescaled, "[[", i="Plate"))
original.names <- unlist(lapply(rescaled, colnames))
# Needs unique names: trigger error otherwise.
stopifnot(anyDuplicated(original.names)==0L)
m <- match(rownames(mnn.out3$corrected), original.names)
new.plate <- original.plate[m]
In situations where the nature of each batch is unknown, users can set auto.order=TRUE
to allow fastMNN()
to empirically choose which batches to merge at each step.
The first merge is performed between the pair of batches with the most MNN pairs.
Progressive merges are performed with the remaining batch that has the most MNN pairs with the current reference batch.
The aim is to maximize the number of MNN pairs at each step to provide a stable correction.
We demonstrate below using the within-donor subbatches in the GSE81076 data set.
mnn.out.auto <- do.call(batchelor::fastMNN, c(pcs.81076,
list(pc.input=TRUE, auto.order=TRUE)))
names(by.donor.81076) # supplied order
## [1] "A" "B" "C" "D"
metadata(mnn.out.auto)$merge.order # automatically defined order
## [1] "D" "B" "C" "A"
As with manual specification, the merge order does not affect the output order of cells. This makes it easy to try different merge orderings without having to deal with a re-ordering of output cells.
Rle(mnn.out.auto$batch)
## character-Rle of length 1292 with 4 runs
## Lengths: 162 464 320 346
## Values : "A" "B" "C" "D"
The obvious cost of this approach is that of computation time. Nearest-neighbour searches need to be performed between all pairs of batches, and then between each remaining batch and the reference at each merge step. As such, we prefer manual definition of a merge order that makes better use of prior knowledge about the experiment design.
The low-dimensional corrected values can be used in any procedure that involves computing and comparing cell-cell (Euclidean) distances. Recall that the aim of batch correction is to bring together related cells from different batches while preserving biological differences between cells within each batch. The exact values of the corrected coordinates may not be interpretable, but this is not a problem if we are interested in the relative magnitudes of the distances. For example, the code below directly uses the MNN-corrected values for clustering.
snn.gr <- buildSNNGraph(sce, use.dimred="corrected")
clusters <- igraph::cluster_walktrap(snn.gr)
table(clusters$membership, sce$batch)
##
## GSE81076 GSE85241
## 1 339 253
## 2 55 0
## 3 64 198
## 4 129 4
## 5 165 434
## 6 182 232
## 7 25 108
## 8 122 760
## 9 22 127
## 10 101 94
## 11 28 49
## 12 45 44
## 13 0 18
## 14 8 4
## 15 7 21
Figure 6 shows strong correspondence between the cluster labels and separation in t-SNE space.
sce$Cluster <- factor(clusters$membership)
plotTSNE(sce, colour_by="Cluster")
The corrected values can be used in any procedure that operates on cell-cell distances. This includes nearest-neighbor searches, non-linear visualization like t-SNE and trajectory inference.
For gene-based procedures like differential expression (DE) analyses or gene network construction, it is desirable to use the original log-expression values or counts. The corrected values are only used to obtain cell-level results such as clusters or trajectories. Batch effects are handled explicitly using blocking terms or via a meta-analysis across batches. We do not use the corrected values directly in gene-based analyses, for various reasons:
Indeed, in the specific case of fastMNN()
, the batch-corrected values no longer correspond to per-gene expression values anyway.
This means that they cannot be directly used in gene-based analyses4 Though one can circumvent this, see below..
Users should generally aim to avoid using batch-corrected values for per-gene analyses when within-batch alternatives are available.
To illustrate, we perform a DE analysis on the uncorrected expression data using the clusters identified previously.
To model the batch effect, we set the batch of origin as the block=
argument in findMarkers()
.
This will perform all comparisons between clusters within each batch, and then combine the \(p\)-values to consolidate results across batches.
m.out <- findMarkers(sce, clusters$membership, block=sce$batch,
direction="up", assay.type="original")
demo <- m.out[["10"]] # probably alpha cells.
demo <- demo[demo$Top <= 5,]
as.data.frame(demo[,1:3]) # only first three columns for brevity.
## Top p.value FDR
## TTR 1 0.000000e+00 0.000000e+00
## GCG 1 4.759946e-242 3.490707e-238
## SCG5 1 4.415136e-195 2.158560e-191
## TM4SF4 1 9.371393e-137 1.963575e-133
## PAX6 1 5.841497e-79 3.569885e-76
## SLC22A17 1 1.394702e-57 5.383181e-55
## PCSK2 2 2.931642e-173 1.074960e-169
## MAFB 2 2.014811e-103 1.970082e-100
## GC 2 1.965312e-84 1.372630e-81
## PAM 2 4.967682e-82 3.311863e-79
## FXYD6 2 5.212388e-51 1.529002e-48
## CPE 3 2.053947e-163 6.025049e-160
## SCG2 3 5.301541e-124 8.639744e-121
## NEUROD1 3 1.478393e-74 8.339842e-72
## IRX2 3 1.510946e-56 5.540262e-54
## MAB21L3 3 6.578484e-37 1.121937e-34
## ALDH1A1 4 1.119374e-140 2.736311e-137
## CHGA 4 1.257629e-132 2.305706e-129
## CLU 4 5.371635e-118 7.162343e-115
## CRYBA2 4 3.629264e-74 1.971497e-71
## SCG3 4 9.014001e-65 4.131511e-62
## FAP 4 1.911375e-54 6.371394e-52
## SYT7 4 2.559545e-49 7.219394e-47
## KCNQ1OT1 4 2.671328e-35 4.124249e-33
## SLC30A8 5 2.645414e-88 1.940014e-85
## PTPRN2 5 1.885137e-70 9.534241e-68
## PPP1R1A 5 1.020339e-63 4.534944e-61
## UGDH-AS1 5 4.313047e-23 2.888560e-21
Other approaches for handling batch effects during marker gene detection are discussed elsewhere.
It is similarly possible to perform these analyses with standard Bioconductor packages for DE analysis such as edgeR or limma.
Note that the use of block=
is roughly similar to the use of a batch-cluster interaction model and testing whether the average log-fold change across batches is equal to zero.
Comments from Aaron:
mnnCorrect()
function will note that the function returned corrected expression values.
It is tempting to use these corrected values directly for DE analyses, but this would likely be inappropriate.
In addition to the reasons discussed above, the default parameters of mnnCorrect()
do not return corrected values on the log-scale, but rather a cosine-normalized log-scale.
This makes it difficult to interpret the effect size of DE analyses based on the corrected values.When applied on gene expression values, fastMNN()
will also return a matrix of corrected per-gene expression values in the reconstructed
assay.
This is obtained by taking the cross-product of the corrected low-dimensional values with the rotation vectors from the initial PCA,
which effectively reverses the initial projection into a low-dimensional space during multiBatchPCA()
.
The example below extracts the corrected expression values for insulin.
assay(sce, "reconstructed")
## <14667 x 3638> LowRankMatrix object of type "double":
## D2ex_1 D2ex_2 D2ex_3 ... D30.8_93
## GCG -0.05524770 -0.05151017 -0.05767458 . -0.0149947200
## CHGB -0.02991609 -0.02866532 -0.02961641 . -0.0262184713
## TM4SF4 -0.02477495 -0.03042626 -0.02821652 . -0.0069536055
## INS -0.02249294 -0.02011597 -0.01585674 . 0.0007176131
## TTR -0.05441265 -0.05924151 -0.05007137 . -0.0306570407
## ... . . . . .
## INTS5 2.371094e-04 -2.293720e-04 2.468718e-04 . 0.0010819816
## PLGRKT 1.901531e-04 2.368056e-04 2.222304e-04 . 0.0020596355
## RUVBL2 -1.851085e-04 6.657381e-05 7.270702e-04 . 0.0006563725
## NUP85 -5.918494e-05 -8.362143e-06 3.003140e-04 . 0.0019161638
## POP4 -2.176989e-03 -2.583791e-03 -2.046742e-03 . 0.0018909824
## D30.8_94
## GCG -0.0315699659
## CHGB -0.0230390676
## TM4SF4 -0.0128199079
## INS -0.0054854880
## TTR -0.0272707530
## ... .
## INTS5 0.0011030223
## PLGRKT 0.0006906096
## RUVBL2 0.0019940485
## NUP85 0.0016072321
## POP4 0.0014569066
summary(assay(sce)["INS",])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.038066 -0.005596 0.003042 0.014278 0.013968 0.100745
If fastMNN()
was run on low-dimensional inputs, only the low-dimensional output will be reported.
Nonetheless, users can obtain per-gene corrected values by manually computing the cross-product using the PCA rotation vectors.
For example, the code below obtains corrected expression values for GCG from our hierarchical merge.
rotations <- metadata(pc.all)$rotation
cor.exp <- tcrossprod(mnn.out3$corrected,
rotations["ENSG00000115263",,drop=FALSE])
summary(cor.exp)
## ENSG00000115263
## Min. :-0.075929
## 1st Qu.:-0.038805
## Median :-0.028759
## Mean :-0.018606
## 3rd Qu.: 0.009197
## Max. : 0.074218
Explicit calculation of all per-gene corrected values is probably ill-advised as this would involve the construction of a dense matrix.
This may be prohibitively memory-consuming for large data sets that are otherwise representable as sparse matrices.
Rather, corrected values can be computed for specific genes as they are needed, e.g., using the LowRankMatrix
class.
Per-gene corrected values can be readily used for visualization, e.g., in Figure 4. This can be more aesthetically pleasing than uncorrected expression values that may contain large shifts on the colour scale between cells in different batches. However, use of the corrected values in any quantitative procedure should be treated with extreme caution. If they must be used, they should be backed up by similar results from an analysis on the uncorrected values.
We save the SingleCellExperiment
object for use elsewhere.
This avoids the need to repeat all of the processing steps described above.
saveRDS(file="pancreas_data.rds", sce)
All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rtsne_0.15 BiocSingular_1.0.0
## [3] scran_1.12.0 scater_1.12.0
## [5] ggplot2_3.1.1 SingleCellExperiment_1.6.0
## [7] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
## [9] BiocParallel_1.18.0 matrixStats_0.54.0
## [11] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
## [13] org.Hs.eg.db_3.8.2 AnnotationDbi_1.46.0
## [15] IRanges_2.18.0 S4Vectors_0.22.0
## [17] Biobase_2.44.0 BiocGenerics_0.30.0
## [19] BiocFileCache_1.8.0 dbplyr_1.4.0
## [21] knitr_1.22 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7
## [3] httr_1.4.0 dynamicTreeCut_1.63-1
## [5] tools_3.6.0 R6_2.4.0
## [7] irlba_2.3.3 vipor_0.4.5
## [9] DBI_1.0.0 lazyeval_0.2.2
## [11] colorspace_1.4-1 withr_2.1.2
## [13] processx_3.3.0 tidyselect_0.2.5
## [15] gridExtra_2.3 bit_1.1-14
## [17] curl_3.3 compiler_3.6.0
## [19] BiocNeighbors_1.2.0 labeling_0.3
## [21] bookdown_0.9 scales_1.0.0
## [23] callr_3.2.0 rappdirs_0.3.1
## [25] stringr_1.4.0 digest_0.6.18
## [27] rmarkdown_1.12 XVector_0.24.0
## [29] pkgconfig_2.0.2 htmltools_0.3.6
## [31] limma_3.40.0 highr_0.8
## [33] rlang_0.3.4 RSQLite_2.1.1
## [35] DelayedMatrixStats_1.6.0 dplyr_0.8.0.1
## [37] RCurl_1.95-4.12 magrittr_1.5
## [39] simpleSingleCell_1.8.0 GenomeInfoDbData_1.2.1
## [41] Matrix_1.2-17 Rcpp_1.0.1
## [43] ggbeeswarm_0.6.0 munsell_0.5.0
## [45] viridis_0.5.1 stringi_1.4.3
## [47] yaml_2.2.0 edgeR_3.26.0
## [49] zlibbioc_1.30.0 plyr_1.8.4
## [51] grid_3.6.0 blob_1.1.1
## [53] dqrng_0.2.0 crayon_1.3.4
## [55] lattice_0.20-38 cowplot_0.9.4
## [57] locfit_1.5-9.1 ps_1.3.0
## [59] pillar_1.3.1 igraph_1.2.4.1
## [61] codetools_0.2-16 glue_1.3.1
## [63] evaluate_0.13 BiocManager_1.30.4
## [65] batchelor_1.0.0 gtable_0.3.0
## [67] purrr_0.3.2 assertthat_0.2.1
## [69] xfun_0.6 rsvd_1.0.0
## [71] viridisLite_0.3.0 tibble_2.1.1
## [73] beeswarm_0.2.3 memoise_1.1.0
## [75] statmod_1.4.30
Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.
Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol. 36 (5):421–27.
Lawlor, N., J. George, M. Bolisetty, R. Kursawe, L. Sun, V. Sivakamasundari, I. Kycia, P. Robson, and M. L. Stitzel. 2017. “Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes.” Genome Res. 27 (2):208–22.
Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.
Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.
McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.
Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.
Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7):e47.
Segerstolpe, A., A. Palasantza, P. Eliasson, E. M. Andersson, A. C. Andreasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metab. 24 (4):593–607.
2.3 Comments on additional batches
In Haghverdi et al. (2018), we originally performed batch correction across four separate pancreas scRNA-seq datasets. For simplicity, we will only consider the two batches generated using CEL-seq(2) and ignore those generated using Smart-seq2 (Segerstolpe et al. 2016; Lawlor et al. 2017). As one might expect, batch correction is easiest when dealing with data generated from the same technology, as fewer systematic differences are present that can interfere with the biological structure. Nonetheless, it is often possible to obtain good results when applying MNN correction to batches of data generated with different technologies.
It is also worth pointing out that both of the CEL-seq(2) batches above contain cells from multiple donors. Each donor could be treated as a separate batch in their own right, reflecting (presumably uninteresting) biological differences between donors due to genotype, age, sex or other factors that are not easily controlled when dealing with humans. For simplicity, we will ignore the donor effects within each study and only consider the removal of the batch effect between the two studies. However, we note that it is possible to apply the MNN correction between donors in each batch and then between the batches - see
?fastMNN
for details.