Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.
Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for valid downstream analysis.
However, existing methods such as removeBatchEffect()
(Ritchie et al. 2015) assume that the composition of cell populations are either known or the same across batches.
This workflow describes the application of an alternative strategy for batch correction based on the detection of mutual nearest neighbours (MNNs) (Haghverdi et al. 2018).
The mnnCorreect()
approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches.
We demonstrate its use on three human pancreas scRNA-seq datasets from different groups and using different protocols.
This dataset was generated by Grun et al. (2016) using the CEL-seq protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above. We first read the table into memory.
gse81076.df <- read.table("GSE81076_D2_3_7_10_17.txt.gz", sep='\t',
header=TRUE, stringsAsFactors=FALSE, row.names=1)
dim(gse81076.df)
## [1] 20148 1728
Unfortunately, the data and metadata are all mixed together in this file. As a result, we need to manually extract the metadata from the column names.
donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse81076.df))
table(donor.names)
## donor.names
## D101 D102 D10631 D17 D1713 D172444 D2 D3 D71
## 96 96 96 288 96 96 96 480 96
## D72 D73 D74
## 96 96 96
plate.id <- sub("^D[0-9]+(.*)_.*", "\\1", colnames(gse81076.df))
table(plate.id)
## plate.id
## All1 All2 TGFB en1 en2 en3 en4 ex
## 864 96 96 96 96 96 96 96 192
Another irritating feature of this dataset is that gene symbols were supplied, rather than stable identifiers such as Ensembl.
We convert all row names to Ensembl identifiers, removing NA
or duplicated entries (with the exception of spike-in transcripts).
gene.symb <- gsub("__chr.*$", "", rownames(gse81076.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE TRUE
## 20064 84
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse81076.df <- gse81076.df[keep,]
rownames(gse81076.df) <- gene.ids[keep]
summary(keep)
## Mode FALSE TRUE
## logical 2252 17896
We create a SingleCellExperiment
object to store the counts and metadata together.
This reduces the risk of book-keeping errors in later steps of the analysis.
Note that we re-identify the spike-in rows, as the previous indices would have changed after the subsetting.
library(SingleCellExperiment)
sce.gse81076 <- SingleCellExperiment(list(counts=as.matrix(gse81076.df)),
colData=DataFrame(Donor=donor.names, Plate=plate.id),
rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse81076, "ERCC") <- grepl("^ERCC-", rownames(gse81076.df))
sce.gse81076
## class: SingleCellExperiment
## dim: 17896 1728
## metadata(0):
## assays(1): counts
## rownames(17896): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
## ENSG00000036549
## rowData names(1): Symbol
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC
We compute quality control (QC) metrics for each cell (McCarthy et al. 2017) and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.
library(scater)
sce.gse81076 <- calculateQCMetrics(sce.gse81076, compact=TRUE)
QC <- sce.gse81076$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes),
HighSpike=sum(high.spike, na.rm=TRUE))
## LowLib LowNgenes HighSpike
## 1 54 130 389
Cells with extreme values for these QC metrics are presumed to be of low quality and are removed. A more thorough analysis would examine the distributions of these QC metrics beforehand, but we will skip that step for brevity here.
discard <- low.lib | low.genes | high.spike
sce.gse81076 <- sce.gse81076[,!discard]
summary(discard)
## Mode FALSE TRUE
## logical 1291 437
We compute size factors for the endogenous genes using the deconvolution method (Lun, Bach, and Marioni 2016).
This is done with pre-clustering through quickCluster()
to avoid pooling together very different cells.
library(scran)
clusters <- quickCluster(sce.gse81076, min.mean=0.1)
table(clusters)
## clusters
## 1 2 3 4
## 414 357 267 253
sce.gse81076 <- computeSumFactors(sce.gse81076, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse81076))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003044 0.447400 0.791070 1.000000 1.316058 8.761424
We also compute size factors for the spike-in transcripts (Lun et al. 2017).
Recall that we set general.use=FALSE
to ensure that the spike-in size factors are only applied to the spike-in transcripts.
sce.gse81076 <- computeSpikeFactors(sce.gse81076, general.use=FALSE)
summary(sizeFactors(sce.gse81076, "ERCC"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01042 0.57800 0.88697 1.00000 1.27728 7.44245
We then compute normalized log-expression values for use in downstream analyses.
sce.gse81076 <- normalize(sce.gse81076)
We identify highly variable genes (HVGs) using trendVar()
and decomposeVar()
, using the variances of spike-in transcripts to model technical noise.
We set block=
to ensure that uninteresting differences between plates or donors do not inflate the variance.
The small discrepancy in the fitted trend in Figure 1 is caused by the fact that the trend is fitted robustly to the block-wise variances of the spike-ins,
while the variances shown are averaged across blocks and not robust to outliers.
block <- paste0(sce.gse81076$Plate, "_", sce.gse81076$Donor)
fit <- trendVar(sce.gse81076, block=block, parametric=TRUE)
dec <- decomposeVar(sce.gse81076, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance of log-expression", pch=16)
OBis.spike <- isSpike(sce.gse81076)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
We order genes by decreasing biological component, revealing some usual suspects such as insulin and glucagon.
We will be using this information later when performing feature selection prior to running mnnCorrect()
.
dec.gse81076 <- dec
dec.gse81076$Symbol <- rowData(sce.gse81076)$Symbol
dec.gse81076 <- dec.gse81076[order(dec.gse81076$bio, decreasing=TRUE),]
head(dec.gse81076)
## DataFrame with 6 rows and 7 columns
## mean total bio
## <numeric> <numeric> <numeric>
## ENSG00000254647 2.82477500862721 6.24939686243935 5.80546348935238
## ENSG00000115263 4.00392228927663 5.56617240435477 5.33299468869058
## ENSG00000118271 3.65232925066363 5.56170090164767 5.25875426012777
## ENSG00000115386 4.24621561998282 5.3682782819134 5.13334914023796
## ENSG00000164266 3.03739368175512 5.39637250473505 4.98985386325022
## ENSG00000172023 2.4600444995601 5.25637953645412 4.86267149420317
## tech p.value FDR Symbol
## <numeric> <numeric> <numeric> <character>
## ENSG00000254647 0.443933373086972 0 0 INS
## ENSG00000115263 0.233177715664187 0 0 GCG
## ENSG00000118271 0.302946641519902 0 0 TTR
## ENSG00000115386 0.234929141675433 0 0 REG1A
## ENSG00000164266 0.406518641484827 0 0 SPINK1
## ENSG00000172023 0.393708042250948 0 0 REG1B
This dataset was generated by Muraro et al. (2016) using the CEL-seq2 protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above. We first read the table into memory.
gse85241.df <- read.table("GSE85241_cellsystems_dataset_4donors_updated.csv",
sep='\t', h=TRUE, row.names=1, stringsAsFactors=FALSE)
dim(gse85241.df)
## [1] 19140 3072
We extract the metadata from the column names.
donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse85241.df))
table(donor.names)
## donor.names
## D28 D29 D30 D31
## 768 768 768 768
plate.id <- sub("^D[0-9]+\\.([0-9]+)_.*", "\\1", colnames(gse85241.df))
table(plate.id)
## plate.id
## 1 2 3 4 5 6 7 8
## 384 384 384 384 384 384 384 384
Yet again, gene symbols were supplied instead of Ensembl or Entrez identifiers.
We convert all row names to Ensembl identifiers, removing NA
or duplicated entries (with the exception of spike-in transcripts).
gene.symb <- gsub("__chr.*$", "", rownames(gse85241.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE TRUE
## 19059 81
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse85241.df <- gse85241.df[keep,]
rownames(gse85241.df) <- gene.ids[keep]
summary(keep)
## Mode FALSE TRUE
## logical 2097 17043
We create a SingleCellExperiment
object to store the counts and metadata together.
sce.gse85241 <- SingleCellExperiment(list(counts=as.matrix(gse85241.df)),
colData=DataFrame(Donor=donor.names, Plate=plate.id),
rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse85241, "ERCC") <- grepl("^ERCC-", rownames(gse85241.df))
sce.gse85241
## class: SingleCellExperiment
## dim: 17043 3072
## metadata(0):
## assays(1): counts
## rownames(17043): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
## ENSG00000036549
## rowData names(1): Symbol
## colnames(3072): D28.1_1 D28.1_2 ... D30.8_95 D30.8_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC
We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.
sce.gse85241 <- calculateQCMetrics(sce.gse85241, compact=TRUE)
QC <- sce.gse85241$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes),
HighSpike=sum(high.spike, na.rm=TRUE))
## LowLib LowNgenes HighSpike
## 1 577 669 696
Low-quality cells are defined as those with extreme values for these QC metrics and are removed.
discard <- low.lib | low.genes | high.spike
sce.gse85241 <- sce.gse85241[,!discard]
summary(discard)
## Mode FALSE TRUE
## logical 2346 726
We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values.
clusters <- quickCluster(sce.gse85241, min.mean=0.1, method="igraph")
table(clusters)
## clusters
## 1 2 3 4
## 371 828 536 611
sce.gse85241 <- computeSumFactors(sce.gse85241, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse85241))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07892 0.54127 0.82761 1.00000 1.22184 13.98424
sce.gse85241 <- computeSpikeFactors(sce.gse85241, general.use=FALSE)
summary(sizeFactors(sce.gse85241, "ERCC"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.09295 0.61309 0.88902 1.00000 1.27519 4.04643
sce.gse85241 <- normalize(sce.gse85241)
We fit a trend to the spike-in variances as previously described, allowing us to model the technical noise for each gene (Figure 2).
Again, we set block=
to ensure that uninteresting differences between plates or donors do not inflate the variance.
block <- paste0(sce.gse85241$Plate, "_", sce.gse85241$Donor)
fit <- trendVar(sce.gse85241, block=block, parametric=TRUE)
dec <- decomposeVar(sce.gse85241, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance of log-expression", pch=16)
is.spike <- isSpike(sce.gse85241)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
We order genes by decreasing biological component, as described above.
dec.gse85241 <- dec
dec.gse85241$Symbol <- rowData(sce.gse85241)$Symbol
dec.gse85241 <- dec.gse85241[order(dec.gse85241$bio, decreasing=TRUE),]
head(dec.gse85241)
## DataFrame with 6 rows and 7 columns
## mean total bio
## <numeric> <numeric> <numeric>
## ENSG00000115263 7.66267378242402 6.59681396806777 6.56800161897542
## ENSG00000089199 4.62064563864715 6.42527492432335 6.30366585829772
## ENSG00000169903 3.01379686443808 6.5465999616288 6.19244340917136
## ENSG00000254647 2.00214393757549 6.44633299225628 5.89008000997448
## ENSG00000118271 7.3354941164374 5.75428324335117 5.72283471882918
## ENSG00000171951 4.1895920246965 5.57157012090427 5.40710655269578
## tech p.value FDR Symbol
## <numeric> <numeric> <numeric> <character>
## ENSG00000115263 0.0288123490923479 0 0 GCG
## ENSG00000089199 0.12160906602564 0 0 CHGB
## ENSG00000169903 0.35415655245744 0 0 TM4SF4
## ENSG00000254647 0.556252982281801 0 0 INS
## ENSG00000118271 0.0314485245219835 0 0 TTR
## ENSG00000171951 0.164463568208491 0 0 SCG2
This dataset was generated by Segerstolpe et al. (2016) using the Smart-seq2 protocol with ERCC spike-ins. Count tables were obtained from ArrayExpress using the accession number above. We first read the table into memory, though this requires some effort as the file is even more unconventionally formatted than the two examples above.
unzip("E-MTAB-5061.processed.1.zip")
# Figuring out the number of libraries (-1 for the '#sample').
header <- read.table("pancreas_refseq_rpkms_counts_3514sc.txt",
nrow=1, sep="\t", comment.char="", stringsAsFactors=FALSE)
ncells <- ncol(header) - 1L
# Loading only the gene names and the counts.
col.types <- vector("list", ncells*2 + 2)
col.types[1:2] <- "character"
col.types[2+ncells + seq_len(ncells)] <- "integer"
e5601.df <- read.table("pancreas_refseq_rpkms_counts_3514sc.txt",
sep="\t", colClasses=col.types)
# Disentangling the gene names and the counts.
gene.data <- e5601.df[,1:2]
e5601.df <- e5601.df[,-(1:2)]
colnames(e5601.df) <- as.character(header[1,-1])
dim(e5601.df)
## [1] 26271 3514
The gene metadata does contains unique GenBank identifiers, but these are transcript-level and concatenated together for each gene. Instead of trying to pull them apart, we perform the symbol-to-Ensembl conversion that was done for the previous datasets.
is.spike <- grepl("^ERCC-", gene.data[,2])
table(is.spike)
## is.spike
## FALSE TRUE
## 26179 92
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.data[,1], keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.data[is.spike,2]
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
e5601.df <- e5601.df[keep,]
rownames(e5601.df) <- gene.ids[keep]
summary(keep)
## Mode FALSE TRUE
## logical 3736 22535
At least the metadata is stored in a separate file, which makes it fairly straightforward to parse.
We match
the rows to the column names to ensure that the metadata and count table are in the same order.
metadata <- read.table("E-MTAB-5061.sdrf.txt", header=TRUE,
sep="\t", check.names=FALSE, stringsAsFactors=FALSE)
m <- match(colnames(e5601.df), metadata[["Assay Name"]])
stopifnot(all(!is.na(m)))
metadata <- metadata[m,]
donor.id <- metadata[["Characteristics[individual]"]]
table(donor.id)
## donor.id
## AZ HP1502401 HP1504101T2D HP1504901 HP1506401
## 96 352 383 383 383
## HP1507101 HP1508501T2D HP1509101 HP1525301T2D HP1526901T2D
## 383 383 383 384 384
We create a SingleCellExperiment
object to store the counts and metadata together.
sce.e5601 <- SingleCellExperiment(list(counts=as.matrix(e5601.df)),
colData=DataFrame(Donor=donor.id),
rowData=DataFrame(Symbol=gene.data[keep,1]))
isSpike(sce.e5601, "ERCC") <- grepl("^ERCC-", rownames(e5601.df))
sce.e5601
## class: SingleCellExperiment
## dim: 22535 3514
## metadata(0):
## assays(1): counts
## rownames(22535): ENSG00000118473 ENSG00000142920 ... ERCC-00061
## ERCC-00048
## rowData names(1): Symbol
## colnames(3514): HP1502401_N13 HP1502401_D14 ... HP1526901T2D_O11
## HP1526901T2D_A8
## colData names(1): Donor
## reducedDimNames(0):
## spikeNames(1): ERCC
We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content. We also identify cells with near-zero spike-in counts, which are not compatible with spike-in normalization and modelling of technical noise.
sce.e5601 <- calculateQCMetrics(sce.e5601, compact=TRUE)
QC <- sce.e5601$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
low.spike <- isOutlier(QC$feature_control_ERCC$log10_total_counts, type="lower", nmad=2)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes),
HighSpike=sum(high.spike, na.rm=TRUE), LowSpike=sum(low.spike))
## LowLib LowNgenes HighSpike LowSpike
## 1 162 572 904 359
Low-quality cells are defined as those with extreme values for these QC metrics and are removed.
discard <- low.lib | low.genes | high.spike | low.spike
sce.e5601 <- sce.e5601[,!discard]
summary(discard)
## Mode FALSE TRUE
## logical 2285 1229
We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values.
Recall that the Smart-seq2 protocol generates read count data, so we use a more stringent filter of min.mean=1
to remove low-abundance genes.
clusters <- quickCluster(sce.e5601, min.mean=1, method="igraph")
table(clusters)
## clusters
## 1 2 3 4 5 6
## 315 308 460 482 282 438
sce.e5601 <- computeSumFactors(sce.e5601, min.mean=1, clusters=clusters)
summary(sizeFactors(sce.e5601))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.004916 0.377375 0.699866 1.000000 1.315723 12.740545
sce.e5601 <- computeSpikeFactors(sce.e5601, general.use=FALSE)
summary(sizeFactors(sce.e5601, "ERCC"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01433 0.19416 0.42867 1.00000 1.16641 11.23978
sce.e5601 <- normalize(sce.e5601)
We identify highly variable genes (HVGs) using trendVar()
and decomposeVar()
.
Here, we need to fit the mean-variance trend separately to each donor, as the donor-to-donor variation in the mean-variance trend is more pronounced than that in the UMI datasets.
This yields one fit for each donor in Figure 3.
donors <- sort(unique(sce.e5601$Donor))
is.spike <- isSpike(sce.e5601)
par(mfrow=c(ceiling(length(donors)/2), 2),
mar=c(4.1, 4.1, 2.1, 0.1))
collected <- list()
for (x in unique(sce.e5601$Donor)) {
current <- sce.e5601[,sce.e5601$Donor==x]
if (ncol(current)<2L) { next }
current <- normalize(current)
fit <- trendVar(current, parametric=TRUE)
dec <- decomposeVar(current, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance of log-expression", pch=16, main=x)
points(fit$mean, fit$var, col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
collected[[x]] <- dec
}