Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.
Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for valid downstream analysis.
However, existing methods such as removeBatchEffect()
(Ritchie et al. 2015) assume that the composition of cell populations are either known or the same across batches.
This workflow describes the application of an alternative strategy for batch correction based on the detection of mutual nearest neighbours (MNNs) (Haghverdi et al. 2018).
The MNN approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches.
We demonstrate its use on two human pancreas scRNA-seq datasets generated in separate studies.
This dataset was generated by Grun et al. (2016) using the CEL-seq protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above.
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
grun.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
"GSE81nnn/GSE81076/suppl/GSE81076%5FD2%5F3%5F7%5F10%5F17%2Etxt%2Egz"))
We first read the table into memory.
gse81076.df <- read.table(grun.fname, sep='\t',
header=TRUE, stringsAsFactors=FALSE, row.names=1)
dim(gse81076.df)
## [1] 20148 1728
Unfortunately, the data and metadata are all mixed together in this file. As a result, we need to manually extract the metadata from the column names.
donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse81076.df))
table(donor.names)
## donor.names
## D101 D102 D10631 D17 D1713 D172444 D2 D3 D71
## 96 96 96 288 96 96 96 480 96
## D72 D73 D74
## 96 96 96
plate.id <- sub("^D[0-9]+(.*)_.*", "\\1", colnames(gse81076.df))
table(plate.id)
## plate.id
## All1 All2 TGFB en1 en2 en3 en4 ex
## 864 96 96 96 96 96 96 96 192
Another irritating feature of this dataset is that gene symbols were supplied, rather than stable identifiers such as Ensembl.
We convert all row names to Ensembl identifiers, removing NA
or duplicated entries (with the exception of spike-in transcripts).
gene.symb <- gsub("__chr.*$", "", rownames(gse81076.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE TRUE
## 20064 84
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse81076.df <- gse81076.df[keep,]
rownames(gse81076.df) <- gene.ids[keep]
summary(keep)
## Mode FALSE TRUE
## logical 2071 18077
We create a SingleCellExperiment
object to store the counts and metadata together.
This reduces the risk of book-keeping errors in later steps of the analysis.
Note that we re-identify the spike-in rows, as the previous indices would have changed after the subsetting.
library(SingleCellExperiment)
sce.gse81076 <- SingleCellExperiment(list(counts=as.matrix(gse81076.df)),
colData=DataFrame(Donor=donor.names, Plate=plate.id),
rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse81076, "ERCC") <- grepl("^ERCC-", rownames(gse81076.df))
sce.gse81076
## class: SingleCellExperiment
## dim: 18077 1728
## metadata(0):
## assays(1): counts
## rownames(18077): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
## ENSG00000036549
## rowData names(1): Symbol
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC
We compute quality control (QC) metrics for each cell (McCarthy et al. 2017) and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.
library(scater)
sce.gse81076 <- calculateQCMetrics(sce.gse81076, compact=TRUE)
QC <- sce.gse81076$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes),
HighSpike=sum(high.spike, na.rm=TRUE))
## LowLib LowNgenes HighSpike
## 1 55 130 388
Cells with extreme values for these QC metrics are presumed to be of low quality and are removed. A more thorough analysis would examine the distributions of these QC metrics beforehand, but we will skip that step for brevity here.
discard <- low.lib | low.genes | high.spike
sce.gse81076 <- sce.gse81076[,!discard]
summary(discard)
## Mode FALSE TRUE
## logical 1292 436
We compute size factors for the endogenous genes using the deconvolution method (Lun, Bach, and Marioni 2016).
This is done with pre-clustering through quickCluster()
to avoid pooling together very different cells.
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.gse81076, method="igraph", min.mean=0.1)
table(clusters)
## clusters
## 1 2 3
## 513 331 448
sce.gse81076 <- computeSumFactors(sce.gse81076, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse81076))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002648 0.442511 0.797307 1.000000 1.296612 9.507820
We also compute size factors for the spike-in transcripts (Lun et al. 2017).
Recall that we set general.use=FALSE
to ensure that the spike-in size factors are only applied to the spike-in transcripts.
sce.gse81076 <- computeSpikeFactors(sce.gse81076, general.use=FALSE)
summary(sizeFactors(sce.gse81076, "ERCC"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01042 0.57782 0.88699 1.00000 1.27765 7.43998
We then compute normalized log-expression values for use in downstream analyses.
sce.gse81076 <- normalize(sce.gse81076)
We identify highly variable genes (HVGs) using trendVar()
and decomposeVar()
, using the variances of spike-in transcripts to model technical noise.
We set block=
to ensure that uninteresting differences between plates or donors do not inflate the variance.
The small discrepancy in the fitted trend in Figure 1 is caused by the fact that the trend is fitted robustly to the block-wise variances of the spike-ins,
while the variances shown are averaged across blocks and not robust to outliers.
block <- paste0(sce.gse81076$Plate, "_", sce.gse81076$Donor)
fit <- trendVar(sce.gse81076, block=block, parametric=TRUE)
dec <- decomposeVar(sce.gse81076, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance of log-expression", pch=16)
is.spike <- isSpike(sce.gse81076)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
We order genes by decreasing biological component, revealing some usual suspects such as insulin and glucagon.
We will be using this information later when performing feature selection prior to running mnnCorrect()
.
dec.gse81076 <- dec
dec.gse81076$Symbol <- rowData(sce.gse81076)$Symbol
dec.gse81076 <- dec.gse81076[order(dec.gse81076$bio, decreasing=TRUE),]
head(dec.gse81076)
## DataFrame with 6 rows and 7 columns
## mean total bio
## <numeric> <numeric> <numeric>
## ENSG00000254647 2.82926326785746 6.25972220963598 5.81569316873789
## ENSG00000129965 1.87615109970638 5.92173614351272 5.47357562769792
## ENSG00000115263 4.00610832326079 5.56338526078797 5.33005501610815
## ENSG00000118271 3.65393642101131 5.54285843054711 5.24069541798718
## ENSG00000115386 4.24556465621839 5.43121995552112 5.19578897832117
## ENSG00000164266 3.03848794345203 5.4545366051026 5.04804616712653
## tech p.value FDR Symbol
## <numeric> <numeric> <numeric> <character>
## ENSG00000254647 0.444029040898086 0 0 INS
## ENSG00000129965 0.448160515814804 0 0 INS-IGF2
## ENSG00000115263 0.233330244679815 0 0 GCG
## ENSG00000118271 0.302163012559924 0 0 TTR
## ENSG00000115386 0.235430977199958 0 0 REG1A
## ENSG00000164266 0.406490437976075 0 0 SPINK1
This dataset was generated by Muraro et al. (2016) using the CEL-seq2 protocol with unique molecular identifiers (UMIs) and ERCC spike-ins. Count tables were obtained from the NCBI Gene Expression Omnibus using the accession number above.
muraro.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
"GSE85nnn/GSE85241/suppl",
"GSE85241%5Fcellsystems%5Fdataset%5F4donors%5Fupdated%2Ecsv%2Egz"))
We first read the table into memory.
gse85241.df <- read.table(muraro.fname, sep='\t',
header=TRUE, row.names=1, stringsAsFactors=FALSE)
dim(gse85241.df)
## [1] 19140 3072
We extract the metadata from the column names.
donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(gse85241.df))
table(donor.names)
## donor.names
## D28 D29 D30 D31
## 768 768 768 768
plate.id <- sub("^D[0-9]+\\.([0-9]+)_.*", "\\1", colnames(gse85241.df))
table(plate.id)
## plate.id
## 1 2 3 4 5 6 7 8
## 384 384 384 384 384 384 384 384
Yet again, gene symbols were supplied instead of Ensembl or Entrez identifiers.
We convert all row names to Ensembl identifiers, removing NA
or duplicated entries (with the exception of spike-in transcripts).
gene.symb <- gsub("__chr.*$", "", rownames(gse85241.df))
is.spike <- grepl("^ERCC-", gene.symb)
table(is.spike)
## is.spike
## FALSE TRUE
## 19059 81
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=gene.symb, keytype="SYMBOL", column="ENSEMBL")
gene.ids[is.spike] <- gene.symb[is.spike]
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
gse85241.df <- gse85241.df[keep,]
rownames(gse85241.df) <- gene.ids[keep]
summary(keep)
## Mode FALSE TRUE
## logical 1949 17191
We create a SingleCellExperiment
object to store the counts and metadata together.
sce.gse85241 <- SingleCellExperiment(list(counts=as.matrix(gse85241.df)),
colData=DataFrame(Donor=donor.names, Plate=plate.id),
rowData=DataFrame(Symbol=gene.symb[keep]))
isSpike(sce.gse85241, "ERCC") <- grepl("^ERCC-", rownames(gse85241.df))
sce.gse85241
## class: SingleCellExperiment
## dim: 17191 3072
## metadata(0):
## assays(1): counts
## rownames(17191): ENSG00000268895 ENSG00000121410 ... ENSG00000074755
## ENSG00000036549
## rowData names(1): Symbol
## colnames(3072): D28.1_1 D28.1_2 ... D30.8_95 D30.8_96
## colData names(2): Donor Plate
## reducedDimNames(0):
## spikeNames(1): ERCC
We compute QC metrics for each cell and identify cells with low library sizes, low numbers of expressed genes, or high ERCC content.
sce.gse85241 <- calculateQCMetrics(sce.gse85241, compact=TRUE)
QC <- sce.gse85241$scater_qc
low.lib <- isOutlier(QC$all$log10_total_counts, type="lower", nmad=3)
low.genes <- isOutlier(QC$all$log10_total_features_by_counts, type="lower", nmad=3)
high.spike <- isOutlier(QC$feature_control_ERCC$pct_counts, type="higher", nmad=3)
data.frame(LowLib=sum(low.lib), LowNgenes=sum(low.genes),
HighSpike=sum(high.spike, na.rm=TRUE))
## LowLib LowNgenes HighSpike
## 1 577 669 696
Low-quality cells are defined as those with extreme values for these QC metrics and are removed.
discard <- low.lib | low.genes | high.spike
sce.gse85241 <- sce.gse85241[,!discard]
summary(discard)
## Mode FALSE TRUE
## logical 2346 726
We compute size factors for the endogenous genes and spike-in transcripts, and use them to compute log-normalized expression values.
set.seed(1000)
clusters <- quickCluster(sce.gse85241, min.mean=0.1, method="igraph")
table(clusters)
## clusters
## 1 2 3 4 5 6
## 237 248 285 483 613 480
sce.gse85241 <- computeSumFactors(sce.gse85241, min.mean=0.1, clusters=clusters)
summary(sizeFactors(sce.gse85241))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07856 0.53931 0.81986 1.00000 1.22044 14.33280
sce.gse85241 <- computeSpikeFactors(sce.gse85241, general.use=FALSE)
summary(sizeFactors(sce.gse85241, "ERCC"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.09295 0.61309 0.88902 1.00000 1.27519 4.04643
sce.gse85241 <- normalize(sce.gse85241)
We fit a trend to the spike-in variances as previously described, allowing us to model the technical noise for each gene (Figure 2).
Again, we set block=
to ensure that uninteresting differences between plates or donors do not inflate the variance.
block <- paste0(sce.gse85241$Plate, "_", sce.gse85241$Donor)
fit <- trendVar(sce.gse85241, block=block, parametric=TRUE)
dec <- decomposeVar(sce.gse85241, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance of log-expression", pch=16)
is.spike <- isSpike(sce.gse85241)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
We order genes by decreasing biological component, as described above.
dec.gse85241 <- dec
dec.gse85241$Symbol <- rowData(sce.gse85241)$Symbol
dec.gse85241 <- dec.gse85241[order(dec.gse85241$bio, decreasing=TRUE),]
head(dec.gse85241)
## DataFrame with 6 rows and 7 columns
## mean total bio
## <numeric> <numeric> <numeric>
## ENSG00000115263 7.66453729345785 6.66863456231166 6.63983282674494
## ENSG00000089199 4.62375793902937 6.46558866721711 6.34422879524315
## ENSG00000169903 3.01813483888172 6.59298310984116 6.23983239174791
## ENSG00000254647 2.00308741950834 6.43736230365307 5.88133815055707
## ENSG00000118271 7.33751241351268 5.81792529163505 5.7864800378735
## ENSG00000171951 4.1933909879622 5.60615948812391 5.44209845614432
## tech p.value FDR Symbol
## <numeric> <numeric> <numeric> <character>
## ENSG00000115263 0.0288017355667183 0 0 GCG
## ENSG00000089199 0.121359871973959 0 0 CHGB
## ENSG00000169903 0.353150718093246 0 0 TM4SF4
## ENSG00000254647 0.556024153095996 0 0 INS
## ENSG00000118271 0.0314452537615524 0 0 TTR
## ENSG00000171951 0.164061031979585 0 0 SCG2
To obtain a single set of features for batch correction, we compute the average biological component across all batches.
We then take all genes with positive biological components to ensure that all interesting biology is retained, equivalent to the behaviour of denoisePCA()
.
However, the quality of the correction can often be sensitive to technical noise, which means that some discretion may be required during feature selection.
Users may prefer to take the top 1000-5000 genes with the largest average components, or to use combineVar()
to obtain combined p-values for gene selection.
universe <- intersect(rownames(dec.gse85241), rownames(dec.gse81076))
mean.bio <- (dec.gse85241[universe,"bio"] + dec.gse81076[universe,"bio"])/2
chosen <- universe[mean.bio > 0]
length(chosen)
## [1] 14758
We also rescale each batch to adjust for differences in sequencing depth between batches.
The multiBatchNorm()
function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment
objects.
(Keep in mind that the previously computed size factors only remove biases between cells within a single batch.)
This improves the quality of the correction by removing one aspect of the technical differences between batches.
rescaled <- multiBatchNorm(sce.gse85241[universe,], sce.gse81076[universe,])
rescaled.gse85241 <- rescaled[[1]]
rescaled.gse81076 <- rescaled[[2]]
Comments from Aaron:
multiBatchNorm()
.
This ensures that the variance components are estimated from the same values to be used in the batch correction.
In practice, this makes little difference, and it tends to be easier to process each batch separately and consolidate all results in one step as shown above.Consider a cell \(a\) in batch \(A\), and identify the cells in batch \(B\) that are nearest neighbours to \(a\) in the expression space defined by the selected features. Repeat this for a cell \(b\) in batch \(B\), identifying its nearest neighbours in \(A\). Mutual nearest neighbours are pairs of cells from different batches that belong in each other’s set of nearest neighbours. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which can yield batch-corrected values.
We apply the fastMNN()
function to the three batches to remove the batch effect, using the genes in chosen
.
To reduce computational work and technical noise, all cells in all cells are projected into the low-dimensional space defined by the top d
principal components.
Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space.
The function returns a matrix of corrected values for downstream analyses like clustering or visualization.
set.seed(100)
original <- list(
GSE81076=logcounts(rescaled.gse81076)[chosen,],
GSE85241=logcounts(rescaled.gse85241)[chosen,]
)
# Slightly convoluted call to avoid re-writing code later.
# Equivalent to fastMNN(GSE81076, GSE85241, k=20, d=50, approximate=TRUE)
mnn.out <- do.call(fastMNN, c(original, list(k=20, d=50, approximate=TRUE)))
dim(mnn.out$corrected)
## [1] 3638 50
Each row of the corrected
matrix corresponds to a cell in one of the batches.
The batch
field contains a run-length encoding object specifying the batch of origin of each row.
mnn.out$batch
## character-Rle of length 3638 with 2 runs
## Lengths: 1292 2346
## Values : "GSE81076" "GSE85241"
Advanced users may also be interested in the list of DataFrame
s in the pairs
field.
Each DataFrame
describes the MNN pairs identified upon merging of each successive batch.
This may be useful for checking the identified MNN pairs against known cell type identity, e.g., to determine if the cell types are being paired correctly.
mnn.out$pairs
## [[1]]
## DataFrame with 6635 rows and 2 columns
## first second
## <integer> <integer>
## 1 1 1794
## 2 1 2087
## 3 15 1612
## 4 15 2512
## 5 15 2575
## ... ... ...
## 6631 1290 2339
## 6632 1290 2625
## 6633 1290 1538
## 6634 1290 2794
## 6635 1290 2001
As previously mentioned, we have only used two batches here to simplify the workflow.
However, the MNN approach is not limited to two batches, and inclusion of more batches is as simple as adding more SingleCellExperiment
objects to the fastMNN()
call.
Comments from Aaron:
k=
parameter specifies the number of nearest neighbours to consider when defining MNN pairs.
This should be interpreted as the minimum frequency of each cell type or state in each batch.
Larger values will improve the precision of the correction by increasing the number of MNN pairs,
at the cost of reducing accuracy by allowing MNN pairs to form between cells of different type.auto.order=TRUE
to allow fastMNN()
to empirically choose which batches to merge at each step.
Batches are chosen to maximize the number of MNN pairs in order to provide a stable correction.approximate=TRUE
, fastMNN()
uses methods from the irlba package to perform the principal components analysis quickly.
While the run-to-run differences should be minimal, it does mean that set.seed()
is required to obtain fully reproducible results.We create a new SingleCellExperiment
object containing log-expression values for all cells, along with information regarding the batch of origin.
The MNN-corrected values are stored as dimensionality reduction results, befitting the principal components analysis performed within fastMNN()
.
omat <- do.call(cbind, original)
sce <- SingleCellExperiment(list(logcounts=omat))
reducedDim(sce, "MNN") <- mnn.out$corrected
sce$Batch <- as.character(mnn.out$batch)
sce
## class: SingleCellExperiment
## dim: 14758 3638
## metadata(0):
## assays(1): logcounts
## rownames(14758): ENSG00000115263 ENSG00000089199 ... ENSG00000148688
## ENSG00000176731
## rowData names(0):
## colnames(3638): D2ex_1 D2ex_2 ... D30.8_93 D30.8_94
## colData names(1): Batch
## reducedDimNames(1): MNN
## spikeNames(0):
We examine the batch correction with some t-SNE plots. Figure~3 demonstrates how the cells separate by batch of origin in the uncorrected data. After correction, more intermingling between batches is observed, consistent with the removal of batch effects. Note that the E-MTAB-5601 dataset still displays some separation, which is probably due to the fact that the other batches are UMI datasets.
set.seed(100)
# Using irlba to set up the t-SNE, for speed.
osce <- runPCA(sce, ntop=Inf, method="irlba")
osce <- runTSNE(osce, use_dimred="PCA")
ot <- plotTSNE(osce, colour_by="Batch") + ggtitle("Original")
set.seed(100)
csce <- runTSNE(sce, use_dimred="MNN")
ct <- plotTSNE(csce, colour_by="Batch") + ggtitle("Corrected")
multiplot(ot, ct, cols=2)
We colour by the expression of marker genes for known pancreas cell types to determine whether the correction is biologically sensible. Cells in the same visual cluster express the same marker genes (Figure 4), indicating that the correction maintains separation of cell types.
ct.gcg <- plotTSNE(csce, colour_by="ENSG00000115263") +
ggtitle("Alpha cells (GCG)")
ct.ins <- plotTSNE(csce, colour_by="ENSG00000254647") +
ggtitle("Beta cells (INS)")
ct.sst <- plotTSNE(csce, colour_by="ENSG00000157005") +
ggtitle("Delta cells (SST)")
ct.ppy <- plotTSNE(csce, colour_by="ENSG00000108849") +
ggtitle("PP cells (PPY)")
multiplot(ct.gcg, ct.ins, ct.sst, ct.ppy, cols=2)
For downstream analyses, the MNN-corrected values can be treated in the same manner as any other dimensionality reduction result. For example, it is straightforward to use the MNN-corrected values directly used for clustering analyses, as shown below.
snn.gr <- buildSNNGraph(sce, use.dimred="MNN")
clusters <- igraph::cluster_walktrap(snn.gr)
table(clusters$membership, sce$Batch)
##
## GSE81076 GSE85241
## 1 70 0
## 2 322 280
## 3 346 264
## 4 216 847
## 5 64 198
## 6 149 395
## 7 25 108
## 8 63 84
## 9 22 127
## 10 0 18
## 11 8 4
## 12 7 21
Figure 5 shows strong correspondence between the cluster labels and separation in t-SNE space.
csce$Cluster <- factor(clusters$membership)
plotTSNE(csce, colour_by="Cluster")
Differential expression analyses should be performed on the original log-expression values or counts.
We do not use the corrected values here (which no longer correspond to genes anyway) except to obtain the clusters or trajectories to be characterized.
To model the batch effect, we set the batch of origin as the block=
argument in findMarkers()
.
This will perform all comparisons between clusters within each batch, and then combine the p-values with Stouffer’s Z method to consolidate results across batches.
Intra-batch comparisons are robust to differences between batches but assume that each pair of clusters is present in at least one batch.
m.out <- findMarkers(sce, clusters$membership, block=sce$Batch,
direction="up")
demo <- m.out[["4"]] # looking at cluster 4 (probably alpha cells).
demo <- demo[demo$Top <= 5,]
library(org.Hs.eg.db)
data.frame(row.names=rownames(demo),
Symbol=mapIds(org.Hs.eg.db, keytype="ENSEMBL",
keys=rownames(demo), column="SYMBOL"),
Top=demo$Top, FDR=demo$FDR)
## Symbol Top FDR
## ENSG00000169903 TM4SF4 1 0.000000e+00
## ENSG00000089199 CHGB 1 0.000000e+00
## ENSG00000171951 SCG2 1 0.000000e+00
## ENSG00000135447 PPP1R1A 1 0.000000e+00
## ENSG00000170561 IRX2 1 0.000000e+00
## ENSG00000078098 FAP 1 0.000000e+00
## ENSG00000018236 CNTN1 1 0.000000e+00
## ENSG00000004848 ARX 1 0.000000e+00
## ENSG00000109472 CPE 2 0.000000e+00
## ENSG00000007372 PAX6 2 0.000000e+00
## ENSG00000079689 SCGN 2 0.000000e+00
## ENSG00000145321 GC 2 0.000000e+00
## ENSG00000163499 CRYBA2 2 0.000000e+00
## ENSG00000155093 PTPRN2 2 0.000000e+00
## ENSG00000145730 PAM 2 0.000000e+00
## ENSG00000011347 SYT7 2 0.000000e+00
## ENSG00000138131 LOXL4 2 6.094010e-290
## ENSG00000054356 PTPRN 3 0.000000e+00
## ENSG00000166922 SCG5 3 0.000000e+00
## ENSG00000174938 SEZ6L2 3 0.000000e+00
## ENSG00000076554 TPD52 3 0.000000e+00
## ENSG00000138193 PLCE1 3 2.298815e-191
## ENSG00000204103 MAFB 4 0.000000e+00
## ENSG00000164756 SLC30A8 4 0.000000e+00
## ENSG00000139209 SLC38A4 4 0.000000e+00
## ENSG00000169213 RAB3B 4 0.000000e+00
## ENSG00000172348 RCAN2 4 1.059137e-279
## ENSG00000125851 PCSK2 5 0.000000e+00
## ENSG00000115263 GCG 5 0.000000e+00
## ENSG00000204580 DDR1 5 0.000000e+00
## ENSG00000164687 FABP5 5 1.102690e-262
## ENSG00000136698 CFC1 5 4.356826e-262
Another approach is to define a design matrix containing the batch of origin as the sole factor.
findMarkers()
will then fit a linear model to the log-expression values, similar to the use of limma for bulk RNA sequencing data (Ritchie et al. 2015).
This handles situations where multiple batches contain unique clusters, whereby comparisons can be implicitly performed between shared cell types.
However, the use of a linear model makes some strong assumptions about the homogeneity of the batch effect across cell types and the equality of variances.
# Setting up the design matrix (we remove intercept for full rank
# in the final design matrix with the cluster-specific terms).
design <- model.matrix(~sce$Batch)
design <- design[,-1,drop=FALSE]
m.alt <- findMarkers(sce, clusters$membership, design=design,
direction="up")
demo <- m.alt[["4"]]
demo <- demo[demo$Top <= 5,]
data.frame(row.names=rownames(demo),
Symbol=mapIds(org.Hs.eg.db, keytype="ENSEMBL",
keys=rownames(demo), column="SYMBOL"),
Top=demo$Top, FDR=demo$FDR)
## Symbol Top FDR
## ENSG00000166922 SCG5 1 0
## ENSG00000125851 PCSK2 1 0
## ENSG00000169903 TM4SF4 1 0
## ENSG00000170561 IRX2 1 0
## ENSG00000109472 CPE 2 0
## ENSG00000118271 TTR 2 0
## ENSG00000115263 GCG 2 0
## ENSG00000145321 GC 2 0
## ENSG00000204103 MAFB 3 0
## ENSG00000089199 CHGB 4 0
## ENSG00000078098 FAP 4 0
## ENSG00000171951 SCG2 5 0
## ENSG00000135447 PPP1R1A 5 0
It is similarly possible to perform these analyses with standard Bioconductor packages for DE analysis such as edgeR or limma.
Note that the use of block=
is roughly similar to the use of a batch-cluster interaction model and testing whether the average log-fold change across batches is equal to zero.
Comments from Aaron:
mnnCorrect()
function will note that the function returned corrected expression values.
Thus, it is tempting to use these corrected values directly for DE analyses.
This is inappropriate for various reasons:
mnnCorrect()
do not return corrected values on the log-scale, but rather a cosine-normalized log-scale.
This makes it difficult to interpret the effect size of DE analyses based on the corrected values.All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.7.0
## [2] EnsDb.Hsapiens.v86_2.99.0
## [3] ensembldb_2.6.3
## [4] AnnotationFilter_1.6.0
## [5] DropletUtils_1.2.2
## [6] pheatmap_1.0.12
## [7] cluster_2.0.7-1
## [8] dynamicTreeCut_1.63-1
## [9] limma_3.38.3
## [10] scran_1.10.2
## [11] scater_1.10.1
## [12] ggplot2_3.1.0
## [13] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0
## [14] GenomicFeatures_1.34.1
## [15] org.Mm.eg.db_3.7.0
## [16] AnnotationDbi_1.44.0
## [17] SingleCellExperiment_1.4.1
## [18] SummarizedExperiment_1.12.0
## [19] DelayedArray_0.8.0
## [20] BiocParallel_1.16.5
## [21] matrixStats_0.54.0
## [22] Biobase_2.42.0
## [23] GenomicRanges_1.34.0
## [24] GenomeInfoDb_1.18.1
## [25] IRanges_2.16.0
## [26] S4Vectors_0.20.1
## [27] BiocGenerics_0.28.0
## [28] bindrcpp_0.2.2
## [29] BiocFileCache_1.6.0
## [30] dbplyr_1.2.2
## [31] knitr_1.21
## [32] BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.14.0 bitops_1.0-6
## [3] bit64_0.9-7 RColorBrewer_1.1-2
## [5] progress_1.2.0 httr_1.4.0
## [7] tools_3.5.2 irlba_2.3.2
## [9] R6_2.3.0 KernSmooth_2.23-15
## [11] HDF5Array_1.10.1 vipor_0.4.5
## [13] DBI_1.0.0 lazyeval_0.2.1
## [15] colorspace_1.3-2 withr_2.1.2
## [17] tidyselect_0.2.5 gridExtra_2.3
## [19] prettyunits_1.0.2 bit_1.1-14
## [21] curl_3.2 compiler_3.5.2
## [23] BiocNeighbors_1.0.0 rtracklayer_1.42.1
## [25] labeling_0.3 bookdown_0.9
## [27] scales_1.0.0 rappdirs_0.3.1
## [29] stringr_1.3.1 digest_0.6.18
## [31] Rsamtools_1.34.0 rmarkdown_1.11
## [33] XVector_0.22.0 pkgconfig_2.0.2
## [35] htmltools_0.3.6 highr_0.7
## [37] rlang_0.3.1 RSQLite_2.1.1
## [39] DelayedMatrixStats_1.4.0 bindr_0.1.1
## [41] dplyr_0.7.8 RCurl_1.95-4.11
## [43] magrittr_1.5 GenomeInfoDbData_1.2.0
## [45] Matrix_1.2-15 Rcpp_1.0.0
## [47] ggbeeswarm_0.6.0 munsell_0.5.0
## [49] Rhdf5lib_1.4.2 viridis_0.5.1
## [51] edgeR_3.24.3 stringi_1.2.4
## [53] yaml_2.2.0 zlibbioc_1.28.0
## [55] Rtsne_0.15 rhdf5_2.26.2
## [57] plyr_1.8.4 grid_3.5.2
## [59] blob_1.1.1 crayon_1.3.4
## [61] lattice_0.20-38 Biostrings_2.50.2
## [63] cowplot_0.9.4 hms_0.4.2
## [65] locfit_1.5-9.1 pillar_1.3.1
## [67] igraph_1.2.2 reshape2_1.4.3
## [69] biomaRt_2.38.0 XML_3.98-1.16
## [71] glue_1.3.0 evaluate_0.12
## [73] BiocManager_1.30.4 gtable_0.2.0
## [75] purrr_0.2.5 assertthat_0.2.0
## [77] xfun_0.4 viridisLite_0.3.0
## [79] tibble_2.0.0 GenomicAlignments_1.18.1
## [81] beeswarm_0.2.3 memoise_1.1.0
## [83] statmod_1.4.30
Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.
Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol. 36 (5):421–27.
Lawlor, N., J. George, M. Bolisetty, R. Kursawe, L. Sun, V. Sivakamasundari, I. Kycia, P. Robson, and M. L. Stitzel. 2017. “Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes.” Genome Res. 27 (2):208–22.
Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.
Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.
McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.
Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.
Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7):e47.
Segerstolpe, A., A. Palasantza, P. Eliasson, E. M. Andersson, A. C. Andreasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metab. 24 (4):593–607.
2.3 Comments on additional batches
In Haghverdi et al. (2018), we originally performed batch correction across four separate pancreas scRNA-seq datasets. For simplicity, we will only consider the two batches generated using CEL-seq(2) and ignore those generated using Smart-seq2 (Segerstolpe et al. 2016; Lawlor et al. 2017). As one might expect, batch correction is easiest when dealing with data generated from the same technology, as fewer systematic differences are present that can interfere with the biological structure. Nonetheless, it is often possible to obtain good results when applying MNN correction to batches of data generated with different technologies.
It is also worth pointing out that both of the CEL-seq(2) batches above contain cells from multiple donors. Each donor could be treated as a separate batch in their own right, reflecting (presumably uninteresting) biological differences between donors due to genotype, age, sex or other factors that are not easily controlled when dealing with humans. For simplicity, we will ignore the donor effects within each study and only consider the removal of the batch effect between the two studies. However, we note that it is possible to apply the MNN correction between donors in each batch and then between the batches - see
?fastMNN
for details.