Low-quality cells can often yield misleading results in downstream analyses, by:
As such, we need to remove these cells at the start of the analysis.
Recall that we were defining low-quality cells as those with outlier values for various quality control (QC) metrics,
using the isOutlier()
and calculateQCMetrics()
functions from the scater package (McCarthy et al. 2017).
Here, we will examine some of the reasoning behind the outlier-based QC in more detail.
An outlier-based definition for low-quality cells assumes that most cells are of high quality. This is usually reasonable and can be experimentally supported in some situations by visually checking that the cells are intact, e.g., on the microwell plate. Another assumption is that the QC metrics are independent on the biological state of each cell. This ensures that any outlier values for these metrics are driven by technical factors rather than biological processes. Thus, removing cells based on the metrics will not misrepresent the biology in downstream analyses.
The second assumption is most likely to be violated in highly heterogeneous cell populations. For example, some cell types may naturally have less RNA or express fewer genes than other cell types. Such cell types are more likely to be considered outliers and removed, even if they are of high quality. The use of the MAD mitigates this problem by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types (at the cost of reducing power to remove low-quality cells). Nonetheless, filtering based on outliers may not be appropriate in extreme cases where one cell type is very different from the others.
Systematic differences in the QC metrics can be handled to some extent using the batch=
argument in the isOutlier()
function.
For example, setting batch
to the plate of origin will identify outliers within each level of batch
, using plate-specific median and MAD estimates.
This is obviously useful for accommodating known differences in experimental processing, e.g., sequencing at different depth or different amounts of added spike-in RNA.
We can also include biological factors in batch
, if those factors could result in systematically fewer expressed genes or lower RNA content.
However, this is not applicable in experiments where the factors are not known in advance.
We can diagnose loss of distinct cell types during QC by looking for differences in gene expression between the discarded and retained cells. To demonstrate, we compute the average count across the discarded and retained pools in the 416B data set.
library(SingleCellExperiment)
sce.full.416b <- readRDS("416B_preQC.rds")
library(scater)
lost <- calcAverage(counts(sce.full.416b)[,!sce.full.416b$PassQC])
kept <- calcAverage(counts(sce.full.416b)[,sce.full.416b$PassQC])
If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. No systematic upregulation of genes is apparent in the discarded pool in Figure 1, indicating that the QC step did not inadvertently filter out a cell type in the 416B dataset.
# Avoid loss of points where either average is zero.
capped.lost <- pmax(lost, min(lost[lost>0]))
capped.kept <- pmax(kept, min(kept[kept>0]))
plot(capped.lost, capped.kept, xlab="Average count (discarded)",
ylab="Average count (retained)", log="xy", pch=16)
is.spike <- isSpike(sce.full.416b)
points(capped.lost[is.spike], capped.kept[is.spike], col="red", pch=16)
is.mito <- rowData(sce.full.416b)$is_feature_control_Mt
points(capped.lost[is.mito], capped.kept[is.mito], col="dodgerblue", pch=16)
We examine this more closely by computing log-fold changes between the average counts of the two pools.
The predFC
function stabilizes the log-fold change estimates by adding a prior count to the average of each pool.
We only examine the log-fold changes rather than formally testing for differential expression, as we are not interested in penalizing intra-pool heterogeneity.
library(edgeR)
coefs <- predFC(cbind(lost, kept), design=cbind(1, c(1, 0)))[,2]
info <- data.frame(logFC=coefs, Lost=lost, Kept=kept,
row.names=rownames(sce.full.416b))
head(info[order(info$logFC, decreasing=TRUE),], 20)
## logFC Lost Kept
## ENSMUSG00000104647 6.844237 7.515034 0.000000000
## Nmur1 6.500909 5.909533 0.000000000
## Retn 6.250501 10.333172 0.196931018
## Fut9 6.096356 4.696897 0.010132032
## ENSMUSG00000102352 6.077614 9.393793 0.206637368
## ENSMUSG00000102379 6.029758 4.244690 0.000000000
## 1700101I11Rik 5.828821 4.483094 0.039199404
## Gm4952 5.698380 6.580862 0.172999108
## ENSMUSG00000106680 5.670156 3.389236 0.005234611
## ENSMUSG00000107955 5.554616 5.268508 0.132532601
## Gramd1c 5.446975 4.435342 0.103783669
## Jph3 5.361082 4.696897 0.139188080
## ENSMUSG00000092418 5.324462 3.395752 0.056931488
## 1700029I15Rik 5.316226 8.199510 0.394588776
## Pih1h3b 5.307439 2.546814 0.000000000
## ENSMUSG00000097176 5.275459 2.541927 0.003772842
## Olfr456 5.093383 2.186536 0.000000000
## ENSMUSG00000103731 5.017303 3.315016 0.107148909
## Klhdc8b 4.933215 19.861036 1.635081878
## ENSMUSG00000082449 4.881422 1.878759 0.000000000
Again, no obvious cell type markers are present in the top set of genes upregulated in the discarded pool. The magnitude of the log-fold changes is less important, attributable to imprecision with few cells in the discarded pool. Large log-fold changes can also be driven by enrichment or depletion of mitochondrial, ribosomal protein or nuclear genes upon cell damage.
For comparison, we consider the PBMC data set in which we previously identified a platelet population
(see the previous workflow).
Recall that we relied on the use of the emptyDrops()
method from the DropletUtils package to retain the platelets.
In contrast, if we had used a naive threshold on the total unique molecular identifier (UMI) count, we would have removed this population during the cell calling step.
sce.pbmc <- readRDS("pbmc_data.rds")
wrong.keep <- sce.pbmc$total_counts >= 1000
lost <- calcAverage(counts(sce.pbmc)[,!wrong.keep])
kept <- calcAverage(counts(sce.pbmc)[,wrong.keep])
The presence of a distinct population in the discarded pool manifests in Figure 2 as a shift to the bottom-right for a number of genes. This includes PF4, PPBP and SDPR that are strongly upregulated in the platelets.
# Avoid loss of points where either average is zero.
capped.lost <- pmax(lost, min(lost[lost>0]))
capped.kept <- pmax(kept, min(kept[kept>0]))
plot(capped.lost, capped.kept, xlab="Average count (discarded)",
ylab="Average count (retained)", log="xy", pch=16)
platelet <- c("PF4", "PPBP", "SDPR")
points(capped.lost[platelet], capped.kept[platelet], col="orange", pch=16)