Droplet-based scRNA-seq protocols capture cells in droplets for massively multiplexed library prepation [Klein et al. (2015); macosko2015highly]. This greatly increases the throughput of scRNA-seq studies, allowing tens of thousands of individual cells to be profiled in a routine experiment. Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). This again involves some differences from the previous workflows to reflect some unique aspects of droplet-based data.
We load in the raw count matrix using the read10xCounts()
function from the DropletUtils package.
This will create a SingleCellExperiment
object where each column corresponds to a cell barcode.
untar("pbmc4k_raw_gene_bc_matrices.tar.gz", exdir="pbmc4k")
library(DropletUtils)
fname <- "pbmc4k/raw_gene_bc_matrices/GRCh38"
sce <- read10xCounts(fname, col.names=TRUE)
sce
## class: SingleCellExperiment
## dim: 33694 737280
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
Here, each count represents the number of unique molecular identifiers (UMIs) assigned to a gene for a cell barcode.
Note that the counts are loaded as a sparse matrix object - specifically, a dgCMatrix
instance from the Matrix package.
This avoids allocating memory to hold zero counts, which is highly memory-efficient for low-coverage scRNA-seq data.
class(counts(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
We relabel the rows with the gene symbols for easier reading.
This is done using the uniquifyFeatureNames()
function, which ensures uniqueness in the case of duplicated or missing symbols.
library(scater)
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
head(rownames(sce))
## [1] "RP11-34P13.3" "FAM138A" "OR4F5" "RP11-34P13.7"
## [5] "RP11-34P13.8" "RP11-34P13.14"
We also identify the chromosomal location for each gene. The mitochondrial location is particularly useful for later quality control.
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce)$ID,
column="SEQNAME", keytype="GENEID")
rowData(sce)$CHR <- location
summary(location=="MT")
## Mode FALSE TRUE NA's
## logical 33537 13 144
An interesting aspect of droplet-based data is that we have no prior knowledge about which droplets (i.e., cell barcodes) actually contain cells, and which are empty. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced. An examination of the distribution of total counts suggests a fairly sharp transition between barcodes with large and small total counts (Figure 1), probably corresponding to cell-containing and empty droplets respectively.
bcrank <- barcodeRanks(counts(sce))
# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
abline(h=bcrank$inflection, col="darkgreen", lty=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2)
legend("bottomleft", legend=c("Inflection", "Knee"),
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
We use the emptyDrops()
function to test whether the expression profile for each cell barcode is significantly different from the ambient pool (Lun et al. 2018).
Any significant deviation indicates that the barcode corresponds to a cell-containing droplet.
We call cells at a false discovery rate (FDR) of 1%, meaning that no more than 1% of our called barcodes should be empty droplets on average.
set.seed(100)
e.out <- emptyDrops(counts(sce))
sum(e.out$FDR <= 0.01, na.rm=TRUE)
## [1] 4453
emptyDrops()
computes Monte Carlo p-values, so it is important to set the random seed to obtain reproducible results.
The number of Monte Carlo iterations also determines the lower bound for the _p_values.
If any non-significant barcodes are TRUE
for Limited
, we may need to increase the number of iterations to ensure that they can be detected.
table(Sig=e.out$FDR <= 0.01, Limited=e.out$Limited)
## Limited
## Sig FALSE TRUE
## FALSE 836 0
## TRUE 1751 2702
We then subset our SingleCellExperiment
object to retain only the detected cells.
# using which() to automatically remove NAs.
sce <- sce[,which(e.out$FDR <= 0.01)]
Comments from Aaron:
emptyDrops()
assumes that cell barcodes with total UMI counts below a certain threshold (default of 100) correspond to empty droplets,
and uses them to estimate the ambient expression profile.
By definition, these barcodes cannot be cell-containing droplets and are excluded from the hypothesis testing, hence the NA
s in the output.defaultDrops()
instead.
This tends to be quite conservative as it often discards genuine cells with low RNA content (and thus low total counts).
It also requires an estimate of the expected number of cells in the experiment.The previous step only distinguishes cells from empty droplets, but makes no statement about the quality of the cells.
It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis.
We compute some QC metrics using calculateQCMetrics()
(McCarthy et al. 2017) and examine their distributions in Figure 2.
sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(location=="MT")))
par(mfrow=c(1,3))
hist(sce$log10_total_counts, breaks=20, col="grey80",
xlab="Log-total UMI count")
hist(sce$log10_total_features_by_counts, breaks=20, col="grey80",
xlab="Log-total number of expressed features")
hist(sce$pct_counts_Mito, breaks=20, col="grey80",
xlab="Proportion of reads in mitochondrial genes")
Ideally, we would remove cells with low library sizes or total number of expressed features as described previously. However, this would likely remove cell types with low RNA content, especially in a heterogeneous PBMC population with many different cell types. Thus, we use a more relaxed strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. (Keep in mind that droplet-based datasets usually do not have spike-in RNA.)
high.mito <- isOutlier(sce$pct_counts_Mito, nmads=3, type="higher")
sce <- sce[,!high.mito]
summary(high.mito)
## Mode FALSE TRUE
## logical 4115 338
Comments from Aaron:
The average expression of each gene is much lower here compared to the previous datasets (Figure 3). This is due to the reduced coverage per cell when thousands of cells are multiplexed together for sequencing.
ave <- calcAverage(sce)
rowData(sce)$AveCount <- ave
hist(log10(ave), col="grey80")
The set of most highly expressed genes is dominated by ribosomal protein and mitochondrial genes (Figure 4), as expected.
plotHighestExprs(sce)
We apply the deconvolution method to compute size factors for all cells (Lun, Bach, and Marioni 2016). We perform some pre-clustering to break up obvious clusters and avoid pooling cells that are very different.
library(scran)
clusters <- quickCluster(sce, method="igraph", min.mean=0.1,
irlba.args=list(maxit=1000)) # for convergence.
table(clusters)
## clusters
## 1 2 3 4 5
## 933 232 1230 1149 571
sce <- computeSumFactors(sce, min.mean=0.1, cluster=clusters)
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.006901 0.715236 0.888703 1.000000 1.110422 12.230975
The size factors are well correlated against the library sizes (Figure 5), indicating that capture efficiency and sequencing depth are the major biases.
plot(sce$total_counts, sizeFactors(sce), log="xy")