1 Overview

Droplet-based scRNA-seq protocols capture cells in droplets for massively multiplexed library prepation (Klein et al. 2015; Macosko et al. 2015). This greatly increases the throughput of scRNA-seq studies, allowing tens of thousands of individual cells to be profiled in a routine experiment. However, it (again) involves some differences from the previous workflows to reflect some unique aspects of droplet-based data.

Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). The data are publicly available from the 10X Genomics website, from which we download the raw gene/barcode count matrices, i.e., before cell calling from the CellRanger pipeline.

bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))

2 Setting up the data

2.1 Reading in a sparse matrix

We load in the raw count matrix using the read10xCounts() function from the DropletUtils package. This will create a SingleCellExperiment object where each column corresponds to a cell barcode.

fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce <- read10xCounts(fname, col.names=TRUE)
## class: SingleCellExperiment 
## dim: 33694 737280 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):

Here, each count represents the number of unique molecular identifiers (UMIs) assigned to a gene for a cell barcode. Note that the counts are loaded as a sparse matrix object - specifically, a dgCMatrix instance from the Matrix package. This avoids allocating memory to hold zero counts, which is highly memory-efficient for low-coverage scRNA-seq data.

## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

2.2 Annotating the rows

We relabel the rows with the gene symbols for easier reading. This is done using the uniquifyFeatureNames() function, which ensures uniqueness in the case of duplicated or missing symbols.

rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
## [1] "RP11-34P13.3"  "FAM138A"       "OR4F5"         "RP11-34P13.7" 
## [5] "RP11-34P13.8"  "RP11-34P13.14"

We also identify the chromosomal location for each gene. The mitochondrial location is particularly useful for later quality control.

location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce)$ID, 
    column="SEQNAME", keytype="GENEID")
rowData(sce)$CHR <- location
##    Mode   FALSE    TRUE    NA's 
## logical   33537      13     144

3 Calling cells from empty droplets

3.1 Testing for deviations from ambient expression

An interesting aspect of droplet-based data is that we have no prior knowledge about which droplets (i.e., cell barcodes) actually contain cells, and which are empty. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced. The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 1), probably corresponding to cell-containing and empty droplets respectively.

bcrank <- barcodeRanks(counts(sce))

# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
    xlab="Rank", ylab="Total UMI count", cex.lab=1.2)

abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)

legend("bottomleft", legend=c("Inflection", "Knee"), 
    col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.

Figure 1: Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.

We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2018). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average.

e.out <- emptyDrops(counts(sce))
sum(e.out$FDR <= 0.001, na.rm=TRUE)
## [1] 4237

We then subset our SingleCellExperiment object to retain only the detected cells.

# using which() to automatically remove NAs.
sce <- sce[,which(e.out$FDR <= 0.001)]

Comments from Aaron:

  • emptyDrops() computes Monte Carlo \(p\)-values based on a Dirichlet-multinomial model of sampling molecules into droplets. These \(p\)-values are stochastic so it is important to set the random seed to obtain reproducible results.
  • The stability of the Monte Carlo \(p\)-values depends on the number of iterations used to compute them. This is controlled using the niters= argument in emptyDrops(), set to a default of 10000 for speed. Larger values improve stability at the cost of time only, so users should set niters= to the largest value they are willing to tolerate.
  • The function assumes that cell barcodes with total UMI counts below a certain threshold (lower=100 by default) correspond to empty droplets, and uses them to estimate the ambient expression profile. By definition, these barcodes cannot be cell-containing droplets and are excluded from the hypothesis testing, hence the NAs in the output.
  • Users wanting to use the cell calling algorithm from the CellRanger pipeline can call defaultDrops() instead. This tends to be quite conservative as it often discards genuine cells with low RNA content (and thus low total counts). It also requires an estimate of the expected number of cells in the experiment.

3.2 Examining cell-calling diagnostics

The number of Monte Carlo iterations (specified by the niters argument in emptyDrops()) determines the lower bound for the _p_values (Phipson and Smyth 2010). The Limited field in the output indicates whether or not the computed p-value for a particular barcode is bounded by the number of iterations. If any non-significant barcodes are TRUE for Limited, we may need to increase the number of iterations. A larger number of iterations will often result in a lower p-value for these barcodes, which may allow them to be detected after correcting for multiple testing.

table(Sig=e.out$FDR <= 0.01, Limited=e.out$Limited)
##        Limited
## Sig     FALSE TRUE
##   FALSE   929    0
##   TRUE   1722 2638

As mentioned above, emptyDrops() assumes that barcodes with low total UMI counts are empty droplets. Thus, the null hypothesis should be true for all of these barcodes. We can check whether the hypothesis test holds its size by examining the distribution of \(p\)-values for low-total barcodes. Ideally, the distribution should be close to uniform.

full.data <- read10xCounts(fname, col.names=TRUE)
limit <- 100   
all.out <- emptyDrops(counts(full.data), lower=limit, test.ambient=TRUE)
hist(all.out$PValue[all.out$Total <= limit & all.out$Total > 0],
    xlab="P-value", main="", col="grey80") 
Distribution of $p$-values for the assumed empty droplets.

Figure 2: Distribution of \(p\)-values for the assumed empty droplets.

Large peaks near zero indicate that barcodes with total counts below lower are not all ambient in origin. This can be resolved by decreasing lower further to exclude barcodes corresponding to droplets with very small cells.

4 Quality control on the cells

The previous step only distinguishes cells from empty droplets, but makes no statement about the quality of the cells. It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. We compute some QC metrics using calculateQCMetrics() (McCarthy et al. 2017) and examine their distributions in Figure 3.

sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(location=="MT")))
hist(sce$log10_total_counts, breaks=20, col="grey80",
    xlab="Log-total UMI count")
hist(sce$log10_total_features_by_counts, breaks=20, col="grey80",
    xlab="Log-total number of expressed features")
hist(sce$pct_counts_Mito, breaks=20, col="grey80",
    xlab="Proportion of reads in mitochondrial genes")
Histograms of QC metric distributions in the PBMC dataset.

Figure 3: Histograms of QC metric distributions in the PBMC dataset.

Ideally, we would remove cells with low library sizes or total number of expressed features as described previously. However, this would likely remove cell types with low RNA content, especially in a heterogeneous PBMC population with many different cell types. Thus, we use a more relaxed strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. (Keep in mind that droplet-based datasets usually do not have spike-in RNA.)

high.mito <- isOutlier(sce$pct_counts_Mito, nmads=3, type="higher")
sce <- sce[,!high.mito]
##    Mode   FALSE    TRUE 
## logical    3925     312

Comments from Aaron:

  • The above justification for using a more relaxed filter is largely retrospective. In practice, we may not know a priori the degree of population heterogeneity and whether it manifests in the QC metrics. We recommend performing an initial analysis with some QC, and then relaxing the filter (or making it more stringent) based on further diagnostics. See here for an example of a potential diagnostic approach.
  • Removal of cells with high mitochondrial content assumes that the damage to the cell membrane is modest enough that the mitochondria are retained. It is possible for the cells to be so damaged that all cytoplasmic content is lost and only the stripped nucleus remains for sequencing. This manifests as mitochondrial proportions of zero, usually accompanied by low library sizes/numbers of expressed genes and possibly low ribosomal protein gene expression. If a cluster of such cells is observed in the data, they can be removed by using isOutlier() on the mitochondrial proportions with type="lower" and log=TRUE (to improve resolution around zero).

5 Examining gene expression

The average expression of each gene is much lower here compared to the previous datasets (Figure 4). This is due to the reduced coverage per cell when thousands of cells are multiplexed together for sequencing.

ave <- calcAverage(sce)
rowData(sce)$AveCount <- ave
hist(log10(ave), col="grey80")