Droplet-based scRNA-seq protocols capture cells in droplets for massively multiplexed library prepation (Klein et al. 2015; Macosko et al. 2015). This greatly increases the throughput of scRNA-seq studies, allowing tens of thousands of individual cells to be profiled in a routine experiment. However, it (again) involves some differences from the previous workflows to reflect some unique aspects of droplet-based data.
Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). The data are publicly available from the 10X Genomics website, from which we download the raw gene/barcode count matrices, i.e., before cell calling from the CellRanger pipeline.
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir="pbmc4k")
We load in the raw count matrix using the read10xCounts()
function from the DropletUtils package.
This will create a SingleCellExperiment
object where each column corresponds to a cell barcode.
library(DropletUtils)
fname <- "pbmc4k/raw_gene_bc_matrices/GRCh38"
sce <- read10xCounts(fname, col.names=TRUE)
sce
## class: SingleCellExperiment
## dim: 33694 737280
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
Here, each count represents the number of unique molecular identifiers (UMIs) assigned to a gene for a cell barcode.
Note that the counts are loaded as a sparse matrix object - specifically, a dgCMatrix
instance from the Matrix package.
This avoids allocating memory to hold zero counts, which is highly memory-efficient for low-coverage scRNA-seq data.
class(counts(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
We relabel the rows with the gene symbols for easier reading.
This is done using the uniquifyFeatureNames()
function, which ensures uniqueness in the case of duplicated or missing symbols.
library(scater)
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
head(rownames(sce))
## [1] "RP11-34P13.3" "FAM138A" "OR4F5" "RP11-34P13.7"
## [5] "RP11-34P13.8" "RP11-34P13.14"
We also identify the chromosomal location for each gene. The mitochondrial location is particularly useful for later quality control.
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce)$ID,
column="SEQNAME", keytype="GENEID")
rowData(sce)$CHR <- location
summary(location=="MT")
## Mode FALSE TRUE NA's
## logical 33537 13 144
An interesting aspect of droplet-based data is that we have no prior knowledge about which droplets (i.e., cell barcodes) actually contain cells, and which are empty. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced. The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 1), probably corresponding to cell-containing and empty droplets respectively.
bcrank <- barcodeRanks(counts(sce))
# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
abline(h=bcrank$inflection, col="darkgreen", lty=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2)
legend("bottomleft", legend=c("Inflection", "Knee"),
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
We use the emptyDrops()
function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2018).
Any significant deviation indicates that the barcode corresponds to a cell-containing droplet.
We call cells at a false discovery rate (FDR) of 1%, meaning that no more than 1% of our called barcodes should be empty droplets on average.
set.seed(100)
e.out <- emptyDrops(counts(sce))
sum(e.out$FDR <= 0.01, na.rm=TRUE)
## [1] 4358
We then subset our SingleCellExperiment
object to retain only the detected cells.
# using which() to automatically remove NAs.
sce <- sce[,which(e.out$FDR <= 0.01)]
Comments from Aaron:
emptyDrops()
computes Monte Carlo p-values based on a Dirichlet-multinomial model of sampling molecules into droplets.
These p-values are stochastic so it is important to set the random seed to obtain reproducible results.lower=100
by default) correspond to empty droplets,
and uses them to estimate the ambient expression profile.
By definition, these barcodes cannot be cell-containing droplets and are excluded from the hypothesis testing, hence the NA
s in the output.defaultDrops()
instead.
This tends to be quite conservative as it often discards genuine cells with low RNA content (and thus low total counts).
It also requires an estimate of the expected number of cells in the experiment.The number of Monte Carlo iterations (specified by the niters
argument in emptyDrops()
) determines the lower bound for the _p_values (Phipson and Smyth 2010).
The Limited
field in the output indicates whether or not the computed p-value for a particular barcode is bounded by the number of iterations.
If any non-significant barcodes are TRUE
for Limited
, we may need to increase the number of iterations.
A larger number of iterations will often result in a lower p-value for these barcodes, which may allow them to be detected after correcting for multiple testing.
table(Sig=e.out$FDR <= 0.01, Limited=e.out$Limited)
## Limited
## Sig FALSE TRUE
## FALSE 931 0
## TRUE 1745 2613
As mentioned above, emptyDrops()
assumes that barcodes with low total UMI counts are empty droplets.
Thus, the null hypothesis should be true for all of these barcodes.
We can check whether the hypothesis test holds its size by examining the distribution of p-values for low-total barcodes.
Ideally, the distribution should be close to uniform.
full.data <- read10xCounts(fname, col.names=TRUE)
set.seed(100)
limit <- 100
all.out <- emptyDrops(counts(full.data), lower=limit, test.ambient=TRUE)
hist(all.out$PValue[all.out$Total <= limit & all.out$Total > 0],
xlab="P-value", main="", col="grey80")
Large peaks near zero indicate that barcodes with total counts below lower
are not all ambient in origin.
This can be resolved by decreasing lower
further to exclude barcodes corresponding to droplets with very small cells.
The previous step only distinguishes cells from empty droplets, but makes no statement about the quality of the cells.
It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis.
We compute some QC metrics using calculateQCMetrics()
(McCarthy et al. 2017) and examine their distributions in Figure 3.
sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(location=="MT")))
par(mfrow=c(1,3))
hist(sce$log10_total_counts, breaks=20, col="grey80",
xlab="Log-total UMI count")
hist(sce$log10_total_features_by_counts, breaks=20, col="grey80",
xlab="Log-total number of expressed features")
hist(sce$pct_counts_Mito, breaks=20, col="grey80",
xlab="Proportion of reads in mitochondrial genes")
Ideally, we would remove cells with low library sizes or total number of expressed features as described previously. However, this would likely remove cell types with low RNA content, especially in a heterogeneous PBMC population with many different cell types. Thus, we use a more relaxed strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. (Keep in mind that droplet-based datasets usually do not have spike-in RNA.)
high.mito <- isOutlier(sce$pct_counts_Mito, nmads=3, type="higher")
sce <- sce[,!high.mito]
summary(high.mito)
## Mode FALSE TRUE
## logical 4037 321
Comments from Aaron:
The average expression of each gene is much lower here compared to the previous datasets (Figure 4). This is due to the reduced coverage per cell when thousands of cells are multiplexed together for sequencing.
ave <- calcAverage(sce)
rowData(sce)$AveCount <- ave
hist(log10(ave), col="grey80")
The set of most highly expressed genes is dominated by ribosomal protein and mitochondrial genes (Figure 5), as expected.
plotHighestExprs(sce)
We perform some pre-clustering to break up obvious clusters, as described previously.
Recall that we need to set the seed when using method="igraph"
.
library(scran)
set.seed(1000)
clusters <- quickCluster(sce, method="igraph", min.mean=0.1,
irlba.args=list(maxit=1000)) # for convergence.
table(clusters)
## clusters
## 1 2 3 4 5
## 937 269 1173 1109 549
We apply the deconvolution method to compute size factors for all cells (Lun, Bach, and Marioni 2016).
The specification of cluster=
ensures that we do not pool cells that are very different.
sce <- computeSumFactors(sce, min.mean=0.1, cluster=clusters)
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005117 0.722562 0.884547 1.000000 1.105609 12.711794
The size factors are well correlated against the library sizes (Figure 6), indicating that capture efficiency and sequencing depth are the major biases.
plot(sce$total_counts, sizeFactors(sce), log="xy")
Finally, we compute normalized log-expression values.
There is no need to call computeSpikeFactors()
here, as there are no spike-in transcripts available.
sce <- normalize(sce)
Comments from Aaron:
block=
in quickCluster()
to cluster cells within each batch or run.
This reduces computational work considerably without compromising performance, provided that the clusters within each batch are sufficiently large
(see comments here) for a discussion of the considerations involved in pre-clustering for normalization).block=cut(seq_len(ncol(sce)), 10)
to split the cells into ten “batches” of roughly equal size.
Recall that we are not interpreting the clusters themselves, so it is not a problem to have multiple redundant cluster labels.
Again, this assumes that each cluster is large enough to support deconvolution.quickCluster()
and computeSumFactors()
can process blocks or clusters in parallel.
This is achieved using the BiocParallel framework, which accommodates a range of parallelization strategies.
In this manner, size factors for large datasets can be computed in a scalable manner.The lack of spike-in transcripts complicates the modelling of the technical noise.
One option is to assume that most genes do not exhibit strong biological variation, and to fit a trend to the variances of endogenous genes
(see here for details).
However, this assumption is generally unreasonable for a heterogeneous population.
Instead, we assume that the technical noise is Poisson and create a fitted trend on that basis using the makeTechTrend()
function.
new.trend <- makeTechTrend(x=sce)
We estimate the variances for all genes and compare the trend fits in Figure 7. The Poisson-based trend serves as a lower bound for the variances of the endogenous genes. This results in non-zero biological components for most genes, which is consistent with other UMI-based data sets (see the corresponding analysis of the Zeisel et al. (2015) data set).
fit <- trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05))
plot(fit$mean, fit$var, pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
curve(new.trend(x), col="red", add=TRUE)
We decompose the variance for each gene using the Poisson-based trend, and examine the genes with the highest biological components.
fit0 <- fit
fit$trend <- new.trend
dec <- decomposeVar(fit=fit)
top.dec <- dec[order(dec$bio, decreasing=TRUE),]
head(top.dec)
## DataFrame with 6 rows and 6 columns
## mean total bio tech
## <numeric> <numeric> <numeric> <numeric>
## LYZ 2.00433282867795 5.27238976970295 4.63611169338538 0.636278076317563
## S100A9 1.96814061490774 4.73501347757919 4.09335226800879 0.641661209570392
## S100A8 1.7450872893628 4.62384461299396 3.95524147335823 0.668603139635734
## HLA-DRA 2.10913349062558 3.74365888475255 3.12419870700643 0.619460177746116
## CD74 2.88808019297542 3.34102575092047 2.86962155465658 0.471404196263887
## CST3 1.51005991810233 3.04448800763578 2.36293668357704 0.681551324058745
## p.value FDR
## <numeric> <numeric>
## LYZ 0 0
## S100A9 0 0
## S100A8 0 0
## HLA-DRA 0 0
## CD74 0 0
## CST3 0 0
We can plot the genes with the largest biological components, to verify that they are indeed highly variable (Figure 8).
plotExpression(sce, features=rownames(top.dec)[1:10])
Comments from Aaron:
makeTechTrend()
tends to yield large biological components for highly-expressed genes for which Poisson noise is low (in the log-expression space).
This often includes so-called “house-keeping” genes coding for essential cellular components such as ribosomal proteins.
These genes are often considered uninteresting for characterizing cellular heterogeneity,
though this is debatable as they are often differentially expressed in a variety of conditions (Glare et al. 2002; Nazari, Parham, and Maleki 2015; Guimaraes and Zavolan 2016).
Indeed, the fact that they have large biological components indicates that there is strong variation in their expression across cells, which warrants some further investigation.
Nonetheless, if they are deemed to be uninteresting, their impact can be reduced by fitting the mean-variance trend to the endogenous genes.We use the denoisePCA()
function with the assumed Poisson technical trend to choose the number of dimensions to retain after PCA.
Recall that this involves a random initialization when approximate=TRUE
, which motivates the call to set.seed()
to obtain reproducible results.
set.seed(1000)
sce <- denoisePCA(sce, technical=new.trend, approximate=TRUE)
ncol(reducedDim(sce, "PCA"))
## [1] 11
plot(attr(reducedDim(sce), "percentVar"), xlab="PC",
ylab="Proportion of variance explained")
abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red")
Examination of the first few PCs already reveals some strong substructure in the data (Figure 10).
plotPCA(sce, ncomponents=3, colour_by="log10_total_features_by_counts")
This is recapitulated with a t-SNE plot (Figure 11).
Again, note that we set use_dimred=
to perform t-SNE on the denoised expression matrix.
set.seed(100)
sce <- runTSNE(sce, use_dimred="PCA", perplexity=30)
plotTSNE(sce, colour_by="log10_total_features_by_counts")
We build a shared nearest neighbour graph (Xu and Su 2015) and use the Walktrap algorithm to identify clusters.
snn.gr <- buildSNNGraph(sce, use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)
sce$Cluster <- factor(clusters$membership)
table(sce$Cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 850 591 544 200 534 46 120 46 798 40 136 29 17 86
We look at the ratio of the observed and expected edge weights to confirm that the clusters are modular. (We don’t look at the modularity score itself, as that varies by orders of magnitudes across clusters and is difficult to interpret.) Figure 12 indicates that most of the clusters are well seperated, with few strong off-diagonal entries.
cluster.mod <- clusterModularity(snn.gr, sce$Cluster, get.values=TRUE)
log.ratio <- log2(cluster.mod$observed/cluster.mod$expected + 1)
library(pheatmap)
pheatmap(log.ratio, cluster_rows=FALSE, cluster_cols=FALSE,
color=colorRampPalette(c("white", "blue"))(100))
We examine the cluster identities on a t-SNE plot (Figure 13) to confirm that different clusters are indeed separated.
plotTSNE(sce, colour_by="Cluster")
We detect marker genes for each cluster using findMarkers()
.
Again, we only look at upregulated genes in each cluster, as these are more useful for positive identification of cell types in a heterogeneous population.
markers <- findMarkers(sce, clusters=sce$Cluster, direction="up")
We examine the markers for cluster 8 in more detail. The upregulation of genes such as PF4 and PPBP suggests that this cluster contains platelets or their precursors.
marker.set <- markers[["8"]]
head(marker.set[,1:8], 10) # only first 8 columns, for brevity
## DataFrame with 10 rows and 8 columns
## Top p.value FDR
## <integer> <numeric> <numeric>
## TMSB4X 1 5.35760583326261e-35 1.8051917094595e-30
## PF4 1 5.12050931325382e-30 8.62652204003868e-26
## TAGLN2 2 9.84031158911545e-26 1.10519819561219e-21
## SDPR 2 1.09264714960276e-22 9.2039132646789e-19
## GPX1 3 8.8776256972845e-22 5.9824544048861e-18
## PPBP 3 4.66923110355769e-21 2.62208454672122e-17
## ACTB 4 2.1293693738595e-20 8.96837146035275e-17
## NRGN 5 9.50068126106725e-21 4.5730850630057e-17
## GNG11 7 3.99705163013454e-19 1.49640730695281e-15
## HIST1H2AC 8 9.7347880763609e-19 3.28003949444905e-15
## logFC.1 logFC.2 logFC.3
## <numeric> <numeric> <numeric>
## TMSB4X 2.28378175710327 2.72251442896142 2.76133648694816
## PF4 6.3180410124472 6.36708295382423 6.37291217458349
## TAGLN2 4.46385139989111 4.26174885894995 4.46325505239159
## SDPR 5.28229555255907 5.33079013944103 5.33711366958133
## GPX1 2.49875298171963 4.80932926423923 4.93623033629471
## PPBP 6.09985681188489 6.19243453124335 6.1970659544121
## ACTB 2.33529538395888 3.21038176116685 3.0073999346119
## NRGN 4.36583495397255 4.59794551675681 4.60087668501509
## GNG11 5.11648049930094 5.16755151142283 5.17120403573627
## HIST1H2AC 5.10888323361265 5.14563307683107 5.14606795276647
## logFC.4 logFC.5
## <numeric> <numeric>
## TMSB4X 2.99168279631072 3.36484534862961
## PF4 6.36659548548937 6.37432350301952
## TAGLN2 4.41428250025306 4.1619645723748
## SDPR 5.33711366958133 5.32700873954551
## GPX1 5.00157378898467 4.51330365854975
## PPBP 6.1973642133943 6.19154459691209
## ACTB 2.81031789967064 3.59027478829347
## NRGN 4.60388537827064 4.60053114511147
## GNG11 5.17562678251855 5.14000969651157
## HIST1H2AC 5.17054604083487 5.07232660241665
This is confirmed in Figure 14, where the transcriptional profile of cluster 8 is clearly distinct from the others.
chosen <- rownames(marker.set)[marker.set$Top <= 10]
plotHeatmap(sce, features=chosen, exprs_values="logcounts",
zlim=5, center=TRUE, symmetric=TRUE, cluster_cols=FALSE,
colour_columns_by="Cluster", columns=order(sce$Cluster),
show_colnames=FALSE)
Having completed the basic analysis, we save the SingleCellExperiment
object with its associated data to file.
This avoids having to repeat all of the pre-processing steps described above prior to further analyses.
saveRDS(sce, file="pbmc_data.rds")
All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] EnsDb.Hsapiens.v86_2.99.0
## [2] ensembldb_2.6.3
## [3] AnnotationFilter_1.6.0
## [4] DropletUtils_1.2.2
## [5] pheatmap_1.0.12
## [6] cluster_2.0.7-1
## [7] dynamicTreeCut_1.63-1
## [8] limma_3.38.3
## [9] scran_1.10.2
## [10] scater_1.10.1
## [11] ggplot2_3.1.0
## [12] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0
## [13] GenomicFeatures_1.34.1
## [14] org.Mm.eg.db_3.7.0
## [15] AnnotationDbi_1.44.0
## [16] SingleCellExperiment_1.4.1
## [17] SummarizedExperiment_1.12.0
## [18] DelayedArray_0.8.0
## [19] BiocParallel_1.16.5
## [20] matrixStats_0.54.0
## [21] Biobase_2.42.0
## [22] GenomicRanges_1.34.0
## [23] GenomeInfoDb_1.18.1
## [24] IRanges_2.16.0
## [25] S4Vectors_0.20.1
## [26] BiocGenerics_0.28.0
## [27] bindrcpp_0.2.2
## [28] BiocFileCache_1.6.0
## [29] dbplyr_1.2.2
## [30] knitr_1.21
## [31] BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.14.0 bitops_1.0-6
## [3] bit64_0.9-7 RColorBrewer_1.1-2
## [5] progress_1.2.0 httr_1.4.0
## [7] tools_3.5.2 irlba_2.3.2
## [9] R6_2.3.0 KernSmooth_2.23-15
## [11] HDF5Array_1.10.1 vipor_0.4.5
## [13] DBI_1.0.0 lazyeval_0.2.1
## [15] colorspace_1.3-2 withr_2.1.2
## [17] tidyselect_0.2.5 gridExtra_2.3
## [19] prettyunits_1.0.2 bit_1.1-14
## [21] curl_3.2 compiler_3.5.2
## [23] BiocNeighbors_1.0.0 rtracklayer_1.42.1
## [25] labeling_0.3 bookdown_0.9
## [27] scales_1.0.0 rappdirs_0.3.1
## [29] stringr_1.3.1 digest_0.6.18
## [31] Rsamtools_1.34.0 rmarkdown_1.11
## [33] XVector_0.22.0 pkgconfig_2.0.2
## [35] htmltools_0.3.6 highr_0.7
## [37] rlang_0.3.1 RSQLite_2.1.1
## [39] DelayedMatrixStats_1.4.0 bindr_0.1.1
## [41] dplyr_0.7.8 RCurl_1.95-4.11
## [43] magrittr_1.5 GenomeInfoDbData_1.2.0
## [45] Matrix_1.2-15 Rcpp_1.0.0
## [47] ggbeeswarm_0.6.0 munsell_0.5.0
## [49] Rhdf5lib_1.4.2 viridis_0.5.1
## [51] edgeR_3.24.3 stringi_1.2.4
## [53] yaml_2.2.0 zlibbioc_1.28.0
## [55] Rtsne_0.15 rhdf5_2.26.2
## [57] plyr_1.8.4 grid_3.5.2
## [59] blob_1.1.1 crayon_1.3.4
## [61] lattice_0.20-38 Biostrings_2.50.2
## [63] cowplot_0.9.4 hms_0.4.2
## [65] locfit_1.5-9.1 pillar_1.3.1
## [67] igraph_1.2.2 reshape2_1.4.3
## [69] biomaRt_2.38.0 XML_3.98-1.16
## [71] glue_1.3.0 evaluate_0.12
## [73] BiocManager_1.30.4 gtable_0.2.0
## [75] purrr_0.2.5 assertthat_0.2.0
## [77] xfun_0.4 viridisLite_0.3.0
## [79] tibble_2.0.0 GenomicAlignments_1.18.1
## [81] beeswarm_0.2.3 memoise_1.1.0
## [83] statmod_1.4.30
Glare, E. M., M. Divjak, M. J. Bailey, and E. H. Walters. 2002. “beta-Actin and GAPDH housekeeping gene expression in asthmatic airways is variable and not suitable for normalising mRNA levels.” Thorax 57 (9):765–70.
Guimaraes, J. C., and M. Zavolan. 2016. “Patterns of ribosomal protein expression specify normal and malignant human cells.” Genome Biol. 17 (1):236.
Klein, A. M., L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, D. A. Weitz, and M. W. Kirschner. 2015. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.” Cell 161 (5):1187–1201.
Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.
Lun, A., S. Riesenfeld, T. Andrews, T. P. Dao, T. Gomes, participants in the 1st Human Cell Atlas Jamboree, and J. Marioni. 2018. “Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell Rna Sequencing Data.” bioRxiv. Cold Spring Harbor Laboratory. https://doi.org/10.1101/234872.
Macosko, E. Z., A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5):1202–14.
McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.
Nazari, F., A. Parham, and A. F. Maleki. 2015. “GAPDH, -actin and -microglobulin, as three common reference genes, are not reliable for gene expression studies in equine adipose- and marrow-derived mesenchymal stem cells.” J Anim Sci Technol 57:18.
Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9:Article 39.
Xu, C., and Z. Su. 2015. “Identification of cell types from single-cell transcriptomes using a novel clustering method.” Bioinformatics 31 (12):1974–80.
Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226):1138–42.
Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January):14049.