1 Overview
2 Setting up the data
- 2.1 Reading in a sparse matrix
- 2.2 Annotating the rows
3 Calling cells from empty droplets
4 Quality control on the cells
5 Examining gene expression
6 Normalizing for cell-specific biases
7 Modelling the mean-variance trend
8 Dimensionality reduction
9 Clustering with graph-based methods
10 Marker gene detection
11 Concluding remarks
References

1 Overview

Droplet-based scRNA-seq protocols capture cells in droplets for massively multiplexed library prepation [Klein et al. (2015); macosko2015highly]. This greatly increases the throughput of scRNA-seq studies, allowing tens of thousands of individual cells to be profiled in a routine experiment. Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). This again involves some differences from the previous workflows to reflect some unique aspects of droplet-based data.

2 Setting up the data

2.1 Reading in a sparse matrix

We load in the raw count matrix using the read10xCounts() function from the DropletUtils package. This will create a SingleCellExperiment object where each column corresponds to a cell barcode.

untar("pbmc4k_raw_gene_bc_matrices.tar.gz", exdir="pbmc4k")

library(DropletUtils)
fname <- "pbmc4k/raw_gene_bc_matrices/GRCh38"
sce <- read10xCounts(fname, col.names=TRUE)
sce

## class: SingleCellExperiment 
## dim: 33694 737280 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
##   TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):

Here, each count represents the number of unique molecular identifiers (UMIs) assigned to a gene for a cell barcode. Note that the counts are loaded as a sparse matrix object - specifically, a dgCMatrix instance from the Matrix package. This avoids allocating memory to hold zero counts, which is highly memory-efficient for low-coverage scRNA-seq data.

class(counts(sce))

## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

2.2 Annotating the rows

We relabel the rows with the gene symbols for easier reading. This is done using the uniquifyFeatureNames() function, which ensures uniqueness in the case of duplicated or missing symbols.

library(scater)
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
head(rownames(sce))

## [1] "RP11-34P13.3"  "FAM138A"       "OR4F5"         "RP11-34P13.7" 
## [5] "RP11-34P13.8"  "RP11-34P13.14"

We also identify the chromosomal location for each gene. The mitochondrial location is particularly useful for later quality control.

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce)$ID, 
    column="SEQNAME", keytype="GENEID")
rowData(sce)$CHR <- location
summary(location=="MT")

##    Mode   FALSE    TRUE    NA's 
## logical   33537      13     144

3 Calling cells from empty droplets

An interesting aspect of droplet-based data is that we have no prior knowledge about which droplets (i.e., cell barcodes) actually contain cells, and which are empty. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced. An examination of the distribution of total counts suggests a fairly sharp transition between barcodes with large and small total counts (Figure 1), probably corresponding to cell-containing and empty droplets respectively.

bcrank <- barcodeRanks(counts(sce))

# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
    xlab="Rank", ylab="Total UMI count", cex.lab=1.2)

abline(h=bcrank$inflection, col="darkgreen", lty=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2)

legend("bottomleft", legend=c("Inflection", "Knee"), 
    col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)

Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.

Figure 1: Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts)
The inferred locations of the inflection and knee points are also shown.

We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient pool (Lun et al. 2018). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. We call cells at a false discovery rate (FDR) of 1%, meaning that no more than 1% of our called barcodes should be empty droplets on average.

set.seed(100)
e.out <- emptyDrops(counts(sce))
sum(e.out$FDR <= 0.01, na.rm=TRUE)

## [1] 4453

emptyDrops() computes Monte Carlo p-values, so it is important to set the random seed to obtain reproducible results. The number of Monte Carlo iterations also determines the lower bound for the _p_values. If any non-significant barcodes are TRUE for Limited, we may need to increase the number of iterations to ensure that they can be detected.

table(Sig=e.out$FDR <= 0.01, Limited=e.out$Limited)

##        Limited
## Sig     FALSE TRUE
##   FALSE   836    0
##   TRUE   1751 2702

We then subset our SingleCellExperiment object to retain only the detected cells.

# using which() to automatically remove NAs.
sce <- sce[,which(e.out$FDR <= 0.01)]

Comments from Aaron:

emptyDrops() assumes that cell barcodes with total UMI counts below a certain threshold (default of 100) correspond to empty droplets, and uses them to estimate the ambient expression profile. By definition, these barcodes cannot be cell-containing droplets and are excluded from the hypothesis testing, hence the NAs in the output.
Users wanting to use the cell calling algorithm from the CellRanger pipeline can call defaultDrops() instead. This tends to be quite conservative as it often discards genuine cells with low RNA content (and thus low total counts). It also requires an estimate of the expected number of cells in the experiment.

4 Quality control on the cells

The previous step only distinguishes cells from empty droplets, but makes no statement about the quality of the cells. It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. We compute some QC metrics using calculateQCMetrics() (McCarthy et al. 2017) and examine their distributions in Figure 2.

sce <- calculateQCMetrics(sce, feature_controls=list(Mito=which(location=="MT")))
par(mfrow=c(1,3))
hist(sce$log10_total_counts, breaks=20, col="grey80",
    xlab="Log-total UMI count")
hist(sce$log10_total_features_by_counts, breaks=20, col="grey80",
    xlab="Log-total number of expressed features")
hist(sce$pct_counts_Mito, breaks=20, col="grey80",
    xlab="Proportion of reads in mitochondrial genes")

Figure 2: Histograms of QC metric distributions in the PBMC dataset

Ideally, we would remove cells with low library sizes or total number of expressed features as described previously. However, this would likely remove cell types with low RNA content, especially in a heterogeneous PBMC population with many different cell types. Thus, we use a more relaxed strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. (Keep in mind that droplet-based datasets usually do not have spike-in RNA.)

high.mito <- isOutlier(sce$pct_counts_Mito, nmads=3, type="higher")
sce <- sce[,!high.mito]
summary(high.mito)

##    Mode   FALSE    TRUE 
## logical    4115     338

Comments from Aaron:

The above justification for using a more relaxed filter is largely retrospective. In practice, we may not know a priori the degree of population heterogeneity and whether it manifests in the QC metrics. We recommend performing the analysis first with a stringent QC filter, and then relaxing it based on further diagnostics (see here for an example). This is motivated by the fact that low-quality cells can often yield misleading results, clustering separately from other cells; containing genes that appear to be strongly “upregulated” due to the presence of very small size factors; containing genes that are strongly downregulated due to the loss of RNA upon cell damage; and distorting the characterization of population heterogeneity during variance estimation or PCA.

5 Examining gene expression

The average expression of each gene is much lower here compared to the previous datasets (Figure 3). This is due to the reduced coverage per cell when thousands of cells are multiplexed together for sequencing.

ave <- calcAverage(sce)
rowData(sce)$AveCount <- ave
hist(log10(ave), col="grey80")

Histogram of the log~10~-average counts for each gene in the PBMC dataset.

Figure 3: Histogram of the log₁₀-average counts for each gene in the PBMC dataset

The set of most highly expressed genes is dominated by ribosomal protein and mitochondrial genes (Figure 4), as expected.

plotHighestExprs(sce)

Percentage of total counts assigned to the top 50 most highly-abundant features in the PBMC dataset. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell.

Figure 4: Percentage of total counts assigned to the top 50 most highly-abundant features in the PBMC dataset
For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell.

6 Normalizing for cell-specific biases

We apply the deconvolution method to compute size factors for all cells (Lun, Bach, and Marioni 2016). We perform some pre-clustering to break up obvious clusters and avoid pooling cells that are very different.

library(scran)
clusters <- quickCluster(sce, method="igraph", min.mean=0.1,
    irlba.args=list(maxit=1000)) # for convergence.
table(clusters)

## clusters
##    1    2    3    4    5 
##  933  232 1230 1149  571

sce <- computeSumFactors(sce, min.mean=0.1, cluster=clusters)
summary(sizeFactors(sce))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.006901  0.715236  0.888703  1.000000  1.110422 12.230975

The size factors are well correlated against the library sizes (Figure 5), indicating that capture efficiency and sequencing depth are the major biases.

plot(sce$total_counts, sizeFactors(sce), log="xy")

Figure 5: Size factors for all cells in the PBMC dataset, plotted against the library size

Finally, we compute normalized log-expresion values. There is no need to call computeSpikeFactors() here, as there are no spike-in transcripts available.

sce <- normalize(sce)

7 Modelling the mean-variance trend

The lack of spike-in transcripts complicates the modelling of the technical noise. One option is to assume that most genes do not exhibit strong biological variation, and to fit a trend to the variances of endogenous genes. However, this assumption is generally unreasonable for a heterogeneous population. Instead, we assume that the technical noise is Poisson and create a fitted trend on that basis using the makeTechTrend() function.

new.trend <- makeTechTrend(x=sce)

We estimate the variances for all genes and compare the trend fits in Figure 6. The Poisson-based trend serves as a lower bound for the variances of the endogenous genes, consistent with non-zero biological components.

fit <- trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05))
plot(fit$mean, fit$var, pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
curve(new.trend(x), col="red", add=TRUE)

Variance of normalized log-expression values for each gene in the PBMC dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances, while the red line represents the Poisson noise.

Figure 6: Variance of normalized log-expression values for each gene in the PBMC dataset, plotted against the mean log-expression
The blue line represents the mean-dependent trend fitted to the variances, while the red line represents the Poisson noise.

We decompose the variance for each gene using the Poisson-based trend, and examine the genes with the highest biological components.

fit0 <- fit
fit$trend <- new.trend
dec <- decomposeVar(fit=fit)
top.dec <- dec[order(dec$bio, decreasing=TRUE),] 
head(top.dec)

## DataFrame with 6 rows and 6 columns
##                     mean            total              bio              tech
##                <numeric>        <numeric>        <numeric>         <numeric>
## LYZ     2.02105835396706 5.31679699819892 4.66975606209996 0.647040936098961
## S100A9  1.98316312497791 4.77901442789332 4.12661274873872 0.652401679154603
## S100A8  1.76532862059769 4.68957247313204 4.01209233180623 0.677480141325811
## HLA-DRA 2.12297042829692 3.75033809854188 3.11884168722863 0.631496411313253
## CD74    2.89711555377952 3.35686556292097 2.86411449020498 0.492751072715998
## CST3    1.51801372204386 3.04199907371977 2.35202120359331 0.689977870126461
##           p.value       FDR
##         <numeric> <numeric>
## LYZ             0         0
## S100A9          0         0
## S100A8          0         0
## HLA-DRA         0         0
## CD74            0         0
## CST3            0         0

We can plot the genes with the largest biological components, to verify that they are indeed highly variable (Figure 7).

plotExpression(sce, features=rownames(top.dec)[1:10])

Distributions of normalized log-expression values for the top 10 genes with the largest biological components in the PBMC dataset. Each point represents the log-expression value in a single cell.

Figure 7: Distributions of normalized log-expression values for the top 10 genes with the largest biological components in the PBMC dataset
Each point represents the log-expression value in a single cell.

8 Dimensionality reduction

We use the denoisePCA() function with the assumed Poisson technical trend, to choose the number of dimensions to retain after PCA.

sce <- denoisePCA(sce, technical=new.trend, approx=TRUE)
ncol(reducedDim(sce, "PCA"))

## [1] 13

plot(attr(reducedDim(sce), "percentVar"), xlab="PC",
    ylab="Proportion of variance explained")
abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red")

Figure 8: Variance explained by each principal component in the PBMC dataset
The red line represents the chosen number of PCs.

Examination of the first few PCs already reveals some strong substructure in the data (Figure 9).

plotPCA(sce, ncomponents=3, colour_by="log10_total_features_by_counts")

Pairwise PCA plots of the first three PCs in the PBMC dataset, constructed from normalized log-expression values of genes with positive biological components. Each point represents a cell, coloured by the log-number of expressed features.

Figure 9: Pairwise PCA plots of the first three PCs in the PBMC dataset, constructed from normalized log-expression values of genes with positive biological components
Each point represents a cell, coloured by the log-number of expressed features.

This is recapitulated with a t-SNE plot (Figure 10). Again, note that we set use_dimred= to perform t-SNE on the denoised expression matrix.

sce <- runTSNE(sce, use_dimred="PCA", perplexity=30, rand_seed=100)
plotTSNE(sce, colour_by="log10_total_features_by_counts")

_t_-SNE plots constructed from the denoised PCs of the PBMC dataset. Each point represents a cell and is coloured according to the log-number of expressed features.

Figure 10: t-SNE plots constructed from the denoised PCs of the PBMC dataset
Each point represents a cell and is coloured according to the log-number of expressed features.

9 Clustering with graph-based methods

We build a shared nearest neighbour graph (Xu and Su 2015) and use the Walktrap algorithm to identify clusters.

snn.gr <- buildSNNGraph(sce, use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)
sce$Cluster <- factor(clusters$membership)
table(sce$Cluster)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  46 886 618  55 533 202 558 121 796  29 144  91  36

We look at the ratio of the observed and expected edge weights to confirm that the clusters are modular. (We don’t look at the modularity score itself, as that varies by orders of magnitudes across clusters and is difficult to interpret.) Figure 11 indicates that most of the clusters are well seperated, with few strong off-diagonal entries.

cluster.mod <- clusterModularity(snn.gr, sce$Cluster, get.values=TRUE)
log.ratio <- log2(cluster.mod$observed/cluster.mod$expected + 1)

library(pheatmap)
pheatmap(log.ratio, cluster_rows=FALSE, cluster_cols=FALSE, 
    color=colorRampPalette(c("white", "blue"))(100))

Heatmap of the log~10~-ratio of the total weight between nodes in the same cluster or in different clusters, relative to the total weight expected under a null model of random links.

Figure 11: Heatmap of the log₁₀-ratio of the total weight between nodes in the same cluster or in different clusters, relative to the total weight expected under a null model of random links

We examine the cluster identities on a t-SNE plot (Figure 12) to confirm that different clusters are indeed separated.

plotTSNE(sce, colour_by="Cluster")

_t_-SNE plots constructed from the denoised PCs of the PBMC dataset. Each point represents a cell and is coloured according to its cluster identity.

Figure 12: t-SNE plots constructed from the denoised PCs of the PBMC dataset
Each point represents a cell and is coloured according to its cluster identity.

10 Marker gene detection

We detect marker genes for each cluster using findMarkers(). Again, we only look at upregulated genes in each cluster, as these are more useful for positive identification of cell types in a heterogeneous population.

markers <- findMarkers(sce, clusters=sce$Cluster, direction="up")

We examine the markers for cluster 1 in more detail. The upregulation of genes such as PF4 and PPBP suggests that cluster 1 contains platelets or their precursors.

marker.set <- markers[["1"]]
head(marker.set[,1:8], 10) # only first 8 columns, for brevity

## DataFrame with 10 rows and 8 columns
##              Top                  FDR          logFC.2          logFC.3
##        <integer>            <numeric>        <numeric>        <numeric>
## TMSB4X         1 1.20825539104512e-33 2.83376891001389 3.28957055233535
## PF4            1 1.04028515209341e-26  6.8445359368567 6.89093288763622
## TAGLN2         2 1.06267736617134e-22  4.9998040168041 4.78347642357398
## B2M            2 1.51687425240043e-22 1.90344775411314 1.48626294733737
## SDPR           3 7.21969575189484e-20 5.75570471918738 5.79991414440864
## GPX1           4 1.27494762133093e-18 3.03208976370449 5.29719381741187
## NRGN           5 1.91731837359494e-18 4.83847541750688 5.06236425811631
## ACTB           6 7.56677173680049e-18 2.89642013144024 3.75752960793793
## PPBP           6 1.79194616325018e-17 6.55809359131182 6.64889276227971
## GNG11          7 1.77119203110873e-16 5.58385148265683 5.63179837312263
##                 logFC.4          logFC.5          logFC.6          logFC.7
##               <numeric>        <numeric>        <numeric>        <numeric>
## TMSB4X 3.10796701627874 3.29759385628515 3.51950865615281 3.89392066536141
## PF4    6.86973397565973 6.89622281556257 6.89045046363676  6.8954950235345
## TAGLN2 4.98583940060965 4.95529557191157   4.941601217138 4.67632038438988
## B2M    1.79622898528281 1.23522214773642 1.40299353807891  2.1520245042641
## SDPR   5.76758550968281 5.80757836156552 5.80757836156552 5.79807334221895
## GPX1   3.52654529546572 5.45344486557165  5.5156162324872 5.02429679143685
## NRGN   4.97451803249145 5.06856204885949  5.0687270984853 5.06560075358076
## ACTB   3.29965028477221 3.54556905986091  3.3396823794263 4.11877292011244
## PPBP   6.59890989660713 6.65701139834105 6.65590557795021 6.65023604394866
## GNG11  5.58680268818968 5.63450236666286 5.63959330665481 5.60555848129265

This is confirmed in Figure 13, where the transcriptional profile of cluster 1 is clearly distinct from the others.

chosen <- rownames(marker.set)[marker.set$Top <= 10]
plotHeatmap(sce, features=chosen, exprs_values="logcounts", 
    zlim=5, center=TRUE, symmetric=TRUE, cluster_cols=FALSE,
    colour_columns_by="Cluster", columns=order(sce$Cluster))

Heatmap of mean-centred and normalized log-expression values for the top set of markers for cluster 1 in the PBMC dataset. Column colours represent the cluster to which each cell is assigned, as indicated by the legend.

Figure 13: Heatmap of mean-centred and normalized log-expression values for the top set of markers for cluster 1 in the PBMC dataset
Column colours represent the cluster to which each cell is assigned, as indicated by the legend.

11 Concluding remarks

Having completed the basic analysis, we save the SingleCellExperiment object with its associated data to file. This avoids having to repeat all of the pre-processing steps described above prior to further analyses.

saveRDS(sce, file="pbmc_data.rds")

All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] EnsDb.Hsapiens.v86_2.99.0             
##  [2] ensembldb_2.4.1                       
##  [3] AnnotationFilter_1.4.0                
##  [4] DropletUtils_1.0.1                    
##  [5] pheatmap_1.0.10                       
##  [6] cluster_2.0.7-1                       
##  [7] dynamicTreeCut_1.63-1                 
##  [8] limma_3.36.1                          
##  [9] scran_1.8.2                           
## [10] scater_1.8.0                          
## [11] ggplot2_2.2.1                         
## [12] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0
## [13] GenomicFeatures_1.32.0                
## [14] org.Mm.eg.db_3.6.0                    
## [15] AnnotationDbi_1.42.1                  
## [16] SingleCellExperiment_1.2.0            
## [17] SummarizedExperiment_1.10.1           
## [18] DelayedArray_0.6.0                    
## [19] BiocParallel_1.14.1                   
## [20] matrixStats_0.53.1                    
## [21] Biobase_2.40.0                        
## [22] GenomicRanges_1.32.3                  
## [23] GenomeInfoDb_1.16.0                   
## [24] IRanges_2.14.10                       
## [25] S4Vectors_0.18.2                      
## [26] BiocGenerics_0.26.0                   
## [27] knitr_1.20                            
## [28] BiocStyle_2.8.1                       
## 
## loaded via a namespace (and not attached):
##  [1] Rtsne_0.13               ggbeeswarm_0.6.0        
##  [3] colorspace_1.3-2         rjson_0.2.19            
##  [5] rprojroot_1.3-2          XVector_0.20.0          
##  [7] DT_0.4                   bit64_0.9-7             
##  [9] tximport_1.8.0           Rsamtools_1.32.0        
## [11] shinydashboard_0.7.0     shiny_1.1.0             
## [13] compiler_3.5.0           httr_1.3.1              
## [15] backports_1.1.2          assertthat_0.2.0        
## [17] Matrix_1.2-14            lazyeval_0.2.1          
## [19] later_0.7.2              htmltools_0.3.6         
## [21] prettyunits_1.0.2        tools_3.5.0             
## [23] bindrcpp_0.2.2           igraph_1.2.1            
## [25] gtable_0.2.0             glue_1.2.0              
## [27] GenomeInfoDbData_1.1.0   reshape2_1.4.3          
## [29] dplyr_0.7.5              Rcpp_0.12.17            
## [31] Biostrings_2.48.0        rtracklayer_1.40.2      
## [33] DelayedMatrixStats_1.2.0 xfun_0.1                
## [35] stringr_1.3.1            mime_0.5                
## [37] irlba_2.3.2              statmod_1.4.30          
## [39] XML_3.98-1.11            edgeR_3.22.2            
## [41] zlibbioc_1.26.0          scales_0.5.0            
## [43] BiocInstaller_1.30.0     ProtGenerics_1.12.0     
## [45] promises_1.0.1           rhdf5_2.24.0            
## [47] RColorBrewer_1.1-2       curl_3.2                
## [49] yaml_2.1.19              memoise_1.1.0           
## [51] gridExtra_2.3            biomaRt_2.36.1          
## [53] stringi_1.2.2            RSQLite_2.1.1           
## [55] highr_0.6                rlang_0.2.0             
## [57] pkgconfig_2.0.1          bitops_1.0-6            
## [59] evaluate_0.10.1          lattice_0.20-35         
## [61] purrr_0.2.4              Rhdf5lib_1.2.1          
## [63] bindr_0.1.1              GenomicAlignments_1.16.0
## [65] htmlwidgets_1.2          labeling_0.3            
## [67] cowplot_0.9.2            bit_1.1-13              
## [69] tidyselect_0.2.4         plyr_1.8.4              
## [71] magrittr_1.5             bookdown_0.7            
## [73] R6_2.2.2                 DBI_1.0.0               
## [75] pillar_1.2.2             RCurl_1.95-4.10         
## [77] tibble_1.4.2             KernSmooth_2.23-15      
## [79] rmarkdown_1.9            viridis_0.5.1           
## [81] progress_1.1.2           locfit_1.5-9.1          
## [83] grid_3.5.0               data.table_1.11.2       
## [85] blob_1.1.1               FNN_1.1                 
## [87] digest_0.6.15            xtable_1.8-2            
## [89] httpuv_1.4.3             munsell_0.4.3           
## [91] beeswarm_0.2.3           viridisLite_0.3.0       
## [93] vipor_0.4.5

References

Klein, A. M., L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, D. A. Weitz, and M. W. Kirschner. 2015. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.” Cell 161 (5):1187–1201.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.

Lun, A., S. Riesenfeld, T. Andrews, T. P. Dao, T. Gomes, participants in the 1^st Human Cell Atlas Jamboree, and J. Marioni. 2018. “Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell Rna Sequencing Data.” bioRxiv. Cold Spring Harbor Laboratory. https://doi.org/10.1101/234872.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.

Xu, C., and Z. Su. 2015. “Identification of cell types from single-cell transcriptomes using a novel clustering method.” Bioinformatics 31 (12):1974–80.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January):14049.

Analyzing single-cell RNA sequencing data from droplet-based protocols

2018-05-25

Contents