scDblFinder 1.0.0
scDblFinder identifies doublets in single-cell RNAseq directly by creating artificial doublets and looking at their prevalence in the neighborhood of each cell. The rough logic is very similar to DoubletFinder, but it much simpler and more efficient. In a nutshell:
dbr.sd
), and use the error rate of the real/artificial predicition in conjunction with the deviation in global doublet rate to set the threshold.scDblFinder was developed under R 3.6. Install with:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scDblFinder")
Or, until the new bioconductor release:
BiocManager::install("plger/scDblFinder")
Given an object sce
of class SingleCellExperiment
:
library(scDblFinder)
sce <- scDblFinder(sce)
This will add the following columns to the colData of sce
:
sce$scDblFinder.neighbors
: the number of neighbors consideredsce$scDblFinder.ratio
: the proportion of artificial doublets among the neighborhood (the higher, the more chances that the cell is a doublet)sce$scDblFinder.score
: a doublet score integrating the ratio in a probability of the cell being a doubletsce$scDblFinder.class
: the classification (doublet or singlet)If you have multiple samples (understood as different cell captures), then it is
preferable to look for doublets separately for each sample. You can do this by
simply providing a vector of the sample ids to the samples
parameter of scDblFinder or,
if these are stored in a column of colData
, the name of the column. In this case,
you might also consider multithreading it using the BPPARAM
parameter. For example:
library(BiocParallel)
sce <- scDblFinder(sce, samples="sample_id", BPPARAM=MulticoreParam(3))
table(sce$scDblFinder.class)
The important sets of parameters in scDblFinder
refer respectively to the expected proportion of doublets, to the clustering, and to the number of artificial doublets used.
The expected proportion of doublets has no impact on the score (the ratio
above), but a very strong impact on where the threshold will be placed. It is specified through the dbr
parameter and the dbr.sd
parameter (the latter specifies the standard deviation of dbr
, i.e. the uncertainty in the expected doublet rate). For 10x data, the more cells you capture the higher the chance of creating a doublet, and Chromium documentation indicates a doublet rate of roughly 1% per 1000 cells captures (so with 5000 cells, (0.01*5)*5000 = 250 doublets), and the default expected doublet rate will be set to this value (with a default standard deviation of 0.015). Note however that different protocols may create considerably more doublets, and that this should be updated accordingly.
Since doublets are created across clusters, it is important that subpopulations are not misrepresented as belonging to the same cluster. For this reason, we rely on an over-clustering approach which is similar to scran’s quickCluster
, but splits clusters above a certain size. This is implemented by scDblFinder’s overcluster
function. By default, the maximum cluster size will be 1/20 of the number of cells. While this is reasonable for most datasets (i.e. all those used for benchmark), so many clusters might be unnecessary when the cell population has a simple structure, and more clusters might be needed in a very complex population (e.g. whole brain).
scDblFinder
itself determines a reasonable number of artificial doublets to create on the basis of the size of the population and the number of clusters, but increasing this number can only increase the accuracy. If you increase the number above default settings, you might also consider increasing parameter k
- i.e. the number of neighbors considered.
To benchmark scDblFinder against alternatives, we used datasets in which cells from multiple individuals were mixed and their identity deconvoluted using SNPs (via demuxlet), which also enables the identification of doublets from different individuals.
The method is compared to:
doubletCells
function
Figure 1: Accuracy of the doublet detection in the mixology10x3cl dataset (a mixture of 3 cancer cell lines)
All methods perform very well.
Figure 2: Accuracy of the doublet detection in the mixology10x5cl dataset (a mixture of 5 cancer cell lines)
Figure 3: Accuracy of the doublet detection in the demuxlet control (Batch 2) dataset (GSM2560248)
## Warning: Removed 1 rows containing missing values (position_stack).
Figure 4: Running time for each method/dataset
1 DoubletFinder failed on the mixology10x3cl dataset
Note that by far most of the running time of scDblFinder
is actually the clustering.
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scDblFinder_1.0.0 cowplot_1.0.0 ggplot2_3.2.1 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.46.0 viridis_0.5.1
## [3] edgeR_3.28.0 BiocSingular_1.2.0
## [5] viridisLite_0.3.0 DelayedMatrixStats_1.8.0
## [7] assertthat_0.2.1 statmod_1.4.32
## [9] BiocManager_1.30.9 highr_0.8
## [11] stats4_3.6.1 dqrng_0.2.1
## [13] GenomeInfoDbData_1.2.2 vipor_0.4.5
## [15] yaml_2.2.0 pillar_1.4.2
## [17] lattice_0.20-38 glue_1.3.1
## [19] limma_3.42.0 digest_0.6.22
## [21] GenomicRanges_1.38.0 XVector_0.26.0
## [23] randomForest_4.6-14 colorspace_1.4-1
## [25] htmltools_0.4.0 Matrix_1.2-17
## [27] pkgconfig_2.0.3 bookdown_0.14
## [29] zlibbioc_1.32.0 purrr_0.3.3
## [31] scales_1.0.0 BiocParallel_1.20.0
## [33] tibble_2.1.3 IRanges_2.20.0
## [35] withr_2.1.2 SummarizedExperiment_1.16.0
## [37] BiocGenerics_0.32.0 lazyeval_0.2.2
## [39] magrittr_1.5 crayon_1.3.4
## [41] evaluate_0.14 beeswarm_0.2.3
## [43] tools_3.6.1 scater_1.14.0
## [45] data.table_1.12.6 matrixStats_0.55.0
## [47] stringr_1.4.0 S4Vectors_0.24.0
## [49] munsell_0.5.0 locfit_1.5-9.1
## [51] DelayedArray_0.12.0 irlba_2.3.3
## [53] compiler_3.6.1 GenomeInfoDb_1.22.0
## [55] rsvd_1.0.2 rlang_0.4.1
## [57] grid_3.6.1 RCurl_1.95-4.12
## [59] BiocNeighbors_1.4.0 SingleCellExperiment_1.8.0
## [61] igraph_1.2.4.1 bitops_1.0-6
## [63] labeling_0.3 rmarkdown_1.16
## [65] gtable_0.3.0 R6_2.4.0
## [67] gridExtra_2.3 knitr_1.25
## [69] dplyr_0.8.3 stringi_1.4.3
## [71] ggbeeswarm_0.6.0 parallel_3.6.1
## [73] Rcpp_1.0.2 scran_1.14.0
## [75] tidyselect_0.2.5 xfun_0.10