1 scDblFinder

The scDblFinder method combines the strengths of various doublet detection approaches, training an iterative classifier on the neighborhood of real cells and artificial doublets.

scDblFinder() has two main modes of operation: cluster-based or not. Both perform quite well (see Germain et al., 2021). In general, we recommend the cluster-based approach in datasets with a very clear cluster structure, and the random approach in more complex datasets.

1.1 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scDblFinder")

# or, to get that latest developments:
BiocManager::install("plger/scDblFinder")

1.2 Usage

The input of scDblFinder is an object sce of class SingleCellExperiment (empty drops having already been removed) containing at least the counts (assay ‘counts’). Alternatively, a simple count matrix can also be provided.

Given an SCE object, scDblFinder (using the random approach) can be launched as follows :

set.seed(123)
library(scDblFinder)
# we create a dummy dataset; since it's small we set a higher doublet rate
sce <- mockDoubletSCE(dbl.rate=0.1, ngenes=300 )
# we run scDblFinder (providing the unusually high doublet rate)
sce <- scDblFinder(sce, dbr=0.1)

## Creating ~5000 artificial doublets...

## Dimensional reduction

## Evaluating kNN...

## Training model...

## iter=0, 26 cells excluded from training.

## iter=1, 26 cells excluded from training.

## iter=2, 26 cells excluded from training.

## Threshold found:0.752

## 26 (5%) doublets called

For 10x data, it is usually safe to leave the dbr empty, and it will be automatically estimated.

scDblFinder will add a number of columns to the colData of sce prefixed with ‘scDblFinder’, the most important of which are:

sce$scDblFinder.score : the final doublet score
sce$scDblFinder.class : the classification (doublet or singlet)

We can compare the calls with the truth in this toy example:

table(truth=sce$type, call=sce$scDblFinder.class)

##          call
## truth     singlet doublet
##   singlet     498       2
##   doublet       0      24

To use the cluster-based approach, one simply needs to additionally provide the clusters argument:

sce <- scDblFinder(sce, clusters="cluster")

## 2 clusters

## Creating ~5000 artificial doublets...

## Dimensional reduction

## Evaluating kNN...

## Training model...

## iter=0, 24 cells excluded from training.

## iter=1, 24 cells excluded from training.

## iter=2, 24 cells excluded from training.

## Threshold found:0.998

## 24 (4.6%) doublets called

table(truth=sce$type, call=sce$scDblFinder.class)

##          call
## truth     singlet doublet
##   singlet     500       0
##   doublet       0      24

The clusters argument can be either a vector of cluster labels for each column of sce, a colData column of sce containing such labels, or TRUE. If clusters=TRUE, the fast clustering approach (see ?fastcluster) will be employed. If normalized expression (assay ‘logcounts’) and/or PCA (reducedDim ‘PCA’) are already present in the object, these will be used for the clustering step.

1.2.1 Multiple samples

If you have multiple samples (understood as different cell captures), then it is preferable to look for doublets separately for each sample (for multiplexed samples with cell hashes, this means for each batch). You can do this by simply providing a vector of the sample ids to the samples parameter of scDblFinder or, if these are stored in a column of colData, the name of the column. In this case, you might also consider multithreading it using the BPPARAM parameter (assuming you’ve got enough RAM!). For example:

library(BiocParallel)
sce <- scDblFinder(sce, samples="sample_id", BPPARAM=MulticoreParam(3))
table(sce$scDblFinder.class)

Note that if you are running multiple samples using the cluster-based approach (see below), clustering will be performed sample-wise. While this is typically not an issue for doublet identification, it means that the cluster labels (and putative origins of doublets) won’t match between samples. If you are interested in these, it is preferable to first cluster (for example using sce$cluster <- fastcluster(sce)) and then provide the clusters to scDblFinder, which will ensure concordant labels across samples.

Of note, if you have very large differences in number of cells between samples the scores will not be directly comparable. We are working on improving this, but in the meantime it would be preferable to stratify similar samples and threshold the sets separately.

1.3 Description of the method

Wrapped in the scDblFinder function are the following steps:

1.3.1 Splitting captures

Doublets can only arise within a given sample or capture, and for this reason are better sought independently for each sample, which also speeds up the analysis. If the samples argument is given, scDblFinder will use it to split the cells into samples/captures, and process each of them in parallel if the BPPARAM argument is given. The classifier will be trained globally, but thresholds will be optimized on a per-sample basis. If your samples are multiplexed, i.e. the different samples are mixed in different batches, then the batches should be what you provide to this argument.

1.3.2 Reducing and clustering the data

The analysis can be considerably sped up, at little if any cost in accuracy, by reducing the dataset to only the top expressed genes (controlled by the nfeatures argument).

Then, depending on the clusters argument, an eventual PCA and clustering (using the internal fastcluster function) will be performed. The rationale for the cluster-based approach is that homotypic doublets are nearly impossible to distinguish on the basis of their transcriptome, and therefore that creating that kind of doublets is a waste of computational resources that can moreover mislead the classifier into flagging singlets. An alternative approach, however, is to generate doublets randomly (setting clusters to FALSE or NULL), and use the iterative approach (see below) to exclude also unidentifiable artificial doublets from the training.

1.3.3 Generating artificial doublets

Depending on the clusters and propRandom arguments, artificial doublets will be generated by combining random cells and/or pairs of non-identical clusters (this can be performed manually using the getArtificialDoublets function). A proportion of the doublets will simply use the sum of counts of the composing cells, while the rest will undergo a library size adjustment and poisson resampling.

1.3.4 Examining the k-nearest neighbors (kNN) of each cell

A new PCA is performed on the combination of real and artificial cells, from which a kNN network is generated. Using this kNN, a number of parameters are gathered for each cell, such as the proportion of doublets (i.e. artificial doublets or known doublets provided through the knownDoublets argument, if given) among the KNN, ratio of the distances to the nearest doublet and nearest non-doublet, etc. Several of this features are reported in the output with the ‘scDblFinder.’ prefix, e.g.:

distanceToNearest : distance to the nearest cell (real or artificial)
ratio : the proportion of the KNN that are doublets. (If more than one value of k is given, the various ratios will be used during classification and will be reported)
weighted : the proportion of the KNN that are doublets, weighted by their distance (useful for isolated cells)

1.3.5 Training a classifier

Unless the score argument is set to ‘weighted’ or ‘ratio’ (in which case the aforementioned ratio is directly used as a doublet score), scDblFinder then uses gradient boosted trees trained on the kNN-derived properties along with a few additional features (e.g. library size, number of non-zero features, and an estimate of the difficultly of detecting artificial doublets in the cell’s neighborhood, a variant of the cxds score from the scds, etc.) to distinguish doublets (either artificial or given) from other cells, and assigns a score on this basis.

One problem of using a classifier for this task is that some of the real cells (the actual doublets) are mislabeled as singlet, so to speak. scDblFinder therefore iteratively retrains the classifier, each time excluding from the training the (real) cells called as doublets in the previous step (as well as unidentifiable artificial doublets). The number of steps being controlled by the iter parameter (in our experience, 2 or 3 is optimal).

This score is available in the output, in the scDblFinder.score colData column, and can be interpreted as a probability. If the data is multi-sample, a single model is trained for all samples.

1.3.6 Thresholding

Rather than thresholding on some arbitrary cutoff of the score, scDblFinder uses the expected number of doublets in combination to the misclassification rate to establish a threshold. Unless it is manually given through the dbr argument, the expected doublet rate is first estimated using the empirical rule of thumb applicable to 10X data, namely roughly 1% per 1000 cells captured (so with 5000 cells, (0.01*5)*5000 = 250 doublets, and the more cells you capture the higher the chance of creating a doublet). If samples were specified, and if the dbr is automatically calculated, thresholding is performed separately across samples.

Thresholding then tries to simultaneously minimize: 1) the classification error (in terms of the proportion of known doublets below the threshold) and 2) the deviation from the expected number of doublets among real cells (as a ratio of the total number of expected doublets within the range determined by dbr.sd, and adjusted for homotypic doublets). This means that, if you have no idea about the doublet rate, setting dbr.sd=1 will make the threshold depend entirely on the misclassification rate.

1.3.7 Doublet origins and enrichments

If artificial doublets are generated between clusters, it is sometimes possible to call the most likely origin (in terms of the combination of clusters) of a given putative real doublet. We observed that at least one of the two composing cell is typically recognized, but that both are seldom correctly recognized, owing to the sometimes small relative contribution of one of the two original cells. This information is provided through the scDblFinder.mostLikelyOrigin column of the output (and the scDblFinder.originAmbiguous column indicates whether this origin is ambiguous or rather clear). This, in turn, allows us to identify enrichment over expectation for specific kinds of doublets. Some statistics on each combination of clusters are saved in metadata(sce)$scDblFinder.stats, and the plotDoubletMap function can be used to visualize enrichments. In addition, two frameworks are offered for testing the significance of enrichments:

The clusterStickiness function tests whether each cluster forms more doublet than would be expected given its abundance, by default using a single quasi-binomial model fitted across all doublet types.
The doubletPairwiseEnrichment function separately tests whether each specific doublet type (i.e. combination of clusters) is more abundant than expected, by default using a poisson model.

1.4 Some important parameters

scDblFinder has a fair number of parameters governing the preprocessing, generation of doublets, classification, etc. (see ?scDblFinder). Here we describe just a few of the most important ones.

1.4.1 Expected proportion of doublets

The expected proportion of doublets has no impact on the density of artificial doublets in the neighborhood, but impacts the classifier’s score and, especially, where the cutoff will be placed. It is specified through the dbr parameter and the dbr.sd parameter (the latter specifies a +/- range around dbr within which the deviation from dbr will be considered null). For 10x data, the more cells you capture the higher the chance of creating a doublet, and Chromium documentation indicates a doublet rate of roughly 1% per 1000 cells captures (so with 5000 cells, (0.01*5)*5000 = 250 doublets), and the default expected doublet rate will be set to this value (with a default standard deviation of 0.015). Note however that different protocols may create considerably more doublets, and that this should be updated accordingly. If you have unsure about the doublet rate, you might consider increasing dbr.sd, so that it is estimated mostl/purely from the misclassification error.

1.5 Frequently-asked questions

1.5.1 I’m getting way too many doublets called - what’s going on?

Then you most likely have a wrong doublet rate. If you did not provide it (dbr argument), the doublet rate will be calculated automatically using expected doublet rates from 10x, meaning that the more cells captured, the higher the doublet rates. If you have reasons to think that this is not applicable to your data, set the dbr manually.

The most common cause for an unexpectedly large proportion of doublets is if you have a multi-sample dataset and did not split by samples. scDblFinder will think that the data is a single capture with loads of cells, and hence with a very high doublet rate. Splitting by sample should solve the issue.

Also note that, although 10X-like data tends to have roughly 1% per 1000 cells captured, the determining factor for doublet formation is the number of cells inserted into the machine. If for some reason your recovery rate is lower than expected, you might have a higher doublet rate than you’d expect from the captured and called cells (in other words, it would be preferable to say that the doublet rate is roughly 0.6% per 1000 cells put into the machine, where 0.6 is the recovery rate). In such circumstances, scDblFinder typically sets the thresholds correctly nevertheless. This is because the thresholding tries to minimize both the deviation from the expected number of doublets and the misclassification (i.e. of artificial doublets), meaning that the effective (i.e. final) doublet rate will differ from the given one. scDblFinder also considers false positives to be less problematic than false negatives. You can reduce to some degree the deviation from the input doublet rate by setting dbr.sd=0.

1.5.2 Should I use the cluster-based doublet generation or not?

Both approaches perform very similarly overall in benchmarks (see the scDblFinder paper). If your data is very clearly segregated into clusters, or if you are interested in the origin of the doublets, the cluster-based approach is preferable. This will also enable a more accurate accounting of homotypic doublets, and therefore a slightly better thresholding. Otherwise, and especially if your data does not segregate very clearly into clusters, the random approach (e.g. clusters=FALSE, the default) is preferable.

1.5.3 The clusters don’t make any sense!

If you ran scDblFinder on a multi-sample dataset and did not provide the cluster labels, then the labels are sample-specific (meaning that label ‘1’ in one sample might have nothing to do with label ‘1’ in another), and plotting them on a tSNE will look like they do not make sense. For this reason, when running multiple samples we recommend to first cluster all samples together (for example using sce$cluster <- fastcluster(sce)) and then provide the clusters to scDblFinder.

1.5.4 ‘Size factors should be positive’ error

You will get this error if you have some cells that have zero reads (or a very low read count, leading to zero after feature selection). After filtering out these cells the error should go away.

1.5.5 How can I make this reproducible?

Because it relies on the partly random generation of artificial doublets, running scDblFinder multiple times on the same data will yield slightly different results. You can ensure reproducibility using set.seed(), however this will not be sufficient when processing multiple samples (i.e. using the samples argument – even without multithreading!). In such case, the seed needs to be passed to the BPPARAMs:

bp <- MulticoreParam(3, RNGseed=1234)
sce <- scDblFinder(sce, clusters="cluster", samples="sample", BPPARAM=bp)

Similarly, when processing the samples serially, use SerialParam(RNGseed = seed).

(Note that in BiocParallel versions <1.28, one had in addition to explicitly start the cluster before the run using bpstart(bp), and then bpstop(bp) after scDblFinder.)

1.5.6 Can I use this in combination with Seurat or other tools?

If the input SCE already contains a logcounts assay or a reducedDim slot named ‘PCA’, scDblFinder will use them for the clustering step. In addition, a clustering can be manually given using the clusters argument of scDblFinder(). In this way, seurat clustering could for instance be used to create the artificial doublets (see ?Seurat::as.SingleCellExperiment.Seurat for conversion to SCE). For example, assuming as Seurat object se, the following could be done:

sce <- scDblFinder(GetAssayData(se, slot="counts"), clusters=Idents(se))
# port the resulting scores back to the Seurat object:
se$scDblFinder.score <- sce$scDblFinder.score

After artificial doublets generation, the counts of real and artificial cells must then be reprocessed (i.e. normalization and PCA) together, which is performed internally using scater. If you wish this step to be performed differently, you may provide your own function for doing so (see the processing argument in ?scDblFinder). We note, however, that the impact of variations of this step on doublet detection is rather mild. In fact, not performing any normalization at all for instance decreases doublet identification accuracy, but by rather little.

For example, the following code would enable the internal use of sctransform:

# assuming `x` is the count matrix:
nfeatures <- 1000
sce <- SingleCellExperiment(list(counts=x))
# sctransform on real cells:
vst1 <- sctransform::vst(counts(sce), n_cells=min(ncol(sce),5000), verbosity=0)
sce <- sce[row.names(vst1$y),]
logcounts(sce) <- vst1$y
hvg <- row.names(sce)[head(order(vst1$gene_attr$residual_variance, decreasing=TRUE), nfeatures)]

# define a processing function that scDblFinder will use on the real+artificial doublets;
# the input should be a count matrix and the number of dimensions, and the output a PCA matrix
  
myfun <- function(e, dims){
  # we use the thetas calculated from the first vst on real cells
  e <- e[intersect(row.names(e), row.names(vst1$model_pars_fit[which(!is.na(vst1$model_pars_fit[,"theta"])),])),]
  vst2 <- sctransform::vst(e, n_cells=min(ncol(e),5000), method="nb_theta_given", 
                           theta_given=vst1$model_pars_fit[row.names(e),"theta"],
                           min_cells=1L, verbosity=0)
  scater::calculatePCA(vst2$y, ncomponents=dims)
}
  
sce <- scDblFinder(sce, processing=myfun, nfeatures=hvg)

Note however that this did not generally lead to improved performance – but rather decreased on most benchmark datasets, in fact (see comparison in this issue).

1.5.7 Can this be used with scATACseq data?

Yes, see the scATAC vignette specifically on this topic.

1.5.8 Should I run QC cell filtering before or after doublet detection?

The input to scDblFinder should not include empty droplets, and it might be necessary to remove cells with a very low coverage (e.g. <200 reads) to avoid errors. Further quality filtering should be performed downstream of doublet detection, for two reasons: 1. the default expected doublet rate is calculated on the basis of the cells given, and if you excluded a lot of cells as low quality, scDblFinder might think that the doublet rate should be lower than it is. 2. kicking out all low quality cells first might hamper our ability to detect doublets that are formed by the combination of a good quality cell with a low-quality one. This being said, these are mostly theoretical grounds, and unless your QC filtering is very stringent (and it shouldn’t be!), it’s unlikely to make a big difference.

1.5.9 Can I combine this method with others?

Of course it is always possible to run multiple methods and combine the results. In our benchmark, the combination of scDblFinder with DoubletFinder, for instance, did yield an improvement in most (though not all) datasets (see the results here), although of a small magnitude. The simplest way is to do an average of the scores (assuming that the scores are on a similar scale, and that a higher score has the same interpretation across methods), which for instance gave similar results to using a Fisher p-value combination on 1-score (interpreted as a probability).

Session information

sessionInfo()

## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] bluster_1.10.0              scDblFinder_1.14.0         
##  [3] scater_1.28.0               ggplot2_3.4.2              
##  [5] scran_1.28.0                scuttle_1.10.0             
##  [7] ensembldb_2.24.0            AnnotationFilter_1.24.0    
##  [9] GenomicFeatures_1.52.0      AnnotationDbi_1.62.0       
## [11] scRNAseq_2.13.0             SingleCellExperiment_1.22.0
## [13] SummarizedExperiment_1.30.0 Biobase_2.60.0             
## [15] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0        
## [17] IRanges_2.34.0              S4Vectors_0.38.0           
## [19] BiocGenerics_0.46.0         MatrixGenerics_1.12.0      
## [21] matrixStats_0.63.0          BiocStyle_2.28.0           
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_1.8.4                magrittr_2.0.3               
##   [3] magick_2.7.4                  ggbeeswarm_0.7.1             
##   [5] farver_2.1.1                  rmarkdown_2.21               
##   [7] BiocIO_1.10.0                 zlibbioc_1.46.0              
##   [9] vctrs_0.6.2                   memoise_2.0.1                
##  [11] Rsamtools_2.16.0              DelayedMatrixStats_1.22.0    
##  [13] RCurl_1.98-1.12               htmltools_0.5.5              
##  [15] progress_1.2.2                AnnotationHub_3.8.0          
##  [17] curl_5.0.0                    BiocNeighbors_1.18.0         
##  [19] xgboost_1.7.5.1               sass_0.4.5                   
##  [21] bslib_0.4.2                   cachem_1.0.7                 
##  [23] GenomicAlignments_1.36.0      igraph_1.4.2                 
##  [25] mime_0.12                     lifecycle_1.0.3              
##  [27] pkgconfig_2.0.3               rsvd_1.0.5                   
##  [29] Matrix_1.5-4                  R6_2.5.1                     
##  [31] fastmap_1.1.1                 GenomeInfoDbData_1.2.10      
##  [33] shiny_1.7.4                   digest_0.6.31                
##  [35] colorspace_2.1-0              dqrng_0.3.0                  
##  [37] irlba_2.3.5.1                 ExperimentHub_2.8.0          
##  [39] RSQLite_2.3.1                 beachmat_2.16.0              
##  [41] labeling_0.4.2                filelock_1.0.2               
##  [43] fansi_1.0.4                   httr_1.4.5                   
##  [45] compiler_4.3.0                bit64_4.0.5                  
##  [47] withr_2.5.0                   BiocParallel_1.34.0          
##  [49] viridis_0.6.2                 DBI_1.1.3                    
##  [51] highr_0.10                    biomaRt_2.56.0               
##  [53] MASS_7.3-59                   rappdirs_0.3.3               
##  [55] DelayedArray_0.26.0           rjson_0.2.21                 
##  [57] tools_4.3.0                   vipor_0.4.5                  
##  [59] beeswarm_0.4.0                interactiveDisplayBase_1.38.0
##  [61] httpuv_1.6.9                  glue_1.6.2                   
##  [63] restfulr_0.0.15               promises_1.2.0.1             
##  [65] grid_4.3.0                    Rtsne_0.16                   
##  [67] cluster_2.1.4                 generics_0.1.3               
##  [69] gtable_0.3.3                  data.table_1.14.8            
##  [71] hms_1.1.3                     BiocSingular_1.16.0          
##  [73] ScaledMatrix_1.8.0            metapod_1.8.0                
##  [75] xml2_1.3.3                    utf8_1.2.3                   
##  [77] XVector_0.40.0                ggrepel_0.9.3                
##  [79] BiocVersion_3.17.1            pillar_1.9.0                 
##  [81] stringr_1.5.0                 limma_3.56.0                 
##  [83] later_1.3.0                   dplyr_1.1.2                  
##  [85] BiocFileCache_2.8.0           lattice_0.21-8               
##  [87] rtracklayer_1.60.0            bit_4.0.5                    
##  [89] tidyselect_1.2.0              locfit_1.5-9.7               
##  [91] Biostrings_2.68.0             knitr_1.42                   
##  [93] gridExtra_2.3                 bookdown_0.33                
##  [95] ProtGenerics_1.32.0           edgeR_3.42.0                 
##  [97] xfun_0.39                     statmod_1.5.0                
##  [99] stringi_1.7.12                lazyeval_0.2.2               
## [101] yaml_2.3.7                    evaluate_0.20                
## [103] codetools_0.2-19              tibble_3.2.1                 
## [105] BiocManager_1.30.20           cli_3.6.1                    
## [107] xtable_1.8-4                  munsell_0.5.0                
## [109] jquerylib_0.1.4               Rcpp_1.0.10                  
## [111] dbplyr_2.3.2                  png_0.1-8                    
## [113] XML_3.99-0.14                 parallel_4.3.0               
## [115] ellipsis_0.3.2                blob_1.2.4                   
## [117] prettyunits_1.1.1             sparseMatrixStats_1.12.0     
## [119] bitops_1.0-7                  viridisLite_0.4.1            
## [121] scales_1.2.1                  purrr_1.0.1                  
## [123] crayon_1.5.2                  rlang_1.1.0                  
## [125] cowplot_1.1.1                 KEGGREST_1.40.0

scDblFinder

25 April 2023

Abstract

Package

Contents