1 Introduction

UMAP is commonly used in scRNA-seq data analysis as a visualization tool projecting high dimensional data onto 2 dimensions to visualize cell clustering. However, UMAP is prone to showing spurious clustering and distorting distances (Chari, Banerjee, and Pachter 2021). Moreover, UMAP shows clustering that seems to correspond to graph-based clusters from Louvain and Leiden because the k nearest neighbor graph is used in both clustering and UMAP. We have developed concordex as a quantitative alternative to UMAP cluster visualization without the misleading problems of UMAP. This package is the R implementation of the original Python command line tool.

In a nutshell, concordex finds the proportion of cells among the k nearest neighbors of each cell with the same cluster or label as the cell itself. This is computed across all labels and the average of all labels is returned as a metric that indicates the quality of clustering. To see if this is significant, the labels are permuted to estimate a null distribution and the actual observed value is compared to the simulated values. If the clustering separates cells well, then the observed value should be much higher than the simulated values, i.e. the neighborhood of each cell is more dominated by cells of the same label as the cell of interest than by chance.

library(concordexR)
library(TENxPBMCData)
library(BiocNeighbors)
library(bluster)
library(scater)
library(patchwork)
library(ggplot2)
theme_set(theme_bw())

2 Preprocessing

In this vignette, we demonstrate the usage of concordex on a human peripheral blood mononuclear cells (PBMC) scRNA-seq dataset from 10X Genomics. The data is loaded as a SingleCellExperiment object.

sce <- TENxPBMCData("pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> loading from cache

Here we plot the standard QC metrics: total number of UMIs detected per cell (nCounts), number of genes detected (nGenes), and percentage of UMIs from mitochondrially encoded genes (pct_mito).

sce$nCounts <- colSums(counts(sce))
sce$nGenes <- colSums(counts(sce) > 0)
mito_inds <- grepl("^MT-", rowData(sce)$Symbol_TENx)
sce$pct_mito <- colSums(counts(sce)[mito_inds,])/sce$nCounts * 100
plotColData(sce, "nCounts") +
  plotColData(sce, "nGenes") +
  plotColData(sce, "pct_mito")

p1 <- plotColData(sce, x = "nCounts", y = "nGenes") +
  geom_density2d()
p2 <- plotColData(sce, x = "nCounts", y = "pct_mito") +
  geom_density2d()
p1 + p2

Remove the outliers and cells with high percentage of mitochondrial counts as the high percentage is not expected biologically from the cell type:

sce <- sce[, sce$nCounts < 10000 & sce$pct_mito < 8]
sce <- sce[rowSums(counts(sce)) > 0,]

Then normalize the data:

sce <- logNormCounts(sce)

3 Graph based clustering in PCA space

For simplicity, the top 500 highly variable genes are used to perform PCA:

sce <- runPCA(sce, ncomponents = 30, ntop = 500, scale = TRUE)

See the number of PCs to use later from the elbow plot:

plot(attr(reducedDim(sce, "PCA"), "percentVar"), ylab = "Percentage of variance explained")

Percentage of variance explained drops sharply from PC1 to PC5, and definitely levels off after PC10, so we use the top 10 PCs for clustering here. The graph based Leiden clustering uses a k nearest neighbor graph. For demonstration here, we use k = 10.

set.seed(29)
sce$cluster <- clusterRows(reducedDim(sce, "PCA")[,seq_len(10)],
                           NNGraphParam(k = 10, cluster.fun = "leiden",
                                        cluster.args = list(
                                          objective_function = "modularity"
                                        )))

See what the clusters look like in PCA space:

plotPCA(sce, color_by = "cluster", ncomponents = 4)
#> Warning in data.frame(gg1$all, df_to_plot[, -reddim_cols]): row names were
#> found from a short variable and have been discarded

Some of the clusters seem well-separated along the first 4 PCs.

Since UMAP is commonly used to visualize the clusters, we plot UMAP here although we don’t recommend UMAP because it’s prone to showing spurious clusters and distorting distances. UMAP also uses a k nearest neighbor graph, and we use the same k = 10 here:

sce <- runUMAP(sce, dimred = "PCA", n_dimred = 10, n_neighbors = 10)
plotUMAP(sce, color_by = "cluster")

For the most part, the clusters are clearly separated on UMAP.

4 Enter concordex

Since UMAP is prone to showing spurious clusters, we’ll see what the concordex metric says about the clustering and whether it agrees with UMAP visualization. Here we explicitly obtain the k nearest neighbor graph, as clustering and UMAP above did not store the graph itself.

g <- findKNN(reducedDim(sce, "PCA")[,seq_len(10)], k = 10)

The result here is a list of two n (number of cell) by k matrices. The first is the indices of each cell’s neighbors, as in an adjacency list that can be matrix here due to the fixed number of neighbors, and the second is the distances between each cell and its neighbors. For concordex, only the first matrix is relevant. An adjacency matrix, either sparse of dense, as stored in the Seurat object, can also be used. Here the cluster labels are permuted 100 times.

res <- calculateConcordex(g$index, labels = sce$cluster, k = 10, n.iter = 100)

The results can be visualized in plots, which is implemented in this R package but not in the Python package. The actual concordex value can be compared to the simulated values where the latter is visualized in a density plot:

plotConcordexSim(res)

Here the actual value is much higher than the simulated values, indicating that the cluster labels do reflect actual clusters well. The value itself is the proportion of cells with each label in the neighborhood of other cells with the same label, averaged over all labels.

To aid interpretation, the ratio of the observed value to the average simulated value is also returned:

res$concordex_ratio
#> [1] 8.367525

A number greater than 1 means that cells are more likely to have neighbors with the same label than expected by chance when the labels are completely randomly assigned, and the value of 7.7 here means that the clusters are really good.

While the value is an average over all clusters, the matrix with the values for each cluster is also returned by default, and can be visualized in a heatmap, as a clustering diagnostic:

heatConcordex(res, angle_col = 0, cluster_rows = FALSE, cluster_cols = FALSE)

This heatmap also indicates good clustering, as almost all neighbors of cells from each of the cluster labels have the same label, on the diagonal. Off diagonal entries should be interpreted as such: (i,j) means the proportion of cells with label i in the neighborhood of cells with label j. Off diagonal entries means ambiguity between labels as similar cells get different labels. As overplotting easily happens on the UMAP plot, this heatmap shows clustering quality more unambiguously than UMAP.

5 Session info

sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] patchwork_1.2.0             scater_1.32.0              
#>  [3] ggplot2_3.5.1               scuttle_1.14.0             
#>  [5] bluster_1.14.0              BiocNeighbors_1.22.0       
#>  [7] TENxPBMCData_1.21.0         HDF5Array_1.32.0           
#>  [9] rhdf5_2.48.0                DelayedArray_0.30.0        
#> [11] SparseArray_1.4.0           S4Arrays_1.4.0             
#> [13] abind_1.4-5                 Matrix_1.7-0               
#> [15] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
#> [17] Biobase_2.64.0              GenomicRanges_1.56.0       
#> [19] GenomeInfoDb_1.40.0         IRanges_2.38.0             
#> [21] S4Vectors_0.42.0            BiocGenerics_0.50.0        
#> [23] MatrixGenerics_1.16.0       matrixStats_1.3.0          
#> [25] concordexR_1.4.0            BiocStyle_2.32.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.2                 gridExtra_2.3            
#>  [3] rlang_1.1.3               magrittr_2.0.3           
#>  [5] compiler_4.4.0            RSQLite_2.3.6            
#>  [7] DelayedMatrixStats_1.26.0 png_0.1-8                
#>  [9] vctrs_0.6.5               pkgconfig_2.0.3          
#> [11] crayon_1.5.2              fastmap_1.1.1            
#> [13] magick_2.8.3              dbplyr_2.5.0             
#> [15] XVector_0.44.0            labeling_0.4.3           
#> [17] utf8_1.2.4                rmarkdown_2.26           
#> [19] ggbeeswarm_0.7.2          UCSC.utils_1.0.0         
#> [21] tinytex_0.50              purrr_1.0.2              
#> [23] bit_4.0.5                 xfun_0.43                
#> [25] zlibbioc_1.50.0           cachem_1.0.8             
#> [27] beachmat_2.20.0           jsonlite_1.8.8           
#> [29] blob_1.2.4                highr_0.10               
#> [31] rhdf5filters_1.16.0       Rhdf5lib_1.26.0          
#> [33] BiocParallel_1.38.0       irlba_2.3.5.1            
#> [35] parallel_4.4.0            cluster_2.1.6            
#> [37] R6_2.5.1                  bslib_0.7.0              
#> [39] RColorBrewer_1.1-3        jquerylib_0.1.4          
#> [41] Rcpp_1.0.12               bookdown_0.39            
#> [43] knitr_1.46                FNN_1.1.4                
#> [45] igraph_2.0.3              tidyselect_1.2.1         
#> [47] viridis_0.6.5             yaml_2.3.8               
#> [49] codetools_0.2-20          curl_5.2.1               
#> [51] lattice_0.22-6            tibble_3.2.1             
#> [53] withr_3.0.0               KEGGREST_1.44.0          
#> [55] evaluate_0.23             isoband_0.2.7            
#> [57] BiocFileCache_2.12.0      ExperimentHub_2.12.0     
#> [59] Biostrings_2.72.0         pillar_1.9.0             
#> [61] BiocManager_1.30.22       filelock_1.0.3           
#> [63] generics_0.1.3            BiocVersion_3.19.1       
#> [65] sparseMatrixStats_1.16.0  munsell_0.5.1            
#> [67] scales_1.3.0              glue_1.7.0               
#> [69] pheatmap_1.0.12           tools_4.4.0              
#> [71] AnnotationHub_3.12.0      ScaledMatrix_1.12.0      
#> [73] cowplot_1.1.3             grid_4.4.0               
#> [75] AnnotationDbi_1.66.0      colorspace_2.1-0         
#> [77] GenomeInfoDbData_1.2.12   beeswarm_0.4.0           
#> [79] BiocSingular_1.20.0       vipor_0.4.7              
#> [81] rsvd_1.0.5                cli_3.6.2                
#> [83] rappdirs_0.3.3            fansi_1.0.6              
#> [85] viridisLite_0.4.2         dplyr_1.1.4              
#> [87] uwot_0.2.2                gtable_0.3.5             
#> [89] sass_0.4.9                digest_0.6.35            
#> [91] ggrepel_0.9.5             farver_2.1.1             
#> [93] memoise_2.0.1             htmltools_0.5.8.1        
#> [95] lifecycle_1.0.4           httr_1.4.7               
#> [97] mime_0.12                 MASS_7.3-60.2            
#> [99] bit64_4.0.5

References

Chari, Tara, Joeyta Banerjee, and Lior Pachter. 2021. “The Specious Art of Single-Cell Genomics.” bioRxiv.