if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("Dune")
We use a subset of the Allen Smart-Seq nuclei dataset. Run ?Dune::nuclei
for more details on pre-processing.
suppressPackageStartupMessages({
library(RColorBrewer)
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(purrr)
library(Dune)
})
data("nuclei", package = "Dune")
theme_set(theme_classic())
We have a dataset of \(1744\) cells, with the results from 3 clustering algorithms: Seurat3, Monocle3 and SC3. The Allen Institute also produce hand-picked cluster and subclass labels. Finally, we included the coordinates from a t-SNE representation, for visualization.
ggplot(nuclei, aes(x = x, y = y, col = subclass_label)) +
geom_point()
We can also see how the three clustering algorithm partitioned the dataset initially:
walk(c("SC3", "Seurat", "Monocle"), function(clus_algo){
df <- nuclei
df$clus_algo <- nuclei[, clus_algo]
p <- ggplot(df, aes(x = x, y = y, col = as.character(clus_algo))) +
geom_point(size = 1.5) +
# guides(color = FALSE) +
labs(title = clus_algo, col = "clusters") +
theme(legend.position = "bottom")
print(p)
})
The adjusted Rand Index between the three methods can be computed.
plotARIs(nuclei %>% select(SC3, Seurat, Monocle))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## â„ą The deprecated feature was likely used in the Dune package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
As we can see, the ARI between the three methods is initially quite low.
We can now try to merge clusters with the Dune
function. At each step, the algorithm will print which clustering label is merged (by its number, so 1~SC3
and so on), as well as the pair of clusters that get merged.
merger <- Dune(clusMat = nuclei %>% select(SC3, Seurat, Monocle), verbose = TRUE)
## [1] "SC3" "21" "20"
## [1] "Monocle" "20" "4"
## [1] "SC3" "11" "12"
## [1] "SC3" "30" "28"
## [1] "SC3" "11" "24"
The output from Dune
is a list with four components:
names(merger)
## [1] "initialMat" "currentMat" "merges" "ImpMetric" "metric"
initialMat
is the initial matrix. of cluster labels. currentMat
is the final matrix of cluster labels. merges
is a matrix that recapitulates what has been printed above, while ImpARI
list the ARI improvement over the merges.
We can now see how much the ARI has improved:
plotARIs(clusMat = merger$currentMat)
The methods now look much more similar, as can be expected.
We can also see how the number of clusters got reduced.
plotPrePost(merger)
For SC3 for example, we can visualize how the clusters got merged:
ConfusionPlot(merger$initialMat[, "SC3"], merger$currentMat[, "SC3"]) +
labs(x = "Before merging", y = "After merging")
Finally, the ARIImp function tracks mean ARI improvement as pairs of clusters get merged down.
ARItrend(merger)
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Dune_1.16.0 purrr_1.0.2 knitr_1.46 tidyr_1.3.1
## [5] ggplot2_3.5.1 dplyr_1.1.4 RColorBrewer_1.1-3
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.34.0 gtable_0.3.5
## [3] xfun_0.43 bslib_0.7.0
## [5] Biobase_2.64.0 lattice_0.22-6
## [7] vctrs_0.6.5 tools_4.4.0
## [9] generics_0.1.3 stats4_4.4.0
## [11] parallel_4.4.0 tibble_3.2.1
## [13] fansi_1.0.6 highr_0.10
## [15] pkgconfig_2.0.3 Matrix_1.7-0
## [17] S4Vectors_0.42.0 lifecycle_1.0.4
## [19] GenomeInfoDbData_1.2.12 farver_2.1.1
## [21] compiler_4.4.0 progress_1.2.3
## [23] munsell_0.5.1 aricode_1.0.3
## [25] codetools_0.2-20 GenomeInfoDb_1.40.0
## [27] htmltools_0.5.8.1 sass_0.4.9
## [29] yaml_2.3.8 pillar_1.9.0
## [31] crayon_1.5.2 jquerylib_0.1.4
## [33] BiocParallel_1.38.0 cachem_1.0.8
## [35] DelayedArray_0.30.0 magick_2.8.3
## [37] abind_1.4-5 tidyselect_1.2.1
## [39] digest_0.6.35 stringi_1.8.3
## [41] labeling_0.4.3 fastmap_1.1.1
## [43] grid_4.4.0 colorspace_2.1-0
## [45] cli_3.6.2 SparseArray_1.4.0
## [47] magrittr_2.0.3 S4Arrays_1.4.0
## [49] utf8_1.2.4 withr_3.0.0
## [51] prettyunits_1.2.0 scales_1.3.0
## [53] UCSC.utils_1.0.0 rmarkdown_2.26
## [55] XVector_0.44.0 httr_1.4.7
## [57] matrixStats_1.3.0 hms_1.1.3
## [59] evaluate_0.23 GenomicRanges_1.56.0
## [61] IRanges_2.38.0 viridisLite_0.4.2
## [63] gganimate_1.0.9 rlang_1.1.3
## [65] Rcpp_1.0.12 glue_1.7.0
## [67] tweenr_2.0.3 BiocGenerics_0.50.0
## [69] jsonlite_1.8.8 R6_2.5.1
## [71] MatrixGenerics_1.16.0 zlibbioc_1.50.0