# 1 Introduction

CEllular Latent Dirichlet Allocation (celda), is among multiple approaches to cluster cells from single-cell RNA-seq data into discrete sub-populations. In most cases, annotating a concise set of important features for distinguishing these sub-populations is not a trivial task.

In this vignette we will demonstrate the implementation of a multiclass decision tree approach to simultaneously sort and annotate cell cluster label estimations by generating a sequence of univariate rules for each cluster.

The procedure has two main deviations from simple multiclass decision tree procedures. First, at each split cells from the same cluster are never separated during tree building. Instead cells from the same population are moved to one-side of a particular split based on majority voting. Second, each cluster split can be determined by one of two heuristics, as follows…

1. Find univariate rules that accurately distinguish individual cell clusters, referred to as a one-off split.
2. If no sufficient one-off split exists, the tree finds thes best more ballanced split that segregates the clusters into two sets of comparatively similar sub-populations.

# 2 Installation

celda can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE)){
install.packages("BiocManager")}
BiocManager::install("celda")

The package can be loaded using the library command.

library(celda)

Complete list of help files are accessible using the help command with a package option.

help(package = celda)

To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.

# 3 Read in data and run celda_CG()

First, we will read in a single-cell data set and use celda_CG() to cluster the genes and samples. The M3DExampleData is a subset of single-cell RNA-Seq expression data of r mouse 2-cell embryos and blastocysts, originally generated by Deng et al. (2014). In this example we will use 500 genes to generate 10 sample and 10 gene clusters using celda_CG(). Note, that these parameters are not based on knowledge of these data, but rather just for running a speedy example.

library(M3DExampleData)

# 5 Generate celda decision tree

findMarkers allows for different parameter options. The optimal combination of these parameters is generally analysis specific and may require some trial and error. Parameters include:

• oneoffMetric A selection for one of two possible metrics for evaluating good one-off splits for segregating a single cluster from every other cluster based on up-regulation of that cluster alone. The performance of either is a value between 0 and 1. The two options include:
• modified F1 - Finds the univariate split that maximizes the harmonic mean of the sensitivity and precision of the up-regulated cluster, as well as the minimum cluster-specific sensitivity across all down-regulated clusters.
• pairwise AUC - Finds the univariate split for the minimum pairwise AUC between the up-regulated cluster and every down-regulated cluster, individually.
• threshold The threshold for determining a good one-off split. Generally modified F1 is more conservative, so the using the same threshold for either one-off metric will result in equal or fewer one-off splits for modified F1 compared to pairwise AUC.
• reuseFeatures A logical value. Whether a feature can be used as a rule for the same cluster more than once. Note, that even when reuseFeatures = FALSE a feature is still allowed to be used more than once in the decision tree. This happens when the same feature is used for different sets of cluster rules after those sets of clusters have diverged.
• reuseFeatures A logical value. Often times one cluster will never include a positive marker in its set of rules, i.e. it is only sorted by have lower values for each feature in the tree. This option forces a positive marker at its terminal split.
• consecutiveOneoff A logical value. Whether one-off splits can be used at two consecutive levels in the tree. Note, that even when consecutiveOneoff = FALSE, more then one good one-off splits will be assessed at the same level and used if determined to be a good split.
DecTree <- findMarkers(features,
class,
oneoffMetric = "modified F1",
threshold = 0.95,
reuseFeatures = FALSE,
altSplit = TRUE,
consecutiveOneoff = FALSE)

## 5.1findMarkers output

The findMarkers output is a named list of four elements

• rules A named list with one data.frame for every label. Each data.frame has five columns and gives the set of rules for distinguishing each label.
• feature - Feature identifier.
• direction - Relationship to feature value, -1 if less than, 1 if greater than.
• value - The feature value which defines the decision boundary
• stat - The performance value returned by the splitting metric for this split.
• statUsed - Which performance metric was used. “IG” if information gain and “OO” if one-off.
• level - The level of the tree at which is rule was defined. 1 is the level of the first split of the tree.
• dendro A dendrogram object of the decision tree output
• performance A named list denoting the training performance of the

• accuracy - (number correct/number of samples) for the whole set of samples.
• balAcc - Mean sensitivity across all labels.
• meanPrecision - Mean precision across all labels.
• correct - Number of correct predictions of each label.
• sizes Number of actual counts of each label.
• sensitivity - Sensitivity of the prediction of each label.
• precision - Precision of the prediction of each label.

# 6 Plot decision tree

plotDendro generates a dendrogram as a ggplot object. The addSensPrec argument determines whether to print cluster-specific sensitivities and precisions under the leaf labels for each cluster. leafSize, boxSize, boxColor are all stylistic choices to make the plot labels legible.

plotDendro(DecTree,
classLabel = NULL,
leafSize = 24,
boxSize = 7,
boxColor = "black")

In the plot, the feature(s) used for splits determined by the one-off metric are printed above the cluster labels for which they are markers. Alternatively, the features used for the balanced splits are centered above the two sets of clusters defined by that split. The up-regulated set of clusters for a balanced split are one the right side of that split. The size, sensitivitie and precision of each class are printed below the leaf labels, respectively.

# 7 Plot decision tree highlight rules for specific cluster

In plotDendro the classLabel argument may used to only print the sequence of relevant rules for a given split.

plotDendro(DecTree,
classLabel = "1",
leafSize = 15,
boxSize = 7,
boxColor = "black")

# 8 Get label estimates from features matrix

findMarkers performs label estimation of each sample in the training set automatically. You can use the set of rules generated by findMarkers to predict the labels on an independent feature matrix using getDecisions.

head(getDecisions(DecTree\$rules, features))
## GSM1112603_early2cell_0r.1_expression GSM1112604_early2cell_0r.2_expression
##                                   "7"                                   "7"
##  GSM1112605_early2cell_1.1_expression  GSM1112606_early2cell_1.2_expression
##                                   "7"                                   "7"
##  GSM1112607_early2cell_2.1_expression  GSM1112608_early2cell_2.2_expression
##                                   "7"                                   "7"

# 9 Create decision tree with meta clusters

If you have a priori understanding of sub-groups of your cluster labels, you can ensure that these sub-groups are not separated up-stream in the tree by using the optional cellTypes argument. This is just a named list of labels in your class vector. For example, if we knew that clusters, 4, 5, and 1 were of the same subtype of clusters, we could do the following…

# Run with a hierarchichal split
cellTypes <- list(metaLabel = c("4", "5", "1"))
DecTreeMeta <- findMarkers(features,
class,
cellTypes,
oneoffMetric = "modified F1",
threshold = 1,
reuseFeatures = F,
consecutiveOneoff = FALSE)
plotDendro(DecTreeMeta)

# 10 Session Information

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] M3DExampleData_1.11.0 celda_1.2.0           BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
##  [1] Biobase_2.46.0              ggdendro_0.1-20
##  [3] viridis_0.5.1               httr_1.4.1
##  [5] viridisLite_0.3.0           foreach_1.4.7
##  [7] RcppParallel_4.4.4          assertthat_0.2.1
##  [9] BiocManager_1.30.9          stats4_3.6.1
## [11] GenomeInfoDbData_1.2.2      yaml_2.2.0
## [13] ggrepel_0.8.1               pillar_1.4.2
## [15] lattice_0.20-38             glue_1.3.1
## [17] pROC_1.15.3                 RcppEigen_0.3.3.5.0
## [19] digest_0.6.22               GenomicRanges_1.38.0
## [21] RColorBrewer_1.1-2          XVector_0.26.0
## [23] colorspace_1.4-1            enrichR_2.1
## [25] htmltools_0.4.0             Matrix_1.2-17
## [27] plyr_1.8.4                  pkgconfig_2.0.3
## [29] bookdown_0.14               zlibbioc_1.32.0
## [31] purrr_0.3.3                 scales_1.0.0
## [33] Rtsne_0.15                  BiocParallel_1.20.0
## [35] tibble_2.1.3                combinat_0.0-8
## [37] IRanges_2.20.0              ggplot2_3.2.1
## [39] withr_2.1.2                 SummarizedExperiment_1.16.0
## [41] BiocGenerics_0.32.0         lazyeval_0.2.2
## [43] magrittr_1.5                crayon_1.3.4
## [45] evaluate_0.14               doParallel_1.0.15
## [47] MASS_7.3-51.4               MAST_1.12.0
## [49] tools_3.6.1                 data.table_1.12.6
## [51] matrixStats_0.55.0          stringr_1.4.0
## [53] S4Vectors_0.24.0            munsell_0.5.0
## [55] DelayedArray_0.12.0         compiler_3.6.1
## [57] GenomeInfoDb_1.22.0         rlang_0.4.1
## [59] grid_3.6.1                  RCurl_1.95-4.12
## [61] iterators_1.0.12            rjson_0.2.20
## [63] SingleCellExperiment_1.8.0  labeling_0.3
## [65] bitops_1.0-6                rmarkdown_1.16
## [67] gtable_0.3.0                codetools_0.2-16
## [69] abind_1.4-5                 reshape2_1.4.3
## [71] R6_2.4.0                    gridExtra_2.3
## [73] knitr_1.25                  dplyr_0.8.3
## [75] uwot_0.1.4                  dendextend_1.12.0
## [77] stringi_1.4.3               parallel_3.6.1
## [79] Rcpp_1.0.2                  MCMCprecision_0.3.9
## [81] tidyselect_0.2.5            xfun_0.10