Contents

1 Introduction

CEllular Latent Dirichlet Allocation (celda), is among multiple approaches to cluster cells from single-cell RNA-seq data into discrete sub-populations. In most cases, annotating a concise set of important features for distinguishing these sub-populations is not a trivial task.

In this vignette we will demonstrate the implementation of a multiclass decision tree approach to simultaneously sort and annotate cell cluster label estimations by generating a sequence of univariate rules for each cluster.

The procedure has two main deviations from simple multiclass decision tree procedures. First, at each split cells from the same cluster are never separated during tree building. Instead cells from the same population are moved to one-side of a particular split based on majority voting. Second, each cluster split can be determined by one of two heuristics, as follows…

  1. Find univariate rules that accurately distinguish individual cell clusters, referred to as a one-off split.
  2. If no sufficient one-off split exists, the tree finds thes best more ballanced split that segregates the clusters into two sets of comparatively similar sub-populations.

2 Installation

celda can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE)){
    install.packages("BiocManager")}
BiocManager::install("celda")

The package can be loaded using the library command.

library(celda)

Complete list of help files are accessible using the help command with a package option.

help(package = celda)

To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.


3 Read in data and run celda_CG()

First, we will read in a single-cell data set and use celda_CG() to cluster the genes and samples. The M3DExampleData is a subset of single-cell RNA-Seq expression data of r mouse 2-cell embryos and blastocysts, originally generated by Deng et al. (2014). In this example we will use 500 genes to generate 10 sample and 10 gene clusters using celda_CG(). Note, that these parameters are not based on knowledge of these data, but rather just for running a speedy example.

library(M3DExampleData)
counts <- M3DExampleData::Mmus_example_list$data
# Subset 500 genes for fast clustering
counts <- as.matrix(counts[1501:2000, ])
# Cluster genes ans samples each into 10 modules
cm <- celda_CG(counts = counts, L = 10, K = 10, verbose = FALSE)

4 Format celda output for decision tree generation

The decision trees are generated and evaluated by findMarkers. findMarkers requires two arguments: features and class. features is a numeric matrix with a row for every variable (or feature) to be used in sorting and a column for every sample. In this example we will use a factorized matrix of the gene clusters as our feature matrix (Check ?factorizedMatrix for more details). class is a vector with a label assignment for every sample. Note, that neither features nor class may have any missing values.

# Get features matrix and cluster assignments
factorized <- factorizeMatrix(counts, cm)
features <- factorized$proportions$cell
class <- clusters(cm)$z

5 Generate celda decision tree

findMarkers allows for different parameter options. The optimal combination of these parameters is generally analysis specific and may require some trial and error. Parameters include:

DecTree <- findMarkers(features,
    class,
    oneoffMetric = "modified F1",
    threshold = 0.95,
    reuseFeatures = FALSE,
    altSplit = TRUE,
    consecutiveOneoff = FALSE)

5.1 findMarkers output

The findMarkers output is a named list of four elements

  • rules A named list with one data.frame for every label. Each data.frame has five columns and gives the set of rules for distinguishing each label.
    • feature - Feature identifier.
    • direction - Relationship to feature value, -1 if less than, 1 if greater than.
    • value - The feature value which defines the decision boundary
    • stat - The performance value returned by the splitting metric for this split.
    • statUsed - Which performance metric was used. “IG” if information gain and “OO” if one-off.
    • level - The level of the tree at which is rule was defined. 1 is the level of the first split of the tree.
  • dendro A dendrogram object of the decision tree output
  • performance A named list denoting the training performance of the

    • accuracy - (number correct/number of samples) for the whole set of samples.
    • balAcc - Mean sensitivity across all labels.
    • meanPrecision - Mean precision across all labels.
    • correct - Number of correct predictions of each label.
    • sizes Number of actual counts of each label.
    • sensitivity - Sensitivity of the prediction of each label.
    • precision - Precision of the prediction of each label.

6 Plot decision tree

plotDendro generates a dendrogram as a ggplot object. The addSensPrec argument determines whether to print cluster-specific sensitivities and precisions under the leaf labels for each cluster. leafSize, boxSize, boxColor are all stylistic choices to make the plot labels legible.

plotDendro(DecTree,
    classLabel = NULL,
    addSensPrec = TRUE,
    leafSize = 24,
    boxSize = 7,
    boxColor = "black")

In the plot, the feature(s) used for splits determined by the one-off metric are printed above the cluster labels for which they are markers. Alternatively, the features used for the balanced splits are centered above the two sets of clusters defined by that split. The up-regulated set of clusters for a balanced split are one the right side of that split. The size, sensitivitie and precision of each class are printed below the leaf labels, respectively.

7 Plot decision tree highlight rules for specific cluster

In plotDendro the classLabel argument may used to only print the sequence of relevant rules for a given split.

plotDendro(DecTree,
    classLabel = "1",
    addSensPrec = TRUE,
    leafSize = 15,
    boxSize = 7,
    boxColor = "black")

8 Get label estimates from features matrix

findMarkers performs label estimation of each sample in the training set automatically. You can use the set of rules generated by findMarkers to predict the labels on an independent feature matrix using getDecisions.

head(getDecisions(DecTree$rules, features))
## GSM1112603_early2cell_0r.1_expression GSM1112604_early2cell_0r.2_expression 
##                                   "7"                                   "7" 
##  GSM1112605_early2cell_1.1_expression  GSM1112606_early2cell_1.2_expression 
##                                   "7"                                   "7" 
##  GSM1112607_early2cell_2.1_expression  GSM1112608_early2cell_2.2_expression 
##                                   "7"                                   "7"

9 Create decision tree with meta clusters

If you have a priori understanding of sub-groups of your cluster labels, you can ensure that these sub-groups are not separated up-stream in the tree by using the optional cellTypes argument. This is just a named list of labels in your class vector. For example, if we knew that clusters, 4, 5, and 1 were of the same subtype of clusters, we could do the following…

# Run with a hierarchichal split
cellTypes <- list(metaLabel = c("4", "5", "1"))
DecTreeMeta <- findMarkers(features,
    class,
    cellTypes,
    oneoffMetric = "modified F1",
    threshold = 1,
    reuseFeatures = F,
    consecutiveOneoff = FALSE)
plotDendro(DecTreeMeta)

10 Session Information

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] M3DExampleData_1.12.0 celda_1.2.4           BiocStyle_2.14.4     
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.46.0              ggdendro_0.1-20            
##  [3] viridis_0.5.1               httr_1.4.1                 
##  [5] viridisLite_0.3.0           foreach_1.4.7              
##  [7] RcppParallel_4.4.4          assertthat_0.2.1           
##  [9] BiocManager_1.30.10         stats4_3.6.2               
## [11] GenomeInfoDbData_1.2.2      ggrepel_0.8.1              
## [13] yaml_2.2.0                  pillar_1.4.3               
## [15] lattice_0.20-38             glue_1.3.1                 
## [17] pROC_1.16.1                 RcppEigen_0.3.3.7.0        
## [19] digest_0.6.23               GenomicRanges_1.38.0       
## [21] RColorBrewer_1.1-2          XVector_0.26.0             
## [23] colorspace_1.4-1            enrichR_2.1                
## [25] htmltools_0.4.0             Matrix_1.2-18              
## [27] plyr_1.8.5                  pkgconfig_2.0.3            
## [29] magick_2.2                  bookdown_0.17              
## [31] zlibbioc_1.32.0             purrr_0.3.3                
## [33] scales_1.1.0                Rtsne_0.15                 
## [35] BiocParallel_1.20.1         tibble_2.1.3               
## [37] combinat_0.0-8              farver_2.0.3               
## [39] IRanges_2.20.2              ggplot2_3.2.1              
## [41] withr_2.1.2                 SummarizedExperiment_1.16.1
## [43] BiocGenerics_0.32.0         lazyeval_0.2.2             
## [45] magrittr_1.5                crayon_1.3.4               
## [47] evaluate_0.14               doParallel_1.0.15          
## [49] MASS_7.3-51.5               MAST_1.12.0                
## [51] tools_3.6.2                 data.table_1.12.8          
## [53] lifecycle_0.1.0             matrixStats_0.55.0         
## [55] stringr_1.4.0               S4Vectors_0.24.3           
## [57] munsell_0.5.0               DelayedArray_0.12.2        
## [59] compiler_3.6.2              GenomeInfoDb_1.22.0        
## [61] rlang_0.4.2                 grid_3.6.2                 
## [63] RCurl_1.98-1.1              iterators_1.0.12           
## [65] rjson_0.2.20                SingleCellExperiment_1.8.0 
## [67] labeling_0.3                bitops_1.0-6               
## [69] rmarkdown_2.1               gtable_0.3.0               
## [71] codetools_0.2-16            abind_1.4-5                
## [73] reshape2_1.4.3              R6_2.4.1                   
## [75] gridExtra_2.3               knitr_1.27.2               
## [77] dplyr_0.8.3                 uwot_0.1.5                 
## [79] dendextend_1.13.2           stringi_1.4.5              
## [81] parallel_3.6.2              Rcpp_1.0.3                 
## [83] MCMCprecision_0.4.0         tidyselect_0.2.5           
## [85] xfun_0.12