1 Introduction

CHETAH (CHaracterization of cEll Types Aided by Hierarchical classification) is a package for cell type identification of single-cell RNA-sequencing (scRNA-seq) data.
A pre-print of the article describing CHETAH is available at bioRxiv.

Summary: Cell types are assigned by correlating the input data to a reference in a hierarchical manner. This creates the possibility of assignment to intermediate types if the data does not allow to fully classify to one of the types in the reference. CHETAH is built to work with scRNA-seq references, but will also work (with limited capabilities) with RNA-seq or micro-array reference datasets. So, to run CHETAH, you will only need:

1.1 At a glance

To run chetah on an input count matrix input_counts with t-SNE1 coordinates in input_tsne, and a reference count matrix ref_counts with celltypes vector ref_ct, run:

## Make SingleCellExperiments
reference <- SingleCellExperiment(assays = list(counts = ref_counts),
                                     colData = DataFrame(celltypes = ref_ct))

input <- SingleCellExperiment(assays = list(counts = input_counts),
                              reducedDims = SimpleList(TSNE = input_tsne))

input <- CHETAHclassifier(input = input, ref_cells = reference)

## Plot the classification

## Extract celltypes:
celltypes <- input$celltype_CHETAH

A tumor micro-environment reference dataset containing all major cell types for tumor data can be downloaded: here. This reference can be used for all (tumor) input datasets.

2 Some background

CHETAH constructs a classification tree by hierarchically clustering the reference data. The classification is guided by this tree. In each node of the tree, input cells are either assigned to the right, or the left branch. A confidence score is calculated for each of these assignments. When the confidence score for an assignment is lower than the threshold (default = 0.1), the classification for that cell stops in that node.

This results in two types of classifications:

3 Installation

CHETAH will be a part of Bioconductor starting at release 2.9 (30th of April), and will be available by:

## Install BiocManager is necessary
if (!require("BiocManager")) {

# Load the package

The development version can be downloaded from the development version of Bioconductor (in R v3.6).

## Install BiocManager is necessary
if (!require("BiocManager")) {
BiocManager::install('CHETAH', version = "devel")

# Load the package

4 Preparing your data

4.1 Required data

If you have your data stored as SingleCellExperiments, continue to the next step. Otherwise, you need the following data before you begin:

  • input scRNA-seq count data of the cells to be classified
    • a data.frame or matrix, with cells in the columns and genes in the rows
  • (!) normalized scRNA-seq count data of reference cells
    • in similar format as the input
  • the cell types of the reference cells
    • a named character vector (names corresponding to the colnames of the reference counts)
  • (optional) a 2D reduced dimensional representation of your input data for visualization: e.g. t-SNE1, PCA.
    • a two-column matrix/data.frame, with the cells in the rows and the two dimensions in the columns

As an example on how to prepare your data, we will use melanoma input data from Tirosh et al. and head-neck tumor reference data from Puram et al. as an example.

For information on how to create your own reference see Creating a Reference

## load CHETAH's datasets

## To prepare the data from the package's internal data, run:
celltypes_hn <- headneck_ref$celltypes
counts_hn <- assay(headneck_ref)
counts_melanoma <- assay(input_mel)
tsne_melanoma <- reducedDim(input_mel)

## The input data: a Matrix
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
counts_melanoma[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#>              mel_cell1 mel_cell2 mel_cell3 mel_cell4 mel_cell5
#> ELMO2           .         .         .         4.5633    .     
#> PNMA1           .         4.3553    .         .         .     
#> MMP2            .         .         .         .         .     
#> TMEM216         .         .         .         .         5.5624
#> TRAF3IP2-AS1    2.1299    4.0542    2.4209    1.6531    1.3144

## The reduced dimensions of the input cells: 2 column matrix
tsne_melanoma[1:5, ]
#>               tSNE_1    tSNE_2
#> mel_cell1  4.5034553 13.596680
#> mel_cell2 -4.0025667 -7.075722
#> mel_cell3  0.4734054  9.277648
#> mel_cell4  3.2201815 11.445236
#> mel_cell5 -0.3354758  5.092415
all.equal(rownames(tsne_melanoma), colnames(counts_melanoma))
#> [1] TRUE

## The reference data: a Matrix
#> [1] "matrix" "array"
counts_hn[1:5, 1:5]
#>              hn_cell1 hn_cell2 hn_cell3 hn_cell4 hn_cell5
#> ELMO2         0.00000        0  0.00000  1.55430   4.2926
#> PNMA1         0.00000        0  0.00000  4.55360   0.0000
#> MMP2          0.00000        0  7.02880  4.50910   6.3006
#> TMEM216       0.00000        0  0.00000  0.00000   0.0000
#> TRAF3IP2-AS1  0.14796        0  0.65352  0.28924   3.6365

## The cell types of the reference: a named character vector
#>  Named chr [1:180] "Fibroblast" "Fibroblast" "Fibroblast" "Fibroblast" ...
#>  - attr(*, "names")= chr [1:180] "hn_cell1" "hn_cell2" "hn_cell3" "hn_cell4" ...
## The names of the cell types correspond with the colnames of the reference counts:
all.equal(names(celltypes_hn), colnames(counts_melanoma)) 
#> [1] "Lengths (180, 150) differ (string compare on first 150)"
#> [2] "150 string mismatches"

4.2 SingleCellExperiments

CHETAH expects data to be in the format of a SingleCellExperiment, which is an easy way to store different kinds of data together. Comprehensive information on this data type can be found here.

A SingleCellExperiment holds three things:

  • counts: assays
    • as a list of Matrices
  • meta-data: colData
    • as DataFrames
  • reduced dimensions (e.g. t-SNE, PCA): ReducedDims
    • as a SimpleList of 2-column data.frames or matrices

CHETAH needs

  • a reference SingleCellExperiment with:
    • an assay
    • a colData column with the corresponding cell types (default “celltypes”)
  • an input SingleCellExperiment with:
    • an assay
    • a reducedDim (e.g. t-SNE)

For the example data, we would make the two objects by running:

## For the reference we define a "counts" assay and "celltypes" metadata
headneck_ref <- SingleCellExperiment(assays = list(counts = counts_hn),
                                     colData = DataFrame(celltypes = celltypes_hn))

## For the input we define a "counts" assay and "TSNE" reduced dimensions
input_mel <- SingleCellExperiment(assays = list(counts = counts_melanoma),
                                  reducedDims = SimpleList(TSNE = tsne_melanoma))
Note: CHETAH functions default to the first assay/reducedDim in an object and “celltypes” for the reference’s colData. See ?CHETAHclassifier and ?PlotCHETAH on how to change this behaviour.

5 Running CHETAH

Now that the data is prepared, running chetah is easy:

input_mel <- CHETAHclassifier(input = input_mel,
                              ref_cells = headneck_ref)
#> Preparing data....
#> Running analysis...

5.1 The output

CHETAH returns the input object, but added:

  • input$celltype_CHETAH
    • a named character vector that can directly be used in any other workflow/method.
  • “hidden” int_colData and int_metadata, not meant for direct interaction, but
    • which can all be viewed and interacted with using: PlotCHETAH and CHETAHshiny

5.2 Standard plots

CHETAH’s classification can be visualized using: PlotCHETAH. This function plots both the classification tree and the t-SNE (or other provided reduced dimension) map.
Either the final types or the intermediate types are colored in these plots. The non-colored types are represented in a grayscale.

To plot the final types:

PlotCHETAH(input = input_mel)

Please note that each type “NodeX” corresponds to the node with number X and that the type “Unassigned” corresponds to the node 0

Conversely, to color the intermediate types:

PlotCHETAH(input = input_mel, interm = TRUE)

If you would like to use the classification, and thus the colors, in another package (e.g. Seurat2), you can extract the colors using:

colors <- PlotCHETAH(input = input_mel, return_col = TRUE)

5.3 CHETAHshiny

The classification of CHETAH and other outputs like profile and confidence scores can be visualized in a shiny application that allows for easy and interactive analysis of the classification.

Here you can view:

  • the confidence of all assignments
  • the classification in an interactive window
  • the genes used by CHETAH, an it’s expression in the input data
  • a lot more

The following command will open the shiny application as in an R window. The page can also be opened in your default web-browser by clicking “Open in Browser” at the very top.

CHETAHshiny(input = input_mel)

6 Changing classification

6.1 Confidence score

CHETAH calculates a confidence score for each assignment of an input cell to one of the branches of a node.
The confidence score:

  • has a value between 0 and 2
  • will normally lie between 0 and 1
  • 0 represents no confidence for an assignment, 1 high confidence.

The default confidence threshold of CHETAH is 0.1.
This means that whenever a cell is assigned to a branch and the confidence of that assignment is lower than 0.1, the classification will stop in that node.

The confidence threshold can be adjusted in order to classify more or fewer cells to a final type:

  • Using a confidence threshold of 0 will classify all input cells to a final type. Be aware that this classification can be noisy and can contain incorrect classifications.
  • Using a threshold of 0.2, 0.3, 0.4, etc, will classify a decreasing number of cells to a final type, with the remaining cells having a increasingly high confidence throughout all nodes in the tree.

For example, to only classify cells with very high confidence:

input_mel <- Classify(input = input_mel, 0.8)
PlotCHETAH(input = input_mel, tree = FALSE)

Conversely, to classify all cells:

input_mel <- Classify(input_mel, 0)
PlotCHETAH(input = input_mel, tree = FALSE)