Flow cytometry and the more recently introduced CyTOF (cytometry by time-of-flight mass spectrometry or mass cytometry) are high-throughput technologies that measure protein abundance on the surface or within cells. In flow cytometry, antibodies are labeled with fluorescent dyes and fluorescence intensity is measured using lasers and photodetectors. CyTOF utilizes antibodies tagged with metal isotopes from the lanthanide series, which have favorable chemistry and do not occur in biological systems; abundances per cell are recorded with a time-of-flight mass spectrometer. In either case, fluorescence intensities (flow cytometry) or ion counts (mass cytometry) are assumed to be proportional to the expression level of the antibody-targeted antigens of interest.
Due to the differences in acquisition, further distinct characteristics should be noted. Conventional fluorophore-based flow cytometry is non-destructive and can be used to sort cells for further analysis. However, because of the spectral overlap between fluorophores, compensation of the data needs to be performed [1], which also limits the number of parameters that can be measured simultaneously. Thus, standard flow cytometry experiments measure 6-12 parameters with modern systems measuring up to 20 channels [2], while new developments (e.g., BD FACSymphony) promise to increase this capacity towards 50. Moreover, flow cytometry offers the highest throughput with tens of thousands of cells measured per second at relatively low operating costs per sample.
By using rare metal isotopes in CyTOF, cell autofluorescence can be avoided and spectral overlap is drastically reduced. However, the sensitivity of mass spectrometry results in the measurement of metal impurities and oxide formations, which need to be carefully considered in antibody panel design (e.g., through antibody concentrations and coupling of antibodies to neighboring metals). Leipold et al. recently commented that minimal spillover does not equal no spillover [3]. Nonetheless, CyTOF offers a high dimension of parameters measured per cell, with current panels using ~40 parameters and the promise of up to 100. Throughput of CyTOF is lower, at the rate of hundreds of cells per second, and cells are destroyed during ionization.
The ability of flow cytometry and mass cytometry to analyze individual cells at high-throughput scales has resulted in a wide range of biological and medical applications. For example, immunophenotyping assays are used to detect and quantify cell populations of interest, to uncover new cell populations and compare abundance of cell populations between different conditions, for example between patient groups [4]. Thus, it can be used as a biomarker discovery tool.
Various methodological approaches aim for biomarker discovery [5]. A common strategy, which we will refer to throughout this workflow as the “classic” approach, is to first identify cell populations of interest by manual gating or automated clustering [6, 7]. Second, using statistical tests, one can determine which of the cell subpopulations or protein markers are associated with a phenotype (e.g., clinical outcome) of interest. Typically, cell subpopulation abundance expressed as cluster cell counts or median marker expression would be used in the statistical model to relate to the sample-level phenotype.
Importantly, there are many alternatives to what we propose below, and several methods have emerged. For instance, Citrus [8] tackles the differential discovery problem by strong over-clustering of the cells, and by building a hierarchy of clusters from very specific to general ones. Using model selection and regularization techniques, clusters and markers that associate with the outcome are identified. A further machine learning approach, CellCnn [9], learns the representation of clusters that are associated with the considered phenotype by means of convolutional neural networks, which makes it particularly applicable to detecting discriminating rare cell populations. Another approach, cydar [10] performs differential abundance analysis on “hypersphere” counts, where hyperspheres are defined using all markers, and calculates differential tests using the the generalized linear modeling capabilities of edgeR [11].
However, there are tradeoffs to consider. Citrus performs feature selection but does not provide significance levels, such as p-values, for the strength of associations. Due to its computational requirements, Citrus cannot be run on entire mass cytometry datasets and one typically must analyze a subset of the data. The “filters” from CellCnn may identify one or more cell subsets that distinguish experimental groups, while these groups may not necessarily correspond to any of the canonical cell types, since they are learned with a data-driven approach. Since the hyperspheres from cydar are defined using all markers, interpretation of differential expression of specific markers (e.g., functional markers) within cell populations is difficult.
A noticeable distinction between the machine-learning approaches and our classical regression approach is the configuration of the model. Citrus and CellCnn model the patient response as a function of the measured HDCyto values, whereas the classical approach models the HDCyto data itself as the response, thus putting the distributional assumptions on the experimental HDCyto data. This carries the distinct advantage that covariates (e.g., age, gender, batch) can be included, together with finding associations of the phenotype to the predictors of interest (e.g., cell type abundance). Specifically, neither Citrus nor CellCnn are able to directly account for covariates, such as paired experiments or presence of batches. Another recent approach, mixed-effects association testing for single cells (MASC) uses the same “reverse” association approach that we illustrate below [12]. Recently, we have formalized and compared various regression approaches, resulting in the diffcyt package [13].
Within the classical approach, hybrid methods are certainly possible, where discovery of interesting cell populations is done with one algorithm, and quantifications or signal aggregations are modeled in standard regression frameworks. For instance, CellCnn provides p-values from a t-test or Mann-Whitney U-test conducted on the frequencies of previously detected cell populations. Some caution is warranted here, in terms of using data twice – so-called double dipping or circular analysis – and making claims about the statistical evidence of a change in abundance where initial analyses of the same data were used to discover subpopulations. This topic has been discussed with respect to clustering other types of single cell data and then inferring the markers of such populations [14]; however, it is less clear how much clustering affects cross-sample inferences.
Step by step, this workflow presents differential discovery analyses assembled from a suite of tools and methods that, in our view, lead to a higher level of flexibility and robust, statistically-supported and interpretable results. Cell population identification is conducted by means of unsupervised clustering using the FlowSOM and ConsensusClusterPlus packages, which together were among the best performing clustering approaches for high-dimensional cytometry data [15]. Notably, FlowSOM scales easily to millions of cells and thus no subsetting of the data is required.
To be able to analyze arbitrary experimental designs (e.g., batch effects, paired experiments, etc.), we show how to conduct differential analysis of cell population abundances using generalized linear mixed models (GLMM) and of marker intensities using linear models (LM) and linear mixed models (LMM). For both differential abundance and expression analysis, we use methods implemented in the diffcyt package [13]. Internally, model fitting is performed with packages lme4 and stats, and hypothesis testing with the multcomp package.
For visualization, we use new plotting functions from the CATALYST package that employ ggplot2 as their graphical engine. Notably, CATALYST delivers a suite of useful visual representations of HDCyto data characteristics, such as an MDS (multidimensional scaling) plot of aggregated signal for exploring sample similarities. The obtained cell populations are visualized using dimension reduction techniques (e.g., UMAP via the umap package) and heatmaps (via the ComplexHeatmap package [16]) to represent characteristics of the annotated cell populations and identified biomarkers. (Note that an alternative R implementation of the UMAP algorithm with additional functionality is also available in the uwot package.)
The workflow is intentionally not fully automatic. First, we strongly advocate for exploratory data analysis to get an understanding of data characteristics before formal statistical modeling. Second, the workflow involves an optional step where the user can manually merge and annotate clusters (see Cluster merging and annotation section) but in a way that is easily reproducible. The CyTOF data used here (see Data description section) is already preprocessed; i.e., the normalization and de-barcoding, as well as removal of doublets, debris and dead cells, were already performed; further details are available in the Data preprocessing section.
Notably, this workflow is equally applicable to flow or mass cytometry datasets, for which the preprocessing steps have already been performed. In addition, the workflow is modular and can be adapted as new algorithms or new knowledge about how to best use existing tools comes to light. Alternative clustering algorithms such as the popular PhenoGraph algorithm [17] (e.g., via the Rphenograph package), dimensionality reduction techniques, such as diffusion maps [18] via the destiny package [19], t-SNE via the Rtsne and SIMLR [20] via the SIMLR package could be inserted into the workflow.
Note: To cite this workflow, please refer to this F1000 article https://f1000research.com/articles/6-748/v3 .
To generate reproducible results, we set random seeds in several steps of the workflow. However, the default methods for random number generation in R were updated in R version 3.6.0 (released in April 2019; see R News for details). Therefore, for consistency with earlier versions of the workflow, we use the function RNGversion()
to use the random number generation methods from the previous version of R. Note that this step is not required when running a standard analysis on a new dataset; it is included here for reproducibility and backward compatibility only.
RNGversion("3.5.3")
We use a subset of CyTOF data originating from Bodenmiller et al. [21] that was also used in the Citrus paper [8]. In the original study, peripheral blood mononuclear cells (PBMCs) in unstimulated and after 11 different stimulation conditions were measured for 8 healthy donors. For each sample, expression of 10 cell surface markers and 14 signaling markers was recorded. We perform our analysis on samples from the reference and one stimulated condition where cells were crosslinked for 30 minutes with B cell receptor/Fc receptor known as BCR/FcR-XL, resulting in 16 samples in total (8 patients, unstimulated and stimulated for each).
The original data is available from the Cytobank report. The subset used here can be downloaded from the Citrus Cytobank repository (files with _BCR-XL.fcs
or _Reference.fcs
endings) or from the HDCytoData [22] package via Bodenmiller_BCR_XL_flowSet()
(see Data import section).
In both the Bodenmiller et al. and Citrus manuscripts, the 10 lineage markers were used to identify cell subpopulations. These were then investigated for differences between reference and stimulated cell subpopulations separately for each of the 14 functional markers. The same strategy is used in this workflow; 10 lineage markers are used for cell clustering and 14 functional markers are tested for differential expression between the reference and BCR/FcR-XL stimulation. Even though differential analysis of cell abundance was not in the scope of the Bodenmiller et al. experiment, we present it here to highlight the generality of the discovery.
Conventional flow cytometers and mass cytometers produce .fcs files that can be manually analyzed using programs such as FlowJo [TriStar] or Cytobank [23], or using R/Bioconductor packages, such as flowWorkspace [24] and openCyto [25]. During this initial analysis step, dead cells are removed, compensation is checked and with simple two dimensional scatter plots (e.g., marker intensity versus time), marker expression patterns are checked. Often, modern experiments are barcoded in order to remove analytical biases due to individual sample variation or acquisition time. Preprocessing steps including normalization using bead standards [26], de-barcoding [27] and compensation can be completed with the CATALYST package [28], which also provides a Shiny app for interactive analysis. Of course, preprocessing steps can occur using custom scripts within R or outside of R (e.g., Normalizer [26]).
We recommend as standard practice to keep an independent record of all samples collected, with additional information about the experimental condition, including sample or patient identifiers, processing batch and so on. That is, we recommend having a trail of metadata for each experiment. In our example, the metadata file, PBMC8_metadata.xlsx
, can be downloaded from the Robinson Lab server with the download.file()
function. For the workflow, the user should place it in the current working directory (getwd()
). Here, we load it into R with the read_excel()
function from the readxl package and save it into a variable called md
, but other file types and interfaces to read them in are also possible.
The data frame md
contains the following columns:
file_name
with names of the .fcs files corresponding to the reference (suffix “Reference”) and BCR/FcR-XL stimulation (suffix “BCR-XL”) samples,
sample_id
with shorter unique names for each sample containing information about conditions and patient IDs. These will be used to label samples throughout the entire workflow.
condition
describes whether samples originate from the reference (Ref
) or stimulated (BCRXL
) condition,
patient_id
defines the IDs of patients.
library(readxl)
url <- "http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow"
md <- "PBMC8_metadata.xlsx"
download.file(file.path(url, md), destfile = md, mode = "wb")
md <- read_excel(md)
head(data.frame(md))
## file_name sample_id condition patient_id
## 1 PBMC8_30min_patient1_BCR-XL.fcs BCRXL1 BCRXL Patient1
## 2 PBMC8_30min_patient1_Reference.fcs Ref1 Ref Patient1
## 3 PBMC8_30min_patient2_BCR-XL.fcs BCRXL2 BCRXL Patient2
## 4 PBMC8_30min_patient2_Reference.fcs Ref2 Ref Patient2
## 5 PBMC8_30min_patient3_BCR-XL.fcs BCRXL3 BCRXL Patient3
## 6 PBMC8_30min_patient3_Reference.fcs Ref3 Ref Patient3
In our example, the data from the .fcs files listed in the metadata can be loaded from the HDCytoData package [22].
library(HDCytoData)
fs <- Bodenmiller_BCR_XL_flowSet()
Alternatively, the files can be downloaded manually from the Citrus Cytobank repository and loaded into R as a flowSet
using read.flowSet()
from the flowCore package [29]. Importantly, read.flowSet()
and the underlying read.FCS()
functions, by default, may transform the marker intensities and remove cells with extreme positive values. This behavior can be controlled with arguments transformation
and truncate_max_range
, respectively.
In our example, information about the panel is also available in a file called PBMC8_panel.xlsx
, and can be downloaded from the Robinson Lab server and loaded into a variable called panel
. It contains columns for Isotope
and Metal
that define the atomic mass number and the symbol of the chemical element conjugated to the antibody, respectively, and Antigen
, which specifies the protein marker that was targeted; two additional columns specify whether a channel belongs to the lineage or functional type of marker.
The isotope, metal and antigen information that the instrument receives is also stored in the flowFrame
(container for one sample) or flowSet
(container for multiple samples) objects. One can type fs[[1]]
to see the first flowFrame
, which contains a table with columns name
and desc
. Their content can be retrieved with accessors pData(parameters(fs[[1]]))
. The variable name
corresponds to the column names in the flowSet
object, and can be viewed in R via colnames(fs)
.
It should be checked that elements from panel
can be matched to their corresponding entries in the flowSet
object. Specifically, the entries in panel$Antigen
must have an equivalent in the desc
columns of the flowFrame
objects.
In the following analysis, we will often use marker IDs as column names in the tables containing expression values. As a cautionary note, during object conversion from one type to another (e.g., in the creation of data.frame from a matrix), some characters (e.g., dashes) in the dimension names are replaced with dots, which may cause problems in matching. To avoid this problem, we will replace problematic characters (dashes with underscores; colons with dots) when organizing all data (measurement data, panel, and experimental metadata) into a SingleCellExperiment
(SCE) object (see below).
panel <- "PBMC8_panel_v3.xlsx"
download.file(file.path(url, panel), destfile = panel, mode = "wb")
panel <- read_excel(panel)
head(data.frame(panel))
## fcs_colname antigen marker_class
## 1 CD3(110:114)Dd CD3 type
## 2 CD45(In115)Dd CD45 type
## 3 pNFkB(Nd142)Dd pNFkB state
## 4 pp38(Nd144)Dd pp38 state
## 5 CD4(Nd145)Dd CD4 type
## 6 CD20(Sm147)Dd CD20 type
# spot check that all panel columns are in the flowSet object
all(panel$fcs_colname %in% colnames(fs))
## [1] TRUE
Usually, the raw marker intensities read by a cytometer have strongly skewed distributions with varying ranges of expression, thus making it difficult to distinguish between the negative and positive cell populations. It is common practice to transform CyTOF marker intensities using, for example, arcsinh (inverse hyperbolic sine) with cofactor 5 [8, 30] to make the distributions more symmetric and to map them to a comparable range of expression, which is important for clustering. A cofactor of 150 has .been promoted for flow cytometry, but users are free to implement alternative transformations, some of which are available from the transform()
function of the flowCore package. By default, the prepData()
SCE constructor (see next section) arcsinh transforms marker expressions with a cofactor of 5.
As the ranges of marker intensities can vary substantially, for visualization, we apply another transformation that scales the expression of all markers to values between 0 and 1 using low (e.g., 1%) and high (e.g., 99%) percentiles as the boundary. This additional transformation of the arcsinh-transformed data can sometimes give better visual representation of relative differences in marker expression between annotated cell populations. However, all computations (differential testing, hierarchical clustering etc.) are still performed on arcsinh-transformed not scaled expressions. Whether scaled expression values should be plotted is specified with argument scale = TRUE
or FALSE
in the respective visualizations (e.g., plotExprHeatmap()
and plotClusterHeatmap()
).
We will store all data used and returned throughout differential analysis in an object of the SingleCellExperiment (SCE) class. For this, CATALYST provides the wrapper prepData()
to construct a SCE object from the following inputs:
x
: a flowSet
containing the raw measurement data, or a character string that specifies the path to a set of .fcs files.panel
: a data.frame
containing, for each marker, i) its column name in the input raw data, ii) its targeted protein markers, and, optionally, iii) its class (type, state, or none).md
: a data.frame
with columns describing the experimental design.Argument features
specifies which columns (channels) to retain from the input data. By default, all measurement parameters will be kept (features = NULL
). Here, we only keep the channels listed in panel
.
It is important to carefully check whether variables are of the desired type (factor, numeric, character), since input methods may convert columns into different data types. This is taken care of by the prepData()
SCE constructor. For the statistical modeling, we want to make the condition variable a factor with the reference (Ref
) being the reference level. The order of factor levels can be defined with the levels
parameter of the factor
function or via relevel()
.
As a final note, prepData()
requires the filenames listed in the md$file_name
column to match those in the flowSet
.
# specify levels for conditions & sample IDs to assure desired ordering
md$condition <- factor(md$condition, levels = c("Ref", "BCRXL"))
md$sample_id <- factor(md$sample_id,
levels = md$sample_id[order(md$condition)])
# construct SingleCellExperiment
sce <- prepData(fs, panel, md, features = panel$fcs_colname)
We propose some quick checks to verify whether the data we analyze globally represents what we expect; for example, whether samples that are replicates of one condition are more similar and are distinct from samples from another condition. Another important check is to verify that marker expression distributions do not have any abnormalities such as having different ranges or distinct distributions for a subset of the samples. This could highlight problems with the sample collection or data acquisition, or batch effects that were unexpected. Depending on the situation, one can then consider removing problematic markers or samples from further analysis; in the case of batch effects, a covariate column could be added to the metadata table and used below in the statistical analyses.
The step below generates a plot with per-sample marker expression distributions, colored by condition (Figure 1). Here, we can already see distinguishing markers, such as pNFkB and CD20, between stimulated and unstimulated conditions.
p <- plotExprs(sce, color_by = "condition")
p$facet$params$ncol <- 6
p
Figure 1: Per-sample smoothed densities of marker expression (arcsinh-transformed) of 10 lineage markers and 14 functional markers in the PBMC dataset
Two conditions: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) for each of the 8 healthy donors are presented and colored by experimental condition.
Another spot check is the number of cells per sample (Figure 2). This plot can be used as a guide together with other readouts to identify samples where not enough cells were assayed. The number of cells measured in each sample is also stored in the experiment_info
slot of the SingleCellExperiment
’s metadata
, and can be accessed directly via n_cells()
.
n_cells(sce)
##
## Ref1 Ref2 Ref3 Ref4 Ref5 Ref6 Ref7 Ref8 BCRXL1 BCRXL2
## 2739 16725 9434 6906 11962 11038 15974 13670 2838 16675
## BCRXL3 BCRXL4 BCRXL5 BCRXL6 BCRXL7 BCRXL8
## 12252 8990 8543 8622 14770 11653
plotCounts(sce, group_by = "sample_id", color_by = "condition")
Figure 2: Barplot showing the number of cells measured for each sample in the PBMC dataset
Bars are colored by experimental condition: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL). Numbers in the names on the x-axis indicate patient IDs. Numbers on top of the bars indicate the cell counts.
In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. Such plots show similarities between samples measured in an unsupervised way and give a sense of how much differential expression can be detected before conducting any formal tests. In transcriptomics, distances between samples are calculated based on the expression of the top varying genes. We propose a similar plot for HDCyto data using median marker expression over all cells to calculate dissimilarities between samples (other aggregations are also possible, and one could reduce the number of top varying markers to include in the calculation). Ideally, samples should cluster well within the same condition, although this depends on the magnitude of the difference between experimental conditions. With this diagnostic, one can identify outlier samples and eliminate them if the circumstances warrant it. An MDS plot on the median marker expressions can be generated with pbMDS()
, which internally calls the same-named limma function.
In our MDS plot on median marker expression values (Figure 3), we can also see that the first dimension (MDS1) separates the unstimulated and stimulated samples reasonably well. The second dimension (MDS2) represents, to some degree, differences between patients. Most of the samples that originate from the same patient are placed at a similar point along the y-axis, for example, samples from patients 7, 5, and 8 are at the bottom of the plot, while samples from patient 4 are located at the top of the plot. This also indicates that the marker expression of individual patients is driving similarity and perhaps should be formally accounted for in the downstream statistical modeling.
pbMDS(sce, color_by = "condition", label_by = "sample_id")
Figure 3: MDS plot for the unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) samples obtained for each of the 8 healthy donors in the PBMC dataset
Calculations are based on the median (arcsinh-transformed) marker expression of 10 lineage markers and 14 functional markers across all cells measured for each sample. Distances between samples in the plot approximate the typical change in medians. Numbers in the label names indicate patient IDs.
In contrast to genomic applications, the number of variables measured for each sample is much lower in HDCyto data. In the former, thousands of genes are surveyed, whereas in the latter, ~20-50 antigens are typically targeted. Similar to the MDS plot above, a heatmap of the same data also gives insight into the structure of the data. The heatmap shows median marker intensities with clustered columns (markers) and rows (samples). We have used hierarchical clustering with average linkage and Euclidean distance, but also Ward’s linkage could be used [8], and in CyTOF applications, a cosine distance shows good performance [31]. In this plot, we can see which markers drive the observed clustering of samples (Figure 4).
As with the MDS plot, the dendrogram separates the reference and stimulated samples very well. Also, similar groupings of patients within experimental conditions are observed.
plotExprHeatmap(sce, scale = "last",
hm_pal = rev(hcl.colors(10, "YlGnBu")))
Figure 4: Heatmap of the median (arcsinh-transformed) marker expression of 10 lineage markers and 14 functional markers across all cells measured for each sample in the PBMC dataset
Color-coded with yellow for lower expression and blue for higher expression. Dendrograms present clustering of samples (rows) and markers (columns) which is based on hierarchical clustering with Euclidean distance metric and average linkage. Row annotations on the left of the heatmap represent the two conditions: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL), and patient IDs for each of the 8 healthy donors.
In this step, we identify the ability of markers to explain the variance observed in each sample. In particular, we calculate the PCA-based non-redundancy score (NRS) [17]. Markers with higher score explain a larger portion of variability present in a given sample.
The average NRS can be used to select a subset of markers that are non-redundant in each sample but at the same time capture the overall diversity between samples. Such a subset of markers can then be used for cell population identification analysis (i.e., clustering). We note that there is no precise rule on how to choose the right cutoff for marker inclusion, but one option is to select a suitable number of the top-scoring markers. The number can be chosen by analyzing the plot with the NR scores (Figure 5), where the markers are sorted by the decreasing average NRS. Based on prior biological knowledge, one can refine the marker selection and remove markers that are not likely to distinguish cell populations of interest, even if they have high scores, and add in markers with low scores but known to be important in discerning cell subgroups [17]. Thus, the NRS analysis serves more as a guide to marker selection and is not meant as a hard rule.
In the dataset considered here [8, 21], we want to use all the 10 lineage markers, so there is no explicit need to restrict the set of cell surface markers, and the NRS serve as another quality control step. There may be other situations where this feature selection step would be of interest, for example, in panel design [17].
plotNRS(sce, features = "type", color_by = "condition")
Figure 5: Non-redundancy scores for each of the 10 lineage markers and all samples in the PBMC dataset
The full points represent the per-sample NR scores (colored by experimental conditions), while empty black circles indicate the mean NR scores from all the samples. Markers on the x-axis are sorted according to the decreasing average NRS.
Cell population identification typically has been carried out by manual gating, a method based on visual inspection of a series of two-dimensional scatterplots. At each step, a subset of cells, either positive or negative for the two visualized markers, is selected and further stratified in the subsequent iterations until populations of interest across a range of marker combinations are captured. However, manual gating has drawbacks, such as subjectivity, bias toward well-known cell types, and inefficiency when analyzing large datasets, which also contribute to a lack of reproducibility [5].
Considerable effort has been made to improve and automate cell population identification, such as unsupervised clustering [32]. However, not all methods scale well in terms of performance and speed from the lower dimensionality flow cytometry data to the higher dimensionality mass cytometry data [15], since clustering in higher dimensions can suffer the “curse of dimensionality”.
Beside the mathematical and algorithmic challenges of clustering, cell population identification may be difficult due to the chemical and biological aspects of the cytometry experiment itself. Therefore, caution should be taken when designing panels aimed at detecting rare cell populations by assigning higher sensitivity metals to rare markers. The right choice of a marker panel used for clustering can also be important. For example, it should include all markers that are relevant for cell type identification.
In this workflow, we conduct cell clustering with FlowSOM [33] and ConsensusClusterPlus [34], which appeared amongst the fastest and best performing clustering approaches in a recent study of HDCyto datasets [15]. This ensemble showed strong performance in detecting both high and low frequency cell populations and is one of the fastest methods to run, which enables its interactive usage. We use a slight modification of the original workflow presented in the FlowSOM vignette, which we find more flexible. In particular, we directly call the ConsensusClusterPlus()
function that is embedded in metaClustering_consensus()
. Thus, we are able to access all the functionality of the ConsensusClusterPlus package to explore the number of clusters.
The FlowSOM workflow consists of three steps: i) building a self-organizing map (SOM), where cells are assigned according to their similarities to 100 (by default) grid points (or, so-called codebook vectors or codes) of the SOM; ii) building a minimal spanning tree, which is mainly used for graphical representation of the clusters, is skipped in this pipeline; and iii) metaclustering of the SOM codes, is performed with the ConsensusClusterPlus package. These are wrapped in the CATALYST function cluster()
. Additionally, we add an optional round of manual expert-based merging of the metaclusters and allow this to be done in a reproducible fashion.
It is important to point out that we cluster all cells from all samples together. This strategy is beneficial, since we directly obtain cluster assignment for each cell, we label cell populations only once and the mapping of cell types between samples is automatically consistent. For a list of alternative approaches and their advantages and disadvantages, please refer to the Discussion section, where we consider: clustering per sample, clustering of data from different measurement batches and down-sampling in case of widely varying numbers of cells per sample.
CATALYST provides the wrapper function cluster()
to perform both FlowSOM clustering and ConsensusClusterPlus metaclustering. The clustering IDs obtained after the first high-dimensional clustering step are added to the input SCE’s colData
in the cluster_id
column. The cluster codes for the lower dimensional metaclusterings to 2 through maxK
clusters are stored as list element cluster_codes
in the metadata
. In this way, all levels of clustering are computed once and kept accessible for further investigation, visualization, and differential analysis.
The subset of markers to use for clustering is specified with argument features
. For future reference, the specified markers will be assigned class "type"
, and the remainder of markers will be assigned to be "state"
markers. The sets of type and state markers can be accessed at any point with the type_markers()
and state_markers()
accessor functions, respectively.
In our example, we have specified marker classes in the input panel
, and cluster()
will default to using "type"
markers for clustering. For clarity, we specify this explicitly via features = type_markers(sce)
. We call ConsensusClusterPlus()
with maximum number of clusters maxK = 20
.
FlowSOM output can be sensitive to random starts [15]. To make results reproducible, we first set a seed for random number generation prior to calling cluster()
, and, secondly, specifcy a seed
argument inside cluster()
. Unfortunately, this is necessary as the ConsensusClusterPlus()
function internally calls set.seed()
and will, when not provided with a seed, overwrite our seed using the current system time (set.seed(as.numeric(Sys.time()))
).
In general, it is advisable to rerun analyses with multiple random seeds, for two reasons. First, one can see how robust the detected clusters are, and second, when the goal is to find smaller cell populations, it may happen that, in some runs, random starting points do not represent rare cell populations, as the chance of selecting starting cells from them is low and they are merged into a larger cluster.
set.seed(1234)
sce <- cluster(sce, features = "type",
xdim = 10, ydim = 10, maxK = 20, seed = 1234)
Automatic approaches for selecting the number of clusters in HDCyto data do not always succeed [15]. In general, we therefore recommend some level of over-clustering, and if desired, manual merging of clusters. Such a hierarchical approach is especially suited when the goal is to detect smaller cell populations.
The SPADE clustering analysis performed by Bodenmiller et al. [21] identified 6 main cell types (T-cells, monocytes, dendritic cells, B-cells, NK cells and surface- cells) that were further stratified into 14 more specific subpopulations (CD4+ T-cells, CD8+ T-cells, CD14+ HLA-DR high monocytes, CD14+ HLA-DR med monocytes, CD14+ HLA-DR low monocytes, CD14- HLA-DR high monocytes, CD14- HLA-DR med monocytes, CD14- HLA-DR low monocytes, dendritic cells, IgM+ B-cells, IgM- B-cells, NK cells, surface- CD14+ cells and surface- CD14- cells). In our analysis, we are interested in identifying the 6 main PBMC populations, including: CD4+ T-cells, CD8+ T-cells, monocytes, dendritic cells, NK cells and B-cells. Following the concept of over-clustering, we perform the metaclustering of the (by default) 100 SOM codes into more than expected number of groups. For example, stratification into 20 groups should give enough resolution to detect these main clusters. We can explore the clustering in a wide variety of visualizations: UMAP plots, heatmaps and the “delta area” from ConsensusClusterPlus.
When the interest is in studying more specific subpopulations at higher detail, one can follow a strategy of reclustering as described in the Obtaining higher resolution section, where we propose to repeat the workflow (clustering and differential analyses) after gating out a selected subpopulation (e.g., one of the large populations).
We can then investigate characteristics of identified clusters with heatmaps that illustrate median marker expression in each cluster (Figure 6). As the range of marker expression can vary substantially from marker to marker, we use the 0-1 transformed data for some visualizations (argument scale = TRUE
in the respective plotting functions). However, to stay consistent with FlowSOM and ConsensusClusterPlus, we use the (arcsinh-transformed) unscaled data to generate the dendrogram of the hierarchical structure of metaclusters.
Instead of using only medians, which do not give a full representation of cluster specifics, one can plot the entire marker expression distribution in each cluster (Figure 7). Such a plot gives more detailed profile of each cluster, but represents a larger amount of information to interpret. Heatmaps give an overall overview of clusters, are quicker and easier to interpret, and together with the dendrogram can be a good basis for further cluster merging (see Cluster merging and annotation section).
plotExprHeatmap(sce, features = "type",
by = "cluster_id", k = "meta20",
bars = TRUE, perc = TRUE)
Figure 6: Heatmap of the median marker intensities of the 10 lineage markers across the 20 cell populations obtained with FlowSOM after the metaclustering step with ConsensusClusterPlus (PBMC data)
The color in the heatmap represents the median of the arcsinh, 0-1 transformed marker expression calculated over cells from all the samples, varying from blue for lower expression to red for higher expression. The dendrogram on the left represents the hierarchical similarity between the 20 metaclusters (metric: Euclidean distance; linkage: average). Each cluster has a unique color assigned (bar on the left) which is identical in other visualizations of these 20 clusters (e.g., the UMAP shown in Figure 10) facilitating the figure interpretation. Barplot along the rows (clusters) and values in brackets on the right indicate the relative sizes of clusters.
plotClusterExprs(sce, k = "meta20", features = "type")
Figure 7: Distributions of marker intensities (arcsinh-transformed) of the 10 lineage markers in the 20 cell populations obtained with FlowSOM after the metaclustering step with ConsensusClusterPlus (PBMC data)
Red densities represent marker expression for cells in a given cluster. Blue densities are calculated over all the cells and serve as a reference.
In addition to investigating expression of the lineage markers, we can also have a look at expression of the functional markers. We propose a heatmap that depicts median expression of functional markers in each sample (Figure 8) such that the potential differential expression can be investigated already at this data exploration step before the formal testing is done. In order to plot all the heatmaps in one panel, we use the ComplexHeatmap package [16].
plotMultiHeatmap(sce,
hm1 = "type", hm2 = "pS6", k = "meta20",
row_anno = FALSE, bars = TRUE, perc = TRUE)
Figure 8: Heatmap of the median marker intensities of the 10 lineage markers and one signaling marker (pS6) across the 20 cell populations obtained with FlowSOM after the metaclustering step with ConsensusClusterPlus (PBMC data)
The left panel presents a heatmap analogous to the one in Figure 6. Heatmap on the right represents the median of the arcsinh, 0-1 transformed marker expression for a signaling marker pS6 calculated over cells in each sample (columns) individually.
One of the most popular plots for representing single cell data are t-SNE plots, where each cell is represented in a lower, usually two-dimensional, space computed using t-stochastic neighbor embedding (t-SNE) [35, 36]. More generally, dimensionality reduction techniques represent the similarity of points in 2 or 3 dimensions, such that similar objects in high-dimensional space are also similar in lower dimensional space. Mathematically, there are a myriad of ways to define this similarity. For example, principal component analysis (PCA) uses linear combinations of the original features to find orthogonal dimensions that show the highest levels of variability; the top 2 or 3 principal components can then be visualized.
Nevertheless, there are a few notes of caution when using t-SNE or any other dimensionality reduction technique. Since they are based on preserving similarities between cells, those that are similar in the original space will be close in the 2D/3D representation, but the opposite does not always hold. In our experience, t-SNE with default parameters for HDCyto data is often suitable (for more guidance on the specifics of t-SNE, see How to Use t-SNE Effectively [37]).
Another nonlinear dimensionality reduction technique, uniform manifold approximation and projection (UMAP), has recently been directly compared to t-SNE, and shown to outperform t-SNE in runtime, reproducibility, and its ability to organize cells into meaningful clusters [38, 39]. Throughout this workflow, we use UMAP as our dimensionality reduction method of choice, but other techniques, such as PCA, diffusion maps [19], SIMLR [20], isomaps or t-SNE could be applied. Alternative algorithms, such as largeVis [40] (available via the largeVis package) or hierarchical stochastic neighbor embedding (HSNE) [41], can also be used for dimensionality reduction of very large datasets without downsampling. Alternatively, the dimensionality reduction can be performed on the codes of the SOM, at a resolution (size of the SOM) specified by the user (Figure 13).
A variety of dimension reduction methods are available in the scater package and can be run via runX()
, where X = "PCA", "TSNE", "UMAP", "MDS"
or "DiffusionMap"
(see ?"scater-red-dim-args"
for details). To make results reproducible, the random seed should be set via set.seed
prior to dimension reduction. The assay containing expression values (default logcounts
) should be specified via argument exprs_values
, and the subset of markers to use for computing reduced dimensions is specified via subset_row
. Here, we will use the set of type-markers accessible via type_markers(sce)
.
Most dimensionality reduction techniques require significant computational time to process the data. To keep running times reasonable for larger CyTOF datasets, one may use a subset of cells. Here, we first split the vector of cell indices by sample ID and sample at most 500 cells per sample for dimension reduction.
# run t-SNE/UMAP on at most 500/1000 cells per sample
set.seed(1234)
sce <- runDR(sce, "TSNE", cells = 500, features = "type")
sce <- runDR(sce, "UMAP", cells = 1e3, features = "type")
The UMAP map below is colored according to the expression level of the CD4 marker, highlighting the position of CD4+ T-cells (Figure 9). In this way, one can use a set of markers to highlight where cell types of interest are located on the map. If one is loosely interpreting density of points in the map, it is recommended to select a fixed number of cells per sample.
plotDR(sce, "UMAP", color_by = "CD4")
Figure 9: UMAP based on the arcsinh-transformed expression of the 10 lineage markers in the cells from the PBMC dataset
UMAP was run with no PCA step. From each of the 16 samples, 1000 cells were randomly selected. Cells are colored according to the expression level of the CD4 marker.
Alternatively, we can color the cells by any resolution of clustering available in the codes. Here, we compare the t-SNE and UMAP projections of cells colored by the 20 metaclusters. Ideally, cells of the same color should be close to each other (Figure 10). When the plots are further stratified by sample (Figure 11), we can verify whether similar cell populations are present in all replicates, which can help in identifying outlying samples. Optionally, stratification can be done by condition (Figure 12). With such a spot-check plot, we can inspect whether differences in cell abundance are strong between conditions, and we can visualize and identify distinguishing clusters before applying formal statistical testing. A similar approach of data exploration was proposed in studies of treatment-specific differences of polyfunctional antigen-specific T-cells [42].
p1 <- plotDR(sce, "TSNE", color_by = "meta20") +
theme(legend.position = "none")
p2 <- plotDR(sce, "UMAP", color_by = "meta20")
lgd <- get_legend(p2)
p2 <- p2 + theme(legend.position = "none")
plot_grid(p1, p2, lgd, nrow = 1, rel_widths = c(5, 5, 2))
Figure 10: t-SNE and UMAP based on the arcsinh-transformed expression of the 10 lineage markers in the cells from the PBMC dataset
From each of the 16 samples, 500 (t-SNE) and 1000 (UMAP) cells were randomly selected. Cells are colored according to the 20 cell populations obtained with FlowSOM after the metaclustering step with ConsensusClusterPlus.
# facet by sample
plotDR(sce, "UMAP", color_by = "meta20", facet_by = "sample_id")
Figure 11: UMAP as in Figure 10, but stratified by sample
# facet by condition
plotDR(sce, "UMAP", color_by = "meta20", facet_by = "condition")
Figure 12: UMAP as in Figure 10, but stratified by condition
The SOM codes represent characteristics of the 100 (by default) clusters generated in the first step of the FlowSOM pipeline. Their visualization can also be helpful in understanding the cell population structure and determining the number of clusters. Ultimately, the metaclustering step uses the codes and not the original cells. We treat the codes as new representative cells and apply the t-SNE dimension reduction to visualize them in 2D (Figure 13). The size of the points corresponds to the number of cells that were assigned to a given code. The points are colored according to the results of metaclustering. Since we have only 100 data points, the t-SNE analysis is fast.
As there are multiple ways to mathematically define similarity in high-dimensional space, it is always good practice visualizing projections from other methods to see how consistent the observed patterns are. For instance, we also represent the FlowSOM codes via the first two principal components (Figure 13).
plotCodes(sce, k = "meta20")
Figure 13: The 100 SOM codes in the PBMC dataset colored according to the metaclustering with ConsensusClusterPlus into 20 cell populations presented after the dimension reduction with (A) t-SNE and (B) PCA
The SOM codes represent characteristics of the 100 (by default) clusters generated in the first step of the FlowSOM pipeline. The size of the points corresponds to the number of cells that were assigned to a given code.
Using heatmaps, we can also visualize median marker expression in the 100 SOM codes as in Figure 14. Of note, the clustering presented with the dendrogram does not completely agree with the clustering depicted by the 20 colors because the first one is based on the hierarchical clustering with average linkage and Euclidean distance, while the second one results from the consensus clustering.
plotMultiHeatmap(sce,
hm1 = "type", hm2 = "pS6", k = "som100", m = "meta20",
row_anno = FALSE, col_anno = FALSE, bars = TRUE, perc = TRUE)