1 Introduction

In this document, the user is presented with an analysis that DepecheR has been written to perform. There are lots of tweaks to this general outline, so the user is encouraged to read the help files for each function individually in addition. In cases where bugs are identified, feedback is most welcome, primarily on the github site github.com/theorell/DepecheR. Now let us get started.

2 Installation

This is how to install the package, if that has not already been done:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DepecheR")

3 Example data description

The data used in this example is a semi-simulated dataset, consisting of 1000 cytotoxic lymphocytes from each of 20 individuals. These have been categorized into two groups, and after this, alterations have been added to the sizes of some cell populations in both groups. This means that the groups can be separated based onthe sizes of certain cell types in the data. And this excersize will show how to identify these, and tell us what markers that define the separating cell types in question.

Importantly, DepecheR does not provide any pre-processing tools, such as for compensation/spectral unmixing of flow cytometry files. The clustering function does have an internal algorithm to detect data with extreme tails, but this does not circumvent the need to transform flow- or mass cytometry data. This can be done using either commercially available software or with R packages, such as Biocpkg(“flowSpecs”), Biocpkg(“flowCore”) or Biocpkg(“flowVS”).

library(DepecheR)
data('testData')
str(testData)
## 'data.frame':    20000 obs. of  16 variables:
##  $ ids   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SYK   : num  11.2 21.3 23.7 22.1 24.8 ...
##  $ FcER1g: num  15.4 19.8 18.7 17.9 28.2 ...
##  $ CD16  : num  27.4 23.7 17.9 19.2 19.8 ...
##  $ CD57  : num  6.45 5.55 9.65 17.41 86.38 ...
##  $ EAT.2 : num  21.2 18.2 23.6 39.6 25.9 ...
##  $ CD4   : num  80.7 82.6 88.2 89.6 12.7 ...
##  $ TCRgd : num  17.5 21.6 20.2 29 14.3 ...
##  $ CD8   : num  21.2 17.6 16.4 -11.4 78.3 ...
##  $ iCD3  : num  88.9 86.7 82.7 90.6 87.6 ...
##  $ NKG2C : num  21.2 56.3 36.5 62.9 21.5 ...
##  $ CD2   : num  43.2 64.6 73.6 14.7 75.6 ...
##  $ CD45RO: num  34.9 34.1 27.3 59.5 22.5 ...
##  $ CD3   : num  83.3 90.7 91.5 103.4 76.2 ...
##  $ CD56  : num  21.1 39.7 28.3 15.5 69.5 ...
##  $ label : int  0 0 0 0 0 0 0 0 0 0 ...

As can be noted here, the expected input format is either a dataframe or a matrix with cells as rows and markers/variables as columns. This is in accordance with the .fcs file convention. In this case, however, the different samples (coming from donors) should be added to the same dataframe, and a donor column should specify which cells that belong to which donor. If you have .fcs files, you can do this conversion easily using the “flowSet2LongDf” function in Biocpkg(“flowSpecs”).

4 depeche clustering

With the depeche clustering function, all necessary scaling and parameter selection is performed under the hood, so all we have to do, when we have the file of interest in the right format, is to run the function on the variables that we want to cluster on.

testDataDepeche <- depeche(testData[, 2:15])
## [1] "Files will be saved to ~/Desktop"
## [1] "As the dataset has less than 100 columns, peak centering is applied."
## [1] "Set 1 with 7 iterations completed in 14 seconds."
## [1] "Set 2 with 7 iterations completed in 6 seconds."
## [1] "Set 3 with 7 iterations completed in 6 seconds."
## [1] "The optimization was iterated 21 times."
str(testDataDepeche)
## List of 4
##  $ clusterVector     : int [1:20000] 2 2 2 2 6 3 5 2 1 1 ...
##  $ clusterCenters    : num [1:8, 1:14] 0 0 0 40.2 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:14] "SYK" "FcER1g" "CD16" "CD57" ...
##  $ essenceElementList:List of 8
##   ..$ 1: chr [1:3] "CD4" "NKG2C" "CD45RO"
##   ..$ 2: chr [1:5] "CD4" "iCD3" "NKG2C" "CD2" ...
##   ..$ 3: chr [1:3] "CD57" "CD8" "CD2"
##   ..$ 4: chr [1:10] "SYK" "FcER1g" "CD16" "CD57" ...
##   ..$ 5: chr [1:3] "CD57" "CD8" "CD45RO"
##   ..$ 6: chr [1:4] "CD57" "CD8" "CD2" "CD56"
##   ..$ 7: chr [1:6] "SYK" "FcER1g" "iCD3" "CD2" ...
##   ..$ 8: chr [1:4] "TCRgd" "CD2" "CD45RO" "CD56"
##  $ penaltyOptList    :List of 2
##   ..$ :'data.frame': 1 obs. of  2 variables:
##   .. ..$ bestPenalty: num 16
##   .. ..$ k          : num 30
##   ..$ :'data.frame': 11 obs. of  2 variables:
##   .. ..$ ARI   : num [1:11] 0.59 0.581 0.683 0.689 0.857 ...
##   .. ..$ nClust: num [1:11] 29.4 28 27.1 24.3 20.4 ...

As can be seen above, the output from the function is a relatively complex list. If the names of each list element is not suficiently self explanatory, see (?depeche) for information about each slot.

4.1 depeche function output graphs

Two graphs are part of the output from the depeche function.

4.1.1 Adjusted Rand Index as a function of penalty values

This graph shows how internally reproducible the results were for each of the tested penalties. An Adjusted Rand Index of 1 shows that if any random subset of observations is clustered two times, each observation will be assigned to the same cluster both times. Conversely, an Adjusted Rand Index of 0 indicates the opposite, i.e. totally random distribution. The adjustment in “Adjusted” Rand index takes the divering probabilities of ending up with a high or low overlap in the special cases of very few and very many clusters into consideration.

Adjusted Rand Index as a function of penalty values. Each value is based on at least 7 clustering pairs (DepecheR standard in a 8 core machine).

Adjusted Rand Index as a function of penalty values. Each value is based on at least 7 clustering pairs (DepecheR standard in a 8 core machine).

4.1.2 Cluster centers

This graph shows in a heatmap format where the cluster center is located for each of the markers that are defined for the cluster in question. A light color indicates a high expression, whereas a dark color indicates low or absent expression. Grey color, on the other hand, indicates that the cluster in question did not contribute to defining the cluster in question. In some cases, the results might seem strange, as a cluster might have an expression very close to the center of the full dataset, but this expression still defined the cluster. This is due to an internal, and for stability reasons necessary, effect of the algotihm: a specific penalty will have a larger effect on a cluster with fewer observations, than on a cluster with many observations.

Cluster center heatmap. Marker names on the x-axis, cluster numbers on the y-axis.

Cluster center heatmap. Marker names on the x-axis, cluster numbers on the y-axis.

5 tSNE/umap generation

To be able to visualize the results, we need to generate a two-dimensional representation of the data used to generate the depeche clustering. Any sutiable method, such as tSNE or UMAP can be used for this purpose. I would today use uwot::umap, mainly as it in its R implementation is considerably faster than tSNE, but we will keep the tSNE here, as it well represents the data.

library(Rtsne)
testDataSNE <- Rtsne(testData[,2:15], pca=FALSE)

6 Visualization of depeche clusters on 2D representation

Now, we want to evaluate how the different clusters are distributed on the 2D representation. To do this, we need to generate a color vector from the cluster vector in the testDataDepeche. This cluster vector is then overlayed over the tSNE, and to make things easier to interpret, a separate legend is included as well. The reason that the legend is in a separate plot is for making it easier to use the plots for publication purposes. For file size reasons, it has namely been necessary to use PNG and not PDF for the plot files.

NB! The resolution of the files normally generated by DepecheR is considerably higher than in this vinjette, due to size restrictions.

dColorPlot(colorData = testDataDepeche$clusterVector, xYData = testDataSNE$Y, 
           colorScale = "dark_rainbow", plotName = "Cluster")
## png 
##   2