bluster 1.2.1

- 1 Introduction
- 2 Computing the silhouette width
- 3 Computing the neighborhood purity
- 4 Computing per-cluster RMSD
- 5 Computing graph modularity
- 6 Comparing two clusterings
- 7 Bootstrapping cluster stability
- 8 Clustering parameter sweeps
- 9 Linking clusters
- 10 Comparing multiple clusterings
- Session information

The *bluster* package provides a few diagnostics for quantitatively examining the cluster output.
These diagnostics
We will demonstrate on another dataset from the *scRNAseq* package,
clustered with graph-based methods via the `clusterRows()`

generic as described in the previous vignette.

```
library(scRNAseq)
sce <- GrunPancreasData()
# Quality control to remove bad cells.
library(scuttle)
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats, sub.fields="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]
# Normalization by library size.
sce <- logNormCounts(sce)
# Feature selection.
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)
# Dimensionality reduction.
set.seed(1000)
library(scater)
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)
# Clustering.
library(bluster)
mat <- reducedDim(sce)
clust.info <- clusterRows(mat, NNGraphParam(), full=TRUE)
clusters <- clust.info$clusters
table(clusters)
```

```
## clusters
## 1 2 3 4 5 6 7 8 9 10 11 12
## 285 171 161 59 174 49 70 137 69 65 28 23
```

The silhouette width is a standard metric to quantify the separation between clusters generated by any procedure. A cell with a large positive width is closer to other cells from the same cluster compared to cells from different clusters. On the other hand, low or negative widths indicate that cells from different clusters are not well separated.

The exact silhouette calculation is rather computationally intensive so *bluster* implements an approximation instead.
This is provided in the `approxSilhouette()`

function, which returns the width for each cell and its closest (non-self) cluster.
Clusters consisting of cells with lower widths may warrant some more care during interpretation.

```
sil <- approxSilhouette(mat, clusters)
sil
```

```
## DataFrame with 1291 rows and 3 columns
## cluster other width
## <factor> <factor> <numeric>
## D2ex_1 1 7 0.375510
## D2ex_2 1 7 0.370596
## D2ex_3 1 7 0.409271
## D2ex_4 5 3 0.257069
## D2ex_5 5 3 0.249884
## ... ... ... ...
## D17TGFB_91 6 2 0.126608
## D17TGFB_92 1 5 0.403107
## D17TGFB_93 1 7 0.384364
## D17TGFB_94 6 2 0.243933
## D17TGFB_95 6 2 0.204860
```

`boxplot(split(sil$width, clusters))`

The function also returns the identity of the closest “other” cluster for each cell. This can be helpful to identify which clusters are easily confused to each other, based on how many of one cluster’s cells are closer to the other cluster.

```
best.choice <- ifelse(sil$width > 0, clusters, sil$other)
table(Assigned=clusters, Closest=best.choice)
```

```
## Closest
## Assigned 1 2 3 4 5 6 7 8 9 10 11 12
## 1 275 0 7 0 3 0 0 0 0 0 0 0
## 2 0 171 0 0 0 0 0 0 0 0 0 0
## 3 0 0 150 0 11 0 0 0 0 0 0 0
## 4 0 0 0 59 0 0 0 0 0 0 0 0
## 5 0 0 21 0 153 0 0 0 0 0 0 0
## 6 0 26 0 0 0 23 0 0 0 0 0 0
## 7 14 0 8 10 2 0 18 0 7 2 8 1
## 8 0 0 0 1 0 0 0 136 0 0 0 0
## 9 0 0 0 0 0 0 0 10 58 0 0 1
## 10 0 0 0 0 0 0 0 0 0 65 0 0
## 11 0 0 0 0 0 0 0 0 0 0 28 0
## 12 0 0 0 0 0 0 0 0 0 0 0 23
```

Another diagnostic uses the percentage of neighbors for each cell that belong to the same cluster. Well-separated clusters should exhibit high percentages (i.e., “purities”) as cells from different clusters do not mix. Low purities are symptomatic of overclustering where cluster boundaries become more ambiguous.

The `neighborPurity()`

function computes the purity of the neighborhood for each cell.
Clusters with systematically low purities may warrant some more care during interpretation.
By default, we perform some weighting so that large clusters do not have large purities simply because there are few cells assigned to other clusters in the dataset.

```
pure <- neighborPurity(mat, clusters)
pure
```

```
## DataFrame with 1291 rows and 2 columns
## purity maximum
## <numeric> <factor>
## D2ex_1 1.000000 1
## D2ex_2 1.000000 1
## D2ex_3 1.000000 1
## D2ex_4 0.951603 5
## D2ex_5 1.000000 5
## ... ... ...
## D17TGFB_91 0.959277 6
## D17TGFB_92 1.000000 1
## D17TGFB_93 1.000000 1
## D17TGFB_94 0.986538 6
## D17TGFB_95 0.898817 6
```

`boxplot(split(pure$purity, clusters))`

The function also returns the identity of the other cluster with the highest percentage. This can again be useful to identify the relationships between clusters based on which pairs have the greatest intermingling in their neighborhoods.

`table(Assigned=clusters, Max=pure$maximum)`

```
## Max
## Assigned 1 2 3 4 5 6 7 8 9 10 11 12
## 1 279 0 0 0 0 0 6 0 0 0 0 0
## 2 0 170 0 0 0 0 0 0 0 0 0 1
## 3 0 0 158 0 3 0 0 0 0 0 0 0
## 4 0 0 0 59 0 0 0 0 0 0 0 0
## 5 1 0 6 0 165 0 2 0 0 0 0 0
## 6 0 18 0 0 0 31 0 0 0 0 0 0
## 7 0 0 0 1 0 0 67 0 1 0 1 0
## 8 0 0 0 0 0 0 0 136 1 0 0 0
## 9 0 0 0 0 0 0 0 5 64 0 0 0
## 10 0 0 0 0 0 0 0 0 0 65 0 0
## 11 0 0 0 0 0 0 0 0 0 0 28 0
## 12 0 0 0 0 0 0 0 0 0 0 0 23
```

The root-mean-squared deviation (RMSD) for each cluster represents the dispersion of cells within each cluster. A large RMSD value indicates that a cluster has high internal heterogeneity, making it a good candidate for further subclustering.

```
rmsd <- clusterRMSD(mat, clusters)
barplot(rmsd)
```

Alternatively, we can compute the within-cluster sum of squares (WCSS), a metric commonly seen in \(k\)-means clustering. One could pick a “sensible” choice for \(k\) by computing the WCSS for a range of values and picking the value the WCSS begins to plateau - see Section 8 for more details.

`clusterRMSD(mat, clusters, sum=TRUE)`

```
## 1 2 3 4 5 6 7 8
## 60368.807 9537.386 26566.192 19009.470 30701.291 8244.528 31829.154 14682.105
## 9 10 11 12
## 10652.129 6094.776 2644.849 1956.404
```

For graph-based methods, we can compute the cluster modularity within clusters and between pairs of clusters.
Specifically, we examine the ratio of observed to expected edge weights for each pair of clusters (closely related to the modularity score used in many `cluster_*`

functions from *igraph*).
We would usually expect to see high observed weights between cells in the same cluster with minimal weights between clusters, indicating that the clusters are well-separated.
Large off-diagonal entries indicate that the corresponding pair of clusters are closely related.

```
g <- clust.info$objects$graph
ratio <- pairwiseModularity(g, clusters, as.ratio=TRUE)
library(pheatmap)
pheatmap(log10(ratio+1), cluster_cols=FALSE, cluster_rows=FALSE,
col=rev(heat.colors(100)))
```

This may be better visualized with a force-directed layout:

```
cluster.gr <- igraph::graph_from_adjacency_matrix(log2(ratio+1),
mode="upper", weighted=TRUE, diag=FALSE)
# Increasing the weight to increase the visibility of the lines.
set.seed(1100101)
plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5,
layout=igraph::layout_with_lgl)
```

We can also tune the resolution of the clustering *post hoc* with the `mergeCommunities()`

function.
This will iteratively merge the most closely related pair of clusters together until the desired number of clusters is reached.
For example, if we wanted to whittle down the number of clusters to 10, we could do:

```
merged <- mergeCommunities(g, clusters, number=10)
table(merged)
```

```
## merged
## 1 2 3 5 6 7 9 10 11 12
## 285 171 161 174 49 129 206 65 28 23
```

To compare two clusterings, the `pairwiseRand()`

function computes the adjusted Rand index (ARI).
High ARIs indicate that the two clusterings are similar with respect to how they partition the observations,
and an ARI of 1 means that the clusterings are identical.

```
hclusters <- clusterRows(mat, HclustParam(cut.dynamic=TRUE))
pairwiseRand(clusters, hclusters, mode="index")
```

`## [1] 0.4512108`

Of course, a single number is not particularly useful,
so `clusterRand()`

also provides the capability to break down the ARI into its contributions from each cluster or cluster pair.
Specifically, for each cluster or cluster pair in a “reference” clustering (here, `clusters`

),
we see whether it is preserved in the “alternative” clustering (here, `hclusters`

).
Large values on the diagonal indicate that the reference cluster is recapitulated;
large values off the diagonal indicate that the separation between the corresponding pair of clusters is also maintained.
Conversely, low diagonal values indicate that the corresponding cluster is fragmented in the alternative,
and low off-diagonal values can be used as a diagnostic for loss of separation.

```
ratio <- pairwiseRand(clusters, hclusters, mode="ratio")
library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
col=viridis::viridis(100), breaks=seq(-1, 1, length=101))
```

Explicit mappings between two clusterings can be performed using `linkClusters()`

(see Section 9).
Alternatively, we can quantify the degree of “nesting” of one clustering within another with `nestedClusters()`

;
this can be useful for verifying that the higher-resolution clustering is indeed nested within its coarser counterpart.

We can use bootstrapping to evaluate the effect of sampling noise on the stability of a clustering procedure.
The `bootstrapStability()`

function will return the ARI of the original clusters against those generated from bootstrap replicates,
averaged across multiple bootstrap iterations.
High values indicate that the clustering is robust to sample noise.

```
set.seed(1001010)
ari <-bootstrapStability(mat, clusters=clusters,
mode="index", BLUSPARAM=NNGraphParam())
ari
```

`## [1] 0.7147913`

Advanced users may also set `mode="ratio"`

to obtain a more detailed breakdown of the effect of noise on each cluster (pair).

```
set.seed(1001010)
ratio <-bootstrapStability(mat, clusters=clusters,
mode="ratio", BLUSPARAM=NNGraphParam())
library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
col=viridis::viridis(100), breaks=seq(-1, 1, length=101))
```

The `clusterSweep()`

function provides a convenient way to test multiple combinations of parameter settings.
Given a `BlusterParam`

object and a set of values for each parameter, the function will repeat the clustering ith each combination of parameters.
The example below uses graph-based clustering with a variety of `k`

as well as different community detection algorithms.

```
combinations <- clusterSweep(mat, BLUSPARAM=SNNGraphParam(),
k=c(5L, 10L, 15L, 20L), cluster.fun=c("walktrap", "louvain", "infomap"))
```

This yields a list containing all clusterings and the corresponding parameter combinations used to generate them. The function will attempt to generate some sensible name for each combination, though this may require some manual curation for large numbers of parameters.

`colnames(combinations$clusters)`

```
## [1] "k.5_cluster.fun.walktrap" "k.10_cluster.fun.walktrap"
## [3] "k.15_cluster.fun.walktrap" "k.20_cluster.fun.walktrap"
## [5] "k.5_cluster.fun.louvain" "k.10_cluster.fun.louvain"
## [7] "k.15_cluster.fun.louvain" "k.20_cluster.fun.louvain"
## [9] "k.5_cluster.fun.infomap" "k.10_cluster.fun.infomap"
## [11] "k.15_cluster.fun.infomap" "k.20_cluster.fun.infomap"
```

`combinations$parameters`

```
## DataFrame with 12 rows and 2 columns
## k cluster.fun
## <integer> <character>
## k.5_cluster.fun.walktrap 5 walktrap
## k.10_cluster.fun.walktrap 10 walktrap
## k.15_cluster.fun.walktrap 15 walktrap
## k.20_cluster.fun.walktrap 20 walktrap
## k.5_cluster.fun.louvain 5 louvain
## ... ... ...
## k.20_cluster.fun.louvain 20 louvain
## k.5_cluster.fun.infomap 5 infomap
## k.10_cluster.fun.infomap 10 infomap
## k.15_cluster.fun.infomap 15 infomap
## k.20_cluster.fun.infomap 20 infomap
```

We can combine this with some of the metrics defined above to quantify cluster separation as a function of the clustering parameters. This allows us to quickly determine which parameters have a noticeable impact on the results. In principle, we could choose the clustering with the greatest separation for further analysis; however, this tends to be disappointing as it often favors overly broad clusters.

```
set.seed(10)
nclusters <- 3:25
kcombos <- clusterSweep(mat, BLUSPARAM=KmeansParam(centers=5), centers=nclusters)
sil <- vapply(as.list(kcombos$clusters), function(x) mean(approxSilhouette(mat, x)$width), 0)
plot(nclusters, sil, xlab="Number of clusters", ylab="Average silhouette width")
```

```
pur <- vapply(as.list(kcombos$clusters), function(x) mean(neighborPurity(mat, x)$purity), 0)
plot(nclusters, pur, xlab="Number of clusters", ylab="Average purity")
```

```
wcss <- vapply(as.list(kcombos$clusters), function(x) sum(clusterRMSD(mat, x, sum=TRUE)), 0)
plot(nclusters, wcss, xlab="Number of clusters", ylab="Within-cluster sum of squares")
```

If we have many clusterings, we can identify corresponding clusters with the `linkClusters()`

function.
This constructs a graph where edges are formed between pairs of clusters from different clusterings, based on the number of cells assigned to both clusters.
Re-using some of the clusterings from our previous sweep, we might do:

```
linked <- linkClusters(
list(
walktrap=combinations$clusters$k.10_cluster.fun.walktrap,
louvain=combinations$clusters$k.10_cluster.fun.louvain,
infomap=combinations$clusters$k.10_cluster.fun.infomap
)
)
linked
```

```
## IGRAPH 6900cc8 UNW- 44 84 --
## + attr: name (v/c), weight (e/n)
## + edges from 6900cc8 (vertex names):
## [1] walktrap.3 --louvain.1 walktrap.10--louvain.2 walktrap.1 --louvain.3
## [4] walktrap.7 --louvain.3 walktrap.12--louvain.4 walktrap.2 --louvain.5
## [7] walktrap.6 --louvain.5 walktrap.8 --louvain.6 walktrap.9 --louvain.6
## [10] walktrap.3 --louvain.7 walktrap.5 --louvain.7 walktrap.4 --louvain.8
## [13] walktrap.5 --louvain.8 walktrap.7 --louvain.8 walktrap.6 --louvain.9
## [16] walktrap.8 --louvain.9 walktrap.3 --louvain.10 walktrap.5 --louvain.10
## [19] walktrap.7 --louvain.10 walktrap.11--louvain.10 walktrap.8 --infomap.1
## [22] louvain.9 --infomap.1 walktrap.1 --infomap.2 walktrap.7 --infomap.2
## + ... omitted several edges
```

This can be used to visualize the relationships between clusters, or to identify metaclusters across clusterings with community detection algorithms:

```
meta <- igraph::cluster_walktrap(linked)
plot(linked, mark.groups=meta)
```