1 Introduction

The bluster package provides a flexible and extensible framework for clustering in Bioconductor packages/workflows. At its core is the clusterRows() generic that controls dispatch to different clustering algorithms. We will demonstrate on some single-cell RNA sequencing data from the scRNAseq package; our aim is to cluster cells into cell populations based on their PC coordinates.

sce <- ZeiselBrainData()

# Trusting the authors' quality control, and going straight to normalization.
sce <- logNormCounts(sce)

# Feature selection based on highly variable genes.
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

# Dimensionality reduction for work (PCA) and pleasure (t-SNE).
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)
sce <- runUMAP(sce, dimred="PCA")

mat <- reducedDim(sce, "PCA")
## [1] 3005   20

2 Hierarchical clustering

Our first algorithm is good old hierarchical clustering, as implemented using hclust() from the stats package. This automatically sets the cut height to half the dendrogram height.

hclust.out <- clusterRows(mat, HclustParam())
plotUMAP(sce, colour_by=I(hclust.out))

Advanced users can achieve greater control of the procedure by passing more parameters to the HclustParam() constructor. Here, we use Wardโ€™s criterion for the agglomeration with a dynamic tree cut from the dynamicTreeCut package.

hp2 <- HclustParam(method="ward.D2", cut.dynamic=TRUE)
## class: HclustParam
## metric: euclidean
## method: ward.D2
## cutreeDynamic
## cut.params(0):
hclust.out <- clusterRows(mat, hp2)
plotUMAP(sce, colour_by=I(hclust.out))

3 \(k\)-means clustering

Our next algorithm is \(k\)-means clustering, as implemented using the kmeans() function. This requires us to pass in the number of clusters, either as a number:

kmeans.out <- clusterRows(mat, KmeansParam(10))
plotUMAP(sce, colour_by=I(kmeans.out))