# 1 Installation

To install the package, please use the following.

if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("mbkmeans")

# 2 Introduction

This vignette provides an introductory example on how to work with the mbkmeans package, which contains an implementation of the mini-batch k-means algorithm proposed in (Sculley 2010) for large single-cell sequencing data.

The main function to be used by the users is mbkmeans. This is implemented as an S4 generic and methods are implemented for matrix, Matrix, HDF5Matrix, DelayedMatrix, SummarizedExperiment, and SingleCellExperiment.

Most of this work was inspired by the MiniBatchKmeans() function implemented in the ClusterR R package and we re-use many of the C++ functions implemented there.

Our main contribution here is to provide an interface to the DelayedArray and HDF5Array packages, allowing the user to run the mini-batch k-means algorithm on data that do not fit entirely in memory.

The motivation for this work is the clustering of large single-cell RNA-sequencing (scRNA-seq) datasets, and hence the main focus is on Bioconductor’s SingleCellExperimentand SummarizedExperiment data container. For this reason, mbkmeans assumes a data representation typical of genomic data, in which genes (variables) are in the rows and cells (observations) are in the column. This is contrary to most other statistical applications, and notably to the stats::kmeans() and ClusterR::MiniBatchKmeans() functions that assume observations in rows.

We provide a lower-level mini_batch() function that expects observations in rows and is expected to be a direct a replacement of ClusterR::MiniBatchKmeans() for on-disk data representations such as HDF5 files. The rest of this document shows the typical use case through the mbkmeans() interface, users interested in the mini_batch() function should refer to its manual page.

## 2.1 Example dataset

To illustrate a typical use case, we use the pbmc4k dataset of the TENxPBMCData package. This dataset contains a set of about 4,000 cells from peripheral blood from a healthy donor and is expected to contain many types or clusters of cell.

Note that in this vignette, we do not aim at identifying biologically meaningful clusters here (that would entail a more sophisticated normalization of data and dimensionality reduction), but insted we only aim to show how to run mini-batch k-means on a large HDF5-backed matrix.

We normalize the data simply by scaling for the total number of counts using scater and select the 1,000 most variable genes and a random set of 100 cells to speed-up computations.

tenx_pbmc4k <- TENxPBMCData(dataset = "pbmc4k")

set.seed(1034)
idx <- sample(seq_len(NCOL(tenx_pbmc4k)), 100)
sce <- tenx_pbmc4k[, idx]

#normalization
sce <- normalize(sce)

vars <- rowVars(logcounts(sce))
names(vars) <- rownames(sce)
vars <- sort(vars, decreasing = TRUE)

sce1000 <- sce[names(vars)[1:1000],]

sce1000
## class: SingleCellExperiment
## dim: 1000 100
## assays(2): counts logcounts
## rownames(1000): ENSG00000090382 ENSG00000204287 ... ENSG00000117748
##   ENSG00000135968
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## spikeNames(0):

# 3mbkmeans

The main function, mbkmeans(), returns a list object including centroids, WCSS_per_cluster (where WCSS stands for within-cluster-sum-of-squares), best_initialization, iters_per_initiazation and Clusters.

It takes any matrix-like object as input, such as SummarizedExperiment, SingleCellExperiment, matrix, DelayedMatrix and HDF5Matrix.

In this example, the input is a SingleCellExperiment object.

res <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts")

The number of clusters (such as k in the k-means algorithm) is set through the clusters argument. In this case, we set clusters = 5 for no particular reason. For SingleCellExperiment objects, the function provides the reduceMethod and whichAssay arguments. The reduceMethod argument should specify the dimensionality reduction slot to use for the clustering, and the default is “PCA”. Note that this does not perform PCA but only looks at a slot called “PCA” already stored in the object. Alternatively, one can specify whichAssay as the assay to use as input to mini-batch k-means. This is used only when reduceMethod option is NA. See ?mbkmeans for more details.

## 3.1 Choice of arguments

There are additional arguements in mbkmeans() function that make the function more flexible and suitable for more situations.

### 3.1.1 Batch size

The size of the mini batches is set through the batch_size argument. The default value uses the blocksize() function. The blocksize() function considers both the number of columns in the dataset and the amount of RAM on the current matchine to calculate as large of a batch size as reasonable for the RAM available to the session. The calculation uses get_ram() function in benchmarkme package. See the benchmarkme vignette for more details.

batchsize <- blocksize(sce1000)
batchsize
## [1] 100

In this case, as the whole data fits in memory, the default batch size would be a single batch of size 100.

### 3.1.2 Initialization

The performance of mini-batch k-means greatly depends on the process of initialization. We implemented two different initialization methods:

1. Random initialization, as in regular k-means;
2. kmeans++, as proposed in (Arthur and Vassilvitskii 2007). The default is “kmeans++”.

The percentage of data to use for the initialization centroids is set through the init_fraction argument, which should be larger than 0 and less than 1, with default value of 0.25.

res_random <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts",
initializer = "random")
table(res$Clusters, res_random$Clusters)
##
##      1  2  3  4  5
##   1  0  0  0  0  1
##   2  7 12  0  2  0
##   3  9  0 41  0  0
##   4  0  0  0  0 27
##   5  0  0  1  0  0

# 4 Comparison with k-means

Note that if we set init_fraction=1, initializer = "random", and batch_size=ncol(x), we recover the classic k-means algorithm.

res_full <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts",
initializer = "random",
batch_size = ncol(sce1000))
res_classic <- kmeans(t(logcounts(sce1000)), centers = 5)
table(res_full$Clusters, res_classic$cluster)
##
##      1  2  3  4  5
##   1  0  0 14  0 42
##   2  0  0  0  1  0
##   3  0  0  0  1  0
##   4  0  7  0  0  9
##   5 10  0  0 16  0

# References

Arthur, David, and Sergei Vassilvitskii. 2007. “K-Means++: The Advantages of Careful Seeding.” In Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms, 1027–35. Society for Industrial; Applied Mathematics.

Sculley, David. 2010. “Web-Scale K-Means Clustering.” In Proceedings of the 19th International Conference on World Wide Web, 1177–8. ACM.