To install the package, please use the following.

```
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("mbkmeans")
```

This vignette provides an introductory example on how
to work with the `mbkmeans`

package, which contains an
implementation of the mini-batch k-means algorithm
proposed in (Sculley 2010) for large single-cell
sequencing data.

The main function to be used by the users is `mbkmeans`

.
This is implemented as an S4 generic and methods are
implemented for `matrix`

, `Matrix`

, `HDF5Matrix`

,
`DelayedMatrix`

, `SummarizedExperiment`

, and
`SingleCellExperiment`

.

Most of this work was inspired by the
`MiniBatchKmeans()`

function implemented in the
`ClusterR`

R package and we re-use many of the
C++ functions implemented there.

Our main contribution here is to provide an
interface to the `DelayedArray`

and `HDF5Array`

packages, allowing the user to run the mini-batch
*k*-means algorithm on data that do not fit
entirely in memory.

The motivation for this work is the clustering of
large single-cell RNA-sequencing (scRNA-seq) datasets,
and hence the main focus is on Bioconductor’s
`SingleCellExperiment`

and `SummarizedExperiment`

data container. For this reason, `mbkmeans`

assumes a
data representation typical of genomic data, in which
genes (variables) are in the rows and cells
(observations) are in the column. This is contrary to
most other statistical applications, and notably to
the `stats::kmeans()`

and `ClusterR::MiniBatchKmeans()`

functions that assume observations in rows.

We provide a lower-level `mini_batch()`

function that
expects observations in rows and is expected to be a direct
a replacement of `ClusterR::MiniBatchKmeans()`

for on-disk
data representations such as `HDF5`

files. The rest of
this document shows the typical use case through the
`mbkmeans()`

interface, users interested in the
`mini_batch()`

function should refer to its manual page.

To illustrate a typical use case, we use the
`pbmc4k`

dataset of the
`TENxPBMCData`

package.
This dataset contains a set of about 4,000 cells from
peripheral blood from a healthy donor and is expected
to contain many types or clusters of cell.

Note that in this vignette, we do not aim at identifying
biologically meaningful clusters here (that would
entail a more sophisticated normalization of data and
dimensionality reduction), but insted we only aim to show
how to run mini-batch *k*-means on a large HDF5-backed
matrix.

We normalize the data simply by scaling for the total
number of counts using `scater`

and select the 1,000
most variable genes and a random set of 100 cells to
speed-up computations.

```
tenx_pbmc4k <- TENxPBMCData(dataset = "pbmc4k")
set.seed(1034)
idx <- sample(seq_len(NCOL(tenx_pbmc4k)), 100)
sce <- tenx_pbmc4k[, idx]
#normalization
sce <- normalize(sce)
vars <- rowVars(logcounts(sce))
names(vars) <- rownames(sce)
vars <- sort(vars, decreasing = TRUE)
sce1000 <- sce[names(vars)[1:1000],]
sce1000
```

```
## class: SingleCellExperiment
## dim: 1000 100
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(1000): ENSG00000090382 ENSG00000204287 ... ENSG00000117748
## ENSG00000135968
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## spikeNames(0):
```

`mbkmeans`

The main function, `mbkmeans()`

, returns a
list object including `centroids`

,
`WCSS_per_cluster`

(where WCSS stands for
within-cluster-sum-of-squares),
`best_initialization`

, `iters_per_initiazation`

and `Clusters`

.

It takes any matrix-like object as input, such
as `SummarizedExperiment`

, `SingleCellExperiment`

,
`matrix`

, `DelayedMatrix`

and `HDF5Matrix`

.

In this example, the input is a `SingleCellExperiment`

object.

```
res <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts")
```

The number of clusters (such as *k* in the
*k*-means algorithm) is set through the `clusters`

argument. In this case, we set `clusters = 5`

for
no particular reason. For `SingleCellExperiment`

objects, the function provides the `reduceMethod`

and `whichAssay`

arguments. The `reduceMethod`

argument
should specify the dimensionality reduction slot
to use for the clustering, and the default is
“PCA”. Note that this *does not perform* PCA but
only looks at a slot called “PCA” already stored
in the object. Alternatively, one can specify
`whichAssay`

as the assay to use as input to
mini-batch *k*-means. This is used only when
`reduceMethod`

option is `NA`

. See `?mbkmeans`

for more details.

There are additional arguements in `mbkmeans()`

function that make the function more flexible
and suitable for more situations.

The size of the mini batches is set through the
`batch_size`

argument. The default value uses the
`blocksize()`

function. The `blocksize()`

function
considers both the number of columns in the dataset
and the amount of RAM on the current matchine to
calculate as large of a batch size as reasonable for
the RAM available to the session. The calculation uses
`get_ram()`

function in `benchmarkme`

package. See
the `benchmarkme`

vignette for more details.

```
batchsize <- blocksize(sce1000)
batchsize
```

`## [1] 100`

In this case, as the whole data fits in memory, the default batch size would be a single batch of size 100.

The performance of mini-batch *k*-means greatly
depends on the process of initialization. We
implemented two different initialization methods:

- Random initialization, as in regular
*k*-means; `kmeans++`

, as proposed in (Arthur and Vassilvitskii 2007). The default is “kmeans++”.

The percentage of data to use for the initialization
centroids is set through the `init_fraction`

argument,
which should be larger than 0 and less than 1, with
default value of 0.25.

```
res_random <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts",
initializer = "random")
table(res$Clusters, res_random$Clusters)
```

```
##
## 1 2 3 4 5
## 1 0 0 0 0 1
## 2 7 12 0 2 0
## 3 9 0 41 0 0
## 4 0 0 0 0 27
## 5 0 0 1 0 0
```

Note that if we set `init_fraction=1`

,
`initializer = "random"`

, and `batch_size=ncol(x)`

,
we recover the classic *k*-means algorithm.

```
res_full <- mbkmeans(sce1000, clusters = 5,
reduceMethod = NA,
whichAssay = "logcounts",
initializer = "random",
batch_size = ncol(sce1000))
res_classic <- kmeans(t(logcounts(sce1000)), centers = 5)
table(res_full$Clusters, res_classic$cluster)
```

```
##
## 1 2 3 4 5
## 1 0 0 14 0 42
## 2 0 0 0 1 0
## 3 0 0 0 1 0
## 4 0 7 0 0 9
## 5 10 0 0 16 0
```

Arthur, David, and Sergei Vassilvitskii. 2007. “K-Means++: The Advantages of Careful Seeding.” In *Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms*, 1027–35. Society for Industrial; Applied Mathematics.

Sculley, David. 2010. “Web-Scale K-Means Clustering.” In *Proceedings of the 19th International Conference on World Wide Web*, 1177–8. ACM.