We will now go into more detail about important options for the main parts of the clustering workflow.
clusterMany
In the quick start section we picked some simple and familiar clustering options that would run quickly and needed little explanation. However, our workflow generally assumes more complex options and more parameter variations are tried. Before getting into the specific options of clusterMany, let us first describe some of these more complicated setups, since many of the arguments of clusterMany depend on understanding them.
Base clustering algorithms and the ClusterFunction class
This package is meant to be able to use and compare different clustering routines. However, the required input, arguments, etc. of different clustering algorithms varies greatly. We create the ClusterFunction class so that we can ensure that the necessary information to fit into our workflow is well defined, and otherwise the other details of the algorithm can be ignored by the workflow. In general, the user will not need to know the details of this class if they want to use the built in functions provided by the package, which can be accessed by character values. To see the set of character values that correspond to built-in functions,
listBuiltInFunctions()
## [1] "pam" "clara" "kmeans" "hierarchical01"
## [5] "hierarchicalK" "tight" "spectral" "mbkmeans"
If you are interested in implementing your own ClusterFunction object, after reading this section look at our (example)[#customAlgorithm] below.
There are some important features of any clustering algorithm that are encoded in the ClusterFunction object for which it is important to understand because they affect which algorithms can be used when at different parts of the workflow.
algorithmType
We group together algorithms that cluster based on common strategies that affect how we can use them in our workflow. Currently there are two “types” of algorithms we consider, which we call type “K” and “01”. We can determine the type of a builtin function by the following:
algorithmType(c("kmeans","hierarchicalK","hierarchical01"))
## kmeans hierarchicalK hierarchical01
## "K" "K" "01"
The “K” algorithms are so called because their main parameter requirement is that the user specifies the number of clusters (\(K\)) to be created and require an input of k to the clustering function. Built in ‘K’ algorithms are:
listBuiltInTypeK()
## [1] "pam" "clara" "kmeans" "hierarchicalK"
## [5] "spectral" "mbkmeans"
The “01” algorithms are so named because the algorithm assumes that the input is a disimilarities between samples and that the similarities encoded in \(D\) are on a scale of 0-1. The clustering functions should use this fact to make the primary user-specified parameter be not the number of final clusters, but a measure \(\alpha\) of how dissimilar samples in the same cluster can be (on a scale of 0-1). Given \(\alpha\), the algorithm then implements a method to then determine the clusters (so \(\alpha\) implicitly determines \(K\)). These methods rely on the assumption that because the 0-1 scale has special significance, the user will be able to make an determination more easily as to the level of dissimilarity allowed in a true cluster, rather than predetermine the number of clusters \(K\). The current 01 methods are:
listBuiltInType01()
## [1] "hierarchical01" "tight"
requiredArgs
The different algorithm types correspond to requiring different input types (k versus alpha). This is usually sorted out by clusterMany, which will only dispatch the appropriate one. Clustering functions can also have additional required arguments. See below for more discussion about how these arguments can be passed along to clusterMany or RSEC.
To see all of the required arguments of a function,
requiredArgs(c("hierarchical01","hierarchicalK"))
## $hierarchical01
## [1] "alpha"
##
## $hierarchicalK
## [1] "k"
Internal clustering procedures
clusterMany iteratively calls a function clusterSingle over the collection of parameters. clusterSingle is the clustering workhorse, and may be used by the user who wants more fine-grained control, see documentation of clusterSingle.
Within each call of clusterSingle, there are three possible steps, depending on the value of the arguments subsample and sequential:
- Subsampling (
subsampleClustering) – if subsample=TRUE
- Main Clustering (
mainClustering)
- Sequential (
seqCluster) – if sequntial=TRUE
If both sequential and subsample are FALSE, then step 1 and step 3 are skipped and clusterSingle just calls the mainClustering (step 2), resulting in a basic clustering routine applied to the input data. If subsample=TRUE, then step 1 (subsampleClustering) is called which subsamples the input data and clusters each subsample to calculate a co-occurance matrix. That co-occurance matrix is used as the input for mainClustering (step 2). If sequential=TRUE this process (step 1 then step 2 if subsample=TRUE or just step 2 if subsample=FALSE) is iterated over and over again to iteratively select the best clusters (see ?seqCluster for a detailed description). Each of these steps has a function that goes with it, noted above, but they should not generally be called directly by the user. However, the documentation of these functions can be useful.
In particular, arguments to these three functions that are not set by clusterMany can be passed via named lists to the arguments: subsampleArgs, mainClusterArgs, and seqArgs. Some of the arguments to these functions can be varied in clusterMany, but more esoteric ones should be sent as part of the named list of parameters given to clusterMany; those named lists will be fixed for all parameter combinations tried in clusterMany.
Main Clustering (Step 2): mainClustering
The main clustering (step 2) described above is done by the function mainClustering. In addition to the basic clustering algorithms on the input data, we also implement many other common cluster processing steps that are relevant to the result of the clustering. We have already seen such an example with dimensionality reduction, where the input \(D\) is determined based on different input data. Many of the arguments to mainClustering are arguments to clusterMany as well so that mainClusterArgs is usually not needed. The main exception would be to send more esoteric arguments to the underlying clustering function called in the main clustering step. The syntax for this would be to give a nested list to the argument mainClusterArgs
clusterMany(x,clusterFunction="hierarchicalK", ... , mainClusterArgs=list(clusterArgs=list(method="single") ))
Here we change the argument method in the clustering function hclust called by the hierarchicalK function to single.
Subsampling (step 1) subsampleClustering
A more significant process that can be coupled with any clustering algorithm is to continually by subsample the data and cluster the subsampled data. This creates a \(n x n\) matrix \(S\) that is a matrix of co-clustering percentages – how many times two samples co-clustered together over the subsamples (there are slight variations as how this can be calculated, see help pages of subsampleClustering ). This does not itself give a clustering, but the resulting \(S\) matrix can then form the basis for clustering the samples. Specifically, the matrix \(D=1-S\) is then given as input to the main clustering step described above. The subsampling option is computationally expensive, and when coupled with comparing many parameters, does result in a lengthy evaluation of clusterMany. However, we recommend it if the number of samples is not too large as one of the most useful methods for getting stable clustering results.
Note that the juxtaposition of step 1 then step 2 (the subsampling and then feeding the results to the main clustering function) implies there actually two different possible clustering algorithms (and sets of corresponding parameters) – one for the clustering on the subsampled data, and one for the clustering of the resulting \(D\) based on the percentage of coClustering of samples. This brings up a restriction on the clustering function in the main clustering step – it needs to be able to handle input that is a dissimilarity (inputType is either diss or either).
Furthermore, the user might want to set clustering function and corresponding parameters separately for step 1 and step 2. The way that clusterMany handles this is that the main arguments of clusterMany focus on varying the parameters related to step 2 (the main clustering step, i.e. the clustering of \(D\) after subsampling). For this reason, the argument clusterFunction in clusterMany varies the clustering function used by the main clustering (step 2), not the subsampling step. The clustering function of the subsampling (step 1) can be specified by the user via subsampleArgs, but in this case it is set for all calls of clusterMany and does not vary. Alternatively, if the user doesn’t specify the clusterFunction in subsampleArgs then the default is to use clusterFunction of the main clustering step along with any required arguments given by the user for that function (there are some cases where using the clusterFunction of the main step is not possible for the subsampling step, in which case the default is to use “pam”).
More generally, since few of the arguments to subsampleClustering are allowed to be varied by the direct arguments to clusterMany, it is also more common to want to change these arguments via the argument subsampleArgs. Examples might be resamp.num (the number of subsamples to draw) or samp.p (the proportion of samples to draw in each subsample) – see ?subsampleClustering for a full documentation of the possible arguments. In addition, there are arguments to be passed to the underlying clustering function; like for mainClustering, these arguments would be a nested list to the argument subsampleArgs.
An example of a syntax that sets the arguments for subsampleClustering would be:
clusterMany(x,..., subsampleArgs=list(resamp.num=100,samp.p=0.5,clusterFunction="hiearchicalK", clusterArgs=list(method="single") ))
Sequential Detection of Clusters (Step 3): seqCluster
Another complicated addition that can be added to the main clustering step is the implementation of sequential clustering. This refers to clustering of the data, then removing the “best” cluster, and then re-clustering the remaining samples, and then continuing this iteration until all samples are clustered (or the algorithm in some other way calls a stop). Such sequential clustering can often be convenient when there is very dominant cluster, for example, that is far away from the other mass of data. Removing samples in these clusters and resampling can sometimes be more productive and result in a clustering more robust to the choice of samples. A particular implementation of such a sequential method, based upon (Tseng and Wong 2005), is implemented in the clusterExperiment package when the option sequential=TRUE is chosen (see ?seqCluster for documentation of how the iteration is done). Sequential clustering can also be quite computationally expensive, particularly when paired with subsampling to determine \(D\) at each step of the iteration.
Because of the iterative nature of the sequential step, there are many possible parameters (see ?seqCluster). Like subsample clustering, clusterMany does not allow variation of very many of these parameters, but they can be set via passing arguments in a named list to seqArgs. An example of a syntax that sets the arguments for seqCluster would be:
clusterMany(x,..., seqArgs=list( remain.n=10))
This code changes the remain.n option of the sequential step, which governs when the sequential step stops because there are not enough samples remaining.
Arguments of clusterMany
Now that we’ve explained the underlying architecture of the clustering provided in the package, and how to set the arguments that can’t be varied, we discuss the parameters that can be varied in clusterMany. (There are a few additional arguments available for clusterMany that govern how clusterMany works, but right now we focus on only the ones that can be given multiple options).
Recall that arguments in clusterMany that take on multiple values mean that the combinations of all the multiple valued arguments will be given as input for a clustering routine. These arguments are:
sequential This parameter consists of logical values, TRUE and/or FALSE, indicating whether the sequential strategy should be implemented or not.
subsample This parameter consists of logical values, TRUE and/or FALSE, indicating whether the subsampling strategy for determining \(D\) should be implemented or not.
clusterFunction The clustering functions to be tried in the main clustering step. Recall if subsample=TRUE is part of the combination, then clusterFunction the method that will be used on the matrix \(D\) created from subsampling the data. Otherwise, clusterFunction is the clustering method that will be used directly on the data.
ks The argument ‘ks’ is interpreted differently for different choices of the other parameters and can differ from between parameter combinations!. If sequential=TRUE is part of the parameter combination, ks defines the argument k0 of sequential clustering (see ?seqCluster), which is approximately like the initial starting point for the number of clusters in the sequential process. Otherwise, ks is passed to set k of both the main clustering step (and by default that of the subsampled data), and is only relevant if clusterFunction is of type “K”. When/if findBestK=TRUE is part of the combination, ks also defines the range of values to search for the best k (see the details in the documentation of clusterMany for more).
reduceMethod These are character strings indicating what choices of dimensionality reduction should be tried. These can indicate any combination of either filtering statistics or dimensionality reductions. The character strings can either refer to built-in methods, meaning clusterMany will do the necessary calculations and save the results as an initial step, OR the vector can refer to filtering statistics/dimensionality reductions that have already been calculated and saved in the object (see (above)[#dimReduce] for more information). The vector cannot be a combination of these two.
If either a dimensionality reduction or a filtering statistic are chosen, the following parameters can also be varied to indicate the number of such features to be used (with a vector of values meaning all will be tried):
distFunction These are character values giving functions that provide a distance matrix between the samples, when applied to the data. These functions should be accessible in the global environment (clusterMany applies get to the global environment to access these functions). To make them compatible with the standard R function dist, these functions should assume the samples are in the rows, i.e. they should work when applied to t(assay(ce)). We give an example in the next subsection below.
minSizes these are integer values determining the minimum size required for a cluster (passed to the mainClustering part of clustering).
alphas These are the \(\alpha\) parameters for “01” clustering techniques; these values are only relevant if one of the clusterFunction values is a “01” clustering algorithm. The values given to alphas should be between 0 and 1, with smaller values indicating greater similarity required between the clusters.
betas These are the \(\beta\) parameters for sequential clustering; these values are only relevant if sequential=TRUE and determine the level of stability required between changes in the parameters to determine that a cluster is stable.
findBestK This option is for “K” clustering techniques, and indicates that \(K\) should be chosen automatically as the \(K\) that gives the largest silhouette distance between clusters.
removeSil A logical value as to whether samples with small silhouette distance to their assigned cluster are “removed”, in the sense that they are not given their original cluster assignment but instead assigned -1. This option is for “K” clustering techniques as a method of removing poorly clustered samples.
silCutoff If removeSil is TRUE, then silCutoff determines the cutoff on silhouette distance for unassigning the sample.
clusterMany tries to have generally simple interface, and for this reason makes choices about what is meant by certain combinations of parameters. For example, in combinations where findBestK=TRUE, ks=2:10 is taken to mean that the clustering should find the best \(k\) out of the range of 2-10. However, in other parameter combinations where findBestK=FALSE the same ks might indicate the specific number of clusters, \(K\), that should be found. To see the parameter choices that will be run, the user can set run=FALSE and the output will be a matrix of the parameter values indicated by the choices of the user. For parameter combinations that are not what is desired, the user should consider making direct calls to clusterSingle where all of these options combinations (and many more) can be explicitly called.
Other parameters for the clustering are kept fixed. As described above, there are many more possible parameters in play than are considered in clusterMany. These parameters can be set via the arguments mainClusterArgs, subsampleArgs and seqArgs. These arguments correspond to the different processes described above (the main clustering step, the creation of \(D\) to be clustered via subsampling, and the sequential clustering process, respectively). These arguments take a list of arguments that are sent directly to clusterSingle. However, these arguments may be overridden by the interpretation of clusterMany of how different combinations interact; again for complete control direct calls to clusterSingle are necessary.
Example changing the distance function
Providing different distance functions is slightly more involved than the other parameters, so we give an example here.
First we define distances that we would like to compare. We are going to define two distances that take values between 0-1 based on different choices of correlation.
corDist<-function(x){(1-cor(t(x),method="pearson"))/2}
spearDist<-function(x){(1-cor(t(x),method="spearman"))/2}
These distances are defined so as to give distance of 0 between samples with correlation 1, and distance of 1 for correlation -1.
We will also compare using different algorithms for clustering. Currently, clusterMany requires that the distances work with all of the clusterFunction choices given. Since some of the clusterFunction algorithms require a distance matrix between 0-1, this means we can only compare all of the algorithms when the distance is a 0-1 distance. (Future versions may try to create a work around so that the algorithm just skips algorithms that don’t match the distance). Since the distances we defined are between 0-1, however, we can use any algorithm that takes dissimilarities as input.
Note on 0-1 clustering when subsample=FALSE We would note that the default values of \(\alpha\) in clusterMany and RSEC for the 0-1 clustering were set with the distance \(D\) the result of subsampling or other concensus summary in mind. In generally, subsampling creates a \(D\) matrix with high similarity for many samples who share a cluster (the proportion of times samples are seen together for well clustered samples can easily be in the .8-.95 range, or even exactly 1). For this reason the default \(\alpha\) is 0.1 which requires distances between samples in the 0.1 range or less (i.e. a similarity in the range of 0.9 or more).
To illustrate this point, we show an example of the \(D\) matrix from subsampling. To do this we make use of the clusterSingle which is the workhorse mentioned above that runs a single clustering command directly; it gives the output \(D\) from the sampling in the “coClustering” slot of ce when we set replaceCoCluster=TRUE (and therefore we save it as a separate object, so that it doesn’t write over the existing “coClustering” slot in ce). Note that the result is \(1-p_{ij}\) where \(p_{ij}\) is the proportion of times sample \(i\) and \(j\) clustered together.
ceSub<-clusterSingle(ce,reduceMethod="mad",nDims=1000,subsample=TRUE,subsampleArgs=list(clusterFunction="pam",clusterArgs=list(k=8)),clusterLabel="subsamplingCluster",mainClusterArgs=list(clusterFunction="hierarchical01",clusterArgs=list(alpha=0.1),minSize=5), saveSubsamplingMatrix=TRUE)
plotCoClustering(ceSub)

We see even here, the default of \(\alpha=0.1\) was perhaps too conservative since only two clusters came out (at leastwith size greater than 5).
However, the distances based on correlation calculated directly on the data, such as we created above, are also often used for clustering expression data directly (i.e. without the subsampling step). But they are unlikely to have dissimilarities as low as seen in subsampling, even for well clustered samples. Here’s a visualization of the correlation distance matrix we defined above (using Spearman’s correlation) on the top 1000 most variable features:
dSp<-spearDist(t(transformData(ce,reduceMethod="mad",nFilterDims=1000)))
plotHeatmap(dSp,isSymmetric=TRUE)

We can see that the choice of \(\alpha\) must be much higher (and we are likely to be more sensitive to it).
Notice to calculate the distance in the above plot, we made use of the transform function applied to our ce object to get the results of dimensionality reduction. The transform function gave us a data matrix back that has been transformed, and also reduced in dimensions, like would be done in our clustering routines. transform has similar parameters as seen in clusterMany,makeDendrogram or clusterSingle and is useful when you want to manually apply something to transformed and/or dimensionality reduced data; and you can be sure you are getting the same matrix of data back that the clustering algorithms are using.
Comparing distance functions with clusterMany Now that we have defined the distances we want to compare in our global environment, we can give these to the argument “distFunction” in clusterMany. They should be given as character strings giving the names of the functions. For computational ease for this vignette, we will just choose the dimensionality reduction to be the top 1000 features based on MAD and set K=8 or \(\alpha=0.45\).
Since we haven’t yet calculated “mad” on this object, it hasn’t been calculated yet. clusterMany does not let you mix and match between uncalculated and stored filters (or dimensionality reductions), so our first step is to store the mad results. We will save these results as a separate object so as to not disrupt the earlier workflow.
ceDist<-makeFilterStats(ce,filterStats="mad")
ceDist
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final mad
## -----------
## Primary cluster type: mergeClusters
## Primary cluster label: mergeClusters
## Table of clusters (of primary clustering):
## -1 m01 m02 m03 m04
## 8 27 14 8 8
## Total number of clusterings: 41
## Dendrogram run on 'makeConsensus,final' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? Yes
ceDist<-clusterMany(ceDist, k=7:9, alpha=c(0.35,0.4,0.45),
clusterFunction=c("tight","hierarchical01","pam","hierarchicalK"),
findBestK=FALSE,removeSil=c(FALSE), distFunction=c("corDist","spearDist"),
makeMissingDiss=TRUE,
reduceMethod=c("mad"),nFilterDims=1000,run=TRUE)
## 48 parameter combinations, 0 use sequential method, 0 use subsampling method
## Calculating the 2 Requested Distance Matrices needed to run clustering comparisions (if 'distFunction=NA', then needed because of choices of clustering algorithms and 'makeMissingDiss=TRUE').
## done.
##
## Running Clustering on Parameter Combinations...
## done.
clusterLabels(ceDist)<-gsub("clusterFunction","alg",clusterLabels(ceDist))
clusterLabels(ceDist)<-gsub("Dist","",clusterLabels(ceDist))
clusterLabels(ceDist)<-gsub("distFunction","dist",clusterLabels(ceDist))
clusterLabels(ceDist)<-gsub("hierarchical","hier",clusterLabels(ceDist))
par(mar=c(1.1,15.1,1.1,1.1))
plotClusters(ceDist,axisLine=-2,colData=c("Biological_Condition"))

Notice that using the “tight” methods did not give relevant results (no samples were clustered)
Example using a user-defined clustering algorithm
Here, we show how to use a user-defined clustering algorithm in clusterSingle. Our clustering algorithm will be a simple nearest-neighbor clustering.
To do so, we need to create a ClusterFunction object that defines our algorithm. ClusterFunction objects recognize clustering algorithms of two different types, based on the required input from the user: 01 or K (see (ClusterFunction)[#ClusterFunction] section above for more). Type K refers to a clustering algorithm where the user must specify the number of clusters as input parameter, and this is the type of algorithm we will implement (though as we’ll see, in fact our clustering algorithm doesn’t have the user specify the number of clusters…).
First, we need to define a wrapper function that performs the clustering. Here, we define a simple shared nearest-neighbor clustering using functions from the scran and the igraph packages.
library(scran)
library(igraph)
SNN_wrap <- function(inputMatrix, k, steps = 4, ...) {
snn <- buildSNNGraph(inputMatrix, k = k, d = NA, transposed = FALSE) ##scran package
res <- cluster_walktrap(snn, steps = steps) #igraph package
return(res$membership)
}
Here the argument k defines the number of nearest-neighbors to use in constructing the nearest neighbor graph.
To create a type K algorithm, the wrapper must have two required arguments:
- an argument for the input data. This can be
x if the input is a matrix of \(N\) samples on the columns and features on the rows, or diss if the input is expected to be a \(NxN\) dissimilarity matrix. Both x and diss can be given as parameters if the algorithm handles either one)
- a parameter
k specifying the number of clusters (or any other integer-valued parameter that the clustering relies on)
Our k value for SNN_wrap will not in fact specify the number of clusters, but that is not actually required anywhere. But setting it up as type K mainly distinguishes it from the 01 type (which expects a dissimilarity matrix taking values between 0 and 1 in its dissimilarity entries, as input). Also setting it up as type K allows us to use the findBestK option, where a range of k values is tried and those with the best results (in silhouette width) is reported.
Our wrapper function should return a integer vector corresponding to the cluster assignments of each sample (see ?ClusterFunction for information about other types of output available).
clusterExperiment provides the function internalFunctionCheck that validates user-defined cluster functions. Among other things, it checks that the input and output are compatible with the clusterExperiment workflow (see ?internalFunctionCheck for details). The call to internalFunctionCheck contains, in addition to the function definition, arguments specifying information about the type of input, the type of algorithm, and type of output expected by the function. This information is passed to clusterMany and clusterSingle so that they know what to pass and what to expect from the user-defined method.
internalFunctionCheck(SNN_wrap, inputType = "X", algorithmType = "K",
outputType="vector")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## [1] TRUE
If it passes all checks (returns TRUE), we can then create an object of the S4 class ClusterFunction to be used within the package using this same set of arguments. If it fails, it will return a character string giving the error. Among the checks is running the function on a small randomly generated set of data, so the errors may not be about the format of the function, but also whether the series of code runs.
Since we passed the checks, we are ready to define our ClusterFunction object.
SNN <- ClusterFunction(SNN_wrap, inputType = "X", algorithmType = "K",
outputType="vector")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
Now that we have our object SNN, we can treat our custom method as a base clustering routine to be used in clusterMany, similarly to how we used kmeans and pam earlier. However, unlike before, you should pass the actual object SNN, and not a quoted version (i.e. not "SNN").
In this example, we use clusterSingle to implement a subsample clustering based on SNN. clusterSingle is useful if you just want to create a single clustering.
We give SNN to the subsampleArgs that will be passed to the subsampling. (Note that to create a consensus clustering from the different subsamplings we use a different function, "hierarchical01", that is passed to mainClusterArgs),
ceCustom <- clusterSingle(ce, reduceMethod="PCA",
nDims=50, subsample = TRUE,
sequential = FALSE,
mainClusterArgs = list(clusterFunction = "hierarchical01",
clusterArgs = list(alpha = 0.3),
minSize = 1),
subsampleArgs = list(resamp.num=100,
samp.p = 0.7,
clusterFunction = SNN,
clusterArgs = list(k = 10),
ncores = 1,
classifyMethod='InSample')
)
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
ceCustom
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final
## -----------
## Primary cluster type: clusterSingle
## Primary cluster label: clusterSingle
## Table of clusters (of primary clustering):
## 1 2 3 4
## 33 30 1 1
## Total number of clusterings: 42
## Dendrogram run on 'makeConsensus,final' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? Yes
Similarly, we can use clusterMany to compute clusters using many different methods, including built-in and custom functions. To mix and match built-in functions, you need to get the actual ClusterFunction objects that match their names, using the getBuiltInFunction function.
clFuns<-getBuiltInFunction(c("pam","kmeans"))
Then we will add our function to the list of functions. Note that it is important we give a name to every element of the list, including our new function!
clFuns<-c(clFuns, "SNN"=SNN)
Now we can give this list of functions to clusterMany
ceCustom <- clusterMany(ce, dimReduce="PCA",nPCADims=50,
clusterFunction=clFuns,
ks=4:15, findBestK=FALSE)
## 36 parameter combinations, 0 use sequential method, 0 use subsampling method
## Running Clustering on Parameter Combinations...
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## Warning in .local(x, ...): 'buildSNNGraph' is deprecated.
## Use 'bluster::makeSNNGraph' instead.
## See help("Deprecated")
## done.
Note that if I call getBuiltInFunction for only one cluster function, it returns the actual ClusterFunction object, not a list of length 1. To combine it with other functions you need to make it into a list.
Dealing with large numbers of clusterings
A good first check before running clusterMany is to determine how many clusterings you are asking for. clusterMany has some limited internal checks to not do unnecessary duplicates (e.g. removeSil only works with some clusterFunctions so clusterMany would detect that), but generally takes all combinations. This can take a while for more complicated clustering techniques, so it is a good idea to check what you are getting into. You can do this by running clusterMany with run=FALSE.
In the following we consider expanding our original clustering choices to consider individual choices of \(K\) (rather than just findBestK=TRUE).
checkParam<-clusterMany(se, clusterFunction="pam", ks=2:10,
removeSil=c(TRUE,FALSE), isCount=TRUE,
reduceMethod=c("PCA","var"), makeMissingDiss=TRUE,
nFilterDims=c(100,500,1000),nReducedDims=c(5,15,50),run=FALSE)
## 108 parameter combinations, 0 use sequential method, 0 use subsampling method
## Calculating the 4 Requested Distance Matrices needed to run clustering comparisions (if 'distFunction=NA', then needed because of choices of clustering algorithms and 'makeMissingDiss=TRUE').
## Returning Parameter Combinations without running them (to run them choose run=TRUE)
dim(checkParam$paramMatrix) #number of rows is the number of clusterings
## [1] 108 17
Each row of the matrix checkParam$paramMatrix is a requested clustering (the columns indicate the value of a possible parameter). Our selections indicate 108 different clusterings (!).
We can set ncores argument to have these clusterings done in parallel. If ncores>1, the parallelization is done via mclapply and should not be done in the Rgui interface (see help pages for mclapply).
Create a unified cluster from many clusters with makeConsensus
After creating many clusterings, makeConsensus finds a single cluster based on what samples were in the same clusters throughout the many clusters found by clusterMany. While subsampling the data helps be robust to outlying samples, combining across many clustering parameters can help be robust to choice in parameters, particularly when the parameters give roughly similar numbers of clusters.
As mentioned in the Quick Start section, the default option for makeConsensus is to only define a cluster when all of the samples are in the same clusters across all clusterings. However, this is generally too conservative and just results in most samples not being assigned to a cluster.
Instead makeConsensus has a parameter proportion that governs in what proportion of clusterings the samples should be together. Internally, makeConsensus makes a coClustering matrix \(D\). Like the \(D\) created by subsampling in clusterMany, the coClustering matrix takes on values 0-1 for the proportion of times the samples are together in the clustering. This \(D\) matrix can be visualized with plotCoClustering (which is just a call to plotHeatmap). Recall the one we last made in the QuickStart, with our last call to makeConsensus (proportion=0.7 and minSize=3).
plotCoClustering(ce)

makeConsensus performs the clustering by running a “01” clustering algorithm on the \(D\) matrix of percentage co-clustering (the default being “hierarchical01”). The alpha argument to the 01 clustering is 1-proportion. Also passed to the clustering algorithm is the parameter minSize which sets the minimum size of a cluster.
Treatment of Unclustered assignments -1 values are treated separately in the calculation. In particular, they are not considered in the calculation of percentage co-clustering – the percent co-clustering is taken only with respect to those clusterings where both samples were assigned. However, a post-processing is done to the clusters found from running the clustering on the \(D\) matrix. For each sample, the percentage of times that they were marked -1 in the clusterings is calculated. If this percentage is greater than the argument propUnassigned then the sample is forced to be -1 (unassigned) in the clustering returned by makeConsensus.
Good scenarios for running makeConsensus Varying certain parameters result in clusterings better for makeConsensus than other sets of parameters. In particular, if there are huge discrepancies in the set of clusterings given to makeConsensus, the results will be a shattering of the samples into many small clusters. Similarly, if the number of clusters \(K\) is very different, the end result will likely be like that of the large \(K\), and how much value that will provide you (rather than just picking the clustering with the largest \(K\)), is debatable. However, for “01” clustering algorithms or clusterings using the sequential algorithm, varying the underlying parameters \(\alpha\) or \(k_0\) often results in roughly similar clusterings across the parameters so that creating a consensus across them is highly informative.
Consensus from subsets of clusterings
A call to clusterMany or to RSEC can generate many clusterings that result from changing the underlying parameters of a given method (e.g., the number of centers in k-means), the dimensionality reduction, the distance function, or the base algorithm used for the clustering (e.g., PAM vs. k-means).
To highlight interesting structure in the data, it may be useful to understand whether a set of samples tends to cluster together across paramter choices of the same method or across very different methods.
clusterExperiment makes it easy to extract a subset of clusterings and to compute a consensus clustering for any given subset, to help addressing this type of questions, as we have already seen in the section on makeConsensus.
As an example, assume that we want to explore the role of PCA in the clustering results. We can separately calculate a consensus of those clusterings that used 15 or 50 principal components of the data.
First, we use the getClusterManyParams function to extract the information on the clusterings performed.
params <- getClusterManyParams(ce)
head(params)
## clusteringIndex k
## k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA 5 5
## k=6,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA 6 6
## k=7,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA 7 7
## k=8,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA 8 8
## k=9,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA 9 9
## k=10,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA 10 10
## reduceMethod nReducedDims
## k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA PCA 5
## k=6,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA PCA 5
## k=7,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA PCA 5
## k=8,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA PCA 5
## k=9,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA PCA 5
## k=10,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA PCA 5
## nFilterDims
## k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA NA
## k=6,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA NA
## k=7,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA NA
## k=8,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA NA
## k=9,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA NA
## k=10,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA NA
We can select only the clusterings in which we are interested in and pass them to makeConsensus using the whichClusters argument.
clusterIndices15 <- params$clusteringIndex[which(params$nReducedDims == 15)]
clusterIndices50 <- params$clusteringIndex[which(params$nReducedDims == 50)]
#note, the indices will change as we add clusterings!
clusterNames15 <- clusterLabels(ce)[clusterIndices15]
clusterNames50 <- clusterLabels(ce)[clusterIndices50]
shortNames15<-gsub("reduceMethod=PCA,nReducedDims=15,nFilterDims=NA,","",clusterNames15)
shortNames50<-gsub("reduceMethod=PCA,nReducedDims=50,nFilterDims=NA,","",clusterNames50)
ce <- makeConsensus(ce, whichClusters = clusterNames15, proportion = 0.7,
clusterLabel = "consensusPCA15")
ce <- makeConsensus(ce, whichClusters = clusterNames50, proportion = 0.7,
clusterLabel = "consensusPCA50")
Analogously, we’ve seen that many visualization functions have a whichClusters argument that can be used to visually inspect the similarities and differences between subsets of clusterings.
Here we show using this features with plotClustersWorkflow for the two different consensus clusterings we made.
par(mar=plotCMar,mfrow=c(1,2))
plotClustersWorkflow(ce, whichClusters ="consensusPCA15",clusterLabel="Consensus",whichClusterMany=match(clusterNames15,clusterLabels(ceSub)),clusterManyLabels=c(shortNames15),axisLine=-1,nBlankLines=1,main="15 PCs")
plotClustersWorkflow(ce, whichClusters = c("consensusPCA50"),clusterLabel="Consensus",clusterManyLabels=shortNames50, whichClusterMany=match(clusterNames50,clusterLabels(ceSub)),nBlankLines=1,main="50 PCs")

We can also choose a subset and vary the parameters
wh<-getClusterManyParams(ce)$clusteringIndex[getClusterManyParams(ce)$reduceMethod=="var"]
ce<-makeConsensus(ce,whichCluster=wh,proportion=0.7,minSize=3,
clusterLabel="makeConsensus,nVAR")
plotCoClustering(ce)

We can compare to all of our other versions of makeConsensus. While they do not all have clusterTypes equal to “makeConsensus” (only the most recent call has clusterType exactly equal to “makeConsensus”), they all have “makeConsensus” as part of their clusterType, even though they have different clusterLabels (and now we’ll see that it was useful to give them different labels!)
wh<-grep("makeConsensus",clusterTypes(ce))
par(mar=plotCMar)
plotClusters(ce,whichClusters=rev(wh),axisLine=-1)

Creating a Hierarchy of Clusters with makeDendrogram
As mentioned above, we find that merging clusters together based on the extent of differential expression between the features to be a useful method for combining many small clusters.
We provide a method for doing this that consists of two steps. Making a hierarchy between the clusterings and then estimating the amount of differential expression at each branch of the hierarchy.
makeDendrogram creates a hierarchical clustering of the clusters as determined by the primaryCluster of the ClusterExperiment object. In addition to being used for merging clusters, the dendrograms created by makeDendrogram are also useful for ordering the clusters in plotHeatmap as has been shown above.
makeDendrogam performs hierarchical clustering of the cluster medoids (after transformation of the data) and provides a dendrogram that will order the samples according to this clustering of the clusters. The hierarchical ordering of the dendrograms is saved internally in the ClusterExperiment object.
Like clustering, the dendrogram can depend on what features are included from the data. The same options for clustering are available for the hierarchical clustering of the clusters, namely choices of dimensionality reduction via reduceMethod and the number of dimensions via nDims.
ce<-makeDendrogram(ce,reduceMethod="var",nDims=500)
plotDendrogram(ce)
Notice that the plot of the dendrogram shows the hierarchy of the clusters (and color codes them according to the colors stored in colorLegend slot).
Recall that the most recent clustering made is from our call to makeConsensus, where we experimented with using on some of the clusterings from clusterMany, so that is our current primaryCluster:
show(ce)
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final var_makeConsensus.nVAR
## -----------
## Primary cluster type: makeConsensus
## Primary cluster label: makeConsensus,nVAR
## Table of clusters (of primary clustering):
## -1 c01 c02 c03 c04 c05 c06 c07
## 10 15 13 9 7 5 3 3
## Total number of clusterings: 44
## Dendrogram run on 'makeConsensus,nVAR' (cluster index: 1)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? No
This is the clustering from combining only the clusterings from clusterMany that use the top most variable genes. Because it is the primaryCluster, it was the clustering that was used by default to make the dendrogram.
We might prefer to get back to the dendrogram based on our makeConsensus in quick start (the “makeConsensus, final” clustering). We’ve lost that dendrogram when we called makeDendrogram again. However, we can rerun makeDendrogram and choose a different clustering from which to make the dendrogram.
ce<-makeDendrogram(ce,reduceMethod="var",nDims=500,
whichCluster="makeConsensus,final")
We will visualize the dendrogram with plotDendrogram. The default setting plots the dendrogram where there are color blocks equal to the size of the clusters (i.e number of samples in each cluster).
plotDendrogram(ce,leafType="sample",plotType="colorblock")

We can actually use plotDendrogram to compare clusterings too, like plotClusters using the whichClusters argument to identfy which clusters to show. For example, lets compare our different makeConsensus results
par(mar=plotDMar)
whCM<-grep("makeConsensus",clusterTypes(ce))
plotDendrogram(ce,whichClusters=whCM,leafType="sample",plotType="colorblock")

Unlike plotClusters, however, there is no aligning of samples to make samples with the same cluster group together.
Making a past run the current one.
Note that because we’ve run additional makeConsensus steps on this data, the clustering we originally designated as “final” is not our primary cluster. Instead our most recent call to makeConsensus is the primary cluster:
primaryClusterLabel(ce)
## [1] "makeConsensus,nVAR"
show(ce)
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final var_makeConsensus.nVAR
## -----------
## Primary cluster type: makeConsensus
## Primary cluster label: makeConsensus,nVAR
## Table of clusters (of primary clustering):
## -1 c01 c02 c03 c04 c05 c06 c07
## 10 15 13 9 7 5 3 3
## Total number of clusterings: 44
## Dendrogram run on 'makeConsensus,final' (cluster index: 5)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? No
We know the results are still saved. If we search for the label we gave it, we can find it (given by clusterLabel). And if we look at its value in clusterTypes it doesn’t even have its clusterTypes as makeConsensus, but instead has a “.x” value appended to it (see )
whFinal<-which(clusterLabels(ce)=="makeConsensus,final")
head(clusterMatrix(ce,whichCluster=whFinal))
## makeConsensus,final
## SRR1275356 -1
## SRR1275251 -1
## SRR1275287 6
## SRR1275364 5
## SRR1275269 -1
## SRR1275263 -1
clusterTypes(ce)[whFinal]
## [1] "makeConsensus.3"
But rather than continually refinding the cluster, we can choose to reset this past call to makeConsensus to be the current ‘makeConsensus’ output (which will also set this clustering to be the primaryCluster).
ce<-setToCurrent(ce,whichCluster="makeConsensus,final")
show(ce)
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final var_makeConsensus.nVAR
## -----------
## Primary cluster type: makeConsensus
## Primary cluster label: makeConsensus,final
## Table of clusters (of primary clustering):
## -1 c01 c02 c03 c04 c05 c06
## 8 15 14 9 8 8 3
## Total number of clusterings: 44
## Dendrogram run on 'makeConsensus,final' (cluster index: 5)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? No
We don’t need to recall makeDendrogram, since in our call to makeDendrogram we explicitly set the argument whichCluster to make the dendrogram from this clustering.
More about how the dendrogram is saved
The resulting dendrograms (one for just the cluster hierarchy and one that expands the cluster hierarchy to include the samples) are saved in the object. They are each saved as a phylo4d class from the package phylobase (which uses the basic format of the S3 class phylo in the ape package, but is a S4 class with some useful helpers).
They can be accessed with the functions clusterDendrogram and sampleDendrogram.
clusterDendrogram(ce)
## label node ancestor edge.length node.type NodeId ClusterIdDendro
## 1 T1 1 9 6751.708 tip NodeId6 ClusterId1
## 2 T2 2 8 4758.616 tip NodeId7 ClusterId2
## 3 T3 3 10 5240.817 tip NodeId8 ClusterId3
## 4 T4 4 11 3381.071 tip NodeId9 ClusterId4
## 5 T5 5 11 3381.071 tip NodeId10 ClusterId5
## 6 T6 6 10 5240.817 tip NodeId11 ClusterId6
## 7 NodeId1 7 0 NA root NodeId1 <NA>
## 8 NodeId2 8 7 4814.822 internal NodeId2 <NA>
## 9 NodeId3 9 7 2821.730 internal NodeId3 <NA>
## 10 NodeId4 10 9 1510.891 internal NodeId4 <NA>
## 11 NodeId5 11 8 1377.545 internal NodeId5 <NA>
## ClusterIdMerge Position
## 1 NA cluster hierarchy tip
## 2 NA cluster hierarchy tip
## 3 NA cluster hierarchy tip
## 4 NA cluster hierarchy tip
## 5 NA cluster hierarchy tip
## 6 NA cluster hierarchy tip
## 7 NA cluster hierarchy node
## 8 NA cluster hierarchy node
## 9 NA cluster hierarchy node
## 10 NA cluster hierarchy node
## 11 NA cluster hierarchy node
head(sampleDendrogram(ce))
## label node ancestor edge.length node.type NodeId Position SampleIndex
## 1 T01 1 72 0 tip <NA> assigned tip 8
## 2 T02 2 78 0 tip <NA> assigned tip 9
## 3 T03 3 79 0 tip <NA> assigned tip 10
## 4 T04 4 80 0 tip <NA> assigned tip 18
## 5 T05 5 81 0 tip <NA> assigned tip 24
## 6 T06 6 82 0 tip <NA> assigned tip 26
## 7 T07 7 83 0 tip <NA> assigned tip 27
## 8 T08 8 84 0 tip <NA> assigned tip 28
## 9 T09 9 85 0 tip <NA> assigned tip 42
## 10 T10 10 86 0 tip <NA> assigned tip 43
## 11 T11 11 87 0 tip <NA> assigned tip 45
## 12 T12 12 88 0 tip <NA> assigned tip 46
## 13 T13 13 89 0 tip <NA> assigned tip 53
## 14 T14 14 90 0 tip <NA> assigned tip 58
## 15 T15 15 90 0 tip <NA> assigned tip 59
## 16 T16 16 73 0 tip <NA> assigned tip 11
## 17 T17 17 91 0 tip <NA> assigned tip 12
## 18 T18 18 92 0 tip <NA> assigned tip 14
## 19 T19 19 93 0 tip <NA> assigned tip 21
## 20 T20 20 94 0 tip <NA> assigned tip 22
Just like the clusters, the nodes have permanent non-changing names (stored in the NodeId column). The dendrograms also store information on how to match the dendrogram to the clusters (and if applicable the merged clusters). To see more about the information saved in these dendrograms, see ?clusterDendrogram.
Generally, these dendrograms will not need to be directly manipulated by the user. But if desired, the user can explore these objects using the functions in phylobase.
library(phylobase)
nodeLabels(clusterDendrogram(ce))
## 7 8 9 10 11
## "NodeId1" "NodeId2" "NodeId3" "NodeId4" "NodeId5"
descendants(clusterDendrogram(ce),node="NodeId3")
## T1 T3 T6
## 1 3 6
The main reason to really ever work with these dendrograms directly is to link it back with the (merging results)[#mergeClusters] or (feature extraction results)[#Dendrocontrasts]. In particular, one feature of the cluster dendrogram can be set by the user is the labels for the internal nodes of the cluster hierarchy. Because of this there is a function nodeLabels that can be called directly on the ClusterExperiment object to see and update these values. Unlike our previous code, where we extracted the dendrogram and then used the functions in phylobase to look at it, these functions will update the actual dendrograms inside the object.
We’ll demonstrate this by giving the nodes new names that are the letters A-Z. The main trick in creating new node labels, is that it is required that the vector of new names have names that match the internal node ids (the NodeId column). These are the default names of the node, we can use the default names (given by nodeLabels) to grab them.
newNodeLabels<-LETTERS[1:nNodes(ce)]
names(newNodeLabels)<-nodeLabels(ce)
nodeLabels(ce)<-newNodeLabels
Merging clusters with mergeClusters
We then can use this hierarchy of clusters to merge clusters that show little difference in expression. We do this by testing, for each node of the dendrogram, for which features is the mean of the set of clusters to the right split of the node is equal to the mean on the left split. This is done via the getBestFeatures (see section on getBestFeatures), where the type argument is set to “Dendro”.
Starting at the bottom of the tree, those clusters that have the percentage of features with differential expression below a certain value (determined by the argument cutoff) are merged into a larger cluster. This testing of differences and merging continues until the estimated percentage of non-null DE features is above cutoff. This means lower values of cutoff result in less merging of clusters. There are multiple methods of estimation of the percentage of non-null features implemented. The option mergeMethod="adjP" which we showed earlier is the simplest: the proportion found significant by calculating the proportion of DE genes a given False Discovery Rate threshold of 0.05 (using the Benjamini-Hochberg procedure). However, other more sophisticated methods are also implemented (see ?mergeClusters).
Notice that mergeClusters will always run based on the clustering that made the currently existing dendrogram. So it is always good to check that it is what we expect.
ce
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final var_makeConsensus.nVAR
## -----------
## Primary cluster type: makeConsensus
## Primary cluster label: makeConsensus,final
## Table of clusters (of primary clustering):
## -1 c01 c02 c03 c04 c05 c06
## 8 15 14 9 8 8 3
## Total number of clusterings: 44
## Dendrogram run on 'makeConsensus,final' (cluster index: 5)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? No
We see in the summary “Dendrogram run on ‘makeConsensus,final’”, showing us that this is the clustering that will be used (and also showing us the value of giving our own labels to the results of makeConsensus if we are going to try different strategies).
We will run mergeClusters with the option mergeMethod="adjP". We will also set plotInfo="adjP" meaning that we would like the mergeClusters command to also produce a plot showing the dendrogram and the estimates from the adjP method for each node. We also set calculateAll=FALSE for illustration purposes, meaning the function will only calculate the estimates for the methods we request, but as we explain below, that is not necessarily the best option if you are going to be trying out different cutoffs.
ce<-mergeClusters(ce,mergeMethod="adjP",plotInfo=c("adjP"),calculateAll=FALSE)

The info about the merge is saved in the ce object.
mergeMethod(ce)
## [1] "adjP"
mergeCutoff(ce)
## [1] 0.05
nodeMergeInfo(ce)
## NodeId Contrast isMerged mergeClusterId Storey PC adjP locfdr JC
## 1 NodeId1 NodeId1 FALSE NA NA NA 0.09209223 NA NA
## 2 NodeId2 NodeId2 FALSE NA NA NA 0.07780450 NA NA
## 3 NodeId3 NodeId3 TRUE 1 NA NA 0.04823879 NA NA
## 4 NodeId4 NodeId4 TRUE NA NA NA 0.01541944 NA NA
## 5 NodeId5 NodeId5 FALSE NA NA NA 0.05616070 NA NA
Notice that nodeMergeInfo gives for each node the proportion estimated to be differentially expressed at each node (as displayed in the plot that we requested), as well as whether that node was merged together in the mergeClusters call (the isMerged column). Because we set calculateAll=FALSE only the methods needed for our command were calculated (adjP). The others have NA values. The column mergeClusterId tells us which nodes in the tree are now equivalent to a cluster; this is different than the isMerged column, since some nodes can be merged but if their parent nodes were also merged, then that node will not be equivalent to a cluster in the “mergeClusters” clustering. (See Dendrogram Contrats above for more information about the nodes of the dendrograms).
mergeClusters can also be run without merging the cluster, and simply drawing a plot showing the dendrogram along with the estimates of the percentage of non-null features to aid in deciding a cutoff and method. By setting plotInfo="all", all of the estimates of the different methods are displayed simultaneously, while before we only showed the values for the specific mergeMethod we requested.
ce<-mergeClusters(ce,mergeMethod="none",plotInfo="all")

Notice that now if we call nodeMergeInfo, all of the methods now have estimates (except for some methods that didn’t run successfully for this data).
nodeMergeInfo(ce)
## NodeId Contrast isMerged mergeClusterId Storey PC adjP
## 1 NodeId1 NodeId1 NA NA 0.4313198 0.3608610 0.09209223
## 2 NodeId2 NodeId2 NA NA 0.3509690 0.3031920 0.07780450
## 3 NodeId3 NodeId3 NA NA 0.3532324 0.2878294 0.04823879
## 4 NodeId4 NodeId4 NA NA 0.2590182 0.1910256 0.01541944
## 5 NodeId5 NodeId5 NA NA 0.2980620 0.2472606 0.05616070
## locfdr JC
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
This means in any future calls to mergeClusters there will be no more need for calculations of per-gene significance, which will speed up the calls if you just want to change the cutoff (all of the methods used the same input of per-gene p-values, so recalculating them each time is computationally inefficient). In practice, the default is calculateAll=TRUE, meaning all methods are calculated unless the user specifically requests otherwise.
Now we can pick a cutoff and rerun mergeClusters. We’ll give it a label to keep it separate from the previous merge clusters run we had made. Note, we can turn off plotting completely by setting plot=FALSE.
ce<-mergeClusters(ce,cutoff=0.05,mergeMethod="adjP",clusterLabel="mergeClusters,v2",plot=FALSE)
ce
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final var_makeConsensus.nVAR
## -----------
## Primary cluster type: mergeClusters
## Primary cluster label: mergeClusters,v2
## Table of clusters (of primary clustering):
## -1 m01 m02 m03 m04
## 8 27 14 8 8
## Total number of clusterings: 46
## Dendrogram run on 'makeConsensus,final' (cluster index: 7)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? Yes
Notice that the nodeMergeInfo has changed, since different nodes were merged, but the estimates per node stay the same.
nodeMergeInfo(ce)
## NodeId Contrast isMerged mergeClusterId Storey PC adjP
## 1 NodeId1 NodeId1 FALSE NA 0.4313198 0.3608610 0.09209223
## 2 NodeId2 NodeId2 FALSE NA 0.3509690 0.3031920 0.07780450
## 3 NodeId3 NodeId3 TRUE 1 0.3532324 0.2878294 0.04823879
## 4 NodeId4 NodeId4 TRUE NA 0.2590182 0.1910256 0.01541944
## 5 NodeId5 NodeId5 FALSE NA 0.2980620 0.2472606 0.05616070
## locfdr JC
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
If we want to rerun mergeClusters with a different method, we can do that instead.
ce<-mergeClusters(ce,cutoff=0.15,mergeMethod="Storey",
clusterLabel="mergeClusters,v3",plot=FALSE)
ce
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final var_makeConsensus.nVAR
## -----------
## Primary cluster type: mergeClusters
## Primary cluster label: mergeClusters,v3
## Table of clusters (of primary clustering):
## -1 m01 m02 m03 m04 m05 m06
## 8 15 14 9 8 8 3
## Total number of clusterings: 47
## Dendrogram run on 'makeConsensus,final' (cluster index: 8)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? Yes
We can use plotDendrogram to compare the results. Notice that plotDendrogram can recreate the above plots that were created in the calls to mergeClusters via the argument mergeInfo (of course, this only works after mergeClusters has actually been called so that the information is saved in the ce object).
par(mar=c(1.1,1.1,6.1,2.1))
plotDendrogram(ce,whichClusters=c("mergeClusters,v3","mergeClusters,v2"),mergeInfo="mergeMethod")

Requiring a certain log-fold change
With a large number of cells, it can be overly easy to get significant results, even if the size of the differences is small. Another reasonable constraint is to require the difference between the contrasts to be at least of a certain fold-change for the gene to be counted as different. We allow this option for the merge method adjP. Namely, the proportion significant calculated at each node requires both an adjusted p-value less than 0.05 but also that the estimated \(log_2\) fold-change have an absolute value greater than an amount specified by the argument logFCCutoff.
ce<-mergeClusters(ce,cutoff=0.05,mergeMethod="adjP", logFCcutoff=2,
clusterLabel="mergeClusters,FC1",plot=FALSE)
ce
## class: ClusterExperiment
## dim: 7069 65
## reducedDimNames: PCA
## filterStats: var var_makeConsensus.final var_makeConsensus.nVAR
## -----------
## Primary cluster type: mergeClusters
## Primary cluster label: mergeClusters,FC1
## Table of clusters (of primary clustering):
## -1 m01 m02 m03 m04
## 8 27 14 8 8
## Total number of clusterings: 48
## Dendrogram run on 'makeConsensus,final' (cluster index: 9)
## -----------
## Workflow progress:
## clusterMany run? Yes
## makeConsensus run? Yes
## makeDendrogram run? Yes
## mergeClusters run? Yes
In this case, we can see that it did not make a difference in the merging.
par(mar=c(1.1,1.1,6.1,2.1))
plotDendrogram(ce,whichClusters=c("mergeClusters,FC1","mergeClusters,v3","mergeClusters,v2"),mergeInfo="mergeMethod")

Keeping track of and rerunning elements of the workflow
The commands we have shown above show a workflow which continually saves the results over the previous object, so that additional information just gets added to the existing object.
What happens if some parts of the clustering workflow are re-run? For example, in the above we reran parts of the workflow when we talked about them in more detail, or to experiment with parameter settings.
The workflow commands check for existing clusters of the workflow (based on the clusterTypes of the clusterings). If there exist clusterings from previous runs and such clusterings came from calls that are “downstream” of the requested clustering, then the method will change their clusterTypes value by adding a “.i”, where \(i\) is a numerical index keeping track of replicate calls.
For example, if we rerun ‘makeConsensus’, say with a different parameter choice of the proportion similarity to require, then makeConsensus searches the existing clusterings in the input object. We’ve already seen that any existing makeConsensus results will have their clusterTypes changed from makeConsensus to makeConsensus.x, where \(x\) is the largest such number needed to be greater than any existing makeConsensus.x (after all, you might do this many times!). Their labels will also be updated if they just have the default label, but if the user has given different labels to the clusters those will be preserved.
Moreover, this rerunning of makeConsensus will also effect everything in the analysis that was downstream of it and depended on that call. So since mergeClusters is downstream of makeConsensus in the workflow, currently existing mergeClusters will also get bumped to mergeClusters.x along with makeConsensus. However, clusterMany is upstream of makeConsensus (i.e. you expect there to be existing clusterMany before you run makeConsensus) so nothing will happen to clusterMany.
This is handled internally, and may never be apparent to the user unless they choose whichClusters="all" in a plotting command. Indeed this is one reason to always pick whichClusters="workflow", so that these saved previous versions are not displayed.
However, if the user wants to “go back” to previous versions and make them the current iteration, we have seen that the setToCurrent command will do this (see example in the section on makeDendrogram). setToCurrent follows the same process as described above, only with an existing cluster set to the current part of the pipeline.
Note that there is nothing that governs or protects the clusterTypes values to be of a certain kind. This means that if the user decides to name a clusterTypes of a clustering one of these protected names, that is allowed. However, it could clearly create some havoc if done poorly.
Erasing old clusters You can also choose to have all old versions erased by choosing the options eraseOld=TRUE in the call to clusterMany, makeConsensus,mergeClusters and/or setToCurrent. eraseOld=TRUE in any of these functions will delete ALL past workflow results except for those that are both in the current workflow and “upstream” of the requested command. You can also manually remove clusters with removeClusters.
Finding workflow iterations Sometimes which numbered iteration a particular call is in will not be obvious if there are many calls to the workflow. You may have a mergeClusters.2 cluster but no mergeClusters.1 because of an upstream workflow call in the middle that bumped the iteration value up to 2 without ever making a mergeClusters.1. If you really want to, you can see more about the existing iterations and where they are in the clusterMatrix. “0” refers to the current iteration; otherwise the smaller the iteration number, the earlier it was run.
workflowClusterTable(ce)
## Iteration
## Type 0 1 2 3 4 5 6 7 8
## final 0 0 0 0 0 0 0 0 0
## mergeClusters 1 0 0 1 0 0 1 1 1
## makeConsensus 1 1 1 0 1 1 1 0 0
## clusterMany 36 0 0 0 0 0 0 0 0
Explicit details about every workflow cluster and their index in clusterMatrix is given by workflowClusterDetails:
head(workflowClusterDetails(ce),8)
## index type iteration label
## 1 1 mergeClusters 0 mergeClusters,FC1
## 2 2 mergeClusters 8 mergeClusters,v3
## 3 3 mergeClusters 7 mergeClusters,v2
## 4 4 mergeClusters 6 mergeClusters.6
## 5 5 makeConsensus 6 makeConsensus,nVAR
## 6 6 makeConsensus 5 consensusPCA50
## 7 7 makeConsensus 4 consensusPCA15
## 8 8 mergeClusters 3 mergeClusters.3
A note on the whichCluster argument Many functions take the whichCluster argument for identifying a clustering or clusterings on which to perform an action. These arguments all act similarly across functions, and allow the user to give character arguments. As described above, these can be shortcuts like “workflow”, or they can match either clusterTypes or clusterLabels of the object. It is important to note that matching is first done to clusterTypes, and then if not successful to clusterLabels. Since neither clusterTypes nor clusterLabels is guaranteed to be unique, the user should be careful in how they make the call. And, of course, whichCluster arguments can also take explicit numeric integers that identify the column(s) of the clusterMatrix that should be used.
Designate a Final Clustering
A final protected clusterTypes is “final”. This is not created by any method, but can be set to be the clusterType of a clustering by the user (via the clusterTypes command). Any clustering marked final will be considered one of the workflow values for commands like whichClusters="workflow". However, they will NOT be renamed with “.x” or removed if eraseOld=TRUE. This is a way for a user to ‘save’ a clustering as important/final so it will not be changed internally by any method, yet still have it show up with the “workflow” clustering results. There is no limit to the number of such clusters that are so marked, but the utility of doing so will drop if too many such clusters are chosen.
For best functionality, particularly if a user has determined a single final clustering after completing clustering, a user will probably want to set the primaryClusterIndex to be that of the final cluster and rerun makeDendrogram. This will help in plotting and visualizing. The setToFinal command does this.
Here we will demonstrate marking a cluster as final. We go back to our previous mergeClusters that we found with cutoff=0.05 and mark it as our final clustering. First we need to find which cluster it is. We see from our above call to the workflow functions above, that it is clusterType equal to “mergeClusters.4” and label equal to “mergeClusters,v2”. In our call to setToFinal we will decide to change it’s label as well.
ce<-setToFinal(ce,whichCluster="mergeClusters,v2",
clusterLabel="Final Clustering")
par(mar=plotCMar)
plotClusters(ce,whichClusters="workflow")

Note that because it is labeled as “final” it shows up automatically in “workflow” clusters in our plotClusters plot. It has also been set as our primaryCluster and has the new clusterLabel we gave it in the call to setToFinal.
This didn’t get rid of our undesired mergeClusters result that is most recent. It still shows up as “the” mergeClusters result. This might be undesired. We could remove that “mergeClusters” result with removeClusters. Alternatively, we could manually change the clusterTypes to mergeClusters.x so that it doesn’t show up as current.
A cleaner way to do this would have been to first set the desired cluster (“mergeClusters.4”) to the most current iteration with setToCurrent, which would have bumped up the existing mergeClusters result to be no longer current.