# 1 Introduction

The goal of this package is to encourage the user to try many different clustering algorithms in one package structure. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithm tools to find common shared clustering patterns. We implement common post-processing steps unrelated to the specific clustering algorithm (e.g. subsampling the data for stability, finding cluster-specific markers via differential expression, etc).

The other main goal of this package is to implement strategies that we have developed for finding a single robust clustering based on the many clusterings that the user might create by perturbing various parameters of a clustering algorithm. There are several steps to these strategies that we call our standard clustering workflow. Our RSEC algorithm (Resampling-based Sequential Ensemble Clustering) is our preferred realization of this workflow that depends on subsampling on and other ensembl methods to provide robust clusterings, particularly for single-cell sequencing experiments and other large mRNA-Seq experiments.

We also provide a class clusterExperiment that inherits from SummarizedExperiment to store the many clusterings and related information.

All of our methods also have a barebones version that allows input of matrices and greater control. This comes at the expense of the user having to manage and keep track of the clusters, input data, transformation of the data, etc. We do not discuss these barebone versions in this tutorial. Instead, we focus on using SummarizedExperiment object as the input and working with the resulting ClusterExperiment object. See the help pages of each method for more on how to allow for matrix input.

Although this package was developed with (single-cell) RNA-seq data in mind, its use is not limited to RNA-seq or even to gene expression data. Any dataset characterized by high dimensionality could benefit from the methods implemented here.

## 1.1 The clustering workflow

The package encodes many common practices that are shared across clustering algorithms, like subsampling the data, computing silhouette width, sequential clustering procedures, and so forth. It also provides novel strategies that we developed as part of the RSEC algorithm.

As mentioned above, RSEC is a specific algorithm for creating a clustering that follows these basic steps:

• Implement many different clusterings using different choices of parameters using the function clusterMany. This results in a large collection of clusterings, where each clustering is based on different parameters.
• Find a unifying clustering across these many clusterings using the combineMany function.
• Determine whether some clusters should be merged together into larger clusters. This involves two steps:
• Find a hierarchical clustering of the clusters found by combineMany using makeDendrogram
• Merge together clusters of this hierarchy based on the percentage of differential expression, using mergeClusters.

The basic premise of RSEC is to find small, robust clusters of samples, and then merge them into larger clusters as relevant. We find that many algorithmic methods for choosing the appropriate number of clusters for methods err on the side of too few clusters. However, we find in practice that we tend to prefer to err on finding many clusters and then merging them based on examining the data.

RSEC makes many specific choices in this basic workflow, and many steps of this workflow are useful separately from RSEC. For this reason, the clusterExperiment package generalizes this workflow so that the user can use these tools with their own choices. We call this generalization the clustering workflow, as oppose to the specific choices set in RSEC. We will introduce this workflow in its general setting, and only ing specific sections devoted to the RSEC function will discussed the RSEC algorithm. For this reason, the examples shown here use simple clustering choices in the workflow to save on computational overhead; RSEC uses extensive subsampling techniques and takes much longer to run than is practical for a vignette.

Users can also run or add their own clusters to this workflow at different stages. Additional functionality for clustering is available in the clusterSingle function, and the user should see the documentation in the help page of that function. However, there is special functionality to ease in quickly visualizing and managing the results of this workflow.

### 1.1.1 The RSEC Routine

The RSEC algorithm (Resampling-based Sequential Ensemble Clustering) was developed specifically for finding robust clusters in single cell sequencing data. It follows our above suggested workflow, but makes specific choices along the way that we find to be advantageous in clustering large mRNA expression datasets, like those of single cell sequencing. It is implemented in the function RSEC. This vignette serves to show how the pieces of the workflow operate, and use choices that are shown are not those recommended by RSEC. This is both to show the flexibility of the functions and because RSEC is computationally more time-intensive, since it is based on both resampling, and sequential discovery of clusters.

## 1.3 Visualization Tools

We provide a visualization to compare many clusterings of the same data in the function plotClusters. We also provide some functions that are wrappers for common visualization tasks in clustering gene expression data. We provide a heatmap function, plotHeatmap that is an interface to the aheatmap function in the NMF package.

# 2 Quickstart

We will go quickly through the standard steps of clustering using the clusterExperiment package, before turning to more details. The standard workflow we envision is the following:

• clusterMany – run desired clusterings
• combineMany – get a unified clustering
• makeDendrogram – get a hierarchical relationship between the clusters
• mergeClusters – merge together clusters with little DE between their genes.
• getBestFeatures – Find Features that are differential between the final clustering.

## 2.1 Data example

We will make use of a single cell RNA sequencing experiment made available in the scRNAseq package.

set.seed(14456) ## for reproducibility, just in case
library(scRNAseq)
data("fluidigm")

We will use the fluidigm dataset (see help("fluidigm")). This dataset is stored as a SummarizedExperiment object. We can access the data with assay and metadata on the samples with colData.

assay(fluidigm)[1:5,1:10]
##          SRR1275356 SRR1274090 SRR1275251 SRR1275287 SRR1275364 SRR1275269
## A1BG              0          0          0          0          0          0
## A1BG-AS1          0          0          0          0          0          0
## A1CF              0          0          0          0          0          0
## A2M               0          0          0         31          0         46
## A2M-AS1           0          0          0          0          0          0
##          SRR1275263 SRR1275242 SRR1275338 SRR1274117
## A1BG              0          0          0          0
## A1BG-AS1          0          0          0          0
## A1CF              0          0          0          0
## A2M               0          0          0         29
## A2M-AS1           0          0          0        133
colData(fluidigm)[,1:5]
## DataFrame with 130 rows and 5 columns
##               NREADS  NALIGNED    RALIGN TOTAL_DUP    PRIMER
##            <numeric> <numeric> <numeric> <numeric> <numeric>
## SRR1275356  10554900   7555880   71.5862   58.4931 0.0217638
## SRR1274090    196162    182494   93.0323   14.5122 0.0366826
## SRR1275251   8524470   5858130   68.7213   65.0428 0.0351827
## SRR1275287   7229920   5891540   81.4884   49.7609 0.0208685
## SRR1275364   5403640   4482910   82.9609   66.5788 0.0298284
## ...              ...       ...       ...       ...       ...
## SRR1275259   5949930   4181040   70.2705   52.5975 0.0205253
## SRR1275253  10319900   7458710   72.2747   54.9637 0.0205342
## SRR1275285   5300270   4276650   80.6873   41.6394 0.0227383
## SRR1275366   7701320   6373600   82.7600   68.9431 0.0266275
## SRR1275261  13425000   9554960   71.1727   62.0001 0.0200522
NCOL(fluidigm) #number of samples
## [1] 130

## 2.2 Step 0: Filtering and normalization

While there are 130 samples, there are only 65 cells, because each cell is sequenced twice at different sequencing depth. We will limit the analysis to the samples corresponding to high sequencing depth.

se <- fluidigm[,colData(fluidigm)[,"Coverage_Type"]=="High"]

We also filter out lowly expressed genes: we retain only those genes with at least 10 reads in at least 10 cells.

wh_zero <- which(rowSums(assay(se))==0)
pass_filter <- apply(assay(se), 1, function(x) length(x[x >= 10]) >= 10)
se <- se[pass_filter,]
dim(se)
## [1] 7069   65

This removed 19186 genes out of 26255. We now have 7069 genes (or features) remaining. Notice that it is important to remove genes with zero counts in all samples (we had 9673 genes which were zero in all samples here). Otherwise, PCA dimensionality reductions and other implementations may have a problem.

Normalization is an important step in any RNA-seq data analysis and many different normalization methods have been proposed in the literature. Comparing normalization methods or finding the best performing normalization in this dataset is outside of the scope of this vignette. Instead, we will use a simple quantile normalization that will at least make our clustering reflect the biology rather than the difference in sequencing depth among the different samples.

fq <- round(limma::normalizeQuantiles(assay(se)))
assays(se) <- list(normalized_counts=fq)

## 2.3 Step 1: Clustering with clusterMany

clusterMany lets the user quickly pick between many clustering options and run all of the clusterings in one single command. In the quick start we pick a simple set of clusterings based on varying the dimensionality reduction options. The way to designate which options to vary is to give multiple values to an argument. Due to a bug in R, we need to set getClass.msg=FALSE or otherwise a slew of annoying warnings will spit out at every call; this should be fixed in the next patch to R.

Here is our call to clusterMany:

options(getClass.msg=FALSE) #get rid of annoying messages about cache until fixed internally in R
library(clusterExperiment)
ce<-clusterMany(se, clusterFunction="pam",ks=5:10,
isCount=TRUE,dimReduce=c("PCA","var"),nVarDims=c(100,500,1000),
nPCADims=c(5,15,50),run=TRUE)

In this call to clusterMany we made the follow choices about what to vary:

• set dimReduce=c("PCA", "var") meaning run the clustering algorithm after both a dimensionality reduction to the top principal componetns, and also after filtering to the top most variable genes
• For PCA reduction, choose 5,15, and 50 principal components for the reduced data set (set nPCADims=c(5,15,50))
• For most variable reduction, choose 100, 500, and 1000 most variable genes (set nVarDims=c(100,500,1000))
• For the number of clusters, vary from $$k=5,\ldots,10$$ (set ks=5:10)

By giving only a single value to the relative argument, we keep the other possible options fixed, for example:

• we used ‘pam’ for all clustering (clusterFunction)
• we left the default of minSize which requires clusters to be of size at least 5; smaller clusters will be ignored and samples assigned to them given the unassigned value of -1.

We also set isCount=TRUE to indicate that our input data are counts. This means that operations for clustering and visualizations will internally transform the data as $$log_2(x+1)$$ (We could have alternatively explicitly set a transformation by giving a function to the transFun argument, for example if we wanted $$\sqrt(x)$$ or $$log(x+\epsilon)$$ or just identity).

We can visualize the resulting clusterings using the plotClusters command. It is also useful to change the amount of space to allow for the labels of the clusterings, so we will reset the mar option in par (we also change axisLine argument for this reason).

defaultMar<-par("mar")
plotCMar<-c(1.1,8.1,4.1,1.1)
par(mar=plotCMar)
plotClusters(ce,main="Clusters from clusterMany", whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"), axisLine=-1)

This plot shows the samples in the columns, and different clusterings on the rows. Each sample is color coded based on its clustering for that row, where the colors have been chosen to try to match up clusters across different clusterings that show large overlap. Moreover, the samples have been ordered so that each subsequent clustering (starting at the top and going down) will try to order the samples to keep the clusters together, without rearranging the clustering blocks of the previous clustering/row.

Notice that we also added the sampleData argument in our call, indicating that we also want to visualize some information about the samples saved in the colData slot (inherited from our original fluidigm object). We chose the columns “Biological_Condition” and “Cluster2” from colData, which correspond to the original biological condition of the experiment, and the clusters reported in the original paper, respectively. These are shown at the bottom of the plot.

Notice that some samples are white. This indicates that they have the value -1, meaning they were not clustered. This is from our choices to require at least 5 samples to make a cluster.

We can see that some clusters are fairly stable across different choices of dimensions while others can vary dramatically.

We have set whichClusters="workflow" to only plot clusters created from the workflow. Right now that’s all there are anyway, but as commands get rerun with different options, other clusterings can build up in the object (see discussion in this section about how multiple calls to workflow are stored). So setting whichClusters="workflow" means that we are only going to see our most recent calls to the workflow.

The labels shown are those given by clusterMany but can be a bit much for plotting. We can assign new labels if we prefer, for example to be more succinct, by changing the clusterLabels of the object (note we are permanently changing the labels here within the ClusterExperiment object). We choose to remove “Features” as being too wordy:

cl<-clusterLabels(ce)
cl<-gsub("Features","",cl)
clusterLabels(ce)<-cl

We will also show the clusters in a different order, which corresponds to varying the number of dimensions, rather than k.

cl<-clusterLabels(ce)
ndims<-sapply(sapply(strsplit(cl,","),function(x){strsplit(x[1],"=")}),function(x){x[2]})
ord<-order(as.numeric(ndims))
par(mar=plotCMar)
plotClusters(ce,main="Clusters from clusterMany", whichClusters=ord, sampleData=c("Biological_Condition","Cluster2"), axisLine=-1)

We see that the order in which the clusters are given to plotClusters changes the plot greatly. There are many different options for how to run plotClusters discussed in in the detailed section on plotClusters, but for now, this plot is good enough for a quick visualization.

### 2.3.1 The output

The output of clusterMany is a ClusterExperiment object. This is a class built for this package and explained in the section on ClusterExperiment Objects. In the object ce the clusters are stored, names and colors for each cluster within a clustering are assigned, and other information about the clusterings is recorded. Furthermore, all of the information in the original SummarizedExperiment is retained.

There are many accessor functions that help you get at the information in a ClusterExperiment object and some of the most relevant are described in the section on ClusterExperiment Objects. (ClusterExperiment objects are S4 objects, and are not lists).

For right now we will only mention the most basic such function that retrieves the actual cluster assignments. This is the clusterMatrix function, that returns a matrix where the columns are the different clusterings and the rows are samples. Within a clustering, the clusters are encoded by consecutive integers.

head(clusterMatrix(ce)[,1:3])
##      nVAR=100,k=5 nVAR=500,k=5 nVAR=1000,k=5
## [1,]            1            1             1
## [2,]            2            1             1
## [3,]            3            2             2
## [4,]            1            1             1
## [5,]            4            3             2
## [6,]            1            4             2

Because we have changed clusterLabels above, these new cluster labels are shown here. Notice that some of the samples are assigned the value of -1. -1 has the significance of encoding samples that are not assigned to any cluster. Why certain samples are not clustered depends on the underlying choices of the clustering routine. In this case, the default in clusterMany set the minimum size of a cluster to be 5, which resulted in -1 assignments.

Another special value is -2 discussed in the section on ClusterExperiment objects

## 2.4 Step 2: Find a consensus with combineMany

To find a consensus clustering across the many different clusterings created by clusterMany the function combineMany can be used next.

ce<-combineMany(ce)
## Note: no clusters specified to combine, using results from clusterMany

Notice we get a warning that we did not specify any clusters to combine, so it is using the default – those from the clusterMany call.

If we look at the clusterMatrix of the returned ce object, we see that the new cluster from combineMany has been added to the existing clusterings. This is the basic strategy of the functions in this package. Any clustering that is created is added to existing clusterings, so the user does not need to keep track of past clusterings and can easily compare what has changed.

We can again run plotClusters, which will now also show the result of combineMany:

head(clusterMatrix(ce)[,1:3])
##      combineMany nVAR=100,k=5 nVAR=500,k=5
## [1,]          -1            1            1
## [2,]          -1            2            1
## [3,]          -1            3            2
## [4,]          -1            1            1
## [5,]          -1            4            3
## [6,]          -1            1            4
par(mar=plotCMar)
plotClusters(ce,whichClusters="workflow")

The default result of combineMany is not usually a great choice, and certainly isn’t helpful here. The clustering from the default combineMany leaves most samples unassigned (white in the above plot). This is because the default way of combining is very conservative – it requires samples to be in the same cluster in every clustering to be assigned a cluster. This is quite stringent. We can vary this by setting the proportion argument to indicate the minimum proportion of times they should be together with other samples in the cluster they are assigned to. Explicit details on how combineMany makes these clusters are discussed in the section on combineMany.

So let’s label the one we found “combineMany, default” and then create a new one. (Making an informative label will make it easier to keep track of this particular clustering later, particularly if we make multiple calls to the workflow).

wh<-which(clusterLabels(ce)=="combineMany")
if(length(wh)!=1) stop() else clusterLabels(ce)[wh]<-"combineMany,default"

Now we’ll rerun combineMany with proportion=0.7. This time, we will give it an informative label upfront in our call to combineMany.

ce<-combineMany(ce,proportion=0.7,clusterLabel="combineMany,0.7")
## Note: no clusters specified to combine, using results from clusterMany
par(mar=plotCMar)
plotClusters(ce,whichClusters="workflow")

We see that more clusters are detected. Those that are still not assigned a cluster from combineMany clearly vary across the clusterings as to whether the samples are clustered together or not. Varying the proportion argument will adjust whether some of the unclustered samples get added to a cluster. There is also a minSize parameter for combineMany, with the default of minSize=5. We could reduce that requirement as well and more of the unclustered samples would be grouped into a cluster. Here, we reduce it to minSize=3 (we’ll call this “combineMany,final”):

ce<-combineMany(ce,proportion=0.7,minSize=3,clusterLabel="combineMany,final")
## Note: no clusters specified to combine, using results from clusterMany
par(mar=plotCMar)
plotClusters(ce,whichClusters="workflow",main="Min. Size=3")

We can also visualize the proportion of times these clusters were together across these clusterings (this information was made and stored in the ClusterExperiment object when we called combineMany as long as proportion value is <1):

plotCoClustering(ce)

This visualization can help in determining whether to change the value of proportion (though see combineMany for how -1 assignments affect combineMany).

## 2.5 Step 3: Merge clusters together with makeDendrogram and mergeClusters

It is not uncommon in practice to create forty or more clusterings with clusterMany, in which case the results of combineMany can often still result in too many small clusters. We might wonder if they are necessary or could be logically combined together. We could change the value of proportion in our call to combineMany. But we have found that it is often after looking at the clusters and how different they look on individual genes that we best make this determination, rather than the proportion of times they are together in different clustering routines.

For this reason, we often find the need for an additional clustering step that merges clusters together that are not different, based on running tests of differential expression between the clusters found in combineMany. We often display and use both sets of clusters side-by-side (that from combineMany and that from mergeClusters).

mergeClusters needs a hierarchical clustering of the clusters; it then goes progressively up that hierarchy, deciding whether two adjacent clusters can be merged. The function makeDendrogram makes such a hierarchy between clusters (by applying hclust to the medoids of the clusters). Because the results of mergeClusters are so dependent on that hierarchy, we require the user to call makeDendrogram rather than calling it internally. This is because different options in makeDendrogram can affect how the clusters are hierarchically ordered, and we want to encourage the user make these choices.

As an example, here we use the 500 most variable genes to make the cluster hierarchy.

ce<-makeDendrogram(ce,dimReduce="var",ndims=500)
plotDendrogram(ce)

We can see that clusters 1 and 3 are most closely related, at least in the top 500 most variable genes.

If we look at the summary of ce, it now has ‘makeDendrogram’ marked as ‘Yes’.

ce
## class: ClusterExperiment
## dim: 7069 65
## Primary cluster type: combineMany
## Primary cluster label: combineMany,final
## Table of clusters (of primary clustering):
## -1 c1 c2 c3 c4 c5 c6
## 12  9  7 15 14  4  4
## Total number of clusterings: 39
## Dendrogram run on 'combineMany,final' (cluster index: 1)
## -----------
## Workflow progress:
## clusterMany run? Yes
## combineMany run? Yes
## makeDendrogram run? Yes
## mergeClusters run? No

Now we are ready to actually merge clusters together. We now run mergeClusters that will go up this hierarchy and compare the level of differential expression (DE) in each pair. In other words, if we focus on the left side of the tree, DE tests are run, between 1 and 3, and between 6 and 8. If there is not enough DE between each of these (based on a cutoff that can be set by the user), then clusters 1 and 3 and/or 6 and 8 will be merged. And so on up the tree.

It is useful to first run mergeClusters without actually creating any merged clusters so as to preview what the final clustering will be (and perhaps to help in setting the cutoff).

mergeClusters(ce,mergeMethod="adjP",plot="mergeMethod")
## Note: Merging will be done on ' combineMany,final ', with clustering index 1

Then we can decide on a cutoff and visualize the resulting clustering.

ce<-mergeClusters(ce,mergeMethod="adjP",cutoff=0.01)
## Note: Merging will be done on ' combineMany,final ', with clustering index 1
## Warning in locfdr::locfdr(tstats, plot = 0): f(z) misfit = 2.9. Rerun with
## increased df
## Warning in locfdr::locfdr(tstats, plot = 0): f(z) misfit = 1.6. Rerun with
## increased df

par(mar=plotCMar)
plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"))
plotCoClustering(ce,whichClusters=c("mergeClusters","combineMany"),
sampleData=c("Biological_Condition","Cluster2"),annLegend=FALSE)

Notice that mergeClusters combines clusters based on the actual values of the features, while the coClustering plot shows how often the samples clustered together. It is not uncommon that mergeClusters will merge clusters that don’t look “close” on the coClustering plot. This can be due to just the choices of the hierarchical clustering, but can also be because the two merged clusters are not often confused for each other across the clustering algorithms, yet don’t have strong differences on individual genes. This can be the case especially when the clustering is done on reduced PCA space, where an accumulation of small differences might consistently separate the samples (so the two clusters are not “confused” as to the samples), but because the differences are not strong on individual genes, mergeClusters combines them. These are ultimately different criteria.

Finally, we can do a heatmap visualizing this final step of clustering.

plotHeatmap(ce,clusterSamplesData="dendrogramValue",breaks=.99,
sampleData=c("Biological_Condition", "Cluster1", "Cluster2"))

By choosing “dendrogramValue” for the clustering of the samples, we will be showing the clusters according to the hierarchical ordering of the clusters found by makeDendrogram. The argument breaks=0.99 means that the last color of the heatmap spectrum will be forced to be the top 1% of the data (rather than evenly spaced through the entire range of values). This can be helpful in making sure that rare extreme values in the upper range do not absorb too much space in the color spectrum. There are many more options for plotHeatmap, some of which are discussed in the section on plotHeatmap.

## 2.6 Step 4: Finding Features related to the clusters

The last step is to then find features that are different between these clusters, as a start to understanding biological differences in the samples. The function getBestFeatures performs tests for differential expression (i.e. different mean levels) between the clusters for each feature. It relies on limma to run the differential expression analysis, with voom correction if the data are indicated by the user to be counts.

There are several types of tests that can be performed to identify features that are different between the clusters. Here we perform all pairwise tests between the clusters.

pairsAll<-getBestFeatures(ce,contrastType="Pairs",p.value=0.05,
number=nrow(ce),isCount=TRUE)
## Note: If isCount=TRUE the data will be transformed with voom() rather than
## with the transformation function in the slot transformation.
## This makes sense only for counts.
head(pairsAll)
##   IndexInOriginal Contrast Feature     logFC    AveExpr         t
## 1             552    X1-X2  BCL11B  9.967559  2.2054809  8.128140
## 2            6565    X1-X2     VIM -9.010410  5.8344528 -7.711733
## 3            3309    X1-X2    MIAT  8.400380  6.4698388  6.590747
## 4            4501    X1-X2 PRTFDC1 -8.249512 -0.1792585 -6.434117
## 5            1763    X1-X2   EPHA3  8.465910  2.6400851  6.020382
## 6            2210    X1-X2    GLI3 -7.736866  3.3201396 -5.961636
## 1 6.996166e-12 4.239077e-09 15.416450
## 2 4.314551e-11 2.060781e-08 13.179277
## 3 5.459041e-09 1.254304e-06  9.041400
## 4 1.061382e-08 2.224182e-06  8.804356
## 5 6.024265e-08 9.812334e-06  7.361316
## 6 7.687870e-08 1.197039e-05  6.924331

We can visualize only these significantly different pair-wise features with plotHeatmap by using the column “IndexInOriginal” in the result of getBestFeatures to quickly identify the genes to be used in the heatmap. Notice that the same genes can be replicated across different contrasts, so we will not always have unique genes:

length(pairsAll$Feature)==length(unique(pairsAll$Feature))
## [1] FALSE

In this case they are not unique. Hence, we will make sure we take only unique gene values so that they are not plotted multiple times in our heatmap. (This is a good practice even if in a particular case the genes are unique).

plotHeatmap(ce, clusterSamplesData="dendrogramValue",
clusterFeaturesData=unique(pairsAll[,"IndexInOriginal"]),
main="Heatmap of features w/ significant pairwise differences",
breaks=.99)

Notice that the samples clustered into the -1 cluster (i.e. not assigned) are clustered as an outgroup. They can also be mixed into the dendrogram (see makeDendrogram)

# 3 ClusterExperiment Objects

The ce object that we created by calling clusterMany is a ClusterExperiment object. The ClusterExperiment class is used by this package to keep the data and the clusterings together. It inherits from SummarizedExperiment, which means the data and colData and other information orginally in the fluidigm object are retained and can be accessed with the same functions as before. The ClusterExperiment object additionally stores clusterings and information about the clusterings along side the data. This helps keep everything together, and like the SummarizedExperiment object, allows for simple things like subsetting to a reduced set of subjects and being confident that the corresponding clusterings, colData, and so forth are similarly subset.

Typing the name at the control prompt results in a quick summary of the object.

ce
## class: ClusterExperiment
## dim: 7069 65
## Primary cluster type: mergeClusters
## Primary cluster label: mergeClusters
## Table of clusters (of primary clustering):
## -1 m1 m2 m3 m4 m5 m6
## 12  9  7 15 14  4  4
## Total number of clusterings: 40
## Dendrogram run on 'combineMany,final' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes
## combineMany run? Yes
## makeDendrogram run? Yes
## mergeClusters run? Yes

This summary tells us the total number of clusterings (40), and gives some indication as to what parts of the standard workflow have been completed and stored in this object. It also gives information regarding the primaryCluster of the object. The primaryCluster is just one of the clusterings that has been chosen to be the “primary” clustering, meaning that by default various functions will turn to this clustering as the desired clustering to use. The “primaryCluster” can be reset by the user (see primaryClusterIndex). clusterMany arbitrarily sets the ‘primaryCluster’ to the first one, and each later step of the workflow sets the primary index to the most recent, but the user can set a specific clustering to be the primaryCluster with primaryClusterIndex.

There are also additional commands to access the clusterings and their related information (type help("ClusterExperiment-methods") for more).

The cluster assignments are stored in the clusterMatrix slot of ce, with samples on the rows and different clusterings on the columns. We can look at the cluster matrix and the primary cluster with the commands clusterMatrix and primaryCluster

head(clusterMatrix(ce))[,1:5]
##      mergeClusters combineMany,final combineMany,0.7 combineMany,default
## [1,]            -1                -1              -1                  -1
## [2,]            -1                -1              -1                  -1
## [3,]            -1                -1              -1                  -1
## [4,]             1                 1               1                  -1
## [5,]            -1                -1              -1                  -1
## [6,]            -1                -1              -1                  -1
##      nVAR=100,k=5
## [1,]            1
## [2,]            2
## [3,]            3
## [4,]            1
## [5,]            4
## [6,]            1
primaryCluster(ce)
##  [1] -1 -1 -1  1 -1 -1  2  3  3  3  4  4  1  4 -1  5  2  3  2 -1  4  4  4
## [24]  3 -1  3  3  3  6  4  4  1  5  2  2  6 -1  4 -1  6  2  3  3  1  3  3
## [47]  1 -1  2  4  4  4  3  4 -1  1  1  3  3  1  6  5  4  1  5

Remember that we made multiple calls to combineMany: only the last such call will be shown when we use whichClusters="workflow" in our plotting (see this section for a discussion of how these repeated calls are handled.)

Negative Valued Cluster Assignments The different clusters are stored as consecutive integers, with ‘-1’ and ‘-2’ having special meaning. ‘-1’ refers to samples that were not clustered by the clustering algorithm. In our example, we removed clusters that didn’t meet specific size criterion, so they were assigned ‘-1’. ‘-2’ is for samples that were not included in the original input to the clustering. This is useful if, for example, you cluster on a subset of the samples, and then want to store this clustering with the clusterings done on all the data. You can create a vector of clusterings that give ‘-2’ to the samples not originally used and then add these clusterings to the ce object manually with addClusters.

clusterLabels gives the column names of the clusterMatrix; clusterMany has given column names based on the parameter choices, and later steps in the workflow also give a name (or allow the user to set them). Clusterings might also have no specific label if the user created them. As we’ve seen, the user can also change these labels.

head(clusterLabels(ce),10)
##  [1] "mergeClusters"       "combineMany,final"   "combineMany,0.7"
##  [4] "combineMany,default" "nVAR=100,k=5"        "nVAR=500,k=5"
##  [7] "nVAR=1000,k=5"       "nPCA=5,k=5"          "nPCA=15,k=5"
## [10] "nPCA=50,k=5"

clusterTypes on the other hand indicates what call made the clustering. Unlike the labels, it is wise to not change the values of clusterTypes lightly.

head(clusterTypes(ce),10)
##  [1] "mergeClusters" "combineMany"   "combineMany.2" "combineMany.1"
##  [5] "clusterMany"   "clusterMany"   "clusterMany"   "clusterMany"
##  [9] "clusterMany"   "clusterMany"

The information that was in the original fluidigm object has also been preserved, like colData that contains information on each sample.

colData(ce)[,1:5]
## DataFrame with 65 rows and 5 columns
##               NREADS  NALIGNED    RALIGN TOTAL_DUP    PRIMER
##            <numeric> <numeric> <numeric> <numeric> <numeric>
## SRR1275356  10554900   7555880   71.5862   58.4931 0.0217638
## SRR1275251   8524470   5858130   68.7213   65.0428 0.0351827
## SRR1275287   7229920   5891540   81.4884   49.7609 0.0208685
## SRR1275364   5403640   4482910   82.9609   66.5788 0.0298284
## SRR1275269  10729700   7806230   72.7536   50.4285 0.0204349
## ...              ...       ...       ...       ...       ...
## SRR1275259   5949930   4181040   70.2705   52.5975 0.0205253
## SRR1275253  10319900   7458710   72.2747   54.9637 0.0205342
## SRR1275285   5300270   4276650   80.6873   41.6394 0.0227383
## SRR1275366   7701320   6373600   82.7600   68.9431 0.0266275
## SRR1275261  13425000   9554960   71.1727   62.0001 0.0200522

Another important slot in the ClusterExperiment object is the clusterLegend slot. This consists of a list, one element per column or clustering of clusterMatrix.

length(clusterLegend(ce))
## [1] 40
clusterLegend(ce)[1:2]
## $mergeClusters ## clusterIds color name ## -1 "-1" "white" "-1" ## 1 "1" "#1F78B4" "m1" ## 2 "2" "#33A02C" "m2" ## 3 "3" "#E31A1C" "m3" ## 4 "4" "#FF7F00" "m4" ## 5 "5" "#6A3D9A" "m5" ## 6 "6" "#B15928" "m6" ## ##$combineMany,final
##    clusterIds color     name
## -1 "-1"       "white"   "-1"
## 1  "1"        "#1F78B4" "c1"
## 2  "2"        "#33A02C" "c2"
## 3  "3"        "#E31A1C" "c3"
## 4  "4"        "#FF7F00" "c4"
## 5  "5"        "#6A3D9A" "c5"
## 6  "6"        "#B15928" "c6"

We can see that each element of clusterLegend consists of a matrix, with number of rows equal to the number of clusters in the clustering. The columns store information about that cluster. clusterIds is the internal id used in clusterMatrix to identify the cluster, name is a name for the cluster, and color is a color for that cluster. color is used in plotting and visualizing the clusters, and name is an arbitrary character string for a cluster. They are automatically given default values when the ClusterExperiment object is created, but we will see under the description of visualization methods how the user might want to manipulate these for better plotting results.

# 4 Visualizing the data

## 4.1 Plotting the clusters

We demonstrated during the quick start that we can visualize multiple clusterings using the plotClusters command.

par(mar=plotCMar)
plotClusters(ce,main="Clusters from clusterMany", whichClusters="workflow",
axisLine=-1)

We have seen that we can get very different plots depending on how we order the clusterings, and what clusterings are included. The argument whichClusters allows the user to choose different clusterings or provide an explicit ordering of the clusterings. whichClusters can take either a single character value, or a vector of either characters or indices. If whichClusters matches either “all” or “workflow”, then the clusterings chosen are either all, or only those from the most recent calls to the workflow functions. Choosing “workflow” removes from the visualization both user-defined clusterings and also previous calls to the workflow that have since been rerun. Setting whichClusters="workflow" can be a useful if you have called a method like combineMany several times, as we did, only with different parameters. All of those runs are saved (unless eraseOld=TRUE), but you may not want to plot them.

If whichClusters is a character that is not one of these designated values, the entries should match a clusterType value (like clusterMany) or a clusterLabel value (with exact matching). Alternatively, the user can specify numeric indices corresponding to the columns of clusterMatrix that provide the order of the clusters.

par(mar=plotCMar)
plotClusters(ce,whichClusters="clusterMany",
main="Only Clusters from clusterMany",axisLine=-1)

We can also add to our plot (categorical) information on each of our subjects from the colData of our SummarizedExperiment object (which is also retained in our ClusterExperiment object). This can be helpful to see if the clusters correspond to other features of the samples, such as sample batches. Here we add the values from the columns “Biological_Condition” and “Cluster2” that were present in the fluidigm object and given with the published data.

par(mar=plotCMar)
plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"),
main="Workflow clusters plus other data",axisLine=-1)

### 4.1.1 Saving the alignment of plotClusters

plotClusters invisibly returns a ClusterExperiment object. In our earlier calls to plotCluster, this would be the same as the input object and so there is no reason to save it. However, the alignment and color assignments created by plotClusters can be requested to be saved via the resetNames, resetColors and resetOrderSamples arguments. If any of these are set to TRUE, then the object returned will be different than those of the input. Specifically, if resetColors=TRUE the colorLegend of the returned object will be changed so that the colors assigned to each cluster will be as were shown in the plot. Similarly, if resetNames=TRUE the names of the clusters will be changed to be integer values, but now the integers will be aligned to try to be the same across clusters (and therefore not consecutive integers, which is why these are saved as names for the clusters and not clusterIds). If resetOrderSamples=TRUE, then the order of the samples shown in the plot will be similarly saved in the slot orderSamples.

Here we make a call to plotClusters, but now ask to reset everything to match this ordering. We’ll save this as a different object so that this is not a permanent change for the rest of the vignette.

par(mar=plotCMar)
ce_temp<-plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"),
main="Clusters from clusterMany, different order",axisLine=-1,resetNames=TRUE,resetColors=TRUE,resetOrderSamples=TRUE)

clusterLegend(ce_temp)[c("mergeClusters","combineMany,final")]
## $mergeClusters ## clusterIds color name ## -1 "-1" "white" "-1" ## 1 "1" "#1F78B4" "1" ## 2 "2" "#33A02C" "2" ## 3 "3" "#E31A1C" "3" ## 4 "4" "#FF7F00" "4" ## 5 "5" "#6A3D9A" "5" ## 6 "6" "#B15928" "6" ## ##$combineMany,final
##    clusterIds color     name
## -1 "-1"       "white"   "-1"
## 1  "1"        "#1F78B4" "1"
## 2  "2"        "#33A02C" "2"
## 3  "3"        "#E31A1C" "3"
## 4  "4"        "#FF7F00" "4"
## 5  "5"        "#6A3D9A" "5"
## 6  "6"        "#B15928" "6"

Now, the clusterLegend slot of the object no longer has the default color/name assignments, but it has names and colors that match across the clusters. Notice, that this means the prefix “m” or “c” that was previously given to distinguish the combineMany result from the mergeClusters result is now gone (the user could manually add them by changing the clusterLegend values).

We can also force plotClusters to use the existing color definitions, rather than create its own. This makes particular sense if you want to have continuity between plots – i.e. be sure that a particular cluster always has a certain color – but would like to do different variations of plotClusters to get a sense of how similar the clusters are.

For example, we set the colors above based on the cluster order from plotClusters where the clusters were ordered according to the workflow. But now we want to plot only the clusters from clusterMany, yet keep the same colors as before so we can compare them. We do this by setting the argument existingColors="all", meaning use all of the existing colors (currently this is the only option available for how to use the existing colors).

par(mar=plotCMar)
par(mfrow=c(1,2))
plotClusters(ce_temp, sampleData=c("Biological_Condition","Cluster2"),
whichClusters="workflow", existingColors="all",
main="Workflow Clusters, fix the color of clusters",axisLine=-1)

plotClusters(ce_temp, sampleData=c("Biological_Condition","Cluster2"),
existingColors="all", whichClusters="clusterMany",
main="clusterMany Clusters, fix the color of clusters",
axisLine=-1)