Contents

0.0.1 Abstract

Genome-wide data is used to stratify patients into classes using class discovery algorithms. However, we have observed systematic bias present in current state-of-the-art methods. This arises from not considering reference distributions while selecting the number of classes (K). As a solution, we developed a consensus clustering-based algorithm with a hypothesis testing framework called Monte Carlo consensus clustering (M3C). M3C uses a multi-core enabled Monte Carlo simulation to generate null distributions along the range of K which are used to calculate p values to select its value. P values beyond the limits of the simulation are estimated using a beta distribution. M3C can quantify structural relationships between clusters and uses spectral clustering to deal with non-gaussian and imbalanced structures.

0.0.2 Prerequisites

M3C recommended spec:

A relatively new and fast multi-core computer or cluster.

M3C requires:

A matrix or data frame of normalised expression data (e.g. microarray, RNA-seq, but also epigenetic or protein data) where columns equal samples and rows equal features. For RNA-seq data, VST or rlog transformed count data, log(CPM), log(TPM), and log(RPKM), are all acceptable forms of normalisation.

The data should be filtered to remove features with no or very low signal, and filtered using variance to reduce dimensionality (unsupervised), or p value from a statistical test (supervised).

M3C also accepts optionally:

Annotation data frame, where every row is a patient/sample and columns refer to meta-data, e.g. age, sex, etc. M3C will automatically rearrange your annotation to match the clustering output and add the consensus cluster grouping to it. This is done to speed up subsequent analyses. Note, this only works if the IDs (column names in data) match a column called “ID” in the annotation data frame.

0.0.3 Example workflow

The M3C package contains the GBM cancer microarray dataset for testing. There is an accepted cluster solution of 4. First we load the package which also loads the GBM data.

library(M3C)
library(NMF) # loading for aheatmap plotting function
library(gplots) # loading this for nice colour scale
library(ggsci) # more cool colours

# now we have loaded the mydata and desx objects (with the package automatically)
# mydata is the expression data for GBM
# desx is the annotation for this data

0.0.4 Exploratory data analysis

This is an important checking step prior to running M3C. Not just outliers are important to check for, also important are the underlying assumptions of PAM or K means. Spectral clustering has more flexible assumptions (it is considerably slower, however), all three methods require removal of extreme outliers. It is sensible to visually inspect a PCA (of samples) plot prior to clustering to verify;

  1. Clusters are approximately gaussian (normally distributed) - not severely elongated or non-linear
  2. That one cluster is not clearly a lot smaller than another cluster
  3. That extreme outliers have been removed

You can read more about the assumptions of K means here: (http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html)

In the PCA plot below we can see there are no obvious non linear structures, evidence of cluster imbalance, or extreme outliers, therefore we can proceed with running M3C on this dataset with PAM as the inner algorithm. If points i and ii were clearly not the case, spectral clustering should be used instead, but this normally is not a problem.

PCA1 <- pca(mydata)
Figure 1: CDF plot for real data

Figure 1: CDF plot for real data

0.0.5 Running M3C

In our example, we run the algorithm using the default settings (100x monte carlo iterations and 100x inner replications). We have found the results generally stable using these parameters, although they may be increased or the algorithm may simply be run a few extra times to test stability. Plots from the tool and an .csv file with the numerical outputs may be printed into the working directory (by adding printres = TRUE). We will set the seed in this example, incase you wish to repeat our results exactly (seed = 123). We will add the annotation file for streamlined downstream analyses (des = desx).

It is recommended to save the workspace after M3C if you are working with a large dataset because the runtimes can be quite long. M3C uses by default PAM with Euclidean distance in the consensus clustering loop because we have found this runs fast with good results - most of the time. The reference method that generates the null datasets is best left to the default setting which is ‘reverse-PCA’, this maintains the correlation structure of the data using the principle component loadings. We will set the removeplots option to TRUE in this example to remove plots from the vignette, normally this is FALSE by default.

res <- M3C(mydata, cores=1, seed = 123, des = desx, removeplots = TRUE)

The scores and p values are contained within the res$scores object. So we can see the RCSI reaches a maxima at K = 4 (RSCI=0.33), the monte carlo p value supports this decision (p=0.033). This means the null hypothesis that K = 1 can be rejected for this dataset because we have achieved significance (alpha=0.05) versus a dataset with no clusters (but with the same gene-gene correlation structure). For p values that extend beyond the lower limits imposed by the Monte Carlo simulation, M3C estimates parameters from the simulation to generate a beta distribution. The BETA_P in this case study is 0.033. You may of course want to take a look at other significant or near significant clustering solutions and how they relate to the variables of investigation.

res$scores
##    K  PAC_REAL   PAC_REF        RCSI MONTECARLO_P     BETA_P   P_SCORE
## 1  2 0.6734694 0.5773796 -0.15394262   0.66336634 0.69721054 0.1566361
## 2  3 0.4473469 0.5410857  0.19024326   0.18811881 0.19923640 0.7006313
## 3  4 0.3404082 0.4711755  0.32508528   0.03960396 0.03313314 1.4797374
## 4  5 0.3200000 0.4018041  0.22764361   0.08910891 0.07848793 1.1051971
## 5  6 0.3102041 0.3474204  0.11330519   0.23762376 0.22889997 0.6403543
## 6  7 0.2840816 0.3031673  0.06502332   0.31683168 0.33674564 0.4726980
## 7  8 0.2653061 0.2700408  0.01768878   0.45544554 0.46514316 0.3324134
## 8  9 0.2465306 0.2413469 -0.02125070   0.56435644 0.56967334 0.2443741
## 9 10 0.2130612 0.2190531  0.02773443   0.44554455 0.44629390 0.3503790

Now we will take a look at some of the plots M3C generates.

This is a CDF plot of the consensus matrices for our test data. We are looking for the value of K with the flattest curve and this can be quantified using the PAC metric (Șenbabaoğlu et al., 2014). In the CDF and following PAC plot we can see the overfitting effect of consensus clustering where as K increases so does the apparent stability of the results for any given dataset, this we correct for by using a reference.

Figure 2: CDF plot for real data

Figure 2: CDF plot for real data

This figure below is the PAC score, we can see an elbow at K = 4 which is suggestive this is the best K. However, we are not able to quantify how confident we are in this value without comparison versus a null dataset. Indeed, this may just be the result of random noise.

Figure 3: PAC score for real data

Figure 3: PAC score for real data

We then derive the relative cluster stability index (RCSI) which takes into account the reference PAC scores, but not the shape of the distribution. This metric is better than the PAC score for deciding class number, where the maximum value corresponds to the best value of K. In this example the RCSI has an optima at K=4. It generally requires fewer iterations than deriving the empirical p values.

Figure 4: RCSI

Figure 4: RCSI

Finally, we calculate a p value from the distribution, here we display the p values from the beta distribution. If none of the p values reach significance over a reasonable range of K (e.g. 10), then we accept the null hypothesis. In the GBM dataset, we can see K = 4 reaches signfiicance with an alpha of 0.05 (red dotted line), therefore we can reject the null hypothesis for the GBM dataset.

Figure 5: Empirical p values

Figure 5: Empirical p values

Now we are pretty convinced there are 4 clusters within this dataset which are not likely simply to have occurred by chance alone.

A further analysis that M3C conducts (if the dend flag is set to TRUE) is to make a dendrogram for the optimal K using the consensus cluster medoids, then sigclust is run to calculate p values for each branchpoint. This allows quantification of the structural relationships between consensus clusters. In this case CC3 and CC4 have a closer relationship, whilst the other clusters are quite far apart.

Figure 6: Dendrogram

Figure 6: Dendrogram

We can turn to examine the output objects that M3C generates. These allow heatmap generation for publications.

0.0.6 Understanding M3C outputs

The first 3 lines below extract the ordered (according in clustering results) expression data and the annotation data from the results object after running M3C. If we wanted to extract the data for 5 clusters, we would simply replace in 4 in the below lines to 5 and so on. We scale the data here row wise according to z-score prior to some light data compression for visualisation purposes in the heatmap. Remember to set Colv = NA for heatmap plotting because we have already ordered the data column or samplewise, M3C does that for you.

# get the data out of the results list (by using $ - dollar sign)
data <- res$realdataresults[[4]]$ordered_data # this is the data
annon <- res$realdataresults[[4]]$ordered_annotation # this is the annotation
ccmatrix <- res$realdataresults[[4]]$consensus_matrix # this is the consensus matrix

# normalise and scale the data
data <- t(scale(t(data))) # z-score normalise each row (feature)
data <- apply(data, 2, function(x) ifelse(x > 4, 4, x)) # compress data within range
data <- apply(data, 2, function(x) ifelse(x < -4, -4, x)) # compress data within range

# get some cool colour palettes from the ggsci package and RColourBrewer
ann_colors <- ggsci::pal_startrek("uniform")(4) # star trek palette
ann_colors2 <- ggsci::pal_futurama()(4) # futurama palette
pal <- rev(colorRampPalette(RColorBrewer::brewer.pal(10, "RdBu"))(256))
NMF::aheatmap(data, annCol = annon, scale = 'row', Colv = NA, distfun = 'pearson', 
         color = gplots::bluered(256), annColors = list(class=ann_colors, consensuscluster=ann_colors2))
Figure 7: aheatmap of GBM consensus clusters with tumour classification

Figure 7: aheatmap of GBM consensus clusters with tumour classification

Another plot we may want to do for publications is print the consensus matrix for our optimal clustering solution (in this case, K = 4). This should be quite crisp reflecting the significant stability of the results. We can see in this heatmap below of the consensus matrix the clusters do indeed look quite clear supporting our view that there is 4 clusters.

# time to plot the consensus matrix for the optimal cluster decision
ccmatrix <- res$realdataresults[[4]]$consensus_matrix # pull out the consensus matrix from the k = 4 object
pal <- rev(colorRampPalette(RColorBrewer::brewer.pal(9, "Reds"))(256)) # get some nice colours
NMF::aheatmap(ccmatrix, annCol = annon, Colv = NA, Rowv = NA, 
              color = rev(pal), scale = 'none') # plot the heatmap
Figure 6: aheatmap of GBM consensus matrix

Figure 6: aheatmap of GBM consensus matrix

0.0.7 Final check of consensus cluster structure

A last analysis we recommend to do is to examine how the clusters are arranged in principal component space. To do this the PCA function can be run directly on the M3C output object.

  PCA2 <- pca(res, K=4)

We can see in the PCA that clusters 3 and 4 are close together. Doing a visual check like this allows the user to ensure there are no artifacts that have occured due to the assumptions of K means or PAM. An example would be a very large cluster next to a smaller one, these algorithms will often pull in samples from the larger cluster to balance the small one.

Figure 7: PCA of a 4 cluster simulated dataset

Figure 7: PCA of a 4 cluster simulated dataset

0.0.8 Generating simulated data

We have included a function for generating simulated data to test various clustering algorithms. This cluster simulator is simple to use. Using the code below, clustersim generates a dataset with 225 samples, 900 features, a radius cut-off for the initial square of 8, a cluster number of 4, a seperation of clusters of 0.75, and a degree of noise added to each co-ordinate of 0.025. After running, a PCA will print of the data so we can visualise the 4 clusters in principle component space.

  res <- clustersim(225, 900, 8, 4, 0.75, 0.025, print = FALSE, seed=123)
Figure 7: PCA of a 4 cluster simulated dataset

Figure 7: PCA of a 4 cluster simulated dataset

0.0.9 Conclusions

In this tutorial, we have seen that M3C provides a rigourous approach for selecting the number of clusters in the data opposed to relying on relative scores. We have found in our analyses this approach results in increased performance relative to other methods and the removal of systematic bias inherant in the results. Building on prior work (Tibshirani et al., 2001; Șenbabaoğlu et al., 2014), M3C represents a conceptual step forward for class discovery algorithms.

0.0.10 References

Monti, Stefano, et al. “Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.” Machine learning 52.1 (2003): 91-118.

Șenbabaoğlu, Yasin, George Michailidis, and Jun Z. Li. “Critical limitations of consensus clustering in class discovery.” Scientific reports 4 (2014): 6207.

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. “Estimating the number of clusters in a data set via the gap statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001): 411-423.