1 Introduction

MCbiclust is a R package for running massively correlating biclustering analysis. MCbiclust aims to find large scale biclusters with selected features being highly correlated with each other over a subset of samples. MCbiclust was particularly designed for the application of studying gene expression data, finding and understanding biclusters that are related to large scale co-regulation of genes.

Report issues on https://github.com/rbentham/MCbiclust/issues

2 Getting started

Once installed MCbiclust can be loaded with the following command:

library(MCbiclust)

MCbiclust also makes sure that the packages BiocParallel, cluster, stats, GGally, ggplot2 and scales are all installed. It is also advised that the packages ggplot2 and gplots are separately installed and loaded.

library(ggplot2)
library(gplots)
library(dplyr)
library(gProfileR)
library(MASS)
library(devtools)

3 Example of a single run

3.1 Loading Cancer Cell Line Encyclopedia (CCLE) and MitoCarta data sets

For this example analysis we will be seeking to find biclusters related to mitochondrial function in the cancer cell line encyclopedia (Barretina et al. (2012)). For this two datasets are needed, both of which are available on the MCbiclust package. The first in CCLE_small that contains a subset of the gene expression values found in the entire CCLE data set (the full dataset is avaliable at https://portals.broadinstitute.org/ccle/home), the second, Mitochondrial_genes, is a list of mitochondrial genes that can be found from MitoCarta1.0 (Pagliarini et al. (2008)).

data(CCLE_small)
data(Mitochondrial_genes)

It is a simple procedure to create a new matrix CCLE.mito only containing the mitochondrial genes. While there are \(1023\) known mitochondrial genes, not all of these are measured in CCLE_data.

mito.loc <- which(row.names(CCLE_small) %in% Mitochondrial_genes)
CCLE.mito <- CCLE_small[mito.loc,]

3.2 Finding a bicluster seed

The first step in using MCbiclust is to find a subset of samples that have the most highly correlating genes in the chosen gene expression matrix. This is done by, calculating the associated correlation matrix and then calculating the absolute mean of the correlations, as a correlation score.

Mathematically for a gene expression dataset measuring multiple gene probes across multiple samples, let \[\begin{equation} X = \textrm{Set of all probes, } Y = \textrm{Set of all samples} \end{equation}\] Then define two subsets of \(X\) and \(Y\), \(I\) and \(J\) repectively \[\begin{equation} I \subset X \textrm{ and } J \subset Y \end{equation}\] Subsets \(I\) and \(J\) form a bicluster on sets \(X\) and \(Y\), and the strength of this bicluster measured is based on measuring the correlations between pairs of probes in set \(I\) across all samples in set \(J\). The correlation between a probe \(i \in I\) to a probe \(k \in I\) across the samples in \(J\) is denoted as \(C_{i,k}^J\). Then the strength of the bicluster is measured as having a score \(\alpha\) based on these correlations, defined as: \[\begin{equation} \alpha_I^J= \frac{1}{|I|^2}\sum_{i \in I} \sum_{k \in I} abs(C_{i,k}^J) \end{equation}\] where the function \(abs()\) refers to the absolute value. In words the score \(\alpha\) is the average of the absolute values of the gene-gene correlation matrix for gene-probe set \(I\) across the samples in sample set \(J\).

A high \(\alpha_I^J\) value indicates that the probes in set \(I\) are being strongly co-regulated across the samples in set \(J\). As \(\alpha_I^J\) is calculating using the absolute values of \(C_{i,k}^J\), these probes could be in either in correlation or anti-correlation with each other.

MCbiclust main aim is therefore to find sets of samples and genes that have a high \(\alpha_I^J\) value. This is achieved by first finding a small sample “seed” containing relatively few samples but a very high \(\alpha_I^J\) value,

This is achieved with function FindSeed, initially a random subset of samples is chosen and then at each iteration one sample is removed and replaced and if this results in a higher \(\alpha_I^J\) value than this new subset is chosen. In this function the argument gem stands for gene expression matrix, seed.size indicates the size of the subset of samples that is sought. iterations indicates how many iterations of the algorithm to carry out before stopping. In general the higher the iterations the more optimal the solution in terms of maximising the strength of the correlation.

For reproducibility set.seed has been used to set R’s pseudo-random number generator. It should also be noted that the for gem the data matrix can not contain all the genes, since FindSeed involves the calculation of correlation matrices which are not computationally efficient to compute if they involve greater than ~1000 genes.

set.seed(102)
CCLE.seed <- FindSeed(gem = CCLE.mito,
                      seed.size = 10,
                      iterations = 10000,
                      messages = 1000)

FindSeed has one more additional options, initial.seed allows the user to specify the initial subsample to be tested, by default the initial sample subset is randomly chosen.

There is a function CorScoreCalc that can calculate the correlation score \(\alpha_I^J\) directly, in general however you should not need to use it, unless you wish to manually check the chosen seed is an improvement on one that is randomly generated.

set.seed(103)
random.seed <- sample(seq(length = dim(CCLE.mito)[2]), 10)
CorScoreCalc(CCLE.mito,random.seed)
## [1] 0.2901273
CorScoreCalc(CCLE.mito,CCLE.seed)
## [1] 0.5781261

The results of FindSeed can also be visualised by examining the associated correlation matrix, and viewing the result as a heatmap. Again it is easy to see the difference between the random subsample and the one outputted from FindSeed.

CCLE.random.cor <- cor(t(CCLE.mito[,random.seed]))
heatmap.2(CCLE.random.cor,trace = "none")