1. Introduction

CancerSubtypes is a package for cancer subtype analysis that includes various functions from dataset processing to result validation. In CancerSubtypes package, we provide a unified framework for analysing cancer subtpes from raw data to result visualisation. The main functions include genomic data pre-processing, cancer subtypes identification, results validation, visualization and comparison. CancerSubtypes provides the common data imputation and normalization methods for the genomic data pre-processing. Meanwhile, there are four feature selection methods to screen the key features in genomic dataset. The common cancer subtypes identification methods are integrated in this package such as Consensus clustering (CC) [From R package ConsensusClusterPlus], Consensus Nonnegative matrix factorization (CNMF) [From R package NMF], Integrative clustering (iCluster)[From R package iCluster], Similarity Network Fusion (SNF) [From R package SNFtool] and Combined SNF and CC (SNF.CC). We implement these cancer subtypes identification methods in a unified input and output data format. The process of analysing cancer subtypes can be easily conducted in a standard workflow. CancerSubtypes provides the most useful feature selection methods and subtypes validation method and helps users to focus on their cancer genomic data and the results from different methods can be compared and evaluation in visualization way easily.


2 Data processing

For the basic data processing, CancerSubtypes provides the methods for data distribution check, imputation and normalization and feature selection. There are four feature selection methods (Variance-Var, Median Absolute Deviation-MAD, COX model, Principal Component Analysis-PCA) in CancerSubtypes package. All the data processing methods possess the same input and output data format.

2.1 Basic processing

2.1.1 Analysis the raw data by check the data distribution.

### Prepare a TCGA gene expression dataset for analysis. 
library("CancerSubtypes")
library("RTCGA.mRNA")
data(BRCA.mRNA)
mRNA=t(as.matrix(BRCA.mRNA[,-1]))
colnames(mRNA)=BRCA.mRNA[,1]

###To observe the mean, variance and Median Absolute Deviation distribution of the dataset, it helps users to get the distribution characteristics of the data, e.g. To evaluate whether the dataset fits a normal distribution or not.
data.checkDistribution(mRNA)

2.1.2 Data imputation for features with missing values (NAs)

The raw genomic dataset always contains missing observations, especially in microarray gene expression data. It is not wise to remove all the features with missing observations in very few samples because the useful information will be discarded. The common method is to impute the proper value for the missing observations. CancerSubtypes integrates three common imputation methods for genomic datasets.

index=which(is.na(mRNA))
res1=data.imputation(mRNA,fun="median")
res2=data.imputation(mRNA,fun="mean")
res3=data.imputation(mRNA,fun="microarray")

2.1.3 Data normalization.

result1=data.normalization(mRNA,type="feature_Median",log2=FALSE)
result2=data.normalization(mRNA,type="feature_zscore",log2=FALSE)

2.2 Feature selection

2.2.1 Feature selection based on the most variance.

###The top 1000 most variance features will be selected.
data1=FSbyVar(mRNA, cut.type="topk",value=1000)
###The features with (variance>0.5) are selected.
data2=FSbyVar(mRNA, cut.type="cutoff",value=0.5)

2.2.2 Feature selection based on the most variant Median Absolute Deviation (MAD).

data1=FSbyMAD(mRNA, cut.type="topk",value=1000)
data2=FSbyMAD(mRNA, cut.type="cutoff",value=0.5)

2.2.3 Feature dimension reduction and extraction based on Principal Component Analysis.

mRNA1=data.imputation(mRNA,fun="microarray")
data1=FSbyPCA(mRNA1, PC_percent=0.9,scale = TRUE)

2.2.4 Feature selection based on Cox regression model.

data(GeneExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)

3. Clustering methods for Cancer subtype identification.

3.1 Consensus Clustering for cancer subtype identification

Consensus clustering (CC, 2003) as an unsupervised subtypes discovery method, was a frequently used and valuable approach in many genomic studies and have lots of successful applications.

### The input dataset is single gene expression matrix.
data(GeneExp)
result=ExecuteCC(clusterNum=3,d=GeneExp,maxK=10,clusterAlg="hc",distance="pearson",title="GBM")

### The input dataset is multi-genomics data as a list
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteCC(clusterNum=3,d=GBM,maxK=10,clusterAlg="hc",distance="pearson",title="GBM")

3.2 Consensus Non-negative matrix factorization for cancer subtype identification

Non-negative matrix factorization (CNMF, 2004), as an effective dimension reduction method, was used in distinguishing molecular patterns for high-dimensional genomic data and provided a powerful method for class discovery. We apply the NMF package to execute the non-negative matrix factorization for the cancer genomic dataset. So this method allows users to input the number of core-CPUs for parallel processing.

### The input dataset is single gene expression matrix.
data(GeneExp)
result=ExecuteCNMF(GeneExp,clusterNum=3,nrun=30)

### The input dataset is multi-genomics data as a list
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteCNMF(GBM,clusterNum=3,nrun=30)

3.3 Integrative clustering for cancer subtype identification

Integrative clustering (iCluster, 2009) used a joint latent variable model for integrative clustering for multiple types of omics data.

data(GeneExp)
data(miRNAExp)
data1=FSbyVar(GeneExp, cut.type="topk",value=1000)
data2=FSbyVar(miRNAExp, cut.type="topk",value=300)
GBM=list(GeneExp=data1,miRNAExp=data2)
result=ExecuteiCluster(datasets=GBM, k=3, lambda=list(0.44,0.33,0.28))

3.4 Similarity network fusion for cancer subtype identification

Similarity network fusion (SNF, 2014) is a computational method on fusion similarity network for aggregating multi-omics data.

data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20)

3.5 Ensemble method of SNF and CC for cancer subtype identification

We propose to combine the SNF and CC together to generate a new cancer subtypes identification method.

data(GeneExp)
data(miRNAExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
data2=FSbyCox(miRNAExp,time,status,cutoff=0.05)
GBM=list(GeneExp=data1,miRNAExp=data2)
result=ExecuteSNF.CC(GBM, clusterNum=3, K=20, alpha=0.5, t=20,maxK = 10, pItem = 0.8,reps=500, 
                     title = "GBM", plot = "png", finalLinkage ="average")

4 Results validation, interpretation and visualization for the identified cancer subtypes.

The identified cancer subtypes by the computational methods should be in accordance with biological meanings and reveal the distinct molecular patterns.

4.1 Silhouette width

Silhouette width is used to measure how similar a sample is matched to its identified subtype compared to other subtypes, a high value indicates that the sample is well matched. Each horizontal line represents a sample in the Silhouette plot. The length of the line is the silhouette width the sample has.

data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20,plot = FALSE)

###Similarity smaple matrix
sil=silhouette_SimilarityMatrix(result$group, result$distanceMatrix)
plot(sil)

Note: If the input matrix is a dissimilarity matrix between samples, please use the silhouette() in cluster package to compute the silhouette width, otherwise a wrong result will be generated.

sil1=silhouette(result$group, result$distanceMatrix)
plot(sil1)  ##wrong result

All the samples have the negative silhouette width.

4.2 Survival analysis

Survival analysis is used to judge the different survival patterns between subtypes.

data(GeneExp)
data(miRNAExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
data2=FSbyCox(miRNAExp,time,status,cutoff=0.05)
GBM=list(GeneExp=data1,miRNAExp=data2)

#### 1.ExecuteSNF
result1=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20,plot = FALSE)
group1=result1$group
distanceMatrix1=result1$distanceMatrix
p_value=survAnalysis(mainTitle="GBM1",time,status,group1,
                     distanceMatrix1,similarity=TRUE)
##                                                      
## *****************************************************
## GBM1 Cluster= 3   Call:
## survdiff(formula = Surv(time, status) ~ group)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## group=1 48       46     43.0     0.213     0.395
## group=2 39       38     26.6     4.848     7.069
## group=3 13       12     26.4     7.847    12.104
## 
##  Chisq= 14.5  on 2 degrees of freedom, p= 0.000704

This is a combination figure with three parts: Survival curves, heatmap of the sample similarity matrix and Silhouette width plots for the identified cancer subtypes. The samples in all of the plots have be reorganized by the identified caner subtypes. This kind of figure provides the visible results that can be easily evaluated.

#### 2.ExecuteSNF.CC
result2=ExecuteSNF.CC(GBM, clusterNum=3, K=20, alpha=0.5, t=20,
                      maxK = 5, pItem = 0.8,reps=500, 
                      title = "GBM2", plot = "png", 
                      finalLinkage ="average")
group2=result2$group
distanceMatrix2=result2$distanceMatrix
p_value=survAnalysis(mainTitle="GBM2",time,status,group2,
                     distanceMatrix2,similarity=TRUE)
##                                                      
## *****************************************************
## GBM2 Cluster= 3   Call:
## survdiff(formula = Surv(time, status) ~ group)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## group=1 39       38     26.6     4.848     7.069
## group=2 47       45     42.0     0.212     0.386
## group=3 14       13     27.4     7.529    11.661
## 
##  Chisq= 14  on 2 degrees of freedom, p= 0.000897

4.3 Statistical significance of clustering

Statistical significance of clustering is a pure statistical approach to test the significance difference data distribution between subtypes. Different expression is to test the expression difference between each subtypes and a reference group (always a set of normal samples).

data(GeneExp)
data(miRNAExp)
data(time)
data(status)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20,plot = FALSE)
group=result$group
sigclust=sigclustTest(miRNAExp,group, nsim=1000, nrep=1, icovest=1)

sigclust
##           Subtype 1 Subtype 2 Subtype 3
## Subtype 1     1.000     0.814     0.006
## Subtype 2     0.814     1.000     0.074
## Subtype 3     0.006     0.074     1.000

The SigClust summary plots show the distributions of cluster index (CI). The blue points, representing the simulated CIs, are plotted with random vertical jitter for better visualization. The solid and dotted lines correspond to the estimated nonparametric density and Gaussian density fit to the simulated CIs. The p-value show the significant level between the two subtypes. Please refer to [5] to get the more information about sigclust.

4.4 Differential expression analysis

Differential expression analysis is to test the expression difference between each subtypes and a reference group (always a set of normal samples). Here we apply limma package to conduct the different expression analysis between each subtypes and normal samples.

library("RTCGA.mRNA")
rm(list = ls())
data(BRCA.mRNA)
mRNA=t(as.matrix(BRCA.mRNA[,-1]))
colnames(mRNA)=BRCA.mRNA[,1]
mRNA1=data.imputation(mRNA,fun="microarray")
mRNA1=FSbyMAD(mRNA1, cut.type="topk",value=5000)
###Split the normal and tumor samples
index=which(as.numeric(substr(colnames(mRNA1),14,15))>9)
mRNA_normal=mRNA1[,index]
mRNA_tumor=mRNA1[,-index]

### Remove the duplicate samples
index1=which(as.numeric(substr(colnames(mRNA_tumor),14,15))>1)
mRNA_tumor=mRNA_tumor[,-index1]

##### Identify cancer subtypes
result=ExecuteCC(clusterNum=5,d=mRNA_tumor,maxK=5,clusterAlg="hc",distance="pearson",title="BRCA")
group=result$group
res=DiffExp.limma(Tumor_Data=mRNA_tumor,Normal_Data=mRNA_normal,group=group,topk=NULL,RNAseq=FALSE)
## Differently expression genes in subtype 1
head(res[[1]])
##           ID     logFC    AveExpr         t       P.Value     adj.P.Val
## 2439 COL10A1  4.712350  5.0043839  38.57960 3.234769e-151 1.617385e-147
## 3725   MMP11  3.528733  2.6563911  34.61589 3.206291e-134 8.015728e-131
## 3797     CA4 -4.357770 -1.0831256 -27.32458 7.230942e-101  1.205157e-97
## 3235    CAV1 -3.006418 -1.5612746 -26.90504  7.051163e-99  8.813954e-96
## 4586   CXCL3 -2.499337 -2.9294699 -26.24577  9.675861e-96  9.675861e-93
## 4133    RYR3 -2.575299 -0.4590268 -25.94906  2.524404e-94  2.103670e-91
##             B
## 2439 334.8796
## 3725 295.9708
## 3797 219.5287
## 3235 214.9674
## 4586 207.7722
## 4133 204.5236
## Differently expression genes in subtype 2
head(res[[2]])
##          ID     logFC    AveExpr          t      P.Value    adj.P.Val
## 637   ITM2A -6.173926  0.8088468 -12.356546 1.091109e-18 5.455546e-15
## 4942  CKAP2  4.229372 -2.1539566   9.959475 1.093416e-14 2.267182e-11
## 4596 RCBTB2 -4.492770 -0.2484435  -9.904508 1.360309e-14 2.267182e-11
## 3297  GPR19  4.272492 -2.2195806   9.791601 2.132269e-14 2.665337e-11
## 4829  CASP4 -3.767096  0.7556694  -9.613507 4.342293e-14 4.342293e-11
## 2065   MCM4  5.079992 -3.2200565   9.518389 6.355510e-14 5.296258e-11
##             B
## 637  31.11131
## 4942 22.64778
## 4596 22.44505
## 3297 22.02758
## 4829 21.36631
## 2065 21.01180

5 Conclusions

The CancerSubtypes R package provides a suite of cancer subtypes analysis tools and embeds the analysis in a standardized workflow. It provides a powerful way to analyze cancer subtype on genome-wide scale.


6. References

[1] Monti, Stefano, et al. “Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.” Machine learning 52.1-2 (2003): 91-118.

[2] Brunet, Jean-Philippe, et al. “Metagenes and molecular pattern discovery using matrix factorization.” Proceedings of the national academy of sciences 101.12 (2004): 4164-4169.

[3] Shen, Ronglai, Adam B. Olshen, and Marc Ladanyi. “Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis.” Bioinformatics 25.22 (2009): 2906-2912.

[4] Wang, Bo, et al. “Similarity network fusion for aggregating data types on a genomic scale.” Nature methods 11.3 (2014): 333-337.

[5] Liu, Yufeng, et al. “Statistical significance of clustering for high-dimension, low?Csample size data.” Journal of the American Statistical Association (2012).

[6] Rousseeuw, Peter J. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.” Journal of computational and applied mathematics 20 (1987): 53-65.