*CancerSubtypes* is a package for cancer subtype analysis that includes various functions from dataset processing to result validation. In *CancerSubtypes* package, we provide a unified framework for analysing cancer subtpes from raw data to result visualisation. The main functions include genomic data pre-processing, cancer subtypes identification, results validation, visualization and comparison. *CancerSubtypes* provides the common data imputation and normalization methods for the genomic data pre-processing. Meanwhile, there are four feature selection methods to screen the key features in genomic dataset. The common cancer subtypes identification methods are integrated in this package such as **Consensus clustering (CC)** [From R package *ConsensusClusterPlus*], **Consensus Nonnegative matrix factorization (CNMF)** [From R package *NMF*], **Integrative clustering (iCluster)**[From R package *iCluster*], **Similarity Network Fusion (SNF)** [From R package *SNFtool*] and **Combined SNF and CC (SNF.CC)**. We implement these cancer subtypes identification methods in a unified input and output data format. The process of analysing cancer subtypes can be easily conducted in a standard workflow. *CancerSubtypes* provides the most useful feature selection methods and subtypes validation method and helps users to focus on their cancer genomic data and the results from different methods can be compared and evaluation in visualization way easily.

For the basic data processing, *CancerSubtypes* provides the methods for data distribution check, imputation and normalization and feature selection. There are four feature selection methods (Variance-Var, Median Absolute Deviation-MAD, COX model, Principal Component Analysis-PCA) in *CancerSubtypes* package. All the data processing methods possess the same input and output data format.

```
### Prepare a TCGA gene expression dataset for analysis.
library(CancerSubtypes)
library("RTCGA.mRNA")
rm(list = ls())
data(BRCA.mRNA)
mRNA=t(as.matrix(BRCA.mRNA[,-1]))
colnames(mRNA)=BRCA.mRNA[,1]
###To observe the mean, variance and Median Absolute Deviation distribution of the dataset, it helps users to get the distribution characteristics of the data, e.g. To evaluate whether the dataset fits a normal distribution or not.
data.checkDistribution(mRNA)
```

The raw genomic dataset always contains missing observations, especially in microarray gene expression data. It is not wise to remove all the features with missing observations in very few samples because the useful information will be discarded. The common method is to impute the proper value for the missing observations. *CancerSubtypes* integrates three common imputation methods for genomic datasets.

```
index=which(is.na(mRNA))
res1=data.imputation(mRNA,fun="median")
res2=data.imputation(mRNA,fun="mean")
res3=data.imputation(mRNA,fun="microarray")
```

```
result1=data.normalization(mRNA,type="feature_Median",log2=FALSE)
result2=data.normalization(mRNA,type="feature_zsocre",log2=FALSE)
```

```
###The top 1000 most variance features will be selected.
data1=FSbyVar(mRNA, cut.type="topk",value=1000)
###The features with (variance>0.5) are selected.
data2=FSbyVar(mRNA, cut.type="cutoff",value=0.5)
```

```
data1=FSbyMAD(mRNA, cut.type="topk",value=1000)
data2=FSbyMAD(mRNA, cut.type="cutoff",value=0.5)
```

```
mRNA1=data.imputation(mRNA,fun="microarray")
data1=FSbyPCA(mRNA1, PC_percent=0.9,scale = TRUE)
```

```
data(GeneExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
```

Consensus clustering (CC, 2003) as an unsupervised subtypes discovery method, was a frequently used and valuable approach in many genomic studies and have lots of successful applications.

```
### The input dataset is single gene expression matrix.
data(GeneExp)
result=ExecuteCC(clusterNum=3,d=GeneExp,maxK=10,clusterAlg="hc",distance="pearson",title="GBM")
### The input dataset is multi-genomics data as a list
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteCC(clusterNum=3,d=GBM,maxK=10,clusterAlg="hc",distance="pearson",title="GBM")
```

Non-negative matrix factorization (CNMF, 2004), as an effective dimension reduction method, was used in distinguishing molecular patterns for high-dimensional genomic data and provided a powerful method for class discovery. We apply the *NMF* package to execute the non-negative matrix factorization for the cancer genomic dataset. So this method allows users to input the number of core-CPUs for parallel processing.

```
### The input dataset is single gene expression matrix.
data(GeneExp)
result=ExecuteCNMF(GeneExp,clusterNum=3,nrun=30)
### The input dataset is multi-genomics data as a list
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteCNMF(GBM,clusterNum=3,nrun=30)
```

Integrative clustering (iCluster, 2009) used a joint latent variable model for integrative clustering for multiple types of omics data.

```
data(GeneExp)
data(miRNAExp)
data1=FSbyVar(GeneExp, cut.type="topk",value=1000)
data2=FSbyVar(miRNAExp, cut.type="topk",value=300)
GBM=list(GeneExp=data1,miRNAExp=data2)
result=ExecuteiCluster(datasets=GBM, k=3, lambda=list(0.44,0.33,0.28))
```

Similarity network fusion (SNF, 2014) is a computational method on fusion similarity network for aggregating multi-omics data.

```
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20)
```

We propose to combine the SNF and CC together to generate a new cancer subtypes identification method.

```
data(GeneExp)
data(miRNAExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
data2=FSbyCox(miRNAExp,time,status,cutoff=0.05)
GBM=list(GeneExp=data1,miRNAExp=data2)
result=ExecuteSNF.CC(GBM, clusterNum=3, K=20, alpha=0.5, t=20,maxK = 10, pItem = 0.8,reps=500,
title = "GBM", plot = "png", finalLinkage ="average")
```

The identified cancer subtypes by the computational methods should be in accordance with biological meanings and reveal the distinct molecular patterns.

Silhouette width is used to measure how similar a sample is matched to its identified subtype compared to other subtypes, a high value indicates that the sample is well matched. Each horizontal line represents a sample in the Silhouette plot. The length of the line is the silhouette width the sample has.

```
data(GeneExp)
data(miRNAExp)
GBM=list(GeneExp=GeneExp,miRNAExp=miRNAExp)
result=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20,plot = FALSE)
###Similarity smaple matrix
sil=silhouette_SimilarityMatrix(result$group, result$distanceMatrix)
plot(sil)
```

Note: If the input matrix is a dissimilarity matrix between samples, please use the *silhouette()* in cluster package to compute the silhouette width, otherwise a wrong result will be generated.

```
sil1=silhouette(result$group, result$distanceMatrix)
plot(sil1) ##wrong result
```

All the samples have the negative silhouette width.

Survival analysis is used to judge the different survival patterns between subtypes.

```
data(GeneExp)
data(miRNAExp)
data(time)
data(status)
data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
data2=FSbyCox(miRNAExp,time,status,cutoff=0.05)
GBM=list(GeneExp=data1,miRNAExp=data2)
#### 1.ExecuteSNF
result1=ExecuteSNF(GBM, clusterNum=3, K=20, alpha=0.5, t=20,plot = FALSE)
group1=result1$group
distanceMatrix1=result1$distanceMatrix
p_value=survAnalysis(mainTitle="GBM1",time,status,group1,
distanceMatrix1,similarity=TRUE)
```

```
##
## *****************************************************
## GBM1 Cluster= 3 Call:
## survdiff(formula = Surv(time, status) ~ group)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## group=1 48 46 43.0 0.213 0.395
## group=2 39 38 26.6 4.848 7.069
## group=3 13 12 26.4 7.847 12.104
##
## Chisq= 14.5 on 2 degrees of freedom, p= 0.000704
```

This is a combination figure with three parts: Survival curves, heatmap of the sample similarity matrix and Silhouette width plots for the identified cancer subtypes. The samples in all of the plots have be reorganized by the identified caner subtypes. This kind of figure provides the visible results that can be easily evaluated.

```
#### 2.ExecuteSNF.CC
result2=ExecuteSNF.CC(GBM, clusterNum=3, K=20, alpha=0.5, t=20,
maxK = 5, pItem = 0.8,reps=500,
title = "GBM2", plot = "png",
finalLinkage ="average")
```

```
group2=result2$group
distanceMatrix2=result2$distanceMatrix
p_value=survAnalysis(mainTitle="GBM2",time,status,group2,
distanceMatrix2,similarity=TRUE)
```

```
##
## *****************************************************
## GBM2 Cluster= 3 Call:
## survdiff(formula = Surv(time, status) ~ group)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## group=1 39 38 26.6 4.848 7.069
## group=2 47 45 42.0 0.212 0.386
## group=3 14 13 27.4 7.529 11.661
##
## Chisq= 14 on 2 degrees of freedom, p= 0.000897
```