The ClusterSignificance package provides tools to assess if clusters, in e.g. principal component analysis (PCA), have a separation different from random or permuted data. This is accomplished in a 3 step process projection, classification, and permutation. To be able to compare cluster separations, we have to give them a score based on this separation. First, all data points in each cluster are projected onto a line (projection), after which the separation for two groups at a time is scored (classification). Furthermore, to get a p-value for the separation we have to compare the separation score for our real data to the separation score for permuted data (permutation).
The package includes 2 different methods for accomplishing the 3 steps above, Mean line projection (Mlp) and Principal curve projection (Pcp). Here we will first discuss the similarities of the 3 steps independent of which projection method is used. This is followed by an example of the Mlp and Pcp methods and the unique features of each. Furthermore, we provide an example where ClusterSignificance is used to examine the separation between 2 clusters downsteam of a PCA.
library(ClusterSignificance)
The input to the projection methods is a matrix, with data points (or samples) on the rows, and dimensions (or principal components) in the columns. For the method to know which rows are in the same group, the groups argument must be specified. The groups argument is a character vector and ClusterSignificance will produce an error if the groups argument contain NA’s. The projection step utilizes either the Mlp or Pcp method to project all data points onto a line, after which they are projected into a single dimension.
Now that all of the points are in one dimension, it allows us to easily define a perpendicular line that best separates the two groups. Using the user-defined groups variable, the sensitivity and specificity are calculated at each possible seperation along the projected points. Each seperation score is then calculated based on this formula:
All separation scores are recorded and the max score is subsequently used downstream to compare to the max scores from the permutated data.
Permutation is performed by randomly sampling the input matrix and assigning the data points to the groups. The projection and classification steps are then run for the sampled matrix. The max separation score is recorded and the next permutation is performed. The p-value can subsequently be calculated with the following formula:
If none of the permuted scores are greater that the real score the p.value is instead calculated as:
Ideally, the data is permuted until at least one permuted score is greater than the real score allowing for the calculation of an exact p-value, although, with very distinct separations, this is not always realistic. Despite the fact that we use a low iteration number in many examples in the vignette, we recommend 10,000 iterations as a good starting point to examine the distribution of permuted scores compared to the real score.
The ClusterSignificance package includes 2 small example data matrices, one used for demonstrating the Mlp method and one used for demonstrating the Pcp method. The Mlp demo matrix has 2 dimensions whereas, the Pcp demo matrix has 3. In addition, the groups variable has been added as the rownames of each matrix, with the Mlp demo matrix having 2 groups and the Pcp demo matrix having 3 groups. Finally, colnames, indicating the dimension number, were also added to each matrix. Neither colnames or rownames are needed for the input matrix to ClusterSignificance functions and we have only added them here only for clarity.
data(mlpMatrix)
dim1 | dim2 | |
---|---|---|
grp1 | 0.42 | 0.50 |
grp1 | 0.86 | 0.85 |
grp1 | 0.57 | 0.42 |
groups | Freq |
---|---|
grp1 | 20 |
grp2 | 20 |
data(pcpMatrix)
dim1 | dim2 | dim3 | |
---|---|---|---|
grp1 | 20.8 | 28.0 | 52.01 |
grp1 | 68.5 | 48.7 | 45.44 |
grp1 | 90.1 | 31.6 | 13.87 |
groups | Freq |
---|---|
grp1 | 30 |
grp2 | 30 |
grp3 | 30 |
The Mlp method is suitable for comparing clusters in a maximum of 2 dimensions when the separation between a maximum of 2 groups will be evaluated at a time. Briefly, Mlp functions by drawing a regression line through the means of the two groups and then projects all points onto that line. To project the points into one dimension, the points are then rotated onto the x-axis. A detailed description and graphical representation of the individual steps can be seen in the example below. It should be noted that, despite the fact that the Mlp method will work with as few as 2 data points per group, it may not be advisable to perform an analysis with such a small amount of data.
We use the demonstration data included with the package to demonstrate the Mlp function.
## Load demo input.
data(mlpMatrix)
## Create the group variable.
groups <- rownames(mlpMatrix)
## Run Mlp and plot.
prj <- mlp(mlpMatrix, groups)
plot(prj)
The plot above shows the 6 steps Mlp takes to project the points into one dimension.
Note that step 4 shows the orthogonal projection onto the regression line and appears visually to not be perpendicular to the line. This is due to the automatic scaling of the y-axis which is necessary when large differences in proportions between PC1 and PC2 would otherwise cause data to disappear outside of the margins.
## Classify and plot.
cl <- classify(prj)
plot(cl)
The classification function works in the same manner for both the Mlp and Pcp methods and is previously described in the Classification section. The perpendicular lines in the classification plot represent the separation scores for all possible separations. The highest scoring separation of the two groups is represented by the highest perpendicular line.
## Set the seed and number of iterations.
set.seed(3)
iterations <- 100
## Permute and plot.
pe <- permute(
mat=mlpMatrix,
iter=iterations,
groups=groups,
projmethod="mlp"
)
plot(pe)
## grp1 vs grp2
## 0.01
The permutation plot shows a histogram of the max separation scores from all permutations with the vertical red line representing the max score from the real data. From the plot can see that when comparing the seperation of grp1 vs. grp2, using 100 iterations, the permuted data never scores higher than the real data. Thus, the separation thus has a significant p-value of < 0.01. The p-values can be obtained using pvalue(pe)
.
The Pcp method is suitable in situations where more than 2 dimensions need to be considered simultaneously and/or more than 2 group separations should be analyzed. It is limited by the fact that, the Pcp method will not work for < 5 data points per group and must have > 5 unique values per dimension. The Pcp method utilizes the principal.curve method from the princurve package to fit a principal curve to the data points utilizing all the dimensions input by the user. A principal curve can be described as a “smooth curve that passes through the middle of the data in an orthogonal sense”. All data points are then projected onto the principal curve and the distance between the points is calculated. This completes the points transfer into one dimension and the group seperations are now ready for scoring.
For this example the demonstration data was simulated so that one cluster has a visible separation from the two other clusters, whereas two of the clusters are largely overlapping. Therefore, can predict that ClusterSignificance will find 2 separations to be significant where as the other will be insignificant.
## Load demo input.
data(pcpMatrix)
## Create the group variable.
groups <- rownames(pcpMatrix)
## Run Pcp and plot.
prj <- pcp(pcpMatrix, groups)
plot(prj)
The following steps take place during Pcp projection:
Due to the fact that the Pcp method can perform several group comparisons at once, when > 2 groups are used all possible separations are automatically analyzed and plotted.
cl <- classify(prj)
plot(cl)
#permute matrix
set.seed(3)
iterations <- 100 ## Set the number of iterations.
## Permute and plot.
pe <- permute(
mat=pcpMatrix,
iter=iterations,
groups=groups,
projmethod="pcp"
)
plot(pe)
pvalue(pe)
## grp1 vs grp2 grp1 vs grp3 grp2 vs grp3
## 0.51 0.01 0.01
The results from the permutation indicate that ClusterSignificance performed as we would expect. The separation for grp1 vs grp2 is insignificant whereas the separations for both grp1 vs grp3 and grp2 vs grp3 are significant.
One probable application for ClusterSignificance would be the analysis of cluster separation post-PCA. Here we use the leukemia dataset from the plsgenomics package to demonstrate how this may be done. The dataset includes “gene expression data (3051 genes and 38 leukemia samples) from the leukemia microarray study of Golub et al. (1999)”. The tumor samples are from both acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) patients. Due to the inherent differences in the transcriptome of these two cancer types, we hypothesize that they would form seperate clusters post-PCA and that we can use ClusterSignificance to examine if this separation is significant.
## The leukemia dataset is provided as a list with 3 elements.
## An explaination of these can be found using the command:
## help(leukemia)
library(plsgenomics, quietly = TRUE)
data(leukemia)
## Extract the expression matrix.
expression <- leukemia$X
## Run PCA and extract the principal components.
pca <- prcomp(expression, scale = TRUE)
mat <- pca$x
## Extract the grouping variable (coded as 1 (ALL) and 2 (AML)).
groups <- ifelse(leukemia$Y == 1, "ALL", "AML")
groups | Freq |
---|---|
ALL | 27 |
AML | 11 |
ob <- pcp(mat, groups, normalize=TRUE)
plot(ob)
cl <- classify(ob)
plot(cl)
set.seed(3)
iterations <- 100
pe <- permute(mat=mat, iter=iterations, groups=groups, projmethod="pcp")
pvalue(pe)
## ALL vs AML
## 0.01
As we can see from the resulting p-value, ClusterSignificance is able to find a significant separation between ALL and AML clusters, even with the low iteration number used in the example. 10000 iterations results in p = 1e-04.
When running the permutation step of the Pcp method, it is possible that all iterations are not successfully completed, as can be seen with the message upon permutation completion. Due to the fact that, the p-value is calculated not by the number of iterations input by the user but, instead, with the number of iterations that successfully complete, this will not be an issue for most users. However, if it is the case that a specific number of iterations is required by the user it is possible to concatenate multiple permutation runs.
mat <- mlpMatrix
groups <- rownames(mlpMatrix)
set.seed(3)
iterations <- 10
pe1 <- permute(mat=mat, iter=iterations, groups=groups, projmethod="pcp")
pe2 <- permute(mat=mat, iter=iterations, groups=groups, projmethod="pcp")
pe3 <- permute(mat=mat, iter=iterations, groups=groups, projmethod="pcp")
pvalue(pe1)
## grp1 vs grp2
## 0.1
pe <- c(pe1, pe2, pe3)
pvalue(pe)
## grp1 vs grp2
## 0.03333333
In some cases, e.g. PCA, it may be desirable to define a permutation matrix rather than allowing ClusterSignificance to do this automatically. For example, when determining the separation of clusters post-PCA, a more valid approach may be to perform PCA on a sampled matrix of the original values, after which this is provided to ClusterSignificance, rather than allowing ClusterSignificance to randomly sample the principal components. This can be accomplished by providing the user.permutations argument to the permute method.
##define permMatrix function
.pca <- function(y, groups, uniq.groups, mat) {
pca <- prcomp(
sapply(1:ncol(mat[groups %in% uniq.groups[, y], ]),
function(i)
mat[groups %in% uniq.groups[, y], i] <-
sample(
mat[groups %in% uniq.groups[, y], i],
replace=FALSE
)
), scale=TRUE
)
return(pca$x)
}
permMatrix <- function(iterations, groups, mat) {
uniq.groups <- combn(unique(groups), 2)
permats <- lapply(1:ncol(uniq.groups),
function(y)
lapply(1:iterations,
function(x)
.pca(y, groups, uniq.groups, mat)
)
)
return(permats)
}
set.seed(3)
mat <- pcpMatrix
groups <- rownames(pcpMatrix)
iterations = 100
permats <- permMatrix(iterations, groups, mat)
The permutation matrix is a list (we can call this list 1) with one element per comparison. Each of the elements in list 1 holds another list (we can call this list 2). The number of elements in list 2 is equal to the number of iterations and, thus, each element in list 2 is a permuted matrix. We can then provide this permutation matrix to ClusterSignificance in the following manner:
pe <- permute(
mat = mat,
iter=100,
groups=groups,
projmethod="pcp",
userPermats=permats
)
The ClusterSignificance package can estimate a p-value for a separation projected on a line. Typically, the package might be used post-PCA-analysis, but clusters coming from other sources can also be tested.
sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] plsgenomics_1.3-1 boot_1.3-18
## [3] MASS_7.3-45 ClusterSignificance_1.2.3
## [5] ggplot2_2.2.1 BiocStyle_2.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.9 knitr_1.15.1 magrittr_1.5
## [4] scatterplot3d_0.3-38 munsell_0.4.3 colorspace_1.3-2
## [7] quadprog_1.5-5 highr_0.6 stringr_1.2.0
## [10] plyr_1.8.4 tools_3.3.3 grid_3.3.3
## [13] gtable_0.2.0 htmltools_0.3.5 yaml_2.1.14
## [16] lazyeval_0.2.0 rprojroot_1.2 digest_0.6.12
## [19] assertthat_0.1 tibble_1.2 RColorBrewer_1.1-2
## [22] princurve_1.1-12 evaluate_0.10 rmarkdown_1.3
## [25] stringi_1.1.2 pracma_1.9.9 scales_0.4.1
## [28] backports_1.0.5