Hierarchical clustering

0.1 Three clustering approaches

The bigPint package incorporates a function plotClusters() for users to conduct hierarchical clustering analyses on their data. This is accomplished through the hclust() and cutree() functions in the stats package. By default, the resulting clusters will be plotted as parallel coordinate lines superimposed onto side-by-side boxplots showing the full dataset. This will allow users to quickly view the patterns in the resulting clusters. There are three main approaches to using the plotClusters() function:

Approach 1: The clusters are determined by clustering only on a subset of data (such as significant genes). Only the significant genes are overlaid as parallel coordinate lines.
Approach 2: The clusters are determined by clustering the full dataset. Then, only a subset of data (such as significant genes) are overlaid as parallel coordinate lines.
Approach 3: The clusters are determined by clustering the full dataset. All genes are overlaid as parallel coordinate lines.

We will now show examples for each of these three main approaches and discuss a few additional options in the plotClusters() function.

0.2 Approach 1

Below, we read in the soybean iron metabolism dataset and perform a logarithm on it (Lauter and Graham 2016). We determine that this dataset contains 5,604 genes. We also determine that this dataset has 113 genes that have FDR < 1e-7 (n = 113). We will consider these our 113 significant genes of interest.

library(grid)
library(matrixStats)
library(ggplot2)
library(bigPint)
data("soybean_ir_sub")
data("soybean_ir_sub_metrics")
soybean_ir_sub[,-1] <- log(soybean_ir_sub[,-1]+1)
nrow(soybean_ir_sub)

## [1] 5604

length(which(soybean_ir_sub_metrics[["N_P"]]$FDR < 1e-7))

## [1] 113

We then perform a hierarchical clustering of size four. By setting the clusterAllData parameter to a value of FALSE, our clustering algorithm only considers the 113 significant genes of interest. As a side note, we use the default agglomeration method “ward.D”. We then overlay these 113 genes as parallel coordinate lines across the full dataset represented as side-by-side boxplots.

colList = c("#00A600FF", rainbow(5)[c(1,4,5)])
ret <- plotClusters(data=soybean_ir_sub, dataMetrics = soybean_ir_sub_metrics, nC=4,
  colList = colList, clusterAllData = FALSE, threshVal = 1e-7, saveFile = FALSE)
names(ret)

## [1] "N_P_4"

grid.draw(ret[["N_P_4"]])

0.3 Approach 2

We can perform a similar analysis as we did before, only now we will set the clusterAllData parameter to a value of TRUE. This means our four clusters will be created considering all genes in the data (n = 5,604). The clusters will then represent the four main patterns in the full dataset. After that, we will only overlay the 113 significant genes as parallel coordinate lines across the full dataset represented as side-by-side boxplots. Note that Cluster 1 did not contain any of the 113 significant genes.

ret <- plotClusters(data=soybean_ir_sub, dataMetrics = soybean_ir_sub_metrics, nC=4,
  colList = colList, clusterAllData = TRUE, threshVal = 1e-7, saveFile = FALSE)
grid.draw(ret[["N_P_4"]])

0.4 Approach 3

In this last approach, we will perform hierarchical clustering on all 5,604 genes in the data and also overlay all 5,604 genes as parallel coordinate lines. We do not create any subset of genes (such as significant genes) because we leave the dataMetrics object with its default value of NULL.

ret <- plotClusters(data=soybean_ir_sub, nC=4, colList = colList,
  clusterAllData = TRUE, saveFile = FALSE)
grid.draw(ret[["N_P_4"]])

As a side note, we can use the output from Approach 3 to better understand how Approach 2 worked. Basically, in Approach 2, all 5,604 genes were used in the clustering algorithm to create four clusters. These resulted in the cluster sizes shown in Approach 3: That is, Cluster 1 had 2,334 genes; Cluster 2 had 1,997 genes; Cluster 3 had 1,992 genes; and Cluster 4 had 881 genes. However, at that point, Approach 2 only plotted the 113 significant genes. This means that none of 2,334 genes in Cluster 1 were significant; 44 of the 1,997 genes in Cluster 2 were significant; 33 of the 1,992 genes in Cluster 3 were significant; and 36 of the 881 genes in Cluster 4 were significant.

0.5 Standardization

When visualizing hierarchical clustering of genes, it is often recommended to consider the standardized values of read counts (Chandrasekhar, Thangavel, and Elayaraja 2012). Below is an example of standardizing gene read counts and plotting their clusters as parallel coordinate plots superimposed onto side-by-side boxplots. The standardization transforms each gene to have a mean of zero and a standard deviation of one across its samples. This example uses the soybean cotyledon dataset, which has three treatment groups (S1, S2, and S3) (Brown and Hudson 2015). Only two treatment pairs (S1 and S2; S1 and S3) return significant genes. As a result, our ret object will have these two comparisons.

data(soybean_cn_sub)
data(soybean_cn_sub_metrics)
soybean_cn_sub_st <- as.data.frame(t(apply(as.matrix(soybean_cn_sub[,-1]), 1, 
  scale)))
soybean_cn_sub_st$ID <- as.character(soybean_cn_sub$ID)
soybean_cn_sub_st <- soybean_cn_sub_st[,c(length(soybean_cn_sub_st), 
  1:length(soybean_cn_sub_st)-1)]
colnames(soybean_cn_sub_st) <- colnames(soybean_cn_sub)
nID <- which(is.nan(soybean_cn_sub_st[,2]))
soybean_cn_sub_st[nID,2:length(soybean_cn_sub_st)] <- 0
ret <- plotClusters(data=soybean_cn_sub_st, dataMetrics = soybean_cn_sub_metrics, 
  nC=4, colList = c("#00A600FF", "#CC00FFFF", "red", "darkorange"), lineSize = 0.5, 
  lineAlpha = 1, clusterAllData = FALSE, aggMethod = "average",
  yAxisLabel = "Standardized read count", saveFile = FALSE)

## [1] "S2_S3: There were no significant genes"

names(ret)

## [1] "S1_S2_4" "S1_S3_4" "S2_S3_4"

grid.draw(ret[["S1_S2_4"]])
grid.draw(ret[["S1_S3_4"]])

0.6 Verbose option

In some cases, users may wish to use the plotClusters() function to create clusters and then save these clusters to use for other purposes later, such as overlaying them onto scatterplot matrices, litre plots, and volcano plots. The gene IDs of the parallel coordinate lines in each cluster can be saved as .rds files for this purpose by setting the verbose option to a value of TRUE. Below is an example of this:

plotClusters(data=soybean_cn_sub_st, dataMetrics = soybean_cn_sub_metrics, nC=4,
  colList = c("#00A600FF", "#CC00FFFF", "red", "darkorange"), lineSize = 0.5, 
  lineAlpha = 1, clusterAllData = FALSE, aggMethod = "average",
  yAxisLabel = "Standardized read count", verbose = TRUE)

Running the command above will save numerous files into the outDir location. The default location is a temporary directory. You can find the exact pathway of this temporary directory by simply typing tempdir() into your R console. This temporary directory and the files within it remain the same until you close your R session; at that point, a new temporary directory will be created. After running the above command, you should find the following files in your temporary directory:

S1_S2_4_1.rds (10 gene IDs in Cluster 1 in S1 versus S2 comparison)
S1_S2_4_1.jpg (Individual plot of Cluster 1 in S1 versus S2 comparison)
S1_S2_4_2.rds (3 gene IDs in Cluster 1 in S1 versus S2 comparison)
S1_S2_4_2.jpg (Individual plot of Cluster 2 in S1 versus S2 comparison)
S1_S2_4_3.rds (2 gene IDs in Cluster 1 in S1 versus S2 comparison)
S1_S2_4_3.jpg (Individual plot of Cluster 3 in S1 versus S2 comparison)
S1_S2_4_4.rds (1 gene IDs in Cluster 1 in S1 versus S2 comparison)
S1_S2_4_4.jpg (Individual plot of Cluster 4 in S1 versus S2 comparison)
S1_S2_4.jpg (Plot of all four clusters in S1 versus S2 comparison)
S1_S3_4_1.rds (19 gene IDs in Cluster 1 in S1 versus S3 comparison)
S1_S3_4_1.jpg (Individual plot of Cluster 1 in S1 versus S3 comparison)
S1_S3_4_2.rds (7 gene IDs in Cluster 2 in S1 versus S3 comparison)
S1_S3_4_2.jpg (Individual plot of Cluster 2 in S1 versus S3 comparison)
S1_S3_4_3.rds (2 gene IDs in Cluster 3 in S1 versus S3 comparison)
S1_S3_4_3.jpg (Individual plot of Cluster 3 in S1 versus S3 comparison)
S1_S3_4_4.rds (2 gene IDs in Cluster 4 in S1 versus S3 comparison)
S1_S3_4_4.jpg (Individual plot of Cluster 4 in S1 versus S3 comparison)
S1_S3_4.jpg (Plot of all four clusters in S1 versus S3 comparison)

At this point, we can use any of the .rds files to examine the genes represented in the clustered parallel coordinate plots. For instance, we can read in the 19 Cluster 1 genes in the S1 versus S3 comparison as follows:

S1S3Cluster1 <- readRDS(paste0(tempdir(), "/S1_S3_4_1.rds"))

And check its contents contain the 19 gene IDs.

S1S3Cluster1

##  [1] "Glyma18g00690.1" "Glyma08g44110.1" "Glyma01g26570.1" "Glyma07g09700.1"
##  [5] "Glyma02g40610.1" "Glyma17g17970.1" "Glyma19g26250.1" "Glyma10g34630.1"
##  [9] "Glyma14g14220.1" "Glyma19g26710.1" "Glyma03g29150.1" "Glyma08g19245.1"
## [13] "Glyma07g01730.2" "Glyma18g25845.1" "Glyma08g22380.1" "Glyma20g30460.1"
## [17] "Glyma12g10960.1" "Glyma16g08810.1" "Glyma18g42630.2"

Then, we can overlay these 19 genes from Cluster 1 in the S1 versus S3 comparison as points onto the scatterplot matrix. This step can give us another perspective of these 19 genes. We may wish to use the same color (green) that we did in the parallel coordinate plots above and add an informative plot title.

ret <- plotSM(data = soybean_cn_sub_st, geneList = S1S3Cluster1,
  pointColor = "#00A600FF", saveFile = FALSE)
ret[["S1_S3"]] + ggtitle("Cluster 1 Genes (n=19)")

0.7 Pre-determined clusters

Some users may perform clustering using other software and may wish to simply overlay those clusters as parallel coordinate plots using the bigPint package. This can be achieved in the plotClusters() function using the geneLists variable. As an example, let us read in three clusters that each contain a character array of gene IDs. These three clusters may have been created in another software package and we simply now wish to plot them out using the bigPint graphics. We see that the first cluster contains 7 gene IDs and the last two clusters each contain 2 gene IDs.

S1S3Cluster2 <- readRDS(paste0(tempdir(), "/S1_S3_4_2.rds"))
S1S3Cluster3 <- readRDS(paste0(tempdir(), "/S1_S3_4_3.rds"))
S1S3Cluster4 <- readRDS(paste0(tempdir(), "/S1_S3_4_4.rds"))

S1S3Cluster2

## [1] "Glyma06g12670.1" "Glyma12g32460.1" "Glyma17g09850.1" "Glyma18g52920.1"
## [5] "Glyma01g24710.1" "Glyma04g39880.1" "Glyma05g27450.2"

S1S3Cluster3

## [1] "Glyma04g37510.1" "Glyma03g19880.1"

S1S3Cluster4

## [1] "Glyma08g11570.1" "Glyma08g19290.1"

We can now plot these three clusters directly as parallel coordinate lines superimposed on side-by-side boxplots as follows:

ret <- plotClusters(data=soybean_cn_sub_st, geneLists = list(S1S3Cluster2,
  S1S3Cluster3, S1S3Cluster4), lineAlpha = 1, lineSize = 0.5)
grid.draw(ret[["S1_S3_3"]])

References

Brown, Anne V., and Karen A. Hudson. 2015. “Developmental Profiling of Gene Expression in Soybean Trifoliate Leaves and Cotyledons.” BMC Plant Biology 15 (1). BioMed Central:169.

Chandrasekhar, T, K Thangavel, and E Elayaraja. 2012. “Effective Clustering Algorithms for Gene Expression Data.” International Journal of Computer Applications 1201:4914.

Lauter, AN Moran, and MA Graham. 2016. “NCBI Sra Bioproject Accession: PRJNA318409.”