The simplifyEnrichment package clusters functional terms into groups by clustering the similarity matrix of the terms with a new proposed method “binary cut” which recursively applies partition around medoids (PAM) with two groups on the similarity matrix and in each iteration step, a score is assigned to decide whether the group of gene sets that corresponds to the current sub-matrix should be split or not. For more details of the method, please refer to the simplifyEnrichment paper.
The major use case for simplifyEnrichment is for simplying the GO enrichment results by clustering the corresponding semantic similarity matrix of the significant GO terms. To demonstrate the usage, we first generate a list of random GO IDs from the Biological Process (BP) ontology category:
simplifyEnrichment starts with the GO similarity matrix. Users can use their own similarity matrices or use the GO_similarity()
function to calculate the semantic similarity matrix. The GO_similarity()
function is simply a wrapper on GOSemSim::termSim()
. The function accepts a vector of GO IDs. Note the GO terms should only belong to one same ontology (i.e., BP
, CC
or MF
).
By default, GO_similarity()
uses Rel
method in GOSemSim::termSim()
. Other methods to calculate GO similarities can be set by measure
argument, e.g.:
With the similarity matrix mat
, users can directly apply simplifyGO()
function to perform the clustering as well as visualizing the results.
## Cluster 500 terms by 'binary_cut'... 43 clusters, used 1.914086 secs.
On the right side of the heatmap there are the word cloud annotations which summarize the functions with keywords in every GO cluster. Note there is no word cloud for the cluster that is merged from small clusters (size < 5).
The returned variable df
is a data frame with GO IDs, GO terms and the cluster labels:
## id term cluster
## 1 GO:0003283 atrial septum development 1
## 2 GO:0022018 lateral ganglionic eminence cell proliferation 1
## 3 GO:0030032 lamellipodium assembly 2
## 4 GO:0061508 CDP phosphorylation 3
## 5 GO:1901222 regulation of NIK/NF-kappaB signaling 4
## 6 GO:0060164 regulation of timing of neuron differentiation 1
The size of GO clusters can be retrieved by:
##
## 5 7 8 10 12 13 17 18 21 22 24 25 26 27 28 29 30 31 32 33
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 34 35 36 37 38 39 40 41 42 43 20 23 9 16 19 15 14 11 2 6
## 1 1 1 1 1 1 1 1 1 1 2 2 3 5 5 6 10 12 37 45
## 4 1 3
## 97 114 132
Or split the data frame by the cluster labels:
plot
argument can be set to FALSE
in simplifyGO()
, so that no plot is generated and only the data frame is returned.
If the aim is only to cluster GO terms, binary_cut()
or cluster_terms()
functions can be directly applied:
## [1] 1 1 2 3 4 1 2 3 1 3 5 6 1 3 1 3 6 7 3 3 4 8 4 4 1
## [26] 1 3 6 3 1 4 3 3 2 3 4 3 4 4 2 2 4 6 6 9 2 6 2 2 3
## [51] 3 4 2 2 4 6 3 3 4 3 4 3 3 3 10 3 1 11 3 1 6 4 6 3 12
## [76] 1 13 4 14 2 4 11 6 1 3 4 1 4 4 4 15 3 6 3 3 3 4 3 14 6
## [101] 3 4 16 6 1 2 2 2 11 4 3 3 3 17 1 4 1 3 6 3 1 3 1 3 4
## [126] 3 16 4 6 4 3 9 3 3 3 3 1 2 3 4 3 1 3 3 3 18 1 2 3 1
## [151] 3 19 1 3 4 1 1 1 1 4 3 20 15 2 3 1 1 1 1 1 21 4 1 4 6
## [176] 4 4 3 1 4 1 4 11 11 11 1 4 11 1 3 6 2 3 1 22 1 3 6 1 14
## [201] 1 3 4 4 4 2 4 6 3 3 1 3 1 6 3 4 4 11 4 1 1 23 1 24 6
## [226] 4 3 2 1 1 3 1 1 6 1 1 1 4 6 3 4 3 4 16 3 1 1 4 25 4
## [251] 4 4 1 1 26 4 4 4 6 4 1 3 3 3 2 19 4 3 4 27 4 4 2 6 3
## [276] 3 11 3 6 2 16 3 3 2 1 1 6 6 6 3 3 3 1 3 4 3 1 3 1 14
## [301] 15 28 2 20 1 3 1 1 1 4 1 3 3 4 1 19 29 4 4 4 1 4 6 4 30
## [326] 3 3 6 1 1 3 1 4 2 1 3 3 3 3 19 6 14 3 1 1 11 1 1 4 14
## [351] 6 11 3 3 4 3 3 2 1 14 1 1 4 1 2 31 1 1 4 1 3 4 3 1 3
## [376] 32 1 1 3 1 3 6 3 3 3 1 19 6 3 11 3 1 3 3 33 2 1 4 4 1
## [401] 6 1 4 2 15 6 4 2 3 4 4 4 14 3 3 4 4 1 1 3 6 2 3 2 1
## [426] 4 34 1 3 1 1 23 3 6 4 9 1 1 6 4 1 35 36 37 38 1 1 1 4 2
## [451] 1 14 15 4 3 14 3 3 6 2 1 15 16 3 4 3 3 4 3 3 4 1 39 6 4
## [476] 3 3 4 3 4 6 2 2 3 40 4 6 4 41 3 1 3 1 3 1 6 42 43 4 1
or
binary_cut()
and cluster_terms()
basically generate the same clusterings, but the labels of clusters might differ.
Semantic measures can be used for the similarity of GO terms. However, there are still a lot of ontologies (e.g. MsigDB gene sets) that are only represented as a list of genes where the similarity between gene sets are mainly measured by gene overlap. simplifyEnrichment provides the term_similarity()
and other related functions (term_similarity_from_enrichResult()
, term_similarity_from_KEGG()
, term_similarity_from_Reactome()
, term_similarity_from_MSigDB()
and term_similarity_from_gmt()
) which calculate the similarity of terms by the gene overlapping, with methods of Jaccard coefficient, Dice coefficient, overlap coefficient and kappa coefficient.
The similarity can be calculated by providing:
enrichResult
object which is normally from the ‘clusterProfiler’, ‘DOSE’, ‘meshes’ or ‘ReactomePA’ package.Once you have the similarity matrix, you can send it to simplifyEnrichment()
function. But note, as we benchmarked in the manuscript, the clustering on the gene overlap similarity performs much worse than on the semantic similarity.
In the simplifyEnrichment package, there are also functions that compare clustering results from different methods. Here we still use previously generated variable mat
which is the similarity matrix from the 500 random GO terms. Simply running compare_clustering_methods()
function performs all supported methods (in all_clustering_methods()
) excluding mclust
, because mclust
usually takes very long time to run. The function generates a figure with three panels:
In the barplots, the three metrics are defined as follows:
## Cluster 500 terms by 'binary_cut'... 43 clusters, used 1.292667 secs.
## Cluster 500 terms by 'kmeans'... 15 clusters, used 5.466666 secs.
## Cluster 500 terms by 'dynamicTreeCut'... 58 clusters, used 0.2248063 secs.
## Cluster 500 terms by 'apcluster'... 39 clusters, used 1.05522 secs.
## Cluster 500 terms by 'hdbscan'... 13 clusters, used 0.2847941 secs.
## Cluster 500 terms by 'fast_greedy'... 29 clusters, used 0.1642258 secs.
## Cluster 500 terms by 'leading_eigen'... 30 clusters, used 0.4157999 secs.
## Cluster 500 terms by 'louvain'... 29 clusters, used 0.1897464 secs.
## Cluster 500 terms by 'walktrap'... 26 clusters, used 0.4785211 secs.
## Cluster 500 terms by 'MCL'... 28 clusters, used 4.439619 secs.
If plot_type
argument is set to heatmap
. There are heatmaps for the similarity matrix under different clusterings methods. The last panel is a table with the number of clusters.
## Cluster 500 terms by 'binary_cut'... 43 clusters, used 1.313843 secs.
## Cluster 500 terms by 'kmeans'... 16 clusters, used 5.596574 secs.
## Cluster 500 terms by 'dynamicTreeCut'... 58 clusters, used 0.229274 secs.
## Cluster 500 terms by 'apcluster'... 39 clusters, used 1.707315 secs.
## Cluster 500 terms by 'hdbscan'... 13 clusters, used 0.2676535 secs.
## Cluster 500 terms by 'fast_greedy'... 29 clusters, used 0.1445436 secs.
## Cluster 500 terms by 'leading_eigen'... 30 clusters, used 0.4241514 secs.
## Cluster 500 terms by 'louvain'... 29 clusters, used 0.1744215 secs.
## Cluster 500 terms by 'walktrap'... 26 clusters, used 0.4935129 secs.
## Cluster 500 terms by 'MCL'... 28 clusters, used 3.612622 secs.