The simplifyEnrichment package clusters functional terms into groups by clustering the similarity matrix of the terms with a new proposed method “binary cut” which recursively applies partition around medoids (PAM) with two groups on the similarity matrix and in each iteration step, a score is assigned to decide whether the group of gene sets that corresponds to the current sub-matrix should be split or not. For more details of the method, please refer to the simplifyEnrichment paper.
The major use case for simplifyEnrichment is for simplying the GO enrichment results by clustering the corresponding semantic similarity matrix of the significant GO terms. To demonstrate the usage, we first generate a list of random GO IDs from the Biological Process (BP) ontology category:
simplifyEnrichment starts with the GO similarity
matrix. Users can use their own similarity matrices or use the
GO_similarity()
function to calculate the semantic
similarity matrix. The GO_similarity()
function is simply a
wrapper on GOSemSim::termSim()
. The function accepts a
vector of GO IDs. Note the GO terms should only belong to one same
ontology (i.e., BP
, CC
or
MF
).
By default, GO_similarity()
uses Rel
method
in GOSemSim::termSim()
. Other methods to calculate GO
similarities can be set by measure
argument,
e.g.:
With the similarity matrix mat
, users can directly apply
simplifyGO()
function to perform the clustering as well as
visualizing the results.
On the right side of the heatmap there are the word cloud annotations which summarize the functions with keywords in every GO cluster. Additionally, enrichment is done on keywords compared to GO background vocabulary and the significance corresponds to the font size of the keywords.
Note there is no word cloud for the cluster that is merged from small clusters (size < 5).
The returned variable df
is a data frame with GO IDs and
the cluster labels:
## id cluster
## 1 GO:0086066 1
## 2 GO:0090461 2
## 3 GO:0032912 3
## 4 GO:0090220 4
## 5 GO:0032495 5
## 6 GO:0070585 4
The size of GO clusters can be retrieved by:
##
## 15 16 17 18 19 8 9 14 10 12 13 6 11 2 7 4 5 1 3
## 1 1 1 1 1 2 2 3 5 5 5 8 8 13 36 69 84 120 135
Or split the data frame by the cluster labels:
plot
argument can be set to FALSE
in
simplifyGO()
, so that no plot is generated and only the
data frame is returned.
If the aim is only to cluster GO terms, binary_cut()
or
cluster_terms()
functions can be directly applied:
## [1] 1 2 3 4 5 4 4 4 1 3 3 5 1 5 1 4 1 4 5 4 5 1 4 3 6
## [26] 3 7 4 3 1 1 3 8 7 3 3 5 1 4 4 5 2 4 9 5 1 1 7 3 7
## [51] 5 3 10 7 1 7 3 10 3 3 7 1 5 5 3 3 1 3 1 3 3 4 11 3 1
## [76] 3 4 7 3 3 1 5 4 6 3 3 1 1 4 5 7 7 3 4 5 3 1 1 4 6
## [101] 4 3 4 4 1 4 3 5 3 7 3 1 3 5 3 1 1 1 9 4 4 12 5 1 1
## [126] 1 1 5 4 1 4 1 3 5 5 5 1 5 3 5 5 5 1 3 5 5 13 1 2 4
## [151] 12 3 7 1 3 5 8 1 4 5 1 5 1 1 5 3 3 4 1 1 3 3 3 1 4
## [176] 4 1 3 4 3 3 5 5 4 4 3 1 3 5 3 5 3 3 3 1 1 1 5 4 3
## [201] 2 5 4 4 2 3 1 1 3 3 3 2 3 3 3 4 5 3 4 3 4 6 6 4 7
## [226] 4 5 5 1 1 3 1 14 12 7 7 5 5 3 3 7 3 1 4 1 1 5 14 3 5
## [251] 5 1 15 1 3 11 4 1 5 13 3 1 7 3 1 5 1 3 5 5 3 6 1 5 3
## [276] 11 1 2 7 4 5 13 1 1 10 3 5 3 3 3 7 4 5 3 3 4 1 1 4 7
## [301] 3 2 4 5 3 3 5 3 1 3 2 5 2 3 3 1 1 5 5 4 11 1 1 5 3
## [326] 3 1 4 5 1 1 1 1 7 3 2 3 2 5 3 5 1 13 3 3 4 7 1 4 6
## [351] 4 7 1 3 1 4 7 3 5 3 1 2 11 4 3 3 1 1 1 5 1 1 1 1 3
## [376] 5 4 7 3 10 1 3 1 7 3 1 3 11 3 16 4 6 3 1 7 7 14 1 5 4
## [401] 5 3 1 3 3 5 3 3 7 4 5 1 7 5 7 3 1 12 3 7 10 1 1 4 3
## [426] 1 3 13 1 3 3 1 1 5 3 3 1 1 1 1 17 5 1 3 7 3 1 5 1 5
## [451] 7 3 4 5 4 4 1 1 3 5 7 11 4 4 5 3 4 4 1 1 18 5 12 5 3
## [476] 3 3 3 5 5 3 3 3 4 1 11 5 1 3 3 1 5 3 1 3 19 7 5 3 1
or
binary_cut()
and cluster_terms()
basically
generate the same clusterings, but the labels of clusters might
differ.
Semantic measures can be used for the similarity of GO terms.
However, there are still a lot of ontologies (e.g. MsigDB gene sets)
that are only represented as a list of genes where the similarity
between gene sets are mainly measured by gene overlap.
simplifyEnrichment provides the
term_similarity()
and other related functions
(term_similarity_from_enrichResult()
,
term_similarity_from_KEGG()
,
term_similarity_from_Reactome()
,
term_similarity_from_MSigDB()
and
term_similarity_from_gmt()
) which calculate the similarity
of terms by the gene overlapping, with methods of Jaccard
coefficient, Dice
coefficient, overlap
coefficient and kappa
coefficient.
The similarity can be calculated by providing:
enrichResult
object which is normally from the
‘clusterProfiler’, ‘DOSE’, ‘meshes’ or ‘ReactomePA’ package.Once you have the similarity matrix, you can send it to
simplifyEnrichment()
function. But note, as we benchmarked
in the manuscript, the clustering on the gene overlap similarity
performs much worse than on the semantic similarity.
In the simplifyEnrichment package, there are also
functions that compare clustering results from different methods. Here
we still use previously generated variable mat
which is the
similarity matrix from the 500 random GO terms. Simply running
compare_clustering_methods()
function performs all
supported methods (in all_clustering_methods()
) excluding
mclust
, because mclust
usually takes very long
time to run. The function generates a figure with three panels:
In the barplots, the three metrics are defined as follows:
If plot_type
argument is set to heatmap
.
There are heatmaps for the similarity matrix under different clusterings
methods. The last panel is a table with the number of clusters.
Please note, the clustering methods might have randomness, which
means, different runs of compare_clustering_methods()
may
generate different clusterings (slightly different). Thus, if users want
to compare the plots between
compare_clustering_methods(mat)
and
compare_clustering_methods(mat, plot_type = "heatmap")
,
they should set the same random seed before executing the function.
set.seed(123)
compare_clustering_methods(mat)
set.seed(123)
compare_clustering_methods(mat, plot_type = "heatmap")
compare_clustering_methods()
is simply a wrapper on
cmp_make_clusters()
and cmp_make_plot()
functions where the former function performs clustering with different
methods and the latter visualizes the results. To compare different
plots, users can also use the following code without specifying the
random seed.
clt = cmp_make_clusters(mat) # just a list of cluster labels
cmp_make_plot(mat, clt)
cmp_make_plot(mat, clt, plot_type = "heatmap")
New clustering methods can be added by
register_clustering_methods()
, removed by
remove_clustering_methods()
and reset to the default
methods by reset_clustering_methods()
. All the supported
methods can be retrieved by all_clustering_methods()
.
compare_clustering_methods()
runs all the clustering
methods in all_clustering_methods()
.
The new clustering methods should be as user-defined functions and
sent to register_clustering_methods()
as named arguments,
e.g.:
register_clustering_methods(
method1 = function(mat, ...) ...,
method2 = function(mat, ...) ...,
...
)
The functions should accept at least one argument which is the input
matrix (mat
in above example). The second optional argument
should always be ...
so that parameters for the clustering
function can be passed by control
argument from
cluster_terms()
or simplifyGO()
. If users
forget to add ...
, it is added internally.
Please note, the user-defined function should automatically identify the optimized number of clusters. The function should return a vector of cluster labels. Internally it is converted to numeric labels.
There are following examples which we did for the benchmarking in the manuscript:
It is always very common that users have multiple lists of GO
enrichment results (e.g. from multiple groups of genes) and they want to
compare the significant terms between different lists, e.g. to see which
biological functions are more specific in a certain list. There is a
function simplifyGOFromMultipleLists()
in the package which
helps this type of analysis.
The input data for simplifyGOFromMultipleLists()
(with
the argument lt
) can have three types of formats:
go_id_column
argument and the column of the adjusted
p-values can be specified with padj_column
argument. If the
two columns are not specified, they are automatically identified. The GO
ID column is found by checking whether a column contains all GO IDs. The
adjusted p-value column is found by comparing the column names of the
data frame to see whether it might be a column for adjusted p-values.
These two columns are used to construct a numeric vector with GO IDs as
names.If the GO enrichment results is directly from upstream analysis, e.g. the package clusterProfiler or other similar packages, the results are most probably represented as a list of data frames, thus, we first demonstrate the usage on a list of data frames.
The function functional_enrichment()
in
cola package applies functional enrichment on different
groups of signature genes from consensus clustering. The function
internally uses clusterProfiler and returns a list of
data frames:
# perform functional enrichment on the signatures genes from cola anlaysis
library(cola)
data(golub_cola)
res = golub_cola["ATC:skmeans"]
library(hu6800.db)
x = hu6800ENTREZID
mapped_probes = mappedkeys(x)
id_mapping = unlist(as.list(x[mapped_probes]))
lt = functional_enrichment(res, k = 3, id_mapping = id_mapping)
## - 2058/4116 significant genes are taken from 3-group comparisons
## - on k-means group 1/4, 531 genes
## - 478/531 (90%) genes left after id mapping
## - gene set enrichment, GO:BP
## - on k-means group 2/4, 811 genes
## - 640/811 (78.9%) genes left after id mapping
## - gene set enrichment, GO:BP
## - on k-means group 3/4, 315 genes
## - 276/315 (87.6%) genes left after id mapping
## - gene set enrichment, GO:BP
## - on k-means group 4/4, 401 genes
## - 374/401 (93.3%) genes left after id mapping
## - gene set enrichment, GO:BP
## [1] "BP_km1" "BP_km2" "BP_km3" "BP_km4"
## ID Description GeneRatio BgRatio RichFactor
## GO:0006974 GO:0006974 DNA damage response 79/471 893/18888 0.08846585
## GO:0006281 GO:0006281 DNA repair 63/471 609/18888 0.10344828
## GO:1903047 GO:1903047 mitotic cell cycle process 68/471 761/18888 0.08935611
## GO:0000278 GO:0000278 mitotic cell cycle 73/471 911/18888 0.08013172
## GO:0051276 GO:0051276 chromosome organization 58/471 635/18888 0.09133858
## GO:0006260 GO:0006260 DNA replication 37/471 278/18888 0.13309353
## FoldEnrichment zScore
## GO:0006974 3.547649 12.47303
## GO:0006281 4.148474 12.63034
## GO:1903047 3.583351 11.63309
## GO:0000278 3.213435 10.95090
## GO:0051276 3.662852 10.91564
## GO:0006260 5.337305 11.65069
By default, simplifyGOFromMultipleLists()
automatically
identifies the columns that contain GO IDs and adjusted p-values, so
here we directly send lt
to
simplifyGOFromMultipleLists()
. We additionally set
padj_cutoff
to 0.001 because under the default cutoff 0.01,
there are too many GO IDs and to save the running time, we set a more
strict cutoff.
Next we demonstrate two other data types for
simplifyGOFromMultipleLists()
. Both usages are
straightforward. The first is a list of numeric vectors:
lt2 = lapply(lt, function(x) structure(x$p.adjust, names = x$ID))
simplifyGOFromMultipleLists(lt2, padj_cutoff = 0.001)
And the second is a list of character vectors of GO IDs:
The process of this analysis is as follows. Let’s assume there are
\(n\) GO lists, we first construct a
global matrix where columns correspond to the \(n\) GO lists and rows correspond to the
“union” of all GO IDs in the \(n\)
lists. The value for the ith GO ID and in the jth
list are taken from the corresponding numeric vector in lt
.
If the jth vector in lt
does not contain the
ith GO ID, the value defined by default
argument
is taken there (e.g. in most cases the numeric values are adjusted
p-values, thus default
is set to 1). Let’s call this matrix
as \(M_0\).
Next step is to filter \(M_0\) so
that we only take a subset of GO IDs of interest. We define a proper
function via argument filter
to remove GO IDs that are not
important for the analysis. Function for filter
is applied
to every row in \(M_0\) and
filter
function needs to return a logical value to decide
whether to keep or remove the current GO ID. For example, if the values
in lt
are adjusted p-values, the filter
function can be set as function(x) any(x < padj_cutoff)
so that the GO ID is kept as long as it is signfiicant in at least one
list. After the filtering, let’s call the filtered matrix \(M_1\).
GO IDs in \(M_1\) (row names of \(M_1\)) are used for clustering. A heatmap of \(M_1\) is attached to the left of the GO similarity heatmap so that the group-specific (or list-specific) patterns can be easily observed and to corresponded to GO functions.
Argument heatmap_param
controls several parameters for
heatmap \(M_1\):
transform
: A self-defined function to transform the
data for heatmap visualization. The most typical case is to transform
adjusted p-values by -log10(x)
.breaks
: Break values for color interpolation.col
: The corresponding values for
breaks
.labels
: The corresponding labels for legend.name
: Legend title.## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.6.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## Random number generation:
## RNG: L'Ecuyer-CMRG
## Normal: Inversion
## Sample: Rejection
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] hu6800.db_3.13.0 org.Hs.eg.db_3.19.1
## [3] AnnotationDbi_1.66.0 IRanges_2.38.1
## [5] S4Vectors_0.42.1 Biobase_2.64.0
## [7] cola_2.10.0 simplifyEnrichment_1.14.1
## [9] BiocGenerics_0.50.0 knitr_1.48
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.1 ggplotify_0.1.2 tibble_3.2.1
## [4] R.oo_1.26.0 polyclip_1.10-7 XML_3.99-0.17
## [7] lifecycle_1.0.4 httr2_1.0.4 doParallel_1.0.17
## [10] NLP_0.3-0 lattice_0.22-6 prabclus_2.3-4
## [13] MASS_7.3-61 magrittr_2.0.3 sass_0.4.9
## [16] rmarkdown_2.28 jquerylib_0.1.4 yaml_2.3.10
## [19] doRNG_1.8.6 flexmix_2.3-19 cowplot_1.1.3
## [22] DBI_1.2.3 RColorBrewer_1.1-3 eulerr_7.0.2
## [25] zlibbioc_1.50.0 expm_1.0-0 purrr_1.0.2
## [28] R.utils_2.12.3 ggraph_2.2.1 yulab.utils_0.1.7
## [31] nnet_7.3-19 tweenr_2.0.3 rappdirs_0.3.3
## [34] circlize_0.4.16 GenomeInfoDbData_1.2.12 enrichplot_1.24.4
## [37] tm_0.7-14 ggrepel_0.9.6 irlba_2.3.5.1
## [40] tidytree_0.4.6 genefilter_1.86.0 annotate_1.82.0
## [43] brew_1.0-10 commonmark_1.9.1 codetools_0.2-20
## [46] DOSE_3.30.5 xml2_1.3.6 ggforce_0.4.2
## [49] tidyselect_1.2.1 shape_1.4.6.1 aplot_0.2.3
## [52] UCSC.utils_1.0.0 farver_2.1.2 viridis_0.6.5
## [55] matrixStats_1.4.1 dynamicTreeCut_1.63-1 jsonlite_1.8.9
## [58] GetoptLong_1.0.5 tidygraph_1.3.1 survival_3.7-0
## [61] iterators_1.0.14 foreach_1.5.2 dbscan_1.2-0
## [64] tools_4.4.1 treeio_1.28.0 Rcpp_1.0.13
## [67] glue_1.7.0 gridExtra_2.3 xfun_0.47
## [70] qvalue_2.36.0 MatrixGenerics_1.16.0 GenomeInfoDb_1.40.1
## [73] dplyr_1.1.4 withr_3.0.1 fastmap_1.2.0
## [76] fansi_1.0.6 digest_0.6.37 R6_2.5.1
## [79] gridGraphics_0.5-1 microbenchmark_1.5.0 colorspace_2.1-1
## [82] GO.db_3.19.1 Cairo_1.6-2 markdown_1.13
## [85] RSQLite_2.3.7 diptest_0.77-1 R.methodsS3_1.8.2
## [88] utf8_1.2.4 tidyr_1.3.1 generics_0.1.3
## [91] data.table_1.16.0 robustbase_0.99-4 class_7.3-22
## [94] graphlayouts_1.2.0 httr_1.4.7 scatterpie_0.2.4
## [97] MCL_1.0 pkgconfig_2.0.3 gtable_0.3.5
## [100] modeltools_0.2-23 blob_1.2.4 ComplexHeatmap_2.20.0
## [103] impute_1.78.0 XVector_0.44.0 shadowtext_0.1.4
## [106] clusterProfiler_4.12.6 htmltools_0.5.8.1 fgsea_1.30.0
## [109] clue_0.3-65 scales_1.3.0 png_0.1-8
## [112] ggfun_0.1.6 reshape2_1.4.4 rjson_0.2.23
## [115] nlme_3.1-166 cachem_1.1.0 GlobalOptions_0.1.2
## [118] stringr_1.5.1 parallel_4.4.1 pillar_1.9.0
## [121] proxyC_0.4.1 vctrs_0.6.5 slam_0.1-53
## [124] xtable_1.8-4 cluster_2.1.6 evaluate_1.0.0
## [127] magick_2.8.5 cli_3.6.3 compiler_4.4.1
## [130] rlang_1.1.4 crayon_1.5.3 rngtools_1.5.2
## [133] labeling_0.4.3 mclust_6.1.1 skmeans_0.2-17
## [136] plyr_1.8.9 fs_1.6.4 stringi_1.8.4
## [139] viridisLite_0.4.2 BiocParallel_1.38.0 munsell_0.5.1
## [142] Biostrings_2.72.1 lazyeval_0.2.2 GOSemSim_2.30.2
## [145] Matrix_1.7-0 patchwork_1.3.0 bit64_4.5.2
## [148] ggplot2_3.5.1 KEGGREST_1.44.1 fpc_2.2-13
## [151] highr_0.11 apcluster_1.4.13 kernlab_0.9-33
## [154] gridtext_0.1.5 igraph_2.0.3 memoise_2.0.1
## [157] bslib_0.8.0 ggtree_3.12.0 fastmatch_1.1-4
## [160] DEoptimR_1.1-3 bit_4.5.0 gson_0.1.0
## [163] ape_5.8