In many applications, semantic similarity analysis is integerated with gene set enrichment analysis, especially taking GO as the source of gene sets. simona provides functions that import ontologies already integrated with gene annotations. simona also provides functions for over-representation analysis (ORA) and functions to integrate the ORA results with semantic similarity analysis.
To add gene annotations for GO, just set the name of the “org.db”
package for the specific organism. For example “org.Hs.eg.db” for human
and “org.Mm.eg.db” for mouse. The full list of supported “org.db”
packages can be found at https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData
(search "org."
).
## An ontology_DAG object:
## Source: GO BP / GO.db package 3.18.0
## 27597 terms / 55036 relations
## Root: GO:0008150
## Terms: GO:0000001, GO:0000002, GO:0000003, GO:0000011, ...
## Max depth: 18
## Avg number of parents: 1.99
## Avg number of children: 1.88
## Aspect ratio: 358:1 (based on the longest distance from root)
## 771.78:1 (based on the shortest distance from root)
## Relations: is_a, part_of
## Annotations: 18870 items
## 291, 1890, 4205, 4358, ...
##
## With the following columns in the metadata data frame:
## id, name, definition
As the object dag
prints, the genes stored in
dag
are in the EntreZ ID type. So when doing ORA, the input
gene list should also be in the EntreZ ID type.
We generate a list of random genes for testing:
## [1] "8556" "51301" "2921" "406923" "26122" "693139"
To perform ORA, use the function
dag_enrich_on_genes()
.
## term name n_hits n_gs n_genes n_all
## GO:0007283 GO:0007283 spermatogenesis 31 623 500 18870
## GO:0007286 GO:0007286 spermatid development 14 203 500 18870
## GO:0030317 GO:0030317 flagellated sperm motility 10 117 500 18870
## GO:0048232 GO:0048232 male gamete generation 31 640 500 18870
## GO:0097722 GO:0097722 sperm motility 10 117 500 18870
## GO:0000003 GO:0000003 reproduction 57 1465 500 18870
## log2_fold_enrichment z_score p_value p_adjust depth
## GO:0007283 0.9091305 3.676248 0.0006230113 0.242407 6
## GO:0007286 1.3800415 3.787771 0.0010711972 0.242407 8
## GO:0030317 1.6895859 3.983984 0.0011068813 0.242407 6
## GO:0048232 0.8702907 3.515995 0.0009654651 0.242407 5
## GO:0097722 1.6895859 3.983984 0.0011068813 0.242407 3
## GO:0000003 0.5542276 3.079549 0.0022200374 0.259954 1
We can take the significant GO terms and look at their semantic similarities.
library(ComplexHeatmap)
Heatmap(mat, name = "similarity",
show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE)
And the significant GO terms on the global circular plot:
One of the use of the semantic similarity matrix is to cluster GO
terms in groups, to simplify the read of the results. Here the semantic
similarity matrix can be directly sent to
simplifyEnrichment()
function from the
simplifyEnrichment package. Since the terms are from
GO, there will be word cloud associated with the heatmap to show their
generl biological functions in each cluster.
In the previous example, when setting the organism, we use the name
of the org.db
package. The value can also directly be an
OrgDb
object. This expands the use of the function since
there are many OrgDb
objects for less-studied organims
available on AnnotationHub.
The following code demonstrates the use of the delphin organism
(Delphinus truncatus). AH112417
is the ID of this dataset.
Please refer to AnnotationHub for the usage of the
package.
Besides GO, there are also other ontologies that have gene annotations integrated.
UniProt Keywords (https://www.uniprot.org/keywords) is a set of controlled
vocabulary developed in UniProt to describe the biological functions of
proteins. It is organised in a hierarchical way, thus in a form of the
ontology. The function ontology_kw()
can import the UniProt
Keywords ontology with gene annotations from a specific organims.
The function internally uses the UniProtKeywords
package. All supported organisms can be found in the documentation of
UniProtKeywords::load_keyword_genesets()
.
## An ontology_DAG object:
## Source: UniProt Keywords
## 1202 terms / 1348 relations
## Root: ~~all~~
## Terms: KW-0001, KW-0002, KW-0003, KW-0004, ...
## Max depth: 6
## Avg number of parents: 1.12
## Avg number of children: 1.07
## Aspect ratio: 112:1 (based on the longest distance from root)
## 120:1 (based on the shortest distance from root)
## Annotations: 18050 items
## 2230, 316, 55847, 493856, ...
##
## With the following columns in the metadata data frame:
## id, accession, name, description, category
As dag
shows, the gene ID type is EntreZ ID. Similar as
GO, we randomly generate a list of genes and perform ORA.
genes = random_items(dag, 500)
tb = dag_enrich_on_genes(dag, genes)
tb = tb[order(tb$p_adjust), ]
top_go_ids = tb$term[1:50]
Obtain the semantic similarity matrix and make plots:
mat = term_sim(dag, top_go_ids)
Heatmap(mat, name = "similarity",
show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE)
We also also use simplifyEnrichment()
to cluster terms
in mat
, but there is no word cloud around the heatmap.
## id cluster
## 1 KW-0007 1
## 2 KW-0010 2
## 3 KW-0024 3
## 4 KW-0043 4
## 5 KW-0145 5
## 6 KW-0156 2
The following ontologies as well as the gene annotations are from the Rat Genome Database (RGD). Although the RGD is a database for mouse, it also provides gene annotations for other oganisms. The specific files used in each function can be found at https://download.rgd.mcw.edu/ontology/.
Note that the following functions may support different sets of organims. Please go to the documentations for the list.
Pathway Ontology
## An ontology_DAG object:
## Source: pw, 7.82
## 2593 terms / 3182 relations
## Root: ~~all~~
## Terms: PW:0000001, PW:0000002, PW:0000003, PW:0000004, ...
## Max depth: 10
## Avg number of parents: 1.23
## Avg number of children: 1.26
## Aspect ratio: 79.44:1 (based on the longest distance from root)
## 94.75:1 (based on the shortest distance from root)
## Relations: is_a
## Annotations: 5956 items
## CACNA1C, MAP3K3, RASGRP3, MAP3K6, ...
##
## With the following columns in the metadata data frame:
## id, short_id, name, namespace, definition
Note that, in the pathway ontology, genes are saved in gene symbols.
To perform enrichment analysis on the pathway ontology:
Chemical Entities of Biological Interest
To perform enrichment analysis on CheBi:
Disease Ontology
To perform enrichment analysis on the disease ontology:
Vertebrate Trait Ontology
To perform enrichment analysis on the vertebrate trait ontology:
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] simplifyEnrichment_1.12.0 ComplexHeatmap_2.18.0
## [3] org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1
## [5] IRanges_2.36.0 S4Vectors_0.40.2
## [7] Biobase_2.62.0 BiocGenerics_0.48.1
## [9] igraph_1.5.1 simona_1.0.10
## [11] knitr_1.45
##
## loaded via a namespace (and not attached):
## [1] blob_1.2.4 Biostrings_2.70.2 bitops_1.0-7
## [4] fastmap_1.1.1 RCurl_1.98-1.13 UniProtKeywords_0.99.7
## [7] promises_1.2.1 digest_0.6.33 mime_0.12
## [10] lifecycle_1.0.3 cluster_2.1.4 Cairo_1.6-1
## [13] ellipsis_0.3.2 NLP_0.2-1 KEGGREST_1.42.0
## [16] RSQLite_2.3.2 magrittr_2.0.3 compiler_4.3.2
## [19] rlang_1.1.1 sass_0.4.7 tools_4.3.2
## [22] yaml_2.3.7 htmlwidgets_1.6.2 bit_4.0.5
## [25] scatterplot3d_0.3-44 curl_5.1.0 xml2_1.3.5
## [28] RColorBrewer_1.1-3 GOSemSim_2.28.1 tm_0.7-11
## [31] xtable_1.8-4 colorspace_2.1-0 GO.db_3.18.0
## [34] iterators_1.0.14 cli_3.6.1 rmarkdown_2.25
## [37] DiagrammeR_1.0.10 crayon_1.5.2 ragg_1.2.6
## [40] RcppParallel_5.1.7 rstudioapi_0.15.0 httr_1.4.7
## [43] rjson_0.2.21 visNetwork_2.1.2 DBI_1.1.3
## [46] cachem_1.0.8 zlibbioc_1.48.0 parallel_4.3.2
## [49] XVector_0.42.0 proxyC_0.3.4 yulab.utils_0.1.0
## [52] matrixStats_1.0.0 vctrs_0.6.4 Matrix_1.6-1.1
## [55] slam_0.1-50 jsonlite_1.8.7 GetoptLong_1.0.5
## [58] bit64_4.0.5 clue_0.3-65 magick_2.8.1
## [61] systemfonts_1.0.5 foreach_1.5.2 jquerylib_0.1.4
## [64] glue_1.6.2 codetools_0.2-19 Polychrome_1.5.1
## [67] shape_1.4.6 later_1.3.1 GenomeInfoDb_1.38.5
## [70] htmltools_0.5.7 GenomeInfoDbData_1.2.11 circlize_0.4.15
## [73] R6_2.5.1 textshaping_0.3.7 doParallel_1.0.17
## [76] lattice_0.22-5 evaluate_0.23 shiny_1.7.5.1
## [79] highr_0.10 png_0.1-8 memoise_2.0.1
## [82] httpuv_1.6.12 bslib_0.5.1 Rcpp_1.0.11
## [85] xfun_0.41 fs_1.6.3 pkgconfig_2.0.3
## [88] GlobalOptions_0.1.2