There are several formats for ontology data. The most compact and
readable format is the .obo
format, which was initially
developed by the GO consortium. A lot of ontologies in .obo
format can be found from the OBO
Foundry or BioPortal. A description
of the .obo
format can be found from https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html.
In the simona package, the function
import_obo()
can be used to import an .obo
file to an ontology_DAG
object. The input is a path on
local computer or an URL. In the following example, we use the Plant Ontology as an
example.
The link of po.obo
can be found from that web package.
You can download it or directly provide it as an URL.
library(simona)
dag1 = import_obo("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.obo")
dag1
## An ontology_DAG object:
## Source: po, releases/2023-07-13
## 1656 terms / 1776 relations
## Root: ~~all~~
## Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
## Max depth: 11
## Avg number of parents: 1.07
## Avg number of children: 1.06
## Aspect ratio: 39:1 (based on the longest distance from root)
## 38.2:1 (based on the shortest distance from root)
## Relations: is_a
##
## With the following columns in the metadata data frame:
## id, short_id, name, namespace, definition
There are also several meta columns attached to the object, such as the name and the long definition of terms in the ontology.
## id short_id name namespace
## PO:0000001 PO:0000001 PO:0000001 plant embryo proper plant_anatomy
## PO:0000002 PO:0000002 PO:0000002 anther wall plant_anatomy
## PO:0000003 PO:0000003 PO:0000003 whole plant plant_anatomy
## PO:0000004 PO:0000004 PO:0000004 in vitro plant structure plant_anatomy
## PO:0000005 PO:0000005 PO:0000005 cultured plant cell plant_anatomy
## PO:0000006 PO:0000006 PO:0000006 plant protoplast plant_anatomy
## definition
## PO:0000001 An embryonic plant structure (PO:0025099) that is the body of a developing plant embryo (PO:0009009) attached to the maternal tissue in an plant ovule (PO:0020003) by a suspensor (PO:0020108).
## PO:0000002 A microsporangium wall (PO:0025307) that is part of an anther (PO:0009066).
## PO:0000003 A plant structure (PO:0005679) which is a whole organism.
## PO:0000004 A plant structure (PO:0009011) that is grown or maintained in vitro.
## PO:0000005 A plant cell (PO:0009002) that is grown or maintained in vitro.
## PO:0000006 A cultured plant cell from which the entire plant cell wall has been removed.
Note rows in mcols(dag1)
corresponds to terms in
dag_all_terms(dag)
.
The is_a
relation between classes is of course saved in
the DAG object (specified in the is_a
tag in the
.obo
file). Additional relation types can also be selected
(specified in the relationship
tag). By default only the
relation type part_of
is used. You can check other values
associated with the relationship
tag and the
[Typedef]
section in the .obo
file to select
proper additional relation types. Just make sure that the selected
relation types are transitive and not inversed (e.g. you cannot select
has_part
which is a reversed relation of
part_of
).
Relations can also have a DAG structure. In
import_obo()
, if a parent relation type is selected, all
its offspring types are automatically selected. For example, in GO,
besides relations of is_a
and part_of
, there
are also regulates
, positively_regulates
and
negatively_regulates
, where the latter two are child
relations of regulates
. So if regulates
is
selected as an additional relation type, the other two are automatically
selected.
The DAG of relation types is automatically recognized and saved from the ontology files.
import_obo("file_for_go.obo", relation_type = c("part_of", "regulates"))
Finally, all the spaces specified in relation_type
will
be converted to underlines. So it is the same if you specify
"part of"
or "part_of"
.
For ontologies in other formats, simona uses an
external tool ROBOT to
convert them to .obo
format and later internally uses
import_obo()
to import them. ROBOT is
already doing a great and professional job of converting between
different ontology formats. The file robot.jar
is needed
and it can be downloaded from https://github.com/ontodev/robot/releases (Since this is
a tool in Java, you should have Java already available on your
machine).
The file po.owl
can also be found from the Plant Ontology web
page.
dag2 = import_ontology("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.owl",
robot_jar = "~/Downloads/robot.jar")
## An ontology_DAG object:
## Source: po, releases/2021-08-13
## 1654 terms / 2510 relations
## Root: _all_
## Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
## Max depth: 13
## Aspect ratio: 24.85:1 (based on the longest distance to root)
## 39.6:1 (based on the shortest distance to root)
## Relations: is_a, part_of
##
## With the following columns in the metadata data frame:
## id, short_id, name, namespace, definition
More conveniently, the path of robot.jar
can be set as a
global option:
simona_opt$robot_jar = "~/Downloads/robot.jar"
import_ontology("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.owl")
ROBOT supports the following ontology formats and they are automatically identified according to the file contents.
json
: OBO Graphs JSONobo
: OBO Formatofn
: OWL Functionalomn
: Manchesterowl
: RDF/XMLowx
: OWL/XMLttl
: TurtleFor some huge ontologies, ROBOT requires a huge
amount of memory to convert to the .obo
format. If the
ontology is in the .owl
format (in the RDF/XML seriation
format), the function import_owl()
can be optionally used.
import_owl()
directly parses the .owl
file and
returns an ontology_DAG
object. The
import_owl()
is written from scratch and it is recommended
to use only when import_ontology()
does not work.
## An ontology_DAG object:
## Source: Plant Ontology, http://purl.obolibrary.org/obo/po/releases/2023-07-13/po.owl
## 1656 terms / 1776 relations
## Root: ~~all~~
## Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
## Max depth: 11
## Avg number of parents: 1.07
## Avg number of children: 1.06
## Aspect ratio: 39:1 (based on the longest distance from root)
## 38.2:1 (based on the shortest distance from root)
## Relations: is_a
##
## With the following columns in the metadata data frame:
## id, short_id, name, namespace, definition
Similarly, some ontologies may only provide large .ttl
format files (the Turtle
format). simona also provides a function
import_ttl()
which can recognize .ttl
file
with owl:Class
as objects. The internal parsing script is
written in Perl, so you need to make sure Perl is installed on your
machine.
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1 IRanges_2.36.0
## [4] S4Vectors_0.40.2 Biobase_2.62.0 BiocGenerics_0.48.1
## [7] igraph_1.5.1 simona_1.0.10 knitr_1.45
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.42.0 circlize_0.4.15 shape_1.4.6
## [4] rjson_0.2.21 xfun_0.41 bslib_0.5.1
## [7] GlobalOptions_0.1.2 bitops_1.0-7 vctrs_0.6.4
## [10] tools_4.3.2 curl_5.1.0 parallel_4.3.2
## [13] Polychrome_1.5.1 RSQLite_2.3.2 highr_0.10
## [16] cluster_2.1.4 blob_1.2.4 pkgconfig_2.0.3
## [19] RColorBrewer_1.1-3 scatterplot3d_0.3-44 GenomeInfoDbData_1.2.11
## [22] lifecycle_1.0.3 compiler_4.3.2 Biostrings_2.70.2
## [25] codetools_0.2-19 ComplexHeatmap_2.18.0 clue_0.3-65
## [28] GenomeInfoDb_1.38.5 httpuv_1.6.12 htmltools_0.5.7
## [31] sass_0.4.7 RCurl_1.98-1.13 yaml_2.3.7
## [34] later_1.3.1 crayon_1.5.2 jquerylib_0.1.4
## [37] GO.db_3.18.0 ellipsis_0.3.2 cachem_1.0.8
## [40] iterators_1.0.14 foreach_1.5.2 mime_0.12
## [43] digest_0.6.33 fastmap_1.1.1 grid_4.3.2
## [46] colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
## [49] promises_1.2.1 bit64_4.0.5 rmarkdown_2.25
## [52] XVector_0.42.0 httr_1.4.7 matrixStats_1.0.0
## [55] bit_4.0.5 png_0.1-8 GetoptLong_1.0.5
## [58] memoise_2.0.1 shiny_1.7.5.1 evaluate_0.23
## [61] doParallel_1.0.17 rlang_1.1.1 Rcpp_1.0.11
## [64] xtable_1.8-4 DBI_1.1.3 xml2_1.3.5
## [67] jsonlite_1.8.7 R6_2.5.1 zlibbioc_1.48.0