Get started with OmaDB

Klara Kaleb

2019-01-04

This little vignette shows you how to get started with the roma package. roma is a wrapper for the REST API for the Orthologous MAtrix project (OMA) which is a database for the inference of orthologs among complete genomes.

For more details on the OMA project, see https://omabrowser.org/oma/home/.

Some useful functions

The package contains a range of functions that are used to query the database in an R friendly way. This vignette describes some of them, whereas the rest are described in more detail in other vignettes:

Exploring Hierarchical orthologous groups with roma

Exploring Taxonomic trees with roma

Sequence Analysis with roma

###getXref This function searches the OMA database for entries containing the pattern defined and returns the results in a dataframe. Hence, it is usually a good starting place. Sample response is below.

library(OmaDB)

xref = load('../data/xref.rda')

head(xref)
## [1] "xref"

###getGenomeAlignment This function serves to obtain the orthologs for 2 whole genomes. The result is a dataframe containing information on each member in the pair and their relationship.

load('../data/pairs.rda')

head(pairs)
##   entry_1.entry_nr                           entry_1.entry_url
## 1          6618226 https://omabrowser.org/api/protein/6618226/
## 2          6618227 https://omabrowser.org/api/protein/6618227/
## 3          6618228 https://omabrowser.org/api/protein/6618228/
## 4          6618229 https://omabrowser.org/api/protein/6618229/
## 5          6618230 https://omabrowser.org/api/protein/6618230/
## 6          6618231 https://omabrowser.org/api/protein/6618231/
##   entry_1.omaid entry_1.canonicalid             entry_1.sequence_md5
## 1    ASHGO00001              Q75FB7 a9b1a6dc9afb2b02afe8fdf8029b5f22
## 2    ASHGO00002              Q75FB6 5d186037c4dd0a89b34d70d596fac86d
## 3    ASHGO00003              Q75FB5 3a83b276f0c9034f7cf66277e7d6c983
## 4    ASHGO00004              Q75FB4 a8611c3f24ac6599e6a36a2710f6d24a
## 5    ASHGO00005              Q75FB3 07277f0ca66fcb49d175667272f1547f
## 6    ASHGO00006              Q75FB2 2b666569856af5f068e532ade89c1140
##   entry_1.oma_group entry_1.oma_hog_id entry_1.chromosome
## 1                 0     HOG:0393392.4c                  I
## 2            203915        HOG:0200818                  I
## 3            214367        HOG:0200433                  I
## 4            768456  HOG:0387657.2d.3a                  I
## 5            530479     HOG:0397172.3b                  I
## 6            563083     HOG:0201049.3a                  I
##   entry_1.locus.start entry_1.locus.end entry_1.locus.strand
## 1                8108              9067                    1
## 2                9537             12593                    1
## 3               12906             13244                    1
## 4               13713             14846                    1
## 5               16155             19850                    1
## 6               20056             23721                   -1
##   entry_1.is_main_isoform entry_2.entry_nr
## 1                    TRUE          6637770
## 2                    TRUE          6637359
## 3                    TRUE          6637360
## 4                    TRUE          6637767
## 5                    TRUE          6636211
## 6                    TRUE          6636209
##                             entry_2.entry_url entry_2.omaid
## 1 https://omabrowser.org/api/protein/6637770/    YEAST04806
## 2 https://omabrowser.org/api/protein/6637359/    YEAST04395
## 3 https://omabrowser.org/api/protein/6637360/    YEAST04396
## 4 https://omabrowser.org/api/protein/6637767/    YEAST04803
## 5 https://omabrowser.org/api/protein/6636211/    YEAST03247
## 6 https://omabrowser.org/api/protein/6636209/    YEAST03245
##   entry_2.canonicalid             entry_2.sequence_md5 entry_2.oma_group
## 1          RCE1_YEAST 605098a0697ad8fc7af2101e758033cb            494558
## 2          ZDS2_YEAST 8cc75f16fbfd321833abc48fd1173154            203915
## 3          YMK8_YEAST 783fea3b573632292d89a5a5218b6e90            214367
## 4          SCS7_YEAST 9307c7a6e80ed39d8b6529329fee1819            768456
## 5          SMC3_YEAST 4e8f1295434b44ae8f749cad976b966e            530479
## 6          NET1_YEAST f2ba71aea520ea66f015ba357eb6e8c6            563083
##   entry_2.oma_hog_id entry_2.chromosome entry_2.locus.start
## 1     HOG:0393392.4c               XIII              814364
## 2     HOG:0200818.1a               XIII               51640
## 3        HOG:0200433               XIII               54793
## 4  HOG:0387657.2d.3a               XIII              809623
## 5     HOG:0397172.3b                  X              299157
## 6     HOG:0201049.2a                  X              295245
##   entry_2.locus.end entry_2.locus.strand entry_2.is_main_isoform rel_type
## 1            815311                   -1                    TRUE      1:1
## 2             54468                    1                    TRUE      1:1
## 3             55110                    1                    TRUE      1:1
## 4            810777                   -1                    TRUE      1:1
## 5            302849                   -1                    TRUE      1:1
## 6            298814                    1                    TRUE      1:1
##   distance   score
## 1 122.0000  636.04
## 2  95.0000 1424.24
## 3  50.0000  557.51
## 4  37.7442 2682.46
## 5  58.0000 5548.40
## 6  90.0000 1844.35

###getData This master-function function serves to obtain the information for a single entry in a database - either a group, protein or a genome. The data type is specified by setting the “type” argument and a specific entry by its ID - below are the possible ID’s for different object type.

###getObjectAttributes The result of the getData function is an S3 Object with attributes corresponding to the information requested. This function allows the user to list all the object attributes and their corresponding data types.

###getAttribute The specific attributes of the created object can be accessed via $ or via the getAttribute() function. Below is an example of object containing information about an OMA group.

load('../data/group.rda')

object_attributes = getObjectAttributes(group)
## [1] "group_nr : integer"
## [1] "fingerprint : character"
## [1] "related_groups : URL"
## [1] "members : data.frame"
group$fingerprint
## [1] "FPNDKFP"
getAttribute(group, 'fingerprint')
## [1] "FPNDKFP"

###resolveURL In most cases there is great quantity of information available for a given entry and this impacts the data retrival time. Due to this, the information available for such entries is split into a number of endpoints and these are included appropriatelly as redirects. This function allows the user to obtain further information behind those urls.

An example of use for the above function would be to obtain the list of orthologs for a given protein.

load('../data/protein.rda')

getAttribute(protein,'orthologs')
## [1] "https://omabrowser.org/api/protein/6633022/orthologs/"
load('../data/orthologs.rda')

orthologs 
##    entry_nr                                   entry_url      omaid
## 1   6342668 https://omabrowser.org/api/protein/6342668/ COCLU03588
## 2   6399341 https://omabrowser.org/api/protein/6399341/ COLGR10929
## 3   6407219 https://omabrowser.org/api/protein/6407219/ COLSU06788
## 4   6468764 https://omabrowser.org/api/protein/6468764/ FUSO411719
## 5   6475980 https://omabrowser.org/api/protein/6475980/ GIBZA02345
## 6   6501606 https://omabrowser.org/api/protein/6501606/ NECHA14651
## 7   6530466 https://omabrowser.org/api/protein/6530466/ THIHE03640
## 8   6554631 https://omabrowser.org/api/protein/6554631/ CANAL00775
## 9   6560817 https://omabrowser.org/api/protein/6560817/ CANAW00573
## 10  6566321 https://omabrowser.org/api/protein/6566321/ LODEL00382
## 11  6575886 https://omabrowser.org/api/protein/6575886/ DEBHA04175
## 12  6579264 https://omabrowser.org/api/protein/6579264/ PICGU01290
## 13  6594625 https://omabrowser.org/api/protein/6594625/ SPAPN04932
## 14  6595834 https://omabrowser.org/api/protein/6595834/ CANTE00168
## 15  6608129 https://omabrowser.org/api/protein/6608129/ KOMPG00513
## 16  6619568 https://omabrowser.org/api/protein/6619568/ ASHGO01343
## 17  6623679 https://omabrowser.org/api/protein/6623679/ KLULA00697
## 18  6630047 https://omabrowser.org/api/protein/6630047/ CANGA01835
## 19  6642012 https://omabrowser.org/api/protein/6642012/ ZYGRO02696
## 20  8837885 https://omabrowser.org/api/protein/8837885/ DROBM15099
##    canonicalid                     sequence_md5 oma_group     oma_hog_id
## 1              c5db7c8e6b0eee5c6bbc18fe2dbf9ba7    737617 HOG:0379998.2a
## 2              2cf82b1ffadc581d82d920af9b0130a2    737617 HOG:0379998.2a
## 3   A0A066XI05 d6abbfd6d6ed96217331e86733ae4aeb    737617 HOG:0379998.2a
## 4   A0A0D2XKA6 b92bedd8c4fe61cc3667d4f108ffb528    737617 HOG:0379998.2a
## 5              54814145e0f45beb33ac2e2cb99335a3    737618 HOG:0379998.2a
## 6              8c85e6f457288d146cb99fbc61404cb9    737617 HOG:0379998.2a
## 7              416de3ec8baf6df5b95a10b2145e7242    737617 HOG:0379998.2a
## 8   CCR4_CANAL b80e7f71d278fae2335d393c82912fee    737617 HOG:0379998.2b
## 9       C4YDK4 d1158bec0e2fface48020c27d9e6ca2b    737618 HOG:0379998.2b
## 10      A5DSP6 58726579b898e8bc18c954e8cc7039f5    737636 HOG:0379998.2b
## 11  CCR4_DEBHA 06ae2387578bfe3873ad4c34a339bb26    737636 HOG:0379998.2b
## 12      A5DDD9 d796187a45010e0868d774e568bf8d75    737617 HOG:0379998.2b
## 13      G3ATH1 fe2c181e7927a749ce89d63f0cfa99a9    737636 HOG:0379998.2b
## 14             4ff2469626a34edf6dfc0cc18d6ae06b    737618 HOG:0379998.2b
## 15      C4R821 eb4910adfeab0ff42a374bee5d9c2176    737618 HOG:0379998.2b
## 16  CCR4_ASHGO dc714e8db742c3fa6e16c074c3ea9b75    737617 HOG:0379998.2b
## 17  CCR4_KLULA bd38c48cb3b76bed0c72f5cd2a7ec92e    737636 HOG:0379998.2b
## 18  CCR4_CANGA 3329cf643b6767b0b022622779d5daf1    737636 HOG:0379998.2b
## 19  A0A1Q2ZZG0 f2629a0761ab52e0a1beb6b4a5d9c868    737636 HOG:0379998.2b
## 20             11d6a871f132d98e296ba5a628cac8b4         0               
##                                   chromosome locus.start locus.end
## 1                                 scaffold_6       27852     30189
## 2                                   GG697333      764822    767178
## 3                            Scaffolds0609.1         399      2755
## 4                                          4     4317479   4319279
## 5                            Supercontig_3.1     6701031   6703176
## 6                           sca_14_chr10_3_0      669785    671940
## 7                               chromosome_2     3756431   3758803
## 8                     supercontig_supercont4     1809654   1812017
## 9                                          1     1390917   1393274
## 10 supercont1.1 of Lodderomyces elongisporus     1013851   1016379
## 11                                         F      373608    376103
## 12                      supercontig_CH408156      187607    189769
## 13                                scaffold_6      636379    638832
## 14                            scaffold_00004      297506    299533
## 15                                         4      962805    965323
## 16                                       III      882342    884552
## 17                                         F     1468612   1470984
## 18                                         H      598643    601264
## 19                                         D      453024    455597
## 20                          scf7180000299460       79621     84056
##    locus.strand is_main_isoform rel_type distance   score
## 1            -1            TRUE      1:1 105.0000 1362.00
## 2             1            TRUE      1:1 108.0000 1320.99
## 3             1            TRUE      1:1 109.0000 1312.07
## 4            -1            TRUE      1:1 107.0000 1326.20
## 5            -1            TRUE      1:1 109.0000 1327.66
## 6            -1            TRUE      1:1 105.0000 1336.03
## 7             1            TRUE      1:1 113.0000 1251.50
## 8             1            TRUE      1:1  72.0000 2268.69
## 9            -1            TRUE      1:1  72.0000 2266.12
## 10           -1            TRUE      1:1  74.0000 2267.07
## 11           -1            TRUE      1:1  73.0000 2246.76
## 12           -1            TRUE      1:1  76.0000 2124.66
## 13           -1            TRUE      1:1  73.0000 2240.65
## 14            1            TRUE      1:1  79.0000 2116.79
## 15            1            TRUE      1:1  68.0000 2364.83
## 16            1            TRUE      1:1  36.1210 4302.76
## 17           -1            TRUE      1:1  36.1210 4423.53
## 18           -1            TRUE      1:1  35.3357 5201.71
## 19           -1            TRUE      1:1  42.1286 4655.84
## 20           -1            TRUE      1:1   0.3903 7730.37

The orthologs for a protein are returned as a data.frame. This structure is also found in other areas of the package (e.g. the data on the members of a particular OMA group or a HOG) and hence features a function getInfo() to simplify its processing. For example, the user can obtain a set of genomic ranges for the proteins in a dataframe as so:

gRanges = getInfo(orthologs,type='genomic_ranges')

str(gRanges)
## Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
##   ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
##   .. .. ..@ values         : Factor w/ 20 levels "ZYGRO02696","ASHGO01343",..: 7 8 9 12 13 17 20 3 4 16 ...
##   .. .. ..@ lengths        : int [1:20] 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
##   .. .. ..@ start          : int [1:20] 27852 764822 399 4317479 6701031 669785 3756431 1809654 1390917 1013851 ...
##   .. .. ..@ width          : int [1:20] 2338 2357 2357 1801 2146 2156 2373 2364 2358 2529 ...
##   .. .. ..@ NAMES          : NULL
##   .. .. ..@ elementType    : chr "ANY"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
##   .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 2 1 2 1 2 1 2
##   .. .. ..@ lengths        : int [1:7] 1 2 3 2 5 3 4
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
##   .. .. ..@ seqnames   : chr [1:20] "ZYGRO02696" "ASHGO01343" "CANAL00775" "CANAW00573" ...
##   .. .. ..@ seqlengths : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
##   .. .. ..@ is_circular: logi [1:20] NA NA NA NA NA NA ...
##   .. .. ..@ genome     : chr [1:20] NA NA NA NA ...
##   ..@ elementMetadata:Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
##   .. .. ..@ rownames       : NULL
##   .. .. ..@ nrows          : int 20
##   .. .. ..@ listData       : Named list()
##   .. .. ..@ elementType    : chr "ANY"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ elementType    : chr "ANY"
##   ..@ metadata       : list()

The user can also obtain the list of sequences for a given dataframe of proteins and well as the list of their corresponding ontologies (that can be plugged into the topGO for further analysis). This can be done using the functions getSequences() and getOntologies() respectively.

For further information on the OMA REST API please visit OMA REST API DOCUMENTATION.