KEGGREST

KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

KEGGREST allows access to the KEGG REST API. Since KEGG disabled the KEGG SOAP server on December 31, 2012 (which means the KEGGSOAP package will no longer work), KEGGREST serves as a replacement.

The interface to KEGGREST is simpler and in some ways more powerful than KEGGSOAP; however, not all the functionality that was available through the SOAP API has been exposed in the REST API. If and when more functionality is exposed on the server side, this package will be updated to take advantage of it.

Overview

The KEGG REST API is built on some simple operations: info, list, find, get, conv, and link. The corresponding R functions in KEGGREST are: keggInfo(), keggList(), keggFind(), keggGet(), keggConv, and keggLink().

Exploring KEGG Resources with keggList()

KEGG exposes a number of databases. To get an idea of what is available, run listDatabases():

library(KEGGREST)
listDatabases()
##  [1] "pathway"  "brite"    "module"   "disease"  "drug"     "environ" 
##  [7] "ko"       "genome"   "compound" "glycan"   "reaction" "rclass"  
## [13] "enzyme"   "organism"

You can use these databases in further queries. Note that in many cases you can also use a three-letter KEGG organism code or a “T number” (genome identifier) in the same place you would use one of these database names.

You can obtain the list of organisms available in KEGG with the keggList() function:

org <- keggList("organism")
head(org)
##      T.number organism species                                              
## [1,] "T01001" "hsa"    "Homo sapiens (human)"                               
## [2,] "T01005" "ptr"    "Pan troglodytes (chimpanzee)"                       
## [3,] "T02283" "pps"    "Pan paniscus (bonobo)"                              
## [4,] "T02442" "ggo"    "Gorilla gorilla gorilla (western lowland gorilla)"  
## [5,] "T01416" "pon"    "Pongo abelii (Sumatran orangutan)"                  
## [6,] "T03265" "nle"    "Nomascus leucogenys (northern white-cheeked gibbon)"
##      phylogeny                               
## [1,] "Eukaryotes;Animals;Vertebrates;Mammals"
## [2,] "Eukaryotes;Animals;Vertebrates;Mammals"
## [3,] "Eukaryotes;Animals;Vertebrates;Mammals"
## [4,] "Eukaryotes;Animals;Vertebrates;Mammals"
## [5,] "Eukaryotes;Animals;Vertebrates;Mammals"
## [6,] "Eukaryotes;Animals;Vertebrates;Mammals"

From KEGGREST's point of view, you've just asked KEGG to show you the name of every entry in the “organism” database.

Therefore, the complete list of entities that can be queried with KEGGREST can be obtained as follows:

queryables <- c(listDatabases(), org[,1], org[,2])

You could also ask for every entry in the “hsa” (Homo sapiens) database as follows:

keggList("hsa")

Get specific entries with keggGet()

Once you have a list of specific KEGG identifiers, use keggGet() to get more information about them. Here we look up a human gene and an E. coli O157 gene:

query <- keggGet(c("hsa:10458", "ece:Z5100"))

As expected, this returns two items:

length(query)
## [1] 2

Behind the scenes, KEGGREST downloaded and parsed a KEGG flat file, which you can now explore:

names(query[[1]])
##  [1] "ENTRY"      "NAME"       "DEFINITION" "ORTHOLOGY"  "ORGANISM"  
##  [6] "PATHWAY"    "BRITE"      "POSITION"   "MOTIF"      "DBLINKS"   
## [11] "STRUCTURE"  "AASEQ"      "NTSEQ"
query[[1]]$ENTRY
##     CDS 
## "10458"
query[[1]]$DBLINKS
## [1] "NCBI-ProteinID: NP_059345" "NCBI-GeneID: 10458"       
## [3] "OMIM: 605475"              "HGNC: 947"                
## [5] "HPRD: 05686"               "Ensembl: ENSG00000175866" 
## [7] "Vega: OTTHUMG00000177698"  "Pharos: Q9UQB8(Tbio)"     
## [9] "UniProt: Q9UQB8"

keggGet() can also return amino acid sequences as AAStringSet objects (from the Biostrings package):

keggGet(c("hsa:10458", "ece:Z5100"), "aaseq") ## retrieves amino acid sequences
##   A AAStringSet instance of length 2
##     width seq                                               names               
## [1]   552 MSLSRSEEMHRLTENVYKTIMEQ...DLSAQGPEGREHGDGSARTLAGR hsa:10458 K05627 ...
## [2]   248 MLNGISNAASTLGRQLVGIASRV...SGLPPLAQALKDHLAAYEQSKKG ece:Z5100 K12786 ...

…or DNAStringSet objects if option is ntseq:

keggGet(c("hsa:10458", "ece:Z5100"), "ntseq") ## retrieves nucleotide sequences
##   A DNAStringSet instance of length 2
##     width seq                                               names               
## [1]  1659 ATGTCTCTGTCTCGCTCAGAGGA...CCCGCACCCTGGCTGGAAGATGA hsa:10458 K05627 ...
## [2]   747 ATGCTTAATGGAATTAGTAACGC...ATGAGCAATCGAAGAAAGGGTAA ece:Z5100 K12786 ...

keggGet() can also return images:

png <- keggGet("hsa05130", "image") 
t <- tempfile()
library(png)
writePNG(png, t)
if (interactive()) browseURL(t)

NOTE: keggGet() can only return 10 result sets at once (this limitation is on the server side). If you supply more than 10 inputs to keggGet(), KEGGREST will warn that only the first 10 results will be returned.

Search by keywords with keggFind()

You can search for two separate keywords (“shiga” and “toxin” in this case):

head(keggFind("genes", c("shiga", "toxin")))
## ece:Z1464
## "K11006 shiga toxin subunit A | (GenBank) stx2A; shiga-like toxin II A
subunit encoded by bacteriophage BP-933W"
## ece:Z1465
## "K11007 shiga toxin subunit B | (GenBank) stx2B; shiga-like toxin II B
subunit encoded by bacteriophage BP-933W"
## ece:Z3343
## "K11007 shiga toxin subunit B | (GenBank) stx1B; shiga-like toxin 1 subunit
B encoded within prophage CP-933V"
## ece:Z3344
## "K11006 shiga toxin subunit A | (GenBank) stx1A; shiga-like toxin 1 subunit
A encoded within prophage CP-933V"
## ecs:ECs1205
## "K11006 shiga toxin subunit A | (RefSeq) Shiga toxin 2 subunit A"
## ecs:ECs1206
## "K11007 shiga toxin subunit B | (RefSeq) Shiga toxin 2 subunit B"

Or search for the two words together:

head(keggFind("genes", "shiga toxin"))
## ece:Z1464
## "K11006 shiga toxin subunit A | (GenBank) stx2A; shiga-like toxin II A
subunit encoded by bacteriophage BP-933W"
## ece:Z1465
## "K11007 shiga toxin subunit B | (GenBank) stx2B; shiga-like toxin II B
subunit encoded by bacteriophage BP-933W"
## ece:Z3343
## "K11007 shiga toxin subunit B | (GenBank) stx1B; shiga-like toxin 1 subunit
B encoded within prophage CP-933V"
## ece:Z3344
## "K11006 shiga toxin subunit A | (GenBank) stx1A; shiga-like toxin 1 subunit
A encoded within prophage CP-933V"
## ecs:ECs1205
## "K11006 shiga toxin subunit A | (RefSeq) Shiga toxin 2 subunit A"
## ecs:ECs1206
## "K11007 shiga toxin subunit B | (RefSeq) Shiga toxin 2 subunit B"

Search for a chemical formula:

head(keggFind("compound", "C7H10O5", "formula"))
## cpd:C00493 cpd:C04236 cpd:C16588 cpd:C17696 cpd:C18307 cpd:C18312 
##  "C7H10O5"  "C7H10O5"  "C7H10O5"  "C7H10O5"  "C7H10O5"  "C7H10O5"

Search for a chemical formula containing “O5” and “C7”:

head(keggFind("compound", "O5C7", "formula"))
## cpd:C00493 cpd:C00624 cpd:C01215 cpd:C01424 cpd:C02123 cpd:C02236 
##  "C7H10O5" "C7H11NO5"  "C7H9NO5"   "C7H6O5"  "C7H12O5"  "C7H6O5S"

You can search for compounds with a particular exact mass:

keggFind("compound", 174.05, "exact_mass")
##   cpd:C00493   cpd:C04236   cpd:C16588   cpd:C17696   cpd:C18307   cpd:C18312 
## "174.052823" "174.052823" "174.052823" "174.052823" "174.052823" "174.052823" 
##   cpd:C21281 
## "174.052823"

Because we've supplied a number with two decimal digits of precision, KEGG will find all compounds with exact mass between 174.045 and 174.055.

Integer ranges can be used to find compounds by molecular weight:

head(keggFind("compound", 300:310, "mol_weight"))
##   cpd:C00051   cpd:C00200   cpd:C00219   cpd:C00239   cpd:C00270   cpd:C00357 
##  "307.32348"  "306.33696"  "304.46688" "307.197122"  "309.26986" "301.187702"

Convert identifiers with keggConv()

Convert between KEGG identifiers and outside identifiers.

You can either specify fully qualified identifiers:

keggConv("ncbi-proteinid", c("hsa:10458", "ece:Z5100"))
##                  hsa:10458                  ece:Z5100 
## "ncbi-proteinid:NP_059345"  "ncbi-proteinid:AAG58814"

…or get the mapping for an entire species:

head(keggConv("eco", "ncbi-geneid"))
## ncbi-geneid:944742 ncbi-geneid:945803 ncbi-geneid:947498 ncbi-geneid:945198 
##        "eco:b0001"        "eco:b0002"        "eco:b0003"        "eco:b0004" 
## ncbi-geneid:944747 ncbi-geneid:944749 
##        "eco:b0005"        "eco:b0006"

Reversing the arguments does the opposite mapping:

head(keggConv("ncbi-geneid", "eco"))
##            eco:b0001            eco:b0002            eco:b0003 
## "ncbi-geneid:944742" "ncbi-geneid:945803" "ncbi-geneid:947498" 
##            eco:b0004            eco:b0005            eco:b0006 
## "ncbi-geneid:945198" "ncbi-geneid:944747" "ncbi-geneid:944749"

Link across databases with keggLink()

Most of the KEGGSOAP functions whose names started with “get”, for example get.pathways.by.genes(), can be replaced with the keggLink() function. Here we query all pathways for human:

head(keggLink("pathway", "hsa"))
##       hsa:10327         hsa:124         hsa:125         hsa:126         hsa:127 
## "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010" 
##         hsa:128 
## "path:hsa00010"

…but you can also specify one or more genes (from multiple species):

keggLink("pathway", c("hsa:10458", "ece:Z5100"))
##       hsa:10458       hsa:10458       ece:Z5100 
## "path:hsa04520" "path:hsa04810" "path:ece05130"