rGREAT has integrated Gene Ontology gene sets for all supported organisms and MSigDB gene sets for human. It is very easy to apply GREAT analysis on other geneset databases such as KEGG. THe basical use has been explained in the section “Manually set gene sets” in the vignette “Analyze with local GREAT”. In this vignette, we will just show you several examples.
KEGG pathway database is widely used for gene set enrichment analysis. To apply GREAT anlaysis with KEGG pathways, we need to construct the corresponding gene sets.
KEGG provides its data via a REST API (https://rest.kegg.jp/).
There are several commands that can be used to retrieve specific types
of data. Here we use the package KEGGREST
to obtain data directly from KEGG. To get a list of genes involved in
pathways, we use the “link” command which is implemented in the function
keggLink()
.
library(KEGGREST)
pathway2gene = keggLink("pathway", "hsa")
head(pathway2gene)
## hsa:10327 hsa:124 hsa:125 hsa:126 hsa:127
## "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010"
## hsa:128
## "path:hsa00010"
The returned object pathway2gene
is a named vector,
where the names are gene Entrez IDs (with an additional
"hsa:"
prefix) and values are pathway IDs (with an
additional "path:"
prefix). You can try to execute
keggLink("hsa","pathway")
to compare the results. The named
vectors are not common for downstream gene set analysis. Let’s convert
it to a list of genes, also remove the prefix.
kegg_pathways = split(gsub("hsa:", "", names(pathway2gene)),
gsub("path:", "", pathway2gene))
Now simply send kegg_pathways
to great()
to
perform GREAT analysis.
library(rGREAT)
gr = randomRegions(genome = "hg19")
great(gr, kegg_pathways, "hg19")
## 908 regions are associated to 1377 genes' extended TSSs.
## TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
## Genome: hg19
## OrgDb: org.Hs.eg.db
## Gene sets: self-provided
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
Here we only include the KEGG pathway IDs. The full name of pathways can be obtained by the following command:
pn = keggList("pathway/hsa")
head(pn)
## hsa01100
## "Metabolic pathways - Homo sapiens (human)"
## hsa01200
## "Carbon metabolism - Homo sapiens (human)"
## hsa01210
## "2-Oxocarboxylic acid metabolism - Homo sapiens (human)"
## hsa01212
## "Fatty acid metabolism - Homo sapiens (human)"
## hsa01230
## "Biosynthesis of amino acids - Homo sapiens (human)"
## hsa01232
## "Nucleotide metabolism - Homo sapiens (human)"
Please go the the documentation of KEGGREST or the main website of KEGG API (https://rest.kegg.jp/) to find out commands for retrieving various datasets there.
To make it easier to use, there is already a helper function
getKEGGPathways()
function which directly downloads data
from kEGG and format it as a list or a data frame.
lt = getKEGGPathways("hsa")
lt[1:2]
## $`hsa00010: Glycolysis / Gluconeogenesis`
## [1] "10327" "124" "125" "126" "127" "128" "130" "130589"
## [9] "131" "160287" "1737" "1738" "2023" "2026" "2027" "217"
## [17] "218" "219" "2203" "221" "222" "223" "224" "226"
## [25] "229" "230" "2538" "2597" "26330" "2645" "2821" "3098"
## [33] "3099" "3101" "387712" "3939" "3945" "3948" "441531" "501"
## [41] "5105" "5106" "5160" "5161" "5162" "5211" "5213" "5214"
## [49] "5223" "5224" "5230" "5232" "5236" "5313" "5315" "55276"
## [57] "55902" "57818" "669" "7167" "80201" "83440" "84532" "8789"
## [65] "92483" "92579" "9562"
##
## $`hsa00020: Citrate cycle (TCA cycle)`
## [1] "1431" "1737" "1738" "1743" "2271" "3417" "3418" "3419" "3420"
## [10] "3421" "4190" "4191" "47" "48" "4967" "50" "5091" "5105"
## [19] "5106" "5160" "5161" "5162" "55753" "6389" "6390" "6391" "6392"
## [28] "8801" "8802" "8803"
Please note, KEGG pathways are only free for academic users, https://www.pathway.jp/en/licensing.html.
Reactome is another popular pathway database for gene set enrichment analysis. In Bioconductor, there is a package reactome.db that contains genes associated to every pathway in Reactome.
To get the Reactome pathway gene sets, simply use the object
reactomePATHID2EXTID
.
library(reactome.db)
gs = as.list(reactomePATHID2EXTID)
great(gr, gs, "hg19")
## 908 regions are associated to 1377 genes' extended TSSs.
## TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
## Genome: hg19
## OrgDb: org.Hs.eg.db
## Gene sets: self-provided
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
The full pathway names are in the object
reactomePATHID2NAME
. You can convert the object to a normal
list by as.list(reactomePATHID2NAME)
.
UniProt database provides a list of controlled vocabulary represented as keywords for genes or proteins (https://www.uniprot.org/keywords/). This is useful for summarizing gene functions in a compact way. The UniProtKeywords package contains gene sets associated to keywords.
UniProtKeywords support several organisms. Here we
use human as an example. In the function
load_keyword_genesets()
, you need to specify the taxon ID
for the organism.
library(UniProtKeywords)
gs = load_keyword_genesets(9606)
great(gr, gs, "hg19")
Please refer to the
section “Manually set gene sets” in the vignette
“Analyze with local GREAT”. There introduces a
function read_gmt()
which can import a .gmt
file into a list of genes.
The tool Enrichr
provides a great resource of gene sets in .gmt
format. See
https://maayanlab.cloud/Enrichr/#libraries.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] reactome.db_1.84.0
## [2] KEGGREST_1.42.0
## [3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [4] GenomicFeatures_1.54.0
## [5] AnnotationDbi_1.64.0
## [6] Biobase_2.62.0
## [7] rGREAT_2.4.0
## [8] GenomicRanges_1.54.0
## [9] GenomeInfoDb_1.38.0
## [10] IRanges_2.36.0
## [11] S4Vectors_0.40.1
## [12] BiocGenerics_0.48.0
## [13] knitr_1.43
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3
## [2] bitops_1.0-7
## [3] biomaRt_2.58.0
## [4] rlang_1.1.1
## [5] magrittr_2.0.3
## [6] GetoptLong_1.0.5
## [7] matrixStats_1.0.0
## [8] compiler_4.3.1
## [9] RSQLite_2.3.1
## [10] png_0.1-8
## [11] vctrs_0.6.3
## [12] stringr_1.5.0
## [13] pkgconfig_2.0.3
## [14] shape_1.4.6
## [15] crayon_1.5.2
## [16] fastmap_1.1.1
## [17] ellipsis_0.3.2
## [18] dbplyr_2.3.3
## [19] XVector_0.42.0
## [20] utf8_1.2.3
## [21] Rsamtools_2.18.0
## [22] promises_1.2.0.1
## [23] rmarkdown_2.23
## [24] bit_4.0.5
## [25] xfun_0.39
## [26] zlibbioc_1.48.0
## [27] cachem_1.0.8
## [28] jsonlite_1.8.7
## [29] progress_1.2.2
## [30] blob_1.2.4
## [31] highr_0.10
## [32] later_1.3.1
## [33] DelayedArray_0.28.0
## [34] BiocParallel_1.36.0
## [35] parallel_4.3.1
## [36] prettyunits_1.1.1
## [37] R6_2.5.1
## [38] bslib_0.5.0
## [39] stringi_1.7.12
## [40] RColorBrewer_1.1-3
## [41] rtracklayer_1.62.0
## [42] jquerylib_0.1.4
## [43] Rcpp_1.0.11
## [44] SummarizedExperiment_1.32.0
## [45] iterators_1.0.14
## [46] httpuv_1.6.11
## [47] Matrix_1.6-0
## [48] tidyselect_1.2.0
## [49] abind_1.4-5
## [50] yaml_2.3.7
## [51] doParallel_1.0.17
## [52] codetools_0.2-19
## [53] curl_5.0.1
## [54] lattice_0.21-8
## [55] tibble_3.2.1
## [56] shiny_1.7.4.1
## [57] evaluate_0.21
## [58] BiocFileCache_2.10.1
## [59] xml2_1.3.5
## [60] circlize_0.4.15
## [61] Biostrings_2.70.1
## [62] pillar_1.9.0
## [63] filelock_1.0.2
## [64] MatrixGenerics_1.14.0
## [65] DT_0.28
## [66] foreach_1.5.2
## [67] generics_0.1.3
## [68] RCurl_1.98-1.12
## [69] hms_1.1.3
## [70] xtable_1.8-4
## [71] glue_1.6.2
## [72] tools_4.3.1
## [73] BiocIO_1.12.0
## [74] TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0
## [75] GenomicAlignments_1.38.0
## [76] XML_3.99-0.14
## [77] grid_4.3.1
## [78] colorspace_2.1-0
## [79] GenomeInfoDbData_1.2.10
## [80] restfulr_0.0.15
## [81] cli_3.6.1
## [82] rappdirs_0.3.3
## [83] fansi_1.0.4
## [84] S4Arrays_1.2.0
## [85] dplyr_1.1.2
## [86] sass_0.4.6
## [87] digest_0.6.33
## [88] SparseArray_1.2.0
## [89] org.Hs.eg.db_3.17.0
## [90] rjson_0.2.21
## [91] htmlwidgets_1.6.2
## [92] memoise_2.0.1
## [93] htmltools_0.5.5
## [94] lifecycle_1.0.3
## [95] httr_1.4.6
## [96] GlobalOptions_0.1.2
## [97] GO.db_3.17.0
## [98] mime_0.12
## [99] bit64_4.0.5