rGREAT has integrated Gene Ontology gene sets for all supported organisms and MSigDB gene sets for human. It is very easy to apply GREAT analysis on other geneset databases such as KEGG. THe basical use has been explained in the section “Manually set gene sets” in the vignette “Analyze with local GREAT”. In this vignette, we will just show you several examples.
KEGG pathway database is widely used for gene set enrichment analysis. To apply GREAT anlaysis with KEGG pathways, we need to construct the corresponding gene sets.
KEGG provides its data via a REST API (https://rest.kegg.jp/). There are several commands that can be used to retrieve specific types of data. Here we use the package KEGGREST to obtain data directly from KEGG. To get a list of genes involved in pathways, we use the “link” command which is implemented in the function keggLink()
.
library(KEGGREST)
pathway2gene = keggLink("pathway", "hsa")
head(pathway2gene)
## hsa:10327 hsa:124 hsa:125 hsa:126 hsa:127
## "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010"
## hsa:128
## "path:hsa00010"
The returned object pathway2gene
is a named vector, where the names are gene Entrez IDs (with an additional "hsa:"
prefix) and values are pathway IDs (with an additional "path:"
prefix). You can try to execute keggLink("hsa","pathway")
to compare the results. The named vectors are not common for downstream gene set analysis. Let’s convert it to a list of genes, also remove the prefix.
kegg_pathways = split(gsub("hsa:", "", names(pathway2gene)),
gsub("path:", "", pathway2gene))
Now simply send kegg_pathways
to great()
to perform GREAT analysis.
library(rGREAT)
gr = randomRegions(genome = "hg19")
great(gr, kegg_pathways, "hg19")
## 908 regions are associated to 1377 genes' extended TSSs.
## TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
## Genome: hg19
## OrgDb: org.Hs.eg.db
## Gene sets: self-provided
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
Here we only include the KEGG pathway IDs. The full name of pathways can be obtained by the following command:
pn = keggList("pathway/hsa")
head(pn)
## hsa01100
## "Metabolic pathways - Homo sapiens (human)"
## hsa01200
## "Carbon metabolism - Homo sapiens (human)"
## hsa01210
## "2-Oxocarboxylic acid metabolism - Homo sapiens (human)"
## hsa01212
## "Fatty acid metabolism - Homo sapiens (human)"
## hsa01230
## "Biosynthesis of amino acids - Homo sapiens (human)"
## hsa01232
## "Nucleotide metabolism - Homo sapiens (human)"
Please go the the documentation of KEGGREST or the main website of KEGG API (https://rest.kegg.jp/) to find out commands for retrieving various datasets there.
To make it easier to use, there is already a helper function getKEGGPathways()
function which directly downloads data from kEGG and format it as a list or a data frame.
lt = getKEGGPathways("hsa")
lt[1:2]
## $`hsa00010: Glycolysis / Gluconeogenesis`
## [1] "10327" "124" "125" "126" "127" "128" "130" "130589"
## [9] "131" "160287" "1737" "1738" "2023" "2026" "2027" "217"
## [17] "218" "219" "2203" "221" "222" "223" "224" "226"
## [25] "229" "230" "2538" "2597" "26330" "2645" "2821" "3098"
## [33] "3099" "3101" "387712" "3939" "3945" "3948" "441531" "501"
## [41] "5105" "5106" "5160" "5161" "5162" "5211" "5213" "5214"
## [49] "5223" "5224" "5230" "5232" "5236" "5313" "5315" "55276"
## [57] "55902" "57818" "669" "7167" "80201" "83440" "84532" "8789"
## [65] "92483" "92579" "9562"
##
## $`hsa00020: Citrate cycle (TCA cycle)`
## [1] "1431" "1737" "1738" "1743" "2271" "3417" "3418" "3419" "3420"
## [10] "3421" "4190" "4191" "47" "48" "4967" "50" "5091" "5105"
## [19] "5106" "5160" "5161" "5162" "55753" "6389" "6390" "6391" "6392"
## [28] "8801" "8802" "8803"
Please note, KEGG pathways are only free for academic users, https://www.pathway.jp/en/licensing.html.
Reactome is another popular pathway database for gene set enrichment analysis. In Bioconductor, there is a package reactome.db that contains genes associated to every pathway in Reactome.
To get the Reactome pathway gene sets, simply use the object reactomePATHID2EXTID
.
library(reactome.db)
gs = as.list(reactomePATHID2EXTID)
great(gr, gs, "hg19")
## 908 regions are associated to 1377 genes' extended TSSs.
## TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
## Genome: hg19
## OrgDb: org.Hs.eg.db
## Gene sets: self-provided
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
The full pathway names are in the object reactomePATHID2NAME
. You can convert the object to a normal list by as.list(reactomePATHID2NAME)
.
UniProt database provides a list of controlled vocabulary represented as keywords for genes or proteins (https://www.uniprot.org/keywords/). This is useful for summarizing gene functions in a compact way. The UniProtKeywords package contains gene sets associated to keywords.
UniProtKeywords support several organisms. Here we use human as an example. In the function load_keyword_genesets()
, you need to specify the taxon ID for the organism.
library(UniProtKeywords)
gs = load_keyword_genesets(9606)
great(gr, gs, "hg19")
Please refer to the section “Manually set gene sets” in the vignette “Analyze with local GREAT”. There introduces a function read_gmt()
which can import a .gmt
file into a list of genes.
The tool Enrichr provides a great resource of gene sets in .gmt
format. See https://maayanlab.cloud/Enrichr/#libraries.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] reactome.db_1.84.0
## [2] KEGGREST_1.40.0
## [3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [4] GenomicFeatures_1.52.0
## [5] AnnotationDbi_1.62.0
## [6] Biobase_2.60.0
## [7] rGREAT_2.2.0
## [8] GenomicRanges_1.52.0
## [9] GenomeInfoDb_1.36.0
## [10] IRanges_2.34.0
## [11] S4Vectors_0.38.0
## [12] BiocGenerics_0.46.0
## [13] knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3
## [2] bitops_1.0-7
## [3] biomaRt_2.56.0
## [4] rlang_1.1.0
## [5] magrittr_2.0.3
## [6] GetoptLong_1.0.5
## [7] matrixStats_0.63.0
## [8] compiler_4.3.0
## [9] RSQLite_2.3.1
## [10] png_0.1-8
## [11] vctrs_0.6.2
## [12] stringr_1.5.0
## [13] pkgconfig_2.0.3
## [14] shape_1.4.6
## [15] crayon_1.5.2
## [16] fastmap_1.1.1
## [17] dbplyr_2.3.2
## [18] XVector_0.40.0
## [19] ellipsis_0.3.2
## [20] utf8_1.2.3
## [21] Rsamtools_2.16.0
## [22] promises_1.2.0.1
## [23] rmarkdown_2.21
## [24] bit_4.0.5
## [25] xfun_0.39
## [26] zlibbioc_1.46.0
## [27] cachem_1.0.7
## [28] jsonlite_1.8.4
## [29] progress_1.2.2
## [30] blob_1.2.4
## [31] highr_0.10
## [32] later_1.3.0
## [33] DelayedArray_0.26.0
## [34] BiocParallel_1.34.0
## [35] parallel_4.3.0
## [36] prettyunits_1.1.1
## [37] R6_2.5.1
## [38] bslib_0.4.2
## [39] stringi_1.7.12
## [40] RColorBrewer_1.1-3
## [41] rtracklayer_1.60.0
## [42] jquerylib_0.1.4
## [43] Rcpp_1.0.10
## [44] SummarizedExperiment_1.30.0
## [45] iterators_1.0.14
## [46] httpuv_1.6.9
## [47] Matrix_1.5-4
## [48] tidyselect_1.2.0
## [49] yaml_2.3.7
## [50] doParallel_1.0.17
## [51] codetools_0.2-19
## [52] curl_5.0.0
## [53] lattice_0.21-8
## [54] tibble_3.2.1
## [55] shiny_1.7.4
## [56] evaluate_0.20
## [57] BiocFileCache_2.8.0
## [58] xml2_1.3.3
## [59] circlize_0.4.15
## [60] Biostrings_2.68.0
## [61] pillar_1.9.0
## [62] filelock_1.0.2
## [63] MatrixGenerics_1.12.0
## [64] DT_0.27
## [65] foreach_1.5.2
## [66] generics_0.1.3
## [67] RCurl_1.98-1.12
## [68] hms_1.1.3
## [69] xtable_1.8-4
## [70] glue_1.6.2
## [71] tools_4.3.0
## [72] BiocIO_1.10.0
## [73] TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0
## [74] GenomicAlignments_1.36.0
## [75] XML_3.99-0.14
## [76] grid_4.3.0
## [77] colorspace_2.1-0
## [78] GenomeInfoDbData_1.2.10
## [79] restfulr_0.0.15
## [80] cli_3.6.1
## [81] rappdirs_0.3.3
## [82] fansi_1.0.4
## [83] dplyr_1.1.2
## [84] sass_0.4.5
## [85] digest_0.6.31
## [86] org.Hs.eg.db_3.17.0
## [87] rjson_0.2.21
## [88] htmlwidgets_1.6.2
## [89] memoise_2.0.1
## [90] htmltools_0.5.5
## [91] lifecycle_1.0.3
## [92] httr_1.4.5
## [93] GlobalOptions_0.1.2
## [94] GO.db_3.17.0
## [95] mime_0.12
## [96] bit64_4.0.5