Analyze with online GREAT

Note: On Aug 19 2019 GREAT released version 4 which supports hg38 genome and removes some ontologies such pathways. submitGreatJob() still takes hg19 as default. hg38 can be specified by argument species = "hg38". To use the older versions such as 3.0.0, specify as submitGreatJob(..., version = "3").

GREAT (Genomic Regions Enrichment of Annotations Tool) is a popular web-based tool to associate biological functions to genomic regions. The rGREAT package makes GREAT anlaysis automatic by first constructing a HTTP POST request according to user’s input and retrieving results from GREAT web server afterwards.

Submit the job

Load the package:

library(rGREAT)

The input data is either a GRanges object or a BED-format data frame, no matter it is sorted or not. In following example, we use a GRanges object which is randomly generated.

set.seed(123)
gr = randomRegions(nr = 1000, genome = "hg19")
head(gr)

## GRanges object with 6 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]     chr1   9204434-9208784      *
##   [2]     chr1   9853594-9859363      *
##   [3]     chr1 10862809-10871681      *
##   [4]     chr1 12716970-12723206      *
##   [5]     chr1 13814692-13823250      *
##   [6]     chr1 19243285-19247097      *
##   -------
##   seqinfo: 26 sequences from an unspecified genome; no seqlengths

Submit genomic regions by submitGreatJob().

The returned variable job is a GreatJob class instance which can be used to retrieve results from GREAT server and store results which are already downloaded.

job = submitGreatJob(gr)

You can get the summary of your job by directly printing job.

job

## Submit time: 2023-04-01 03:44:07 
##   Note the results may only be avaiable on GREAT server for 24 hours.
## Version: 4.0.4 
## Species: hg19 
## Inputs: 1000 regions
## Mode: Basal plus extension 
##   Proximal: 5 kb upstream, 1 kb downstream,
##   plus Distal: up to 1000 kb
## Include curated regulatory domains
## 
## Enrichment tables for following ontologies have been downloaded:
##   GO Biological Process
##   GO Cellular Component
##   GO Molecular Function

More parameters can be set for the job:

job = submitGreatJob(gr, species = "mm9") # of course, gr should be from mm9
job = submitGreatJob(gr, adv_upstream = 10, adv_downstream = 2, adv_span = 2000)
job = submitGreatJob(gr, rule = "twoClosest", adv_twoDistance = 2000)
job = submitGreatJob(gr, rule = "oneClosest", adv_oneDistance = 2000)

Also you can choose different versions of GREAT for the analysis.

job = submitGreatJob(gr, version = "3.0")
job = submitGreatJob(gr, version = "2.0")

Note: from rGREAT package 1.99.0, background by bg argument is not supported any more (currently you can still use it, but you will see a warning message), because GREAT requires a special format for gr and bg if both are set, and it uses a different method for the enrichment analysis and returns enrichment tables in a different format. But still, you can use local GREAT to integrate background regions. Seel the rGREAT paper for more details.

Available parameters are (following content is copied from GREAT website):

species: “hg38”, “hg19”, “mm10”, “mm9” are supported in GREAT version 4.x.x, “hg19”, “mm10”, “mm9”, “danRer7” are supported in GREAT version 3.x.x and “hg19”, “hg18”, “mm9”, “danRer7” are supported in GREAT version 2.x.x.
includeCuratedRegDoms: Whether to include curated regulatory domains.
rule: How to associate genomic regions to genes.
- basalPlusExt: mode ‘Basal plus extension’. Gene regulatory domain definition: Each gene is assigned a basal regulatory domain of a minimum distance upstream and downstream of the TSS (regardless of other nearby genes). The gene regulatory domain is extended in both directions to the nearest gene’s basal domain but no more than the maximum extension in one direction.
  - adv_upstream: proximal extension to upstream (unit: kb)
  - adv_downstream: proximal extension to downstream (unit: kb)
  - adv_span: maximum extension (unit: kb)
- twoClosest: mode ‘Two nearest genes’. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the nearest gene’s TSS but no more than the maximum extension in one direction.
  - adv_twoDistance: maximum extension (unit: kb)
- oneClosest: mode ‘Single nearest gene’. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the midpoint between the gene’s TSS and the nearest gene’s TSS but no more than the maximum extension in one direction.
  - adv_oneDistance: maximum extension (unit: kb)

GREAT uses the UCSC bed-format where genomic coordinates are 0-based. Many R packages generate genomic regions as 1-based. Thus by default, the start positions of regions are subtracted by 1. If your regions are already 0-based, you can specify gr_is_zero_based = TRUE in submitGreatJob(). Anyway in most cases, this will only slightly affect the enrichment results.

Get enrichment tables

With job, we can now retrieve results from GREAT. The first and the primary results are the tables which contain enrichment statistics for the analysis. By default it will retrieve results from three GO Ontologies. All tables contains statistics for all terms no matter they are significant or not. Users can then make filtering with self-defined cutoff.

There is a column for adjusted p-values by “BH” method. Other p-value adjustment methods can be applied by p.adjust().

The returned value of getEnrichmentTables() is a list of data frames in which each one corresponds to the table for a single ontology. The structure of data frames are same as the tables on GREAT website.

tbl = getEnrichmentTables(job)

## The default enrichment table does not contain informatin of associated
## genes for each input region. You can set `download_by = 'tsv'` to
## download the complete table, but note only the top 500 regions can be
## retreived. See the following link:
## 
## https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655401/Export#Export-GlobalExport
## 
## Except the additional gene-region association column if taking 'tsv' as
## the source of result, all other columns are the same if you choose
## 'json' (the default) as the source. Or you can try the local GREAT
## analysis with the function `great()`.

names(tbl)

## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"

tbl[[1]][1:2, ]

##           ID                                                           name
## 1 GO:0070696 transmembrane receptor protein serine/threonine kinase binding
## 2 GO:0033612                       receptor serine/threonine kinase binding
##   Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits
## 1           0.003455733       3.455733                         11
## 2           0.003596413       3.596413                         11
##   Binom_Fold_Enrichment Binom_Region_Set_Coverage Binom_Raw_PValue
## 1              3.183116                     0.011     0.0008981541
## 2              3.058603                     0.011     0.0012301660
##   Binom_Adjp_BH Hyper_Total_Genes Hyper_Expected Hyper_Observed_Gene_Hits
## 1             1                13      0.9938002                        5
## 2             1                15      1.1466930                        5
##   Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage
## 1              5.031192             0.003526093                0.3846154
## 2              4.360367             0.003526093                0.3333333
##   Hyper_Raw_PValue Hyper_Adjp_BH
## 1      0.001982437     0.4919942
## 2      0.004066220     0.6862153

Information stored in job will be updated after retrieving enrichment tables.

job

## Submit time: 2023-04-01 03:44:07 
##   Note the results may only be avaiable on GREAT server for 24 hours.
## Version: 4.0.4 
## Species: hg19 
## Inputs: 1000 regions
## Mode: Basal plus extension 
##   Proximal: 5 kb upstream, 1 kb downstream,
##   plus Distal: up to 1000 kb
## Include curated regulatory domains
## 
## Enrichment tables for following ontologies have been downloaded:
##   GO Biological Process
##   GO Cellular Component
##   GO Molecular Function

You can get results by either specifying the ontologies or by the pre-defined categories (categories already contains pre-defined sets of ontologies):

tbl = getEnrichmentTables(job, ontology = c("GO Molecular Function", "Human Phenotype"))
tbl = getEnrichmentTables(job, category = c("GO"))

As you have seen in the previous messages and results, The enrichment tables contain no associated genes. However, you can set download_by = 'tsv' in getEnrichmentTables() to download the complete tables, but due to the restriction from GREAT web server, only the top 500 regions can be retreived (check the last two columns of tbl2[["GO Molecular Function"]] in the following example).

tbl2 = getEnrichmentTables(job, download_by = "tsv")

All available ontology names for a given species can be get by availableOntologies() and all available ontology categories can be get by availableCategories(). Here you do not need to provide species information because job already contains it.

availableOntologies(job)

## [1] "GO Molecular Function"     "GO Biological Process"    
## [3] "GO Cellular Component"     "Mouse Phenotype"          
## [5] "Mouse Phenotype Single KO" "Human Phenotype"          
## [7] "Ensembl Genes"

availableCategories(job)

## [1] "GO"        "Phenotype" "Genes"

availableOntologies(job, category = "GO")

## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"

Make volcano plot

In differential gene expression analysis, volcano plot is used to visualize relations between log2 fold change and (adjusted) p-values. Similarly, we can also use volcano plot to visualize relations between fold enrichment and (adjusted) p-values for the enrichment analysis. The plot is made by the function plotVolcano():

plotVolcano(job, ontology = "GO Biological Process")

As the enrichment analysis basically only looks for over-representations, it is actually half volcano.

Get region-gene associations

Association between genomic regions and genes can be plotted by plotRegionGeneAssociations(). The function will make the three plots which are same as on GREAT website.

plotRegionGeneAssociations(job)

getRegionGeneAssociations() returns a GRanges object which contains the gene-region associations. Note the column dist_to_TSS is based on the middle points of the input regions to TSS.

getRegionGeneAssociations(job)

## GRanges object with 885 ranges and 2 metadata columns:
##         seqnames            ranges strand | annotated_genes    dist_to_TSS
##            <Rle>         <IRanges>  <Rle> | <CharacterList>  <IntegerList>
##     [1]     chr1   9204434-9208784      * |     H6PD,GPR157  -88225,-17380
##     [2]     chr1   9853594-9859363      * |   CLSTN1,PIK3CD   28106,144675
##     [3]     chr1 10862809-10871681      * |    TARDBP,CASZ1 -205454,-10540
##     [4]     chr1 12716970-12723206      * | AADACL3,AADACL4   -56058,15522
##     [5]     chr1 13814692-13823250      * | LRRC38,PRAMEF20    21572,82064
##     ...      ...               ...    ... .             ...            ...
##   [881]     chrY 16384799-16389852      * |    NLGN4Y,VCY1B -248301,219228
##   [882]     chrY 16842593-16846893      * |          NLGN4Y         209117
##   [883]     chrY 16998750-17003225      * |          NLGN4Y         365361
##   [884]     chrY 24401612-24406252      * |   RBMY1J,RBMY1F -145685,-74837
##   [885]     chrY 26206997-26215445      * |     BPY2B,CDY1B -552930,-17105
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Please note the two meta columns are in formats of CharacterList and IntegerList because a region may associate to multiple genes.

You can also choose only plotting one of the three figures.

plotRegionGeneAssociations(job, which_plot = 1)

By specifying ontology and term ID, you can get the associations in a certain term. Here the term ID is from the first column of the data frame from getEnrichmentTables().

plotRegionGeneAssociations(job, ontology = "GO Molecular Function",
    term_id = "GO:0004984")

getRegionGeneAssociations(job, ontology = "GO Molecular Function",
    term_id = "GO:0004984")

## GRanges object with 12 ranges and 2 metadata columns:
##        seqnames              ranges strand | annotated_genes   dist_to_TSS
##           <Rle>           <IRanges>  <Rle> | <CharacterList> <IntegerList>
##    [1]     chr1 247871555-247874068      * |    OR13G1,OR6F1   -36446,3294
##    [2]     chr5 180577320-180586838      * |           OR2V2           136
##    [3]    chr11     4669904-4675668      * |   OR51E1,OR51E2    8136,46286
##    [4]    chr11   50163533-50171101      * |          OR4C12       -163246
##    [5]    chr11   51961271-51969107      * |          OR4C46        449907
##    ...      ...                 ...    ... .             ...           ...
##    [8]    chr12   48643533-48645814      * |         OR10AD1        -47503
##    [9]    chr14   18596966-18604599      * |         OR11H12       -776740
##   [10]    chr14   20760309-20769811      * |          OR11H4         54162
##   [11]    chr14   22807516-22814493      * |           OR4E2        677707
##   [12]    chr22   16218127-16222924      * |          OR11H1        229280
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

The Shiny application

shinyReport() creates a Shiny application to view the complete results:

shinyReport(job)

Session info

sessionInfo()

## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [2] GenomicFeatures_1.52.0                 
##  [3] AnnotationDbi_1.62.0                   
##  [4] Biobase_2.60.0                         
##  [5] rGREAT_2.2.0                           
##  [6] GenomicRanges_1.52.0                   
##  [7] GenomeInfoDb_1.36.0                    
##  [8] IRanges_2.34.0                         
##  [9] S4Vectors_0.38.0                       
## [10] BiocGenerics_0.46.0                    
## [11] knitr_1.42                             
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.1.3                               
##  [2] bitops_1.0-7                            
##  [3] biomaRt_2.56.0                          
##  [4] rlang_1.1.0                             
##  [5] magrittr_2.0.3                          
##  [6] GetoptLong_1.0.5                        
##  [7] matrixStats_0.63.0                      
##  [8] compiler_4.3.0                          
##  [9] RSQLite_2.3.1                           
## [10] png_0.1-8                               
## [11] vctrs_0.6.2                             
## [12] stringr_1.5.0                           
## [13] pkgconfig_2.0.3                         
## [14] shape_1.4.6                             
## [15] crayon_1.5.2                            
## [16] fastmap_1.1.1                           
## [17] dbplyr_2.3.2                            
## [18] XVector_0.40.0                          
## [19] ellipsis_0.3.2                          
## [20] utf8_1.2.3                              
## [21] Rsamtools_2.16.0                        
## [22] promises_1.2.0.1                        
## [23] rmarkdown_2.21                          
## [24] bit_4.0.5                               
## [25] xfun_0.39                               
## [26] zlibbioc_1.46.0                         
## [27] cachem_1.0.7                            
## [28] jsonlite_1.8.4                          
## [29] progress_1.2.2                          
## [30] blob_1.2.4                              
## [31] highr_0.10                              
## [32] later_1.3.0                             
## [33] DelayedArray_0.26.0                     
## [34] BiocParallel_1.34.0                     
## [35] parallel_4.3.0                          
## [36] prettyunits_1.1.1                       
## [37] R6_2.5.1                                
## [38] bslib_0.4.2                             
## [39] stringi_1.7.12                          
## [40] RColorBrewer_1.1-3                      
## [41] rtracklayer_1.60.0                      
## [42] jquerylib_0.1.4                         
## [43] Rcpp_1.0.10                             
## [44] SummarizedExperiment_1.30.0             
## [45] iterators_1.0.14                        
## [46] httpuv_1.6.9                            
## [47] Matrix_1.5-4                            
## [48] tidyselect_1.2.0                        
## [49] yaml_2.3.7                              
## [50] doParallel_1.0.17                       
## [51] codetools_0.2-19                        
## [52] curl_5.0.0                              
## [53] lattice_0.21-8                          
## [54] tibble_3.2.1                            
## [55] shiny_1.7.4                             
## [56] KEGGREST_1.40.0                         
## [57] evaluate_0.20                           
## [58] BiocFileCache_2.8.0                     
## [59] xml2_1.3.3                              
## [60] circlize_0.4.15                         
## [61] Biostrings_2.68.0                       
## [62] pillar_1.9.0                            
## [63] filelock_1.0.2                          
## [64] MatrixGenerics_1.12.0                   
## [65] DT_0.27                                 
## [66] foreach_1.5.2                           
## [67] generics_0.1.3                          
## [68] RCurl_1.98-1.12                         
## [69] hms_1.1.3                               
## [70] xtable_1.8-4                            
## [71] glue_1.6.2                              
## [72] tools_4.3.0                             
## [73] BiocIO_1.10.0                           
## [74] TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0
## [75] GenomicAlignments_1.36.0                
## [76] XML_3.99-0.14                           
## [77] grid_4.3.0                              
## [78] colorspace_2.1-0                        
## [79] GenomeInfoDbData_1.2.10                 
## [80] restfulr_0.0.15                         
## [81] cli_3.6.1                               
## [82] rappdirs_0.3.3                          
## [83] fansi_1.0.4                             
## [84] dplyr_1.1.2                             
## [85] sass_0.4.5                              
## [86] digest_0.6.31                           
## [87] org.Hs.eg.db_3.17.0                     
## [88] rjson_0.2.21                            
## [89] htmlwidgets_1.6.2                       
## [90] memoise_2.0.1                           
## [91] htmltools_0.5.5                         
## [92] lifecycle_1.0.3                         
## [93] httr_1.4.5                              
## [94] GlobalOptions_0.1.2                     
## [95] GO.db_3.17.0                            
## [96] mime_0.12                               
## [97] bit64_4.0.5