Analyze with GREAT

Author: Zuguang Gu ( z.gu@dkfz.de )

Date: 2015-04-16

GREAT (Genomic Regions Enrichment of Annotations Tool, http://great.stanford.edu) is a popular web-based tool to associate biological functions to genomic regions. The rGREAT package makes GREAT anlaysis automatic by first constructing a HTTP POST request according to user's input and retrieving results from GREAT web server afterwards.

Load the package:

library(rGREAT)

The input data is either a GRanges object or a BED-format data frame, no matter it is sorted or not. In following example, we use a data frame which is randomly generated.

set.seed(123)
bed = circlize::generateRandomBed(nr = 1000, nc = 0)
head(bed)

##    chr    start      end
## 1 chr1   155726  2608935
## 2 chr1  6134977 10483365
## 3 chr1 11354986 11423447
## 4 chr1 15134641 19115321
## 5 chr1 22692774 23328609
## 6 chr1 23639094 25639077

Submit genomic regions by submitGreatJob(). Before submitting, genomic regions will be sorted and overlapping regions will be merged.

The returned variable job is a GreatJob class instance which can be used to retrieve results from GREAT server and stored results which are already downloaded.

job = submitGreatJob(bed)

You can get the summary of your job by directly calling job variable.

job

## Submit time: 2015-04-16 22:32:32 
## Species: hg19 
## Background: wholeGenome 
## Model: Basal plus extension 
##   Proximal: 5 kb upstream, 1 kb downstream,
##   plus Distal: up to 1000 kb
## Include curated regulatory domains
## 
## Enrichment tables for following ontologies have been downloaded:
##   None

More parameters can be set for the job:

job = submitGreatJob(bed, species = "mm9")
job = submitGreatJob(bed, bg, species = "mm9", bgChoise = "data")
job = submitGreatJob(bed, adv_upstream = 10, adv_downstream = 2, adv_span = 2000)
job = submitGreatJob(bed, rule = "twoClosest", adv_twoDistance = 2000)
job = submitGreatJob(bed, rule = "oneClosest", adv_oneDistance = 2000)

Available parameters are (following content is copied from GREAT website):

species: hg19 and hg18 for human, mm9 for mouse and danRer7 for zebrafish
bgChoise: Background regions. wholeGenome and data. If this value is set to data, bg argument should be specified
includeCuratedRegDoms: Whether to include curated regulatory domains.
rule: How to associate genomic regions to genes.
- basalPlusExt: mode 'Basal plus extension'. Gene regulatory domain definition: Each gene is assigned a basal regulatory domain of a minimum distance upstream and downstream of the TSS (regardless of other nearby genes). The gene regulatory domain is extended in both directions to the nearest gene's basal domain but no more than the maximum extension in one direction.
  - adv_upstream: proximal extension to upstream (unit: kb)
  - adv_downstream: proximal extension to downstream (unit: kb)
  - adv_span: maximum extension (unit: kb)
- twoClosest: mode 'Two nearest genes'. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the nearest gene's TSS but no more than the maximum extension in one direction.
  - adv_twoDistance: maximum extension (unit: kb)
- oneClosest: mode 'Single nearest gene'. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the midpoint between the gene's TSS and the nearest gene's TSS but no more than the maximum extension in one direction.
  - adv_oneDistance: maximum extension (unit: kb)

With job, we can now retrieve results from GREAT. The first and the primary results are the tables which contain enrichment statistics for the analysis. By default it will retrieve results from three GO Ontologies and all pathway ontologies. All tables contains statistics for all terms no matter they are significant or not. Users can then make filtering yb self-defined cutoff.

One thing that users should note is that there is no column for adjusted p-values. But it is can be easily done by using p.adjust().

The returned value of getEnrichmentTables() is a list of data frames in which each one corresponds to tables for single ontology. The structure of data frames are same as the tables on GREAT website.

tb = getEnrichmentTables(job)
names(tb)

## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"

head(tb[[1]])

##           ID
## 1 GO:0070002
## 2 GO:0004731
## 3 GO:0016841
## 4 GO:0016901
## 5 GO:0008111
## 6 GO:0003796
##                                                                                                    name
## 1                                                                      glutamic-type peptidase activity
## 2                                                              purine-nucleoside phosphorylase activity
## 3                                                                                ammonia-lyase activity
## 4 oxidoreductase activity, acting on the CH-OH group of donors, quinone or similar compound as acceptor
## 5                                                                alpha-methylacyl-CoA racemase activity
## 6                                                                                     lysozyme activity
##   Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits Binom_Fold_Enrichment
## 1          8.106316e-06    0.008146847                          1            122.746900
## 2          1.435477e-05    0.014426540                          1             69.316690
## 3          1.864956e-04    0.187428100                          2             10.670760
## 4          9.768315e-04    0.981715700                          4              4.074500
## 5          1.855838e-05    0.018651170                          1             53.615940
## 6          9.926527e-04    0.997616000                          4              4.009559
##   Binom_Region_Set_Coverage Binom_Raw_PValue Hyper_Total_Genes Hyper_Expected
## 1              0.0009950249      0.008113784                 1     0.08785544
## 2              0.0009950249      0.014323080                 1     0.08785544
## 3              0.0019900500      0.015504970                 5     0.43927720
## 4              0.0039801000      0.017829270                 5     0.43927720
## 5              0.0009950249      0.018478480                 1     0.08785544
## 6              0.0039801000      0.018781680                11     0.96640980
##   Hyper_Observed_Gene_Hits Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage
## 1                        1             11.382330            0.0006309148                1.0000000
## 2                        1             11.382330            0.0006309148                1.0000000
## 3                        2              4.552934            0.0012618300                0.4000000
## 4                        3              6.829401            0.0018927440                0.6000000
## 5                        1             11.382330            0.0006309148                1.0000000
## 6                        4              4.139031            0.0025236590                0.3636364
##   Hyper_Raw_PValue
## 1      0.087855440
## 2      0.087855440
## 3      0.064472140
## 4      0.005910147
## 5      0.087855440
## 6      0.011837350

Information stored in job will be updated after retrieving enrichment tables.

job

## Submit time: 2015-04-16 22:32:32 
## Species: hg19 
## Background: wholeGenome 
## Model: Basal plus extension 
##   Proximal: 5 kb upstream, 1 kb downstream,
##   plus Distal: up to 1000 kb
## Include curated regulatory domains
## 
## Enrichment tables for following ontologies have been downloaded:
##   GO Biological Process
##   GO Cellular Component
##   GO Molecular Function

You can get results by either specifying the ontologies or by the pre-defined categories (categories already contains pre-defined sets of ontologies):

tb = getEnrichmentTables(job, ontology = c("GO Molecular Function", "BioCyc Pathway"))
tb = getEnrichmentTables(job, category = c("GO"))

All available ontology names for given species can be get by availableOntologies() and all available ontology categories can be get by availableCategories(). Here you do not need to provide species information because job already contains it.

availableOntologies(job)

##  [1] "GO Molecular Function"            "GO Biological Process"           
##  [3] "GO Cellular Component"            "Mouse Phenotype"                 
##  [5] "Human Phenotype"                  "Disease Ontology"                
##  [7] "MSigDB Oncogenic Signatures"      "MSigDB Immunologic Signatures"   
##  [9] "MSigDB Cancer Neighborhood"       "Placenta Disorders"              
## [11] "PANTHER Pathway"                  "BioCyc Pathway"                  
## [13] "MSigDB Pathway"                   "MGI Expression: Detected"        
## [15] "MSigDB Perturbation"              "MSigDB Predicted Promoter Motifs"
## [17] "MSigDB miRNA Motifs"              "InterPro"                        
## [19] "TreeFam"                          "HGNC Gene Families"

availableCategories(job)

## [1] "GO"                               "Phenotype Data and Human Disease"
## [3] "Pathway Data"                     "Gene Expression"                 
## [5] "Regulatory Motifs"                "Gene Families"

availableOntologies(job, category = "Pathway Data")

## [1] "PANTHER Pathway" "BioCyc Pathway"  "MSigDB Pathway"

Association between genomic regions and genes can be get by plotRegionGeneAssociationGraphs(). The function will make the three plots which are same as on GREAT website and returns a GRanges object which contains the gene-region associations.

par(mfrow = c(1, 3))
res = plotRegionGeneAssociationGraphs(job)

plot of chunk unnamed-chunk-11

res

## GRanges object with 1708 ranges and 2 metadata columns:
##          seqnames               ranges strand   |        gene   distTSS
##             <Rle>            <IRanges>  <Rle>   | <character> <numeric>
##      [1]     chr1 [  155726,  2608935]      *   |      ATAD3C     -2738
##      [2]     chr1 [ 6134977, 10483365]      *   |      ERRFI1   -222803
##      [3]     chr1 [ 6134977, 10483365]      *   |     SLC45A1    -68715
##      [4]     chr1 [11354986, 11423447]      *   |      PTCHD2   -150078
##      [5]     chr1 [11354986, 11423447]      *   |      UBIAD1     55954
##      ...      ...                  ...    ... ...         ...       ...
##   [1704]     chrY [36594469, 36986560]      *   |        <NA>      <NA>
##   [1705]     chrY [38466369, 40094584]      *   |        <NA>      <NA>
##   [1706]     chrY [44554177, 44654068]      *   |        <NA>      <NA>
##   [1707]     chrY [46572517, 50931029]      *   |        <NA>      <NA>
##   [1708]     chrY [53045400, 56467562]      *   |        <NA>      <NA>
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

For those regions that are not associated with any genes under current settings, the corresponding gene and distTSS columns will be NA.

You can also choose only plotting one of the three figures.

plotRegionGeneAssociationGraphs(job, type = 1)

By specifying ontology and term ID, you can get the association in a certain term. Here the term ID is from the first column of the data frame which is returned by getEnrichmentTables().

par(mfrow = c(1, 3))
res = plotRegionGeneAssociationGraphs(job, ontology = "GO Molecular Function",
    termID = "GO:0004984")

plot of chunk unnamed-chunk-13

res

## GRanges object with 9 ranges and 2 metadata columns:
##       seqnames               ranges strand |        gene   distTSS
##          <Rle>            <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr3 [97539225, 98605059]      * |       OR5K4      -556
##   [2]    chr11 [ 5451667,  5687557]      * |      OR52H1     -2833
##   [3]    chr11 [52004320, 52314790]      * |      OR4C46    644273
##   [4]    chr11 [56236606, 59435519]      * |       OR6Q1     37661
##   [5]    chr11 [56236606, 59435519]      * |       OR9I1     50853
##   [6]    chr15 [22650187, 22840775]      * |       OR4N4    363099
##   [7]    chr16 [ 2676812,  4141377]      * |       OR2C1      3206
##   [8]    chr17 [ 2595702,  3018690]      * |       OR1D5    159705
##   [9]    chr22 [15166009, 17201109]      * |      OR11H1    266246
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths