1 Introduction

InterMine constitutes a powerful open source data warehouse system which integrates diverse biological data sets and provides a growing number of analysis tools, including enrichment analysis widgets (R. N. Smith et al. 2012; A. Kalderimis et al. 2014).

Specifically, the gene set enrichment tool looks for annotations to a set of genes that occur more than would be expected by chance, given a background population of annotations. The hypergeometric distribution is used to calculate a P-value for each annotation and a choice of correction methods for multiple testing (Bonferonni, Holm-Bonferonni and Benjamini Hochberg) are available (R. N. Smith et al. 2012; A. Kalderimis et al. 2014).

InterMine provides Gene Ontology enrichment statistics as well as enrichment statistics for other annotation types including protein domains, pathways, human diseases, mammalian phenotypes and publications. The default background population is all genes in the genome with the specified annotation type. However, the background population can be changed by specifying another list. More information can be found in the online documentation.

2 Perform enrichment analysis with InterMineR

Performing enrichment analysis with InterMineR is preceded by two steps:

Retrieve the list of bioentities of interest (Genes, Proteins, SNPs, etc.) for which the analysis will be performed.
Get the enrichment widget name which indicates the annotation types that you want to investigate for enrichment (e.g. Gene Ontology Terms, protein domains, KEGG and Reactome Pathways, human diseases, mammalian phenotypes and publications).

2.1 Retrieve list containing the genes of interest from InterMineR

In this example, enrichment analysis is performed using a list of genes which are associated with all forms of Diabetes according to OMIM.

The PL_DiabetesGenes dataset is included in the InterMineR package and can also be found online in HumanMine.

library(InterMineR)

Retrieve PL_DiabetesGenes data and get gene identifiers as HumanrMine Gene.symbol and Gene.primaryIdentifier values (ENTREZ IDs).

# get HumanMine instance
im.human = initInterMine(listMines()["HumanMine"])

# retrieve data model for HumanMine
hsa_model = getModel(im.human)

# all targets of HumanMine enrichment widgets possess InterMine ids
subset(hsa_model, type %in% c("Gene", "Protein", "SNP") & child_name == "id")

##         type child_name child_type
## 602     Gene         id           
## 1290 Protein         id           
## 1535     SNP         id

# load human genes which are associated with Diabetes (retrieved from HumanMine)
data("PL_DiabetesGenes")

head(PL_DiabetesGenes, 3)

##   Gene.symbol                                 Gene.name
## 1       ABCC8 ATP binding cassette subfamily C member 8
## 2         ACE           angiotensin I converting enzyme
## 3        AKT2             AKT serine/threonine kinase 2
##   Gene.primaryIdentifier Gene.secondaryIdentifier Gene.length
## 1                   6833          ENSG00000006071       83961
## 2                   1636          ENSG00000159640       21320
## 3                    208          ENSG00000105221       55215
##   Gene.organism.name
## 1       Homo sapiens
## 2       Homo sapiens
## 3       Homo sapiens

# get Gene.symbol
hsa_gene_names = as.character(PL_DiabetesGenes$Gene.symbol)

head(hsa_gene_names, 3)

## [1] "ABCC8" "ACE"   "AKT2"

# get Gene.primaryIdentifiers (ENTREZ Gene identifiers)
hsa_gene_entrez = as.character(PL_DiabetesGenes$Gene.primaryIdentifier)

head(hsa_gene_entrez, 3)

## [1] "6833" "1636" "208"

2.2 Get enrichment widgets

After obtaining the Gene.id values of interest, we must determine the type of annotation for the enrichment analysis.

To retrieve all available widgets for a Mine we can use the getWidgets function.

# get all HumanMine widgets
human.widgets = as.data.frame(getWidgets(im.human))

human.widgets

##    startClass                           title widgetType
## 1     Protein       Protein Domain Enrichment enrichment
## 2        Gene          Publication Enrichment enrichment
## 3        Gene         Chromosome Distribution      chart
## 4        Gene        Gene Ontology Enrichment enrichment
## 5        Gene              Pathway Enrichment enrichment
## 6         SNP Publication Enrichment for SNPs enrichment
## 7        <NA>                    Interactions      table
## 8         SNP  GWAS study Enrichment for SNPs enrichment
## 9         SNP         Chromosome Distribution      chart
## 10       Gene       Protein Domain Enrichment enrichment
##                                                                                                                                                                                                                       description
## 1                                                                                                                                                                                Protein Domains enriched for items in this list.
## 2                                                                                                                                                                                   Publications enriched for genes in this list.
## 3  Actual: number of items in this list found on each chromosome.  Expected: given the total number of items on the chromosome and the number of items in this list, the number of items expected to be found on each chromosome.
## 4                                                                                                                                                                                       GO terms enriched for items in this list.
## 5                                                                                                                                                          Pathways enriched for genes in this list - data from KEGG and Reactome
## 6                                                                                                                                                                                    Publications enriched for SNPs in this list.
## 7                                                                Genes (from the list or not) that interact with genes in this list.  Counts may include the same interaction more than once if observed in multiple experiments.
## 8                                                                                                                                                                                    GWAS studies enriched for SNPs in this list.
## 9  Actual: number of items in this list found on each chromosome.  Expected: given the total number of items on the chromosome and the number of items in this list, the number of items expected to be found on each chromosome.
## 10                                                                                                                                                                               Protein Domains enriched for items in this list.
##                                                 enrichIdentifier
## 1           proteinDomainRegions.proteinDomain.primaryIdentifier
## 2                                          publications.pubMedId
## 3                                                           <NA>
## 4                   goAnnotation.ontologyTerm.parents.identifier
## 5                                            pathways.identifier
## 6                         GWASResults.study.publication.pubMedId
## 7                                                           <NA>
## 8                         GWASResults.study.publication.pubMedId
## 9                                                           <NA>
## 10 proteins.proteinDomainRegions.proteinDomain.primaryIdentifier
##                                name startClassDisplay
## 1   prot_dom_enrichment_for_protein primaryIdentifier
## 2            publication_enrichment primaryIdentifier
## 3  chromosome_distribution_for_gene              <NA>
## 4            go_enrichment_for_gene primaryIdentifier
## 5                pathway_enrichment primaryIdentifier
## 6        snp_publication_enrichment primaryIdentifier
## 7                      interactions              <NA>
## 8         snp_gwas_study_enrichment primaryIdentifier
## 9   chromosome_distribution_for_snp              <NA>
## 10     prot_dom_enrichment_for_gene primaryIdentifier
##                                              enrich targets
## 1           proteinDomainRegions.proteinDomain.name Protein
## 2                                publications.title    Gene
## 3                                              <NA>    Gene
## 4            goAnnotation.ontologyTerm.parents.name    Gene
## 5                                     pathways.name    Gene
## 6               GWASResults.study.publication.title     SNP
## 7                                              <NA>    Gene
## 8                            GWASResults.study.name     SNP
## 9                                              <NA>     SNP
## 10 proteins.proteinDomainRegions.proteinDomain.name    Gene
##                                                     filters   chartType
## 1                                                      <NA>        <NA>
## 2                                                      <NA>        <NA>
## 3                                      organism.name=[list] ColumnChart
## 4  biological_process,cellular_component,molecular_function        <NA>
## 5              All,KEGG pathways data set,Reactome data set        <NA>
## 6                                                      <NA>        <NA>
## 7                                                      <NA>        <NA>
## 8                                                      <NA>        <NA>
## 9                                      organism.name=[list] ColumnChart
## 10                                                     <NA>        <NA>
##                labels
## 1                <NA>
## 2                <NA>
## 3  Count & Chromosome
## 4                <NA>
## 5                <NA>
## 6                <NA>
## 7                <NA>
## 8                <NA>
## 9  Count & Chromosome
## 10               <NA>

HumanMine provides enrichment for genes, proteins and SNPs, but here we are interested only in the gene-related, enrichment widgets.

# get enrichment, gene-related widgets for HumanMine
subset(human.widgets, widgetType == 'enrichment' & targets == "Gene")

##    startClass                     title widgetType
## 2        Gene    Publication Enrichment enrichment
## 4        Gene  Gene Ontology Enrichment enrichment
## 5        Gene        Pathway Enrichment enrichment
## 10       Gene Protein Domain Enrichment enrichment
##                                                               description
## 2                           Publications enriched for genes in this list.
## 4                               GO terms enriched for items in this list.
## 5  Pathways enriched for genes in this list - data from KEGG and Reactome
## 10                       Protein Domains enriched for items in this list.
##                                                 enrichIdentifier
## 2                                          publications.pubMedId
## 4                   goAnnotation.ontologyTerm.parents.identifier
## 5                                            pathways.identifier
## 10 proteins.proteinDomainRegions.proteinDomain.primaryIdentifier
##                            name startClassDisplay
## 2        publication_enrichment primaryIdentifier
## 4        go_enrichment_for_gene primaryIdentifier
## 5            pathway_enrichment primaryIdentifier
## 10 prot_dom_enrichment_for_gene primaryIdentifier
##                                              enrich targets
## 2                                publications.title    Gene
## 4            goAnnotation.ontologyTerm.parents.name    Gene
## 5                                     pathways.name    Gene
## 10 proteins.proteinDomainRegions.proteinDomain.name    Gene
##                                                     filters chartType labels
## 2                                                      <NA>      <NA>   <NA>
## 4  biological_process,cellular_component,molecular_function      <NA>   <NA>
## 5              All,KEGG pathways data set,Reactome data set      <NA>   <NA>
## 10                                                     <NA>      <NA>   <NA>

The column names provides the character strings that can be passed as arguments to the doEnrichment function and thus define the type of enrichment analysis.

2.3 Perform enrichment analysis with InterMineR

We will perform Gene Ontology Enrichment analysis for the genes in our list using the Gene.id values and the ‘go_enrichment_for_gene’ widget.

# Assign directly the doEnrichment.string from getGeneIds function to the ids argument of doEnrichment to perform the analysis

# Perform enrichment analysis
GO_enrichResult = doEnrichment(
  im = im.human,
  ids = hsa_gene_entrez,
  widget = "go_enrichment_for_gene"
  # organism = "Homo sapiens" # optional if results from more than one organism are retrieved
)

As mentioned above ‘PL_DiabetesGenes’ constitutes a genelist which already exists in HumanMine. Therefore, the enrichment analysis could also be performed by passing the name of the list to the genelist argument, with the same results:

# Perform enrichment analysis with genelist name instead of ids
GO_enrichResult = doEnrichment(
  im = im.human,
  genelist = "PL_DiabetesGenes",
  # ids = hsa_gene_entrez,
  widget = "go_enrichment_for_gene"
)

doEnrichment function returns a list which contains:

a data.frame with the results of the enrichment analysis which was performed in the defined InterMine platform. p-values are given after applying the correction algorithm. Count and populationAnnotationCount columns contain the number of genes that belong to each GO term in the list and in the background population respectively.

# data.frame containing the results of the enrichment analysis
head(GO_enrichResult$data)

##   identifier                     description                 pValue count
## 1 GO:0030072       peptide hormone secretion 2.1574571527914352E-19    21
## 2 GO:0046879               hormone secretion  3.151042144650987E-19    22
## 3 GO:0009914               hormone transport   5.03709770698072E-19    22
## 4 GO:0030073               insulin secretion  6.982356406080463E-17    18
## 5 GO:0023061                  signal release  1.305533140658726E-16    22
## 6 GO:0046883 regulation of hormone secretion  1.362659568146168E-16    19
##   populationAnnotationCount
## 1                       240
## 2                       297
## 3                       309
## 4                       199
## 5                       417
## 6                       252

dim(GO_enrichResult$data)

## [1] 695   5

the size of the reference background population (populationCount)

GO_enrichResult$populationCount

## [1] 16368

the number of input features that were not included in the enrichment analysis (notAnalyzed)

GO_enrichResult$notAnalysed

## [1] 6

the name and url of the Mine (im)

GO_enrichResult$im

## $mine
##                            HumanMine 
## "http://www.humanmine.org/humanmine" 
## 
## $token
## [1] ""

the rest of the parameters used to perform the analysis (parameters)

GO_enrichResult$parameters

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ids 
## "1111223,1154072,1023066,1289145,1081177,1185614,1233411,1045730,1200810,1000165,1123255,1110592,1240909,1066420,1271723,1058563,1198741,1054723,1103487,1248547,1243559,1185273,1266975,1040092,1096845,1204150,1276769,1158084,1284092,1189851,1045052,1080121,1233076,1036968,1128899,1037607,1033459,1066434,1201550,1102330,1014543,1286771,1052600,1135417,1258182,1263184,1090012,1016985,1143998,1192242,1046839,1191074,1201371,1242073,1118800,1154478,1009851,1129293,1055800,1142038,1085579,1150267,1238219,1023611,1002387,1155942,1221379,1059634" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            widget 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "go_enrichment_for_gene" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              maxp 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "0.05" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        correction 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "Benjamini Hochberg"

2.4 Apply filters on enrichment analysis

Some of the enrichment widgets contain filters which, when applied, limit the annotation types of the enrichment analysis.

In our example, if we want to retrieve only the enriched GO terms in our list of genes that are related to molecular function, we will use the appropriate filter:

# get available filter values for Gene Ontology Enrichment widget
as.character(subset(human.widgets, name == "go_enrichment_for_gene")$filters)

## [1] "biological_process,cellular_component,molecular_function"

# Perform enrichment analysis for GO molecular function terms
GO_MF_enrichResult = doEnrichment(
  im = im.human,
  ids = hsa_gene_entrez,
  widget = "go_enrichment_for_gene",
  filter = "molecular_function")

head(GO_MF_enrichResult$data)

##   identifier
## 1 GO:0016773
## 2 GO:0003705
## 3 GO:0016772
## 4 GO:0001012
## 5 GO:0000980
## 6 GO:0000977
##                                                                                  description
## 1                                     phosphotransferase activity, alcohol group as acceptor
## 2 transcription factor activity, RNA polymerase II distal enhancer sequence-specific binding
## 3                            transferase activity, transferring phosphorus-containing groups
## 4                                            RNA polymerase II regulatory region DNA binding
## 5                            RNA polymerase II distal enhancer sequence-specific DNA binding
## 6                          RNA polymerase II regulatory region sequence-specific DNA binding
##                  pValue count populationAnnotationCount
## 1 0.0016639782980631742    18                      1364
## 2  0.001788854488949391     6                        92
## 3  0.002130585335160086    20                      1624
## 4 0.0023546706224918355    13                       777
## 5  0.002465654722518649     6                       106
## 6  0.002575586733299186    13                       775

dim(GO_MF_enrichResult$data)

## [1] 20  5

2.5 Apply multiple test correction

To reduce the probability of false positive errors, three different multiple correction algorithms can be applied to the results of the enrichment analysis:

Benjamini Hochberg (default)
Holm-Bonferroni
Bonferroni

The application of one of these algorithms changes the p-values and determines the number of the results which will be returned from the analysis:

# Perform enrichment analysis for Protein Domains in list of genes
PD_FDR_enrichResult = doEnrichment(
  im = im.human,
  ids = hsa_gene_entrez,
  widget = "prot_dom_enrichment_for_gene",
  correction = "Benjamini Hochberg"
) 

head(PD_FDR_enrichResult$data)

## NULL

# Perform enrichment analysis for Protein Domains in list of genes
# but without a correction algoritm this time
PD_None_enrichResult = doEnrichment(
  im = im.human,
  ids = hsa_gene_entrez,
  widget = "prot_dom_enrichment_for_gene",
  correction = "None"
)

head(PD_None_enrichResult$data)

##   identifier                                           description
## 1  IPR006897 Hepatocyte nuclear factor 1, beta isoform, C-terminal
## 2  IPR023219      Hepatocyte nuclear factor 1, dimerisation domain
## 3  IPR006899               Hepatocyte nuclear factor 1, N-terminal
## 4  IPR006020                                         PTB/PI domain
## 5  IPR000734                         Triacylglycerol lipase family
## 6  IPR013818                                   Lipase/vitellogenin
##                  pValue count populationAnnotationCount
## 1 1.1151399403557156E-5     2                         2
## 2 1.1151399403557156E-5     2                         2
## 3  3.338150019168922E-5     2                         3
## 4  3.285826277100501E-4     3                        40
## 5  4.931622661513312E-4     2                        10
## 6  4.931622661513312E-4     2                        10

3 Visualization of InterMineR Gene Ontology Enrichment analysis results

3.1 Convert InterMineR enrichment analysis results to a GeneAnswer-class object

In order to visualize the InterMineR enrichment analysis results, we will use the function convertToGeneAnswers. This function was created to facilitate the visualization of doEnrichment function results by converting them into a GeneAnswers-class object.

This way we can utilize the functions of the package GeneAnswers to visualize the results of the enrichment analysis and the relations between the features (e.g. genes) and/or the annoatation categories (e.g. GO terms) (Feng et al. 2010; Feng et al. 2012; Huang et al. 2014).

# load GeneAnswers package
library(GeneAnswers)

# convert InterMineR Gene Ontology Enrichment analysis results to GeneAnswers object
geneanswer_object = convertToGeneAnswers(
  
  # assign with doEnrichment result:
  enrichmentResult = GO_enrichResult,
  
  # assign with list of genes:
  geneInput = data.frame(GeneID = as.character(hsa_gene_entrez), 
                         stringsAsFactors = FALSE),
  
  # assign with the type of gene identifier
  # in our example we use Gene.primaryIdentifier values (ENTREZ IDs):
  geneInputType = "Gene.primaryIdentifier",
  
  # assign with Bioconductor annotation package:
  annLib = 'org.Hs.eg.db',
  
  # assign with annotation category type
  # in our example we use Gene Ontology (GO) terms:
  categoryType = "GO"
  
  #optional define manually if 'enrichIdentifier' is missing from getWidgets:
  #enrichCategoryChildName = "Gene.goAnnotation.ontologyTerm.parents.identifier"
)

class(geneanswer_object)

## [1] "GeneAnswers"
## attr(,"package")
## [1] "GeneAnswers"

summary(geneanswer_object)

## This GeneAnswers instance was build from GO based on hyperG test.
## Statistical information of 695 categories with p value less than 0.05 are reported. Other categories are considered as nonsignificant.
## There are 695 categories related to the given 68 genes
## 
## Summary of GeneAnswers instance information:
## 
## Slot: geneInput
##   GeneID
## 1   6833
## 2   1636
## 3    208
## 4  26060
## 5    359
## 6    551
## ......
## 
## Slot: testType
## [1] "hyperG"
## 
## Slot: pvalueT
## [1] 0.05
## 
## Slot: genesInCategory
## $`GO:0030072`
##  [1] "6833"   "640"    "11132"  "9451"   "2645"   "3077"   "6928"   "3172"  
##  [9] "3557"   "3569"   "3630"   "3667"   "8660"   "3710"   "3767"   "4544"  
## [17] "4760"   "3651"   "6514"   "169026" "6934"  
## 
## $`GO:0046879`
##  [1] "6833"   "640"    "11132"  "9451"   "2645"   "3077"   "6928"   "3172"  
##  [9] "3557"   "3569"   "3630"   "3667"   "8660"   "3710"   "3767"   "4544"  
## [17] "4760"   "3651"   "56729"  "6514"   "169026" "6934"  
## 
## $`GO:0009914`
##  [1] "6833"   "640"    "11132"  "9451"   "2645"   "3077"   "6928"   "3172"  
##  [9] "3557"   "3569"   "3630"   "3667"   "8660"   "3710"   "3767"   "4544"  
## [17] "4760"   "3651"   "56729"  "6514"   "169026" "6934"  
## 
## $`GO:0030073`
##  [1] "6833"   "640"    "11132"  "9451"   "2645"   "6928"   "3172"   "3557"  
##  [9] "3667"   "8660"   "3710"   "3767"   "4544"   "4760"   "3651"   "6514"  
## [17] "169026" "6934"  
## 
## $`GO:0023061`
##  [1] "6833"   "640"    "11132"  "9451"   "2645"   "3077"   "6928"   "3172"  
##  [9] "3557"   "3569"   "3630"   "3667"   "8660"   "3710"   "3767"   "4544"  
## [17] "4760"   "3651"   "56729"  "6514"   "169026" "6934"  
## 
## $`GO:0046883`
##  [1] "6833"   "640"    "11132"  "2645"   "3077"   "3172"   "3569"   "3630"  
##  [9] "3667"   "8660"   "3710"   "3767"   "4544"   "4760"   "3651"   "56729" 
## [17] "6514"   "169026" "6934"  
## 
## ......
## 
## Slot: geneExprProfile
## NULL
## 
## Slot: annLib
## [1] "org.Hs.eg.db"
## 
## Slot: categoryType
## [1] "GO"
## 
## Slot: enrichmentInfo
##            genes in Category percent in the observed List
## GO:0030072                21                    0.3088235
## GO:0046879                22                    0.3235294
## GO:0009914                22                    0.3235294
## GO:0030073                18                    0.2647059
## GO:0023061                22                    0.3235294
## GO:0046883                19                    0.2794118
##            percent in the genome fold of overrepresents odds ratio
## GO:0030072            0.01466276               21.06176   30.02553
## GO:0046879            0.01814516               17.83007   25.87923
## GO:0009914            0.01887830               17.13764   24.85564
## GO:0030073            0.01215787               21.77239   29.25045
## GO:0023061            0.02547654               12.69911   18.29434
## GO:0046883            0.01539589               18.14846   24.79786
##                 p value  fdr p value
## GO:0030072 2.157457e-19 2.157457e-19
## GO:0046879 3.151042e-19 3.151042e-19
## GO:0009914 5.037098e-19 5.037098e-19
## GO:0030073 6.982356e-17 6.982356e-17
## GO:0023061 1.305533e-16 1.305533e-16
## GO:0046883 1.362660e-16 1.362660e-16
## ......

#slotNames(geneanswer_object)

GeneAnswers is package designed for the enrichment analysis and visualization of gene lists. It is worth mentioning that the InterMineR filters for the widgets of Gene Ontology and Pathway Enrichment:

as.character(
subset(human.widgets,
       title %in% c("Gene Ontology Enrichment",
                    "Pathway Enrichment"))$filters
)

## [1] "biological_process,cellular_component,molecular_function"
## [2] "All,KEGG pathways data set,Reactome data set"

match the following available values:

InterMineR widget name	InterMineR filter	`convertToGeneAnswers` categoryType
go_enrichment_for_gene	biological_process	GO.BP
go_enrichment_for_gene	cellular_component	GO.CC
go_enrichment_for_gene	molecular_function	GO.MF
pathway_enrichment	KEGG pathways data set	KEGG
pathway_enrichment	Reactome data set	REACTOME.PATH

for the categoryType argument of the geneAnswerBuilder function, and can be assigned accordingly in the categoryType argument of the convertToGeneAnswers, therby facilitating the conversion of InterMineR Gene Ontology and Pathway Enrichment analysis results to GeneAnswers-class objects.

# convert to GeneAnswers results for GO terms associated with molecular function
geneanswer_MF_object = convertToGeneAnswers(
  enrichmentResult = GO_MF_enrichResult,
  geneInput = data.frame(GeneID = as.character(hsa_gene_entrez), 
                         stringsAsFactors = FALSE),
  geneInputType = "Gene.primaryIdentifier",
  annLib = 'org.Hs.eg.db',
  categoryType = "GO.MF"
  #enrichCategoryChildName = "Gene.goAnnotation.ontologyTerm.parents.identifier"
)

class(geneanswer_MF_object)

## [1] "GeneAnswers"
## attr(,"package")
## [1] "GeneAnswers"

summary(geneanswer_MF_object)

## This GeneAnswers instance was build from GO.MF based on hyperG test.
## Statistical information of 20 categories with p value less than 0.05 are reported. Other categories are considered as nonsignificant.
## There are 20 categories related to the given 68 genes
## 
## Summary of GeneAnswers instance information:
## 
## Slot: geneInput
##   GeneID
## 1   6833
## 2   1636
## 3    208
## 4  26060
## 5    359
## 6    551
## ......
## 
## Slot: testType
## [1] "hyperG"
## 
## Slot: pvalueT
## [1] 0.05
## 
## Slot: genesInCategory
## $`GO:0016773`
##  [1] "1636"  "208"   "551"   "640"   "5611"  "9451"  "2056"  "2645"  "3082" 
## [10] "3569"  "3630"  "3643"  "3667"  "8660"  "9479"  "5770"  "26191" "7422" 
## 
## $`GO:0003705`
## [1] "50943" "3159"  "6928"  "3172"  "5078"  "3651" 
## 
## $`GO:0016772`
##  [1] "1636"  "208"   "551"   "640"   "5611"  "9451"  "2056"  "2645"  "3082" 
## [10] "3569"  "3630"  "3643"  "3667"  "8660"  "9479"  "4938"  "5468"  "5770" 
## [19] "26191" "7422" 
## 
## $`GO:0001012`
##  [1] "50943" "3159"  "6927"  "6928"  "3172"  "8462"  "4760"  "5078"  "3651" 
## [10] "5325"  "5468"  "6925"  "6934" 
## 
## $`GO:0000980`
## [1] "50943" "3159"  "6928"  "3172"  "5078"  "3651" 
## 
## $`GO:0000977`
##  [1] "50943" "3159"  "6927"  "6928"  "3172"  "8462"  "4760"  "5078"  "3651" 
## [10] "5325"  "5468"  "6925"  "6934" 
## 
## ......
## 
## Slot: geneExprProfile
## NULL
## 
## Slot: annLib
## [1] "org.Hs.eg.db"
## 
## Slot: categoryType
## [1] "GO.MF"
## 
## Slot: enrichmentInfo
##            genes in Category percent in the observed List
## GO:0016773                18                   0.26470588
## GO:0003705                 6                   0.08823529
## GO:0016772                20                   0.29411765
## GO:0001012                13                   0.19117647
## GO:0000980                 6                   0.08823529
## GO:0000977                13                   0.19117647
##            percent in the genome fold of overrepresents odds ratio
## GO:0016773           0.083752917               3.160557   3.938358
## GO:0003705           0.005649024              15.619565  17.034362
## GO:0016772           0.099717549               2.949507   3.761802
## GO:0001012           0.047709689               4.007079   4.717843
## GO:0000980           0.006508658              13.556604  14.771759
## GO:0000977           0.047586884               4.017419   4.730628
##                p value fdr p value
## GO:0016773 0.001663978 0.001663978
## GO:0003705 0.001788854 0.001788854
## GO:0016772 0.002130585 0.002130585
## GO:0001012 0.002354671 0.002354671
## GO:0000980 0.002465655 0.002465655
## GO:0000977 0.002575587 0.002575587
## ......

#slotNames(geneanswer_MF_object)

Briefly, in the following sections we present how to use several functions of the GeneAnswers package to visualize the results of the Gene Ontology Enrichment analysis on the PL_DiabetesGenes gene list.

For more information about the usage of GeneAnswers package, the user should look at the respective documentation.

3.2 PieChart and BarPlot graphs

# GeneAnswers pieChart
geneAnswersChartPlots(geneanswer_object, 
                      chartType='pieChart',
                      sortBy = 'correctedPvalue',
                      top = 5)

Figure 1: Pie chart of the top five GO terms with the smallest FDR-corrected p-values.

# GeneAnswers barplot
geneAnswersChartPlots(geneanswer_object, 
                      chartType='barPlot',
                      sortBy = 'correctedPvalue',
                      top = 5)

Figure 2: Bar plot of the top five GO terms with the smallest FDR-corrected p-values.

3.3 Concept-gene network

# generate concept-gene network
geneAnswersConceptNet(geneanswer_object, 
                      colorValueColumn=NULL,
                      centroidSize='correctedPvalue', 
                      output='interactive',
                      catTerm = FALSE,
                      catID = FALSE,
                      showCats = 1:5)

Figure 3: Screen shot of concept-gene network for the first five GO terms. The size of nodes is proportional to the number of genes in each GO term.

The color of gene nodes can be changed based on the numeric values associated with the genes.

# generate concept-gene network
# for visualization purposes add a column of RANDOM numbers in geneInput
set.seed(123)
geneanswer_object@geneInput$random_values = rnorm(nrow(geneanswer_object@geneInput))

geneAnswersConceptNet(geneanswer_object, 
                      colorValueColumn=2,
                      centroidSize='correctedPvalue', 
                      output='interactive',
                      catTerm = FALSE,
                      catID = FALSE,
                      showCats = 1:5)

Figure 4: Screen shot of concept-gene network for the first five GO terms. Random numeric values have been assigned as column to geneInput in order to visualize the ability to add color to gene nodes. The size of nodes is proportional to the number of genes in each GO term.

3.4 Gene Ontology structure network

# visualize the relations between the first two GO terms
geneAnswersConceptRelation(geneanswer_object,
                           directed=TRUE, 
                           netMode='connection',
                           catTerm=TRUE, 
                           catID=TRUE,
                           showCats = 1:2)

Figure 5: Screen shot of Gene Ontology structure network for the first two GO terms. The size of nodes is proportional to number of genes in these GO terms. The color of nodes stand for how relative the given genes are to the GO terms. More red, more relative. The given GO terms are yellow framed dots with dark purple edges as connections.

3.5 Gene interaction network

The functions of GeneAnswers package make use of several Bioconductor Annotation Packages, which are primarily based on Entrez Gene identifiers.

In our example, GeneAnswers uses the annotation package ‘org.Hs.eg.db’, which was specified previously in convertToGeneAnswers function, to retrieve gene interactions for the specified genes.

# visualize interactions between the first five genes
buildNet(getGeneInput(geneanswer_object)[1:5,1],
         idType='GeneInteraction', 
         layers=2,
         filterGraphIDs=getGeneInput(geneanswer_object)[,1],
         filterLayer=2, 
         netMode='connection')

Figure 6: Screen shot of gene interaction network for the first five genes of PL_DiabetesGenes. The large black dots with yellow frame stand for the five given genes. They also connect to other genes by dark-blue-purple edges. Small black dots represent the other genes from getGeneInput. Small white dots are genes that are not in the genes from getGeneInput, but interact with these genes.

3.6 Gene Ontology terms - Genes cross tabulation

# generate GO terms - Genes cross tabulation
geneAnswersHeatmap(geneanswer_object, 
                   catTerm=TRUE, 
                   geneSymbol=TRUE,
                   showCats = 5:10,
                   cex.axis = 0.75)

## initial  value 26.199675 
## final  value 26.199675 
## converged
## initial  value 1.874804 
## final  value 1.874804 
## converged

4 Visualization of InterMineR Publication Enrichment analysis results

4.1 Perform Publication Enrichment analysis

Although GeneAnswers package is designed for performing enrichment analysis on genes and reporting enriched pathways, it is possible to utilize some of its functions to visualize the results of the various enrichment widgets used by InterMineR.

Specifically in the current example we will show how to perform visualizations on publication enrichment analysis results.

# publication enrichment analysis
publication_enrichResult = doEnrichment(
  im = im.human,
  ids = hsa_gene_entrez,
  widget = "publication_enrichment")

Since we are performing conversion of enrichment results that are not related to pathways there are three things to notice here:

We are deliberately assign Gene.symbol instead of Gene.primaryIdentifier values as geneInput. This will result in a more readable concept-gene network and can be changed in order to generate the gene interaction network.
annLib argument is assigned with ‘org.Hs.eg.db’ as before
categoryType is assigned with NULL since we no longer use GO terms but PubMed IDs (Gene.publications.pubMedId) as annotation categories.

# convert to GeneAnswer-class
# use Gene.symbol values instead of Gene.primaryIdentifier (ENTREZ)
geneanswer_pubs_object = convertToGeneAnswers(
  enrichmentResult = publication_enrichResult,
  geneInput = data.frame(GeneID = as.character(hsa_gene_names),
                         stringsAsFactors = FALSE),
  geneInputType = "Gene.symbol",
  annLib = 'org.Hs.eg.db',
  categoryType = NULL,
  enrichCategoryChildName = "Gene.publications.pubMedId"
)

class(geneanswer_pubs_object)

## [1] "GeneAnswers"
## attr(,"package")
## [1] "GeneAnswers"

summary(geneanswer_pubs_object)

## This GeneAnswers instance was build from  based on hyperG test.
## Statistical information of 1236 categories with p value less than 0.05 are reported. Other categories are considered as nonsignificant.
## There are 1236 categories related to the given 68 genes
## 
## Summary of GeneAnswers instance information:
## 
## Slot: geneInput
##   GeneID
## 1  ABCC8
## 2    ACE
## 3   AKT2
## 4  APPL1
## 5   AQP2
## 6    AVP
## ......
## 
## Slot: testType
## [1] "hyperG"
## 
## Slot: pvalueT
## [1] 0.05
## 
## Slot: genesInCategory
## $`19578796`
##  [1] "ABCC8"  "ACE"    "CAPN10" "CCR5"   "ENPP1"  "GCGR"   "GCK"    "HNF1A" 
##  [9] "HNF4A"  "IL6"    "INS"    "INSR"   "IRS1"   "KCNJ11" "LIPC"   "PAX4"  
## [17] "PDX1"   "PON1"   "PPARG"  "PTPN1"  "RETN"   "VEGFA" 
## 
## $`19913121`
##  [1] "ABCC8"   "ACE"     "AKT2"    "AVPR2"   "CCR5"    "CDKAL1"  "CEL"    
##  [8] "CTLA4"   "EIF2AK3" "ENPP1"   "EPO"     "GCK"     "HFE"     "HGF"    
## [15] "HMGA1"   "HNF1A"   "HNF1B"   "HNF4A"   "IGF2BP2" "IL1RN"   "IL2RA"  
## [22] "IL6"     "INS"     "INSR"    "IRS1"    "IRS2"    "ITPR3"   "KCNJ11" 
## [29] "LIPC"    "NEUROD1" "PDX1"    "PON1"    "PPARG"   "PTPN22"  "RETN"   
## [36] "SLC2A2"  "SOD2"    "TCF7L2"  "VEGFA"   "WFS1"   
## 
## $`20628086`
##  [1] "ABCC8"   "ACE"     "AKT2"    "AVPR2"   "CCR5"    "CDKAL1"  "CEL"    
##  [8] "CTLA4"   "EIF2AK3" "ENPP1"   "EPO"     "GCK"     "HFE"     "HGF"    
## [15] "HMGA1"   "HNF1A"   "HNF1B"   "HNF4A"   "IGF2BP2" "IL1RN"   "IL2RA"  
## [22] "IL6"     "INS"     "INSR"    "IRS1"    "IRS2"    "ITPR3"   "KCNJ11" 
## [29] "LIPC"    "NEUROD1" "PDX1"    "PON1"    "PPARG"   "PTPN22"  "RETN"   
## [36] "SLC2A2"  "SOD2"    "TCF7L2"  "VEGFA"   "WFS1"   
## 
## $`20682687`
##  [1] "ABCC8"   "ACE"     "CAPN10"  "GCK"     "HNF1A"   "HNF1B"   "HNF4A"  
##  [8] "IL6"     "IRS1"    "KCNJ11"  "LIPC"    "NEUROD1" "PDX1"    "PPARG"  
## [15] "PTPN1"  
## 
## $`20384434`
##  [1] "ABCC8"   "CDKAL1"  "HNF1B"   "HNF4A"   "IGF2BP2" "IL6"     "IRS1"   
##  [8] "KCNJ11"  "LIPC"    "PPARG"   "SLC30A8" "TCF7L2"  "WFS1"   
## 
## $`24509480`
##  [1] "CDKAL1"  "GLIS3"   "HNF1A"   "HNF1B"   "HNF4A"   "IGF2BP2" "IRS1"   
##  [8] "KCNJ11"  "MTNR1B"  "PPARG"   "SLC30A8" "TCF7L2"  "VEGFA"   "WFS1"   
## 
## ......
## 
## Slot: geneExprProfile
## NULL
## 
## Slot: annLib
## [1] "org.Hs.eg.db"
## 
## Slot: categoryType
## NULL
## 
## Slot: enrichmentInfo
##          genes in Category percent in the observed List
## 19578796                22                    0.3235294
## 19913121                40                    0.5882353
## 20628086                40                    0.5882353
## 20682687                15                    0.2205882
## 20384434                13                    0.1911765
## 24509480                14                    0.2058824
##          percent in the genome fold of overrepresents odds ratio
## 19578796          0.0042521051              76.086881  111.99800
## 19913121          0.0620751392               9.476182   21.58501
## 20628086          0.0616555235               9.540675   21.74164
## 20682687          0.0014266931             154.615052  198.09101
## 20384434          0.0007553081             253.110566  312.70034
## 24509480          0.0020141550             102.217729  128.45936
##               p value  fdr p value
## 19578796 1.773299e-30 1.773299e-30
## 19913121 5.422389e-26 5.422389e-26
## 20628086 6.259033e-26 6.259033e-26
## 20682687 9.620720e-25 9.620720e-25
## 20384434 2.224729e-24 2.224729e-24
## 24509480 3.927844e-20 3.927844e-20
## ......

#slotNames(geneanswer_pubs_object)

4.2 PieChart and BarPlot graphs

# GeneAnswers pieChart
geneAnswersChartPlots(geneanswer_pubs_object, 
                      chartType='pieChart',
                      sortBy = 'correctedPvalue',
                      top = 5)

Figure 7: Pie chart of the top five PubMed IDs with the smallest FDR-corrected p-values.

# GeneAnswers barplot
geneAnswersChartPlots(geneanswer_pubs_object, 
                      chartType='barPlot',
                      sortBy = 'correctedPvalue',
                      top = 5)

Figure 8: Bar plot of the top five PubMed IDs with the smallest FDR-corrected p-values.

4.3 Concept-gene network

It is important to note that the arguments catTerm and catID are assigned with FALSE. This is because the function is not expecting PubMed IDs as annotation categories and it will fail to convert them to short descriptions (as it would do with GO terms or KEGG and REACTOME pathways).

# generate concept-gene network
geneAnswersConceptNet(geneanswer_pubs_object, 
                      colorValueColumn=NULL,
                      centroidSize='correctedPvalue', 
                      output='interactive',
                      catTerm = FALSE,
                      catID = FALSE,
                      showCats = 1:5)

Figure 9: Screen shot of concept-gene network for the first five PubMed IDs. The size of nodes is proportional to the number of genes in each PubMed ID.

4.4 Gene interaction network

To generate a gene interaction network it is necessary to replace Gene.symbol with Gene.primaryIdentifier values because buildNet function retrieves gene interactions from the ‘org.Hs.eg.db’ package, which uses ENTREZ identifiers for mapping.

# copy the existing GeneAnswer-class object
geneanswer_pubs_object2 = geneanswer_pubs_object

# replace Gene.symbol with Gene.primaryIdentifier values
geneanswer_pubs_object2@geneInput = data.frame(
  GeneID = as.character(hsa_gene_entrez), 
  stringsAsFactors = FALSE)

# generate gene interaction network
buildNet(getGeneInput(geneanswer_pubs_object2)[1:5,1],
         idType='GeneInteraction', 
         layers=2,
         filterGraphIDs=getGeneInput(geneanswer_pubs_object)[,1],
         filterLayer=2, 
         netMode='connection')

Figure 10: Screen shot of gene interaction network for the first five genes of PL_DiabetesGenes. The large black dots with yellow frame stand for the five given genes. They also connect to other genes by dark-blue-purple edges. Small black dots represent the other genes from getGeneInput. Small white dots are genes that are not in the genes from getGeneInput, but interact with these genes.

5 System info

sessionInfo()

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=C                  LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.4.2   GO.db_3.4.2          GeneAnswers_2.20.0  
##  [4] RColorBrewer_1.1-2   Heatplus_2.24.0      MASS_7.3-47         
##  [7] annotate_1.56.0      XML_3.98-1.9         AnnotationDbi_1.40.0
## [10] IRanges_2.12.0       S4Vectors_0.16.0     Biobase_2.38.0      
## [13] BiocGenerics_0.24.0  RCurl_1.95-4.8       bitops_1.0-6        
## [16] igraph_1.1.2         RSQLite_2.0          InterMineR_1.0.0    
## [19] BiocStyle_2.6.0     
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.8.0 lattice_0.20-35           
##  [3] tcltk_3.4.2                htmltools_0.3.6           
##  [5] yaml_2.1.14                RBGL_1.54.0               
##  [7] chron_2.3-51               blob_1.1.0                
##  [9] rlang_0.1.2                DBI_0.7                   
## [11] bit64_0.9-7                gsubfn_0.6-6              
## [13] matrixStats_0.52.2         GenomeInfoDbData_0.99.1   
## [15] stringr_1.2.0              zlibbioc_1.24.0           
## [17] Biostrings_2.46.0          evaluate_0.10.1           
## [19] memoise_1.1.0              knitr_1.17                
## [21] GenomeInfoDb_1.14.0        curl_3.0                  
## [23] proto_1.0.0                Rcpp_0.12.13              
## [25] xtable_1.8-2               backports_1.1.1           
## [27] DelayedArray_0.4.0         graph_1.56.0              
## [29] jsonlite_1.5               XVector_0.18.0            
## [31] bit_1.1-12                 digest_0.6.12             
## [33] stringi_1.1.5              bookdown_0.5              
## [35] RJSONIO_1.3-0              GenomicRanges_1.30.0      
## [37] grid_3.4.2                 rprojroot_1.2             
## [39] tools_3.4.2                magrittr_1.5              
## [41] sqldf_0.4-11               tibble_1.3.4              
## [43] pkgconfig_2.0.1            Matrix_1.2-11             
## [45] downloader_0.4             xml2_1.1.1                
## [47] rmarkdown_1.6              httr_1.3.1                
## [49] R6_2.2.2                   compiler_3.4.2

References

Feng, Gang, Pan Du, Nancy L Krett, Michael Tessel, Steven Rosen, Warren A Kibbe, and Simon M Lin. 2010. “A collection of bioconductor methods to visualize gene-list annotations.” BMC Research Notes 3 (1): 10. doi:10.1186/1756-0500-3-10.

Feng, Gang, Pamela Shaw, Steven T. Rosen, Simon M. Lin, and Warren A. Kibbe. 2012. “Using the Bioconductor GeneAnswers Package to Interpret Gene Lists.” In Methods in Molecular Biology (Clifton, N.J.), 802:101–12. doi:10.1007/978-1-61779-400-1_7.

Huang, Lei, Gang Feng, Pan Du, Tian Xia, Xishu Wang, Jing Wen, Warren Kibbe, and Simon Lin. 2014. “GeneAnswers: Integrated Interpretation of Genes.” Bioconductor. https://www.bioconductor.org/packages/release/bioc/html/GeneAnswers.html.

Kalderimis, Alex, Rachel Lyne, Daniela Butano, Sergio Contrino, Mike Lyne, Joshua Heimbach, Fengyuan Hu, et al. 2014. “InterMine: extensive web services for modern biology.” Nucleic Acids Research 42 (Web Server issue). Oxford University Press: W468–72. doi:10.1093/nar/gku301.

Smith, R. N., J. Aleksic, D. Butano, A. Carr, S. Contrino, F. Hu, M. Lyne, et al. 2012. “InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data.” Bioinformatics 28 (23): 3163–5. doi:10.1093/bioinformatics/bts577.

Enrichment Analysis and Visualization

30 October 2017

Contents