1. Introduction to FGNet

FGNet allows to perform a Functional Enrichment Analysis (FEA) on a list of genes or expression set, and transform the results into networks. The resulting functional networks provide an overview of the biological functions of the genes/terms, and allows to easily see links between genes, overlap between clusters, finding key genes, etc.

FGNet takes as input a query list of genes selected by the user, and builds and displays networks of genes based in the existence of common functional terms that are enriched in certain subsets of genes of the list. By doing this, the tool allows to disclose groups/clusters of genes that have similar annotations and so they may have similar biological function in the cell. The discovery of molecular machines or functional modules within the cell (i.e. genes or proteins that work together to perform a biological process in the cells) is essential in modern molecular medicine and systems biology, because many times we do not know which are the gene partners playing in the same roles in a pathological state. FGNet is a tool that helps to create functional connections between different genes/proteins based on annotations. By grouping similar, redundant and homogeneous annotation content from the same or different biological resources into gene-term groups, the biological interpretation of large gene lists moves from a gene centric approach (where each gene is independent) to a functional-module centric approach (where the genes are interconnected). In this way, FGNet can provide a better representation of complex biological processes and reveal associations between genes.

Biological functional analysis

After obtaining a list of genes or proteins from an experiment or omic studies (microarrays, RNAseq, mass spectrometry, etc), the next step is usually to perform a functional analysis of the genes to search for the biological functions or processes in which they are involved. In order to facilitate the analysis of large lists of genes, multiple functional enrichment tools have been developed. These tools search for the genes in biological databases (i.e. GO, Kegg, Interpro), and test whether any biological annotations are over-represented in the query gene list compared to what would be expected in the whole population. However, the raw output from a functional enrichment analysis often provides dozens or hundreds of terms, and it still requires a lot of time and attention to go through the whole list of genes and annotations. A way to simplify this task is grouping genes and terms which often appear together and create associated networks: the Functional Networks.

FGNet builds the functional networks, based on data from a previous functional enrichment analysis (FEA). The package provides the functions to perform the FEA through four specific tools:

  • DAVID with Functional Annotation Clustering (DAVID-FAC), which measures relationships among annotation terms based on their co-association with subsets of genes within the query gene list (Huang et al). This type of clustering mostly results in groups of highly related terms, such as synonymous annotations from different annotation spaces (i.e. term “glycolysis” in KEGG and GO-BP), which also share most of their genes. This tool provides great coverage but does not avoid redundant terms and very general terms (like signal transduction or regulation of transcription that correspond to specific terms in Gene Ontology, GO).

  • GeneTerm Linker, a post-enrichment tool, which focuses on clearing and sorting the results from a previous modular enrichment analysis. This is achieved by filtering general terms with low information content (i.e. cellular process or protein binding) and redundant annotations (i.e. metabolic process and primary metabolic process). The remaining gene-term sets are grouped into metagroups based on their shared genes and terms (using a reciprocal linkage approach) (Fontanillo et al).

  • TopGO (Alexa et al), an enrichment analysis tool based on Gene Ontology (GO) that tests GO terms while accounting for the topology of the GO graph to eliminate local similarities and dependencies between GO terms. TopGO does not provide clusters, and therefore the functional network is built using only the gene-term sets. TopGO can be applied off-line.

  • GAGE (Luo et al), a gene set enrichment analysis (GSEA) tool. It searches for functional enrichment in gene sets (i.e. KEGG pathways, Reactome, GO) and allows including a signal value -like expression changes- to rank the genes and then to identify the enrichment in functional terms that are altered (i.e. changed in genes UP and DOWN) or altered consistently in one direction (UP or DOWN). GAGE also clusters the resulting enriched gene-term sets and can be applied off-line.

To build the network based on other other tools, the raw output should be saved into a text file which contains the enriched terms and their genes. (For more details see function format_results()).

Functional network

The functional network is the representation of the results from a functional enrichment analysis.

In the default network, all the nodes of the network are of the same type, i.e. genes OR terms, which are linked to each other if they are in the same gene-term set. In the plot, the genes/terms in the same groups (metagroups or clusters) are surrounded by a common background color.

In the bipartite network, the nodes are of two types, allowing to link the genes or terms, with the clusters they belong to. This network, can be built as an intersection network , a simplified functional network where all the genes/terms that belong to only one metagroup are clustered into a single node. This simplified network contains only the nodes in several groups.

In addition to the networks, FGNet also provides a few functions for further analysis. These functions allow to get a distance matrix , which represents the similarity between the groups based on the genes they share with each other (binary distance), and the distribution of degree and betweenness within the network and subnetworks, in order to find the most important genes (hubs).

All these functionalities can be accessed directly through the appropriate functions or the graphical user interface (GUI). In addition, FGNet also allows to generate an HTML report with an overview of these plots and analyses for a specific gene list.

Examples of functional network for different analyses:

nw1 nw2 nw3

2. Installation

To install FGNet from Bioconductor, type in your R console:

To reduce system requirements, only the minimum packages are required to execute FGNet. However, there are several functionalities that require further packages. i.e. the Graphical User Interface (GUI) requires “RGtk2”, the FEA analyses might require “RDAVIDWebService”, “gage”, “topGO” or some annotation packages… etc.

To make sure all FGNet functionalities are available, install the following packages:

3. Creating a network

To generate a functional network with FGNet:

  1. Perform a Functional Enrichment Analysis (FEA) on a list of genes, proteins or expression set.

    FEA tool Online? Input Annotations
    DAVID Yes Gene list Many
    Gene-Term Linker Yes Gene list GO, KEGG, Interpro
    TopGO No Gene list GO
    GAGE (GSEA) No Expression set Any gene set
  2. Create an HTML report with multiple views of the networks and analyses.
  3. Personalize or analyze an specific network.

These steps are integrated into the Graphical User Interface (GUI), which provides access to the main functionalities of FGNet.

Graphical User Interface (GUI)

The Graphical User Interface (GUI) provides access to most FGNet functionalities in Windows and Linux (The current version of the GUI is not available for Mac OS X Snow Leopard).

To launch the GUI, type in the R console:

FGNet GUI

FGNet GUI

In case you already have a gene list or gene expression from a previous analysis, it is possible to load it directly into the GUI genes field by passing it as argument:

In R code…

The first step in the workflow is always is to perform a Functional Enrichment Analysis (FEA) on a list of genes or expression set.

Once the FEA is ready, you can proceed to generate the HTML report or the individual network/analyses:

FGNet workflow

FGNet workflow

For help or more details on any functions or their arguments, just set a ? before its name.

Functional Enrichment Analysis (FEA)

Since the arguments required to perform the FEA depends on the tool, there are several FEA functions:

FEA tool Function Output group type
DAVID fea_david() Clusters
TopGO fea_topGO() No grouping
Gene-Term Linker fea_gtLinker() & fea_gtLinker_getResults() Metagroups
GAGE fea_gage() Clusters
Other format_feaResults()

All the FEA functions and FGNet_report() save the results in the current working directory.

Here is an example analyzing a gene list with the different tools:

DAVID

NOTE: Since August 2015, DAVID requires https. This causes errors in some systems. A (hopefully) temporary solution requries to install some certificates locally. See RDAVIDWebService help: https://support.bioconductor.org/p/70090/#72226

Using DAVID requires internet connection. In addition, we recommend to register at http://david.abcc.ncifcrf.gov/webservice/register.htm to perform the queries through its Web Service.

By default, geneIdType="ENSEMBL_GENE_ID". To replace the gene IDs by readable names in the plots and HTML report, use the argument geneLabels. To see the gene IDs supported by DAVID’s Web Service, use: getIdTypes(DAVIDWebService$new(email=...)).

TopGO

Since TopGO uses local databases, it does not require internet connection.

The results from topGO are provided as individual gene-term sets not grouped into clusters. FGNet treats each gene-term set as a single cluster.

Gene-Term Linker

Since the analysis with Gene-Term Linker usually takes several minutes to be ready, the workflow is divided in two steps: (1) sending the analysis request, and (2) retrieving the results:

once the analysis is ready…

GAGE

As a GSEA approach, instead of performing the functional enrichment over a gene list, gage requires a raw expression set and the samples to compare:

Other tools

To import the results from a functional enrichment analysis performed with other tools, see:

Web analysis

FGNet can also be applied to an analysis performed at DAVID and GeneTerm Linker web site:

To import these results into FGNet, use DAVID’s download file or GeneTerm linker’s job ID, and the functions format_david() or fea_gtLinker_getResults():

ssDavid ssGtlinker
David Gene-term Linker

HTML report

The HTML report function allows to create a comprehensive report including different views of the Functional Network, the cluster/metagroup legend, and some further statistics directly directly from a gene list.

Here is the code to use FGNet_report() with each of the previous examples:

By default, the clusters included in these reports are filtered out to get cleaner results. The default values depend on the tool, and can be modified through FGNet_report arguments:

Individual networks

After the FEA is ready, it is also possible to generate specific networks rather than the full report. Here is a simple example on how to use fea2incidMat() to generate the incidence matrices that represent the networks and plot them. There are more detailed examples on how to edit and explore the networks in sections Editing and creating new networks and Filtering and selecting clusters.

##       1 2 3 4 5
## ADA2  0 1 0 0 0
## APC1  0 0 1 0 0
## APC11 0 0 1 0 0
## APC2  0 0 1 0 1
## APC4  0 0 1 0 0
##                                                      1 2 3 4 5
## Chromatin assembly (BP) (GO:0031497)                 0 0 1 0 0
## Chromatin modification (BP) (GO:0016568)             0 1 0 0 0
## Cytoplasmic mRNA processing body (CC) (GO:0000932)   0 0 0 1 0
## Enzyme regulator activity (MF) (GO:0030234)          0 0 0 0 1
## Histone acetylation (BP) (GO:0016573)                0 1 0 0 0
## Histone acetyltransferase activity (MF) (GO:0004402) 0 1 0 0 0

These incidence matrices can be plotted and analyzed in different ways:

## $`Metagroup 1`
##      Terms                                                                     
## [1,] "MRNA cleavage (BP)"                                                      
## [2,] "MRNA cleavage and polyadenylation specificity factor complex (CC)"       
## [3,] "MRNA cleavage factor complex (CC)"                                       
## [4,] "MRNA polyadenylation (BP)"                                               
## [5,] "MRNA surveillance pathway"                                               
## [6,] "Termination of RNA polymerase II transcription, poly(A)-coupled (BP)"    
## [7,] "Termination of RNA polymerase II transcription, poly(A)-independent (BP)"
Dynamic plot

Dynamic plot

4. Editing and creating new networks

In this section we will use the functional analysis of an Alzheimer dataset (GSE4757):

The variable feaAlzheimer contains the raw results from the functional analysis. The slot metagroups could also be clusters or missing depending on the FEA tool:

## [1] "queryArgs"    "metagroups"   "geneTermSets" "fileName"

To see the terms in each cluster/metagroup use getTerms():

## $`Metagroup 3`
##       Terms                                                 
##  [1,] "Alzheimer's disease"                                 
##  [2,] "Calcium ion transport (BP)"                          
##  [3,] "Calcium signaling pathway"                           
##  [4,] "Calmodulin binding (MF)"                             
##  [5,] "GnRH signaling pathway"                              
##  [6,] "Induction of apoptosis by extracellular signals (BP)"
##  [7,] "Long-term potentiation"                              
##  [8,] "Melanogenesis"                                       
##  [9,] "Neurotrophin signaling pathway"                      
## [10,] "Salivary secretion"                                  
## [11,] "Tuberculosis"                                        
## [12,] "Vascular smooth muscle contraction"                  
## [13,] "Wnt signaling pathway"                               
## 
## $`Metagroup 4`
##      Terms                       
## [1,] "Glutamatergic synapse"     
## [2,] "Postsynaptic density (CC)" 
## [3,] "Postsynaptic membrane (CC)"
## [4,] "Synapse (CC)"

Incidence matrices

The FEA results should be transformed into incidence matrices to create the network. These matrices are the internal representation of the network: they contain which genes are in each metagroup or cluster and in each gene-term set. Therefore, it is in this step where the main shape of the network is determined.

The function to create the incidence matrices is fea2incidMat(). It allows to filter out clusters, decide whether the networks should be gene-based or term-based, establish the groups to link the genes/terms, etc…

We will start the example creating a simple gene-based network:

##        1 2 3 4 5 6 7 8 9
## ACTN1  1 0 0 0 1 0 0 0 0
## ADD3   1 0 1 0 0 0 0 0 0
## ANO3   1 0 0 0 0 0 0 0 0
## APOE   1 0 0 0 0 1 0 0 1
## ATP2B1 0 0 1 0 0 0 0 0 0
## C7     1 0 0 0 0 0 0 0 0
##        1.14 1.15 1.16 1.17 1.18
## ACTN1     0    1    0    1    1
## ADD3      0    0    0    0    0
## ANO3      0    0    0    0    0
## APOE      0    0    1    0    0
## ATP2B1    0    0    0    0    0

To filter or select with metagroups to show, use the arguments filterAttribute, filterOperator and filterThreshold. filterAttribute should be a column from the feaAlzheimer$clusters or feaAlzheimer$metagroups data frames. The recommended filters for each tool can be seen in the object FEA_tools, which contains the default filters when generating the HTML report:

To see which metagroups/clusters have been filtered out and will not be shown in the networks:

For more on selecting and filtering groups see section ‘filtering’. To build the networks based on terms, use the argument key="Terms".

Functional network

The function functionalNetwork() generates and plots the networks. In case there is available expression data, it can be used for representation in this step:

The default plot will plot all the genes/terms in the network, and will return the networks as igraph objects and matrices in an invisible list. The argument keepColors determine whether the colors should be consistent, taking into account the filtered groups, or restarted:

By setting the parameter plotOutput="dynamic" instead of an static plot, it will create an interactive one. By setting plotOutput="none", it is possible to produce only the network without plotting.

Since the returned networks are iGraph objects, they can be used or analyzed as such:

## [1] "iGraph" "adjMat"
## [1] "commonClusters" "commonGtSets"
## IGRAPH 9fb2235 UN-- 49 334 -- 
## + attr: name (v/c)
## + edges from 9fb2235 (vertex names):
##  [1] ACTN1--CNTN4   ACTN1--DDN     ACTN1--FZD3    ACTN1--GAP43  
##  [5] ACTN1--HCN1    ACTN1--KIF13B  ACTN1--LZTS1   ACTN1--MAPT   
##  [9] ACTN1--RHOC    ACTN1--SCN2A   APOE --CACNA1C APOE --CACNA1D
## [13] APOE --CALM3   APOE --CHRM1   APOE --DDN     APOE --FZD3   
## [17] APOE --HCN1    APOE --HHIP    APOE --KIF13B  APOE --KIFC2  
## [21] APOE --MAPK8   APOE --MAPT    APOE --NLGN4X  APOE --NRSN2  
## [25] APOE --PLCB1   APOE --RASGRP1 APOE --ROBO2   APOE --SLC3A2 
## [29] APOE --SORT1   APOE --SORT1   APOE --STRC    APOE --THY1   
## + ... omitted several edges

In dynamic plots (tkplot) it is not possible to draw the metagroup background. However, you can save the layout of a dynamic network, and plot it as static using the argument vLayout:

Bipartite and intersection network

The default bipartite version of the functional network plots the intersection network: a simplified functional network, containing only the nodes in several metagroups and the metagroups they belong to. In this network, metagroup nodes (the coloured nodes) can be seen as a cluster of all the genes/proteins that belong only to that metagroup:

To plot a full bipartite network including all the nodes, just set keepAllNodes=TRUE:

Terms networks

In the same way we have built networks to explore the relationship between genes, the same approach can be used to explore the relationship between the biological terms in the enrichment analysis. i.e. to see which biological terms are usually associated, or locate which terms are in several groups. To do so, build the incidence matrices based on terms instead of genes using the argument key="Terms".

By default, the functional network is built establishing links between nodes (genes or terms) in the same gene-term sets. Depending on the tool, this network might have few or no edges:

To plot a network with links between all the terms in the same cluster or metagroups, use fea2incidMat() with the $cluster or $metagroup slots from the FEA, in order to consider the whole cluster/metagroup as a gene-term set:

Since GeneTerm Linker filters out generic and redundant terms from the final metagroups, by default these terms are not plotted. To include them in the graph, set the argument removeFiltered=FALSE (only available for GeneTerm Linker).

For more information on the filtered terms see (Fontanillo et al) or http://gtlinker.cnb.csic.es/gtset/help .

Genes - Terms networks

To build a genes-terms network, we can use the bipartite plot with the appropriate formatting of the input matrices.

For many FEA tools it will be enough with applying the fea2incidMat() directly to the $geneTermSets matrix selecting the gene-term sets we want to plot. i.e. gene-term sets in a specific cluster, filter generic terms (terms annotated to more than X genes), etc…

Note that this approach might not be appropriate for GeneTerm Linker, since it groups several terms into each gene-term set.

Then, create a terms-genes incidence matrix with fea2incidMat(), and plot the network…

##                                                        CDC16 DOC1 GLC7
## Anatomical structure morphogenesis                         1    1    1
## Cell differentiation                                       1    1    1
## Sporulation resulting in formation of a cellular spore     1    1    1
## Developmental process                                      1    1    1
## Sporulation                                                1    1    1
##                                                        RPN11 SPT3
## Anatomical structure morphogenesis                         1    1
## Cell differentiation                                       0    1
## Sporulation resulting in formation of a cellular spore     0    1
## Developmental process                                      1    1
## Sporulation                                                0    1

Network with genes colored based on their expression and terms on alphabetical order:

Network with genes colored by alphabetical order (from red to pink), terms white:

5. Filtering and selecting clusters

As an example of analysis of a network with very overlapping clusters, we will use the yeast gene list analyzed with DAVID:

Filtering based on a cluster property

The clusters to plot can be selected/filtered based on any property that is available in the clusters matrix:

## [1] "Cluster"                "nGenes"                
## [3] "ClusterEnrichmentScore" "Genes"                 
## [5] "Terms"                  "keyWordsTerm"

i.e. Selecting the clusters with highest Enrichment Score or least genes (setting eColor=NA, plots the networks without edges):

##        10%        25%        50%        75%        90% 
## 0.08585003 0.33812100 5.90148600 7.65222050 7.85874500
##  10%  25%  50%  75%  90% 
##  5.0 13.5 44.0 55.0 58.0

To use any property that is not available in the $clusters data frame, just add it as column to the dataframe.

Selecting clusters with specific keywords

##                            Cluster 1                            Cluster 2 
## "Cellular protein catabolic process"                  "Metabolic process" 
##                            Cluster 3                            Cluster 4 
##                      "Transcription" "Cellular protein catabolic process" 
##                            Cluster 5                            Cluster 6 
##                          "Organelle"                    "MRNA processing" 
##                            Cluster 7                            Cluster 8 
##                      "Transcription" "Regulation of biosynthetic process" 
##                            Cluster 9                           Cluster 10 
##   "Anatomical structure development"                 "Hydrolase activity" 
##                           Cluster 11 
##                        "ATP binding"

Filtering based on a gene-term set property

In some ocasions, it might also be useful to filter out gene-term sets within a cluster. i.e. The terms in the top of the GO ontologies are annotated to many genes and make most clusters overalp.

To filter out terms, (1) filter or select the terms in the the feaReults$geneTermSets data frame, (2) save it as text file, and (3) import it with readGeneTermSets()

In this case, we will use DAVID’s example, and keep the terms thar are annotated to less than 100 genes in yeast:

## [1] 64

To explore the distribution of genes-terms in a specific organism:

6. Other auxiliary functions

analyzeNetwork()

analyzeNetwork() can be used to explore the structure of the network. It also returns statistics about the nodes betweenness within each cluster, etc…

The example with GeneTerm Linker (Alzheimer):

## [1] "degree"            "betweenness"       "transitivity"     
## [4] "betweennessMatrix" "hubsList"          "intraHubsCount"
## commonClustersNw   commonGtSetsNw 
##        0.6947699        0.5943638

$degree and $betweenness are the values used for the plots. They contain the values for each of the nodes in the global network (commonClusters) and within each cluster/metagroup (subsets of commonGtSets network). The degree is given as percentage, normalized based on the total number of nodes of the network. i.e. a value of 90 in a network of 10 nodes, would mean the actual degree of the node is 9: it is connected to 9 nodes (90% of 10)).

The betweenness of each node in each cluster as matrix:

##         Global 2 5 6 7 8 9
## NLGN4X     270 0 0 5 0 0 0
## HCN1       181 3 1 1 0 0 0
## CHRM1      109 0 0 1 3 0 0
## CNTNAP1    108 8 0 0 0 0 0
## SCN2A       53 1 9 0 0 0 0
## CACNA1C     44 0 0 0 0 0 1

Inter-modular hubs: Nodes with betweenness within the top 75% in the global network

## [1] "NLGN4X"  "HCN1"    "CHRM1"   "CNTNAP1" "SCN2A"

Intra-modular hubs: Nodes with betweenness within the top 75% in each cluster sub-network

## [1] "MAPT"  "CALM3" "APOE"

DAVID’s example:

Note the structure of the network varies not only depending on the dataset, but also on the tool. Since tools like DAVID link all the nodes/terms within each cluster, their internal normalized degree is always 100%.

plotGoAncestors()

plotGoAncestors() and plotKegg() also allow to explore the significant gene term sets:

7. Acknowledgments

This work was supported by Instituto de Salud Carlos III [Research Projects PS09/00843 and PI12/00624] and by a grant from the Junta de Castilla y Leon and the European Social Fund to S.A and C.D.

8. References