FoldGO: a tool for fold-change-specific functional enrichment analysis

Daniil Wiebe

2020-10-30

A typical scenario of transcriptome data analysis is identification of differentially expressed genes (DEGs), those with significant changes in the number of transcripts (fold-change), and functional enrichment analysis of DEGs lists using Gene Ontology. Classical gene set enrichment analysis ignores the difference in the fold-changes that lead to the loss of valuable information. The FoldGO method has been created to identify the GO terms significantly overrepresented for the genes responded to the factor within a narrow fold-change-interval. FoldGO processes the DEGs list in three steps:

See an example of the FoldGO performance in the analysis of the transcriptome data on expression changes of Arabidopsis thaliana genes in response to plant hormone auxin treatment (Omelyanchuk et al., 2017)1.

Workflow

FoldGO pipeline consists of three steps:

As input data the algorithm uses the tables for up- and down-regulated genes that contain Gene IDs and their fold-change values:

GeneID fold-change
AT3G65420 3.6
AT1G78450 1.5
AT2G66890 2.1

First, one has to separate initial set of genes in to quantiles and generate unions of all neighbouring quantiles. For example, we will use built-in data derived from experiment on auxin treatment of Arabidopsis thaliana roots. GeneGroups function will take only first two columns, so be sure that your data have gene identifiers in the first column and fold-change values (FC) in the second one. Note that data for up- and down-regulation must be processed separately. In the following example demonstrating GeneGroups function usage only the data for up-regulation is used.

head(degenes)
GeneID FC pval qval
AT1G01120 2.069 2.70e-10 2.60e-08
AT1G02520 1.641 7.79e-05 2.06e-03
AT1G02730 1.290 3.37e-05 9.95e-04
AT1G02850 1.978 2.50e-06 1.01e-04
AT1G02900 1.138 1.22e-03 1.95e-02
AT1G03110 2.686 8.44e-05 2.22e-03
up_groups <- GeneGroups(degenes, quannumber=6)

At the next step we will conduct functional enrichment analysis of generated groups. For functional enrichment analysis FoldGO uses TopGO package.

Functional annotation

Custom annotaion

For custom annotation one has to provide GO id -> Gene id list or object of MgsaSets (from mgsa package) or GAFReader class. FoldGO provides GAFReader - simple and convinient parser for annotation presented in GAF format. It takes only two arguments:

  • file - full path to GAF file.
  • geneid_col - index of column with Genes IDs

NOTE: The GAF file in example is truncated. The original full file can be downloaded from GO consortium website: http://www.geneontology.org/page/download-go-annotations.

gaf_path <- system.file("extdata", "gene_association.tair.lzma", package = "FoldGO")
gaf <- GAFReader(file = gaf_path,  geneid_col = 10)

One can retrieve direct annotations as list object and version of GAF file using following methods:

getVersion(gaf)
#> [1] "2.0"
getAnnotation(gaf)

To annotate our gene groups we will use FuncAnnotGroupsTopGO function which uses topGO package for singular enrichment analysis. The minimal set of arguments needed for this function to work is:

  • groups - object of GeneGroups class
  • namespace - character string specifing GO namespace. This argument accepts the following values: “BP”, “MF” or “CC”, where
    • BP - biological process
    • MF - molecular function
    • CC - cellular component
  • customAnnot - It can be an object generated by or function from mgsa package or list which has GO term ids as keys and character vectors contain gene ids as values.
  • annot - functions used to compile a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term. Here it must be assigned with topGO::annFUN.GO2genes
  • bggenes - vector contains background set of genes
up_annotobj <- FuncAnnotGroupsTopGO(genegroups = up_groups, namespace = "MF", customAnnot = gaf, annot = topGO::annFUN.GO2genes, bggenes = bggenes)

Using annotaions from Bioconductor packages

Another possibility is to use bioconductor packages containing annotations for specific organism. For example “org.Hs.eg.db” (Human) and “org.At.tair.db” (Arabidopsis), package name must be assigned to mapping argument. In this case one has to assign topGO::annFUN.org to annot argument and specify ID (from topGO package manual: character string specifing the gene identifier to use). Currently only the following identifiers can be used: c("entrez", "genbank", "alias", "ensembl", "symbol", "genename", "unigene").

Enrichment analysis with Arabidopsis annotations:

up_annotobj <- FuncAnnotGroupsTopGO(up_groups,"MF", mapping = "org.At.tair.db", annot = topGO::annFUN.org, ID = "entrez", bggenes = bggenes)

Enrichment analysis with annotations for Human:

up_groups <- GeneGroups(degenes_hum, quannumber=6)
FuncAnnotGroupsTopGO(up_groups,"MF", mapping = "org.Hs.eg.db", annot = topGO::annFUN.org, ID = "ensembl", bggenes = bggenes_hum)

Testing for fold-specificity

The fold-specificity recognition procedure consists of GO terms preselection from DEGs annotation and fold-change-specific enrichment analysis. At each step the FDR threshold must be established. By default FDR threshold for GO terms preselection (fdrstep1) is set to 1 (no preselection) and FDR threshold for fold-change-specific enrichment analysis (fdrstep2) is set to 0.05. As a default method for mutltiple testing correction FoldGO uses Benjamini-Hochberg correction procedure2.

up_fsobj <- FoldSpecTest(up_annotobj, fdrstep1 = 0.05, fdrstep2 = 0.01)
down_fsobj <- FoldSpecTest(down_annotobj, fdrstep1 = 0.05, fdrstep2 = 0.01)

It is possible to choose another correction procedure from R base which can be listed via p.adjust.methods. Here Benjamini-Yekutieli correction procedure3 is selected.

FoldSpecTest(up_annotobj, padjmethod = "BY")

One can inspect the results of enrichment analysis as dataframes. Access dataframe with fold-specific terms

fs_table <- getFStable(up_fsobj)
head(fs_table)
ids namespace name wholeint_pval wholeint_padj min_pval padj interval
GO:0003723 MF RNA binding 1.80e-07 2.71e-05 7.28e-05 9.26e-03 1-4
GO:0003735 MF structural constituent of ribosome 1.00e-30 1.13e-27 2.98e-27 2.32e-24 1-2
GO:0003743 MF translation initiation factor activity 6.50e-05 4.80e-03 8.31e-05 9.26e-03 1-2
GO:0005198 MF structural molecule activity 1.00e-30 1.13e-27 1.94e-26 7.57e-24 1-2
GO:0008135 MF translation factor activity, RNA binding 1.80e-07 2.71e-05 2.56e-08 6.67e-06 1-2
GO:0097159 MF organic cyclic compound binding 2.90e-07 3.88e-05 3.89e-06 6.07e-04 1-5

where:

And with not fold-specific:

nfs_table <- getNFStable(up_fsobj)
head(nfs_table)
ids namespace name wholeint_pval wholeint_padj min_pval padj interval
GO:0000166 MF nucleotide binding 3.70e-07 4.17e-05 3.30e-04 2.58e-02 1-5
GO:0003697 MF single-stranded DNA binding 5.20e-04 3.01e-02 9.43e-02 1.00e+00 1-4
GO:0003746 MF translation elongation factor activity 2.40e-04 1.46e-02 2.11e-04 2.06e-02 1
GO:0003883 MF CTP synthase activity 3.90e-04 2.31e-02 2.26e-01 1.00e+00 4-5
GO:0004386 MF helicase activity 6.50e-08 1.13e-05 2.32e-02 6.04e-01 3
GO:0005488 MF binding 6.00e-06 5.41e-04 6.48e-04 4.50e-02 1-5

Plot results

Via plot function one can plot “Fold-change specific GO Profile” on which the GO terms significantly associated with a certain fold-change intervals are presented in yellow and blue boxes for up- and down-regulated genes, correspondingly. Here the result for six equal in size fold-change intervals is presented. The diagram presents only fold-change-specific terms. If the gene was associated fold-specifically with down-regulation but not fold-specifically with up-regulation (or vise versa) than not fold-change-specific interval (1-6 here) will be also shown.

plot(up_fsobj, down_fsobj)


  1. Omelyanchuk, N. A. et al. Auxin regulates functional gene groups in a fold-change-specific manner in Arabidopsis thaliana roots // Nat Sci Rep – 2017. – N 7(1) – p 2489↩︎

  2. Benjamini, Y., Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society - 1995 - Series B. 57 (1): 289–300↩︎

  3. Benjamini, Y., Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics - 2001 - 29 (4): 1165–1188. doi:10.1214/aos/1013699998↩︎