1 vissE

This package implements the vissE algorithm to summarise results of gene-set analyses. Usually, the results of a gene-set enrichment analysis (e.g using limma::fry, singscore or GSEA) consist of a long list of gene-sets. Biologists then have to search through these lists to determines emerging themes to explain the altered biological processes. This task can be labour intensive therefore we need solutions to summarise large sets of results from such analyses.

This package provides an approach to provide summaries of results from gene-set enrichment analyses. It exploits the relatedness between gene-sets and the inherent hierarchical structure that may exist in pathway databases and gene ontologies to cluster results. For each cluster of gene-sets vissE identifies, it performs text-mining to automate characterisation of biological functions and processes represented by the cluster.

An additional power of vissE is to perform a novel type of gene-set enrichment analysis based on the network of similarity between gene-sets. Given a list of genes (e.g. from a DE analysis), vissE can characterise said list by first identifying all other gene-sets that are similar to it, following up with clustering the resulting gene-sets and finally performing text-mining to reveal emerging themes.

In addition to these analyses, it provides visualisations to assist the users in understanding the results of their experiment. This document will demonstrate these functions across the two use-cases. The vissE package can be downloaded as follows:

2 Summarising the results of a gene-set enrichment analysis

Often, the results of a gene-set enrichment analysis (be it an over representation analysis of a functional class scoring approach) is a list of gene-sets that are accompanied by their statistics and p-values or false discovery rates (FDR). These results are mostly scanned through by biologists who then extract relevant themes pertaining to the experiment of interest. The approach here, vissE, will allow automated extraction of themes.

The example below can be used with the results of any enrichment analysis. The data below is simulated to demonstrate the workflow.

A vissE analysis involves 3 steps:

  1. Compute gene-set overlaps and the gene-set overlap network
  2. Identify clusters of gene-sets based on their overlap
  3. Characterise clusters using text mining
  4. (Optional) Visualise gene-level statistics

2.2 Identify clusters of gene-sets

Related gene-sets likely represent related processes. The next step is to identify clusters of gene-sets so that they can be assessed for biological themes. The specific clustering approach can be selected by the user though we recommend graph clustering approaches to use the information provided in the overlap graph. We recommend using the igraph::cluster_walktrap() algorithm as it works well with dense graphs. Many other algorithms are implemented in the igraph package and these can be used instead of the walktrap algorithm.

Instead of exploring the full network of gene-sets, the subgraph of nodes that form part of the groups can be plot. This allows for a more focused investigation into the relatedness of clusters identified using vissE.

2.3 Characterise gene-set clusters

Gene-set clusters identified can be assessed for their biological similarities using text-mining approaches. Here, we perform a frequency analysis (adjusted for using the inverse document frequency) on the gene-set names or their short descriptions to assess recurring biological themes in clusters. These results are then presented as word clouds.