1 Introduction

The multiGSEA package was designed to run a robust GSEA-based pathway enrichment for multiple omics layers. The enrichment is calculated for each omics layer separately and aggregated p-values are calculated afterwards to derive a composite multi-omics pathway enrichment.

Pathway definitions can be downloaded from up to eight different pathway databases by means of the graphite Bioconductor package (Sales, Calura, and Romualdi 2018). Feature mapping for transcripts and proteins is supported towards Entrez Gene IDs, Uniprot, Gene Symbol, RefSeq, and Ensembl IDs. The mapping is accomplished through the AnnotationDbi package (Pagès et al. 2019) and currently supported for 11 different model organisms including human, mouse, and rat. ID conversion of metabolite features to Comptox Dashboard IDs (DTXCID, DTXSID), CAS-numbers, Pubchem IDs (CID), HMDB, KEGG, ChEBI, Drugbank IDs, or common metabolite names is accomplished through the AnnotationHub package metabliteIDmapping. This package provides a comprehensive ID mapping for more than 1.1 million entries.

This vignette covers a simple example workflow illustrating how the multiGSEA package works. The omics data sets that will be used throughout the example were originally provided by Quiros et al. (Quirós et al. 2017). In their publication the authors analyzed the mitochondrial response to four different toxicants, including Actinonin, Diclofenac, FCCB, and Mito-Block (MB), within the transcriptome, proteome, and metabolome layer. The transcriptome data can be downloaded from NCBI GEO, the proteome data from the ProteomeXchange Consortium, and the non-targeted metabolome raw data can be found in the online supplement.

1.1 Installation

There are two different ways to install the multiGSEA package.

First, the multiGSEA package is part of Bioconductor. Hence, the best way to install the package is via BiocManager. Start R (>=4.0.0) and run the following commands:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("multiGSEA")

Alternatively, the multiGSEA package is hosted on our github page https://github.com/yigbt/multiGSEA and can be installed via the devtools package:

install.packages("devtools")
devtools::install_github("https://github.com/yigbt/multiGSEA")

Once installed, just load the multiGSEA package with:

library(multiGSEA)

2 Example workflow

A common workflow which is exemplified in this vignette is typically separated in the following steps:

Load required libraries, including the multiGSEA package, and omics data sets.
Create data structure for enrichment analysis.
Download and customize the pathway definitions.
Run the pathway enrichment for each omics layer.
Calculate the aggregated pathway enrichment.

2.1 Load libraries and omics data

At first, we need to load the necessary packages to run the multi-omics enrichment analysis. In our example data we have to deal with omics data that has been measured in human cell lines. We therefore need the org.Hs.eg.db package (Carlson 2019a) for transcript and protein mapping. In case the omics data was measured in mouse or rat, we would need the packages org.Mm.eg.db (Carlson 2019b) and org.Rn.eg.db (Carlson 2019c), respectively.

library( "org.Hs.eg.db")

In principle, multiGSEA is able to deal with 11 different model organisms. A summary of supported organisms, their naming format within multiGSEA and their respective AnnotationDbi package is shown in Table 1.

Table 1: Supported organisms, their abbreviations being used in `multiGSEA`,
and mapping database that are needed for feature conversion. Supported abbreviations can be seen with `getOrganisms()`
Organisms	Abbreviations	Mapping
Human	hsapiens	org.Hs.eg.db
Mouse	mmusculus	org.Mm.eg.db
Rat	rnorvegicus	org.Rn.eg.db
Dog	cfamiliaris	org.Cf.eg.db
Cow	btaurus	org.Bt.eg.db
Pig	sscrofa	org.Ss.eg.db
Chicken	ggallus	org.Gg.eg.db
Zebrafish	drerio	org.Xl.eg.db
Frog	xlaevis	org.Dr.eg.db
Fruit Fly	dmelanogaster	org.Dm.eg.db
C.elegans	celegans	org.Ce.eg.db

To run the analysis of this vignette, load the installed version of multiGSEA.

library( multiGSEA)
library( magrittr)

Load the omics data for each layer where an enrichment should be calculated. The example data is provided within the package and already preprocessed such that we have log2 transformed fold changes and their associated p-values.

# load transcriptomic features
data( transcriptome)

# load proteomic features
data( proteome)

# load metabolomic features
data( metabolome)

2.1.1 Cautionary note

This example involves preprocessed omics data from public repositories, which means that the data might look different when you download and pre-process it with your own workflow. Therefore, we put our processed data as an example data in the R package. We here sketch out the pipeline described in the multiGSEA paper. We will not focus on the pre-processing steps and how to derive the necessary input format for the multi-omics pathway enrichment in terms of differentially expression analysis, since this is highly dependent on your experiment and analysis workflow.

However, the required input format is quite simple and exactly the same for each input omics layer: A data frame with 3 mandatory columns, including feature IDs, the log2-transformed fold change (logFC), and the associated p-value.

The header of the data frame can be seen in Table 2:

Table 2: Structure of the necessary omics data
For each layer (here: transcriptome), feature IDs, log2FCs, and p-values are needed.
Symbol	logFC	pValue	adj.pValue
A2M	-0.1615651	0.0000060	0.0002525
A2M-AS1	0.2352903	0.2622606	0.4594253
A4GALT	-0.0384392	0.3093539	0.5077487
AAAS	0.0170947	0.5407324	0.7126172
AACS	-0.0260510	0.4034970	0.5954772
AADAT	0.0819910	0.2355455	0.4285059

2.2 Create data structure

multiGSEA works with nested lists where each sublist represents an omics layer. The function rankFeatures calculates the pre-ranked features, that are needed for the subsequent calculation of the enrichment score. rankFeatures calculates the a local statistic ls based on the direction of the fold change and the magnitude of its significance:

\[\begin{equation} ls = sign( log_2(FC)) * log_{10}( p-value) \end{equation}\]

Please note, that any other rank metric will work as well and the choice on which one to chose is up to the user. However, as it was shown by Zyla et al. (Joanna Zyla and Polanska 2017), the choice of the applied metric might have a big impact on the outcome of your analysis.

# create data structure
omics_data <- initOmicsDataStructure( layer = c("transcriptome", 
                                                "proteome",
                                                "metabolome"))

## add transcriptome layer
omics_data$transcriptome <- rankFeatures( transcriptome$logFC, 
                                          transcriptome$pValue)
names( omics_data$transcriptome) <- transcriptome$Symbol

## add proteome layer
omics_data$proteome <- rankFeatures(proteome$logFC, proteome$pValue)
names( omics_data$proteome) <- proteome$Symbol

## add metabolome layer
## HMDB features have to be updated to the new HMDB format
omics_data$metabolome <- rankFeatures(metabolome$logFC, metabolome$pValue)
names( omics_data$metabolome) <- metabolome$HMDB
names( omics_data$metabolome) <- gsub( "HMDB", "HMDB00", 
                                       names( omics_data$metabolome))

The first elements of each omics layer are shown below:

omics_short <- lapply( names( omics_data), function( name){ 
                        head( omics_data[[name]])
                      })
names( omics_short) <- names( omics_data)
omics_short

## $transcriptome
##     STC2     ASNS     PCK2  FAM129A    NUPR1     ASS1 
## 13.43052 12.69346 12.10931 11.97085 11.81069 11.61673 
## 
## $proteome
##     IFRD1   FAM129A     FDFT1      ASNS       CTH      PCK2 
## 10.818222 10.108260 -9.603185  9.327082  8.914447  8.908938 
## 
## $metabolome
## HMDB0000042 HMDB0003344 HMDB0000820 HMDB0000863 HMDB0006853 HMDB0013785 
##   -1.404669   -1.404669   -3.886828   -3.886828   -3.130641   -3.130641

2.3 Download and customize pathway definitions

Now we have to select the databases we want to query and the omics layer we are interested in before pathway definitions are downloaded and features are mapped to the desired format.

databases <- c( "kegg", "reactome")
layers <- names( omics_data)

pathways <- getMultiOmicsFeatures( dbs = databases, layer = layers,
                                   returnTranscriptome = "SYMBOL",
                                   returnProteome = "SYMBOL",
                                   returnMetabolome = "HMDB",
                                   useLocal = FALSE)

The first two pathway definitions of each omics layer are shown below:

pathways_short <- lapply( names( pathways), function( name){
                          head( pathways[[name]], 2)
                        })
names( pathways_short) <- names( pathways)
pathways_short

## $transcriptome
## $transcriptome$`(KEGG) Glycolysis / Gluconeogenesis`
##  [1] "AKR1A1"  "ADH1A"   "ADH1B"   "ADH1C"   "ADH4"    "ADH5"    "ADH6"   
##  [8] "GALM"    "ADH7"    "LDHAL6A" "DLAT"    "DLD"     "ENO1"    "ENO2"   
## [15] "ENO3"    "ALDH2"   "ALDH3A1" "ALDH1B1" "FBP1"    "ALDH3B1" "ALDH3B2"
## [22] "ALDH9A1" "ALDH3A2" "ALDOA"   "ALDOB"   "ALDOC"   "G6PC1"   "GAPDH"  
## [29] "GAPDHS"  "GCK"     "GPI"     "HK1"     "HK2"     "HK3"     "ENO4"   
## [36] "LDHA"    "LDHB"    "LDHC"    "PGAM4"   "ALDH7A1" "PCK1"    "PCK2"   
## [43] "PDHA1"   "PDHA2"   "PDHB"    "PFKL"    "PFKM"    "PFKP"    "PGAM1"  
## [50] "PGAM2"   "PGK1"    "PGK2"    "PGM1"    "PKLR"    "PKM"     "PGM2"   
## [57] "ACSS2"   "G6PC2"   "BPGM"    "TPI1"    "HKDC1"   "ADPGK"   "ACSS1"  
## [64] "FBP2"    "LDHAL6B" "G6PC3"   "MINPP1" 
## 
## $transcriptome$`(KEGG) Citrate cycle (TCA cycle)`
##  [1] "CS"     "DLAT"   "DLD"    "DLST"   "FH"     "IDH1"   "IDH2"   "IDH3A" 
##  [9] "IDH3B"  "IDH3G"  "MDH1"   "MDH2"   "ACLY"   "ACO1"   "OGDH"   "ACO2"  
## [17] "PC"     "PDHA1"  "PDHA2"  "PDHB"   "OGDHL"  "SDHA"   "SDHB"   "SDHC"  
## [25] "SDHD"   "SUCLG2" "SUCLG1" "SUCLA2" "PCK1"   "PCK2"  
## 
## 
## $proteome
## $proteome$`(KEGG) Glycolysis / Gluconeogenesis`
##  [1] "AKR1A1"  "ADH1A"   "ADH1B"   "ADH1C"   "ADH4"    "ADH5"    "ADH6"   
##  [8] "GALM"    "ADH7"    "LDHAL6A" "DLAT"    "DLD"     "ENO1"    "ENO2"   
## [15] "ENO3"    "ALDH2"   "ALDH3A1" "ALDH1B1" "FBP1"    "ALDH3B1" "ALDH3B2"
## [22] "ALDH9A1" "ALDH3A2" "ALDOA"   "ALDOB"   "ALDOC"   "G6PC1"   "GAPDH"  
## [29] "GAPDHS"  "GCK"     "GPI"     "HK1"     "HK2"     "HK3"     "ENO4"   
## [36] "LDHA"    "LDHB"    "LDHC"    "PGAM4"   "ALDH7A1" "PCK1"    "PCK2"   
## [43] "PDHA1"   "PDHA2"   "PDHB"    "PFKL"    "PFKM"    "PFKP"    "PGAM1"  
## [50] "PGAM2"   "PGK1"    "PGK2"    "PGM1"    "PKLR"    "PKM"     "PGM2"   
## [57] "ACSS2"   "G6PC2"   "BPGM"    "TPI1"    "HKDC1"   "ADPGK"   "ACSS1"  
## [64] "FBP2"    "LDHAL6B" "G6PC3"   "MINPP1" 
## 
## $proteome$`(KEGG) Citrate cycle (TCA cycle)`
##  [1] "CS"     "DLAT"   "DLD"    "DLST"   "FH"     "IDH1"   "IDH2"   "IDH3A" 
##  [9] "IDH3B"  "IDH3G"  "MDH1"   "MDH2"   "ACLY"   "ACO1"   "OGDH"   "ACO2"  
## [17] "PC"     "PDHA1"  "PDHA2"  "PDHB"   "OGDHL"  "SDHA"   "SDHB"   "SDHC"  
## [25] "SDHD"   "SUCLG2" "SUCLG1" "SUCLA2" "PCK1"   "PCK2"  
## 
## 
## $metabolome
## $metabolome$`(KEGG) Glycolysis / Gluconeogenesis`
##  [1] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
## [16] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
## 
## $metabolome$`(KEGG) Citrate cycle (TCA cycle)`
##  [1] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
## [16] "NA" "NA" "NA" "NA" "NA"

2.4 Run the pathway enrichment

At this stage, we have the ranked features for each omics layer and the extracted and mapped features from external pathway databases. In the following step we can use the multiGSEA function to calculate the enrichment for all omics layer separately.

# use the multiGSEA function to calculate the enrichment scores
# for all omics layer at once.
enrichment_scores <- multiGSEA( pathways, omics_data)

The enrichment score data structure is a list containing sublists named transcriptome, proteome, and metabolome. Each sublist represents the complete pathway enrichment for this omics layer.

2.5 Calculate the aggregated p-values

Making use of the Stouffers Z-method to combine multiple p-values that have been derived from independent tests that are based on the same null hypothesis. The function extractPvalues creates a simple data frame where each row represents a pathway and columns represent omics related p-values and adjusted p-values. This data structure can then be used to calculate the aggregated p-value. The subsequent calculation of adjusted p-values can be achieved by the function p.adjust.

multiGSEA provided three different methods to aggregate p-values. These methods differ in their way how they weight either small or large p-values. By default, combinePvalues will apply the Z-method or Stouffer’s method (Stouffer et al. 1949) which has no bias towards small or large p-values. The widely used Fisher’s combined probability test (Fisher 1932) can also be applied but is known for its bias towards small p-values. Edgington’s method goes the opposite direction by favoring large p-values (Edgington 1972). Those methods can be applied by setting the parameter method to “fisher” or “edgington”.

df <- extractPvalues( enrichmentScores = enrichment_scores,
                      pathwayNames = names( pathways[[1]]))

df$combined_pval <- combinePvalues( df)
df$combined_padj <- p.adjust( df$combined_pval, method = "BH")

df <- cbind( data.frame( pathway = names( pathways[[1]])), df)

Finally, print the pathways sorted based on their combined adjusted p-values. For displaying reasons, only the adjusted p-values are shown in Table 3.

Table 3: Table summarizing the top 15 pathways where we can calculate an
enrichment for all three layers . Pathways from KEGG, Reactome, and BioCarta are listed based on their aggregated adjusted p-value. Corrected p-values are displayed for each omics layer separately and aggregated by means of the Stouffer’s Z-method.
pathway	transcriptome_padj	proteome_padj	metabolome_padj	combined_pval

3 Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] magrittr_2.0.3       multiGSEA_1.6.0      org.Hs.eg.db_3.15.0 
## [4] AnnotationDbi_1.58.0 IRanges_2.30.0       S4Vectors_0.34.0    
## [7] Biobase_2.56.0       BiocGenerics_0.42.0  BiocStyle_2.24.0    
## 
## loaded via a namespace (and not attached):
##   [1] TH.data_1.1-1                 fgsea_1.22.0                 
##   [3] colorspace_2.0-3              ellipsis_0.3.2               
##   [5] XVector_0.36.0                bit64_4.0.5                  
##   [7] interactiveDisplayBase_1.34.0 fansi_1.0.3                  
##   [9] mvtnorm_1.1-3                 mathjaxr_1.6-0               
##  [11] codetools_0.2-18              splines_4.2.0                
##  [13] mnormt_2.0.2                  cachem_1.0.6                 
##  [15] knitr_1.38                    TFisher_0.2.0                
##  [17] jsonlite_1.8.0                dbplyr_2.1.1                 
##  [19] png_0.1-7                     graph_1.74.0                 
##  [21] shiny_1.7.1                   graphite_1.42.0              
##  [23] BiocManager_1.30.17           compiler_4.2.0               
##  [25] httr_1.4.2                    assertthat_0.2.1             
##  [27] Matrix_1.4-1                  fastmap_1.1.0                
##  [29] cli_3.3.0                     later_1.3.0                  
##  [31] htmltools_0.5.2               tools_4.2.0                  
##  [33] gtable_0.3.0                  glue_1.6.2                   
##  [35] GenomeInfoDbData_1.2.8        dplyr_1.0.8                  
##  [37] rappdirs_0.3.3                fastmatch_1.1-3              
##  [39] Rcpp_1.0.8.3                  jquerylib_0.1.4              
##  [41] vctrs_0.4.1                   Biostrings_2.64.0            
##  [43] multtest_2.52.0               xfun_0.30                    
##  [45] stringr_1.4.0                 rbibutils_2.2.8              
##  [47] mime_0.12                     lifecycle_1.0.1              
##  [49] qqconf_1.2.3                  AnnotationHub_3.4.0          
##  [51] zlibbioc_1.42.0               MASS_7.3-57                  
##  [53] zoo_1.8-10                    scales_1.2.0                 
##  [55] promises_1.2.0.1              parallel_4.2.0               
##  [57] sandwich_3.0-1                yaml_2.3.5                   
##  [59] curl_4.3.2                    memoise_2.0.1                
##  [61] gridExtra_2.3                 ggplot2_3.3.5                
##  [63] sass_0.4.1                    stringi_1.7.6                
##  [65] RSQLite_2.2.12                BiocVersion_3.15.2           
##  [67] highr_0.9                     mutoss_0.1-12                
##  [69] plotrix_3.8-2                 metaboliteIDmapping_1.0.0    
##  [71] filelock_1.0.2                BiocParallel_1.30.0          
##  [73] GenomeInfoDb_1.32.0           Rdpack_2.3                   
##  [75] rlang_1.0.2                   pkgconfig_2.0.3              
##  [77] bitops_1.0-7                  evaluate_0.15                
##  [79] lattice_0.20-45               purrr_0.3.4                  
##  [81] bit_4.0.4                     tidyselect_1.1.2             
##  [83] bookdown_0.26                 R6_2.5.1                     
##  [85] generics_0.1.2                multcomp_1.4-19              
##  [87] DBI_1.1.2                     withr_2.5.0                  
##  [89] pillar_1.7.0                  sn_2.0.2                     
##  [91] survival_3.3-1                KEGGREST_1.36.0              
##  [93] RCurl_1.98-1.6                tibble_3.1.6                 
##  [95] crayon_1.5.1                  utf8_1.2.2                   
##  [97] tmvnsim_1.0-2                 BiocFileCache_2.4.0          
##  [99] rmarkdown_2.14                grid_4.2.0                   
## [101] data.table_1.14.2             blob_1.2.3                   
## [103] metap_1.8                     digest_0.6.29                
## [105] xtable_1.8-4                  httpuv_1.6.5                 
## [107] numDeriv_2016.8-1.1           munsell_0.5.0                
## [109] bslib_0.3.1

References

Carlson, Marc. 2019a. Org.Hs.eg.db: Genome Wide Annotation for Human.

———. 2019b. Org.Mm.eg.db: Genome Wide Annotation for Mouse.

———. 2019c. Org.Rn.eg.db: Genome Wide Annotation for Rat.

Edgington, Eugene S. 1972. “An Additive Method for Combining Probability Values from Independent Experiments.” The Journal of Psychology 80 (2): 351–63.

Fisher, S R A. 1932. Statistical Methods for Research Workers - Revised and Enlarged. Edinburgh, London.

Joanna Zyla, January Weiner, Michal Marczyk, and Joanna Polanska. 2017. “Ranking Metrics in Gene Set Enrichment Analysis: Do They Matter?” BMC Bioinformatics 18 (256). https://doi.org/https://doi.org/10.1186/s12859-017-1674-0.

Pagès, Hervé, Marc Carlson, Seth Falcon, and Nianhua Li. 2019. AnnotationDbi: Manipulation of Sqlite-Based Annotations in Bioconductor.

Quirós, P M, M A Prado, N Zamboni, D D’Amico, R W Williams, D Finley, S P Gygi, and J Auwerx. 2017. “Multi-Omics Analysis Identifies ATF4 as a Key Regulator of the Mitochondrial Stress Response in Mammals.” J Cell Biol 216 (7): 2027–45. https://doi.org/10.1083/jcb.201702058.

Sales, Gabriele, Enrica Calura, and Chiara Romualdi. 2018. “metaGraphite - a New Layer of Pathway Annotation to Get Metabolite Networks.” Bioinformatics. https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty719/5090451.

Stouffer, Samuel A, Edward A Suchman, Leland C DeVinney, Shirley A Star, and Robin M Williams Jr. 1949. “The American Soldier: Adjustment During Army Life.(studies in Social Psychology in World War Ii), Vol. 1.”

multiGSEA: an example workflow

March 11, 2020

Package