MetaPhOR

0.1 Introduction

MetaPhOR was developed to enable users to assess metabolic dysregulation using transcriptomic-level data (RNA-sequencing and Microarray data) and produce publication-quality figures. A list of differentially expressed genes (DEGs), which includes fold change and p value, from DESeq2 (Love, Huber, and Anders (2014)) or limma (Ritchie et al. (2015)), can be used as input, with sample size for MetaPhOR, and will produce a data frame of scores for each KEGG pathway. These scores represent the magnitude and direction of transcriptional change within the pathway, along with estimated p-values (Rosario et al. (2018)). MetaPhOR then uses these scores to visualize metabolic profiles within and between samples through a variety of mechanisms, including: bubble plots, heatmaps, and pathway models.

0.2 Installation

This command line can be used to install and load MetaPhOR.

if (!require(“BiocManager”, quietly = TRUE))
install.packages(“BiocManager”) BiocManager::install(“MetaPhOR”)

library(MetaPhOR)

0.3 Data Preparation

Minimal data preparation is required to run MetaPhOR. DEGs may be loaded into R in the form of .csv or .tsv for use with this package. The DEG file must contain columns for log fold change, adjusted p-value, and HUGO gene names. By default, MetaPhOR assumes DESeq2 header stylings: “log2FoldChange” and “padj”. In any function that assumes these headers, however, the user can define column names for these values. Below is a sample DEG file resulting from limma:

exdegs <- read.csv(system.file("extdata/exampledegs.csv",
                                package = "MetaPhOR"),
                                header = TRUE)

X	logFC	AveExpr	t	P.Value	adj.P.Val	B
TATDN1	0.0996779	6.112478	6.074387	0e+00	0.0000523	10.736686
ZNF706	0.1055866	6.446976	5.845307	0e+00	0.0000958	9.237467
DCAF13	0.0882855	6.393943	5.513711	1e-07	0.0002918	7.531541
TRMT12	0.1093866	6.071099	5.434470	1e-07	0.0003550	7.377948
IGF2BP1	0.9256589	4.454204	5.631783	0e+00	0.0002065	7.273836
MAP6D1	0.1619178	5.719428	5.342911	1e-07	0.0004612	7.090578

0.4 Pathway Analysis

“pathwayAnalysis” first assigns scores and their absolute values using log fold change and p value to each gene (Rosario et al. (2018)). These transcript-level scores, along with sample size, are then utilized to calculate both scores (directional change) and absolute value scores (magnitude of change) (Rosario et al. (2018)) for each KEGG Pathway (Kanehisa and Goto (2000)). We then utilize a bootstrapping method, to randomly calculate 100,000 scores per pathway, based on the number of genes in that pathway and model the distribution. This distribution can then be used to evaluate where the actual score for that pathway sits in relation to the distribution, and can assign a p-value to the achieved score.

For example, if the polyamine biosynthetic pathway contains 13 genes, we can get a score for the sum of the 13 genes that exist within that pathway. Using bootstrapping, we can then randomly sample, with replacement, 13 genes to create scores, 100,000 times. We use these random samples (100,000) to generate a distribution, and we can calculate a p-value dependent on where the score that consists of the 13 genes that actually exist within the pathway falls within the distribution.

Taken together, the scores and p-values resulting from “pathwayAnalysis” provide a measure for both the biological and statistical significance of metabolic dysregulation.

Note: A seed MUST be set before utilizing this function to ensure reproducible results by bootstrapping. It is NECESSARY that the seed remain the same throughout an analysis.

Note: Bootrapping is performed, by default, with 100,000 iterations of resampling for optimal power. The number of iterations can be decreased for improved speed. We use 50,000 iterations for improved speed of examples.*

pathwayAnalysis() requires:

The file path to the DEG list of interest
The name of the column containing HUGO gene names
The sample size of the DEG analysis
The number of iterations of resampling to be performed during bootstrapping
Correct headers for fold change and p value columns (as indicated above)

A partial output of the pathway analysis function can be seen as follows:

set.seed(1234)

brca <- pathwayAnalysis(DEGpath = system.file("extdata/BRCA_DEGS.csv",
                        package = "MetaPhOR"),
                        genename = "X",
                        sampsize = 1095,
                        iters = 50000,
                        headers = c("logFC", "adj.P.Val"))

	Scores	ABSScores	ScorePvals	ABSScorePvals
Cardiolipin.Metabolism	0.0293332	0.0299031	0.22996	0.60408
Cardiolipin.Biosynthesis	0.0031671	0.0031671	0.43074	0.91878
Cholesterol.Biosynthesis	0.2400430	0.2614076	0.06334	0.35952
Citric.Acid.Cycle	0.0314352	0.1469312	0.48340	0.92848
Cyclooxygenase.Arachidonic.Acid.Metabolism	0.0004634	0.0080444	0.52228	0.92286
Prostaglandin.Biosynthesis	-0.0392858	0.1046875	0.79502	0.35816

0.5 bubblePlot

The metabolic profile determined by pathway analysis can be easily visualized using “bubblePlot.” Scores are plotted on the x-axis, while absolute value scores are plotted on the y-axis. Each point represents a KEGG pathway, where point size represents p-value (the smaller the p value, the larger the point) and point color is dictated by scores. Negative scores, which indicate transcriptional downregulation, are blue, and positive scores, which indicate transcriptional upregulation, are red. The top ten points, either by smallest p value or greatest dysregulation by score, are labeled with their pathway names. The plot demonstrates which pathways are the most statistically and biologically relevant.

bubblePlot() requires:

The output of pathwayAnalysis(), as a data frame
An indication which values to use, in order to label points: either “Pval” or “LogFC”
Optional: Numeric value for point label text size (default = .25)

Bubble Plot Labeled by P Value

pval <- bubblePlot(scorelist = brca,
                    labeltext = "Pval",
                    labelsize = .85)
plot(pval)

Bubble Plot Labeled by LogFC

logfc <- bubblePlot(scorelist = brca,
                    labeltext = "LogFC",
                    labelsize = .85)
plot(logfc)

0.6 metaHeatmap

“metaHeatmap” provides a useful visualization for comparing metabolic profiles between groups, including only significantly dysregulated pathways, and highlighting those which are most highly changed. This function should be used when you have multiple groups/DEGs being compared, e.g. if you have 4 conditions all being compared to each other. This will not be useful if you have a single DEG list. This function can be used only when multiple DEG comparisons are scored by “pathwayAnalysis.” The absolute pathway scores are scaled across outputs and plotted via pheatmap, selecting only those which have absolute score p values below the level of significance.

Note: A heatmap cannot be produced if there are no pathways significantly dysregulated below the p value cut off.

metaHeatmap() requires:

A list of outputs from pathway analysis, as data frames
A character vector of names for labeling each output
Optional: The p value cut off to be used (default = 0.05)

##read in two additional sets of scores,
##run in the same manner as brca for comparison

ovca <- read.csv(system.file("extdata/OVCA_Scores.csv", package = "MetaPhOR"),
                header = TRUE,
                row.names = 1)
prad <- read.csv(system.file("extdata/PRAD_Scores.csv", package = "MetaPhOR"),
                header = TRUE,
                row.names = 1)

all.scores <- list(brca, ovca, prad)
names <- c("BRCA", "OVCA", "PRAD")

metaHeatmap(scorelist = all.scores,
            samplenames = names,
            pvalcut = 0.05)

0.7 cytoPath

“cytoPath” models metabolic pathways, sourced from WikiPathway (Martens et al. (2021)), using the transcriptional change of individual genes in the pathway. The resulting figure can be used to identify candidate genes for further study by providing a detailed look at transcriptional dysregulation within and between “pathwayAnalysis” outputs. Pathways of interest can be readily identified from “bubblePlot” or “metaHeatmap.” This function utilizes Cytoscape (Gustavsen et al. (2019)); the package RCy3 and a local instance of Cytoscape are required to render this plot. For details regarding legends, see the Cytoscape Legend Creator Tutorial (https://cytoscape.org/cytoscape-tutorials/protocols/legend-creator/#/ introduction).

cytoPath() requires:

The name of the pathway to be plotted
The file path to the DEG list of interest
The file path to which the figure will be saved
The name of the column containing HUGE gene names
Correct headers for fold change and p value columns (as indicated above)

cytoPath(pathway = "Tryptophan Metabolism",
            DEGpath = "BRCA_DEGS.csv",
            figpath = paste(getwd(), "BRCA_Tryptophan_Pathway", sep = "/"),
            genename = "X",
            headers = c("logFC", "adj.P.Val"))

0.8 pathwayList

The WikiPathways available within MetaPhOR have been restricted to metabolic pathways. The pathwayList function will provide a complete list of WikiPathways for mapping with “cytoPath.”

pathwayList()

wpid2name$name
Acetylcholine synthesis
Activation of vitamin K-dependent proteins
Adipogenesis
Aerobic glycolysis
Aflatoxin B1 metabolism
AGE/RAGE pathway

0.9 Conclusion

MetaPhOR contains five functions for the analysis of metabolic dysregulation using transcriptomic-level data and the production of publication-quality figures. This version of MetaPhOR, 0.99.0, is available at https://github.com/emilyise/MetaPhOR.

0.10 SessionInfo

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MetaPhOR_1.0.0   kableExtra_1.3.4 BiocStyle_2.26.0
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2             tidyselect_1.2.0       RSQLite_2.2.18        
##   [4] AnnotationDbi_1.60.0   grid_4.2.1             BiocParallel_1.32.0   
##   [7] scatterpie_0.1.8       munsell_0.5.0          base64url_1.4         
##  [10] codetools_0.2-18       pbdZMQ_0.3-8           future_1.28.0         
##  [13] withr_2.5.0            colorspace_2.0-3       GOSemSim_2.24.0       
##  [16] Biobase_2.58.0         highr_0.9              knitr_1.40            
##  [19] uuid_1.1-0             rstudioapi_0.14        stats4_4.2.1          
##  [22] DOSE_3.24.0            listenv_0.8.0          labeling_0.4.2        
##  [25] repr_1.1.4             GenomeInfoDbData_1.2.9 polyclip_1.10-4       
##  [28] bit64_4.0.5            farver_2.1.1           pheatmap_1.0.12       
##  [31] downloader_0.4         parallelly_1.32.1      vctrs_0.5.0           
##  [34] treeio_1.22.0          generics_0.1.3         gson_0.0.9            
##  [37] ipred_0.9-13           xfun_0.34              R6_2.5.1              
##  [40] GenomeInfoDb_1.34.0    graphlayouts_0.8.3     RJSONIO_1.3-1.6       
##  [43] bitops_1.0-7           cachem_1.0.6           fgsea_1.24.0          
##  [46] gridGraphics_0.5-1     assertthat_0.2.1       scales_1.2.1          
##  [49] ggraph_2.1.0           nnet_7.3-18            enrichplot_1.18.0     
##  [52] gtable_0.3.1           globals_0.16.1         tidygraph_1.2.2       
##  [55] rlang_1.0.6            systemfonts_1.0.4      splines_4.2.1         
##  [58] lazyeval_0.2.2         RecordLinkage_0.4-12.3 BiocManager_1.30.19   
##  [61] yaml_2.3.6             reshape2_1.4.4         backports_1.4.1       
##  [64] qvalue_2.30.0          clusterProfiler_4.6.0  tools_4.2.1           
##  [67] lava_1.7.0             bookdown_0.29          ggplotify_0.1.0       
##  [70] ggplot2_3.3.6          gplots_3.1.3           jquerylib_0.1.4       
##  [73] ff_4.0.7               RColorBrewer_1.1-3     proxy_0.4-27          
##  [76] BiocGenerics_0.44.0    Rcpp_1.0.9             plyr_1.8.7            
##  [79] base64enc_0.1-3        zlibbioc_1.44.0        purrr_0.3.5           
##  [82] RCurl_1.98-1.9         rpart_4.1.19           viridis_0.6.2         
##  [85] cowplot_1.1.1          S4Vectors_0.36.0       ggrepel_0.9.1         
##  [88] fs_1.5.2               magrittr_2.0.3         data.table_1.14.4     
##  [91] magick_2.7.3           patchwork_1.1.2        evaluate_0.17         
##  [94] xtable_1.8-4           HDO.db_0.99.1          XML_3.99-0.12         
##  [97] IRanges_2.32.0         gridExtra_2.3          compiler_4.2.1        
## [100] tibble_3.1.8           RCy3_2.18.0            KernSmooth_2.23-20    
## [103] crayon_1.5.2           shadowtext_0.1.2       htmltools_0.5.3       
## [106] ggfun_0.0.7            tidyr_1.2.1            aplot_0.1.8           
## [109] DBI_1.1.3              tweenr_2.0.2           MASS_7.3-58.1         
## [112] Matrix_1.5-1           cli_3.4.1              uchardet_1.1.0        
## [115] parallel_4.2.1         evd_2.3-6.1            igraph_1.3.5          
## [118] pkgconfig_2.0.3        IRdisplay_1.1          xml2_1.3.3            
## [121] ggtree_3.6.0           svglite_2.1.0          bslib_0.4.0           
## [124] webshot_0.5.4          XVector_0.38.0         prodlim_2019.11.13    
## [127] rvest_1.0.3            ada_2.0-5              yulab.utils_0.0.5     
## [130] stringr_1.4.1          digest_0.6.30          graph_1.76.0          
## [133] Biostrings_2.66.0      rmarkdown_2.17         fastmatch_1.1-3       
## [136] tidytree_0.4.1         gtools_3.9.3           lifecycle_1.0.3       
## [139] nlme_3.1-160           jsonlite_1.8.3         viridisLite_0.4.1     
## [142] fansi_1.0.3            pillar_1.8.1           lattice_0.20-45       
## [145] KEGGREST_1.38.0        fastmap_1.1.0          httr_1.4.4            
## [148] survival_3.4-0         GO.db_3.16.0           glue_1.6.2            
## [151] png_0.1-7              bit_4.0.4              ggforce_0.4.1         
## [154] class_7.3-20           stringi_1.7.8          sass_0.4.2            
## [157] blob_1.2.3             caTools_1.18.2         memoise_2.0.1         
## [160] IRkernel_1.3           dplyr_1.0.10           e1071_1.7-12          
## [163] future.apply_1.9.1     ape_5.6-2

References

Gustavsen, Julia A, Shraddha Pai, Ruth Isserlin, Barry Demchak, and Alexander R Pico. 2019. “RCy3: Network Biology Using Cytoscape from Within R.” f1000Research 8 (1774). https://doi.org/10.12688/f1000research.20887.3.

Kanehisa, Minoru, and Susumu Goto. 2000. “KEGG: Kyoto Encyclopedia of Genes and Genomes.” Nucleic Acids Research 28 (1): 27–30. https://doi.org/10.1093/nar/28.1.27.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Martens, Marvin, Ammar Ammar, Anders Riutta, Andra Waagmeester, Denise N Slenter, Kristina Hanspers, Ryan A Miller, et al. 2021. “WikiPathways: Connecting Communities.” Nucleic Acids Research 49 (D1): D613–621. https://doi.org/10.1093/nar/gkaa1024.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Rosario, S R, M D Long, H C Affronti, A M Rowsam, K H Eng, and D J Smirglia. 2018. “Pan-Cancer Analysis of Transcriptional Metabolic Dysregulation Using the Cancer Genome Atlas.” Nature Communications 9 (5330). https://doi.org/10.1038/s41467-018-07232-8.