0.1 Introduction

MetaPhOR was developed to enable users to assess metabolic dysregulation using transcriptomic-level data (RNA-sequencing and Microarray data) and produce publication-quality figures. A list of differentially expressed genes (DEGs), which includes fold change and p value, from DESeq2 (Love, Huber, and Anders (2014)) or limma (Ritchie et al. (2015)), can be used as input, with sample size for MetaPhOR, and will produce a data frame of scores for each KEGG pathway. These scores represent the magnitude and direction of transcriptional change within the pathway, along with estimated p-values (Rosario et al. (2018)). MetaPhOR then uses these scores to visualize metabolic profiles within and between samples through a variety of mechanisms, including: bubble plots, heatmaps, and pathway models.

0.2 Installation

This command line can be used to install and load MetaPhOR.

if (!require(“BiocManager”, quietly = TRUE))
install.packages(“BiocManager”) BiocManager::install(“MetaPhOR”)

library(MetaPhOR)

0.3 Data Preparation

Minimal data preparation is required to run MetaPhOR. DEGs may be loaded into R in the form of .csv or .tsv for use with this package. The DEG file must contain columns for log fold change, adjusted p-value, and HUGO gene names. By default, MetaPhOR assumes DESeq2 header stylings: “log2FoldChange” and “padj”. In any function that assumes these headers, however, the user can define column names for these values. Below is a sample DEG file resulting from limma:

exdegs <- read.csv(system.file("extdata/exampledegs.csv",
                                package = "MetaPhOR"),
                                header = TRUE)
X logFC AveExpr t P.Value adj.P.Val B
TATDN1 0.0996779 6.112478 6.074387 0e+00 0.0000523 10.736686
ZNF706 0.1055866 6.446976 5.845307 0e+00 0.0000958 9.237467
DCAF13 0.0882855 6.393943 5.513711 1e-07 0.0002918 7.531541
TRMT12 0.1093866 6.071099 5.434470 1e-07 0.0003550 7.377948
IGF2BP1 0.9256589 4.454204 5.631783 0e+00 0.0002065 7.273836
MAP6D1 0.1619178 5.719428 5.342911 1e-07 0.0004612 7.090578

0.4 Pathway Analysis

“pathwayAnalysis” first assigns scores and their absolute values using log fold change and p value to each gene (Rosario et al. (2018)). These transcript-level scores, along with sample size, are then utilized to calculate both scores (directional change) and absolute value scores (magnitude of change) (Rosario et al. (2018)) for each KEGG Pathway (Kanehisa and Goto (2000)). We then utilize a bootstrapping method, to randomly calculate 100,000 scores per pathway, based on the number of genes in that pathway and model the distribution. This distribution can then be used to evaluate where the actual score for that pathway sits in relation to the distribution, and can assign a p-value to the achieved score.

For example, if the polyamine biosynthetic pathway contains 13 genes, we can get a score for the sum of the 13 genes that exist within that pathway. Using bootstrapping, we can then randomly sample, with replacement, 13 genes to create scores, 100,000 times. We use these random samples (100,000) to generate a distribution, and we can calculate a p-value dependent on where the score that consists of the 13 genes that actually exist within the pathway falls within the distribution.

Taken together, the scores and p-values resulting from “pathwayAnalysis” provide a measure for both the biological and statistical significance of metabolic dysregulation.

Note: A seed MUST be set before utilizing this function to ensure reproducible results by bootstrapping. It is NECESSARY that the seed remain the same throughout an analysis.

Note: Bootrapping is performed, by default, with 100,000 iterations of resampling for optimal power. The number of iterations can be decreased for improved speed. We use 50,000 iterations for improved speed of examples.*

pathwayAnalysis() requires:

A partial output of the pathway analysis function can be seen as follows:

set.seed(1234)

brca <- pathwayAnalysis(DEGpath = system.file("extdata/BRCA_DEGS.csv",
                        package = "MetaPhOR"),
                        genename = "X",
                        sampsize = 1095,
                        iters = 50000,
                        headers = c("logFC", "adj.P.Val"))
Scores ABSScores ScorePvals ABSScorePvals
Cardiolipin.Metabolism 0.0293332 0.0299031 0.22996 0.60408
Cardiolipin.Biosynthesis 0.0031671 0.0031671 0.43074 0.91878
Cholesterol.Biosynthesis 0.2400430 0.2614076 0.06334 0.35952
Citric.Acid.Cycle 0.0314352 0.1469312 0.48340 0.92848
Cyclooxygenase.Arachidonic.Acid.Metabolism 0.0004634 0.0080444 0.52228 0.92286
Prostaglandin.Biosynthesis -0.0392858 0.1046875 0.79502 0.35816

0.5 bubblePlot

The metabolic profile determined by pathway analysis can be easily visualized using “bubblePlot.” Scores are plotted on the x-axis, while absolute value scores are plotted on the y-axis. Each point represents a KEGG pathway, where point size represents p-value (the smaller the p value, the larger the point) and point color is dictated by scores. Negative scores, which indicate transcriptional downregulation, are blue, and positive scores, which indicate transcriptional upregulation, are red. The top ten points, either by smallest p value or greatest dysregulation by score, are labeled with their pathway names. The plot demonstrates which pathways are the most statistically and biologically relevant.

bubblePlot() requires:

Bubble Plot Labeled by P Value

pval <- bubblePlot(scorelist = brca,
                    labeltext = "Pval",
                    labelsize = .85)
plot(pval)