The PoTRA analysis is based on topological ranks of genes in biological pathways. PoTRA can be used to detect pathways involved in disease (1). We use PageRank to measure the relative topological ranks of genes in each biological pathway, then select hub genes for each pathway, and use Fishers Exact test to determine if the number of hub genes in each pathway is altered from normal to cancer (1). Alternatively, if the distribution of topological ranks of gene in a pathway is altered between normal and cancer, this pathway might also be involved in cancer (1). Hence, we use the Kolmogorov–Smirnov test to detect pathways that have an altered distribution of topological ranks of genes between two phenotypes (1). PoTRA can be used with the KEGG, Reactome, SMPDB, Panther, PathBank and PharmGKB databases from the devel graphite library.
Most of the approaches for topology-based network and pathway analysis are based on different correlation-based metrics to identify differential networks between two different phenotypes. Generally, there are three main ways to compare networks for differential network analysis. The first approach handles weighted networks and uses some functions of the edge-specific weight differences as edge weights to construct differential networks (Hudson, Reverter & Dalrymple, 2009; Tesson, Breitling & Jansen, 2010; Liu et al., 2010; Rhinn et al., 2013). The second approach tries to find co-expressed gene sets and identify which correlation patterns are different between sets across conditions (Watson, 2006; Rahmatallah, Emmert-Streib & Glazko, 2014). This approach formulates summary measures that represent co-expression in a biological network and compares the metric between sets. The third approach compares the topology of biological networks across different phenotypes by using measures such as degree of nodes or modularity (Reverter et al., 2006; Zhang et al., 2009). However, the PoTRA method uses a topology-based metric to identify differential networks between two phenotypes. In addition to using different metrics, some of the other tools are based on correlation pattern of genes and identify groups of genes whose correlation patterns behave differentially across different datasets (Watson, 2006; Hudson, Reverter & Dalrymple, 2009; Tesson, Breitling & Jansen, 2010; Liu et al., 2010; Rhinn et al., 2013). Compared to these tools, PoTRA is directly based on topological ranks of genes and aims to identify pathways where the topological ranks of genes are different across datasets, which is more biologically intuitive. In this method, not only do we use correlation networks but we also use combined networks by taking intersected networks of correlation networks and KEGG curated pathways. Hence, when KEGG curated pathway information is employed, the topological rank-based PoTRA method can apply to the combined networks, while the previous correlation-based methods cannot, which is a limitation of the previously discussed correlation-only based methods. Regarding the previous tools based on topology (Reverter et al., 2006; Zhang et al., 2009), Zhang et al. focuses on identifying genes involved in topological changes, while PoTRA focuses on identifying pathways involved in topological changes. Also, Reverter et al. focuses on identifying genes with differential connectivity between two phenotypes, which is also different from PoTRA’s application scenario.
Although the above methods for differential network analysis can deal with some important biological questions, they are still limited. In general, they are based on a basic hypothesis that some connections between genes across the groups could be thought of as passenger events and other connections are unique to either one of groups and thus could be driver events that contribute to disease progression and development. Hence, they focus on the contribution of individual differential connections to disease. This results in several limitations. First, each differential connection is regarded by these methods to have an equal contribution to disease. However, it is well understood that loss of a connection between two hub genes from normal to disease is more deleterious than loss of a connection between two non-hub genes. Second, how differential connections (driver connections mentioned above) between pairs of genes are associated with diseases is still not very biologically intuitive, because how the dependency between genes contributes to diseases is usually little understood.
To address these problems, we propose a new PageRank-based method called Pathways of Topological Rank Analysis (PoTRA) to detect pathways associated with cancer. PageRank is an algorithm initially used by Google Search to rank websites in their search engine results (Page et al., 1999). It is a way of measuring the importance of nodes in a network. More generally, PageRank has been applied to other networks, e.g., social networks (Pedroche et al., 2013; Wang et al., 2013). To date, there have been several studies using PageRank for gene expression and network analysis (Morrison et al., 2005; Winter et al., 2012; Kimmel & Visweswaran, 2013; Hou & Ma, 2014; Bourdakou, Athanasiadis & Spyrou, 2016; Zeng et al., 2016; Ramsahai et al., 2017; Morshed Osmani & Rahman, 2007). These studies focus on ranking genes and discovering key driver genes in disease, and do not try to detect dysregulated pathways involved in disease. Other studies (Winter et al., 2012; Zeng et al., 2016) use PageRank to select topological important genes and simply see which pathways that these topological important genes are involved in. These PageRank-related approaches are very different from our approach.
Our approach embodied by PoTRA is motivated by the observation that the loss of connectivity is a common topological trait of cancer networks (Anglani et al., 2014), as well as the prior knowledge that a normal biological network includes a small number of well-connected hub nodes and a large number of nodes that are non-hubs (Albert, 2005; Khanin & Wit, 2006; Zhu, Gerstein & Snyder, 2007). However, from normal to cancer, the process of the network losing connectivity might be the process of disrupting the structure of the network, namely, the number of hub genes might be altered in cancer compared to that in normal or the distribution of topological ranks of genes might be altered. Thus, we hypothesize that if the number of hub genes is different in a pathway between normal and cancer, this pathway might be involved in cancer. Based on this hypothesis, we propose to detect pathways involved in cancer by testing if the number of hub genes for each pathway is different between normal and cancer samples.
The TCGA BRCA open access HTSEQ normalized gene expression data was used with PoTRA to determine if any pathways were significantly dysregulated between normal and case samples. In a 2018 PeerJ preprint the authors conducted a dispersion analysis to determine the thresholds for the number of normal and cases (2). The thresholds were: Normals - 50 samples, Cases - 50 samples. These sample sizes were associated with the smallest standard deviation in pathway ranks. This dispersion analysis is part of a larger collaborative project between the Mayo Clinic and Arizona State University that yielded a pan-cancer study of the TCGA data repository.
Installation from Bioconductor (recommended)
The most reliable way to install the package is to use the following Bioconductor method:
if (!requireNamespace(“BiocManager”, quietly = TRUE))
install.packages(“BiocManager”)
BiocManager::install(“PoTRA”, version = “3.9”)
The following examples utilize the KEGG, Reactome, PathBank and PharmGKB databases. Kanehisa Laboratories developed KEGG and has been updating it since 1995, it is an important resource for signaling and metabolic pathways (3). The PathBank pathway database reports pathways for every protein and a map for every metabolite and is supported by Dr. Wishart (Univ. Alberta and The Metabolomics Innovation Centre). Reactome is one of the most comprehensive databases for signaling and metabolic molecules and describes how they relate to pathways and processes (4). PharmGKB is a well known comprehensive resource and describes how genetic variants impact drug response. The PharmGKB database primarily serves as a important resource for metabolic data analyses.
library(PoTRA)
library(repmis)
options(warn=-1)
library(BiocGenerics)
library(graph)
library(graphite)
library(igraph)
Download the example dataset
source_data(“https://github.com/GenomicsPrograms/example_data/raw/master/PoTRA-vignette.RData”)
Choose your database before running the PoTRA commandline:
humanKEGG <- pathways(“hsapiens”, “kegg”)
Pathway.database = humanKEGG
Run the PoTRA program with KEGG:
results.KEGG <- PoTRA.corN(mydata=mydata, genelist=genelist, Num.sample.normal=113, Num.sample.case=113, Pathway.database=Pathway.database[1:15], PR.quantile=PR.quantile)
names(results.KEGG)
## [1] "Fishertest.p.value" "KStest.p.value"
## [3] "LengthOfPathway" "TheNumberOfHubGenes.normal"
## [5] "TheNumberOfHubGenes.case" "TheNumberOfEdges.normal"
## [7] "TheNumberOfEdges.case" "PathwayName"
head(results.KEGG$Fishertest.p.value)
## [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.1038961
head(results.KEGG$KStest.p.value)
## [1] 0.9601715 0.6601010 0.8671189 0.6898871 0.8982138 0.7572491
head(results.KEGG$LengthOfPathway)
## [1] 48 23 22 25 24 18
head(results.KEGG$TheNumberOfHubGenes.normal)
## [1] 45 22 22 25 23 14
head(results.KEGG$TheNumberOfHubGenes.case)
## [1] 46 22 22 25 24 18
head(results.KEGG$TheNumberOfEdges.normal)
## [1] 408 189 91 60 147 38
head(results.KEGG$TheNumberOfEdges.case)
## [1] 135 89 33 40 41 16
head(results.KEGG$PathwayName)
## [1] "Glycolysis / Gluconeogenesis"
## [2] "Citrate cycle (TCA cycle)"
## [3] "Pentose phosphate pathway"
## [4] "Pentose and glucuronate interconversions"
## [5] "Fructose and mannose metabolism"
## [6] "Galactose metabolism"
library(PoTRA)
library(repmis)
options(warn=-1)
library(BiocGenerics)
library(graph)
library(graphite)
library(igraph)
Download the example dataset
source_data(“https://github.com/GenomicsPrograms/example_data/raw/master/PoTRA-vignette.RData”)
Choose your database before running the PoTRA commandline:
humanReactome <- pathways(“hsapiens”, “reactome”)
Pathway.database = humanReactome
Run the PoTRA program with Reactome:
results.KEGG <- PoTRA.corN(mydata=mydata, genelist=genelist, Num.sample.normal=113, Num.sample.case=113, Pathway.database=Pathway.database[1:15], PR.quantile=PR.quantile)
names(results.Reactome)
## [1] "Fishertest.p.value" "KStest.p.value"
## [3] "LengthOfPathway" "TheNumberOfHubGenes.normal"
## [5] "TheNumberOfHubGenes.case" "TheNumberOfEdges.normal"
## [7] "TheNumberOfEdges.case" "PathwayName"
head(results.Reactome$Fishertest.p.value)
## [1] 1.0000000 1.0000000 1.0000000 0.1150685 NA 1.0000000
head(results.Reactome$KStest.p.value)
## [1] 9.627040e-01 2.281308e-01 9.646735e-06 2.389407e-01 NA
## [6] 1.600695e-01
head(results.Reactome$LengthOfPathway)
## [1] 7 118 314 38 1 32
head(results.Reactome$TheNumberOfHubGenes.normal)
## [1] 6 118 314 38 NA 32
head(results.Reactome$TheNumberOfHubGenes.case)
## [1] 6 118 314 34 NA 32
head(results.Reactome$TheNumberOfEdges.normal)
## [1] 14 3862 18399 396 NA 216
head(results.Reactome$TheNumberOfEdges.case)
## [1] 6 1630 5321 125 NA 51
head(results.Reactome$PathwayName)
## [1] "Interleukin-6 signaling" "Apoptosis"
## [3] "Hemostasis" "Intrinsic Pathway for Apoptosis"
## [5] "PKB-mediated events" "PI3K Cascade"
library(PoTRA)
library(repmis)
options(warn=-1)
library(BiocGenerics)
library(graph)
library(graphite)
library(igraph)
Download the example dataset
source_data(“https://github.com/GenomicsPrograms/example_data/raw/master/PoTRA-vignette.RData”)
Choose your database before running the PoTRA commandline:
humanPathBank <- pathways(“hsapiens”, “pathbank”)
Pathway.database = humanPathBank
Run the PoTRA program with PathBank:
results.PathBank <- PoTRA.corN(mydata=mydata, genelist=genelist, Num.sample.normal=113, Num.sample.case=113, Pathway.database=Pathway.database[1:15], PR.quantile=PR.quantile)
names(results.PathBank)
## [1] "Fishertest.p.value" "KStest.p.value"
## [3] "LengthOfPathway" "TheNumberOfHubGenes.normal"
## [5] "TheNumberOfHubGenes.case" "TheNumberOfEdges.normal"
## [7] "TheNumberOfEdges.case" "PathwayName"
head(results.PathBank$Fishertest.p.value)
## [1] 1.0000000 1.0000000 1.0000000 1.0000000 0.5238095 1.0000000
head(results.PathBank$KStest.p.value)
## [1] 0.8441558 0.8441558 0.8441558 0.5196135 0.3571429 0.3760419
head(results.PathBank$LengthOfPathway)
## [1] 6 6 6 12 5 11
head(results.PathBank$TheNumberOfHubGenes.normal)
## [1] 5 5 5 12 4 10
head(results.PathBank$TheNumberOfHubGenes.case)
## [1] 6 6 6 12 2 11
head(results.PathBank$TheNumberOfEdges.normal)
## [1] 4 4 4 32 5 18
head(results.PathBank$TheNumberOfEdges.case)
## [1] 2 2 2 10 3 3
head(results.PathBank$PathwayName)
## [1] "Citrullinemia Type I"
## [2] "Carbamoyl Phosphate Synthetase Deficiency"
## [3] "Argininosuccinic Aciduria"
## [4] "Glycine and Serine Metabolism"
## [5] "Pterine Biosynthesis"
## [6] "Tyrosine Metabolism"
library(PoTRA)
library(repmis)
options(warn=-1)
library(BiocGenerics)
library(graph)
library(graphite)
library(igraph)
Download the example dataset
source_data(“https://github.com/GenomicsPrograms/example_data/raw/master/PoTRA-vignette.RData”)
Choose your database before running the PoTRA commandline:
humanPharmGKB <- pathways(“hsapiens”, “pharmgkb”)
Pathway.database = humanPharmGKB
Run the PoTRA program with PharmGKB:
results.PharmGKB <- PoTRA.corN(mydata=mydata, genelist=genelist, Num.sample.normal=113, Num.sample.case=113, Pathway.database=Pathway.database[1:15], PR.quantile=PR.quantile)
names(results.PharmGKB)
## [1] "Fishertest.p.value" "KStest.p.value"
## [3] "LengthOfPathway" "TheNumberOfHubGenes.normal"
## [5] "TheNumberOfHubGenes.case" "TheNumberOfEdges.normal"
## [7] "TheNumberOfEdges.case" "PathwayName"
head(results.PharmGKB$Fishertest.p.value)
## [1] 0.4814815 1.0000000 1.0000000 1.0000000 1.0000000 NA
head(results.PharmGKB$KStest.p.value)
## [1] 0.1359540 1.0000000 0.4740260 0.9440559 0.8095238 NA
head(results.PharmGKB$LengthOfPathway)
## [1] 14 8 6 7 5 2
head(results.PharmGKB$TheNumberOfHubGenes.normal)
## [1] 12 6 5 6 4 NA
head(results.PharmGKB$TheNumberOfHubGenes.case)
## [1] 14 8 6 7 5 NA
head(results.PharmGKB$TheNumberOfEdges.normal)
## [1] 19 3 5 10 5 NA
head(results.PharmGKB$TheNumberOfEdges.case)
## [1] 7 4 4 5 1 NA
head(results.PharmGKB$PathwayName)
## [1] "Statin Pathway - Generalized, Pharmacokinetics"
## [2] "Rosuvastatin Pathway, Pharmacokinetics"
## [3] "Warfarin Pathway, Pharmacodynamics"
## [4] "Tamoxifen Pathway, Pharmacokinetics"
## [5] "Platinium Pathway"
## [6] "Vinka Alkaloid Pathway, Pharmacokinetics"
Using the PharmGKB and the TCGA’s BRCA open-access HTSEQ normalized gene expression data, sample files were joined on their genelists and then randomly sampled to form ten cohorts of normal and case samples. The pre-processed datasets were then analyzed by PoTRA and the results (Fisher Exact Test P-value and Pathway List) were separately sorted (FE p-values least to greatest), additionally, each set of sorted PoTRA results were given a rank column (1:nrow(DF)) and then a single dataframe was created to hold a single column of PharmGKB pathways, each results sorted FE P-values and all of the rank columns.
The FE P-values columns were statistically averaged using the sumlog(x) function from the metap package. Next, the Ranks were averaged using rowMeans from the colr library.
Synthetic Data (Fisher’s Exact Test P-values)
FPvalues1 <- c(0.01,0.05,1,0.90,0.01,0.05,0.03)
FPvalues2 <- c(0.01,1,1,1,0.94,0.34,0.25)
FPvalues3 <- c(0.01,0.01,0.04,0.07,0.01,0.03,0.40)
FPvalues4 <- c(0.55,0.21,0.01,0.02,0.01,0.01,0.01)
Note: In each results DF from PoTRA, sort the Fisher’s Exact Test P-Values (decreasing = FALSE), so that the pathways for each individual results set,
remains associated with their corresponding Fisher’s Exact Test P-Values.
PharmGKB Pathways associated with each of the FPvalues
Pathways <- c(“Statin Pathway - Generalized, Pharmacokinetics”, “Atorvastatin/Lovastatin/Simvastatin Pathway”, “Pharmacokinetics”, “Pravastatin Pathway”, “Pharmacokinetics”, “Fluvastatin Pathway”, “Pharmacokinetics”)
In the PoTRA results data frame, after sorting by Fisher’s Exact Test P-Values, create a Rank column.
Ranks <- seq(1,nrow(DF),1)
Repeat this for each set of PoTRA results. The following output is from the dataFrame of combined, sorted and ranked PoTRA results.
DF
## Pathways FPvalues1 FPvalues2 FPvalues3
## 1 Statin Pathway - Generalized, Pharmacokinetics 0.01 0.01 0.01
## 2 Atorvastatin/Lovastatin/Simvastatin Pathway 0.01 0.25 0.01
## 3 Pharmacokinetics 0.03 0.34 0.01
## 4 Pravastatin Pathway 0.05 0.94 0.03
## 5 Pharmacokinetics 0.05 1.00 0.04
## 6 Fluvastatin Pathway 0.90 1.00 0.07
## 7 Pharmacokinetics 1.00 1.00 0.40
## FPvalues4 Rank1 Rank2 Rank3 Rank4
## 1 0.01 1 1 1 1
## 2 0.01 1 2 1 1
## 3 0.01 2 3 1 1
## 4 0.01 3 4 2 1
## 5 0.02 3 5 3 2
## 6 0.21 4 5 4 3
## 7 0.55 5 5 5 4
library(metap)
library(colr)
FE = cgrep(DF, “^FPvalues”)
data_FE <- as.data.frame(t(FE))
FE_SumLog <- t(sapply(data_FE, function(z)
sumlog(z)$p
))
FE_SumLogs <- data.frame(t(FE_SumLog))
names(FE_SumLogs) <- c(“SumLog_Pval”)
FE_SumLogs
## SumLog_Pval
## V1 1.230837e-05
## V2 1.793148e-04
## V3 5.585111e-04
## V4 4.325641e-03
## V5 9.419188e-03
## V6 3.726281e-01
## V7 9.325719e-01
Ranks = cgrep(DF, “^Rank”)
data_Ranks <- Ranks
DF$Average_Rank <- rowMeans(data_Ranks, na.rm = TRUE, dims = 1)
DF
## Pathways SumLog_Pval Average_Rank
## 1 Statin Pathway - Generalized, Pharmacokinetics 1.230837e-05 1.00
## 2 Atorvastatin/Lovastatin/Simvastatin Pathway 1.793148e-04 1.25
## 3 Pharmacokinetics 5.585111e-04 1.75
## 4 Pravastatin Pathway 4.325641e-03 2.50
## 5 Pharmacokinetics 9.419188e-03 3.25
## 6 Fluvastatin Pathway 3.726281e-01 4.00
## 7 Pharmacokinetics 9.325719e-01 4.75
sessionInfo()
## R version 4.2.0 RC (2022-04-21 r82226)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## Random number generation:
## RNG: Mersenne-Twister
## Normal: Inversion
## Sample: Rounding
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] colr_0.1.900 metap_1.8 repmis_0.5
## [4] PoTRA_1.13.0 graphite_1.43.0 graph_1.75.0
## [7] igraph_1.3.1 org.Hs.eg.db_3.15.0 AnnotationDbi_1.59.0
## [10] IRanges_2.31.0 S4Vectors_0.35.0 Biobase_2.57.0
## [13] BiocGenerics_0.43.0 BiocStyle_2.25.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 sass_0.4.1 splines_4.2.0
## [4] bit64_4.0.5 jsonlite_1.8.0 qqconf_1.2.3
## [7] tmvnsim_1.0-2 R.utils_2.11.0 bslib_0.3.1
## [10] sn_2.0.2 Rdpack_2.3 BiocManager_1.30.17
## [13] blob_1.2.3 TFisher_0.2.0 GenomeInfoDbData_1.2.8
## [16] yaml_2.3.5 numDeriv_2016.8-1.1 RSQLite_2.2.12
## [19] lattice_0.20-45 digest_0.6.29 XVector_0.37.0
## [22] rbibutils_2.2.8 sandwich_3.0-1 htmltools_0.5.2
## [25] Matrix_1.4-1 R.oo_1.24.0 plyr_1.8.7
## [28] pkgconfig_2.0.3 bookdown_0.26 zlibbioc_1.43.0
## [31] mvtnorm_1.1-3 KEGGREST_1.37.0 TH.data_1.1-1
## [34] cachem_1.0.6 cli_3.3.0 mnormt_2.0.2
## [37] survival_3.3-1 magrittr_2.0.3 crayon_1.5.1
## [40] memoise_2.0.1 evaluate_0.15 R.methodsS3_1.8.1
## [43] MASS_7.3-57 R.cache_0.15.0 tools_4.2.0
## [46] data.table_1.14.2 multcomp_1.4-19 mutoss_0.1-12
## [49] stringr_1.4.0 plotrix_3.8-2 Biostrings_2.65.0
## [52] compiler_4.2.0 jquerylib_0.1.4 GenomeInfoDb_1.33.0
## [55] rlang_1.0.2 grid_4.2.0 RCurl_1.98-1.6
## [58] rappdirs_0.3.3 bitops_1.0-7 rmarkdown_2.14
## [61] multtest_2.53.0 codetools_0.2-18 DBI_1.1.2
## [64] curl_4.3.2 R6_2.5.1 zoo_1.8-10
## [67] knitr_1.38 fastmap_1.1.0 bit_4.0.4
## [70] mathjaxr_1.6-0 stringi_1.7.6 Rcpp_1.0.8.3
## [73] vctrs_0.4.1 png_0.1-7 xfun_0.30