Pathway fingerprint: a tool for biomarker discovery based on gene expression data and pathway knowledge

Introduction

Traditional methods of analyzing gene expression data in the study of some disease usually compare the disease and normal control groups of samples and find the most differentially expressed genes. But that is hard to discover the disease’s biomarkers and mechanism. To give a quantitative comparison of the complex disease, we achieve PFP, a good characterization for a person’s disease based on pathway on the open scientific computing platform R. In this package, a pathway-fingerprint (PFP) method was introduced to evaluate the importance of a gene set in different pathways to help researchers focus on the most related pathways and genes.It will be used to visually compare and parse different diseases by generating a fingerprint overlay. We collected three types of gene expression data to perform the enrichment analysis in KEGG pathways and make some comparations with other methods. The result indicated that Pathway Fingerprint had better performance than other enrichment tools, which not only picked out the most relevant pathways but also showed strong stability when changing data. we propose a novel, general and systematic method called Pathway Fingerprint to help researchers focus on the fatal pathways and genes by considering the topology knowledge.

The three main features of PFP:

Installation

PFP requires these packages: graph, igraph, KEGGgraph, clusterProfiler, ggplot2, plyr,tidy,magrittr, stats, methods and utils. To install PFP, some required packages are only available from Bioconductor. It also allows users to install the latest development version from github, which requires devtools package has been installed on your system or can be installed using install.packages("devtools"). Note that devtools sometimes needs some extra non-R software on your system – more specifically, an Rtools download for Windows or Xcode for OS X. There’s more information about devtools here. You can install PFP via Bioconductor.

You can also install PFP via github.

During analysis, you need to install org.Hs.eg.db, the installation strategy is as follows.

After installation, the *{PFP} is ready to load into the current workspace by the following codes to the current workspace by typing or pasting the following codes:

Analysis Pipeline: Get pathway networks from KEGG

In our method, we choose KEGG(http://www.kegg.jp/) pathway networks as a reference to generate a Pathway Fingerprint. KEGG provides KGML files of pathways for users, which enables automatic drawing of KEGG pathways and provides facilities for computational analysis and modeling of gene/protein networks and chemical networks. We downloaded the latest (2020.11.8) KGML files of all the human pathways in KEGG and translated the KGML files to network.Then we got a total number of 338 pathway networks for further analysis.

Identification of DEGs(differentially expressed genes)

There are different methods for the different three types of data to identify DEGs. We processed the microarray data by limma, besides, we also selected some cancer samples of the same type of cancer and compared them with the control group by edgeR.In both limma and edgeR, we only chose the genes whose log2 fold change (logFC) was greater than 1 and false discovery rate (FDR) was less than 0.05.

We defined a new S4 class PFP to store the score. PFP also provides six major methods for this S4 class:

  1. genes_score(): Gene score, adding function to specified selection group/ pathway.
  2. sub_PFP(): A portion of the PFP can be selected by group, slice, path name, and ID.
  3. show(): Display the network group name, group size, and PFP score for each channel
  4. plot_PFP(): Display PFP fingerprinting.
  5. refnet_names(): Extract base network group names
  6. rank_PFP(): To achieve the path weight ranking, the preferred P value, and then the PFP score. Detailed instructions for this five methods refer to package function help.

We also defined a new S4 class PFPRefnet to store the reference pathway network information of KEGG, it provides six methods for this S4 class::

  1. network(): Reference path network of KEGG.
  2. net_info(): Pathway information.
  3. group(): Group information.
  4. refnet_names(): The access information of the reference network.
  5. subnet(): A portion of the PFPRefnet can be selected by group, slice, path name, and ID.
  6. show(): Show the number of pathways in each group of the reference network.

PFP scores calculation

  1. Input differential gene list
  2. Extract the network information of KEGG base map.
  3. The channel fingerprint score was calculated and converted to PFP format.

Then the PFP can be calculated as following:

We study the target pathway, the pathway with the highest score after ranking.Below is a simple example.

Network fingerprint visualization

PFP provides the plot_PFP() function to visualize the network fingerprint of a single query network. First we show an example of PFP score.

Plot the scores from high to low.

Session Information

The version number of R and packages loaded for generating the vignette were:

#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] org.Hs.eg.db_3.15.0  AnnotationDbi_1.58.0 IRanges_2.30.0      
#> [4] S4Vectors_0.34.0     Biobase_2.56.0       BiocGenerics_0.42.0 
#> [7] PFP_1.4.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] fgsea_1.22.0           colorspace_2.0-3       ggtree_3.4.0          
#>   [4] ellipsis_0.3.2         qvalue_2.28.0          XVector_0.36.0        
#>   [7] aplot_0.1.3            farver_2.1.0           graphlayouts_0.8.0    
#>  [10] ggrepel_0.9.1          bit64_4.0.5            scatterpie_0.1.7      
#>  [13] fansi_1.0.3            splines_4.2.0          cachem_1.0.6          
#>  [16] GOSemSim_2.22.0        knitr_1.38             polyclip_1.10-0       
#>  [19] jsonlite_1.8.0         GO.db_3.15.0           png_0.1-7             
#>  [22] graph_1.74.0           ggforce_0.3.3          compiler_4.2.0        
#>  [25] httr_1.4.2             lazyeval_0.2.2         assertthat_0.2.1      
#>  [28] Matrix_1.4-1           fastmap_1.1.0          cli_3.3.0             
#>  [31] tweenr_1.0.2           formatR_1.12           htmltools_0.5.2       
#>  [34] tools_4.2.0            igraph_1.3.1           gtable_0.3.0          
#>  [37] glue_1.6.2             GenomeInfoDbData_1.2.8 reshape2_1.4.4        
#>  [40] DO.db_2.9              dplyr_1.0.8            fastmatch_1.1-3       
#>  [43] Rcpp_1.0.8.3           enrichplot_1.16.0      jquerylib_0.1.4       
#>  [46] vctrs_0.4.1            Biostrings_2.64.0      ape_5.6-2             
#>  [49] nlme_3.1-157           ggraph_2.0.5           xfun_0.30             
#>  [52] stringr_1.4.0          lifecycle_1.0.1        clusterProfiler_4.4.0 
#>  [55] XML_3.99-0.9           DOSE_3.22.0            zlibbioc_1.42.0       
#>  [58] MASS_7.3-57            scales_1.2.0           tidygraph_1.2.1       
#>  [61] parallel_4.2.0         KEGGgraph_1.56.0       RColorBrewer_1.1-3    
#>  [64] yaml_2.3.5             memoise_2.0.1          gridExtra_2.3         
#>  [67] ggplot2_3.3.5          downloader_0.4         ggfun_0.0.6           
#>  [70] yulab.utils_0.0.4      sass_0.4.1             stringi_1.7.6         
#>  [73] RSQLite_2.2.12         highr_0.9              tidytree_0.3.9        
#>  [76] BiocParallel_1.30.0    GenomeInfoDb_1.32.0    rlang_1.0.2           
#>  [79] pkgconfig_2.0.3        bitops_1.0-7           evaluate_0.15         
#>  [82] lattice_0.20-45        purrr_0.3.4            labeling_0.4.2        
#>  [85] treeio_1.20.0          patchwork_1.1.1        shadowtext_0.1.2      
#>  [88] bit_4.0.4              tidyselect_1.1.2       plyr_1.8.7            
#>  [91] magrittr_2.0.3         R6_2.5.1               generics_0.1.2        
#>  [94] DBI_1.1.2              pillar_1.7.0           KEGGREST_1.36.0       
#>  [97] RCurl_1.98-1.6         tibble_3.1.6           crayon_1.5.1          
#> [100] utf8_1.2.2             rmarkdown_2.14         viridis_0.6.2         
#> [103] grid_4.2.0             data.table_1.14.2      blob_1.2.3            
#> [106] Rgraphviz_2.40.0       digest_0.6.29          tidyr_1.2.0           
#> [109] gridGraphics_0.5-1     munsell_0.5.0          viridisLite_0.4.0     
#> [112] ggplotify_0.1.0        bslib_0.3.1