1 Introduction

Pathway Integrated Regression-based Kernel Association Test (PaIRKAT) is a model framework for assessing statistical relationships between networks and some outcome of interest while adjusting for potential confounders and covariates.

The PaIRKAT method is motivated by the analysis of networks of metabolites from a metabolomics assay and the relationship of those networks with a phenotype or clinical outcome of interest, though the method can be generalized to other domains.

PaIRKAT queries the KEGG database to determine interactions between metabolites from which network connectivity is constructed. This model framework improves testing power on high dimensional data by including graph topography in the kernel machine regression setting. Studies on high dimensional data can struggle to include the complex relationships between variables. The semi-parametric kernel machine regression model is a powerful tool for capturing these types of relationships. They provide a framework for testing for relationships between outcomes of interest and high dimensional data such as metabolomic, genomic, or proteomic pathways. PaIRKAT uses known biological connections between high dimensional variables by representing them as edges of ‘graphs’ or ‘networks.’ It is common for nodes (e.g. metabolites) to be disconnected from all others within the graph, which leads to meaningful decreases in testing power whether or not the graph information is included. We include a graph regularization or ‘smoothing’ approach for managing this issue.

3 Load pairkat library and example data

PaIRKAT comes with an example data set to use as a reference when formatting data inputs to its functions. This synthetic dataset called “smokers” contains a set of human subjects with phenotype variables related to lung health among smokers. Subjects have associated metabolomics assay data that are linked to KEGG pathway database IDs.

4 Create a summarized experiment object

Data need to be added to a SummarizedExperiment class object before they can be analyzed with PaIRKAT. There are three components of the SummarizedExperiment that need to be added:

  1. Phenotype Data: Contains outcomes of interest and any meaningful covariates to be adjusted for. Rows should be subjects and columns should be variables.

  2. Pathway Data: Maps measured metabolites names to KEGG IDs in order to query the KEGG database. Rows are metabolite names and one column should contain KEGG IDs.

  3. Metabolite Assay: Contains measurements of metabolites for all subjects. Rows should be subjects and columns should be metabolite names. One column should have subject IDs matching subject IDs in clinical data.

We will walk through each of these data and discuss their formats.

4.1 Phenotype Data

Phenotype data should be formatted such that subject ID’s are row names and variables of interest are column names. Categorical variables should be converted to dummy variables (i.e. one-hot-encoded).

In our example, we will be working with a data set containing measures related to lung health among smokers. Variables in this data set include:

log_FEV1_FVC_ratio: [log(FEV1/FVC)] A measure of lung capacity. The FEV1/FVC ratio is the ratio of the forced expiratory volume in the first one second to the forced vital capacity of the lungs. This measure has been log transformed.

high_log_FEV1_FVC_ratio_binary: A binarized transformation of log_FEV1_FVC_ratio which indicates whether log_FEV1_FVC_ratio is “high” with a 1, otherwise 0.

race_binary: A binary indicator of race.

age: Age in years

bmi: Body Mass Index

smoking_status: Binary indicator of whether a subject is a current smoker (1) or not (0)

pack_years: A way to measure the amount a person has smoked over a long period of time. It is calculated by multiplying the number of packs of cigarettes smoked per day by the number of years the person has smoked.

The example data are already packaged as a SummarizedExperiment. We will extract each of the components to look at their structures. Phenotype data are saved in the colData partition of the SummarizedExperiment can can be extracted as follows:

## DataFrame with 6 rows and 7 columns
##    log_FEV1_FVC_ratio high_log_FEV1_FVC_ratio_binary race_binary       age
##             <numeric>                      <numeric>   <numeric> <numeric>
## S1           0.838751                              1           1      76.0
## S2          -0.163348                              0           1      68.6
## S3          -0.283701                              0           1      51.5
## S4          -0.997419                              0           1      69.3
## S5           1.114170                              1           1      60.7
## S6          -0.366500                              0           1      75.2
##          bmi smoking_status pack_years
##    <numeric>      <numeric>  <numeric>
## S1     31.81              0       68.4
## S2     37.10              1       72.4
## S3     29.24              1       44.3
## S4     22.47              0       41.4
## S5     29.51              1       21.6
## S6     25.25              0       86.6

4.2 Pathway Data

Pathway data primarily functions to link metabolite names with KEGG ID’s. Row names should be metabolite names and one column should contain KEGG IDs. Other columns may be present if desired.

Pathway data are saved in the rowData partition of the SummarizedExperiment can can be extracted as follows:

## DataFrame with 6 rows and 2 columns
##                                  kegg_id compound_id
##                              <character>   <numeric>
## 1-arachidonoyl-GPC* (20:4)*       C05208       33228
## 1-arachidonylglycerol (20:4)      C13857       34397
## 1-linoleoyl-GPC (18:2)            C04100       34419
## 1-methyl-4-imidazoleacetate       C05828       32350
## 1-methyladenosine                 C02494       15650
## 1-methylhistidine                 C01152       30460

4.3 Metabolite Assay

Metabolite assay data should have metabolite names as row names (matching pathway data names) and subject IDs as column names.

Metabolite Assay data are saved in the assays partition of the SummarizedExperiment. Multiple assays can be stored in the SummarizedExperiment object. PaIRKAT relies on the assay to be named “metabolomics”. The assay data can can be extracted as follows:

##                                    S1       S2       S3
## 1-arachidonoyl-GPC* (20:4)*  17.36150 18.32505 16.70162
## 1-arachidonylglycerol (20:4) 11.52322 12.03081 13.40373
## 1-linoleoyl-GPC (18:2)       16.68326 19.04595 15.93618
## 1-methyl-4-imidazoleacetate  13.89287 15.59214 14.69882
## 1-methyladenosine            15.83022 14.67049 13.31189
## 1-methylhistidine            13.04124 11.91380 11.84175

4.4 Create the Summarized Experiment Object

Once the three components are formatted correctly, it is straightforward to add these dataframes to the SummarizedExperiment object with the SummarizedExperiment function.

In this example, we re-save our SummarizedExperiment data to an object called smokers

5 GatherNetworks Function

GatherNetworks queries the KEGG API to discover molecular interactions and build a network graph between measured metabolites.

GatherNetworks takes 4 arguments

SE - a SummarizedExperiment object

keggID - a string of the column name in pathway data containing KEGG IDs

species - a three letter species IDs can be found by running keggList("organism")

minPathwaySize - this argument filters KEGG pathways that contain fewer metabolites than the number specified

5.1 Get species ID

Our data were gathered from humans, so we will use the three letter species code “hsa”.

##      organism species                                              
## [1,] "hsa"    "Homo sapiens (human)"                               
## [2,] "ptr"    "Pan troglodytes (chimpanzee)"                       
## [3,] "pps"    "Pan paniscus (bonobo)"                              
## [4,] "ggo"    "Gorilla gorilla gorilla (western lowland gorilla)"  
## [5,] "pon"    "Pongo abelii (Sumatran orangutan)"                  
## [6,] "nle"    "Nomascus leucogenys (northern white-cheeked gibbon)"

6 PaIRKAT function

PaIRKAT can be used to get a kernel score for all pathways found in GatherNetworks. PaIRKAT takes 3 arguments:

formula.H0 - The null model in the “formula” format used in lm and glm functions.

networks - networks object obtained with GatherNetworks

tau - A parameter to control the amount of smoothing, analogous to a bandwidth parameter in kernel smoothing. We found 1 often gave reasonable results, as over-smoothing can lead to inflated Type I errors.

6.1 View Results

The formula call and results of the kernel test can be viewed by calling items from the PaIRKAT function output. In our example, the first result can be interpreted as follows:

Histidine metabolism has a significant relationship with log_FEV1_FVC_ratio when controlling for age (p = 0.0038). Similarly, Arginine biosynthesis has a significant relationship with log_FEV1_FVC_ratio when controlling for age (p = 0.0070). etc.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:igraph':
## 
##     as_data_frame, groups, union
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect
## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union
## The following object is masked from 'package:matrixStats':
## 
##     count
## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## log_FEV1_FVC_ratio ~ age
##                                       pathway    Q.adj     p.value
## 1                        Histidine metabolism 1.289588 0.003775968
## 2                       Arginine biosynthesis 1.282351 0.007034843
## 3 Alanine, aspartate and glutamate metabolism 1.135580 0.026493023
## 4                              Bile secretion 1.048787 0.071780296
## 5                    Linoleic acid metabolism 1.085318 0.126369867
## 6                   Pentose phosphate pathway 0.953244 0.153105272

6.2 Visualize Networks

To look further into significant metabolic pathways, it is possible to plot pathway networks using the plotNetworks function and passing the networks object along with one of the pathway names as a string. To visualize all networks, pass “all” to the pathways argument.

7 Session Information

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_1.0.7                 pairkat_1.0.0              
##  [3] data.table_1.14.2           igraph_1.2.7               
##  [5] SummarizedExperiment_1.24.0 Biobase_2.54.0             
##  [7] GenomicRanges_1.46.0        GenomeInfoDb_1.30.0        
##  [9] IRanges_2.28.0              MatrixGenerics_1.6.0       
## [11] matrixStats_0.61.0          S4Vectors_0.32.0           
## [13] BiocGenerics_0.40.0         KEGGREST_1.34.0            
## [15] BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1       xfun_0.27              bslib_0.3.1           
##  [4] purrr_0.3.4            lattice_0.20-45        vctrs_0.3.8           
##  [7] generics_0.1.1         htmltools_0.5.2        yaml_2.2.1            
## [10] utf8_1.2.2             rlang_0.4.12           jquerylib_0.1.4       
## [13] pillar_1.6.4           DBI_1.1.1              glue_1.4.2            
## [16] GenomeInfoDbData_1.2.7 lifecycle_1.0.1        stringr_1.4.0         
## [19] zlibbioc_1.40.0        Biostrings_2.62.0      evaluate_0.14         
## [22] knitr_1.36             fastmap_1.1.0          curl_4.3.2            
## [25] fansi_0.5.0            highr_0.9              Rcpp_1.0.7            
## [28] BiocManager_1.30.16    DelayedArray_0.20.0    magick_2.7.3          
## [31] jsonlite_1.7.2         XVector_0.34.0         png_0.1-7             
## [34] digest_0.6.28          stringi_1.7.5          bookdown_0.24         
## [37] CompQuadForm_1.4.3     grid_4.1.1             tools_4.1.1           
## [40] bitops_1.0-7           magrittr_2.0.1         sass_0.4.0            
## [43] tibble_3.1.5           RCurl_1.98-1.5         crayon_1.4.1          
## [46] pkgconfig_2.0.3        ellipsis_0.3.2         Matrix_1.3-4          
## [49] assertthat_0.2.1       rmarkdown_2.11         httr_1.4.2            
## [52] R6_2.5.1               compiler_4.1.1