Contents

1 Introduction

This package de-novo implements or adjusts the existing implementations of several different methods for topology-based pathway analysis of gene expression data from microarray and RNA-Seq technologies.

Traditionally, in pathway analysis differentially expressed genes are mapped to reference pathways derived from databases and relative enrichment is assessed. Methods of topology-based pathway analysis are the last generation of pathway analysis methods that take into account the topological structure of a pathway, which helps to increase specificity and sensitivity of the results.

This package implements eight topology-based pathway analysis methods that focus on identification of the pathways that are differentially affected between two condition. Each method is implemented as a single wrapper function which allows the user to call a method in a single command.

2 Setup

We start by loading the package.

library(ToPASeq)

3 Preparing the data

The input data are either normalized (count) data or gene expression data as well as pathway topological structure.

For the sake of simplicity, our package offers in each wrapper function a pre-processing step for RNA-seq normalization as implemented in edgeR and DESeq packages. If necessary, the functions also performs differential gene expression analysis through calling limma and DESeq2 packages. Additionally, the user can prepare the data by his preffered method and skip the built-in normalization and/or differential expression analysis.

Pathways and their topological structures are an important input for the analysis. They are represented as graphs \(G=(V,E)\), where \(V\) denotes a set of vertices or nodes represented by genes and \(E \subseteq V \times V\) is a set of edges between nodes (oriented or not, depending on the method) representing the interaction between genes. These structures can be downloaded from public databases such as KEGG or Biocarta or are available through other packages such as graphite.

ToPASeq is build upon graphite R-package were patways for multiple species and from multiple database can be obtained. In these pathways, protein complexes are expanded into cliques since it is assumed that all units from one complex interact with each other. A clique, from graph theory, is a subset of vertices such that every two vertices in the subset are connected by an edge. On the other hand, gene families are expanded into separate nodes with same incoming and/or outgoing edges, because they are believed to be interchangable. Additionaly, proteins and metabolites are differentiated in the pathway. Interactions between proteins are propagated through metabolites and viceversa.

4 Analysis of RNA-Seq data

For RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone Himes et al., 2014.

We load the airway dataset

library(airway)
data(airway)

For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID.

airSE <- airway[grep("^ENSG", rownames(airway)),]
dim(airSE)
## [1] 63677     8
assay(airSE)[1:4,1:4]
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003        679        448        873        408
## ENSG00000000005          0          0          0          0
## ENSG00000000419        467        515        621        365
## ENSG00000000457        260        211        263        164

We also remove genes with zero counts.

airSE <- airway[rowSums(assay(airSE))>0,]
dim(airSE)
## [1] 33865     8
assay(airSE)[1:4,1:4]
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003        679        448        873        408
## ENSG00000000419        467        515        621        365
## ENSG00000000457        260        211        263        164
## ENSG00000000460         60         55         40         35

We define the grouping variable

group <- ifelse(airway$dex == "trt", 1, 0)
table(group)
## group
## 0 1 
## 4 4

Here, we retrieve human KEGG pathways.

library(graphite)
pwys <- pathways(species="hsapiens", database="kegg")[1:5]
pwys
## KEGG pathways for hsapiens
## 5 entries, retrieved on 22-04-2020

As the airway dataset uses ENSEMBL gene IDs, but the nodes of the pathways are based on NCBI Entrez Gene IDs,

nodes(pwys[[1]])
##  [1] "ENTREZID:10327"  "ENTREZID:124"    "ENTREZID:125"    "ENTREZID:126"   
##  [5] "ENTREZID:127"    "ENTREZID:128"    "ENTREZID:130"    "ENTREZID:130589"
##  [9] "ENTREZID:131"    "ENTREZID:160287" "ENTREZID:1737"   "ENTREZID:1738"  
## [13] "ENTREZID:2023"   "ENTREZID:2026"   "ENTREZID:2027"   "ENTREZID:217"   
## [17] "ENTREZID:218"    "ENTREZID:219"    "ENTREZID:220"    "ENTREZID:2203"  
## [21] "ENTREZID:221"    "ENTREZID:222"    "ENTREZID:223"    "ENTREZID:224"   
## [25] "ENTREZID:226"    "ENTREZID:229"    "ENTREZID:230"    "ENTREZID:2538"  
## [29] "ENTREZID:2597"   "ENTREZID:26330"  "ENTREZID:2645"   "ENTREZID:2821"  
## [33] "ENTREZID:3098"   "ENTREZID:3099"   "ENTREZID:3101"   "ENTREZID:387712"
## [37] "ENTREZID:3939"   "ENTREZID:3945"   "ENTREZID:3948"   "ENTREZID:441531"
## [41] "ENTREZID:501"    "ENTREZID:5105"   "ENTREZID:5106"   "ENTREZID:5160"  
## [45] "ENTREZID:5161"   "ENTREZID:5162"   "ENTREZID:5211"   "ENTREZID:5213"  
## [49] "ENTREZID:5214"   "ENTREZID:5223"   "ENTREZID:5224"   "ENTREZID:5230"  
## [53] "ENTREZID:5232"   "ENTREZID:5236"   "ENTREZID:5313"   "ENTREZID:5315"  
## [57] "ENTREZID:55276"  "ENTREZID:55902"  "ENTREZID:57818"  "ENTREZID:669"   
## [61] "ENTREZID:7167"   "ENTREZID:80201"  "ENTREZID:83440"  "ENTREZID:84532" 
## [65] "ENTREZID:8789"   "ENTREZID:92483"  "ENTREZID:92579"  "ENTREZID:9562"

we first map the gene IDs in the pathways from ENTREZ IDs to ENSEMBL.

pwys<-graphite::convertIdentifiers(pwys,"ENSEMBL")
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:1 mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
nodes(pwys[[1]])
##  [1] "ENSEMBL:ENSG00000117448" "ENSEMBL:ENSG00000187758"
##  [3] "ENSEMBL:ENSG00000196616" "ENSEMBL:ENSG00000248144"
##  [5] "ENSEMBL:ENSG00000198099" "ENSEMBL:ENSG00000197894"
##  [7] "ENSEMBL:ENSG00000172955" "ENSEMBL:ENSG00000143891"
##  [9] "ENSEMBL:ENSG00000196344" "ENSEMBL:ENSG00000166800"
## [11] "ENSEMBL:ENSG00000150768" "ENSEMBL:ENSG00000091140"
## [13] "ENSEMBL:ENSG00000074800" "ENSEMBL:ENSG00000111674"
## [15] "ENSEMBL:ENSG00000108515" "ENSEMBL:ENSG00000111275"
## [17] "ENSEMBL:ENSG00000108602" "ENSEMBL:ENSG00000137124"
## [19] "ENSEMBL:ENSG00000184254" "ENSEMBL:ENSG00000165140"
## [21] "ENSEMBL:ENSG00000006534" "ENSEMBL:ENSG00000132746"
## [23] "ENSEMBL:ENSG00000143149" "ENSEMBL:ENSG00000072210"
## [25] "ENSEMBL:ENSG00000149925" "ENSEMBL:ENSG00000136872"
## [27] "ENSEMBL:ENSG00000109107" "ENSEMBL:ENSG00000131482"
## [29] "ENSEMBL:ENSG00000111640" "ENSEMBL:ENSG00000105679"
## [31] "ENSEMBL:ENSG00000106633" "ENSEMBL:ENSG00000105220"
## [33] "ENSEMBL:ENSG00000282019" "ENSEMBL:ENSG00000156515"
## [35] "ENSEMBL:ENSG00000159399" "ENSEMBL:ENSG00000160883"
## [37] "ENSEMBL:ENSG00000188316" "ENSEMBL:ENSG00000134333"
## [39] "ENSEMBL:ENSG00000288299" "ENSEMBL:ENSG00000111716"
## [41] "ENSEMBL:ENSG00000166796" "ENSEMBL:ENSG00000226784"
## [43] "ENSEMBL:ENSG00000164904" "ENSEMBL:ENSG00000124253"
## [45] "ENSEMBL:ENSG00000100889" "ENSEMBL:ENSG00000285241"
## [47] "ENSEMBL:ENSG00000131828" "ENSEMBL:ENSG00000163114"
## [49] "ENSEMBL:ENSG00000168291" "ENSEMBL:ENSG00000141959"
## [51] "ENSEMBL:ENSG00000152556" "ENSEMBL:ENSG00000067057"
## [53] "ENSEMBL:ENSG00000171314" "ENSEMBL:ENSG00000164708"
## [55] "ENSEMBL:ENSG00000102144" "ENSEMBL:ENSG00000170950"
## [57] "ENSEMBL:ENSG00000079739" "ENSEMBL:ENSG00000143627"
## [59] "ENSEMBL:ENSG00000262785" "ENSEMBL:ENSG00000067225"
## [61] "ENSEMBL:ENSG00000169299" "ENSEMBL:ENSG00000131069"
## [63] "ENSEMBL:ENSG00000152254" "ENSEMBL:ENSG00000278373"
## [65] "ENSEMBL:ENSG00000172331" "ENSEMBL:ENSG00000111669"
## [67] "ENSEMBL:ENSG00000156510" "ENSEMBL:ENSG00000159322"
## [69] "ENSEMBL:ENSG00000154930" "ENSEMBL:ENSG00000130957"
## [71] "ENSEMBL:ENSG00000171989" "ENSEMBL:ENSG00000141349"
## [73] "ENSEMBL:ENSG00000107789"

The methods described below can be applied on microarry data by setting argument type to "MA". All methods include also differential expression analysis which is perfomed by limma package. For RNA-Seq data, voom transformation is applied first and the user can change the method with arguments norm.method and test.method.

4.1 TopologyGSA

TopologyGSA represents a multivariable method in which the expression of genes is modelled with Gausian Graphical Models with covariance matrix reflecting the pathway topology. It uses the the Iterative Proportional Scaling algorithm to estimate the covariance matrices. The vectors of means are compared by Hotelling’s T2 test. The method has requirements on the minimal number of samples for each pathway. This method was first implemented in the TopologyGSA package. In here, we have optimalized its performance by using different function for obtaining cliques from each pathway.

The method can be used with a single command

top<-TopologyGSA(assay(airSE), group, pwys, type="RNASeq", nperm=10, norm.method="edgeR")
## Selected methods use directly expression data. Normalization may not be optimal.
## 168 node labels mapped to the expression data
## Average coverage 52.32891 %
## 0 (out of 5) pathways without a mapped node
## Analysing pathway:
## Glycolysis / Gluconeogenesis 
## Citrate cycle (TCA cycle) 
## Pentose phosphate pathway 
## Pentose and glucuronate interconversions 
## Fructose and mannose metabolism
## Warning: Zero sample variances detected, have been offset away from zero
res(top)
## $results
## named list()
## 
## $errors
##                                                                                     Glycolysis / Gluconeogenesis 
## "Error in .chooseVariance(shrink, n1, n2, maxCliqueSize): Both groups should have more than 19 rows (samples)\n" 
##                                                                                        Citrate cycle (TCA cycle) 
## "Error in .chooseVariance(shrink, n1, n2, maxCliqueSize): Both groups should have more than 11 rows (samples)\n" 
##                                                                                        Pentose phosphate pathway 
## "Error in .chooseVariance(shrink, n1, n2, maxCliqueSize): Both groups should have more than 20 rows (samples)\n" 
##                                                                         Pentose and glucuronate interconversions 
##  "Error in .chooseVariance(shrink, n1, n2, maxCliqueSize): Both groups should have more than 5 rows (samples)\n" 
##                                                                                  Fructose and mannose metabolism 
## "Error in .chooseVariance(shrink, n1, n2, maxCliqueSize): Both groups should have more than 18 rows (samples)\n"

Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type argument which decides on the type of the data ("RNASeq" for RNA-Seq data). The others arguments are optional. The perms argument sets the number of permutations to be used in the statistical tests. By default both mean and variance tests are run, this can be changed to only variance test by setting method="var". Arguments convertTo and convertBy control the conversion of the node labels in the pathways. The default setting is convertTo="none" which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The threshold for variance test is specified with alpha argument. The implementation allows also testing of all the cliques present in the graph by setting testCliques=TRUE. Please note that these tests may take quite a long time. The implementation returns also a gene-level statistics of the differential expression of genes performed via ‘limma’ package.

4.2 DEGraph

Another multivariable method implemented in the package is DEGraph. This method assumes the same direction in the differential expresion of genes belonging to a pathway. It performs the regular Hotelling’s T2 test in the graph-Fourier space restricted to its first \(k\) components which is more powerful than test in the full graph-Fourier space or in the original space.

We apply the method with

deg<-DEGraph(assay(airSE), group, pwys, type="RNASeq", norm.method="edgeR")
## Selected methods use directly expression data. Normalization may not be optimal.
## 168 node labels mapped to the expression data
## Average coverage 52.32891 %
## 0 (out of 5) pathways without a mapped node
## Warning: Zero sample variances detected, have been offset away from zero
res(deg)
## $results
## $results[[1]]
##                                           Overall.p Overall.q.value Comp1.p
## Glycolysis / Gluconeogenesis                     NA              NA      NA
## Citrate cycle (TCA cycle)                0.19839160       0.2360371      NA
## Pentose phosphate pathway                0.23603709       0.2360371      NA
## Pentose and glucuronate interconversions 0.08248029       0.2360371      NA
## Fructose and mannose metabolism                  NA              NA      NA
##                                          Comp1.pFourier Comp1.k Comp2.p
## Glycolysis / Gluconeogenesis                         NA      13      NA
## Citrate cycle (TCA cycle)                    0.19839160       6      NA
## Pentose phosphate pathway                    0.23603709       6      NA
## Pentose and glucuronate interconversions     0.08248029       2      NA
## Fructose and mannose metabolism                      NA       7      NA
##                                          Comp2.pFourier Comp2.k
## Glycolysis / Gluconeogenesis                         NA      NA
## Citrate cycle (TCA cycle)                            NA      NA
## Pentose phosphate pathway                            NA      NA
## Pentose and glucuronate interconversions    0.000592163       2
## Fructose and mannose metabolism                      NA      NA
## 
## $results$graphs
##                                          Comp1.graph Comp2.graph
## Glycolysis / Gluconeogenesis             ?           NA         
## Citrate cycle (TCA cycle)                ?           NA         
## Pentose phosphate pathway                ?           NA         
## Pentose and glucuronate interconversions ?           ?          
## Fructose and mannose metabolism          ?           NA         
## 
## 
## $errors
## named list()

Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type argument which decides on the type of the data ("RNASeq" for RNA-Seq data). The others arguments are optional. Arguments convertTo and convertBy control the conversion of the node labels in the pathways. The default setting is convertTo="none" which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. Since, the DEGraph method runs a statistical test for each connected component of a pathway, a method for assigning a global p-value for whole pathway is needed. The user can select from three approaches: the minimum, the mean and the p-value of the biggest component. This is specified via overall argument. The implementation returns also a gene-level statistics of the differential expression of genes performed via `limma package.

4.3 clipper

The last multivariable method available within this package is called clipper. This method is similar to the topologyGSA as it uses the same two-step approach. However, the Iterative Proportional Scaling algorithm was subsituted with a shrinkage procedure of James-Stein-type which additionally allows proper estimates also in the situation when number of samples is smaller than the number of genes in a pathway. The tests on a pathway-level are follwed with a search for the most affected path in the graph.

The method can be applied with

cli<-clipper(assay(airSE), group, pwys,type="RNASeq", method="mean", nperm=10, norm.method="edgeR")
## Selected methods use directly expression data. Normalization may not be optimal.
## 168 node labels mapped to the expression data
## Average coverage 52.32891 %
## 0 (out of 5) pathways without a mapped node
## Analysing pathway:
## Glycolysis / Gluconeogenesis
## Warning: 4 instances of variables with zero scale detected!

## Warning: 4 instances of variables with zero scale detected!
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning: 3 instances of variables with zero scale detected!

## Warning: 3 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 5 instances of variables with zero scale detected!

## Warning: 5 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning: 3 instances of variables with zero scale detected!

## Warning: 3 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 3 instances of variables with zero scale detected!

## Warning: 3 instances of variables with zero scale detected!
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Citrate cycle (TCA cycle)
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Pentose phosphate pathway
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Pentose and glucuronate interconversions
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Fructose and mannose metabolism
## Warning: 2 instances of variables with zero scale detected!
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 2 instances of variables with zero scale detected!

## Warning: 2 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: 1 instances of variables with zero scale detected!

## Warning: 1 instances of variables with zero scale detected!
## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced

## Warning in pf(t.obs, ngene, n1 + n2 - ngene - 1): NaNs produced
## Warning: Zero sample variances detected, have been offset away from zero
res(cli)$results[[1]]
##                                          pval     t.obs   type q.value
## Glycolysis / Gluconeogenesis                0 -137.7941 shrink   0.000
## Citrate cycle (TCA cycle)                   0 -14.16513 shrink   0.000
## Pentose phosphate pathway                   0 -19.08594 shrink   0.000
## Pentose and glucuronate interconversions  0.1 -25.12681 shrink   0.125
## Fructose and mannose metabolism           0.2 -21.30838 shrink   0.200

Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type argument which decides on the type of the data ("RNASeq" for RNA-Seq data). The others arguments are optional. Arguments convertTo and convertBy control the conversion of the node labels in the pathways. The default setting is convertTo="none" which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. Also, both mean and variance tests are run, this can be changed to only variance test by setting method="var". The nperm controls the number of permutations in the statistical tests. Similarly as in topologyGSA, the implementation allows testing of all the cliques present in the graph by setting testCliques=TRUE. Please note that these tests may take quite a long time. The implementation returns also a gene-level statistics of the differential expression of genes performed via limma package.

The function returns two types of the results on pathway-level. The first (printed above), is a table of p-values and q-values related to the differential expression and concentration of the pathways.

4.4 SPIA

The most well-known topology-based pathway analysis method is SPIA. In there, two evidences of differential expression of a pathway are combined. The first evidence is a regular so called overrepresentation analysis in which the statistical significance of the number of differentially expressed genes belonging to a pathway is assessed. The second evidence reflects the pathway topology and it is called the pertubation factor. The authors assume that a differentially expressed gene at the begining of a pathway topology (e.g. a receptor in a signaling pathway) has a stronger effect on the functionality of a pathway than a differentially expressed gene at the end of a pathway (e.g. a transcription factor in a signaling pathway). The pertubation factors of all genes are calculated from a system of linear equations and then combined within a pathway. The two evidences in a form of p-values are finally combined into a global p-value, which is used to rank the pathways.

spi<-SPIA(assay(airSE), group,pwys , type="RNASeq", logFC.th=-1, test.method="edgeR")
## 168 node labels mapped to the expression data
## Average coverage 52.32891 %
## 0 (out of 5) pathways without a mapped node
res(spi)
## $results
##                                          pSize NDE  pNDE tA pPERT    pG  pGFdr
## Glycolysis / Gluconeogenesis                63  14 0.008  0    NA 0.008 0.0400
## Citrate cycle (TCA cycle)                   29   2 0.847  0    NA 0.847 0.8840
## Pentose phosphate pathway                   26   3 0.561  0    NA 0.561 0.8840
## Pentose and glucuronate interconversions    15   5 0.019  0    NA 0.019 0.0475
## Fructose and mannose metabolism             32   2 0.884  0    NA 0.884 0.8840
##                                          pGFWER    Status
## Glycolysis / Gluconeogenesis              0.040 Inhibited
## Citrate cycle (TCA cycle)                 1.000 Inhibited
## Pentose phosphate pathway                 1.000 Inhibited
## Pentose and glucuronate interconversions  0.095 Inhibited
## Fructose and mannose metabolism           1.000 Inhibited
## 
## $errors
## named list()

Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type argument which decides on the type of the data ("RNASeq" for RNA-Seq data). Alternatively, the user can supply the results of the differential expression analysis of genes instead of gene expression data matrix in two forms:

  1. a data.frame with columns: ID gene identifiers (they must match with the node labels), logFC** log fold-changes, t t-statistics, pval p-values, padj adjusted p-values. Then the user sets type to "DEtable"
  2. a list with two slots: named vector of log fold-changes of differentially expressed genes and a vector of names of all genes analysed. Then the user sets type to "DElist"

The others arguments are optional. Arguments convertTo and convertBy control the conversion of the node labels in the pathways. The default setting is convertTo="none" which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The default thresholds for the differential expression analysis of genes are set with arguments logFC.th and p.val.th. The user can omit one of these criteria by setting the agrument negative value, as is shown also in the example. The implementation returns also a gene-level statistics of the differential expression of genes. These statistics are later used in the visualization of a selected pathway.

4.5 TAPPA

TAPPA was among the first topology-based pathway analysis methods. It was inspired in chemointformatics and their models for predicting the structure of molecules. In TAPPA, the gene expression values are standardized and sigma-transformed within a samples. Then, a pathway is seen a molecule, individual genes as atoms and the energy of a molecule is a score defined for one sample. This score is called Pathway Connectivity Index. The difference of expression is assessed via a common univariable two sample test - Mann-Whitney in our implemetation.

tap<-TAPPA(assay(airSE), group, pwys, type="RNASeq", norm.method = "edgeR")
## Selected methods use directly expression data. Normalization may not be optimal.
## 168 node labels mapped to the expression data
## Average coverage 52.32891 %
## 0 (out of 5) pathways without a mapped node
## Warning: Zero sample variances detected, have been offset away from zero
res(tap)
## $results
##                                          X0.N   X0.Min. X0.1st.Qu. X0.Median
## Glycolysis / Gluconeogenesis                4 1.3252668  1.3446770 1.3680572
## Citrate cycle (TCA cycle)                   4 1.3106798  1.3151226 1.3250092
## Pentose phosphate pathway                   4 1.1067643  1.1171255 1.1222628
## Pentose and glucuronate interconversions    4 0.3439945  0.3469346 0.3491481
## Fructose and mannose metabolism             4 0.9195776  0.9232636 0.9245101
##                                            X0.Mean X0.3rd.Qu.   X0.Max. X1.N
## Glycolysis / Gluconeogenesis             1.3628562  1.3862364 1.3900435    4
## Citrate cycle (TCA cycle)                1.3243264  1.3342130 1.3366075    4
## Pentose phosphate pathway                1.1267043  1.1318416 1.1555273    4
## Pentose and glucuronate interconversions 0.3498780  0.3520915 0.3572214    4
## Fructose and mannose metabolism          0.9526196  0.9538662 1.0418808    4
##                                            X1.Min. X1.1st.Qu. X1.Median
## Glycolysis / Gluconeogenesis             1.3848398  1.3957418 1.4083271
## Citrate cycle (TCA cycle)                1.3236519  1.3244475 1.3342513
## Pentose phosphate pathway                1.0743874  1.0911928 1.1011456
## Pentose and glucuronate interconversions 0.3414858  0.3426804 0.3504955
## Fructose and mannose metabolism          0.8886422  0.9090359 0.9206953
##                                            X1.Mean X1.3rd.Qu.   X1.Max.
## Glycolysis / Gluconeogenesis             1.4053625  1.4179478 1.4199560
## Citrate cycle (TCA cycle)                1.3352925  1.3450963 1.3490153
## Pentose phosphate pathway                1.1057681  1.1157210 1.1463939
## Pentose and glucuronate interconversions 0.3513146  0.3591296 0.3627815
## Fructose and mannose metabolism          0.9615608  0.9732202 1.1162106
##                                             p.value   q.value
## Glycolysis / Gluconeogenesis             0.06152821 0.3076410
## Citrate cycle (TCA cycle)                0.27217487 0.4999636
## Pentose phosphate pathway                0.29997816 0.4999636
## Pentose and glucuronate interconversions 0.82126186 0.8877110
## Fructose and mannose metabolism          0.88771099 0.8877110
## 
## $errors
## named list()

Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type argument which decides on the type of the data ("RNASeq" for RNA-Seq data). The others arguments are optional. Arguments convertTo and convertBy control the conversion of the node labels in the pathways. The default setting is convertTo="none" which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The user can also specified whether the normalization step (standardization and sigma-transformation) should be perfomed (normalize=TRUE). If verbose=TRUE, function prints out the titles of pathways as their are analysed. The implementation returns also a gene-level statistics of the differential expression of genes.

4.6 PRS

PRS is another method that works with gene-level statistics and a list of differentially expresed genes. The pathway topology is incorporated as the number of downstream differentially expressed genes. The gene-level log fold-changes are weigted by this number and sumed up into a pathway-level score. A statistical significance is assessed by a permutations of genes.

Prs<-PRS_wrapper( assay(airSE), group, pwys, type="RNASeq",  logFC.th=-1, nperm=10, test.method="edgeR")
## 168 node labels mapped to the expression data
## Average coverage 52.32891 %
## 0 (out of 5) pathways without a mapped node
res(Prs)
## $results
##                                                nPRS p.value q.value
## Glycolysis / Gluconeogenesis             26.9223042     0.0   0.000
## Citrate cycle (TCA cycle)                -0.9287292     0.9   0.900
## Pentose phosphate pathway                -0.3822421     0.7   0.875
## Pentose and glucuronate interconversions  3.8829883     0.0   0.000
## Fructose and mannose metabolism          -0.7322924     0.7   0.875
## 
## $errors
## named list()

Arguments of this functions are almost the same as in SPIA. Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type argument which decides on the type of the data ("RNASeq" for RNA-Seq data). Alternatively, the user can supply the results of the differential expression analysis of genes instead of gene expression data matrix in two forms:

  1. a data.frame with columns: ID gene identifiers (they must match with the node labels), logFC log fold-changes, t t-statistics, pval p-values, padj adjusted p-values. Then the user sets type to "DEtable"
  2. a list with two slots: named vector of log fold-changes of differentially expressed genes and a vector of names of all genes analysed. Then the user sets type to "DElist"

The others arguments are optional. Arguments convertTo and convertBy control the conversion of the node labels in the pathways. The default setting is convertTo="none" which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The default thresholds for the differential expression analysis of genes are set with arguments logFC.th and p.val.th. The user can omit one of these criteria by setting the agrument negative value, as is shown also in the example. The implementation returns also a gene-level statistics of the differential expression of genes. There is one extra argument nperm which controls the number of permutations.

The last method available in this package is called PathWay Enrichment Analysis (PWEA). This is actually a weigthed form of common Gene Set Enrichment Analysis (GSEA). The weights are called Topological Influence Factor (TIF) and are defined as a geometic mean of ratios of Pearson’s correlation coefficient and the distance of two genes in a pathway. The weights of genes outside a pathway are assigned randomly from normal distribution with parameters estimated from the weights of genes in all pathways. A statistical significance of a pathway is assessed via Kolmogorov-Simirnov-like test statistic comparing two cumulative distribution functions with class label permutations.

pwe<-PWEA(assay(airSE), group, pwys,  type="RNASeq", nperm=10, test.method="edgeR")
## 168 node labels mapped to the expression data
## Average coverage 52.32891 %
## 0 (out of 5) pathways without a mapped node
res(pwe)
## $results
## named list()
## 
## $errors
##                                                         Glycolysis / Gluconeogenesis 
## "Error in if (abs(max.ES) > abs(min.ES)) {: missing value where TRUE/FALSE needed\n" 
##                                                            Citrate cycle (TCA cycle) 
## "Error in if (abs(max.ES) > abs(min.ES)) {: missing value where TRUE/FALSE needed\n" 
##                                                            Pentose phosphate pathway 
## "Error in if (abs(max.ES) > abs(min.ES)) {: missing value where TRUE/FALSE needed\n" 
##                                             Pentose and glucuronate interconversions 
## "Error in if (abs(max.ES) > abs(min.ES)) {: missing value where TRUE/FALSE needed\n" 
##                                                      Fructose and mannose metabolism 
## "Error in if (abs(max.ES) > abs(min.ES)) {: missing value where TRUE/FALSE needed\n"

Apart from the expected arguments: a gene expression data matrix, a vector of class labels and a list of pathways, the user needs to specify the type argument which decides on the type of the data ("RNASeq" for RNA-Seq data). Alternatively, instead of gene expression data matrix the user can supply a list of observed and random gene-level statistics and set type to "DEtable". The observed gene-level statistics are expected as data frame with columns: ID gene identifiers (they must match with the node labels), logFC log fold-changes, t t-statistics, pval p-values, padj** adjusted p-values.. A data.frame of similar data.frames is expected for random statistics (it is an output from sapply function when the applied function returns a data frame). Columns should refer to the results from individual analyses after class label permutation. The others arguments are optional.Arguments convertTo and convertBy control the conversion of the node labels in the pathways. The default setting is convertTo="none" which performs no conversion. Please note, that the node labels should be the same as the rownames of gene expression data matrix. The alpha parameter sets a threshold for gene weights. The purpose of this filtering is to reduce the possiblity that a weight of a gene that is tighly correlated with a few genes are lowered by the weak correlation with other genes in a pathway. The implementation returns also a gene-level statistics of the differential expression of genes. The nperm argument controls the number of permutations.