Pathway Analysis of Metabolic Activities

In this tutorial we are going to analyze the dataset of metabolic activities published in Terunuma et al (2017). In this paper the Authors characterized the transcriptomic and metabolomic profile of several types of human breast tumors to uncovered metabolite signatures using an untargeted discovery approach. They found that the oncometabolite 2-hydroxyglutarate (2HG) accumulates at high levels in a subset of tumors and discovered an association between increased 2HG levels and MYC pathway activation in breast cancer. As an example of metabolites topological analyses here we are going to use only the dataset of metabolite abundances as reported in the supplementary files selecting only tumour samples and comparing ER+ and ER- breast cancers. We will use pathway information from the KEGG and Reactome databases to identify a set of metabolites whose activity changes significantly between the two sample classes. This result will hopefully hint at some specific biological activities that are pathologically altered in tumoral samples.

Data Preparation

We start by loading the experimental data into R. The measurements are stored in a simple text file, in tabular form. Thanks to the utility functions provided by R, we can combine the download and the actual reading of the table in a single operation. We also specify the details of the file format: the first row of input represents the header of the table; the tabulation character \t separates values; we don’t want R to change strings into factors. TODO: change the URL

addr <- "http://romualdi.bio.unipd.it/wp-uploads/2017/10/metabolite_matrix_final.txt"
expr <- read.table(url(addr), header = TRUE, sep = "\t", stringsAsFactors = FALSE)

The loaded table contains 536 rows and 71 columns, but not all of them actually correspond to measurements. We ask R to list the names of the columns:

colnames(expr)

##  [1] "METABOLON_ID" "KEGG_ID"      "CAS"          "HMDB_ID"     
##  [5] "NEG.TUMOR"    "NEG.TUMOR.1"  "NEG.TUMOR.2"  "POS.TUMOR"   
##  [9] "POS.TUMOR.1"  "NEG.TUMOR.3"  "POS.TUMOR.2"  "POS.TUMOR.3" 
## [13] "NEG.TUMOR.4"  "NEG.TUMOR.5"  "NEG.TUMOR.6"  "POS.TUMOR.4" 
## [17] "NEG.TUMOR.7"  "NEG.TUMOR.8"  "POS.TUMOR.5"  "NEG.TUMOR.9" 
## [21] "NEG.TUMOR.10" "NEG.TUMOR.11" "NEG.TUMOR.12" "NEG.TUMOR.13"
## [25] "POS.TUMOR.6"  "POS.TUMOR.7"  "POS.TUMOR.8"  "POS.TUMOR.9" 
## [29] "POS.TUMOR.10" "POS.TUMOR.11" "POS.TUMOR.12" "POS.TUMOR.13"
## [33] "NEG.TUMOR.14" "NEG.TUMOR.15" "POS.TUMOR.14" "POS.TUMOR.15"
## [37] "NEG.TUMOR.16" "NEG.TUMOR.17" "NEG.TUMOR.18" "NEG.TUMOR.19"
## [41] "POS.TUMOR.16" "POS.TUMOR.17" "POS.TUMOR.18" "NEG.TUMOR.20"
## [45] "NEG.TUMOR.21" "NEG.TUMOR.22" "NEG.TUMOR.23" "NEG.TUMOR.24"
## [49] "POS.TUMOR.19" "POS.TUMOR.20" "POS.TUMOR.21" "POS.TUMOR.22"
## [53] "POS.TUMOR.23" "NEG.TUMOR.25" "POS.TUMOR.24" "POS.TUMOR.25"
## [57] "NEG.TUMOR.26" "NEG.TUMOR.27" "NEG.TUMOR.28" "NEG.TUMOR.29"
## [61] "POS.TUMOR.26" "NEG.TUMOR.30" "POS.TUMOR.27" "POS.TUMOR.28"
## [65] "NEG.TUMOR.31" "POS.TUMOR.29" "POS.TUMOR.30" "POS.TUMOR.31"
## [69] "NEG.TUMOR.32" "NEG.TUMOR.33" "POS.TUMOR.32"

As you can see the first four columns actually contain metabolite identifiers. We store their indices in a variable for future reference.

idcols <- 1:4

Missing Values

We display the first few rows of the table to get a feeling for its content. For the moment, we concentrate on the metabolite measurements.

head(expr[,5:10])

##   NEG.TUMOR NEG.TUMOR.1 NEG.TUMOR.2 POS.TUMOR POS.TUMOR.1 NEG.TUMOR.3
## 1 143976.09   241472.30          NA  28847.14          NA    91703.21
## 2  32786.00   433319.00    21002.72  47510.59          NA   119432.82
## 3        NA   163501.61          NA        NA          NA   133341.57
## 4  20556.03          NA          NA        NA    94989.22          NA
## 5  72971.21   124561.37    50246.18 106322.34    89398.17    43596.23
## 6        NA    55323.71          NA        NA          NA   361387.05

It looks like there is a significant number of missing values. Let’s plot a distribution of the fraction of NAs for each row.

nas_by_row <- rowSums(is.na(data.matrix(expr[,-idcols]))) / (ncol(expr) - length(idcols))
hist(nas_by_row, xlab = "Fraction of NA values", ylab = "Number of rows", main = NULL)

We cannot make good use of rows with too many missing values. We decide to drop them from our dataset. Specifically, we are going to remove anything with more than 50% of NAs.

dexpr <- expr[nas_by_row < 0.5,]

This operation leaves us with 315 of the original 536 rows.

We are not yet satisfied, though. The downstream analysis won’t be able to cope with any NA. What would happen if we were to apply a more stringent procedure, removing any NA?

sum(rowSums(is.na(dexpr[,-idcols])) == 0)

## [1] 88

The above command is a little bit dense, so let’s go through it in small steps. First we select all columns, except the identifiers. Then we check each value, testing if it’s an NA. We tally how many NAs we have in each row and we consider only those in which the count is zero (meaning, no NA at all). The outer sum tells us how many rows would survive our filter.

The final verdict is quite grim. Only 88 out of 315 measurements would be used. We could do much better using a strategy that has become quite common in cases like this: imputation.

Instead of implementing it ourselves, we are going to use the excellent BioConductor package impute.

library(impute)
iexpr <- cbind(dexpr[,idcols],
               impute.knn(data.matrix(dexpr[,-idcols]))$data,
               stringsAsFactors = FALSE)
head(iexpr[, 1:6])

##    METABOLON_ID KEGG_ID         CAS   HMDB_ID NEG.TUMOR NEG.TUMOR.1
## 1         35186                               143976.09    241472.3
## 2         34214  C03819                        32786.00    433319.0
## 5         27665  C02918  1005-24-9; HMDB00699  72971.21    124561.4
## 7         21184  C01885   111-03-5;           102595.37    940195.4
## 8         33960  C03916 19420-56-5; HMDB02815  66618.92    351724.6
## 10        21127  C01885   542-44-9;            17852.30    367552.9

We can explicitly check there are not NAs left:

sum(is.na(iexpr[,-idcols]))

## [1] 0

Missing Identifiers

We now concentrate on the metabolite identifiers. We start again from the first few rows.

head(iexpr[,idcols])

##    METABOLON_ID KEGG_ID         CAS   HMDB_ID
## 1         35186                              
## 2         34214  C03819                      
## 5         27665  C02918  1005-24-9; HMDB00699
## 7         21184  C01885   111-03-5;          
## 8         33960  C03916 19420-56-5; HMDB02815
## 10        21127  C01885   542-44-9;

The CAS identifiers have a trailing ; we don’t need and there are a number of empty strings, which really represent missing values. We fix both these issues.

iexpr$CAS <- sub("\\s*;.*$", "", iexpr$CAS)
iexpr[,idcols][iexpr[,idcols] == ""] <- NA
head(iexpr[,idcols])

##    METABOLON_ID KEGG_ID        CAS   HMDB_ID
## 1         35186    <NA>       <NA>      <NA>
## 2         34214  C03819       <NA>      <NA>
## 5         27665  C02918  1005-24-9 HMDB00699
## 7         21184  C01885   111-03-5      <NA>
## 8         33960  C03916 19420-56-5 HMDB02815
## 10        21127  C01885   542-44-9      <NA>

We get a measure of how many identifiers are present or missing:

summary(is.na(iexpr[,idcols]))

##  METABOLON_ID     KEGG_ID           CAS           HMDB_ID       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:315       FALSE:176       FALSE:199       FALSE:175      
##                  TRUE :139       TRUE :116       TRUE :140

We have a Metabolon ID for each metabolite in the matrix, while in the case of CAS identifiers we find only 199 usable rows.

Unfortunately, at this point graphite does not support the Metabolon IDs. We do rely on CAS even if that means loosing a significant fraction of the rows.

valid_cas <- !is.na(iexpr$CAS)
cas_col <- which(names(iexpr) == "CAS")
cexpr <- iexpr[valid_cas, c(cas_col, seq.int(5, ncol(iexpr)))]
head(cexpr[,1:6])

##           CAS NEG.TUMOR NEG.TUMOR.1 NEG.TUMOR.2 POS.TUMOR POS.TUMOR.1
## 5   1005-24-9  72971.21    124561.4   50246.180 106322.34    89398.17
## 7    111-03-5 102595.37    940195.4    8155.901  64641.62   450857.68
## 8  19420-56-5  66618.92    351724.6   96895.167  43963.93    89116.99
## 10   542-44-9  17852.30    367552.9  168209.378  31647.50   361469.21
## 11 17364-16-8  73652.66   2933573.1  118921.644  13491.35   201283.33
## 12   123-94-4  25390.48    255992.6  164295.490  53549.10   161703.45

There is no reason at this point to keep the identifiers as a column inside the dataset. We move such information to the row names and transform the data.frame into a numeric matrix.

mexpr <- data.matrix(cexpr[,-1])
rownames(mexpr) <- paste("CAS", cexpr$CAS, sep = ":")
head(mexpr[,1:6])

##                NEG.TUMOR NEG.TUMOR.1 NEG.TUMOR.2 POS.TUMOR POS.TUMOR.1
## CAS:1005-24-9   72971.21    124561.4   50246.180 106322.34    89398.17
## CAS:111-03-5   102595.37    940195.4    8155.901  64641.62   450857.68
## CAS:19420-56-5  66618.92    351724.6   96895.167  43963.93    89116.99
## CAS:542-44-9    17852.30    367552.9  168209.378  31647.50   361469.21
## CAS:17364-16-8  73652.66   2933573.1  118921.644  13491.35   201283.33
## CAS:123-94-4    25390.48    255992.6  164295.490  53549.10   161703.45
##                NEG.TUMOR.3
## CAS:1005-24-9     43596.23
## CAS:111-03-5     115254.33
## CAS:19420-56-5  1438440.74
## CAS:542-44-9     809467.15
## CAS:17364-16-8  8597685.11
## CAS:123-94-4     476471.58

Pathway Analysis

To make sense of the changes in metabolic activity recorded in the data we have just loaded, we are going to use pathway information from KEGG (via the graphite package) and a statistical analysis capable of exploiting the topology of such pathways (here we’ll rely on clipper).

We import the required packages:

library(graphite)
library(clipper)

clipper will need three pieces of information:

a set of pathways
the activity matrix (mexpr in our case)
a vector representing the two sample groups we are going to compare

KEGG Pathways

Getting pathways is quite easy thanks to graphite.

kpaths <- pathways("hsapiens", "kegg")

The above command retrieves all KEGG pathways for Homo sapiens (294 in total). We take a peek at the first entry:

kpaths[[1]]

## "Glycolysis / Gluconeogenesis" pathway
## Native ID       = hsa:00010
## Database        = KEGG
## Species         = hsapiens
## Number of nodes = 94
## Number of edges = 1191
## Retrieved on    = 12-10-2017

This summary doesn’t tell us which identifiers are used for the nodes in the pathway. We can get that from the list of edges.

head(edges(kpaths[[1]], "metabolites"))

##   src_type    src dest_type   dest direction              type
## 1 KEGGCOMP C00022  KEGGCOMP C00074  directed Process(indirect)
## 2 KEGGCOMP C00022  KEGGCOMP C00186  directed Process(indirect)
## 3 KEGGCOMP C00022  KEGGCOMP C05125  directed Process(indirect)
## 4 KEGGCOMP C00022  KEGGCOMP C15972  directed Process(indirect)
## 5 KEGGCOMP C00022  KEGGCOMP C16255  directed Process(indirect)
## 6 KEGGCOMP C00024  KEGGCOMP C15973  directed Process(indirect)

As we might expect, KEGG pathways are using KEGG compounds IDs. Since we’re relying on CAS in our data, we should convert them. convertIdentifiers to the rescue!

rpaths <- convertIdentifiers(kpaths, "CAS")

The edges of the first pathway have been converted into:

head(edges(rpaths[[1]], "metabolites"))

##   src_type     src dest_type    dest direction              type
## 1      CAS 57-60-3       CAS 73-89-2  directed Process(indirect)
## 2      CAS 57-60-3       CAS 79-33-4  directed Process(indirect)
## 3      CAS 64-19-7       CAS 72-89-9  directed Process(indirect)
## 4      CAS 71-50-1       CAS 72-89-9  directed Process(indirect)
## 5      CAS 64-19-7       CAS 75-07-0  directed Process(indirect)
## 6      CAS 71-50-1       CAS 75-07-0  directed Process(indirect)

To make the rest of the analysis more robust, we are going to filter pathways requiring they have at least 10 edges whose metabolites appear in our data.

filter_pathways <- function(pathways, expr, min_edge_num) {
  node_names <- sub("^CAS:", "", rownames(expr))

  pred <- function(p) {
    es <- edges(p, "metabolites")
    mask <- es$src %in% node_names & es$dest %in% node_names
    sum(mask) > min_edge_num
  }

  Filter(pred, pathways)
}

fpaths <- filter_pathways(rpaths, mexpr, 10)

Now about sample classes. We are going to split our dataset into two: POS.TUMOR samples in the first group and NEG.TUMOUR samples in the other. clipper wants from us a vector with as many entries as there are samples. An entry should be 1 if the corresponding samples belongs to the first class, or 2 in the other case.

pos_indices <- grep("^POS", colnames(mexpr))
neg_indices <- grep("^NEG", colnames(mexpr))

classes <- numeric(length = ncol(mexpr))
classes[neg_indices] <- 1
classes[pos_indices] <- 2
classes

##  [1] 1 1 1 2 2 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1
## [36] 1 2 2 2 1 1 1 1 1 2 2 2 2 2 1 2 2 1 1 1 1 2 1 2 2 1 2 2 2 1 1 2

We use the helper function runClipper to start our analysis. Note that we explicitly require metabolites from the pathways.

out <- runClipper(fpaths, mexpr, classes, "mean", "metabolites", maxNodes = 150, seed = 42)

We check that there are no errors and the we extract the list of pathways which appears to be significantly altered between the two conditions according to clipper.

stopifnot(length(out$errors) == 0)
names(out$results)

## [1] "Alanine, aspartate and glutamate metabolism"

plot_altered_path <- function(result, pathways, node_scale = 2) {
  title <- names(result)
  pathway <- pathways[[title]]

  graph <- pathwayGraph(pathway, which = "metabolites")

  labels <- sub("^CAS:", "", nodes(graph))
  names(labels) <- nodes(graph)

  altered <- unlist(strsplit(result[[1]][1, "pathGenes"], "[,;]"))
  selected <- nodes(graph) %in% altered
  node_colors <- ifelse(selected, "red", "black")
  names(node_colors) <- nodes(graph)

  base <- 0.75
  heights <- ifelse(selected, base * node_scale, base)
  names(heights) <- nodes(graph)
  widths <- ifelse(selected, base * node_scale, base)
  names(widths) <- nodes(graph)

  base <- 14
  fontsizes <- ifelse(selected, base * node_scale, base)
  names(fontsizes) <- nodes(graph)

  between_altered <- function(edge_names, altered) {
    sapply(edge_names, function(edge_name) {
      nodes <- unlist(strsplit(edge_name, "~", fixed = TRUE))
      all(nodes %in% altered)
    })
  }

  edge_colors <- ifelse(between_altered(edgeNames(graph), altered), "red", "black")

  plot(graph,
       attrs = list(edge = list(arrowsize = 0.5)),
       nodeAttrs = list(label = labels, color = node_colors, width = widths,
                        height = heights, fontsize = fontsizes),
       edgeAttrs = list(color = edge_colors),
       recipEdges = "combined", main = title)
}

plot_altered_path(out$results[1], fpaths)

As expected, within the significant path (highlighted in red) we found N-Acetyl-L-aspartic acid (NAA) (CAS:997-55-7) and L-Glutamic acid (CAS:56-85-9) identified by the Authors as two differential metabolites between ER+ and ER- patients. Previous studies demonstrate that a deleterious aspartoacylase (ASPA) gene mutations inhibit the hydrolysis of NAA suggesting a possible cause for increased tumor-associated NAA. The Authors found NAA significantly elevated in ER- tumors and a reduced expression of aspartoacylase (ASPA) gene in ER- tumours.

Reactome Pathways

In this section we repeat the previous analysis, but we use Reactome as a source of pathway information.

fpaths <- filter_pathways(convertIdentifiers(pathways("hsapiens", "reactome"), "CAS"),
                          mexpr, 10)
out <- runClipper(fpaths, mexpr, classes, "mean", "metabolites", maxNodes = 150, seed = 42)

stopifnot(length(out$errors) == 0)
plot_altered_path(out$results[1], fpaths, 12)

## Warning in arrows(head_from[1], head_from[2], head_to[1], head_to[2], col =
## edgeColor, : zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(head_from[1], head_from[2], head_to[1], head_to[2], col =
## edgeColor, : zero-length arrow is of indeterminate angle and so skipped

Here we found a general pathway (nucleotide metabolism) within which the best significant path is composed by inosine, guanine, guanosine, adenosine and xanthine. It is worth to note that Terunuma et al (2017) found guanine one of the highest significant metabolite between ER+ and ER- tumours.