Introduction to SANTA

# generate the simulated network
require(SANTA, quietly=TRUE)
require(igraph, quietly=TRUE)
set.seed(1) # for reproducibility
g <- barabasi.game(n=500, power=1, m=1, directed=FALSE) 
```

A cluster of hits is simulated by first randomly choosing a seed vertex. All other vertices within the network are then ranked according to their distance to the from the seed vertex. The distance can be measured using any of the previously described distance metrics (shortest paths, diffusion kernel and MFPT). In this example, we will use the shortest paths metric, as it is the quickest to run. However, running the analysis with the other methods will produce the same results.

# measure the distance between pairs of vertices in g
dist.method <- "shortest.paths"
D <- DistGraph(g, dist.method=dist.method, verbose=FALSE) 
```

The `Knet` function computes the distances between each pair of vertices in $g$. However, to reduce the time required to run the function, these distances can be computed outside the function. The `Knet` function cannot be run on the raw continuous distances. Therefore, it is necessary to place the pairs of vertices into discreet bins. This is done using the `BinGraph} function.

# place the distances into discreet bins
B <- BinGraph(D, verbose=FALSE)
```

Next, we will generate clusters of hits and measure the clustering strength using the `Knet` function. The $s$ vertices positioned closest to the seed vertex form the sample set. From this set, 5 vertices are randomly selected to form the hit set. The size of the hit set needs to be small compared to the number of vertices in the network. Increasing the value of $s$ decreases the strength of the clustering. We will test 5 different values of $s$ in order to simulate different clustering strengths. A value of $s$ equal to the total number of vertices in the network produces a hit set containing vertices randomly selected from the entire network. For each value of $s$, we will conduct 10 trials. In the associated paper, 1000 trials were run. Here we will also use 100 permutations, rather than 1000, in order to reduce the time required.

cluster.size <- 5
s.to.use <- c(10, 20, 50, 100, 500)
n.trials <- 10
pvalues <- array(0, dim=c(n.trials, length(s.to.use)), 
    dimnames=list(NULL, as.character(s.to.use)))

# run the trials for each value of s
for (s in s.to.use) {       
    for (i in 1:n.trials) {
        # generate the hit set
        seed.vertex <- sample(vcount(g), 1) # identify seed
        sample.set <- order(D[seed.vertex, ])[1:s]
        hit.set <- sample(sample.set, cluster.size)
        
        # measure the stength of association
        g <- set.vertex.attribute(g, name="hits", 
            value=as.numeric(1:vcount(g) %in% hit.set))
        pvalues[i, as.character(s)] <- Knet(g, nperm=100, 
            dist.method=dist.method, vertex.attr="hits", 
            B=B, verbose=FALSE)$pval
    }
}
```

Figure 1 contains the results of the simulation study. The `Knet` function correctly identifies sets of hits created using smaller values of $s$ as clustering more significantly.

# create the network
n.nodes <- 12
edges <- c(1,2, 1,3, 1,4, 2,3, 2,4, 3,4, 1,5, 5,6, 
           2,7, 7,8, 4,9, 9,10, 3,11, 11,12)
weights1 <- weights2 <- rep(0, n.nodes)
weights1[c(1,2)] <- 1
weights2[c(5,6)] <- 1 

g <- graph.empty(n.nodes, directed=FALSE)
g <- add.edges(g, edges)
g <- set.vertex.attribute(g, "weights1", value=weights1)
g <- set.vertex.attribute(g, "weights2", value=weights2)
```

# set 1
Knet(g, nperm=100, vertex.attr="weights1", verbose=FALSE)$pval
Compactness(g, nperm=100, vertex.attr="weights1", verbose=FALSE)$pval

# set 2
Knet(g, nperm=100, vertex.attr="weights2", verbose=FALSE)$pval
Compactness(g, nperm=100, vertex.attr="weights2", verbose=FALSE)$pval
```

The `Compactness` function does not incorporate vertex degree when measuring the significance of hit clustering. For this reason, it outputs a similar p-value for each hit set. `Knet` incorporates the global distribution of vertices and therefore identifies set 2 as clustering more significantly than set 1. It is therefore preferable to use `Knet` instead of `Compactness` when the degree of vertices varies across a network.





##### iii) Using `Knet` to demonstrate that correlation in GI profile produce functionally more informative networks

In this section we will demonstrate how the `Knet` function can be used to identify the most effective way of building an interaction network.

There is evidence that protein function is more closely related to global similarity between genetic interaction (GI) profile than to individual interactions. In order to test this, we will use the `Knet` function to quantify the strength of association between GO terms and two S. cerevisiae networks: a network of raw interactions and a network created from correlations in interaction profile. Both of these will be created using the data produced by Costanzo et al. If the GO terms associate more strongly with a network, then this is indication that the network is more functionally informative.

First, we will load two pre-processed networks, in the form of `igraph` objects from the `SANTA` package. `SANTA` requires all networks to be in the form of `igraph` objects. The data used to create these networks is available from http://drygin.ccbr.utoronto.ca/~costanzo2009/ and the method used is described in the associated SANTA paper.

`Knet` is also able to incorporate different edge distances when quantifying the clustering of high-weight vertices. Edge distances can represent different biological properties, such as the strength of a physical interaction between 2 gene products, or in the case of our 2 networks, the strength of the genetic interaction. The smaller the edge distance, the stronger or more significant the interaction. When creating the network, we converted the correlation coefficients into edge distances by taking the $-log10$ of the scores. These distances have been stored under the edge attribute `distance`.

# load igraph objects
data(g.costanzo.raw)
data(g.costanzo.cor)
networks <- list(raw=g.costanzo.raw, cor=g.costanzo.cor)
network.names <- names(networks)
network.genes <- V(networks$raw)$name 
# genes identical across networks
```
 
Associations between genes and GO terms will be obtained from the Gene Ontology (GO) project using the `org.Sc.sgd.db` package. In order to reduce the time required to run this analysis, we will use only 5 GO terms. However, it is easy to run this analysis with many more GO terms, as done in the associated paper.

# obtain the GO term associations from org.Sc.sgd.db package
library(org.Sc.sgd.db, quietly=TRUE)
xx <- as.list(org.Sc.sgdGO2ALLORFS)
go.terms <- c("GO:0000082", "GO:0003682", "GO:0007265", 
              "GO:0040008", "GO:0090329")
# apply the GO terms to the networks
for (name in network.names) {
    for (go.term in go.terms) {
        networks[[name]] <- set.vertex.attribute(
            networks[[name]], name=go.term, 
            value=as.numeric(network.genes %in% xx[[go.term]]))
    }
} 
```

We will now apply the `Knet` function to each GO terms on each network in order to quantify the strength of association. If a gene is associated with the GO term it will given a weight of 1. If it is not associated, it will be given a weight of 0. We will now store these scores in the 2 graphs under a vertex attribute called `rdds`. Due to the time required to compute the distance matrices, the number of permutations has been reduced to 10. Running at least 1000 permutations is recommended.

results <- list()
for (name in network.names) {
    results[[name]] <- Knet(networks[[name]], nperm=10, 
        vertex.attr=go.terms, edge.attr="distance", verbose=FALSE)
    results[[name]] <- sapply(results[[name]], 
        function(res) res$pval)
}
```

# load igraph objects
data(g.bandyopadhyay.treated)
data(g.bandyopadhyay.untreated)
networks <- list(
    treated=g.bandyopadhyay.treated, 
    untreated=g.bandyopadhyay.untreated
)
network.names <- names(networks)
```

We will the Gene Ontology (GO) project to identify those genes involved in responding to DNA damage. By changing the code below, it is possible to measure the strength of association of gene sets involved in different functions. In the associated paper, we measured the strength of association of 194 GO terms with each network.

# obtain GO term associations
library(org.Sc.sgd.db, quietly=TRUE)
xx <- as.list(org.Sc.sgdGO2ALLORFS)
# change to use alternative GO terms
associated.genes <- xx[["GO:0006974"]] 
associations <- sapply(networks, function(g) 
    as.numeric(V(g)$name %in% associated.genes), 
    simplify=FALSE)
networks <- sapply(network.names, function(name) 
    set.vertex.attribute(networks[[name]], "rdds", 
    value=associations[[name]]), simplify=FALSE)
```

We will now apply the `Knet` function to the GO term on each network in order to measure the strength of association. The greater the number of permutations run, the greater the accuracy of the p-value.

We will also run `Knet` using the shortest paths method to compute the distance between vertex pairs. However, as shown in the associated paper, the `diffusion` and `mfpt` distance measures produce the same results. This can be demonstrated by replacing the `dist.method` argument below with one of the alternative distance measures. By running this analysis with additional GO terms, the robustness of the `Knet` across different distance measures can be seen. Due to the time required to compute the distance matrices, the number of permutations has been reduced to 10. Running with least 1000 permutations is recommended.

results <- sapply(networks, function(g) Knet(g, nperm=10, 
    dist.method="shortest.paths", vertex.attr="rdds", 
    edge.attr="distance"), simplify=FALSE)
p.values <- sapply(results, function(res) res$pval)
p.values
```

Since the GO term tested was `response to DNA damage stimulus`, it is not surprising that the gene set associated more significantly with the treated network (as shown by the lower p-value). The yeast exposed to the DNA-damaging agent have activated or upregulated pathways involved in responding to the agent, thereby increasing the strength of the genetic interaction between DNA-damage response-related genes.

# laod igraph object
data(g.srivas.high)
data(g.srivas.untreated)
networks <- list(
    high=g.srivas.high, 
    untreated=g.srivas.untreated
)
network.names <- names(networks)
```

We will again test only a single GO term (GO:0000725, recombinational repair). In the associated paper, a greater number of GO terms are tested.

# obtain GO term associations
library(org.Sc.sgd.db, quietly=TRUE)
xx <- as.list(org.Sc.sgdGO2ALLORFS)
associated.genes <- xx[["GO:0000725"]] 
associations <- sapply(networks, function(g) 
    as.numeric(V(g)$name %in% associated.genes), 
    simplify=FALSE)
networks <- sapply(network.names, function(name)
    set.vertex.attribute(networks[[name]], "dsbr", 
    value=associations[[name]]), simplify=FALSE)
```

The `Knet` function will again be used to compare the strength of association of the GO term with the 2 networks. Due to the time required to compute the distance matrices, the number of permutations has been reduced to 10. Running with least 1000 permutations is recommended.

p.values <- sapply(networks, function(g) 
       Knet(g, nperm=10, dist.method="shortest.paths", 
       vertex.attr="dsbr", edge.attr="distance")$pval)
p.values
```

As expected, the GO term associates more strongly with the UV treated network. This is likely due to the functional rewiring that occurs within the yeast when exposed to DNA-damaging UV radiation.

# import and convert RNAi data
data(rnai.cheung)
rnai.cheung <- -log10(rnai.cheung)
rnai.cheung[!is.finite(rnai.cheung)] <- max(rnai.cheung[is.finite(rnai.cheung)])
rnai.cheung[rnai.cheung < 0] <- 0
```

Next, we will load the interaction data for the two network and create an `igraph` object for each. No edge weights are given for the IntAct network. Each of the interactions in the HumanNet network is associated with a log likelihood score (LLS), describing the probability the functional interaction occurring given the data used. In order to reduce the density of the network, interactions with an $LLS < 2$ have been removed. Distances for the remaining edges have been produced by dividing the weights by the maximum weight and taking the $-log_{10}$ of this score. We have removed all genes not part of the largest cluster from each network.

# import and create IntAct network
data(edgelist.intact)
g.intact <- graph.edgelist(as.matrix(edgelist.intact), 
    directed=FALSE)

# import data and create HumanNet network
data(edgelist.humannet)
g.humannet <- graph.edgelist(as.matrix(edgelist.humannet)[,1:2], 
    directed=FALSE)
g.humannet <- set.edge.attribute(g.humannet, "distance", 
    value=edgelist.humannet$distance)
networks <- list(intact=g.intact, humannet=g.humannet)
```

Next, we will apply the vertex weights to the networks. Not all networks genes have vertex weights available.

network.names <- names(networks)
network.genes <- sapply(networks, get.vertex.attribute, 
    name="name", simplify=FALSE)
rnai.cheung.genes <- rownames(rnai.cheung)
cancers <- colnames(rnai.cheung)

for (cancer in cancers) {
    for (name in network.names) {
        vertex.weights <-rep(NA, vcount(networks[[name]]))
        names(vertex.weights) <- network.genes[[name]]
        common.genes <- rnai.cheung.genes[rnai.cheung.genes
            %in% network.genes[[name]]] 
        vertex.weights[common.genes] <- rnai.cheung[common.genes, cancer]
        networks[[name]] <- set.vertex.attribute(networks[[name]], 
            cancer, value=vertex.weights)
    }
}
```

We will now apply the `Knet`-function to these scores. Due to the time required to compute the distance matrices, the number of permutations has been reduced to 10. Running with least 1000 permutations is recommended.

knet.res <- sapply(networks, Knet, nperm=10, 
   dist.method="shortest.paths", vertex.attr=cancers, 
   edge.attr="distance", simplify=FALSE)
p.values <- sapply(knet.res, function(i) sapply(i, 
   function(j) j$pval))
p.values
```

The RNAi data from the colon and ovarian cancer cell lines associate strongly with both networks. However, they associate associate more strongly with the HumanNet network, indicating that the HumanNet network better explains the cellular mechanisms involved in maintaining these cell lines.

sessionInfo()