Contents

1 Context

Fletcher et al. (2013) reconstructed regulons for 809 transcription factors (TFs) using microarray transcriptomic data from the METABRIC breast cancer cohort (Curtis et al. 2012). Our goal here is to assess the evolutionary root of the regulons reconstructed by Fletcher et al. (2013) using the geneplast package. This script reproduces the main observations described in Trefflich et al. (2019), which proposed a framework to explore the evolutionary roots of regulons.

2 Package installation and data sets

Please make sure to install all required packages. Installing and then loading the geneplast.data.string.v91 and Fletcher2013b data packages will make available all data required for this case study.

#-- Call packages
library(geneplast)
library(geneplast.data.string.v91)
library(Fletcher2013b)
library(ggplot2)
library(ggpubr)

3 Inferring evolutionary roots

This analysis will determine the evolutionary root of a gene based on the distribution of its orthologs in a given species tree. We will need two data objects, cogdata and phyloTree, both loaded with the gpdata_string_v91 call. The cogdata is a data.frame object listing orthologous groups (OGs) predicted for 121 eukaryotic species, while the phyloTree is a phylogenetic tree object of class phylo. The groot.preprocess function will check the input data and build an object of class OGR, which will be used in the subsequent steps of the analysis pipeline.

#-- Load orthology data from the 'geneplast.data.string.v91' package
data(gpdata_string_v91)

#-- Create an object of class 'OGR' for a reference 'spid'
ogr <- groot.preprocess(cogdata=cogdata, phyloTree=phyloTree, spid="9606")

The groot function addresses the problem of finding the evolutionary root of a feature in an phylogenetic tree. The method infers the probability that such feature was present in the Last Common Ancestor (LCA) of a given lineage. The groot function assesses the presence and absence of the orthologs in the extant species of the phylogenetic tree in order to build a probability distribution, which is used to identify vertical heritage patterns. The spid=9606 parameter sets Homo sapiens as the reference species, which defines the ancestral lineage assessed in the query (i.e. each ortholog of the reference species will be rooted in an ancestor of the reference species).

#-- Run the 'groot' function and infer the evolutionary roots
ogr <- groot(ogr, nPermutations=1000, verbose=TRUE)

4 Mapping root-to-gene annotation

In this section we will map the inferred evolutionary roots (available in the ogr object) to genes annotated in the regulons reconstructed by Fletcher et al. (2013) (available in the rtni1st object). For a summary of the regulons in the rtni1st object we recommend using the tni.regulon.summary function, which shows that there are 809 regulatory elements (TFs) and 14131 targets.

#-- Load regulons
data("rtni1st")
tni.regulon.summary(rtni1st)
## This regulatory network comprised of 809 regulons. 
## regulatoryElements            Targets              Edges 
##                809              14131              47012 
## Min.  1st Qu.  Median    Mean  3rd Qu.   Max. 
## 0.0     10.0     37.0    58.1    80.0  523.0 
## regulatoryElements            Targets              Edges 
##                809              14131             617672 
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0      43      449     764   1245    4148 
## ---

We will transform the rtni1st into a graph object using the tni.graph function. The resulting graph will be assessed by the ogr2igraph function, which will map the root-to-gene annotation; the results will be available in the roots_df data frame for subsequent analysis.

#-- Put regulons into an 'igraph' object 
#-- Note: small regulons (n<15 targets) are romeved in this step.
graph <- tni.graph(rtni1st, gtype = "rmap")

#-- Map the 'ogr' object to the 'igraph' object
graph <- ogr2igraph(ogr, cogdata, graph, idkey = "ENTREZ")

#-- Make a data frame with the gene roots
roots_df <- data.frame(SYMBOL = V(graph)$SYMBOL,
                       ENTREZ = V(graph)$ENTREZ, 
                       Root = V(graph)$Root, 
                       TF_Targets = c("Targets","TFs")[V(graph)$tfs+1])

Please note that some level of missing annotation is expected, as not all gene ids listed in the cogdata might be available in the graph object. Also, small regulons (n < 15 targets) are removed by the tni.graph function. As a final pre-processing step, we will remove genes rooted at the base of the phylogenetic tree, for which the predictions can not discriminate from earlier ancestor roots. Here, 307 TFs and 6308 targets were retained.

#-- Remove NAs from missing annotation
roots_df <- roots_df[complete.cases(roots_df),]

#-- Remove genes rooted at the base of the phylogenetic tree
roots_df <- roots_df[roots_df$Root<max(roots_df$Root),]
rownames(roots_df) <- 1:nrow(roots_df)

#-- Check TF and target counts
table(roots_df$TF_Targets)
## Targets     TFs
##    6308     307

5 Comparing regulators and targets

A transcriptional regulatory network (TRN) is formed by regulators (TFs) and target genes. The roots_df data frame lists the evolutionary roots inferred for each TRN element, including whether the TRN element is annotated as TF or as a target.

head(roots_df)
##     SYMBOL ENTREZ Root TF_Targets
## 1   CEBPG   1054   19        TFs
## 2   NR4A2   4929   17        TFs
## 3     EN1   2019   17        TFs
## 4    TP53   7157   20        TFs
## 5 GATAD2A  54815   19        TFs
## 6     DR1   1810   23        TFs
tail(roots_df)
##         SYMBOL ENTREZ Root TF_Targets
## 6610      F11   2160   19    Targets
## 6611   KCNK18 338567   24    Targets
## 6612  TMEM220 388335   14    Targets
## 6613 C1orf170  84808    7    Targets
## 6614 C16orf96 342346    6    Targets
## 6615    PANX3 116337   13    Targets

For example, CEBPG gene is placed at root 19 while PANX3 gene is placed at root 13, indicating that the evolutionary root inferred for CEBPG is more ancestral than the evolutionary root inferred for PANX3. Please note that the evolutionary roots are enumerated from the most recent to the most ancestral node of the phylogenetic tree. Also, as the aim of the analysis is to find the root of the orthologs of the reference species, the root enumeration is related to the ancestral lineage of the reference species (for details of the phylogenetic tree, see Figure 4 of the Geneplast’s vignette).

Here we will compare the distribution of the evolutionary roots inferred for TFs and target genes using the Wilcoxon-Mann-Whitney test, and then generate violin plots (please refer to Trefflich et al. (2019) for additional details).

wilcox.test(Root ~ TF_Targets, data=roots_df)
## Wilcoxon rank sum test with continuity correction
## data:  Root by TF_Targets
## W = 812534, p-value = 1.6e-06
## alternative hypothesis: true location shift is not equal to 0
#-- Set roots to display in y-axis
roots <- c(4,8,11,13,19,21,25)

#-- Set a summary function to display dispersion within the violins
data_summary <- function(x) {
  y <- mean(x); ymin <- y-sd(x); ymax <- y+sd(x)
  return(c(y=y,ymin=ymin,ymax=ymax))
}

#-- Build a ggplot object
p <- ggplot(roots_df, aes(x=TF_Targets, y=Root)) + 
  geom_violin(aes(fill=TF_Targets), adjust=2, show.legend=F) +
  scale_y_continuous(breaks=roots, labels=paste("root",roots)) +
  scale_fill_manual(values=c("#c7eae5","#dfc27d")) +
  labs(x="TRN elements", y="Root distribution") +
  scale_x_discrete(limits=c("TFs","Targets")) +
  theme_classic() +
  theme(text=element_text(size=20)) + 
  stat_summary(fun.data = data_summary)

#-- Generate violin plots
p + stat_compare_means(method="wilcox.test",
                       comparisons =list(c("Targets","TFs")),
                       label = "p.signif")

title Figure 1. Distribution of the inferred evolutionary roots of TFs and target genes.

Next we compute the root distance between a TF and its targets, and then generate a pie chart and a boxplot that reproduce the evolutionary scenarios discussed in Trefflich et al. (2019).

#-- Get roots for TFs
idx <- roots_df$TF_Targets=="TFs"
tfroots <- roots_df$Root[idx]
names(tfroots) <- roots_df$SYMBOL[idx]

#-- Get roots for TF-target genes
regulons <- tni.get(rtni1st, what = "regulons", idkey = "ENTREZ")[names(tfroots)]
regroots <- lapply(regulons, function(reg){
  roots_df$Root[roots_df$ENTREZ%in%reg]
})
tfroots <- tfroots[names(regroots)]

#-- Compute root distance between a TF and its targets
rootdist <- lapply(names(regroots), function(reg){
  regroots[[reg]]-tfroots[reg]
})
names(rootdist) <- names(regroots)

#-- Compute median root distance
med_rootdist <- unlist(lapply(rootdist, median))

#-- Plot a pie chart grouping regulons based on 
#-- the median root distance
cols = c("#1c92d5","grey","#98d1f2")
n <- as.numeric(table(sign(med_rootdist)))
pie(n, labels = paste(n,"regulons"), col = cols, border="white", cex=1.5)
legend("bottomleft",
       legend = c("TF-target genes rooted before the TF",
                  "TF-target genes rooted with the TF", 
                  "TF-target genes rooted after the TF"), 
       fill = rev(cols), bty = "n")

title Figure 2. Regulons grouped based on the median distance between a TF’s root and its targets’ roots.

#-- Plot a boxplot showing individual regulons

#-- Sort regulons based on the calculated root distances
med_rootdist <- sort(med_rootdist, decreasing = T)
rootdist <- rootdist[names(med_rootdist)]
regroots <- regroots[names(med_rootdist)]
tfroots <- tfroots[names(med_rootdist)]

#-- Generate the boxplot
plot.new()
par(usr=c(c(0,length(rootdist)),range(rootdist)))
idx <- sign(med_rootdist)+2
cols = c("#1c92d5","grey","#98d1f2")
boxplot(rootdist, horizontal= F, outline=FALSE, las=2, axes=FALSE, add=T,
        pars = list(boxwex = 0.6, boxcol=cols[idx], whiskcol=cols[idx]),
        pch="|", lty=1, lwd=0.75,
        col = cols[idx])
abline(h=0, lmitre=5, col="#E69F00", lwd=3, lt=2)
par(mgp=c(2,0.1,0))
axis(side=1, cex.axis=1.2, padj=0.5, hadj=0.5, las=1, lwd=1.5, tcl= -0.2)
par(mgp=c(2.5,1.2,0.5))
axis(side=2, cex.axis=1.2, padj=0.5, hadj=0.5, las=1, lwd=1.5, tcl= -0.2)
legend("topright",legend = c("TF-target genes rooted before the TF",
                             "TF-target genes rooted with the TF", 
                             "TF-target genes rooted after the TF"), fill = rev(cols), bty = "n")
title(xlab = "Regulons sorted by the median distance to TF root", ylab = "Distance to TF root")

title Figure 3. Regulons sorted by the median distance to TF root.

6 Session information

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggpubr_0.2                       magrittr_1.5                    
##  [3] ggplot2_3.1.1                    Fletcher2013b_1.20.0            
##  [5] igraph_1.2.4.1                   RedeR_1.32.0                    
##  [7] RTN_2.8.1                        Fletcher2013a_1.20.0            
##  [9] limma_3.40.2                     geneplast.data.string.v91_0.99.6
## [11] geneplast_1.10.3                 BiocStyle_2.12.0                
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.44.0              mixtools_1.1.0             
##  [3] splines_3.6.0               gtools_3.8.1               
##  [5] assertthat_0.2.1            minet_3.42.0               
##  [7] BiocManager_1.30.4          stats4_3.6.0               
##  [9] GenomeInfoDbData_1.2.1      yaml_2.2.0                 
## [11] viper_1.18.0                pillar_1.4.0               
## [13] lattice_0.20-38             glue_1.3.1                 
## [15] digest_0.6.19               GenomicRanges_1.36.0       
## [17] RColorBrewer_1.1-2          XVector_0.24.0             
## [19] colorspace_1.4-1            plyr_1.8.4                 
## [21] htmltools_0.3.6             Matrix_1.2-17              
## [23] pkgconfig_2.0.2             bookdown_0.10              
## [25] zlibbioc_1.30.0             purrr_0.3.2                
## [27] scales_1.0.0                snow_0.4-3                 
## [29] gdata_2.18.0                VennDiagram_1.6.20         
## [31] BiocParallel_1.18.0         tibble_2.1.1               
## [33] IRanges_2.18.0              withr_2.1.2                
## [35] SummarizedExperiment_1.14.0 BiocGenerics_0.30.0        
## [37] lazyeval_0.2.2              crayon_1.3.4               
## [39] survival_2.44-1.1           evaluate_0.13              
## [41] nlme_3.1-140                MASS_7.3-51.4              
## [43] segmented_0.5-4.0           gplots_3.0.1.1             
## [45] class_7.3-15                tools_3.6.0                
## [47] data.table_1.12.2           formatR_1.6                
## [49] matrixStats_0.54.0          stringr_1.4.0              
## [51] S4Vectors_0.22.0            munsell_0.5.0              
## [53] DelayedArray_0.10.0         lambda.r_1.2.3             
## [55] compiler_3.6.0              e1071_1.7-1                
## [57] GenomeInfoDb_1.20.0         rlang_0.3.4                
## [59] caTools_1.17.1.2            futile.logger_1.4.3        
## [61] grid_3.6.0                  RCurl_1.95-4.12            
## [63] bitops_1.0-6                rmarkdown_1.13             
## [65] gtable_0.3.0                R6_2.4.0                   
## [67] dplyr_0.8.1                 knitr_1.23                 
## [69] futile.options_1.0.1        KernSmooth_2.23-15         
## [71] ape_5.3                     stringi_1.4.3              
## [73] parallel_3.6.0              Rcpp_1.0.1                 
## [75] tidyselect_0.2.5            xfun_0.7

References

Curtis, C, S P Shah, S Chin, G Turashvili, O M Rueda, M J Dunning, D Speed, et al. 2012. “The Genomic and Transcriptomic Architecture of 2,000 Breast Tumours Reveals Novel Subgroups.” Nature 486 (7403):346–52. https://doi.org/10.1038/nature10983.

Fletcher, Michael, Mauro Castro, Suet-Feung Chin, Oscar Rueda, Xin Wang, Carlos Caldas, Bruce Ponder, Florian Markowetz, and Kerstin Meyer. 2013. “Master Regulators of FGFR2 Signalling and Breast Cancer Risk.” Nature Communications 4:2464. https://doi.org/10.1038/ncomms3464.

Trefflich, Sheyla, Rodrigo JS Dalmolin, José M Ortega, and Mauro AA Castro. 2019. “Which Came First, the Transcriptional Regulator or Its Target Genes? An Evolutionary Perspective into the Construction of Eukaryotic Regulons.” Biochimica et Biophysica Acta [under review].