TreeAndLeaf 1.2.0
hclust()
treeAndLeaf()
and adding legendsTreeAndLeaf is an R-based package for better visualization of dendrograms and phylogenetic trees. The package changes the way a dendrogram is viewed. Through the use of the igraph format and the package RedeR, the nodes are rearranged and the hierarchical relations are kept intact, resulting in an image that is easier to read and can be enhanced with additional layers of information.
The classical dendrogram is a limited format in two ways. Firstly, it only displays one type of information, which is the hierarchical relation between the data. Secondly, it is limited by its size, the larger the database, the less readable it becomes. The TreeAndLeaf enhances space distribution because it uses all directions, allowing for an improved visualization and a better image for publications. The package RedeR, used for plotting in this package, uses a force-based relaxation algorithm that helps nodes in avoiding overlaps. By implementing RedeR and the igraph format, the package allows for customization of the dendrogram inserting multiple layers of information to be represented by edge widths and colors, nodes colors, nodes sizes, line color, etc. The package also includes a fast formatting option for quick and exploratory analysis usage. Therefore, the package is designed to make plotting dendrograms more useful, less confusing and more productive. The workflow while using this package is depicted from Figure 1.
Figure 1. A brief representation of what TreeAndLeaf functions are capable of. (A,B) The dendrogram in A was used to construct the graph representation shown in B. (C) Workflow summary. The main input data consists of a distance matrix, which is used to generate a dendrogram. The TreeAndLeaf package transforms the dendrogram into a graph representation.
This document intends to guide you through the basics and give you ideas of how to use the functions to their full potential. Although TreeAndLeaf was created for systems biology application, it is not at all limited to this use.
This section provides a quick and basic example using the R built-in dataframe
USArrests
.
First, the packages necessary to the analysis are loaded.
library(TreeAndLeaf)
library(RedeR)
#> ***This is RedeR 1.38.0! For a quick start, please type 'vignette('RedeR')'.
#> Supporting information is available at Genome Biology 13:R29, 2012,
#> (doi:10.1186/gb-2012-13-4-r29).
library(RColorBrewer)
As stated above, USArrests
is a dataframe readily available in R.
To know more about the info shown in this dataframe, use ?USArrests
.
To use TreeAndLeaf functions to their full potential, it is recommended that
your dataframe has rownames set before making the dendrogram, like this one has.
dim(USArrests)
#> [1] 50 4
head(USArrests)
#> Murder Assault UrbanPop Rape
#> Alabama 13.2 236 58 21.2
#> Alaska 10.0 263 48 44.5
#> Arizona 8.1 294 80 31.0
#> Arkansas 8.8 190 50 19.5
#> California 9.0 276 91 40.6
#> Colorado 7.9 204 78 38.7
hclust()
In order to build a dendrogram, you need to have a distance matrix of the
observations. For example, the default “euclidean distance” method of
dist()
can be used to generate a distance matrix, and then use the “average” method of
hclust()
to create a dendrogram.
hc <- hclust(dist(USArrests), "ave")
plot(hc)
This is a rather simple but important step. Since TreeAndLeaf
and RedeR work with igraph objects, a function is provided to
convert an hclust dendrogram into an igraph. For that, simply
follow use hclust2igraph()
.
gg <- hclust2igraph(hc)
There is a quick formatting option in TreeAndLeaf package by using
the function formatTree()
, which is a theme function used to standardize
node sizes and colors. This is an important step because the tree will
have leaf nodes (the ones representing your observations) and non-leaf nodes
(the ones representing bifurcations of the dendrogram), and this function
makes the last ones invisible to achieve the desired appearance and proper
relaxation. A description of available themes can be consulted at ?formatTree
.
gg <- formatTree(gg = gg, theme = 5)
Now, the tree-and-leaf diagram is ready to be shown in RedeR with treeAndLeaf()
,
or you can have layers of information added to it, as shown below.
RedeR offers a set of functions to manipulate igraph attributes according to the parameters the application reads.
First, att.mapv()
is used to insert the dataframe inside the igraph object
and make it available for setting node attributes. In this step, it is crucial
that the refcol
points to a column with the same content as hc$labels
.
In this case, refcol = 0
indicates the rownames of the dataframe.
gg <- att.mapv(g = gg, dat = USArrests, refcol = 0)
Now that the info is available, att.setv()
changes the igraph attributes.
The package RColorBrewer can be used to generate a palette for reference.
Try ?addGraph
to see the options of igraph attributes RedeR can read.
pal <- brewer.pal(9, "Reds")
gg <- att.setv(g = gg, from = "Murder", to = "nodeColor",
cols = pal, nquant = 5)
gg <- att.setv(g = gg, from = "UrbanPop", to = "nodeSize",
xlim = c(50, 150, 1), nquant = 5)
With the igraph ready to be visualized, you need to invoke RedeR interface. This might take some seconds.
rdp <- RedPort()
calld(rdp)
resetd(rdp)
treeAndLeaf()
and adding legendsThis is TreeAndLeaf’s main function. It will read your igraph object, generate the tree layout, plot it in RedeR interface and use functions to enhance appeal and distribution.
treeAndLeaf(obj = rdp, gg = gg)
Adding legends is optional. When you call for att.setv()
and inform column
names for nodeColor
and nodeSize
, it will automatically generate a RedeR
readable legend, which can be plotted using the code below.
addLegend.color(obj = rdp, gg, title = "Murder Rate",
position = "right")
addLegend.size(obj = rdp, gg, title = "Urban Population Size",
position = "bottomright")
At this stage the image produced needs small adjustments to solve the residual edge crossings. It is possible to just click and drag a node to adjust it while the relaxation algorithm is still running.
All the different parameters can be changed and personalized throughout the steps to achieve the desired image.
The TreeAndLeaf package is particularly useful when dealing with large
dendrograms. This section uses the quakes
built-in dataframe as an
example. To know more about this data, check ?quakes
. Since each step was
detailed in the first example, this one will focus on describing only features
we were not able to see with USArrests
.
library(TreeAndLeaf)
library(RedeR)
library(RColorBrewer)
dim(quakes)
#> [1] 1000 5
head(quakes)
#> lat long depth mag stations
#> 1 -20.42 181.62 562 4.8 41
#> 2 -20.62 181.03 650 4.2 15
#> 3 -26.00 184.10 42 5.4 43
#> 4 -17.97 181.66 626 4.1 19
#> 5 -20.42 181.96 649 4.0 11
#> 6 -19.68 184.31 195 4.0 12
Clearly, when it comes to big dendrograms, it gets harder to show clusterization and any other information by conventional plotting. This is where TreeAndLeaf really makes a difference.
hc <- hclust(dist(quakes))
plot(hc)
As described before, the package function hclust2igraph()
is used for
converting and function formatTree()
is used for initial attribute
setting. From RedeR, att.mapv()
is used for inserting the dataframe
inside the igraph object and att.setv()
to change graph characteristics.
Package RColorBrewer is used in the variable pal
, to generate a color palette.
# Converting hclust to igraph format
gg <- hclust2igraph(hc)
# Formatting the tree
gg <- formatTree(gg, theme = 1, cleanalias = TRUE)
# Mapping the data into the igraph object
gg <- att.mapv(gg, quakes, refcol = 0)
# Set attributes
pal <- brewer.pal(9, "Greens")
gg <- att.setv(gg, from = "mag", to = "nodeColor",
cols = pal, nquant = 10)
gg <- att.setv(gg, from = "depth", to = "nodeSize",
xlim = c(240, 880, 1), nquant = 5)
As stated above, RedeR uses a relaxation force-based algorithm to achieve a
stable distribution of nodes. One of the parameters used to calculate
attraction and repulsion forces is nodeSize
. On the first example, the node
sizes ranged from 50 to 150 and on this one, it ranged from 240 to 880. The
treeAndLeaf()
function uses less zoom to plot if the dendrogram has a great
number of nodes, so it is necessary to use bigger sizes for bigger trees.
Therefore, the nodeSize
is a vital attribute for the tree-and-leaf structure
formation. If sizes are too small, the nodes will barely move during the
relaxation process. If sizes are too big, overlaps will be difficult to solve
and unwanted behaviors can arise. If the sizes are too different (i.e. 10 and
1000), you probably won’t be able to see the smaller ones. That being said, if
the tree is not clear, try changing parameters such as nodeSize
to achieve the
desired image.
Repeat the step described in section Quick Start.
rdp <- RedPort()
calld(rdp)
resetd(rdp)
# Plotting the tree
treeAndLeaf(rdp, gg)
# Adding legend
addLegend.color(obj = rdp, gg, title = "Richter Magnitude")
addLegend.size(obj = rdp, gg, title = "Depth (km)")
After manually solving some overlaps, you should be able to achieve the result shown below. On launching RedeR, the window Dynamic layout settings comes up, and here the parameter repulse radius is fixed to achieve the graph as shown below.
The TreeAndLeaf package is also able to work with phylogenetic trees. To show how it works, we will apply these steps to plot a tree from geneplast package. It is a tree with 121 tips listing the eukaryotes in STRING-db, release 9.1.
library(TreeAndLeaf)
library(RedeR)
library(RColorBrewer)
library(ape) # Analyses of Phylogenetics and Evolution
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:ape':
#>
#> edges, mst, ring
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
As mentioned, the tree can be loaded from geneplast package by running the code below.
library(geneplast)
data("gpdata.gs")
plot(phyloTree)
Aside from exhibiting the phylogenetic tree as a tree-and-leaf diagram, extra
layers of data to each species can also be added. TreeAndLeaf package offers
a dataframe containing statistical data of eukaryotes complete genomes, downloaded
from NCBI Genomes database. For more information, type ?spdata
.
data("spdata")
The spdata
object only shows data for eukaryotes with complete genomes
available, an inner join has to be made to select only the species
available in both datasets used. Therefore, it is necessary to check
which tips of the phylo
object has a match with a row in spdata
. Then,
the tree is plotted again only with the selected tips.
# Accessory indexing
idx <- match(as.numeric(spdata$tax_id), as.numeric(phyloTree$tip.label))
idx <- idx[!is.na(idx)]
tokeep <- phyloTree$tip.label[idx]
phyloTree$tip.label <- as.character(phyloTree$tip.label)
# Remaking the tree
pruned.tree <- drop.tip(phyloTree, phyloTree$tip.label[-match(tokeep,
phyloTree$tip.label)])
For converting a phylogenetic tree to an igraph object, the package provides
another function: phylo2igraph()
.
# Converting phylo to igraph
tal.phylo <- phylo2igraph(pruned.tree)
# Formatting the tree
tal.phylo <- formatTree(tal.phylo, theme = 4)
# Mapping data to the igraph object
tal.phylo <- att.mapv(g = tal.phylo, dat = spdata, refcol = 1)
# Setting attributes
pal <- brewer.pal(9, "Purples")
tal.phylo <- att.setv(g = tal.phylo, from = "genome_size_Mb",
to = "nodeSize", xlim = c(120, 250, 1), nquant = 5)
tal.phylo <- att.setv (g = tal.phylo, from = "proteins",
to = "nodeColor", nquant = 5,
cols = pal, na.col = "black")
If treeAndLeaf()
is called now the NCBI TaxIDs will be shown above each node,
which is not desired. So the igraph object needs to be modified to show
species names, but not all of them, to prevent unreadability. For that, general
igraph manipulation functions can be used.
# Changing the alias to show the names and making them invisible
idx <- match(V(tal.phylo)$nodeAlias, spdata$tax_id)
V(tal.phylo)$nodeAlias <- spdata$sp_name[idx]
V(tal.phylo)$nodeAlias[is.na(V(tal.phylo)$nodeAlias)] <- ""
V(tal.phylo)$nodeFontSize <- 1
# Randomly selecting some names to be shown
set.seed(9)
V(tal.phylo)$nodeFontSize[sample(1:length(V(tal.phylo)$nodeFontSize), 50)] <- 100
V(tal.phylo)$nodeFontSize[V(tal.phylo)$name == "9606"] <- 100 #Homo sapiens
# Calling RedeR
rdp <- RedPort()
calld(rdp)
resetd(rdp)
# Plotting
treeAndLeaf(obj = rdp, gg = tal.phylo)
# Adding Legend
addLegend.size(rdp, tal.phylo, title = "Genome Size (Mb)")
addLegend.color(rdp, tal.phylo, title = "Protein Count")
Although TreeAndLeaf was written to work with binary trees, the package also works for some non binary diagrams such as the STRING-db species tree, release 11.0.
Since all features were detailed on previous sections, this
is just a demonstration and there will be no code explanation other than
comments. This example uses the same dataframe spdata
downloaded from NCBI Genomes,
applied on the previous example.
# Packages required
library(TreeAndLeaf)
library(RedeR)
library(RColorBrewer)
library(ape)
library(igraph)
library(geneplast)
# Loading data
data("spdata") # NCBI Genomes scraped info
data("phylo_species") # STRING-db tree metadata
data("phylo_tree") # STRING-db phylo object
# Remaking the tree with species inside spdata
idx <- match(as.numeric(spdata$tax_id), as.numeric(phylo_species$X...taxon_id))
idx <- idx[!is.na(idx)]
tokeep <- phylo_species$X...taxon_id[idx]
pruned.tree <- drop.tip(phylo_tree,phylo_tree$tip.label[-match(tokeep, phylo_tree$tip.label)])
# Converting phylo to igraph
tal.phy <- phylo2igraph(pruned.tree)
# Formatting the tree
tal.phy <- formatTree(gg = tal.phy, theme = 3)
# Mapping data into the igraph object
tal.phy <- att.mapv(g = tal.phy, dat = spdata, refcol = 1)
# Setting attributes
pal <- brewer.pal(9, "Blues")
tal.phy <- att.setv(g = tal.phy, from = "genome_size_Mb",
to = "nodeSize", nquant = 5, xlim = c(200, 600, 1))
tal.phy <- att.setv(g = tal.phy, from = "proteins", to = "nodeColor",
nquant = 5, cols = pal, na.col = "black")
# Randomly selecting names to be shown
set.seed(9)
V(tal.phy)$nodeFontSize <- 1
V(tal.phy)$nodeFontSize[sample(1:length(V(tal.phy)$nodeFontSize), 80)] <- 300
V(tal.phy)$nodeFontSize[V(tal.phy)$name == 9606] <- 300
idx <- match(V(tal.phy)$nodeAlias, spdata$tax_id)
V(tal.phy)$nodeAlias <- spdata$sp_name[idx]
V(tal.phy)$nodeAlias[is.na(V(tal.phy)$nodeAlias)] <- ""
# Calling RedeR and plotting
rdp <- RedPort()
calld(rdp)
resetd(rdp)
# Plotting the tree
treeAndLeaf(rdp, tal.phy)
# Adding legends
addLegend.color(rdp, tal.phy, title = "Protein count")
addLegend.size(rdp, tal.phy, title = "Genome size (Mb)")
The package is freely available from the Bioconductor at https://bioconductor.org/packages/TreeAndLeaf.
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] geneplast_1.16.0 igraph_1.2.6 ape_5.4-1 RColorBrewer_1.1-2
#> [5] RedeR_1.38.0 TreeAndLeaf_1.2.0 BiocStyle_2.18.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.5 knitr_1.30 magrittr_1.5
#> [4] lattice_0.20-41 rlang_0.4.8 stringr_1.4.0
#> [7] tools_4.0.3 parallel_4.0.3 grid_4.0.3
#> [10] data.table_1.13.2 nlme_3.1-150 xfun_0.18
#> [13] snow_0.4-3 htmltools_0.5.0 yaml_2.2.1
#> [16] digest_0.6.27 bookdown_0.21 BiocManager_1.30.10
#> [19] evaluate_0.14 rmarkdown_2.5 stringi_1.5.3
#> [22] compiler_4.0.3 magick_2.5.0 pkgconfig_2.0.3