You can’t even begin to understand biology, you can’t understand life, unless you understand what it’s all there for, how it arose - and that means evolution. — Richard Dawkins

Citation

If you use ggtree in published research, please cite:

G Yu, D Smith, H Zhu, Y Guan, TTY Lam,
ggtree: an R package for visualization and annotation of phylogenetic tree with different types of meta-data.
submitted.

Introduction

This project arose from our needs to annotate nucleotide substitutions in the phylogenetic tree, and we found that there is no tree visualization software can do this easily. Existing tree viewers are designed for displaying phylogenetic tree, but not annotating it. Although some tree viewers can displaying bootstrap values in the tree, it is hard/impossible to display other information in the tree. Our first solution for displaying nucleotide substituitions in the tree is to add this information in the node/tip names and use traditional tree viewer to show it. We displayed the information in the tree successfully, but We believe this indirect approach is inefficient.

In the old day, phylogenetic tree is often small. At that time, as we almost didn’t have a need to annotate a tree; displaying the evolution relationships is mostly enough. Nowadays, we can obtain a lot of data from different experiments, and we want to associate our data, for instance antigenic change, with the evolution relationship. Visualizing these associations in the phylogenetic tree can help us to identify evolution patterns. We believe we need a next generation tree viewer that should be programmable and extensible. It can view a phylogenetic tree easily as we did with classical software and support adding annotation data in a layer above the tree. This is the objective of developing the ggtree. Common tasks of annotating a phylogenetic tree should be easy and complicated tasks can be possible to achieve by adding multiple layers of annotation.

The ggtree is designed by extending the ggplot21 package. It is based on the grammar of graphics and takes all the good parts of ggplot2. There are other R packages that implement tree viewer using ggplot2, including OutbreakTools, phyloseq2 and ggphylo; they mostly create complex tree view functions for their specific needs. Internally, these packages interpret a phylogenetic as a collection of lines, which makes it hard to annotate diverse user input that are related to node (taxa). The ggtree is different to them by interpreting a tree as a collection of taxa and allowing general flexibilities of annotating phylogenetic tree with diverse types of user inputs.

Tree visualization

viewing tree with ggtree

ggtree extend ggplot to support viewing phylogenetic tree. It implements geom_tree layer for displaying phylogenetic trees, as shown below:

nwk <- system.file("extdata", "sample.nwk", package="ggtree")
x <- readLines(nwk)
cat(substring(x, 1, 56), "\n", substring(x, 57), "\n")
## (((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14, 
##  H:8):13):13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);
library("ggplot2")
library("ggtree")

tree <- read.tree(nwk)
ggplot(tree, aes(x, y)) + geom_tree() + theme_tree() + xlab("") + ylab("")

This example tree was obtained from Chapter 34 of Inferring Phylogenies3. The function, ggtree, was implemented as a short cut to visualize a tree, and it works exactly the same as shown above.

ggtree takes all the advantages of ggplot2. For example, we can change the color, size and type of the lines as we do with ggplot2.

ggtree(tree, color="steelblue", size=0.5, linetype="dotted")

By default, the tree is viewing in ladderize form, user can set the parameter ladderize = FALSE to disable it.

ggtree(tree, ladderize=FALSE)

The branch.length is used to scale the edge, user can set the parameter branch.length = "none" to only viewing the tree topology (cladogram) or other numerical variable to scale the tree (e.g. dN/dS).

ggtree(tree, branch.length="none")

support multiple phylogenetic classes

ggtree defined several S4 classes to store phylogenetic object and its associated annotation, including:

In addition, it also supports phylo (defined by ape4), and phylo4 (defined by phylobase)

User can use ggtree(object) command to view the phylogenetic tree directly, and annotation data stored in these objects can be added as demonstrated in Tree annotation with output from evolution software session.

layout

Currently, ggtree supports several layout, including:

for Phylogram (by default) and Cladogram if user explicitly setting branch.length='none'.

And unrooted layout.

Unrooted layout was implemented by the equal-angle algorithm that described in Inferring Phylogenies3.

library("gridExtra")
grid.arrange(ggtree(tree) + ggtitle("(Phylogram) rectangular layout"),
             ggtree(tree, branch.length='none') + ggtitle("(Cladogram) rectangular layout"),
         ggtree(tree, layout="slanted") + ggtitle("(Phylogram) slanted layout"),
             ggtree(tree, layout="slanted", branch.length='none') + ggtitle("(Cladogram) slanted layout"),
         ggtree(tree, layout="circular") + ggtitle("(Phylogram) circular layout"),
             ggtree(tree, layout="circular", branch.length="none") + ggtitle("(Cladogram) circular layout"),
         ggtree(tree, layout="unrooted") + ggtitle("unrooted layout"),
         ncol=2)

two dimensional tree

ggtree implemented 2 dimensional tree. It accepts parameter yscale to scale the y-axis based on the selected tree attribute. The attribute should be numerical variable. If it is character/category variable, user should provides a name vector of mapping the variable to numeric by passing it to parameter yscale_mapping.

tree2d <- read.beast(system.file("extdata", "twoD.tree", package="ggtree"))
ggtree(tree2d, mrsd = "2014-05-01",
       yscale="NGS", yscale_mapping=c(N2=2, N3=3, N4=4, N5=5, N6=6, N7=7)) +
           theme_classic() + 
               theme(panel.grid.major=element_line(color="grey20", linetype="dotted", size=.3),
                     panel.grid.major.y=element_blank()) +
                         scale_y_continuous(labels=paste0("N", 2:7))

In this example, the figure demonstrates the quantity of y increase along the trunk. User can highlight the trunk with different line size or color using the functions we described below.

display evolution distance

To show evolution distance, user can use add_legend function.

ggtree(tree) %>% add_legend()

We can also use theme_tree2() or ggtree(showDistance=TRUE)

ggtree(tree) + theme_tree2()

display nodes/tips

Show all the internal nodes and tips in the tree can be done by adding a layer of points using geom_nodepoint, geom_tippoint or geom_point.

ggtree(tree)+geom_point(aes(shape=isTip, color=isTip), size=3)

p <- ggtree(tree) + geom_nodepoint(color="#b5e521", alpha=1/4, size=10)
p + geom_tippoint(color="#FDAC4F", shape=8, size=3)

display labels

Users can use geom_text to display the node/tip labels:

p + geom_text(aes(label=label), size=3, color="purple", hjust=-0.3)

For circular and unrooted layout, ggtree supports rotating node labels according to the angles of the branches.

ggtree(tree, layout="circular") + geom_text(aes(label=label, angle=angle), size=3, color="purple", vjust=-0.3)

By default, the positions are based on the node positions, we can change them to based on the middle of the branch/edge.

p + geom_text(aes(x=branch, label=label), size=3, color="purple", vjust=-0.3)

Based on the middle of branches is very useful when annotating transition from parent node to child node.

theme

theme_tree() defined a totally blank canvas, while theme_tree2() add phylogenetic distance legend. These two themes all accept a parameter of bgcolor that defined the background color.

grid.arrange(
    ggtree(rtree(30), color="red") + theme_tree("steelblue"),
    ggtree(rtree(20), color="white") + theme_tree("black"),
    ncol=2)

update tree viewing with a new tree

In the display nodes/tips section, we have a p object that stored the tree viewing of 13 tips and internal nodes highlighted with specific colored big dots. If you want to applied this pattern (we can imaging a more complex one) to a new tree, you don’t need to build the tree step by step. ggtree provides an operator, %<%, for applying the visualization pattern to a new tree.

For example, the pattern in the p object will be applied to a new tree with 50 tips as shown below:

p %<% rtree(50)

Another example can be found in CODEML session.

Tree annotation

zoom on a portion of tree

ggtree provides gzoom function that similar to zoom function provided in ape. This function plots simultaneously a whole phylogenetic tree and a portion of it. It aims at exploring very large trees.

library("ape")
data(chiroptera)
library("ggtree")
gzoom(chiroptera, grep("Plecotus", chiroptera$tip.label))

color tree

In ggtree, coloring phylogenetic tree is easy, by using aes(color=VAR) to map the color of tree based on a specific variable (numeric and category are both supported).

ggtree(tree, aes(color=branch.length)) +
    scale_color_continuous(low="green", high="red") +
        theme(legend.position="bottom")

User can use any feature, including clade posterior and dN/dS etc., to scale the color of the tree.

annotate clade

ggtree implements annotation_clade and annotation_clade2 functions to annotate a selected clade with a bar indicating that clade with a corresponding label.

The annotation_clade function accepts a selected internal node number and annotates that selected clade, while annotation_clade2 functions accepts two tip labels (upper one and lower one) to annotate the clade.

User can use geom_text to display all the node numbers, and select interesting clade to annotate.

ggtree(tree) + geom_text(aes(label=node))

p <- ggtree(tree) + geom_tiplab()
annotation_clade(p, node=17, "selected clade", offset.text=2)

annotation_clade2(p, "B", "E", "Clade X", offset.text=2) %>%
    annotation_clade2("G", "H", "Clade Y", bar.size=4, font.size=8, offset=5, offset.text=4, color="steelblue")

The parameter bar.size is used to control the width of the bar and the font.size parameter is to control the font size of the clade lable. The parameter offset is used to control the distance from the annotation to the tree, while offset.text to control the distance from clade label to bar.

highlight clades

ggtree implements hilight function, that accepts tree view and internal node number and add a layer of rectangle to highlight the selected clade.

ggtree(tree) %>% hilight(node=21, fill="steelblue", alpha=.6) %>%
    hilight(node=17, fill="darkgreen", alpha=.6)

ggtree(tree, layout="fan") %>% hilight(node=21, fill="steelblue", alpha=.6) %>%
     hilight(node=23, fill="darkgreen", alpha=.6)

Another way to highlight selected clades is setting the clades with different colors and/or line types as demonstrated in group clades section.

collapse clade

With collapse function, user can collapse a selected clade.

cp <- ggtree(tree) %>% collapse(node=21)
cp + geom_point(subset=.(node == 21), size=5, shape=23, fill="steelblue")

expand collapsed clade

The collapsed clade can be expanded via expand function.

cp %>% expand(node=21)

flip clades

The positions of two selected branches can be flip over using flip function.

set.seed(2015-06-30)
p1 <- ggtree(rtree(30)) + geom_text(aes(label=node))
p2 <- flip(p1, node1=45, node2=33)
p3 <- flip(p2, 32, 58)
grid.arrange(p1, p2, p3, ncol=3)

rotate clade

A selected clade can be rotated by 180 degree using rotate function.

set.seed(2015-07-01)
p1 <- ggtree(rtree(30)) + geom_text(aes(label=node))
p1 <- hilight(p1, 33)
p2 <- rotate(p1, 33)
grid.arrange(p1, p2, ncol=2)

group OTUs

ggtree provides groupOTU function to group tips and all their related ancestors.

tree <- groupOTU(tree, focus=c("A", "B", "C", "D", "E"))
ggtree(tree, aes(color=group)) + geom_tiplab()

groupOTU can also input a list of tip groups.

cls <- list(c1=c("A", "B", "C", "D", "E"),
            c2=c("F", "G", "H"),
            c3=c("L", "K", "I", "J"),
            c4="M")

tree <- groupOTU(tree, cls)
library("colorspace")
ggtree(tree, aes(color=group, linetype=group)) + geom_text(aes(label=label),  hjust=-.25) +
     scale_color_manual(values=c("black", rainbow_hcl(4))) + theme(legend.position="right")

groupOTU also work with tree view (ggplot2 object).

p <- ggtree(tree)
groupOTU(p, LETTERS[1:5]) + aes(color=group) + geom_tiplab() + scale_color_manual(values=c("black", "firebrick"))

iris example

In this example, we first build a tree based on the iris data, then grouping the tree based on different spacies.

data(iris)
rn <- paste0(iris[,5], "_", 1:150)
rownames(iris) <- rn
d_iris <- dist(iris[,-5], method="man")

tree_iris <- bionj(d_iris)
tree_iris <- groupOTU(tree_iris, list(setosa    = rn[1:50],
                versicolor    = rn[51:100],
                virginica_145 = rn[101:150]))
cols <- rainbow_hcl(4)
ggtree(tree_iris, aes(color=group)) +
    geom_text(aes(label=label), hjust=-.1) +
        scale_color_manual(values=cols, breaks=1:3,
                           labels=c("Setosa", "Versicolor", "Virginica")) +
                               theme(legend.position="right")

This example demonstrates how the separation of the bionj is very good with the setosa species, but misses in labeling several versicolor and virginica species.

group clades

As demonstrated above, groupOTU is used for clustering related OTUs. Related OTUs are not necessarily within a clade, they can be distantly related as demonstrated in iris example. groupOTU works fine for monophyletic (clade), polyphyletic and paraphyletic. If user wants to hilight a specific clade, we provides a more friendly function groupClade that accept an internal node or a vector of internal nodes to cluster clade/clades that works similar to groupOTU. User can also use hilight function demonstrated in hilight clades section for highlighting selected clades.

tree <- groupClade(tree, node=21)
ggtree(tree, aes(color=group, linetype=group))

tree <- groupClade(tree, node=c(21, 17))
ggtree(tree, aes(color=group, linetype=group))

It also works with tree view, just like groupOTU.

With groupOTU and groupClade, it’s easy to highlight selected taxa and easy to selecte taxa to display related feature.

ggtree(tree, aes(color=group, linetype=group)) +
    geom_text(aes(label=label), subset=.(group==2), hjust = -.5) +
        geom_text(aes(label=label), subset=.(group==1), hjust = -.5, color="blue")

scale clade

In collapse clade, we have illustrated how to collapse selected clades. Another approach is to zoom out clade to a small scale.

grid.arrange(ggtree(tree) %>% hilight(21, "steelblue"),
             ggtree(tree) %>% scaleClade(21, scale=0.3) %>% hilight(21, "steelblue"),
             ncol=2)

Of course, scaleClade can accept scale larger than 1 and zoom in the selected portion.

grid.arrange(ggtree(tree) %>% hilight(17, fill="steelblue") %>%
                 hilight(21, fill="darkgreen"),
             ggtree(tree) %>% scaleClade(17, scale=2) %>% scaleClade(21, scale=0.3) %>%
                 hilight(17, "steelblue") %>% hilight(21, fill="darkgreen"),
             ncol=2)

Tree annotation with phylopic

PhyloPic is a database that stores reusable silhouette images of organisms. ggtree supports downloading images from PhyloPic and annotating phylogenetic tree with the downloaded images.

pp <- ggtree(tree) %>% phylopic("79ad5f09-cf21-4c89-8e7d-0c82a00ce728", color="steelblue", alpha = .3)
print(pp)

pp %>% phylopic("67382184-5135-4faa-8e98-eadff02c3e8a", color="#86B875", alpha=.8, node=4) %>%
     phylopic("d3563b54-780f-4711-a49a-7ea051e9dacc", color="darkcyan", alpha=.8, node=17, width=.2)

Tree annotation with output from evolution software

In ggtree, we implemented several functions to parse the output from PAML5, HYPHY6, EPA7, PPLACER8 and BEAST9 and defined several classes to store phylogenetic object and associated annotation.

Classes include:

jplace class is also designed to store user specific annotation data, and serves as a standard format for tree annotation within the ggtree package. Please refer to the jplace file format session.

For each classes, we defined read.className to parse input file and output a corresponding object, get.fields method to get the annotation features available in the object, access methods to get these features, and plot methods for quickly viewing these annotation features.

annotating tree with BEAST output

file <- system.file("extdata/BEAST", "beast_mcc.tree", package="ggtree")
beast <- read.beast(file)
beast
## 'beast' S4 object that stored information of
##   '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/BEAST/beast_mcc.tree'.
## 
## ...@ tree: 
## Phylogenetic tree with 15 tips and 14 internal nodes.
## 
## Tip labels:
##  A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##   'height',  'height_0.95_HPD',  'height_median',    'height_range', 'length',
##   'length_0.95_HPD', 'length_median',    'length_range', 'posterior',    'rate',
##   'rate_0.95_HPD',   'rate_median',  'rate_range'.

Since % is not a valid character in names, all the feature names that contain x% will convert to 0.x. For example, length_95%_HPD will be changed to length_0.95_HPD.

plot(beast, annotation="length_0.95_HPD", branch.length="none") + theme_tree()

User can round the digits by setting the parameter ndigits. The default value is 2.

plot(beast, annotation="height", ndigits=3, annotation.color="red")

annotating tree with PAML output

BASEML

rst file

rst file from baseml is similar to codeml output. The only difference is the space in the sequences. For baseml, each ten bases are separated by one space, while for codeml, each three bases (triplet) are separated by one space. We defined a read.paml_rst to parse rst file. It supports baseml and codeml output. The information will be stored in paml_rst S4 object.

rstfile <- system.file("extdata/PAML_Baseml", "rst", package="ggtree")
rst <- read.paml_rst(rstfile)
rst
## 'paml_rst' S4 object that stored information of
##   '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/PAML_Baseml/rst'.
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## Node labels:
##  16, 17, 18, 19, 20, 21, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##   'marginal_subs',   'joint_subs',   'marginal_AA_subs', 'joint_AA_subs'.

The function read.paml_rst can accept only one parameter, rstfile, and the output can be used to view the phylogeny and the substitution annotation.

p <- plot(rst, annotation="marginal_AA_subs", annotation.color="steelblue")
print(p)

CODEML

rst file

rst file from CODEML is similar to BASEML, and also parsed by read.paml_rst function. The plot method works also in the same way.

If you remember the %<% operator introduced in update tree viewing with a new tree session, you can use it to update a tree view with a new object.

In last session, we use rstfile of BASEML to build a tree view with amino acid substitution annotated. The following example use another rstfile from CODEML to update the tree view.

rstfile <- system.file("extdata/PAML_Codeml", "rst", package="ggtree")
rst <- read.paml_rst(rstfile)
p %<% rst

You can found that these two figures have different evolution distances, and substitutions inferred from BASEML and CODEML are slightly different.

mlc file

mlcfile contains dN/dS estimation.

mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="ggtree")
mlc <- read.codeml_mlc(mlcfile)
mlc
## 'codeml_mlc' S4 object that stored information of
##   '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/PAML_Codeml/mlc'. 
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## Node labels:
##  16, 17, 18, 19, 20, 21, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##   't',   'N',    'S',    'dN_vs_dS', 'dN',   'dS',   'N_x_dN',   'S_x_dS'.

Please aware that / and * are not valid characters in names, they were changed to _vs_ and _x_ respectively.

So dN_vs_dS is dN/dS, N_x_dN is N*dN, and S_x_dS is S*dS.

plot(mlc, branch.length="branch.length", annotation="dN_vs_dS", annotation.color="blue", ndigits=3)

The paramter branch.length can be one of available annotations:

get.fields(mlc)
## [1] "t"        "N"        "S"        "dN_vs_dS" "dN"       "dS"      
## [7] "N_x_dN"   "S_x_dS"

For example, if we set branch.length to dN_vs_dS, it will plot the \(\omega\) (dN/dS) tree:

plot(mlc, branch.length="dN_vs_dS", annotation="dN_vs_dS", ndigits=3)

We can also plot the dN or dS tree and others. The parameter annotation can also be one of the available annotations.

CODEML output: rst and mlc files

We annotate the tree with information presented in rstfile and mlcfile separately as demonstrated in previous sessions.

We can also use both of them and it’s highly recommended. All the features in both files are available for annotation.

ml <- read.codeml(rstfile, mlcfile)
ml
## 'codeml' S4 object that stored information of
##   '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/PAML_Codeml/rst' and 
##  '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/PAML_Codeml/mlc'. 
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## Node labels:
##  16, 17, 18, 19, 20, 21, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##   'marginal_subs',   'joint_subs',   'marginal_AA_subs', 'joint_AA_subs',
##   't',   'N',    'S',    'dN_vs_dS',
##   'dN',  'dS',   'N_x_dN',   'S_x_dS',
## .

So we can annotate dN/dS with the tree in rstfile and amino acid substitutions with the tree in mlcfile.

plot(ml, branch.length="rst.branch.length", annotation="dN_vs_dS")

plot(ml, branch.length="mlc.branch.length", annotation="marginal_AA_subs")

plot(ml, branch.length="dN", annotation="joint_AA_subs", annotation.color="darkgreen")

annotating tree with HYPHY output

nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="ggtree")
ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="ggtree")
tipfas <- system.file("extdata", "pa.fas", package="ggtree")
hy <- read.hyphy(nwk, ancseq, tipfas)
hy
## 'hyphy' S4 object that stored information of 
##   '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/HYPHY/labelledtree.tree', 
##  '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/HYPHY/ancseq.nex' and 
##  '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/pa.fas'. 
## 
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  K, N, D, L, J, G, ...
## Node labels:
##  Node1, Node2, Node3, Node4, Node5, Node12, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features available:
##   'subs',    'AA_subs'.
plot(hy, annotation="AA_subs")

annotating tree with RAxML bootstraping analysis

raxml_file <- system.file("extdata/RAxML", "RAxML_bipartitionsBranchLabels.H3", package="ggtree")
raxml <- read.raxml(raxml_file)
plot(raxml)

annotating tree with r8s output

r8s output contains 3 trees, namely TREE, RATO and PHYLO for time tree, rate tree and absolute substitution tree respectively.

r8s <- read.r8s(system.file("extdata/r8s", "H3_r8s_output.log", package="ggtree"))
ggtree(r8s, branch.length="TREE", mrsd="2014-01-01") + theme_tree2()

branch.length should be one of TREE, RATO or PHYLO for selecting the corresponding tree.

User can also view 3 trees simultaneously.

ggtree(get.tree(r8s)) + facet_wrap(~.id, scales="free_x") + theme_tree2()

annotating tree with EPA and PPLACER output

EPA7 and PPLACER8 have common output file format, jplace.

jpf <- system.file("extdata/sample.jplace",  package="ggtree")
jp <- read.jplace(jpf)
print(jp)
## 'jplace' S4 object that stored information of
##   '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/sample.jplace'. 
## 
## ...@ tree: 
## Phylogenetic tree with 13 tips and 12 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features availables:
##   'edge_num',    'likelihood',   'like_weight_ratio',    'distal_length',    'pendant_length'.

In ggtree, we provide get.placements method to access the placement.

## get only best hit
get.placements(jp, by="best")
##   name edge_num likelihood like_weight_ratio distal_length pendant_length
## 1   AA       24  -61371.30          0.333344         3e-06       0.003887
## 2   BB        1  -61312.21          0.333335         1e-06       0.000003
## 3   BB        1  -61312.21          0.333335         1e-06       0.000003
## 4   BB        2  -61312.21          0.333322         3e-06       0.000003
## 5   CC        8  -61312.23          0.200011         1e-06       0.000003
## 6   CC        8  -61312.23          0.200011         1e-06       0.000003
## 7   CC        9  -61312.23          0.200000         3e-06       0.000003
## get all placement
get.placements(jp, by="all")
##   name edge_num likelihood like_weight_ratio distal_length pendant_length
## 1   AA       24  -61371.30          0.333344      0.000003       0.003887
## 2   BB        1  -61312.21          0.333335      0.000001       0.000003
## 3   BB        2  -61312.21          0.333322      0.000003       0.000003
## 4   BB      550  -61312.21          0.333322      0.000961       0.000003
## 5   CC        8  -61312.23          0.200011      0.000001       0.000003
## 6   CC        9  -61312.23          0.200000      0.000003       0.000003
## 7   CC       10  -61312.23          0.199992      0.000003       0.000003

This is only a tiny sample file. In reality, EPA and PPLACER may place thousands of short reads on a reference tree.

We may, for example, count the number of placement and annotate this information in the tree. We do not provide a plot method for jplace object, since we use this file format as a standard annotation format in ggtree package and have no assumption of information it may stored. Please refer to jplace file format session.

mergine tree objects

In ggtree, objects can be merged and evidences inferred from different phylogenetic analyses can be combined or compared and visualized.

User can use the command like:

tree <- merge_tree(tree_object_1, tree_object_2)
## it's chainable, and serveral tree objects can be merged into one tree object
tree <- merge_tree(tree_object_1, tree_object_2) %>% merge_tree(tree_object_3) %>% merge_tree(tree_object_4)

Here we use a tree analyzed by BEAST and CodeML and merge them into one.

beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)

rst_file <- system.file("examples/rst", package="ggtree")
mlc_file <- system.file("examples/mlc", package="ggtree")
codeml_tree <- read.codeml(rst_file, mlc_file)

merged_tree <- merge_tree(beast_tree, codeml_tree)

merged_tree
## 'beast' S4 object that stored information of
##   '/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/examples/MCC_FluA_H3.tree'.
## 
## ...@ tree: 
## Phylogenetic tree with 76 tips and 75 internal nodes.
## 
## Tip labels:
##  A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ...
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##   'height',  'height_0.95_HPD',  'height_median',    'height_range', 'length',
##   'length_0.95_HPD', 'length_median',    'length_range', 'posterior',    'rate',
##   'rate_0.95_HPD',   'rate_median',  'rate_range',   't',    'N',
##   'S',   'dN_vs_dS', 'dN',   'dS',   'N_x_dN',
##   'S_x_dS',  'marginal_subs',    'joint_subs',   'marginal_AA_subs', 'joint_AA_subs',
## .

After merging, all evidences inferred from different tools can be used to annotate the tree simultaneously. In this example, we used ‘dN’ inferred by CodeML to color the tree and annotate the tree with ‘posterior’ inferred by BEAST.

ggtree(merged_tree, aes(color=dN), mrsd="2013-01-01", ndigits = 3) +
    geom_text(aes(label=posterior), vjust=.1, hjust=-.1, size=5) + theme_tree2() +
        scale_color_continuous(low="green", high="red") + theme(legend.position="right")

annotating tree using ggplot2 layers

We implemented several plot methods for easily viewing annotation data. Users are not restricted to plot methods provided. They can use geom_text to add annotation layer. All annotation data are visible to ggplot2.

In the following example, we use the codeml object to visualize the \(\omega\) (dN/dS) tree, and annotate the tree with dN and dS.

ggtree(ml, branch.length="dN_vs_dS") + 
    geom_text(aes(x=branch, label=dN), 
              size=3, vjust=-0.5, color="red") +
                  geom_text(aes(x=branch, label=dS),
                            size=3, vjust=1.2, color="darkgreen")

Tree annotation with user specific annotation

the %<+% operator

We provides several functions to parse and store information from common software output, and corresponding plot methods for visualizing annotation in the tree.

Here, we would like to demonstrate how to inject user specific annotation data in a tree.

Suppose we have the following data that associated with the tree and would like to attach the data in the tree.

nwk <- system.file("extdata", "sample.nwk", package="ggtree")
tree <- read.tree(nwk)
p <- ggtree(tree)

dd <- data.frame(taxa  = LETTERS[1:13], 
             place = c(rep("GZ", 5), rep("HK", 3), rep("CZ", 4), NA),
                 value = round(abs(rnorm(13, mean=70, sd=10)), digits=1))
## you don't need to order the data
## data was reshuffled just for demonstration
dd <- dd[sample(1:13, 13), ]
row.names(dd) <- NULL
print(dd)
taxa place value
D GZ 76.8
J CZ 86.6
F HK 53.8
L CZ 81.4
G HK 66.5
H HK 67.8
B GZ 96.9
C GZ 60.5
K CZ 49.2
I CZ 55.4
M NA 67.6
A GZ 63.8
E GZ 64.2

We can imaging that the place column is the place we isolated the species and value column stored numerical values for example bootstrap values.

We have shown using the operator, %<%, to update a tree view with a new tree. Here, we will introduce another operator, %<+%, that attaches annotation data to a tree view. The only requirement of the input data is that its first column should be matched with the node/tip labels of the tree.

After attaching the annotation data to the tree by %<+%, all the columns in the data are visible to ggplot2. As an example, here we attach the above annotation data to the tree view, p, and add a layer that showing the tip labels and colored them by the isolation site stored in place column.

p <- p %<+% dd + geom_text(aes(color=place, label=label), hjust=-0.5) + 
       geom_tippoint(aes(size=value, shape=place, color=place), alpha=0.25)
p+theme(legend.position="right")

Once the data was attached, it is always attached. So we can add another layer to display the isolation sites easily.

p <- p + geom_text(aes(color=place, label=place), hjust=1, vjust=-0.4, size=3)
print(p)

And another layer showing numerical values:

p <- p + geom_text(aes(color=place, label=value), hjust=1, vjust=1.4, size=3)
print(p)

jplace file format

The jplace file format was defined by Masten10 for phylogenetic placements. We employed this file format to store phylogenetic tree and user specific annotation data. Suppose we have a tree, and the associated data as shown below:

tree <- system.file("extdata", "pa.nwk", package="ggtree")
data <- read.csv(system.file("extdata", "pa_subs.csv", package="ggtree"), stringsAsFactor=FALSE)
print(tree)
## [1] "/tmp/RtmpDhRuy5/Rinst4d99721ed4e4/ggtree/extdata/pa.nwk"
head(data)
##   label                          subs    gc
## 1     A                   R262K/K603R 0.444
## 2     B                         R262K 0.442
## 3     C                   N272S/I354T 0.439
## 4     D K104R/F105L/E382D/F480S/I596T 0.449
## 5     E             T208S/V379I/K609R 0.443
## 6     F                   S277P/L549I 0.444

The data contains amino acid substitutions from parent node to child node and GC contents of each node. We can annotate the tree as demonstrated in user specific annotation session.

ggtree provides a function, write.jplace, to combine a tree and an associated data and store them to a single jplace file.

outfile <- tempfile()
write.jplace(tree, data, outfile)

Then read.jplace function was designed to read the jplace file and store the information to a jplace object.

jp <- read.jplace(outfile)
print(jp)
## 'jplace' S4 object that stored information of
##   '/tmp/Rtmp2yKAEM/file528b7c1c83fe'. 
## 
## ...@ tree: 
## Phylogenetic tree with 15 tips and 13 internal nodes.
## 
## Tip labels:
##  K, N, D, L, J, G, ...
## 
## Unrooted; includes branch lengths.
## 
## with the following features availables:
##   'label',   'subs', 'gc'.

Now we know the jp object stored the tree and the associated amino acid substitution and GC content information, we can view the tree and display the associated annotation data on it directly by ggtree.

ggtree(jp, showDistance=TRUE) + 
       geom_text(aes(x=branch, label=subs), color="purple", vjust=-1, size=3) + 
       geom_text(aes(label=gc), color="steelblue", hjust=-.6, size=3) +
       geom_text(aes(label=label), hjust=-.5)

visualize tree with associated matrix

At first we implemented gplot function to visualize tree with heatmap but it has an issue that it can’t always guarantee the heatmap aligning to the tree properly, since the line up is between two figures and it’s currently not supported internally by ggplot2. I have implemented another function gheatmap that can do the line up properly by creating a new layer above the tree.

In the following example, we visualized a tree of H3 influenza viruses with their associated genotype.

genotype_file <- system.file("examples/Genotype.txt", package="ggtree")
genotype <- read.table(genotype_file, sep="\t", stringsAsFactor=F)
p <- ggtree(beast_tree, mrsd="2013-01-01") %>% add_legend(x=2008, y=5)
p <- p + geom_tiplab(size=3)
gheatmap(p, genotype, offset = 2, width=0.5)

The width parameter is to control the width of the heatmap. It supports another parameter offset for controling the distance between the tree and the heatmap, for instance left space for tip labels.

For time scaled tree, as in this example, it’s more often to use x axis by using theme_tree2. But with this solution, the heatmap is just another layer and will change the x axis. To overcome this issue, we implemented scale_x_ggtree to set the x axis more reasonable. User can also use gplot and tweak the positions of two plot to align properly.

p <- ggtree(beast_tree,  mrsd="2013-01-01") + geom_tiplab(size=3, align=TRUE) + theme_tree2()
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`size`)
pp <- (p + scale_y_continuous(expand=c(0, 0.3))) %>%
    gheatmap(genotype, offset=4, width=0.5, colnames=FALSE) %>%
        scale_x_ggtree()
pp + theme(legend.position="right")

visualize tree with multiple sequence alignment

With msaplot function, user can visualizes multiple sequence alignment with phylogenetic tree, as demonstrated below:

fasta <- system.file("examples/FluA_H3_AA.fas", package="ggtree")
msaplot(ggtree(beast_tree), fasta) 

A specific slice of the alignment can also be displayed by specific window parameter.

External documents

Bugs/Feature requests

If you have any, let me know. Thx!

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggtree_1.0.21       gridExtra_2.0.0     ggplot2_1.0.1      
##  [4] ape_3.3             Biostrings_2.36.4   XVector_0.8.0      
##  [7] IRanges_2.2.9       S4Vectors_0.6.6     BiocGenerics_0.14.0
## [10] colorspace_1.2-6    BiocStyle_1.6.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1     formatR_1.2.1   plyr_1.8.3      highr_0.5.1    
##  [5] tools_3.2.2     zlibbioc_1.14.0 digest_0.6.8    jsonlite_0.9.17
##  [9] evaluate_0.8    gtable_0.1.2    nlme_3.1-122    lattice_0.20-33
## [13] png_0.1-7       yaml_2.1.13     proto_0.3-10    stringr_1.0.0  
## [17] knitr_1.11      fftwtools_0.9-7 locfit_1.5-9.1  grid_3.2.2     
## [21] jpeg_0.1-8      rmarkdown_0.8   reshape2_1.4.1  magrittr_1.5   
## [25] scales_0.3.0    htmltools_0.2.6 MASS_7.3-44     abind_1.4-3    
## [29] EBImage_4.10.1  tiff_0.1-5      labeling_0.3    stringi_0.5-5  
## [33] munsell_0.4.2

References

1.Wickham, H. Ggplot2: Elegant graphics for data analysis. (Springer, 2009).

2.McMurdie, P. J. & Holmes, S. Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8, e61217 (2013).

3.Felsenstein, J. Inferring phylogenies. (Sinauer Associates, 2003).

4.Paradis, E., Claude, J. & Strimmer, K. APE: Analyses of phylogenetics and evolution in r language. Bioinformatics 20, 289–290 (2004).

5.Yang, Z. PAML 4: Phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24, 1586–1591 (2007).

6.Pond, S. L. K., Frost, S. D. W. & Muse, S. V. HyPhy: Hypothesis testing using phylogenies. Bioinformatics 21, 676–679 (2005).

7.Berger, S. A., Krompass, D. & Stamatakis, A. Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic Biology 60, 291–302 (2011).

8.Matsen, F. A., Kodner, R. B. & Armbrust, E. V. Pplacer: Linear time maximum-likelihood and bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010).

9.Bouckaert, R. et al. BEAST 2: A software platform for bayesian evolutionary analysis. PLoS Comput Biol 10, e1003537 (2014).

10.Matsen, F. A., Hoffman, N. G., Gallagher, A. & Stamatakis, A. A format for phylogenetic placements. PLoS ONE 7, e31009 (2012).