The ggtree
package should not be viewed solely as a standalone software. While it is useful for viewing, annotating and manipulating phylogenetic trees, it is also an infrastructure that enables evolutionary evidences that inferred by commonly used software packages in the field to be used in R
. For instance, dN/dS values or ancestral sequences inferred by CODEML1, clade support values (posterior) inferred by BEAST2 and short read placement by EPA3 and pplacer4. These evolutionary evidences are not only used in annotating phylogenetic tree in ggtree
but can also be further analyzed in R
.
Supported File Formats
Most of the tree viewer software (including R
packages) focus on Newick
and Nexus
file formats, while there are file formats from different evolution analysis software that contain supporting evidences within the file that are ready for annotating a phylogenetic tree. The ggtree
package define several parser functions and S4
classes to store statistical evidences inferred by commonly used software packages. It supports several file formats, including:
- Newick (via
ape
) - Nexus (via
ape
) - New Hampshire eXtended format (NHX)
- Jplace
- Phylip
and software output from:
Parser functions
The ggtree
package implement several parser functions, including:
read.beast
for parsing output of BEASEread.codeml
for parsing output of CODEML (rst
andmlc
files)read.codeml_mlc
for parsingmlc
file (output ofCODEML
)read.hyphy
for parsing output of HYPHYread.jplace
for parsingjplace
file including output from EPA and pplacerread.nhx
for parsingNHX
file including output from PHYLODOG and RevBayesread.paml_rst
for parsingrst
file (output ofBASEML
andCODEML
)read.r8s
for parsing output of r8sread.raxml
for parsing output of RAxML
S4 classes
Correspondingly, ggtree
defines several S4
classes to store evolutionary evidences inferred by these software packages, including:
apeBootstrap
for bootstrap analysis ofape::boot.phylo()
10, output ofapeBoot()
defined inggtree
beast
for storing output ofread.beast()
codeml
for storing output ofread.codeml()
codeml_mlc
for storing output ofread.codeml_mlc()
hyphy
for storing output ofread.hyphy()
jplace
for storing output ofread.jplace()
nhx
for storing output ofread.nhx()
paml_rst
forrst
file obtained by PAML, includingBASEML
andCODEML
.phangorn
for storing ancestral sequences inferred byR
packagephangorn
11, output ofphyPML()
defined inggtree
r8s
for storing output ofread.r8s()
raxml
for storing output ofread.raxml()
The jplace
class is also designed to store user specified annotation data.
Here is an overview of these S4
classes:
In addition, ggtree
also supports phylo
, multiPhylo
(defined by ape
10), phylo4
, phylo4d
(defined by phylobase
) obkData
(defined in OutbreakTools
) and phyloseq
(defined in phyloseq
).
In ggtree
, tree objects can be merged and evidences inferred from different phylogenetic analyses can be combined or compared and visualized.
Viewing a phylogenetic tree in ggtree
is easy by using the command ggtree(tree_object)
and annotating a phylogenetic tree is simple by adding graphic layers using the grammar of graphics.
For each class, we defined get.fields
method to get the annotation features that available in the object that can be used to annotate a phylogenetic tree directly in ggtree
. A get.tree
method can be used to convert tree object to phylo
(or multiPhylo
for r8s
) object that are widely supported by other R
packages.
The groupOTU
method is used for clustering related OTUs (from tips to their most recent common ancestor). Related OTUs are not necessarily within a clade, they can be distantly related. groupOTU
works fine for monophyletic (clade), polyphyletic and paraphyletic, while groupClade
only works for clade (monophyletic). These methods are useful for clustering related OTUs or clades.
The fortify
method is used to convert tree object to a data.frame
which is familiar by R
users and easy to manipulate. The output data.frame
contains tree information and all evolutionary evidences (if available, e.g. dN/dS in codeml
object).
Detail descriptions of slots
defined in each class are documented in class man pages. Users can use class?className
(e.g. class?beast
) to access man page of a class.
Getting Tree Data into R
Parsing BEAST output
file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
beast <- read.beast(file)
beast
## 'beast' S4 object that stored information of
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/BEAST/beast_mcc.tree'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 14 internal nodes.
##
## Tip labels:
## A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...
##
## Rooted; includes branch lengths.
##
## with the following features available:
## 'height', 'height_0.95_HPD', 'height_median', 'height_range', 'length',
## 'length_0.95_HPD', 'length_median', 'length_range', 'posterior', 'rate',
## 'rate_0.95_HPD', 'rate_median', 'rate_range'.
Since %
is not a valid character in names
, all the feature names that contain x%
will convert to 0.x
. For example, length_95%_HPD
will be changed to length_0.95_HPD
.
The get.fields
method return all available features that can be used for annotation.
get.fields(beast)
## [1] "height" "height_0.95_HPD" "height_median"
## [4] "height_range" "length" "length_0.95_HPD"
## [7] "length_median" "length_range" "posterior"
## [10] "rate" "rate_0.95_HPD" "rate_median"
## [13] "rate_range"
Users can use ggtree(beast)
to visualize the tree and add layer to annotate it.
ggtree(beast, ndigits=2, branch.length = 'none') + geom_text(aes(x=branch, label=length_0.95_HPD), vjust=-.5, color='firebrick')
ggtree
provides geom_range
layer to display uncertainty of branch length.
ggtree(beast) + geom_range(range='length_0.95_HPD', color='red', alpha=.6, size=2)
In FigTree
, only heigh_0.95_HPD
is meaningful since the branch is scaled by height
. In ggtree
we can display HPD of rate
, height
or other variable if available since ggtree
can rescale a tree using rescale_tree
function or by specifing branch.length
in ggtree
function.
ggtree(beast, branch.length = 'rate') + geom_range(range='rate_0.95_HPD', color='red', alpha=.6, size=2)
With ggtree
, evolutionary evidences inferred by commonly used software packages (BEAST
in this example) can be easily transformed to a tidy data.frame
by fortify
method.
beast_data <- fortify(beast)
head(beast_data)
## node parent branch.length x y label isTip branch angle
## 1 1 26 1.3203366 19.92609 13 A_1995 TRUE 19.265920 312
## 2 2 25 3.2619545 20.92609 12 B_1996 TRUE 19.295111 288
## 3 3 28 3.3119924 19.92609 7 C_1995 TRUE 18.270092 168
## 4 4 17 8.8216633 11.92609 1 D_1987 TRUE 7.515257 24
## 5 5 29 3.2710481 20.92609 9 E_1996 TRUE 19.290565 216
## 6 6 27 0.8311578 21.92609 14 F_1997 TRUE 21.510510 336
## height height_0.95_HPD height_median height_range length
## 1 18 [18,18] 18 [18,18] 1.3441569
## 2 17 [17,17] 17 [17,17] 3.3017477
## 3 18 [18,18] 18 [18,18] 3.3286573
## 4 26 [26,26] 26 [26,26] 9.1413106
## 5 17 [17,17] 17 [17,17] 3.2862156
## 6 16 [16,16] 16 [16,16] 0.8488314
## length_0.95_HPD length_median
## 1 [0.68517891886869,2.10354483714556] 1.3212658
## 2 [2.53358832907781,4.1401029285715] 3.2820598
## 3 [2.27307022364686,4.31554438343149] 3.3119924
## 4 [5.30884002039044,13.4387641030349] 8.9748680
## 5 [2.22106540956613,4.33254269801286] 3.2713107
## 6 [0.400580889894957,1.31579833143232] 0.8311578
## length_range posterior rate
## 1 [0.265324005330331,3.07384777043179] NA 0.002910248
## 2 [2.0792675983921,5.06630544445367] NA 0.002931375
## 3 [1.71975701694686,6.50816017066041] NA 0.002897274
## 4 [3.19947741596854,18.8278732882232] NA 0.002924273
## 5 [1.64732407690073,5.67293898900339] NA 0.002930741
## 6 [0.24483144082356,1.97014285250898] NA 0.002908179
## rate_0.95_HPD rate_median
## 1 [0.00223113475668131,0.00356451343607832] 0.002894327
## 2 [0.0022724788824011,0.00354123504386722] 0.002911165
## 3 [0.00227453143856834,0.00351910617601055] 0.002891624
## 4 [0.00215956295110437,0.00357908783275133] 0.002906094
## 5 [0.00228866191654885,0.00356083209119594] 0.002916350
## 6 [0.00222013461432246,0.0035242778400976] 0.002894097
## rate_range
## 1 [0.00139151638500062,0.00527869901952313]
## 2 [0.00175983928116987,0.00501640955442468]
## 3 [0.00137369890223726,0.00491233632681979]
## 4 [0.00129525039729045,0.00607832777082667]
## 5 [0.00141612851412156,0.00557985134567404]
## 6 [0.00118946556091268,0.00607832777082667]
Parsing PAML output
rst
file
The read.paml_rst
function can parse rst
file from BASEML
and CODEML
. The only difference is the space in the sequences. For BASEML
, each ten bases are separated by one space, while for CODEML
, each three bases (triplet) are separated by one space.
brstfile <- system.file("extdata/PAML_Baseml", "rst", package="treeio")
brst <- read.paml_rst(brstfile)
brst
## 'paml_rst' S4 object that stored information of
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/PAML_Baseml/rst'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
## Node labels:
## 16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'marginal_subs', 'joint_subs', 'marginal_AA_subs', 'joint_AA_subs'.
p <- ggtree(brst) +
geom_text(aes(x=branch, label=marginal_AA_subs), vjust=-.5, color='steelblue') +
theme_tree2()
print(p)
Similarly, we can parse the rst
file from CODEML
and visualize the tree with amino acid substitution.
crstfile <- system.file("extdata/PAML_Codeml", "rst", package="treeio")
crst <- read.paml_rst(crstfile)
crst
## 'paml_rst' S4 object that stored information of
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/PAML_Codeml/rst'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
## Node labels:
## 16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'marginal_subs', 'joint_subs', 'marginal_AA_subs', 'joint_AA_subs'.
ggtree
defines the update
operator, %<%
, that can update a tree view (applying the same pattern of visualization and annotation) with another tree object.
p %<% crst
We can find that these two figures have different evolution distances and subsitutions inferred by BASEML
and CODEML
are slightly different.
CODEML
mlc file
The mlc
file output by CODEML
contains dN/dS
estimation.
mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio")
mlc <- read.codeml_mlc(mlcfile)
mlc
## 'codeml_mlc' S4 object that stored information of
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/PAML_Codeml/mlc'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
## Node labels:
## 16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 't', 'N', 'S', 'dN_vs_dS', 'dN', 'dS', 'N_x_dN', 'S_x_dS'.
Please aware that /
and *
are not valid characters in names
, they were changed to _vs_
and _x_
respectively.
So dN_vs_dS
is dN/dS
, N_x_dN
is N*dN
, and S_x_dS
is S*dS
.
ggtree(mlc) + geom_text(aes(x=branch, label=dN_vs_dS), color='blue', vjust=-.2)
The paramter branch.length
can be one of available annotations:
get.fields(mlc)
## [1] "t" "N" "S" "dN_vs_dS" "dN" "dS"
## [7] "N_x_dN" "S_x_dS"
For example, if we set branch.length
to dN_vs_dS
, the tree view will be re-scaled based on \(\omega\) (dN/dS
).
ggtree(mlc, branch.length = "dN_vs_dS", aes(color=dN_vs_dS)) +
scale_color_continuous(name='dN/dS', limits=c(0, 1.5),
oob=scales::squish, low="darkgreen", high="red")+
theme_tree2(legend.position=c(.9, .5))
We can also plot the dN
or dS
tree and others. More examples (including evidences inferred by BEAST) can be found in the Tree Annotation vignette.
CODEML
output: rst and mlc files
We annotate the tree with information presented in rst
and mlc
file separately as demonstrated in previous sessions.
We can also use both of them which is highly recommended.
ml <- read.codeml(crstfile, mlcfile)
ml
## 'codeml' S4 object that stored information of
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/PAML_Codeml/rst' and
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/PAML_Codeml/mlc'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
## Node labels:
## 16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'marginal_subs', 'joint_subs', 'marginal_AA_subs', 'joint_AA_subs', 't',
## 'N', 'S', 'dN_vs_dS', 'dN', 'dS', 'N_x_dN', 'S_x_dS'.
head(fortify(ml))
## label node parent branch.length x y isTip branch angle t
## 14 A 1 27 0.010340 0.417866 13 TRUE 0.4126960 312 0.010
## 15 B 2 26 0.031643 0.431097 12 TRUE 0.4152755 288 0.032
## 16 C 3 22 0.027730 0.418729 7 TRUE 0.4048640 168 0.028
## 17 D 4 17 0.082021 0.313649 3 TRUE 0.2726385 72 0.082
## 18 E 5 23 0.031104 0.430431 8 TRUE 0.4148790 192 0.031
## 19 F 6 28 0.006649 0.442478 14 TRUE 0.4391535 336 0.007
## N S dN_vs_dS dN dS N_x_dN S_x_dS
## 14 1514.9 633.1 0.0646 0.0007 0.0101 1 6.4
## 15 1514.9 633.1 0.0001 0.0000 0.0358 0 22.7
## 16 1514.9 633.1 0.0461 0.0013 0.0282 2 17.9
## 17 1514.9 633.1 0.0385 0.0033 0.0849 5 53.8
## 18 1514.9 633.1 0.0641 0.0020 0.0305 3 19.3
## 19 1514.9 633.1 0.2980 0.0013 0.0044 2 2.8
## marginal_subs
## 14 C214T / G294T / T966C \n C1641T / A1808G / C1941T / C2070T
## 15 C222T / G555A / C732A / C828T / G939T / T999C / A1047G / G1083A / C1263T / G1329A \n A1341G / A1344T / T1405C / T1440C / C1752T / A1827G / C1911T / C1941T / G2013A / T2022C / C2043T
## 16 A204G / C222T / G504A / A815G / C891T / G1023T / T1061C / G1101A \n G1176A / C1299T / C1545T / C1584T / T1626C / A1668G / G1878A / G1884T / C1944T / A1947G
## 17 A111G / A168G / T211C / A213G / T214C / G243A / A297G / A311G / T315A / G369A / G423A / A477C / G504A / G555A / G585A / C621T / C636T / A768G / G774A / T780C / C825T / A867G / A870G / G885A / C888T / G897A \n C987T / C1006T / G1017A / A1047G / C1062T / A1146T / G1197A / A1230G / A1281G / T1365C / C1392T / T1439C / A1443G / A1470G / A1482G / T1512C / T1530C / T1539C / T1560C / A1608G / G1671A / G1689A / G1695A / A1761G / T1776C / T1787C / A1914G / G2052A
## 18 C174T / G243A / G321A / A622T / T831C / A894G / G948T / C960T / A984G \n C1041T / G1135A / T1215C / T1249C / A1401G / T1527A / T1594C / A1826G / A1848G / T1950C / G1986A
## 19 T829C / C1353T / A1443G / T1548C / C1645A
## joint_subs
## 14 C214T / G294T / T966C \n C1641T / A1808G / C1941T / C2070T
## 15 C222T / G555A / C732A / C828T / G939T / T999C / A1047G / G1083A / C1263T / G1329A \n A1341G / A1344T / T1405C / T1440C / C1752T / A1827G / C1911T / C1941T / G2013A / T2022C / C2043T
## 16 A204G / C222T / G504A / A815G / C891T / G1023T / T1061C / G1101A \n G1176A / C1299T / C1545T / C1584T / T1626C / A1668G / G1878A / G1884T / C1944T / A1947G
## 17 A111G / A168G / T211C / A213G / T214C / G243A / A311G / T315A / G369A / G423A / A477C / G504A / G555A / G585A / C621T / C636T / A768G / T771C / G774A / T780C / C825T / A867G / A870G / G885A / C888T \n G897A / C987T / C1006T / G1017A / A1047G / C1062T / A1146T / G1197A / A1230G / A1281G / T1365C / C1392T / T1439C / A1443G / T1512C / T1530C / T1539C / T1560C / A1608G / G1671A / G1689A / G1695A / A1761G / T1776C / T1787C / A1914G / G2052A
## 18 C174T / G243A / G321A / A622T / T831C / A894G / G948T / C960T / A984G \n C1041T / G1135A / T1215C / T1249C / A1401G / T1527A / T1594C / A1826G / A1848G / T1950C / G1986A
## 19 T829C / C1353T / A1443G / T1548C / C1645A
## marginal_AA_subs
## 14 K603R
## 15
## 16 N272S / I354T
## 17 K104R / F105L / E382D / F480S / I596T
## 18 T208S / V379I / K609R
## 19 S277P / L549I
## joint_AA_subs
## 14 K603R
## 15
## 16 N272S / I354T
## 17 K104R / F105L / E382D / F480S / I596T
## 18 T208S / V379I / K609R
## 19 S277P / L549I
All the features in both files are available for annotation. For example, we can annotate dN/dS
with the tree in rstfile
and amino acid substitutions with the tree in mlcfile
.
Parsing HYPHY output
nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="treeio")
ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="treeio")
tipfas <- system.file("extdata", "pa.fas", package="treeio")
hy <- read.hyphy(nwk, ancseq, tipfas)
hy
## 'hyphy' S4 object that stored information of
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/HYPHY/labelledtree.tree',
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/HYPHY/ancseq.nex' and
## '/home/biocbuild/bbs-3.5-bioc/R/library/treeio/extdata/pa.fas'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## K, N, D, L, J, G, ...
## Node labels:
## Node1, Node2, Node3, Node4, Node5, Node12, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'subs', 'AA_subs'.
ggtree(hy) + geom_text(aes(x=branch, label=AA_subs), vjust=-.5)
Parsing r8s output
r8s output contains 3 trees, namely TREE
, RATO
and PHYLO
for time tree, rate tree and absolute substitution tree respectively.
r8s <- read.r8s(system.file("extdata/r8s", "H3_r8s_output.log", package="treeio"))
ggtree(r8s, branch.length="TREE", mrsd="2014-01-01") + theme_tree2()
branch.length
should be one of TREE
, RATO
or PHYLO
for selecting the corresponding tree.
User can also view 3 trees simultaneously.
ggtree(get.tree(r8s), aes(color=.id)) + facet_wrap(~.id, scales="free_x")
Parsing output of RAxML bootstraping analysis
raxml_file <- system.file("extdata/RAxML", "RAxML_bipartitionsBranchLabels.H3", package="treeio")
raxml <- read.raxml(raxml_file)
ggtree(raxml) + geom_label(aes(label=bootstrap, fill=bootstrap)) + geom_tiplab() +
scale_fill_continuous(low='darkgreen', high='red') + theme_tree2(legend.position='right')