You can’t even begin to understand biology, you can’t understand life, unless you understand what it’s all there for, how it arose - and that means evolution. — Richard Dawkins

Citation

If you use ggtree in published research, please cite:

G Yu, D Smith, H Zhu, Y Guan, TTY Lam,
ggtree: an R package for visualization and annotation of phylogenetic tree with different types of meta-data.
revised.

Introduction

This project arose from our needs to annotate nucleotide substitutions in the phylogenetic tree, and we found that there is no tree visualization software can do this easily. Existing tree viewers are designed for displaying phylogenetic tree, but not annotating it. Although some tree viewers can displaying bootstrap values in the tree, it is hard/impossible to display other information in the tree. Our first solution for displaying nucleotide substituitions in the tree is to add this information in the node/tip names and use traditional tree viewer to show it. We displayed the information in the tree successfully, but we believe this indirect approach is inefficient.

Previously, phylogenetic trees were much smaller. Annotation of phylogenetic trees was not as necessary as nowadays much more data is becomming available. We want to associate our experimental data, for instance antigenic change, with the evolution relationship. Visualizing these associations in a phylogenetic tree can help us to identify evolution patterns. We believe we need a next generation tree viewer that should be programmable and extensible. It can view a phylogenetic tree easily as we did with classical software and support adding annotation data in a layer above the tree. This is the objective of developing the ggtree. Common tasks of annotating a phylogenetic tree should be easy and complicated tasks can be possible to achieve by adding multiple layers of annotation.

The ggtree is designed by extending the ggplot21 package. It is based on the grammar of graphics and takes all the good parts of ggplot2. There are other R packages that implement tree viewer using ggplot2, including OutbreakTools, phyloseq2 and ggphylo; they mostly create complex tree view functions for their specific needs. Internally, these packages interpret a phylogenetic as a collection of lines, which makes it hard to annotate diverse user input that are related to node (taxa). The ggtree is different to them by interpreting a tree as a collection of taxa and allowing general flexibilities of annotating phylogenetic tree with diverse types of user inputs.

Getting data into R

Most of the tree viewer software (including R packages) focus on Newick and Nexus file format, while there are file formats from different evolution analysis software that contain supporting evidences within the file that are ready for annotating a phylogenetic tree. In addition to Newick and Nexus, ggtree supports NHX, jplace and Phylip file formats. ggtree also supports software outputs from BEAST3, EPA4, HYPHY5, PAML6, PHYLDOG7, pplacer8, r8s9, RAxML10 and RevBayes11.

Parsing data from a number of molecular evolution software is not only for visualization in ggtree, but also bring these data to R users for further analysis (e.g. summarization, visualization, comparision, test, etc).

For more details, please refer to Tree Data Import vignette.

Tree Visualization and Annotation

Tree Visualization in ggtree is easy, with one line of command ggtree(tree_object). It supports several layouts, including rectangular, slanted and circular for Phylogram and Cladogram, unrooted layout, time-scaled and two dimentional phylogenies. Tree Visualization vignette describes these feature in details.

We implement several functions to manipulate a phylogenetic tree.

Details and examples can be found in Tree Manipulation vignette.

Most of the phylogenetic trees are scaled by evolutionary distance (substitution/site), in ggtree a phylogenetic tree can be re-scaled by any numerical variable inferred by evolutionary analysis (e.g. species divergence time, dN/dS, etc). Numerical and category variable can be used to color a phylogenetic tree.

The ggtree package provides several layers to annotate a phylogenetic tree, including geom_tiplab for adding tip labels, geom_treescale for adding a legend of tree scale, geom_hilight for highlighting selected clades and geom_cladelabel for labelling selected clades.

It supports annotating phylogenetic trees with analyses obtained from R packages and other commonly used evolutionary software. User’s specific annotation (e.g. experimental data) can be integrated to annotate phylogenetic trees. ggtree provides write.jplace function to combine Newick tree file and user’s own data to a single jplace file that can be parsed and the data can be used to annotate the tree directly in ggtree.

ggtree integrates phylopic database and silhouette images of organisms can be downloaded and used to annotate phylogenetic directly. ggtree also supports using local images to annotate a phylogenetic tree.

Visualizing an annotated phylogenetic tree with numerical matrix (e.g. genotype table), multiple sequence alignment and subplots are also supported in ggtree. Examples of annotating phylogenetic trees can be found in the Tree Annotation and Advance Tree Annotation vignettes.

Vignette Entry

More documents can be found in http://guangchuangyu.github.io/tags/ggtree.

Bugs/Feature requests

If you have any, let me know. Thx!

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.2.4 (2016-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] phangorn_2.0.2      Biostrings_2.38.4   XVector_0.10.0     
##  [4] IRanges_2.4.8       S4Vectors_0.8.11    BiocGenerics_0.16.1
##  [7] colorspace_1.2-6    ggtree_1.2.17       ggplot2_2.1.0      
## [10] ape_3.4            
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3     formatR_1.3     plyr_1.8.3      tools_3.2.4    
##  [5] zlibbioc_1.16.0 digest_0.6.9    jsonlite_0.9.19 evaluate_0.8.3 
##  [9] nlme_3.1-125    gtable_0.2.0    lattice_0.20-33 png_0.1-7      
## [13] Matrix_1.2-4    igraph_1.0.1    DBI_0.3.1       yaml_2.1.13    
## [17] stringr_1.0.0   dplyr_0.4.3     knitr_1.12.3    fftwtools_0.9-7
## [21] locfit_1.5-9.1  grid_3.2.4      R6_2.1.2        jpeg_0.1-8     
## [25] rmarkdown_0.9.5 reshape2_1.4.1  tidyr_0.4.1     magrittr_1.5   
## [29] nnls_1.4        scales_0.4.0    htmltools_0.3   assertthat_0.1 
## [33] abind_1.4-3     EBImage_4.12.2  tiff_0.1-5      quadprog_1.5-5 
## [37] labeling_0.3    stringi_1.0-1   lazyeval_0.1.10 munsell_0.4.3

References

1.Wickham, H. Ggplot2: Elegant graphics for data analysis. (Springer, 2009).

2.McMurdie, P. J. & Holmes, S. Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8, e61217 (2013).

3.Bouckaert, R. et al. BEAST 2: A software platform for bayesian evolutionary analysis. PLoS Comput Biol 10, e1003537 (2014).

4.Berger, S. A., Krompass, D. & Stamatakis, A. Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic Biology 60, 291–302 (2011).

5.Pond, S. L. K., Frost, S. D. W. & Muse, S. V. HyPhy: Hypothesis testing using phylogenies. Bioinformatics 21, 676–679 (2005).

6.Yang, Z. PAML 4: Phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24, 1586–1591 (2007).

7.Boussau, B. et al. Genome-scale coestimation of species and gene trees. Genome Res. 23, 323–330 (2013).

8.Matsen, F. A., Kodner, R. B. & Armbrust, E. V. Pplacer: Linear time maximum-likelihood and bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010).

9.Marazzi, B. et al. Locating evolutionary precursors on a phylogenetic tree. Evolution 66, 3918–3930 (2012).

10.Stamatakis, A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics btu033 (2014). doi:10.1093/bioinformatics/btu033

11.Höhna, S. et al. Probabilistic graphical model representation in phylogenetics. Syst Biol 63, 753–771 (2014).