Preface

I am so excited to have this book published. The book is meant as a guide for data integration, manipulation and visualization of phylogenetic trees using a suite of R packages, tidytree, treeio, ggtree and ggtreeExtra. Hence, if you are starting to read this book, we assume you have a working knowledge of how to use R and ggplot2.

The development of the ggtree package started during my PhD study at the University of Hong Kong. I joined the State Key Laboratory of Emerging Infectious Diseases (SKLEID) under the supervision of Yi Guan and Tommy Lam. I was asked to provide assistance to modify the newick tree string to incorporate some additional information, such as amino acid substitutions, in the internal node labels of the phylogeny for visualization. I wrote an R script to do it and soon realized that most phylogenetic tree visualization software can only display one type of data through node labels. Basically, we cannot display two data variables at the same time for comparative analysis. In order to produce tree graphs displaying different types of branch/node associated information, such as bootstrap values and substitutions, people mostly relied on post-processing image software. This situation motivates me to develop ggtree. First of all, I think a good user interface must fully support the ggplot2 syntax, which allows us to draw graphs by superimposing layers. In this way, simple graphs are simple, and complex graphs are just a combination of simple layers, which are easy to generate.

After several years of development, ggtree has evolved into a package suite, including tidytree for manipulating tree with data using the tidy interface; treeio for importing and exporting tree with richly annotated data; ggtree for tree visualization and annotation and ggtreeExtra for presenting data with a phylogeny side-by-side for a rectangular layout or in outer rings for a circular layout. The ggtree is a general tool that supports different types of tree and tree-like structures and can be applied to different disciplines to help researchers presenting and interpreting data in the evolutionary or hierarchical context.

0.1 Structure of the book

  • Part 1 (Tree data input, output and manipulation) describes treeio package for tree data input and output, and tidytree package for tree data manipulation.
  • Part 2 (Tree data visualization and annotation) introduces tree visualization and annotation using the grammar of graphic syntax implemented in the ggtree package. It emphasizes on presenting tree-associated data on the tree.
  • Part 3 (ggtree extensions) introduces ggtreeExtra for presenting data on circular layout trees and other extensions including MicrobiotaProcess and tanggle etc.
  • Part 4 (Miscellaneous topics) describes utilities provided by the ggtree package suite and presents a set of reproducible examples.

0.2 Software information and conventions

The R and core packages information when compiling this book is as follows:

R.version.string
## [1] "R version 4.1.1 (2021-08-10)"
library(treedataverse)
##  Attaching packages  treedataverse 0.0.1 

##  ape         5.5         treeio      1.18.0
##  dplyr       1.0.7       ggtree      3.2.0 
##  ggplot2     3.3.5       ggtreeExtra 1.4.0 
##  tidytree    0.3.6

The treedataverse is a meta package to make it easy to install and load core packages for processing and visualizing tree with data using the packages described in this book. The installation guide for treedataverse can be found in FAQ.

The datasets used in this book have three sources:

  1. Simulation data
  2. Datasets in the R packages
  3. Data downloaded from the Internet

In order to make the data downloaded from the Internet more accessible, we packed the data in an R package, TDbook, with detailed documentation of the original source, including URL, authors, and citation if the information is available. The TDbook is available on CRAN and can be installed using install.packages("TDbook").

Package names in this book are formated as bold text (e.g., ggtree), and function names are followed by parentheses (e.g., treeio::read.beast()). The double-colon operator (::) means accessing an object from a package.

Acknowledgements

Many people have contributed to this book with spelling and grammar corrections. I’d particularly like to thank Shuangbin Xu and Lin Li for their detailed technical reviews of the book.

Many others have contributed over the development of the ggtree package suite. I would like to thank Hadley Wickham, for creating the ggplot2 package that ggtree relies on; Tommy Tsan-Yuk Lam and Yi Guan for being great advisors and supporting the development of ggtree during my PhD; Richard Ree for inviting me to catalysis meeting on phylogenetic tree visualization; William Pearson for inviting me to publish a protocol paper of ggtree on the Current Procotols in Bioinformatics Journal; Shuangbin Xu, Yonghe Xia, Justin Silverman, Bradley Jones, Watal M. Iwasaki, Ruizhu Huang, Casey Dunn, Tyler Bradley, Konstantinos Geles, Zebulun Arendsee and many others who have contributed source code or given me feedback; and last, but not least, the members of the ggtree mailing list1, for providing many challenging problems that have helped improve the ggtree package suite.