Phylogenetic Data Integration: Methods and Applications

We have developed methods and software tools for the operation, integration, and visualization of phylogenetic trees and data. Key contributions include: (1) parsing and integrating phylogenetic data; (2) visualizing phylogenetic trees using the grammar of graphics; (3) mapping heterogeneous data onto evolutionary trees; and (4) ensuring reproducibility with programmable data structures. These efforts aim to support researchers in analyzing data within an evolutionary context.

Two monographs have been published to introduce this series of work: “Data integration, manipulation and visualization of phylogenetic trees” (in English) by CRC Press and 《R实战:系统发育树的数据集成操作与可视化》 (in Chinese) by Publishing House of Electronics Industry (电子工业出版社).


1. Parsing and Integrating Phylogenetic Data Link to heading

The outputs of phylogenetic software are often in non-standard formats, leading to compatibility issues and hindering integration and comparative analysis. To address this challenge, we developed treeio, a tool capable of parsing both standard and a variety of non-standard data formats. It facilitates the integration of external data and supports exporting phylogenetic trees and associated data into a single file. This capability enables data format conversion and integration, supporting downstream analysis. Our work was published in Molecular Biology and Evolution in 2020.

2. Visualizing Phylogenetic Trees with Grammar of Graphics Link to heading

Numerous software tools exist for visualizing phylogenetic trees, but they primarily focus on displaying the tree’s topological structure. ggtree introduces the grammar of graphics into the visualization of phylogenetic trees and related data. This approach enables visualization through a simple grammar, reducing the complexity of data visualization and accommodating various requirements. This work was published in Methods in Ecology and Evolution in 2017. An invited protocol paper demonstrating the use of this package was also published in Current Protocols in Bioinformatics in 2020.

3. Associating and Visualizing Data on Phylogeny Link to heading

We proposed two methods for integrating and visualizing phylogenetic data. The first method enables data to be directly mapped onto the tree’s topology. The second method involves the restructuring external data based on the tree’s topology, allowing users to visualize it according to their specifications, and subsequently aligning the visualization with the phylogenetic tree. These methods enable the integration of heterogeneous data within the context of phylogenetics. This work was published in Molecular Biology and Evolution in 2018. The ggtreeExtra package, which enhances these capabilities, was published in Molecular Biology and Evolution in 2021.

4. Enhancing Reproducibility Link to heading

Visualisation of phylogenetic trees typically manifests as static images, leading to a lack of reusability. We devised the ggtree object, which encapsulates the phylogenetic tree, data, and visualization directives. This object can be rendered into an image, from which the phylogenetic tree and related data can be extracted. Furthermore, the visualization directives are transferable for visualizing other tree objects. This work, published in iMeta in 2022, supports data reusability and research replicability.

5. Extending to General Hierarchical Structures Link to heading

We have broadened the scope of tools pertaining to tree data integration and visualization to encompass other tree-like structures, such as hierarchical clustering. We developed the ggtreeDendro package to accommodate general hierarchical structures and are working on the ecluster package to support various omics data structures. These advancements enable the interpretation and integration of related data based on their hierarchical relationships.


Feedback from the academic community Link to heading

It is embarrassing to me that I just learned about “ggtree” this month, but I am looking forward to incorporating it into several projects in my lab. I am looking forward to using it to visualizing annotation results after doing BLAST or FASTA similarity searches, and for high lighting errors in protein sequences among closely related organisms. I have been searching for a tool like “ggtree” for several years, not appreciating that you had already built it. So my interest in recruiting your paper is somewhat selfish. I would like more help in understanding how to add graphical annotations to the leaves of trees.

– The email from Professor William Pearson (ISCB Fellow), dated September 14, 2019, invited Guangchuang Yu to contribute to Current Protocols in Bioinformatics.

Publications Link to heading