Integrating Tree Data: Methods and Applications

We have proposed and developed a series of methods and software tools for the operation, integration, and visualization of phylogenetic trees and data. Key innovations include: (1) introducing graphic grammar to the field of phylogenetics for the first time; (2) enhancing the data integration capabilities of phylogenetics and its application across various disciplines; (3) proposing two universal methods for phylogenetic data integration and visualization; and (4) designing data structures that can store phylogenetic trees, associated data, and visualization directives to ensure analytical reproducibility. These methods and tools offer a concise and unified syntax system, assisting researchers in discovering hidden patterns and proposing new hypotheses by integrating heterogeneous data within the context of evolution or hierarchy.

Two monographs have been published to introduce this series of work: “Data integration, manipulation and visualization of phylogenetic trees” (in English) by CRC Press and 《R实战:系统发育树的数据集成操作与可视化》 (in Chinese) by Publishing House of Electronics Industry (电子工业出版社).


1. Frist-time implementation of graphical grammar for visualizing phylogenetic tree data Link to heading

Numerous software tools exist for visualizing phylogenetic trees, but they primarily focus on displaying the tree’s topological structure and often lack comprehensive support for annotating the tree with additional knowledge or data. For the first time, ggtree introduces the grammar of graphics into the visualization of phylogenetic trees and related data. This innovative approach enables a high level of abstraction for visualization through a simple grammar, significantly reducing the complexity of data visualization and accommodating complex requirements. This work was published in Methods in Ecology and Evolution in 2017 (ESI highly cited) and was chosen by the journal as one of the ten representative works for its 10th anniversary celebration. An invited protocol paper demonstrating the use of this package was also published in Current Protocols in Bioinformatics in 2020.

2. Base classes and functions for phylogenetic tree input and output Link to heading

The outputs of phylogenetic software are often in non-standard formats, leading to compatibility issues and hindering integration and comparative analysis in downstream applications. To address this challenge, we developed treeio, a tool capable of parsing both standard and a variety of non-standard data formats. It facilitates the integration of external data and supports exporting phylogenetic trees and associated data into a single file. This dual capability enables data format conversion, thereby indirectly expanding the compatibility of software with a wider range of data. By parsing and integrating diverse data types, treeio empowers downstream integrated and comparative analyses, thus broadening the application scope of phylogenetic analysis. Our work, published in Molecular Biology and Evolution in 2020 (ESI highly cited), underscores the significance of this advancement.

3. Proposing two general methods for the integration and visualization of phylogenetic data Link to heading

Two comprehensive methods have been proposed and implemented to address all facets of integrating and visualizing phylogenetic data. The first method enables data to be directly mapped onto the tree’s topology, facilitating the direct display of data or its transformation into visualization features. The second method involves the restructuring external data based on the tree’s topology, allowing users to visualize it according to their specifications, and subsequently aligning the visualization with the phylogenetic tree. These two methods enable the integration of diverse heterogeneous data from various disciplines within the context of phylogenetics, thereby aiding in the discovery of new patterns or the formulation of novel hypotheses. This work, published in Molecular Biology and Evolution in 2018 (ESI highly cited), served as the foundation for the development of the ggtreeExtra package. This package enhances integration and visualization capabilities for data richly annotated with additional features, and was published in Molecular Biology and Evolution in 2021.

4. Enhancing data reuse and analytical reproducibility Link to heading

Visualisation of phylogenetic trees typically manifests as static images, leading to a lack of reusability of both the trees and associated data, thus impeding the integration of phylogenetic knowledge and comparative analysis. Research indicates that approximately 60% of published phylogenetic data is lost permanently. To tackle this issue, we devised the ggtree object, which encapsulates the phylogenetic tree, data, and visualization directives. This object can be rendered into an image, from which the phylogenetic tree and related data can be extracted. Furthermore, akin to the “format painter” tool, the visualization directives are transferable for visualizing other tree objects. Our efforts, published in iMeta in 2022, promote data reusability and research replicability, while facilitating the integration and comparative analysis of phylogenetic data within the field.

5. Expanding support to other tree-like structures Link to heading

Broaden the scope of tools pertaining to tree data integration and visualization to encompass other tree-like structures, such as hierarchical clustering and classification/regression trees. We have developed the ggtreeDendro package to accommodate general hierarchical structures and are currently working on the ecluster package to support various omics data structures available in the Bioconductor project. These advancements enable the interpretation and integration of related data at the feature or sample level based on their hierarchical relationships.


Feedback from the academic community Link to heading

It is embarrassing to me that I just learned about “ggtree” this month, but I am looking forward to incorporating it into several projects in my lab. I am looking forward to using it to visualizing annotation results after doing BLAST or FASTA similarity searches, and for high lighting errors in protein sequences among closely related organisms. I have been searching for a tool like “ggtree” for several years, not appreciating that you had already built it. So my interest in recruiting your paper is somewhat selfish. I would like more help in understanding how to add graphical annotations to the leaves of trees.

– The email from Professor William Pearson (ISCB Fellow), dated September 14, 2019, invited Guangchuang Yu to contribute to Current Protocols in Bioinformatics.

Publications Link to heading