Tree Data Integration

A series of methods and software tools have been proposed and developed for the operation, integration, and visualization of trees and data. This includes (1) the introduction of graphic grammar into the field of phylogeny for the first time, (2) enhancing the data integration ability of phylogenetics and its application in different disciplines, (3) proposing two universal methods for phylogenetic data integration and visualization, and (4) designing data structures that can store phylogenetic trees, related data, and visualization directives to ensure analytical reproducibility. This series of methods and software tools provide a concise and unified syntax as a whole system., which assists researchers in discovering hidden patterns and proposing new hypotheses by integrating heterogeneous data in the context of evolution or hierachy.

Two monographs have been published to introduce this series of work, including “Data integration, manipulation and visualization of phylogenetic trees” (EN) published by CRC Press and 《R实战:系统发育树的数据集成操作与可视化》 (CN) published by Publishing House of Electronics Industry (电子工业出版社).


1. Frist-time implementation of graphical grammar in visualizing phylogenetic tree data Link to heading

There are numerous software implementations for phylogenetic tree visualization, yet they are mostly used to visualize the tree’s topological structure and are unable to (or only have limited support for) annotating the tree using knowledge or data. For the first time, ggtree introduces the grammar of graphics into the visualization of phylogenetic trees and related data, effectively implementing a high level of abstraction for visualization through simple grammar, greatly reducing the difficulty of data visualization and making complex requirements possible. This work was published in Methods in Ecology and Evolution 2017 (ESI highly cited), and was selected by the journal as one of the ten representative works for its 10th anniversary celebration. A protocol paper (invited) demonstrating the use of this package was published in Current Protocols in Bioinformatics 2020.

2. Base classes and functions for phylogenetic tree input and output Link to heading

The outputs of phylogenetic software are mostly in non-standard formats and not compatible with each other. This restricts the integration and comparative analysis in downstream applications. To address this issue, we developed treeio, which allows for parsing of standard and a dozen of non-standard data formats, and enables integration of external data. It also supports exporting the phylogenetic tree and related data into a single file. The ability to support input and output means that data format conversion is possible, thereby making more software indirectly support a wider range of data. The parsing and integration of various types of data provides the potential for downstream integrated analyses and comparative analyses, while expanding the scope of application of phylogenetic analysis. This work has been published in Molecular Biology and Evolution 2020 (ESI highly cited).

3. Propose two general methods for the integration and visualization of phylogenetic data Link to heading

Two general methods have been proposed and implemented, covering all aspects of the integration and visualization of phylogenetic data. The first method allows data to be mapped to the topology of the tree and supports the direct display of data or its mapping as visualization features; the second method reorganizes external data according to the tree topology, visualizes it in the manner specified by the user, and finally aligns the visualization result with the phylogenetic tree. These two general methods allow various heterogeneous data from different disciplines to be deciphered in the context of phylogenetics, which can contribute to the discovery of new patterns related or the proposal of new hypotheses. This work was published in Molecular Biology and Evolution 2018 (ESI highly cited). Based on this, the ggtreeExtra package was developed to enhance the integration and visualization capabilities for richly annotated data and was published in Molecular Biology and Evolution 2021.

4. Enhance data reuse and analytical reproducibility Link to heading

Visualisation of phylogenetic trees is usually in the form of images, and the corresponding trees and data cannot be reused, which hampers the integration of phylogenetic knowledge and comparative analysis. Studies have shown that about 60% of the published phylogenetic data is permanently lost. To solve this problem, we designed the ggtree object, which encompasses the phylogenetic tree, data, and visualization directives. It can be rendered into an image, while the phylogenetic tree and related data can be extracted from it. In addition, similar to the “format painter”, the visualization directives can be used for visualizing other tree objects. This work was published in iMeta 2022, supporting data reusability and research replicability while promoting the integration and comparative analysis of phylogenetic data in the field.

5. Extend support for other tree-like structures Link to heading

Expand the series of tools related to tree data integration and visualization to apply to other tree-like structures (e.g., hierarchical clustering and classification/regression trees). Implemented the ggtreeDendro package to support general hierarchical structures and the ecluster package (work in progress) to support various omics data structures provided in the Bioconductor project. This allows the related data at the feature or sample level to be interpreted and integrated according to their hierarchical relationships.


Comments from the academic community Link to heading

It is embarrassing to me that I just learned about “ggtree” this month, but I am looking forward to incorporating it into several projects in my lab. I am looking forward to using it to visualizing annotation results after doing BLAST or FASTA similarity searches, and for high lighting errors in protein sequences among closely related organisms. I have been searching for a tool like “ggtree” for several years, not appreciating that you had already built it. So my interest in recruiting your paper is somewhat selfish. I would like more help in understanding how to add graphical annotations to the leaves of trees.

– The email (Sep 14, 2019) from Professor William Pearson (ISCB Fellow) inviting Guangchuang Yu to contribute to Current Protocols in Bioinformatics