7 Plotting tree with data

Integrating user data to annotate a phylogenetic tree can be done at different levels. The treeio package (Wang et al., 2020) implements full_join() methods to combine tree data to phylogenetic tree object. The tidytree package supports linking tree data to phylogeny using tidyverse verbs (see also Chapter 2). The ggtree package (Yu et al., 2018) supports mapping external data to phylogeny for visualization and annotation on the fly. Although the feature of linking external data is overlapping among these packages, they have different application scopes. For example, in addition to the treedata object, ggtree also supports several other tree objects (see Chapter 9), including phylo4d, phyloseq, and obkData that were designed to contain domain-specific data. The design of these objects did not consider supporting linking external data to the object (it can not be done at the tree object level). We can visualize trees from these objects using ggtree and link external data at the visualization level (Yu et al., 2018).

The ggtree package provides two general methods for mapping and visualizing associated external data on phylogenies. Method 1 allows external data to be mapped on the tree structure and used as visual characteristics in the tree and data visualization. Method 2 plots the data with the tree side-by-side using different geometric functions after reordering the data based on the tree structure. These two methods integrate data with phylogeny for further exploration and comparison in the evolutionary biology context. The ggtreeExtra provides a better implementation of the Method 2 proposed in ggtree (see also Chapter 10) and works with both rectangular and circular layouts (Xu, Dai, et al., 2021).

7.1 Mapping Data to The tree Structure

In ggtree, we implemented an operator, %<+%, to attach annotation data to a ggtree graphic object. Any data that contains a column of “node” or the first column of taxa labels can be integrated using the %<+% operator. Multiple datasets can be attached progressively. When the data are attached, all the information stored in the data serves as numerical/categorical node attributes and can be directly used to visualize the tree by scaling the attributes as different colors or line sizes, labeling the tree using the original values of the attributes or parsing them as math expression, emoji or silhouette image. The following example uses the %<+% operator to integrate taxon (df_tip_data) and internal node (df_inode_data) information and map the data to different colors or shapes of symbolic points and labels (Figure 7.1). The tip data contains imageURL that links to online figures of the species, which can be parsed and used as tip labels in ggtree (see Chapter 8).

library(ggimage)
library(ggtree)
library(TDbook)

# load `tree_boots`, `df_tip_data`, and `df_inode_data` from 'TDbook'
p <- ggtree(tree_boots) %<+% df_tip_data + xlim(-.1, 4)
p2 <- p + geom_tiplab(offset = .6, hjust = .5) +
    geom_tippoint(aes(shape = trophic_habit, color = trophic_habit, 
                size = mass_in_kg)) + 
    theme(legend.position = "right") + 
    scale_size_continuous(range = c(3, 10))

p2 %<+% df_inode_data + 
    geom_label(aes(label = vernacularName.y, fill = posterior)) + 
    scale_fill_gradientn(colors = RColorBrewer::brewer.pal(3, "YlGnBu"))
Example of attaching multiple datasets. External datasets including tip data (e.g., trophic habit and body weight) and node data (e.g., clade posterior and vernacular name) were attached to the ggtree graphic via the %<+% operator and the data was used to annotate the tree.

FIGURE 7.1: Example of attaching multiple datasets. External datasets including tip data (e.g., trophic habit and body weight) and node data (e.g., clade posterior and vernacular name) were attached to the ggtree graphic via the %<+% operator and the data was used to annotate the tree.

Although the data integrated by the %<+% operator in ggtree is for tree visualization, the data attached to the ggtree graphic object can be converted to treedata object that contains the tree and the attached data (see session 7.5).

7.2 Aligning Graph to the Tree Based on the Tree Structure

For associating phylogenetic tree with different types of plot produced by user’s data, ggtree provides geom_facet() layer and facet_plot() function which accept an input data.frame and a geom layer to draw the input data. The data will be displayed in an additional panel of the plot. The geom_facet() (or facet_plot) is a general solution for linking the graphic layer to a tree. The function internally re-orders the input data based on the tree structure and visualizes the data at the specific panel by the geometric layer. Users are free to visualize several panels to plot different types of data as demonstrated in Figure 9.4 and to use different geometric layers to plot the same dataset (Figure 13.1) or different datasets on the same panel.

The geom_facet() is designed to work with most of the geom layers defined in ggplot2 and other ggplot2-based packages. A list of the geometric layers that work seamlessly with geom_facet() and facet_plot() can be found in Table C.1. As the ggplot2 community keeps expanding and more geom layers will be implemented in either ggplot2 or other extensions, geom_facet() and facet_plot() will gain more power to present data in the future. Note that different geom layers can be combined to present data on the same panel and the combinations of different geom layers create the possibility to present more complex data with phylogeny (see also Figures 13.1 and 13.4). Users can progressively add multiple panels to present and compare different datasets in the evolutionary context (Figure 7.2). Detailed descriptions can be found in the supplemental file of (Yu et al., 2018).

library(ggtree)
library(TDbook)

## load `tree_nwk`, `df_info`, `df_alleles`, and `df_bar_data` from 'TDbook'
tree <- tree_nwk
snps <- df_alleles
snps_strainCols <- snps[1,] 
snps<-snps[-1,] # drop strain names
colnames(snps) <- snps_strainCols

gapChar <- "?"
snp <- t(snps)
lsnp <- apply(snp, 1, function(x) {
        x != snp[1,] & x != gapChar & snp[1,] != gapChar
    })
lsnp <- as.data.frame(lsnp)
lsnp$pos <- as.numeric(rownames(lsnp))
lsnp <- tidyr::gather(lsnp, name, value, -pos)
snp_data <- lsnp[lsnp$value, c("name", "pos")]

## visualize the tree 
p <- ggtree(tree) 

## attach the sampling information data set 
## and add symbols colored by location
p <- p %<+% df_info + geom_tippoint(aes(color=location))

## visualize SNP and Trait data using dot and bar charts,
## and align them based on tree structure
p + geom_facet(panel = "SNP", data = snp_data, geom = geom_point, 
               mapping=aes(x = pos, color = location), shape = '|') +
    geom_facet(panel = "Trait", data = df_bar_data, geom = geom_col, 
                aes(x = dummy_bar_value, color = location, 
                fill = location), orientation = 'y', width = .6) +
    theme_tree2(legend.position=c(.05, .85))
Example of plotting SNP and trait data. The ‘location’ information was attached to the tree and used to color tip symbols (Tree panel), and other datasets. SNP and Trait data were visualized as dot chart (SNP panel) and bar chart (Trait panel).

FIGURE 7.2: Example of plotting SNP and trait data. The ‘location’ information was attached to the tree and used to color tip symbols (Tree panel), and other datasets. SNP and Trait data were visualized as dot chart (SNP panel) and bar chart (Trait panel).

Companion functions to adjust panel widths and rename panel names are described in session 12.1. Removing the panel name is also possible and an example was presented in Figure 13.4. We can also use aplot or patchwork to create composite plots as described in session 7.5.

The geom_facet() (or facet_plot()) internally used ggplot2::facet_grid() and only works with Cartesian coordinate system. To align the graph to the tree for the polar system (e.g., for circular or fan layouts), we developed another Bioconductor package, ggtreeExtra. The ggtreeExtra package provides the geom_fruit() layer that works similar to geom_facet() (details described in Chapter 10). The geom_fruit() is a better implementation of the Method 2 proposed in (Yu et al., 2018).

7.3 Visualize a Tree with an Associated Matrix

The gheatmap() function is designed to visualize the phylogenetic tree with a heatmap of an associated matrix (either numerical or categorical). The geom_facet() layer is a general solution for plotting data with the tree, including heatmap. The gheatmap() function is specifically designed for plotting heatmap with a tree and provides a shortcut for handling column labels and color palettes. Another difference is that geom_facet() only supports rectangular and slanted tree layouts, while gheatmap() supports rectangular, slanted, and circular (Figure 7.4) layouts.

In the following example, we visualized a tree of H3 influenza viruses with their associated genotypes (Figure 7.3A).

beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)

genotype_file <- system.file("examples/Genotype.txt", package="ggtree")
genotype <- read.table(genotype_file, sep="\t", stringsAsFactor=F)
colnames(genotype) <- sub("\\.$", "", colnames(genotype))
p <- ggtree(beast_tree, mrsd="2013-01-01") + 
    geom_treescale(x=2008, y=1, offset=2) + 
    geom_tiplab(size=2)
gheatmap(p, genotype, offset=5, width=0.5, font.size=3, 
        colnames_angle=-45, hjust=0) +
    scale_fill_manual(breaks=c("HuH3N2", "pdm", "trig"), 
        values=c("steelblue", "firebrick", "darkgreen"), name="genotype")

The width parameter is to control the width of the heatmap. It supports another parameter offset for controlling the distance between the tree and the heatmap, such as allocating space for tip labels.

For a timescaled tree, as in this example, it’s more common to use x-axis by using theme_tree2. But with this solution, the heatmap is just another layer and will change the x-axis. To overcome this issue, we implemented scale_x_ggtree() to set the x-axis more reasonably (Figure 7.3B).

p <- ggtree(beast_tree, mrsd="2013-01-01") + 
    geom_tiplab(size=2, align=TRUE, linesize=.5) + 
    theme_tree2()
gheatmap(p, genotype, offset=8, width=0.6, 
        colnames=FALSE, legend_title="genotype") +
    scale_x_ggtree() + 
    scale_y_continuous(expand=c(0, 0.3))
Example of plotting matrix with gheatmap(). A H3 influenza tree with a genotype table visualized as a heatmap (A). Tips were aligned and with a tailored x-axis for divergence times (tree) and genomic segments (heatmap) (B).

FIGURE 7.3: Example of plotting matrix with gheatmap(). A H3 influenza tree with a genotype table visualized as a heatmap (A). Tips were aligned and with a tailored x-axis for divergence times (tree) and genomic segments (heatmap) (B).

7.3.1 Visualize a tree with multiple associated matrices

Of course, we can use multiple gheatmap() function calls to align several associated matrices with the tree. However, ggplot2 doesn’t allow us to use multiple fill scales11.

To solve this issue, we can use the ggnewscale package to create new fill scales. Here is an example of using ggnewscale with gheatmap().

nwk <- system.file("extdata", "sample.nwk", package="treeio")

tree <- read.tree(nwk)
circ <- ggtree(tree, layout = "circular")

df <- data.frame(first=c("a", "b", "a", "c", "d", "d", "a", 
                        "b", "e", "e", "f", "c", "f"),
                 second= c("z", "z", "z", "z", "y", "y", 
                        "y", "y", "x", "x", "x", "a", "a"))
rownames(df) <- tree$tip.label

df2 <- as.data.frame(matrix(rnorm(39), ncol=3))
rownames(df2) <- tree$tip.label
colnames(df2) <- LETTERS[1:3]


p1 <- gheatmap(circ, df, offset=.8, width=.2,
               colnames_angle=95, colnames_offset_y = .25) +
    scale_fill_viridis_d(option="D", name="discrete\nvalue")


library(ggnewscale)
p2 <- p1 + new_scale_fill()
gheatmap(p2, df2, offset=15, width=.3,
         colnames_angle=90, colnames_offset_y = .25) +
    scale_fill_viridis_c(option="A", name="continuous\nvalue")
Example of plotting matrix with gheatmap(). A H3 influenza tree with a genotype table visualized as a heatmap (A). Tips were aligned and with a tailored x-axis for divergence times (tree) and genomic segments (heatmap) (B).

FIGURE 7.4: Example of plotting matrix with gheatmap(). A H3 influenza tree with a genotype table visualized as a heatmap (A). Tips were aligned and with a tailored x-axis for divergence times (tree) and genomic segments (heatmap) (B).

7.4 Visualize a Tree with Multiple Sequence Alignments

The msaplot() accepts a tree (output of ggtree()) and a fasta file, then it can visualize the tree with sequence alignment. We can specify the width (relative to the tree) of the alignment and adjust the relative position by offset, which is similar to the gheatmap() function (Figure 7.5A).

library(TDbook)

# load `tree_seq_nwk` and `AA_sequence` from 'TDbook'
p <- ggtree(tree_seq_nwk) + geom_tiplab(size=3)
msaplot(p, AA_sequence, offset=3, width=2)

A specific slice of the alignment can also be displayed by specifying the window parameter (Figure 7.5B)..

p <- ggtree(tree_seq_nwk, layout='circular') + 
    geom_tiplab(offset=4, align=TRUE) + xlim(NA, 12)
msaplot(p, AA_sequence, window=c(120, 200))
Example of plotting multiple sequence alignments with a tree. Whole MSA sequences were visualized with a tree in rectangular layout (A). Circular layout with a slice of alignment window (B).

FIGURE 7.5: Example of plotting multiple sequence alignments with a tree. Whole MSA sequences were visualized with a tree in rectangular layout (A). Circular layout with a slice of alignment window (B).

To better support visualizing multiple sequence alignments with a tree and other associated data, we developed the ggmsa package with the ability to label the sequences and color the sequences with different color schemes (Yu, 2020). The ggmsa() output is compatible with geom_facet() and ggtreeExtra::geom_fruit() and can be used to visualize a tree, multiple sequence alignments, and different types of associated data to explore their underlying linkages/associations.

7.5 Composite Plots

In addition to aligning graphs to a tree using geom_facet() or ggtreeExtra::geom_fruit() and special cases using the gheatmap() and msaplot() functions, users can use cowplot, patchwork, gtable12 or other packages to create composite plots. However, extra efforts need to be done to make sure all the plots are aligned properly. The ggtree::get_taxa_name() function is quite useful for users to re-order their data based on the tree structure. To remove this obstacle, we created an R package aplot that can re-order the internal data of a ggplot object and create composite plots that align properly with a tree.

In the following example, we have a tree with two associated datasets.

library(ggplot2)
library(ggtree)

set.seed(2019-10-31)
tr <- rtree(10)

d1 <- data.frame(
    # only some labels match
    label = c(tr$tip.label[sample(5, 5)], "A"),
    value = sample(1:6, 6))

d2 <- data.frame(
    label = rep(tr$tip.label, 5),
    category = rep(LETTERS[1:5], each=10),
    value = rnorm(50, 0, 3)) 

g <- ggtree(tr) + geom_tiplab(align=TRUE) + hexpand(.01)

p1 <- ggplot(d1, aes(label, value)) + geom_col(aes(fill=label)) + 
    geom_text(aes(label=label, y= value+.1)) +
    coord_flip() + theme_tree2() + theme(legend.position='none')
 
p2 <- ggplot(d2, aes(x=category, y=label)) + 
    geom_tile(aes(fill=value)) + scale_fill_viridis_c() + 
    theme_minimal() + xlab(NULL) + ylab(NULL)

If we align them using cowplot, the composite plots are not aligned properly as we anticipated (Figure 7.6A).

cowplot::plot_grid(g, p2, p1, ncol=3) 

Using aplot, it will do all the dirty work for us and all the subplots are aligned properly as demonstrated in Figure 7.6B.

Example of aligning tree with data side-by-side to create composite plot. cowplot` just places the subplots together (A), while aplot does extra work to make sure that tree-associated subplots are properly ordered according to the tree structure (B). Note: The ‘A’ category in the bar plot that is not matched with the tree was removed.

FIGURE 7.6: Example of aligning tree with data side-by-side to create composite plot. cowplot` just places the subplots together (A), while aplot does extra work to make sure that tree-associated subplots are properly ordered according to the tree structure (B). Note: The ‘A’ category in the bar plot that is not matched with the tree was removed.

7.6 Summary

Although there are many software packages that support visualizing phylogenetic trees, plotting a tree with data is often missing or with only limited support. Some of the packages define S4 classes to store phylogenetic tree with domain-specific data, such as OutbreakTools (Jombart et al., 2014) defined obkData for storing tree with epidemiology data and phyloseq (McMurdie & Holmes, 2013) defines phyloseq for storing tree with microbiome data. These packages are capable of presenting some of the data stored in the object on the tree. However, not all the associated data are supported. For example, species abundance stored in the phyloseq object is not supported to be visualized using the phyloseq package. These packages did not provide any utilities to integrate external data for tree visualization. None of these packages support visualizing external data and aligning the plot to a tree based on the tree structure.

The ggtree package provides two general solutions for integrating data. Method 1, the %<+% operator, can integrate external and internal node data and map the data as a visual characteristic to visualize the tree and other datasets used in geom_facet() or ggtreeExtra::geom_fruit(). Method 2, the geom_facet layer or ggtreeExtra::geom_fruit(), has no restriction of input data as long as there is a geom function available to plot the data (e.g., species abundance displayed by geom_density_ridges as demonstrated in Figure 9.4). Users are free to combine different panels and combine different geom layers in the same panel (Figure 13.1).

The ggtree package has many unique features that cannot be found in other implementations (Yu et al., 2018):

  1. Integrating node/edge data to the tree can be mapped to visual characteristics of the tree or other datasets (Figure 7.1).
  2. Capable of parsing expressions (math symbols or text formatting), emoji, and image files (Chapter 8).
  3. No pre-definition of input data types or how the data should be plotted in geom_facet() (Table C.1).
  4. Combining different geom functions to visualize associated data is supported (Figure 13.1).
  5. Visualizing different datasets on the same panel is supported.
  6. Data integrated by %<+% can be used in geom_facet() layer.
  7. Able to add further annotations to specific layers.
  8. Modular design by separating tree visualization, data integration (Method 1), and graph alignment (Method 2).

Modular design is a unique feature for ggtree to stand out from other packages. The tree can be visualized with data stored in the tree object or external data linked by the %<+% operator, and fully annotated with multiple layers of annotations (Figures 7.1 and 13.1), before passing it to geom_facet() layer. The geom_facet() layer can be called progressively to add multiple panels or multiple layers on the same panels (Figure 13.1). This creates the possibility of plotting a full annotated tree with complex data panels that contain multiple graphic layers.

The ggtree package fits the R ecosystem and extends the abilities to integrate and present data with trees to existing phylogenetic packages. As demonstrated in Figure 9.4, we can plot species abundance distributions with the phyloseq object. This cannot be easily done without ggtree. With ggtree, we are able to attach additional data to tree objects using the %<+% operator and align graphs to a tree using the geom_facet() layer. Integrating ggtree into existing workflows will extend the abilities and broaden the applications to present phylogeny-associated data, especially for comparative studies.