-
PDF
- Split View
-
Views
-
Cite
Cite
Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan, Two Methods for Mapping and Visualizing Associated Data on Phylogeny Using Ggtree, Molecular Biology and Evolution, Volume 35, Issue 12, December 2018, Pages 3041–3043, https://doi.org/10.1093/molbev/msy194
Close -
Share
Abstract
Ggtree is a comprehensive R package for visualizing and annotating phylogenetic trees with associated data. It can also map and visualize associated external data on phylogenies with two general methods. Method 1 allows external data to be mapped on the tree structure and used as visual characteristic in tree and data visualization. Method 2 plots the data with the tree side by side using different geometric functions after reordering the data based on the tree structure. These two methods integrate data with phylogeny for further exploration and comparison in the evolutionary biology context. Ggtree is available from http://www.bioconductor.org/packages/ggtree.
Introduction
Phylogenetic trees are increasingly used in various biological studies to visualize associated data in an evolutionary context. For instance, influenza virus has a wide host range, diverse and dynamic genotypes and characteristic transmission behaviors that are mostly associated with the virus evolution. Such information can be interpreted in a phylogenetic context to help identifying evolutionary patterns. Data integration extends and broadens the applications of phylogenetic trees, especially for comparative studies. Many software packages and web tools have been developed for visualizing trees in various ways, such as TreeDyn, iTOL (Letunic and Bork 2007), and EvolView(He et al. 2016). These tools provide many useful and complementary features. However, the ability to annotate tree with external data sets is often missing. To overcome this limitation, we developed the ggtree R package. Ggtree (Yu et al. 2017) parses tree data from various file formats, especially model-based statistical inferences from commonly used programs. These parsed data then can be used directly for tree visualization and annotation. The data parsing utilities were moved to the treeio package so that the updated version of ggtree (v ≥ 1.12.0) can focus on visualization with many new functions including general solutions for mapping and visualizing external data sets on phylogeny. Here, we outline two approaches most relevant to explore, compare, visualize, and interpret phylogeny with external associated data sets. The R script to generate figure 1 and several reproducible examples are presented in Supplementary Material online.
Mapping Data to the Tree Structure
One of the demands of mapping data is to link the data, such as phenotypic data (supplementary fig. S9, Supplementary Material online), experimental data, and clinical data, to the tree structure, to display the data on the tree directly or use the data as visual characteristics of the tree branches/nodes. Ggtree provides an operator, %<+%, for attaching external data to the ggtree graphic object. Any data frame that contains a column of “node” or first column of taxa labels can be integrated using the %<+% operator. The new version of the %<+% function allows integrating associated data to internal nodes, and the data incorporated can be exported with the tree to a BEAST compatible NEXUS file (Example 1.1 in Supplementary Material online). Multiple data sets can be attached progressively. When the data are attached, all the information stored in the data serve as numerical/categorical node attributes and can be directly used to visualize the tree by scaling the attributes as different colors or line sizes, label the tree using the original values of the attributes or parsing them as math expression or silhouette image (supplementary fig. S1, Supplementary Material online).
Aligning Graph to the Tree Based on Tree Structure
Evolutionary data are heterogeneous and many data types cannot be displayed on the tree directly, such as genetic information at a pan-genome scale, multiple sequence alignment, and species abundance distributions. The issue is that we need to reorder the plot based on the tree structure if we want to align the plot with the tree side by side to interpret the data in evolutionary context. This is quite challenging as tree structure is not human friendly and need expertise in programming. Consequently, ggtree provides the facet_plot function to align graphs to the tree. The facet_plot function internally reorders the input data based on the tree structure and visualizes the data at the specific panel by the geom function (supplementary table S1, Supplementary Material online). The Abundance panel in figure 1 demonstrates the visualization of species abundance distributions using density ridgeline plot. The graph is reordered internally based on the tree structure and aligned to corresponding nodes with the tree presented in the Tree panel.
Illustration of mapping and visualizing associated data on phylogeny. Species abundance distributions were aligned to the tree and visualized as a density ridgelines (Abundance panel). The Phylum information was used to color symbolic points on the tree (Tree panel) and also Abundance panel.
Illustration of mapping and visualizing associated data on phylogeny. Species abundance distributions were aligned to the tree and visualized as a density ridgelines (Abundance panel). The Phylum information was used to color symbolic points on the tree (Tree panel) and also Abundance panel.
Summary
Existing packages of plotting trees with data only provide limited visualization methods and can only apply to predefined data types. Two methods introduced here have many unique features: integrating node/edge data to the tree that can be mapped to visual characteristics of the tree or other data sets (supplementary fig. S1, Supplementary Material online), no restriction of data types or how the data should be plotted in facet_plot (supplementary table S1, Supplementary Material online), modular design that separates tree visualization, data integration, and graph alignment. Modular design is a unique feature for ggtree to stand out from other packages. The tree can be fully annotated with multiple data sets attached by the %<+% operator and facet_plot can progressively align multiple panels to the tree (supplementary fig. S6, Supplementary Material online) or add multiple geometric layers to visualize one or more data sets on a single panel (supplementary figs. S4 and S9, Supplementary Material online). Only with this design, it is possible to plot a fully annotated tree with complex data panels. Besides, ggtree works with other tree objects defined in other R packages (supplementary figs. S3, S8, and S9, Supplementary Material online) and the methods introduced here broaden the applications of existing R packages by allowing external data integration. For example, ggtree extends phyloseq package to plot species abundance distribution with the tree (fig. 1). Comparison with other R packages and a full list of unique features of ggtree can be found in Supplementary Material online.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Acknowledgments
This work was supported by research grants from National Key Plan for Scientific Research and Development of China (2016YFD0500302; 2017YFE0190800), Shenzhen Peacock Plan (KQTD201203), and National Institutes of Health (HHSN272201400006C). We thank Prof. Jian-Rong Yang (Sun Yat-Sen University), Dr Daijiang Li (University of Florida), Dr Guanyang Zhang (University of Florida), and Dr Chenhao Li (Genome Institute of Singapore) for constructive comments on the article.

