-
PDF
- Split View
-
Views
-
Cite
Cite
Nicolas Bortolussi, Eric Durand, Michael Blum, Olivier François, apTreeshape: statistical analysis of phylogenetic tree shape, Bioinformatics, Volume 22, Issue 3, 1 February 2006, Pages 363–364, https://doi.org/10.1093/bioinformatics/bti798
Close -
Share
Abstract
Summary:apTreeshape is a R package dedicated to simulation and analysis of phylogenetic tree topologies using statistical imbalance measures. It is a companion library of the R package ‘ape’, which provides additional functions for reading, plotting, manipulating phylogenetic trees and for connecting to public phylogenetic tree databases. One strength of the package is to include appropriate corrections of classical shape statistics as well as new tests based on the statistical theory of likelihood ratios.
Availability:Author Webpage
Contact:Olivier.Francois@imag.fr
1 INTRODUCTION
The understanding of macroevolutionary processes, such as speciation or extinction, is a major issue in evolutionary biology. It is widely acknowledged that such processes leave their fingerprint on the phylogenetic trees that we reconstruct from extant taxa.
The recent explosion of phylogenetic data has generated a bulk of modern analytical methods that rely on stochastic models of tree structure. These methods fall into two classes: temporal and topological. Temporal methods focus on the estimation of diversification rates (Nee, 2001). Topological methods are based on statistical measures of tree imbalance (Mooers and Heard, 1997; Aldous, 2001). Most of them assume null models of tree structure among which the Yule's process (1924) is the most popular.
In this article, we describe the computer package apTreeshape that is dedicated to simulation and analysis of phylogenetic tree topologies using statistical indices. It is programmed in the R language (R Development Core Team, 2005), and complements the library ‘ape’ of Paradis et al. (2004) which covers aspects of temporal methods essentially. It also provides additional functions for reading, plotting, manipulating phylogenetic trees and offers immediate web-access to public phylogenetic tree databases, such as TreeBASE and Pandit (Whelan et al., 2003).
Beyond the software facilities for data analysis and graphical display offered by the R language, apTreeshape includes important corrections on classical shape statistics. One strength of the package is to present new tests based on the statistical theory of likelihoods, and therefore provide optimal power for testing null models of macroevolution.
2 CONTENTS
The functions contained in apTreeshape can be classified into four categories: basic topological manipulation, web-access, simulation and statistical testing.
The basic objects handled by the package are cladograms, i.e. binary trees for which branch lengths have been ignored. They can be read from files in the Newick/Nexus format or converted from objects of the ‘ape’ package. These objects are stored into a class called ‘treeshape’. Objects of class ‘treeshape’ have dendrogram-like data structure, and they are plotted using methods for dendrograms. Basic topological manipulations are allowed such as pruning or cuting from a specified internal node. Pruning returns the ancestral part of a tree, while cutting extracts a subtree rooted at a specific node. Subtrees corresponding to a subset of taxa can be extracted from a whole tree as well.
The package apTreeshape has been designed to perform large-scale studies of tree shape from phylogeny databases. For instance, it contains specific functions for accessing TreeBASE and Pandit through R. As an example, the following instructions download the trees with ID numbers = 705, 706 and 709 in Pandit, and convert them into objects of class ‘treeshape’. Basic summaries can be obtained very easily.
trees<-dbtrees(“pandit”, c(705,706,709))
summary(trees[[2]]);plot(trees[[2]])
Although apTreeshape deals with fully resolved tree, any phylogeny can be downloaded, and converted into a binary tree solving polytomies using a random simulation method.
Simulation methods and Monte Carlo estimates of P-values are central to apTreeshape. The function rtreeshape enables sampling trees from the most usual stochastic models of trees: the equal rate Markov (ERM) and proportional to distinguishable arrangements models (PDA). In the ERM each branch has an equal probability of splitting, whereas the PDA model has the property that all trees are equally likely (Mooers and Heard, 1997). Note that the topology of the ERM model is shared by other models such as the Hey, Moran or coalescent models for which branch lengths can be simulated using the R base package without difficulties. In addition, we implemented the biased-speciation model used by Kirkpatrick and Slatkin (1993), and a universal random generator for branching Markov processes. Solving polytomies makes use of one of the ERM, PDA or biased-speciation models locally.
The core of apTreeshape consists of statistical testing procedures for the ERM and PDA null hypotheses. We implemented classical shape measures such as the Sackin's and Colless' imbalance measures. We introduced standardized measures with means and variances computed under the ERM and PDA models. The use of standardized measures can reduce size effects when comparing trees with different sizes. The standardization were computed using recent results regarding tree structures in theoretical computer science. In addition, we implemented a graphical test described in Aldous (2001) which attempts to fit Beta-splitting processes, a family that contains both the ERM and PDA as special cases.
3 EXAMPLES
In this section, we illustrate the use of apTreeshape from two examples: the HIV-1 phylogeny and a large-scale study of tree imbalance obtained from the screening of the Pandit database.
Tests based on Colless' indices are more conservative that tests based on likelihood ratios. An example of this is illustrated by the HIV-1 phylogeny (data from ‘ape’ and tree with 193 tips) published in Rambaut et al. (2001). The authors attempted to date the most recent common ancestor of the HIV-1 viruses assuming a coalescent tree whose topological structure is identical to the ERM model. Using a test based on standardized Colless’ indices, the hypothesis that the tree was less balanced than the ERM model was not rejected (Colless index = 992, P-value = 0.1). However the departure from the ERM model (and then the coalescent) is strongly asserted by the likelihood ratio test (standardized s = 3.48, P-value = 0.25 × 10−4). These results were obtained thanks to the following instructions:
colless.test(tree<-hivtree.treeshape, alternative=“greater”)
likelihood.test(tree,model=“yule”, alternative=“greater”)}
The next script connects to Pandit via the Internet, and downloads resolved trees with ID numbers in the range 100–300. Then the histogram of shape statistics s is plotted using the PDA normalization.
trees<-dbtrees(db=“pandit”, 100:300, quiet=T)
s.statistic<-sapply(trees, FUN=shape.statistic, norm=“pda”)hist(s.statistic,prob=T)
The results are displayed in Figure 1. We obtain a clear departure from the PDA model. Nevertheless the empirical distribution indices are bell-shaped [shift to the left from the standard N(0,1)], with a standard error (SD = 1.34) close to the value predicted by the PDA model (SD = 1).
Histogram of shape statistics s obtained after PDA standardization (196 trees collected from Pandit). (The histogram displays a departure from the PDA model [shift to the left from the standard N(0,1)].
Histogram of shape statistics s obtained after PDA standardization (196 trees collected from Pandit). (The histogram displays a departure from the PDA model [shift to the left from the standard N(0,1)].
4 CONCLUSION
The R programming language has been proved to be a powerful tool for bioinformatics. We contributed to R in order to improve the analysis of phylogenetic data. The package apTreeshape integrates recent development in the statistical theory of imbalance measures, which warrant the optimality of some testing procedures. This package competes with another program called SymmeTREE (Chan and Moore, 2005) which covers the same range of applications (temporal and topological analyses of trees). In this comparison apTreeshape benefits the extended power of R for performing all the types of data analyses (and its facilities for connecting to public databases). This should make this resource attractive to R users.
REFERENCES
Author notes
Associate Editor: Keith A Crandall

![Histogram of shape statistics s obtained after PDA standardization (196 trees collected from Pandit). (The histogram displays a departure from the PDA model [shift to the left from the standard N(0,1)].](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/22/3/10.1093/bioinformatics/bti798/2/m_bti798f1.jpeg?Expires=1605419513&Signature=r7Tumh-p1Q0BZ7rHCjV-VWiwpi4MrgXnu4mFBB8FO184avd82uBt1x-KheQ3BGCXUmTm-HPCKj4w0Dtw5uva8V3dONxRzB5x4tcxs5aoBXePV8zEO4jHfmdPOlJIaf28msrlF~VNubaBRxhLMmrhRZbbfKSl5GGfEdVAby080xFSgduBQ6a0Fz7eWoFz5SsDWQ~vvTspqXkljZ28kEp89sBYkPAdiXheN3rwwg7pmlRSTcyCZIs9u6ppWnXOuhnULmOlFsiJjW9~HyrLWdkjohN0p586pttX15st4aV0GSU7vEW4rq7xJ1JulTLwR~Us9gciTHz67W81hCyg-oQdvg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)