The understanding of macroevolutionary processes, such as speciation or extinction, is a major issue in evolutionary biology. It is widely acknowledged that such processes leave their fingerprint on the phylogenetic trees that we reconstruct from extant taxa.
The recent explosion of phylogenetic data has generated a bulk of modern analytical methods that rely on stochastic models of tree structure. These methods fall into two classes: temporal and topological. Temporal methods focus on the estimation of diversification rates (Nee, 2001). Topological methods are based on statistical measures of tree imbalance (Mooers and Heard, 1997; Aldous, 2001). Most of them assume null models of tree structure among which the Yule's process (1924) is the most popular.
In this article, we describe the computer package
Beyond the software facilities for data analysis and graphical display offered by the R language,
The functions contained in
The basic objects handled by the package are cladograms, i.e. binary trees for which branch lengths have been ignored. They can be read from files in the Newick/Nexus format or converted from objects of the ‘ape’ package. These objects are stored into a class called ‘treeshape’. Objects of class ‘treeshape’ have dendrogram-like data structure, and they are plotted using methods for dendrograms. Basic topological manipulations are allowed such as pruning or cuting from a specified internal node. Pruning returns the ancestral part of a tree, while cutting extracts a subtree rooted at a specific node. Subtrees corresponding to a subset of taxa can be extracted from a whole tree as well.
Simulation methods and Monte Carlo estimates of P-values are central to
The core of
As an improvement over the existing literature on tree balance, we used the theory of likelihood ratios in order to provide a test statistic with maximal power for rejecting the ERM against the PDA model. The shape statistic can be computed as
In this section, we illustrate the use of
Tests based on Colless' indices are more conservative that tests based on likelihood ratios. An example of this is illustrated by the HIV-1 phylogeny (data from ‘ape’ and tree with 193 tips) published in Rambaut et al. (2001). The authors attempted to date the most recent common ancestor of the HIV-1 viruses assuming a coalescent tree whose topological structure is identical to the ERM model. Using a test based on standardized Colless’ indices, the hypothesis that the tree was less balanced than the ERM model was not rejected (Colless index = 992, P-value = 0.1). However the departure from the ERM model (and then the coalescent) is strongly asserted by the likelihood ratio test (standardized s = 3.48, P-value = 0.25 × 10−4). These results were obtained thanks to the following instructions:
The next script connects to Pandit via the Internet, and downloads resolved trees with ID numbers in the range 100–300. Then the histogram of shape statistics s is plotted using the PDA normalization.
The results are displayed in Figure 1. We obtain a clear departure from the PDA model. Nevertheless the empirical distribution indices are bell-shaped [shift to the left from the standard N(0,1)], with a standard error (SD = 1.34) close to the value predicted by the PDA model (SD = 1).
The R programming language has been proved to be a powerful tool for bioinformatics. We contributed to R in order to improve the analysis of phylogenetic data. The package