## Abstract

Summary:

apTreeshape
is a R package dedicated to simulation and analysis of phylogenetic tree topologies using statistical imbalance measures. It is a companion library of the R package ‘ape’, which provides additional functions for reading, plotting, manipulating phylogenetic trees and for connecting to public phylogenetic tree databases. One strength of the package is to include appropriate corrections of classical shape statistics as well as new tests based on the statistical theory of likelihood ratios.

Availability:

Contact:Olivier.Francois@imag.fr

## 1 INTRODUCTION

The understanding of macroevolutionary processes, such as speciation or extinction, is a major issue in evolutionary biology. It is widely acknowledged that such processes leave their fingerprint on the phylogenetic trees that we reconstruct from extant taxa.

The recent explosion of phylogenetic data has generated a bulk of modern analytical methods that rely on stochastic models of tree structure. These methods fall into two classes: temporal and topological. Temporal methods focus on the estimation of diversification rates (Nee, 2001). Topological methods are based on statistical measures of tree imbalance (Mooers and Heard, 1997; Aldous, 2001). Most of them assume null models of tree structure among which the Yule's process (1924) is the most popular.

apTreeshape
that is dedicated to simulation and analysis of phylogenetic tree topologies using statistical indices. It is programmed in the R language (R Development Core Team, 2005), and complements the library ‘ape’ of Paradis et al. (2004) which covers aspects of temporal methods essentially. It also provides additional functions for reading, plotting, manipulating phylogenetic trees and offers immediate web-access to public phylogenetic tree databases, such as TreeBASE and Pandit (Whelan et al., 2003).

Beyond the software facilities for data analysis and graphical display offered by the R language,

apTreeshape
includes important corrections on classical shape statistics. One strength of the package is to present new tests based on the statistical theory of likelihoods, and therefore provide optimal power for testing null models of macroevolution.

## 2 CONTENTS

The functions contained in

apTreeshape
can be classified into four categories: basic topological manipulation, web-access, simulation and statistical testing.

The basic objects handled by the package are cladograms, i.e. binary trees for which branch lengths have been ignored. They can be read from files in the Newick/Nexus format or converted from objects of the ‘ape’ package. These objects are stored into a class called ‘treeshape’. Objects of class ‘treeshape’ have dendrogram-like data structure, and they are plotted using methods for dendrograms. Basic topological manipulations are allowed such as pruning or cuting from a specified internal node. Pruning returns the ancestral part of a tree, while cutting extracts a subtree rooted at a specific node. Subtrees corresponding to a subset of taxa can be extracted from a whole tree as well.

The package

apTreeshape
has been designed to perform large-scale studies of tree shape from phylogeny databases. For instance, it contains specific functions for accessing TreeBASE and Pandit through R. As an example, the following instructions download the trees with ID numbers = 705, 706 and 709 in Pandit, and convert them into objects of class ‘treeshape’. Basic summaries can be obtained very easily.

trees<-dbtrees(“pandit”, c(705,706,709))

summary(trees[[2]]);plot(trees[[2]])

Although

apTreeshape
deals with fully resolved tree, any phylogeny can be downloaded, and converted into a binary tree solving polytomies using a random simulation method.

Simulation methods and Monte Carlo estimates of P-values are central to

apTreeshape
. The function
rtreeshape
enables sampling trees from the most usual stochastic models of trees: the equal rate Markov (ERM) and proportional to distinguishable arrangements models (PDA). In the ERM each branch has an equal probability of splitting, whereas the PDA model has the property that all trees are equally likely (Mooers and Heard, 1997). Note that the topology of the ERM model is shared by other models such as the Hey, Moran or coalescent models for which branch lengths can be simulated using the R base package without difficulties. In addition, we implemented the biased-speciation model used by Kirkpatrick and Slatkin (1993), and a universal random generator for branching Markov processes. Solving polytomies makes use of one of the ERM, PDA or biased-speciation models locally.

The core of

apTreeshape
consists of statistical testing procedures for the ERM and PDA null hypotheses. We implemented classical shape measures such as the Sackin's and Colless' imbalance measures. We introduced standardized measures with means and variances computed under the ERM and PDA models. The use of standardized measures can reduce size effects when comparing trees with different sizes. The standardization were computed using recent results regarding tree structures in theoretical computer science. In addition, we implemented a graphical test described in Aldous (2001) which attempts to fit Beta-splitting processes, a family that contains both the ERM and PDA as special cases.

As an improvement over the existing literature on tree balance, we used the theory of likelihood ratios in order to provide a test statistic with maximal power for rejecting the ERM against the PDA model. The shape statistic can be computed as

(1)
$s=∑i=1n−1log(Ni−1),$
where n is the number of taxa, and Ni is the size of the clade that descends from the i-th ancestor in the tree. Mathematical formulae for likelihoods were found in Semple and Steel (2003), and asymptotic properties of s have been established earlier by Fill (1996).

## 3 EXAMPLES

In this section, we illustrate the use of

apTreeshape
from two examples: the HIV-1 phylogeny and a large-scale study of tree imbalance obtained from the screening of the Pandit database.

Tests based on Colless' indices are more conservative that tests based on likelihood ratios. An example of this is illustrated by the HIV-1 phylogeny (data from ‘ape’ and tree with 193 tips) published in Rambaut et al. (2001). The authors attempted to date the most recent common ancestor of the HIV-1 viruses assuming a coalescent tree whose topological structure is identical to the ERM model. Using a test based on standardized Colless’ indices, the hypothesis that the tree was less balanced than the ERM model was not rejected (Colless index = 992, P-value = 0.1). However the departure from the ERM model (and then the coalescent) is strongly asserted by the likelihood ratio test (standardized s = 3.48, P-value = 0.25 × 10−4). These results were obtained thanks to the following instructions:

colless.test(tree<-hivtree.treeshape, alternative=“greater”)

likelihood.test(tree,model=“yule”, alternative=“greater”)}

The next script connects to Pandit via the Internet, and downloads resolved trees with ID numbers in the range 100–300. Then the histogram of shape statistics s is plotted using the PDA normalization.

trees<-dbtrees(db=“pandit”, 100:300, quiet=T)

s.statistic<-sapply(trees, FUN=shape.statistic, norm=“pda”)hist(s.statistic,prob=T)

The results are displayed in Figure 1. We obtain a clear departure from the PDA model. Nevertheless the empirical distribution indices are bell-shaped [shift to the left from the standard N(0,1)], with a standard error (SD = 1.34) close to the value predicted by the PDA model (SD = 1).

Fig. 1

Histogram of shape statistics s obtained after PDA standardization (196 trees collected from Pandit). (The histogram displays a departure from the PDA model [shift to the left from the standard N(0,1)].

Fig. 1

Histogram of shape statistics s obtained after PDA standardization (196 trees collected from Pandit). (The histogram displays a departure from the PDA model [shift to the left from the standard N(0,1)].

## 4 CONCLUSION

The R programming language has been proved to be a powerful tool for bioinformatics. We contributed to R in order to improve the analysis of phylogenetic data. The package

apTreeshape
integrates recent development in the statistical theory of imbalance measures, which warrant the optimality of some testing procedures. This package competes with another program called SymmeTREE (Chan and Moore, 2005) which covers the same range of applications (temporal and topological analyses of trees). In this comparison
apTreeshape
benefits the extended power of R for performing all the types of data analyses (and its facilities for connecting to public databases). This should make this resource attractive to R users.

## REFERENCES

Aldous
D.J.
Stochastic models and descriptive statistics for phylogenetic trees, from Yule to Today
Stat. Sci.
,
2001
, vol.
16
(pg.
23
-
34
)
Chan
K.M.A.
Moore
B.R.
SymmeTREE: whole-tree analysis of differential diversification rates
Bioinformatics
,
2005
, vol.
21
(pg.
1709
-
1710
)
Fill
J.A.
On the distribution of binary search trees under the random permutation model
Rand. Struct. Algor.
,
1996
, vol.
8
(pg.
1
-
25
)
Kirkpatrick
M.
Slatkin
M.
Searching for evolutionary patterns in the shape of a phylogenetic tree
Evolution
,
1993
, vol.
47
(pg.
1171
-
1181
)
Mooers
A.O.
Heard
S.B.
Inferring evolutionary process from phylogenetic tree shape
Q. Rev. Biol.
,
1997
, vol.
72
(pg.
31
-
54
)
Nee
S.
Inferring speciation rates from phylogenies
Evolution
,
2001
, vol.
55
(pg.
661
-
668
)
E.
, et al.  .
APE: analyses of phylogenetics and evolution in R language
Bioinformatics
,
2004
, vol.
20
(pg.
289
-
290
)
R Development Core Team
A Language and Environment for Statistical Computing. R Foundation for Statistical Computing
2005
Vienna, Austria
Rambaut
A.
, et al.  .
Human immunodeficiency virus phylogeny and the origin of HIV-1
Nature
,
2001
, vol.
410
(pg.
1047
-
1048
)
Semple
C.
Steel
M.
Phylogenetics
,
2003
Oxford
Oxford University Press
Whelan
S.
, et al.  .
Pandit: a database of protein and associated nucleotide domains with inferred trees
Bioinformatics
,
2003
, vol.
19
(pg.
1556
-
1563
)

## Author notes

Associate Editor: Keith A Crandall