Abstract
Summary: Using an appropriate model of amino acid replacement is very important for the study of protein evolution and phylogenetic inference. We have built a tool for the selection of the best-fit model of evolution, among a set of candidate models, for a given protein sequence alignment.
Availability: ProtTest is available under the GNU license from http://darwin.uvigo.es
Contact:fabascal@uvigo.es
1 INTRODUCTION
1.1 Models of protein evolution
Models of protein evolution, or amino acid replacement, describe the probabilities of change from one amino acid to another, and therefore become indispensable tools for the characterization of the process of protein evolution (Thorne, 2000; Thorne and Goldman, 2003). Indeed, these models provide the foundation for the reconstruction of protein phylogenies under distance, maximum likelihood and Bayesian methods. Dayhoff et al. (1978) introduced the most influential model of amino acid replacement, a 20-state time reversible homogeneous Markov model. Because the large number of parameters in a 20-state replacement matrix, estimates of these parameters are usually obtained from large datasets prior to the analysis of the dataset of interest. In this way, different empirical matrices with fixed relative rates of amino acid replacement have been already proposed, like the Dayhoff matrix (Dayhoff et al., 1978), the JTT matrix (Jones et al., 1992), the mtREV matrix (Adachi and Hasegawa, 1996) or the WAG matrix (Whelan and Goldman, 2001). While these models generally assume that the process of amino acid replacement is very similar across all positions, conservation of protein function and structure imposes constraints on which positions can change. This evolutionary information can be inferred by considering a fraction of amino acids to be invariable (‘+I’) (Reeves, 1992), or assigning each site a probability to belong to given rate categories (‘+G’) (Yang, 1993). Additionally, observed amino acid frequencies can also be considered (‘+F’) (Cao et al., 1994).
1.2 Model selection and inference
Model selection may be seen as a way of identifying the model that, among a set of candidates, is closest to reality. Looking for a balance between accuracy and simplicity, Akaike (1973) found a simple relationship between the likelihood (L) and the number of parameters (K):
2 THE PROGRAM: PROTTEST
Although widely-used software exists for the selection of the best-fit nucleotide models (Posada and Crandall, 1998), no program has been developed until now for protein models. ProtTest is a java program to find the best model of amino acid replacement for a given protein alignment. It is based on the Phyml program (Guindon and Gascuel, 2003) for the ML optimizations, modified to support +F and four extra substitution matrices and uses the PAL library (Drummond and Strimmer, 2001) for handling protein alignments and trees. ProtTest is available for Mac OSX, Linux and Windows, and it can be run in three ways: using a GUI, at the command-line and through the web. Its basic workflow is summarized in Figure 1.
Given a protein alignment and a tree topology the program calculates the likelihood under each candidate model, and estimates model parameters. The current version 1.2 implements 64 empirical models: the eight matrices WAG, mtREV, Dayhoff, JTT, VT, Blosum62, CpREV and RtREV under +F, +G, +I and their combinations. Other models exist, particularly mechanical models, that are not implemented in ProtTest. For each model, the tree topology can be fixed [provided by the user or calculated by BIONJ (Gascuel, 1997)] or optimized under ML. After this, the user can choose a model selection strategy (AIC, AICc, BIC), and obtain a rank of model fits, model-averaged parameter estimates or measures of parameter importance. For the AICc and the BIC, sample size is set by default to the number of positions in the alignment. Other options to define sample size attempt to take into account both the number of sequences and their redundancy. Other valuable features include the ability to restrict the set of candidate models (only in the GUI version) and the possibility to output the tree corresponding to the best model.
The basic workflow of ProtTest Program. This figure can be viewed in colour on Bioinformatics online.
The basic workflow of ProtTest Program. This figure can be viewed in colour on Bioinformatics online.
Special thanks to Stephane Guindon (Phyml) and Matthew Goode (PAL) for their help. This work was financially supported from a research grant in bioinformatics from the Fundación BBVA (Spain).


Comments