Although the use of adequate models of evolution should improve the accuracy of phylogenetic inference (Leitner, Kumar, and Albert 1997<$REFLINK> ; Sullivan and Swofford 1997<$REFLINK> ; Cunningham, Zhu, and Hillis 1998<$REFLINK> ), computer simulation studies have shown that under certain circumstances, wrong models can recover the true tree with higher probability than the tree model employed to generate the data (Saitou and Nei 1987<$REFLINK> ; Schöniger and von Haeseler 1993<$REFLINK> ; Tajima and Takezaki 1994<$REFLINK> ; Tateno, Takezaki, and Nei 1994<$REFLINK> ; Yang 1997a<$REFLINK> ; Takahashi and Nei 2000<$REFLINK> ). These results have previously been attributed to the complexity of the topology estimation problem (Yang 1997a<$REFLINK> ), but Bruno and Halpern (1999)<$REFLINK> have suggested that they could be better understood in terms of bias toward the true tree when the assumptions of the method are violated; in other words, getting the right answer for the wrong reason.
There is no evidence of whether this “phylogenetic bias” actually occurs with real data. While there are empirical examples of “phylogenetic robustness” (complex models produce no better results than simpler ones) (Russo, Takezaki, and Nei 1996<$REFLINK> ), we are not aware of cases in which the use of simpler models leads to a better phylogenetic estimate. We report here on an empirical example in which only the use of simpler, wrong models of evolution leads to the estimation of trees that are in agreement with biochemical and immunological evidence and with previous phylogenetic studies.
DNA and amino acid sequences for the gag, pol, and env genes for 61–76 retroviral taxa were downloaded from GenBank. ClustalX (Thompson et al. 1997<$REFLINK> ) was used for the alignment of the DNA and protein sequences. The best-fit model of DNA substitution for each data set was selected using Modeltest, version 1.05 (Posada and Crandall 1998<$REFLINK> ). Neighbor-joining (NJ) trees (Saitou and Nei 1987<$REFLINK> ) were constructed using uncorrected (NJp) and maximum-likelihood (NJml) distances for DNA and using mean character (=uncorrected) (NJmc) and maximum-likelihood (NJmlprot) distances for proteins. A heuristic search was carried out to find the most parsimonious (MP) tree(s). Unweighted (MPdna and MPprot) and weighted (MPstep, MPpars, and MPrice) MP trees were estimated. Three step matrices were used for the weighting. The DNAstep matrix was constructed in MacClade, version 3.07 (Maddison and Maddison 1994), on the NJ tree estimated using maximum-likelihood (ML) distances under the best-fitting model of nucleotide substitution (MPstep). For protein sequences, two amino acid substitution matrices were used: protpars (Felsenstein 1991<$REFLINK> ) (MPpars) and one nonmetric and asymmetrical step matrix estimated using the accepted mutation-parsimony method (AMP program; Rice, personal communication) (MPrice). Heuristic searches were performed in PAUP* (Swofford 1998<$REFLINK> ). The program GAML (Lewis 1998<$REFLINK> ) was used to estimate ML trees for the DNA data sets. The method implemented in this program uses a genetic algorithm to increase search speed of the ML tree.
Sequences belonging to the same genus were aligned without much trouble. It was the alignment among the different genera that was problematic due to the high divergence among the sequences, especially for the DNA data sets. Numerous gaps were needed in each case to build the alignments, highlighting the importance of indels in the evolution of retroviruses. For the three genes, env, gag, and pol, the best-fitting model of evolution was the complex general time reversible model GTR+Γ (Rodríguez et al. 1990<$REFLINK> ), with rate heterogeneity among sites. DNA and protein trees for the pol gene estimated under simpler models of evolution (MPdna, MPprot, NJp, and NJmc) presented almost all of the genera as monophyletic groups, while trees estimated under more complex models (NJml, NJmlprot, MPpars, MPrice, and GAML [not shown]) showed less monophyletic groups (figs. 1 and 2 ). The MPstep tree, which can be considered to use a model of medium complexity, recovered several monophyletic groups, although its topology was different from that of the MPdna, MPprot, NJp, and NJmc trees (Templeton test, P < 0.05). For the env and gag genes, those trees estimated under simpler models again showed more monophyletic groups coincident with the current genera than those trees estimated under more complex models (trees not shown). NJp trees were not significantly different (Templeton test, P > 0.05) from MP trees. Topologies inferred from the three genes were different. In all cases, the nodal support was higher for the simpler models than for the complex models. This is not unexpected, as variances increase under complex models, and simpler models might be just overestimating the bootstrap values (Yang, Goldman, and Friday 1994<$REFLINK> ).
In the past years, the nucleotide sequences of a large number of retroviruses have been determined, and phylogenetic trees including several members of the family Retroviridae have been published (Doolittle et al. 1989, 1990<$REFLINK> ; Xiong and Eickbush 1990<$REFLINK> ; McClure 1993<$REFLINK> ; Griffiths et al. 1997<$REFLINK> ; Boeke and Stoye 1998<$REFLINK> ; Vogt 1998<$REFLINK> ; Coffin 1999<$REFLINK> ). Current classification of retroviruses is based on this phylogenetic evidence and on morphology, biochemical properties, and range of hosts (Chiu et al. 1984<$REFLINK> ; Coffin 1999<$REFLINK> ). Today, we have a good idea of the “true” phylogenetic relationships among the main groups of retroviruses. When examining the results of the present study, only those trees estimated according to simple, likely wrong, models of evolution agree with current evidence. In most of the reconstructed trees, different genera appear as monophyletic groups. These groups have normally high bootstrap values indicating that, given the data sets at hand, we can be confident in the nodes defining these clusters. When more complex, more realistic, models of evolution are employed, fewer genera are recovered as monophyletic, the level of support is lower, and the topologies are very different from the assumed “known” trees.
Phylogenetic bias, by which “incorrect” models can give “correct” answers, has been identified in simulation studies. Why this bias occurs is a question that remains unsolved. Here, we report an empirical example of this bias. The causes of it are probably complex and dependent on several factors. One possible factor contributing to the bias is most likely a problematic alignment, in which sequences belonging to the same group (genus) are easily aligned, whereas the opposite is true for sequences belonging to different groups. Complex models might be confounded when trying to extract information from the bad intragroup sequence alignment, while simpler models use basically the observed patterns. This would warrant a word of caution for the estimation of phylogenies from highly divergent data sets. It would be interesting to test whether this bias appears when analyzing highly divergent sequences. It should be noted that the three genes studied might not be giving independent evidence for the bias. Most likely, they share common causes responsible for it, probably related to the high level of divergence.
The use of particular models of nucleotide substitution may change the results of an analysis (Leitner, Kumar, and Albert 1997<$REFLINK> ; Sullivan and Swofford 1997<$REFLINK> ; Cunningham, Zhu, and Hillis 1998<$REFLINK> ; Kelsey, Crandall, and Voevodin 1999<$REFLINK> ). In general, phylogenetic methods may be less accurate or may be inconsistent when the model of evolution assumed is wrong (Felsenstein 1978<$REFLINK> ; Huelsenbeck and Hillis 1993<$REFLINK> ; Penny et al. 1994<$REFLINK> ; Bruno and Halpern 1999<$REFLINK> ). Statistical procedures exist to select the best-fit mode at hand (Frati et al. 1997<$REFLINK> ; Huelsenbeck and Crandall 1997<$REFLINK> ; Posada and Crandall 1998<$REFLINK> ), and we strongly encourage their use in phylogenetic reconstruction. However, there are unusual cases in which wrong models can lead to a better answer, and here we report such an example. Identifying these few exceptions and their causes will help us to understand the role of models in phylogenetic inference. Indeed, the conclusions of this paper are based on two main assumptions: (1) that there is a good knowledge of how the “true tree” of the family Retroviridae should look, and (2) that the processes governing the evolution of retroviruses are complex and are not well described by simple models of evolution. We think that these assumptions are easily justified.
Ross Crozier, Reviewing Editor
Keywords: models of evolution complex trees phylogenetic bias alignment retrovirus phylogeny
Address for correspondence and reprints: David Posada, Department of Zoology, Brigham Young University, 574 Widtsoe Building, Provo, Utah 84602-5255. email@example.com.
We thank Dr. Ziheng Yang for implementing the estimation of amino acid likelihood pairwise distances in PAML and for guidance in their calculation. Dave Swofford kindly provided beta versions of PAUP*. Two anonymous reviewers made helpful suggestions. This work was supported by the Alfred P. Sloan Foundation and the NIH R01-HD34350-01A1 HIV grant. NEXUS alignments are available on request from the authors.