The average size of molecular systematic data sets has grown steadily over the past 20 years. Combined phylogenetic matrices that include multiple genetic loci currently are the norm, and in many cases, rapid compilation of extremely large DNA data sets is feasible. Thus, a frequently asked question is “How many genes should a systematist sequence in order to generate a robust phylogenetic hypothesis?” This query generally has been addressed by computer simulation, where the amount of virtual DNA sequence data that can be generated is unlimited (e.g., Huelsenbeck and Hillis, 1993). Genomic data, however, provide systematists with a multitude of empirical molecular data for phylogenetic analysis, and several authors have taken advantage of this resource to examine the effects of increasing the number of genes to quantities that seemed impossible in the recent past (e.g., Cummings et al., 1995; Bapteste et al., 2002; Goremykin, 2004).
In one noteworthy study, Rokas et al. (2003) compiled a large systematic matrix of 127,026 nucleotide positions from 106 genes for 7 species of Saccharomyces yeast and an outgroup (Candida albicans). Maximum likelihood (ML) and parsimony analyses of this large data set produced congruent, well-supported results with bootstrap scores of 100% for all clades (Fig. 1a). In spite of this overwhelming support, Rokas et al. (2003) noted that there was widespread topological conflict among gene trees. Separate analyses of individual genes produced various topologies that contradicted all nodes in the tree based on concatenation of 106 genes (Fig. 1a). Pairwise comparisons of gene trees showed extensive incongruence, and one conflicting clade, S. kudriavzevii + S. bayanus, was supported by a very large percentage of the gene trees (Fig. 1a). Replicated support for this anomalous clade was apparent in analyses of nucleotides, transversions, codons, and amino acids for a variety of systematic methods (Rokas et al., 2003; also see Holland et al., 2004, 2006; Phillips et al., 2004; Taylor and Piel, 2004; Collins et al., 2005; Gatesy et al., 2005; Ren et al., 2005; Hedtke et al., 2006).
By examining correlations between bootstrap scores and possible confounding factors, however, Rokas et al. (2003) concluded that “… none of the factors known or predicted to cause phylogenetic error could systematically account for the observed incongruence, suggesting that there may be no good predictor of the phylogenetic informativeness of genes” (p. 802). Therefore, many randomly selected genes were necessary to overwhelm conflicting signals. In this case study, very large concatenated data sets of ∼ 20 genes were required to provide 95% bootstrap support for all nodes in the combined data tree, “substantially more genes than commonly used but a small fraction of any genome” (p. 799). Rokas et al. (2003) concluded that “These results have important implications for resolving branches of the tree of life” (p. 799) and “… important implications for many current practices in molecular phylogenetics” (p. 802), points that were reasserted in a commentary by Gee (2003).
Specifically, if 20 or more genes generally are required to yield robust support, then most previous phylogenetic analyses are inadequate in terms of character sampling. This assertion is based on the assumption that the 8 taxa analyzed by Rokas et al. (2003) represent a typical systematic problem. Rokas et al. (2003) considered this issue, noting that “It is possible that the 8 yeast taxa we have analyzed represent a very difficult phylogenetic case, atypical of the situations found in other groups. However, the widespread occurrence of incongruence at all taxonomic levels argues strongly against such a view. Rather, we believe that this group is a representative model for key issues that researchers in phylogenetics are confronting” (p. 802). Large matrices that combine information from 20 or more gene fragments are rare (e.g., Murphy et al., 2001; Bapteste et al., 2002; Gatesy et al., 2002; Goremykin, 2004); therefore, if the test case of Rokas et al. (2003) is representative, most published molecular systematic studies are, at best, preliminary efforts.
Rokas et al. (2003) primarily used the nonparametric bootstrap (Felsenstein, 1985) to assess support and to search for correlates of incongruence in the yeast matrix. Recent reanalyses have utilized a variety of techniques to further characterize conflicting signals in the yeast data set. These approaches included Bayesian analysis (Taylor and Piel, 2004; Jeffroy et al., 2006), transversion coding (Phillips et al., 2004; Jeffroy et al., 2006), removal of rapidly evolving third codon positions (Collins et al., 2005; Jeffroy et al., 2006), partitioned Bremer support scores (Collins et al., 2005; Gatesy et al., 2005), consensus networks (Holland et al., 2004, 2006), isolation of genes with shifting base compositional biases (Collins et al., 2005), supertree bootstrapping (Burleigh et al., 2006), increased taxon sampling (Rokas and Carroll, 2005; Hedtke et al., 2006), and better fitting models of molecular evolution (Ren et al., 2005). Alternatively, several authors have suggested that reducing the number of taxa included in analysis can yield insights regarding the stability of phylogenetic hypotheses (Lanyon, 1985; Philippe and Douzery, 1994; Siddall, 1995; Brochu, 1997; Poe, 1998; Siddall and Whiting, 1999; Holland et al., 2003).
Here, we use selected removal of taxa to explore patterns of incongruence in the yeast data set. In particular, we analyze different subsets of species to determine whether disagreements among gene trees are tempered or accentuated by altering taxonomic representation. In combination with documentation of branch lengths for individual gene trees, our subsampling results show that the set of species analyzed by Rokas et al. (2003) is not representative of most published systematic studies. We suggest that the yeast matrix does not provide a coherent, general recommendation for how many genes to sample in future molecular systematic studies. However, patterns of conflict for different subsets of species offer a very simple explanation for replication of the discrepant S. kudriavzevii + S. bayanus clade in many gene trees (Fig. 1a).
Exceptionally Long Branches
Examination of the optimal topology for the concatenation of 106 genes showed a striking difference between branches that connected 5 closely related Saccharomyces species (S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. bayanus) and branches that led to S. castellii, S. kluyveri, and the outgroup C. albicans (Fig. 1b; Hedtke et al., 2006; Jeffroy et al., 2006). For the ML model utilized by Rokas et al. (2003), the branches that connected to S. castellii, S. kluyveri, and C. albicans ranged from 0.31 to 1.58 expected substitutions per site, whereas the branches that joined the remaining, closely related Saccharomyces species were from 0.03 to 0.08 substitutions per site. Only 15% of the inferred nucleotide substitutions occurred on branches that linked these 5 species (Fig. 1b).
Consistent with an estimated Precambrian (∼ 723 Mya) divergence of Candida albicans from Saccharomyces cerevisiae (Hedges et al., 2004), the outgroup branch in the yeast tree was exceptionally long. Each site in the concatenated data set is expected to change 1.58 times on this branch according to the ML estimate (Fig. 1b). For the topology supported by the concatenation of 106 genes, 43% of the yeast genes had one branch that was > 2.00 expected substitutions per site, and 79% of the yeast genes had at least one branch that was > 1.00 substitution per site (also see Hedtke et al., 2006). For comparison, in an often cited discussion of long branches in a 28S rDNA tree of holometabolous insects, Huelsenbeck (1998) remarked that two branches in his analysis were “among the longest ever observed (approximately 1.0 substitution per site)” (p. 530). However, branches in many of the yeast gene trees dwarfed those in the insect rDNA tree and were up to 95 times longer (Fig. 2). From another perspective, the longest branches in the yeast data set exceeded those in a tree based on mitochondrial genomes from 5 animal phyla (Naylor and Brown, 1998) and also were much longer than branches in simulations designed to assess misplacement of long branches (e.g., Anderson and Swofford, 2004). Although it has been suggested that the set of species in the yeast data set represents a typical phylogenetic problem (Rokas et al., 2003), the extraordinarily long branch lengths in most yeast gene trees demonstrate that this is not the case (e.g., Fig. 2).
Complete Congruence for Five Closely Related Saccharomyces Species
For the yeast data set, gene trees that included all 8 species showed many conflicts among the 5 most closely related Saccharomyces species (Fig. 1a). In ML analyses, 11% of gene trees conflicted with the S. cerevisiae + S. paradoxus clade, 28% conflicted with the S. cerevisiae + S. paradoxus + S. mikatae clade, and 45% conflicted with the S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii clade. Given the moderate lengths of branches that linked S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus (Fig. 1b), we were surprised by the widespread discrepancies among genes at this level.
To further explore differences among gene trees, we reanalyzed the 5 closely related species of Saccharomyces in isolation from their distant relatives, S. castellii, S. kluyveri, and C. albicans. Given the diversity of gene trees for all 8 taxa, we expected to find many conflicting topologies but were shocked by complete congruence among the 106 gene trees in ML analyses (Fig. 3). There are 15 possible bifurcating trees (unrooted) for a data set of 5 taxa; assuming an equal probability for each topology a priori, the chance of recovering the same tree 106 straight times is astronomically low (P = 3.24 × 10− 124). Ironically, a systematic data set that has been presented as a prime example of pervasive, inexplicable conflict among genes (Rokas et al., 2003) can be transformed, with the removal of 3 species, into a remarkably congruent data set that shows 100% agreement among 106 genes. For the set of 5 closely related Saccharomyces species, 20 genes were not necessary to resolve relationships; basically any gene will do (Fig. 3c). Subsets of only 600 randomly resampled nucleotides from the yeast data set were sufficient to support the 2 nodes in the 5 taxon tree in > 95% of replicates, and only 200 nucleotides recovered each clade > 75% of the time (Fig. 3a, Fig. 3b). These are very small samples of characters relative to most modern systematic studies.
Incongruence among Genes with the Addition of Divergent Taxa
In ML analyses of S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, there were no topological discrepancies among gene trees (Fig. 3c), but when more distantly related taxa (S. castellii, S. kluyveri, and C. albicans) were included, gene trees showed extensive conflicts regarding relationships among the 5 closely related Saccharomyces species. Previously, we noted such conflicts for the full complement of 8 species (Fig. 1a). Analyses of the 7 Saccharomyces species, excluding the outgroup C. albicans, also revealed widespread incongruence among genes (Rokas et al., 2003), as did analyses of 6 species (Fig. 4).
For the 6 species set composed of 5 closely related Saccharomyces species and C. albicans, the disparity in length between the outgroup branch and ingroup branches was greatest. Approximately 87% of the expected character change was restricted to the outgroup branch (3.04 substitutions per site), and the concatenation of 106 genes supported a grouping of S. kudriavzevii + S. bayanus with an ML bootstrap score of 76% (Fig. 4). This relationship was incompatible with the S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii clade that had 100% bootstrap support in the analysis of 106 genes for 8 species (Fig. 1a). Thus, a systematic data set of 8 species, in which 20 genes were considered sufficient for robust phylogenetic support (Rokas et al., 2003), can be transformed, with the deletion of 2 species, into a data set of 106 genes that yields a contradictory tree; > 100 concatenated genes did not provide ≥95% bootstrap support at all nodes for this set of 6 species (Fig. 4). Phylogenetic analyses of taxon subsamples (Fig. 3, Fig. 4) clearly show that the number of genes required to yield strong bootstrap scores is highly dependent on the particulars of a given systematic problem and suggest that long branches (Fig. 2) explain much of the incongruence among genes in the yeast data set (Taylor and Piel, 2004; Hedtke et al., 2006; Jeffroy et al., 2006).
Highly Replicated Incongruence with the Addition of Distant Taxa
In ML analyses of the 5 closely related Saccharomyces species, all gene trees had the same 7 branches (Fig. 3a). Assignment of the root to 5 of these 7 branches will yield the incongruent S. kudriavzevii + S. bayanus clade (Fig. 5a). When the distantly related C. albicans was added to a matrix that included S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, all ML gene trees were consistent with the unrooted topology for these 5 Saccharomyces species (Fig. 3a), but rooting position was scattered among the 7 ingroup branches (Fig. 5b). The most common root position was on the “correct” branch (S. bayanus; 31 times). For the other 75 genes, the root was distributed across the remainder of the topology, and the majority of gene trees (57 of 106) supported the S. kudriavzevii + S. bayanus clade (Figs. 4 and 5b). Assuming an equal a priori probability of recovering the S. kudriavzevii + S. bayanus clade or the S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii clade, it would be highly unlikely for one of these groups to be supported in ≥57 of 88 gene trees, as was the case here (binomial probability of 0.007). Analogous but less extreme patterns were observed in ML gene trees for other combinations of 6 taxa, in which the 5 closely related Saccharomyces species were rooted with either S. castellii or S. kluyveri. S. kudriavzevii + S. bayanus was always the most common, conflicting clade (Fig. 4). Likewise, Rokas et al. (2003) documented the same pattern of replicated support for the conflicting S. kudriavzevii + S. bayanus in gene trees for the 7 Saccharomyces species, excluding the outgroup C. albicans.
Of the 10,395 possible bifurcating topologies for all 8 species, the S. kudriavzevii + S. bayanus bipartition is found in only 9% of all trees. However, ML analyses of 8 species showed that S. kudriavzevii + S. bayanus was recovered 32 times in 106 gene trees (Fig. 1a); previous parsimony, ML, and Bayesian results showed this same pattern of replicated incongruence whether nucleotides, transversions, codons, or amino acids were analyzed (Rokas et al., 2003; Phillips et al., 2004; Taylor and Piel, 2004; Collins et al., 2005; Gatesy et al., 2005; Ren et al., 2005; Burleigh et al., 2006; Holland et al., 2004, 2006). Repeated recovery of the incongruent S. kudriavzevii + S. bayanus clade in ∼ 30% of our ML gene trees strongly suggested an underlying bias. Once again, all ML gene trees for 8 species were compatible with relationships in the unrooted tree for the 5 closely related Saccharomyces species (Fig. 3a), but different placements of the 3 long branch taxa (Fig. 6a, Fig. 6b) resulted in many gene trees that supported the S. kudriavzevii + S. bayanus clade (Fig. 1a and Fig. 5c). As in the analyses of 6 taxa (Fig. 4), replicated support for the conflicting S. kudriavzevii + S. bayanus grouping was due to erratic rooting of the uniformly supported, pectinate topology for S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus (Fig. 3a) by the very distantly related S. castellii, S. kluyveri, and C. albicans (Fig. 4 to Fig. 6; for discussion of distant outgroups see Wheeler, 1990; Huelsenbeck et al., 2002; Holland et al., 2003; Anderson and Swofford, 2004; Bergsten, 2005; Susko et al., 2005; Goloboff and Pol, 2005; Hedke et al., 2006).
How Many Genes Are Enough?
Rokas et al. (2003) suggested that their analyses of 106 genes from 8 species had important implications for resolving the tree of life. In particular, they argued that 20 or more genes might be required to garner robust support for phylogenetic relationships. This assertion is based on two critical assumptions: (1) There are no good predictors for the utility of different genes, and (2) the 8 species in their data set represent a typical phylogenetic problem. Recent reanalyses of the yeast data set have contested both of these assumptions. Phillips et al. (2004) noted that differences in G-C content might explain non-historical signals in the yeast matrix. Subsequently, Collins et al. (2005) found that shifts in base composition were most prominent at third codon positions (also see Jeffroy, 2006). When Collins et al. (2005) resampled genes with stationary base compositions, only 10 genes were required to record high bootstrap percentages for relationships supported by the concatenated data set. By contrast, 23 genes characterized by large shifts in base composition were necessary to yield the same level of support (Collins et al., 2005). In a study focused on Bayesian support measures, Taylor and Piel (2004) found that, “Overall the external/internal branch length ratios were greater for trees that were incongruent with the reference tree [our Fig. 1a, Fig. 1b] …” (p. 1536), a result that was statistically significant. In sum, the contention that there are no good predictors of phylogenetic utility for particular genes does not seem to hold for this phylogenomic data set.
Following Taylor and Piel (2004), Ren et al. (2005), Hedtke et al. (2006), Jeffroy et al. (2006), and Holland et al. (2006) noted the presence of exceptionally long branches and argued that a high level of divergence and associated branch length inequalities (e.g., Fig. 2) were determinants of conflict among genes in the yeast data set. Here, we extended these arguments and concluded that the 8 species in the yeast data set do not represent a “typical” phylogenetic problem. The tree based on the concatenated matrix of 106 genes showed great disparities in branch lengths (Fig. 1b), but individual gene trees had some truly extraordinary branches that were up to 95.31 expected substitutions per site (Fig. 2). This saturation of nucleotide substitution does not represent a typical phylogenetic problem; many systematists acknowledge that this degree of divergence is a very difficult problem (Felsenstein, 1978; Hendy and Penny, 1989; Wheeler, 1990; Huelsenbeck, 1998; Pol and Siddall, 2001; Holland et al., 2003; Anderson and Swofford, 2004; Bergsten, 2005; Susko et al., 2005). S. castellii, S. kluyveri, and C. albicans are very genetically distant from each other and from the 5 most closely related Saccharomyces species (Fig. 1, Fig. 2, Fig. 4, and Fig. 6). Therefore, it was not surprising that there were wholesale conflicts among gene trees in parsimony, Bayesian, and ML analyses (Taylor and Piel, 2004; Hedtke et al., 2006). In fact, the 3 divergent taxa, which also were characterized by the largest shifts in base composition (Collins et al., 2005), accounted for all conflicts among genes in ML analyses.
Because of extensive incongruence, Rokas et al. (2003) found that 20 randomly sampled genes from the yeast matrix were required for a robustly supported tree of 8 species, but this result has no generality. Even within this 106 gene matrix, it is clear that some systematic problems are much more difficult to solve relative to others. For the 5 closely related Saccharomyces species, one gene might be sufficient (Fig. 3c). ML analyses of individual genes produced the same tree 106 straight times, and sets of 600 randomly sampled nucleotides consistently supported this topology (Fig. 3b). By contrast, in ML analyses of these 5 species plus C. albicans, 106 concatenated genes (127,026 nucleotides) apparently were insufficient; the optimal ML tree (Fig. 4) contradicted the best tree for all 8 yeast species, a topology that was thought to show an unprecedented level of support (Fig. 1a, Fig. 1b; Gee, 2003; Rokas et al., 2003).
Clearly, the quantity of genes that is required to robustly resolve relationships will be dependent on the specifics of the phylogenetic problem at hand (Cummings and Meyer, 2005; Hedtke et al., 2006), as well as a particular researcher's definition of “adequate support” (e.g., Satta et al, 2000; Zander, 2001; Siddall, 2002; Grant and Kluge, 2003; Soltis et al., 2004; Taylor and Piel, 2004; Jeffroy et al., 2006). For easy phylogenetic problems where divergence among taxa is not great and internodes are moderately long, a single gene might provide high bootstrap support (e.g., Fig. 3). However, even in this situation, sequencing 2 or more genes may be justified, given that tightly linked nucleotides do not necessarily provide independent evidence for phylogenetic relationships (Doyle, 1992). For cases where exceptionally long branches are apparent, even 106 genes might not be enough (e.g., Fig. 4). When faced with extreme branch lengths like these, increased taxonomic sampling (e.g., Zwickl and Hillis, 2002), evidence from the fossil record (e.g., Brochu, 1997), or a set of more slowly evolving genes (e.g., Springer et al., 2001) with stationary base frequencies (e.g., Collins et al., 2005) may be required. Educated guesses can be made, but in the end, the amount of character data needed to arrive at a stable, well-supported phylogenetic hypothesis can only be quantified by adding new data to existing data and then reassessing the results.
Iterated Conflict in Phylogenomic Matrices
Our reanalyses provided a very simple explanation for replicated support of the conflicting S. kudriavzevii + S. bayanus clade in many yeast gene trees (Fig. 1a). Misrooting of a stable topology for 5 close relatives (Fig. 3a) by 3 genetically distant taxa (Fig. 4) can account for this iterated pattern. Because the topology for S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus was pectinate (Fig. 3a), erratic placement of the root (Fig. 5) repeatedly yielded the discrepant S. kudriavzevii + S. bayanus clade. In the most extreme case that we examined (a data set that included 6 of 8 species in the yeast matrix), the “wrong clade” was preferred over the “right clade” in the majority of ML gene trees (57 of 106 = 54%) and in the concatenated analysis of 106 genes (Fig. 4).
This result represents a cautionary tale for phylogenomic studies, in which > 100 genes from relatively few taxa may be sampled, and where congruence among individual gene trees has been used to assess support (e.g., Rokas et al., 2003; Holland et al., 2004, 2006; Burleigh et al., 2006). Previously, many authors have argued that large concatenations of genes can provide strong, but spurious, bootstrap support because of model misspecification, inadequate taxon sampling, or both (e.g., Philippe and Douzery, 1994; Naylor and Brown, 1998; Holland et al., 2004, 2006; Phillips et al., 2004; Soltis et al., 2004; Stefanovic et al., 2004; Hedtke et al., 2006; Jeffroy et al., 2006). Here, we documented an exceptional pattern of replicated conflict in which a consensus derived from separate analyses of > 100 genes failed to give the right result; nearly twice as many gene trees favored the wrong grouping of S. kudriavzevii + S. bayanus over the right S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii clade (Fig. 4). In comparison to concatenation of genes, it might be expected that partitioned phylogenetic analyses of individual genes should be less prone to highly supported but spurious results. Unfortunately, this is not always the case (Fig. 4), and a simple compilation of many genes for very few taxa (Rokas et al., 2003; Rokas and Carroll, 2004) cannot be trusted as a general solution for “ending incongruence” (Gee, 2003).
We thank R. Baker, T. Collins, A. de Queiroz, J. Garb, C. Hayashi, M. R. McGowen, R. Page, and two anonymous reviewers for comments on different versions of the manuscript. J. Gatesy was supported by NSF (USA) DEB-0212572, DEB-0213171, and EAR-0228629; R. DeSalle was supported by the Lewis B. and Dorothy Cullman Program in Molecular Systematics at the American Museum of Natural History and by NSF (USA) DBI-0421604; N. Wahlberg was supported by the Swedish Research Council 621-2004-2853. G. Naylor provided alignments of animal mitochondrial genomes. A. Rokas provided published multiple sequence alignments and supporting materials that made the present study possible.