Abstract

Combined analysis of multiple phylogenetic data sets can reveal emergent character support that is not evident in separate analyses of individual data sets. Previous parsimony analyses have shown that this hidden support often accounts for a large percentage of the overall phylogenetic signal in cladistic studies. Here, reanalysis of a large comparative genomic data set for yeast (genus Saccharomyces ) demonstrates that hidden support can be an important factor in maximum likelihood analyses of multiple data sets as well. Emergent signal in a concatenation of 106 genes was responsible for up to 64% of the likelihood support at a particular node (the difference in log likelihood scores between optimal topologies that included and excluded a supported clade). A grouping of four yeast species ( S. cerevisiae, S. paradoxus, S. mikatae , and S. kudriavzevii ) was robustly supported by combined analysis of all 106 genes, but separate analyses of individual genes suggested numerous conflicts. Forty-eight genes strictly contradicted S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii in separate analyses, but combined likelihood analyses that included up to 45 of the “wrong” data sets supported this group. Extensive hidden support also emerged in a combined likelihood analysis of 41 genes that each recovered the exact same topology in separate analyses of the individual genes. These results show that isolated analyses of individual data sets can mask congruence and distort interpretations of clade stability, even in strictly model-based phylogenetic methods. Consensus and supertree procedures that ignore hidden phylogenetic signals are, at best, incomplete.

In the electoral college system of the United States, each citizen is given a single vote in the presidential election. However, voters are partitioned by geographic region, and all-or-nothing electoral college points generally are awarded to majority winners in each state. Subsignals in the electorate (minority voters in each state) are ignored, and counterintuitive results can occur when voter preferences are not evenly distributed among states. For example, in the 2000 presidential election, a candidate who received the most popular votes at the national level was defeated. Indeed, with this system, a candidate could garner the lion's share of the popular vote and earn no electoral votes. Analogous outcomes are possible in a partitioned approach to systematics, in which sets of characters are analyzed separately and congruence among data sets is summarized by consensus (e.g., Mickevich, 1978 ; Miyamoto and Fitch, 1995 ) or supertree methods (e.g., Sanderson et al., 1998 ; Bininda-Emonds et al., 2002 ); phylogenetic signals at a lower hierarchical level (characters) might be cloaked at a higher level (data sets).

These partitioning effects are side-stepped in a combined evidence approach, where characters are treated as the basic units of analysis, and all relevant data are parlayed into unified summaries of common support ( Miyamoto, 1985 ; Kluge, 1989 , 1997 ; Nixon and Carpenter, 1996 ). When diverse data sets are concatenated and analyzed simultaneously, secondary phylogenetic signals that were not apparent in the separate analyses of individual data sets often materialize. These hidden supports and conflicts are due to contrasting patterns of homoplasy in different data sets ( Barrett et al., 1991 ; Chippindale and Wiens, 1994 ; Olmstead and Sweere, 1994 ).

Numerous factors could produce character incongruence that is not dispersed uniformly among and within character partitions. Biological examples include convergence through directional selection, introgression, developmental correlations, incomplete lineage sorting, gene conversion, differential rates of evolution among lineages, selective sweeps of linked nucleotides, undetected gene duplications, and mutational biases ( Felsenstein, 1978 ; Goodman et al., 1979 ; Radding, 1982 ; Pamilo and Nei, 1988 ; Cronin, 1993 ; Naylor and Brown, 1998 ; McCracken et al., 1999 ; Whiting et al., 2003 ). Within a probabilistic framework, the differential distribution of homoplasy among data sets also could be interpreted simply as the result of “sampling error.” That is, given a limited number of characters in a particular data set, conflicts might not be evenly dispersed “by chance” ( Bull et al., 1993 ). However, when several data sets, each with unique biases, are combined, common phylogenetic signals can accrue and incongruities that are not replicated in the majority of partitions might be nullified ( Barrett et al., 1991 ).

Hidden support has been defined as increased character support for a clade in combined analysis of multiple data sets relative to the sum of support for that clade in the separate analyses of the different partitions ( Gatesy et al., 1999 ). In the most obvious cases of hidden support, combined analysis yields relationships that are not supported by separate analyses of any of the individual data partitions (e.g., Gatesy and Arctander, 2000 ). Conversely, hidden conflict can be defined as decreased character support for a clade in combined analysis relative to the sum of support for that clade in the separate analyses of different partitions. In the most obvious cases of hidden conflict, a clade that is supported by all separate analyses of individual data sets is not supported by combined analysis of all partitions (e.g., Chippindale and Wiens, 1994 ).

Gatesy et al. (1999) outlined simple indices for measuring hidden support in terms of branch support (also called Bremer support or decay index; Bremer, 1988 , 1994; Donoghue et al., 1992). In this general parsimony framework, hidden branch support is defined simply as the difference between branch support for a clade in a combined analysis of all data sets and the sum of branch support scores, positive or negative, in separate analyses of the individual data sets. The hidden support for a clade that is due to a particular data set is the partitioned branch support for that data set in combined analysis (see Baker and DeSalle, 1997 ), minus the branch support score in the separate analysis of the data set. For these indices, positive scores indicate hidden character support, and negative scores suggest hidden character conflict (see Gatesy et al., 1999 ).

Lee and Hugall (2003) defined analogous measures of character support within a strictly model-based approach. They described likelihood support for a particular clade as the difference in log likelihood scores between optimal topologies that included and excluded that clade, and showed that partitioned likelihood support could be used to summarize the contributions of different data sets to likelihood support at a node in a combined analysis of several data sets ( Lee and Hugall, 2003 ). By logical extension, hidden likelihood support for a clade supported by a combined analysis of multiple data sets would be the likelihood support for that clade in combined analysis, minus the sum of likelihood support scores, positive or negative, in separate analyses of the individual data sets. The hidden likelihood support for a clade that is due to a particular data set is simply the partitioned likelihood support for the data set in combined analysis, minus the likelihood support score in the separate analysis of that data set. Again, positive scores would indicate hidden character support, and negative scores would suggest hidden character conflict. Lee and Hugall (2003) noted that hidden support might be evident in their simultaneous maximum likelihood (ML) analysis of four genes from cetartiodactyl mammals, but could not confirm this because branch lengths for different genes were not optimized identically in both separate and combined analyses.

Empirical studies have shown that hidden support often accounts for a large percentage of the total branch support in combined parsimony analyses of multiple data sets, ( Chippindale and Wiens, 1994 ; Olmstead and Sweere, 1994 ; Gatesy et al., 1996 , 1999 , 2003 ; Sullivan, 1996 ; Brower and Egan, 1997 ; Miller et al., 1997 ; Gatesy and Arctander, 2000 ; Cognato and Vogler, 2001 ; Wheeler et al., 2001 ; Baker and Gatesy, 2002 ; Gatesy, 2002 ; O'Grady et al., 2002 ; Damgaard and Cognato, 2003 ; Lambkin, 2004 ). It might be expected, however, that hidden support is less influential in ML analyses where different stochastic models are used to offset heterogeneities among character partitions ( Huelsenbeck and Bull, 1996 ; Stranger-Hall and Cunningham, 1998 ; Jamieson et al., 2002 ). Here, we execute combined analyses of genomic data for Saccharomyces yeast ( Rokas et al., 2003 ) to show that emergent support can be substantial when commonly utilized ML routines are implemented (i.e., ModelTest— Posada and Crandall, 1998 and PAUP*—Swofford, 1998). Subdivisions of the 106-gene data set for yeast reveal striking patterns of hidden support and conflict that are due to interactions among partitions in a strictly model-based approach. These emergent signals profoundly alter interpretations of support and conflict among data sets.

Materials and Methods

Data

Genome sequences from a variety of fungi have been published (e.g., Kellis et al., 2003 ). Rokas et al. (2003) recently utilized these data to compile a large combined systematic matrix of 106 genes from seven species of yeast ( Saccharomyces cerevisiae , S. paradoxus , S. mikatae , S. kudriavzevii , S. bayanus , S. castellii , and S. kluyveri ) and an outgroup, Candida albicans . Here, we accepted hypotheses of gene orthology based on synteny from Rokas et al. (2003) and employed the multiple sequence alignments of these authors in all of our analyses. The concatenated matrix consisted of 127,026 nucleotide sites.

Stochastic Models

Models of nucleotide substitution were chosen as in Rokas et al. (2003) . For each of the 106 genes in the concatenated alignment, optimal models, from a selection of 56, were determined using PAUP* 4.0b10 ( Swofford et al., 1998 ) and likelihood ratio tests ( Goldman, 1993 ) implemented in ModelTest 3.06 ( Posada and Crandall, 1998 ). In most combined analyses of multiple genes, a different best-fitting model was utilized for each gene to account, at least in part, for gene-specific evolutionary properties, and unique branch (edge) lengths were allowed for each gene (e.g., Cao et al., 1994 ). Two alternative ML frameworks also were explored: (1) A single, best-fitting model was applied to the entire combined matrix, with a single set of branch lengths for all 106 genes as in Rokas et al. (2003) ; (2) A single best-fitting model for the combined matrix was employed, but different branch lengths were permitted for each gene. Preferences for competing ML frameworks were based on the Akaike information criterion (AIC; Akaike, 1974 ; Hasegawa et al., 1990 ).

Phylogenetic Analyses

All analyses of individual genes were done in PAUP* 4.0b10 ( Swofford, 1998 ). Maximum likelihood searches utilized gene-specific model parameters and were exhaustive or branch and bound ( Hendy and Penny, 1982 ). Bootstrap analyses ( Felsenstein, 1985 ) included 250 replicates with heuristic searches (taxon addition “as is” and tree bisection reconnection branch swapping). In the individual analyses, the “constraints” command in PAUP* was used to calculate likelihood support, positive or negative, for particular clades (see Lee and Hugall, 2003 ).

In order to calculate the ML topology for the combined matrix using gene-specific models, PAUP* and a spreadsheet (Microsoft Excel) were used to sum log likelihood scores of the 106 genes. For each gene and gene-specific model, natural log likelihood scores of all possible 10,395 bifurcating topologies were determined using PAUP*; branch lengths were free to vary for each individual gene. The log likelihood scores for each gene were pasted into the spreadsheet, and for each possible topology, the log likelihoods for all 106 genes were summed. This procedure yielded a combined likelihood value for each of the 10,395 topologies that accounted for gene-specific substitution parameters and branch lengths (e.g., Cao et al., 1994 ). The trees were then sorted in the spreadsheet according to the combined likelihood scores. The topology with the highest log likelihood represented the optimal topology that allowed different substitution parameters and branch lengths for each of the 106 genes.

Likelihood Support in Combined Analysis

PAUP* and the spreadsheet were used to calculate likelihood support at each of the five nodes favored by the 106 gene data set. Visual inspection of suboptimal trees from the combined analysis revealed the best likelihood topologies that lacked particular supported clades. The ML topology and all topologies that lacked particular supported clades (five topologies in total) were used to calculate the likelihood support for each clade favored by the combined data set. At each supported node, partitioned likelihood support scores for the 106 genes also were determined. For each gene and each supported clade, the log likelihood for the optimal combined data topology that lacked the clade of interest was subtracted from the log likelihood for the optimal combined data tree (see Lee and Hugall, 2003 ; Jamieson et al., 2002 ).

Hidden likelihood support at each supported node was calculated by subtracting the sum of likelihood support scores for the 106 genes in separate analyses (see above) from the likelihood support for the combination of all 106 genes. The contribution of individual genes to the total hidden support score at a node also was determined by subtracting, for each gene, the likelihood support score for that gene in separate analysis from the partitioned likelihood support score for the gene in the combined analysis of all 106 genes.

To illustrate the different effects of hidden support and conflict in ML calculations, the total combined data set was partitioned into two smaller subsets of genes. The first subset included 45 genes that strictly conflicted with a clade supported by the combined ML analysis of all 106 genes (the grouping of S. cerevisiae , S. paradoxus , S. mikatae , and S. kudriavzevii ). The second subset included the 41 genes that, in separate analyses, strictly supported the topology favored by combined analysis of all 106 genes. Likelihood support, partitioned likelihood support, and hidden likelihood support were determined according to the procedures described above.

Parsimony Support Scores

Branch support, partitioned branch support, and hidden branch support for the 106-gene matrix were calculated for comparison to the analogous ML measures. Parsimony analyses in PAUP* were branch and bound with all character transformations weighted equally. The “constraints” command in PAUP* was used to determine length differences between minimum length topologies and trees that lacked/contained the clades of interest ( Baker and DeSalle, 1997 ; Gatesy et al., 1999 ).

Results and Discussion

Stochastic Models

We utilized procedures for model choice that have been implemented commonly in the recent systematics literature (e.g., Posada and Crandall, 1998 ; Pupko et al., 2002 ). Best-fitting models for the 106 genes were identical to those reported by Rokas et al. (2003) . Base composition, transition/transversion ratios, variability in rates among sites, and the percentage of invariant positions showed wide ranges of values in the various gene-specific models (Appendix 1; available at the Society of Systematic Biologists Website, http://systematicbiology.org). Furthermore, estimated branch lengths differed radically among genes. Because of these discrepancies, the combined ML analysis that allowed unique substitution parameters and branch lengths for each of the 106 genes (> 2,000 parameters; maximum log likelihood = −676,750.042) was favored over simpler models (one substitution model/one set of branch lengths = −683,277.766; one substitution model/unique branch lengths for each gene = −678,649.197) according to AIC. The preferred multiple-model analysis identified a single optimal topology, and likelihood support scores for all nodes were high, ranging from +1309.065 to +253.800 ( Fig. 1 ). The same tree was supported robustly by the simpler ML models utilized here, by parsimony, and by the previous phylogenetic analyses of the yeast data set ( Rokas et al., 2003 ; Phillips et al., 2004 ).

Figure 1

The optimal topology for seven species of Saccharomyces and the outgroup, Candida albicans . For each of the five supported nodes, partitioned likelihood support in combined analysis (PLS), likelihood support (LS) in separate analysis, and partitioned hidden likelihood support (HLS) were calculated for each of the 106 genes. Positive support scores are colored green, negative scores are pink, and scores of zero are yellow (HLS scores greater than 5 are marked by asterisks, and circled asterisks indicate scores less than −5). LS scores for the five supported nodes in the combined ML analysis of 106 genes were as follows: node 1 = 1024.609, node 2 = 586.350, node 3 = 253.800, node 4 = 1309.065, node 5 = 318.870. The percentage of the total LS in combined analysis that was due to HLS is indicated at the bottom of each node (%). For the parsimony analysis of the yeast data, the percentage of hidden branch support to total branch support also was substantial (node 1 = 34%, node 2 = 34%, node 3 = 56%, node 4 = 9%, node 5 = 18%).

Figure 1

The optimal topology for seven species of Saccharomyces and the outgroup, Candida albicans . For each of the five supported nodes, partitioned likelihood support in combined analysis (PLS), likelihood support (LS) in separate analysis, and partitioned hidden likelihood support (HLS) were calculated for each of the 106 genes. Positive support scores are colored green, negative scores are pink, and scores of zero are yellow (HLS scores greater than 5 are marked by asterisks, and circled asterisks indicate scores less than −5). LS scores for the five supported nodes in the combined ML analysis of 106 genes were as follows: node 1 = 1024.609, node 2 = 586.350, node 3 = 253.800, node 4 = 1309.065, node 5 = 318.870. The percentage of the total LS in combined analysis that was due to HLS is indicated at the bottom of each node (%). For the parsimony analysis of the yeast data, the percentage of hidden branch support to total branch support also was substantial (node 1 = 34%, node 2 = 34%, node 3 = 56%, node 4 = 9%, node 5 = 18%).

Hidden Signals in the Combined ML Analysis

In the multiple-model combined analysis, hidden conflicts (negative hidden likelihood support scores) were observed at all supported nodes ( Fig. 1 ). For example, 32 genes expressed emergent incongruities at node 1 ( S. cerevisiae + S. paradoxus ), and seven genes that favored node 1 in individual analyses had negative partitioned likelihood support scores in combined analysis. Node 4 ( S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii + S. bayanus ) also showed slightly more incongruence among genes in combined analysis relative to separate analyses, but overall, such hidden conflicts were swamped by the overwhelming hidden support that sprung from the combined analysis ( Fig. 1 ).

Specifically, hidden likelihood support was positive at all five nodes in the optimal topology and accounted for a large percentage of the total character support in combined analysis. From 12% to 64% of the likelihood support at particular nodes was shrouded in the separate analyses of individual genes but emerged in the concatenated analysis ( Fig. 1 ); ∼ 80% of the genes provided more hidden support than hidden conflict. For the combined parsimony analysis of the yeast data, the percentage of hidden branch support to total branch support also was substantial (from 9% to 56% at particular nodes; Fig. 1 ).

In separate ML analyses, many individual genes disagreed with the tree favored by combined analysis, but hidden support offset much of this conflict, and the partitioned likelihood support scores illustrated this effect. For example, at node 5 ( S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii + S. bayanus + S. castellii ), 41 genes had negative likelihood support scores in separate analyses. In the combined tree, however, 59% of the genes contributed hidden support, and 12 of the 41 genes that conflicted in separate analyses had positive partitioned likelihood support scores ( Fig. 1 ). Node 2 ( S. cerevisiae + S. paradoxus + S. mikatae ) and node 3 ( S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii ) also showed less conflict among genes in combined analysis relative to separate analyses ( Fig. 1 ).

Forty-Five Wrongs Make a Right?

Many genes that gave “wrong” answers in separate ML analyses gave the “right” answer when combined with other “wrong” genes. For example, node 3 ( S. cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii ) was supported primarily by secondary phylogenetic signals. Sixty-four percent of the likelihood support was emergent (likelihood support = +253.800; hidden likelihood support = +163.623), and 77 genes contributed hidden support ( Fig. 1 ). Exploratory analyses of the 48 genes that contradicted node 3 in separate analyses revealed striking patterns of cryptic support; 45 of the “wrong” genes ( Fig. 2 ) contained a substantial reservoir of hidden signal ( Fig. 3 ). The conflicting clade, S. kudriavzevii + S. bayanus , was replicated 28 times in the separate analyses of these 45 genes. The probability of independently recovering a group this many times “by chance" might seem remote ( Penny and Hendy, 1986 ; Miyamoto and Fitch, 1995 ; Chen et al., 2003 ), and 13 of the bootstrap scores for S. kudriavzevii + S. bayanus were greater than 70% ( Fig. 2 ; Table 1 ), a level that often has been equated with accuracy (see Hillis and Bull, 1993 ). Naively, examination of the 45 contrary genes could have led to the conclusion that these genes strongly disputed the robust support for node 3 in the remaining 61 genes, but when the 45 “wrong" genes were concatenated and analyzed in a multiple-model ML analysis, the “right" node 3 was resolved. Hidden support emanated from 41 of the 45 genes, and 16 of the conflicting genes showed positive partitioned likelihood support for node 3 ( Fig. 3 ).

Figure 2

Forty-five of the genes that did not support node 3 ( S . cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii ) in separate ML analyses. The optimal topology for each gene is shown. The S. kudriavzevii + S. bayanus clade (thick branches) is incompatible with node 3 and was supported by 28 of the 45 genes (* = bootstrap scores ≥70%). A “taxonomic congruence” tree and a “combined evidence” tree also are shown (lower right). The taxonomic congruence tree shows the five clades that were most commonly replicated in the 45 gene trees. The concatenation of all 45 genes supported node 3. Taxa are: Scer = Saccharomycescerevisiae , Spar = S. paradoxus , Smik = S. mikatae , Skud = S. kudriavzevii, Sbay = S. bayanus, Scas = S. castellii, Sklu = S. kluyveri, and Calb = Candida albicans .

Figure 2

Forty-five of the genes that did not support node 3 ( S . cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii ) in separate ML analyses. The optimal topology for each gene is shown. The S. kudriavzevii + S. bayanus clade (thick branches) is incompatible with node 3 and was supported by 28 of the 45 genes (* = bootstrap scores ≥70%). A “taxonomic congruence” tree and a “combined evidence” tree also are shown (lower right). The taxonomic congruence tree shows the five clades that were most commonly replicated in the 45 gene trees. The concatenation of all 45 genes supported node 3. Taxa are: Scer = Saccharomycescerevisiae , Spar = S. paradoxus , Smik = S. mikatae , Skud = S. kudriavzevii, Sbay = S. bayanus, Scas = S. castellii, Sklu = S. kluyveri, and Calb = Candida albicans .

TABLE 1.

TABLE 1.

Bootstrap percentages for 45 genes that conflicted with node 3 in separate ML analyses ( Fig. 2 and Fig. 3 ). Bootstrap scores for node 3 and for Skud + Sbay are shown.

 Bootstrap Bootstrap Skud + 
Gene node 3 Skud + Sbay Sbay > node 3 
5_YBR039W 17 72 
7_YBR070C 63 
8_YBR110W 69 
12_YBR198C 26 17  
14_YCR017C 18 74 
17_YDL116W 30  
20_YDL166C 45 
21_YDL195W 19 78 
22_YDL215C 53 47  
24_YDR021W 17 24 
25_YDR054C 80 
26_YDR072C 39 12  
32_YDR484W 28 
33_YDR531W 21 
34_YEL037C 17 72 
35_YER005W 15 66 
42_YGL225W 44  
44_YGR005C 23 23  
46_YGR194C 50 
47_YGR285C 35 56 
50_YHR137W 20 62 
51_YIL109C 39 61 
53_YJL100W 22 
54_YJR117W 27  
56_YKR089C 19  
57_YLL029W 22 75 
58_YLR253W  
60_YML021C 70 
65_YNL104C 15 77 
66_YNL155W 77 
67_YNL201C 11 60 
70_YNR038W 10 70 
74_YOR025W 15 54 
75_YOR158W 13 61 
79_YPL104W 14 24 
81_YPL169C 22 70 
83_YPL210C 52 
85_YPR140W 32  
88_YIL090W 13 26 
92_YJR068W 72 
93_YJR072C 29 68 
94_YKL034W 15 52 
95_YKL120W 33 10  
97_YKR099W 84 
102_YNL082W 24 64 
Total 780 2134 34 of 45 genes 
Average 17 47  
 Bootstrap Bootstrap Skud + 
Gene node 3 Skud + Sbay Sbay > node 3 
5_YBR039W 17 72 
7_YBR070C 63 
8_YBR110W 69 
12_YBR198C 26 17  
14_YCR017C 18 74 
17_YDL116W 30  
20_YDL166C 45 
21_YDL195W 19 78 
22_YDL215C 53 47  
24_YDR021W 17 24 
25_YDR054C 80 
26_YDR072C 39 12  
32_YDR484W 28 
33_YDR531W 21 
34_YEL037C 17 72 
35_YER005W 15 66 
42_YGL225W 44  
44_YGR005C 23 23  
46_YGR194C 50 
47_YGR285C 35 56 
50_YHR137W 20 62 
51_YIL109C 39 61 
53_YJL100W 22 
54_YJR117W 27  
56_YKR089C 19  
57_YLL029W 22 75 
58_YLR253W  
60_YML021C 70 
65_YNL104C 15 77 
66_YNL155W 77 
67_YNL201C 11 60 
70_YNR038W 10 70 
74_YOR025W 15 54 
75_YOR158W 13 61 
79_YPL104W 14 24 
81_YPL169C 22 70 
83_YPL210C 52 
85_YPR140W 32  
88_YIL090W 13 26 
92_YJR068W 72 
93_YJR072C 29 68 
94_YKL034W 15 52 
95_YKL120W 33 10  
97_YKR099W 84 
102_YNL082W 24 64 
Total 780 2134 34 of 45 genes 
Average 17 47  

By contrast, character support for S. kudriavzevii + S. bayanus, a clade independently corroborated by 28 genes ( Fig. 2 ), was overturned by the interaction of homoplastic signals in the unified analysis of 45 genes ( Fig. 3 ). Remarkably, nine data sets that strictly supported S. kudriavzevii + S. bayanus in separate ML analyses can be combined to yield the total evidence tree that contradicts S. kudriavzevii + S. bayanus ( Fig. 4 ). This is a striking empirical example of hidden conflict.

Figure 3

Support scores for 45 genes on the optimal topology favored by combined ML analysis of these genes. None of the genes supported node 3 ( S . cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii ) in separate ML analyses ( Fig. 2 ), but in combined analysis of the 45 genes, node 3 was favored. For each gene, partitioned likelihood support in combined analysis (PLS), likelihood support (LS) in separate analysis, and partitioned hidden likelihood support (HLS) are shown at node 3. Positive scores are colored green, negative scores are pink, and scores of zero are yellow. The total of the PLS scores for the 45 genes is the LS for node 3 in the combined analysis, and the sum of the partitioned HLS scores for the 45 genes is the total HLS at node 3. The percentage of the total LS in combined analysis that was due to HLS is indicated (% Total).

Figure 3

Support scores for 45 genes on the optimal topology favored by combined ML analysis of these genes. None of the genes supported node 3 ( S . cerevisiae + S. paradoxus + S. mikatae + S. kudriavzevii ) in separate ML analyses ( Fig. 2 ), but in combined analysis of the 45 genes, node 3 was favored. For each gene, partitioned likelihood support in combined analysis (PLS), likelihood support (LS) in separate analysis, and partitioned hidden likelihood support (HLS) are shown at node 3. Positive scores are colored green, negative scores are pink, and scores of zero are yellow. The total of the PLS scores for the 45 genes is the LS for node 3 in the combined analysis, and the sum of the partitioned HLS scores for the 45 genes is the total HLS at node 3. The percentage of the total LS in combined analysis that was due to HLS is indicated (% Total).

Figure 4

Striking hidden conflict in combined ML analysis of 9 genes for Saccharomyces . Separate analyses of genes 5 (open reading frame YBR039W), 22 (YDL215C), 34 (YEL037C), 46 (YGR194C), 57 (YLL029W), 66 (YNL155W), 67 (YNL201C), 74 (YOR025W), and 93 (YJR072C) each favored a grouping of S. kudriavzevii with S. bayanus (thick branches). When the 9 genes are merged and analyzed in a combined likelihood framework, however, overwhelming hidden conflicts erased character support for the S. kudriavzevii + S. bayanus clade, and the topology robustly supported by the remaining 97 genes was favored. Species are abbreviated as in Figure 2 .

Figure 4

Striking hidden conflict in combined ML analysis of 9 genes for Saccharomyces . Separate analyses of genes 5 (open reading frame YBR039W), 22 (YDL215C), 34 (YEL037C), 46 (YGR194C), 57 (YLL029W), 66 (YNL155W), 67 (YNL201C), 74 (YOR025W), and 93 (YJR072C) each favored a grouping of S. kudriavzevii with S. bayanus (thick branches). When the 9 genes are merged and analyzed in a combined likelihood framework, however, overwhelming hidden conflicts erased character support for the S. kudriavzevii + S. bayanus clade, and the topology robustly supported by the remaining 97 genes was favored. Species are abbreviated as in Figure 2 .

Forty-One Rights Are Even More Right in Combination?

In the above example, 45 genes that strictly contradicted a clade in separate analyses supported that clade in combined analysis, but topological conflicts are not a prerequisite for the manifestation of hidden support. When analyzed individually, 41 of the yeast genes yielded the exact sametopology as the ML analysis of all 106 genes ( Fig. 1 ). In combination, these 41 “right" genes expressed surprising synergistic relationships; much of the likelihood support at particular nodes was emergent (node 1 = 36%, node 2 = 27%, node 3 = 16%, node 4 = 19%, and node 5 = 4%), and 40 of the 41 genes yielded hidden support (given complete topological congruence among data sets, hidden conflict is impossible).

Homoplastic signals in the 41 topologically congruent genes were incongruent with each other. For example, both open reading frame YAL053W (gene 1) and YDL126C (gene 18) strongly supported node 1 ( S. cerevisiae + S. paradoxus ) in separate analyses (likelihood support = +4.884 and +13.254; bootstrap support = 91% and 95%), but for these genes, the best topologies that lacked node 1 were inconsistent with each other. Furthermore, these “collapse points” differed from the collapse point for node 1 in the combined analysis of 41 genes. Discrepant patterns of homoplasy in different genes translated to high hidden likelihood support scores for gene 1 (+14.398) and gene 18 (+26.864). By pooling all of the homoplasy, particular gene-specific character conflicts were dispersed ( Barrett et al., 1991 ), and the end result was bolstered support for the combined data topology ( Fig. 1 ). Even when there was complete topological congruence among individual gene trees and heterogeneities in nucleotide substitution patterns were tempered by > 2000 unique model parameters, substantial common character support emerged in combined analysis.

Conclusion

Dramatic hidden support has been documented in several combined parsimony analyses ( Chippindale and Wiens, 1994 ; Olmstead and Sweere, 1994 ; Gatesy et al., 1996 , 1999 , 2003 ; Sullivan, 1996 ; Brower and Egan, 1997 ; Miller et al., 1997 ; Gatesy and Arctander, 2000 ; Cognato and Vogler, 2001 ; Wheeler et al., 2001 ; Baker and Gatesy, 2002 ; Gatesy, 2002 ; O'Grady et al., 2002 ; Damgaard and Cognato, 2003 ; Lambkin, 2004 ) but has not been quantified previously using ML methods. Emergent signals in phylogenetic analysis are caused by contrasting patterns of incongruence among data sets, and it is possible that such conflicts could be accommodated by partition-specific model parameters (e.g., Huelsenbeck and Bull, 1996 ; Stranger-Hall and Cunningham, 1998 ; Jamieson et al., 2002 ). Consequently, a mixed-model approach might be expected to disperse homoplasy and minimize the relevance of hidden support and conflict in comprehensive ML studies. Combined analyses of the 106-gene matrix for yeast, however, showed that emergent phylogenetic signals can be extensive in a strictly model-based approach. The amount of hidden support was comparable in ML and parsimony analyses ( Fig. 1 ).

Numerous authors have suggested that separate analyses of individual character sets are critical because, unlike combined phylogenetic analysis, these partitioned searches summarize the distribution of support and conflict among data sets (e.g., de Queiroz et al., 1995 ; Miyamoto and Fitch, 1995 ; Slowinski and Lawson, 2002 ; Chen et al., 2003 ; DeBry, 2003 ). This stance denies the importance of hidden support in assessing agreements and disagreements among data sets. Corroboration among separate analyses of individual data sets ( Fig. 2 and Fig. 4 ) could be a seductive phylogenetic mirage that evaporates in combined analysis ( Fig. 3 and Fig. 4 ), conflicts in separate analyses might be reduced in combination ( Fig. 1 ), and even complete topological congruence among partitions does not guarantee that all common character support has been identified (see “Forty-One Rights Are Even More Right in Combination?”).

Because systematic data sets are finite and evolutionary processes are not necessarily stochastic ( Siddall and Kluge, 1997 ), we contend that, in any empirical case, it is unlikely that incongruence will be evenly distributed within and among data partitions. By isolating character sets, a bonanza of hidden support might go unnoticed and interpretations of conflict and support among data sets could be distorted, whether a parsimony or ML framework is implemented. Therefore, heuristic exploration of data sets should include both separate and combined phylogenetic analyses.

Acknowledgments

We thank T. Blackledge, A. Brower, T. Buckley, A. Cognato, R. DeSalle, A. de Queiroz, J. Garb, S. Gatesy, G. Giribet, C. Hayashi, K. Ober, R. Page, A. Vogler, N. Wahlberg, and an anonymous reviewer for helpful comments. A. Rokas provided published multiple sequence alignments and supporting materials. C. Simon (via D. Swofford) suggested the procedure whereby log likelihood scores for multiple genes/models were summed and sorted in Excel. J. Gatesy was supported by NSF grants DEB-0212572, DEB-0213171, and EAR-0228629. R. Baker was supported by NIH National Research Service Award 1F32GM67463-01.

References

Akaike
H.
A new look at the statistical model identification
IEEE Trans. Autom. Contr.
 , 
1974
, vol. 
19
 (pg. 
716
-
723
)
Baker
R.
DeSalle
R.
Multiple sources of character information and the phylogeny of Hawaiian drosophilids
Syst. Biol.
 , 
1997
, vol. 
46
 (pg. 
654
-
673
)
Baker
R.
Gatesy
J.
DeSalle
R.
Wheeler
W.
Giribet
G.
Is morphology still relevant?
Molecular systematics and evolution: Theory and practice
 , 
2002
Basel
Birkhauser Verlag
(pg. 
Pages 163
-
174
)
Bapteste
E.
Brinkmann
H.
Lee
J.
Moore
D.
Sensen
C.
Gordon
P.
Duruflé
L.
Gaasterland
T.
Lopez
P.
Müller
M.
Philippe
H.
The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium , Entamoeba , and Mastigamoeba
Proc. Natl. Acad. Sci. USA
 , 
2002
, vol. 
99
 (pg. 
1414
-
1419
)
Barrett
M.
Donoghue
M.
Sober
E.
Against consensus
Syst. Zool.
 , 
1991
, vol. 
40
 (pg. 
486
-
493
)
Bininda-Emonds
O.
Gittleman
J.
Steel
M.
The (super)tree of life: Procedures, problems, and prospects
Annu. Rev. Evol. Syst.
 , 
2002
, vol. 
33
 (pg. 
265
-
289
)
Bremer
K.
The limits of amino acid sequence data in angiosperm phylogenetic reconstruction
Evolution.
 , 
1988
, vol. 
42
 (pg. 
795
-
803
)
Bremer
K.
Branch support and tree stability
Cladistics.
 , 
1994
, vol. 
10
 (pg. 
295
-
304
)
Brower
A.
Egan
M.
Cladistic analysis of Heliconius butterflies and relatives (Nymphalidae: Heliconiiti): A revised phylogenetic position for Eueides based on sequences from mtDNA and a nuclear gene
Proc. R. Soc. Lond. B
 , 
1997
, vol. 
264
 (pg. 
969
-
977
)
Bull
J.
Huelsenbeck
J.
Cunningham
C.
Swofford
D.
Waddell
P.
Partitioning and combining data in phylogenetic systematics
Syst. Biol.
 , 
1993
, vol. 
42
 (pg. 
384
-
397
)
Cao
Y.
Adachi
J.
Janke
A.
Pääbo
S.
Hasegawa
M.
Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: Instability of a tree based on a single gene
J. Mol. Evol.
 , 
1994
, vol. 
39
 (pg. 
519
-
527
)
Chen
W.-J.
Bonillo
C.
Lecointre
G.
Repeatability of clades as a criterion of reliability: A case study for molecular phylogeny of Acanthomorpha (Teleostei) with larger number of taxa
Mol. Phylogenet. Evol.
 , 
2003
, vol. 
26
 (pg. 
262
-
288
)
Chippindale
P.
Wiens
J.
Weighting, partitioning, and combining characters in phylogenetic analysis
Syst. Biol.
 , 
1994
, vol. 
43
 (pg. 
278
-
287
)
Cognato
A.
Vogler
A.
Exploring data interaction and nucleotide alignment in a multiple gene analysis of Ips (Coleoptera: Scolytinae)
Syst. Biol.
 , 
2001
, vol. 
50
 (pg. 
758
-
780
)
Cronin
M. A.
Mitochondrial DNA in wildlife taxonomy and conservation biology: Cautionary notes
Wildl. Sec. Bull.
 , 
1993
, vol. 
21
 (pg. 
339
-
348
)
Damgaard
J.
Cognato
A.
Sources of character conflict in a clade of water striders (Heteroptera: Gerridae)
Cladistics.
 , 
2003
, vol. 
19
 (pg. 
512
-
526
)
DeBry
R.
Identifying conflicting signal in a multigene analysis reveals a highly resolved tree: The phylogeny of Rodentia (Mammalia)
Syst. Biol.
 , 
2003
, vol. 
52
 (pg. 
604
-
617
)
de Queiroz
A.
Donoghue
M.
Kim
J.
Separate versus combined analysis of phylogenetic evidence
Annu. Rev. Ecol. Syst.
 , 
1995
, vol. 
26
 (pg. 
657
-
681
)
Donoghue
M.
Olmstead
R.
Smith
J.
Palmer
J.
Phylogenetic relationships of Dipsacales based on rbcL sequences
Ann. Mo. Bot. Gard.
 , 
1992
, vol. 
79
 (pg. 
333
-
345
)
Felsenstein
J.
Cases in which parsimony or compatibility methods will be positively misleading
Syst. Zool.
 , 
1978
, vol. 
27
 (pg. 
401
-
410
)
Felsenstein
J.
Confidence limits on phylogenies: An approach using the bootstrap
Evolution.
 , 
1985
, vol. 
39
 (pg. 
783
-
791
)
Gatesy
J.
DeSalle
R.
Wheeler
W.
Giribet
G.
Relative quality of different systematic data sets for cetartiodactyl mammals: Assessments within a combined analysis framework
Molecular systematics and evolution: Theory and practice
 , 
2002
Basel
Birkhauser Verlag
(pg. 
Pages 45
-
68
)
Gatesy
J.
Amato
G.
Norell
M.
DeSalle
R.
Hayashi
C.
Combined support for wholesale taxic atavism in gavialine crocodylians
Syst. Biol.
 , 
2003
, vol. 
52
 (pg. 
403
-
422
)
Gatesy
J.
Arctander
P.
Hidden morphological support for the phylogenetic placement of Pseudoryxnghetinhensis with bovine bovids: A combined analysis of gross anatomical evidence and DNA sequences from five genes
Syst. Biol.
 , 
2000
, vol. 
49
 (pg. 
515
-
538
)
Gatesy
J.
Hayashi
C.
Cronin
M.
Arctander
P.
Evidence from milk casein genes that cetaceans are close relatives of hippopotamid artiodactyls
Mol. Biol. Evol.
 , 
1996
, vol. 
13
 (pg. 
954
-
963
)
Gatesy
J.
O'Grady
P.
Baker
R.
Corroboration among data sets in simultaneous analysis: Hidden support for phylogenetic relationships among higher level artiodactyl taxa
Cladistics.
 , 
1999
, vol. 
15
 (pg. 
271
-
313
)
Goldman
N.
Statistical tests of models of DNA substitution
J. Mol. Evol.
 , 
1993
, vol. 
36
 (pg. 
182
-
198
)
Goodman
M.
Czelusniak
J.
Moore
G.
Romero
A.
Matsuda
G.
Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences
Syst. Zool.
 , 
1979
, vol. 
28
 (pg. 
132
-
163
)
Hasegawa
M.
Kishino
H.
Hayasaka
K.
Horai
S.
Mitochondrial DNA evolution in Primates: Transition rate has been extremely low in the lemur
J. Mol. Evol.
 , 
1990
, vol. 
31
 (pg. 
113
-
121
)
Hendy
M.
Penny
D.
Branch and bound algorithms to determine minimal evolutionary trees
Math. Biosci.
 , 
1982
, vol. 
59
 (pg. 
277
-
290
)
Hillis
D.
Bull
J.
Am empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis
Syst. Biol.
 , 
1993
, vol. 
42
 (pg. 
182
-
192
)
Huelsenbeck
J.
Bull
J.
A likelihood ratio test to detect conflicting phylogenetic signal
Syst. Biol.
 , 
1996
, vol. 
45
 (pg. 
92
-
98
)
Jamieson
B.
Tillier
S.
Tillier
A.
Justine
J.-L.
Ling
E.
James
S.
McDonald
K.
Hugall
A.
Phylogeny of the Megascolecidae and Crassiclitellata (Annelida, Oligochaeta): Combined versus partitioned analysis using nuclear (28s) and mitochondrial (12S, 16S) rDNA
Zoostema.
 , 
2002
, vol. 
24
 (pg. 
707
-
734
)
Kellis
M.
Patterson
N.
Endrizzi
M.
Birren
B.
Lander
E.
Sequencing and comparison of yeast species to identify genes and regulatory elements
Nature.
 , 
2003
, vol. 
423
 (pg. 
241
-
254
)
Kluge
A. G.
A concern for the evidence and a phylogenetic hypothesis of relationships among Epicrates (Boidae, Serpentes)
Syst. Zool.
 , 
1989
, vol. 
38
 (pg. 
7
-
25
)
Kluge
A. G.
Testability and the refutation and corroboration of cladistics hypotheses
Cladistics.
 , 
1997
, vol. 
13
 (pg. 
81
-
96
)
Lambkin
C.
Partitioned Bremer support localizes significant conflict in bee flies (Diptera: Bombyliidae: Anthracinae)
Invert. Syst.
 , 
2004
, vol. 
18
 (pg. 
351
-
360
)
Lee
M.
Hugall
A.
Partitioned likelihood support and the evaluation of data set conflict
Syst. Biol.
 , 
2003
, vol. 
52
 (pg. 
15
-
22
)
McCracken
K.
Harshman
J.
McCleelan
D.
Afton
A.
Data set incongruence and correlated character evolution: An example of functional convergence in the hind-limbs of stifftail diving ducks
Syst. Biol.
 , 
1999
, vol. 
48
 (pg. 
683
-
714
)
Mickevich
M.
Taxonomic congruence
Syst. Zool.
 , 
1978
, vol. 
27
 (pg. 
143
-
158
)
Miller
J.
Brower
A.
DeSalle
R.
Phylogeny of neotropical moth tribe Josiini (Notodontidae: Dioptinae): Comparing and combining evidence from DNA sequences and morphology
Biol. J. Linn. Soc.
 , 
1997
, vol. 
60
 (pg. 
297
-
316
)
Miyamoto
M.
Consensus cladograms and general classifications
Cladistics.
 , 
1985
, vol. 
1
 (pg. 
186
-
189
)
Miyamoto
M.
Fitch
W.
Testing species phylogenies and phylogenetic methods with congruence
Syst. Biol.
 , 
1995
, vol. 
44
 (pg. 
64
-
76
)
Naylor
G.
Brown
W.
Amphioxus mt DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences
Syst. Biol.
 , 
1998
, vol. 
47
 (pg. 
61
-
76
)
Nixon
K.
Carpenter
J.
On simultaneous analysis
Cladistics
 , 
1996
, vol. 
12
 (pg. 
221
-
241
)
O'Grady
P.
Remsen
J.
Gatesy
J.
DeSalle
R.
Giribet
G.
Wheeler
W.
Partitioning of multiple data sets in phylogenetic analysis
Techniques in molecular systematics and evolution
 , 
2002
Basel
Birkhauser Verlag
(pg. 
Pages 102
-
119
)
Olmstead
R.
Sweere
J.
Combining data in phylogenetic systematics: An empirical approach using three molecular data sets in the Solanaceae
Syst. Biol.
 , 
1994
, vol. 
43
 (pg. 
467
-
481
)
Pamilo
P.
Nei
M.
Relationships between gene trees and species trees
Mol. Biol. Evol.
 , 
1988
, vol. 
5
 (pg. 
568
-
583
)
Penny
D.
Hendy
M.
Estimating the reliability of evolutionary trees
Mol. Biol. Evol.
 , 
1986
, vol. 
3
 (pg. 
403
-
417
)
Phillips
M.
Delsuc
F.
Penny
D.
Genome-scale phylogeny and the detection of systematic biases
Mol. Biol. Evol.
 , 
2004
, vol. 
21
 (pg. 
1455
-
1458
)
Posada
D.
Crandall
K.
ModelTest: Testing the model of DNA substitution
Bioinformatics.
 , 
1998
, vol. 
14
 (pg. 
817
-
818
)
Pupko
T.
Huchon
D.
Cao
Y.
Okada
N.
Hasegawa
M.
Combining multiple data sets in a likelihood analysis: Which models are the best? Mol
Biol. Evol.
 , 
2002
, vol. 
19
 (pg. 
2294
-
2307
)
Radding
C.
Strand transfer in homologous genetic recombination
Annu. Rev. Genet.
 , 
1982
, vol. 
16
 (pg. 
405
-
437
)
Rokas
A.
Williams
B.
King
N.
Carroll
S.
Genome-scale approaches to resolving incongruence in molecular phylogenies
Nature.
 , 
2003
, vol. 
425
 (pg. 
798
-
804
)
Sanderson
M.
Purvis
A.
Henze
C.
Phylogenetic supertrees: Assembling the trees of life
Trends Ecol. Evol.
 , 
1998
, vol. 
13
 (pg. 
105
-
109
)
Siddall
M. E.
Kluge
A. G.
Probabilism and phylogenetic inference
Cladistics.
 , 
1997
, vol. 
13
 (pg. 
313
-
336
)
Slowinski
J.
Lawson
R.
Snake phylogeny: Evidence from nuclear and mitochondrial genes
Mol. Phylogenet. Evol.
 , 
2002
, vol. 
24
 (pg. 
194
-
202
)
Stranger-Hall
K.
Cunningham
C.
Support for a monophyletic Lemuriformes: Overcoming incongruence between data partitions
Mol. Biol. Evol.
 , 
1998
, vol. 
15
 (pg. 
1572
-
1577
)
Sullivan
J.
Combining data with different distributions of among-site rate variation
Syst. Biol.
 , 
1996
, vol. 
45
 (pg. 
375
-
380
)
Swofford
D.
PAUP*. Phylogenetic analysis using parsimony (*and other methods)
 , 
1998
Sunderland, Massachusetts
Sinauer Associates
 
Version 4
Wheeler
W.
Whiting
M.
Wheeler
Q.
Carpenter
J.
The phylogeny of the extant hexapod orders
Cladistics.
 , 
2001
, vol. 
17
 (pg. 
113
-
169
)
Whiting
M.
Bradler
S.
Maxwell
T.
Loss and recovery of wings in stick insects
Nature
 , 
2003
, vol. 
421
 (pg. 
264
-
267
)