Abstract

The resolution of four controversial topics in phylogenetic experimental design hinges upon the informativeness of characters about the historical relationships among taxa. These controversies regard the power of different classes of phylogenetic character, the relative utility of increased taxonomic versus character sampling, the differentiation between lack of phylogenetic signal and a historical rapid radiation, and the design of taxonomically broad phylogenetic studies optimized by taxonomically sparse genome-scale data. Quantification of the informativeness of characters for resolution of phylogenetic hypotheses during specified historical epochs is key to the resolution of these controversies. Here, such a measure of phylogenetic informativeness is formulated. The optimal rate of evolution of a character to resolve a dated four-taxon polytomy is derived. By scaling the asymptotic informativeness of a character evolving at a nonoptimal rate by the derived asymptotic optimum, and by normalizing so that net phylogenetic informativeness is equivalent for all rates when integrated across all of history, an informativeness profile across history is derived. Calculation of the informativeness per base pair allows estimation of the cost-effectiveness of character sampling. Calculation of the informativeness per million years allows comparison across historical radiations of the utility of a gene for the inference of rapid adaptive radiation. The theory is applied to profile the phylogenetic informativeness of the genes BRCA1, RAG1, GHR, and c-myc from a muroid rodent sequence data set. Bounded integrations of the phylogenetic profile of these genes over four epochs comprising the diversifications of the muroid rodents, the mammals, the lobe-limbed vertebrates, and the early metazoans demonstrate the differential power of these genes to resolve the branching order among ancestral lineages. This measure of phylogenetic informativeness yields a new kind of information for evaluation of phylogenetic experiments. It conveys the utility of the addition of characters a phylogenetic study and it provides a basis for deciding whether appropriate phylogenetic power has been applied to a polytomy that is proposed to be a rapid radiation. Moreover, it provides a quantitative measure of the capacity of a gene to resolve soft polytomies.

Phylogenetic analyses seek to reveal the evolutionary relationships of taxa by comparing their characters. Four long-standing phylogenetic controversies hinge upon the informativeness of characters. These debates are: Which types of characters are most informative (Collins et al., 2005; Dequeiroz and Wimberger, 1993; Graybeal, 1994; Naylor and Brown, 1997; Rokas and Holland, 2000; Wiens and Servedio, 1997; Yang, 1998; Zwickl and Hillis, 2002)? Would increased taxonomic or character sampling be more informative (Graybeal, 1998; Hillis, 1998; Kim, 1996, 1998; Poe, 1998; Pollock et al., 2002; Rannala et al., 1998; Rokas and Carroll, 2005; Rosenberg and Kumar, 2001, 2003; Sullivan et al., 1999)? Can we accurately identify historical polytomies that are attributable to rapid radiations (Berbee et al., 2000; Berbee and Taylor, 2001; Nee, 2001; Poe and Chubb, 2004; Ree, 2005; Rokas et al., 2003a, 2005)? What is the optimal procedure for using genome-scale sequence data to empower taxonomically broad phylogenetic studies (Goldman, 1998; Rokas et al., 2003b; Shpak and Churchill, 2000)? One gap in knowledge that perpetuates these debates is the lack of a theory predicting the phylogenetic power of characters for explicit historical epochs. Here it is demonstrated that the informativeness of a character can be quantified over a historical time scale. This formulation may play a role in resolving these controversies.

The phylogenetic informativeness of characters has long been debated. Partisans have variously argued for the utility of morphological, DNA sequence, amino acid sequence, and recently for rare genomic characters (Rokas and Holland, 2000). One school of thought deems useful only those characters that change in state in ways that are unique, irreversible, and indisputable. However, such irreversible states are rare and seldom indisputable. Moreover, disregarding characters that occupy recurrent states dispenses with potentially useful information, including most molecular sequence data. A variety of measures have been proposed to empirically characterize the phylogenetic informativeness of classes of data (Collins et al., 2005; Dequeiroz and Wimberger, 1993; Graybeal, 1994; Naylor and Brown, 1997, 1998; Wiens and Servedio, 1997). For instance, the amount of signal present may be assessed by the skewness of the tree length distribution (Huelsenbeck, 1991a), the consistency index (CI; Farris, 1989), and various other measures. However, genes are known to differ in their informativeness over historical time (Graybeal, 1994). Whole-tree measures fail to reflect the heterogeneity of information across different parts of the tree. Support for individual branches by a given sequence may be characterized with bootstrap values (Felsenstein, 1985), Bremer support (Baker and DeSalle, 1997), or Bayesian posterior probabilities (Huelsenbeck and Ronquist, 2001). These measures are vital to analysis of the validity of an inferred phylogeny. However, their value is critically dependent on the unknown actual branch length(s), and therefore they are ambiguous indicators of phylogenetic informativeness. Graybeal (1994) pioneered the use of empirical saturation plots to evaluate the utility of genes for vertebrate phylogeny. Empirical saturation plots convey a qualitiative sense of temporal utility and feature variable rates of evolution of characters by plotting cumulative taxon-taxon divergence against time. However, theoretical attempts to clarify phylogenetic power have not resulted in explicit quantitative procedures for judicious experimental design (Mossel and Steel, 2004; Shpak and Churchill, 2000) or have been forsaken (Goldman, 1998) due to the intricacy of practical implementation. Molecular phylogenetic studies have instead generally relied on imprecise heuristics for choosing gene sequences to survey among relevant taxa for a given phylogenetic hypothesis. Conventional wisdom recognizes that it is important to select a gene that evolves at an appropriate pace to resolve the unknown ancestral branching order linking particular taxa of interest. Particular genes have become renowned for their perceived utility in resolving ancient (e.g., rDNA, elongation factors) and recent (e.g., cytochrome b, albumin) divergences. Ideally, this general perception could be captured and enhanced by quantitative measures of the utility of specified genes.

Application of such a measure could play a role in the resolution of the longstanding debate regarding the relative utilities of increasing taxonomic versus character sampling in phylogenetic experimental design (Berbee et al., 2000; Graybeal, 1998; Hillis, 1998; Kim, 1996, 1998; Poe, 1998, 2003; Pollock et al., 2002; Rannala et al., 1998; Rokas and Carroll, 2005; Rosenberg and Kumar, 2001, 2003; Zwickl and Hillis, 2002). In this debate it has been demonstrated that the informativeness of increasing taxonomic sampling is critically dependent on the chronology of ancestral linkages of the historical lineages of the taxa added to the data set (Fiala and Sokal, 1985; Huelsenbeck, 1991b; Kim, 1996, 1998; Poe, 2003). However, quantitative procedures for selection of characters that exhibit appropriate rates of evolution to resolve soft polytomies have not been explored. Clearly, combining acquisition of character data for new taxa that branch close to the time of a specified polytomy with acquisition of new characters that are most informative about that time period will yield the greatest phylogenetic resolution. To design such an ideal experiment, it is necessary to identify characters that contribute optimally to inferential power.

Debate within phylogenetic communities has frequently erupted with regard to putative examples of rapid radiations. Debates occur not only because of their biological importance for understanding the causes of evolutionary diversification but also because the rapid radiations can be difficult to infer using current phylogenetic methods (Berbee et al., 2000; Poe and Chubb, 2004; Rokas et al., 2005; Slowinski, 2001; Walsh et al., 1999; Weisrock et al., 2005). Rapid radiations are characterized by short internodes with few to zero featured synapomorphies. The inferential difficulty arises because measures of phylogenetic support, such as bootstrap values (Felsenstein, 1985), tree-length distribution skewness (Hillis and Huelsenbeck, 1992; Huelsenbeck, 1991a), Bremer support (Baker and DeSalle, 1997), or posterior probabilities (Huelsenbeck and Ronquist, 2001) only convey the degree to which data support a particular clade or tree. They do not convey the power of the characters examined to have revealed any true internodes (regardless of actual branch length) that define clades during a specific epoch. Thus, current methodologies would be enhanced by a measure of the degree to which the selected characters are sufficiently informative to justify a conclusion of rapid radiation.

Finally, the recent sequencing of multiple whole genomes within major branches of the tree of life has occasioned speculation regarding the best way to employ such genome-wide data sets to inform molecular phylogenetic studies that encompass much broader taxonomic sampling (Dacks and Doolittle, 2001; Delsuc et al., 2005). How can molecular phylogeneticists working with large sets of taxa exploit the breadth of information in genome sequence to improve their chances of conclusively addressing particular phylogenetic hypotheses? With profiles of the phylogenetic informativeness of particular genes during particular epochs, orthologous sequences from genome projects could be used to provide data on the rate of evolution of the sites in many genes. Profiles of phylogenetic informativeness could then be calculated from this data to identify the most informative genes. Here, such a method of profiling informativeness is presented.

Theory

The Optimal Rate of Change of a Phylogenetic Character

Consider a star phylogeny in which four taxa had a common ancestor at time T (Fig. 1a and Fig. 1b). When parsimony is used to select an optimal tree, only a character that changes along an internode between two sister clades (Fig. 1c and Fig. 1d, segments t1+ t2) will be informative about the actual branching order underlying the polytomy. Additionally, an informative character that changes during the ancestral internode must thereafter remain unchanged during the subsequent evolution of the four taxa. The longer the tips, and the shorter the internal branch, the less likely it is that such an informative character will be discovered. Both rapid and sluggish rates of change can make characters unfavorable for phylogenetic reconstruction. Characters that evolve too slowly will have negligible probability of change on the short internal branch; characters that evolve too quickly will nearly always change on one or more of the long tips.

Figure 1

Relevant time parameters for the resolution of polytomies. (a) Unrooted polytomy at time T. (b) Rooted polytomy at time T. (c) Four-taxon tree with an internode comprising components t1 and t2. (d) Rooted tree corresponding to the tree depicted in panel c.

Figure 1

Relevant time parameters for the resolution of polytomies. (a) Unrooted polytomy at time T. (b) Rooted polytomy at time T. (c) Four-taxon tree with an internode comprising components t1 and t2. (d) Rooted tree corresponding to the tree depicted in panel c.

Informativeness is maximized at an intermediate rate that optimizes the joint probability of change on the short internal branch and lack of change on the long tips. Assuming that evolutionary changes of a character state are randomly distributed at rate λ across the lineages descending from the common ancestor, the probability P of a random variable X equaling k changes of state on any internode of time length b may be calculated via the Poisson distribution:  

formula

The probability that at least one change occurs on the short internal branch is  

formula

The probability that the character would subsequently remain unchanged in the four tips is  

formula

For simplicity, let t0 = t1 + t2. Then, the probability that a character as described would be informative, π (T, t0; λ), is the product of probabilities expressed in Equation 1 and Equation 2,  

formula

The optimal rate, forumla, maximizes this function, and is revealed by solving  

formula

Further algebra yields  

formula

However, for any polytomy we wish to resolve, t0 is unknown. Nevertheless, it is frequently known that t0 is very small compared to T. Therefore, assuming t0T, we may take the limit of Equation 5 as t0 approaches zero,  

formula

Thus, the character that evolves at the optimal rate of character change for resolution of a four-taxon polytomy dated at time T in the past is the one that evolves at a rate of one change along the sum of the lengths of the four branch tips subsequent to the polytomy at time T.

The Phylogenetic Informativeness Profile of a Character

The rate of change for a character that maximizes informativeness is fundamental to phylogenetic theory. Examining five archetypal four-taxon trees with nonzero internodes, Yang (1998) used computer simulations to reveal the utility of an “intermediate” evolutionary rate that is in rough agreement with Equation 6. This intermediate, optimal rate would be expected to be higher for Yang's trees with nonzero internodes than is predicted by Equation 6, and increasingly so as the ratio of internode length to tip length increases. Consistent with this expectation, the optimal rates lay either at the rate predicted by Equation 6 or slightly higher than that asymptotic prediction.

Yet in phylogenetic practice, one will never discover a set of characters that all evolve at the same rate, let alone a set of characters that all evolve at the optimal rate (Felsenstein, 2001; Yang, 1996). Thus, it is necessary to establish the relative informativeness of characters that evolve at rates that are not optimal. Clearly, characters that evolve at rates close to the optimal rate will be more useful in resolving a polytomy than those that evolve at a dramatically different rate. Here the functional form of the relationship between the optimal and all suboptimal rates is established.

The probability of informativeness of a character evolving at rate λ is given by Equation 3. However, the value of the key parameter t0 is unknown, and as t0 asymptotically approaches zero, the probability of informativeness (Equation 3) aptly approaches zero as well. To profile the phylogenic informativeness of a character evolving at rate λ, we must derive an index of the informativeness that, in contrast, approaches a nonzero limit as the length of the internode t0 approaches zero. Such an index may be derived from Equation 3 by taking the ratio of the informativeness of a character evolving at the rate λ to the informativeness of a character evolving at the ideal rate forumlaT,  

formula

With t0T, as assumed above, we may take  

formula

The function ρ0 (T;λ) ranges from zero to a maximum of one for all real values of λ and T greater than zero. If and only if λ = forumlaT, ρ0 = 1. Thus, as expected, Equation 8 is maximized at the optimal rate of character change.

However, integration of the right-hand side of Equation 8 from zero to infinity yields e/4λ, a result that attributes a greater net informativeness to a character that evolves at a slower rate (smaller λ). In contrast, characters should supply net information equivalence for each rate when integrated over all of time. Thus, a normalized profile is generated by obtaining ρ (T;λ) such that ∫0ρ (T;λ)dT = 1. Such a function is readily computed as  

formula

A Profile of the Phylogenetic Informativeness of a Set of Characters

Here, Equation 9 is generalized to profile the informativeness of n characters to resolve polytomies at sequential depths of a phylogenetic tree. Denoting a rate of change for each character λ1, … λn, the phylogenetic informativeness profile can then be  

formula

The informativeness of a particular data set at a continuum of depths of a phylogenetic tree (Fig. 2a) may be conveyed by a plot of Equation 10. Figure 2b shows such a phylogenetic informativeness profile for a set of characters each evolving near the optimal rate to resolve the obscure branchings within the more recent of the two depicted polytomies. Note that Equation 10 is uninformative as to whether there is sufficient data to resolve a particular node, as that depends critically upon the unknown length of the internode t0. Rather, Equation 10 provides the degree to which a set of characters will be informative in comparison to another character set for which it may also be evaluated. For instance, the data set resulting in the phylogenetic informativeness profile plotted in Figure 2b would have been a fairly uninformative choice for resolving the ancient polytomy depicted in Figure 2a, because the characters evolve at a rate too likely to result in change along tips, obscuring signal that might have arisen within the time comprising the ancient polytomy.

Figure 2

Depiction of phylogenetic informativeness. (a) Example of a phylogeny with a recent and an ancient polytomy. (b) Informativeness profile of a set of five characters, each evolving near the optimal rate for resolution of the recent polytomy in panel a. (c) Informativeness profile of a set of five characters, one of which is evolving at five times the rate of the characters underlying the profile in panel b, and four of which are evolving at a rate four times as slow as the characters underlying the profile in panel b. Integrated over an infinite history, the area under the curves in panels b and c are equal.

Figure 2

Depiction of phylogenetic informativeness. (a) Example of a phylogeny with a recent and an ancient polytomy. (b) Informativeness profile of a set of five characters, each evolving near the optimal rate for resolution of the recent polytomy in panel a. (c) Informativeness profile of a set of five characters, one of which is evolving at five times the rate of the characters underlying the profile in panel b, and four of which are evolving at a rate four times as slow as the characters underlying the profile in panel b. Integrated over an infinite history, the area under the curves in panels b and c are equal.

A different character set underlies the informativeness profile depicted in Figure 2c, composed of the same number of characters as in Figure 2b, and evolving at about the same average rate. However, in this new set, one fifth of the characters are evolving at a rate fivefold faster than, and four fifths are evolving fourfold slower than, the characters that underly the profile in Figure 2b. Such a bimodal distribution of rates could correspond to synonymous and replacement sites in the DNA sequence of a functional gene. In this scenario, the more slowly evolving replacement sites yield some power for the resolution of the deep polytomy. The more rapidly evolving synonymous sites evolve yet too fast for accurate resolution of the relatively recent polytomy. Thus, the set of characters underlying the profile in Figure 2c would be a poor choice for the resolution of obscure branching events within the more recent polytomy of Figure 2b. Although the average rate of evolution of the two genes is approximately equal, the phylogenetic informativeness profile is radically different.

This differential phylogenetic informativeness of character sets among historical epochs can be evaluated quantitatively by integrating Equation 10 over the time period of interest. Specifying that period by its commencement, h1, and its terminus, h2, calculations of  

formula
yield measures of the relative utility of character sets for resolving ancestral branching order within that epoch. Assigning h1 and h2 so as to encompass all branching points of a phylogeny provides a summary of the relative informativeness of the character sets to resolve the whole phylogeny. Assigning h1 and h2 so that they encompass one polytomy or a subset of sequential weakly supported branches provides a more focused appraisal. To establish character sets that will be most informative for compound hypotheses relating to more than one epoch, integrals over multiple epochs of interest may be calculated and either jointly considered or summed to create a single index of informativeness.

Example: Profiling the Phylogenetic Informativeness of Genes

To briefly illustrate the theory developed here, I apply it to molecular data to generate profiles of the phylogenetic informativeness of four genes characterized by a DNA sequence data set. Alignments of the DNA sequences of genes c-myc, BRCA1, GHR, and RAG1 were extracted from the data set of Steppan et al. (2004) on the phylogeny of muroid rodents. Taxon sampling for this data set was sufficiently large (Pollock and Bruno, 2000; Sullivan et al., 1999) for rates of evolution of the sites to be estimated using the maximum likelihood program DNARates (by Gary Olsen) on the fossil-calibrated global clock–enforced phylogenetic trees of Steppan et al. (2004). Despite being carefully selected by experts for the purpose of resolving the muroid rodent phylogeny, the nucleotide sequences of these four genes demonstrate differential power for the resolution of ancestral branching order. Sequence of BRCA1 is predicted to be the most likely to be informative, followed by RAG1, GHR, and lastly, c-myc (Table 1).

Table 1.

The power of genes to resolve rapid radiations.

  Muroid rodentsa Mammalsb Vertebratesc Metazoad 
  
 

 

 

 
Gene Length (bp) per bp net per bp net per bp net per bp net 
RAG1 3023 0.055 166.1 0.102 308.3 0.029 87.7 0.028 84.6 
BRCA1 1697 0.170 288.4 0.174 295.3 0.008 13.6 0.005 8.4 
GHR 916 0.071 65.2 0.116 106.3 0.026 23.8 0.023 21.1 
c-myc 564 0.045 25.4 0.090 50.7 0.032 18.0 0.031 17.5 
  Muroid rodentsa Mammalsb Vertebratesc Metazoad 
  
 

 

 

 
Gene Length (bp) per bp net per bp net per bp net per bp net 
RAG1 3023 0.055 166.1 0.102 308.3 0.029 87.7 0.028 84.6 
BRCA1 1697 0.170 288.4 0.174 295.3 0.008 13.6 0.005 8.4 
GHR 916 0.071 65.2 0.116 106.3 0.026 23.8 0.023 21.1 
c-myc 564 0.045 25.4 0.090 50.7 0.032 18.0 0.031 17.5 
a

Informativeness profiles for Muroid rodents were integrated from 7 to 26 Mya.

b

65–107 Mya.

c

375–405 Mya.

d

550–600 Mya.

This differential power is illustrated by phylogenetic informativeness profiles over the history encompassed by the inferred phylogeny of Steppan et al. (2004). The informativeness profiles of the four genes are graphed above the phylogeny in Figure 3. The gene BRCA1 has the greatest informativeness during the epoch of interest, followed by RAG1, GHR, and c-myc. Compared to RAG1, BRCA1 has nearly twice the informativeness over the region of interest and yet is composed of about half the number of nucleotides. This remarkable difference in power is possible because the sites of BRCA1 evolve at nearly uniformly rates compared to most genes (Fig. 4; Adkins et al., 2001; Delsuc et al., 2002), including RAG1. The rates of substitution of the majority of nucleotides in GHR and c-myc appear to be slower as well as more diverse.

Figure 3

Phylogenetic informativeness profiles from the muroid rodent phylogeny. (a) Phylogenetic informativeness profiles through 50 million years ago for the genes BRCA1 (blue, 1697 bp), RAG1 (red, 3023 bp), GHR (green, 916 bp), and c-myc (yellow, 564 bp). The sum of the instantaneous asymptotic informativeness of all sites in each gene is graphed. (b) Phylogeny of selected muroid rodents on the same time scale as in panel a. DNA sequence, sequence alignment, and phylogeny are from Steppan et al. (2004). Taxa are denoted by genus name. (c) Relative phylogenetic informativeness of the four genes over the same time period.

Figure 3

Phylogenetic informativeness profiles from the muroid rodent phylogeny. (a) Phylogenetic informativeness profiles through 50 million years ago for the genes BRCA1 (blue, 1697 bp), RAG1 (red, 3023 bp), GHR (green, 916 bp), and c-myc (yellow, 564 bp). The sum of the instantaneous asymptotic informativeness of all sites in each gene is graphed. (b) Phylogeny of selected muroid rodents on the same time scale as in panel a. DNA sequence, sequence alignment, and phylogeny are from Steppan et al. (2004). Taxa are denoted by genus name. (c) Relative phylogenetic informativeness of the four genes over the same time period.

Figure 4

Phylogenetic informativeness profile for nucleotides in the first (green), second (blue), and third (red) codon positions of the genes (a) c-myc, (b) GHR, and (c) BRCA1. The average informativeness per site in each position is graphed. The profile for RAG1 (not shown) appears intermediate between the profiles for c-myc and GHR. (d) The phylogeny of muroid rodents from Figure 3, as a benchmark for the time scale of panels a to c.

Figure 4

Phylogenetic informativeness profile for nucleotides in the first (green), second (blue), and third (red) codon positions of the genes (a) c-myc, (b) GHR, and (c) BRCA1. The average informativeness per site in each position is graphed. The profile for RAG1 (not shown) appears intermediate between the profiles for c-myc and GHR. (d) The phylogeny of muroid rodents from Figure 3, as a benchmark for the time scale of panels a to c.

Despite its highly conserved sequence, the informativeness profile of the relatively slowly evolving gene c-myc still indicates that it has some power to resolve recent divergences. The solution to this paradox is clear: variation of rates among sites. This variation is most readily observed in aggregate by comparing codon positions within the gene. Sites residing at the third position in each codon (that can usually withstand substitutions without changing the amino acid sequence of the protein) yield phylogenetic informativeness for recent divergences, while first and second sites yield only phylogenetic informativeness for very ancient divergences (Fig. 4).

The net informativeness of the four genes during three key ancient radiations is tallied in Table 1. Calculation of the informativeness on a per base pair basis (Table 1, Table 2) allows estimation of the cost-effectiveness of character sampling across genes. Calculation of the informativeness on a per million years basis (Table 2) allows comparison of the utility of a gene for the inference of phylogenetic relationships across historical radiations. Regardless of the unit used, the rank order of genes by informativeness varies over history. For example, BRCA1 is uniformly the most powerful gene of these four for resolving the Muroid rodents (cf. Adkins et al., 2001) and rivals RAG1 in utility for the mammalian radiation (cf. Scally et al., 2002). Yet, it is uniformly the least powerful gene for resolving the early metazoa.

Table 2.

The power to resolve phylogeny per million years.

  Muroid rodents Mammals Vertebrates Metazoa 
  
 

 

 

 
Gene Length (bp) per bp net per bp net per bp net per bp net 
RAG1 3023 0.0029 8.7 0.0024 7.3 0.00097 2.92 0.00056 1.69 
BRCA1 1697 0.0089 15.2 0.0041 7.0 0.00027 0.45 0.00010 0.17 
GHR 916 0.0037 3.4 0.0028 2.5 0.00087 0.79 0.00046 0.42 
c-myc 564 0.0024 1.3 0.0021 1.2 0.00107 0.60 0.00062 0.35 
  Muroid rodents Mammals Vertebrates Metazoa 
  
 

 

 

 
Gene Length (bp) per bp net per bp net per bp net per bp net 
RAG1 3023 0.0029 8.7 0.0024 7.3 0.00097 2.92 0.00056 1.69 
BRCA1 1697 0.0089 15.2 0.0041 7.0 0.00027 0.45 0.00010 0.17 
GHR 916 0.0037 3.4 0.0028 2.5 0.00087 0.79 0.00046 0.42 
c-myc 564 0.0024 1.3 0.0021 1.2 0.00107 0.60 0.00062 0.35 

From this analysis it can be predicted that removal of BRCA1 would have the most adverse effect upon the bootstrap support of the inferred phylogeny of the Muroid rodents. Another predicted consequence of removing BRCA1 would be to disproportionally reduce the bootstrap support of more recent nodes in the phylogeny, where its phylogenetic informativeness is particularly large compared to other genes. Removal of RAG1, with more numerous but generally slower evolving sites would, in contrast, diminish bootstrap support deeper in the phylogeny. These predicted consequences are in fact demonstrated in the analysis of Steppan et al. (2004).

Discussion

Here, I used an asymptotic instance of the four-taxon case to derive the evolutionary rate at which a character would be optimally phylogenetically informative for a given historical time. This result was then extended to formulate a chronological measure of the phylogenetic informativeness corresponding to the rate of evolution of a character. Lastly, informativeness profiles of individual characters were summed to create a profile of the informativeness of a set of characters. An example case, profiling the informativeness of four genes for resolving the muroid rodent phylogeny, demonstrated the application of the method and validated its predictions. The theory provides quantitative characterization of the potential inferential power of character sets for resolving soft polytomies.

The four-taxon case used to derive the analytical results presents a tractable and versatile framework for theoretical study of phylogeny. One concern, however, may be the applicability of theory from the four-taxon case to phylogenies with larger numbers of taxa. In this regard, two intuitive extremes may be noted. With uniformly dense branching and sampling over all epochs, it is possible that faster rates of evolution may contribute to a greater degree to phylogenetic inference. Dense and deep sampling may subdivide tips such that it becomes less probable that rapid evolution would completely obscure ancient signal arising at a deep short internode (Poe, 2003). In contrast, when many taxa are sampled that all have extremely short internodes within a brief epoch of interest, as is the case in a rapid radiation, it seems unlikely that faster or slower rates than that predicted here for the four-taxon case would be optimal. Further work will be required to establish the interaction between taxon sampling and the optimal rate for inference. However, the four-taxon case has a sterling record of theoretical utility (Felsenstein, 1978; Gaut and Lewis, 1995; Huelsenbeck and Hillis, 1993) for revealing optimal phylogenetic methodology for larger data sets, due not only to its analytical and computational tractability but also because results based on analysis of the four-taxon case may be readily extrapolated to trees of more taxa (Cummings et al., 2003).

When profiling phylogenetic informativeness to select character sets to assay for phylogenetic analyses, several other points should be kept in mind. First, the informativeness profile conveys the historical epochs during which a character or set of characters are most likely to provide parsimony-informative phylogenetic signal but does not account for the misleading effects of noise caused by convergence to the same character state in divergent lineages (Collins et al., 2005; Felsenstein, 1978). Such convergence will occur more in faster evolving sites than in slower evolving sites (Grundy and Naylor, 1999). Thus, all else being equal, designers of phylogenetic experiments may prefer to select character sets with phylogenetic informativeness profiles that peak very slightly prior to, rather than subsequent to, the epoch of interest. This choice should minimize selection of characters that may have too frequently evolved to convergent states.

However, the effect of convergence should be negligible when character sets evolve at close to the optimal rate. At the optimal rate, multiple changes of character state will be rare—fewer than 3% of characters will have more than one change in a branch of length T. Consequent convergent characters will be randomly dispersed among taxa and should not be significantly misleading (Wenzel and Siddall, 1999). To the extent that lineages vary in rate of character change, those lineages whose characters change state more rapidly will tend to evolve a greater proportion of convergent states and thus may be positively misleading to phylogenetic analysis (Felsenstein, 1978), producing the phenomenon frequently termed “long branch attraction.” The degree (but not the nature) of this misleading effect depends upon the number of states that a character may adopt. The greater the number of states that are accessible to the character, the lower the potentially misleading noise arising from rapidly evolving characters will be (Mossel and Steel, 2004; Steel and Penny, 2000). Specification of the effect of the evolutionary state space available to the character upon estimates of the phylogenetic utility for particular epochs remains to be performed.

A difficulty for phylogenetic analysis that is closely related to convergence and long-branch attraction has been the accommodation of genome-wide shifts in substitution rate. Evidence has demonstrated that some clades have experienced elevated or reduced rates of substitution compared to sister clades. For example, rodents are known to have an elevated rate of substitution, compared to most of the mammals (Li et al., 1996; Weinreich, 2001). Thus, to reveal the informativeness of genes previously examined within rodents for the phylogenetic analysis of vertebrates or mammals, the time axis of the phylogenetic informativeness profile derived from rodent sequence evolution may require appropriate scaling by the ratio of the relative substitution rate within the rodent clade to the substitution rate among the nonrodent lineages. Provided that the shape of the site rate distribution remains constant across clades, this procedure may produce an appropriate phylogenetic informativeness profile of those character sets for the new experimental clade.

The shape of the site rate distribution within genes presumably remains stable when there is retention of functional constraints on the protein products (Naylor and Brown, 1997). However, a shift in the site rate distribution has been inferred for some data sets (Lockhart et al., 1998; Miyamoto and Fitch, 1995; Penny et al., 2001), most clearly in the case of the evolution of gene families after functional divergence (Wang and Gu, 2001). However, violations of the assumption of a static site rate distribution (Susko et al., 2002) and of the assumption of stationarity of nucleotide frequency (Fedrigo et al., 2005) may be readily tested for particular data sets. Specification of phylogenetic power profiles predicted from highly parameterized evolutionary models that may ameliorate such deviations (e.g., Galtier, 2001; Gu, 2001; Whelan et al., 2001) remains to be performed.

In addition to enabling judicious choice of character sets for phylogenetic analysis, profiling phylogenetic informativeness allows evaluation of the relative utility of disparate types of characters for their value in phylogenetic studies. Here, first, second, and third positions of codons were partitioned for four genes, and third positions were shown to result in greater inferential power for recent epochs and lesser inferential power for ancient epochs. The striking third-position effect depicted in Figure 4 is an outcome of the frequently rapid rate of evolution of unconstrained third-position sites within codons; the effect in three of four genes examined is dramatic, despite site-to-site variation of synonymous substitution rates (Pond and Muse, 2005). Other categorizations of characters may be conceived, such as coding versus noncoding sites, or nucleotide versus amino acid characters. Detailed comparisons of the informativeness of diverse characters are currently underway and will guide selection of the most powerful data at hand for the purpose of testing phylogenetic hypotheses.

Although the rate of speciation in a rapid radiation may be readily characterized for well-resolved trees (e.g., Nee, 2001, 2005), poor resolution in a phylogenetic tree presents an inferential dilemma with regard to potential rapid radiation. Consideration of the phylogenetic informativeness profile of a data set conveys new insight into whether unresolved branches (e.g., those with poor bootstrap support) are due to short branches (rapid radiation) or due to poor signal in the genes used. Low support values for a branch can arise from either situation, but only sites evolving at inappropriate rates would result in a low phylogenetic informativeness profile during the epoch of interest. Low informativeness paired with poor resolution calls for more data. High informativeness and poor resolution indicate rapid radiation. As examples of rapid radiation are considered with regard to the phylogenetic informativeness applied, it may become possible to establish a quantitative relationship between the informativeness of the data applied, resolution achieved, and the rapidity of radiation that may be reliably inferred. Because all approaches that use characters from extant taxa to infer evolutionary history are predicated upon an assumption that the evolutionary process has left a recoverable signature of historical parameters in those characters, such a relationship between informativeness, resolution, and rapidity of radiation would ultimately provide quantitative means for evaluating the long-term feasibility of resolving the most recalcitrant ancestral relationships (Table 2).

Most importantly, profiling phylogenetic informativeness, performed after completion of a previous or preliminary investigation, informs the choice of genes for future studies. By furnishing quantitative estimates of the informativeness for specific time periods, profiling phylogenetic power enables optimal experimental design. Historically, the utility of a gene for a study has largely been decided by qualitative heuristics based upon the average rate of evolution of a gene or from experiential impressions of the gene's utility in studies of taxa more or less divergent from the taxa of interest. With the sequence of genomes dispersed across the tree of life, simultaneous estimation of the phylogenetic informativeness profile of many orthologous genes is possible. These estimates may be compared to optimize gene choice. It is hoped that quantitative profiling of the phylogenetic informativeness of candidate genes using genome sequence data, preliminary data, or data from previous genic studies will supplant the contentious opinions of experts with an accurate and precise methodology for choosing character sets during the experimental design phase of a phylogenetic study.

Conclusions about the relative utility of adding characters or taxa to a current phylogenetic study have subtly hinged upon the appropriateness of the rate of evolution of the characters added for resolution of the phylogeny in question. Clearly, the addition of characters evolving at optimal rates will have much greater impact upon accurate phylogenetic analysis than will addition of characters with an inappropriate rate of evolution. Development of practical analytical predictions of the asymptotic impact of adding additional taxa (cf. Goldman, 1998; Huelsenbeck, 1991b; Kim, 1996, 1998; Poe, 2003) would complement computational investigations of the relative utility of these two methods of expanding acquired data (Graybeal, 1994; Pollock and Bruno, 2000; Rokas and Carroll, 2005). Synthesized with complementary elaboration of the quantitative theory presented herein, such a development could culminate in a rigorous and comprehensive theory for phylogenetic experimental design.

Acknowledgements

Thanks to Robert Friedman for bioinformatic assistance converting sequence files. Thanks also to John Taylor, Paul Lewis, Elizabeth Jockusch, Robert Friedman, Peter Gogarten, Alison Galvani, an anonymous reviewer, and associate editor Gavin Naylor for helpful comments on drafts of the manuscript.

REFERENCES

Adkins
R. M.
Gelke
E. L.
Rowe
D.
Honeycutt
R. L.
Molecular phylogeny and divergence time estimates for major rodent groups: Evidence from multiple genes
Mol. Biol. Evol.
 , 
2001
, vol. 
18
 (pg. 
777
-
791
)
Baker
R. H.
DeSalle
R.
Multiple sources of character information and the phylogeny of Hawaiian Drosophilids
Syst. Biol.
 , 
1997
, vol. 
46
 (pg. 
654
-
673
)
Berbee
M. L.
Carmean
D. A.
Winka
K.
Ribosomal DNA and resolution of branching order among the ascomycota: How many nucleotides are enough?
Mol. Phylogenet. Evol.
 , 
2000
, vol. 
17
 (pg. 
337
-
344
)
Berbee
M. L.
Taylor
J. W.
McLaughlin
D. J.
McLaughlin
E. G.
Lemke
P. A.
Fungal molecular evolution: Gene trees and geologic time
The Mycota. VII. Part B. Systematics and evolution
 , 
2001
Berlin Heidelberg
Springer-Verlag
(pg. 
229
-
245
)
Collins
T. M.
Fedrigo
O.
Naylor
G. J. P.
Choosing the best genes for the job: The case for stationary genes in genome-scale phylogenetics
Syst. Biol.
 , 
2005
, vol. 
54
 (pg. 
493
-
500
)
Cummings
M. P.
Handley
S. A.
Myers
D. S.
Reed
D. L.
Rokas
A.
Winka
K.
Comparing bootstrap and posterior probability values in the four-taxon case
Syst. Biol.
 , 
2003
, vol. 
52
 (pg. 
477
-
487
)
Dacks
J. B.
Doolittle
W. F.
Reconstructing/deconstructing the earliest eukaryotes: How comparative genomics can help
Cell
 , 
2001
, vol. 
107
 (pg. 
419
-
425
)
Delsuc
F.
Brinkmann
H.
Philippe
H.
Phylogenomics and the reconstruction of the tree of life
Nat. Rev. Genet.
 , 
2005
, vol. 
6
 (pg. 
361
-
375
)
Delsuc
F.
Scally
M.
Madsen
O.
Stanhope
M. J.
de Jong
W. W.
Catzeflis
F. M.
Springer
M. S.
Douzery
E. J. P.
Molecular phylogeny of living xenarthrans and the impact of character and taxon sampling on the placental tree rooting
Mol. Biol. Evol.
 , 
2002
, vol. 
19
 (pg. 
1656
-
1671
)
Dequeiroz
A.
Wimberger
P. H.
The usefulness of behavior for phylogeny estimation—Levels of homoplasy in behavioral and morphological characters
Evolution
 , 
1993
, vol. 
47
 (pg. 
46
-
60
)
Farris
J. S.
The Retention Index and the Rescaled Consistency Index. Cladistics Int
J. Willi Hennig Soc.
 , 
1989
, vol. 
5
 (pg. 
417
-
419
)
Fedrigo
O.
Adams
D. C.
Naylor
G. J.
DRUIDS—Detection of regions with unexpected internal deviation from stationarity
J. Exp. Zool. B Mol. Dev. Evol.
 , 
2005
, vol. 
304
 (pg. 
119
-
128
)
Felsenstein
J.
Cases in which parsimony and compatibility methods will be positively misleading
Syst. Zool.
 , 
1978
, vol. 
27
 (pg. 
401
-
410
)
Felsenstein
J.
Confidence limits on phylogenies: An approach using the bootstrap
Evolution
 , 
1985
, vol. 
39
 (pg. 
783
-
791
)
Felsenstein
J.
Taking variation of evolutionary rates between sites into account in inferring phylogenies. J
Mol. Evol.
 , 
2001
, vol. 
53
 (pg. 
447
-
455
)
Fiala
K. L.
Sokal
R. R.
Factors determining the accuracy of cladogram estimation: Evaluation using computer simulation
Evolution
 , 
1985
, vol. 
39
 (pg. 
609
-
622
)
Galtier
N.
Maximum-likelihood phylogenetic analysis under a covarion-like model
Mol. Biol. Evol.
 , 
2001
, vol. 
18
 (pg. 
866
-
873
)
Gaut
B. S.
Lewis
P. O.
Success of maximum likelihood phylogeny inference in the four-taxon case
Mol. Biol. Evol.
 , 
1995
, vol. 
12
 (pg. 
152
-
162
)
Goldman
N.
Phylogenetic information and experimental design in molecular systematics
Proc. Biol. Sci.
 , 
1998
, vol. 
265
 (pg. 
1779
-
1786
)
Graybeal
A.
Evaluating the phylogenetic utility of genes: A search for genes informative about deep divergences among vertebrates
Syst. Biol.
 , 
1994
, vol. 
43
 (pg. 
174
-
193
)
Graybeal
A.
Is it better to add taxa or characters to a difficult phylogenetic problem?
Syst. Biol.
 , 
1998
, vol. 
47
 (pg. 
9
-
17
)
Grundy
W. N.
Naylor
G. J.
Phylogenetic inference from conserved sites alignments
J. Exp. Zool.
 , 
1999
, vol. 
285
 (pg. 
128
-
139
)
Gu
X.
Maximum-likelihood approach for gene family evolution under functional divergence
Mol. Biol. Evol.
 , 
2001
, vol. 
18
 (pg. 
453
-
464
)
Hillis
D. M.
Taxonomic sampling, phylogenetic accuracy, and investigator bias
Syst. Biol.
 , 
1998
, vol. 
47
 (pg. 
3
-
8
)
Hillis
D. M.
Huelsenbeck
J. P.
Signal, noise, and reliability in molecular phylogenetic analyses
J. Hered.
 , 
1992
, vol. 
83
 (pg. 
189
-
195
)
Huelsenbeck
J. P.
Tree-length distribution skewness: An indicator of phylogenetic information
Syst. Zool.
 , 
1991
, vol. 
10
 (pg. 
257
-
270
)
Huelsenbeck
J. P.
When are fossils better than extant taxa in phylogenetic analysis?
Syst. Zool.
 , 
1991
, vol. 
40
 (pg. 
458
-
469
)
Huelsenbeck
J. P.
Hillis
D. M.
Success of phylogenetic methods in the four-taxon case
Syst. Biol.
 , 
1993
, vol. 
42
 (pg. 
247
-
264
)
Huelsenbeck
J. P.
Ronquist
F.
MrBayes: Bayesian inference of phylogenetic trees
Bioinformatics
 , 
2001
, vol. 
17
 (pg. 
754
-
755
)
Kim
J.
General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa
Syst. Biol.
 , 
1996
, vol. 
45
 (pg. 
363
-
374
)
Kim
J.
Large-scale phylogenies and measuring the performance of phylogenetic estimators
Syst. Biol.
 , 
1998
, vol. 
47
 (pg. 
43
-
60
)
Li
W. H.
Ellsworth
D. L.
Krushkal
J.
Chang
B. H.
Hewett-Emmett
D.
Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis
Mol. Phylogenet. Evol.
 , 
1996
, vol. 
5
 (pg. 
182
-
187
)
Lockhart
P. J.
Steel
M. A.
Barbrook
A. C.
Huson
D. H.
Charleston
M. A.
Howe
C. J.
A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages
Mol. Biol. Evol.
 , 
1998
, vol. 
15
 (pg. 
1183
-
1188
)
Miyamoto
M. M.
Fitch
W. M.
Testing the covarion hypothesis of molecular evolution
Mol. Biol. Evol.
 , 
1995
, vol. 
12
 (pg. 
503
-
513
)
Mossel
E.
Steel
M.
A phase transition for a random cluster model on phylogenetic trees
Math. Biosci.
 , 
2004
, vol. 
187
 (pg. 
189
-
203
)
Naylor
G. J.
Brown
W. M.
Structural biology and phylogenetic estimation
Nature
 , 
1997
, vol. 
388
 (pg. 
527
-
528
)
Naylor
G. J. P.
Brown
W. M.
Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences
Syst. Biol.
 , 
1998
, vol. 
47
 (pg. 
61
-
76
)
Nee
S.
Inferring speciation rates from phylogenies
Evol. Int. J. Org. Evol.
 , 
2001
, vol. 
55
 (pg. 
661
-
668
)
Penny
D.
McComish
B. J.
Charleston
M. A.
Hendy
M. D.
Mathematical elegance with biochemical realism: The covarion model of molecular evolution
J. Mol. Evol.
 , 
2001
, vol. 
53
 (pg. 
711
-
723
)
Poe
S.
Sensitivity of phylogeny estimation to taxonomic sampling
Syst. Biol.
 , 
1998
, vol. 
47
 (pg. 
18
-
31
)
Poe
S.
Evaluation of the strategy of long-branch subdivision to improve the accuracy of phylogenetic methods
Syst. Biol.
 , 
2003
, vol. 
52
 (pg. 
423
-
428
)
Poe
S.
Chubb
A. L.
Birds in a bush: Five genes indicate explosive evolution of avian orders
Evolution
 , 
2004
, vol. 
58
 (pg. 
404
-
415
)
Pollock
D. D.
Bruno
W. J.
Assessing an unknown evolutionary process: Effect of increasing site-specific knowledge through taxon addition
Mol. Biol. Evol.
 , 
2000
, vol. 
17
 (pg. 
1854
-
1858
)
Pollock
D. D.
Zwickl
D. J.
McGuire
J. A.
Hillis
D. M.
Increased taxon sampling is advantageous for phylogenetic inference
Syst. Biol.
 , 
2002
, vol. 
51
 (pg. 
664
-
671
)
Pond
S. K.
Muse
S. V.
Site-to-site variation of synonymous substitution rates
Mol. Biol. Evol.
 , 
2005
, vol. 
22
 (pg. 
2375
-
2385
)
Rannala
B.
Huelsenbeck
J. P.
Yang
Z.
Nielsen
R.
Taxon sampling and the accuracy of large phylogenies
Syst. Biol.
 , 
1998
, vol. 
47
 (pg. 
702
-
710
)
Ree
R. H.
Detecting the historical signature of key innovations using stochastic models of character evolution and cladogenesis
Evol. Int. J. Org. Evol.
 , 
2005
, vol. 
59
 (pg. 
257
-
265
)
Rokas
A.
Carroll
S. B.
More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy
Mol. Biol. Evol.
 , 
2005
, vol. 
22
 (pg. 
1337
-
1344
)
Rokas
A.
Holland
P. W.
Rare genomic changes as a tool for phylogenetics
Trends Ecol. Evol.
 , 
2000
, vol. 
15
 (pg. 
454
-
459
)
Rokas
A.
King
N.
Finnerty
J.
Carroll
S. B.
Conflicting phylogenetic signals at the base of the metazoan tree
Evol. Dev.
 , 
2003
, vol. 
5
 (pg. 
346
-
359
)
Rokas
A.
Kruger
D.
Carroll
S. B.
Animal evolution and the molecular signature of radiations compressed in time
Science
 , 
2005
, vol. 
310
 (pg. 
1933
-
1938
)
Rokas
A.
Williams
B. L.
King
N.
Carroll
S. B.
Genome-scale approaches to resolving incongruence in molecular phylogenies
Nature
 , 
2003
, vol. 
425
 (pg. 
798
-
804
)
Rosenberg
M. S.
Kumar
S.
Incomplete taxon sampling is not a problem for phylogenetic inference
Proc. Natl. Acad. Sci. USA
 , 
2001
, vol. 
98
 (pg. 
10751
-
10756
)
Rosenberg
M. S.
Kumar
S.
Taxon sampling, bioinformatics, and phylogenomics
Syst. Biol.
 , 
2003
, vol. 
52
 (pg. 
119
-
124
)
Scally
M.
Madsen
O.
Douady
C. J.
deJong
W. W.
Stanhope
M. J.
Springer
M. S.
Molecular evidence for the major clades of placental mammals
J. Mammal. Evol.
 , 
2002
, vol. 
8
 (pg. 
239
-
277
)
Shpak
M.
Churchill
G. A.
The information content of a character under a Markov model of evolution
Mol. Phylogenet. Evol.
 , 
2000
, vol. 
17
 (pg. 
231
-
243
)
Slowinski
J. B.
Molecular polytomies
Mol. Phylogenet. Evol.
 , 
2001
, vol. 
19
 (pg. 
114
-
120
)
Steel
M.
Penny
D.
Parsimony, likelihood, and the role of models in molecular phylogenetics
Mol. Biol. Evol.
 , 
2000
, vol. 
17
 (pg. 
839
-
850
)
Steppan
S.
Adkins
R.
Anderson
J.
Phylogeny and divergence-date estimates of rapid radiations in muroid rodents based on multiple nuclear genes
Syst. Biol.
 , 
2004
, vol. 
53
 (pg. 
533
-
553
)
Sullivan
J.
Swofford
D. L.
Naylor
G. J. P.
The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models
Mol. Biol. Evol.
 , 
1999
, vol. 
16
 (pg. 
1347
-
1356
)
Susko
E.
Inagaki
Y.
Field
C.
Holder
M. E.
Roger
A. J.
Testing for differences in rates-across-sites distributions in phylogenetic subtrees
Mol. Biol. Evol.
 , 
2002
, vol. 
19
 (pg. 
1514
-
1523
)
Walsh
H. E.
Kidd
M. G.
Moum
T.
Friesen
V. L.
Polytomies and the power of phylogenetic inference
Evolution
 , 
1999
, vol. 
53
 (pg. 
932
-
937
)
Wang
Y.
Gu
X.
Functional divergence in the caspase gene family and altered functional constraints: Statistical analysis and prediction
Genetics
 , 
2001
, vol. 
158
 (pg. 
1311
-
1320
)
Weinreich
D. M.
The rates of molecular evolution in rodent and primate mitochondrial DNA
J. Mol. Evol.
 , 
2001
, vol. 
52
 (pg. 
40
-
50
)
Weisrock
D. W.
Harmon
L. J.
Larson
A.
Resolving deep phylogenetic relationships in salamanders: Analyses of mitochondrial and nuclear genomic data
Syst. Biol.
 , 
2005
, vol. 
54
 (pg. 
758
-
777
)
Wenzel
J. W.
Siddall
M. E.
Noise. Cladistics Int.
J. Willi Hennig Soc.
 , 
1999
, vol. 
15
 (pg. 
51
-
64
)
Whelan
S.
Lio
P.
Goldman
N.
Molecular phylogenetics: State-of-the-art methods for looking into the past
Trends Genet.
 , 
2001
, vol. 
17
 (pg. 
262
-
272
)
Wiens
J. J.
Servedio
M. R.
Accuracy of phylogenetic analysis including and excluding polymorphic characters
Syst. Biol.
 , 
1997
, vol. 
46
 (pg. 
332
-
345
)
Yang
Z.
On the best evolutionary rate for phylogenetic analysis
Syst. Biol.
 , 
1998
, vol. 
47
 (pg. 
125
-
133
)
Yang
Z. H.
Among-site rate variation and its impact on phylogenetic analyses
Trends Ecol. Evol.
 , 
1996
, vol. 
11
 (pg. 
367
-
372
)
Zwickl
D. J.
Hillis
D. M.
Increased taxon sampling greatly reduces phylogenetic error
Syst. Biol.
 , 
2002
, vol. 
51
 (pg. 
588
-
598
)