A direct method for calculating expected data from an evolutionary model for two state characters is described. The method uses four vectors p, q, r and s. p and q are the probabilities of a character change on the 2n − 3 edges of a tree T (n is the number of taxa). r and s are properties of the data, are independent of any tree and have 2n−1 entries. For a given tree T, and with specified probabilities (p or q), we determine r, then s, the expected probabilities of each of the 2n−1 possible partitions of taxa. For any tree T the relationship can be inverted. This allows the probabilities of change on the tree, p and q, to be estimated directly from observed data (r or s).

These relationships have been used to analyse the behaviour of tree building algorithms under conditions when there are sufficient data. (This is when the tree does not change as more data are collected, i.e., convergence to a single tree.) With equal rates of evolution (i.e., with a molecular clock), we show that for n = 4 taxa, parsimony will always converge to the correct tree, but we give examples with n = 5 where parsimony will converge on an incorrect tree, even for equal rates of evolution. A further example with n = 6 shows convergence to an incorrect tree with equal but arbitrarily small rates of change. We interpret a basic difficulty with parsimony as ‘long edges attract.’ If there are additional taxa that intersect long edges on the tree, then this effect can be reduced. Some distance methods may also converge to an incorrect tree.

You do not currently have access to this article.