Obtaining Maximal Concatenated Phylogenetic Data Sets from Large Sequence Databases

To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multigene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times even for large numbers of sequences. We developed an exact algorithm for obtaining the largest multigene data sets from a collection of sequences. The algorithm was then tested on a set of 100,000 protein sequences of green plants and used to identify the largest multigene ortholog data sets having at least 3 genes and 6 species. The distribution of sizes of these data sets forms a hollow curve, and the largest are surprisingly small, ranging from 62 genes by 6 species, to 3 genes by 65 species, with more symmetrical data sets of around 15 taxa by 15 genes. These upper bounds to sequence concatenation have important implications for building the tree of life from large sequence databases.


Introduction
Improved accuracy of phylogenetic inference through the concatenation of multiple sequences from the same taxon is expected on theoretical grounds (Erdos et al. 1999;Bininda-Emonds et al. 2001) and has been found in many recent studies (Qiu et al. 1999;Soltis, Soltis, and Chase 1999;Graham and Olmstead 2000;Brown et al. 2001;Madsen et al. 2001;Murphy et al. 2001;Bapteste et al. 2002).For example, Bapteste et al. (2002) combined 123 genes for 30 species of eukaryotes and found support increased monotonically with the number of genes included.Concatenation has been undertaken widely in taxa in which complete genomes are available, such as in prokaryotes (Brown et al. 2001); taxa having small organellar genomes, such as animal mitochondria (Miya and Nishida 2000); and taxa for which long-term coordination among investigators has driven parallel sequencing efforts, as for green plants (Chase et al. 1993).Adding genes for a given set of taxa generally improves both robustness and running times of computationally intensive phylogenetic analyses (Soltis et al. 1998;Savolainen et al. 2000;Bapteste et al. 2002), although inferences about any particular node may sometimes be skewed by long-branch attraction (Felsenstein 1978).Adding taxa for a given set of genes can also sometimes improve phylogenetic tree inference, especially in cases of long-branch attraction (Hillis 1996).Increased taxon sampling can improve reconstructions of ancestral character states, sharpen estimates of rates and modes of sequence evolution, and generally provide a more comprehensive summary of a clade's history.Not surprisingly, the dimensions of phylogenetic sequence data sets have grown rapidly in recent years.
However, phylogenetic methods require input data in the form of rectangular matrices of taxa by aligned sequences.Ideally, such data sets are complete-meaning that every species has been sequenced for every gene in the data matrix.The sample of genes among taxa found in sequence databases or available from other sources rarely allows large complete data sets to be constructed, however, and therefore all large concatenated phylogenetic data sets published recently have missing entries (Qiu et al. 1999;Soltis, Soltis, and Chase 1999;Murphy et al. 2001).Missing data in incomplete data sets should eventually degrade phylogenetic inference by increasing both the number of optimal solutions found and the uncertainty in the placement of some taxa relative to others (Wiens 1998), although how much missing data is tolerable remains an open question (Kearney 2002).This motivates the present study.How can we optimally construct complete phylogenetic data sets from large sequence databases?The problem can be posed more formally as follows: Given a large collection of sequences that have been partitioned into sets of homologous genes, is it possible to construct a complete data matrix in which m taxa have sequences for the same k genes?Moreover, is it possible to find all complete data sets of this size or larger?Finding complete data sets in a sequence database is a nontrivial problem.In fact, determining whether a complete data set of a given size exists is an NP-complete problem, meaning that efficient (polynomial time) algorithmic solutions are unlikely to be discovered (Garey and Johnson 1979).However, we describe an exact, exponential time algorithm which effectively solves many large problems, and we illustrate its use by constructing maximal concatenated data sets from a large set of orthologous protein sequences available from green plants.Alexe et al. (2002) have described a different algorithm which, though polynomial in the output size, also has worst-case exponential running time, because the output size can grow exponentially with input size.

Definitions
A cluster is a set of sequence homologs.Because our sequences consist entirely of protein coding genes, a cluster here represents a ''gene'' or ''protein.''Concatenation of sequences is appropriate only for clusters consisting of orthologous genes, so we restrict attention to those (see below and table 1).The cluster set, C, is the set of all clusters.A cluster set, C, can be represented as a bipartite graph (West 2001), G(C), consisting of two disjoint sets of nodes, one for the clusters and one for the taxa (fig.1).Edges connect a cluster node to a taxon node if and only if that cluster has a sequence for that taxon.A complete phylogenetic data set is one containing no missing genes for a given set of taxa.Finding a complete phylogenetic data set for a cluster set C is equivalent to finding a biclique in the bipartite graph, G(C).A biclique is a subgraph of G(C) denoted K k,m and consisting of k clusters and m taxa such that every cluster node is connected (adjacent) to every taxon node and vice versa (fig.1).A biclique is maximal if no other biclique exists that contains it as a proper subgraph.There may be many maximal bicliques.Bicliques arise in many problems in graph theory, and the complexity of problems related to bicliques has been studied in some detail (Hochbaum 1998;Peeters 2000;Dawande et al. 2001).

Formal Statement of Problem
We study the following optimization problem: Given: Bipartite graph G(C) for cluster set C, natural numbers k and m.Find: All maximal bicliques, K k9 ,m9 , for G(C), in which k9 k, m9 m.
The decision problem asking whether a biclique, K k,m , exists in G(C) is NP-complete.This follows because the decision problem version of our problem for the instance k ¼ m is called GT24 (balanced complete bipartite subgraph problem) in Garey and Johnson (1979; see also Peeters 2000) and is known to be NP-complete.Therefore the decision variant of our problem (without the restriction k ¼ m) must therefore be NP-complete.This implies that the optimization problem stated above is also unlikely to have an efficient solution for arbitrary inputs (Cormen et al. 2001).

Algorithm
We have developed an exact algorithm that solves this problem quickly (in minutes to a few hours on a Linux workstation) for our data, by exhaustively building progressively larger bicliques.Briefly, suppose we seek all bicliques at least as large as K k,m .The exact algorithm examines every pair of clusters and calculates the intersection of their taxon sets.Let the size of that intersection (the number of taxa in both clusters) be m9.This step finds all bicliques, K 2,m9 .Discard any bicliques in which m9 , m, and-assuming any are left-try adding every remaining cluster to each surviving candidate biclique.This step finds all bicliques, K 3,m9 .If this iterative procedure terminates before finding bicliques, K k,m , then none exists satisfying the constraints.On the other hand if the iteration gets to the point of examining k clusters, then every biclique found from then on will be of the required size or larger, and the algorithm will continue to enumerate these until it has found all of them.Maximal bicliques are obtained from these by discarding any that are contained in another biclique.This algorithm was implemented in Cþþ, using the LEDA library (Algorithmic Solutions Software GmbH) for data structures.Source code and a Linux executable are available at (http://ginger.ucdavis.edu/sandlab/SOFTWARE).See the Appendix for a detailed description of the algorithm.

Sequence Data, Extraction of Cluster Set, and Identification of Ortholog Clusters
To illustrate the effectiveness of this algorithm, we applied it to a set of 100,813 proteins sampled from 11,587 taxa of green plants extracted from the five GenBank GBPLN flatfiles in release 127.0 of GenBank (December 2001), representing all green plant protein coding sequences in the database that have been translated.The cluster set was constructed using all-against-all BLAST searches (Altschul et al. 1990) combined with single linkage clustering as implemented in the National Center for Biotechnology Information (NCBI) blastclust tool (Dondoshansky 2002), at a stringency of 60% at the amino acid level.Other clustering strategies have been described which are considerably more sophisticated than ours (Krause, Stoye, and Vingron 2000;Kriventseva et al. 2001;Tatusov et al. 2001), but our algorithm can be applied to any cluster set, regardless of how it is constructed.All clusters are available in our online mySQL database (host: ginger.ucdavis.edu;user name ''guest'').
Sequence concatenation is most appropriate for orthologous sequences.Automated identification of orthologs in sequence database is a challenging task both for computational reasons (Remm et al. 2001;Lee et al. 2002) and because sparse sampling of gene families in databases imposes sharp limits on the strength of inferences.We used a phylogenetic procedure, rather than relying on more standard reciprocal Blast searching strategies (e.g., Lee et al. 2002).By default, the sequences in any cluster in which only a single sequence was present per taxon were treated as orthologous.For clusters containing multiple sequences in at least one taxon, unconstrained and constrained gene trees were constructed using maximum parsimony with a ''protein parsimony'' step matrix (Swofford et al. 1996; gaps treated as missing data) in the program PAUP* (Swofford 2002).The constrained tree forced all sequences from the same species to form a clade.If the constrained tree was significantly less parsimonious than the unconstrained tree (using a signed-rank test; Swofford et al. 1996), the cluster was considered to contain paralogs and was excluded from concatenation analysis.Note that a cluster might well contain only the orthologs of one paralog in a gene family.Also, some orthologous genes with ancestral polymorphisms will be excluded by the test, and if sampling of a gene family is extremely poor, such that only one sequence per taxon is found in the database, then we are left with no choice but to mistakenly infer orthology.This will be especially likely for clusters with only a few sequences and taxa.

Phylogenetic Analysis of Concatenated Data Sets
Phylogenies were constructed using standard methods for two representative maximal bicliques selected from the set of maximal bicliques.Clusters containing multiple accessions from the same species were pruned such that only one sequence was included per species.This pruning is justified because all sequences from the same species form a clade for those clusters passing the phylogenetic test for orthology.Presumably these represent multiple accessions of the same gene or multiple alleles from the same locus.Amino acid sequences were aligned with default options in ClustalW (Thompson et al. 1994).Protein parsimony was used to reconstruct trees (see above).Bootstrap analysis (100 replicates) was used to assess phylogenetic support for clades.The single-celled streptophyte Mesostigma was used as the outgroup to land plants for the K 39,10 biclique; three chlorophytes were used as outgroups to the streptophytes (including land plants) in the K 15,15 biclique.Aligned data sets are available at http:// ginger.ucdavis.edu/sandlab/WWW_DATA.

Results
Clustering the database generated a set of 40,154 protein clusters, of which 1,092 were potentially phylogenetically informative about species relationships because they contained four or more distinct taxa (table 1).The plastid genes rbcL, matK, and ndhF formed the largest clusters, and the model organisms Arabidopsis, rice, and maize were represented in the most clusters.Of the informative clusters, 656 were determined to consist entirely of orthologs, and thus represent candidates for concatenation.Although this usable subset contained only 26% of the sequences in the original data set, its taxonomic coverage was still 88% (table 1).
Although the running time of the exact algorithm precluded identification of all maximal bicliques, careful setting of different combinations of lower bounds, k, and m, allowed us to quickly identify the largest ones (fig.2).Running times increase rapidly as the input constraints, k and m, are decreased, which is expected given the computational complexity of the problem.To find the largest maximal bicliques (those forming the boundary in fig.2), it suffices to choose values of k and m that are just small enough to find some bicliques but not so small that run times become a problem.Each such successful run identifies one or more bicliques guaranteed to have the property that no other bicliques exist above and to the right of it (i.e., with larger values of either k or m or both).Ten runs of the algorithm were sufficient to identify all largest maximal bicliques with more than 3 clusters and 6 taxa.
The relative efficiency of the exact algorithm even for this large sequence set stems from the uneven distribution of sequences of genes among taxa (many sequences are available for a few model species and a few heavily sampled genes are available for many species).No bicliques exist larger than the set of maximal bicliques shown, which form a boundary in the space of concatenated data sets (fig.2).Bicliques of every possible dimension exist below and to the left of the boundary, but many of these are not maximal.Some bicliques are equal to each other in their dimensions but have different taxon and cluster compositions.Many of the bicliques overlap, and for any given biclique there are generally bicliques which have either slightly more taxa and slightly fewer clusters, or slightly more clusters and fewer taxa.The largest bicliques form a highly left-skewed ''hollow'' curve with many data sets of few genes and many taxa, or few taxa and many genes.At one end of this range was a data set with 62 genes by 6 species; at the other end, a data set with 3 genes by 65 species.Bicliques with approximately equal numbers of both taxa and genes have the unexpectedly small size of about 12-15 of each.In general, no concatenated data set contained more than about 400 sequences.Thus, no complete data set contained more than about 2% of the 22,000 sequences found in phylogenetically informative ortholog clusters for green plants.
The phylogenetic tree (fig.3) based on the K 39,10 biclique (39 genes; 8,523 amino acids; 10 species) provides strong support for major clades within land plants, including basal relationships within eudicot angiosperms that are not well resolved in recent analyses (Qiu et al. 1999;Soltis et al. 1999;Savolainen et al. 2000).For example, spinach (Chenopodiaceae) and tobacco (Solanaceae) share a more recent common ancestor than either does with the other angiosperms sampled.The K 15,15 biclique (15 genes; 3,919 amino acids; 15 species) samples many fewer genes but delves more deeply into green plant phylogeny, including several green algal relatives of land plants, another angiosperm, and another seed plant, Pinus.Bootstrap support was slightly lower within eudicots because of fewer characters, which is apparently the tradeoff for increased taxonomic coverage.These examples are characteristic of the high degree of overlap between many maximal bicliques.

Discussion
The algorithms described here permit construction of maximal complete concatenated sequence data sets from the sequence databases.Construction of complete data matrices via concatenation for phylogenetic analysis, equivalent to the identification of bicliques, is a computationally hard problem in theory.However, for problem instances such as the one examined above, comprising a set of over 20,000 proteins for a taxonomically diverse collection of green plants (some 10% of all species in GenBank), exact algorithms can solve the problem fairly quickly.This should permit more intensive exploitation of sequence databases for phylogenetic purposes.

Assembling the Tree of Life: The Role of Sequence Concatenation
These results have implications for recent efforts aimed at assembling large parts of the tree of life (http://research.amnh.org/biodiversity/features/feat.html; or more correctly, that part of the history of life that is tree-like, see Wolf et al. 2002).Exploiting the size and diversity of sequence databases for building comprehensive species phylogenies poses many computational challenges.Two competing strategies have been discussed: (1) concatenation of sequences for increasingly large sets of taxa, as discussed here; and (2) combination of trees (rather than data) constructed from separate but overlapping data sets, using ''supertree'' methods (Sanderson, Purvis, and Henze 1998;Daubin, Gouy, and Perriere 2001;Liu et al. 2001).Given that some 11,000 species of green plants have protein sequences in GenBank, and that these fall into 40,000 protein clusters, the discovery that the corresponding maximal bicliques have sizes on the order of only 15 by 15 is striking.Moreover, the fact that these complete data sets each contain no more than about 2% of the available sequences implies some limitations on the utility of concatenated data sets.Even though the algorithm described here will permit more intensive exploitation of the sequence databases, concatenation alone will probably not provide a comprehensive solution to building the tree of life any time soon, even with the rapid accumulation of new complete genome sequences.There is little reason to expect that the rich diversity of species from within the broad clades represented in maximal bicliques will soon have many sequences in GenBank.Inclusion of the innumerable ''minor'' species in the tree of life will therefore require other strategies, such as supertree methods.Nonetheless, bicliques will ultimately provide well-supported phylogenetic backbones that can form the basis for more comprehensive supertree-style studies.
An important lingering question concerns the treatment of overlapping bicliques.Although some bicliques found by running this algorithm are mutually exclusive, many are not.The two ''optimal'' bicliques shown in figure 2, for example, contain many of the same se- quences, and these overlap as well with some of the other maximal bicliques that satisfy the input constraints.If it were desirable to construct many phylogenetic trees from these data sets, it would be necessary to develop strategies either to avoid redundantly using the same sequence data or to account for that redundant use in later applications.One way to formalize this problem is to (edge-)partition the bipartite graph with the fewest bicliques possible.Deciding whether it is possible to partition a bipartite graph into k bicliques is NP-complete (Amilhastre 1999), but minimizing k would guarantee at least that the average number of sequences (edges) per data set (biclique) was maximized.More specific optimality criteria relating to the entire collection of bicliques might be necessary to build robust large phylogenies, however, such as requiring bicliques to be of a minimum size.

Caveats and Limitations
Concatenation of sequences from different genes may not always be a good idea.If different clusters contain different phylogenetic signals, either because of real differences in their evolutionary history, or because of different statistical biases, concatenation may obscure the underlying species tree (Bull et al. 1993).An extensive literature has considered the problem of combinability of data, and statistical tests are widely available (e.g., in PAUP* and other software).However, until very recently these tests have generally dealt with two or three data sets at a time.Scaling up tests to tens or ultimately perhaps even hundreds of genes will present important new challenges.Judging by recent phylogenetic analyses using concatenated genes, the tendency will be to combine data by default, in the hopes that weight of evidence will resolve any conflicts.As genes are sampled from multiple linkage groups and multiple genomes, however, the chances for conflicts between gene trees will rise (Baker and Desalle 1997;Krzywinski ,Wilkerson, and Besansky 2001).

Extensions
The problems described in this paper can be easily modified to include weights on either the taxa or the clusters.A natural weight on clusters is the length of the sequences, which should be crudely correlated with robustness (all else being equal).For phylogenetic reconstruction it might be worthwhile to have one gene of 2,500 nucleotides rather than three genes that together only comprise 1,500 nucleotides.At present we treat all clusters as equal.Equal weighting may have its uses if each cluster is considered a potentially independent source of evidence on the species tree.However, in the case of the green plant data, the bicliques were made almost exclusively of genes in a single linkage group, the chloroplast genome.In that case, weighting by number of sites might be instructive.Better still might be weighting by an a posteriori analysis of the phylogenetic signal in the cluster, such as by an average bootstrap score or posterior probability from Bayesian analysis.However, implementing this sort of analysis would be difficult, because the taxa from the cluster that are eventually represented in the biclique might be in a part of the tree that is poorly supported, even if the remainder of the tree is strongly supported.
Another straightforward extension is to constrain bicliques to contain specific taxa, clusters, or both.Phylogeneticists may wish to obtain large bicliques that include specific taxa of interest; molecular evolutionists may wish to include genes sampled from specific classes of molecules.The algorithm described in this paper can be easily modified to start from a given set of constraints.
It may be important to extend these methods to ''quasi-bicliques''-bicliques that are allowed to have some fixed number or fraction of empty elements (an element being an entire sequence missing for some taxon).It should be possible to construct quasi-bicliques larger than the bicliques found here.Recent large multigene studies (Qiu et al. 1999;Soltis, Soltis, and Chase 1999;Murphy et al. 2001) all contain ''holes'' in their data matrices, and are thus quasi-bicliques.The problem is to assess the trade-off between better sampling in a quasibiclique and additional noise owing to the missing data (Kearney 2002).Quasi-bicliques might also be used to identify which new sequences should be obtained in the laboratory to permit true blicliques to be constructed.In this way, algorithms can guide phylogenetic experimental design.
FIG. 2.-Largest maximal bicliques for the green plant protein data set having at least 3 genes (clusters) and 6 species.No bicliques exist above and to the right of those indicated.Every possible data set below and to the left does exist, but many are trivial subsets of the indicated maximal bicliques.The two bicliques subjected to phylogenetic analysis are indicated by arrows.

FIG. 3
FIG.3.-A,phylogenetic tree for the K 39,10 concatenated data set.B, phylogenetic tree for the K 15,15 concatenated data set.Bootstrap support is indicated at each node.Taxon names are prefixed with their NCBI taxonomy identification numbers.

Table 1
Distribution of Sequences and Taxa Among Phylogenetically Informative Green Plant Clusters Bipartite graph of a hypothetical cluster set.Edges connect species (taxon) nodes with cluster nodes if a sequence exists for that taxon in that cluster.A hypothetical biclique is indicated by bold lines.It corresponds to a phylogenetic data matrix that is complete (shown in box).Cluster 4 has not been sequenced for species A or B and therefore cannot be part of this complete matrix.Downloaded from https://academic.oup.com/mbe/article-abstract/20/7/1036/1046695 by guest on 28 December 2018