Bilaterian Phylogeny: A Broad Sampling of 13 Nuclear Genes Provides a New Lophotrochozoa Phylogeny and Supports a Paraphyletic Basal Acoelomorpha

During the past decade, great progress has been made in clarifying the relationships among bilaterian animals. Studies based on a limited number of markers established new hypotheses such as the existence of three superclades (Deuterostomia, Ecdysozoa, and Lophotrochozoa) but left major questions unresolved. The data sets used to the present either bear few characters for many taxa (i.e., the ribosomal genes) or present many characters but lack many phyla (such as recent phylogenomic approaches) failing to provide deﬁnitive answers for all the regions of the bilaterian tree. We performed phylogenetic analyses using a molecular matrix with a high number of characters and bilaterian phyla. This data set is built from 13 genes (8,880 bp) belonging to 90 taxa from 27 bilaterian phyla. Probabilistic analyses robustly support the three superclades, the monophyly of Chordata, a spiralian clade including Brachiozoa, the basal position of a paraphyletic Acoelomorpha, and point to an ecdysozoan afﬁliation for Chaetognatha. This new phylogeny not only agrees with most classical molecular results but also provides new insights into the relationships between lophotrochozoans and challenges the results obtained using high-throughput strategies, highlighting the problems associated with the current trend to increase gene number rather than taxa.


Introduction
Small ribosomal subunit RNA gene (18S rDNA or small subunit [SSU]) sequences were the first and most extensively used source of information to establish the new, widely accepted bilaterian phylogeny, which features three large superclades: the Lophotrochozoa, the Ecdysozoa, and the Deuterostomia (Halanych 2004). However, the relationships within these superclades and the phylogenetic position of some enigmatic phyla still remain elusive to SSU analyses, in part due to long-branch attraction (LBA) artifacts (Felsenstein 1978) and to their recognized limited resolution (Abouheif et al. 1998). To overcome this problem, other markers such as the large ribosomal subunit RNA gene (28S or large subunit [LSU]; Mallatt and Giribet 2006;Passamaneck and Halanych 2006) or proteincoding genes (Ruiz-Trillo et al. 2002;Anderson et al. 2004;Peterson et al. 2004) were introduced. Although these approaches were instrumental in resolving some internal nodes of the tree, they are associated with similar problems to those encountered with SSU genes, namely, stochastic errors and artifacts due to LBA (Philippe and Telford 2006). Alternative sources of information such as sequence signatures in the Hox genes (Balavoine et al. 2002), mitogenomics (Boore et al. 2005), or micro-RNAs (Sempere et al. 2006) were subsequently proposed. Unfortunately, the binary nature of these qualitative characters (present/ absent, shared/not shared) has only allowed definition of one clade versus another and has not helped to resolve all the internal relationships yet.
Phylogenomics, based mainly on expressed sequence tag (EST) data, is nowadays the leading approach through which to address this problem. Using up to 183 genes, together with the development of new models of protein evolution, recent phylogenomic studies have lent support to the three superclades (Philippe et al. 2005;Bourlat et al. 2006;Delsuc et al. 2006;Lartillot et al. 2007;Brinkmann and Philippe 2008;Dunn et al. 2008) but have been unable to produce a clear and robust internal phylogeny of these clades. Phylogenomics claims to overcome stochastic errors by incorporating a high number of characters; however, it is also susceptible to ''gappy'' alignments (Hartmann and Vision 2008), poor taxon sampling, systematic errors, and paralogy problems (Philippe and Telford 2006). Indeed, reduced sampling might explain conflicting results either supporting or rejecting the Ecdysozoa over the Coelomata (Dopazo et al. 2004), as well as uncertainties related to the true position of the Acoela  or the tunicates (Delsuc et al. 2006). Even the incorporation of one taxon per phylum does not guarantee a systematic error-free phylogeny (Philippe and Telford 2006), although the incorporation of new EST projects into future analyses will hopefully prevail over those errors as shown by recent studies (Dunn et al. 2008).
As of today, the molecular matrices used are asymmetric: They include either many phyla and few markers or not many phyla and numerous markers, each case bearing its own flaws. This produces dark areas in some regions of the bilaterian tree, for example, the internal relationships within the Lophotrochozoa, the monophyletic status of Chordata, and the position of groups like the Acoelomorpha and the Chaetognatha. To provide a more robust basis on which to analyze the phylogenetic relationships of these problematic regions, we developed and analyzed a more balanced data set. We evaluated 26 genes for their potential phylogenetic information, and 11 were selected. Sequences already present in GenBank were downloaded and 89 new sequences produced. The final matrix contains 90 representatives from 27 phyla and is 8,880 nt long for the 11 proteincoding genes in addition to the two ribosomal RNA genes with a value of 40% missing data. Maximum likelihood (ML) and Bayesian inference (BI) methods were used to obtain a phylogeny of the bilaterians.
Gene sequences were downloaded from GenBank and each gene was aligned independently based on the amino acid sequence using ClustalX 1.81, and the resulting alignments were checked with Bioedit. For organisms which full genome is available (i.e., Drosophila melanogaster, Homo sapines, etc.), only those markers with clear orthology were used. Regions of ambiguous alignment were removed using Gblocks (Castresana 2000) with the default options except Allowed gap positions (set to ''With half''). For each gene, taxa lacking representatives were amplified and sequenced, and a Blast search was performed with the new sequences to confirm their identity. They were added to their respective alignments and their orthology was assessed with singlegene phylogenies. Genes that produced poorly resolved phylogenies (comb trees) contained poor taxon sampling or produced trees that were highly inconsistent with previous phylogenetic studies (e.g., placing molluscs inside chordates) were discarded. The final selected genes were ATPase alpha, GAPDH, H3, IFs, myosin, tropomyosin, aldolase, ATP synthase beta, MAT, PFK, and catalase. Independent alignments for SSU and LSU sequences from a previous study (Paps et al. 2009) were also used. In order to have the same number of operational taxonomic unit (OTUs) for all the genes, the missing representatives for each gene were classified as missing data (filled with Ns).

Data Set
The independent alignments were concatenated into a data set containing 90 OTUs representing 27 phyla and 8,880 positions for 13 genes and a 40% missing data. A summary of the sequences included is provided in supplementary table 3, Supplementary Material online, and a more detailed description for each OTU (species, classification, number of genes available, and accession numbers) is shown in supplementary table 4, Supplementary Material online. To reduce the ''gappyness'' in the matrix (see Hartmann and Vision 2008), 20 of the 90 OTUs were produced by merging sequences of different species, an approach already used in other studies (Giribet et al. 2001;Bourlat et al. 2008). The merged sequences were from species as related as possible and always belonging to the same class. Only the Echiura representative is constituted by organisms from different classes. These amalgamated OTUs were named consequently (e.g., Oligochaeta, Opiliones, and Teleostei).

Phylogenetic Analyses
TreePuzzle was also employed to carry out the likelihood mapping analyses, with the options estimation accurate (slow), Tamura-Nei 93 model, 4 Gamma categories and 1 invariable and 10,000 quartets. Modeltest (Posada and Crandall 1998) was used to determine the evolutionary model that showed the best fit for each gene, following the Akaike information criterion. The specified model (general timereversible [GTR] þ C þ I) was used in all the algorithms.
BI trees were inferred with a parallelized version of MrBayes software (Ronquist and Huelsenbeck 2003), using a partitioned data set (one partition for each gene, unlinking for each partition the estimation of Statefreq, revmat, Pinvar, and shape) and running 1,000,000 generations in 2 independent analyses with a sample frequency of 100. To obtain the consensus tree and BI supports, 500,000 generations were removed to discard trees sampled before likelihood values had reached a plateau. PhyloBayes analyses were performed with the CAT mixture model, which accounts for across-site heterogeneities in the amino acid replacement process (Lartillot and Philippe 2004). Two independent runs were performed with a total length of 17,000 cycles and the first 2,500 points were discarded 2398 Paps et al. as burn-in, and the posterior consensus was computed on the remaining trees.
ML trees were inferred with RaxML (Stamatakis 2006), ran using the model GTR þ C þ I (4 gamma categories þ 1 invariable) and using a partitioned data set (one partition for each gene). A random topology was used as starting tree and 1,000 bootstrap replicates were obtained with the ''RapidBootstrap'' algorithm. To test the results using an alternative heuristic search, PhyML 3.0.1 (Guindon and Gascuel 2003) was run using the model GTR þ C þ I (4 gamma categories þ 1 invariable), 1,000 replicates obtained and subtree pruning regrafting (SPR) heuristic search was used.
Competing topologies were evaluated. The alternative trees were constructed using Treeview (Page 1996) using the original ML inference tree as a template. The alternative topologies tested were based on previous studies or were considered interesting to be tested (table 1). PAUP (Swofford 2000) was used to obtain the site likelihood for the ML tree and CONSEL (Shimodaira and Hasegawa 2001) was run to perform the approximately unbiased (AU) test. The analyses were run on four different computers: 1) two PCs running Windows XP and SUSE Linux 10.0, 2) a supercomputer located at CESCA (Centre de Supercomputació de Catalunya, http://www.cesca.es), and 3) the Marenostrum supercomputer located at the Barcelona Supercomputing Center (http://www.BPPc.es).

Methodological Problems and Data Set Information
The experimental work endured two bottlenecks causing the missing data in our matrix. The first were the unsuccessful RNA extractions, mainly from some marine tiny animals (Porifera, Placozoa, Myxozoa, Gnathostomulida, Cycliophora, or Gastrotricha), where a seemingly sufficient amount of tissue was available but the extraction did not yield enough quantity or quality for the following procedures. This explains the lack of some key phyla in our matrix, though these groups were collected several times. The second bottleneck was the PCR amplification, where some primer pairs worked successfully for some samples but not for others, although these very same samples amplified effectively for other pairs. In the end, in this study, 89 new sequences were produced for the 11 selected genes. 2) result in a bilaterian phylogeny that sheds light on some current uncertainties. The topology from PhyML using SPR heuristics (data not shown) completely agrees with the RaxML tree with few minor exceptions (see clades discussion). The trees obtained from Phylobayes (supplementary fig. 1, Supplementary Material online) lack convergence, but mostly agree with our original MrBayes and ML topologies; unfortunately, the two Phylobayes trees show some anomalies. The first tree places the Bryozoa (including the rotifer Philodina) as sister group to Priapulida within Ecdysozoa. The second run is not able to recover the monophyly of Deuterostomia and places Onychophora within a highly unresolved Lophotrochozoa. Those two anomalies are likely due to trapping in local optima or to the GTR þ C þ I model fitting better than CAT-GTR.
The other analyses agree in the general topology, reproducing the three main bilaterian superclades with some minor disagreements in a few internal nodes. The general lack of high support is also seen in other multigene studies including many taxa and specially rogue species (Dunn et al. 2008), probably reflecting these type of data  being at their limit of resolution with present day models and methods of analyses. Nonetheless, the general agreement in the topology that begins to arise in the analyses done with different markers (as an example Dunn et al. and this work) can be seen as an indicator of the topologies getting closer to the real tree, or said in another way, the new markers effectively add information although do not reach high support for complex data matrices.

Acoelomorpha as Paraphyletic Basal Bilaterians
Both inference methods robustly show a paraphyletic Acoelomorpha as sister group to the other bilaterians. Their relationships to Platyhelminths are rejected by the comparison of topologies with AU test (table 1, hypothesis 3) (table 1). This result confirms earlier studies showing them as a paraphyletic assemblage at the base of the bilaterians (Ruiz-Trillo et al. 1999, 2004Sempere et al. 2007;Wallberg et al. 2007;Paps et al. 2009). This is contrary to two recent phylogenomic reports, the first placing a single acoel species as a sister group to the deuterostomates , an alternative also rejected here by comparison of topologies (table 1, hypotheses 2), and the second showing acoels within a clade that is sister group to the spiralian lophotrochozoans (although with no support, Dunn et al. 2008). The large number of Acoelomorpha included here (five acoels and two nemertodermatids) probably provided greater stability to this clade. Moreover, we recover a partial internal phylogeny of the acoels that is in concordance to a recent systematic proposal based on molecular and sperm structure data (Hooge and Tyler 2006).

Deuterostomia and Xenoturbella
The Deuterostomia appear as a monophyletic clade but with low support in both inference methods ( fig. 2). Deuterostomates split into a robust Chordata and a weak clade including Xenoturbella and a strong Ambulacraria (Hemichordata þ Echinodermata). PhyML using SPR heuristics shows Cephalochordata þ (Urochordata þ Vertebrata), in agreement with recent molecular studies (Bourlat et al. 2006;Philippe et al. 2007;Dunn et al. 2008). The low support for the deuterostomates holds in many recent molecular studies (Bourlat et al. 2006;Delsuc et al. 2006Delsuc et al. , 2008Mallatt and Giribet 2006;Philippe et al. 2007) and can be explained here by the unstable nature of Xenoturbella. When Xenoturbella is removed from the data set, the deuterostomes support increases to 83% Bootstrap Support (BS, results not shown). The best BI tree (data not shown) positions Xenoturbella splitting after Acoela and Nemertodermatida, as sister group to the rest of bilaterians. Moreover, this latter position is not rejected by the AU test (table 1, hypothesis 4), though further data are needed to corroborate their position. It is noteworthy that, although we have no representatives of the Class Ophiuroidea, the other echinoderm classes robustly conform to previous studies (Littlewood et al. 1997).

Ecdysozoa and Chaetognatha
The Ecdysozoa phylogeny shows two main clades: 1) Scalidophora (Priapulida þ Kinorhyncha) plus Nemato-morpha and 2) a clade including Nematoda, Onychophora, Chaetognatha, and Arthropoda. The controversial phylogenetic position of Chaetognatha is one of the biggest conundrums in animal phylogeny. Recent data on Hox cluster genes suggest a new position close to the base of the Bilateria (Papillon et al. 2003), whereas phylogeny based on ribosomal genes (Mallatt and Winchell 2002), mitochondrial DNA (Helfenbein et al. 2004;Papillon et al. 2004) and multigenic approaches (Marletaz et al. 2006;Matus et al. 2006;Philippe et al. 2007) place them as sister group to the protostomates or within that group. Placement of chaetognaths inside the ecdysozoans has recently been suggested , a position also shown in our analyses and supported by the comparison of topologies that clearly rejects chaetognaths as sister group to all the protostomates, to Lophotrochozoa or to Ecdysozoa (table 1, hypotheses 7 to 9).
In our trees, chaetognaths group in a clade together with nematodes, onychophorans, and arthropods. This group, however, is likely a consequence of an LBA artifact. To test it, we ran two ML analyses (data not shown). The first, excluding chaetognaths, placed onychophorans with arthropods with 94% support, whereas the second, excluding onychophorans, positions chaetognaths with nematodes (7%). These results point to an internal LBA effect between nematodes, onychophorans, and the chaetognaths, which can also explain the lack of resolution for the relationships among the main ecdysozoan clades. Despite these problems, arthropods show a reliable internal phylogeny grouping Myriapoda with Chelicerata and Hexapoda with Crustacea, both groups also recovered in other studies (Hwang et al. 2001;Peterson and Eernisse 2001;Dunn et al. 2008;Paps et al. 2009).
A well-supported clade made by Gnathostomulida and Gastrotricha appear as the first splitting lophotrochozoan lineage. A close relationship among gnathostomulans and gastrotrichans has been suggested on basis of morphology (Rieger 1976;Sterrer et al. 1985;Zrzavy et al. 1998) and molecules (Zrzavy et al. 1998;Giribet et al. 2000;Todaro et al. 2006). These two phyla were also recovered as basal lophotrochozoan clades in recent SSU þ LSU analyses (Paps et al. 2009) though a recent ESTs study was not able to sort out their relationships (Dunn et al. 2008). Our trees suggest for them a new position as the most basal lophotrochozoans, in contradiction with other proposals such as Gnathifera (Rieger and Tyler 1995;Ahlrichs 1997)  Our results relate Bryozoa with Rotifera and Acanthocephala. A close relationship between the last two clades has been suggested by morphology (see review in Garey and Schmidt-Rhaesa 1998) and SSU (Winnepenninckx et al. 1995;named Syndermata in Zrzavy et al. 1998). As regards the Bryozoa, early SSU studies showed that bryozoans do not belong to lophophorates (Cohen 2000), and recent ESTs analyses place them close to spiralians Dunn et al. 2008). Our AU test rejects the monophyly of Bryozoa þ Entoprocta if the entoproctan is forced inside the Syndermata þ Bryozoa but does not reject it if the bryozoans are removed from syndermatans and placed with entoproctans as sister group to spiralians (''Spiralia'' þ Brachiozoa þ Platyhelminthes in fig. 2). Therefore, the position of entoprocts as sister group to spiralians is highly supported by our analyses, whereas the position of bryozoans rests unresolved.
Platyhelminthes and Nemertea have traditionally been clustered together, either because they were supposed to share an acoelomate condition (Hyman 1951;and Parenchymia of Nielsen 1995), or because they share some larval features (Nielsen 2001). However, SSU molecular studies and a reassessment of the morphological features of nemertines convincingly showed them to be coelomate animals more related molluscs and annelids than to Platyhelminths (Turbeville et al. 1992). In turn, molecular studies placed the Platyhelminthes either as basal lophotrochozoans (Ruiz-Trillo et al. 1999) or within the Platyzoa (Giribet et al. 2000). Our results show instead Platyhelminthes and Nemertea to branch paraphyletically with a clade of spiralian animals. Moreover, the internal phylogeny for the sampled flatworm classes is robust and highly congruent with the modern systematics of the group (see a review in Baguñà and Riutort 2004).
Monophyly of the Spiralia (animals bearing a spiralquartet cleavage, indicated in fig. 2) has been recovered in many molecular studies. The Spiralia appear paraphyletic due to the inclusion of Phoronida þ Brachiopoda within this group. A clade of phoronids and brachiopods, often named Brachiozoa or Phoronozoa, is increasingly recovered in recent molecular phylogenies (Zrzavy et al. 1998;Cohen 2000;Peterson and Eernisse 2001). The relationship between Brachiozoa and Spiralia has been hinted by recent EST studies (albeit relating them with nemertines rather than molluscs, Dunn et al. 2008) and recent paleontological studies have also pointed out a likely affiliation to molluscs (Morris and Peel 1995). The other spiralian group relates Echiura and Pogonophora with Annelida in agreement with recent studies (Hessling and Westheide 2002;Bleidorn et al. 2003;Struck et al. 2007).

Gene Contribution
It is often suggested that to obtain better resolution, more data are needed, implying more taxa, but most of the time meaning longer sequences (more markers). As is pointed out in the Introduction, the latter is not enough if it falls short regarding the taxon sampling. We wanted to explore the gene contribution to our tree, in order to elucidate if each marker helps to improve or not the phylogeny 2402 Paps et al. resolution. In order to do that, two approaches were used: The first uses the site likelihood values for the ML tree ( fig. 3 and table 2), whereas the second removes one gene at a time to infer a new tree (table 3). Figure 3 plots the site likelihood score of each position into the ML tree; at this point, it is important to remember that the ln L is a negative value and that the closer it is to zero the higher is the likelihood. Curiously there is a ''band'' of sites around anln L value of À2 for all genes that correspond to the conserved sites in the alignment. The rest of the sites have variable Àln L values, with some genes having nearly all the positions with values close to zero (IF, Tropo), and others presenting a greater dispersion (the two ribosomal genes). Table 2 shows the relative length of the gene in the alignment and its contribution to the Àln L of the tree (both in percentage). If all sites of all genes were contributing exactly the same, the two values would be expected to be correlated. We can observe how some genes have higher contribution to the Àln L value than assumed for their length (hence holding a higher proportion of low probability sites) and the other way around. Tropo and IF, as expected from the observations on the graph, have a greater proportion of high probability sites than expected by their length, whereas genes like ALD, myosin, or 18S have a lower proportion. In conclusion, in our data set, more characters do not necessarily mean more information, and this shows that long genes can have many sites with very low probabilities, whereas short genes can hold many highprobability sites.
The site likelihood approach used, though informative, has two flaws: First, the gene contribution is measured on the phylogenetic hypothesis inferred, despite the latter being true or not. Second, the site likelihood calculation enforces all the positions to contribute somewhat to the obtained tree; therefore, any gene adding more noise than information cannot be detected. To work around these limitations, we used another method based on removing one gene at a time from the matrix and evaluating its impact on the BS for different nodes (table 3). We have selected the clades that are widely accepted and at the same time have an intermediate BS value in the original tree, so the values can go either up or down in our experiment. Theoretically, the BS increase in a clade when a gene is removed will be interpreted as a higher proportion of sites in the alignment giving support to the node, whereas a BS decrease means an increase in the proportion of sites not supporting that node. Hence, the observation of a general fall in BS values when removing a marker means that this gene held sites supporting the node, whereas a general BS upraise could be interpreted as the gene not contributing or even having contradictory information.
The removal of the ribosomal genes results in a BS decrease for all the nodes analyzed, and the same holds for ALD. The rest of the genes show a variable proportion of nodes going up or down, with the exception of H3 for which all nodes but one increase. This last result is expected taking into account the profile shown by this molecule in figure 3, where it shows a group of highly conserved sites and presents a gap (between 20 and 40 Àln L), whereas the rest of the sites have low probabilities likely not contributing to the tree. The rest of the genes all seem to be contributing information in some nodes, whereas their removal increases the BS value for others. These groupings are still a simplification, as they do not take into account at which levels the nodes are affected. For instance, when comparing all the tree nodes (data not shown), some genes affect nodes close to the outgroup, whereas others affect values only at the tips.
We can conclude that the addition of genes is highly positive, most of them adding some information to the tree, but the first approach shows that the quantity of information is not rigorously correlated to the length of sequence and the second method shows that not all the genes improve all the nodes of the phylogeny. Therefore, the indiscriminate addition of huge quantities of information does not grant a phylogeny of higher resolution. The idea of removing markers from a matrix is not new and has been already used in previous studies (Philippe et al. 2005). The consequences of these results into the new phylogenomic field warn about  the arbitrary addition of markers to animal molecular matrices. Although when using fast molecular methods adding much information is an added bonus, here we would like to encourage a further filtering of the data before final analyses are carried out. Thus, the simple removal of one gene (H3) results in a BS increase of many nodes. However, that raises a main new question: How to devise objective parameters to evaluate which genes are mostly informative and which ones mostly noisy?

Concluding Remarks
To obtain a robust phylogeny of the bilaterians, a necessary prerequisite seems to get a fair balance between number of taxa and number of characters sampled. In addition, a deep taxon sampling, the use of probabilistic methods and adequate models has helped us to recover a novel bilaterian phylogeny. Our results show, as in most previous studies, the monophyly of Deuterostomia, Protostomia, Lophotrochozoa, and Ecdysozoa. However, it sheds new light on some dark, conflicting areas of the bilaterian tree that appear better resolved than those derived from current ESTs analyses. We mainly refer to the basal relationships of the Lophotrochozoa and to the position of a paraphyletic Acoelomorpha as earliest extant branching bilaterians. Moreover, we also suggest to include the Chaetognatha within ecdysozoans, though further information is needed to settle their position, and new internal relationships for many phyla that match studies focused to solve their internal relationships. On the other side of the coin, our results did not find high support for some regions of the tree such as the status of Xenoturbella, the internal relationships of Ecdysozoa, and some intermediate branches of the Lophotrochozoa. Finally, our analysis on gene contribution points out that more data do not necessarily mean a better resolved phylogeny. Therefore, the markers produced by highthroughput methods must be carefully evaluated before being unsystematically added to the final matrix. We predict that a similarly balanced approach, incorporating better filtered EST collections, and better taxon sampling, will substantially improve our understanding of bilaterian evolution.