Phylogenomic analyses resolve historically contentious relationships within the Palaeognathae in the presence of an empirical anomaly zone

Palaeognathae represent one of the two basal lineages in modern birds, and comprise the volant (flighted) tinamous and the flightless ratites. Resolving palaeognath phylogenetic relationships has historically proved difficult, and short internal branches separating major palaeognath lineages in previous molecular phylogenies suggest that extensive incomplete lineage sorting (ILS) might have accompanied a rapid ancient divergence. Here, we investigate palaeognath relationships using genome-wide data sets of three types of noncoding nuclear markers, together totalling 20,850 loci and over 41 million base pairs of aligned sequence data. We recover a fully resolved topology placing rheas as the sister to kiwi and emu + cassowary that is congruent across marker types for two species tree methods (MP-EST and ASTRAL-II). This topology is corroborated by patterns of insertions for 4,274 CR1 retroelements identified from multi-species whole genome screening, and is robustly supported by phylogenomic subsampling analyses, with MP-EST demonstrating particularly consistent performance across subsampling replicates as compared to ASTRAL. In contrast, analyses of concatenated data supermatrices recover rheas as the sister to all other non-ostrich palaeognaths, an alternative that lacks retroelement support and shows inconsistent behavior under subsampling approaches. While statistically supporting the species tree topology, conflicting patterns of retroelement insertions also occur and imply high amounts of ILS across short successive internal branches, consistent with observed patterns of gene tree heterogeneity. Coalescent simulations indicate that the majority of observed topological incongruence among gene trees is consistent with coalescent variation rather than arising from gene tree estimation error alone, and estimated branch lengths for short successive internodes in the inferred species tree fall within the theoretical range encompassing the anomaly zone. Distributions of empirical gene trees confirm that the most common gene tree topology for each marker type differs from the species tree, signifying the existence of an empirical anomaly zone in palaeognaths.


53
The scaling-up of multigene phylogenetic data sets that accompanied rapid advances in 54 DNA sequencing technologies over the past two decades was at first heralded as a possible end 55 to the incongruence resulting from stochastic error associated with single-gene topologies (Rokas 56 et al. 2003, Gee 2003. However, it soon became clear that conflicting, but highly supported, 57 topologies could result from different data sets when sequence from multiple genes was analyzed 58 as a concatenated supermatrix, leading Jeffroy et al. (2006) to comment that phylogenomics 59 recently coming to signify the application of phylogenetic principles to genome-scale data 60 (Delsuc et al. 2005) could instead signal the beginning of incongruence . On the one hand, 61 these observations highlighted the need for more sophisticated models to account for 62 nonphylogenetic signal such as convergent base composition or unequal rates that can become 63 amplified in large concatenated data sets (Jeffroy et al. 2006). But at the same time, there was a   Warnow 2015) to further strengthen support for this topology relative to that obtained from 160 concatenated sequence data. Additionally, we employ phylogenomic subsampling to investigate 161 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint 8 consistency in the underlying signal for conflicting relationships recovered from species tree 162 versus concatenation approaches, and use likelihood evaluation and coalescent simulation to 163 assess the underlying gene tree support for the recovered species tree topology and the existence 164 of an empirical anomaly zone in palaeognaths. Throughout, we consider the variation in signal 165 among classes of noncoding nuclear markers that are becoming increasingly adopted for  . Table S1). We chose to analyze noncoding sequences primarily because coding regions 176 across large taxonomic scales in birds are known to experience more severe among-lineage 177 variation in base composition than noncoding regions, which can complicate phylogenetic

196
Candidate introns were identified using BEDTools to output coordinates for annotated 197 intron features in the galGal4 genome release that did not overlap with any annotated exon 198 feature. Chicken coordinates for these introns were lifted over to target palaeognaths in the 199 whole-genome alignment as described for CNEEs above, filtered to remove duplicated regions, pairwise sequence identity of 70% and fewer than 0.5 undetermined sites (e.g. gaps and Ns) per 207 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. in the whole-genome alignment was allowed, resulting in a data set of 3,158 UCEs.

218
Blastn searches with NCBI s default somewhat similar parameters (evalue 1e-10, 219 perc_identity 10, penalty -3, reward 2, gapopen 5, gapextend 2, word size 11) were used with 220 query sequence from each of the three kiwi species included in the whole-genome alignment to 221 identify orthologous regions in the North Island brown kiwi (Apteryx mantelli, Le Duc et al.

222
2015), which was not included in the WGA. North Island brown kiwi sequence was added for 223 loci that had consistent top hits across blastn searches, with a single high-scoring segment pair 224 (HSP) covering at least 50% of the input query sequence and minimum 80% sequence identity.

227
Sequence was also added from a reference-based genome assembly of the extinct little 228 bush moa (Anomalopteryx didiformis, Cloutier et al. 2018) that was generated by mapping moa 229 reads to the emu genome included in the whole-genome alignment. Emu coordinates from the 230 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.   beginning with different random number seeds and with ten independent tree searches within 248 each run. The species tree topology was inferred using best maximum likelihood gene trees as 249 input, and node support was estimated from MP-EST runs using gene tree bootstrap replicates as 250 input.

251
ASTRAL-II v. 4.10.9 (Mirarab and Warnow 2015, hereafter ASTRAL ) was also run 252 using the best maximum likelihood gene trees to infer the species tree topology, and the 500 253 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

266
For all three methods, bootstrap supports were placed on the inferred species tree using 267 RAxML, and trees were outgroup rooted with chicken using ETE v. 3 (Huerta-Cepas et al.   UCEs), loci were randomly sampled with replacement to create subsets of 50, 100, 200, 300, 274 400, 500, 1000, 1500, 2000, 2500, and 3000 loci. This process was repeated ten times to create a 275 total of 110 data sets per marker type (e.g. 10 replicates of 50 loci each, 10 replicates of 100 loci, 276 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint 13 etc. for each of CNEEs, introns, and UCEs). Topologies were inferred for bootstrap replicates of 277 each data set using MP-EST, ASTRAL, and ExaML as described above. However, for reasons 278 of computational tractability, 200 rather than 500 bootstrap replicates were used for each method 279 (including ExaML), and MP-EST was run once (rather than three times) for each data set, 280 although still with ten tree searches within each run.

281
Support for alternative hypotheses regarding the sister group to the rheas, and the sister to 282 emu + cassowary, was estimated by first counting the number of bootstraps that recovered each 283 alternative topology from the 200 bootstraps run for each replicate, and then calculating the 284 mean value for each hypothesis across the ten replicates within each data set size category.

286
Analyses of gene tree heterogeneity were conducted using both the best maximum

303
Relative support for each gene tree topology was assessed by computing AIC from the 304 log-likelihood score (lnL) of the inferred gene tree and lnL when the input alignment for each

309
For each locus, we tested the estimated gene tree topology against an a priori candidate set of 310 probable trees that enforced monophyly of the five higher-level palaeognath lineages (kiwi, emu 311 + cassowary, rheas, moa + tinamous, ostrich), but allowed all possible rearrangements among 312 those lineages (for a total of 105 trees in the candidate set, plus the gene tree topology itself if it 313 did not occur within this set). We also tested gene trees against a second set of candidate 314 topologies using the same criteria as above, but additionally allowing all possible rearrangements 315 within a monophyletic tinamou clade (for 1,575 candidate topologies). For each gene, in 316 addition to the reported P-value for the fit of the species tree topology, we also calculated the 317 rank of the species tree topology relative to all tested candidates from P-values sorted in 318 ascending order.

319
The proportion of observed gene tree heterogeneity consistent with coalescent variation 320 was estimated through simulation. For each marker type, we estimated ultrametric species tree 321 branch lengths in mutational units (µT, where µ is the mutation rate per generation and T is the 322 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018.

391
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint Coalescent species tree methods, but not concatenation, recover congruent palaeognath 392 relationships 393 Using genome-wide data sets of 12,676 CNEEs, 5,016 introns, and 3,158 UCEs, we 394 recover fully congruent topologies across all marker types and for the combined total evidence 395 tree using MP-EST and ASTRAL coalescent species tree approaches (Fig. 1, Suppl. Fig. S1a- sister group relationship to kiwi as is seen for MP-EST and ASTRAL, but with casuariiforms 407 instead placed as the sister to moa + tinamous with 60% support for CNEEs and 96% support for 408 UCEs (Suppl. Fig. S1).

409
Robustness of the underlying signal for these inconsistently recovered relationships was analyses rapidly accumulate support for a sister-group relationship between rheas and 414 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.  Fig. 2a-c, Fig. 3a,j), and with support for 416 alternative hypotheses sharply dropping off for replicates with greater than 200 loci. Support 417 accumulates more slowly for ASTRAL, but the hypothesis of rheas as sister to emu/cassowary + 418 kiwi clearly dominates and support for alternatives declines in replicates with more than 1000 419 loci for all markers (Fig. 2d-f, Fig. 3). In contrast, subsampling replicates are less consistent for 420 relationships inferred under concatenation with ExaML ( Fig. 2g-i, Fig. 3). In particular, CNEEs 421 oscillate between recovering rheas as the sister to moa + tinamous, or the sister to all other non-422 ostrich palaeognaths, although always with low bootstrap support (Fig. 2g, Fig. 3g). The other 423 two marker types more clearly support the topology recovered from full data sets with ExaML 424 that place rheas as sister to the remaining non-ostrich palaeognaths, although bootstrap support 425 for UCE replicates is generally weak (Fig. 2h,i; Fig. 3).

426
Subsampling provides even more robust support for emu + cassowary as the sister to 427 kiwi, with both MP-EST and ASTRAL quickly accumulating support for this clade and with 428 rapidly declining support for all other hypotheses (Suppl. Fig. S2a-f, Suppl. Fig. S3). ExaML 429 intron replicates also steadily accumulate support for this relationship with an increasing number 430 of loci (Suppl. Fig. S2h, Suppl. Fig. S3h,q). The alternative hypothesis of emu + cassowary as 431 sister to moa + tinamous, which is favored by CNEEs and UCEs analyzed within a concatenation 432 framework, is not well supported by subsampling, where conflicting topologies characterized by 433 low support are recovered across ExaML replicates (Suppl. Fig. S2g,i; Suppl. Fig. S3).

435
Patterns of CR1 retroelement insertions corroborate both the inferred species tree 436 topology from MP-EST and ASTRAL and the existence of substantial conflicting signal 437 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint consistent with incomplete lineage sorting across short internal branches. In total, 4,301 438 informative CR1 insertions were identified from multispecies genome-wide screens, the vast 439 majority of which (4,274 of 4,301, or 99.4%) are entirely consistent with relationships inferred 440 from sequence-based analyses ( Fig. 4a; analysis here was restricted to species in the whole-441 genome alignment, and little bush moa and North Island brown kiwi are therefore not included).

442
Not surprisingly, we identify many more insertion events occurring along shallower branches 443 with longer estimated lengths and fewer insertions along shorter branches that form the backbone 444 of the inferred species tree (refer to Fig. 1 for estimated branch lengths).

445
Of the 27 (0.6%) CR1s that are inconsistent with the species tree topology, two conflict 446 with the inferred relationships within kiwi, and nine contradict relationships among tinamous 447 (Fig. 4b,c). However, in each case, conflicting CR1s are far outweighed by markers that support

478
Distributions of estimated gene tree topologies illustrate that the most common topology 479 for each marker type is not the species tree topology inferred by MP-EST and ASTRAL, thereby 480 suggesting the existence of an empirical anomaly zone (Fig. 5a-c). While the ranking of specific 481 gene tree topologies differs across marker types, common to these anomalous gene trees that 482 occur at higher frequency than the species tree topology is the fact that both the shallowest clades 483 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint as well as the deepest split between the ostrich and all other palaeognaths are maintained 484 throughout (Fig. 5d). Rearrangements of AGTs relative to the MP-EST and ASTRAL species 485 tree topology instead involve the two short internal branches forming the common ancestor of 486 emu and cassowary with kiwi, and with this clade to rheas (Fig. 5d).

487
To more fully investigate the observed gene tree heterogeneity, we considered all with rheas means that the emu/cassowary + kiwi + rhea clade is actually recovered less often 499 than alternatives.

500
We next considered whether topological differences between estimated gene trees and the 501 species tree are well supported, or are instead likely to primarily reflect gene tree estimation 502 error. Mean bootstrap support for estimated gene trees is relatively high, especially for introns 503 and UCEs (83.9% and 82.8% respectively, Fig. 7a-c). However, average support falls by about 504 10% for each marker type when gene tree bootstrap replicates are constrained to the species tree 505 topology, with P < 0.0001 for paired t-tests of each data set. These results suggest that 506 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint differences from the species tree are broadly supported by variation in sequence alignments for 507 individual loci. To test this further, we compared the difference in Akaike information criterion 508 (AIC) for estimated gene trees to the AIC obtained when the sequence alignment for each gene 509 was constrained to the species tree topology. Approximately 80% of CNEEs have AIC (gene 510 tree-species tree) less than -2, indicating substantial support in favor of the gene tree topology 511 relative to that of the species tree (Burnham and Anderson 2002), while the proportion was even 512 greater for introns and UCEs (approximately 90% with AIC < -2, Fig. 7d). Despite this result, 513 approximately unbiased (AU) tests typically failed to reject a hypothesis that the data fit the 514 species tree topology, with only about 20% of introns and UCEs, and 30% of CNEEs rejecting 515 the species tree topology at P < 0.05 (Fig. 7d). However, the species tree topology is also not 516 commonly amongst the top 5% of candidate alternative topologies when these alternatives are 517 ranked according to increasing AU test P-value within each locus (Fig. 7d).

518
In keeping with the results for all loci, gene tree topologies are also generally supported 519 for loci falling within AGT groups. Support for individual gene trees is somewhat weak for 520 CNEEs, with low median bootstrap support and few substitutions occurring along branches that 521 conflict with the species tree topology (Suppl. Fig. S4a,d), which is consistent with the shorter 522 average alignment length and lower variability of these loci (Suppl. Fig. S5). However, support 523 is much stronger for introns and UCEs, with most loci having bootstrap support above 50% for 524 conflicting clades and AIC < -2 indicating much stronger likelihood support for the recovered 525 gene tree topology relative to that obtained if sequence alignments are constrained to match the 526 inferred species tree (Suppl. Fig. S4).

527
Simulations were used to further assess what proportion of total gene tree heterogeneity 528 is likely attributable to coalescent processes rather than to gene tree estimation error (Fig. 8).
529 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint Using either Robinson-Foulds (RF) distances or the matching cluster distance, which is 530 influenced less by the displacement of a single taxon than is the RF metric (Bogdanowicz et al.

532
ASTRAL for each marker type, we find that coalescent processes alone can account for more 533 than 70% of the observed gene tree heterogeneity in most comparisons, and >90% for introns 534 when gene trees are simulated from MP-EST coalescent branch lengths (Fig. 8).

581
In parallel, there has been a renewed focus on palaeognath phylogeny throughout the past

590
We recover a topology placing rheas as sister to emu/cassowary + kiwi that is congruent 591 across all analyses for MP-EST and ASTRAL species tree methods, but that differs from the 592 placement of rheas as sister to the remaining non-ostrich palaeognaths when concatenated data 593 are analyzed. These conflicting topologies are recovered with maximal support in at least some 594 data sets for both coalescent and concatenation analyses. However, subsampling approaches that

596
Edwards 2016b) provide more robust support for the coalescent species tree, and this topology is 597 further corroborated by patterns of CR1 retroelement insertions from multiway genome-wide 598 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint screening. However, conflicting CR1 insertions suggest extensive ILS across short internal 599 branches separating major palaeognath lineages, and coalescent lengths for these pairs of 600 branches fall within the theoretical range expected to produce anomalous gene trees. We indeed 601 find that the most common gene tree for each marker type does not match the inferred species 602 tree topology, consistent with an empirical anomaly zone in palaeognaths.

603
Although we contrast results from concatenation and coalescent species tree methods, we    constrained to the species tree topology and clades that conflict with the species tree tend to have 703 at least 50% median bootstrap support, suggesting that gene tree heterogeneity is real rather than 704 reflecting gene tree estimation error alone. However, we concur that short internodes pose 705 substantial challenges to accurate gene tree inference, and investigations of the empirical 706 anomaly zone should greatly benefit from algorithms that make 'single-step' coestimation of 707 gene trees and species trees scalable to phylogenomic data sets.

708
In conclusion, we find strong evidence that past difficulty in resolving some palaeognath 709 relationships is likely attributable to extensive incomplete lineage sorting within this group, and 710 that species tree methods accommodating gene tree heterogeneity produce robustly supported

711
topologies despite what appears to be an empirical anomaly zone. We echo the sentiments of 712 other authors that high bootstrap support alone is an inadequate measure of confidence in 713 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint inferred species trees given increasingly large phylogenomic data sets (Edwards 2016b certainly not a new idea in phylogenetics, but an increasing emphasis on its importance in the era 720 of species trees will continue to advance the field beyond reports of highly supported, but often 721 discordant, 'total evidence' topologies toward a more nuanced 'sum of evidence' approach that 722 considers not just which topologies are produced but also what they can tell us about the 723 underlying evolutionary processes and our attempts to model them.                    Table S2).  The copyright holder for this preprint (which was not this version posted February 9, 2018. . https://doi.org/10.1101/262949 doi: bioRxiv preprint