A Spatially Explicit Model of Stabilizing Selection for Improving Phylogenetic Inference

Abstract Ultraconserved elements (UCEs) are stretches of hundreds of nucleotides with highly conserved cores flanked by variable regions. Although the selective forces responsible for the preservation of UCEs are unknown, they are nonetheless believed to contain phylogenetically meaningful information from deep to shallow divergence events. Phylogenetic applications of UCEs assume the same degree of rate heterogeneity applies across the entire locus, including variable flanking regions. We present a Wright–Fisher model of selection on nucleotides (SelON) which includes the effects of mutation, drift, and spatially varying, stabilizing selection for an optimal nucleotide sequence. The SelON model assumes the strength of stabilizing selection follows a position-dependent Gaussian function whose exact shape can vary between UCEs. We evaluate SelON by comparing its performance to a simpler and spatially invariant GTR+Γ model using an empirical data set of 400 vertebrate UCEs used to determine the phylogenetic position of turtles. We observe much improvement in model fit of SelON over the GTR+Γ model, and support for turtles as sister to lepidosaurs. Overall, the UCE-specific parameters SelON estimates provide a compact way of quantifying the strength and variation in selection within and across UCEs. SelON can also be extended to include more realistic mapping functions between sequence and stabilizing selection as well as allow for greater levels of rate heterogeneity. By more explicitly modeling the nature of selection on UCEs, SelON and similar approaches can be used to better understand the biological mechanisms responsible for their preservation across highly divergent taxa and long evolutionary time scales.


This PDF file includes:
Figures S1 to S8 Simulation results demonstrating that one unit of branch length under our SelON model represents one expected substitution per site.The generating model was based on the parameter estimates and site lengths from 22 randomly selected UCEs from our full analysis of the turtle dataset, which, cumulatively, produced data sets with 10,000 sites.We assumed a single starting lineage pulling ancestral states from the equilibrium base frequencies for each site.We then incremented the time by 0.001 expected substitution units and recorded the cumulative number of substitution across sites.We repeated this process 20 times.As would be expected, the expected and observed average substitution per site followed each other exactly.
5.0 5.5 6.0 6.5 4.5 5.0 5.5 6.0 6.5 5.0 5.5 6.0 6.5 4.5 5.0 5.5 6.0 6.5 .The impact of using an outgroup taxon for two scenarios testing whether the distribution of long branches (0.10 expected substitutions/site) and short branches (0.025 expected substitutions/site) within a tree can impact branch length estimates.The first column represents the estimated branch lengths inferred under SelON without an outgroup (panels A and C) and with SelON with an outgroup that was subsequently removed (panels B and D).The tree with the thick branches represents the generating model, and each transparent line represents the estimates from an individual simulation.We suspect the difficulties SelON had in properly estimating the lengths of the two descendant branches from the root is partly an identifiability issue.

Fig
Fig. S1.Simulation results demonstrating that one unit of branch length under our SelON model represents one expected substitution per site.The generating model was based on the parameter estimates and site lengths from 22 randomly selected UCEs from our full analysis of the turtle dataset, which, cumulatively, produced data sets with 10,000 sites.We assumed a single starting lineage pulling ancestral states from the equilibrium base frequencies for each site.We then incremented the time by 0.001 expected substitution units and recorded the cumulative number of substitution across sites.We repeated this process 20 times.As would be expected, the expected and observed average substitution per site followed each other exactly.

Fig. S2 .
Fig. S2.Summary of the simulation where the generating model was based on the parameter estimates from a randomly selected set of 22 UCEs.(A-C) The parameter estimates associated the magnitude, width, and centering of the sensitivity to selection distributions, and (D) the global mutation rates following the UNREST model of nucleotide substitution.The topology used for the simulation is shown in Figure 3A in the main text.The dashed line reflects the 1:1 line.
Fig. S3.Summary of the simulation where the generating model was based again on the parameter estimates from a randomly selected set of 22 UCEs, but with the topology used for the simulation is shown in Figure3Bfrom the main text.The dashed line reflects the 1:1 line.
Fig. S4.The impact of using an outgroup taxon for two scenarios testing whether the distribution of long branches (0.10 expected substitutions/site) and short branches (0.025 expected substitutions/site) within a tree can impact branch length estimates.The first column represents the estimated branch lengths inferred under SelON without an outgroup (panels A and C) and with SelON with an outgroup that was subsequently removed (panels B and D).The tree with the thick branches represents the generating model, and each transparent line represents the estimates from an individual simulation.We suspect the difficulties SelON had in properly estimating the lengths of the two descendant branches from the root is partly an identifiability issue.
Fig. S5.Cladograms and phylograms summarizing the inferred phylogeny for determining the placement of turtles relative to archosaurs (bird+crocodiles, or the "turtle-archosaur" alliance) and lepidosaurs (lizards+tuataras, or "Ankylopoda" hypothesis) under (A,C) GTR+ and (B,D) SelON.When comparing topologies, there was overwhelming support under GTR+for turtles being sister to archosaurs (turtlearchosaur alliance), whereas under SelON, not only does it provide an extraordinary improvement in overall fit compared to GTR+, but it also indicated stronger support for the Ankylopoda hypothesis (i.e., turtles sister to lepidosaurs).
Fig. S7.The site-wise patterns (red lines) of topological support (support defined as lnL=lnL TAA -lnL AH ) under (A) GTR+ and (B) SelON model fits to the full 400 UCE empirical dataset.With GTR+, support for the turtle-archosaur relationship was driven both by the lowest rate sites and highest rate sites, based on a ranking of the weighted-average  rate at a site.The support also steadily increased as the distance from the presumptive conserved center (determined by the location of the inferred  %  '() under SelON).With SelON, overall support for the Ankylopoda hypothesis was generally supported regardless of weighted-average  rate and distance from the center.

Fig. S8 .
Fig. S8.The same site-wise patterns (red lines) of topological support (support defined as lnL=lnL TAA -lnL AH ) under GTR+ model fit as Fig. S7, but with branch lengths and model parameters are estimated individually to the full 400 UCE empirical dataset.