Leaping through Tree Space: Continuous Phylogenetic Inference for Rooted and Unrooted Trees

Abstract Phylogenetics is now fundamental in life sciences, providing insights into the earliest branches of life and the origins and spread of epidemics. However, finding suitable phylogenies from the vast space of possible trees remains challenging. To address this problem, for the first time, we perform both tree exploration and inference in a continuous space where the computation of gradients is possible. This continuous relaxation allows for major leaps across tree space in both rooted and unrooted trees, and is less susceptible to convergence to local minima. Our approach outperforms the current best methods for inference on unrooted trees and, in simulation, accurately infers the tree and root in ultrametric cases. The approach is effective in cases of empirical data with negligible amounts of data, which we demonstrate on the phylogeny of jawed vertebrates. Indeed, only a few genes with an ultrametric signal were generally sufficient for resolving the major lineages of vertebrates. Optimization is possible via automatic differentiation and our method presents an effective way forward for exploring the most difficult, data-deficient phylogenetic questions.


Introduction
Phylogenetic inference, the task of reconstructing the evolutionary relationships across taxonomic units given observational data, has a wide range of theoretical and practical applications in biology, such as evolution (Cavalli-Sforza and Edwards 1967;Felsenstein 2004;O'Meara 2012), conservation (Rolland et al. 2011) and epidemiology (Grenfell et al. 2004;Faria et al. 2021), but also in comparative linguistics (Mace and Holden 2005) and cultural anthropology (Collard, Shennan, and Tehrani 2006;Morrison 2014).In particular, the COVID-19 pandemic has catalysed the development of efficient phylogenetic tools and methods to better understand the virus' origin, spread, and evolution (Lemoine et al. 2021;O'Toole et al. 2021;Attwood et al. 2022;T. Sanderson 2022;Turakhia et al. 2022;Voznica et al. 2022;De Maio et al. 2023).For biological problems, tree inference is primarily informed by molecular sequence data (i.e., nucleotide or amino acid sequences), for which an extensive body of literature exists (M.J. Sanderson and Shaffer 2002;Yang 2006;Yang and Rannala 2012).Other types of biological data such as morphology (Lee and Palci 2015), fossils (Morlon, Parsons, and Plotkin 2011), and auditory communication in animals (Arato and W. T. Fitch 2021) can also be used as input.
Two key parameters considered when inferring a phylogenetic tree include the topology, the branching pattern that specifies the evolutionary relationships between operational taxonomic units, and branch lengths, the amount of evolutionary divergence that occurred between the branching events.A substantial amount of research has been conducted on how to parameterise branch lengths (Bromham and Penny 2003;Dos Reis, Donoghue, and Yang 2016), especially through the use of various molecular clocks (Zuckerkandl and Pauling 1962).Similarly, although to a lesser degree, progress has been made on methods for efficient exploration of the space of tree topologies (Stamatakis 2014), which is fundamentally challenging due to its combinatorial complexity.Indeed, for n taxa, there are (2n−3)!! possible rooted tree arrangements, where n!! denotes the semifactorial of n -even a small dataset of ten taxa can be enumerated by 34 million unique rooted trees.Moreover, finding the global optimal tree is NP-hard for all major optimality criteria (e.g., maximum parsimony (Les R Foulds and Graham 1982), minimum evolution (Day, Johnson, and Sankoff 1986), maximum likelihood (Roch 2006)).
However, such methods still have limitations.First, hill climbing using heuristic approaches necessitates multiple evaluations of the objective function to pick the best move.While these heuristics are still polynomial, exhaustive exploration of single SPR operations is quadratic in complexity, and paired operations (two sequential SPR changes) are quartic.
Second, all the aforementioned tree arrangements are prone to being trapped in local optima and even if a global optimum is found, terraces of trees with identical quality exist (M.J. Sanderson, McMahon, and Steel 2011).The challenge of exploring tree space is exacerbated when concatenating multiple genes in supermatrices (Rokas et al. 2003;Queiroz and Gatesy 2007;Chernomor, Haeseler, and Minh 2016) or when using genomic-scale datasets which require extensive computational resources.
To better explore the space of possible trees, we expand the space over which we need to search.Our novel vector representation of a phylogenetic tree, Phylo2Vec (Penn et al. 2023), has a natural continuous extension, allowing us to improve the ability to search parts of this space.Appealing to a common analogy that casts the optimal tree search problem as finding a needle in a haystack, our approach observes a much bigger haystack, but the hay is in very large bundles, many of whom have a needle, and for these bundles we have access to a (weak) magnet.Providing details to this analogy, the size of the usual phylogenetic haystack with n taxa is (2n − 3)!! (Felsenstein 1978b), while we search a much larger haystack of size (n!) 2 .There are n! bundles in this larger haystack, each of which contains n! trees, but for any tree, 2 n−1 bundles will contain that tree.Although the proportion of bundles containing a needle shrinks exponentially, we propose a novel approach (Queue Shuffle) that chooses bundles that should be closer to one with a needle.For any given bundle, we also introduce a continuous objective function that can be efficiently traversed using gradient descent approaches (the weak magnet) developed for large-scale machine learning problems (Kingma and Ba 2015;Loshchilov and Hutter 2019; Table 1.Balanced minimum evolution loss scores for 11 phylogenetic benchmark datasets.Lower is better.Scores from BioNJ and FastME were obtained following the implementations in ape (Paradis, Claude, and Strimmer 2004) using the same distance matrix as GradME.The distance matrix was estimated from a GTR+Γ model via maximum likelihood (Yang 2006).Our GradME approach always starts from a uniform tree distribution (every tree is equiprobable) with a random taxon ordering (optimised by Queue Shuffle).The best performing approaches for each dataset are denoted in bold.GradME either equalled or performed better than FastME.The topological accuracy, measured as one minus the Robinsons-Foulds distance is shown between GradME and FastME and GradME and a maximum likelihood gold standard from IQ-TREE also using a GTR+Γ model.Starting from a random tree, represented by an n×n stochastic matrix, we compute the continuous gradient, apply softmax activation and increment the original matrix.In a single step, our gradient finds the correct tree at a distance of 6 subtree-prune and regraft moves from the random starting tree.(b) Simulating ultrametric trees of 20 taxa and 100,000 sites under an LG model of protein evolution.We add random uniform noise to all branch lengths to simulate departures from ultrametricity.Compared to the true tree via Robinson-Foulds distance, light blue bars are midpoint rooting the best FastME tree and dark blue bars are the inferred root from our approach.(c) Phylogenies for jawed vertebrates, where the number of genes (hence sites) are reduced to be more clocklike.Normalised Robinson-Foulds distance are shown between the best ASTRAL (Chao Zhang et al. 2018) tree, the best unrooted FastME tree which has been midpoint rooted (light blue) and our inferred rooting algorithm (dark blue).Performance for FastME reduces when the number of sites is small.
A comparison to benchmark phylogenetic data sets.Table 1 presents a comparison of GradME with neighbour joining (BioNJ) and FastME (subtree-prune and regraft version) over 11 popular phylogenetic benchmark datasets (Whidden and Matsen 2015).Both neighbour joining and FastME are only able to infer a minimum length unrooted tree, and therefore we compare estimates only on unrooted trees.We always initialise our algorithm with a uniform, equiprobable tree, where the starting taxon labelling is random and optimised using Queue Shuffle.We estimate tree using distances from a GTR+Γ model estimated via maximum likelihood (see Appendix I for details).As expected, FastME consistently outperforms BioNJ, with lower BME loss on all alignments.On the other hand, GradME always achieves a better or equal loss compared to FastME.We observe similar results when using different substitution models (e.g., F81).In the two examples where GradME does better than FastME, the topological accuracy, measured by one minus the Robinson-Foulds distance (Robinson and Leslie R Foulds 1981) is close to 0.9, suggesting FastME has converged to a similar tree.
We note that FastME's performance is generally worse when using the nearest neighbour interchange heuristic (instead of the SPR-based heuristic).When compared to a maximum likelihood gold standard (IQ-TREE (Minh et al. 2020)), the best distance method does not recover the same tree as that from maximum likelihood, but in some cases, is very close (e.g., DS3 and DS7).Finally, we note that while GradME outperforms FastME, it is orders of magnitude slower and in most of the data sets FastME finds the same optimal tree as GradME.
Rooting ultrametric trees.Despite being applicable to the unrooted problem, our approach, at its core, works with rooted trees.As previously discussed, if we assume the existence of a distant outgroup, then the balanced minimum evolution objective can be used to optimise a rooted phylogenetic tree.In Appendix A, we show that, given an ultrametric unrooted tree, the optimal rooting maximises a heuristic for the root-to-tip distance in the tree.Equivalently, the optimal rooting ensures that the root is estimated to be the maximal possible distance back in time.This is not an immediately biologically plausible objective for the root.Indeed, the cornerstone of balanced minimum evolution is finding the tree of minimum length, and it hence seems counter-intuitive to require the root that is the maximum distance backwards in time (though this does in fact minimise the tree length).However, our assumption of a distant ancestor means that the root of our tree must be the point that is furthest backwards in time.In particular, this means that the evolutionary direction needs to be away from the root.By setting our root such that the root-to-tip distance is maximised, we ensure that the root satisfies this constraint.However, this property does not hold for trees that are not ultrametric -in these cases, the root will be drawn towards branches with higher mutation rates.
While this property only holds for ultrametric trees, our approach still works well for near clock-like trees.As an experiment, we draw small (20 taxa) random ultrametric phylogenies with a total length of one, and simulate 100,000-residue protein sequences (Paradis, Claude, and Strimmer 2004) down these trees under an LG (Lee and Palci 2015) model of protein evolution, assuming random uniform amino acid base frequencies.In the ultrametric cases, all taxa are equidistant to the root, which corresponds to a strict molecular clock.We add uniform noise to all branch lengths to simulate departure from a strict clock.Fig. 1b shows the Robinson-Foulds (Robinson and Leslie R Foulds 1981) distance from the true tree to the midpoint-rooted best unrooted FastME tree (when SPR moves were used by FastME), and the distance to our inferred rooted tree.We see that when the tree is ultrametric, or close to ultrametric, our approach recovers the correct rooted tree.As expected, an increase in noise leads to a decrease in topological accuracy, although our approach still performs substantially better than midpoint rooting.We note that uniform noise is unlikely to be biologically realistic.
Instead, deviations from a strict clock are more likely to be heterogeneous in certain clades or internal branches.However, for small departures, we believe our algorithm to reliably infer the correct tree and root simultaneously.
We implement our rooting algorithm on the popular mammal data from (Song et al. 2012).We infer a rooted tree via Queue Shuffle and also midpoint root the best FastME tree.Both trees, unrooted, have the same balanced minimum evolution loss, but our rooted loss is less than the FastME midpoint rooted loss.Our rooted tree correctly identifies Gallus gallus (red junglefowl) as the outgroup, while midpoint rooting pairs Gallus gallus with Ornithorhynchus Anatinus (platypus) (see Fig. S2 for the rooted phylogenies).Figure 2. Phylogenetic inferences of the jawed vertebrates' phylogeny using the two most ultrametric loci from a data set of 99 taxa and 4593 genes (Irisarri et al. 2017).(a) Inference using our approach leads to high accuracy in identifying the root and all major jawed vertebrate taxa.Note that, we do not estimate branch lengths, but only topology via balanced minimum evolution (b) inference using FastME and midpoint rooting leads to widespread error, primarily and critically near the root of the process.
Rooting the phylogeny of all jawed vertebrates.To perform a more detailed evaluation of our framework, we tested GradME's robustness for topological inference by finding the root of the large jawed vertebrates dataset from (Irisarri et al. 2017) with 99 taxa and 4593 genes.Given the reliance of our method on ultrametric data for inference of the root, we first made a fast measure of the ultrametricity of each gene-tree.To do this, we inferred the phylogeny of each gene using GradME, followed by midpoint rooting.The coefficient of variation in root-to-tip lengths was taken as a measure of ultrametricity.We then concatenated ranked genes into supermatrices including decreasing numbers of genes, and examined the performance of GradME with midpoint rooting against our method.All inferences were performed using the LG amino-acid substitution model to maintain simplicity.We placed special focus on our ability to use small portions of data for recovering the main groupings of vertebrates; these key groupings include the root separating cartilaginous (Chondrichthyes) versus boned vertebrates, ray-finned fishes (Actinopterygii), and the major groups of tetrapods and amniotes (amphibians, mammals, archosaurs, turtles, and lizards and relatives).
A small numbers of genes with an ultrametric signal were generally sufficient for resolving many of the major lineages of vertebrates using both midpoint rooting and our approach (Fig. 1c).For larger numbers of genes, midpoint rooting and our approach are broadly similar.However, at the smallest numbers of genes (0.05%, 2 genes), midpoint rooting was unable to recover many of the early relationships among vertebrates, such as the root, monophyly of cartilaginous fishes, ray-finned fishes, Tetrapoda, or the mammals.Even small amounts of data (1460 amino-acids of 1,964,439; 0.07% of the original data) were sufficient for GradME to resolve the root as well as every major grouping of jawed vertebrates accurately (Fig. 2).The only exception was the controversial position of the Coelacanth, which was found to be sister of Dipnoi (lungfish) rather than the more widely accepted position as sister of Dipnoi plus Tetrapoda.While this remarkable performance under the simple LG model is in part attributed to the informative nature of highly ultrametric genes, our tree topology demonstrates the superiority of our approach in accuracy and efficiency over other fast methods in phylogenetics.

Discussion
We have introduced a new approach for exploring the vastness of tree space.Counterintuitively, our approach explores a much bigger space than the space of possible trees, but this larger space allows for new ways to find the best tree.The key to our method's success lies in transforming the phylogenetic tree search problem from a discrete to a continuous one, allowing us to achieve superior performance.To our knowledge, this is the first time a continuous, differentiable objective function for the inference of tree topology has been proposed, and it opens new possibilities for phylogenetic inference.Bayesian phylogenetics can be regarded as the most robust framework for inferring phylogenies, but has been to-date limited by the poor ability of random walk Metropolis-Hastings algorithms to explore tree space (Betancourt 2018).More efficient Hamiltonian Monte Carlo samplers have been proposed (Dinh et al. 2017) to tackle this problem, and our framework presents a new avenue to jointly explore topology and branch lengths with efficient samplers.A remaining limitation of our approach is the need to shuffle labels to fully explore the space of all possible trees, and while the approach we use, Queue Shuffle, is mathematically and practically powerful, this step is still discrete.The possibility of permutation distributions such as the Gumbel-Sinkhorn distribution could allow for a fully differentiable algorithm.
Finally, the complexity of our approach is O(n 5 ), which easily allows for large phylogenies up to a thousand, but not tens of thousands.However, computation on GPUs or TPUs in parallel can facilitate computational tractability.
A major benefit of our approach is that it naturally enables the estimation of the root node, which has been a long matter of interest in the biological sciences (Huelsenbeck, Bollback, and Levine 2002;Tria, Landan, and Dagan 2017;Naser-Khdour, Quang Minh, and Lanfear 2022).For genes where a strict clock is a reasonable assumption, our method of traversing tree space in large steps reliably estimates both the correct tree topology and the root.Our approach will likely be useful in settings where genetic sequences are contemporaneous and time for measurable evolution is short, such as early epidemics or nosocomial settings.However, as we showed analytically, our approach will have reduced performance when considering rate heterogeneity and departures from a strict clock.
Tests on the relationships among jawed vertebrates demonstrate that even minimal amounts of data can be sufficient for our method to reach high accuracy in topology and root estimates.These results are consistent with previous work on large amounts of genome-scale data showing that clocklike loci to be the most suitable for phylogenetic inference (Vankan, Ho, and D. A. Duchêne 2022).Furthermore, our approach is effective with negligible amounts of data -where other methods are ineffective -making it a powerful addition to the existing toolkit for addressing recalcitrant questions of the tree of life.
Our approach is based on the minimum evolution principle, which has repeatedly shown to produce fast and accurate inference.Nonetheless, an interesting area for further study is to extend the continuous path length formulation to approximations of traditional phylogenetic likelihoods (Felsenstein 1983).This would be particularly beneficial for implementation in Bayesian inference, since tree topology inference is a major obstacle to large hierarchical models (Suchard et al. 2003;S. Duchêne et al. 2016).Our method is therefore a step towards more efficient sampling of the complex posterior distributions over tree topology.

Methods
In the following, we describe GradME, a distance-based method for continuous phylogenetic inference of rooted and unrooted trees using gradient descent.The framework can be divided into three components: i) a continuous tree representation based on Phylo2Vec (Penn et al. 2023), a bijective integer representation of phylogenetic trees; ii) gradient-based optimisation using a continuous version of the balanced minimum evolution criterion (Pauplin 2000;Desper and Gascuel 2002), iii) Queue Shuffle, a method to shuffle the integer-to-taxon mapping underlying Phylo2Vec for full tree space exploration.The overall approach works for both rooted and unrooted trees.
Balanced minimum evolution.Popular objective functions to infer the optimal tree from phylogenetic data include maximum parsimony (W.M. Fitch and Margoliash 1967), maximum likelihood (Felsenstein 1983) and minimum evolution (Saitou and Nei 1987).Maximum likelihood and minimum evolution are provably statistically consistent (Desper and Gascuel 2004;Felsenstein 2004), whereas maximum parsimony can be inconsistent under certain conditions (Felsenstein 1978a).For small to moderate sized phylogenies, methods based on maximum likelihood (and Bayesian extensions) are generally considered state-of-the-art (Stamatakis 2014;Drummond and Rambaut 2007;Minh et al. 2020;Wilgenbusch and D. Swofford 2003).However, approaches based on minimum evolution (ME) have also shown to yield adequate performance (Kuhner and Felsenstein 1994;Gascuel and Steel 2006;Lefort, Desper, and Gascuel 2015;Kumar and Gadagkar 2000;Roch 2006).The first introductions of the minimum evolution (ME) paradigm (Kidd and Sgaramella-Zonta 1971; Rzhetsky and Masatoshi Nei 1992) sought to express evolutionary relationships through dissimilarity.They proved that, given unbiased estimates of the true evolutionary distances, the true phylogeny has an expected length shorter than any other possible phylogeny -thereby establishing the principled ME criterion.Currently, the best performing ME approach is that of balanced minimum evolution (BME) (Pauplin 2000;Desper and Gascuel 2002), with FastME (Lefort, Desper, and Gascuel 2015) being a popular software implementation.Its objective function can be written as: where D ij denotes a distance (e.g., based on molecular sequence data) between two taxa i and j and e ij the number of branches in the path between taxa i and j (the path length (Semple and Steel 2004)).This objective can be computed in a numerically stable fashion using the log-sum-exp trick (see Appendix K for an example snippet).A widely used approach to estimate the optimal tree greedily (Desper and Gascuel 2004;Gascuel and Steel 2006) is the neighbour joining (NJ) method (Saitou and Nei 1987).When neighbour joining is based on an additive distance measure, it reconstructs a unique tree, but still performs well with near-additive trees (Atteson 1999) and under small perturbations in the data (Mihaescu, Levy, and Pachter 2009).However, despite these highly favourable properties, further heuristic optimisation on a NJ tree using SPR moves have proven to be even more accurate (Lefort, Desper, and Gascuel 2015).Once a tree topology is found, quadratic algorithms exist for estimating the branch lengths (Mihaescu and Pachter 2008) as well as efficient approaches for molecular clock dating (To et al. 2016).
Balanced minimum evolution for rooted trees.Inference using BME is always restricted to unrooted trees (Semple and Steel 2004;Catanzaro, Frohn, et al. 2022) with rooting chosen after inference through heuristics (e.g., midpoint rooting) or via a molecular clock (e.g., for serially sampled data).However, it is often of interest to find the optimal rooted tree for a set of taxa, as this provides extra biological context (e.g., to represent evolutionary paths).
In an unrooted tree, the BME objective function (Eq. 1) provides an efficient way of calculating the total length of a tree where the branch lengths are the least squares estimators for approximating each D ij with the distance from nodes i to j in the tree.However, this result does not hold in a rooted tree, as the addition of a root changes many of the path lengths.
To remedy this, we consider adding a "root taxon" to the tree by joining it to the root node as taxon n.If the tree is roughly ultrametric, then we expect where D * is the (assumed constant) root-to-taxa distance.Of course, we do not know the sequence of the root but, as we will show, the value of D * is unimportant -it is instead simply important that it is independent of i. Adding this root taxon as a leaf node transforms the tree from being rooted to being unrooted, where standard BME can be used.From this assumption, we prove two lemmas to ensure the framework's validity, showing that the optimal unrooted tree is obtained when the variation in the root-to-taxa distance is sufficiently small (Lemma 1), and subsequently that, in all cases, the optimal rooting for an unrooted tree solves a biologically plausible optimisation problem (Lemma 2).
First, Lemma 1 shows that, if then, using e u ij and e r ij to denote path lengths in the unrooted tree (u) containing taxon n and the rooted tree (r) formed by removing taxon n, where δ denotes a small number.Hence, the difference between the rooted and unrooted objective functions is approximately equal to the constant, D * .Thus, for sufficiently small δ, by the discreteness of tree space, we can see that, if it is unique, the optimal unrooted tree under the rooted objective (using e r ij ) will be the same as that under the unrooted objective (using e u ij ), when the root taxon is used as an additional leaf.
Subsequently, Lemma 2 shows that, for the correct unrooted tree, the BME-optimal rooting maximizes a simple heuristic (defined in Definition 2) for the root-to-tip distance.Equivalently, the optimal rooting ensures that the root is estimated to be the maximal possible distance back in time.
This is not an immediately biologically plausible objective for the root.Indeed, the cornerstone of BME is that we want the tree of minimum length, and it hence seems counter-intuitive to require the root that is the maximum distance backwards in time (though, by Lemma 2, this does create the minimum length tree).However, the root of our tree must be the point that is furthest backwards in time.In particular, this means that the evolutionary direction needs to be away from the root.
By setting our root such that the root-to-tip distance is maximised, we ensure that the root satisfies this constraint.
This method shares a similar motivation with midpoint rooting, which also seeks to maximise a heuristic for root-to-tip distance.However, the heuristic used in midpoint rooting uses only the two taxa which are furthest apart, while our rooting method uses distances from all taxa.Thus, we expect our method to be more robust, particularly as large intertaxa distances are difficult to estimate, meaning that the additional information used in our heuristic should help to reduce errors.This is evidenced in Fig. S2, where midpoint rooting leads to incorrect root placement.
Nonetheless, this property will not hold if the tree is not ultrametric.If taxa evolve at different rates at different times throughout the tree, then the root will be drawn towards taxa with high evolutionary rates.Thus, caution must be used when applying our rooted algorithm to such trees, although the unrooted algorithm will still give a correct unrooted tree.
In this case, it may be best to find the optimal unrooted tree topology and then solve the rooting problem for this tree, rather than finding the optimal rooted tree, as the former will reduce the skewing effect of the heterogeneity in evolutionary rates.Because of this, the algorithm introduced in this paper has the flexibility to find the optimal unrooted or rooted tree.An ordered bijection to tree space.Previously, we introduced Phylo2Vec (Penn et al. 2023), a novel bijection between the space of phylogenetic trees and a space of integer vectors.In contrast to other bijections such as permutation matchings (Diaconis and Holmes 1998), changes in Phylo2Vec correspond to smooth changes in the tree space, e.g., single changes in a Phylo2Vec vector correspond to a limited set of SPR changes.On the other hand, Prüfer codes (Chen and Wang 2000) form a bijection to the space of all m-ary trees, meaning that there is no guarantee to sample binary trees from random Prüfer sequences.
Here, we focus on the notion of ordered trees from (Penn et al. 2023), where it is possible to construct a tree from its vector in linear time.An ordered tree can be thought of as a birth process, such that when a birth occurs, the original node continues to live and retains its label, while the new node receives an incremented label.Accordingly, we introduce an equivalent but more intuitive tree construction process for these ordered trees (see Fig. 3 for an example).We begin with two leaf nodes and two edges labelled 0 and 1.We then append nodes by joining them, in order, to edges connecting leaf nodes to the tree.This tree construction process can be summarised by a single vector v, with v 0 = v 1 = 0 and, for m ≥ 2, v m being the label of the edge to which node m is appended.Each index in v m is subject to the simple constraint: which is equivalent to the definition of ordered trees in (Penn et al. 2023).Intuitively, a tree is ordered if, starting with a branch connecting the root to taxon 0, the taxa can be added in order of their label by appending a new branch to a terminal branch of the existing tree (i.e. a branch connecting a leaf node to the rest of the tree).Thus, the ordering of the taxa is in some sense "natural" for each possible ordered tree.It is proved in Lemma 3 that for ordered v, the algorithm presented above and the more general Phylo2Vec algorithm in (Penn et al. 2023) produce equivalent trees.
Note that, for a fixed integer-taxon labelling, the number of ordered trees is a subset of the number of possible trees.We discuss an efficient method to remedy this problem and explore all tree space called the Queue Shuffle.
A continuous representation of a tree.We introduce continuous, probabilistic, representation of trees using a square matrix W which gives the distribution of a random ordered vector v with independent entries such that W ij = P(v i = j).
Given Eq. 5, W is a lower-triangular, stochastic matrix (row sums to 1).Thus, W can probabilistically represent any ordered phylogenetic tree (a space of (n − 1)! trees).A simple approach to determining the most likely single tree from W is to take the column-wise argmax, yielding a single tree v.
Gradient-based optimisation using the BME criterion.Using this continuous representation, we can find the optimal ordered tree.Defining f (v) to be the BME objective function for the tree generated by an ordered vector v, we then create The calculation of F (W ) follows our new method of constructing ordered trees from the vector v.For a fixed (randomly chosen) tree, we define e k ij to be the path length between nodes i and j when nodes 0, 1, ..., k − 1 have been added to the tree (for i, j < k).Note that to find the rooted objective function, we initialise with e 2 10 = e 2 01 = 2, while to find the unrooted objective, we initialise with e 2 10 = e 2 01 = 1 (while the Phylo2Vec representation is an inherently rooted representation, this unrooted objective finds the tree length if the root were removed from the random rooted tree with distribution given by W ).This is because, in a tree where the only leaf nodes are 0 and 1, these nodes are a path length of 2 apart if there is also a root (as the root is on this path) while otherwise, they are a path length of 1 apart.
If node k is appended to the edge joining either node i or j to the tree, then e k+1 ij = e k ij + 1; otherwise, e k+1 ij = e k ij .Similarly, if node k is appended to the edge joining node i to the tree, then e k+1 ik = 2 and otherwise, if node k is appended to the edge joining node x to the tree, then e k+1 ik = e k ix + 1.Thus, using V k to denote the random value of v k , we can write for some functions G k ij which are derived explicitly in Lemma 4. Importantly, each term in the sum is independent, and hence, in Lemma 4, a closed iterative system for the quantities ) can be calculated for i < j as with the remaining values for i > j following by symmetry.
The objective function is a polynomial function of the entries of W and is linear in each fixed entry (that is, the diagonal entries of ∇ 2 F are zero).Thus, by Lemma 5, there is always a minimum at a "discrete tree" (that is, at a matrix W where for each row, one value is 1 and all the others are 0).Moreover, this simple form makes it easy to differentiate F analytically, numerically or automatically.Using state-of-the-art automatic differentiation (Bradbury et al. 2018), gradient descent can be used to efficiently minimise F and find the optimal ordered tree.
There may also be minima at non-discrete trees if multiple trees share the same, optimal, objective value.In our rooted optimisation, this is highly unlikely (as two topologically different trees having equal objectives places a dimension 1 condition on the distance matrix D, meaning the set of distance matrices for which this happens has measure 0 in the set of possible distance matrices), but when we use our unrooted algorithm, this will occur.This is because, as discussed in (Penn et al. 2023), there are n − 1 Phylo2Vec vectors which, when the root is removed, give the same unrooted tree.Thus, if multiple rooted vectors giving the same unrooted tree U are in the same space of ordered trees, then, if U is the optimal tree under this ordering, the algorithm may converge to a non-discrete W .In this case, taking the argmax safely recovers an optimal rooted tree as, by Lemma 5, all possible trees according to W will have the optimal objective value.
The tree-space induced by our continuous objective function will have local minima whenever changing any single entry of the vector v causes the objective function to increase.As the Hamming distance between vectors u and v is comparable to SPR distance (Penn et al. 2023), we therefore expect that the discrete subset of our continuous tree-space will be similar in structure to the space induced by SPR moves.However, by starting from a uniformly distributed tree, where all possible ordered trees contribute to the objective, we expect that our algorithm is better able to pick up the "signal" from the true optimum, and avoid moving towards suboptimal local minima.
A Python-like algorithm to compute E(2 −e ) is shown in Algorithm 1.To find the E m ij terms, there are O(m) steps of O(m) (finding the E m m−1,j terms) and O(m 2 ) steps of O(1) (finding the other E m ij terms).There are O(n) values of m that need to be considered, and hence this system can be solved in O(n 3 ) time.
Orderings.The continuous objective function is only defined for ordered trees, which, for a given labelling of the nodes, is a subset of the whole tree space.Thus, we define the concept of an ordering of the nodes, whereby changing the ordering allows the full space of trees to be explored.
Definition 1: Ordering: Suppose that the nodes correspond to taxa with names N 0 , N 1 , ...N n−1 .We then define an 8: E = E * + E * T 9: Return E ordering of the nodes to be a permutation, σ of the set {0, ..., n − 1} such that the node with name N i is processed as node σ(i) by the Phylo2Vec algorithm.It is necessary that the associated Phylo2Vec vector v(σ) is ordered.
Note that we use the phrase "node x" to mean "the node with name N x " and will always be explicit if we refer to a node by its label.
Given a tree, it is possible to generate a possible ordering of the nodes, σ, as well as the associated vector v(σ).We will do this by labelling the tree as follows, again distinguishing between the leaf node names N i and their labels (which we will call l(i)).
Consider labelling the root node as 0.Then, label the children of this node as 0 and 1.One can continue this process inductively, choosing a node, labelled x, with unlabelled children and then labelling its children as x and y, where y is the smallest unused label.This process terminates when every node has been labelled.As discussed above, suppose that the label of the leaf node with name N i is l(i).Then, Lemmas 6, 7 and 8 show that l is a possible ordering of the tree, and provide a method for calculated the associated ordered Phylo2Vec vector v(l).
Given a tree, one can use the previous labelling algorithm to show that there are at least 2 n−1 possible orderings for the tree (as the children of each node can be labelled in either order).The exact number depends on the choice of which internal node is processed at each step (and so, a "balanced" tree where the leaves have a low generation has more possible orderings than a "ladder" tree where each leaf has a distinct generation).However, in general, it will be substantially larger than 2 n−1 (we conjecture that for flat trees, it may be factorial in size) and hence there are numerous possibilities which will allow for a global minimum of the objective to be found.
Queue Shuffle: changing orderings to explore all the tree space.The number of ordered trees (different from ranked trees (Collienne and Gavryushkin 2021)), (n − 1)!, is substantially smaller than the possible number of trees, (2n − 3)!! (albeit with a comparable growth pattern in n), and hence, while the optimal ordered tree will be closer to the true tree than a very large proportion of trees, it is very unlikely to be exactly equal to the true tree.
To fully explore tree space, one must shuffle the labels of the leaf nodes in the optimal ordered tree.Simply choosing a uni-formly random permutation will lead to extremely inefficient optimisation, as each tree is only possible in approximately 1/2 n of the possible orderings.Instead, we use the topology of the optimal tree to inform our choice of permutation through a novel approach we call the Queue Shuffle.This ensures that the previous optimal tree can be written as an ordered tree in the new ordering, while also ensuring a smooth and efficient path through the space of orderings.
The Queue Shuffle is motivated by the labelling procedure discussed in the previous section, but ensures that the set of internal nodes with a given generation (that is, a given distance from the root) are processed consecutively.That is, we begin by processing all nodes with generation 0 (i.e., the root), then all internal nodes with generation 1, then all internal nodes with generation 2, and continue in this fashion until all internal nodes have been processed.
Algorithmically, this can be achieved by a "queue" of internal nodes to be processed.When an internal node is processed, any of its children that are also internal nodes are added to the back of this queue.Thus, the queue is always in ascending order of generation, and it is simple to show that this ensures that nodes are processed in non-decreasing order of generation.
A crucial feature of this queue is that the child given the same label as its parent is placed ahead of the other child in the queue.This ensures that one can, in some way, control the order of processing by choosing the labelling of the children of each node.Moreover, it is vital for the theoretical result presented in the following section.
To add randomness into the labelling procedure, every time an internal node is processed, we randomly choose which child is given the label of their parent, and which child is given the next available label.This provides 2 n−1 possible orderings for each tree.This stochasticity is helpful in ensuring that the algorithm does not get stuck -as discussed in the subsequent section, it ensures that a large class of similar trees will be considered after a few ordering proposals.
An algorithmic description of the Queue Shuffle is provided in Algorithm 2.
GradME.The Queue Shuffle completes our optimisation algorithm.We iteratively find the best ordered tree according to the current ordering and then use Queue Shuffle to change ordering, changing the space of explorable trees.The algorithm terminates when the optimal tree has not been improved upon for a fixed number of iterations (note that, by construction, the previous optimal tree will always be in the new space of ordered trees).In the examples presented in this paper, only tens iterations are needed from some random starting order, and less if a sensible starting ordering (such as from a neighbour joining tree) is used.
We refer the resulting system, combining the continuous tree representation, Queue Shuffle reordering, and the gradientbased optimisation framework using BME, as GradME.
Algorithm 2 The Queue Shuffle ▷ "queue" of nodes to process 2: L = {ν 0 : 0, ν 1 : 1} ▷ node:label mapping 3: ▷ node to process 7: ] ▷ ν will be processed 8: append(P, ν) ▷ ν will be processed 9: if isLeaf(ν) then A given tree is in the space of ordered trees for at least 2 n−1 orderings.This means that we do not need to find a single optimal ordering, but have exponentially many which will return the true optimal tree.
Very loosely considered, being able to explore n! tree space reliably and efficiently with continuous optimisation, Queue Shuffle reduces the inferential task to one that is exponential.
However, while the number of optimal orderings grows exponentially, their proportion tends quickly to zero as n grows.
It is therefore, perhaps, surprising that we are able to find an optimal ordering so quickly from merely tens of shuffles.The proportion of optimal orderings (approximately the ratio of ordered trees to total trees, (n−1)!(2n−3)!! , ranges from 8 × 10 −4 in our smallest dataset (14 taxa) to 6 × 10 −29 in the largest (99 taxa; see Table 2).This efficiency comes from the topology-dependence of the Queue Shuffle algorithm, which allows us to plot a relatively "smooth" path through the space of possible orderings.That is, the majority of trees in the new ordered space will have similar properties to the previous optimal tree and so, unless the previous tree was a local minimum of the objective, it is likely that one of these "close" trees will have a lower objective value.
Lemma 9 shows that the expected distance from the root grows harmonically as the label increases.For large trees, the node with label n − 1 has an expected distance from the root of approximately twice the expected distance from the root of the node label 0. This property is noticeable even for small trees -if n = 10, then the ratio of the expected distance to the root of node 9 and node 0 is approximately 1.65.Thus, nodes which are close to the root in the current optimum, will also be closer on average to the root in the new space of ordered trees.In essence, this means that "fewer slots are wasted" in the new ordered space -that is, there are fewer trees in the new space of ordered trees which are topologically far from the previous optimum (a tree that, after the first few iterations, is likely to be far closer to the true optimum than a randomly-chosen tree) and hence, more trees which are reasonable candidates for having lower objective values.Lemma S1 proves another example of the smoothness in transitions induced by the Queue Shuffle, based on nearest neighbour interchange (NNI) moves.An NNI move considers the four subtrees attached to two non-root nodes that share an edge and swaps two of these subtrees.Lemma S1 shows that, starting from a tree T , any tree which is one NNI move away from T will be in the new space of ordered trees with probability at least 1 4 .
This ensures that this new space contains many sensible proposal trees.Perhaps the most surprising aspect of this theorem is that this probability is bounded below, independently of the topology.Thus, with high probability, the optimal tree will only remain the same for more than a few iterations if large sets of similar trees yield lower objective values than the current optimum.
That being said, the Queue Shuffle does not guarantee that the global minimum will be found, even if the gradient-based algorithm for optimising F (W ) always converges to the optimal tree.If a tree is "far" from the nearest tree with a better objective value, then it may take a very large number of shuffles (or, indeed, it may be impossible) to find a better tree.
However, while the only theoretical guarantee is that Lemma S1 shows it will quickly find better trees that can be formed by NNI, we expect that stronger conditions hold on its ability to "escape" from local minima.
Computational complexity.The computational complexity for all distance-based algorithms requires an upfront computational cost of O(n 2 ) to compute the distance matrix.We will disregard this cost from subsequent comparisons.The standard neighbour-joining algorithm (Naruya Saitou and Masatoshi Nei 1987) has an overall computational complexity of O(n 3 ).FastME has a computational complexity of O(kn 2 Diam (T )) (where Diam(T ) is the maximum path length in a tree, which is generally much smaller than n) for k iterations where k < n when n is large.When fully discrete, our algorithm also has same complexity but with added mechanisms for escaping optima via Queue Shuffle.Therefore, a discrete setting is as computationally efficient as FastME (see Appendix I for details).
Computing the expectation in Algorithm 1 has complexity O(n 3 ).A single gradient evaluation (that is, calculating ∂F ∂Wij for some i and j) is also O(n 3 ) and therefore computing the full Jacobian is O(n 5 ).Our Queue Shuffle algorithm runs in O(n).Therefore, our optimisation for k steps and l shuffles yields a complexity of O(kln 5 ).The size of l is dependent on the choice of gradient optimiser, and the size of k varies if a sensible ordering is initialised.Thus, the computational complexity of GradME is substantially higher than that of FastME and closer to that of FITCH (Felsenstein 1997).This is due to the far greater mathematical complexity of the continuous objective function, F (W ).As it is an expectation over all possible ordered trees, the explicit formula for F (W ) is a polynomial in W with (n − 1)! different terms.Intuitively, the continuous space always considers a path between any two trees, something that becomes impossible with discrete settings.Thus, being able to compute it in polynomial time is a vast improvement on a naive approach, although it is still considerably less than the innovative FastME greedy approach.More savings should be possible, we hope to make further efficiency gains in future work.

Dataset
# Sites # Taxa Type Taxonomic rank Evaluation.We evaluate GradME on a diverse corpus of 14 empirical molecular sequence datasets (Table 2).The first 11 are commonly used to assess phylogenetic inference performance (Whidden and Matsen 2015), whereas the last three were used to assess inference on rooted trees.For each dataset, we start from a random tree and optimise the W matrix to a tolerance of 1e-10 using gradient descent with Adafactor (Shazeer and Stern 2018) optimisation.The distance matrix D is computed using the GTR+Γ substitution model for DNA and an LG model (Le and Gascuel 2008) for amino acids.Substitution model parameters for the GTR+Γ are also estimated using gradient descent with Adafactor using a pairwise maximum likelihood approach (Yang 2006).Jukes-Cantor (Jukes, Cantor, et al. 1969), F81 (Felsenstein 1981) and TN93 (Tamura and Masatoshi Nei 1993) models were also tested for DNA, while stochastic gradient descent (SGD), RMSprop (Tieleman and Hinton 2012), and AdamW (Kingma and Ba 2015;Loshchilov and Hutter 2019) were also considered for optimisation (see Fig. S3).To fairly assess the performance of GradME, we compare our framework to Implementation.Implementation of the BME criterion and the optimisation framework was written in Python using Jax (Bradbury et al. 2018) and Optax (Babuschkin et al. 2020).Optimisation was performed on a Xeon 2.30GHz (CPU; Intel Corporation) or on a single GeForce GTX 1080 (GPU; Nvidia Corporation).Evaluation of the BioNJ (Gascuel 1997) and FastME (Lefort, Desper, and Gascuel 2015) methods was performed via the R package ape (Paradis, Claude, and Strimmer 2004) using rpy2 (Gautier, Krassowski, et al. 2021).Tree manipulation and visualisation scripts were written using ete3 (Huerta-Cepas, Serra, and Bork 2016) and NetworkX (Hagberg, Schult, and Swart 2008).An implementation is available at: https://github.com/Neclow/GradME.

Appendix A. BME for rooted trees
Comparing the rooted and unrooted objectives.
Lemma 1: Consider adding an extra taxon, n, to the set of taxa such that, for some D * and δ.
This creates an unrooted tree T u , and we can create a rooted tree T r by removing node n.If e u ij denotes inter-taxa distance in T u and e r ij denotes inter-taxa distance in T r , then as the path between i and j will not contain any leaf nodes other than i and j, and therefore will not contain node n.Then, using the fact that D nn = 0, We can make progress with this sum by noting the Kraft Equality, found throughout the literature (Catanzaro, Labbé, et al. 2012) This means that Similarly, one can show and hence the result follows.
Understanding the BME rooting.Definition 2: Distance to root heuristic: Under the assumption that the tree is ultrametric, we estimate the distance between the root and the leaf taxa using the following algorithm: 1) Find two leaf nodes that share a parent that is not the root.If no such pair exists, then move to step 3.
2) Let d 1 and d 2 be the distance between each leaf node and its parent.Suppose that d 3 is the distance between the parent and its parent (i.e. the grandparent of the leaves).Remove the leaves from the tree, update the distance between the parent and its parent to be d 3 + d1+d2 2 and return to step 1 3) If d 1 and d 2 are the distances between the two remaining children and the root, then the distance between the root and the taxa is d1+d2 2 .
Lemma 2: Consider the true unrooted tree T and suppose that the distance matrix D gives the true (time) distances between each pair of taxa in T .
Then, the optimal rooting -that is, the edge in the tree such that placing the root on this edge minimizes the BME objective -minimizes the distance to root heuristic defined in Definition 2.
Proof: Choose an edge, b on which to place the root.We define an indicator function the path between nodes A and B and 0 otherwise.Adding a root to edge b changes the objective function by halving the weight assigned to each D AB such that X b (A, B) = 1 as the path length between these nodes will increase by 1.Thus, using f r to denote the rooted objective and f u to denote the unrooted objective, where the e AB are the path lengths in the original, unrooted tree and both A and B are allowed to vary in the sum.Hence, for a fixed unrooted tree topology, the optimal rooting will solve By Lemma 11, while, by Lemma 12, the root-to-tip heuristic, D, satisfies and hence, the optimal rooting solves B Ordered Trees as required.

B. Ordered Trees
Definition 3: Left-to-right construction algorithm An ordered tree can be constructed as follows: 1) Begin with nodes 0 and 1, each joined to a root.Label the edge joining node 0 to the root as edge 0, and the edge joining node 1 to the root as edge 1 2) Process the nodes in order 2, 3, 4. . .When processing node k, join it to edge v k (creating a new internal node -this is well-defined as v k < k).Label the edge joining node k to the tree as edge k.
Lemma 3: The left-to-right algorithm in Definition 3 and the standard Phylo2Vec algorithm defined in (Penn et al. 2023) produce the same tree, provided v is ordered.
Proof: To show that the left-to-right algorithm gives an equivalent tree, define T to be the tree resulting from the standard Phylo2Vec algorithm and T ′ to be the tree resulting from this new left-to-right algorithm.We proceed by induction on the number of nodes, n, noting that the case n = 2 is trivial.
Suppose now that the algorithms are equivalent for n = m.Choose some ordered v of length m + 1 (so that this corresponds to n = m + 1) and consider processing the first node using the Phylo2Vec algorithm, so that (using the fact that v is ordered so that no nodes are skipped), node m merges with node v m .
From this step, the Phylo2Vec algorithm proceeds as if there were n − 1 nodes and the vector was ṽ = (v 0 , v 1 , ..., v m−1 ).
Define T to be the tree given by ṽ.Hence, T can be created from T by removing nodes v m and m (and the edges connecting them to the tree) and relabelling their parent as v m .Equivalently (by reversing this process), T can be created from T by adding a node to the edge joining leaf node v m to the tree, and connecting node m to this edge.
Moreover, from the inductive hypothesis, T can be constructed by using the left-to-right algorithm on ṽ.As this algorithm processes v m last, this means that after m nodes have been added to the tree by the left-to-right algorithm applied to v, the current tree is given by T .
The final step of the left-to-right algorithm applied to v is to add a node to the edge joining leaf node v m to the tree, and to connect node m to this edge.As previously discussed, this creates the tree T and hence, T = T ′ as required.

C. A continuous objective function
Construction.
Lemma 4: For a randomly chosen tree with distribution W , define, for i, j, < k, e k ij to be the path length between taxa i C A continuous objective function and j when k nodes have been added to the tree (using the left-to-right construction algorithm).Define E k ik := E(2 −e k ij ).
Then, for i < j with the remaining values following by symmetry.
Proof: Adding node k to the tree increases the path length between nodes i and j by 1 if and only if As this condition is independent of other values of V , using e k ij to be the path length after the k nodes {0, ..., k − 1} have been added, one can write This is a product of independent random variables and so, defining ) and noting that I{V k ∈ {i, j}} is a Bernoulli random variable with probability W ki + W kj , this equation becomes To close this iterative system, note that when node j is added to the tree, it will be a path length 2 from the leaf node V j connecting the edge it is joined to.Moreover, the distance between node j and any other nodes x ̸ = V j in the tree will be equal to one plus the distance between node x and node V j .That is, Thus, conditioning on the value of V j , Finally, by symmetry, Noting that E k ij is undefined (and unnecessary) for k < max(i, j) + 1, as nodes i and j have not both been added to the tree, ( 24) and ( 26) hence form a closed system.This can be solved inductively, finding all E m ij terms for m = 2, 3, ..., n.

Discrete minima.
Lemma 5: Define f (v) to be the objective function for the discrete tree given by the Phylo2Vec vector v.Then, if V * is the set of vectors v which minimize f, any optimal W satisfies Moreover if |V * | = 1 then there is a unique v minimizing f , then there is a unique W minimizing F , which is the matrix such that Proof: Define V to be the set of ordered tree vectors.Then, we see that F (W ) is a weighted average of the values of f (u) for u ∈ V .Thus, for any and as required.If V = {v}, this minimum requires and therefore, using (28) W is uniquely defined by as required.

D. Orderings
This section considers the labelling algorithm introduced in the main text.
Lemma 6: Two nodes have the same label only if they share an ancestor with that label.
Proof: Suppose that this is false, and that nodes x and y have the same label, L, but do not share an ancestor with that label.Define a(x) and a(y) to be the nodes of lowest generation (that is, the nodes closest to the root) with label L such that they are ancestors of x and y respectively.Note that, by assumption, a(x) ̸ = a(y) and also, neither can be the root (as the root is the ancestor of all nodes).Moreover, they cannot share a parent, as the children of a parent are labelled differently.Thus, without loss of generality, one can assume that a(x) was labelled first.By definition, the parent of a(x) does not have label L and hence, L must have been the smallest unused label when node a(x) was labelled.A similar argument for a(y) shows that L must have been the smallest unused label when node p(y) was labelled.However, when node a(y) was labelled, L had been used to label a(x), giving the required contradiction.
Proof: Firstly, note that as there are n − 1 internal nodes, and a single new label is introduced every time the children of an internal node are labelled, the set of labels used (across all nodes) must be {0, 1, ..., n − 1}.
Suppose that leaf nodes x and y have the same label L. By Lemma 6, they must share an ancestor a with that Lemma 8: For each i ∈ {0, 1, ..., n − 1}, define x(i) to be the node of the highest generation with label i. Define (for i > 0), y(i) to be the label of the parent of x(i).Then, with ordering l, the tree is given by Proof: Firstly, note that v is ordered, as the label of a child is greater than or equal to that of its parent.Hence, as the parent of x(i) does not have label i, it must have a label strictly less than i and so v i < i as required.
Consider constructing the tree according the "standard" right-to-left Phylo2Vec algorithm.One can then proceed by induction on the number of nodes.The case n = 2 is trivial, and so suppose it holds for n = m and consider a tree with n = m + 1.
The first node to be processed has label m.No other node can have label m cannot also have label m (as, otherwise, one of its children would have label greater than m, contradicting Lemma 7) and hence, the first node to be processed is x(m).The parent of x(m) must have label y(m).This must also be the label of its other child and hence, v m = y(m) ensures that the node with label m merges with its sibling from the original tree (which is correct).
From this point, the tree now has m nodes, and our right-to-left construction algorithm considers the parent of the node labelled m to now be a leaf node with label y(m).The values of y(i) for this tree are unchanged (in particular, y(y(m)) depends on the label of an ancestor of the parent of node m) and hence, by induction, the remaining values of v correctly generate the rest of the tree.Thus, the correct tree is generated by v as required.Lemma 9: Define g k m to be the expected distance from the root of the node labelled m in a tree with k nodes.Define the harmonic sum function Then, Proof: From our left-to-right construction algorithm, as adding node k − 1 according to v k−1 = m means that the path length between node k − 1 and the root will be one more than the path length of between m and the root.If v k−1 = m, it will also increase the path length between node m and the root by 1 and so The initial conditions of this system are that as in a two-node rooted tree, the leaves are distance 1 from the root.
We claim by induction on k that the solution to this system is (36).
Note that this holds for k = 2 as H(1) = H(0) = 1.Moreover, under the inductive hypothesis that it holds for a tree with and hence as required.
Definition 4: Define τ (σ) to be the space of possible trees given an ordering σ.
Definition 5: For a tree T , define Q(T ) to be the random ordering generated by Queue Shuffle.
Lemma 10: Consider a tree T and suppose that another tree, T ′ is one NNI move away from T ′ .Then To facilitate the proof, we first note that for any permutation σ, by Lemma 13, that Now, we consider labelling the tree as shown in the left panel of Fig. S1.Suppose that the edge to which the four subtrees are joined has nodes i and j, with node i being a higher generation than node j.Suppose that the subtrees rooted at the child of j are labelled as S j and S c .Suppose that the other child of i (that is, the child not equal to j) is the root of a subtree S i .Finally, suppose that the fourth subtree is labelled S r (this is the subtree containing the root).A single NNI move in this context involves swapping a pair of subtrees from the set {S i , S j , S r , S c }.

F. Eutherian Mammal phylogeny (Song et al. 2012)
Queue Shu e Rooted Inference FaseME midpoint Rooting Moreover, l S j (Q(T )) = l i (Q(T )) if and only if node j was placed ahead of the root of S i in the queue Proof Using the r defined in the proof of Lemma 13, define the random indices A and B such that r A corresponds to processing node i and r B corresponds to processing node j.Note that r A and r B are independent (though B will in general depend on r A ). Now, suppose without loss of generality that j is the leftmost child of i and that the root of S j is the leftmost child of j.As children have labels greater than or equal to their parents, and hence which is the first result.The second result follows immediately from the fact that r A = 1.

I. Estimation of GTR+Γ distances
To estimate distances under a GTR+Γ substitution model we use the approach outlined in (Yang 2006).Assuming a general time reversible rate symmetric matrix Q, the transition probability matrix over time is found via the matrix exponential P (t) = e Qt .This matrix exponential can be readily computed via eigendecomposition.
Including variable rates among sites for a Gamma distribution g, P (t) = e Qu g(u)du, which can again be estimated via eigendecomposition.Given parameters for rates, S, frequencies, π, the time between two sequences t ij , and genetic sequence data G, the log-likelihood for the transitions between a pair of taxa i and j is κ ij ab log(P ab (t ij ; S, π)) (84) where κ ij ab is the number of a → b transitions from taxon i to taxon j.We approximate the optimal parameters by maximizing the total log-likelihood (that is, the sum over i and j of L ij ) using gradient descent in Jax.

J. Fast discrete hill-climbing with Phylo2Vec
The computational complexity of GradME is substantially higher than that of FastME due to the continuous nature of the algorithm.Thus, particularly for large numbers of taxa, using a similarly fast algorithm, at least to get close to the optimal tree, may be preferable.
Because of this, we have developed an alternative, discrete algorithm which has the same computational complexity as FastME.At each step, our algorithm outputs a matrix ∆ such that ∆ ij is the change in the objective function if the value of v i were changed to be equal to j. ∆ allows us to perform a hill-climbing optimisation, as the change corresponding to the minimum value of ∆ is made.
Analogously to FastME, ∆ can be calculated in O(n 2 diam(T )) time, where diam(T ) is the maximum inter-taxa path length of the tree (this is generally substantially smaller than n).This algorithm works exclusively for unrooted trees, though a similar algorithm could be developed using our rooted objective.
Motivated by the method of (Desper and Gascuel 2002), our algorithm begins by calculating directed edge weight vectors w ± e for the edges in T .Each edge e naturally partitions T into two disjoint subtrees, S e 1 and S e 2 , with each one being rooted at a node connected to e.We suppose that S e 1 is the tree not containing some fixed node X.Then, we assign w + e to be the balanced distance between the root of S e 1 and each of the individual leaf nodes in S e 2 , and w − e to be the balanced distance between the root of S e 2 and each of the individual leaf nodes in S e 1 .These edge weights can be calculated efficiently by using an iterative scheme, using the fact that the weight on a given edge from an internal node x to another node y is the mean of the weights of the two edges from nodes z and a (both distinct from y) to x (recalling that these weights are directional).This iterative scheme is closed by the fact that the weights on edges from leaf nodes to the tree are given by the appropriate columns of the distance matrix D.
This weighted tree can be used to calculate the distance between any pair of subtrees.For each fixed subtree, S, one can iteratively calculate the distance d SR between it and (disjoint) subtrees R, using the fact that the precalculated w terms give the distance between each subtree and each leaf, and that for any pair of subtrees R 1 and R 2 which share a parent node and can therefore be combined in a subtree R 3 , From (Desper and Gascuel 2002), there exists a simple formula for the difference in objective function caused by pruning a subtree and regrafting it to an adjacent edge.If an internal node x is joined to nodes which are the roots of disjoint subtrees A, B and C, then the difference between attaching a subtree K to the edge joining x and C and attaching the subtree K to the edge joining x and B is where here, δ denotes inter-subtree distance.These δ terms are not equivalent to the d terms, as removing K from the original tree creates a different set of possible subtree pairs.However, they can be calculated in O(n 2 diam(T )) time from the distances d following the methods of (Desper and Gascuel 2002).
Explicitly, our algorithm moves outwards from the parent of the subtree which we are removing.We take B to be the node we are currently processing, x to be the previously processed node on this path, C to the node processed two iterations

Figure 1 .
Figure1.Results on empirical data (a) Starting from a random tree, represented by an n×n stochastic matrix, we compute the continuous gradient, apply softmax activation and increment the original matrix.In a single step, our gradient finds the correct tree at a distance of 6 subtree-prune and regraft moves from the random starting tree.(b) Simulating ultrametric trees of 20 taxa and 100,000 sites under an LG model of protein evolution.We add random uniform noise to all branch lengths to simulate departures from ultrametricity.Compared to the true tree via Robinson-Foulds distance, light blue bars are midpoint rooting the best FastME tree and dark blue bars are the inferred root from our approach.(c) Phylogenies for jawed vertebrates, where the number of genes (hence sites) are reduced to be more clocklike.Normalised Robinson-Foulds distance are shown between the best ASTRAL (ChaoZhang et al. 2018) tree, the best unrooted FastME tree which has been midpoint rooted (light blue) and our inferred rooting algorithm (dark blue).Performance for FastME reduces when the number of sites is small.
b) ▷ add b to the queue 17: Return L ▷ Ordering determined by values of L for leaf nodes Why does Queue Shuffle work?.
two well-established distance-based methods: BioNJ(Gascuel 1997), based on the neighbour joining algorithm (NaruyaSaitou and Masatoshi Nei 1987), and FastME(Lefort, Desper, and Gascuel 2015), based on balanced minimum evolution.Penn et al. | Leaping through tree space manuscript | 19 label.Define b to be the shared ancestor with the highest generation (that is, furthest from the root) such that b has label L.Then, either nodes x and y are the children of b (in which case, they have distinct labels) or they are descendants of distinct children, c and d, of b.In the second case, one can impose without loss of generality that c does not have label L and, as label L has already been used to label b, we know that all descendants of c do not have label L. Hence, in both cases, one of x and y does not have label L as required.

Figure S2 .
Figure S2.Comparison of the best unrooted FastME tree that has been midpoint rooted to an optimised rooted tree via Queue Shuffle on an Eutherian mammal dataset Song et al. 2012.Queue Shuffle correctly places Gallus gallus as the outgroup of mammals.Branch lengths are ignored and trees are displayed as ultrametric.

Queue Shuffle: generating principled ordering proposals Asymmetry of ordered tree spaces.
Penn et al. | Leaping through tree space manuscript | 32 E.
Penn et al. | Leaping through tree space manuscript | 42 J Fast discrete hill-climbing with Phylo2Vec