Phylo2Vec: a vector representation for binary trees

Binary phylogenetic trees inferred from biological data are central to understanding the shared evolutionary history of organisms. Inferring the placement of latent nodes in a tree by any optimality criterion (e.g., maximum likelihood) is an NP-hard problem, propelling the development of myriad heuristic approaches. Yet, these heuristics often lack a systematic means of uniformly sampling random trees or effectively exploring a tree space that grows factorially, which are crucial to optimisation problems such as machine learning. Accordingly, we present Phylo2Vec, a new parsimonious representation of a phylogenetic tree. Phylo2Vec maps any binary tree with n leaves to an integer vector of length n . We prove that Phylo2Vec is both well-deﬁned and bijective to the space of phylogenetic trees. The advantages of Phylo2Vec are twofold: i) easy uniform sampling of binary trees and ii) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on ﬁve real-world datasets and show that a simple hill climbing-based optimisation efﬁciently traverses the vastness of tree space from a random to an optimal tree.

introduced as a support for model selection and estimation of evolutionary or epidemiological parameters.Other vector representations of tree topology, such as pair matchings (Diaconis and Holmes, 1998) and F matrices (Kim et al., 2020), focus on the polynomial-time computation of the distance between any two trees (to measure similarity).However, methods for systematically sampling random trees or changing tree topology with respect to an objective function by leveraging such vector representations have been understudied.In particular, creating sampling schemes (as done in Bayesian frameworks such as BEAST (Drummond and Rambaut, 2007;Bouckaert et al., 2014) and MrBayes (Huelsenbeck and Ronquist, 2001)) around standard tree arrangements is non-trivial, and, although inferring phylogenetic trees is a common task in evolutionary biology, tree search using any optimality criteria (including maximum likelihood) is NP-hard (Roch, 2006).Another critical challenge is the size of the tree space: for a tree with n leaves, there are (2n − 3) • (2n − 5) • . . .• 5 • 3 • 1 possible rooted binary trees (Cavalli-Sforza and Edwards, 1967).Lastly, optimisation-based approaches often face a jagged "loss" landscape containing many trees with the same criterion score (Sanderson et al., 2011).When considering inference, the choice of representation can be particularly relevant for application to real phylogenetic problems.For example, an application of the approach we introduce here can be used for continuous relaxation and gradient descent under the minimum evolution criterion (Penn et al., 2023).For large phylogenies, the use of an efficient representation such as the compact bijective ladderized vector (Voznica et al., 2022) has proven effective for deep learning-based, likelihood-free, inference (Thompson et al., 2024) or diversification inference (Lambert et al., 2023).
To overcome these limitations, we introduce Phylo2Vec, a new representation for any binary tree.In this framework, the topology of a binary tree can be completely described by a single integer vector v of dimension n − 1, where n is the number of leaves in the tree.The vector's construction is intrinsically related to the branching pattern of the tree, and is defined by a simple constraint: v j ∈ {0, 1, . . ., 2(j − 1)} for all j ∈ {1, . . ., n − 1}.The approach we present here is most similar to that previously introduced by Rohlf (1983), but we focus on the integer representation and its mathematical properties, rather than counting or labelling trees.
Additionally, this formulation naturally offers a new measure of distance between trees (e.g., by comparing two vectors using the Hamming distance) and yields a new mechanism to explore tree space which diverges from traditional heuristics such as subtree, prune and regraft (SPR).To further demonstrate its utility, as a proof of concept, we apply Phylo2Vec to several phylogenetic inference problems, where the task is to find an optimal tree given a set of genetic sequences using maximum likelihood estimation.While state-of-the-art frameworks for phylogenetic inference typically rely on search heuristics based on deterministic tree arrangements, Phylo2Vec provides the first steps to a more systematic criterion for optimisation.

MATERIALS AND METHODS
The goal of this project was to develop a bijection (i.e., a one-to-one correspondence) between the set of binary rooted trees with n leaves to a constrained set of integer vectors of length n − 1.
We first describe an intuitive but incomplete (as not bijective) integer representation of trees as birth processes.Second, we define and characterise Phylo2Vec as a bijective generalisation of this first representation and formalise its properties.Third, we showcase the utility of Phylo2Vec by applying the representation for MLE-based phylogenetic inference on empirical datasets.
Our construction draws from an existing method of assigning integer counts to trees (Rohlf, 1983), although we focus on vector representations.It is distinct from Rohlf (1983) in labelling the tree edges, motivated by a simple and intuitive representation of birth processes.
By applying this encoding to rooted binary trees, we are able to move around tree space similarly to subtree-prune and regraft methods.Furthermore, we provide a rigorous proof of its bijectivity alongside a range of algorithms (all implemented into a Python package) which allows researchers to build on the phylogenetic optimisation algorithm we present here.Thus, we provide a significantly different method from those proposed previously (Rohlf, 1983), by focusing our efforts toward practical transitions in tree space.

An incomplete integer representation of trees as birth processes
Let T denote a rooted phylogenetic tree with n leaf nodes representing (biological) taxa, and D symbolise a key-value mapping (or dictionary) which associates a nonnegative integer (the keys) to each leaf node (the values).
Using this mapping, for a subset of all trees, we can summarise their topology using an integer vector v of size n − 1 such that: The construction of this vector is inspired by birth processes: assuming a two-leaf tree with leaves labelled 0 and 1, we process v from left to right.For each j ∈ {1, . . ., n − 1}, v j (hereinafter noted as v[j]) denotes the addition of leaf j such that, at iteration j, leaf j forms a cherry with leaf v[j].In other words, the branch leading to leaf v[j] "gives birth" to leaf j.
Figure 1 illustrates algorithms to convert a tree to a vector and vice versa.
Although a simple representation of tree topology, it is easy to see from Equation 1 that this construction is incomplete.Indeed, there are j possible values for any v[j], and thus, for n leaves, there are 1 • 2 • . . .• (n − 1) = (n − 1)! possible vectors, which is less than the number of binary rooted trees, (2n − 3)!! (where !! denotes the semifactorial) (Cavalli-Sforza and Edwards, 1967;Felsenstein, 1978;Diaconis and Holmes, 1998).This discrepancy stems from the assumptions of this construction, whereby a new leaf j has to form a cherry with a previously processed leaf 0, 1, . . ., j − 1.For instance, leaf 2 has to form a cherry with either leaf 0 or 1, but cannot be an outgroup of the (0, 1) subtree.We thus denote trees that follow this incomplete construction of tree space as "ordered" trees, as they require a precise ordering of the leaf nodes.

Phylo2Vec
In this section, we define and formalise the properties of Phylo2Vec, an integer vector representation that extends the formulation presented above to be valid for any rooted binary tree.
To ensure bijectivity to this space, we need the vector v to satisfy the following constraints: We say v ∈ V if Equation 2 is satisfied.For this representation, there are 2j − 1 entries for any position j.Therefore, the number of possible vectors matches the number of possible binary rooted trees: From this observation, we can prove the bijectivity of the mapping simply by showing injectivity -that is, that no two distinct vectors v and w lead to the same tree.A proof is presented in the Appendix (Phylo2Vec details).Briefly, our proof relies on the fact that certain properties of pairs of nodes are preserved throughout the construction process -namely, that the most recent common ancestor (MRCA) of a pair of nodes is unchanged (once both nodes have been added to the tree) and that if one node is the ancestor of another at some stage of the construction process, then this remains true in the final tree.Then, if T and T ′ are the trees resulting from different vectors v and v', respectively, we choose the smallest i such that v i ̸ = v ′ i .By considering the sets of leaf nodes descended from the edge to which i is added, we can show that the addition of node i causes a pair of nodes to either have a different MRCA or a different ancestral relationship.Therefore, since these properties are preserved throughout the construction process, we must have T ̸ = T ′ .This shows the injectivity of our mapping, with bijectivity following from the fact that the number of trees is the same as the number of possible vectors v.
Recovering a tree from a Phylo2Vec vector Building a binary tree from v follows closely the algorithm in Figure 1, but incorporates two additional requirements.First, we start from a two-leaf tree, whereby the leaves are labelled 0 and 1.The branches that lead to leaves 0, 1 are also labelled 0, 1, respectively.Second, we draw an additional node (called the "extra root") which is initially connected to the root by a branch labelled 2 (see second row in Figure 2).
The addition of a temporary root in the construction of v from a tree and vice versa
ensures that there are 2j − 1 branches from which a leaf j can descend from.From these requirements, we can build a unique rooted tree T by processing v from left to right, where v[j] indicates the branch that will split and yield leaf j. Figure 2 shows a detailed example of this scheme, and other example representations for trees with n = 4 leaves are shown in Figure 3.We also describe (and prove its existence in the Appendix) an inverse algorithm to convert a tree represented in Newick format as a Phylo2Vec vector in Figure 4.
Initial tree: 2 leaves: 0, 1 (dark green) 1 internal node (lime) 1 extra root (grey) Step 2 v[2] = 0 Split branch 0, yield leaf 2 Rename the branches Remove the extra root and name the ancestors 4 5 6 0 1 2 3 Fig. 2. Recovering a tree from a Phylo2Vec vector: example for v = [0, 0, 3].We process v from left to right.The branch renaming step depends on the branch type.Leaf branches: For leaf branches, branches that end on leaves 0, ..., L − 1 are labelled 0, ..., L − 1, respectively.For internal branches, the next branch (L) to label is i) the deepest and ii) with the "highest" children (if there are ties for case 1.).We repeat the same process for internal branches L + 1, ..., 2(L − 1) − 1, and label the last branch leading to the extra root 2(L − 1).See Algorithm S5 and Figure S2a for more details about implementation and complexity.
Complexity The algorithm underlying Figure 2 and detailed in Algorithm S3 generally runs in linear time (see Fig. S2a).A basic version using NumPy (Harris et al., 2020) runs in a few milliseconds for n = 1000 taxa on a modern CPU.The inverse algorithm (converting a Newick string to a Phylo2Vec v), detailed in Algorithm S4, is of log linear complexity when internal nodes are already labelled (according to the scheme described in Fig. 2 Algorithm S5) and

Input
Newick = (((0,2)4,1)5,3)6; Step 1 Next leaf: 1 Step 2 Next leaf: Branch 0 splits and yields leaf 2 Step 3 Next leaf: 3 Branch 4 splits and yields leaf 3 . Labelling a tree as a Phylo2Vec vector v: example for v = [0, 0, 4].We process leaves in ascending order.For each leaf j, we determine the branch that split and yielded leaf j, which corresponds to v[j].At each step, we re-label the branches with the same process as in Figure 2.

Distances between trees
The formulation of Phylo2Vec as a one-to-one correspondence between binary trees and integer vectors constrained by Equation 2 naturally allows for a new measure of distance between trees.For any two Phylo2Vec trees v and w, a Hamming distance can be defined as To compare this distance with other tree distance metrics, we consider a simple discrete random walk in the space of possible Phylo2Vec vectors V.At each step, we create a new vector w from the previous vector v as follows.First, we choose a random subset of the indices )) where the J(i) are iid random variables uniform on the set {−1, 1}.Note that the values of J(i) at different steps of the walk are also independent, and that the minimum and maximum in the definition of w i ensure that it satisfies the constraint 0 ⩽ w i ⩽ 2(i − 1).
As 1, n − 1 / ∈ I, we fix w 1 = 0 (by our constraints) and w n−1 = 2(n − 2) (to ensure that we move in the unrooted tree space for SPR distance calculations).distance (Robinson and Foulds, 1981) and Kuhner-Felsenstein (KF) distance (Kuhner and Felsenstein, 1994).We note that exact, rooted distance for SPR is NP-hard to compute (Bordewich and Semple, 2005) and therefore cannot be directly compared to our rooted Phylo2Vec formulation.For all distances, we see a nonlinear correspondence, especially for RF and KF distance.Small changes in v can lead to very large topological jumps, but equally, small jumps are also possible.Modifying several indexes in v results in significant jumps across tree space, leading to new trees that are very dissimilar.As a result, SPR, RF, or KF distances saturate as we increase the number of changes in v (St. John, 2017).However, we note that small changes in v i can also readily correspond to very minor topological changes.
In the exploration of tree space, the number of possible moves for both SPR and Phylo2Vec is of order O(n 2 ) (see Phylo2Vec details in the Appendix).Consequently, Phylo2Vec is expected to explore tree space in a similar manner than SPR, with proposals being less local than nearest neighbour interchange but also less global than those by tree bisection reconnection.
However, the number of single SPR changes is approximately four times greater than the number of single changes in v (that is, changes of a single index of v) as SPR considers internal nodes while v is defined across the leaves, and so Phylo2Vec changes are likely a subset of possible SPR changes.
Whereas Figure 5 shows distances between unrooted trees, our framework is built on rooted phylogenies at its core.Knowing that all rootings produce the same likelihood due to the pulley principle and reversibility of nucleotide substitution models (Felsenstein, 2004), we can, for any rooted phylogeny, switch to one that is rooted at a different outgroup and has exactly the same likelihood.Thus, an equivalence class V exists where, given a likelihood or parsimony score ℓ, any given Phylo2Vec vector v ∈ V has the same ℓ(v), an SPR or RF distance of 0, but a Phylo2Vec distance of µ > 0. In practice, µ is often very large between v ∈ V (comparable to half the maximum SPR distance, see Fig. S1), which makes switching a vector v ∈ V to an equivalent v ′ ∈ V an additional mechanism for tree space exploration.Fig. 6.Example of a reordering scheme of v using level-order traversal.Starting from the root, for each level, we relabel the immediately descending leaf nodes with the smallest integers available (from 0 to n − 1; shown in orange).The letters (a-g) indicate the taxa, showing that reordering the leaves does not affect tree topology but simply changes the integer-taxon mapping.
Shuffling Indices This distance between two trees is not symmetric with respect to the labelling of the trees, as discussed further in the Appendix (Phylo2Vec details).Depending on the choice of labelling, certain portions of the tree may be easier to optimise than others when performing phylogenetic inference.This is an undesirable quality and can be remedied by a simple reordering of indices within our algorithm.An example of a reordering algorithm is presented in Figure 6.
Consider a tree T where the leaves are labelled by a fixed set of indices {1, 2, . . ., n − 1}.
Suppose that σ is a permutation of {1, 2, . . ., n − 1}, and consider a shuffled tree σ(T ) to be a tree with the same topological structure as T , but where, for each j ∈ {1, 2, . . ., n − 1}, the leaf with original label i now has label σ(j).
Calculating the likelihood requires a tree and a set of genetic data , where D j corresponds to the genotype of leaf j as well as tree T .We can then write the likelihood as L = L(T , D).Moreover, defining the shuffled genetic data as ).This occurs because when computing the likelihood, any calculation for L(T , D) that involves the node with original label i (and hence genetic data D i ) will now involve the node with label σ(i) and hence genetic data D σ −1 (σ(i)) = D i .Should the permutation only be applied to either the tree labels or the genetic data set, the resulting likelihood will likely be different from L(T , D).Thus, since the topological structure of T is the same as D(T ), the likelihood will remain unchanged.
A more rigorous proof can be found in the Appendix (Phylo2Vec details).
One can also recover the vector v corresponding to the shuffled tree σ(T ).This is possible because of the bijective relationship between the space of v's and the space of trees.We provide an algorithm that inverts our map from v to M in the Appendix (Phylo2Vec details).Thus, one can equivalently define a shuffled vector σ(v) (such that σ(v) generates σ(T )) and consider the likelihood relationship as L(v, D) = L(σ(v), σ(D)).This allows for discrete optimisation steps to be taken with respect to the new shuffled v, increasing the flexibility of the algorithm while removing the asymmetric effects of the initial labelling.
Branch lengths In addition to tree topology, determining the branch lengths of a tree is an important facet in phylogenetic inference.When making small changes to the tree topology, a number of portions of the tree will remain identical and, therefore, it is likely that the optimal values of subtree branch lengths will not change.It is therefore helpful to represent branch lengths in a method that is robust to these changes to avoid carrying out the full optimisation process every time the topology is changed.
Within the Phylo2Vec framework, there are several approaches in which branch lengths can be integrated.First, given each v j refers to the branch splitting and leading to leaf j, a simple solution would consist in adding a 2-column matrix specifying the position at which branch v j splits and the length of the new branch yielding leaf j.Alternatively, it is possible to assign each leaf node a "position", calculate internal node positions as some weighted average of the positions of the nearby leaf nodes, and then calculate branch lengths based on the distance between a pair of nodes.This would have the advantage of branch lengths being independent of the choice of root, thus allowing to easily switch between the unrooted equivalence classes discussed previously.
For the examples in this paper, we used RAxML-NG (Kozlov et al., 2019) to optimise the branch lengths at each step of the algorithm without using information from previous branch lengths.This reduces the speed of the optimisation and is an area for improvement in future work.

Evaluation
Problem and data To demonstrate the utility of Phylo2Vec, we apply our new representation for phylogenetic inference of five popular empirical molecular sequence datasets under the maximum likelihood (ML) criterion.This dataset corpus spans across different biological entities, taxa, and genetic sequence sizes.It has been proved that ML inference for phylogenetic trees is NP-hard (Roch, 2006) and therefore our key goal is to define a sensible heuristic that can explore the vastness of tree space.
Moreover, the likelihood surface exhibits high curvature (Sanderson et al., 2011) and being trapped in a local optima is a persistent problem across all heuristic phylogenetic approaches.
Tree topology optimisation using hill-climbing A simple way to explore the space of possible trees is to use hill climbing where we simply compute the difference in likelihood after a single element is changed.We define the neighbour matrix that is, the tree considered in the first likelihood has identical entries except for the i th entry, which is changed to j.For (i, j) such that v i = j is infeasible, we set ∆ℓ ij = 0. We have found that considering each row of the neighbour matrix yields good results, i.e., if max(∆ℓ i ) > 0, then we find j = argmax j (∆ℓ ij ) and change the value of v i to j.This algorithm is guaranteed to converge to a point where max(∆ℓ) ⩽ 0 as no change in v results in a gradient that is greater than zero.Moreover, as there are only finitely many possible v, and ℓ is strictly decreasing after each iteration of the while loop, the algorithm must converge in finite time.More complicated optimisation algorithms can be readily created and is an especially useful aspect of our representation.An example is performing hill-climbing over paired changes in v. Exploratory analysis suggests that paired changes are far more robust to being trapped in local minima, but at the cost of higher complexity.For challenging phylogenies, a simpler parsimony or minimum evolution score can be used to perform hill-climbing over pairs as an exploratory search.
However, as highlighted above, a fundamental asymmetry exists in Phylo2Vec which can make optimisation inefficient.A simple solution to mitigating this asymmetry is to reorder the integer-taxon mapping to obtain an ordered vector (and thus, an ordered tree), as described previously in An incomplete integer representation of trees as birth processes and Figure 6.The advantage of carrying out our hill climbing scheme on these ordered trees is that it removes the secondary effects of changing an element of v which can occur by the divergence in internal node labels.This prevents our algorithm from getting stuck in local minima, as it means that more parts of the tree can be easily edited.
The resulting algorithm is detailed in Algorithm 1.Our investigations have shown that all the possible trees that are one step from some ordered v are also one SPR move from the original tree (though the converse is not true -not all SPR moves will be one step from v).This is proved in the Appendix (Phylo2Vec details).Thus, this application of our Phylo2Vec formulation falls within the SPR framework, and provides a mathematically convenient and principled way to explore tree space using well-tested SPR methodology.
We note that we could additionally explore rooted equivalence classes to further prevent being stuck in local minima.In particular, there is more freedom in the movements of nodes further down the tree, and re-rooting at the deepest node would allow all nodes to be easily moved to a variety of locations.However, for the experiments presented hereafter, we found this extra degree of freedom to be unnecessary.
Algorithm 1 Hill-climbing optimisation of a tree with n leaves Input v ∈ T n ▷ Initialise with a random v ▷ Reorder the labels (see Figure 6) until max(G i=1,...,n−1 ) = 0 ▷ Continue iterating until local minimum Additional properties of the Phylo2Vec vector An additional advantage of having an integer vector representation for binary trees such as Phylo2vec is efficiency with respect to sampling, data storage, as well as assessing tree equality (with respect to topology).We highlight these properties in Figure 7 by performing several benchmarks against functions of shows the widely used R library ape Paradis and Schliep (2019).Figure 7a shows how Phylo2Vec sampling of trees is several times faster than the function rtree, while also being simple in construction and implementation.Figure 7b verifies that the Phylo2Vec sampling distribution is indeed uniform.While we do explore other sampling schemes further, ordered trees present one avenue to perform constrained tree sampling.Figure 7c shows the storage costs in kB of Phylo2Vec as compared to a Newick string with only topological information.From these simple simulations we estimate a Phylo2Vec vector can be stored as an integer array or a string as much as a six times reduced storage cost.Finally, Figure 7d shows the time required to find a unique set of topologies from a set of trees.Phylo2Vec is several orders of magnitude faster than unique.multiPhylo in ape, and can be massively parallelised.This speed difference can be particularly useful in Bayesian settings.

Implementation
All Phylo2Vec algorithms and related optimisation methods presented in the main text were implemented in Python 3.10 using NumPy (Harris et al., 2020) and numba (Lam et al., 2015).

RESULTS
We test Phylo2Vec by performing inference on five popular empirical datasets described in Table 1.This dataset corpus spans across different biological entities, taxa, and genetic sequence sizes.
For each dataset, we use the optimisation procedure described in Evaluation, using RAxML-NG for branch length and substitution matrix optimisation.We report performance using the negative log version of the tree likelihood defined by Felsenstein (1983).
Figure 8 shows the optimisation results for four of the datasets described in Table 1.We For each size and sampler, we sample 10000 trees and converted them first to their Phylo2Vec representation, and second to an integer using a method similar to that of Rohlf (1983).We then compare the probability distributions of the integers generated by Phylo2Vec and ape sampling against the reference uniform distribution for each tree size using the Kullback-Leibler (KL) divergence.The lower the KL-divergence value, the more the reference distribution and the distribution of interest share similar information.(c) Object sizes for different tree sizes of Phylo2Vec vectors (stored as a 16-or 32-bit numpy integer array, or a string) compared against their Newick-format equivalents (without branch length information).(d) Average time for duplicate removal from a set of trees using Phylo2Vec (vectors) and the unique.multiPhylofunction from ape.Execution time was measured over 30 executions using Python's timeit and R's microbenchmark, respectively.observe that from 10 random starting trees we always achieve the same minimal loss without being trapped in local optima.This is comparable to state-of-the-art software that also searches through topological space (Stamatakis, 2014;Minh et al., 2020).For each dataset, only two epochs of changes (i.e., two passes through every index of v) were generally needed to achieve a minimal negative log-likelihood.In addition, for M501 for example, only a total of around 10,000 Ranked tree number Negative Log-Likelihood Fig. 9. Negative log-likelihood path drawn from all possible trees of the Yeast dataset.A and B respectively show the path to the minimum from a random tree and the worst possible tree.The black line shows the sorted phylogenetic likelihoods for all trees.The arrows show the proposal moves for two searches, one from a random tree (A) and one from the worst possible tree (B).likelihood optimisations for each run were needed to reach a minimum -a vanishingly small fraction of the total number of trees possible with 29 taxa (∼ 8e 36 ).The choice of the number of optimisations can be shortened depending on the optimisation stoppage criteria, but with the trade-off of being trapped in local minima.We also note that in the Zika virus example, two runs converged at a loss slightly (0.07%) greater than the minimum of the other eight runs.The resultant trees from these minima show that we get trapped in these suboptimal minima due to rooting issues, preventing single changes in v from finding a better optimum.This highlights once again that our algorithm is attempting to solve a more difficult problem than is strictly necessary by searching the space of rooted trees rather than unrooted trees.Due to the pulley principle (Felsenstein, 2004), all rootings of an unrooted tree have the same negative log-likelihood and therefore no paths between rooted trees exist to aid our optimisation algorithm.In practice, especially for large phylogenies, it is common to begin optimisation from a sensible starting point (Paradis et al., 2004) (e.g., a maximum parsimony or neighbour joining tree).In our experiments, we have chosen to start from a completely random tree to highlight the utility of simple algorithms based on Phylo2Vec to traverse tree space.Subsequently, we apply the same optimisation procedure for the yeast dataset (8 taxa) initially presented in (Rokas et al., 2003) and studied in (Money and Whelan, 2012).Given the smaller number of taxa, we were able to exhaustively calculate the likelihood for every possible rooted tree.As shown in Figure 9, we notice a broad region of numerous trees with comparable likelihoods, in addition to a considerably smaller group of trees exhibiting increasing likelihood.
Regardless of whether we start from a random tree or the worst possible tree, our algorithm quickly converges to the accurate tree reported in (Rokas et al., 2003).Across several runs, Algorithm 1 required 96 total likelihood evaluations -a very small fraction of the total number of trees.

DISCUSSION
Phylo2Vec is a parsimonious representation for phylogenetic trees whose validity extends to any binary tree.This representation facilitates the calculation of distances between trees and allows the formation of a simple algorithm for phylogenetic optimisation.Following from trends in phylogenetics, Phylo2Vec could be integrated within state-of-the-art computing libraries (e.g., libpll (Flouri et al., 2015) or Beagle (Ayres et al., 2012)) to facilitate its use.We have not yet considered Bayesian inference, but this is likely a useful application of Phylo2Vec, where random walks can be trivially implemented (see Figure 5).Furthermore, Phylo2Vec can be useful in assessing topological convergence, for example, for a large phylogeny of 500 taxa and a million trees, extracting the unique set of topologies takes < 10 seconds on a single core in Python, and can be even faster with parallel computation.Although Phylo2Vec does allow for unrooted trees, it is primarily an algorithm for rooted trees.In the examples in this paper, we only consider reversible Markov models where rooting is irrelevant due to the pulley principle (Felsenstein, 2004).Irreversible Markov models are both mathematically and biologically more principled (Sumner et al., 2012) but require rooted trees.Therefore, a useful application of Phylo2Vec could be in the inference of phylogenies with irreversible Markov models.
The use of empirical datasets served as a proof of concept that maximum likelihood estimation can be performed using Phylo2Vec vectors.We show that, using a simple hill-climbing scheme, we can recover the same topology optimum found by state-of-the-art MLE frameworks such as RAxML-NG (Kozlov et al., 2019).It is important to note, however, that this approach is nowhere near as optimised as RAxML-NG.As it only performs topology changes at a single vector index at a time, its inherent greediness makes inference of large datasets difficult.
That being said, the simplicity of the Phylo2Vec formulation means that it can be used in other more efficient and complex optimisation schemes can also be developed.For instance, Phylo2Vec can also benefit from fast SPR changes (Guindon et al., 2010) and other heuristic optimizations that are currently in RAxML(-NG).In addition, by construction, we have ensured that Phylo2Vec can be differentiable through transforming v into a matrix W ∈ R 0,1 such that PENN ET AL.
W ij = P(v i = j).Via this transform, inference in a continuous tree space using gradient descent-based optimisation frameworks is theoretically possible, but its particulars remain to be developed.Similarly, we expect Phlyo2Vec-based representations to be applied in Monte Carlo tree search (MCTS) frameworks which may explore tree space more efficiently, or used as an embedding to regularly infer phylogenetic trees using well-established machine learning paradigms such as self-supervised learning from large existing tree libraries (e.g., TreeBase (Piel et al., 2009)).

FIGURE CAPTIONS
1.An incomplete integer representation of tree topology as birth processes.(a) Labelling a tree as an ordered vector: example for v = [0, 0, 0].We process leaves in ascending order.
For each leaf j, we retrieve its sibling (or adjacent tip) in the Newick string, ignoring leaves > j.The adjacent tip corresponds to v[j].(b) Recovering a tree from an ordered vector: example for v = [0, 0, 1].We process v from left to right.Ancestors are named in last-in-first-out (LIFO) fashion: The ancestor of the last added leaf L − 1 (here, leaf 3) is named L (here, 4), the ancestor of the second-to-last added leaf L − 2 (here, leaf 2) is named L + 1 (here, 5) etc.In both cases, the lengths of the edges are arbitrary 2. Recovering a tree from a Phylo2Vec vector: example for v = [0, 0, 3].We process v from left to right.The branch renaming step depends on the branch type.Leaf branches: For leaf branches, branches that end on leaves 0, ..., L − 1 are labelled 0, ..., L − 1, respectively.For internal branches, the next branch (L) to label is i) the deepest and ii) with the "highest" children (if there are ties for case 1.).We repeat the same process for internal branches L + 1, ..., 2(L − 1) − 1, and label the last branch leading to the extra root 2(L − 1).See Algorithm S5 and Figure S2a for more details about implementation and complexity 3. Example of trees with n = 4 leaves represented in both Newick and Phylo2Vec vector formats.Leaf and internal nodes are coloured in dark green and lime, respectively 4. Labelling a tree as a Phylo2Vec vector v: example for v = [0, 0, 4].We process leaves in ascending order.For each leaf j, we determine the branch that split and yielded leaf j, which corresponds to v[j].At each step, we re-label the branches with the same process as in Figure 2 5. Comparison of Phylo2Vec moves with three popular tree distances: subtree-prune-and-regraft (SPR; left), Robinson-Foulds (RF; middle), and Kuhner-Felsenstein (KF; right).To generate the distances, a random walk of 5000 steps was performed from a random initial v with 200 taxa.At each step, each v i can increment, decrement or remain unchanged 6. Example of a reordering scheme of v using level-order traversal.Starting from the root, for each level, we relabel the immediately descending leaf nodes with the smallest integers available (from 0 to n − 1; shown in orange).The letters (a-g) indicate the taxa, showing that reordering the leaves does not affect tree topology but simply changes the integer-taxon mapping 7. Phylo2Vec-based likelihood optimisation results for four datasets described in Table 1.The horizontal and vertical lines indicate local minima and epochs (i.e., one pass through every index of v), respectively 8. Negative log-likelihood path drawn from all possible trees of the Yeast dataset.A and B respectively show the path to the minimum from a random tree and the worst possible tree.
The black line shows the sorted phylogenetic likelihoods for all trees.The arrows show the proposal moves for two searches, one from a random tree (A) and one from the worst possible tree (B) the final tree, by the previous argument) once node i has been added (as the M (a, i) will be the new internal node and M (i, b) cannot be either i or this new internal node).A similar argument shows that M (a, i) ≺ M (b, i) in T ′ and hence T ̸ = T ′ .
If only one of X \ X ′ and X ′ \ X is non-empty, suppose without loss of generality that X \ X ′ is non-empty and choose a ∈ X \ X ′ .We can choose a distinct node b ∈ X ′ ∩ X (as otherwise, if X ′ ∩ X and X ′ \ X are both empty, we must have one of X and X ′ being empty which is a contradiction as every edge has at least one leaf node descended from it).Then, M (i, a) is the newly-added internal node in the construction of T , but not in the construction of T ′ .However, in both cases, M (i, b) is the newly-added node.Hence, in T , M (i, a) = M (i, b), but in T ′ , they are distinct.Thus, T ′ ̸ = T .
Hence, topological non-equivalence holds in both cases, and the map from v to the set of trees is therefore injective and therefore bijective.
Label-asymmetry of v-induced distance As discussed in the main text, v induces a natural distance function between trees -namely, that the distance between v and w is equal to However, this distance function is dependent on the labels assigned to each leaf (and is hence label-asymmetric).A simple example of this can be found in the case of four leaves.
Consider the tree given by v 1 = (0, 1, 2).This is a ladder tree (that is, each internal node and the root is parent to at least one leaf node) with nodes in order 0, 1, {2, 3} (where the {2, 3} is used to denote the fact that nodes 2 and 3 are from the same generation and so could be read in either order).This is distance 1 away from v 2 = (0, 1, 4), which is again a ladder tree with nodes now ordered as 3, 0, {1, 2}.Thus, if distance were symmetric, then any ladder tree with ordered nodes a, b, {c, d} would be distance 1 away from the ladder trees with ordered nodes d, a, {b, c} and c, a, {b, d}.However, the ladder tree with ordered nodes 2, 0, {1, 3} is given by v 3 = (0, 2, 1), which is distance 2 away from v 1 .Thus, the distance function is not label-symmetric.
Unrooted tree equivalence classes Noting from the main text that there are (2k − 3)!! = k−2 i=0 (2i + 1) unrooted trees with k + 1 leaves, we can partition the space of trees Algorithm S4 Labelling a rooted tree as a Phylo2Vec vector Input rooted tree r with n leaves ▷ We assume a Newick string with integer nodes and labelled internal nodes.Ex: (((2,3)6,1)7,(0,4)5)8; M ← reduce(r) ▷ "Reduce" the Newick string to a (n − 1) × 3 matrix.for all leaves j = 1, ..., n − 1 do e(j, a(j)) ← j ▷ The edge from which leaf j descends from is labelled j end for j ← n ▷ The label of the next edge for all heights h = 1, ..., n − 1 do ▷ height = number of edges on the path from a given node to the furthest leaf node a j ← ancestor of height h with the greatest child e(a j , a(a j )) ← j ▷ The edge connecting a j and its ancestor is labelled j j ← j + 1 ▷ Increment j end for return e

Fig. 3 .
Fig. 3. Example of trees with n = 4 leaves represented in both Newick and Phylo2Vec vector formats.Leaf and internal nodes are coloured in dark green and lime, respectively.

Fig. 5 .
Fig. 5. Comparison of Phylo2Vec moves with three popular tree distances: subtree-prune-and-regraft (SPR; left), Robinson-Foulds (RF; middle), andKuhner-Felsenstein (KF; right).To generate the distances, a random walk of 5000 steps was performed from a random initial v with 200 taxa.At each step, each vi can increment, decrement or remain unchanged.

Fig. 7 .
Fig. 7. Phylo2Vec allows for fast and unbiased sampling, low memory or storage, and fast comparison of trees.(a) Average sampling time using phylo2vec.utils.sampleand rtree from ape.Execution time was measured over 100 executions using Python's timeit and R's microbenchmark, respectively.(b) Sampling bias comparison.For each size and sampler, we sample 10000 trees and converted them first to their Phylo2Vec representation, and second to an integer using a method similar to that ofRohlf (1983).We then compare the probability distributions of the integers generated by Phylo2Vec and ape sampling against the reference uniform distribution for each tree size using the Kullback-Leibler (KL) divergence.The lower the KL-divergence value, the more the reference distribution and the distribution of interest share similar information.(c) Object sizes for different tree sizes of Phylo2Vec vectors (stored as a 16-or 32-bit numpy integer array, or a string) compared against their Newick-format equivalents (without branch length information).(d) Average time for duplicate removal from a set of trees using Phylo2Vec (vectors) and the unique.multiPhylofunction from ape.Execution time was measured over 30 executions using Python's timeit and R's microbenchmark, respectively.
Fig.8.Phylo2Vec-based likelihood optimisation results for four datasets described in Table1.The horizontal and vertical lines indicate local minima and epochs (i.e., one pass through every index of v), respectively.

▷▷
Starting from the smallest internal node, replace the internal nodes by their smallest child (and discard the third column).This is equivalent to the pairs in Algorithm S3 Ex: Use the same logic as Algorithm S3 to retrieve each v j from the position of the containing leaf j return v Algorithm S5 Labelling edges in the Phylo2Vec framework Input rooted tree T with n leaves ▷ Leaves labelled 1, . . ., n − 1 e ← [ ] ▷ Edges

Table 1 .
Evaluation datasets, sorted by number of taxa.