An evolution strategy approach for the balanced minimum evolution problem

Abstract Motivation The Balanced Minimum Evolution (BME) is a powerful distance based phylogenetic estimation model introduced by Desper and Gascuel and nowadays implemented in popular tools for phylogenetic analyses. It was proven to be computationally less demanding than more sophisticated estimation methods, e.g. maximum likelihood or Bayesian inference while preserving the statistical consistency and the ability to run with almost any kind of data for which a dissimilarity measure is available. BME can be stated in terms of a nonlinear non-convex combinatorial optimization problem, usually referred to as the Balanced Minimum Evolution Problem (BMEP). Currently, the state-of-the-art among approximate methods for the BMEP is represented by FastME (version 2.0), a software which implements several deterministic phylogenetic construction heuristics combined with a local search on specific neighbourhoods derived by classical topological tree rearrangements. These combinations, however, may not guarantee convergence to close-to-optimal solutions to the problem due to the lack of solution space exploration, a phenomenon which is exacerbated when tackling molecular datasets characterized by a large number of taxa. Results To overcome such convergence issues, in this article, we propose a novel metaheuristic, named PhyloES, which exploits the combination of an exploration phase based on Evolution Strategies, a special type of evolutionary algorithm, with a refinement phase based on two local search algorithms. Extensive computational experiments show that PhyloES consistently outperforms FastME, especially when tackling larger datasets, providing solutions characterized by a shorter tree length but also significantly different from the topological perspective. Availability and implementation The software and the data are available at https://github.com/andygaspar/PHYLOES.


PhyloES encoding detail
PhyloES requires the definition of an appropriate encoding to represent the individuals.The problem of encoding trees has been widely studied in the literature (Caminiti et al., 2007;Catanzaro and Pesenti, 2019) and several tree codes have been proposed (Prüfer, 1918;Neville, 1953).However, the encoding method proposed in Rohlf (1983) is specific for unrooted binary trees and it offers two main advantages: it provides a straightforward method for the random generation of phylogenies, and it allows to manipulate trees operating directly on the encodings, by ensuring that when a component of the code of a phylogeny is altered, the result is still a phylogeny.Rohlf (1983)'s encoding defines a map φ between the edge set E of a tree and the set of coding vectors H whose elements are of the form (h 1 , . . ., h n−3 ), where n is the number of leaves and h k ∈ {1, 2, . . ., 2k − 3}.Such a function φ is proven to be well defined and bijective as long as a well defined labelling of the edges is given (Rohlf, 1983).

Internal edge labels
For our encoding, we first assume to work with an ordered set X of n taxa, t 1 , t 2 , . . ., t n .We define the label of an edge as its index in the edge list E of a phylogenetic tree, which is ordered first according to the lowest index of the two nodes defining each edge, and in case of ambiguity according to the greatest.To finally guarantee the uniqueness of the edge labels we have to ensure that a well defined labelling of the internal nodes is given.We achieve so by considering the step-wise tree construction method that, starting from an initial star tree composed of the first three taxa and a single internal node, adds one by one all the remaining taxa.All internal nodes are labelled according to their insertion order, starting from the index k = n + 1 up to the index k = 2n − 2 (where n is the number of taxa).

Decoding
φ −1 , the decoding map described in algorithm 1, allows us to retrieve the edge set of a tree T of n taxa given its code h.Since in our case h represents the instructions for the construction of T we simply need to build E step-by-step according to h, starting from the initial star composed by the first three taxa.At each step k we remove the selected edge h k , we add the two new edges and we fix the order of E. Algorithm 1 shows the details of the decoding procedure, which we will now analyse in depth.From now onwards we will indicate a taxa with the letter t and internal nodes with the letter i.
At the initial step E = {(t 1 , i n+1 ), (t 2 , i n+1 ), (t 3 , i n+1 )}; h, which elements are in {1, 2, 3}, indicates the index in E of the edge selected for the insertion of t 4 and the internal node i n+2 .Referring to the example of Figure 1, in which h = (2, 5, 1), the edge selected is E[2] = (t 2 , i 7 ).The function InsertTaxonEdge (algorithm 2) adds the new edge in the list after the only edge connecting the previous taxa t 3 , (t 3 , i 7 ), obtaining E = {(t 1 , i 7 ), (t 2 , i 7 ), (t 3 , i 7 ), (t 4 , i 8 )}.This insertion preserves the correct ordering of E since the edges having taxa as first component, (t * , * ), always come before the others, which are of the form (i α , i β ).This is always true as we established that taxa are indexed from 1 to n and internal nodes form n + 1 to 2n − 2.
Then, the selected edge (t 3 , i 7 ), has to be removed and its nodes will be connected with the new internal node i 8 forming two new edges.Therefore, (t 3 , i 7 ) can be simply replaced by (t 3 , i 8 ), (line 5 in algorithm 1).In this case E is still ordered so no fixing is needed.However, in general it InsertTaxonEdge((t k+3 , i n+1+k )) 5: FixPosition((v α , i n+1+k )) 7: Insert((i β , i n+1+k )) 8: end for 9: return E Figure 1: Example of the tree construction and encoding for a phylogeny with six taxa.Starting from a star-tree with three taxa connect to a single inner node, the remaining taxa are added to the tree iteratively and both the edge set E and the encoding H are updated at each addition.might be necessary to adjust the position of the new edge, task which is handled by the function FixPosition (algorithm 3).In fact, as shown in figure 2, if the selected edge (i α , i β ) is connecting two internal nodes i α and i β , and if i α is linked with some other internal node i γ with γ > β and the node * is either a leaf or an internal node with index greater than γ, we have that the order of E = {. . ., (i α , i n+1+k ), (i α , i γ , ), . . .} after the replacement, must be fixed swapping (i α , i n+1+k ) and (i α , i γ ), since i + 1 + k > γ.In the case in which * is also an internal node with index greater than γ, an additional swap between (i α , i n+1+k ) and (i α , * ) has to be performed.
The last line of the loop of φ −1 is the insertion in E of the edge (i β , i n+1+k ), handled by the function Insert (algorithm 4).Here there are three cases: if k = 1, the edge (i n+1 , i n+2 ) is simply added to the end of the list.If k > 1 and no edge in E has i β as first component, (i β , i n+1+k ) must be inserted in E after (i γ , * ), where i γ is the internal node with the greatest index such that γ < β and * is the internal node connected to i γ with the greatest index.This is the case of (i 8 , i 9 ) in the second step of the example of figure 1; before its insertion, E = {(t 1 , i 7 ), (t 2 , i 8 ), (t 3 , i 7 ), (t 4 , i 8 ), (t 5 , i 9 ), (i 7 , i 9 )}; since no edge E has i 8 as first component and i 7 is the internal node with the greatest index lower than 8, (i 8 , i 9 ) is inserted after (i 7 , i 9 ).The last case is the one in which there are in E some edges with i β as first component.If so, (i β , i n+1+k ) has to be inserted after the last of them, a task performed by function 3.

Decoding complexity
The decoding algorithm, φ −1 is composed of a for loop of n − 3 (O(n)) iterations.The function InsertTaxonEdge (algorithm 2) requires constant time since the position of each node in the list can be traced out and reached in O(1).FixPosition (algorithm 3) needs to iterate through E starting from (v α , i n+k+1 ) to find the last occurrence of an edge with first component equal to v α ; since any node is linked to at most three other nodes, the function is O(1).Concerning function Insert (algorithm 4), apart the one in line 4, all operations require O(1).Since in line 4 the index γ is unknown, it must be searched in an ordered list, an operation that requires O(log(n)) operations.Due to the above reasoning the whole complexity of φ −1 is O(n log(n)).

Coding
The coding algorithm, φ, maps the edge set of a phylogenetic tree into its coding vector h.We remark that in order to obtain the correct code of a tree two conditions must be ensured: 1.The internal node labels must be consistent with the one chosen for our encoding.
2. E must be ordered primarily according to the edges' first component index and, when equal, according to the second component index.
Concerning 1), we notice that our internal node labels are determined by the phylogeny step-wise tree construction, that is the one followed in the decoding algorithm and for which the label of the internal node inserted at the construction step k is n + k + 1.Consequently, the edge list E of a phylogeny should always contain the edge (t n , i 2n−2 ), which is the one added to the list in the last step.If we have instead that t n is linked to i α with i α = i 2n−2 we can fix the inconsistency by swapping the labels of the two internal nodes.If we then remove the edge (t n , i 2n−2 ) from E we should have that t n−1 is linked i 2n−3 : if this is not the case, we can again fix the inconsistency by performing the index swap between i 2n−3 and the internal node connected to t n−1 .To complete the adjustment of the internal node labels, as described in algorithm 5 this procedure has to be performed for all the remaining steps.We remark that at each step, when we remove the edge (t k , i n+k+1 ), the tree must be reconnected.This is done connecting the two nodes v α and i β , with α < β, previously linked to i n+k+1 (figure 3).

Algorithm 5 FixInternalNodesLabels
Require: swap indexes between i γ and i n+k−2 6: end for 10: return E Once ensured condition 1), condition 2) can be guaranteed by sorting E. The idea behind the coding procedure (algorithm 6) is to invert the flow of the decoding algorithm in a similar fashion to algorithm 5, and to retrieve h deconstructing the tree by removing one-by-one all taxa from t n to t 4 .The main difference with algorithm 5 is that each time that an edge is removed the order of E must be restored.This might happen when reconnecting nodes v α and i β , in the case in which v α is an internal node connected with some other internal node i γ such that γ > β.In that case, the function 7 ensures the correct ordering by applying the appropriate swaps.Since finally E is correctly ordered, the component h k is given by the index of (v α , i β ) in E which is the edge between the nodes v α and i β previously connected to i n+k+1 .

Coding complexity
The whole coding procedure is composed of three parts: the function FixInternalNodesLabels (algorithm 5), the sorting of E and the function phi (algorithm 6).FixInternalNodesLabels consists of a for loop composed of only O(1) operations, so it requires O(n) operations.Sorting E instead requires O(n log(n)).In the for loop of φ all operations but the one of line 6 require O(1); the index retrieval of line 6 requires instead O(log(n)) operations since it is a search in an ordered list.The whole cost of φ is therefore O(n log(n)).In conclusion, all three steps of the coding require in total O(n log(n)).

PhyloES encoding advantages
As mentioned here in section 1, the encoding we developed turns out particularly handy to generate and manipulate trees.The decoding algorithm (6) provides in fact a constructive procedure to generate a random tree of n taxa, by sampling a vector of n − 3 components such that: In addition, in the PhyloES offspring phase, a new individual is generated defining its coding vector h, which components are obtained by sampling from the components of the whole population.Since any vector of the form described in equation 1 is a valid tree we are ensured that such a new individual is a valid tree.Figure 4 shows an example of the generation of individuals in the case of a population composed of three trees of eight taxa.

Individual replacement
Here we provide the pseudo-code of the individual replacement algorithm used in PhyloES.
To better understand the flow of algorithm 8, let us suppose to have the following population of trees P = a, b, c, d, e, f, f , where elements are ordered by tree length.The worst individual f has two occurrences, so its first occurrence in P is replaced e, which is the second worst individual, obtaining

Figure 2 :
Figure 2: In case of γ < β, the order of E must be adjusted.

Figure 4 :
Figure 4: Example of generation of a new individual by sampling each component from the population.