The Free Lunch is not over yet—systematic exploration of numerical thresholds in maximum likelihood phylogenetic inference

Abstract Summary Maximum likelihood (ML) is a widely used phylogenetic inference method. ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood and runtimes for ML tree inferences with RAxML-NG, IQ-TREE, and FastTree on empirical datasets. We provide empirical evidence that we can substantially accelerate tree inferences with RAxML-NG and IQ-TREE by changing the default values of two such numerical thresholds. At the same time, altering these settings does not significantly impact the quality of the inferred trees. We further show that increasing both thresholds accelerates the RAxML-NG bootstrap without influencing the resulting support values. For RAxML-NG, increasing the likelihood thresholds ϵLnL and ϵbrlen to 10 and 103, respectively, results in an average tree inference speedup of 1.9 ± 0.6 on Data collection 1, 1.8 ± 1.1 on Data collection 2, and 1.9 ± 0.8 on Data collection 2 for the RAxML-NG bootstrap compared to the runtime under the current default setting. Increasing the likelihood threshold ϵLnL to 10 in IQ-TREE results in an average tree inference speedup of 1.3 ± 0.4 on Data collection 1 and 1.3 ± 0.9 on Data collection 2. Availability and implementation All MSAs we used for our analyses, as well as all results, are available for download at https://cme.h-its.org/exelixis/material/freeLunch_data.tar.gz. Our data generation scripts are available at https://github.com/tschuelia/ml-numerical-analysis.


ML Inference Tools
RAxML-NG RAxML-NG [10] was introduced in 2019 as a successor of the widely used ML inference tool RAxML [20].RAxML-NG optimizes the initial starting tree using a greedy hill-climbing algorithm.Here, greedy means that only steps improving the LnL score are being accepted.The hill-climbing procedure comprises multiple rounds of optimizing the substitution model parameters, the branch lengths, and the tree topology.The branch lengths and substitution model parameters are optimized using the Newton-Raphson, L-BFGS-B [4], and Brent [1] methods.RAxML-NG allows the user to constrain the branch length values via the minBranchLen and maxBranchLen thresholds.The user can also define the convergence limit of the L-BGFS-B method via the threshold bfgs factor.
RAxML-NG implements Subtree Pruning and Regrafting (SPR) moves as topology optimization strategy.One iteration of pruning and regrafting all possible subtrees in the current tree is called an SPR round.Instead of performing all possible reattachments, RAxML-NG only reattaches a pruned subtree into all neighboring branches up to a maximum distance of nodes away from the pruning position, the SPR radius.RAxML-NG automatically determines this SPR radius by performing SPR rounds and increasing the SPR radius until the LnL score does not further improve (autodetect SPR rounds).On the initial starting tree topology, RAxML-NG first performs so-called fast SPR rounds.During these fast SPR rounds, the branch lengths around subtree regraft nodes are not optimized prior to scoring the tree.During the subsequent slow SPR rounds, the lengths of the three branches adjacent to the regrafting node are optimized prior to computing the LnL score.After each fast or slow SPR round, RAxML-NG optimizes all branch lengths of the 20 most promising tree topologies found during this SPR round.Once the LnL score does not improve more than by ϵ LnL , RAxML-NG terminates the SPR rounds.Before exiting, RAxML-NG optimizes the substitution model parameters until the LnL score converges, that is, if the LnL score does not improve by at least ϵ model between model parameter optimization iterations.
When optimizing all branch lengths, RAxML-NG repeatedly iterates over all branches in the tree to improve their values with respect to the LnL score.The user can limit the maximum number of passes over the entire tree via the parameter num iters.Throughout the tree inference process, RAxML-NG uses likelihood epsilons to detect LnL score convergence during these numerical optimization procedures.After each iteration, RAxML-NG checks if the likelihood improved by at least a certain threshold.Standard RAxML-NG uses two different likelihood epsilons: ϵ LnL and ϵ brlen .RAxML-NG uses the ϵ LnL threshold during the autodetect, fast, and slow SPR rounds, as well as when it optimizes all branch lengths in the tree.For our more thorough likelihood epsilon study Study 4, we modified RAxML-NG such that it uses a dedicated threshold for each of these four steps.The ϵ brlen threshold is used during the slow SPR rounds, when RAxML-NG optimizes the three branch lengths adjacent to the pruning node.In Study 4, we distinguish between the following likelihood epsilons: • ϵ auto : RAxML-NG uses this threshold during the autodetect SPR rounds.
The SPR radius is increased and RAxML-NG continues running autodetect SPR rounds, if the LnL improved by at least ϵ auto .
• ϵ fast : RAxML-NG uses this threshold during the fast SPR rounds.RAxML-NG continues performing fast SPR rounds, as long as the LnL score improves by at least ϵ fast .
• ϵ slow : RAxML-NG uses this threshold during the slow SPR rounds.RAxML-NG continues performing slow SPR rounds, as long as the LnL score improves by at least ϵ slow .
• ϵ brlen full RAxML-NG uses this threshold when it optimizes all branch lengths.
• ϵ brlen RAxML-NG uses this threshold during the slow SPR moves when it optimizes the branch lengths of the three adjacent branches to decide when to stop iterating over these three branches for adjusting their values.
The standard RAxML-NG implementation summarizes the thresholds ϵ auto , ϵ fast , ϵ slow , and ϵ brlen full as the single threshold ϵ LnL .Since the threshold ϵ brlen is only used at one step in the tree inference process, this threshold remains unchanged.
IQ-TREE IQ-TREE [13] is a widely used software package for phylogenetics that was first released in 2014.Similar to RAxML-NG, IQ-TREE in principle also implements a greedy hill-climbing ML tree inference.IQ-TREE's topology optimization strategy consists in repeated NNI moves.Since NNI moves explore the tree space less exhaustively than SPR moves [12], IQ-TREE also implements random moves to conduct backward steps with respect to the LnL score.This allows IQ-TREE to explore the tree space beyond the NNI neighborhood and navigate out of local maxima.These random NNI moves are applied to a locally optimal tree T best with LnL score LnL best .A random NNI move perturbs the tree by applying a randomly selected NNI move.After this random NNI move, IQ-TREE performs repeated standard NNI moves until the tree is NNI optimal, meaning that any additional NNI move will decreases the LnL score again.If this new tree T * has a higher LnL score than LnL best , IQ-TREE replaces T best with T * and repeats the search cycle.If the LnL score of T * is lower than LnL best , IQ-TREE discards T * and repeats the search cycle using the current T best .Since the random NNI move is selected at random, the repeated cycle will, with high probability, select a different random NNI move.During the repeated standard NNI moves, IQ-TREE additionally optimizes the branch lengths until the LnL score increases by less than ϵ LnL .This entire cycle of random NNI moves and standard NNI moves is repeated until IQ-TREE does not find a better tree than T best during 100 cycles.Before exiting, IQ-TREE implements a final optimization of substitution model parameters and checks for convergence of this last optimization step using the ϵ LnL threshold.Similar to RAxML-NG, IQ-TREE also optimizes branch lengths and model parameters using the Newton-Raphson, L-BFGS-B, and Brent methods.One can constrain the branch length value range via the minBranchLen and maxBranchLen thresholds.
FastTree FastTree [14] aims to reduce the time and space complexity of ML phylogenetic inference.The authors achieve this through various heuristics and shortcuts compared to other, more standard, ML inference tools.In contrast to RAxML-NG and IQ-TREE, FastTree initially conducts a minimum evolution criterion [28] based inference step before maximizing the tree's likelihood.The minimum evolution approach attempts to obtain a tree that explains the data with as few mutations as possible, therefore minimizing the branch lengths, with the branch lengths being at least minBranchLen.The minimum evolution criterion search steps include NNI and SPR moves.Instead of iterating until convergence, FastTree runs a predefined number of minimum evolution search rounds.For the subsequent ML optimization, FastTree uses NNI steps to improve the tree topology with respect to its LnL score.FastTree executes at most 2

Software and Command Lines
In our analyses we use RAxML-NG, IQ-TREE, and FastTree to infer phylogenetic trees.We analyze the influence of the numerical thresholds for each tool separately and do not compare results across different ML inference tools.With RAxML-NG, we re-evaluate the inferred trees using its own tree evaluation mode.Analogously, we re-evaluate trees inferred with IQ-TREE using the IQ-TREE tree evaluation mode.Since FastTree does not provide a tree evaluation mode, we do not re-evaluate trees inferred using FastTree.For all three ML inference tools, we use the significance tests implemented in IQ-TREE to determine the plausible tree sets.We infer bootstrap replicates and draw support values using RAxML-NG.We perform the bipartition frequency correlation analyses using RAxML.Table S1 states the software versions we use, and Table S2 shows the command lines used for the respective task and tree inference tool.IQ-TREE implements the following significance tests: the Kishino-Hasegawa test [8] and the Shimodaira-Hasegawa test [17], both in their weighted and unweighted variants, the Approximately Unbiased test [18], as well as the Expected Likelihood Weight test [24].We use the default IQ-TREE settings for the number of resampling of estimated log-likelihoods (RELL) replicates (10 000) and the significance level (α = 0.05).Since the significance tests can be biased by the number of trees in the candidate set [24], we remove topologically identical trees before applying the tests.

Pipeline Setup
We implement the analysis pipeline as described in the main paper, using the Snakemake workflow management system [9] and Python 3. The pipeline code is available at https://github.com/tschuelia/ml-numerical-analysis.We execute this analysis pipeline on two institutional clusters (Cascade and Haswell) at the Heidelberg Institute for Theoretical Studies (HITS) and three stand-alone servers of our research group: • Cascade: 150 compute nodes with Intel Cascade Lake CPUs (Intel Xeon Gold 6230).Each node has 20 cores with 2.1 GHz and 96 GB RAM.
For Studies 1, 2, and 4, we use a single core and two threads for all MSAs.Since the bootstrapping procedure requires an extensive amount of runtime, we use the RAxML-NG multiprocessing option and set the number of nodes and threads according to Table S3.

Datasets
As mentioned in the main paper, in Study 1 we first analyze 22 empirical unpartitioned DNA MSAs (Data collection 1 ).To verify the results, we analyze additional MSAs, including amino-acid (AA) and partitioned MSAs ( Study 2, Data collection 2 ).Table S4  3 Results Study 1: Influence of Numerical Thresholds on LnL scores and Runtimes In this section, we present the results of Study 1 on Data collection 1.This section is separated into two subsections: 1. Tree inference phase: In the first subsection, we focus on the influence of varying numerical thresholds on tree inference.Since we presented the results for ϵ LnL and ϵ brlen in RAxML-NG and IQ-TREE in the main paper, we omit a detailed discussion of these results here.Instead, we show that the default settings for the remaining numerical thresholds, as well as the default ϵ LnL setting in FastTree are appropriate.

Tree Evaluation:
In the second subsection, we focus on the influence of varying numerical thresholds on tree evaluation.We show that the LnL score and runtime remain largely unaffected by the numerical threshold settings, as long as they are within a reasonable value range.Both RAxML-NG's and IQ-TREE's default settings fall within this range.
In analogy to the main paper, we compare LnL scores in percent rather than absolute log likelihood units, since we compare LnL scores within a broad absolute likelihood value range.The LnL scores for the 22 empirical MSAs of Data collection 1 range between approximately −6400 (D354) and −12 300 000 (D4869).All presented figures summarize the results over all MSAs of Data collection 1.

Tree Inference
In this section, we present our results for varying the numerical threshold settings for the tree inference process.We analyze the influence on the LnL scores and runtimes of all presented numerical thresholds.

Likelihood epsilon ϵ LnL
RAxML-NG and IQ-TREE We discuss the results of RAxML-NG and IQ-TREE for this threshold in greater detail in the main paper, based on the results of the broader set of MSAs (Data collection 2 ).For the sake of completeness, we show the results of our analysis for this threshold in both ML tools for Data collection 1 : Figures S1g, S1h, S2g and S2h.
FastTree With FastTree, we observe worse LnL scores for ϵ LnL thresholds > 1.For most MSAs, the trees inferred under threshold settings of 10 2 and 10 3 are significantly worse than the best-known tree according to the statistical tests.This could be due to the numerous heuristic and numerical shortcuts in FastTree's implementation (see Section 1.1) and the lack of a tree evaluation option.The runtimes for ϵ LnL settings ≤ 1 exhibit no substantial speedup.FastTree's default ϵ LnL setting 10 −1 therefore represents an adequate choice.

Likelihood epsilon ϵ brlen
In analogy to the ϵ LnL threshold, we discuss the influence of the ϵ brlen threshold in detail in the main paper based on our analyses on Data collection 2. For the sake of completeness, we also show the results of our analysis for this threshold for the MSAs of Data collection 1 : Figures S1i and S1j

MinBranchLen
We observe that minBranchLen settings ≥ 10 −3 yield worse LnL scores for all three ML inference tools.Due to the lack of an evaluation phase in FastTree, the degradation of LnL scores is an order of magnitude more pronounced than for RAxML-NG and IQ-TREE.Depending on the tool, the runtimes follow distinct trends.We observe that the default settings for all three inference tools are well-chosen.

IQ-TREE
In general, minBranchLen settings ≥ 10 −3 result in worse LnL scores (Figure S2a).The LnL scores under the highest analyzed setting min-BranchLen = 10 −2 are on average 0.1 % worse.In analogy to RAxML-NG, for some MSAs, we obtain slightly worse LnL scores for the minBranchLen setting 10 −10 (≤ 0.11 %).We observe a high impact of the minBranchLen setting on the runtimes of IQ-TREE tree inferences (Figure S2b).The runtimes of IQ-TREE increase with smaller minBranchLen settings.For minBranchLen = 10 −10 tree inferences run on average twice as long as for minBranchLen = 10 −3 .Interestingly, the runtime also increases if minBranchLen is set to 10 −2 .These tree inferences run on average 52 ± 6 % slower than tree inferences with minBranchLen = 10 −3 .Taking into account these observations, the IQ-TREE default setting 10 −6 for the minBranchLen threshold appears to represent a 'good' trade-off between runtime and LnL scores.
FastTree Similar to the other ML inference tools, the LnL scores for FastTree worsen with higher minBranchLen settings.With FastTree, the degradation is by an order of magnitude worse than for IQ-TREE and RAxML-NG.The trees for minBranchLen = 10 −2 are on average 1.7 % less likely than the respective best-known tree (Figure S3a).This is most likely due to the lack of an evaluation mode that improves the LnL scores under smaller minBranchLen settings.We observe the highest decline of LnL scores in our analysis for FastTree with minBranchLen set to 10 −2 on the D4869 MSA.The LnL score under this min-BranchLen setting decreases by 525 %.For minBranchLen settings ≤ 10 −5 the LnL scores are approximately equal (variances ≪ 0.1 %).The runtime fluctuates depending on the MSA and the minBranchLen setting.In general, there is a trend for smaller settings to induce longer runtimes (Figure S3b).Given these observations, the default minBranchLen setting 5 −9 appears to represent a reasonable choice.

Remaining Thresholds
For the maxBranchLen threshold, we observe no influence on the LnL scores of neither RAxML-NG nor IQ-TREE, but we do observe an influence on runtimes.With both, RAxML-NG and IQ-TREE, and depending on the MSA, a different maxBranchLen setting yields faster runtimes on average (≤ 16 % differences; Figures S1c, S1d, S2c and S2d).
For the threshold ϵ model , we observe a minor variance in LnL scores for IQ-TREE and RAxML-NG (≤ 0.1 %; Figures S1e and S2e).For RAxML-NG, depending on the MSA, a different ϵ model setting appears to be the fastest setting yet with no clear trend (Figure S1f)).For IQ-TREE, we observe faster tree inferences under higher ϵ model thresholds (Figure S2f).We make a similar observation regarding runtimes for RAxML-NG's threshold num iters.The LnL score is not affected by the num iters setting (Figures S1m and S1n).With the threshold bfgs factor of RAxML-NG, we notice no influence on the LnL scores, but a trend towards faster runs for higher settings (Figures S1k and S1l).
For all these thresholds, despite their minor variations in runtimes and LnL scores, we conclude that the respective default settings in both RAxML-NG and IQ-TREE are appropriate.

Tree Evaluation
In this section, we present our analysis results for varying the numerical threshold settings during tree evaluation.We analyze the influence on the LnL scores and runtimes for all considered numerical thresholds, except for the likelihood epsilon ϵ brlen .RAxML-NG uses this specific threshold during the SPR rounds (see Section 1.1).Since the tree topology remains unaltered during tree evaluation, this threshold is not used.

MinBranchLen
RAxML-NG For RAxML-NG we observe a correlation between the LnL scores and the minBranchLen setting.We observe worse LnL scores with higher minBranchLen settings (see Figure S4a), the LnL score degradation ranges between 0.1 % and 2.0 % with a mean of 1.55 % and two outliers ((5.1 % worse for D354 and 12.5 % worse for D4869).For settings ≤ 10 −5 we observe, except for D4869, equally good LnL scores.The minBranchLen threshold during the evaluation phase should therefore be set to ≤ 10 −5 .As Figure S4b shows, the runtimes for minBranchLen settings ≤ 10 −5 are approximately identical.

Remaining Thresholds
The thresholds ϵ model , num iters, and bfgs factor have no impact on the LnL score (Figures S4e, S4i and S4k).However, the runtimes for ϵ model increase with smaller ϵ model settings (on average 10.5 % increase for RAxML-NG, and 20 % for IQ-TREE; Figures S4f and S5f).The bfgs factor threshold shows a similar effect: RAxML-NG runtimes increase on average 49 % with lower settings (Figure S4l).The runtimes for the num iters threshold increase for more iterations (on average 3.7 %; Figure S4j).We observe no impact on, neither LnL scores, nor runtime for the maxBranchLen threshold in RAxML-NG (Figures S4c and S4d).For IQ-TREE we notice runtime variations ≤ 20 %.However, depending on the MSA, a different maxBranchLen setting yields faster execution times (Figure S5d).The LnL score remains unaffected (Figure S5c).Based on our findings, we suggest the following threshold settings for the evaluation phase: , 32, 64} bfgs factor ∈ {10 5 , 10 7 , 10 9 } -For both tools, the respective default settings fall within our suggested value ranges.We observe that the runtime of the tree evaluation is negligible compared to the runtime of the tree inference.In our data, the average runtime of the RAxML-NG tree evaluation across all MSAs of Data collection 1 and all numerical thresholds is 5 ± 5 % of the runtime of the respective tree inference of a single tree.For IQ-TREE, the average runtime of the tree evaluation is 4±3 % of the runtime of the respective tree inference.Therefore, we recommend setting the numerical thresholds to their default setting during the tree evaluation.

Results Study 2: Main Paper Speedup with Outliers
In the main paper, we removed outliers for all figures depicting a speedup for Study 2. For the sake of completeness, we provide all these figures, including all outliers in Figure S6.    5 Results Study 4: New likelihood epsilon thresholds in RAxML-NG As RAxML-NG uses the same threshold ϵ LnL for four distinct operations during its tree inference procedure, we separate this threshold into four distinct finegrained likelihood epsilons (see Section 1.1).The goal is to assess whether we can vary these thresholds independently and further improve upon runtime.We analyze these fine-grained threshold setting on Data collection 2. Our analyses show a similar behavior for all four thresholds.For the thresholds ϵ auto , ϵ fast , ϵ brlen full we only observe a slight decrease in LnL scores under higher threshold settings (Figures S7a, S7c and S7g).With ϵ slow this trend is more pronounced (Figure S7e).Our analyses show an analogous behavior for inference times.For ϵ slow the runtimes decrease with higher settings (Figure S7f), for the other thresholds the runtimes are on average approximately equal under all threshold settings (Figures S7b, S7d and S7h).We conclude that a distinction does not further improve either runtime or LnL scores and is therefore unnecessary.

Problems with Significance Tests
As stated in the main paper, we observe problems when assessing multiple trees using statistical significance tests.Since the significance tests can be biased by the number of trees in the candidate set [24], we remove identical tree topologies from the set of inferred trees prior to applying the tests.In the following, we present an example for the rejection of trees with identical LnL scores according to the c-ELW test, as well as an example where the choice of the best tree influences the resulting significance results.

c-ELW scores for identical LnL values
For trees with identical LnL values the c-ELW scores are also identical.Therefore, for trees that have a c-ELW that is close to exceeding the predefined threshold, only some trees with the exact same c-ELW score are accepted while the remaining ones are being rejected.This leads to trees being rejected despite having the same LnL score as accepted trees.Table S5 shows an example of this behavior for 6 distinct tree topologies for dataset D354.Trees 5 and 6 have identical LnL scores and identical c-ELW scores, yet only tree 5 is accepted as significant tree and tree 6 is rejected.Changing the order of trees 5 and 6 in the input file results in tree 6 being accepted, while tree 5 is rejected.Note that this issue is not due to an implementation error in IQ-TREE, but due to an unspecified behavior as per the design of the test.In addition to removing duplicate tree topologies, one would need to filter duplicate LnL values.This, however, raises the question up to how many digits LnL values are considered as being identical.This requires additional analyses, including the analysis of rounding errors and error propagation in ML inference tools that exceed the scope of our work.
Tree Table S5: Example of c-ELW scores for identical LnL scores.The plus sign denotes that the tree is accepted, a minus sign denotes rejection.Despite trees 5 and 6 having identical LnL scores, their tree topologies are distinct.According to the c-ELW test both trees have identical posterior weights, yet only one is accepted as plausible.

Choice of the best tree
To save runtime, IQ-TREE does not re-estimate the substitution model parameters of each tree in the candidate tree set.Instead, IQ-TREE estimates these parameters using a fixed user provided tree.The choice of this tree among multiple trees with identical LnL scores influences the results of the significance tests.For example, on dataset D10 using 5 distinct tree topologies we perform the IQ-TREE significance tests 5 times, each time using a different tree as user provided tree.We observe unexpected differences between iterations for the weighted SH-Test (wSH) and the weighted KH-Test (wKH).Table S6 shows the results of this experiment.Iter i refers to the i-th iteration with tree i used as best tree.The table shows, that during each iteration all trees have identical LnL scores, yet the significance results differ.Collecting only trees passing both tests in the plausible tree set, results in different plausible tree sets depending on the tree used as best tree.Table S6: Influence of the choice of the best tree used to estimate the substitution model parameters on the p-values in IQ-TREE.The plus sign denotes that the tree is accepted, a minus sign denotes rejection.

Figure S1 : 1 .
Figure S1: Influence of the numerical thresholds on the LnL scores and runtimes of the RAxML-NG tree inference.For the figures in the left column, the y-axis shows the degradation in percent relative to the LnL score of the best-known tree.Higher percentages indicate worse LnL scores.For the figures in the right columns, the y-axis shows the speedup relative to the average runtime under the default setting.Each figure summarizes the data over all MSAs of Dataset 1.The dashed vertical line indicates the mean, and the solid vertical line the median value.The highlighted box indicates the default setting for the respective numerical threshold in RAxML-NG.
Influence of the ϵLnL setting on the runtime of IQ-TREE.

Figure S2 :
Figure S2: Influence of the numerical thresholds on the LnL scores and runtimes of the IQ-TREE tree inference.For the figures in the left column, the y-axis shows the degradation in percent relative to the LnL score of the best-known tree.Higher percentages indicate worse LnL scores.For the figures in the right columns, the y-axis shows the speedup relative to the average runtime under the default setting.Each figure summarizes the data over all MSAs of Dataset 1.The dashed vertical line indicates the mean, and the solid vertical line the median value.The highlighted box indicates the default setting for the respective numerical threshold in IQ-TREE.

Figure S3 :
Figure S3: Influence of the numerical thresholds on the LnL scores and runtimes of the FastTree tree inference.For the figures in the left column, the y-axis shows the degradation in percent relative to the LnL score of the best-known tree.Higher percentages indicate worse LnL scores.For the figures in the right columns, the y-axis shows the speedup relative to the average runtime under the default setting.Each figure summarizes the data over all MSAs of Dataset 1.The dashed vertical line indicates the mean, and the solid vertical line the median value.The highlighted box indicates the default setting for the respective numerical threshold in FastTree.
Influence of the minBranchLen setting on the LnL scores of RAxML-NG.Influence of the minBranchLen setting on the runtime of RAxML-NG.Influence of the maxBranchLen setting on the LnL scores of RAxML-NG.Influence of the maxBranchLen setting on the runtime of RAxML-NG.LnL scores (e) Influence of the ϵ model setting on the LnL scores of RAxML-NG.Influence of the ϵ model setting on the runtime of RAxML-NG.LnL scores (g) Influence of the ϵLnL setting on the LnL scores of RAxML-NG.Influence of the ϵLnL setting on the runtime of RAxML-NG.Influence of the num iters setting on the LnL scores of RAxML-NG.Influence of the num iters setting on the runtime of RAxML-NG.LnL scores (k) Influence of the bfgs factor setting on the LnL scores of RAxML-NG.
Influence of the bfgs factor setting on the runtime of RAxML-NG.

Figure S4 :
Figure S4: Influence of the numerical thresholds on the LnL scores and runtimes of the RAxML-NG tree evaluation.For the figures in the left column, the y-axis shows the degradation in percent relative to the LnL score of the best-known tree.Higher percentages indicate worse LnL scores.For the figures in the right columns, the y-axis shows the speedup relative to the average runtime under the default setting.Each figure summarizes the data over all MSAs of Dataset 1.The dashed vertical line indicates the mean, and the solid vertical line the median value.The highlighted box indicates the default setting for the respective numerical threshold in RAxML-NG.

Figure S5 :
Figure S5: Influence of the numerical thresholds on the LnL scores and runtimes of the IQ-TREE tree evaluation.For the figures in the left column, the y-axis shows the degradation in percent relative to the LnL score of the best-known tree.Higher percentages indicate worse LnL scores.For the figures in the right columns, the y-axis shows the speedup relative to the average runtime under the default setting.Each figure summarizes the data over all MSAs of Dataset 1.The dashed vertical line indicates the mean, and the solid vertical line the median value.The highlighted box indicates the default setting for the respective numerical threshold in IQ-TREE.
Influence of the ϵLnL setting on the runtime of the RAxML-NG tree inference (Corresponds to Figure2in the main paper).
Influence of the ϵ brlen setting on the runtime of the RAxML-NG tree inference (Corresponds to Figure4in the main paper).
, ϵ brlen ) Speedup (c) Influence of simultaneously changing both likelihood epsilon settings on the runtime of the RAxML-NG tree inference (Corresponds to Figure6in the main paper).Influence of the ϵLnL setting on the runtime of the IQ-TREE tree inference (Corresponds to Figure8in the main paper).

Figure S6 :
Figure S6: These figures show the same speedup data as presented in the main paper, but include all outliers.The y-axis shows the speedup relative to the default setting of the respective ML inference tool and threshold.The dashed vertical line indicates the mean, and the solid vertical line the median value.

Figure S7 :
Figure S7: Influence of the separated likelihood epsilons on the LnL scores and runtimes of the RAxML-NG tree inference.For the figures in the left column, the y-axis shows the degradation in percent relative to the LnL score of the bestknown tree.Higher percentages indicate worse LnL scores.For the figures in the right columns, the y-axis shows the speedup relative to the average runtime under the default setting.Each figure summarizes the data over all MSAs of Dataset 2. The dashed vertical line indicates the mean, and the solid vertical line the median value.The highlighted box indicates the default setting for the respective numerical threshold in RAxML-NG.

Table S1 :
• N NNI rounds, where N is the number of taxa in the MSA.During these NNI rounds, FastTree compares the current LnL score of the tree to the previous LnL score and terminates the NNI rounds if it increases by less than ϵ LnL .The software versions of RAxML-NG, IQ-TREE, FastTree, and RAxML.

Table S3 :
provides an overview of the MSAs we use.A 1 indicates that we used this MSA for Study 1, a 2 indicates that we used this MSA for Studies 2-4.Because of the dataset size of the MSA marked by a 2 * , we do not analyze all threshold settings during Studies 2 and 4, but only Multiprocessing scheme for the RAxML-NG bootstrapping procedure.
[11]suggested new settings to verify our results on a large MSA.For the unpartitioned DNA MSAs, we use the general time reversible (GTR) model[25]of nucleotide substitution with four discrete Γ rate categories to model among site rate heterogeneity.Respectively, we use the LG substitution model[11]for AA MSAs.For partitioned MSAs, we use the partition files as provided in the respective publication.All MSAs and partition files are available for download at https://cme.h-its.org/exelixis/material/freeLunch_data.tar.gz.

Table S4 :
Overview of the MSAs we use for our analyses.All MSAs are empirical MSAs.