KwARG: parsimonious reconstruction of ancestral recombination graphs with recurrent mutation

Abstract Motivation The reconstruction of possible histories given a sample of genetic data in the presence of recombination and recurrent mutation is a challenging problem, but can provide key insights into the evolution of a population. We present KwARG, which implements a parsimony-based greedy heuristic algorithm for finding plausible genealogical histories (ancestral recombination graphs) that are minimal or near-minimal in the number of posited recombination and mutation events. Results Given an input dataset of aligned sequences, KwARG outputs a list of possible candidate solutions, each comprising a list of mutation and recombination events that could have generated the dataset; the relative proportion of recombinations and recurrent mutations in a solution can be controlled via specifying a set of ‘cost’ parameters. We demonstrate that the algorithm performs well when compared against existing methods. Availability and implementation The software is available at https://github.com/a-ignatieva/kwarg. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
For many species, the evolution of genetic variation within a population is driven by the processes of mutation and recombination in addition to genetic drift. A typical mutation affects the genome at a single position, and may or may not spread through subsequent generations by inheritance. Recombination, on the other hand, occurs when a new haplotype is created as a mixture of genetic material from two different sources, which can drive evolution at a much faster rate. The detection of recombination is an important problem which can provide crucial scientific insights, for instance in understanding the potential for rapid changes in pathogenic properties within viral populations (Simon-Loriere and Holmes, 2011).
Consider a population evolving through the replication, mutation, and recombination of genetic material within individuals, emerging from a common origin and living through multiple generations until the present day. In general, the history of shared ancestry, mutation, and recombination events are not observed, and must be inferred from a sample of genetic data obtained from the present-day population. Crossover recombination can occur anywhere along a sequence, and the breakpoint position is also unobserved. This article focuses on methods for reconstructing possible histories of such a sample, in the form of ancestral recombination graphs (ARGs) -networks of evolution connecting the sampled individuals to shared ancestors in the past through coalescence, mutation, and crossover recombination events; an example is illustrated in Figure 1. This is a very important but challenging problem, as many possible histories might have generated a given sample. Moreover, recombination can be undetectable unless mutations appear on specific branches of the genealogy (Hein et al., 2004, Section 5.11), and recombination events can produce patterns in the data that are indistinguishable from the effects of recurrent mutation (McVean et al., 2002); that is, two or more mutation events in a genealogical history that affect the same locus.
Parsimony is an approach focused on finding possible histories which minimise the number of recombinations and recurrent mutations. This does not necessarily describe the most biologically plausible version of events, but produces a useful lower bound on the complexity of the evolutionary The construction of a history for the dataset given in Figure 1 is illustrated in Figure 2. The first step corresponds to the construction of a neighbourhood, two of the states N 1 1 , N 2 1 ∈ N1 are pictured. Then, the 'Clean' algorithm is applied to each state in the neighbourhood (illustrated as a series of steps following blue arrows). From the resulting reduced neighbourhood {N 1 1 , N 2 1 , . . .}, the state N 2 1 is selected; the other illustrated path is abandoned. This process is repeated until all incompatibilities are resolved and the empty state is reached. Following the path of selected moves in this figure left-to-right corresponds to the events encountered when traversing the leftmost ARG in Figure 1 from the bottom up. If instead the state N 1 2 were selected at the second step of the algorithm, the resulting path would correspond to the ARG in the centre of Figure 1.

Score
When considering which next step to take, more informed choices can be made by considering not just the cost of the step, but also the complexity of the configuration it leads to. This is the principle behind the A* algorithm (Hart et al., 1968), using a heuristic estimate of remaining distance to guide the choice of the next node to expand. KwARG applies the same principle in a greedy fashion, following a path of locally optimal choices in an attempt to find a minimal history.
The score implemented in KwARG is Here, C N i t , Dt denotes the cost of the corresponding event, defined in Section 2.2.3; maxAM(Nt) denotes the maximum amount of ancestral material seen in any of the states in Nt, and AM(N i t ) gives the amount of ancestral material in state N i t . Incorporating a measure of the amount of ancestral material in a state helps to break ties by assigning a smaller score to simpler configurations.
The method of computing the lower bound L depends on the complexity of the dataset, with a trade-off between accuracy and computational cost. For relatively small datasets, it is feasible to compute Rmin exactly using Beagle. HB refers to the haplotype bound, employing the improvements afforded by first calculating local bounds for incompatible intervals, and applying a composition method to obtain a global bound (Myers and Griffiths, 2003). HK refers to the Hudson-Kaplan bound (Hudson and Kaplan, 1985); this is quick but less accurate, so is reserved for larger, more complex configurations. Note that these bounds are computed under the infinite sites assumption.
The particular form and components of the score were chosen through simulation testing; we found that the given formula provides a good level of informativeness regarding the quality of a possible state.

Event cost
Each type of event is assigned a cost, which gives a relative measure of preference for each event type in the reconstructed history: • C R : the cost of a single recombination event, defaults to 1. • C RR : the cost of performing two successive recombinations, defaults to 2. It is sufficient to consider at most two consecutive recombination events before a coalescence (Lyngsø et al., 2005); this type of event also captures the effects of gene conversion.
• C RM : the cost of a recurrent mutation. If N i t is formed from Dt by a recurrent mutation in a column representing k agreeing sites, this corresponds to proposing k recurrent mutation events, so the cost is C(N i t , Dt) = k · C RM . • C SE : this event is a recurrent mutation which affects only one sequence in the original dataset, i.e. it occurs on the terminal branches of the ARG. Thus, the event can be either a regular recurrent mutation, or an artefact due to sequencing errors. The cost can be set to equal C RM , or lower if the presence of sequencing errors is considered likely.
KwARG allows the specification of a range of event costs as tuning parameters, as well as the number Q of independent runs of the algorithm to perform for each cost configuration. The proportions of recombinations to recurrent mutations in the solutions produced by KwARG can be controlled by varying the ratio of costs for the corresponding event types.

Selection probability
The method of selecting the next state from a neighbourhood of candidates will impact on the efficiency and performance of the algorithm. At one extreme, selecting at random amongst the states will mean that the solution space is explored more fully, but will be prohibitively inefficient in terms of the number of runs needed to find a near-optimal solution. On the other hand, always greedily selecting the move with the minimal score will quickly identify a small set of solutions for each cost configuration, at the expense of placing our faith in the ability of the score to assess the quality of the candidate states accurately.
We propose a selection method that is intermediate between these two extremes, randomising the selection but focusing on moves with nearminimal scores. A pseudo-score for state N i tsinfer, RENT+, and ARGweaver. Finally, we compared how well KwARG performs against the parsimony-based heuristic methods SHRUB  and SHRUB-GC (Song et al., 2006); these results are presented in Supplementary Section S4. We also investigated the dependence of the run time of KwARG on the number and length of sequences, through simulation studies.

Finite sites
3.1.1 Comparison to PAUP* Disallowing recombination, the quality of computed upper bounds on Pmin was tested by comparison with PAUP* (Swofford, 2001, version 4.0a168), which was used to compute the exact minimum parsimony score via branch-and-bound on 994 datasets simulated as described in Supplementary Section S3.1.
KwARG failed to find Pmin in 11 (1.1%) cases out of 994. The results are illustrated in the top panel of Figure 3. Where KwARG failed to find an optimal solution, in all 11 cases it was off by just one recurrent mutation. Figure 3 also demonstrates that a substantial proportion of recurrent mutations do not create incompatibilities in the data, and the number of actual events often far exceeds Pmin.

Comparison to Beagle
Under the infinite sites assumption (disallowing recurrent mutation), the accuracy of KwARG's upper bound on Rmin was tested by comparison with Beagle (Lyngsø et al., 2005), on 1 037 datasets simulated as described in Supplementary Section S3.2.
Using the default annealing parameter T = 30, KwARG found Rmin in all cases. In 97% of the runs, this took under 5 seconds of CPU time (on a 2.7GHz Intel Core i7 processor); all but one run took less than 40 seconds. In 93% of the runs, 1 iteration was sufficient to find an optimal solution; in 99% of the runs, 5 iterations were sufficient. Beagle found the exact solution in 5 seconds or less in 86% of cases; for datasets with a small Rmin Beagle runs relatively quickly (median run time for Rmin = 5 was 1 second, compared to KwARG's 0.3 seconds). For more complex datasets, KwARG finds an optimal solution much faster; for Rmin = 9, the median run time of Beagle was 56 seconds, compared to KwARG's 3 seconds.
Setting T = 10 and T = ∞ resulted in 5 and 22 failures to find an optimal solution, respectively, when KwARG was run for Q = 1 000 iterations per dataset (or terminated after 10 minutes have elapsed), demonstrating that setting the annealing parameters too low or too high results in deterioration of performance.
The bottom panel of Figure 3 illustrates the results, and shows the relationship between the true simulated number of recombinations and Rmin. This demonstrates that in many cases, substantially more recombinations have occurred than can be confidently detected from the data.

Comparison to tsinfer, RENT+, and ARGweaver.
We tested the performance of KwARG in recovering the topology of simulated local trees for a range of recombination and mutation rates (under the infinite sites assumption). For each combination of rates, we simulated 100 datasets using msprime; details of the simulation parameters and settings used in running each program are given in Supplementary Section S5. From the output of each method, we calculated the Kendall-Colijn metric (Kendall and Colijn, 2016) between the inferred and true tree topologies at each variant site position, calculating the mean across all variant sites and averaging over the 100 datasets. We note that ARGs contain more information than local trees, but there is no obvious way of comparing ARG topologies (and tsinfer only infers local trees, rather than full ARGs).
The results are shown in the top panel of Figure 4 and Supplementary Figure S4. All methods show very comparable performance across the range of considered scenarios, with KwARG slightly outperforming the other methods, based on the chosen metric, when the recombination rate is relatively low and the mutation rate relatively high. We have performed the same analysis using the Robinson-Foulds metric (Robinson and Foulds, 1981), and found this to give very similar results.

Run time analysis
A comparison of the run times of KwARG against tsinfer, RENT+, and ARGweaver is presented in the bottom panel of Figure 4 and Supplementary Figure S5. KwARG demonstrates good efficiency when the recombination and mutation rates are relatively low, and shows roughly linear growth in run time as the mutation rate increases.
The dependence of the run time of KwARG on the number and length of sequences was further investigated through simulations; the results are presented in Supplementary Section S6. Keeping the sequence length fixed 10 20 30 10 −7 10 −6 10 −5 Mutation rate across 100 simulated datasets for each value of mutation rate µ (per generation per site) with recombination rate ρ = 4 · 10 −7 (per generation per site); error bars show mean ± standard error. Lower K-C distance indicates better accuracy. Bottom panel: points show mean run time averaged over 100 datasets for each combination of rate parameters; error bars show mean ± standard error. ARGweaver results not shown past µ = 3.2 · 10 −6 due to prohibitively long run time.
showed that KwARG runs very quickly when the number of sequences is very low, and shows roughly exponential growth in run time when the number of sequences is 6 or more. Keeping the number of sequences fixed shows that, after an initial exponential increase (due to small datasets taking very little time per iteration), the run time scales roughly linearly in sequence length.

Application to Kreitman data
The performance of KwARG is illustrated on the classic dataset of Kreitman (1983 , Table 1); this is not close to the performance limit of KwARG, but has been widely used for benchmarking algorithms used for ARG reconstruction. The dataset consists of 11 sequences and 2 721 sites, of which 43 are polymorphic, of the alcohol dehydrogenase locus of Drosophila melanogaster. The data is shown in Figure 5, with columns containing singleton mutations removed for ease of viewing. Applying the 'Clean' algorithm, as described in Section 2.2.1, reduces this to matrix of 9 rows and 16 columns. KwARG was run with the default parameters, Q = 500 times for each of 13 default cost configurations given in Supplementary Section S2. An example of the output is shown in Table 1.
KwARG correctly identified the Rmin of 7 and the Pmin of 10 (confirmed by running Beagle and PAUP*, respectively). The 6 500 iterations of KwARG took just under 9 minutes to run. Of these, 1,829 (28%) resulted in optimal solutions; some are shown in Table 1. KwARG identified multiple combinations of recombinations and recurrent mutations that could have generated this dataset. By default, slightly cheaper costs are assigned to recurrent mutations if they happen on terminal branches, so the results show a bias towards solutions with more SE events for each given number of recombinations.

Seed
T  The ten recurrent mutations appearing in the solution in row 8 of Table  1 are highlighted on the dataset in Figure 5. It is striking that 7 of these 10 recurrent mutations affect the same sequence Fl-2S. In fact, these 7 recurrent mutations could be replaced by 3 recombination events affecting sequence Fl-2S, with breakpoints just after sites 3, 16, and 35; leaving the other identified recurrent mutations unchanged yields the solution in row 5 of Table 1. These findings suggest that the sequence may have been affected by cross-contamination or other errors during the sequencing process, or it could indeed be a recombinant mosaic of four other sequences in the sample. This recovers the results obtained by Stephens and Nei (1985), who posited the recombinant origins of sequence Fl-2S following manual examination of a reconstructed maximum parsimony tree, which also highlighted the five consecutive mutations identified by KwARG. The ARG corresponding to the solution in row 5 of Table 1, visualised using Graphviz (Ellson et al., 2004), is shown in Figure 6.
Examination of the identified solutions also shows that site 36 of sequence Ja-S "necessitates" two of the seven recombinations inferred in the minimal solution in the absence of recurrent mutation, while sites 3 and 9 in sequences Wa-S and Fl-1S, respectively, each create incompatibilities that could be resolved by one recombination.

Discussion
Methods for the reconstruction of parsimonious ARGs generally rely on the infinite sites assumption. When examining the output ARGs, it is often difficult to tell by how much the inferred recombination events actually affect the recombining sequences. As is the case with the Kreitman dataset, sometimes further examination reveals that two crossover recombination events have the same effect as one recurrent mutation, raising questions about which version of events is more likely. KwARG removes the need for such manual examination, and provides an automated way of highlighting such cases, which is particularly useful for larger datasets.
While KwARG performs well in inferring ARGs under the infinite sites assumption, it can be particularly useful in analysing genetic data information about the history of a sample. This will obviously depend strongly on the true recombination rate. Based on our comparisons with RENT+, tsinfer, and ARGweaver, KwARG achieves very good accuracy of inference of local tree topologies at least comparable to these other methods, particularly when the recombination rate is low to moderate and the mutation rate moderate to high. We emphasise that KwARG demonstrates relatively good accuracy even when the recombination rate is high and even though its express goal is to seek the most parsimonious, rather than necessarily the most likely, history. Moreover, for datasets with relatively few incompatibilities, the run time of KwARG is competitive with that of the other methods. It is also interesting to note that although all four programs incorporate very different approaches and heuristic algorithms, they demonstrate very similar performance in inferring local tree topologies over the range of considered scenarios.
The scalability of KwARG remains a challenge for large and more complex datasets. Performance gains could be readily achieved by running multiple iterations of KwARG in parallel, or incorporating more efficient ways of storing the intermediate states. Further improvements could also be obtained by amending the calculation of lower bounds within the cost function in order to account for the presence of recurrent mutation, which should make the scores more accurate, and hence the neighbourhood exploration more efficient. Other avenues for further work include explicitly incorporating gene conversion as a possible type of recombination event with a separate cost parameter, with a view to developing the underlying model of evolution to even more closely reflect biological reality.