Exact global alignment using A* with chaining seed heuristic and match pruning

Abstract Motivation Sequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linear-like time. Results We solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposed seed heuristic with match chaining, gap costs, and inexact matches. We additionally integrate the novel match pruning technique and diagonal transition to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically.   On random sequences of divergence d=4% and length n, the empirical runtime of A*PA scales near-linearly with length (best fit n1.06, n≤107 bp). A similar scaling remains up to d=12% (best fit n1.24, n≤107 bp). For n=107 bp and d=4%, A*PA reaches >500× speedup compared to the leading exact aligners Edlib and BiWFA. The performance of A*PA is highly influenced by long gaps. On long (n>500kb) ONT reads of a human sample it efficiently aligns sequences with d<10%, leading to 3× median speedup compared to Edlib and BiWFA. When the sequences come from different human samples, A*PA performs 1.7× faster than Edlib and BiWFA. Availability and implementation github.com/RagnarGrootKoerkamp/astar-pairwise-aligner.


Introduction
The problem of aligning one biological sequence to another is known as global pairwise alignment (Navarro 2001).Among others, it is applied to genome assembly, read mapping, variant detection, and multiple sequence alignment (Prjibelski et al. 2019).Despite the centrality and age of pairwise alignment (Needleman and Wunsch 1970), 'a major open problem is to implement an algorithm with linear-like empirical scaling on inputs where the edit distance is linear in n' (Medvedev 2023a).
Alignment accuracy affects subsequent analyses, so a common goal is to find a shortest sequence of edit operations (single-letter insertions, deletions, and substitutions) that transforms one sequence into the other.The length of such a sequence is known as Levenshtein distance (Levenshtein 1966) and edit distance.It has recently been proven that edit distance cannot be computed in strongly subquadratic time, unless SETH is false (Backurs and Indyk 2015).When the number of sequencing errors is proportional to the length, existing exact aligners scale quadratically both in the theoretical worst case and in practice.Given the increasing amounts of biological data and increasing read lengths, this is a computational bottleneck (Kucherov 2019).
We solve the global alignment problem provably correct and empirically fast by using A* on the alignment graph and building on many existing techniques.Our implementation A*PA (A* Pairwise Aligner) scales near-linear with length up to 10 7 bp long sequences with divergence up to 12%.Additionally, it shows a speedup over other highly optimized aligners when aligning long ONT reads.

Overview of method
To align two sequences A and B globally with minimal cost, we use the A* shortest path algorithm from the start to the end of the alignment graph, as first suggested by Hadlock (1988a).A core part of the A* algorithm is the heuristic function h(u) that provides a lower bound on the remaining distance from the current vertex u.A good heuristic efficiently computes an accurate estimate h, so suboptimal paths get penalized more and A* prioritizes vertices on a shortest path, thus reaching the target quicker.In this article, we extend the seed heuristic by Ivanov et al. (2022) in several ways to increase its accuracy for long and erroneous sequences.

Seed heuristic (SH)
To define the seed heuristic (SH) h s , we split A into short, non-overlapping substrings (seeds) of fixed length k (Fig. 2a).Since the whole sequence A has to be aligned, each of the seeds also has to be aligned somewhere in B. If a seed does not match anywhere in B without mistakes, then at least one edit has to be made to align it.Thus, the SH h s is the number of remaining seeds (contained in A !i ) that do not match anywhere in B. The SH is a lower bound on the distance between the remaining suffixes A !i and B !j .In order to compute h s efficiently, we precompute all matches in B for all seeds from A.
Where Ivanov et al. (2022) uses crumbs to mark upcoming matches in the graph, we do not need them due to the simpler structure of sequence-to-sequence alignment.
1.1.2Chaining seed heuristic (CSH) One drawback of the SH is that it may use matches that do not lie together on a path from u to the end, e.g. the matches for s 1 and s 3 in (Fig. 2a).In the chaining seed heuristic (CSH) h cs (Section 3.1), we enforce that the matches occur in the same order in B as their corresponding seeds occur in A, i.e. the matches form a chain going down and right (Fig. 2b).Now, the number of upcoming errors is at least the minimal number of remaining seeds that cannot be aligned on a single chain to the target.When there are many spurious matches (i.e.outside the optimal alignment), chaining improves the accuracy of the heuristic, thus reducing the number of states expanded by A*.To compute CSH efficiently, we subtract the maximal number of matches in a chain starting in the current state from the number of remaining seeds.

Gap-chaining seed heuristic (GCSH)
The CSH penalizes the chaining of two matches by the seed cost, the number of skipped seeds in between them.This chaining may skip a different number of letters in A and B, in which case the absolute difference between these lengths (gap cost) is a lower bound on the length of a path between the two matches.The gap-chaining seed heuristic (GCSH) h gcs (Fig. 2c) takes the maximum of the gap cost and the seed cost, which significantly improves the accuracy of the heuristic for sequences with long indels.

Inexact matches
To further improve the accuracy of the heuristic for divergent sequences, we use inexact matches (Wu andManber 1992, Marco-Sola et al. 2012).For each seed in A, our algorithm now finds all its inexact matches in B with cost at most 1.The lack of a match of a seed then implies that at least r ¼ 2 edits are needed to align it.This doubles the potential of our heuristic to penalize errors.

Match pruning
In order to further improve the accuracy of our heuristic, we apply the multiple-path pruning observation (Poole and Mackworth 2017): once a shortest path to a vertex u has been found, no other path to u can be shorter.Since we search for a single shortest path, we want to incrementally update our heuristic (similar to Real-Time Adaptive A* (Koenig and Likhachev 2006)) to penalize further paths to u.We prove that once A* expands a state u, which is at the start or end of a match, indeed it has found a shortest path to u.Then, we can ignore (prune) such a match, thus penalizing other paths to u (Fig 2d and Section 3.2).Pruning increases the heuristic in states preceding the match, thereby penalizing states preceding the 'tip' of the A* search.This reduces the number of expanded states, and leads to near-linear scaling with sequence length (Fig. 1e).

Diagonal transition
The diagonal-transition algorithm only visits so called farthest-reaching states (Ukkonen 1985, Myers 1986) along each diagonal and lies at the core of wavefront alignment (WFA) algorithm (Marco-Sola et al. 2021) (Fig. 1c).We introduce the diagonal-transition optimization to the A* algorithm that skips states known to be not farthest reaching.This is independent of the A* heuristic and makes the exploration more 'hollow', especially speeding up the quadratic behaviour of A* in complex regions.
We present an algorithm to efficiently initialize and evaluate these heuristics and optimizations (Supplementary Section A and Section 3.3), prove the correctness of our methods (Supplementary Section B), and evaluate and compare their performance to other optimal aligners (Section 4 and Supplementary Section C).

Related work
We first outline the algorithms behind the fastest exact global aligners: dynamic programming (DP)-based band-doubling (used by EDLIB) and diagonal transition (DT) (used by BIWFA).Then, we outline methods that A*PA integrates.

Dynamic programming
This classic approach to aligning two sequences computes a table where each cell contains the edit distance between a prefix of the first sequence and a prefix of the second by reusing the solutions for shorter prefixes.This quadratic DP was introduced for speech signals Vintsyuk (1968) and genetic sequences (Needleman and Wunsch 1970, Sankoff 1972, Sellers 1974, Wagner and Fischer 1974).The quadratic O(nm) runtime for sequences of lengths n and m allowed for aligning of long sequences for the time but speeding it up has been a central goal in later works.Implementations of this algorithm include SEQAN (Reinert et al. 2017) and PARASAIL (Daily 2016).

Band-doubling and bit-parallelization
When the aligned sequences are similar, the whole DP table does not need to be computed.One such output-sensitive algorithm is the band-doubling algorithm of Ukkonen (1985) (Fig. 1a), which considers only states around the main diagonal of the table, in a band with exponentially increasing width, leading to O(ns) runtime, where s is the edit distance between the sequences.This algorithm, combined with the bit-parallel optimization by Myers (1999) is implemented in EDLIB ( So si c and Siki c 2017) with Oðns=wÞ runtime, where w is the machine word size (nowadays 64).

Diagonal transition
DT (Ukkonen 1985, Myers 1986) is a technique, which exploits the observation that the edit distance does not decrease along diagonals of the DP matrix.This allows for an equivalent representation of the DP table based on farthestreaching states for a given edit distance along each diagonal.
DT has an O(ns) worst-case runtime but only takes expected Oðn þ s 2 Þ time (Fig. 1c) for random input sequences (Myers 1986) (which is still quadratic for a fixed divergence d ¼ s=n).
It has been extended to linear and affine costs in the WFA (Marco-Sola et al. 2021) in a way similar to Gotoh (1982).Its memory usage has been improved to linear in BIWFA (Marco-Sola et al. 2023) by combining it with the divide-andconquer approach of Hirschberg (1975), similar to Myers (1986) for unit edit costs.Wu et al. (1990) and Papamichail and Papamichail (2009) apply DT to align sequences of different lengths.

Contours
The longest common subsequence (LCS) problem is a special case of edit distance, in which gaps are allowed but substitutions are forbidden.Contours partition the state-space into regions with the same remaining answer of the LCS subtask (Fig. 3).The contours can be computed in log-linear time in the number of matching elements between the two sequences, which is practical for large alphabets (Hirschberg 1977, Hunt andSzymanski 1977).

Shortest paths and A*
An alignment that minimizes edit distance corresponds to a shortest path in the alignment graph (Vintsyuk 1968, Ukkonen 1985).Assuming non-negative edit costs, a shortest path can be found using Dijkstra's algorithm (Ukkonen 1985) (Fig. 1b) or A* (Hart et al. 1968).A* is an informed search algorithm, which uses a task-specific heuristic function to direct its search, and has previously been applied to the alignment graph by Hadlock (1988a, b) and Spouge (1989Spouge ( , 1991)).A* with an accurate heuristic may find a shortest path significantly faster than an uninformed search, such as Dijkstra's algorithm.

A* heuristics
One widely used heuristic function is the gap cost that counts the minimal number of indels needed to align the suffixes of two sequences (Ukkonen 1985, Spouge 1989, Wu et al. 1990, Myers and Miller 1995, Papamichail and Papamichail 2009, So si c and Siki c 2017).Hadlock (1988a) introduces a heuristic based on character frequencies.

Seed-and-extend
Seed-and-extend is a commonly used paradigm for approximately solving semi-global alignment by first matching similar regions between sequences (seeding) to find matches (also called anchors), followed by extending these matches (Kucherov 2019).Aligning long reads requires the additional step of chaining the seed matches (seed-chain-extend).Seeds have also been used to solve the LCSk generalization of LCS (Benson et al. 2013, Paveti c et al. 2017).Except for the SH (Ivanov et al. 2022), most seeding approaches seek for seeds with accurate long matches.

Seed heuristic
A* with SH is an exact algorithm that was recently introduced for exact semi-global sequence-to-graph alignment (Ivanov et al. 2022).In a precomputation step, the query sequence is split into non-overlapping seeds each of which is matched exactly to the reference.When A* explores a new state, the SH is computed as the number of remaining seeds that cannot be matched in the upcoming reference.A* with the SH enables provably exact alignment but runs reasonably fast only when the long sequences are very similar ( 0:3% divergence).

Contributions
We present an algorithm for exact global alignment that uses A* on the alignment graph (Hart et al. 1968, Hadlock 1988a), starting with the SH of Ivanov et al. (2022).
We increase the accuracy of this heuristic in several novel ways: seeds must match in order in the CSH, and gaps between seeds are penalized in the GCSH.The novel match pruning technique penalizes states 'lagging behind' the tip of the search and turns the otherwise quadratic algorithm into Figure 2. Demonstration of SH, CSH, GCSH, and match pruning.Sequence A on top is split into five seeds (horizontal black segments _).Each seed is exactly matched in B (diagonal black segments \).The heuristic is evaluated at state u (blue circles ), based on the four remaining seeds.The heuristic value is based on a maximal chain of matches (green columns for seeds with matches; red columns otherwise).Dashed lines denote chaining of matches.(a) The SH h s ðuÞ ¼ 1 is the number of remaining seeds that do not have matches (only s 2 ).(b) The CSH h cs ðuÞ ¼ 2 is the number of remaining seeds without a match (s 2 and s 3 ) on a path going only down and to the right containing a maximal number of matches.(c) The GCSH h gcs ðuÞ ¼ 4 is minimal cost of a chain, where the cost of joining two matches is the maximum of the number of not matched seeds and the gap cost between them.Red dashed lines denote gap costs.(d) Once the start or end of a match is expanded (green circles ), the match is pruned (red cross ), and future computations of the heuristic ignore it.s 1 is removed from the maximum chain of matches starting at u so ĥcs ðuÞ increases by 1.
Exact global alignment using A* an empirically near-linear algorithm in many cases.Inexact matches (Wu andManber 1992, Marco-Sola et al. 2012) increase the divergence of sequences that can be efficiently aligned.We additionally apply the diagonal-transition algorithm (Ukkonen 1985, Myers 1986), so that only the small fraction of farthest-reaching states needs to be computed.We prove the correctness of our methods, and apply contours (Hirschberg 1977, Hunt andSzymanski 1977) to efficiently initialize and evaluate the heuristic.We implement our method in the novel aligner A*PA.
On uniform-random synthetic data with 4% divergence, the runtime of A*PA scales linearly with length up to 10 7 bp and is up to 500Â faster than EDLIB and BIWFA.On > 500 kb long Oxford Nanopore Technologies (ONT) reads of the human genome, A*PA is 3Â faster in median than EDLIB and BIWFA when only read errors are present, and 1:7Â faster in median when additionally genetic variation is present.

Preliminaries
This section provides definitions and notation that are used throughout the article.A summary of notation is included in Supplementary Section D.

Sequences
The input sequences A ¼ a 0 a 1 . . .a i . . .a nÀ1 and B ¼ b 0 b 1 . . .b j . . .b mÀ1 are over an alphabet R with four letters.We refer to substrings a i . . .a i 0 À1 as A i...i 0 , to prefixes a 0 . . .a iÀ1 as A < i , and to suffixes a i . . .a nÀ1 as A !i .The edit distance edðA; BÞ is the minimum number of insertions, deletions, and substitutions of single letters needed to convert A into B. The divergence is the observed number of errors per letter, d :¼ edðA; BÞ=n, whereas the error rate e is the number of errors per letter applied to a sequence.

Alignment graph
Let state hi; ji denote the subtask of aligning the prefix A < i to the prefix B < j .The alignment graph (also called edit graph) G(V, E) is a weighted directed graph with vertices V ¼ fhi; ji j 0 i n; 0 j mg corresponding to all states, and edges connecting subtasks: edge hi; ji !hi þ 1; j þ 1i has cost zero if a i ¼ b j (match) and one otherwise (substitution), and edges hi; ji !hi þ 1; ji (deletion) and hi; ji !hi; j þ 1i (insertion) have cost 1.We denote the starting state h0; 0i by v s , the target state hn; mi by v t , and the distance between states u and v by dðu; vÞ.For brevity, we write f hi; ji instead of f ðhi; jiÞ.

Paths and alignments
A path p from hi; ji to hi 0 ; j 0 i in the alignment graph G corresponds to a (pairwise) alignment of the substrings A i...i 0 and B j...j 0 with cost c path ðpÞ.A shortest path p Ã from v s to v t corresponds to an optimal alignment, thus, c path ðp Ã Þ ¼ dðv s ; v t Þ ¼ edðA; BÞ.We write g Ã ðuÞ :¼ dðv s ; uÞ for the distance from the start to u and h Ã ðuÞ :¼ dðu; v t Þ for the distance from u to the target.

Seeds and matches
We split the sequence A into a set of consecutive nonoverlapping substrings (seeds) S ¼ fs 0 ; s 1 ; s 2 ; . . .; s bn=kcÀ1 g, such that each seed s l ¼ A lk...lkþk has length k.After aligning the first i letters of A, our heuristics will only depend on the remaining seeds S !i :¼ fs l 2 S j lk !ig contained in the suffix A !i .We denote the set of seeds between u ¼ hi; ji and v ¼ hi 0 ; j 0 i by S u...v ¼ S i...i 0 ¼ fs l 2 S j i lk; lk þ k i 0 g and an alignment of s to a subsequence of B by p s .The alignments of seed s with sufficiently low cost (Section 3.1) form the set M s of matches.

Dijkstra and A*
Dijkstra's algorithm (Dijkstra 1959) finds a shortest path from v s to v t by expanding (generating all successors) vertices in order of increasing distance g Ã ðuÞ from the start.Each vertex to be expanded is chosen from a set of open vertices.The A* algorithm (Hart et al. 1968, 1972, Pearl 1984), instead directs the search towards a target by expanding vertices in order of increasing f ðuÞ :¼ gðuÞ þ hðuÞ, where h(u) is a heuristic function that estimates the distance h Ã ðuÞ to the end and g(u) is the shortest length of a path from v s to u found so far.
A heuristic is admissible if it is a lower bound on the remaining distance, hðuÞ h Ã ðuÞ, which guarantees that A* has found a shortest path as soon as it expands v t .Heuristic h 1 dominates (is more accurate than) another heuristic h 2 when h 1 ðuÞ !h 2 ðuÞ for all vertices u.A dominant heuristic will usually, but not always (Holte 2010), expand less vertices.Note that Dijkstra's algorithm is equivalent to A* using a heuristic that is always 0, and that both algorithms require nonnegative edge costs.Our variant of the A* algorithm is provided in Supplementary Section A.1.

Chains
A state u ¼ hi; ji 2 V precedes a state v ¼ hi 0 ; j 0 i 2 V, denoted u " v, when i i 0 and j j 0 .Similarly, a match m precedes a match m 0 , denoted m " m 0 , when the end of m precedes the start of m 0 .This makes the set of matches a partially ordered set.A state u precedes a match m, denoted u " m, when it precedes the start of the match.A chain of matches is a (possibly empty) sequence of matches m 1 " . . ." m l .

Contours
To efficiently calculate maximal chains of matches, contours are used.Given a set of matches M, S(u) is the number of matches in the longest chain u " m 1 " . .., starting at u.The function Shi; ji is non-increasing in both i and j.Contours are the boundaries between regions of states with SðuÞ ¼ ' and SðuÞ < ' (Fig. 3).Note that contour ' is completely determined by the set of matches m 2 M for which SðstartðmÞÞ ¼ ' (Hirschberg 1977).Hunt and Szymanski (1977) give an algorithm to efficiently compute S when M is the set of singleletter matches between A and B, and Deorowicz and Grabowski (2014) give an algorithm when M is the set of exact k-mer matches.

Methods
We formally define the general CSH (Section 3.1) that encompases inexact matches, chaining, and gap costs (Fig. 2).Next, we introduce the match pruning (Section 3.2) improvement and integrate our A* algorithm with the diagonal-transition optimization (Supplementary Section A.2). We present a practical algorithm (Section 3.3), implementation (Supplementary Section A.3), and proofs of correctness (Supplementary Section B).

General CSH
We introduce three heuristics for A* that estimate the edit distance between a pair of suffixes.Each heuristic is an instance of a general CSH.After splitting the first sequence into seeds S, and finding all matches M in the second sequence, any shortest path to the target can be partitioned into a chain of matches and connections between the matches.Thus, the cost of a path is the sum of match costs c m and chaining costs c.Our simplest SH ignores the position in B where seeds match and counts the number of seeds that were not matched (c ¼ c seed ).To efficiently handle more errors, we allow seeds to be matched inexactly, require the matches in a path to be ordered (CSH), and include the gap-cost in the chaining cost c ¼ maxðc gap ; c seed Þ to penalize indels between matches (GCSH).

Inexact matches
We generalize the notion of exact matches to inexact matches.
We fix a threshold cost r (0 < r k) called the seed potential and define the set of matches M s as all alignments m of seed s with match cost c m ðmÞ < r.The inequality is strict so that M s ¼ 1 implies that aligning the seed will incur cost at least r.Let M ¼ [ s M s denote the set of all matches.With r ¼ 1, we allow only exact matches, while with r ¼ 2, we allow both exact and inexact matches with one edit.We do not consider higher r in this article.For notational convenience, we define m x 6 2 M to be a match from v t to v t of cost 0.

Potential of a heuristic
We call the maximal value the heuristic can take in a state its potential P. The potential of our heuristics in state hi; ji is the sum of seed potentials r over all seeds after i: Phi; ji :¼ r Á j S !i j .

Chaining matches
Each heuristic limits how matches can be chained based on a partial order on states.We write u " p v for the partial order implied by a function p: pðuÞ " pðvÞ.A " p -chain is a sequence of matches m 1 " p . . ." p m l that precede each other: endðm i Þ " p startðm iþ1 Þ for 1 i < l.To chain matches according only to their i-coordinate, SH is defined using " i -chains, while CSH and GCSH are defined using " that compares both i and j.

Chaining cost
The chaining cost c is a lower bound on the path cost between two consecutive matches: from the end state u of a match, to the start v of the next match.
For SH and CSH, the seed cost is r for each seed that is not matched: c seed ðu; vÞ :¼ r Á jS u...v j.When u " i v and v is not in the interior of a seed, then c seed ðu; vÞ ¼ PðuÞ À PðvÞ.
For GCSH, we also include the gap cost c gap ðhi; ji; hi 0 ; j 0 iÞ :¼ jði 0 À iÞ À ðj 0 À jÞj, which is the minimal number of indels needed to correct for the difference in length between the substrings A i...i 0 and B j...j 0 between two consecutive matches (Section 2).Combining the seed cost and the chaining cost, we obtain the gap-seed cost c gs ¼ maxðc seed ; c gap Þ, which is capable of penalizing long indels and we use for GCSH.Note that c ¼ c seed þ c gap would not give an admissible heuristic since indels could be counted twice, in both c seed and c gap .

General CSH
We define the general CSH used to instantiate SH, CSH, and GCSH.
Exact global alignment using A* Definition 1 (General CSH) Given a set of matches M, partial order " p , and chaining cost c, the general CSH h M p;c ðuÞ is the minimal sum of match costs and chaining costs over all " p -chains (indexing extends to m 0 :¼ u and m lþ1 :¼ m x ): We instantiate our heuristics according to Table 1.Our admissibility proofs (Supplementary Section B.1) are based on c m and c being lower bounds on disjoint parts of the remaining path.The more complex h gcs dominates the other heuristics and usually expands fewer states.
Theorem 1 The SH h s , the CSH h cs , and the GCSH h gcs are admissible.Furthermore, h M s ðuÞ h M cs ðuÞ h M gcs ðuÞ for all states u.
We are now ready to instantiate A* with our admissible heuristics but we will first improve them and show how to compute them efficiently.

Match pruning
In order to reduce the number of states expanded by the A* algorithm, we apply the multiple-path pruning observation: once a shortest path to a state has been found, no other path to this state could possibly improve the global shortest path (Poole and Mackworth 2017).As soon as A* expands the start or end of a match, we prune it, so that the heuristic in preceding states no longer benefits from the match, and they get deprioritized by A*.We define pruned variants of all our heuristics that ignore pruned matches: Definition 2 (Pruning heuristic) Let E be the set of expanded states during the A* search, and let MnE be the set of matches that were not pruned, i.e. those matches not starting or ending in an expanded state.We say that ĥ :¼ h MnE is a pruning heuristic version of h.
The hat over the heuristic function ( ĥ) denotes the implicit dependency on the progress of the A*, where at each step a different h MnE is used.Our modified A* algorithm (Supplementary Section A.1) works for pruning heuristics by ensuring that the f-value of a state is up to date before expanding it, and otherwise reorders it in the priority queue.Even though match pruning violates the admissibility of our heuristics for some vertices, we prove that A* is still guaranteed to find a shortest path (Supplementary Section B.2). To this end, we show that our pruning heuristics are weakly admissible heuristics (Supplementary Definition S7) in the sense that they are admissible on at least one path from v s to v t .
Theorem 2 A* with a weakly admissible heuristic finds a shortest path.
Pruning will allow us to scale near-linearly with sequence length, without sacrificing optimality of the resulting alignment.

Computing the heuristic
We present an algorithm to efficiently compute our heuristics (pseudocode in Supplementary Section A.4, worst-case asymptotic analysis in Supplementary Section A.5).At a high level, we rephrase the minimization of costs (over paths) to a maximization of scores (over chains of matches).We initialize the heuristic by precomputing all seeds, matches, potentials, and a contours data structure used to compute the maximum number of matches on a chain.During the A* search, the heuristic is evaluated in all explored states, and the contours are updated whenever a match gets pruned.

Scores
The score of a match m is scoreðmÞ :¼ r À c m ðmÞ and is always positive.The score of a " p -chain m 1 " p . . ." p m l is the sum of the scores of the matches in the chain.We define the chain score of a match m as Since " p is a partial order, S p can be computed with base case S p ðm x Þ ¼ 0 and the recursion We also define the chain score of a state u as the maximum chain score over succeeding matches m: S p ðuÞ ¼ max u"pm"pvt S p ðmÞ, so that Equation (2) can be rewritten as S p ðmÞ ¼ scoreðmÞ þ S p ðendðmÞÞ.
The following theorem allows us to rephrase the heuristic in terms of potentials and scores for heuristics that use c ¼ c seed and respect the order of the seeds, which is the case for h s and h cs (proof in Supplementary Section B.3): Theorem 4 h M p;c seed ðuÞ ¼ PðuÞ À S p ðuÞ for any partial order " p that is a refinement of " i (i.e.u " p v must imply u " i v).

Layers and contours
We compute h s and h cs efficiently using contours.Let layer L ' be the set of states u with score S p ðuÞ !', so that L ' L 'À1 .The 'th contour is the boundary of L ' (Fig. 3).Layer L ' (' > 0) contains exactly those states that precede a match m with score ' S p ðmÞ < ' þ r (Lemma 5 in Supplementary Section B.3).

Computing S p ðuÞ
This last observation inspires our algorithm for computing chain scores.For each layer L ' , we store the set L½i of matches having score ': L½' ¼ fm 2 M j S p ðmÞ ¼ 'g.The score S p ðuÞ is then the highest ' such that layer L½' contains a match m reachable from u (u " p m). From Lemma 5, we know that S p ðuÞ !' if and only if one of the layers L½' 0 for ' 0 2 ½'; ' þ rÞ contains a match preceded by u.We use this to compute S p ðuÞ using a binary search over the layers '.We initialize L½0 ¼ fm x g (m x is a fictive match at the target v t ), sort all matches in M by " p , and process them in decreasing order (from the target to the start).After computing S p ðendðmÞÞ, we add m to layer S p ðmÞ ¼ scoreðmÞ þ S p ðendðmÞÞ.Matches that do not precede the target (startðmÞ 6 " p m x ) are ignored.

Pruning matches from L
When pruning matches starting or ending in state u in layer ' u ¼ S p ðuÞ, we remove all matches that start at u from layers L½' u À r þ 1 to L½' u , and all matches starting in some v and ending in u from layers Pruning a match may change S p in layers above ' u , so we update them after each prune.We iterate over increasing ' starting at ' u þ 1 and recompute ' 0 :¼ S p ðmÞ ' for all matches m in L½'.If ' 0 6 ¼ ', we move m from L½' to L½' 0 .We stop iterating when either r consecutive layers were left unchanged, or when all matches in r À 1 þ ' À ' 0 consecutive layers have shifted down by the same amount ' À ' 0 .In the former case, no further scores can change, and in the latter case, S p decreases by ' À ' 0 for all matches with score !'.We remove the emptied layers L½' 0 þ 1 to L½' so that all higher layers shift down by ' À ' 0 .

Seed heuristic
Due to the simple structure of the SH, we also simplify its computation by only storing the start of each layer and the number of matches in each layer, as opposed to the full set of matches.

Gap-chaining seed heuristic
Lemma 3.3 does not apply to GCSH since it uses chaining cost c ¼ maxðc gap ðu; vÞ; c seed ðu; vÞÞ, which is different from c seed ðu; vÞ.It turns out that in this new setting it is never optimal to chain two matches if the gap cost between them is higher than the seed cost.Intuitively, it is better to miss a match than to incur additional gap-cost to include it.We capture this constraint by introducing a transformation T such that u " T v holds if and only if c seed ðu; vÞ !c gap ðu; vÞ, as shown in Supplementary Section B.4.Using an additional consistency constraint on the set of matches, we can compute h M gcs via S T as before.
Definition 3 (Consistent matches) A set of matches M is consistent when for each m 2 M (from hi; ji to hi 0 ; j 0 i) with scoreðmÞ > 1, for each adjacent pair of existing states ðhi; j61i; hi 0 ; j 0 iÞ and ðhi; ji; hi 0 ; j 0 61iÞ, there is an adjacent match with corresponding start and end, and score at least scoreðmÞ À 1.
This condition means that for r ¼ 2, each exact match must be adjacent to four (or less around the edges of the graph) inexact matches starting or ending in the same state.Since we find all matches m with c m ðmÞ < r, our initial set of matches is consistent.To preserve consistency, we do not prune matches if that would break the consistency of M.
Definition 4 (Gap transformation) The partial order " T on states is induced by comparing both coordinates after the gap transformation T : hi; ji 7 !ði À j À Phi; ji; j À i À Phi; jiÞ: Theorem 5 Given a consistent set of matches M, the GCSH can be computed using scores in the transformed domain: Using the transformation of the match coordinates, we reduce c gs to c seed and efficiently compute GCSH for any explored state.

Synthetic data
Our synthetic datasets are parameterized by sequence length n, induced error rate e, and total number of basepairs N, resulting in N/n sequence pairs.The first sequence in each pair is uniform-random from R n .The second is generated by sequentially applying be Á nc edit operations (insertions, deletions, and substitutions with equal one-third probability) to the first sequence.Introduced errors can cancel each other, making the divergence d between the sequences less than e.Induced error rates of 1%, 5%, 10%, and 15% correspond to divergences of 0.9%, 4.3%, 8.2%, and 11.7%, which we refer to as 1%, 4%, 8%, and 12%.

Human data
We use two datasets of ultra-long ONT reads of the human genome: one without and one with genetic variation.All reads are 500-1100kb long, with mean divergence around 7%.The average length of the longest gap in the alignment is 0:1kb for ONT reads, and 2kb for ONT reads with genetic variation (detailed statistics in Supplementary Section C.5).The reference genome is CHM13 (v1.1) (Nurk et al. 2022).The reads used for each dataset are: • ONT: 50 reads sampled from those used to assemble CHM13.

A*PA parameters
Inexact matches (r ¼ 2) and short seeds (low k) increase the accuracy of GCSH for divergent sequences, thus reducing the number of expanded states.On the other hand, shorter seeds have more matches, slowing down precomputation and contour updates.A parameter grid search on synthetic data (Supplementary Section C.1) shows that the runtime is generally insensitive to k as long as k is high enough to avoid too many spurious matches (k ) log 4 n), and the potential is sufficiently larger than edit distance (k ( r=d).For d ¼ 4%, exact matches lead to faster runtimes, while d ¼ 12% requires r ¼ 2 and k < 2=d ¼ 16:7.We fix k ¼ 15 throughout the evaluations since this is a good choice for both synthetic and human data.

Execution
We use PABENCH on Arch Linux on an Intel Core i7-10750H processor with 64 GB of memory and 6 cores, without hyper-threading, frequency boost, and CPU power saving features.We fix the CPU frequency to 2.6 GHz, limit the memory usage to 32 GiB, and run one single-threaded job at a time with niceness -20.

Measurements
PABENCH first reads the dataset from disk and then measures the wall-clock time and increase in memory usage of each aligner.Plots and tables refer to the average alignment time per aligned pair, and for A*PA include the time to build the heuristic.Best-fit polynomials are calculated via a linear fit in the log-log domain using the least squares method.).For high divergence (d ¼ 12%), we need inexact matches, and the runtime of SH sharply degrades for long sequences (n > 10 6 bp) due to spurious matches.This is countered by chaining the matches in CSH and GCSH, which expand linearly many states (Supplementary Section C.3). GCSH with DT is not exactly linear due to high memory usage and state reordering (Supplementary Section C.7 shows the time spent on parts of the algorithm).

Runtime scaling with divergence
Figure 3c shows that A*PA has near constant runtime in d as long as the edit distance is sufficiently less than the heuristic potential (i.e.d ( r=k).In this regime, A*PA is faster than both EDLIB (linear in d) and BIWFA (quadratic in d).For 1 d 6%, exact matches have less overhead than inexact matches, while BIWFA is fastest for d 1%.A*PA becomes linear in d for d !r=k (Supplementary Section C.4).

Speedup on human data
We compare runtime (Fig. 4 and Supplementary Section C.7), and memory usage (Supplementary Section C.6) on human data.We configure A*PA to prune matches only when expanding their start (not their end), leaving some matches on

Effect of pruning, inexact matches, chaining, and DT
We visualize our techniques on a complex alignment in Supplementary Section C.10.

SH with pruning enables near-linear runtime
Figure 3a shows that the addition of match pruning changes the quadratic runtime of SH without pruning to near-linear, giving multiple orders of magnitude speedup.

Inexact matches cope with higher divergence
Inexact matches double the heuristic potential, thereby almost doubling the divergence up to which A*PA is fast (Fig. 4c).This comes at the cost of a slower precomputation to find all matches.

Chaining copes with spurious matches
While CSH improves on SH for some very slow alignments (Fig. 4), more often the overhead of computing contours makes it slower than SH.

Gap-chaining copes with indels
GCSH is significantly and consistently faster than SH and CSH on human data, especially for slow alignments (Fig. 5).GCSH has less overhead over SH than CSH, due to filtering out matches m 6 " v t .

DT speeds up quadratic search
DT significantly reduces the number of expanded states when the A* search is quadratic (Fig. 4a and Supplementary Section C.4).This results in a significant speedup for genetic variation of long indels (Fig. 5).CSH, GCSH, and DT only have a small impact on the uniform synthetic data, where usually either the SH is sufficiently accurate for the entire alignment and runtime is near-linear (d ( r=k), or even GCSH is not strong enough and runtime is quadratic (d ) r=k).On human data however, containing longer indels and small regions of quadratic search, the additional accuracy of GCSH and the reduced number of states explored by DT provide a significant speedup (Supplementary Section C.10).

Seeds are necessary; matches are optional
The SH exploits the lack of matches to penalize alignments.In our heuristics, the more seeds without matches, the higher the penalty for alignments and the easier it is to dismiss suboptimal ones.In the extreme, not having any matches can be sufficient for finding an optimal alignment in linear time (Supplementary Section C.9).

Modes: near-linear and quadratic
The A* algorithm with a SH has two modes of operation that we call near-linear and quadratic.In the near-linear mode, A*PA expands few vertices because the heuristic successfully penalizes all edits between the sequences.When the divergence is larger than what the heuristic can handle, every edit that is not penalized by the heuristic increases the explored band, leading to a quadratic exploration similar to Dijkstra.(Supplementary Sections C.8 and C.10).Regions with high divergence (d !10%), such as high error rate or long indels, exceed the heuristic potential to direct the search and make the exploration quadratic.Lowcomplexity regions (e.g. with repeats) result in a quadratic number of matches, which also take quadratic time.2) Computational overhead of A*.Computing states sequentially (as in EDLIB, BIWFA) is orders of magnitude faster than computing them in random order (as in Dijkstra, A*).A*PA outperforms EDLIB and BIWFA (Fig. 4a) when the sequences are long enough for the Exact global alignment using A* linear-scaling to have an effect (n > 30kb), and there are enough errors (d > 1%) to trigger the quadratic behaviour of BIWFA.

Future work
1) Performance.We are working on a DP version of A*PA that applies computational volumes (Spouge 1989(Spouge , 1991)), block-based computations (Liu and Steinegger 2023), and a SIMD version of EDLIB's bit-parallelization (Myers 1999).This has already shown 10Â additional speedup on the human datasets and is less sensitive to the input parameters.Independently, the number of matches could be reduced by using variable seed lengths and skipping seeds having many matches.2) Generalizations.Our CSH could be generalized to nonunit and affine costs, and to semi-global alignment.Cost models that better correspond to the data can speed up the alignment.3) Relaxations.At the expense of optimality guarantees, inadmissible heuristics could speed up A*.Another possible relaxation would be to validate the optimality of a given alignment instead of computing it from scratch.4) Analysis.The near-linear scaling with length of A* is not asymptotic and requires a more thorough theoretical analysis (Medvedev 2023b).

Figure 3 .
Figure 3. Contours and layers of different heuristics after aligning (n ¼ 48; m ¼ 42; r ¼ 1; k ¼ 3, edit distance 10).Exact matches are black diagonal segments ( ).The background colour indicates S p ðuÞ, the maximum number of matches on a " p -chain from u to the end starting, with S p ðuÞ ¼ 0 in white.The thin black boundaries of these regions are Contours.The states of layer L ' precede contour '.Expanded states are green ( ), open states blue ( ), and pruned matches red ( ).Pruning matches changes the contours and layers.GCSH ignores matches m6 " T v t .

1)
Quadratic scaling.Complex data can trigger a quadratic (Dijkstra-like) search, which nullifies the benefits of A*

Figure 5 .
Figure 5. Runtime on long human reads.Each dot is an alignment without (left) or with (right) genetic variation.Runtime is capped at 100 s.Boxplots show the three quartiles and red dots show where the edit distance is larger than the heuristic potential.The median runtime of A*PA (GCSH þ DT, k ¼ 15, r ¼ 2) is 3Â (left) and 1:7Â (right) faster than EDLIB and BIWFA.

Table 1 .
Definitions of our heuristic functions.Notes: SH orders the matches by i and uses only the seed cost.CSH orders the matches by both i and j.GCSH additionally exploits the gap cost.
Marco-Sola et al. 2021)ation: 48 reads from another human(Bowden et al. 2019), as used in the BIWFA paper (Marco-Sola et al. 2023).Algorithms and aligners We compare SH, CSH, and GCSH (all with pruning) as implemented in A*PA to the state-of-the-art exact aligners BIWFA and EDLIB.We also compare to Dijkstra's algorithm and A* with previously introduced heuristics (gap cost and character frequencies ofHadlock (1988a), and SH without pruning ofIvanov et al. (2022)).We exclude SEQAN and PARASAIL since they are outperformed by WFA and EDLIB ( So si c and Siki c 2017,Marco-Sola et al. 2021).We run all aligners with unit edit costs with traceback enabled.
Figure4.Runtime comparison on synthetic data.(a,b) Log-log plots comparing variants of our heuristic, including the simplest (SH) and most accurate (GCSH with DT), to EDLIB, BIWFA, and other algorithms (averaged over 10 6 -10 7 total bp, seed length k ¼ 15).The slopes of the bottom (top) of the background cones correspond to linear (quadratic) growth.SH without pruning is dotted, and variants with DT are solid.For d ¼ 12%, red dots show where the heuristic potential is less than the edit distance.Missing data points are due to exceeding the 32 GB memory limit.(c)Runtime scaling with divergence (n ¼ 10 5 , 10 6 total bp, and k ¼ 15).the optimal path unpruned and speeding up contour updates.The runtime of A*PA (GCSH with DT) on ONT reads is less than EDLIB and BIWFA in all quartiles, with the median being > 3Â faster.However, the runtime of A*PA grows rapidly when d !10%, so we set a time limit of 100 s per read, causing six alignments to time out.In real-world applications, the user would either only get results for a subset of alignments, or could use a different tool to align divergent sequences.With genetic variation, A*PA is 1:7Â faster than EDLIB and BIWFA in median.Low-divergence alignments are faster than EDLIB, while high-divergence alignments are slower (three sequences with d !10% time out) because of expanding quadratically many states in complex regions (Supplementary Section C.8).Since slow alignments dominate the total runtime, EDLIB has a lower mean runtime.