Bit-parallel sequence-to-graph alignment

Abstract Motivation Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with |V| nodes and |E| edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(|V|+⌈mw⌉|E| log w) for acyclic graphs and O(|V|+m|E| log w) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm. Availability and implementation https://github.com/maickrau/GraphAligner Supplementary information Supplementary data are available at Bioinformatics online.


A. Existence and uniqueness of a solution for
Recurrence 1 . The gray arrows are the backtrace and the solid black arrows show the optimal alignment. Right: the DP graph for the same sequence graph and sequence. The dashed grey edges have a cost of 1, and the solid black edges have a cost of 0. The source node v source A is at the top. The distance from the source node to any node is equal to the score in the DP matrix.
Recurrence 1 cannot be directly used to calculate the scores for cyclic regions. Instead, we consider this recurrence a constraint on the cell scores, and then find an assignment of scores to the cells which satisfies the recurrence. There exists exactly one assignment of scores that satisfies the recurrence, which we prove in the following. To this end, we employ a classic relationship between edit distances and shortest path problems: we define an alignment graph with one node for each cell in our DP table, such that the distance from a virtual source node equals the value in the corresponding cell (Myers, 1991), as illustrated in Figure 1(right). Formally, the alignment graph (not to be confused with a sequence graph) is defined as follows.
Definition 1 (Alignment graph). Given a string s ∈ Σ m and sequence graph G = (V, E, σ) with V = {v 1 , . . . , v n }, we define the corresponding alignment graph G A through a node set V A := V × {1, . . . , |s|} and a weight function where an edge exists between two nodes from V A if the corresponding weight is finite. Additionally, we add a special node v source A and ∆ i,1 -weight edges from v source A to every node (v i , 1).
Theorem 1 (Existence). For any sequence graph G = (V, E, σ) and sequence s ∈ Σ m , there exists an assignment of cell scores C i,j that satisfies the recurrence given in Recurrence 3.
Proof. We consider the corresponding alignment graph. We observe that none of its edges can go "up", that is, for any edge (v k , j) → (v i , j ), we always have j ≤ j by Definition 1. Hence, all cycles will be among nodes in the same "row" (i.e. for the same value of j). Such "horizontal" edges with j = j have a weight of 1 by definition of the weights. Together, this implies that all cycles have a positive, non-zero cost. That, in turn, implies that the minimum distance from the source node v source A to any other node is well-defined and unique. We set C i,j to the minimum distance from v source A to (v i , j). This assignment of C i,j values satisfies the recurrence in Definition 3, which follows immediately by induction since weights in Definition 1 mirror Recurrence (1).
Theorem 2 (Uniqueness). For any sequence graph G = (V, E, σ) and sequence s ∈ Σ m , there is at most one assignment of cell scores C i,j that satisfies Definition 3.
Proof. The existence of a solution was established in Theorem 1 and we denote the cell scores given by the shortest paths in the corresponding alignment graph by C i,j . Suppose, for the sake of contradiction, that there exists a different assignment C i,j for i ∈ {i, . . . , |V |} and j ∈ {1, . . . , |s|}. Let i and j be indices such that C i ,j = C i ,j . Now consider a shortest path from v source A to (i , j ), corresponding to a sequence of nodes v source A , (i 1 , j 1 ), . . . , (i n , j n ) with (i n , j n ) = (i , j ). Let k be the smallest index such that C i k ,j k = C i k ,j k . Note that k > 1 because the initialization of the first row of the DP table (Definition 3) is identical to the choice of weights of edges from the source node v source A to the first row of vertices (Definition 1), and hence C i1,j1 = C i1,j1 . If C i k ,j k < C i k ,j k , then C i k ,j k violates Recurrence (1), because Recurrence (1) minimizes over all possible predecessor cells, which correspond exactly to the incident nodes of (i k , j k ) in the alignment graph and the term in Recurrence (1) containing C i k−1 ,j k−1 hence gives rise to a smaller value-a contradiction.
, because a shorter path can be obtained by following the "backtrace" through the DP table, i.e. by following the sequence of cells given by the minima picked in Recurrence (1).

B. Proof of Theorem 1 (Vertical property)
We distinguish three cases, based on which of the three terms terms in Recurrence 1 takes a minimum (note that more than one term can have the minimum value).
Case 1 (vertical, Our proof for this case is by induction on values of the cells in a row. That is, to prove the claim C i,j−1 ≤ C i,j + 1 for cell C i,j , we assume that it holds for all cells in the same row with smaller value, i.e. for all cells C i ,j with C i ,j < C i,j . Note that the claim holds for all cells with minimum value in each row, because they are covered by Case 1 or Case 2, but cannot fall under Case 3. By the induction hypothesis, we have C k,j−1 ≤ C k,j +1 and, by the assumption of Case 3, we get

C. Details of bitvector merging algorithm
Figure 2 contains a running example of merging two bitvectors with an O(log w) algorithm, which we use to explain the algorithm. For completeness, we provide full pseudocode below as Algorithm 1 on Page 7 and refer to the respective line numbers below.
The input for the bitvector merging algorithm are two bitvectors A and B, consisting of values VP A , VN A , S A before , S A end and VP B , VN B , S B before , S B end (step A). We assume that S A before ≤ S B before . These bitvectors implicitly represent the values S A i and S B i (step B). The output is the bitvector representation (VP O , VN O , S O before , S O end ) of a column S O such that its values are the minimum of the two columns represented by the input bitvectors, that is, ∀i ∈ {0, 1, ..., w − 1} : . First, we need to find difference masks M A>B and M B>A , which describe cells where the score of A is higher than B and vice versa (step D). To do this, we first verify that the score differences S A − S B are in the range (−2w, 2w) (lines 24-29). We need the popcount operation for this; in most processors an O(1) specialized instruction exists,  We define a variable D split in chunks, where each chunk represents the score difference S A − S B at a certain index. Each chunk is represented by log 2 w + 2 bits and stores a value in two's complement. In this way, each chunk can accommodate a score difference, which takes a value between −2w and 2w. The chunks essentially simulate a SIMD-like architecture, with the chunks corresponding to SIMD registers, and parallel addition, substraction and comparison to zero instructions implemented with normal arithmetic and bitwise operations. This enables calculating the score difference S A − S B in parallel. The chunks of D are first initialized to point at the bit before the start of the chunk (step F, lines 30-34). Then D is updated in parallel to point at the next bit (step G, lines 36-37). The bitvectors VP A , VP B , VN A and VN B represent the score difference between two cells, and are used to update D. Since D represents the current score difference S A − S B , we can update it by ). This is implemented by shifting the appropriate bits from the VP and VN bitvectors to the start of the chunk and selecting the least significant bit, essentially initializing the chunk-registers as either 0 or 1, which are then added and substracted to D (step G, lines 36-37). Once D has been updated to the next bit, M A>B and M B>A can be updated based on the values in the chunks of D (step G, lines 38-41). Since the chunks of D represent the score difference, M A>B must be set at indices where the value of D is greater than 0, and M B>A where the value is less than 0. As the values are stored in two's complement, the case D < 0 is checked by selecting the sign bit and shifting it to the current position (lines 38 and 40). To check D > 0, we select indices where D is not negative and not zero (lines 39 and 41). Updating We first calculate a picking mask M p which determines whether a bit needs to be taken from S A (line 46). The picking mask is set to 1 whenever M A>B = 1 and 0 to whenever M B>A = 1, and in other indices copies the neighboring bit from the least significant direction. However, in practice it is faster to precompute the difference masks for each possible 8-bit combination of S B before − S A before , VP A , VP B , VN A and VN B . Now, finding the difference masks takes w 8 memory lookups. With the lookup, merging is O(w) but faster in practice than the O(log w) algorithm.
Algorithm 1 The bitvector merging algorithm 1: chunkSize ← (log 2 w) + 2 rounded up to the nearest power of two 2: M lsb ← a constant mask which has 1 at each chunk's least significant bit and 0 elsewhere 3: Msign ← a constant mask which has 1 at each chunk's sign bit and 0 elsewhere 4: Chunk prefix sums 6: x ← popcount for each chunkSize-bit block 7: x ← (x chunkSize) + extra 8: x ← x * M lsb x now has the prefix sum of the set bits at each chunk boundary   return S out before = min(S A before , S B before ), S out end = min(S A end , S B end ), VP out , VN out