RecGraph: recombination-aware alignment of sequences to variation graphs

Abstract Motivation Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. Results In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination—we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. Availability and implementation Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.

If the optimal alignment has no recombination, then we have already described the recurrence equation for the optimal solution.Alternatively, we run a complete forward pass and a complete backward pass, then we combine those results to place a recombination.
To compute the DP matrix M [v, i, p], we also need to compute Es [v, i, p], Eg[v, i, p], where: (a) M [v, i, p] is the optimal score of the global alignment between the initial portion of the path p ∈ P that ends in the vertex v and the i-long prefix s[: i] of the sequence s (this definition is the same as in the main text); (b) Es[v, i, p] is the optimal score of the global alignment between the initial portion of the path p ∈ P that ends in the vertex v and the i-long prefix s[: i] of the sequence s, such that the alignment ends with the extension of a gap in the query sequence; (c) Eg[v, i, p] is the optimal score of the global alignment between the initial portion of the path p ∈ P that ends in the vertex v and the i-long prefix s[: i] of the sequence s, such that the alignment ends with the extension of a gap in the graph.
The recurrence equations describing those matrices are the following where, for simplicity, we denote with go the gap opening penalty, with ge the gap extension penalty, with m the score of a match, and with m the score of a mismatch: where λ(v) is the character labeling the vertex v and u is the vertex preceding v in p.Moreover, (1) A similar approach is used to compute R[v, i, p], using the matrices Fs [v, i, p], Fg[v, i, p], where: (a) R [v, i, p] is the optimal score of the global alignment between the final portion of the path p ∈ P that begins with the vertex v and the suffix s[: i] of the sequence s (this definition is the same as in the main text); (b) Fs [v, i, p] is the optimal score of the global alignment between the final portion of the path p ∈ P that begins with the vertex v and the suffix s[i :] of the sequence s, such that the alignment begins with the extension of a gap in the query sequence; (c) Fs [v, i, p] is the optimal score of the global alignment between the final portion of the path p ∈ P that begins with the vertex v and the suffix s[i :] of the sequence s, such that the alignment begins with the extension of a gap in the query sequence; the extension of a gap in the graph.

Computing the displacement
We describe here the procedure to efficiently compute the displacement of a recombination (p1, p2, ρ, ψ) where p1 and p2 are two different paths of G = ⟨V, A, P, λ⟩, ρ and ψ are two vertices, not necessarily distinct, respectively of p1 and p2.Moreover, let T = ⟨v0, . . ., vz) be a topological sort of G = ⟨V, A, P, λ⟩, that is i < j for each arc (vi, vj ), and v0 (resp.vz) is the source (resp.the sink) of G = ⟨V, A, P, λ⟩.
Given the bitvectors A[•], for each pair (p1, p2) of paths we compute the bitvectors BV [p1, p2] and CV [p1, p2] containing respectively all branching vertices and all consolidating vertices of the two paths p1 and p2.In fact the vertex vi is a branching vertex for p1 and p2 if and only if two conditions hold: (1) ) + 1) (the vertex of that immediately follows vi in p1 is different from the vertex of that immediately follows vi in p2).A similar procedure computes CV [p1, p2].Since each rank and select require constant time, the overall time complexity to compute all bitvectors BV Then the branching vertex of the recombination (pi, pj , v l , vm) is the vertex v h where h is the largest integer such that BV [pi, pj ][h] = 1 and h ≤ i, h ≤ j, which can be computed in constant time via rank and select on BV .Computing the distance from v h to v l requires two selects on AV ; in fact it is equal to select A similar idea allows to compute in constant time the distances from v h to vm and the distance from v l (or vm) to their consolidating vertex and consequently to compute the displacement of the recombination in constant time.

Possible improvements
In this section we will describe some ideas to speed up the computation of the matrix M (cases 1a, 1b, 1c, 1d).Let us first describe a possible improvement for RecGraph algorithm in mode no recombination.Let us recall that this mode is related to the basic equations used by RecGraph for detecting the optimal alignment of sequence against a specific path p ∈ P of the variation graph, where P denotes a nonempty set of distinguished paths of the graph.A main improvement is reducing the number of matrices that we need to fill, given k distinct paths in the graph.Indeed, we may fill a unique bidimensional matrix M * with entries (v, i) and keep an additional matrix D to be accessed by using the entry v, i of M * and a specific path denoted π(v), called representative path, that is the path for which an optimal cost is achieved for the entry v, i.
The two matrices will replace the computation of matrix M (v, i, pj ) for all paths pi ∈ Q crossing vertex v, which would require a (|V | • |P |n) time complexity.As detailed below, we are able to express the time with the new formula for B the number of times where we have a change of the representative path, given a position j on the query sequence, with 1 ≤ j ≤ n.
The following proposition formalizes the main idea used for replacing the set Q of paths with a single representative path; it is a direct consequence of cases 1c and 1d where a gap is not introduced to compute the scores M .
Proposition 1 (Invariant scores) Let p1 and p2 be two paths traversing the arc (u, v), and let s[: i] be a prefix of the string s.Assume that the cases 1c, 1d that achieve the maximum for M [v, i, p1] are exactly those achieving the maximum for Proof The Proposition is a consequence of cases 1c, 1d, as by these equations it holds that M [w, j, 1] = M [u, j − 1, p1] + x and M [w, j, p2] = M [u, j − 1, p2] + x and thus the inequality holds in these cases.□ By Proposition 1, given p1 and p2 the two paths sharing the common edge (u, v), if no gap is introduced in position i then the differences M [v, i, p1]−M [u, i−1, p1] and M [v, i, p2]−M [u, i−1, p2] are the same.In other words, when computing M [v, i, pi] from M [u, i − 1, pi], for every path pi ∈ Q ⊆ P sharing edge (u, v) we get the same increment.
The iterated applications of Proposition 1 is that whenever a subset Q ⊆ P of all paths share a common subpath from u to v, we need to explicitly compute M * [u, i] = M [u, i, π(u)] for the representative path π(u) ∈ Q along the path from u to v to get the optimal cost in the vertex v reaching j in the input sequence.Then given M * [v, j], we can obtain all other values M [v, j, pi], for any other pi ∈ Q, by adding the difference M [u, i, pi] − M [u, i, π(u)], since that difference remains unchanged for all the vertices of the shared path from u to v. The difference from the optimal cost in vertex v of the value for path pi ∈ Q v, pi, j] is stored in memory in a vector D associated to v and a position j of the input sequence.
Thus only when the representative changes in a vertex v we need to update the vector D associated to v.
In other words the representative path π(v) is the only path for which the score M is explicitly computed, for all other paths, when we need to have an update of the cost, due to a change of the representative, we need to consider vector D. Clearly, the computation of matrix M keeps associated the information represented by the pair (π(v), v), i.e of the representative path and the vertex v, to which we associate the information D. Observe that the vector D may contain negative values.Now, we expect to compute the D vectors only for changes of the representative path and keep D updated for these changes.It follows that assuming that B is the number of times we change the representative path, then the computation of the costs for the paths is O(B • |P | • n) which is summed up to the cost of computing M for a single representative path.Clearly, B in the worst case is |V |, but if the graph has many vertices of degree 2, then we expect to have a lower value for B. Indeed, each time we change the representative in a vertex u for a given position j over the query sequence, then D(u, pi, j) is updated for any path pi crossing the same vertex of the representative path.
This fact can be exploited to speed up the computation of the optimal score M especially when the graph has a large number of non-branching paths, if no indel is introduced in aligning that portion of the path and the corresponding portion of the sequence.
To conclude the section, the discussed improvement can be applied in the implementation of the equation 2b.
Table S1.Information on bacteria species of Experiment 1.For each species, the table reports the number of strains, the total number of genes, the gene lengths (first quartile, average, and third quartile), and the number of recombinant strains extracted via our procedure.

No.
No  The figure consists of three parts: the variation graph is above, the sequence is below, and a representation of the alignment is in the middle.The alignment consists of 20 pairs, where the number above is the position relative to the path of the graph and the number below is the position relative to the sequence.Moreover each pair is connected via a dotted arc to the character of the path and/or the character of the sequence that form the pair.This connection is inspired by the notion of threading scheme of (Maier, 1978).In the representation of the alignment, squares with a white background and norepresent a match, squares with arepresent an indel, and squares with grey background represent a mismatch.

G
Example of recombination alignment of a sequence against a variation graph with a recombination.The recombination arc is represented by the thick black black arc connecting two vertices with gray background and is an arc already present in the graph.Thick vertices and arcs represents the portions of existing paths that form the alignment.In this case, the resulting alignment is a recombination between the red path and the green path.

Figure S1 :
FigureS1: Example of no-recombination alignment of a sequence against a variation graph.The figure consists of three parts: the variation graph is above, the sequence is below, and a representation of the alignment is in the middle.The alignment consists of 20 pairs, where the number above is the position relative to the path of the graph and the number below is the position relative to the sequence.Moreover each pair is connected via a dotted arc to the character of the path and/or the character of the sequence that form the pair.This connection is inspired by the notion of threading scheme of(Maier, 1978).In the representation of the alignment, squares with a white background and norepresent a match, squares with arepresent an indel, and squares with grey background represent a mismatch.

Figure S2 :Figure S3 :Figure S4 :
Figure S2: Example of canonical variation graph with three paths (one for each color).