A variant selection framework for genome graphs

Abstract Motivation Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. Results In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis. Availability and implementation https://github.com/AT-CG/VF. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
High-throughput technologies enable rapid sequencing of numerous individuals in a species population and cataloging observed variants. This is leading to a switch from linear representation of a chosen reference genome to graph representations depicting multiple observed haplotypes. Graph representations more accurately reflect the sampled individuals within a population, and their use in genome mapping algorithms reduces reference bias and increases mapping accuracy when sequencing a new individual (Ballouz et al., 2019). There is abundant research on data structures designed for graph representations of genomes and pan-genomes (Garrison et al., 2018;Li et al., 2020), their space-efficient indexing (Chang et al., 2020;Ghaffaari and Marschall, 2019;Holley et al., 2016;Jain et al., 2019b;Kuhnle et al., 2020;Marcus et al., 2014;Sirén et al., 2014) and alignment algorithms (Darby et al., 2020;Ivanov et al., 2020;Jain et al., 2020;Kuosmanen et al., 2018;Rautiainen and Marschall, 2020) to map sequences to reference graphs. For review papers summarizing these developments, see Computational Pan-Genomics Consortium (2018), Eizenga et al. (2020), and Paten et al. (2017).
While graph representations have numerous advantages, complete variation graphs that include every variant have certain drawbacks. The graphs invariably contain paths combining variants across haplotypes, but never seen in any observed haplotype. The number of such recombinant paths increases combinatorically with graph size, and is particularly troublesome when mapping long reads which span greater distances. Inclusion of all variants also makes a pan-genome reference more repetitive, i.e. finding a unique base-to-base alignment per read becomes harder. Accuracy of sequence-to-graph mapping algorithms shows diminishing returns at larger graph sizes, and is even negatively affected eventually (Pritt et al., 2018;Sirén et al., 2020). A few attempts have been made to address the first issue by augmenting paths with haplotype information and specifically developing haplotype-aware indexing strategies (Iqbal et al., 2012;Mokveld et al., 2020;Sirén et al., 2020).
The aforementioned factors point to the need for variant selection algorithms which tame reference graph sizes, and strike the right balance for subsequent mapping accuracy and speed. This was primarily approached through selecting variants from a specific database (Danek et al., 2014;Liu et al., 2016;Schneeberger et al., 2009), based on allelic frequency (Eggertsson et al., 2017;Kim et al., 2018;Maciuca et al., 2016), or specific to a biological context such as limiting to a particular population (Sirén et al., 2014) or genome loci of interest (Dilthey et al., 2015;Jain et al., 2019a;Vijaya et al., 2012). Recently, Pritt et al. (2018) developed a more systematic approach FORGe by developing a mathematical model to prioritize variants, and selecting top scoring variants according to the model. In FORGe, the ranking of each variant is done based on its frequency in a population, and its contribution to run-time and space overhead of a read-to-graph mapper.
In this work, we propose a rigorous algorithmic framework for variant selection from the perspective of preserving subsequent mapping accuracy. Consider a complete variation graph constructed from a set of given haplotypes. Any substring of a haplotype has a corresponding path in the complete variation graph. Not including some variants will introduce errors in the corresponding paths. If the number of such errors is matched with the error tolerance built into sequence-to-graph mapping algorithms, the same identical paths can still be found. We make the following contributions: • We develop a novel mathematical framework for variant selection subject to preserving paths of length a while allowing at most d differences. We separately consider the problems of minimizing the number of positions at which variants are retained, and minimizing the total number of variants selected. • We show that both problems are optimally solvable in polynomial time when only SNPs are considered and the goal is to preserve all paths of length a found in the complete variation graph. • These problems become challenging when deletions and insertions are considered. We present efficient heuristics that guarantee preserving paths of length a while allowing at most d edits, but do not guarantee optimal reduction in graph size. • We empirically evaluate run-time performance and reduction in variation graph sizes achieved by the multiple algorithms that are proposed in this article. For testing, we utilize human chromosome sequences, SNPs and short indels from the 1000 Genomes Project (Consortium et al., 2015), and structural variants (SVs) from 15 diverse humans (Audano et al., 2019) of the above problems where the goal is to only preserve paths of length a actually found in the input haplotypes (i.e. not recombinant paths), and prove that they are N P-hard even for d ¼ 1.

Proposed framework
Let R 1 ; R 2 ; . . . ; R m be m input reference haplotype sequences. To be consistent with current literature, we assume one of these (say R 1 ) is a special reference and the other haplotypes are described as variations from it. A (complete) variation graph of these sequences is represented using an edge-labeled directed multigraph GðV; E; rÞ as follows. The graph consists of haplotype R 1 as a linear backbone, augmented with the set of variants present in R 2 ; R 3 ; . . . ; R m , assumed to be known a priori. Each variant represents a deviating base from R 1 (SNP) or an insertion/deletion (can be multiple bases). The function r : E ! R [ fg specifies edge labels, where R denotes the alphabet and denotes the empty character. The haplotype R 1 is represented in G as a directed chain with character labeled edges such that the chain spells the sequence R 1 . This chain will have jR 1 j þ 1 ordered vertices v 0 ; v 1 ; . . . ; v jR1j . These vertices serve as a convenient coordinate axis for the variation graph. Each SNP variant is an additional labeled edge between vertices at two adjacent coordinates. A deletion variant is an edge-labeled between a pair of vertices, whose coordinates are separated by the deletion length. An insertion variant is represented as a chain of one or more labeled edges that starts and ends at the same vertex. In this setup, the total number of variants at coordinate i (0 i jR 1 j) equals out-degree of the vertex v i minus one. See Figure 1 for an illustration.
Any path in graph G with a nonempty edges spells a string of length a. We place the restriction that a path is allowed to visit a vertex at most twice. This restriction avoids traversal of more than one insertion variant at the same coordinate. Note that any recombination of variants that occur at different positions is allowed. Thus, the graph contains paths corresponding to each haplotype and any substrings thereof, but also numerous additional paths (genotypes) that are not present in any haplotype. It is unknown whether such a recombinant genotype exists in the population or not. Restricting paths to only those that belong to at least one input haplotype can also be useful, and will be considered separately (Section 4).
We seek to compute a reduced variation graph G 0 ðV 0 ; E 0 ; r 0 Þ, where V 0 V; E 0 E, and for all e 2 E 0 ; r 0 ðeÞ ¼ rðeÞ. The reduced graph G 0 corresponds to removing some variants in graph G. Our goal is to reduce graph GðV; E; rÞ to the maximum extent possible while ensuring that any a-long string corresponding to a path in G can be mapped to the same starting vertex (coordinate) in G 0 without exceeding a user-specified error-threshold d. In practice, a should be a function of read lengths whereas d is determined based on sequencing errors and error tolerance of read-to-graph mapping algorithms.
We formulate four versions of the problem based on what types of variants are allowed and the reduction objective. First consider the case where all variants are SNPs. Definition 1.: Graph G 0 is said to be ða; dÞ h -compatible if all a-long strings in graph G can be mapped to their corresponding paths in graph G 0 with Hamming distance d.
Problem 1. Compute an ða; dÞ h -compatible reduced variation graph G 0 with minimum number of coordinates containing one or more variants.
Problem 2. Compute an ða; dÞ h -compatible reduced variation graph G 0 with minimum number of variants.
In Problem 1, we seek to 'linearize' the graph, whereas in Problem 2, we intend to remove as many variants as possible. A user can choose either version based on downstream analysis. For the next two problem versions, suppose the variant set also contains indels.
Definition 2.Graph G 0 is said to be ða; dÞ e -compatible if all a-long strings in graph G can be mapped to their corresponding paths in graph G 0 with edit distance d.
Problem 3. Compute an ða; dÞ e -compatible reduced variation graph G 0 with minimum number of coordinates containing one or more variants.
Problem 4. Compute an ða; dÞ e -compatible reduced variation graph G 0 with minimum number of variants.
In Problems 1 and 2 that consider only SNP variants, a-long paths will begin at a vertex along the coordinate axis as there are no other vertices introduced in the graph. In Problems 3 and 4, however, a path can also begin at other vertices due to insertion variants. In this case, we assume an a-long string that maps to G must also be mappable starting from the corresponding vertex in G 0 if that insertion variant is preserved. If the variant is not preserved, it must be mappable to the closest vertex along the coordinate axis.
3 Proposed algorithms 3.1 Results for variation graphs with SNPs 3.1.1 Greedy algorithm for Problem 1 Here, the goal is to minimize the count of coordinates (positions along the special reference R 1 ) at which variants occur. Based on this objective, we should either fully remove or fully retain all the variants at each variant coordinate. When removing variants at a coordinate, its outgoing edge label is chosen to be the base from R 1 . However, the ða; dÞ h -compatibility is sustained even if the base is chosen from a different haplotype, or any arbitrary character in R.
A path of length a naturally corresponds to a line segment of length a starting at an integer coordinate. Observe that in any a-long segment, we cannot remove variants at > d coordinates without violating the ða; dÞ h -compatibility of reduced graph G 0 (Fig. 2a). A variant coordinate i is contained in a segments of length a each, whose starting positions are in ½i À a þ 1; i. For each variant position, we associate two events with coordinates start i ¼ maxf0; i À a þ 1g and end i ¼ i, respectively. Assuming that the n SNP coordinates are given as sorted array, the corresponding 2n events can be sorted in O(n) time. When two events have equal coordinates, the start event type should be placed earlier than the end event type in the sorted order.
Our greedy algorithm works as follows. Begin by placing an along segment at position 0, and remove variants in the left-most d variant positions and retain the rest (if any). Keep a count of the number of positions within the current segment at which variants are removed. Iteratively consider each event in the sorted order. If the event is of type start i and the count is less than d, the variants at position i are removed and the count is incremented by one. If the event is of type start i but the count is equal to d, the variants at position i are retained. If the event is of type end i and the variants at i were previously removed, the count is decremented by 1. As can be seen, the algorithm maintains ða; dÞ h -compatibility and runs in O(n) time.
Proof of optimality: Suppose the greedy algorithm retains variants at coordinates g 1 ; g 2 ; . . . ; g p in ascending order. Let o 1 ; o 2 ; . . . ; o q be the ordered variant coordinates retained by an optimal solution. Let k be the first position where the solutions differ, i.e. g j ¼ o j for j < k and g k 6 ¼ o k . Due to our greedy strategy, o k < g k . Though o k was chosen by the optimal algorithm, ða; dÞ h -compatibility is not violated until start event for g k is reached. For any path starting at a later coordinate, retaining variants at g k offers the same benefit as retaining at o k . Thus, replacing o k with g k will maintain optimality and ða; dÞ h -compatibility. Hence, the greedy solution is also optimal. Lemma 1.: The above greedy algorithm solves Problem 1 in O(n) time.

A linear programming solution to Problem 2
Here, we seek to minimize the total number of variants retained. Interestingly, we can show that optimal solutions still retain or remove all variants at a coordinate.
Lemma 2.An optimal solution to Problem 2 either retains or removes all variants at a coordinate.
Proof. By contradiction. Suppose there exists an optimal reduced graph G 0 with partially removed variants at coordinate i. Coordinate i already induces an error in some a-long paths in G that contain the coordinate. Accordingly, removal of all variants at coordinate i can be tolerated by all a-long paths containing that coordinate, further implying that graph G 0 must be suboptimal. h Suppose we choose to remove all variants at coordinate i, then this choice reduces the overall count of variants by outðv i Þ À 1, where outðv i Þ is the out-degree of vertex v i . As can be seen, Problem 2 is harder than Problem 1 because the number of variants at different coordinates can be different, leading to variable gains. We address this problem by using an Integer Linear Programming (ILP) system that is polynomially solvable using LP relaxation.
Let p 1 ; p 2 ; . . . ; p n be the n variant coordinates in G in ascending order. Let X be an n Â 1 Boolean column vector where X½i ¼ 1 iff variants are removed at coordinate p i in creating G 0 . Let C be another n Â 1 column vector where C½i ¼ outðv pi Þ À 1, i.e. the reduction achieved in variant count by removing variants at p i . The goal is to maximize C T X. Next, we specify constraints to ensure ða; dÞ hcompatibility of graph G 0 , by not allowing removal of variants at > d coordinates in any a-long segment. Similar to the observation made while addressing Problem 1, it suffices to check this constraint only in the subset of a-sized segments that end at the n variants. Accordingly, let A be a Boolean n Â n matrix such that A½i½j ¼ 1 iff coordinate p j is within the a-sized segment range ðp i À a; p i of coordinate p i . Then, ILP constraints required to ensure ða; dÞ h -compatibility of G 0 can be specified as A Á X B, where B is an n Â 1 column vector with each value ¼ d. We also need to ensure that the X½i's are Boolean. This can be achieved by expanding A to a 2n Â n matrix with the bottom n rows being the n Â n identity matrix, and similarly expanding B to a 2n Â 1 vector with the bottom n entries set to 1. Now, maximizing C T X while satisfying A Á X B leads to an optimal reduced graph G 0 that is ða; dÞ h -compatible.
Run-time complexity: Matrix A exhibits a special structure that guarantees integral optimal LP solutions. Observe that A is a 0-1 matrix, and the 1's appear consecutively in each row of A which makes it an interval matrix (Fulkerson and Gross, 1965). As a result, the above ILP can be solved in polynomial time by solving the corresponding LP, which has Oðn x Þ run-time complexity where x is the exponent of matrix multiplication (van den Brand, 2020).
Lemma 3. The above LP-based algorithm solves Problem 2 in Oðn x Þ time.

Results for variation graphs with SNPs and indels
Variation graphs with indels introduce additional complexity. When considering only SNPs, we benefited from the fact that end vertices of any a-long paths will be located on the coordinate axis. In addition, right end of a path was a fixed distance away from its left along the coordinate axis. When indels are permitted, these properties are no longer true, making Problems 3 and 4 more challenging. We present two heuristic solutions, each of which can be used to solve either problem.

A greedy algorithm
We first propose a 'conservative' greedy heuristic which guarantees an ða; dÞ e -compatible reduced graph that is not necessarily optimal in terms of the desired reduction objectives. We choose to either retain or remove all variants at a coordinate vertex (vertex along R 1 ). Suppose a coordinate vertex v has all three types of variants, i.e. insertions, deletions and SNPs. We evaluate an upper bound of edit distance against any overlapping a-long path if we choose to drop all variants at v. Let D ins ; D del be the longest insertion and deletion variants at vertex v, respectively. Dropping all variants at v can contribute an edit distance of at most D ins þ D del . In cases where only a subset of variant types is present, the bound can be adjusted easily.
The following greedy algorithm is designed to select an appropriate set of coordinates where variants can be removed while ensuring that the graph remains ða; dÞ e -compatible. As before, let p 1 ; p 2 ; . . . ; p n be the n variant coordinates in G in ascending order. Note that an a-long path in graph G can span > a range along the coordinate axis by using deletion edges. For a variant position p i , consider the left-most position p j such that v pi can be reached from v pj by using any path that uses < a labeled edges. The rationale for choosing p j this way is that any a-long path which begins at a variant coordinate vertex prior to v pj cannot pass through vertex v pi . Such a window is precomputed for each variant position, and we ensure that dropped variants within each window collectively contribute to edit distance d. To achieve this, our greedy heuristic is to consider the variant positions from left to right. A variant position is removed if and only if the total sum of differences within its window remains d. It is straightforward to prove that the resulting reduced graph remains ða; dÞ e -compatible.
Run-time complexity: Computing window lengths for each coordinate vertex is the most time-consuming step in the above algorithm because the remaining steps have linear complexity either in terms of count of variants or count of variant positions in graph G. For calculating window lengths in the above algorithm, we can ignore SNP and insertion variants from G, and consider only deletion variants. If y denotes the count of deletion variants, then the modified graph will have exactly jR 1 j þ 1 coordinate vertices and jR 1 j þ y edges. Any vertex v i (i > 0) has exactly one incoming labeled edge (say, from vertex v i1 ) and ! 0 incoming unlabeled edges (say, from vertices v i2 ; v i3 ; . . . ; v i k ). Let the function f(v, x) indicate the left-most vertex that can be reached from v by using a path that uses < x labeled edges. Then, f ðv i ; aÞ equals the left-most vertex among f ðv i1 ; a À 1Þ; f ðv i2 ; aÞ; f ðv i3 ; aÞ; . . . ; f ðv i k ; aÞ. One way to compute this recursion is to compute a vector of values f ðv i ; xÞ 8x 2 ½1; a for each vertex along the coordinate axis going from left to right. This procedure requires Oða Á ðjR 1 j þ yÞÞ time. In practice, y ( jR 1 j, so this procedure effectively requires Oða Á jR 1 jÞ time.

An ILP-based algorithm
Alternatively, we can further improve the greedy heuristic by using ILP. This can be achieved by formulating the edit distance constraints discussed above for each window as a set of ILP constraints. Similar to our LP-based algorithm for Problem 2, A is an n Â n matrix, where row i contains nonzero values for those variants that are within the precomputed window of variant i. For instance, if coordinate p j is within the precomputed window span of coordinate p i ðj iÞ, then A½i½j is set to the estimated upper bound of differences induced by removing all variants at coordinate p j as discussed before. Variable X is an n Â 1 Boolean column vector, where X½i ¼ 1 iff variants at coordinate p i are removed. Then, ILP constraints required to ensure ða; dÞ e -compatibility can be specified as A Á X B, where B is a column vector with each value ¼ d. Define C to be an n Â 1 column vector. While addressing Problem 3, set C½i's as 1, and for Problem 4, set C½i ¼ outðv pi Þ À 1, i.e. the count of variants at coordinate p i . In both cases, the ILP objective is set to maximize C T X. These ILP formulations have higher run-time complexity when compared to the greedy solution, but are guaranteed to provide at least as good and possibly superior reduction for both Problems 3 and 4. Neither algorithm guarantees optimality.

Haplotype-aware variant selection
In the previous problem versions, we considered all a-long paths in graph G. Here, we address the important special case where paths are restricted to correspond to strings observed in haplotypes R 1 ; R 2 ; . . . ; R m . Due to this restriction, fewer a-long strings are checked for mappability. As a result, solutions to the previous problems are suboptimal for this case because further reduction may be possible. We start by making the simplifying assumption that the input haplotypes contain only SNPs, and have equal length. We do not require strings associated with haplotype R 1 (or any other haplotype) to be exactly preserved in a reduced graph. For example, if all SNPs are removed at a variant coordinate, then its single outgoing edge label can come from any of the m haplotypes. Graph G 0 is said to be ða; dÞ r h -compatible if all a-long restricted paths in G map to G 0 with Hamming distance d between the corresponding strings. In this scenario, consider the following problems: Problem 5. Compute an ða; dÞ r h -compatible reduced variation graph G 0 with minimum number of coordinates containing one or more variants.
Problem 6. Compute an ða; dÞ r h -compatible reduced variation graph G 0 with minimum number of variants.
We prove that solving the above problems is N P-hard. We give two reductions for Problem 5. The first is a general reduction whereas the second proves hardness for even d ¼ 1. These reductions trivially generalize to Problem 6. Consider the following decision version of Problem 5. Does there exist an ða; dÞ r h -compatible simplified graph G 0 with k coordinates containing one or more variants? Lemma 4. The decision version of Problem 5 is N P-complete.
Proof. Clearly, the problem is in N P. Recall the decision version of the closest string problem (CSP) (Lanctot et al., 2003). Given a set S of strings each of length l and a parameter d, CSP checks existence of a string that is within Hamming distance of d to each of the given strings. CSP is known to be N P-complete. CSP exhibits a trivial reduction to Problem 5: Assume the collection of reference haplotypes to be S. The following statements are equivalent: (i) there exists a string with Hamming distance d to each of the given strings in S, (ii) there exists an ðl; dÞ r h -compatible graph G 0 with no coordinate containing one or more variants. As a result, decision version of Problem 5, which is stated for an arbitrary value of k, is N P-complete. h CSP is known to be N P-complete even for a binary alphabet, thus also making Problem 5 N P-complete for a binary alphabet. However, CSP is fixed-parameter tractable relative to parameter d (Gramm et al., 2003). Consequently, the above claim does not resolve the complexity of Problem 5 for a constant d. For practical applications, d is expected to be small. We address this in the following lemma.
Lemma 5. The decision version of Problem 5 is N P-complete even if d ¼ 1.
Proof. Recall the decision version of the maximum independent set (MIS) problem. Given an undirected graph, the MIS problem asks for a set of !k vertices no two of which are adjacent. In an MIS graph instance G m ðV m ; E m ), let u 0 ; u 1 ; . . . ; u jVmjÀ1 be the vertices and e 0 ; e 1 ; . . . ; e jEmjÀ1 be the edges. We translate this into a multigraph instance of Problem 5 as follows. Let R ¼ fA; Cg. Define haplotype reference sequences R 0 ; R 1 ; . . . ; R jVmjþjEmj each of length jV m j. The first jE m j sequences are defined using the MIS graph instance G m while the rest are auxiliary: i < jE m jÞ; Observe that edge-labeled variation graph G built by using the above sequences has a coordinate axis with jV m j þ 1 vertices (Fig. 3). Each coordinate 2 ½0; jV m jÞ has two SNPs 'A' and 'C'. Claim: There exists a ksized independent set in G m ðV m ; E m Þ if and only if there exists a ðjV m j; 1Þ r h -compatible reduced variation with jV m j À k coordinates containing one or more variants. Consider the forward direction. Suppose there are k vertices in an independent set I . Build a reduced variation graph G 0 by removing 'C'-labeled outgoing edges from coordinates j (8j such that u j 2 I). Note that G 0 is ðjV m j; 1Þ r h -compatible. Next consider the backward direction. Suppose p 1 ; p 2 ; . . . ; p k are the k coordinates in a compatible simplified graph G 0 where variants are removed. If k ¼ 1, then finding an independent set I of size 1 is trivial. If k > 1, then we note that each of the k coordinates must have a single outgoing edge labeled with 'A' to ensure ðjV m j; 1Þ r h -compatibility with respect to the auxiliary reference sequences. It can be further deduced that fu p1 ; u p2 ; . . . ; u p k g is an independent set of graph G m . h In the formulations of Problems 5 and 6, haplotype R 1 is not given any special significance. An interesting question is whether the problems become tractable if we impose the additional constraint that edges associated with haplotype R 1 must be preserved (assuming R 1 is the standard genome reference). In this case, the reduction from the CSP (i.e. Lemma 4) is no longer applicable. However, it is still possible to design a reduction from the MIS problem (Lemma 5), with a few simple modifications. The revised proof is omitted for brevity.

Experimental results
Hardware and software: We provide Cþþ implementations of all the algorithms presented in Section 3 (https://github.com/at-cg/VF). Among these, the first two handle SNP-based variation graphs (Greedy s and LP s ), and the remaining (Greedy i , ILP v i and ILP p i ) are designed for a generic variation graph containing substitution, insertion and deletion events. Our ILP algorithm (Section 3.2.2) supports two different objective functions, the first minimizes count of variants, and the second minimizes variant-containing positions. Accordingly, their naming, i.e. ILP v i and ILP p i differentiates the two versions, respectively.
Using human variation graphs (Table 1), we assess the graph size reduction achieved by the various algorithms, and also evaluate their run-time performance and scalability. The LP s , ILP v i and ILP p i algorithms make use of Gurobi 9.1.0 solver for LP optimization. All the algorithms were tested on dual Intel Xeon Gold 6226 CPUs (2.70 GHz) processors equipped with 2 Â 12 physical cores and 384 GB RAM. Among the implemented algorithms, only the LP s , ILP v i and ILP p i take advantage of multiple cores via Gurobi, whereas the remaining two are sequential.
Variation graph construction: We tested our algorithms using variation graphs associated with human chromosome 1 (249 Mbp) and chromosome 22 (51 Mbp) respectively. For each chromosome, we built three types of variation graphs, corresponding to (a) SNPs, (b) SNPs and short indels, and (c) SVs, respectively. This is useful to contrast output quality while exploring variant types from point mutations (SNPs) to larger variants. Here, SVs include deletion and insertion events of size ! 50 bp as other type of SVs are currently not supported by our framework. Exclusion of SVs in variation graphs is naturally expected to introduce more differences in a-sized paths, and therefore SVs test the limits of our algorithms. SNP and short indel variants were downloaded from the 1000 Genomes Project Phase 3 (Consortium et al., 2015), and SVs were downloaded from a recent long-read-based SV survey of 15 diverse human genomes (Audano et al., 2019). We used vcftools (Danecek et al., 2011) to parse SNPs and indels from the 1000 Genomes Project variant files. Similarly, SVs other than insertions or deletions were filtered out from the SV files. Summary statistics of these variants, and graphs built using them, are listed in Table 1.
a and d parameters: We tested our algorithms using a values of 150 bp, 1 kbp, 5 kbp and 10 kbp. The first is useful for Illumina reads, whereas the latter are useful for different protocols available for long-read sequencing (e.g. DNA or RNA sequencing using either PacBio or ONT). For each a value, we experimented with d values that are 1%, 5% and 10% of a. Here, 1% corresponds to low error tolerance of a mapping algorithm, and 10% corresponds to significant tolerance.
Performance of Greedy s and LP s algorithms: These algorithms were tested using g_chr1_SNP and g_chr22_SNP graphs (Fig. 4). For both the algorithms, we report four statistics: (i) count of variants retained, (ii) count of variant-containing loci retained, (iii) runtime and (iv) peak memory-usage. LP s and Greedy s algorithms are guaranteed to return optimal graphs in terms of the objectives (i) and (ii), respectively. The results in Figure 4 suggest that the two algorithms perform almost equally well in terms of both objectives for all tested combinations of (a; d) values. Increasing a value while keeping d as a constant fraction of a naturally corresponds to fewer SNPs retained. The same is true when d is increased while keeping a fixed. These results corroborate the fact that longer reads and higher sensitivity of mapping algorithms result in retention of fewer variants in a variation graph. For instance with (a ¼ 10 kbp; d ¼ 1000), Fig. 3. Illustration of reduction used to prove Lemma 5. Vertices selected as independent set are highlighted in red (left). Accordingly, we can find an equivalent reduced variation graph where variants from two vertices are removed (removed edges are highlighted in gray) the Greedy s algorithm retained only 531 (0.009%) out of 6 234 046 SNPs using graph g_chr1_SNP. After this run, average distance between two adjacent loci containing SNPs increased from 39 to 225 963. This suggests that variation selection algorithms can potentially yield long stretches of variant-free regions in graph, where the usual read-to-sequence mapping algorithms can also operate. If run-time is considered, the Greedy s algorithm runs significantly faster than LP s , which was also reflected by our time complexity analysis in Section 3.1. Using graph g_chr1_SNP, LP s algorithm ran out of memory for a ¼ 10 kbp due to increased size of the matrix, i.e. count of nonzeros in the matrix which specifies the LP constraints. Taken together, Greedy s algorithm suffices for most practical purposes because it is fast and optimal in terms of its objective to minimize count of variant-containing loci retained. Greedy s algorithm does not account for the number of variants at a locus while deciding its fate, yet it achieves near-optimal reduction in terms of minimizing the count of variants. This is likely because most human SNPs are biallelic.
Performance of Greedy i and ILP-based algorithms: We tested our Greedy i , ILP v i and ILP p i heuristics using g_chr1_SV and g_chr22_SV graphs (Fig. 5). In contrast to SNPs which are single-base mutations, the sizes of SV indels in chromosome 1 computed by Audano et al. (2019) range from 50 bp to 33 kbp, with mean length 0.5 kbp. As a result, it is natural to expect that the fraction of variants retained will be much higher compared to SNPs. The ILP-based heuristics are guaranteed to achieve superior results than the greedy heuristic, i.e. ILP v i heuristic is expected to retain the smallest count of variants among the three, and similarly ILP p i heuristic will retain the smallest count of variant-containing loci. For instance, with long-read compatible settings (a ¼ 10 kbp; d ¼ 1000), the ILP v i , ILP p i and Greedy i heuristics retained 26.8%, 27.0% and 31.3% SVs, respectively in graph g_chr1_SV. Similarly, 24.8%, 25.0% and 30.6% SVs were retained in graph g_chr22_SV. However, with short-read compatible settings (a ¼ 150 bp; d ¼ 8), all SVs were retained by all three heuristics, as expected. These results are not necessarily optimal, but we do not expect them to deviate significantly from optimal numbers. The rationale is that not only SVs are bigger in size but also SV loci are known to be clustered in several known hot-spots of the human genome, e.g. within the last 5 Mbp of both chromosome arms (Audano et al., 2019).
In terms of count, SVs occur much less frequently as compared to SNPs or indels. As a result, run-time of all the three heuristics was dominated by their first step of computing the constraints required to ensure ða; dÞ e compatibility, which is common in all of them. As shown in Section 3.2, this step requires time proportional to a as well as the length of the reference sequence. Accordingly, we observe that running time is comparable among all three heuristics, appears to be independent of d, scales roughly linearly with a, and time spent is higher using graph g_chr1_SV than g_chr22_SV. With the largest a ¼ 10 kbp value, all three algorithms require about 6 and 1 min to process the two graphs, respectively.
The Greedy i , ILP v i and ILP p i heuristics were also separately tested using graphs containing SNPs and short indels, i.e. g_chr1_SNP_indel and g_chr22_SNP_indel. For convenience, we indicate their output graph statistics as a pair (x, y) such that x and y refer to the fraction of SNPs and indels retained, respectively. When using (a ¼ 150 bp; d ¼ 8) parameters on variation graph g_chr1_SNP_indel, the three heuristics Greedy i , ILP v i and ILP p i retained ð6:6%; 25:9%Þ, ð6:0%; 32:0%Þ and ð6:0%; 32:1%Þ of the variants, respectively. Using long-read settings (i.e. a ¼ 10 kbp; d ¼ 1000), all three heuristics retained only ð0:01%; 0:002%Þ variants. These results suggest that the fraction of indels that should be retained is higher than SNPs while mapping short reads. However, almost all SNPs and short indels can be excluded while mapping long reads.
Impact on sequence-to-graph mappers: We conducted a preliminary evaluation to assess the impact of variant selection on read-to-graph mapping. For this experiment, we considered the reduced graph (a) ( b) Fig. 4. Empirical evaluation of Greedy s and LP s algorithms using two human variation graphs g_chr1_SNP and g_chr22_SNP containing SNPs. These plots demonstrate reduction achieved in graph sizes while varying a and d parameters. Size of the complete variation graph (d ¼ 0%) is included for comparison. Numbers on top of bars present actual data, useful for comparison when both Greedy s and LP s achieve close results. Result of LP s algorithm is missing for a ¼ 10 000 (left-most column) because Gurobi LP solver crashed due to insufficient memory. Y axes are log-scaled in all the above plots obtained by ILP v i using g_chr1_SV graph as input. As discussed previously, the ILP v i heuristic retained 1748 of the 6512 SVs using a ¼ 10 kbp and d ¼ 1000 parameters. We built two variation graphs using VG (v1.29.0) tool corresponding to the complete set of SVs and the reduced set of SVs, respectively. The graph statistics (e.g. vertex degree distribution) were validated to ensure the presence of SVs in the respective graphs. Next, we simulated 10 000 long reads, each of length 10 kbp from randomly chosen paths of the complete variation graph with error-rate 5% using VG's read simulation feature. Subsequently, we made use of GraphAligner (v1.0.11) to map these reads to both variation graphs.
We observed the following. First, all 10 000 reads were successfully mapped by GraphAligner to both the graphs. Second, each read was mapped only using primary alignments, and there were no secondary alignments reported. This indicates that there was no mapping ambiguity while using the reduced variation graph. A direct comparison of mapping coordinates is not feasible because VG used different vertex identifiers in the two graphs which have different topology. However, VG includes a heuristic to project graph coordinates onto the linear reference genome using surject command. We used this command to project true read coordinates as well as the computed read alignments to the linear genome reference. A read is considered to be mapped correctly if any one of its alignments overlaps with ! 50% of the true interval. Using this criteria, 9922 and 9919 reads were found to be correctly mapped to the complete and the reduced variation graph, respectively. We additionally used a chain-like variation graph by removing all variations, and found that 9908 reads could be mapped correctly to this graph.
The impact of missing variations was also observed on the count of reads which had split alignments. Reads with split alignments increased from 2 in the complete variation graph to 209 in the reduced graph. Split read alignments are often used as a signature by variant callers to discover SVs (Rausch et al., 2012). Once the reduced variation graph is used to correctly anchor alignments, the full spectrum of variations in the aligning region can be used for effective genotyping. Upon further inspection of the split read alignments, we found the count of alignments per read varied from 2 to 3 in the complete graph, and from 2 to 5 in the reduced graph. We also observed 295 split read alignments if we map the simulated reads to the chain graph with no variation, but the alignments per read in this case ranged from 2 to 43. Here, 15 reads were found to have >5 alignments. These reads may be difficult to align due to significant edit distance with their closest matching path in the reduced graph. Similar anomalies were observed if we construct a graph by selecting a random subset of 1748 SVs, where 1748 is the count of the SVs retained by the ILP v i algorithm. We note that GraphAligner required similar run-time and memory in all scenarios (about 5 min). This is likely because the graph is nearly linear, due to limited count of SVs (6512) that were available in the complete chromosome 1 graph. This result is preliminary, but motivates a deeper investigation into the impact of the proposed algorithms on various sequence-to-graph mapping algorithms while using a much larger catalog of variants as input. Variant selection tool FORGe (Pritt et al., 2018) uses allelic frequency data as an input to its model. Currently, frequency assessment remains challenging in case of SVs due to the lack of appropriate tools as well as data (Mahmoud et al., 2019). A direct comparison with FORGe could not be carried out due to these limitations.

Conclusions and open problems
We developed a novel mathematical framework for variant selection, and presented multiple algorithms and complexity results for various problems arising from this framework. Experimental results demonstrate substantial reduction in the resulting variation graph sizes, while guaranteeing bounds on the number of errors tolerated while doing so. Implementations of all the four algorithms that are proposed in this paper are available as open-source, and can be used by practitioners for pan-genomic analysis. The path-length and error parameters (a; d) can be tuned to match the choice of sequencing (a) ( b) Fig. 5. Empirical evaluation of Greedy i , ILP v i and ILP p i algorithms using two human variation graphs g_chr1_SV and g_chr1_SV containing insertion and deletion SVs. These plots demonstrate reduction achieved in graph sizes while varying a and d parameters. Numbers on top of bars present actual data, useful for comparison when the three algorithms achieve close results technology, mapping algorithms and types of variants considered. The proposed framework makes the assumption that the mapping algorithms have uniform error tolerance throughout the reference. By experimenting with publicly available human genome variation data, we demonstrated that a significant fraction of small-scale variants, but no large-scale variants can be left out during short-read mapping to variation graphs. On the other hand, almost all smallscale variants and a significant fraction of large-scale variants can be excluded prior to long-read-based analysis.
The proposed variant selection framework underpins a rich class of problems making it fertile ground for future research. (i) Optimal algorithms for the two problems associated with indel variants (Problems 3 and 4) are unknown. In fact, it is not known whether these two problems can be solved in polynomial time. (ii) While we were able to prove that the haplotype-aware versions of the problem are N P-hard, efficient heuristics and approximation algorithms for these problems are yet to be developed. Haplotype-aware algorithmic extensions can result in further reduction of graph sizes because fewer paths need to be preserved. (iii) It may also be possible to further extend this framework and add constraints similar to allele frequency thresholding, e.g. by asking a reduced graph which is allowed to violate error-bound for up to 1% of haplotypes at any position. (iv) This work only considered predominant variant categories-SNPs, insertions and deletions; further research is needed to analyze other variant types such as duplications, inversions and complex genomic rearrangements.