A safety framework for flow decomposition problems via integer linear programming

Abstract Motivation Many important problems in Bioinformatics (e.g. assembly or multiassembly) admit multiple solutions, while the final objective is to report only one. A common approach to deal with this uncertainty is finding “safe” partial solutions (e.g. contigs) which are common to all solutions. Previous research on safety has focused on polynomially time solvable problems, whereas many successful and natural models are NP-hard to solve, leaving a lack of “safety tools” for such problems. We propose the first method for computing all safe solutions for an NP-hard problem, “minimum flow decomposition” (MFD). We obtain our results by developing a “safety test” for paths based on a general integer linear programming (ILP) formulation. Moreover, we provide implementations with practical optimizations aimed to reduce the total ILP time, the most efficient of these being based on a recursive group-testing procedure. Results Experimental results on transcriptome datasets show that all safe paths for MFDs correctly recover up to 90% of the full RNA transcripts, which is at least 25% more than previously known safe paths. Moreover, despite the NP-hardness of the problem, we can report all safe paths for 99.8% of the over 27 000 non-trivial graphs of this dataset in only 1.5 h. Our results suggest that, on perfect data, there is less ambiguity than thought in the notoriously hard RNA assembly problem. Availability and implementation https://github.com/algbio/mfd-safety.


Introduction
In real-world scenarios where an unknown object needs to be discovered from the input data, we would like to formulate a computational problem loosely enough so that the unknown object is indeed a solution to the problem, but also tightly enough so that the problem does not admit many other solutions.However, this goal is difficult in practice, and indeed, various commonly used problem formulations in Bioinformatics still admit many solutions.While a naive approach is to just exhaustively enumerate all these solutions, a more practical approach is to report only those sub-solutions (or partial solutions) that are common to all solutions to the problem.
In the graph theory community such sub-solutions have been called persistent [14,21], and in the Bioinformatics community reliable [54], or more recently, safe [51].The study of safe sub-solutions started in Bioinformatics in the 1990's [54,11,37] with those amino-acid pairs that are common to all optimal and suboptimal alignments of two protein sequences.
In the genome assembly community, the notion of contig, namely a string that is guaranteed to appear in any possible assembly of the reads, is at the core of most genome assemblers.This approach originated in 1995 with the notion of unitigs [25] (non-branching paths in an assembly graph), which were progressively [42,6] generalized to paths made up of a prefix of nodes with in-degree one followed by nodes with out-degree one [35,24,29] (also called extended unitigs, or Y-to-V contigs).
Later, [51] formalized all such types of contigs as those safe strings that appear in all solutions to a genome assembly problem formulation, expressed as a certain type of walk in a graph.[10,9] proposed more efficient and unifying safety algorithms for several types of graph walks.[45] recently studied the safety of contigs produced by state-of-the-art genome assemblers on real data.
Analogous studies were recently made also for multi-assembly problems, where several related genomic sequences need to be assembled from a sample of mixed reads.[8] studied safe paths that appear in all constrained path covers of a directed acyclic graph (DAG).Zheng, Ma and Kingsford studied the more practical setting of a network flow in a DAG by finding those paths that appear in any flow decomposition of the given network flow, under a probabilistic framework [34], or a combinatorial framework [58]. 3 [27] presented a simple characterization of safe paths appearing in any flow decomposition of a given acyclic network flow, leading to a more efficient algorithm than the one of [58], and further optimized by [28].
Motivation.Despite the significant progress in obtaining safe algorithms for a range of different applications, current safe algorithms are limited to problems where computing a solution itself is achievable in polynomial time.However, many natural problems are NP-hard, and safe algorithms for such problems are fully missing.Apart from the theoretical interest, usually such NP-hard problems correspond to restrictions of easier (polynomially-computable) problems, and thus by definition, also have longer safe sub-solutions.
As such, current safety algorithms miss data that could be reported as correct, just because they do not constrain the solution space enough.A major reason for this lack of progress is that if a problem is NP-hard, then its safety version is likely to be hard too.This phenomenon can be found both in classically studied NPhard problems -for example, computing nodes present in all maximum independent sets of an undirected graph is NP-hard [21] -as well as in NP-hard problems studied for their application to Bioinformatics, as we discuss further in the appendix.
We introduce our results by focusing on the flow decomposition problem.This is a classical model at the core of multi-assembly software for RNA transcripts [33,31,5,50] and viral quasi-species genomes [3,2,44,12], and also a standard problem with applications in other fields, such as networking [36,22,13,23] or transportation [39,38].In its most basic optimization form, minimum flow decomposition (MFD), we are given a flow in a graph, and we need to decompose it into a minimum number of paths with associated weights, such that the superposition of these weighted paths gives the original flow.This is an NP-hard problem, even when restricted to DAGs [53,22].Various approaches have been proposed to tackle the problem, including fixed-parameter tractable algorithms [30], approximation algorithms [36,7] and Integer Linear Programming formulations [15,46].
In Bioinformatics applications, reads or contigs originating from a mixed sample of genomic sequences with different abundances are aligned to a reference.A graph model, such as a splice graph or a variation graph, is built from these alignments.Read abundances assigned to the nodes and edges of this graph then correspond to a flow in case of perfect data.If this is not the case, the abundance values can either be minimally corrected to become a flow, or one can consider variations of the problem where e.g., the superposition of the weighted paths is closest (or within a certain range) to the edge abundances [50,5].
Current safety algorithms for flow decompositions such as [58,27,26,28] compute paths appearing in all possible flow decompositions (of any size), even though decompositions of minimum size are assumed to better model the RNA assembly problem [30,48,55].Even dropping the minimality constraint, but adding other simple constraints easily renders the problem NP-hard (see e.g., [56]), motivating further study of practical safe algorithms for NP-hard problems.
Contributions.Integer Linear Programming (ILP) is a general and flexible method that has been successfully applied to solve NP-hard problems, including in Bioinformatics.In this paper, we consider graph problems whose solution consists of a set of paths (i.e., not repeating nodes) that can be formulated in ILP.We introduce a technique that, given an ILP formulation of such a graph problem, can enhance it with additional variables and constraints in order to test the safety of a given set of paths.An obvious first application of this safety test is to use it with a single path in a straightforward avoid-and-test approach, using a standard two-pointer technique that has been used previously to find safe paths for flow decomposition.However, we find that a top-down recursive approach that uses the group-testing capability halves the number of computationally-intensive ILP calls, resulting in a 3x speedup over the straightforward approach.
Additionally, we prove that computing all the safe paths for MFDs is an intractable problem, confirming the above intuitive claim that if a problem is hard, then also its safety version is hard.We give this proof in the appendix by showing that the NP-hardness reduction for MFD by [22] can be modified into a Turing reduction from the UNIQUE 3SAT problem.
On the dataset [47] containing splice graphs from human, zebrafish and mouse transcriptomes, safe paths for MFDs (SafeMFD) correctly recover up to 90% of the full RNA transcripts while maintaining a 99% precision, outperforming, by a wide margin (25% increase), state-of-the-art safety approaches, such as extended unitigs [35,24,29], safe paths for constrained path covers of the edges [8], and safe paths for all flow decompositions [28,27,26,58].On the harder dataset by [26], SafeMFD also dominates in a significant proportion of splice graphs (built from t ≤ 15 RNA transcripts), recovering more than 95% of the full transcripts while maintaining a 98% precision.For larger t, precision drastically drops (91% precision in the entire dataset), suggesting that in more complex splice graphs smaller solutions are introduced as an artifact of the combinatorial nature of the splice graph, and the minimality condition [30,48,55] is thus incorrect in this domain.

Preliminaries
ILP models.In this paper we use ILP models as blackboxes, with as few assumptions as possible to further underline the generality of our approach.Let M(V, C) be an ILP model consisting of a set V of variables and a set C of constraints on these variables, built from an input graph G = (V, E).We make only two assumptions on M. First, that a solution to this model consists of a given number k ≥ 1 of paths P 1 , . . ., P k in G (in this paper, paths do not repeat vertices).Second, we assume that the k paths are modeled via binary edge variables x uvi , for all (u, v) ∈ E and for all i ∈ {1, . . ., k}.More specifically, for all i ∈ {1, . . ., k}, we require that the edges (u, v) ∈ E for which the corresponding variable x uvi equals 1 induce a path in G.For example, if G is a DAG, it is a standard fact (see e.g., [49]) that a path from a given s ∈ V to a given t ∈ V (an s-t path) can be expressed with the following constraints: If G is not a DAG, there are other types of constraints that can be added to the x uvi variables to ensure that they induce a path; see, for example, the many formulations in [49].We will assume that such constraints are part of the set C of constraints of M(V, C), but their exact formulation is immaterial for our approach.In fact, one could even add additional constraints to C to further restrict the solution space.For example, some ILP models from [15,46] handle the case when the input also contains a set of paths (subpath constraints) that must appear in at least one of the k solution paths.
Flow decomposition.In the flow decomposition problem we are given a flow network (V, E, f ), where G = (V, E) is a (directed) graph with unique source s ∈ V and unique sink t ∈ V , and f assigns a positive integer flow value f uv to every edge (u, v) ∈ E. Flow conservation must hold for every node different from s and t, namely, the sum of the flow values entering the node must equal the sum of the flow values exiting the node.See Figure 1(a) for an example.We say that k s-t paths P 1 , . . ., P k , with associated positive integer weights w 1 , . . ., w k , are a flow decomposition (FD) if their superposition equals the flow f .Formally, for every (u, v) ∈ E it must hold that i∈{1,...,k} s.t. (u,v)∈Pi See Figures 1(b) and 1(c) for two examples.The number k of paths is also called the size of the flow decomposition.In the minimum flow decomposition (MFD) problem, we need to find a flow decomposition of minimum size. 4On DAGs, a flow decomposition into paths always exists [1], but in general graphs, cycles may be necessary to decompose the flow (see e.g.[16] for different possible formulations of the problem).
For concreteness, we now describe the ILP models from [15] for finding a flow decomposition into k weighted paths in a DAG.They consist of (i) modeling the k paths via the x uvi variables (with constraints (1)), (ii) adding path-weight variables w 1 , . . ., w k , and (iii) requiring that these weighted paths form a flow decomposition, via the following (non-linear) constraint: i∈{1,...,k} This constraint can then be easily linearized by introducing additional variables and constraints; see e.g.[15] for these technical details.However, as mentioned above, the precise formulation of the ILP model M for a problem is immaterial for our method.Only the two assumptions on M made above matter for obtaining our results.
Safety.Given a problem on a graph G whose solutions consist of k paths in G, we say that a path P is safe if for any solution P 1 , . . ., P k to the problem, there exists some i ∈ {1, . . ., k} such that P is a subpath of P i .
If the problem is given as an ILP model M, we also say that P is safe for M. We say that P is a maximal safe path, if P is a safe path and there is no larger safe path containing P as subpath.[27] characterized safe paths for all FDs (not necessarily of minimum size) using the excess flow f P of a path P , defined as the flow on the first edge of P minus the flow on the edges out-going from the internal nodes of P , and different from the edges of P (see Figure 1(d) for an example).It holds that P is safe for all FDs if and only if f P > 0 [27].x abi = 1 x dfi = 1 x adi = 0 x bdi = 0 (b) An MFD into 4 paths of weights 5,3,7,2, respectively.The green dashed path is a subpath of the orange path.x abi = 1 x dfi = 1 x adi = 0 x bdi = 0 (c) An MFD into 4 paths of weights 3,4,1,9, respectively.The green dashed path is a subpath of the pink path.x sai = 1 x abi = 1 The two subpaths (red and blue) of the green dashed path that are maximal safe paths for all FDs.
Fig. 1: Flow decompositions and safe paths.The flow network in (a) admits different MFDs, in (b) and in (c).The path (s, a, b, c, d) (dashed green) is a maximal safe path for MFDs, i.e., it is a subpath of some path of all MFDs and it cannot be extended without losing this property.However, the path (s, a, b, c, d) is not safe for all FDs.Indeed, its two subpaths (s, a, b) (dashed red in (d)) and (b, c, d) (dashed blue in (d)) are maximal safe paths for all FDs.To see this, note that the excess flow of (s, a, b) is 3, while the excess flow of (s, a, b, c) (and of (s, a, b, c, d)) is −6.
The excess flow can be computed in time linear in the length of P (assuming we have pre-computed the flow outgoing from every node), giving thus a linear-time verification of whether P is safe.
A basic property of safe solutions is that any sub-solution of them is also safe.Computing safe paths for MFDs can thus potentially lead to joining several safe paths for FDs, obtaining longer paths from the unknown sequences we are trying to assemble.See Figure 1 for an example of a maximal safe path for MFDs and two maximal subpaths of it that are safe for FDs.

Finding Maximal Safe Paths for MFD via ILP
We now present a method for finding all maximal safe paths for MFD via ILP.The basic idea is to define an inner "safety test" which can be repeatedly called as part of an outer algorithm over the entire instance to find all maximal safe paths.Because calls to the ILP solver are expensive, the guiding choice for our overall approach is to minimize the number of ILP calls.This inspires us to test the safety of a group of paths as the inner safety test, which we achieve by augmenting our ILP model so that it can give us information about the safety of the paths in the set.We use this to define a recursive algorithm to fully determine the safety status of each path in a group of paths.We can then structure the safety test in either a top-down manner (starting with long unsafe paths and shrinking them until they are safe) or a bottom-up manner (starting with short safe paths and lengthening them until they become unsafe).
Safety test (inner algorithm) Let M(V, C) be an ILP model as discussed in Section 2.1; namely, its k solution paths are modeled by binary variables x uvi for each (u, v) ∈ E and each i ∈ {1, . . ., k}.We assume that M(V, C) is feasible (i.e., the problem admits at least one solution).We first show how to modify the ILP model so that, for a given set of paths, it can tell us one of the following: (1) a set of paths that are not safe (the remaining being of unknown status), or (2) that all paths are safe.The idea is to maximize the number of paths that can be simultaneously avoided from the given set of paths.
Let P be a set of paths.For each path P ∈ P, we create an auxiliary binary variable γ P that indicates: Since the model solutions are paths (i.e., not repeating nodes), we can encode whether P appears in the solution by whether all of the − 1 edges of P appear simultaneously.Using this fact, we add a new set of constraints R(P) that include the γ P indicator variables for each path P ∈ P: Next, as the objective function of the ILP model, we require that it should maximize the number of avoided paths from P, i,e., the sum of the γ P variables: All paths P such that γ P = 1 are unsafe, since they were avoided in some minimum flow decomposition.Conversely, if the objective value of Eq. ( 6) was 0, then γ P = 0 for all paths in P, and it must be that all paths in P are safe (if not, at least one path could be avoided and increase the objective).We encapsulate this group testing ILP in a function GroupTest(M, P) that returns a set N ⊆ P with the properties that: (1) if N = ∅, then all paths in P are safe, and (2) if N = ∅, then all paths in N are unsafe (and |N | is maximized).We employ GroupTest(M, P) to construct a recursive procedure GetSafe(M, P) that determines all safe paths in P, as shown in Algorithm 1. x abi = 1 x dfi = 0 x adi = 0 Fig. 2: Illustration of modeling a solution path and a tested path via binary edge variables and safety verification constraints.The i th solution path P i is shown in orange, and a tested path P is shown in dashed green.Constraint (5) includes x sai + x abi + x bci + x cdi + x dei ≤ 5 − γ P .This simplifies to γ P ≤ 0, thus forcing γ P = 0, which indicates P was not avoided in the solution.
Algorithm 1: Testing a set of paths P for safety.
Input: A feasible ILP model M(V, C), and a set of paths P Output: Those paths P ∈ P that are safe for M(V, C) 1 Procedure GetSafe(M, P) We note that in the special case that |P| = 1, GetSafe(M, P) makes only a single call to the ILP via GroupTest(M, P) to determine whether not the given path is safe.With this safety test for a single path, we can easily adapt a standard two-pointer approach as the outer algorithm to find all maximal safe paths for MFD by starting with some MFD solution P 1 , . . ., P k of M(V, C).This same procedure was used in [26] to find all maximal safe paths for FD, using an excess flow check as the inner safety algorithm.
Find all maximal safe paths (outer algorithm) We give two algorithms for finding all maximal safe paths.Both algorithms use a similar approach, however the first uses a top-down approach starting from the original full solution paths and reports all safe paths (these again must be maximal safe), and then trims all the unsafe paths to find new maximal safe paths.The second is bottom-up in that it tries to extend known safe subpaths until they cannot be further extended (and at this point must be maximal safe).We present the first algorithm in detail and defer discussion of the second to the appendix.
We say a set of subpaths T = {P i [l i , r i ]} is a trimming core provided that for any unreported maximal safe path We will use the original k solution paths {P i } as our initial trimming core; the complete algorithm is given in Algorithm 2. See Fig. 3 in the appendix for an illustration of the algorithm's initial steps.The algorithm first checks if any of the paths in T are safe; if so, these are reported as maximal safe.For those paths that were unsafe, it then considers trimming one vertex from the left and one vertex from the right to create new subpaths.Of these subpaths, some may be contained in a safe path in T ; these subpaths can be ignored as they are not maximal safe.The algorithm recurses on those subpaths whose safety status cannot be determined (lines 6-10).In this way, the algorithm maintains the invariant that no paths in T are properly contained in a safe path; thus paths reported in line 4 must be maximal safe.
Algorithm 2: An algorithm to compute all maximal safe paths that can be trimmed from a trimming core set T .
Input: An ILP model M and a trimming core set T Output: All maximal safe paths for M that are trimmed subpaths of T 1 Procedure AllMaxSafe-TopDown(M, T )

Experiments
To test the performance of our methods, we computed safe paths using different safety approaches and reported the quality and running time performances as described below.Additional details on the experimental setup are given in the appendix.
Implementation details -SafeMFD.We implemented the previously described algorithms to compute all maximal safe paths for minimum flow decompositions in Python.The implementation, SafeMFD, uses the package NetworkX [20] for graph processing and the package gurobipy [19] to model and solve the ILPs and it is openly available5 .Our fastest variant (see Table 2 in the appendix for a comparison of running times) implements Algorithm 2 using the group testing in Algorithm 1.We used this variant to compare against other safety approaches.All tested variants of SafeMFD implement the following two optimizations: 1. Before processing an input flow graph we contract it using Y-to-V contraction [51], which is known [30] to maintain (M)FD solution paths.Moreover, since edges in the contracted graph correspond to extended unitigs [35,24,29], source-to-sink edges are further removed from the contracted graph and reported as safe.As such, our algorithms compute all maximal safe paths for funnels [17,26] without using the ILP. 2. Before testing the safety of a path we check if its excess-flow [26] is positive.If this is the case, the path is removed from the corresponding test.Having positive excess flow implies safety for all flow decomposition and thus also safety for minimum flow decompositions.
Safety approaches tested.We compare the following state-of-the-art safety approaches: EUnitigs: Maximal paths made up of a prefix of nodes with in-degree one followed by nodes with outdegree one; also called extended unitigs [51,35,24,29].We use the C++ implementation provided by Khan et al. [26] (which computes only the extended unitigs contained in FD paths).SafeFlow: Maximal safe paths for all flow decompositions [26].We use the C++ implementation provided by Khan et al. [26].SafeMFD: Maximal safe paths for all minimum flow decompositions, as proposed in this work.Every flow graph processed is given a time budget of 2 minutes.If a flow graph consumes its time budget, the solution of SafeFlow is output instead.SafeEPC: Maximal safe paths for all constrained path covers of edges.Previous authors [8,26] have considered safe path covers of the nodes, but for a more fair comparison, we instead use path covers of edges.
To this end, we transform the input graphs by splitting every edge by adding a node in the middle and run the C++ implementation provided by the authors of [8].Since flow decompositions are path covers of edges, safe paths for all edge path covers are subpaths of safe paths for MFD.However, we restrict the path covers to those of minimum size and minimum size plus one, as recommended by the authors of [8] to obtain good coverage results while maintaining high precision.
All safety approaches require a post processing step for removing duplicates, prefixes and suffixes.We use the C++ implementation provided by [26] for this purpose.
Datasets.We use two datasets of flow graphs inspired by RNA transcript assembly.The datasets were created by simulating abundances on a set of transcripts and then perfectly superposing them into a splice graphs that are guaranteed to respect flow conservation.As such, the ground truth corresponds to a flow decomposition (not necessarily minimum).To avoid a skewed picture of our results we filtered out trivial instances with a unique flow decomposition (or funnels, see [17,26]) from the two datasets. 6atfish: Created by [48], it includes 100 simulated human, mouse and zebrafish transcriptomes using Flux-Simulator [18] as well as 1,000 experiments from the Sequence Read Archive simulating abundances using Salmon [40].We took one experiment per dataset, which corresponds to 27,696 non-trivial flow graphs.RefSim: Created by [8] from the Ensembl [57] annotated transcripts of GRCh.104 homo sapiens reference genome, and later augmented by Khan et al. [26] with simulated abundances using the RNASeqRead-Simulator [32].This dataset has 10,323 non-trivial graphs.
Quality metrics.We use the same quality metrics employed by previous multi-assembly safety approaches [8,26].We provide a high-level description of them for completeness.
Weighted precision of reported paths: As opposed to normal precision, the weighted version considers the length of the reported subpaths.It is computed as the total length of the correctly reported subpaths divided by the total length of all reported subpaths.A reported subpath is considered correct if and only if it is a subpath of some path in the ground truth (exact alignment of exons/nodes).Maximum coverage of a ground truth path P : The longest segment of P covered by some reported subpath (exact alignment of exons/nodes), divided by |P |.
We compute the weighted precision of a graph as the average weighted precision over all reported paths in the graph, and the maximum coverage of a graph as the average maximum coverage over all ground truth paths in the graph.
F-Score of a graph: Harmonic mean between weighted precision and maximum coverage of a graph, which assigns a global score to the corresponding approach on the graph.
These metrics are computed per flow graph and reported as an average.In the case of the Catfish dataset the metrics are computed in terms of exons (nodes), since genomic coordinates of exons are missing, whereas in the case of the RefSim dataset the metrics are computed in terms of genomic positions, as this information is present in the input.

Results and Discussion
In the Catfish dataset, EUnitigs and SafeFlow ran in less than a second, while SafeEPC took approximately 30 seconds to compute.On the other hand, solving a harder problem, SafeMFD took approximately 1.5 hours to compute in the rest of the dataset, timing out in only 54 graphs (we use a cutoff of 2 minutes), i.e., only 0.2% of the entire dataset.This equates to only 0.2 seconds on average per solved graph, underlying the scalability of our approach.
Table 1 shows that SafeMFD, on average, covers close to 90% of the ground truth paths, while maintaining a high precision (99%).This corresponds to an increase of approximately 25% in coverage against its closest competitor SafeFlow.SafeMFD also dominates in the combined metric of F-Score, being the only safe approach with F-Score over 90%. Figure 4 in the appendix shows the metrics on graphs grouped by number t of ground truth paths, indicating the dominance in coverage and F-Score of SafeMFD across all values of t, and indicating that the decrease in precision appears for large values of t (t ≥ 12).In the harder RefSim dataset, EUnitigs and SafeFlow also ran in less than a second, while SafeEPC took approximately 2 minutes.In this case, SafeMFD ran out of time in 1,562 graphs (15% of the entire dataset); however, recall that in these experiments we allow a time budget of only 2 minutes.In the rest of the dataset, it took approximately 7.5 hours in total, corresponding to only 3 seconds on average per graph, again underlying that our method, even though it solves many NP-hard problems for each input graph, overall scales sufficiently well.
Table 1 shows that again SafeMFD dominates in coverage, being the only approach obtaining coverage over 90%, with is a 15% improvement over SafeFlow.This time its precision drops to close to 90%, and obtaining an F-Score of 90%, very similar to its closest competitor, SafeFlow.However, recall that coverage is computed only from correctly aligned paths, thus the drop in precision comes only from safe paths not counting in the coverage metric.If we restrict the metrics to graphs with at most 15 ground truth paths, which is still a significant proportion (84%) of the entire dataset, then SafeMFD has a very high precision (98%) while improving coverage by 15% with respect to SafeFlow.Thus, the drop in precision occurs in graphs with a large number of ground truth paths, which can also be corroborated by Figure 5 in the appendix.
These drops in precision (both in RefSim and Catfish) for large t can be explained by the fact that a larger number of ground truth paths produces more complex splice graphs and introduces more artificial solutions of potentially smaller size.As such, the larger t, the less likely that the ground truth is a minimum flow decomposition of the graph, and thus the more likely that SafeMFD reports incorrect solutions.This motivates future work on safety not only on minimum flow decompositions but also in flow decompositions of at most a certain size, analogously to how it is done for SafeEPC.This is still easily achievable with our framework by just changing the ILP blackbox, and keeping everything else unchanged (e.g., the inner and outer algorithms).Namely, instead of formulating the ILP model M(V, C) to admit solutions of exactly optimal k paths, it can be changed to allow solutions of at most some k paths, with k greater than the optimal k.If k is also greater than the number of ground truth paths in these complex graphs, then safe paths are fully correct, meaning that we overall increase precision.

Conclusion
RNA assembly is a difficult problem in practice, with even the top tools reporting low precision values.While there are still many issues that can introduce uncertainty in practice, we can now provide a major source of additional information during the process: which RNA fragments must be included in any parsimonious explanation of the data?Though others have considered RNA assembly in the safety framework [58,26], we are the first to show that safety can be practically used even when we look for optimal (i.e., minimum) size solutions.Our experimental results show that safe paths for MFD clearly outperform other safe approaches for the Catfish dataset, commonly used in this field.On a significant proportion of the second dataset, safe paths for MFD still significantly outperforms other safe methods.
More generally, this is the first work to show that the safety framework can be practically applied to NP-hard problems, where the inner algorithm is an efficient test of safety of a group of paths, and the outer algorithm guides the applications of this test.Because our method was very successful on our test data set, there is strong motivation to try the approach to on other NP-hard graph problems whose solutions are sets of paths.For example, we could study other variations on MFD, such as finding flow decompositions minimizing the longest path (NP-hard when flow values are integer [4,43]).The approach given in this paper can also be directly extended to find decompositions into both cycles and paths [16], though not trails and walks, because they repeat edges.We could also formulate a safety test for classic NP-hard graph problems like Hamiltonian path.3(a) shows the first group test (using Algorithm 1) on MFD solution paths {P 1 , P 2 , P 3 , P 4 }; suppose {P 1 , P 3 } were safe (Fig. 3(b)); these are then reported as maximal safe.In this case we trim {P 2 , P 4 } on both the left and right and make the next group test shown in Fig. 3(c).

B.1 The bottom-up algorithm
Algorithm 3, detailed below, uses a bottom-up group-testing strategy to find all maximal safe paths.
Definition 1.We say a set of subpaths E = {P i [l i , r i ]} is an extending core provided all paths in E are safe and for any unreported maximal safe path P = P i [l, r], there is a P i [l i , r i ] ∈ E, where l ≤ l i ≤ r i ≤ r.
Note that maximal FD-safe subpaths provide an extending core (as well just the set of all single-edge subpaths in each path).Algorithm 3 provides an algorithm to find all maximal safe paths based on group testing, starting from an extending core.The idea is to try both left-extending (by one) and right-extending (by one) each subpath in the core; if neither of these extensions are safe, then we know that that core subpath must be maximal safe.Testing all extensions can done quickly using Algorithm 1.We then recurse on a new core set consisting of those extensions that were found to be safe.Algorithm 3: An algorithm to output all maximal safe subpaths that can be extended from an extending core set E.
Input: An ILP model M and an extending core set E Output: All maximal safe paths for M that extend some path from E 1 Procedure AllMaxSafe-BottomUp(M, E) AllMaxSafe-BottomUp(M, S)

B.2 The two-pointer algorithm
As we observed in Section 2.2, we can test whether a single path P is safe using one ILP call.We will assume that this test is encapsulated as a procedure IsSafe(M, P ).Once we can test whether a single path is safe for M(V, C), we can adopt a standard approach to compute all maximal safe paths.Namely, we start by computing one solution of M(V, C), P 1 , . . ., P k and then compute maximal safe paths by a two-pointer technique that for each path P i , finds all maximal safe paths by just a linear number of calls to the procedure IsSafe [26].This works as follows.We use two pointers, a left pointer L, and a right pointer R. Initially, L points to the first node of path P i and R to the second node.As long as the subpath of P i between L and R is safe, we move the right pointer to the next node on P i .When this subpath is not safe, we output the subpath between L and the previous location of R as a maximal safe path, and we start moving the left pointer to the next node on P i , until the subpath between L and R is safe.We stop the procedure once we reach the end of P i .We summarize this procedure as Algorithm 4; see also Figure 6 for an example.x sai + x abi + x bci + x cdi + x dfi ≤ 4 x sai + x abi + x bci + x cdi + x dfi ≤ 4 Fig. 6: Illustration of the two-pointer algorithm applied on a flow decomposition path P i (in orange).In each sub-figure, the subpath P (dashed green) between the nodes pointed by the left pointer L and the right pointer R is tested for safety, by adding constraints S(P ).In (a), IsSafe(M, P ) returns True, and the right pointer advances on P i .In (b), IsSafe(M, P ) returns False, and the previous subpath from (a) is output as a maximal safe path.In (c), the left pointer has advanced, and the new path P is tested for safety.
Algorithm 4: The two-pointer algorithm applied to compute all maximal subpaths of a given solution path P i Input: An ILP model M and one of its k solution paths, P i = (v 1 , . . ., v t ), t ≥ 2 Output: All maximal safe subpaths of P i for M 1 Procedure AllMaxSafe-TwoPointer(M, P i )

B.3 Running time experiments among different variants proposed
We conducted the experiments on an isolated Linux server with AMD Ryzen Threadripper PRO 3975WX CPU with 32 cores (64 virtual) and 504GB of RAM.Time and peak memory usage of each program were measured with the GNU time command.SafeMFD was allowed to run Gurobi with 12 threads.All C++ implementations were compiled with optimization level 3 (-O3 flag).Running time and peak memory is computed and reported per dataset.
SafeMFD includes the following four variants computing maximal safe paths: TopDown : Implements Algorithm 2 using the group testing in Algorithm 1. BottomUp : Implements Algorithm 3 (Appendix B.1) using the group testing in Algorithm 1.
TwoPointerBin : Same as previous variant, but it additionally replaces the linear scan employed to extend and reduce the currently processed safe path by a binary search7 .
To compare between our four different variants we first run them all on every dataset, and then filter out those graphs that ran out of time in some variant.This way we ensure that no variant consumes its time budget and thus our running time measurements are not skewed by the unsuccessful inputs' timeouts.Applying this filter we removed 83 graphs from the Catfish dataset (0.3%) and 4,515 graphs from the RefSim dataset (43.74%).
Table 2 shows the running times and number of ILP calls of the different variants on both datasets.TopDown clearly outperforms the rest, being at least twice as fast, and performing (roughly) half many ILP calls.While BottomUp is analogous to TopDown, the superiority of the latter can be explained by the length maximal safe paths.Since maximal safe paths are long it is faster to obtain them by reducing unsafe paths (TopDown) than by extending safe paths (BottomUp and both TwoPointer variants).On the other hand, TwoPointer is the slowest variant and BottomUp and TwoPointerBin obtain similar improvements (over TwoPointer ) by following different strategies.While BottomUp reduces the number of ILP calls more than TwoPointerBin (better appreciated in the RefSim dataset), the ILP calls of BottomUp take longer (since BottomUp tests several paths at the same time and TwoPointerBin only one), and thus the total running times of both is similar.This motivates future work on combining both approaches, while processing the paths starting from unsafe (as in TopDown) for better performance.

C Hardness of Testing MFD Safety
In this section we give a Turing-reduction from the UNIQUE 3SAT problem (U3SAT) to the problem of determining if a given path P in a flow network G is safe for minimum flow decomposition (call this problem MFD-SAFETY ).A 3SAT instance belongs to U3SAT if and only if it has exactly one satisfying assignment.U3SAT has been shown to be NP-hard under randomized reductions [52], but it is open as to whether it is NP-hard in general.
The reduction leverages the construction in [22] that reduces 3SAT to minimum flow decomposition.We first briefly review this construction: A variable gadget (see Fig. 4 in [22]) is created for each 3SAT variable x and a clause gadget (see Fig. 5 in [22]) is created for each 3SAT clause.Positive literals in each clause receive flow from the left side of the corresponding variable gadget, whereas negative literals receive flow from the right side.Theorem VI.1 in [22] establishes that a 3SAT instance is satisfiable if and only if the constructed flow network has a minimum flow decomposition of a certain size.Any flow decomposition achieving this size must have a specific structure; in particular, there must be a flow path of weight 4 that either travels up the left side of the gadget (setting x to TRUE), or the right side (setting x to FALSE).The variable gadget from [22], showing only the weight 4 edges (other edges have weights from {1, 2}).A key property established in [22] is that if the 3SAT instance is satisfiable then in a minimum flow decomposition, a weight 4 flow path must travel up either the left side of the gadget (as shown), or the right side.A left flow path indicates the variable should be set to TRUE, while right indicates FALSE.We leverage this construction to reduce U3SAT to MFD-SAFETY.
Theorem 1.There is a polynomial time Turing-reduction from U3SAT to MFD-SAFETY.
Proof.To obtain the desired Turing-reduction algorithm, instead of checking the size of the MFD, we will instead sequentially check the MFD-SAFETY of the aforementioned side paths traveling up the left and right sides of each variable gadget.Provided each variable gadget has exactly one safe side path we can then check the corresponding truth assignment to see if each clause is satisfied.If yes, we accept the instance as belonging to U3SAT, otherwise we reject.Suppose the instance does belong to U3SAT.In this case there is a satisfying assignment so the MFD must have the structure as described above.Furthermore, since there is exactly one satisfying assignment, exactly one side path of each variable gadget must be safe and so our algorithm finds it and then verifies that the truth assignment satisfies each clause, thus accepting the instance.On the other hand, if the instance does not belong to U3SAT it could either be unsatisfiable or have multiple satisfying assignments.If unsatisfiable, no matter whether the safety checks pass, the corresponding assignment will not satisfy all clauses, so the instance will be rejected.If there are multiple solutions, then any variable that can be both TRUE and FALSE will not have a safe side path in the MFD.This means the safety check will fail and the instance will again be rejected.

4 ( 1 P 2 P 3 P 4 P 1 P 2 P 3 P 4 aFig. 3 :
Fig.3: Illustration of the initial group tests performed by Algorithm 2. Fig.3(a)shows the first group test (using Algorithm 1) on MFD solution paths {P 1 , P 2 , P 3 , P 4 }; suppose {P 1 , P 3 } were safe (Fig.3(b)); these are then reported as maximal safe.In this case we trim {P 2 , P 4 } on both the left and right and make the next group test shown in Fig.3(c).

Fig. 4 :
Fig. 4: Quality metrics on graphs distributed by number of paths in the ground truth for the Catfish dataset.The metrics are computed in terms of exons/nodes.

Fig. 5 :
Fig. 5: Quality metrics on graphs distributed by number of paths in the ground truth for the RefSim dataset.The metrics are computed in terms of genomic positions.
Fig.7: The variable gadget from[22], showing only the weight 4 edges (other edges have weights from {1, 2}).A key property established in[22] is that if the 3SAT instance is satisfiable then in a minimum flow decomposition, a weight 4 flow path must travel up either the left side of the gadget (as shown), or the right side.A left flow path indicates the variable should be set to TRUE, while right indicates FALSE.We leverage this construction to reduce U3SAT to MFD-SAFETY.

Table 1 :
Summary of quality metrics for both datasets.For Catfish, the metrics are computed in terms of nodes/exons and for RefSim in terms of genomic positions; t is the number of ground truth paths.

Table 2 :
then return; Running times and number of ILP calls in four different variants of SafeMFD.