Coverage-preserving sparsification of overlap graphs for long-read assembly

Abstract Motivation Read-overlap-based graph data structures play a central role in computing de novo genome assembly. Most long-read assemblers use Myers’s string graph model to sparsify overlap graphs. Graph sparsification improves assembly contiguity by removing spurious and redundant connections. However, a graph model must be coverage-preserving, i.e. it must ensure that there exist walks in the graph that spell all chromosomes, given sufficient sequencing coverage. This property becomes even more important for diploid genomes, polyploid genomes, and metagenomes where there is a risk of losing haplotype-specific information. Results We develop a novel theoretical framework under which the coverage-preserving properties of a graph model can be analyzed. We first prove that de Bruijn graph and overlap graph models are guaranteed to be coverage-preserving. We next show that the standard string graph model lacks this guarantee. The latter result is consistent with prior work suggesting that removal of contained reads, i.e. the reads that are substrings of other reads, can lead to coverage gaps during string graph construction. Our experiments done using simulated long reads from HG002 human diploid genome show that 50 coverage gaps are introduced on average by ignoring contained reads from nanopore datasets. To remedy this, we propose practical heuristics that are well-supported by our theoretical results and are useful to decide which contained reads should be retained to avoid coverage gaps. Our method retains a small fraction of contained reads (1–2%) and closes majority of the coverage gaps. Availability and implementation Source code is available through GitHub (https://github.com/at-cg/ContainX) and Zenodo with doi: 10.5281/zenodo.7687543.


Introduction
Accuracy of long reads has improved tremendously over the last 3 years (Wenger et al. 2019). For instance, PacBio HiFi sequencing technology yields consensus sequencing reads that are both long (averaging 10-25 kb) and highly accurate (averaging 99:8%). Similarly, modal raw read accuracy above 99% has been reported using reads from Oxford Nanopore Technology (ONT) sequencing (Sereika et al. 2022). Consequently, long-read technologies have shown the greatest promise in computing high-quality haplotyperesolved genome assemblies .
The objective of genome assembly is to reconstruct the original sequence from a large number of reads. Read-overlap-based assembly algorithms work by constructing an overlap graph where each vertex corresponds to a read, and edges represent suffix-prefix overlaps between the reads. Practical algorithms account for the following two key challenges while computing genome assembly. First, due to the presence of repetitive sequences, it is usually possible to reconstruct different genomes using the same input set of reads, each of which is fully consistent with the data (Medvedev and Pop 2021). Accordingly, problem formulations for genome assembly which seek a single genome reconstruction, e.g. by finding a Hamiltonian cycle in an overlap graph, or computing an Eulerian cycle in a de Bruijn graph, are not used in practice. Instead, assemblers compute contigs which are long, contiguous segments that, in principle, correspond to substrings of the source genome (Tomescu and Medvedev 2017). Second, the presence of repeats, variable-length reads and read errors also leads to too many edges in the overlap graph. This challenge is addressed by using graph sparsification heuristics that exclude vertices and edges which are likely to be redundant or false.
The commonly used string graph model by Myers (2005) (i) removes reads that are contained as a substring of a longer read ( Fig. 1), (ii) ignores suffix-prefix overlaps that are shorter than the longest possible between a read pair, and (iii) removes transitive edges. More aggressive sparsification procedures also exist in the literature such as the best overlap graph approach of Miller et al. (2008) which retains only the best overlaps to a given read, where "best" is defined using some criterion. Despite the use of such graph sparsification techniques in most long-read assembly tools, theoretical understanding of these heuristics is fairly limited. A rigorous analysis of these sparsification techniques is important, especially in the context of diploid and polyploid genomes, because these can lead to loss of useful haplotype-specific information from an assembly graph. Shomorony et al. (2016) and Hui et al. (2016) gave provably efficient algorithms for overlap graph sparsification for fixed-length and variable-length reads, respectively. However, their formulations make a simplifying assumption that the input reads are long enough to avoid ambiguity caused by repeats, which may not hold in general.
Assuming a genome is sequenced with sufficient coverage, then there must exist walks in an assembly graph that can spell the true sequences corresponding to each chromosome. Due to repeats, these walks need not correspond to non-branching unitigs. An assembly graph which guarantees this property is said to be coveragepreserving (formally defined in Section 2). In this article, we make the following contributions: 1. We formulate the coverage-preserving property of genome assembly graph models, and perform a rigorous evaluation of three commonly used models: (i) de Bruijn graphs, (ii) overlap graphs, and (iii) string graphs. We show that de Bruijn graphs and overlap graphs are guaranteed to be coverage preserving, but string graphs are not. 2. Graph sparsification is a critical step during genome assembly to prune the overlap graph because it helps to compute longer contigs. We develop theoretical results to compute a sparse overlap graph while preserving the coverage-preserving property. 3. We extend the proposed theory into practical heuristics and a prototype software ContainX. The proposed heuristics are useful to identify redundant contained reads which can be excluded from overlap graph without violating the coverage-preserving property. 4. We conducted experiments by using long read datasets simulated from diploid HG002 human genome assembly while matching read length distributions to real PacBio HiFi and ONT data. We show that 1 and 50 coverage gaps are introduced on average by removing all contained reads in HiFi and ONT datasets, respectively. This is the first work to quantify the impact of graph sparsification heuristics in Myers's string graph model. 5. For HG002 genome assembly, ContainX algorithm excludes 98-99% contained reads from the graphs and leaves 10% coverage gaps. This result compares favorably to the performance of other known solutions for this problem.

Concepts and notations
We set the stage by defining important notations and definitions. We will recall commonly used assembly graph models.
Subsequently, we will give a formal definition of the desired coverage-preserving property in graphs.

Notation on strings
For a linear string x ¼ a 1 . . . a n over alphabet R ¼fA, C, G, Tg, jxj ¼ n is the length of x, x½i ¼ a i is the i th symbol of X, and x½i : j ¼ a i . . . a j is the substring from position i to position j. Let x i denote string x concatenated with itself i times. String x is said to have a suffix-prefix overlap of length l with string y when a proper suffix of length l in x equals a proper prefix of y. Recall that a proper prefix or a proper suffix of a string cannot be equal to the string itself.
We will use a circular genome model to avoid edge-effects associated with sequencing coverage. A circular string can be viewed as a traditional linear string which has the left-and the right-most symbols wrapped around and glued together. Circular string z ¼ ha 1 . . . a n i has length jzj ¼ n. Substring z½i : j; where i 2 ½1; n; j ! i, equals the finite substring of the linear infinite string ða 1 . . . a n Þ 1 from position i to position j. Note that offset j in z½i : j can be greater than jzj because z is a circular string. Further, substring z½i : j is said to be a repetitive string in z iff there exists z½i 0 : j 0 ¼ z½i : j; i 0 6 ¼ i. Suppose the true unknown genome is a set of circular strings, each representing a chromosome sequence. Let / be a known upper bound on the maximum length of a chromosome in the genome.
A sequence read sampled from the genome is a substring of one of the circular strings in the genome. An indexed multiset of the reads is represented using symbol R. Read x 2 R is labeled as a contained read if there exists a read y 2 R n fxg such that x is a substring of y. If y is also contained in x, then x and y are identical. In this case, break the tie by assuming that the read with lower index in R is contained within the read with higher index. The reads which contain read x are referred to as parent reads of x. We assume that either there are no sequencing errors or reads have been errorcorrected. In our theoretical results, we will also assume that DNA is a single-stranded molecule.

Graph models for genome assembly
De Bruijn graphs, overlap graphs and string graphs are popular graph-based frameworks used for de novo genome assembly. Let GðV; E; r; wÞ denote a directed assembly graph where vertices are labeled with strings, and edges indicate suffix-prefix overlap relations between the labels of connected vertices. Function r : V ! R þ assigns a string label to each vertex. Function w : E ! N assigns a weight to each edge. Weight of an edge v 1 ! v 2 signifies the length of the prefix of rðv 1 Þ that is not in the suffix-prefix match. In a de Bruijn graph B k ðRÞ, vertex set V corresponds to the set of all k-mer substrings in read set R, and an edge of weight one exists from vertex v 1 to vertex v 2 if and only if rðv 1 Þ½2 : k ¼ rðv 2 Þ½1 : k À 1 (Idury and Waterman 1995). This data structure is also sometimes referred to as node-centric de Bruijn graph (Chikhi and Rizk 2013).
Overlap graph OðRÞ is a directed multigraph where vertices correspond to reads, and edges correspond to suffix-prefix overlaps among the reads (Myers 1995). A directed edge v 1 ! v 2 with weight jrðv 1 Þj À l is drawn if and only if string rðv 1 Þ has a suffix-prefix overlap of length l with string rðv 2 Þ. A subgraph of overlap graph OðRÞ that contains only the edges representing suffix-prefix overlaps of length ! k is denoted as O k ðRÞ. Myers (1995Myers ( , 2005 introduced a sparse variant of overlap graph structure called as string graph. Unlike overlap graphs which are defined as directed multigraphs, string graph SðRÞ is a directed graph. A string graph includes only the longest suffix-prefix overlap between a read pair. Second, vertices associated with contained reads are excluded from a string graph. Finally, if vertex v 1 connects to v 2 , v 2 connects to v 3 and v 1 connects to v 3 using edges e 1 ; e 2 and e 3 , respectively, and wðe 1 Þ þ wðe 2 Þ ¼ wðe 3 Þ, then edge e 3 is excluded from the string graph because it is transitively deducible. Removal of transitive edges is referred to as transitive sparsification. A subgraph of string graph SðRÞ which only uses suffix-prefix overlaps of length ! k is denoted as S k ðRÞ. In practice, a string graph Figure 1 Illustration of an example where removal of vertices associated with contained reads leads to a coverage gap. Reads with IDs 2 and 3 are contained in 1 and 6, respectively. If the contained reads are ignored, there is no walk in the sparse graph that spells the genome has significantly fewer edges compared to the overlap graph which enables computation of longer assembly contigs.
By performing a closed walk in an assembly graph, one can spell a circular string. This string is formed by concatenating labels of the vertices in the walk without spelling the overlapped substrings twice.

Coverage-preserving graph models
Unless all repeats can be unambiguously resolved, one can compute different genome reconstructions, each of which is fully consistent with input reads (Medvedev and Pop 2021). An assembly graph spells the true genome if it spells all the possible genome reconstructions.
The depth of sequencing must be sufficient enough for assembling a genome (Lander and Waterman 1988). Informally, we will assume that reads "cover" the genome, and the length of suffixprefix overlap between "consecutive" reads is above a certain threshold. One way to achieve this is following. We say that there is sufficient coverage over circular string z if there exist parameters l 1 ; l 2 2 N such that l 1 > l 2 and all intervals of length l 1 À l 2 in z include the start of at least one substring of length ! l 1 that matches a read 2 R. See Fig. 2 for an example. This definition supports variable-length reads and allows a read to match at multiple distinct places in a genome. We will frequently use this assumption later in our proofs in Section 3. Let CðR; l 1 ; l 2 ; /Þ be the set of all candidate circular strings of length / which satisfy the stated coverage assumption using a given read set R and coverage parameters ðl 1 ; l 2 Þ. The true genome is a subset of CðR; l 1 ; l 2 ; /Þ.
Definition. Given a set of reads R, coverage parameters ðl 1 ; l 2 Þ and chromosome length threshold /, graph GðV; E; r; wÞ is said to be coverage-preserving if all candidate strings 2 CðR; l 1 ; l 2 ; /Þ can be spelled in the graph.
3 Analysis of known graph models THEOREM 1. de Bruijn graph B k ðRÞ is coverage-preserving for all k l 2 þ 1.
PROOF. If a circular string can be spelled in B l2þ1 ðRÞ, then it can certainly be spelled in B k ðRÞ for all k < l 2 þ 1. We will prove that B k¼l2þ1 ðRÞ is coverage-preserving by using contradiction. Suppose there exists a circular string z 2 CðR; l 1 ; l 2 ; /Þ which cannot be spelled in B k ðRÞ. Let k i ð1 i nÞ denote the k-mer substring starting from position i in z. If all k i s exist in the set of all k-mer substrings extracted from read set R, then it is trivial to construct a closed walk in B k ðRÞ which spells z. Therefore, at least one of the k i s must be missing. Consider the minimum i for which this is true. Next, consider the ðl 1 À l 2 Þ-long interval ending at the position i in the circular string z. Based on the coverage assumption, there is at least one read 2 R which matches a substring of length ! l 1 of z starting in this interval. Such a read must contain k-mer k i as its substring. h Next, we turn our attention to overlap graphs. Unlike de bruijn graphs, overlap graphs require a more careful analysis due to variable-length string labels on vertices and variable-length suffixprefix overlaps. Our aim is to prove the following.
THEOREM 2. Overlap graph O k ðRÞ is coverage-preserving for all k l 2 .
If a circular string can be spelled in O l2 ðRÞ, then it can be spelled in O k ðRÞ for all k < l 2 . For a circular string z 2 CðR; l 1 ; l 2 ; /Þ, we propose an algorithm to identify a closed walk in O l2 ðRÞ which spells z. For this, we first need some definitions. Let A i be the ðl 1 À l 2 Þ-long interval starting at position i in z, for all i ranging from 1 to jzj. For each interval A i , let z½A i :s : A i :e be the substring of length ! l 1 which starts in interval A i and matches a read 2 R. If there are multiple such substrings available, pick the ones with the least value of A i :s, and then with the least value of A i :e. Let A i :r 2 R denote a read that matches substring z½A i :s : A i :e. Suppose A q ; q 2 ½1; jzj is the interval which ends at position A 1 :s À 1 if A 1 :s > 1 or else at position jzj if A 1 :s ¼ 1. Recall that a circular string is equivalent to any of its cyclic rotated version, therefore, assume wlog that jA 1 :rj ¼ max i2½1;jzj jA i :rj. If jA 1 :rj ! jzj þ l 2 , it is trivial to construct a valid closed walk by using single vertex corresponding to read A 1 :r. In the following, we will assume that jA i :rj < jzj þ l 2 for all i 2 ½1; jzj. The inequalities l 1 jA i :rj and jA i :rj < jzj þ l 2 imply that l 1 À l 2 < jzj. Also jzj > 1 because PROOF. All intervals including A i and A iþ1 are of length l 1 À l 2 . A i :e À A iþ1 :s equals its minimum value l 2 À 1 when (a) A i :s equals the first position in interval A i , (b) A iþ1 :s equals the last position in interval A iþ1 , and (c) the substring z½A i :s : A i :e has minimum possible length l 1 . h We will find a subsequence ðT 1 ; T 2 ; . . . ; T p Þ of ðA 1 ; A 2 ; . . . ; A jzj Þ such that vertices corresponding to reads ðT 1 :r; T 2 :r; . . . ; T p :r; T 1 :rÞ can be connected to form a valid closed walk which spells z (Fig. 3). Let T 1 ¼ A 1 . Suppose a partial subsequence ðT 1 ; T 2 ; . . . ; T i ¼ A j Þ has been computed so far. Continue by selecting the first interval among ðA jþ1 ; . . . ; A jzj Þ as T iþ1 which satisfies the conditions T iþ1 :s > T i :s and T iþ1 :e > T i :e. This selection procedure ensures that T 1 :s < T 2 :s < . . . < T p :s and T 1 :e < T 2 :e < . . . < T p :e.
LEMMA 2. Length of the subsequence ðT 1 ; T 2 ; . . . ; T p Þ identified by the above procedure is at least two.
PROOF. By contradiction. Assume p ¼ 1, i.e. the computed subsequence is T 1 ¼ A 1 . Based on our previous arguments, we have jA 1 :rj < jzj þ l 2 , l 1 À l 2 < jzj and jzj > 1. Interval A q cannot contain position A 1 :s because size of each interval, i.e. l 1 À l 2 , is < jzj. Therefore, A 1 :s < A q :s jzj. As interval A q was not picked among the subsequence of intervals, A 1 :e must be ! A q :e. Next, we will Figure 2 Suppose l1 ¼ 4 and l2 ¼ 1. A circular string z ¼ hCATGAi has five intervals of length l1 À l2 ¼ 3 as shown in blue. Two reads ACATG and ATGA, which match substrings z½5 : 9 and z½2 : 5, respectively, are shown in green. z has sufficient coverage because all five intervals include at least one of the starting positions of the reads Figure 3 A graphic illustration of the subsequence computed from the complete sequence of intervals A1; A2; . . . ; A jzj . In the left figure, the gray-colored circle indicates circular string z, and the curved line segments indicate distinct substrings of z associated with intervals A1; A2 . . . ; A jzj . The right figure illustrates only the substrings associated with selected intervals in the subsequence T 1 ; T 2 ; T 3 ; T 4 identify the minimum possible value of A q :e to get a lower bound of A 1 :e. A q :e is lowest when the length of substring z½A q :s : A q :e is l 1 and A q :s equals the first position in A q . Using these conditions, observe that A q :e ! A 1 :s þ jzj þ l 2 À 1, and therefore, A 1 :e ! A 1 :sþ jzj þ l 2 À 1. This implies that the length of substring z½A 1 :s : A 1 :e is ! jzj þ l 2 . This is not possible when jA 1 :rj < jzj þ l 2 . h LEMMA 3. For any two consecutive intervals ðT i ; T iþ1 Þ in ðT 1 ; T 2 ; . . . ; T p Þ, T i :s < T iþ1 :s, T i :e < T iþ1 :e and T i :e ! T iþ1 :s þ l 2 À 1.
PROOF. The first two inequalities T i :s < T iþ1 :s and T i :e < T iþ1 :e are guaranteed by the selection procedure of T iþ1 . Suppose T i ¼ A j and T iþ1 ¼ A k . From Lemma 1, we know that A kÀ1 :e ! A k :s þ l 2 À 1. If j ¼ k À 1, then inequality T i :e ! T iþ1 :s þ l 2 À 1 holds. Otherwise, j < k À 1 implies that A kÀ1 was not picked among the subsequence of intervals. This would happen in either of the following two situations (i) A j :s ¼ A kÀ1 :s; A j :e ¼ A kÀ1 :e, (ii) A j :s < A kÀ1 :s; A j :e ! A kÀ1 :e. In either case, A j :e ! A kÀ1 :e. Therefore, T i :e ! T iþ1 :s þ l 2 À 1. h LEMMA 4. T p :s < T 1 :s þ jzj , T p :e < T 1 :e þ jzj and T p :e ! T 1 :sþ jzj þ l 2 À 1.
PROOF. The first inequality T p :s < T 1 :s þ jzj holds because T i :s 2 ½1; jzj 8i 2 ½1; p. For the second inequality, recall that jA 1 :rj ¼ max i2½1;jzj j A i :rj. If T p :e ! T 1 :e þ jzj, then jA 1 :rj < jT p :rj which cannot be true. The third inequality can be proved by using the arguments from the proof of Lemma 2. If q < jzj, then the intervals A qþ1 ; . . . ; A jzj will contain A 1 :s and won't be selected in the subsequence. As a result, either p ¼ q or p < q. In either case, T p :e ! A q :e. A q :e ! A 1 :s þ jzj þ l 2 À 1 implies that T p :e ! T 1 :sþ jzj þ l 2 À 1. h Lemmas 3 and 4 collectively prove that vertices associated with the reads T 1 :r, T 2 :r, . . ., T p :r; T 1 :r can be connected appropriately to build a valid closed walk in overlap graph O l2 ðRÞ. Along the walk, the nonoverlapping prefix of read T i :r ð1 i < pÞ spells z½T i :s : T iþ1 :s À 1. The prefix of the last read T p :r spells z½T p :s : jzj. When T 1 :s > 1, it also spells z½1 : T 1 :s À 1. This completes the proof of Theorem 2. Finally, we consider the string graph model.
Observation. String graph S k ðRÞ is not guaranteed to be coveragepreserving for any value of k.
The significance of the above result is that despite sufficient sequencing coverage, string graphs may lose coverage over the genome. This limitation can lead to fragmented assembly in practice (Feng et al. 2022;Nurk et al. 2022). Accordingly, the following questions need to be addressed: (i) To what extent does the lack of the above guarantee affect quality of string-graph-based long-read assemblies in practice? and (ii) Does there exist an alternate graph model derived from an overlap graph which is as sparse as the string graph in practice, but guaranteed to be coverage-preserving?
4 An alternative framework for sparsification of overlap graphs String graph is a subgraph of an overlap graph that loses the coverage-preserving guarantee after pruning selected vertices and edges. In this section, we propose techniques to compute a directed multigraph structure which is also a subgraph of overlap graph, and it is guaranteed to be coverage-preserving. Specifically, we propose safe graph sparsification rules for vertex and edge removal from overlap graph O k ðRÞ; k l 2 which ensure that all circular strings 2 CðR; l 1 ; l 2 ; /Þ can be spelled in the sparse graph. Suppose O 0 k ðRÞ equals a subgraph of O k ðRÞ after performing a sequence of safe operations. Let W / ðGÞ denote the set of circular strings of length / which can be spelled in graph G. From Theorem 2, we know that CðR; l 1 ; l 2 ; /Þ W / ðO k ðRÞÞ. Therefore, the next three lemmas hold true.
LEMMA 6. Edge uv l , i.e. the edge from vertex u to vertex v of weight l can be safely removed from LEMMA 7 (Myers 2005). If vertex v 1 connects to v 2 , v 2 connects to v 3 and v 1 connects to v 3 using edges e 1 ; e 2 and e 3 respectively such that wðe 1 Þ þ wðe 2 Þ ¼ wðe 3 Þ in graph O 0 k ðRÞ, then edge e 3 can be safely removed.
One needs to be careful while removing contained reads. The next lemma suggests a condition for when it is safe to remove a contained read.
LEMMA 8. The vertex corresponding to a contained read r 2 R can be safely removed from graph O 0 k ðRÞ if all the following conditions are satisfied: 1. Read r is a substring of exactly one candidate circular string z 2 CðR; l 1 ; l 2 ; /Þ.

2.
Read r matches a non-repetitive substring of z.
3. Graph O 0 k ðRÞ contains a vertex corresponding to a parent read of r.
PROOF. Let r p be a parent read of read r whose corresponding vertex exists in graph O 0 k ðRÞ. Denote the vertices corresponding to reads r and r p as u and u p respectively. Recall that all reads are substrings of the genome, and the genome is a subset of CðR; l 1 ; l 2 ; /Þ. If r is a substring of exactly one circular string z 2 CðR; l 1 ; l 2 ; /Þ, then r p must be a substring of z and no other circular string in CðR; l 1 ; l 2 ; /Þ. We need to show that z can still be spelled after removing vertex u from graph O 0 k ðRÞ. Suppose ðv 1 ; e 1 ; v 2 ; e 2 ; . . . ; v nÀ1 ; e nÀ1 ; v n Þ; n > 1; v n ¼ v 1 is a closed walk that includes vertex u to spell z. WLOG, assume that v 1 ¼ u. Note that neither of the vertices v 2 ; v 3 ; . . . ; v nÀ1 equal u because read r matches a non-repetitive substring of z. Since v 1 ¼ u, the first character of z is spelled by using the first character of r, and z½1 : j rj matches read r. Vertex v 1 ¼ u is said to span interval A 1 of length jrj starting from position one in z. Similarly, vertex v 2 spans interval A 2 of length jrðv 2 Þj starting from position wðe 1 Þ þ 1 Figure 4 A visualization of the counter-example used to analyze the string graph model in z. In general, each vertex v i in the closed walk spans a unique interval A i with starting position 2 ½1; jzj. Next, we will construct a new closed walk to spell z that uses vertex u p instead of vertex u. The interval spanned by vertex u p in z, say A new , can be uniquely identified because substring z½1 : jrj is non-repetitive. A new subsumes A 1 ¼ A n and possibly a few other adjacent intervals A 2 ; A 3 ; . . . and A nÀ1 ; A nÀ2 ; . . .. All vertices associated with the subsumed intervals can be removed and replaced with u p . With additional refinements, this procedure can be used to form a closed walk that spells z without using vertex u. h One can easily generalize Lemma 8 to a case where the contained read and its parent read have consistent matches in more than one candidate circular string. We will use Lemma 8 in Section 5 to motivate a practical heuristic. Finally, based on all such safe rules, it would be ideal to construct a "minimal" sparse overlap graph. However, implementing these rules in practice is not trivial. Suppose all sequencing errors are corrected by using an appropriate errorcorrection algorithm, we need to make an informed estimation of l 1 ; l 2 and / parameters. Computing the set of closed walks W / ðGÞ efficiently in large overlap graphs is also challenging.

A Proof-of-concept implementation
We build a practical algorithm for overlap graph sparsification that removes transitive edges by using the property in Lemma 7, and filters out a subset of contained reads by using heuristics that are inspired from Lemmas 5 and 8. Our implementation is currently designed for error-free long-read sequencing data sampled from both strands of DNA. Reverse complement of a string x is denoted as x. Reverse complements are handled by adding two separate vertices for each read: first for the read as given and second for its reverse complement. If vertex v corresponds to a read in either its forward or reverse-complemented orientation, then v refers to the vertex corresponding to the read in the other orientation. Therefore, rðvÞ ¼ rðvÞ. Graph GðV; E; r; wÞ satisfies the following two properties: (i) 8v 2 V; v 2 V, and (ii) for each edge e 2 E, say from v 1 to v 2 , an edge v 2 ! v 1 of weight jrðv 2 Þj À ðjrðv 1 Þj À wðeÞÞ belongs to E.
Our implementation requires the following inputs from a user for the initial graph construction: (i) error-free long reads, (ii) minimum overlap length threshold, (iii) exact suffix-prefix overlaps, and (iv) exact match coordinates of each contained read to each of its parent read. The minimum overlap length cutoff is set to 5 kb by default. For our experiments, we used minimap2 (Li 2018) to compute all-versus-all read alignments. Overlaps available from minimap2 were filtered to satisfy the alignment length and 100% identity constraints. The first step in our implementation is to build an overlap graph from the input suffix-prefix overlaps, and separately label reads as either contained or non-contained. For each contained read r, a set of 4-tuples is used to save match information with the parent reads. For instance, tuple ðp; i; j; þÞ indicates that read r matches substring p½i : j of read p 2 R. Similarly, tuple ðp; i; j; ÀÞ indicates that read r matches substring p½i : j, where p is the reverse complement of read p 2 R.

Transitive sparsification of a directed multigraph
The transitive sparsification property in Lemma 7 is inspired from Myers's string graph formulation. However, the linear OðjEjÞ expected time algorithm from Myers (2005) to label transitive edges is not applicable here because our graph is a multigraph, i.e. graph can have multiple edges between a pair of source and target vertices. Moreover, Myers's runtime analysis uses the property that the number of irreducible edges per vertex is constant in expectation, which does not hold in the presence of contained reads. We use a simple OðjEjD 2 Þ time procedure (Algorithm 1) where D is the maximum out-degree of a vertex in G. We did not explore faster algorithms here because Algorithm 1 uses a small fraction of the overall assembly runtime.

Remove non-repetitive haplotype-specific contained reads
Lemma 8 suggests that a contained read can be removed if (i) it is a substring of only a single haplotype, (ii) it is sampled from a nonrepetitive genomic region, and (iii) at least one parent read is available in the graph. The third condition is automatically satisfied because there is at least one parent read of each read which is noncontained, and therefore, will not be removed. We propose a heuristic to check the first two conditions by computing all-versus-all read alignments using Hifiasm (Cheng et al. 2021). For each read r, we inspect multiple sequence alignment (MSA) of read r and other reads overlapping with read r. We check if the count of reads in the MSA does not greatly exceed the sequencing coverage to ensure that the read is sampled from a non-repetitive genomic region. For a diploid genome, we additionally check for the presence of a heterozygous variant using the MSA (Fig. 5). Presence of a heterozygous variant implies that read r aligns to a single haplotype. If both the conditions are satisfied, we remove the two vertices corresponding to contained read r and its reverse complement respectively.

k-mer-based filtering heuristic for contained reads
We process the remaining contained reads using a second filter. In theory, we can try enumerating all possible string walks with and without contained read. Comparing the two will inform whether the read can be safely removed (Lemma 5). However, enumerating all possible string walks at chromosome length scale is computationally prohibitive. We address this by implementing a k-mer-based heuristic. We first compute the set of k-mers observed in user-specified bounded-length string walks from the vertex associated with contained read (Fig. 6). By default, the length of these walks is set to twice the length of the contained read. The set of observed k-mers is denoted as j 1 . Next, we compute the union of set of k-mers observed in bounded-length string walks from the vertices associated with the parent reads of the contained read. This set is denoted as j 2 . If j 1 j 2 , we mark the contained read for removal. This heuristic estimates if there exists a string which cannot be spelled after removal of the contained read. Figure 5 An illustration suggesting how multiple sequence alignment is used to detect non-repetitive haplotype-specific contained reads. We precompute the set of non-repetitive haplotype-specific reads by using a modified version of Hifiasm code (https://github.com/cjain7/hifiasm/tree/hifiasm_dev_debug)

Experimental results
Our evaluation addresses two questions: (a) How many coverage gaps are introduced if we follow the standard string graph model, i.e. discard all contained reads during graph construction?, and (b) How well does the proposed implementation (hereon referred to as ContainX) perform when compared to string graph as well as other existing methods?

Experimental setup
Read simulation from human haploid and diploid genomes We used two human genome assemblies for read simulation. The first is a CHM13 haploid genome assembly provided by the Telomere-to-Telomere consortium (GenBank id GCA_009914755.4) (Nurk et al. 2022). This haploid genome assembly of size 3.1 Gbp has 25 contigs and N50 length 150.6 Mbp. The second assembly is a long-read diploid genome assembly (ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm-phase/v2/ HG002.hifiasm.trio.0.16.1.hap1.fa.gz and ftp://ftp.dfci.harvard.edu/pub/ hli/hifiasm-phase/v2/HG002.hifiasm.trio.0.16.1.hap2.fa.gz) of HG002 human sample computed using Trio-hifiasm (Cheng et al. 2021). This diploid genome assembly of size 6.0 Gbp has 970 contigs and N50 length 57.8 Mbp. From these genome assemblies, we simulated eight error-free read sets whose length distributions are compatible with real PacBio HiFi and ONT sequencing data. We used Seqrequester (https:// github.com/marbl/seqrequester) to simulate reads from random start positions in both forward and reverse orientations. Seqrequester allows users to specify a desired read length distribution. We used HiFi and ONT read length distributions from publicly-available long-read datasets of HG02080 human sample (Liao et al. 2022). Four long-read sets were simulated from haploid genome assembly with 20-fold coverage. The other four long-read sets were simulated from diploid genome assembly with 30-fold coverage (15-fold per haplotype). Length statistics and coverage information of the simulated read sets are shown in Table 1. The commands used to run the tools are listed in Supplementary Table S1.

Assessment of coverage gaps caused by removing contained reads
Benchmarking procedure We used the following method to estimate the count of coverage gaps that are introduced by removal of contained reads. We computed allversus-all read overlaps by using minimap2 v2.23 (Li 2018) to identify the set of contained reads. We labeled a read as contained if it matched a proper substring of a longer read. We can have false negatives because minimap2 uses k-mer-based heuristics. However, there cannot be false positives here because each overlap is verified using sequence alignment. In the second step, the set of non-contained reads was mapped to the genome. For the diploid genome, the set of noncontained reads was mapped to both paternal and maternal haplotypes, one by one. Minimap2 parameters were adjusted to enable reporting of multiple best alignments per read. Whenever a read aligned end-to-end against a genomic interval with 100% alignment identity, we recorded the interval. Finally, all the genomic intervals where no alignment coverage was observed were extracted using bedtools v2.29.1 (Quinlan and Hall 2010). Not all of these can be considered as coverage gaps caused by removal of contained reads. The sequencing coverage and read lengths at the two extreme ends of each contig are expected to be lower than the whole-genome average due to edge effect. Therefore, the intervals that overlap with either the first 25 kb or the last 25 kb bases of a contig are not considered. Similarly, the intervals that overlap with segments of the genome with zero sequencing coverage are also not considered. The remaining intervals are the coverage gaps introduced by removal of contained reads.

Results
The counts of coverage gaps observed in the four datasets are shown in Table 2. We observe that there are zero coverage gaps introduced in haploid genome for both ONT and HiFi read sets. In theory, haploid genomes can suffer coverage gaps due to identical repeats in the genome (Fig. 1). However, removal of contained reads does not cause any harm in practice. In diploid datasets, we observe 1 and 46-54 coverage gaps introduced in HiFi and ONT datasets, respectively. Compared to haploid genomes, coverage gaps in diploid genomes are more likely to happen because a longer read sampled from one haplotype can subsume all reads sampled from the homologous loci of the other haplotype, especially in regions with low heterozygosity. Moreover, ONT read length distribution is highly nonuniform which leads to a higher fraction of contained reads, and therefore, more coverage gaps. Supplementary Fig. S1 shows long-read length distributions.
We further investigated whether the observed gaps are clustered in a particular chromosome, but we did not observe such behavior. The coordinates of these coverage gaps in the HG002 assembly were mapped to the coordinates of GRCh38 human genome reference by using paftools (Li 2018). The complete lists of these coordinates are provided in Supplementary Tables S2-S5. We also evaluated heterozygosity rate by checking the count of heterozygous variants in these gaps. These gaps collectively span 1.26 million bases on GRCh38 genome reference when all the four diploid datasets are considered together. However, Dipcall  reported only 83 heterozygous variants. This rate is an order of magnitude lower relative to the whole-genome average heterozygosity rate 0.1% in HG002 genome. This observation confirms our expectation that the coverage Count of contained reads is estimated from all-to-all read alignments. Figure 6 An illustration for how k-mer sets j1 and j2 are computed using boundedlength string walks from contained read r and its parent reads, respectively, in an overlap graph. Colored dots indicate the k-mers extracted from the strings gaps in a diploid genome are more likely to happen in the regions with low heterozygosity.

Evaluation of the proposed sparsification algorithm
Benchmarking procedure We evaluated the performance of our proof-of-concept implementation ContainX that determines a subset of contained reads to be retained. We tested ContainX using the diploid datasets. We checked the performance by observing the following two parameters: (i) count of coverage gaps, and (ii) count of "junction" reads, i.e. the reads which correspond to vertices having either > 1 incoming or > 1 outgoing edges in the graph. Retaining more contained reads implies that their corresponding vertices and edges are retained in the graph. This may result in more number of junction reads which can further result in shorter unitigs. It is important to look at the two parameters simultaneously because the first parameter can be optimized by retaining all contained reads whereas the second parameter can be optimized by removing all contained reads. An ideal solution should optimize both by doing a careful selection of contained reads. To measure the first parameter, we used minimap2 to compute end-to-end 100% identical read alignments of the retained contained reads against the two haplotypes. We checked if these alignments closed the coverage gaps that were previously caused by discarding all contained reads.

Competing methods
We compared ContainX with four other methods. The first method, called as Retain-all, simply retains all contained reads. The second method, called as Remove-all, discards all contained reads. The third method is from Hui et al. (2016). It removes a contained read r if two of its parent reads are inconsistent. Two parent reads are said to be inconsistent if, when aligned with respect to read r, they disagree at some base. We also included Hifiasm in our benchmark. Unlike previous methods which stop at graph construction, Hifiasm is a full-fledged genome assembler that incorporates multiple heuristics (e.g. tip removal, bubble popping etc.) for graph sparsification. Hifiasm builds its initial overlap graph without using contained reads. It uses a heuristic to recall a few selected contained reads which can connect ends of two unitigs. Even though Hifiasm is specifically designed for HiFi reads, we tested it on ONT datasets as well because of error-free read simulation.

Results
In all four datasets, Retain-all method retained the highest number of contained reads as expected (Table 3). It resolved all coverage gaps, but the corresponding graph has significantly higher number of junction reads compared to Remove-all. Next, Hui-2016 filtering method appears to be conservative, i.e. it typically ends up retaining a majority of contained reads. In several cases, the contained read may have parent reads which agree at all bases (e.g. when they come from a homozygous region of the genome). There may also be cases when there is only a single parent read available. In all such scenarios, Hui-2016 heuristic will choose to retain the contained reads. This heuristic resolved all the coverage gaps. Remove-all method has the fewest junction reads among the four methods but has the highest number of unresolved coverage gaps as expected.
ContainX delivered favorable performance compared to the three methods. Using both datasets, ContainX retained 1-2% contained reads compared to Retain-all, yet it successfully resolved majority of the coverage gaps. ContainX resulted in only 2-5% junction reads when compared to Retain-all. This suggests that ContainX heuristics are promising in terms of improving assembly quality by avoiding coverage gaps. The count of false positives, i.e. the count of redundant contained reads retained by ContainX, is still notable. Most of these false positives correspond to contained reads that are sampled from long near-identical repetitive regions ( Supplementary Fig. S2). This is a limitation of our k-mer-based filtering heuristic (Section 5.3) because it may retain redundant contained reads if it finds neighboring vertices that correspond to paralogous sequences during graph traversal. It is unclear whether it is possible to remove such contained reads with a provably-good graph-theoretic strategy. In some other cases, a redundant contained read may be retained if it has suffixprefix overlaps with reads from both haplotypes while its parent reads correspond to a single haplotype. Finally, Hifiasm retained fewer contained reads than ContainX but it failed to resolve a majority of coverage gaps. This suggests that there is further scope to improve Hifiasm algorithm. The unitig graph of Hifiasm has the least number of junction reads because it does additional graph pruning which is necessary for computing longer unitigs. Incorporating ContainX heuristics inside Hifiasm code can be an interesting direction to explore.
The proposed graph sparsification algorithms implemented in ContainX are easy-to-implement, fast and space-efficient. The kmer-based heuristic (Section 5.3) is parallelized by considering each contained read independently. The time required for this step on a multicore AMD processor is about ten minutes. Our other heuristic that detects heterozygous variants by using MSA of overlapping reads (Section 5.2) can be assumed to incur no additional time because this step is a default requirement in overlap-graph-based genome assemblers. In summary, the time needed by the proposed heuristics is insignificant compared to the main time-consuming step of computing all-versus-all read alignments in long-read assemblers.

Discussion
This work formalized the coverage-preserving property in assembly graph models. Development of provably-good graph models will help automate the computation of high-quality haplotype-resolved assemblies of personal human genomes ). We showed a rigorous analysis of de Bruijn graph, overlap graph and string graph assembly models. We concluded that de Bruijn graph and overlap graph models offer the guarantee, but string graph model lacks this guarantee. This limitation of the string graph model can lead to fragmented assembly in practice (Feng et al. 2022;Nurk et al. 2022). Several graph sparsification heuristics, such as removal of contained reads, may be deemed safe at first due to a false intuition, e.g. it may appear that contained reads lack useful information for assembly. We have shown that contained reads should be assessed carefully to avoid suboptimal genome assembly. It may be possible that an optimally constructed sparse overlap graph is more tangled than a string graph, however, further graph simplification can be achieved with known techniques, e.g. aligning complementary sequencing data to the graph (Garg et al. 2018;Cheng et al. 2022;Rautiainen et al. 2023). We quantified the count of coverage gaps in a real haploid and diploid genome assembly, respectively, through our experiments done using simulated long reads. Based on these observations, we expect coverage gaps to occur even more frequently in polyploid genomes and metagenomes. These results are expected to inspire further research to develop alternative models and sparsification procedures for assembly graphs. We conducted our experiments by using error-free long reads to quantify the coverage gaps. If real data is considered, one would need to preprocess reads to correct errors by computing consensus among the overlapping reads (Cheng et al. 2021;Bankevich et al. 2022), but error-correction methods are not necessarily perfect. Our analysis with error-free reads is useful to explicitly evaluate the graph sparsification algorithms. When errors are considered, additional coverage gaps may occur due to incorrect bases in the error-corrected reads (Fu et al. 2019;Zhang et al. 2020). Depending on the error-rate in the corrected reads, one would also need to adjust the definition of a contained read as a substring of another read while allowing a bounded number of edits (Myers 2005). This relaxation can lead to a more aggressive sparsification in string graphs, and therefore, more coverage gaps.
We implemented novel heuristics for assessing contained reads in ContainX that were motivated from our safe graph sparsification policies. We compared ContainX to heuristics from Hifiasm, Hui et al. (2016) and two other extreme policies that remove and retain all the contained reads respectively. None of these methods is able to simultaneously eliminate all gaps and all the redundant contained reads. ContainX provides favorable performance by eliminating most of the redundant contained reads and resolving majority of the coverage gaps. Our future work will explore integration of ContainX heuristics in long-read assemblers. Extending the proposed heuristics to erroneous reads should not be challenging. For example, one can still detect haplotype-specific contained reads by considering MSA of overlapping reads and ignoring variants that are not supported by sufficient number of reads (Cheng et al. 2021). Similarly, the proposed k-mer-based filtering heuristic would be applicable if low-frequency k-mers are ignored.

Supplementary data
Supplementary data is available at Bioinformatics online.