Decomposing mosaic tandem repeats accurately from long reads

Abstract Motivation Over the past 30 years, extended tandem repeats (TRs) have been correlated with ∼60 diseases with high odds ratios, and most known TRs consist of single repeat units. However, in the last few years, mosaic TRs composed of different units have been found to be associated with several brain disorders by long-read sequencing techniques. Mosaic TRs are difficult-to-characterize sequence configurations that are usually confirmed by manual inspection. Widely used tools are not designed to solve the mosaic TR problem and often fail to properly decompose mosaic TRs. Results We propose an efficient algorithm that can decompose mosaic TRs in the input string with high sensitivity. Using synthetic benchmark data, we demonstrate that our program named uTR outperforms TRF and RepeatMasker in terms of prediction accuracy, this is especially true when mosaic TRs are more complex, and uTR is faster than TRF and RepeatMasker in most cases. Availability and implementation The software program uTR that implements the proposed algorithm is available at https://github.com/morisUtokyo/uTR.

Proof of NP-completeness of UEP Theorem 1. The unit encoding problem (UEP) is NP-complete. Given a set of alphabets Σ, a string S ∈ Σ * , a set of strings U ⊂ Σ * , and an integer T , is there a decomposition S = s 1 · · · s k such that all elements s i appear in U and s∈{s 1 ,··· ,s k } {|s| + o(s)} ≤ T ?
To avoid double counting of the same substring appearing in multiple positions of D, {s 1 , · · · , s k } represents a non-redundant set of the decomposition. According to this lemma, it is NP-hard to calculate an optimal decomposition that minimizes s∈{s 1 ,··· ,s k } {|s| + o(s)}. The problem may become harder when U is not provided, but we do not have an answer to the intractability of this extended situation.
We show this lemma by giving a polynomial time reduction from the vertex covering problem, a well-known NP-complete problem: for a graph G = (V, E) and an integer K, check whether or not there is a subset V ⊂ V of size K or smaller such that for each edge (u, v), either u or v is in V (Karp, 1972).
Let G = (V, E) and K denote an instance of the vertex covering. For convenience, we express V as a set of integers {1, · · · , N } and an edge as a pair of integers (i, j) such that i < j. In addition, we define the maximum degree of G as A. We assume A > 1 because the vertex covering problem is trivial if A ≤ 1. We convert G and K to an instance of the UEP (Σ, S, U, T ) to answer the vertex covering problem of G and K.
In general, we can use a so-called adjacency list to represent a graph by a string. Formally, we treat each edge (i, j) ∈ E as a character in Σ, encode each node i ∈ V by a concatenation of all edges with i as one node, which are of the form (i, −) or (−, i), and denote the concatenation by S i . For examples, 2 ∈ V is encoded by (2, 3)(1, 2). By merging all S i s, we obtain the adjacency list S ∈ Σ * of G. For example, the adjacency list S of the graph is We represent a vertex covering V by decomposing S into S i s and subsequently S i into single letters if node i is a member of V . In other words, we connect a covering V of G and a decomposition D of S via the rule: i ∈ V ⇐⇒ D decomposes S i into single letters. (2) In the running example, we represent an optimal covering, {1, 3}, as Here, we use "·" as the symbol for decomposition. The penalty of D is 10 because each of (1, 2), (2, 3)(1, 2), and (2, 3) occurs once, (3, 4) occurs twice, and the penalty is (1 + 1) + (2 + 1) + (1 + 1) + (1 + 2) = 10.
Although the adjacency list representation seems to provide a way to convert vertex covering to UEP, there are three issues to be solved.
First, there is no guarantee that optimal decompositions of S correspond to optimal coverings of G. Specifically, one of the optimal decompositions of S is with a penalty of seven because (1, 2)(2, 3) occurs twice, (3, 4)(3, 4) once, and the penalty is (2 + 2) + (2 + 1) = 7. But this decomposition is inconsistent with nodes grouped by the overbraces in Equation (1), and no coverings correspond to it.
Second, the decompositions for a non vertex covering and an optimal covering can have the same penalty. For example, suppose that V is {1}. V is not even a covering, and the corresponding decomposition is This decomposition and that of (3) share the same penalty, ten because all substrings occur once, and the penalty is (1 + 1) + (2 + 1) + (2 + 1) + (1 + 1) = 10.
Finally, we cannot distinguish optimal coverings from non-optimal coverings. When V is a non-optimal vertex covering {1, 3, 4}, the decomposition is again the same as that in (3).
Considering these issues, it is necessary to refine this adjacency list representation. As the refinement includes complicated techniques, we provide the motivations, conversions, and explanations of why they work.
First, for prohibiting decompositions violating the overbraces in Equation (1), we introduce a set of units U such that all substrings of D are in U , We introduce two new characters D and F to simplify the definition of U .
Second, we ensure that the penalty is much smaller for a valid decomposition built from a covering than for an invalid decomposition. For this purpose, we intuitively append all of the edges at the tail of S: The penalty of an appended edge, say (2, 3), increases by 1 if V is a covering because (2, 3) should already appear as a substring in D (see (2, 3)(3, 4) in Equation (3) for example). In contrast, it increases by 2 if V fails to include either 2 or 3. Therefore, the appended edge-characters work as "covering checkers" in that they increase the penalty every time V fails to cover an edge. To enlarge the incurred penalty sufficiently, we define a new auxiliary character E and replace each edge-character, such as (2, 3), with a long string E · · · E(2, 3)E · · · E.
The final aim is to link the penalties of decompositions with the sizes of the corresponding coverings, |V |. With these motivations in mind, we newly define the alphabet Σ, the set of units U , the string S, and the threshold parameter T from scratch. To begin with, we set U = ∅ and Σ = {(i, j) | (i, j) ∈ E}. We then introduce a new punctuation mark, denoted by D, and add D to Σ. We also treat D as a single-character string and add it into U . As long as no units contain D except for D itself, any decomposition D of S isolates all occurrences of D as elements of D.
We will then ensure that all of the decompositions of S i s have the same penalty and are independent of each other, let A be the maximum degree of G and N be the number of nodes in the graph. We then extend the graph such that Σ has A characters of the form of (i, −) or (−, i) for each i. In the working example, since A = 2, the characters corresponding to the edges are {(1, 2), (1, 5), (2, 3), (3, 4), (4, 5)}. We underline the newly introduced characters. In general, if the degree of i, d(i), is less than A, we add (i, N + 1), · · · , (i, N + A − d(i)) to Σ, where N + 1, · · · , N + A − d(i) are new nodes.
To make the penalties of a decomposition of S i independent of each other, any pairs of S i s should not share any substrings in U . purpose, let L be a sufficiently large integer, e.g., 4|E| + (K + 3)N . Then, we distinguish all edges by associating node i of edge (i, j) with which we denote by P i (i, j). For ease of understanding, we illustrate P i (i, j) and P j (i, j): It is clear that P i (i, j)s contain their own edge characters (i, j), are of the same length 2(L + N + 1), and have unique tags (E L+2i ) at the left side of (i, j).
Then, we introduce another new punctuation character F to Σ and concatenate all P i (i, j)s and P i (j, i)s, separating them with F: as illustrated by S 1 and S 3 in Figure3. As designed, all S i s now have the same length, A(2(L + N + 1) + 1) (1 for F), and all P i (i, j)s and P j (i, j)s are distinct. We add all P i (i, j)s and all S i s to U .
Finally, it is necessary to implement the "covering checker" as illustrated in (4) so as to validate that the corresponding subset of nodes V ⊂ V is indeed a covering. For this purpose, we take an approach of computing the penalty of the optimal decomposition of S i,j = E 2L (i, j)E 2L . Suppose we have decomposed S i,j into either of the following: We can compute the penalties of E L−2i , E L−2N +2i−1 , E L−2j , and E L−2N +2j−1 with ease by adding all of these patterns joined by D, , at the head of S (see "auxiliary domain" in Fig(3)). This is because, as S should be partitioned by D, each of E L−1 , · · · , E L−2N occurs at least once in any decomposition. Now, the penalty of S i,j increases by 3 if either P i (i, j) or P j (i, j) is present in a decomposition and by greater than 3 otherwise. In addition, there are no decompositions of S i,j other than (5) or (6) for U , because (i, j) appears only in P i (i, j), P j (i, j), S i , and S j . Since S i and S j have at least one F, they cannot be any substrings of S i,j .
Here, we summarize our argument so far. First, we convert vertices and edges in the graph as follows: Then, we define S and U as Figure 3 shows the conversion of the running example.
The correspondence between a covering V and a decomposition is: What is the penalty of this decomposition if V is a vertex covering with K or fewer vertices? The leading F · D incurs (1 + 1) + (1 + 1) = 4 penalty. For the subsequent string i E L−i D, as D exists uniquely in U , there is only one way to decompose:E L−1 · D · . . . · E L−2N · D. Given there are 2N Ds, the penalty is (2L − 2N + 3)N in total.
To calculate the penalty corresponding to nodes, we divide the case according to whether it is in the covering. Recall that the size of S i is A(2(L + N + 1) + 1), and letL denote the size |S i |.
and F, and the penalty isL + A, where the A is the number of occurrences of S i . • If i / ∈ V , S i is not decomposed, and the penalty is |S i | + 1 =L + 1.
In total, the penalty is For (i,j)∈E (S i,j D), because V is a covering, the penalty of each S i,j is 3 as shown in (5) and (6). As there is an additional D for each S i,j , the penalty is 4, summing up to 4|E| in total. By aggregating these penalties, we have: as an upper bound of the penalty. Assigning the above bound to T provides a positive answer to the unit encoding problem. To complete the proof, it remains to confirm the reverse direction; Given a decomposition D of S with a penalty of no greater than T , can we construct a vertex covering with the size at most K?
First, because the character D uniquely exists in U , D splits S at least by D. In addition, D must decompose i (E L−i D) into E L−1 · . . . · E L−2N , the penalty of which is (2L − 2N + 3)N .
Considering how we build S i , we have only two ways to decompose S i ; to partition it by F or to leave it as it is. These decompositions incur L+A penalty andL+1 penalty, respectively. Let M be the set of vertices i such that D splits S i by F. Then, the penalty would be: |M |(L + A) + (N − |M |)(L + 1) = NL + N + (A − 1)|M | Summing the penalties of D, F, S i , and the auxiliary domain, the total penalty of edges is T = 4 + (2L − 2N + 3)N + NL + N + (A − 1)|M | + |E| As we assume that the total penalty is T or less, the remaining penalty is T − T = 3|E| + (A − 1)K − (A − 1)|M | (≤ L).
We can only use (5) and (6) to decompose S i,j into either S i or S j by F. This is because, if otherwise, there would be at least one S i,j incurring an additional |P i (i, j)| + 1 = |P j (i, j)| + 1(> L) penalty. Thus, M is a vertex covering of G, and the penalty is three for each S i,j . The total penalty becomes 3|E|. As the overall penalty is at most T , we have T − (T + 3|E|) = (A − 1)(K − |M |) ≥ 0, or equivalently, |M | ≤ K. Thus, M is a vertex covering of G less than or equal to K.