Theory of local k-mer selection with applications to long-read alignment

Abstract Motivation Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well. Results We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. However, we found that the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment. We give new insight into how one might use new k-mer selection methods as a reparameterization to optimize for speed and alignment quality. Availability and implementation Simulations and supplementary methods are available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

Case 1: if i = 1 or i = k − β, then trial j + 1 or i − 1 has to be a failure respectively, otherwise the run is longer than k + β. The remaining trials can be successes or failures; it doesn't matter. Thus Pr(k + β successes in a row, i = ) = (1 − θ) k+β θ for = 1 or k − β. This contributes a 2 · (1 − θ) k+β θ term to the sum. Case 2: if i = 1 and i = k − β, then both of the trials i − 1 and j + 1 have to be failures. Again, the remaining trials can take on any value. Thus Pr(k + β successes in a row, i = ) = (1 − θ) k+β θ 2 for 1 < < k − β. There are (k − β − 2) such terms in the sum.
(1) Figure 2: Graphic of how α consecutive k-mers give rise to a permutation in S k−s+α where the first s-mer is the smallest, the second is the largest, etc. Colors show how k-mers correspond to windows in the permutation.
Proof. Let σ ∈ S k−s+α . Note α is the number of windows in σ of size k−s+1. For i ∈ A α = {1, ..., α}, if σ(i) = 1 then the first s-mer in some window starting at i is the smallest, so this permutation is successful. Otherwise, for i ∈ B α = {k − s + 1, ..., k − s + α} if σ(i) = 1 then the last s-mer in some window is the smallest, so it is successful.
(Case 2 -if β < τ +1). In this case, β is left of position t. Notice that for all windows containing position β will never be successful since the first window contains β at position < τ + 1, and the relative position of β in subsequent windows will be < τ + 1 as well.
This new permutation has to satisfy condition 2, and the number of such permutations is exactly OS(α − β, k, s, t). We have to multiply by an additional (k − s + α − 1) β−1 to count the possible values for the β − 1 entries to the left of β, each of which give the same permutation in S w+a−b after relabelling. Summing over b = 1, ..., τ = 1 gives the R(α, k, s, t, 1 ) term.
(Case 3 -if β > τ + α). This case is identical to case 2 and the same argument works after flipping directions. This works by summing over the 2 = k − s − τ possible positions β ∈ {k − s + α, k − s + α − 1, ..., τ + 1 + α} and using a the same relabelling after cutting off a portion of the permutation. The number of permutations for β = k − s + α − i is the same as for β = i by symmetry. Using this correspondence gives the R(α, k, s, t, 2 ) term and completes the proof.
We now prove the following theorem.
Proof of Lemma. We show OS(α, β − 1, t) ≥ OS(α, β, t) for any β, which implies the result. This is equivalent to showing that Notice that and when f is an open syncmer method with fixed parameters k, s, t from our correspondence between random permutations and the event a k-mer is selected by f . By definition, . Technically, the correspondence is only true up to a small error due to the chance of repeated k-mers appearing in a window, but one can make OS(x, k, s, t) arbitrarily close to Pr(f, x) by letting the alphabet be very large, making repeats unlikely (see the Section 2.3.1 in [1]). Then follows from Pr(f, α − β + 1) ≥ Pr(f, α − β), and we're done.
Proof of Theorem S2.3. We use the similar notation as Lemma S2.4 for OS.
Observe that OS(α, k, s, t) = OS(α, k, s, k − s + 2 − t) since this just swaps the 1 , 2 in the definition. Since k − s + 2 −t =t ort + 1 depending on if k − s + 1 is odd or even, we only need to prove that this inequality holds for t <t. We will assume k − s + 1 is odd for exposition since the indices are easier to handle, but the result holds either way after a slight modification. We proceed by induction on α for OS(α, k, s, t). For the base case α = 1, notice that clearly OS(1, k, s, t) = OS(1, k, s,t) = (k − s)!. Letτ =t − 1 and τ = t − 1. Therefore we want the following term to be positive: By the induction assumption the first two sums are ≥ 0 since OS(α − β, k, s,t) ≥ OS(α − β, k, s, t) for β ≥ 1. For the last term, k − s + 1 odd gives us k − s −τ =τ . We can rewrite the last line aŝ By the induction assumption, OS(α, (τ − j + 1),t) ≥ OS(α, (τ − j + 1), t) and using Lemma S2.4 finishes the proof becauseτ − j + 1 ≤τ + j for all j ≥ 1. Figure 3 what Pr(f ) looks like for syncmer methods over a range of t.

S3.1 Context dependency
Recall the exact expression using for calculating Cons(f, θ, k). Consider the case for w = 2 when f is a random minimizer. For E x , we must consider the 2(w − 1) + α = 3 relevant k-mers on S; call these k-mers x 1 , x 2 , x 3 , where x 2 is the unmutated k-mer overlapping position i. We can see that Pr(E x ) = 2/3 either from recognizing that this is the density 2/(w + 1) or from realizing that this is equivalent to permutations in S 3 for which σ(2) = 1 or 2, where σ(2) is the relative ordering of x 2 .
However, the only k-mer on S which is guaranteed to be unmutated is x 2 , since α = 1. The surrounding k-mers x 1 , x 3 may not be equal to x 1 , x 3 . If they are not, then by considering a random ordering on x 1 , x 2 , x 3 , x 1 , x 3 and seeing that if x 2 is the smallest or second smallest, it is selected on both strings, but if it is the third smallest then there are 8 permutations for which x 2 is not selected on one of the strings. Although Pr( a j=0 E j+x ∩ E j+x ) can be similarly phrased in terms of permutations, a trickier combinatorial problem arises.
S4 Proof of (a, b, m)-words method probability vector Theorem S4. 1. Pr(f, α) under the (a, b, n)-words method is Lemma S4.2. Given a set of of α elements labelled {1, ..., α}, the number of ways c(n + 1, i) to choose i elements x 1 , ..., x i where we order x j < x j+1 for j = 1, ..., i − 1 and |x j − x j+1 | ≥ (n + 1) for all j is The y i s represent the gaps between x i s and also the endpoints. A valid choice of x i s corresponds exactly to a choice of y i s such that each y i ≥ n for i = 1, ..., i − 1 and y 0 , y i ≥ 0. Furthermore, We can take z j = y j − n for i = 1, ...i − 1 and z j = y j otherwise to get the equivalent problem of finding z j all ≥ 0 such that This problem is equivalent to putting α−i−(i−1)n indistinct balls into i+1 distinct jars represented by the variables z j . The solution is Proof of Theorem S4.1. The probability that at least one of the k-mers is selected is where E i is the event that the i−th k-mer is selected. By inclusion-exclusion, we get where E I = i∈I E i . Now note that the probability that E α ∩ E β for |α − β| ≤ n occurs is 0; k-mers with prefix abbb... may not be within distance n from each other. If the i k-mers are all distance ≥ n + 1 apart, then the probability of that event occurring is just ( 3 n 4·4 n ) i because this is just the sequence abbb... appearing i times in a string of i.i.d random letters. Therefore, denoting c(i, n + 1) to be the number of ways to select i elements from {1, ..., α} such that each element is at least pairwise distance n + 1 apart, we get Plugging in the above lemma finishes the proof.

S5 Comparing Pr(f )
In Figure 5, we plot all Pr(f ) and U B(d) where all methods have density d = 1/7 except for the words method, which has density 9/64 ∼ 1/7.11. This is due to the limited range of parameters choices for the methods.

S6 Defining
Here r = {A, G} and y = {C, T } and we mean rryrry to be all 6-mers that satisfy this condition. This set leads to a d = 1/4 method, and was found by an optimization algorithm [2]. We take the words set W 8 as Here r = {A, G} and y = {C, T } and we mean rryrry to be all 6-mers that satisfy this condition. This set leads to a d = 1/8 method, and was found by an optimization algorithm [2].

S7.1 Simulated PacBio Experiment
For the simulated reads, we used PBSIM [3] to simulate PacBio CLR reads at 0.5 coverage across GRCh38 with mean length 15kb and 10% error rate. We report the mapping accuracy using paftools from [4]; a similar evaluation procedure is used in [5]. paftools calls an alignment correct if the overlap between an alignment and the true position of the simulated read is greater than a 0.1 fraction of all bases covering the true and mapped positions. We report accuracy and computational details in Table 1.   3,9). Number of unmapped reads was negligible. % of unique selected k-mers are k-mers selected by syncmers/minimizers from the reference that are unique over all indexed k-mers.