A simple refined DNA minimizer operator enables 2-fold faster computation

Abstract Motivation The minimizer concept is a data structure for sequence sketching. The standard canonical minimizer selects a subset of k-mers from the given DNA sequence by comparing the forward and reverse k-mers in a window simultaneously according to a predefined selection scheme. It is widely employed by sequence analysis such as read mapping and assembly. k-mer density, k-mer repetitiveness (e.g. k-mer bias), and computational efficiency are three critical measurements for minimizer selection schemes. However, there exist trade-offs between kinds of minimizer variants. Generic, effective, and efficient are always the requirements for high-performance minimizer algorithms. Results We propose a simple minimizer operator as a refinement of the standard canonical minimizer. It takes only a few operations to compute. However, it can improve the k-mer repetitiveness, especially for the lexicographic order. It applies to other selection schemes of total orders (e.g. random orders). Moreover, it is computationally efficient and the density is close to that of the standard minimizer. The refined minimizer may benefit high-performance applications like binning and read mapping. Availability and implementation The source code of the benchmark in this work is available at the github repository https://github.com/xp3i4/mini_benchmark


Introduction
The minimizer concept is a data structure for sequence sketching.It is firstly introduced to the sequence analysis by Roberts et al. (2004) to reduce the storage requirements of biological sequence data.Then it was applied by many other applications in the field, such as sequence binning (Deorowicz et al. 2015), sequence compaction (Chikhi et al. 2016), sequence classification (Wood and Salzberg 2014), and read mapping (Li 2016, Jain et al. 2020, B€ uchler et al. 2023).
Given the sequence, the minimizer is the minimum k-mer of a predefined ordering scheme in a window of w consecutive k-mers.The minimizer performance relates to several key measurements.Schleimer et al.'s (2003) study defined the density of a k-mer selection scheme as the fraction of selected k-mers.Formally, denote < the ordering scheme and X the selected k-mers in the sequence S, whose size jSj � wþk.The density of the selection scheme is given by where jXj, jSj are the size of X and S. Since it was first introduced to measure the storage requirements, the selection schemes are supposed to select a set of k-mers that is as sparse as possible such that the storage requirements can be largely reduced.Novel selection schemes, such as Orenstein et al. (2016), Marc¸ais et al. (2017), Jain et al. (2020), and (Zheng et al. 2021), are proposed to improve the minimizer density.
The k-mer repetitiveness is another minimizer measurement.It is measured by the k-mer frequency in practice.Formally, the frequency of a k-mer X ¼ x in S is defined as its average occurrences in the sequence, where nðxÞ is the occurrence of k-mer X ¼ x.Let V denote the random variable over possible k-mer frequencies.It relates to the performance of applications such as 1) Read mapping: Consider the anchoring (seeding) problem, where we need to find all matched pairs of minimizers in the reference and the read.2) Binning: Similar to the read mapping, we need to find matched minimizers and cluster them into bins.
For the two problems, we prefer selection schemes that can generate minimizers of lower repetitiveness (Deorowicz et al. 2015), because highly repetitive minimizers would significantly decrease the matching accuracy and computational efficiency.Like many other fundamental data structures, computational efficiency is the third performance measurement.Although the time complexity of computing minimizers is commonly linear, optimizations of density or k-mer repetitiveness may significantly increase the runtime.For high-performance applications, such as population-scale read mapping, drops in computational efficiency may be non-negligible.
In general, there exist performance trade-offs between minimizer variants.For instance, the random ordering scheme (Chikhi et al. 2014) generates more uniformly and sparsely distributed minimizers than the lexicographic ordering scheme at the expense of increased runtime.In contrast, lexicographical minimizers are less affected by nearby mutations or sequencing errors than random minimizers, sometimes called "conservation" (Edgar 2021).Thus, they are beneficial to some matching applications.But the trade-off is the less random sampling.
Here, we propose an operator as a refinement of the standard (canonical) minimizer.It has the following features.
1) It improves k-mer repetitiveness of the standard minimizer.It is less biased to small k-mers and distributes more uniformly.
2) It applies to any selection schemes of total orders (Davey et al. 2002) (e.g.lexicographic or random order).3) Its density converges toward that of the standard minimizers.4) It is commonly faster than the standard minimizer to compute and can reach two times at most.
It is worth noting that the operator does not apply to noncanonical minimizers of single-strand sequences, such as RNA minimizers.However, canonical minimizers are essential to most sequence analysis applications, such as read mapping and genome assembly.
In the following sections, we will first define the refined minimizer.Next, we will prove three properties that are essential to the refined minimizer performance.In the results, we will compare the algorithm complexity of computing the standard and refined minimizers.Then, we will evaluate the statistics (e. g. repetitiveness, density) of standard and refined minimizers in real sequences.Finally, we will analyze the statistics and discuss the potential limitations and improvements.

Definitions
Operations: For high-performance applications, a preferable minimizer function should be simple and effective.The core idea of the refined minimizer is to define an appropriate decision function that makes the ordering scheme only compute minimizers in the sequence of one strand such that the smallest k-mers are less likely to be selected, repetitively.Provided jsj � 1ðmod2Þ, we define an operator as where p A ; p C ; p G ; p T are the occurrences of characters A, C, G, T in s. jsj � 1 ðmod2Þ is to guarantee dðsÞ 6 ¼ 0, which will be later discussed in the properties.The refined minimizer h is then defined as h r ðsÞ ¼ Table 2 is an example comparing refined and standard minimizers.The lexicographic order of a given k-mer can be computed by , where a i is the order of the ith (right toward left) character of the k-mer and a i equals 0, 1, 2, 3 for A, C, G, T.

Properties
Here, we discuss three refined minimizer properties that are essential to the applications.They hold for all ordering schemes ðw; k; <Þ defined above.The first one guarantees the strand symmetry, such that the computation of the refined minimizer is independent of the strand.The second one guarantees that the refined minimizer is always not smaller than the standard one.The third one guarantees that the refined minimizers have a reasonable density that is close to that of the standard one.
according to the definition in expression 4. 2) For any total order < of R k , h s ðsÞ 6 h r ðsÞ.Proof: It implies that h r ðsÞ would be less biased to small k-mers than h s ðsÞ. 3) Denote s n ¼ a 0 a 1 ; ::; a jsj−1 and s nþ1 ¼ a 1 a 2 ; ::; a jsj the nth and nþ1th subsequences, where a i is the ith base.Denote d n ¼ dðs n Þ the operator of s n defined in expression 3. Provided the sequence is random, then the following expression of probabilities holds Because there exist two cases that h s ðs n Þ 6 ¼ h s ðs nþ1 Þ, namely the minimizer of s n is its leftmost k-mer or the minimizer of s nþ1 is its rightmost one, otherwise s n and s nþ1 share the same minimizer.The probability of each case is We then prove the limit of the refined minimizer in expression 5. Since s nþ1 can be iterated from s n by removing the first character of s n , namely a 0 , and append the last character of s nþ1 , namely a jsj , at the end, we have d nþ1 ¼ d n þd n , where It is worth noting that d n d nþ1 6 ¼ 0, since d 6 ¼ 0 has been proved in the first property.Then we have the following two cases: Then according to the definition in expression 4 According to expressions (3) and ( 6), we know that characters in a 1 ; a 2 ; . . .; a jsj−1 are 2 fA; Cg and a jsj 2 fG; Tg.Therefore, where p is the probability of an random character 2 fA; Cg.Analogously, The limits of the two probabilities above equal 0. Therefore, lim jsj!þ1 Pðd n d nþ1 < 0Þ ¼ 0 Therefore, the limit in expression 5 is Based on the discussion above, we have the expected k-mer density of refined minimizers where q s is the expected density of standard minimizers.Therefore, lim jsj!þ1 q r ¼ q s .

Heuristics
Expression ( 7) suggests that we can improve the k-mer density without significantly impacting the selected minimizers by simply skipping the nþ1th window if d n d nþ1 < 0. The core idea of the heuristic is to skip the "solo" windows, whose signs of d are different from those of predecessor and successor windows.Solo windows are minority especially for large jsj, while they significantly increases Pðd n d nþ1 < 0Þ in expression ( 7).The heuristic skips minimizers of solo windows while preserving minimizers of "non-solo" ones.For instance, if d 1 ; d 2 ; d 3 ; ¼ −1; 1; −1, then skipping the solo window 2 will also drop its minimizer.However, if d 1 ; d 2 ; d 3 ; ¼ −1; 1; 1, then skipping window 2, which is non-solo, may not affect its minimizer, since window 3 may preserve it.

Runtime
Arbitrary windows: We compared the CPU cycles of computing the refined and standard minimizer in algorithms 1 and 2. The loops in the pseudocodes apply to arbitrary windows and ordering schemes induced by the random hash function R, such as ntHash (Mohamadi et al. 2016), which directly computes random rolling hash values.CPU cycles for each step are listed in the comments of algorithms 1 and 2. Algorithm 1 takes o r ¼ 10o 1 þ2o 2 þo 3 þwð3o 1 þo R þo 3 Þ operations in sum and algorithm 2 takes The expected speedup of the refined minimizer is Hence, T r 2 ½0:949; 2Þ, where T r is minimized when w ¼ 1 and o R ¼ 0 (lexicographic ordering).T r is maximized when w � 1 or o R � 0. Therefore, the refined minimizer can be two times faster at most.Applications may apply heuristics to further improve the minimizer performance.For instance, a more practical way to break ties (when the smallest k-mer appears multiple times) is to skip ties in adjacent windows.This creates optimal spread in poly-X regions (e.g.repetitive AA.).Such heuristics will introduce additional CPU cycles.However, heuristics for standard minimizers commonly apply to refined minimizers and can be integrated into function R. Hence the speedup upper bound can be preserved in such cases.
Consecutive windows: Applications may use buffers to reduce the times of computing k-mers when computing minimizers in consecutive windows.The refined minimizer preserves the speedup upper bound in such a case.They are discussed in Supplementary Notes.However, the speedup in practice can be washed out to some extent by additional buffer operations, such as reading, writing, traversing, etc.The exact trade-offs depend on w, k, ordering schemes, CPU architectures, etc. Optimizations of buffers can substantially improve the practical runtime in such cases.

Distributions
As discussed above, we ideally prefer selection schemes that can generate k-mers of lower frequency for the read mapping and binning problem.Correspondingly, we prefer more uniformly distributed minimizers.We evaluated key statistics shown in Table 3 as a sketch of the distribution of selected minimizers X, which are computed in consecutive windows by streaming GRCH38 (chr 1-22, X, Y).Runtime (i.e.T in the table) is the corresponding time of computing minimizers in consecutive windows with buffers rather than the runtime of algorithms 1 and 2. Results for additional groups of jsj 6 45 and k 6 30 are presented in Supplementary Tables S1 and S2.It is worth noting that the tables only show statistics for even ks to simplify the results.The refined minimizer concept also applies to odd ks, and the corresponding results have no significant difference compared to those of even ks.Supplementary Table S3 shows statistics of minimizers of minimap2 (Li 2018).We evaluated 25-95% percentiles of minimizer frequency V, as shown in the table.For instance, P 0:25 ¼ 9:97 per megabases for standard lexicographical minimizer with jsj ¼ 15; k ¼ 4 means 25% minimizer frequencies are lower than this value.
The column D KL ðXjjUÞ is the Kullback-Leibler (KL) divergence of the distribution of X and the uniform k-mer distribution U.It is given by since there exist 4 3 types of 3-mers and each type has the same chance of being selected.A lower KL divergence implies that X is more uniformly distributed, and thus the scheme is less biased to specific minimizers.As expected, the results reveal that refined minimizers have lower KL divergence.Therefore, we would expect refined minimizers to generate less biased k-mers.
The column E-hits is the expected number of hits introduced by research (Sahlin 2022).A lower E-hits may benefit applications such as read mapping.It is computed as follows in the assessment.
Therefore, it is a comprehensive metric of density q and frequency vðx i Þ.Since the refined minimizers improve the kmer frequency V at the cost of limited increased density q, we expect refined minimizers to improve E-hits, while the improvement is relatively lower than those of percentiles and D KL .E-hits for minimizers in GRCH38 are in line with expectations, as shown in Table 3 and Supplementary Tables S1-S3. Figure 1 illustrates the empirical distribution of minimizer frequency V discussed above.It is log-scaled since the distribution is right-skewed, namely a long tail on the right side.Supplementary Fig. S1 shows the histogram version of the same data as a complement.As discussed above, we prefer small V for anchoring and binning problems, since large ones in the long tails would be the performance bottleneck.The figure reveals that for different jsj; k, standard minimizers have heavier tails, indicating larger V than refined minimizers.Therefore, refined minimizers generate more uniformly distributed k-mers.Figures for additional settings of jsj 6 45 and k 6 30 are presented in Supplementary Fig. S2.
Overall, statistics including the percentiles, D KL , E-hits and the distribution figures suggest refined lexicographical minimizers are less repetitive than standard lexicographical or random minimizers.Since the refined minimizer is also computationally efficient, it is expected to be more friendly to high-performance minimizer applications.

Potential limitations
We can observe a drop in benefits for frequency-related statistics of refined minimizers for larger k and jsj (i.e.P 0:95 , D KL , E-hits, and distributions in Supplementary Fig. S2).However, it is worth noting that the benefits depend on a latent factor, the sequence size.We use a coefficient, the average minimizer occurrences in the sequence denoted by EðX; kÞ to describe the latent performance impact.
For instance, if we assess 20-mers in GRCH38 references of approximately 3Gbps in size, then EðX; kÞ ¼ q � 3Gbps=4 20 � 0. It means that most types of 20-mers never occur in the minimizer set of GRCH38.As a result, the empirical distribution of minimizer frequency will not be close to the expected one due to insufficient minimizers (i.e. law of large numbers).Specifically, EðX; kÞ drops exponentially or linearly as k or jsj increases.Therefore, given the sequence of fixed size (e.g.GRCH38), we expect to observe significant or moderate drops in the statistics for large k or jsj.For validation, we assessed the empirical distributions of minimizer frequency V for jsj ¼ 25; k ¼ 10 in 6 sequences, whose sizes jSj are 1; 4; 16; 64; 256; 1024Mbps, as shown in Supplementary Fig. S3.We can observe that the difference between the standard and refined minimizer distributions is insignificant in short sequences (e.g.1Mbps; 4Mbps).However, distributions become significantly different as the sequence size jSj increases exponentially.Therefore, the empirical distributions depend on the sequence size and the practical benefits will increase as the sequence size grows.

Potential improvements
We have discussed the heuristic to improve the refined minimizer density in Section 2.3.There potentially exist other heuristics that can improve the refined minimizers in practice.
For instance, refined minimizers can possibly be improved for specific sequences, such as A, T or C, G enriched ones, where d signs are likely to be frequently changed.A potential improvement is to extend d as follows: where weights x 1 ; x 2 � 1 ðmod2Þ.Additionally, we extend d based on the occurrences of 2-mers p AA ; p AC ; . . .; p TT or q-mers (i.e.q characters).Generally, d based on the occurrences of q-mers can be defined as x i ðp qi −p q 0 i Þ where q i ; q 0 i are the ith q-mer and its reverse complement.Weights x i can be optimized, provided distributions of q-mers in the sequences are known.In practice, the distributions can be approximated by sampling q-mers in the subsequences.Such heuristics may further improve the performance of refined minimizers.

Conclusion
In this work, we proposed a refined DNA minimizer operator.We discussed basic properties that are essential to applications.The refined minimize is generic, computationally efficient, and can improve the k-mer repetitiveness, especially for the lexicographic order at the cost of limited increased density.However, simple heuristics, such as skipping "solo" windows, can further improve the performance.Assessments based on the GRCH38 are in line with expectations.We expect the performance can be potentially improved with additional heuristics in practice.
<Þ selects the minimum k-mer in w consecutive k-mers 2 R k , where R is the character set and order < is commonly induced by a hash function h, which is an injection from R k

Table 1 .
CPU cycles for operations used to compute minimizers.Operations such as traversing an array will probably trigger L 1 cache read.

Table 2 .
Comparison of standard (Std)and refined (Rfd) minimizers in a DNA sequence s and reverse complement s 0 , where jsj ¼ 11, k ¼ 5.K is the minimizer.hðKÞ is the lexicographic order of the minimizer.Q 2 ðhÞ is the median of hðKÞ.Values with bold text imply that h is less biased to small ones.
where o 1 , … ,o 3 are defined in Table 1, o R is CPU cycles for function R. Assuming o 1 ¼ 1, o 2 ¼ 3 and o 3 takes 10 cycles on average, then

Table 3 .
Statistics of standard (Std) and refined (Rfd) minimizer sampled consecutively in GRCH38: P 0:25 -P 0:95 are percentiles of minimizer frequency per megabases.D KL ðXjjUÞ is the Kullback-Leibler (KL) divergence of the distribution of X and the uniform k-mer distribution U. Large values such as E-hits are expressed by scientific notation.T is the runtime.Better values are in bold text.Empirical distributions of V for k ¼ 4; 8; 12 in rows and jsj ¼ 15; 25 in columns.Rfd and Std are refined and standard minimizer.The vertical axis equals the frequency of V ¼ v, namely the empirical probability PðV ¼ vÞ.The horizontal and vertical axes are in log 10 scale.