Locality-preserving minimal perfect hashing of k-mers

Abstract Motivation Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,…,n} bijectively. It is well-known that n log 2(e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k−1 symbols, it seems possible to beat the classic  log 2(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers. Results Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.


Introduction
Given a universe set U , a function f : U → [n] = {1, . . ., n} is a minimal perfect hash function (MPHF, henceforth) for a set S ⊆ U with n = |S| if f (x) = f (y) for all x, y ∈ S, x = y.In simpler words, f maps each key of S into a distinct integer in [n].The function is allowed to return any value in [n] for a key x ∈ U \ S. A classic result established that n log 2 (e) = 1.442n bits are essentially necessary to represent such functions for |U | n [Mehlhorn, 1982].Minimal perfect hashing is a central problem in data structure design and has received considerable attention, both in theory and practice.In fact, many practical constructions have been proposed (see, e.g., [Pibiri and Trani, 2021a] and references therein).These algorithms find MPHFs that take space close to the theoretic-minimum, e.g., 2 -3 bits/key, retain very fast lookup time, and scale well to very large sets.Applications of minimal perfect hashing range from computer networks [Lu et al., 2006] to databases [Chang and Lin, 2005], as well as language models [Pibiri andVenturini, 2019, Strimel et al., 2020], compilers, and operating systems.MPHFs have been also used recently in Bioinformatics to implement fast and compact dictionaries for fixed-length DNA strings [Pibiri, 2022b,a, Almodaresi et al., 2018, Marchet et al., 2021].
In its simplicity and versatility, the minimal perfect hashing problem does not take into account specific types of inputs, nor the intrinsic relationships between the input keys.Each key x ∈ S is considered independently from any other key in the set and, as such, P[f (x) = i] ≈ 1 n for any fixed i ∈ [n].In practice, however, the input keys often present some regularities that we could exploit to let f act "less randomly" on S. This, in turn, would permit to achieve a lower space complexity for f .
We therefore consider in this paper the following special setting of the minimal perfect hashing problem: the elements of S are all the distinct sub-strings of length k, for some k > 0, from a given collection X of strings.The elements of S are called k-mers.The crucial point is that any two consecutive k-mers in a string of X have indeed a strong intrinsic relationship in that they share an overlap of k − 1 symbols.It seems profitable to exploit the overlap information to preserve (as much as possible) the local relationship between consecutive k-mers as to reduce the randomness of f , thus lowering its bit complexity and evaluation time.
In particular, we are interested in the design of a locality-preserving MPHF in the following sense.Given a query sequence Q, if f (x) = j for some k-mer x ∈ Q, we would like f to hash Next(x) to j + 1, Next(Next(x)) to j + 2, and so on, where Next(x) is the k-mer following x in Q (assuming Next(x) and Next(Next(x)) are in X as well).This behavior of f is very desirable in practice, at least for two important reasons.First, it implies compression for satellite values associated to kmers.Typical satellite values are abundance counts, reference identifiers (sometimes called "colors"), or contig identifiers (e.g., unitigs) in a de Bruijn graph.Consecutive k-mers tend to have very similar -if not identical -satellite values, hence hashing consecutive k-mers to consecutive identifiers induce a natural clustering of the associated satellite values which is amenable to effective compression.The second important reason is, clearly, faster evaluation time when querying for consecutive k-mers in a sequence.This streaming query modality is the query modality employed by k-mer-based applications [Almodaresi et al., 2018, Bingmann et al., 2019, Marchet et al., 2021, Robidou and Peterlongo, 2021, Pibiri, 2022b].
We formalize the notion of locality-preserving MPHF along with other preliminary definitions in Section 2. We show how to obtain a locality-preserving MPHF in very compact space in Section 3. To achieve this result, we make use of two algorithmic tools: random minimizers [Schleimer et al., 2003, Roberts et al., 2004] and a novel partitioning scheme for sub-sequences of consecutive k-mers sharing the same minimizers (super-k-mers) which allows a more parsimonious memory layout.The space of the proposed solution decreases for growing k and the data structure is built in linear time in the size of the input (number of distinct k-mers).In Section 4 we present experiments across a breadth of datasets to show that the construction is practical too: the functions can be several times smaller and even faster to query than the most efficient, albeit "general-purpose", minimal perfect hash functions.We conclude in Section 5 where we also sketch some promising future directions.Our C++ implementation of the method is publicly available at https://github.com/jermp/lphash.

Notation and Definitions
Let X be a set of strings over an alphabet Σ.Throughout the paper we focus on the DNA alphabet Σ = {A, C, G, T} to better highlight the connection with our concrete application but our algorithms can be generalized to work for arbitrary alphabets.A sub-string of length k of a string S ∈ X is called a k-mer of S.
Definition 1 (Spectrum).The k-mer spectrum of X is the set of all distinct k-mers of the strings in X .Formally: spectrum k (X ) := {x ∈ Σ k | ∃S ∈ X such that x is a k-mer of S}.
Definition 2 (Spectrum-Preserving String Set).A spectrum-preserving string set (or SPSS) S of X is a set of strings such that (i) each string of S has length at least k, and (ii) spectrum k (S) = spectrum k (X ).
Since our goal is to build a MPHF for the k-mers in a SPSS, we are interested in a SPSS S where each k-mer is seen only once, i.e., for each k-mer x ∈ spectrum k (S) there is only one string of S where x appears once.We assume that no k-mer appearing at the end of a string shares an overlap of k − 1 symbols with the first k-mer of another string, otherwise we could reduce the number of strings in S and obtain a smaller SPSS.In the following, we make use of this form of SPSS which is suitable for the minimal perfect hashing problem.We remark that efficient algorithms exist to compute such SPSSs (see, e.g., [Rahman and Medvedev, 2020, Břinda et al., 2021, Khan and Patro, 2021, Khan et al., 2022]).
The input for our problem is therefore a SPSS S for X with |S| strings and n > 1 distinct k-mers.Without loss of generality, we index k-mers based on their positions in S, assuming an order S 1 , S 2 , S 3 , . . . of the strings of S is fixed, and we indicate with x i the i-th k-mer in S, for i = 1, . . ., n.
We want to build a MPHF f : Σ k → [n] for S; more precisely, for the n distinct k-mers in spectrum k (S).We remark again that our objective is to exploit the overlap of k − 1 symbols between consecutive k-mers from a string of S to preserve their locality, and hence reduce the bit complexity of f as well as its evaluation time when querying k-mers in sequence.
We define a locality-preserving MPHF, or LP-MPHF, for S as follows.
Definition 3 (LP-MPHF).Let f : Σ k → [n] be a MPHF for S and A be the set Intuitively, the "best" LP-MPHF for S is the one having the smallest ε, so we look for practical constructions with small ε.On the other hand, note that a "classic" MPHF corresponds to the case where the localitypreserving property is almost always not satisfied and, as a consequence, ε will be approximately 1.
Two more considerations are in order.First, it should be clear that the way we define locality-preservation in Definition 3 is only pertinent to SPSSs where having consecutive hash codes for consecutive k-mers is a very desirable property as motivated in Section 1.A different definition of locality-preservation could instead be given if we were considering generic input keys.Second, we did not use the term order-preserving to stress the distinction from classic order-preserving functions in the literature [Fox et al., 1991] that make it possible to preserve any wanted order and, as such, incur in an avoidable Ω(log n)-bit overhead per key.Here, we are interested in preserving only the input order of the k-mers which is the one that matters in practice.
The fragmentation factor of S is a measure of how contiguous the k-mers in S are.The minimum fragmentation α = 0 is achieved for |S| = 1 and, in this case, x i shares an overlap of k − 1 symbols with x i+1 for all i = 1, . . ., n − 1.This ideal scenario is, however, unlikely to happen in practice.On the other hand, the worst-case scenario of maximum fragmentation α = 1 − 1/n is achieved when |S| = n and k-mers do not share any overlap (of length k − 1).This is also unlikely to happen given that k-mers are extracted consecutively from the strings of X and, as a result, many overlaps are expected.A more realistic scenario happens, instead, when |S| n, resulting in ε α.For the rest of the paper, we focus on this latter scenario to make our analysis meaningful.
From Definition 3 and 4 it is easy to see that ε ≥ 1/n when α = 0, and ε = 1 when α = 1 − 1/n.In general, we have ε ≥ α + 1/n since there are at least |S| − 1 indexes i for which f (x i+1 ) = f (x i ) + 1.How small ε can actually be therefore depends on the input SPSS (and on the strategy used to implement f in practice, as we are going to illustrate in Section 3).
Lastly in this section, we define minimizers and super-k-mers that will be one of the main ingredients used in Section 3. Definition 5 (Random Minimizer of a k-mer).Given a k-mer x and a random hash function h, the minimizer of x is any m-mer µ such that h(µ) ≤ h(y) for any other m-mer y of x, for some m ≤ k.
In case the minimizer of x is not unique, we break ties by taking the leftmost m-mer in x.For convenience, we indicate with w = k−m+1 the number of m-mers in a k-mer.(Note that Definition 5 defines a minimizer T Z V I Y 9 P 3 H i v f i 5 a u t 1 9 X t 2 s 6 b 3 b 3 9 e u P t p U l z z a H P U 5 n q 6 4 g k 9 e a g 8 e e 8 9 6 h 0 s p V 5 l V f O O r I V 3 9 A e D B b / 9 < / l a t e x i t > G 6,1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 Z 5 N 5 g m f d g / b H w Z t 7 y w E h 4 w u f C 9 5 j i 2 g 9 M n k H u t m x / w Z / M / C + 4 2 a / 7 h / W D y / 3 q y e n o P + b J J t k i O 8 Q n R + S E n J M L 0 i S M J O S V v J H 3 0 q e z 6 W w 5 t W + p U x r N b J C x c v a + A K v t w H w = < / l a t e x i t > ?6,1 = 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " s q o s f y ? 6,2 = 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " B u o l l d p e 0 u L U M 6 Y P 9 D e o 1 P v 9 g 3 0 ? 6,4 = 3 x g,1 , x g,2 , x g,3 , x g,4 for k = 13 and minimizer length m = 7.The shaded boxes highlight the minimizer whose start position is p g,i in k-mer x g,i .It is easy to see that as a specific m-mer inside a k-mer rather than a specific k-mer in a window of w consecutive k-mers, which is the more standard definition found in the literature.)Since h is a random hash function (with a wide range, e.g., [1..2 64 ]), each m-mer in a k-mer has probability ≈ 1 w of being the minimizer of the k-mer.We say that the triple (k, m, h) defines a random minimizer scheme.The density of a minimizer scheme is the expected number of selected minimizers from the input.Definition 6 (Super-k-mer).Given a string S, a super-k-mer g is a maximal sub-string of S where each k-mer has the same minimizer µ and µ appears only once in g.

Locality-Preserving Minimal Perfect Hashing of K-Mers
In this section we describe an algorithm to obtain locality-preserving MPHFs for a spectrum-preserving string set S. The algorithm builds upon the following main insight.
Implicitly Ranking k-mers through Minimizers.Let g be a superk-mer of some string S ∈ S and assume g is the only super-k-mer whose minimizer is µ.By definition of super-k-mer, all the k-mers x g,1 , . . ., x g,|g|−k+1 in g contain the minimizer µ as a sub-stringx g,i being the i-th k-mer of g.If p g,1 is the start position of µ in the first k-mer is the start position of µ in x g,i for 1 ≤ i ≤ |g| − k + 1. Fig. 1 gives a practical example for a super-k-mer g of length 16 and k = 13.The next property illustrates the relation between the size |g| − k + 1 of the super-k-mer g and the position p g,1 (we will come later on the implications of this property).Property 1. |g| − k + 1 ≤ p g,1 ≤ w for any super-k-mer g.
Proof.Since p g,1 is the start position of the minimizer in the first k-mer of g, there are at most p g,1 k-mers that contain the minimizer as a substring, hence |g| − k + 1 ≤ p g,1 .However, g cannot contain more than w k-mers.Now, suppose we are given a query k-mer x ∈ S whose minimizer is µ.The k-mer must appear as a sub-string of g, i.e., it must be one among x g,1 , . . ., x g,|g|−k+1 .We want to compute the rank of x among the k-mers x g,1 , . . ., x g,|g|−k+1 of g, which we indicate by Rank(x) (assuming that it is clear from the context that Rank is relative to g).Let p be the start position of µ in x.We can use this positional information p to compute Rank(x) as follows: (2) • otherwise (p g,1 < p or p g,1 −p+1 > |g|−k +1), x cannot possibly be in g and, hence, indexed by f .
Algorithm 1 Evaluation algorithm for f , given the k-mer x.The helper function minimizer(x) computes the minimizer µ of x and the starting position p of µ in x.
Lemma 1.The strategy in Equation 3guarantees To sum up, the position of the minimizer in the first k-mer of g, p g,1 , defines an implicit ranking (i.e., achieved without explicit string comparison) of the k-mers inside a super-k-mer.

Basic Data Structure
From Equation 3 is evident that f (x g,1 ) acts as a "global" component in the calculation of f (x g,i ), which must be added to a "local" component represented by Rank(x g,i ).We have already shown how to compute Rank(x g,i ) in Equation 2: Lemma 1 guarantees that this local rank computation bijectively maps the k-mers of g into [1..|g| − k + 1].We are therefore left to show how to compute f (x g,1 ) for each super-k-mer g.We proceed as follows.
Layout.Let M be the set of all the distinct minimizers of S. We build a MPHF for M, fm : Σ m → [|M|].Assume, for ease of exposition, that each super-k-mer g is the only super-k-mer having minimizer µ. (We explain how to handle the case where more super-k-mers have the same minimizer in Section 3.3.)We allocate an array L [1..|M| + 1] where We then take the prefix-sums of L into another array L, that is, indicates the number of k-mers before those in g (whose minimizer is µ) in the order given by fm.The size of g can be recovered as It follows that the data structure is built in O(n) time, since a scan over the input suffices to compute all super-k-mers and fm can be built in Lookup.With these three componentsfm, and the two arrays L and P -it is easy to evaluate f (x) as shown in Algorithm 1.The complexity of the lookup algorithm is O(w) since this is the complexity of computing the minimizer (assuming each hash calculation to take constant time) and the overall evaluation of fm as well, since accessing the arrays L and P takes O(1).
Compression.The data structure for f itself is a compressed representation for fm, L, and P .To compute the space taken by the data structure we first need to know |M| -the expected number of distinct minimizers seen in the input.Assuming again that there are no duplicate minimizers, if d indicates the density of a random minimizer scheme, then • |M| = dn, and In particular, a result due to Zheng et al. [Zheng et al., 2020, Theorem 3] allows us to compute d for a random minimizer scheme as d = 2 w+1 + o(1/w) if m > (3+ ) log 4 (w+1) for any > 0. We will always operate under the condition that m is sufficiently large compared to k otherwise minimizers are meaningless.
Therefore any random minimizer scheme gives us a (1 − ε)-LP MPHF with ε = 2 w+1 (we omit lower order terms for simplicity) as illustrated in the following theorem (see the Supplementary material for the proof).where ε = 2 w+1 and b is a constant larger than log 2 (e).Note that the space bound in Theorem 1 decreases as w grows; for example, when m is fixed and k grows.Next we show how to improve this result using some structural properties of super-k-mers.

Partitioned Data Structure
Property 1 states that |g| − k + 1 ≤ p g,1 ≤ w for any super-k-mer g.As an immediate implication we have that if |g| − k + 1 = w then also p g,1 = w (and, symmetrically, if p g,1 = 1 then |g| = k).This suggests that, whenever a super-k-mer contains a maximal number of k-mers, then we can always implicitly derive that |g| − k + 1 = p g,1 = w.We can thus save the space for the entries dedicated to such super-k-mers in the arrays L and P .Note that the converse is not true in general, i.e., if p g,1 = w it could be that |g| − k + 1 < w.Nonetheless, we can still save space for some entries of P in this case.
Depending on the starting position of the minimizer in the first and last k-mer of a super-k-mer, we distinguish between four types of super-k-mers (Definition 7).
Definition 7 (FL rule).Let g be a super-k-mer.The first/last (FL) rule is as follows: 6 6 e g e 8 2 S L 7 J B 9 4 p I j U i f n 5 I I 0 C S M d 8 k b e y U f h y 9 q w N q 3 t o d Q q j G o 2 y F h Y e 9 8 J c 7 2 A < / l a t e x i t > ' < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 h D i H + m A n l 2 I B 9 q h 4 S L 3 3 m e 6 P 8 j 9 x L J T z i U + 5 7 R H H l e m m 2 g 8 x t 0

9
< l a t e x i t s h a 1 _ b a s e 6 4 = " e 5 h 9 q W a Z 3. Partitioned data structure layout and the flow of Algorithm 2 for a query k-mer x, whose minimizer is µ, and with i = fm(µ).Different colors in R are used to distinguish between the different super-k-mer types.
See Fig. 2 for a schematic illustration.
Layout.Based on the FL-rule above, we derive a partitioned layout as follows.We store the type of each super-k-mer in an array R[1..|M|], in the order given by fm.We can now exploit this labeling of super-k-mers to improve the space bound of Theorem 1 because: • for all left-right-max super-k-mers, we do not store L nor P ; • for all left/right-max super-k-mers, we only store L -precisely, two arrays L l and Lr for left-and right-max super-k-mers respectively; • for all the other super-k-mers, i.e., non-max, we store both L and P as explained before -let us indicate them with Ln and Pn in the following.
Addressing the arrays L l , Lr, Ln and Pn, can be achieved by answering Rankt(i) queries on R: the result of this query is the number of super-k-mers that have type t in the prefix R[1..i].If i = fm(µ), then we read the type of the super-k-mer associated to µ as t = R[i].Then we compute j = Rankt(i).Depending on the type t, we either do not perform any array access or access the j-th position of either L l , or Lr, or Ln and Pn (see Algorithm 2).
A succinct representation of R that also supports Rankt(i) and Access(i) queries is the wavelet tree [Grossi et al., 2003].In our case, we only have 4 possible types, hence a 2-bit integer is sufficient to encode a type.The wavelet tree therefore represents R in 2|M| + o(|M|) bits 1 and supports both queries in O(1) time.The wavelet tree is also built in linear time, so the building time of the overall data structure remains O(n).Refer to Fig. 3 for a pictorial representation of this partitioned layout.
Lookup.Algorithm 2 gives the lookup algorithm for the partitioned representation of f .The complexity of the algorithm is still O(w) like that of the un-partitioned counterpart, Algorithm 1.The evaluation algorithm must now distinguish between the four different types of minimizer.On the one hand, this distinction involves an extra array access (to R) and a rank query as explained above but, on the other hand, it permits to save 2 array accesses in the left-right-max case or 1 in the left/right-max case compared to Algorithm 1 that always performs 2 array accesses (one access to L and one to P ).Hence, the overall number of array accesses performed by Algorithm 2 is on average the same as that of Algorithm 1 assuming the four cases are equally likely (see next paragraph).For this reason we do not expect Algorithm 2 to incur in a penalty at query time compared to Algorithm 1 despite of its more complex evaluation. 1 The o(|M|) term is the redundancy needed to accelerate the binary rank queries.In practice, the term o(|M|) can be non-negligible, e.g., can be as high as 2 • (|M|/4) bits using the Rank9 index [Vigna, 2008, Sec. 3], but it is necessary for fast queries in practice (namely, O(1) time).Looking at Table 1a from [Pibiri and Kanda, 2021], we see that the redundancy is in between 3% and 25% of 2|M|.
Algorithm 2 Evaluation algorithm for a partitioned representation of f .The quantities n lr , n l , nr, and nn are, respectively, the number of leftright-max, left-max, right-max, and non-max super-k-mers of S. 1: function f (x): 2: (µ, p) = minimizer(x) 3: prefix = 0, offset = 0, p 1 = 0 7: switch(t): 8: case left-right-max: 9: prefix = 0, offset = (j − 1)w, p 1 = w 10: break 11: case left-max: 12: return prefix + offset + p 1 − p Compression.Intuitively, if the fraction of left-right-max super-kmers and that of left/right-max super-k-mers is sufficiently high, we can save significant space compared to the data structure in Section 3.1 that stores both L and P for all minimizers.We therefore need to compute the proportions of the different types of super-k-mers as given by the FL rule.For ease of notation, let P lr = P[g is left-right-max], P l = P[g is left-max], Pr = P[g is right-max], Pn = P[g is non-max], for any super-k-mer g.
Remark 1.The FL rule is a partitioning rule, i.e., P lr +P l +Pr +Pn = 1 for any super-k-mer.
Our objective is to derive the expression for the probabilities P lr , P l , Pr, and Pn, parametric in k (k-mer length) and m (minimizer length).To achieve this goal we propose a simple model based on a (discrete-time) Markov chain.
Let X : Σ k → {1, . . ., w} be a discrete random variable, modelling the starting position of the minimizer in a k-mer.The corresponding Markov chain is illustrated in Fig. 4.Each state of the chain is labelled with the corresponding value assumed by X, i.e., with each value in {1, . . ., w}.Clearly, we have a left-right-max super-k-mer if, from state w we transition to state w − 1, then to w − 2, . .., down to state 1.Each state has a "fallback" probability to go to state w which corresponds to the event that the right-most m-mer (that coming next to the right) is the new minimizer.If the chain reaches state 1, instead, we know that we are always going to see a new minimizer next.If c ∈ [1..u] is the code assigned to the current minimizer by the coding function h used by µ, for some universe size u (e.g, if c is a 64-bit hash code, then u = 2 64 ), the probability for any m-mer to become the new minimizer is equal to Vice versa, the probability of keeping the same minimizer when sliding one position to the right, is 1 − δ.Whenever we change minimizer, we generate a new code c and, hence, the probability δ changes with every formed super-k-mer.Nonetheless, the following Theorem shows that the probabilities P lr , P l , Pr, and Pn, do not depend on δ.
… 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " Z n 7 q A S B 5 6 w x k w Z 2 q k J 6 g D q H s J A i A j u h G h n w k e m H P C x e x O 4 w / y s P C w V P + F z 5 H l J c 0 L A o d 1 C 6 r d k r 0 N G d j 4 P L V p M G T X r W 8 g + P B v e Y J R t k k 2 w T S v b I I T k h p 6 R N O A H y R t 7 J h / P l N t w 1 d / 1 H 6 j q D m l U y F K 7 / D e b 0 v P o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z n 7 q A S B 5 6 w x k w Z 2 q k J 6 g D q H s J A i A j u h G h n w k e m H P C x e x O 4 w / y s P C w V P + F z 5 H l J c 0 L A o d 1 C 6 r d k r 0 N G d j 4 P L V p M G T X r W 8 g + P B v e Y J R t k k 2 w T S v b I I T k h p 6 R N O A H y R t 7 J h / P l N t w 1 d / 1 H 6 j q D m l U y F K 7 / D e b 0 v P o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z n 7 q A S B 5 6 w R t 7 J h / P l N t w 1 d / 1 H 6 j q D m l U y F K 7 / D e b 0 v P o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z n 7 q A S B 5 6 w a 6 9 T c m 3 q 1 c T G 5 R 5 H s k w N y R F x y S h r k i l y T J m E E y B t 5 J x + F L 6 t i 7 V p 7 P 1 K r M K n Z I V N h V b 8 B 6 P m 8 + w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a + T q 8 a 6 9 T c m 3 q 1 c T G 5 R 5 H s k w N y R F x y S h r k i l y T J m E E y B t 5 J x + F L 6 t i 7 V p 7 P 1 K r M K n Z I V N h V b 8 B 6 P m 8 + w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a + T q 8 a 6 9 T c m 3 q 1 c T G 5 R 5 H s k w N y R F x y S h r k i l y T J m E E y B t 5 J x + F L 6 t i 7 V p 7 P 1 K r M K n Z I V N h V b 8 B 6 P m 8 + w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a + T q 8 < l a t e x i t s h a 1 _ b a s e 6 4 = " g / n J w x 2 e X 4 x x A h U z n R y C T A / + V H g = " > A A A C b X i c b V H b S s N A E N 3 G e 7 0 r P i k S r K I P W r I i 6 I s g + u K j S q t C D b L Z T t q l m 0 3 c n a g 1 + A 2 + 6 q f 5 F f 6 C m 7 S I r Q 4 s H M 6 c m T 0 z E y R S G P S 8 z 5 I z M j o 2 P j E 5 V Z 6 e m Z 2 b X 1 h c u j Z x q j n U e S x j f R s w k T j g R 5 I 2 8 k 4 / S l 7 P i r D n r P a l T 6 t c s k 4 F w t r 8 B W 9 2 + D g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " g / n J w x 2 e X 4 x x A h U z n R y C T A / + V H g = " > A A A C b X i c b V H b S s N A E N 3 G e 7 0 r P i k S r K I P W r I i 6 I s g + u K j S q t C D b L Z T t q l m 0 3 c n a g 1 + A 2 + 6 q f 5 F f 6 C m 7 S I r Q 4 s H M 6 c m T 0 z E y R S G P S 8 z 5 I z M j o 2 P j E 5 V Z 6 e m Z 2 b X 1 h c u j Z x q j n U e S x j f R s w k T j g R 5 I 2 8 k 4 / S l 7 P i r D n r P a l T 6 t c s k 4 F w t r 8 B W 9 2 + D g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " g / n J w x 2 e X 4 x x A h U z n R y C T A / + V H g = " > A A A C b X i c b V H b S s N A E N 3 G e 7 0 r P i k S r K I P W r I i 6 I s g + u K j S q t C D b L Z T t q l m 0 3 c n a g 1 + A 2 + 6 q f 5 F f 6 C m 7 S I r Q 4 s H M 6 c m T 0 z E y R S G P S 8 z 5 I z M j o 2 P j E 5 V Z 6 e m Z 2 b X 1 h c u j Z x q j n U e S x j f R s w k T j g R 5 I 2 8 k 4 / S l 7 P i r D n r P a l T 6 t c s k 4 F w t r 8 B W 9 2 + D g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " g / n J w x 2 e X 4 x x A h U z n R y C T A / + V H g = " > A A A C b X i c b V H b S s N A E N 3 G e 7 0 r P i k S r K I P W r I i 6 I s g + u K j S q t C D b L Z T t q l m 0 3 c n a g 1 + A 2 + 6 q f 5 F f 6 C m 7 S I r Q 4 s H M 6 c m T 0 z E y R S G P S 8 z 5 I z M j o 2 P j E 5 V Z 6 e m Z 2 b X 1 h c u j Z x q j n U e S x j f R s w e 5 2 7 K 9 A h 3 e + V 9 w v V + l X p V e 7 l d O T v v 3 m C S r Z I P s E E o O y Q k 5 J x e k T j g R 5 I 2 8 k 4 / S l 7 P i r D n r P a l T 6 t c s k 4 F w t r 8 B W 9 2 + D g = = < / l a t e x i t > w < l a t e x i t s h a 1 _ b a s e 6 4 E S 1 / 4 8 S T / I / d S C Q N 8 z n 1 P K G 5 d L 8 1 2 k L k t m i u 4 0 z v / C + 6 O q q 5 T d a + O S m f n 4 3 s s k l 2 y T w 6 I S 0 7 I G b k k d d I g j P T I G 3 k n H 4 U v a 9 v a s f Z G U q s w r t k m E 2 F V v g F R Y 7 2 c < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 E S 1 / 4 8 S T / I / d S C Q N 8 z n 1 P K G 5 d L 8 1 2 k L k t m i u 4 0 z v / C + 6 O q q 5 T d a + O S m f n 4 3 s s k l 2 y T w 6 I S 0 7 I G b k k d d I g j P T I G 3 k n H 4 U v a 9 v a s f Z G U q s w r t k m E 2 F V v g F R Y 7 2 c < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 E S 1 / 4 8 S T / I / d S C Q N 8 z n 1 P K G 5 d L 8 1 2 k L k t m i u 4 0 z v / C + 6 O q q 5 T d a + O S m f n 4 3 s s k l 2 y T w 6 I S 0 7 I G b k k d d I g j P T I G 3 k n H 4 U v a 9 v a s f Z G U q s w r t k m E 2 F V v g F R Y 7 2 c < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 E S 1 / 4 8 S T / I / d S C Q N 8 z n 1 P K G 5 d L 8 1 2 k L k t m i u 4 0 z v / C + 6 O q q 5 T d a + O S m f n 4 3 s s k l 2 y T w 6 I S 0 7 I G b k k d d I g j P T I G 3 k n H 4 U v a 9 v a s f Z G U q s w r t k m E 2 F V v g F R Y 7 2 c < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " w x / v i 7 o w Y i S K I X 9 I r e r G f r 3 f q w P m e l O S v r O U R z s L 5 + A Q P o m V 8 = < / l a t e x i t > 1 d < l a t e x i t s h a 1 _ b a s e 6 4 = " w a X J A 2 h m T 2 1 S x t y H Theorem 2. For any random minimizer scheme (k, m, h) we have We give the following lemma to prove Theorem 2. (When we write "first"/"last" k-mer we are going to silently assume "of a super-k-mer".)Lemma 2. P for any 1 ≤ p ≤ w − 1.Then we have the following equivalences.because the starting position of the minimizer of the first k-mer of any left-right-max and of any right-max super-k-mer is w.In a similar way, we have that because the starting position of the minimizer of the last k-mer of any leftright-max and of any left-max super-k-mer is 1.Now note that P l = Pr because: , and similarly From equation P l = Pr, we have P lr + P l = P lr + Pr which, using Equation 6 and Equation 7, yields P The Lemma follows by using the latter equation into Equation 5. Now we prove Theorem 2.
Proof.Since the FL rule induces a partition: P lr + Pr + P l + Pn = 1 ⇐⇒ P lr + P lr + Pr + P l + Pn = 1 + P lr Again exploiting the fact that P lr + Pr = P lr + P l = P[X = w], we also have We have therefore to compute Pn to also determine P lr , P l , and Pr.
In Table 1 we report the probabilities P lr , P l , Pr, and Pn computed using Theorem 2 for some representative combinations of k and m (these combinations are some of those used in the experiments of Section 4; see also Table 2).For comparison, we also report the probabilities measured over the whole human genome.We see that the probabilities computed with the formulas in Theorem 2 accurately model the empirical probabilities.
The net result is that, for sufficiently large w, the probabilities in Theorem 2 are all approximately equal to 1/4, so that we have ≈ n 2(w+1) super-k-mers of each type.This also implies that the choice of 2-bit codes for the symbols of R is essentially optimal.Under this condition, we give the following theorem (see the Supplementary material for the proof).

Ambiguous Minimizers
Let Gµ be the set of super-k-mers whose minimizer is µ.The rank computation in Equation 2 can be used as long as |Gµ| = 1, i.e., whenever one single super-k-mer g has minimizer µ and, thus, the single p g,1 unequivocally displace all the k-mers x g,1 , . . ., x g,|g|−k+1 .When |Gµ| > 1 we say that the minimizer µ is "ambiguous".It is a known fact that the number of such minimizers is very small for a sufficiently long minimizer length m [Pibiri, 2022b, Jain et al., 2020, Chikhi et al., 2014], and the number decreases for growing m.For example, on the datasets used in Section 4, the fraction of ambiguous minimizers is in between 1% and 4%.However, they must be dealt with in some way.
Let ξ be the fraction of k-mers whose minimizers are ambiguous.Our strategy is to build a fallback MPHF for these k-mers.This function adds ξ • b bits/k-mer on top of the space of Theorem 1 and Theorem 3, where b > log 2 (e) is the number of bits per key spent by a MPHF of choice.The fallback MPHF makes our functions (1 − ε + ξ)-locality-preserving.
To detect ambiguous minimizers, one obvious option would be to explicitly use an extra 1-bit code per minimizer.This would however result in a waste of 1 bit per minimizer for most of them since we expect to have a small percentage of ambiguous minimizers.To avoid these problems, we use the following trick.Suppose µ is an ambiguous minimizer.We initially pretend that µ is not ambiguous.For the un-partitioned data structure from Section 3.1, we set L[fm(µ)] = 0.A super-k-mer of size 0 is clearly not possible, thus we use the value 0 to indicate that µ is actually ambiguous.We do the same for the partitioned data structure from Section 3.2: in this case we set Lr[fm(µ)] = 0 pretending the type of µ is right-max (but we could have also used the type left-max or non-max).To sum up, with just an extra check on the super-k-mer size we know if the query k-mer must be looked-up in the fallback MPHF or not.
We leave the exploration of alternative strategies to handle ambiguous minimizers to future work.For example, one can imagine a recursive data structure where, similarly to [Shibuya et al., 2022], each level is an instance of the construction with different minimizer lengths: if level i has minimizer length m i , then level i + 1 is built with length m i+1 > m i over the k-mers whose minimizers are ambiguous at level i.

Experiments
In this section we report on the experiments conducted to asses the practical performance of the method presented in Section 3, which we refer to as LPHash in the following.Our implementation of the method is written in C++ and available at https://github.com/jermp/lphash.Implementation Details.We report here the major implementation details for LPHash.The arrays L and P are compressed with Elias-Fano [Fano, 1971, Elias, 1974] to exploit its constant-time random access (see also [Pibiri and Venturini, 2021, Sec. 3.4] for an explanation of such compressed encoding).Both the function fm and the fallback MPHF are implemented with PTHash using parameters (D-D, α = 0.94, c = 3.0), unless otherwise specified.Under this configuration the space taken by a PTHash MPHF is 2.3 − 2.5 bits/key.
We do not compress the bit-vectors in the wavelet tree and we add constant-time support for rank queries using the Rank9 index [Jacobson, 1989, Vigna, 2008].The Rank9 index adds 25% more space at each level of the wavelet tree, making the wavelet tree to take 2.5 bits per element in practice.Therefore, we estimate the little-Oh factor in Theorem 1 and Theorem 3 to be 0.5.
Competitors.We compare the space usage, query time, and building time of LPHash against PTHash [Pibiri and Trani, 2021b,a], the fastest MPHF in the literature, and the popular BBHash [Limasset et al., 2017].Both competitors are also written in C++.Following the recommendations of the respective authors, we tested two example configurations each: • PTHash-v1, with parameters (D-D, α = 0.94, c = 5.0); • PTHash-v2, with parameters (EF, α = 0.99, c = 5.0); • BBHash-v1, with parameter γ = 2; • BBHash-v2, with parameter γ = 1; We point the reader to the respective papers for an explanation of such parameters; we just report that they offer a trade-off between space, query efficiency, and building time as also apparent in the following experiments.
Testing Machine.The experiments were executed on a machine equipped with a Intel i9-9900K CPU (clocked at 3.60GHz), 64 GB of RAM, and running the Linux 5.13.0 operating system.The whole code (LPHash and competitors) was compiled with gcc 11.2.0, using the flags -O3 and -march=native.
Datasets.We use datasets of increasing size in terms of number of distinct k-mers; namely, the whole-genomes of: Saccharomyces Table 3. Space in average bits/k-mer for PTHash and BBHash.2As reference points, we also report the bits/k-mer for partitioned LPHash for three representative values of k (see also Fig. 5).Cerevisiae (Yeast, 11.6×10 6 k-mers), Caenorhabditis Elegans (Elegans, 96.5 × 10 6 k-mers), Gadus Morhua (Cod, 0.56 × 10 9 k-mers), Falco Tinnunculus (Kestrel, 1.16 × 10 9 k-mers), and Homo Sapiens (Human, 2.77 × 10 9 k-mers).For each dataset, we obtain the corresponding SPSS by first building the compacted de Bruijn graph using BCALM2 [Chikhi et al., 2016], then running the UST algorithm [Rahman and Medvedev, 2020].At our code repository we provide detailed instructions on how to prepare the datasets for indexing.Also, all processed datasets are available at https://zenodo.org/record/7239205already in processed form so that it is easy to reproduce our results.

Space Effectiveness
To build an instance of LPHash for a given k, we have to choose a suitable value of minimizer length (m).A suitable value of m should clearly be not too small (otherwise, most minimizers will appear many times), nor too large (otherwise, the space of fm will be too large as well).In general, a good value for m can be chosen around log 4 (N ) where N is the cumulative length of the strings in the input SPSS.Remember from our discussion in Section 3.3 that the fraction of ambiguous minimizers decreases for growing m.Therefore, testing LPHash for growing values of k allows us to progressively increase m, starting from m = log 4 (N ), while keeping w = k − m + 1 sufficiently large and reducing the fraction of ambiguous minimizers as well.Following this principle, for each combination of k and dataset, we choose m as reported in Table 2. Fig. 5 shows the space of LPHash in average bits/k-mer, by varying k from 31 to 63 with a step of 4, for both un-partitioned and partitioned data structures.We report the actual space usage achieved by the

Building Time
We now consider building time which is reported in Table 5.Both LPHash and PTHash were built limiting to 8GB the maximum amount of RAM to use before resorting to external memory.(There is no such capability in the BBHash implementation so BBHash took more RAM at building time than the other two constructions.) The building time for un-partitioned and partitioned LPHash is the same.LPHash is competitive with the fastest BBHash and significantly faster than PTHash on the larger datasets.Specifically, it is faster than PTHash over the entire set of k-mers since it builds two smaller PTHash functions (fm and fallback).The slowdown seen for Cod is due to the larger fallback MPHF, which is built with PTHash under a strict configuration (c = 3.0) that privileges space effectiveness (and query efficiency) rather than building time.One could in principle use BBHash instead of PTHash for the fallback function, hence trading space for better building time.For example, recall that we use c = 5.0 on Human for this reason.

Conclusion and Future Work
In this paper, we initiate the study of locality-preserving minimal perfect hash functions for k-mers.We propose a construction, named LPHash, that achieves very compact space by exploiting the fact that consecutive k-mers share overlaps of k − 1 symbols.This allows LPHash to actually break the theoretical log 2 (e) bit/key barrier for minimal perfect hash functions.
We show that a concrete implementation of the method is practical as well.Before this paper, one used to build a BBHash function over the k-mers and spend (approximately) 3 bits/k-mer and 100-200 nanoseconds per lookup.This work shows that it is possible to do significantly better than this when the k-mers come from a spectrum-preserving string set: for example, less than 0.6-0.9bits/k-mer and 30-60 nanoseconds per lookup.
Our code is open-source.
As future work, we plan to further engineer the current implementation to accelerate construction and streaming queries.Other strategies for sampling the strings could be used other than random minimizers [Frith et al., 2022]; for example, the Miniception [Zheng et al., 2020] achieving ε = 1.67 w + o(1/w).Evaluating the impact of such different sampling schemes is a promising avenue for future research.Lastly, we also plan to investigate other strategies for handling the ambiguous minimizers.A better strategy is likely to lead to improved space effectiveness and faster construction.

Fig. 2 .
Fig. 2.The four different types of super-k-mers.The example is for k = 13 and minimizer length m = 7, so w = k − m + 1 = 13 − 7 + 1 = 7.The shaded boxes highlight the minimizer sub-string inside a k-mer.The start position of the minimizer is marked with a solid border when it is either max (7), or min (1).

Theorem 1 .
Given a random minimizer scheme (k, m, h) with m > (3 + ) log 4 (w + 1) for any > 0 and w = k − m + 1, there exists a (1 − ε)-LP MPHF for a SPSS S with n = |spectrum k (S)| which takes n • 2 w + 1 log 2 4(w + 1) 2 + b + o(1) bits 4 y 0 G s k n a 9 5 l 7 W L u 7 r l c Z N Hl I R n a B T d I 5 c d I U a 6 A 4 1 U Q s R x N E L e k V v 1 r P 1 b n 1 Y n / P S g p X 3 H K M F W F + / I D C Y 7 Q = = < / l a t e x i t > d < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 N 8 S q 9 B 6 E w 7 p 4 w v c o m q J g G 5 h t p Y = " > A A A C B 3 i c b V D L T g I x F O 3 g C / G F u n Q z E U z c S G a I U Z d E N y 4 x k U e E Ce l 0 O t D Q d i b t H S M h f I B 7 t / o L 7 o x b P 8 M / 8 D P s w C w E P E m T k 3 P u q 8 e P O d P g O N 9 W

Fig. 4 .
Fig. 4. The chain is in state 1 ≤ p ≤ w if the minimizer starts at position p in the k-mer.Different edge colors represent different probabilities.

Fig. 5 .
Fig. 5. Space in average bits/k-mer for LPHash by varying k, for both un-partitioned and partitioned data structures.The flat solid line at log 2 (e) = 1.442 bits/k-mer indicates the classic MPHF lower-bound.Lastly, the dashed lines corresponds to the space bounds computed using Theorem 1 and Theorem 3 with b = 2.5 and including the space for the fallback MPHF.
w+1 and b is a constant larger than log 2 (e).

Table 2 .
Minimizer length m by varying k on the different datasets.