ntHash2: recursive spaced seed hashing for nucleotide sequences

Abstract Motivation Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. Results ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. Availability and implementation ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. Supplementary information Supplementary data are available at Bioinformatics online.


Table of Figures
Fig. S10. Dataset coverage estimation using ntCard with k-mers and spaced seeds. The plot on the right is a zoomed-in version of the peak in the count range between 34 and 64. Spaced seeds are more stable compared to k-mers with the same number of bases and are more reliable for coverage estimation in long-read datasets. .

Background
Given a sequence containing characters from an alphabet Σ, ntHash initially computes a base hash value ! for the first -mer. Next, for each succeeding -mer, the contribution of the previous -mer's first character is removed and the next character in the sequence is included using circular shifts and exclusive-or (XOR) operations. Formally, ntHash can be shown as a recursive function: where " ( ) is the result of rotating a 64-bit word times to the left and ℎ is a lookup table for mapping each character in Σ to a random 64-bit integer. ntHash also calculates the hash value of the reverse-complement of each -mer in the same loop as the forward strand, which leads to faster canonical hash computation (Mohamadi, Chu, et al. 2016). In our implementation, as Σ = {A, C, G, T}, ntHash performs DNA hashing and ignores any other character not present in this alphabet.

Updated -mer Hashing
In equation (1), since rotating any 64 times results in the same value, when > 64, swapping two distinct characters 64 positions apart will not alter the resulting base hash value. Also, the rotation terms of two identical characters with a distance of 64 positions cancel each other out, resulting in the same effect. Our solution to this issue is to replace with , which has a higher period. Because modern computers operate on 64-bit words, the period of is maximized when # = 31 and $ = 33 (lcm(31, 33) = 1023), where lcm stands for the least common multiple. Although this operation can be further generalized for longer periodicities by increasing the number of sub-words of co-prime lengths (Table S1), we implement with two sub-words in ntHash2. The theoretical maximum periodicity of the operation using 64-bit words is 2,042,040 with a pattern of seven sub-words. The hash value for the th -mer in the sequence is computed by ntHash2 according to the recursive function: where ⊕ is the bitwise XOR operator, ℎ is a lookup table containing a fixed 64-bit word for each of the characters in the alphabet Σ. Also, is the split-and-rotate function performed by shifting the input to the left and replacing the bits in positions 0 and 33 with the bits in positions 32 and 63 prior to the shift, as implemented in Fig. S1. ntHash2 is also able to compute a hash value for the reverse-complement of the input -mer by modifying the recursion as: where ′ is the complementary base pair of and is the reverse of a single , which is implemented as Fig. S2. In other words, ( ( ) ) = .
Additionally, ntHash2 can generate extra hash values for each -mer using the function shown in Fig. S3. Note that is the -mer size and > 0 is the index of the newly generated hash value.
To reduce the work required to compute ( ) during runtime, we store results of rotating the first 33 and last 31 bits of each ℎ[ ] ( ∈ Σ) for all possible rotation counts in two separate predefined arrays called MS_TAB_33R and MS_TAB_331. The pre-computed value of ( ) is the result of joining the corresponding elements of the arrays using the bitwise OR operator. This functionality is implemented as Fig. S4. Finally, we speed up hashing the first -mer by pre-computing the hash value of every possible -mer of length ≤ 4 in four 4 arrays. To form 0 , we first iterate over the 4-mers of the first -mer and join the respective pre-calculations using and XOR operations. If is not divisible by 4, we fetch the value of the last characters from the array corresponding to = %64.
Naturally, split-rotation is a more complex function compared to simple rotation and requires a higher number of instructions in the CPU. Calling 10 10 times requires approximately 9.4 s to execute, whereas the same number of executions takes around 6.9 s for . However, the numerous implementation optimizations we explained in this section compensate for the decrease in performance. Calling the forward hash functions in ntHash1 and ntHash2 10 6 times takes 6.3 s and 6.4 s respectively for random 100-mers on average for 3 test repeats. In the end, the uniformity (Section 3) and room for length increasing that split-rotation brings to ntHash2 is valuable for bioinformatics applications and the slight performance change is negligible in most cases.

Uniformity of the Canonical Hash Value
To obtain a single representation for the forward and the reverse complement hash value ofmers, we define a canonical hash value. This feature is most useful in -mer counting algorithms. In most bioinformatics applications, the lexicographically smallest between the forward and reverse-complement is selected as the canonical -mer. Accordingly, the canonical hash value is defined as the minimum of the hashes computed for the forward and reverse-complement of each -mer in ntHash (Mohamadi, Chu, et al. 2016).
While ntHash was shown to distribute output values uniformly in the 64-bit hash space (Birol, Mohamadi and Chu 2018), we changed the canonical hashing mechanism in ntHash2 to improve its performance and uniformity. Specifically, we compute the canonical hash value as the sum of the hashes generated for the -mer's forward and reverse-complement. Here, we provide mathematical justification for the selection of addition for canonical hashing.
First, we show that the hash value of the forward and reverse-complements of the sequences are independently uniform. This was shown to be true in ntHash for the operator (Birol,  Mohamadi and Chu 2018). To do so for ntHash2 and , we generate 10 6 random -mers ( = 100 for this test) and plot the histogram of the generated hashes by counting the values separated in 1000 bins. As illustrated in Fig. S5, both the forward and reverse-complement hash function are uniformly distributed ( (0,1)).
Moreover, we perform the Kolmogorov-Smirnov (K-S) test (Massey Jr. 1951) to compare the distribution of the normalized hash values with (0,1). According to the test statistics = 0.0009 and = 0.0008 and p-values of = 0.48 and = 0.39 for the forward and reverse strand hashes respectively, we fail to reject the null hypothesis that the distributions are not uniform. Therefore, the distributions of the hashes generated for the forward and reverse-complements of the -mers are statistically indistinguishable from (0,1).
After observing a uniform distribution for the output hash values, we prove that in the case of ntHash2, the summation of two uniform distributions results in another uniform distribution. The probability density function (PDF) of the sum of independent (0,1) random variables is equal to the Irwin-Hall distribution (Johnson, Kotz and Balakrishnan 1995): where sgn(•) is the sign function. The PDF is shown in Fig. S6. Note that based on the central limit theorem, the Irwin-Hall distribution estimates a normal distribution as increases. In ntHash2, we have a special case of the Irwin-Hall distribution where = 2: According to the fact that ntHash2 operates in the 64-bit unsigned integer space, if the canonical hash value is greater than 2 %& , integer overflow causes the values to wrap around. The summation of two 64-bit integer values and is simply performed as ( + , 2 %& ) in many computer architectures, and if not, we ensure that this is how it is performed. In that case, the line segment Fig. S6 for = 2 is transferred to 0 ≤ < 1 and the superposition of the two line segments creates a uniform distribution.
Formally, the wrap around version of the normalized operation can be formulated as Because the two cases in ′ are mutually exclusive, its probability distribution can be written as defining a uniform distribution.
Hence, the probability distribution of the canonical hash value defined as the summation of the forward and reverse hashes is uniform.
Other canonical hashing operators tested in previous releases of ntHash are shown in Fig. S7. We also tested employing extended hashes in various applications. It is observed that without extended hash generation is the only operator that is non-uniform. To make ntHash2 uniform when using as the canonical hash operator, a second hash value must be generated which increases hashing time. On the other and, using XOR for generating canonical hash values results in higher hash collision rates, as the computation of forward and reverse hash values already consist of XOR operations between the characters, and XORing the polynomials of the strands will cause multiple terms to be cancelled out. By using addition as the canonical hash operator in ntHash2, we benefit from fast computation, low collision rates, and high uniformity without the need to generate extra hashes.

Spaced Seeds Hashing Algorithm
We summarize our hashing algorithm in the pseudocode presented in Fig. S8. The first stage of our hashing algorithm consists of parsing the input seed strings to blocks and monomers, as implemented in parse_seed. Note that we repeat the same process for blocks of do not cares and use the '0' blocks as masks if the predicted workload is less than blocks of care positions. We generate a hash value for the first -mer with base_hash. Forward and reverse hash values before encoding the monomers are stored in ( and ′ respectively for future function calls. Finally,  subsequent hashes are computed using the slide_hash function.

Additional Features
Here, we provide a list of new features included in ntHash2: • Streaming: Users can feed input characters to ntHash2 during runtime via the BlindNtHash class. Only the first -mer needs to be given to the constructor. Following hash values are generated by passing the next character to the roll(char_in) function.
• Rolling Back: NtHash classes provide a roll_back function that reverts the previous roll operation. Bi-directional traversal is possible using this function.

Spaced Seeds Used in Experiments
The seeds used in experiments of the main text are provided in Table S2. Seeds 1 and 2 show the upper and lower bounds for the number of blocks and monomers. Seed 3 is designed for maximizing the hit probability . Seeds 4 and 5 are randomly generated strings with an average number of blocks and monomers. Seed 6 is designed to maximize the sensitivity (Hahn, et al. 2016).

Extended Experiments and Results
To show that ntHash2 works well when used in tools that rely on -mer or spaced seed hashing, we integrated it in our de novo genome assembler ABySS 2.4.3 (Jackman, et al. 2017) and assembled data from the N2 strain of C. elegans (SRA: DRR008444) and the human individual NA24385, i.e. the Ashkenazi son (SRA: SRR11321732) from the Genome in a Bottle project (Zook, et al. 2016), with ~58x coverage short read data. ABySS parameters are presented in Table  S4. Evaluation results are obtained using QUAST (Gurevich, et al. 2013). Tables S3 and S4 show that using ntHash2 as a substitute for the previous version of ntHash does not negatively impact assembly quality. We observe that improved uniformity does not lead to a significant increase in assembly quality in this experiment due to the fact that ntHash2 operates on a very low level of the assembly algorithm. profile estimations as of ntHash (0.021% absolute error vs. DSK on average). Again, although ntHash2 is a more uniform hash funcation than ntHash1, some applications such as ntCard may not show an immediate quality gain for all datasets and experiments. Fig. S9. C. elegans -mer profiles generated with ntCard using both the previous and new versions of ntHash (k=80). ntCard is also able to estimate -mer profiles with low error when integrated with ntHash2.
Our final experiment shows an application of spaced seed hashing in long-read data. We demonstrate the effect of using spaced seeds for estimating the coverage of a dataset with ntCard. By increasing the number of base-pairs in the k-mers, the coverage profiles become more sensitive to errors, especially in datasets with longer reads and higher error rates such as the CHM13 dataset 1 with a coverage profile shown in Fig. S10. The data used in this experiment contains approximately 56.8x nominal coverage of the PacBio HiFi data used in the assembly of the telomere-to-telomere human genome (Nurk et al., 2022). Here, we use gapped spaced seeds with lengths of bases and care positions. The seeds are built as two equal and contiguous care blocks separated by Δ = − do not care positions. By using gapped spaced seeds with the same lengths and different number of bases included as care positions, the peaks of the profiles remain in roughly the same position. On the other hand, by increasing the number of bases in k-mers, the peak gets shifted to the right. In other words, utilizing spaced seeds for cardinality estimation results in more stable graphs because of the level of error-tolerance they have compared to contiguous k-mers.
Fig. S10. Dataset coverage estimation using ntCard with k-mers and spaced seeds. The plot on the right is a zoomed-in version of the peak in the count range between 34 and 64. Spaced seeds are more stable compared to k-mers with the same number of bases and are more reliable for coverage estimation in longread datasets.