RabbitKSSD: accelerating genome distance estimation on modern multi-core architectures

Abstract Summary We propose RabbitKSSD, a high-speed genome distance estimation tool. Specifically, we leverage load-balanced task partitioning, fast I/O, efficient intermediate result accesses, and high-performance data structures to improve overall efficiency. Our performance evaluation demonstrates that RabbitKSSD achieves speedups ranging from 5.7× to 19.8× over Kssd for the time-consuming sketch generation and distance computation on commonly used workstations. In addition, it significantly outperforms Mash, BinDash, and Dashing2. Moreover, RabbitKSSD can efficiently perform all-vs-all distance computation for all RefSeq complete bacterial genomes (455 GB in FASTA format) in just 2 min on a 64-core workstation. Availability and implementation RabbitKSSD is available at https://github.com/RabbitBio/RabbitKSSD.


Distance computation
In Kssd, distance computation relies on the intersection of hash sets within sketches.Kssd generates an indexed dictionary for each hash value in the reference sketches to store the IDs of reference sketches containing that specific hash.The process of computing the intersection between a query sketch and the reference sketches involves retrieving each hash value from the query sketch and checking it against the indexed dictionary.When the hash bit surpasses 32, Kssd splits the indexed dictionary into multiple sub-dictionaries.Consequently, Kssd retrieves the hash values of all query sketches within the sub-dictionaries sequentially.More detailed information is available in Algorithm 2.
Once the computation of the intersection between genomes A and B (I(A, B))

Parameters
Kssd is characterized by three critical parameters: half K (the k-mer size), drlevel (dimensionality reduction level), and space (the space allocated for the shuffled dictionary).

K-mer size
Similarities are more sensitive to smaller k values, as smaller k-mers are more likely to find matches.However, there is also a higher chance of random collisions inflating the proportion of shared k-mers, especially when dealing with large genomes using excessively small k-mer sizes.Kssd determines the optimal kmer size using the equation k = log 4 3g 2u , where g represents the genome size, and u is the specified upper bound for the probability of random k-mer collisions.Both RabbitKSSD and Kssd default to a k-mer size of 20 (half K = 10) to minimize the risk of random collisions, making it suitable for most use cases.

Dimensionality reduction level
The dimensionality reduction level (drlevel) is another crucial parameter that directly affects the sketch size.drlevel regulates the sub-sampling ratio of kmers, with only 1 16 drlevel k-mers being selected to constitute the sketches.In practical terms, this means that the sketch size can be estimated as roughly L• 1 16 drlevel , where L represents the number of k-mers in a genome, approximately equivalent to the genome's length.
Moreover, as illustrated in Figure 1, the number of bits in the final hash values within sketches is calculated as 4 • (half K − drlevel).

Shuffled dictionary space
The parameter for controlling the size of the shuffled dictionary is denoted as (space).The size of the shuffled dictionary is determined as 2 4•space .To improve robustness and efficiency, it's recommended that space should be greater than or equal to drlevel + 2. The default value for space is 6, and generally, there is no need to modify this setting in most use cases.
2 Optimization techniques 2.1 Sketch generation

Balanced task partition
The sketching operation in Kssd encounters two primary performance bottlenecks: parsing genome files and generating sketches through k-mer sampling.Kssd employs pipe-streamed I/O APIs for opening genome files to facilitate sequence parsing.However, when employing multi-threading, these APIs can become inefficient due to thread safety issues.To resolve this issue, RabbitKSSD incorporates two high-performance sequence parsing tools: RabbitFX and klib.Additionally, we introduce a task partition strategy aimed at optimizing load balancing.
Kssd adopts a parallel genome file processing strategy, allocating one thread to each genome file.However, this approach can lead to inefficiencies when genome files exhibit substantial variations in size.In such cases, the parsing of the largest file will impede overall performance.It becomes especially problematic when the number of genomes (N g ) is smaller than the number of available threads (P ), leading to the underutilization of resources, as only N g threads are actively engaged in sketch generation, leaving the remaining P − N g threads idle.
To fully leverage the computational capabilities of modern multi-core workstations, RabbitKSSD categorizes genomes into two distinct types: "big" and "small."This classification is determined based on the thread count (P ) and the total genome file size (S t ).Genomes larger than St P are designated as "big" genomes, while the remainder are considered "small" genomes.In scenarios where the total genome count is less than the number of threads, all genomes are treated as "big" genomes.
In Figure 2, we present an exemplary comparison of sketch generation processes between Kssd and RabbitKSSD.When faced with one "big" genome file and three "small" files, Kssd employs four threads to simultaneously read and parse all four files.However, the threads handling the "small" files complete their tasks relatively quickly and subsequently remain idle, awaiting the completion of the thread responsible for processing the "big" file.This distribution of tasks across threads does not achieve an optimal balance for multi-threading.In contrast, RabbitKSSD employs a more efficient approach.It utilizes Rab-bitFX, a highly efficient sequence parsing tool capable of parsing large genome files using multiple threads simultaneously.RabbitKSSD employs RabbitFX to parse the "big" genome file with multiple threads (e.g.four threads in Figure 2-(b)), including one producer thread and multiple consumer threads.For the "small" files, RabbitKSSD leverages klib and assigns one dedicated thread per file, processing them in descending order of file sizes.This strategic allocation of resources ensures that RabbitKSSD avoids idle time associated with waiting for the parsing of the "big" file, resulting in a more evenly balanced distribution of tasks and improved overall load balancing.
(a) Kssd sketch strategy (b) RabbitKSSD sketch strategy Shuffled dictionary Shuffled dictionary Shuffled dictionary Four threads parse 4 genome files into sequences and generate k-mers using a slide-window approach, with one thread dedicated to each file.
Each k-mer retrieves the whole shuffled dictionary for sub-sampling.For small files, multiple threads parse files into sequences, one thread per file.

Genome files
Small file Small file Big file Big file 2.1.2Retrieving map dictionary instead of the entire shuffled dictionary Kssd generates sketches by sampling k-mers from the entire shuffled dictionary of the kssp (k-mer substring selection pattern) space array.These k-mers are generated in a slide-window manner, and the total number of k-mers is approximately to the total genome size.With a dimensionality reduction level parameter (drlevel) of l, only a fraction of 1/16 l of the k-mers are selected to compose the sketches.RabbitKSSD constructs a map dictionary using a highly efficient robin-hoodhashing map, with a size of S sd /(16 l ), considering the size of the entire shuffled dictionary in Kssd as S sd .RabbitKSSD then generates sketches by retrieving k-mers from this map dictionary, which is only 1/(16 l ) the size of the whole shuffled dictionary.This design results in more efficient sketch generation in RabbitKSSD compared to Kssd.

Comparison of distance computation between Kssd and Mash
The sketching strategy based on kssp leads to a lower computational complexity compared to Mash, particularly for all-vs-all pairwise distance computations involving large-scale genomes.Algorithms 1 and 2 highlight the distinctions between Kssd and Mash in calculating pairwise distances between reference and query genomes.
For a scenario with M reference genomes and N query genomes, each genome having an average sketch size S, Mash necessitates M • N pairwise intersection computations.Within each pairwise calculation, there are approximately 2 • S compare operations.Thus, the overall computational complexity for the Mash strategy is 2 On the other hand, Kssd initially generates an indexed dictionary of hash values and genome IDs for reference sketches (depicted by the function getKss-dIndexDict in Algorithm 2) with a computational complexity of M • S. The intersection matrix elements are initialized to 0. For each query sketch and each hash value in the sketch, the intersection matrix is incrementally updated with the genome IDs from the indexed dictionary.The complexity of computing the intersection matrix amounts to N • S • t, where t represents the average number of genomes containing a specific hash value and t << M .Consequently, the total complexity of the Kssd strategy is (M + N t) • S.
When the number of reference genomes (M ) and query genomes (N ) is large, the condition (M + N t) << (2 • M • N ) holds, indicating that the Kssd strategy has a lower computational complexity than Mash.Additionally, it's worth noting that the operation for each pairwise intersection computation in Mash is also bottlenecked by branch mispredictions.

Comparison of distance computation between RabbitKSSD and Kssd
Though Kssd offers lower computational complexity, it faces practical efficiency challenges, particularly when applied to large-scale genomes.On one hand, it exhibits inefficiency when accessing and updating the intersection matrix stored on the hard disk.On the other hand, it necessitates significant hard disk storage space to accommodate the entire intersection matrix.For instance, when calculating all-vs-all distances for a dataset comprising 1,000,000 genomes, the intersection matrix can consume approximately 4 terabytes of hard disk storage.The distance computation in both Kssd and RabbitKSSD relies on the intersection of hash sets within sketches.Both tools generate an indexed dictionary for each hash, preserving the IDs of reference sketches that contain these hashes.The process of intersecting a query sketch with the reference sketches involves retrieving the query sketch hashes from the indexed dictionary.See details in Algorithms 2 and 3.The crucial aspect of this operation lies in hash value retrieval.
As is shown in Section 1.1, each nucleotide is encoded into 2 bits within these hashes, and the number of hash bits in sketches is determined by the formula 2 • 2 • (half K − drlevel), where half K and drlevel are parameters reflecting the k-mer size and dimensionality reduction level, respectively.For most scenarios, the count of hash bits is less than 32.However, in cases involving a large kmer size and a small dimensionality reduction level, the count of hash bits may exceed 32 (half K − drlevel > 8).
In Kssd, the hashes are stored as 32-bit unsigned integers.When the count of hash bits is 32 or fewer, Kssd retrieves the indexed dictionary by directly using the hash value as the index.In this case, the index range spans from 0 to 2 4•(half K −drlevel) − 1, and the index file has a size of 4 • 2 4•(half K −drlevel) bytes.However, when the count of hash bits exceeds 32, 32-bit integers are insufficient to cover the range of hash values.To address this limitation, Kssd divides the indexed dictionary into 2 t distinct sub-dictionaries, where t represents the hash bit value exceeding 32 (see Algorithm 5).Each sub-dictionary has an index file with a size of 16 GB (4 • 2 32 bytes).For example, when the k-mer size is 26 (half K = 13) and the dimensionality reduction level is 4096 (drlevel = 3), the count of hash bit is 40 (4 • (half K − drlevel)).Consequently, Kssd splits the indexed dictionary into 256 (2 40−32 ) sub-dictionaries.For each query sketch, Kssd must retrieve hash values from each sub-dictionary.Storing all 256 subdictionaries in memory (4 TB in total, with each sub-dictionary having an index file size of 16 GB) becomes impractical for most workstations.Thus, these subdictionaries have to be stored in hard disk.Kssd updates the whole intersection matrix by loading each sub-dictionary from the hark disk into RAM.If the intersection matrix is divided into N parts (keeping only a part of the distance matrix in local RAM), these sub-dictionaries must be loaded from hard disk N times, resulting in an unacceptable level of overhead.Thus, Kssd cannot keep only a part of the distance matrix in local RAM, and instead, it stores the whole intersection matrix as an intermediate result to minimize the overhead of loading the sub-dictionaries.
In RabbitKSSD, when the count of hash bits is 32 or lower (half K −drlevel ≤ 8), we utilize 32-bit unsigned integers as hash values.The same with Kssd, RabbitKSSD retrieves the indexed dictionary by directly using the 32-bit hash value as the index.When the count of hash bits exceeds 32, we employ 64-bit unsigned integers as hash values instead.Thus, the hash value range spans from 0 to 2 (4•(half K −drlevel) , and this range greatly exceeds the total number of hashes in the sketches, meaning that many hash values within this range do not correspond to any entries in the sketches.As demonstrated in the previous example, when the k-mer size is 26, and the dimensionality reduction level is 4096, resulting in a count of hash bits of 40, the hash value range spans from 0 to 2 40 − 1.However, the tested RefSeq bacteria dataset contains only about 200 million distinct hashes in the sketches, which is significantly less than the total hash range (2 40 ).As a result, a substantial number of hash values have no associated reference IDs in the Kssd multiple sub-dictionary.Storing the indices of all hashes within total range, including these "empty hashes" (the hashes that never occurred in the sketches), consumes excessive resources and proves to be inefficient.To address this issue, we construct a constrictive unified indexed dictionary that exclusively includes hashes present in the sketches.The unified indexed dictionary is implemented using a high-performance robinhood unordered map.The unified indexed dictionary eliminates the need to store indices for "empty hashes" and obviates the requirement for multiple subdictionaries.When calculating intersections of a query sketch against all references, RabbitKSSD retrieves every hash value from the query sketch and identifies all reference IDs associated with these hashes, see Algorithm 3. Unlike the sub-dictionary-based approach in Kssd, RabbitKSSD utilizes the unified indexed dictionary encompassing all reference IDs, thereby eliminating the necessity for intermediate storage of intersection matrix.Consequently, RabbitKSSD can efficiently keep only a part of the distance matrix in local RAM.Consider the parameters of thread count (P ), reference genome count (M ), and query count (N ).RabbitKSSD introduces an efficient strategy that eliminates the need to store the entire intersection matrix on the hard disk, see Figure 3. Instead, it maintains a sub-matrix of size P • M in RAM.This approach significantly enhances both the speed of access and the efficiency of updates.To conduct a comprehensive resource utilization assessment of Kssd and RabbitKSSD, we introduced an additional large-scale dataset denoted as the Genbank bacteria.This dataset encompasses 1,009,738 genomes, with a cumulative size of 4.0 terabytes in FASTA format.When performing pairwise distance computations among the genomes within the Genbank bacteria using default parameters (half K = 10, drlevel = 3), Kssd incurred a computational time of 35,296 seconds (approximately 10 hours) on W1 workstation and 58,020 seconds (about 16 hours) on W2 workstation.Additionally, Kssd necessitated 3.8 terabytes of hard disk space to store intermediate results in the form of an intersection matrix.In contrast, RabbitKSSD completed the same task in significantly shorter times, utilizing 5,516 seconds (roughly 1.5 hours) on W1 workstation and 5,775 seconds (also about 1.5 hours) on W2 workstation.The peak memory footprints of Kssd and RabbitKSSD are 11.2 GB (11.4 GB) and 12.0 GB (12.3 GB) on W1 and W2 workstation, respectively.These results demonstrate that RabbitKSSD achieved notable speed enhancements of 6.4x on W1 workstation and 10.0x on W2 workstation when compared to Kssd.Moreover, RabbitKSSD effectively mitigates extensive hard disk usage by marginally increasing memory consumption.
We conducted parameter tests on both Kssd and RabbitKSSD, setting half K to 12 and drlevel to 3, using the RefSeq bacteria dataset.During testing, Kssd failed to generate sketches and compute pairwise distances since it was out of memory on both workstations, W1 and W2.In contrast, RabbitKSSD exhibited robust performance.On W1, it took 77.3 seconds for sketch generation and 35.7 seconds for distance computation.On W2, these times were 135.6 seconds for sketch generation and 58.9 seconds for distance computation.Additionally, the peak memory usage for sketch generation and distance computation was 1.4 GB and 3.0 GB on W1, and 1.3 GB and 2.9 GB on W2, respectively.These results demonstrate RabbitKSSD's enhanced robustness compared to Kssd, particularly when the count of hash bits exceeds 32.Notably, RabbitKSSD maintains similar sketching times to cases where hash bits are within the 32-bit limit.Compared to the runtime with default parameters (half K = 10 and drlevel = 3) in the main manuscript, the increased computation time in distance computation, for the count of hash bits exceeding 32, can be attributed to the retrieving of unified indexed dictionary implemented using an unordered map.This process is slightly slower than directly accessing the offset array composed by the hash value.However, when count of the hash bit exceeds 32, employing the offset array to generate multiple sub-dictionaries becomes less efficient, similar to Kssd's approach.Furthermore, the use of the unified indexed dictionary reduces memory consumption due to a reduction in "empty hashes".
In summary, RabbitKSSD exhibits greater robustness and efficiency across a broader range of parameters compared to Kssd.

Other optimization on distance computation
Furthermore, the outputting of the result file can also pose a performance bottleneck.As illustrated in Figure 3, RabbitKSSD addresses this challenge by merging intersection sub-matrix computation, distance sub-matrix computation, and the generation of distance results into a thread task.Each thread is responsible for generating a sub-file of the final result to prevent writing conflicts.Moreover, each thread conserves its results within a memory buffer and subsequently writes them to the file in buffer-sized units, thereby minimizing frequent writing overhead.Additionally, a global index file is generated to facilitate locating distances within the sub-files.Once the total size of these sub-files falls below 4 GB, they are merged into a single final result file.

Unified indexed dictionary
Three threads retrieve all hashes of 3 queries from the unified indexed dictionary.…...

Unified indexed dictionary
Three threads retrieve all hashes of 3 queries from the unified indexed dictionary.

Unified indexed dictionary
Three threads retrieve all hashes of 3 queries from the unified indexed dictionary.

Set operations
Kssd and RabbitKSSD use a one-to-one encoding hash function, which differs from the many-to-one hash function used by other tools.This enables the secure subtraction of the sketch hash sets.The reference k-mers will cover the variants between genomes for datasets with the same reference.Therefore, the variants can be enriched by subtracting the k-mer set of the reference from a given dataset.For example, Kssd generated sketches of genomes in the 1000 Genomes Project and then subtracted these sketches by the sketch of the hg38 human reference.The pairwise distances of the remainder sketches were computed for clustering and detecting mislabeled genomes.The performance of set operations in Kssd is bottlenecked by its singlethreaded implementation.To address this limitation, we propose a producerconsumer multithreading strategy to accelerate the set operations. Figure 4 illustrates the comparison between Kssd and RabbitKSSD set subtraction strategies.In the Kssd approach, a single thread is responsible for parsing the sketch file, computing the subtraction, and saving the result to the output file.On the other hand, RabbitKSSD employs a more efficient strategy.It utilizes one   thread as a producer to parse the sketch file, while the parsed sketches are placed into a sketch queue.Multiple threads act as consumers, loading sketches from the queue, computing the subtraction, and saving the results to the output file.This parallelized approach in RabbitKSSD allows it to maximize the peak I/O bandwidth of the SSD, resulting in significantly improved set operation performance.
3 Performance evaluation

Experience data
RefSeq bacteria dataset comprises the complete bacterial genomes from NCBI RefSeq database release 211.This dataset consists of 113,674 genomes, with a cumulative size of 455 GB in FASTA format.The genome sizes within this dataset vary, with the largest being 14,966,964 base pairs, the smallest 143,979 base pairs, and the mean size measuring at 4,287,069 base pairs.
Genbank bacteria dataset comprises the complete bacteriall genomes from NCBI Genbank database release 249.This dataset consists of 1,009,738 genomes, with a cumulative size of 4.0 TB in FASTA format.The genome sizes within this dataset vary, with the largest being 25,462,055 base pairs, the smallest 176,024 base pairs, and the mean size measuring at 4,374,294 base pairs.hg38 dataset represents the standard human genome reference and consists of a single genome file, hg 38.fna.This genome is sized at approximately 3.1 GB in FASTA format.
1000Genome dataset corresponds to the NCBI accession PRJEB31736, which involves the sequencing of 2,504 Phase 3 1000 Genome samples at 30X whole genome coverage.In total, 36,940 runs were successfully processed, resulting in a combined data size of approximately 586,862,543,699,180 base pairs, equivalent to around 533.75 terabytes.These runs vary in size, with the largest being 214,967,956,200 base pairs, the smallest being 218,700 base pairs, and the mean size being 15,886,912,390 base pairs.The data retrieval, dumping, and sketch generation processes were performed using a shell script to stream data.At a sketching dimensionality reduction of 4096-fold, the sketches generated from these runs occupy approximately 81 GB in sketch format.After subtracting the sketch of the hg38 human reference, the remaining 36,940 sketches have a combined size of 16 GB.
3.2 Accuracy comparison for estimated Jaccard Coefficients and mutation distance Kssd utilizes the Pearson correlation coefficient to assess the accuracy of the mutation distance.However, since the ground truth of mutation distances on the real-world RefSeq bacteria dataset is not available, we simulated 900 genome sequences with mutation rates (mutation distances) ranging from 0.1% (0.001) to 30% (0.300), based on a 10,000,000 bp seed nucleotide genome.These mutation rates involve an equal probability of insertion, deletion, and substitution.The true Jaccard coefficients between the seed genome and these mutated genomes vary from 0.005 (mutation distance 0.299) to 0.970 (mutation distance 0.001).Table 2 and Table 3 display the SSEs and Pearson correlation coefficients of these tools on the simulated dataset, respectively.The testing results show that Kssd and RabbitKSSD tend to be slightly more accurate and exhibits better performance than other tools.
In summary, RabbitKSSD achieves at least comparable accuracy to other tools while exhibiting superior runtime efficiency.For sketch generation, RabbitKSSD achieves near-linear speedup when using up to 48 (32) threads on the Intel (AMD) workstation.The SSD on the Intel (AMD) workstation is a Samsung 980 PRO 1TB SSD (Samsung 970 EVO Plus 2TB SSD) with a peak sequential read speed of 7,000 (3,500) MB/s.Rab-bitKSSD can process FASTA genome files of up to 6 GB (3 GB) per second with 48 (32) or more threads on the Intel (AMD) workstation.This allows RabbitKSSD to scale effectively up to the peak bandwidth of the SSD.

Thread scalability
For distance computation, thanks to the task partition method and efficient output strategy, RabbitKSSD achieves a nearly linear speedup as the number of threads increases.For the task of distance computation conducted on the AMD workstation, the time required to compute all-vs-all distances for the RefSeq bacteria dataset stands at 22.3 seconds (22.2 seconds) when employing 44 (48) threads.In the case of the RefSeq bacteria dataset, consisting of 113,674 genomes, the incremental gain in speed achieved by increasing the thread count does not sufficiently offset the additional overhead introduced by multi-threading.

Comparative evaluation of set operations between Kssd and RabbitKSSD
We evaluate the performance of set subtraction using the 1000 Genome sketch with Kssd and RabbitKSSD.The 1000 Genome sketch is generated in a streaming fashion by the shell script, with its performance mainly bottlenecked by download and dumping speeds.The resulting 1000 Genome sketch file is 81 GB in size, which is used to generate a 16 GB remainder sketch file by subtracting the hg38 reference sketch.The comparison of sketch subtraction and all-vs-all distance computation of the remainder sketches between Kssd and RabbitKSSD is summarized in Table 5.As shown in Table 5, RabbitKSSD achieves notable speedups, with performance gains of 7.6x and 14.0x on the Intel (W1) and AMD (W2) workstations, respectively, in the set subtraction phase.This improvement is due to the producer-consumer module in RabbitKSSD, which efficiently processes the  original sketch files chunk by chunk.In contrast, Kssd's single-threaded module loads the entire original sketch files (81 GB) into RAM, resulting in high memory usage.
It's worth noting that the speedup in the all-vs-all distance computation for these remainder sketches is not as significant as that observed in the RefSeq bacteria dataset (i.e.≈ 3x vs. ≈ 16x).The mean sketch size of the 16 GB remainder sketches and RefSeq bacteria sketches is approximately 116,000 and 1,000, respectively.Given the larger number of hash values in the remainder sketches, the most time-consuming aspect of distance computation is retrieving these hash values from the indexed dictionary in both RabbitKSSD and Kssd.The performance improvement achieved by RabbitKSSD is not as significant as with RefSeq bacteria dataset due to the most time-consuming indexed dictionary retrieving.Nonetheless, RabbitKSSD's robustness and performance still outshine Kssd.1: i ← 0, j ← 0; 2: for i = 0; i < M ; i + + do 3: for j = 0; j < N ; j + + do end for 20: end for 21: end function
set subtraction Three threads perform subtraction of next three query sketches One producer thread for parsing query sketch file …...

Figure 5 and
Figure5and Figure6depict the thread scalability of RabbitKSSD and Kssd on the 64-core Intel (W1) and 48-core AMD (W2) workstations.RabbitKSSD demonstrates superior thread scalability compared to Kssd for both computeintensive sketch generation and distance computation operations.Table4presents the parallel efficiency of Kssd and RabbitKSSD on the two workstations.
(a) Runtime of sketch generation of Rab-bitKSSD and Kssd with different numbers of threads (b) Runtime of distance computation of RabbitKSSD and Kssd with different numbers of threads (c) Thread scalability for sketch generation with different numbers of threads (d) Thread scalability for distance computation with different numbers of threads

Figure 5 :
Figure 5: Thread scalability of sketch generation and distance computation of RabbitKSSD and Kssd on the RefSeq bacteria dataset on a 64-core Intel workstation.
(a) Runtime of sketch generation of Rab-bitKSSD and Kssd with different numbers of threads (b) Runtime of distance computation of RabbitKSSD and Kssd with different numbers of threads (c) Thread scalability for sketch generation with different numbers of threads (d) Thread scalability for distance computation with different numbers of threads

Figure 6 :
Figure 6: Thread scalability of sketch generation and distance computation of RabbitKSSD and Kssd on the RefSeq bacteria dataset on a 48-core AMD workstation.
][j] ← intersection(S a [i], S b [j]); Updating the intersection matrix of Kssd Input: Query sketch array S b [N ] contains N sketches; Each sketch (S b [i]) is a hash array; index is the index of the indexed dictionary or sub-dictionary; IDArr[] is the array of reference ID for each hash; of f setArr[] is the array of the offset for retrieving the reference IDs.Output: Intersection matrix Common[N ][M ]; 1: function updateIntersectionMatrix(S b [N ], index, of f setArr[], IDArr[], Common[][]) f setArr[key − 1], end ← of f setArr[key];

Table 1 :
The Sum of Squared Error (SSE) between estimated Jaccard and true Jaccard of Mash, BinDash, Dashing2, Kssd, and RabbitKSSD on the sub bact dataset.Bold indicates the lowest error.

Table 2 :
The Sum of Squared Error (SSE) between estimated Jaccard and true Jaccard of Mash, BinDash, Dashing2, Kssd, and RabbitKSSD on the simulate dataset.Bold indicates the lowest error.

Table 3 :
The Pearson correlation coefficients between estimated Mash distance and true mutation distance of Mash, BinDash, Dashing2, Kssd, and Rab-bitKSSD on the simulate dataset.Bold indicates the highest correlation.

Table 5 :
Comparison of set subtraction (sub) and all-vs-all distance computation (dist) between RabbitKSSD(RKSSD) and Kssd on 1000Genome sketches.Algorithm 1 Distance computation strategy of Mash Input: Sketches array S a [M ] consists of M sketches, sketch array S b [N ] consists N sketches; Each sketch (S a [i] or S b [j]) is a sorted hash array; Output: intersection matrix Common[M ][N ];