CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing

Abstract Motivation Taxonomic classification of short reads and taxonomic profiling of metagenomic samples are well-studied yet challenging problems. The presence of species belonging to groups without close representation in a reference dataset is particularly challenging. While k-mer-based methods have performed well in terms of running time and accuracy, they tend to have reduced accuracy for such novel species. Thus, there is a growing need for methods that combine the scalability of k-mers with increased sensitivity. Results Here, we show that using locality-sensitive hashing (LSH) can increase the sensitivity of the k-mer-based search. Our method, which combines LSH with several heuristics techniques including soft lowest common ancestor labeling and voting, is more accurate than alternatives in both taxonomic classification of individual reads and abundance profiling. Availability and implementation CONSULT-II is implemented in C++, and the software, together with reference libraries, is publicly available on GitHub https://github.com/bo1929/CONSULT-II.


Introduction
Metagenomic sequencing of microbial communities produces short DNA reads from unknown microorganisms (Handelsman 2004), leading to a need for taxonomic identification based on reference datasets.One approach is to taxonomically identify reads and summarize the results to obtain the taxonomic profile of a sample, showing the relative abundances of taxonomic groups.However, despite the availability of mature read classification and profiling tools, benchmarking has revealed major gaps in the accuracy of existing methods (McIntyre et al. 2017, Sczyrba et al. 2017, Meyer et al. 2019, Ye et al. 2019).Precise identification is often hampered by the novelty of queries versus the genome-wide reference datasets and ambiguous matches.In addition, searching against large numbers of genomes is computationally demanding.
Taxonomic identification methods employ various strategies, including k-mer matching (Ames et al. 2013, Ounit et al. 2015, Lu et al. 2017, Lau et al. 2019, Wood et al. 2019), read mapping (Zhu et al. 2022), marker-based alignment (Liu et al. 2011, Segata et al. 2012, Sunagawa et al. 2013, Milanese et al. 2019), and phylogenetic placement (Truong et al. 2015, Asnicar et al. 2020, Shah et al. 2021).Regardless, they all essentially search for matches between reads in the sample and a reference set.The challenge is that a significant portion of the earth's microbial diversity lacks close representatives in reference datasets (Choi et al. 2017), especially in poorly known microbial habitats like seawater or soil (Pachiadaki et al. 2019).Thus, most methods use some strategy to seek inexact matches between the query and references and use the results for classification and profiling.
Classification methods often exhibit reduced accuracy for novel sequences, which lack representation in reference sets (Nasko et al. 2018, von Meijenfeldt et al. 2019, Pachiadaki et al. 2019, Liang et al. 2020).For instance, Rachtman et al. (2020) found a leading tool, Kraken-II (Wood et al. 2019), faced significant degradation in domain-level classification as the genomic distance to the closest reference increased beyond 10%.Analyses of reads from less commonly sampled environments often fail at classification, even at the phylum level (e.g.Pachiadaki et al. 2019).To tackle these challenges, efforts to build more dense reference sets are ongoing (Parks et al. 2020, Wu et al. 2020, McDonald et al. 2023), but these databases remain incomplete compared to the estimated 10 12 microbial species (Locey and Lennon 2016).In addition, computational challenges arise in searching against large reference sets.Thus, we need accurate and scalable methods of identifying novel sequences with respect to distant reference genomes.
As reference sets grow larger, k-mer-based methods become more attractive than alignment-based approaches and phylogenetic placement due to their favorable balance between scalability, ease of use, and high accuracy (McIntyre et al. 2017, Ye et al. 2019).However, k-mer-based methods can be sensitive to reference set completeness if they only allow exact matches.The k-mer-based methods that rely on the presence/absence of long k-mers can accommodate novel sequences by allowing inexact matches.Kraken-II achieves this by masking some positions in a k-mer (default: 7 out of 31).Rachtman et al. (2021) showed that novel reads (e.g.those with 10%-15% distance to the closest match) can be identified with higher accuracy by making inexact matches a central feature of the search.The resulting method, CONSULT, uses locality-sensitive hashing (LSH) to partition k-mers in the reference set into fixed-size buckets such that for a given k-mer, the reference k-mers with distance up to a certain threshold are in pre-determined buckets with high probability.By allowing inexact k-mer matching, CONSULT increased sensitivity without compromising precision in the contamination removal application (domain-level classification).However, CONSULT did not perform taxonomic identification at lower levels.
This paper adopts CONSULT and its increased k-mer matching sensitivity to the taxonomic classification problem.CONSULT estimates the Hamming distance (HD) between the query k-mer and its closest reference k-mers, a feature that Kraken-II lacks.Using the distances is the essence of our proposed approach, which we call CONSULT-II.To enable taxonomic classification, we need to track the reference genome(s) associated with each reference k-mer, a feature that CONSULT lacks and can require unrealistically large memory if done naively.We propose a probabilistic method to retain a single taxonomic ID per k-mer, making it possible to fit the database in the memory of modern server nodes.The next challenge is producing a single assignment based on potentially conflicting signals of different k-mers; we address this need using a weighted voting scheme that accounts for distances.Finally, we use a two-level normalization scheme for producing abundance profiles of complex samples using the votes directly.We evaluate the resulting method, CONSULT-II, using a large reference dataset in simulation studies, and show improved accuracy.

Background: CONSULT
The core idea of CONSULT is to find low HD matches efficiently using the bit-sampling LSH method (Har-Peled et al. 2012).The use of LSH for finding similar DNA sequences is not new (Buhler 2001, Rasheed et al. 2013, Berlin et al. 2015, Luo et al. 2019).For example, Brown and Truszkowski (2013) addressed the related problem of phylogenetic placement using LSH to limit parts of the reference tree searched.The main focus and novelty of CONSULT, compared to existing work, is being able to search against a large number of reference sequences.
CONSULT tackles the following problem: are there any k-mers in a given set of reference k-mers with HD less than some threshold p to a query k-mer?By default, CONSULT uses k ¼ 32 and p ¼ 3 (these are adjustable) and can feasibly index a large set of reference k-mers (e.g. 2 33 ).
CONSULT employs two main data structures to represent a set of reference k-mers: an array K that encodes each 32-mer as a 64-bit number, and l-many (default: l ¼ 2) fixed-sized hash tables H 1 ; . . .; H l with 4-byte pointers to K (and extra n þ 1 bits when jKj>2 32 þ n ).Each hash table is a simple 2 2h × b matrix (default: h ¼ 15 and b ¼ 7) where each row is indexed by a hash value and the columns store pointers to k-mers in K.For each hash table H i , we select h random but fixed positions of a 32-mer as its hash index.Thus, k-mer hashes are computed by simply extracting the corresponding bits from the 64-bit encoding of the k-mer, which is specifically designed to make these extractions efficient.For each query k-mer, the l hashes are computed, pointers from all ≤b×l entries in the H 1 ; . . .; H r are followed to corresponding encodings in K, and the HD is explicitly computed for each such encoding.CONSULT returns a match if there exist k-mers with distance ≤p.As such, it has no false positive matches but false negatives (not finding a match) are possible.Using LSH, CONSULT limits the number of HD computations to a constant.
In our bit-sampling scheme (Har-Peled et al. 2012), two independent k-mers at HD ¼ d have the same hash with probability ð1− d k Þ h .Hence, given two independent k-mers, the probability that at least one of the hash functions is the same for both kmers is given by As desired, for d≤p, ρðdÞ is close to 1.For d � p and for some small enough p (e.g.p ¼ 3), it quickly drops to small values for several choices of l and h.Furthermore, since classification is done at the read level, we have L − k þ 1 chances for a k-mer match (L ¼ read length).While k-mer dependence across a read hampers computing the probability of having at least one LSH match between a read and a database (and independence assumption would be too inaccurate; see Supplementary Fig. S1), we can still compute the expected number of such matches.Assuming the probability of a mismatch between each base pair of a read and a reference species is d k , the expected number of matching k-mers is ðL − k þ 1ÞρðdÞ, which can be a large value for realistic settings (Fig. 1a).For example, with the default settings of k ¼ 32; h ¼ 15; l ¼ 2, for a 150 bp read at 25% distance from the closest reference, we still expect 3.2 k-mer matches and can potentially classify it (assuming that b is large enough to fit all reference k-mers).Had we used l ¼ 1 tables, this expected value would have been 1.6, making it likely to miss many such reads.If we assume 3-4 expected matches provide a sufficiently high probability of at least one match, l ¼ 1 would suffice for d≤0:21, while l ¼ 3 and l ¼ 4 would only increase our tolerance to d≤0:27 and d≤0:28.Given the linear increase in memory with l, we choose l ¼ 2 as a tradeoff.

Overview of the CONSULT-II changes versus CONSULT
To enable taxonomic classification, CONSULT needs to be extended to address several challenges.(i) CONSULT was designed for a fixed reference library size.As a result, all the hashing settings (h, l, b) were fixed for a library of roughly 2 33 32-mers.To make the method more usable and flexible, it needs to be adjusted to the size of the input library.CONSULT-II uses several heuristics [see S¸apcı et al. (2023) and Supplementary Section S1] to estimate an efficient parameter configuration, as a function of the number of k-mers in the reference set and probability of matching two k-mers w.r.t.distance ρðdÞ.This heuristic enables adjusting the needed memory to reference size.(ii) When building the reference library, we need to keep track of which k-mers belong to which set of taxonomic groups.Since keeping a fine-grained map will lead to an explosion of memory, we need heuristics (detailed below) to store some taxonomic information but also to keep the memory manageable.(iii) At the query time, we need some way of combining all the inexact matches from all the k-mers of a read to derive a final identification and to summarize the results across all reads in a sample into a final profile.

Soft lowest common ancestors per k-mer
Considering the density of modern reference datasets, k-mers can appear frequently in several reference species, despite being long (e.g.k ¼ 32).The required memory for keeping the associated set of taxonomic groups for each k-mer would quickly become infeasible.However, for ≤65 536 taxonomic labels, keeping a single ID requires 2 bytes.Hence, storing one ID per k-mer would consume 16 Gb for our standard libraries with 2 33 k-mers, which is doable.We choose a single taxonomic ID representing a 'soft' lowest common ancestors (LCA) of all the genomes that include the k-mer using the following procedure.
Let N i denote the number of genomes that include k-mer x i 2 K, which can be easily computed using pre-processing.At any point in time during the construction, a k-mer x i 2 K is assigned a 2-byte taxonomic ID, denoted by t i .We process through each reference genome g; for each k-mer x i of g, if it can be added to the database, we set or update the taxonomic ID t i to be the LCA of the current t i and species of g with probability where w and s parameterize the rate of decrease and the offset of the probability function p u , respectively.We set s ¼ 5 and w ¼ 4 as the default values (see Fig. 1b).Note that the order of processing of the reference genomes has no significance, as every k-mer, including the first encountered, will be ignored with the same probability.Also, k-mers appearing in more than s genomes have a very small, but nonzero, probability of not having a taxonomic ID at all.The goal of the probabilistic soft LCA is to avoid pushing up taxonomic identifications due to errors in the reference libraries, as higher ranks are less informative.Imagine a k-mer that is found exclusively in 20 species of a particular genus, but is also found in one species of a completely different phylum.
Using the hard LCA would push the k-mer up to the kingdom level, whereas the soft LCA will stay at the genus rank with 85% probability.The probability function p u is a heuristic without a theoretical ground but has two goals.First, it ensures k-mers are assigned an ID if they are rare among references (i.e.N i ≤s ¼ 5).Second, the probability of ignoring a genome smoothly increases as N i grows.The 1 s 2 term is to ensure that each k-mer has a nonzero probability of having a taxonomic ID associated with it, even if it is extremely common.

Read-level taxonomic identification
For each read, CONSULT-II produces a list of matched kmers; and for each matched k-mer x, it outputs the soft LCA taxonomic ID and the distances between x and its closest match with the same hash index.To identify a read, we need to derive a single conclusion from all these potentially conflicting signals.We do so by considering each k-mer as providing a vote to the corresponding taxonomic ID, but weight votes by the match distance.
Let T denote the set of all taxonomic IDs, and KðtÞ; t 2 T be all reference k-mers with t as their soft LCA.Each k-mer x in the set R of query read k-mers might match multiple kmers in the reference set K with varying distances.A match of a lower distance should provide a strong signal.CONSULT-II accounts for this by giving a k-mer x 2 R a vote for the taxonomic ID t using: where hd gives the Hamming distance.The voting function (3) drops close to exponentially with distance min y2KðtÞ hdðx; yÞ.Computing (3) exactly is intractable due to the large size of KðtÞ.Instead, using LSH, we compare x only to k-mers y with the same hash index as x, finding matches with high probability.Moreover, we let x to vote for only a single taxonomic ID with the minimum distance (breaking ties arbitrarily).As LSH is not effective for high distances, we let a k-mer vote only if its minimum distance is below a threshold d max (default: roundð 3p 2 Þ ¼ 5).We set d max to be higher than p because matches with distance above p might also be found; the LSH guarantees that k-mers with distance ≤p are found with high probability, but more distant k-mers can also be found (see Fig. 1a).
Equation ( 3), however, is not enough because a vote for a child should also count toward parent ranks.We recursively sum up individual votes in a bottom-up manner using the taxonomic tree to derive a total vote value for each taxonomic ID: where CðtÞ is the set of children of the taxon t.By design, the votes � vðtÞ increase for higher ranks and reach their maximum at the root (Fig. 1c).To balance specificity and sensitivity, we require a majority vote.Let τ ¼ 1 2 max t2T � vðtÞ.CONSULT-II classifies the read with the taxonomic ID t that belongs to the lowest rank satisfying the condition � vð tÞ>τ.This choice of τ has a special property: only a single taxonomic ID t can exceed τ at a given rank.Therefore, the taxonomic ID predicted by the described classification scheme is unique.Effectively, the classifier starts seeking a taxon at the lowest rank possible but also requires a certain level of confidence; hence, it immediately stops considering upper ranks once the vote value is large enough.In addition, to avoid classification based on only high-distance matches, we require � vð tÞ to be greater than some small threshold, which we explore in our experimental results.

Taxonomic profiling
To derive taxonomic abundance profiles, instead of using read identifications, we use votes directly.For each taxonomic rank r (e.g.genus), we first normalize the total votes per read per rank, equalizing the contributions of each read to the profile (if it has any matches).For a read R i , we simply set where T r is the set of all taxa at rank r.Next, we gather normalized total vote values of all n reads R 1 ; . . .; R n in a sample, and normalize again to obtain the final profile.Let p r ¼ ½ p r t � t2T r denote the relative abundance profile at rank r, summing up to 1.Then, we can set the relative abundance of taxon t to: Here, p r t estimates the ratio of reads belonging to the taxon t in a given sample.Often, we are interested in the relative abundances of cells belonging to a taxon t (denoted by pr ¼ ½ pr t � t2T r ), which needs incorporating genome sizes.We simply do so using: where l t is the average genome length of all references in taxon t.
For both p r t and pr t , relative abundances sum up to 1.By default, CONSULT-II relaxes this constraint by including an unclassified taxon.This is achieved by propagating votes down to an artificial lineage that corresponds to the unclassified group, as each k-mer match to an LCA taxon provides evidence for its children-but it is unclear which.In other words, we augment the taxonomic tree by adding a lineage under each taxon, which continues until the species rank.Then, all votes to any nonspecies taxon are moved along this lineage to an artificial node at species rank.This is equivalent to changing the denominator of Equation ( 5) with the total vote at the root of the taxonomic tree.

Experimental setup
To benchmark CONSULT-II, we constructed reference libraries using the WoL microbial genomic dataset of Zhu et al. (2019a), which is composed of 10 575 species and a reference phylogeny.Five genomes with IDs missing from NCBI were excluded.All methods were run with the same reference set.The hashing parameters of CONSULT-II were set to h ¼ 15, b ¼ 7, l ¼ 2, and k ¼ 32 (minimized from canonical 35-mers).For other parameters, default values were used: w ¼ 4; s ¼ 5 for LCA probability function p u (Fig. 1b) and d max ¼ 5 for the vote function.We used default settings for Kraken-II, without masking low-complexity sequences, as Rachtman et al. (2020) found default settings to be preferable for query identification.We also constructed the CLARK database using the standard parameters, e.g.k ¼ 31, default classification mode, species rank for classification.Note that following Rachtman et al. (2021), 100 archaeal genomes were left out from the reference and used as part of the query set.

Experiment 1: controlled novelty
We compared the classification performance of CONSULT-II with two popular methods: Kraken-II (Wood et al. 2019) andCLARK (Ounit et al. 2015), which are among the leading metagenomic identification tools based on benchmarking studies (McIntyre et al. 2017, Sczyrba et al. 2017, Meyer et al. 2019, Ye et al. 2019).Kraken-II maps each k-mer in a read to the LCA of all genomes that contain that k-mer and then counts the mapped k-mers on the taxonomic tree to infer a taxon prediction.CLARK is a supervised sequence classification method that again relies on exact k-mer matching.It uses the notion of discriminative k-mers to build a library of reference genomes.Here, we evaluate accuracy one read at a time, each simulated from a query genome.
Let the novelty of a query genome be defined as its minimum genomic nucleotide distance (i.e. one minus average nucleotide identity), as approximated by Mash (Ondov et al. 2016), to any genome in the reference database.We refer to this quantity as MinGND.We carefully selected query genomes to span a range of novelty (Supplementary Fig. S2), expecting that more novel queries will be more challenging.We created two sets of queries: bacterial and archaeal.For the bacterial set, we selected 120 bacterial genomes among genomes added to RefSeq after WoL was constructed.Queries range from near-identical to reference genomes to very novel (e.g.22 with MinGND > 0:22; Supplementary Fig. S2).Query genomes span 29 phyla, and most queries are from distinct genera (102 genera across 120 queries); only two query genomes belong to the same species.The 100 archaeal queries were chosen by Rachtman et al. (2021) from WoL set using a similar approach and were excluded from the reference set.We generated 150-bp synthetic reads using ART (Huang et al. 2012) at higher coverage and then subsampled down to 66 667 reads for each query (i.e.10Mbp per sample).
We evaluated the predictions of each tool with respect to the NCBI taxonomy.For each read, we evaluate it separately at each taxonomic rank r.When the reference library had at least one genome matching the query taxon at rank r, we called it a positive: TP if a tool found the correct label, FP if it found an incorrect label, and FN if it did not classify at rank r.When the reference library did not have any genomes from the query taxon at rank r, we called it a negative: TN if a tool did not classify at rank r, FP if it classified it, which would necessarily be false.We show the precision TP ðTPþFPÞ , the recall TP ðTPþFNÞ , and F1 2TP ð2TPþFPþFNÞ which combines both sensitivity and specificity.We ignored queries at levels where the true taxonomic ID given by NCBI is 0, which indicates a missing rank.

Experiment 2: abundance profiling
We also evaluated the ability of CONSULT-II to perform taxonomic abundance profiling using CAMI-I (Sczyrba et al. 2017) and CAMI-II (Meyer et al. 2022) benchmarking challenges.We compared tools using metrics provided by the open-community profiling assessment tool (OPAL) (Meyer et al. 2019).For CONSULT-II, we allowed unclassified taxa in the profile.
CAMI-I dataset contains five different high-complexity samples, each of size 75 Gbp, which are simulated to mimic the abundance distribution of the underlying microbial communities.Among many metrics, we chose two metrics singled out in the original OPAL paper: the Bray-Curtis dissimilarity between the estimated profile and the true profile and Shannon's equitability as a measure of alpha diversity.We report the summary of these two metrics across five samples.We use CAMI-I dataset for empirical evaluation of our method's heuristics.Here, we used the same reference libraries constructed for controlled novelty experiments from the WoL dataset.As a result, we include only Bracken (Lu et al. 2017) and CLARK as alternatives.Bracken extends Kraken-II by combining its taxonomic identification results with Bayesian priors to obtain profiles.For both CLARK and Bracken, we estimated abundance profiles with their default parameters.
On CAMI-II queries, we evaluated CONSULT-II against a host of methods studied by CAMI-II.In particular, we focused on the ten-sample (5 Gbp each) marine dataset and the 100-sample (2 Gbp each) strain-madness dataset.For alternative methods, submitted results were available from CAMI-II.To make comparisons fair, a new CONSULT-II library was constructed using the reference genomes provided under the scope of the challenge (NCBI RefSeq snapshot as of 2019/01/08); among these, we randomly selected b ffi ffi ffi ffi ffi n s p c genomes per each species s represented by n s genome (for a total of 18 381 genomes) and included these in the library.We followed the same evaluation approach as the original paper and compared tools by measuring purity versus completeness and L1 norm versus weighted UniFrac error (Lozupone and Knight 2005).Note that we only included CAMI-II versions of the tools considered, and omitted earlier versions.

Accuracy of k-mer matches
We start by evaluating LCAs of matched k-mers across different ranks and HD values (Fig. 2a), keeping in mind that incorrect k-mer-level matches do not immediately lead to read-level errors.For the queries with low novelty (MinGND < 0:06), � 9% of query k-mers exactly match the reference and have the correct species-level ID; far fewer exact matches have a soft LCA label at higher taxonomic levels, and the proportion of true matches decreases rapidly with higher HD at all ranks.In contrast, less than 0.5% of the k-mers of novel queries (MinGND ≥ 0:22) match any reference, and correct matches peak at HD ¼ 1 or HD ¼ 2. For mildly novel genomes (0:06 ≤ MinGND < 0:22), nonexact matches provide a substantial portion of all correct matches.Unlike true matches, the proportion of false matches tends to increase or remain flat as the HD increases.Nevertheless, in many cases especially at higher taxonomic levels, there are more correct inexact matches with HD ≥ 1 than incorrect ones.For example, 80% of phylum-level inexact matches in the middle novelty range are correct.

Advantages of the soft LCA approach
We next compare the use of hard and soft LCA (Fig. 2b).Overall, soft LCA provides a dramatic improvement compared to naively computing LCA.The improvements are especially large for less novel queries (< 0:12 MinGND).Interestingly, at the species rank, hard LCA completely fails while soft LCA has acceptable accuracy for the less novel queries.While a soft LCA approach is clearly helpful, the choice of the ideal probability function is unclear.Since tuning parameters w and s with a validation set is not practically feasible, we only tested the sensitivity of CONSULT-II performance; settings w ¼ 2 and w ¼ 4 are almost identical, with a negligible advantage for w ¼ 4 (no more than 0.001 in terms of F1).Hence, we use w ¼ 4 and set it as the default value.

Impact of the total vote on classification
Our majority vote mechanism classifies a read at some level, even if the total vote in Equation ( 4) is small.However, the total vote correlates with whether a classification is correct (Fig. 2c).In particular, about 35% of the FP predictions have � vðtÞ < 0:01 (corresponding to up to two k-mers with HD ¼ 5), compared to � 2% of TP predictions.Similarly, around 55% of FPs have � vðtÞ < 0:03 (i.e. less than two matches of HD ¼ 4), compared to �6% of TPs.Thus, filtering reads with low � vðtÞ reduces the FPs at the expense of CONSULT-II: accurate taxonomic identification and profiling locality-sensitive hashing removing some TPs, improving precision and reducing recall (Supplementary Fig. S3b) with only small changes in F1 scores (Supplementary Fig. S3a).To ensure our precision levels are similar to other methods, we avoid classifying reads with low total votes (default: � vðtÞ<0:03).

Impact of normalization and unclassified on profiling
Since the total vote values at the root vary widely across reads, we face a question in profiling samples: should each read contribute equally to the abundance profile of a given sample?If so, we should normalize total vote values per read, as in Equation ( 5).Alternatively, we may skip this step and use the total vote values directly in Equation ( 6).In CONSULT-II v0.1.1 described in S¸apcı et al. (2023), we followed the latter approach and skipped Equation (5); we also used a slightly different equation for Equation ( 6) and took the square root of � vðtÞ (Supplementary Section S1.2).Empirically, adding read-level normalization results in a dramatic improvement in accuracy (Fig. 2d) measured by the Bray-Curtis dissimilarity (up to �20%).Surprisingly, skipping the read normalization provides extremely accurate estimates of Shannon's equitability.Thus, we keep the nonnormalized version as a nondefault option (since v0.3.0).The use of genome sizes [Equation (6) versus Equation.( 7)] also improves the profile accuracy (Supplementary Fig. S9).When we added the unclassified option to profiles, as much as 35% at the species rank and as little as 6% at the phylum rank were unclassified (Supplementary Fig. S8a).Adding unclassified taxa results in slightly more accurate relative abundances in terms of Bray-Curtis dissimilarity (Supplementary Fig. S8b).Thus, we include unclassified taxa in output profiles by default (since v0.4.0).Finally, note that Eq. ( 6) reports independent profiles for each rank.S¸apcı et al. (2023) instead reported metrics computed at the species level and aggregated to higher ranks (Supplementary Fig. S7).

Controlled novelty experiments
On the bacterial query set, CONSULT-II has better F1 scores than CLARK and Kraken-II on all levels above species (Fig. 3a).Only for queries with almost identical genomes in the reference set do all methods have high accuracy with a slight advantage at upper ranks for CONSULT-II.As queries become more novel, accuracy drops across all ranks for all methods.However, CONSULT-II degrades much slower and shows clear improvements for novel genomes.For queries with MinGND >0:05, CONSULT-II outperforms Kraken-II and CLARK across all levels above species with substantial margins.Moreover, the improvements become more substantial at higher taxonomic levels.For instance, with MinGND >0:15, CONSULT-II has mean F1 scores that are 0.12, 0.13, 0.14, and 0.19 more than the second-best method (Kraken-II) respectively for family, order, class, and phylum ranks.Similar patterns are observed matches and writes them to the disk; a second subprogram reads results and performs prediction.
We also observed significant differences in terms of library construction times.Kraken-II was again the fastest tool (�3.5 h with 96 threads), followed by CLARK (�10 h with 96 threads).CONSULT-II took �19 h (with 128 threads) to construct, and this time was dominated by the soft LCA computation step.Note, however, that library construction is a one-time operation; once constructed, libraries can be distributed.
CONSULT-II, in its default mode, has the highest memory footprint (140 Gb), followed by CLARK (130 Gb).Kraken-II has much better memory efficiency, using only 44 Gb.The CLARK's memory requirement increased substantially during the library building, exceeding 350 Gb.Although it is not as dramatic as CLARK, both CONSULT-II and Kraken-II consumed more memory during the library building (16 and 1 Gb, respectively).
Since CONSULT-II uses more than three times as much memory as Kraken-II, we asked if its improved performance was because of its higher memory usage.To answer this question, we analyzed the performance of CONSULT-II with a much smaller library (32 Gb) by setting h ¼ 14 and b ¼ 10 to store fewer k-mers (see Supplementary Fig. S5).Compared to the default CONSULT library, the lightweight CONSULT-II had 13%, 12%, 5%, 11%, 10%, and 4% decrease in F1, for phylum to species ranks.Nevertheless, compared to Kraken-II, CONSULT-II achieved 16%, 14%, 10%, 11%, 5% higher F1 scores for phylum to genus ranks, and 6% less at species, despite using less memory (32 versus 44 Gb).Thus, the advantages of CONSULT-II over Kraken-II persist, especially for higher ranks, but at a lower level when using similar memory.

Discussion
We introduced CONSULT-II for taxonomic classification and abundance profiling.Our approach uses LSH to efficiently find inexact k-mer matches and the match distances between reference genomes and a query.Heuristics are then   used to translate k-mer-level matches and their HD to readlevel classification and sample-level profiling.While they lack theoretical guarantees, these heuristics performed remarkably well in our experiments and outperformed popular k-merbased methods.In particular, our equations for LCA update probability (2) and vote-versus-distance (3) are based on intuitive assumptions and expectations, without much finetuning and very few parameters.Future research should explore alternative approaches, including using machine learning to automatically train parameter-rich functions instead of heuristics.To ensure robustness, it would be essential to evaluate such fine-tuned methods on more varied datasets.An alternative direction of future work is developing a theoretical framework for translating k-mer distances to taxonomic classifications.Connecting taxonomic profiling to distance-based phylogenetic placement could provide a framework to tackle this goal.Such a framework may allow us to go beyond taxonomic identification and could provide alignment-free phylogenetic placement of reads, as others have attempted (Blanke and Morgenstern 2020).Finally, future work should address the high memory consumption of CONSULT-II compared to alternatives, perhaps by smart subsampling of k-mers.

Figure 1 .
Figure 1.(a) The expected number of 32-mers matched by the LSH approach for a short read of length L ¼ 150 as the normalized distance d k of the read to the closest match varies: ðL−kþ1ÞρðdÞ.Lines show different settings of l and h for an infinite b, i.e. all reference k-mers are stored in the library.The black line corresponds to k ¼ 35, h ¼ 35−7, and l ¼ 1, mimicking the default Kraken-II settings.(b) The probability of updating the LCA per each new k-mer goes down with the number of reference genomes in which that k-mer appears; we use w ¼ 4 and s ¼ 5 as shown.(c) Overview of the classification algorithm, consisting of three main stages: (i) finding the closest inexact k-mer match using LSH, (ii) computing vote values based on the Hamming distances (HD) and aggregating vote values on the taxonomic tree, and (iii) determining the most specific taxon, i.e. the lowest rank, above the threshold.

Figure 2 .
Figure 2. Empirical analysis of LCA update probability p u and total vote values � vðtÞ across different ranks.(a) The proportion of matched k-mers across different ranks, novelty bins (box rows, labeled by MinGND range), and Hamming distances.A match is TRUE if the LCA taxon of the matched k-mer is from the correct lineage, and matches are only counted for the rank of the LCA taxon.Note that a FALSE k-mer match at a particular rank can be correct at higher ranks but is not counted there.The y-axis is the ratio between k-mer matches of each type and all 119 × 66 667 k-mers of a query, averaged over all query genomes.(b) Comparison of soft and hard LCA, based on F1 across different ranks and bins.We compare two parameter settings for the p u function of soft LCA.(c) The empirical cumulative distribution function (ECDF) of total vote values for correct (TP) and incorrect (FP) classifications, revealing the predictive power of votes.The shaded areas note 0.01 and 0.03 thresholds, used to filter out spurious classifications.(d) Comparison of two different modes of profiling according to Bray-Curtis dissimilarity and Shannon's equitability.The two measures are computed independently for CONSULT-II's report for each taxonomic rank.Although the default mode [with read-level normalization, using Equation (6)] predicts the actual composition more accurately, the optional mode (without read-level normalization) is a better estimator of Shannon's equitability.

Figure 4 .Figure 5 .
Figure 4. (a)Results on abundance profiling experiments on all five samples of the high-complexity CAMI-I dataset, comparing CONSULT-II with Bracken and CLARK across different taxonomic ranks, using custom libraries constructed with WoL reference database.Left: Bray-Curtis dissimilarity measures the similarity between estimated and true profiles.The smaller the value, the more similar the distributions of abundance profiles are.Right: Shannon's equitability as a measure of alpha diversity.The closer Shannon's equitability of the predicted profile to the gold standard, the better it reflects the actual alpha diversity in the gold standard in terms of the evenness of the taxa abundances.The points are mean values across five samples, and the horizontal bars are for minimum and maximum errors.(b) Performance comparison on CAMI-II benchmarking challenge, marine and strain-madness datasets.Following the original paper, we show the upper bound of L1 norm (2) minus actual L1 norm versus the upper bound of weighted UniFrac error (16) minus actual weighted UniFrac error for genus and phylum ranks.Each data point is the average value over 10 marine samples and 100 strain-madness samples.Higher ranks are mean overall ranks other than species.Metrics are computed using OPAL with default settings and −n.For clarity, axes are cut; see Supplementary Fig.S6for full results, including all ranks.