Alignment-free comparison of metagenomics sequences via approximate string matching

Abstract Summary Quantifying pairwise sequence similarities is a key step in metagenomics studies. Alignment-free methods provide a computationally efficient alternative to alignment-based methods for large-scale sequence analysis. Several neural network-based methods have recently been developed for this purpose. However, existing methods do not perform well on sequences of varying lengths and are sensitive to the presence of insertions and deletions. In this article, we describe the development of a new method, referred to as AsMac that addresses the aforementioned issues. We proposed a novel neural network structure for approximate string matching for the extraction of pertinent information from biological sequences and developed an efficient gradient computation algorithm for training the constructed neural network. We performed a large-scale benchmark study using real-world data that demonstrated the effectiveness and potential utility of the proposed method. Availability and implementation The open-source software for the proposed method and trained neural-network models for some commonly used metagenomics marker genes were developed and are freely available at www.acsu.buffalo.edu/~yijunsun/lab/AsMac.html. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Microbes play an essential role in processes as diverse as those related to human health and biogeochemical activities critical to life in all environments on earth.However, due to the inability of traditional techniques to cultivate most microbes, our understanding of complex microbial communities is limited.The advent of nextgeneration sequencing technology allows researchers to study genetic materials recovered directly from natural environments and thereby opens a new window to extensively probe the hidden microbial world (Wooley et al., 2010).Consequently, metagenomics, where the amplicon sequencing of prokaryotic marker genes (e.g.16S and 23S rRNA genes) serves as a major probing tool, has recently become an exploding research area (Gilbert et al., 2016) and was selected as one of the 10 technical breakthroughs in 2013 by the Science magazine (The Science Website, 2013).
A key step in metagenomics analyses is to quantify pairwise sequence similarities, which plays a fundamental role in various bioinformatics methods for database search, sequence annotation and sequence binning (Cai and Sun, 2011;Cai et al., 2017;Caporaso et al., 2010;Edgar, 2010;Zheng et al., 2019a).Alignment distances are generally considered the gold standard; however, due to the high computational complexity, alignment-based methods can only be applied to small sequence datasets.In a metagenomics study, tens of millions or even hundreds of millions of sequences are typically generated.In this case, alignment-free methods (Bonham-Carter et al., 2014;Song et al., 2014;Zielezinski et al., 2017;2019) are perhaps the only computationally viable approach to estimating pairwise sequence similarities.Commonly used alignment-free methods can be broadly classified into two categories: (i) methods based on word frequency [e.g.k-mer (Karlin and Burge, 1995), FFP (Sims et al., 2009) and CV (Gao and Qi, 2007)], and (ii) methods based on substrings [e.g.ACS (Ulitsky et al., 2006), Kr (Haubold et al., 2009) and kmacs (Leimeister and Morgenstern, 2014)].There also exist some methods based on information theory [e.g.IC-PIC (Gao and Luo, 2012)], but they are less commonly used.By transforming sequences into numerical vectors, alignment-free methods can be viewed as defining an implicit mapping function that projects sequences into an embedding space where pairwise sequence similarities are calculated.The methods in the first category are based on the statistics of fixed-length word frequency or on the information content of word frequency distributions, while the methods in the second category employ the similarities or differences of substrings in a pair of sequences.For substring-based methods, it is not necessary to specify a word length, and thus in general they can achieve better performance than those relying on a fixed word length.However, substring-based methods introduce new parameters [e.g. the number of mismatches in kmacs (Leimeister and Morgenstern, 2014)] that cannot be easily estimated.Moreover, computing features of variable word lengths usually requires more complex data structures and thus is computationally more expensive (Leimeister and Morgenstern, 2014).A major limitation of the above methods is that they are all data-independent approaches, where distance measures are predefined based on various heuristics and thus can only provide rough approximations to alignment distances.
Recently, several attempts have been made to use neural networks to develop data-dependent approaches for alignment-free sequence comparison [e.g.SENSE (Zheng et al., 2019b) and NeuroSEED (Corso et al., 2021)].The basic idea is to use a neural network to learn an explicit mapping function through training and map sequences onto an embedding space so that the mean square error between alignment distances and pairwise distances defined in the embedding space is minimized.Since the mapping function is learned through training, it has been demonstrated that datadependent methods can offer much more accurate solutions than data-independent counterparts (Corso et al., 2021;Zheng et al., 2019b).Moreover, neural network-based approaches are computationally very efficient, and run even faster than the k-mer method (Zheng et al., 2019b).However, existing methods suffer from several major limitations.First, biological sequences under comparison usually have varying lengths, however, SENSE can only be applied to sequences of equal length, and NeuroSEED uses zero padding to make sequences have the same length, which is not biologically meaningful and can lead to poor performance.Second, early attempts used the convolutional neural network (CNN) for sequence comparison.CNN was designed mainly for image analysis (LeCun et al., 2015) and cannot effectively model insertions and deletionscollectively called indels-due to the use of dot product to measure the similarity between a filter and a sequence.Although gated recurrent units and attention transformers have achieved great success for natural language data analysis (Sundermeyer et al., 2010;Wang et al., 2019), it was found that they were inferior to CNN for biological sequence comparisons (Corso et al., 2021).
In this article, we describe the development of a novel neural network-based method, referred to as AsMac, that addresses the aforementioned issues.Specifically, we used approximate string matching (ASM) to transform sequences into numeric vectors.Since ASM uses the edit distance to measure the similarity between a pattern and a subsequence, it is well suited for extracting pertinent information from biological sequences.To learn patterns from data automatically, we proposed a novel neural-network structure for the implementation of approximate string matching and an efficient gradient computation algorithm for training the constructed neural network.We performed a large-scale experiment on prokaryotic rRNA sequence datasets that demonstrated the effectiveness and utility of the proposed method.We also provided trained neural-network models for some commonly used prokaryotic marker genes in an accompanying website that researchers can use directly to process their own datasets.

Overview
We propose a novel method that uses a Siamese neural network and approximate string matching for alignment-free sequence comparison.Figure 1 depicts the overview of the proposed method.The basic idea is to use a neural network to project sequences into embedding vectors so that the difference between alignment distances and pairwise dissimilarities calculated in the embedding space is minimized.Since inputs and outputs are from the sequence and numerical domains, respectively, and are not directly comparable, we use a twin neural networkalso called Siamese neural network (Bromley et al., 1993)-to learn a mapping function.Specifically, given a pair of sequences ðS 1 ; S 2 Þ, each network takes one sequence as input and outputs f ðS 1 jHÞ and f ðS 2 jHÞ as the embedding vectors of the two sequences, respectively.Here, f is the mapping function, and H is the parameters of the network to be optimized.To form the twin network, we propose a novel neural network for approximate string matching (detailed below).To train the network, we compute the alignment distance d a ðS 1 ; S 2 Þ by using the Needleman-Wunsch (NW) algorithm and the cosine distance of the embedding vectors as the embedding distance d e ðf ðS 1 jHÞ; f ðS 2 jHÞÞ, and optimize the neural network by using backpropagation to minimize the mean square error given by CðHÞ ¼ X i;j d e ðf ðS i jHÞ; f ðS j jHÞÞ À d a ðS i ; S j Þ 2 : (1) In the training process, the twin networks are forced to share the same parameters, including both initialization and gradient descent updates.Once Siamese neural network is optimized, one of the networks is used to project sequences into embedding vectors.

Approximate string matching
We start by giving a brief description of the approximate string matching (ASM) algorithm.Let S ¼ s 1 . . .s L be a sequence and P ¼ p 1 . . .p M be a pattern.We aim to identify a sub-sequence in S that has the minimum edit distance to pattern P among all subsequences of S. Dynamic programming provides the optimal solution for this purpose (Peter and Sellers, 1980).Specifically, we first construct a scoring matrix F of size ðM þ 1Þ Â ðL þ 1Þ, where the ðm; 'Þth entry F m;' is the negative minimum edit distance between the first m characters of P and any sub-sequence S ' 0 ;' ¼ s ' 0 . . .s ' of S that ends at position '.The scoring matrix can be constructed recursively: where c is the substitution cost that takes a value of 0 if p m ¼ s ' and 1 otherwise.Once F is computed, the maximum value in the last row is the optimal score, and the best-matched sub-sequence can be identified through backtracking.The computational complexity is on the order of OðMLÞ.Since pattern P is usually much shorter than sequence S, the scoring matrix can be computed efficiently.Moreover, unlike CNN, approximate string matching uses the edit distance to measure the similarity between a pattern and a sub-sequence, allowing for deletions and insertions.Thus, it is well suited for extracting pertinent information from biological sequences.

Novel neural network for approximate string matching
A major issue with approximate string matching for our application is that patterns of interest are generally unknown and can only be predetermined heuristically.To address the issue, we propose a novel neural network for approximate string matching that enables the learning of patterns from data automatically.To cast the problem into a continuous optimization problem so that the constructed neural network can be trained through gradient descent, some modifications are merited.First, we use one-hot coding (Zheng et al., 2019b) to transform an input sequence S into an L Â 4 matrix S ¼ ½s T 1 ; . . .; s T L T , and pattern P into an M Â 4 matrix P ¼ ½p T 1 ; . . .; p T M T .Here, each element in P can be learned through training (detailed below).Second, we replace the substitution cost c with Àp m s T ' to measure the cost of the mismatch between the mth vector of the pattern and the 'th vector of the input sequence.Third, while in the original algorithm the gap penalty is set to 1, in order to make the constructed neural network more generally applicable, following the work of (Koide et al., 2018), we set it to be a learnable parameter.Finally, in order for the constructed mapping function to be differentiable, we replace the max function in Eq.
(2) with a generalized maximization operator (Cuturi and Blondel, 2017), given by max c ða 1 ; . . .; a J Þ ¼ c log X J j¼1 exp ða j =cÞ; (3) where c > 0 is a smoothing parameter.With the above modifications, Eq. ( 2) can be re-formulated as: where g is the gap penalty.Using the above-modified approximate string matching, we construct a neural network to map input sequences into embedding vectors.Specifically, given an input sequence S and N patterns P 1 ; . . .; P N , we generate N scoring matrices F 1 ; . . .; F N .We extract the maximum value v n from the last row of each scoring matrix F n , and concatenate them into vector v to compute an embedding vector where ReLU is the rectifier activation function (Nair and Hinton, 2010) and b is a bias term.

Learning parameters through backpropagation
Finally, we discuss how to estimate the parameters of the proposed method, specifically the patterns used in approximate string matching and the gap penalty, efficiently through backpropagation by minimizing the mean squared error calculated by comparing embedding distances and alignment distances (see Eq. 1).While backpropagation can be easily implemented by using PyTorch (Paszke et al., 2019), the computation of the gradient of the mean squared error with respect to the patterns is not trivial.Since the patterns are independent of each other, we only need to consider one pattern.
Specifically, given pattern matrix P, we need to calculate the partial derivative of the mean squared error with respect to each row vector of the pattern matrix where v is the maximum value of the last row of scoring matrix F. We note that while the computation of F depends on an input sequence, only the best-matched sub-sequence of the input sequence is involved in the computation of v.We utilize this fact to develop an efficient scheme for the computation of gradients.By definition, v is the score of the global alignment between the pattern and its best-matched sub-sequence, and thus can be computed by using the NW algorithm (Needleman and Wunsch, 1970).An issue with the NW method is that its output is not differentiable.
To address the issue, we resort to the differentiable NW algorithm (Koide et al., 2018) to approximate the value of v and compute the partial derivatives with respect to the pattern and the gap penalty.Let S be the best-matched sub-sequence, and By construction, FM; L % v. Thus, the approximation of the partial derivative of v with respect to each row of pattern P can be computed as: It can be shown that @ Fm;' =@p m ¼ expðð FmÀ1;'À1 þ p m sT ' À Fm;' Þ=cÞs ' .Note that Fm;' is involved in the calculation of its three neighbors Fmþ1;' ; Fm;'þ1 and Fmþ1;'þ1 .Hence, @ FM; L =@ Fm;' can be computed recursively.Specifically, when m 6 ¼ M and ' 6 ¼ L, Likewise, the partial derivative @v=@g can be approximated as @ FM; L =@g, which can be calculated recursively by taking the partial derivative of Eq. ( 7) with respect to g.Specifically, when m 6 ¼ 0 and ' 6 ¼ 0, @ Fm;' @g ¼ @ Fm;' @ FmÀ1;' @g :

Experiments
We performed a large-scale benchmark study to demonstrate the effectiveness and potential utility of the proposed method.(Callahan et al., 2019).We used the removePrimers function from the DADA2 R package (Callahan et al., 2016) to remove primers and orient all the sequences in the forward direction.After pre-processing, the Zymo dataset contained 69 367 sequences with a length ranging from 1187 to 1518 bps.

Experimental setting
We compared our method with seven other alignment-free methods, namely, k-mer (Karlin and Burge, 1995), ACS (Ulitsky et al., 2006), Kr (Domazet-Loso and Haubold, 2009), FFP (Sims et al., 2009), kmacs (Leimeister and Morgenstern, 2014), SENSE (Zheng et al., 2019b) and NeuroSEED (Corso et al., 2021).When evaluating an alignment-free method, estimation accuracy and computational efficiency are the two major considerations.Thus, we calculated the mean relative error (MRE) between alignment distances and distances estimated by an alignment-free method and recorded its computational time.By definition, MRE can be interpreted as the percentage of estimated distances that deviate from alignment distances.Since it is computationally infeasible to align all sequence pairs in a dataset, we randomly sampled 1000 sequences without replacement and calculated the alignment distances of all 499 500 Note: The numbers in parentheses are standard deviations.When different parameters were used for a given method, the best result is boldfaced.A P-value was computed by comparing the MRE result of AsMac with the best result of the other method.When different parameters were used for a given method, the best result is boldfaced.
Alignment-free comparison of metagenomics sequences via approximate string matching possible sequence pairs.The NW algorithm (Needleman and Wunsch, 1970) with the default setting was used for sequence alignment (match score ¼ 2, mismatch score ¼ -3, gap opening score ¼ -5, gap extension score ¼ -2).To minimize random variations, the above sampling process was repeated ten times.Therefore, we generated ten test datasets from each sequence dataset to evaluate the eight competing methods.
For kmacs, FFP (V3.19),Kr (V2.0.2) and NeuroSEED, we used the source code provided by the associated articles.Since kmacs is an extension of ACS, we used the kmacs binary by setting k ¼ 0 as the implementation of the ACS method.For k-mer, we used our Cþþ implementation, which is optimized for amplicon sequence data by using a sparse k-mer count representation.For k-mer, kmacs and FFP, different parameters can be applied.Since there is no principal way to estimate the optimal parameter, we tested ten parameters for each method and recorded the best result.For SENSE, we used the default parameters.For NeuroSEED, we used the settings reported to achieve the best results in the original article.For AsMac, we set the number of pattern filters to 300 and the pattern length to 20.For AsMac, SENSE and NeuroSEED, training of a model for each sequence dataset is required.Since SENSE can only be applied to sequences of roughly equal length, it was trained and tested only on the Qiita and RT988 datasets.To form a training dataset, we randomly sampled sequences from a dataset without replacement and calculated the alignment distances of 5 million sequence pairs.To prevent information leakage, we verified that none of the sequences used for training were included in the test data.The Adam optimizer (Diederik et al., 2014) was employed to train the Siamese neural networks in SENSE and AsMac, where the learning rate was set to 1e-4 and the number of training epochs was set to 200.All experiments were performed on a 3.3 GHz Quad-Core Intel Core i5 with 16GB memory.

Benchmark study of accuracy and efficiency
First, we applied the eight methods to the Qiita and RT988 datasets, in which sequences cover only a sub-region of the 16S rRNA gene and have similar lengths.Table 2 reports the MREs and CPU time of the eight methods, obtained by averaging over ten runs, and Supplementary Figures S1 and S2 present the estimated distances against the alignment distances for the two datasets, respectively.In terms of prediction accuracy, AsMac performed slightly better than SENSE and NeuroSEED and significantly outperformed the best results of all other methods by a large margin on both datasets.In terms of running time, while k-mer is the fastest algorithm, its performance is not comparable to our method.Next, we applied the competing methods to the Greengenes, Silva-23S and Labonte lake datasets, where the sequences are of variable lengths.Since there is no biologically meaningful way to trim sequences to an equal length, SENSE cannot be used.Table 3 reports the MREs and CPU time of the seven tested methods.Figure 2, Supplementary Figures S3 and S4 present the estimated distances against the alignment distances for the three datasets, respectively.Similar to the results obtained on the Qiita and RT988 datasets, the prediction accuracies of AsMac were significantly better than the best results of all other competing methods.
Finally, we tested the generalization capability of AsMac and NeuroSEED by training a model using the Greengenes dataset and applying it to the Zymo dataset.The results are reported in Table 3 and Figure 3. Again, AsMac significantly outperformed the other approaches.We noted that, compared to the result obtained from the Greengenes dataset, the prediction accuracy of NeuroSEED declined dramatically.This is likely to be because the Zymo dataset contains a large number of similar sequences that differ by only a few indels, and the CNN structure used by NeuroSEED does not accommodate indels well.As shown in Figure 3b, NeuroSEED severely overestimated the pairwise distances of similar sequences.In contrast, by using the ASM structure, our method maintained a high level of accuracy in predicting alignment distances in this situation.

Taxonomy assignment
To further evaluate the effectiveness and utility of the proposed method, we conducted an experiment where we used our approach to perform taxonomy assignment of a given sequence dataset by searching against a reference database.For the purpose of this study, we constructed a query dataset by randomly sampling 3000 sequences from the Zymo dataset.We used as the reference database the GG97 dataset, a subset of the Greengenes database (McDonald et al., 2012).This is the default closed-reference database used by QIIME (Caporaso et al., 2010) and contains 30 592 sequences with complete taxonomy annotations at the genus level.We performed a database search using the NW algorithm and annotated each query sequence using the genus of the best-matched reference sequence.We used the result obtained by the NW algorithm as the ground truth and compared the performance of our method with the six competing methods used in the Greengenes study.For kmacs, k-mer and FFP, the parameters were set to be those that achieved the best performance in the Greengenes study.All methods were performed in parallel using four threads.
Figure 4 reports the annotation accuracy and CPU time of the seven methods tested.Our method achieved 98.4% prediction accuracy, outperforming FFP and NeuroSEED by $13%, and k-mer, ACS, Kr and kmacs by $20%.We noticed that kmac did not perform as well as that in the Greengenes and Zymo studies (see Table 3).This may be because for sequence alignment we are concerned about the accuracy of distance estimation for all sequence pairs, whereas for sequence annotation we are only interested in sequence pairs that are similar.In terms of computational efficiency, AsMac performed comparably with k-mer and NeuroSEED, but ran two orders of magnitude faster than Kr, ACS, kmacs and FFP.It is worth noting that, compared to the results reported in Table 2, AsMac, NeuroSEED and k-mer ran much faster than the other methods.This is because, for the above three methods, the embedding vectors of the reference sequences can be pre-computed, while for Kr, ACS and kmacs, the pairwise sequence comparison can only be performed in the presence of both query and reference sequences.Although FFP can pre-process the reference sequences, it is computationally expensive to compute the Kullback-Leibler distances between the query and reference sequences.

Parameter sensitivity analysis
The proposed method has two parameters, namely, the pattern number and the pattern length.We performed a parameter sensitivity analysis to investigate how the method performs with respect to the two parameters.We applied the method to the Greengenes and Zymo datasets and reported in Figure 5 the MRE results obtained by using various pattern numbers and lengths.We can see that, with an increase of the number of patterns, the prediction errors dropped quickly and then flattened when the pattern number was larger than 300.We also observed that our method achieved similar results once the pattern length was larger than 20.For a balance between accuracy and efficiency, we set the pattern number to 300 and the pattern length to 20 as the default parameters.

Discussion and conclusion
In this article, we described a new neural network-based method for alignment-free sequence comparison.To demonstrate the effectiveness of the proposed method, we conducted a large-scale benchmark study using prokaryotic rRNA sequence datasets.The study showed that the new method performs much more accurately than its dataindependent counterparts, and that the method is robust against the presence of deletions and insertions and can handle sequences of varying lengths.The taxonomy assignment experiment further demonstrated the potential utility of our method for practical applications.Compared to data-independent methods, a major drawback of our approach is that it requires training.We should emphasize that all supervised learning-based approaches suffer from this drawback.For example, the speech-recognition system used in smartphones requires training, which may take weeks.However, the system installed in smartphones is a trained model, which can perform speech recognition in a matter of milliseconds.To address the aforementioned drawback, we have published five trained neural-network models for some commonly used prokaryotic marker genes (16S rRNA gene, 23S rRNA gene, 16S V3 region, 16S V3-V4 region and 23S V5 region) in an accompanying website that researchers can use directly to process their own datasets.We should point out that the models for 16S rRNA and 23S rRNA genes were trained on the sequences obtained from databases and may have a better generalization capability than the other models that were built by using the sequences obtained from specific studies.In future work, we will continue to refine the models by using sequence data from diverse studies.We also plan to build models for other marker sequences (e.g.other hyper-variable regions of 16S rRNA gene, the 18S rRNA gene and internal transcribed spacer sequences of eukaryotes).We will explore the use of ASM for local sequence comparisons, and the aggregation of the ASM structure to form a multi-layer structure, which may further improve prediction accuracy.It will also be of interest to perform in-depth analyses to investigate what type of information in nucleotide sequences has been encoded by the patterns in a neural-network model.

Fig. 1 .
Fig. 1.Overview of the proposed method that combines a Siamese neural network and approximate string matching for alignment-free sequence comparison and its training process

Fig. 4 .
Fig. 4. Comparison of taxonomy prediction accuracy and CPU time of seven methods

Table 1 .
Summary of the six sequence datasets used in the study (Jose et al., 2015) six sequence datasets used in the study.The Qiita dataset was generated from 66 skin, saliva and fecal samples collected from the Amerindian Yanomami people(Jose et al., 2015).
(Pruesse et al., 2007))t was generated from water samples collected from a eutrophic lake(Steven et al., 2012), which contains 41 647 sequences with a length ranging from 385 to 410 bps and covering the V5 region of the 23S rRNA gene.The Greengenes dataset was extracted from the Greengenes database(McDonald et al., 2012)and contains 925 718 unique full-length 16S rRNA gene sequences with a length ranging from 1325 to 1500 bps.The Silva-23S dataset was downloaded from the Silva database(Pruesse et al., 2007), containing 39 176 full-length 23S rRNA gene sequences with a length ranging from 2750 to 3150 bps.The Zymo dataset was generated from a Zymo mock community sequenced by PacBio circular consensus sequencing technology

Table 2 .
CPU time (seconds) and MRE (%) of eight methods tested on Qiita and RT988 datasets