Seedability: optimizing alignment parameters for sensitive sequence comparison

Abstract Motivation Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2, use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability, a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation https://github.com/lorrainea/Seedability (distributed under GPL v3.0).


Introduction
Comparing genomic sequences is essential for genome-wide analyses, such as phylogenetic inference, genome annotation, and function prediction (Dewey 2012). Similarly, aligning protein sequences is required for template-based protein structure prediction and function annotation (Yan et al. 2013). Traditional techniques for global sequence alignments (Needleman andWunsch 1970, Gotoh 1982), where entire sequences are to be compared, commonly use dynamic programming, which can be inefficient for very long sequences. This can also be particularly time-consuming when aligning a query sequence to a database of reference sequences, e.g. RefSeq (O'Leary et al. 2016).
Seed-based alignment techniques have become increasingly popular, due to their moderate resource requirements, in comparison to the traditional dynamic-programmingbased methods, as well as maintaining a high alignment accuracy. Many seed-based techniques make use of k-mers (Luczak et al. 2019, Alser et al. 2021, which are short substrings of fixed length k. In a nutshell, when a reference k-mer is found within a query sequence, the match is referred to as a hit or seed. The well-known BLAST software (Altschul et al. 1990) uses k-mer seeds, which are then chained and extended to produce alignment(s) between target and query sequences. Spaced-seeds (binary patterns of symbols 0 and 1, denoting a match and a wildcard, respectively) have also been used extensively to improve alignment results via higher sensitivity when compared to traditional seed-based techniques (Ma et al. 2002). For instance, Khiste and Ilie (2017) employ the notion of spacedseeds for assembling PacBio data showing higher alignment detection sensitivity in comparison to pre-existing tools. Other less common sequence comparison techniques include employing the notion of longest common substring (Leimeister and Morgenstern 2014) or common absent words (Charalampopoulos et al. 2018), or even employing Fourier transformations (Yin and Yau 2015). Roberts et al. (2004) introduced the idea of sampling seeds using minimizers, where only a small fraction of seeds need to be stored during computations. Minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds. Intuitively, this is because, when a target sequence exactly matches a query sequence, the same minimizers are sampled from both sequences. Minimap2 (Li 2018) is a versatile sequence alignment program that uses minimizers as seeds to compute alignments of DNA (or mRNA) sequences against a large set of reference sequences. Typical use cases of Minimap2 include, among others, aligning sequence reads to a reference genome or constructing whole-genome alignments between two closely-related species (for instance, with divergence below $15%).
Minimap2 uses a default value for the seed length k, which can also be specified by the user on input. It should be clear that varying the length of seeds has an impact on the efficiency and the output alignments of Minimap2. For instance, setting a small value for k may increase alignment accuracy as it allows more seeds to be identified. However, this comes at the cost of increasing running time due to the increased number of identified seeds that require further processing. On the other hand, setting a large value for the seed length k reduces the running time but may result in poorer alignments and even in alignments that are entirely missed. We hypothesize that the performance of any seed-based alignment algorithm can be impacted by tuning the k value appropriately. Yet, it is unclear how a user may make an educated guess about setting k. Therefore, there is a need for an automated method for identifying appropriate values for k.
While optimizing the values for parameter k has been studied for genome assembly (Chikhi and Medvedev 2013), optimizing the seed length k appears to have only been studied for variants of BLAST (Gotea et al. 2003, Shiryev et al. 2007. To the best of our knowledge, recent methods for sequence comparison, in particular, Minimap2, have not received the same treatment. To aid this, we present Seedability, a framework designed for computing an optimal k-mer length as well as an accurate number of shared seeds between a unique given set of sequences. In the following, we introduce a theoretical alignment framework and formulate the Seedability problem. The problem consists in finding optimal parameter values for an idealized version of seed-based alignment. The precise computational task is, given an alignment identity threshold, to estimate an optimal seed k-mer length as well as a minimal number t of shared seeds for aligning pairs of sequences in a given collection. One can then combine these parameter values to infer optimal parameter values in different alignment tools based on their underlying alignment mechanism. In particular, we demonstrate that the parameter values found by Seedability can be directly used to tune the alignment parameters for increasing the sensitivity of Minimap2 when aligning pairs of short sequences. We show, among others, that in this new regime, Minimap2 becomes capable of aligning sequences of lengths 200, 300, 500, or 1000 base pairs (bp) with a divergence of 25% with an average alignment success rate improvement of 0.57, 0.65, 0.68, and 0.12 points, respectively, compared to when using its default values with preset option sr.
The paper is organized as follows. In Section 2, we provide the necessary definitions and notation. In Section 3, we present the Seedability framework. In Section 4, we present our results. We conclude in Section 5.

Definitions and notation
A string (or sequence) x of length jxj ¼ m is an array x½0 . . . m À 1, where every x½i; 0 i < m, is a letter drawn from some fixed alphabet R. An empty string is the string of length 0, and it is denoted by e. A string x is a substring (or fragment) of string y if there exist two strings u and v, such that y ¼ uxv. When x is a substring of y, we say that x occurs in y. Each occurrence of x can be specified by a position in y. We say that x occurs at (the starting) position i in y when y½i . . . i þ m À 1 ¼ x. A k-mer, for any integer k > 0, is a string from R k . For any two strings x and y and an integer k > 0, we define a seed (or hit) of x and y, a pair (i, j) such that x½i . . . i þ k À 1 and y½j . . . j þ k À 1 is the same k-mer.
Given two strings x and y and an integer k > 0, we say that x and y share t seeds, for some integer t ! 0, if and only if there exists a sequence i 1 ; . . . ; i t of t positions on x and a sequence j 1 ; . . . ; j t of t positions on y, such that all of the following hold: For example, given x ¼ ACGTAGTAG; y ¼ ACGAGTAGG, and k ¼ 3, x and y share t ¼ 4 seeds. This is because there exists a sequence 0; 4; 5; 6 of positions on x, and a sequence 0; 3; 4; 5 of positions on Given a string x of length m and a string y of length n, the Levenshtein distance (or edit distance) (Levenshtein 1965), denoted by d L ðx; yÞ, is the minimum total number of elementary edit operations required to transform x into y. In particular, the elementary edit operations we consider are: • insertion: insert a letter of y in x at a given position; • deletion: delete a letter of x at a given position; • substitution: substitute a letter of x at a given position by a letter of y.
For any two strings x, y, the distance d L ðx; yÞ, can be computed in O(mn) time (Levenshtein 1965). An alignment between x and y is another string z on the alphabet of pairs of letters, more accurately on ðR [ fegÞ Â ðR [ fegÞ n fðe; eÞg; whose projection on the first component is x and the projection on the second component is y. An insertion in z is represented by ðe; aÞ; a 2 R; a deletion in z is represented by ða; eÞ; a 2 R; and a substitution in z is represented by (a, b), a; b 2 R and a 6 ¼ b. The cost of an alignment z is the total number of insertions, deletions and substitutions in z. In our model, an alignment z is optimal if and only if its cost is precisely d L ðx; yÞ. The alignment identity e x;y ðzÞ of an alignment z of x and y is defined as e x;y ðzÞ ¼ jxj À ðRsub þ RdelÞ jxj þ Rins ; where Rsub is the total number of substitutions, Rdel is the total number of deletions, and Rins is the total number of insertions in z. The alignment identity is computed by working out as a fraction, the number of matches in the alignment over the alignment length. Note that the alignment length is equal to jxj plus the total number of insertions in z. The divergence d x;y ðzÞ is the complementary notion and it is equal to 1 À e x;y ðzÞ. When z is an optimal alignment of x, y, we call e x;y ðzÞ and d x;y ðzÞ the optimal alignment identity and the optimal divergence, respectively. Given a string x and an integer j > 0, a minimizer of x is a lexicographically smallest j-mer in x. Given a string x and two integers j > 0 and w > 0, the set of ðj; wÞ-minimizers of x is the set of positions of minimizers of all length-ðw þ j À 1Þ fragments of x. If more than one ðj; wÞ-minimizer exists in one fragment, we can consistently sample one of them; e.g. we can always choose the leftmost one as the minimizer.

Methods
We start by formally defining the computational problem considered here. Let S be a set of input sequences. For presentation purposes, we will assume that all sequences in S have the same length. In practice, our algorithms will work on sequences that have different but similar lengths. We relate the proposed framework to the classic read-to-reference alignment framework (e.g. of Minimap2), where a set of input reads are to be aligned against several candidate positions of the reference. In such a scenario, one may convert a set S d of input sequences, where sequences have different lengths, to another set S, where all sequences have the same length. For instance, one can create S such that it consists of all the length-W substrings of the sequences of S d , where W is a chosen window length smaller than or equal to the shortest sequence in S d . Thus, in the rest of this section, we will assume that all sequences have the same length.
Given S, we define the set O truth;e as the set of all pairs of sequences (s 1 , s 2 ), s 1 ; s 2 2 S, such that s 1 and s 2 have optimal alignment identity greater than or equal to e. We now formally define the problem in scope: Problem 1 (Seedability). Given S and an alignment identity threshold e, compute a set O seed of pairs of sequences from S and one pair (t, k) of values, for every pair of sequences in O seed , such that the symmetric difference of O seed and O truth;e is minimized.
By estimating a k-mer length for every pair of sequences in a given collection for a given alignment identity threshold, we can aggregate these k values to infer an optimal ðj; wÞ value for Minimap2. This is precisely the main application of our alignment framework in this article.

The Seedability algorithm
We propose the following algorithm, which we call Seedability, as a heuristic approach to address Problem 1.
Let S be a set fs 1 ; . . . ; s r g of r sequences. The Seedability algorithm is a two-stage approach that is carried out for all min k k max k , where ½min k ; max k is defined by the user. The default value for min k is 3. The default value for max k is 15, which is the default value in Minimap2 for the length j of minimizers. The two stages are: 1) Estimating e si;sj , for all i 6 ¼ j 2 ½1; r; s i ; s j 2 S; 2) Constructing the set O seed .
The main idea of our algorithm is to use k-mers to identify seeds shared between the given pair of sequences. We then traverse through the seeds to estimate Rins; Rdel, and Rsub, thus estimating alignment identity e si;sj . Finally, for every pair of sequences (s i , s j ), we want to output one (t, k) value for which the corresponding e si;sj exceeds or equates to the alignment identity threshold e. If such a (t, k) value exists, then (s i , s j ) is added to O seed .

Estimating e s i ;s j
We next present the techniques which we employ to estimate t, the number of seeds shared by s i and s j , which allows us to then estimate e si;sj . Given a seed (p i , p j ) on the pair of then ðp i þ 1; p j þ 1Þ is the next chosen seed. This can be easily checked in constant time. If s i ½p i . . . 1 . . . p i þ k 6 ¼ s j ½p j . . . 1 . . . p j þ k, then the following steps are carried out: 1) Let s i and s j be two sequences, where s i a is an occurrence of k-mer a in s i and s j a an occurrence of the same k-mer in s j . Let us assume that the pair ðs ia ; s ja Þ is a previously selected seed. (Note that we can always start with a dummy seed ðÀ1; À1Þ.) Then let s i b be the smallest occurrence of some k-mer b in s i such that there exists an occurrence s j b of the same k-mer in s j with s i b > s ia and s j b > s ja . We find the occurrence of k-mer s ¼ b in s j (see Fig. 1) such that jðs i b À s ia Þ À ðs js À s ja Þj is minimized and jðs i b À s ia Þ À ðs js À s ja Þj k. If this holds, for some s js , then the occurrences s i b and s js form a candidate seed. The inequality ensures that the pair of b occurrences to be selected as a candidate seed are at a similar distance from the corresponding a occurrences. For every s i b (we have Oðjs i jÞ of them), this check can be implemented in O(k) time due to the condition jðs i b À s ia ÞÀ ðs js À s ja Þj k. This is because s i b ; s ia ; s ja are fixed and the only unknown is s js . 2) Again, let us assume that the pair ðs ia ; s ja Þ is a previously selected seed. Let s jc be the smallest occurrence of some k-mer c in s j such that there exists an occurrence s ic in s j with s jc > s ja and s ic > s ia . We find the occurrence of k-mer q ¼ c in s i (see Fig. 2) such that jðs jc À s ja Þ À ðs iq À s ia Þj is minimized and jðs jc À s ja Þ À ðs iq À s ia Þj k. If this holds, for some s iq , then the occurrences s jc and s iq form a candidate seed. This is precisely the symmetric computation of the first step. For every s jc (we have Oðjs j jÞ of them), this check can be implemented in O(k) time due to the condition jðs jc À s ja Þ À ðs iq À s ia Þj k.
3) The two candidate seeds are now compared to select one of them. Let the first one be ðs i b ; s js Þ and the second one be ðs i q ; s j c Þ. If jðs i b À s i a Þ À ðs j s À s j a Þj jðs i q À s i a Þ À ðs j c À s j a Þj then ðs i b ; s j s Þ forms the next seed, otherwise ðs i q ; s j c Þ forms the next seed. We proceed to the computation of the next shared seed (by memorizing the one we have just computed as the new previously selected seed) until no other seed can be selected.
The computation of shared seeds, between every pair of sequences in S, using the two steps described above, allows us to estimate the alignment identity as follows. Let ðs ia ; s ja Þ and ðs i b ; s j b Þ be two consecutive encountered seeds. When a gap of  size less than k is encountered between the two seeds (i.e., jðs i b À s ia Þ À ðs j b À s ja Þj k), the number of letters within the gaps in s i and s j are added onto Rsub; Rdel or Rins. Specifically, if the size of the gap is the same in s i and s j , then Rsub is incremented by the size of the gap. If the size of the gap in s i is larger than that in s j then Rsub is incremented by the size of the gap in s j and Rdel is incremented by the difference in size of the gaps. If the size of the gap in s j is larger than that in s i then Rsub is incremented by the size of the gap in s i and Rins is incremented by the difference in size of the gaps. The computation of Rsub; Rdel, and Rins results in the estimation of e si;sj for the (t, k) values considered. The total number of shared seeds is Oðjs i j þ js j jÞ and so this computation takes Oðjs i j þ js j jÞ time. Overall, the whole computation takes Oðkðjs i j þ js j jÞÞ time for any pair s i ; s j 2 S of sequences and any k 2 ½min k ; max k .
For example, let s i ¼ GCGTGATTCG; s j ¼ GCGGATTGAG and k ¼ 3. Clearly, the first computed seed is ðs ia ; s ja Þ ¼ ð0; 0Þ representing a ¼ GCG. Then, for the second seed, we have two candidates: the first candidate seed is (s i b ; s js Þ ¼ ð3; 6Þ representing b ¼ TGA (Step 1); the second candidate seed is ðs iq ; s jc Þ ¼ ð4; 3Þ representing c ¼ GAT ( Step 2). The chosen seed is ðs iq ; s jc Þ ¼ ð4; 3Þ computed in Step 2.
Alignments z 1 and z 2 below show the final alignments if the first seed was chosen (z 1 ) in comparison to if the second seed was chosen (z 2 ). If the first seed is chosen, there are no further seeds identified in s i and s j , and the alignment identity is 6/13. If, however, the second seed is chosen, there is one further seed identified in s i and s j , and the alignment identity is 7=11 > 6=13. In fact, this (z 2 ) is what our algorithm chooses.

Constructing the set O seed
Recall that we aim at minimizing the symmetric difference between O seed and O truth;e . Since for every pair (s i , s j ), s i ; s j 2 S, we have computed the quantities e si;sj and (t, k), we output a pair (t, k) of values for every pair (s i , s j ) such that e si;sj ! e, thus constructing O seed .
As there could be many values of k satisfying e si;sj ! e, we would like to choose among them a relatively large value (see Section 1). Let e best be the highest alignment identity estimated over all considered k values for s i ; s j 2 S, and k best be the k value corresponding to e best . Further let d be an optional input threshold parameter (with its default value set to 0.05). Then, we choose the maximum k value, which we denote by k d , such that e best À e k d d, where e k d is the alignment identity computed for k ¼ k d . We do that by iterating k over ½k best ; k max . The default value for d and its usefulness is justified in the experiments.
In the next section, we show how the output of Seedability can be directly used to tune the alignment parameters of Minimap2.

Results
Seedability was implemented using the Cþþ programming language, taking in as input a set of sequences in multiFASTA format and an optional reference sequence in FASTA format. Seedability outputs optimal values for (t, k) either for the estimated alignment of all pairwise sequences or for the estimated alignment of the reference sequence and every sequence.
The source code is distributed under the GNU General Public License (GPL v3.0) at https://github.com/lorrainea/ Seedability. We have conducted experiments on a computer using an Intel Core i5-8265U CPU, running at 1.60 GHz, equipped with 8GB of RAM, under GNU/Linux. Seedability was compiled with gþþ version 9.3.0. Minimap2 (Li 2018) is a widely-used bioinformatics tool for aligning DNA or mRNA sequences to a large reference database. To evaluate the accuracy of Seedability, we applied the output values of Seedability on Minimap2 to check how alignment scores were impacted. This was carried out on both synthetic and real data.
Minimap2 has a wide range of preset options, which include default values for j (the minimizer's length) and w (the number of consecutive j-mers considered for sampling). These preset options include: 1) map-ont-Align noisy long reads of $10% error rate to a reference sequence (default). ðj ¼ 15; w ¼ 10Þ. 2) sr-Short single-end reads without splicing. ðj ¼ 21; w ¼ 11Þ.
The average k-mer length output by Seedability, denoted bydk avg e, was used to determine the ðj; wÞ values for Minimap2. We set j ¼ dk avg e and w ¼ d 2 3 je. The value for w was determined using Mimimap2 0 s default value of w ¼ 2 3 j. Table 1 shows the determined ðj; wÞ values. Figure 3a shows the average alignment identities (i.e. the total alignment identity score divided by the total number of pairs) output for the 100 pairs of sequences when using the default ðj; wÞ values in comparison to the ðj; wÞ values determined by Seedability. For the preset options we used: (i) the default preset option map-ont, if the average sequence length is greater or equal to 1000; or (ii) the preset option sr, if the average sequence length is <1000. The parameter values produced by Seedability allow Minimap2 to maintain high alignment identities for longer sequences but also vastly improve the alignment identities for shorter sequences. Note that some Step 1 Step 2 Ayad et al.
of these alignments were unmapped with Figure 3b showing the number of mapped alignments identified. Further, note that the alignment identities computed by Minimap2 were higher than the expected identities due to the mapping quality of Minimap2. The alignment identities are computed as a fraction of the number of matching bases over the total number of bases, including gaps as defined by Minimap2. Figure 4 shows the number of alignments produced out of the 100 pairs of sequences. In this case, a pair of sequences are said to be aligned if the alignment length is at least 90% of the
As previously mentioned, Seedability also estimates the number t of shared seeds, for an output k value, that can be found within an aligned pair of sequences. Table 2 shows the number of pairs of sequences out of 100 where t is within 610% of the number of seeds in an optimal alignment. We set d ¼ 0 to evaluate how well Seedability can perform this task.
To evaluate the symmetric difference between O seed and O truth;e , we counted the number of alignments computed by Seedability, which have an estimated alignment identity e best ! e. We created two datasets both containing 200 pairs of sequences: the first one consisted of sequences with an average length of 200; and the second one with an average length of 500. Both datasets contained 100 pairs of sequences with a divergence of 0.10 and 100 pairs of sequences with a divergence of 0.20. Table 3 shows the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) identified within each dataset. When e ¼ 0.85 and the average length of the sequences is 200, Seedability was able to identify 98 out of 100 pairs of sequences such that e best ! e. For the same e and when the average length of the sequences is 500, Seedability was able to identify 100 out of 100 pairs of sequences such that e best ! e. In this second case O seed ¼ O truth;e . The results show that, although Seedability underestimates alignment identity by a little-which is expected as it is not meant to compute optimal alignments-it minimizes the symmetric difference of O seed and O truth;e by computing appropriate (t, k) values. Furthermore, these results justify the existence of d and its default value.    Figure 5a shows the average time required by Seedability to compute (t, k). Figure 5b shows the average time required by Minimap2 to compute an alignment when using its default values and Fig. 5c shows likewise when using the values determined by Seedability. Figure 6 shows the same but for the average peak memory. Recall from Section 3.2 that the whole computation of Seedability takes Oðkðjs i j þ js j jÞÞ time for any pair s i ; s j 2 S of sequences and any k 2 ½min k ; max k . It is also clear from the results (Figs 5a and 6a) that Seedability requires linear time and space for the values of k used in practice. For divergence <0.25, Minimap2 takes a similar time when using the values determined by Seedability in comparison to when using its default values. Notably, for divergence 0.25, Minimap2 is faster when using the values determined by Seedability. When using the values determined by Seedability; Minimap2 uses similar peak memory as when its default parameter values are used. In these experiments, we have used the preset option map-ont. The Supplementary Material shows similar results for the other preset options.

Real data
To further highlight the usefulness of Seedability, we have considered real data. We looked at a Chimpanzee gene, in particular, gene ENSPTRG00000044036 (125 bp in length) as well as the orthologues of this gene with the following species: the Algerian Mouse (Mus spretus) (126 bp in length); the Northern American deer mouse (Peromyscus maniculatus bairdii) (125 bp in length); and the Shrew mouse (Mus pahari) (114 bp in length). These sequences have an optimal alignment identity of 0.744, 0.752, and 0.729, respectively. The orthologue identities were retrieved from the Ensebl genome browser (Howe et al. 2020). We used the preset option sr and default ðj; wÞ values of Minimap2 to align the sequences. There were no output alignments for the three pairs of sequences. Seedability was then used to identify optimal ðj; wÞ parameter values for Minimap2. The computed output values were (6, 4), (5, 4), and (5, 4), respectively, for the listed orthologues. The sequence pairs were re-aligned using the ðj; wÞ values determined by Seedability and the resulting alignment identities computed were 0.857, 0.845, and 0.895, respectively. Note that the alignment identities computed by Minimap2 were higher than the original identities due to the mapping quality of Minimap2.
We also carried out similar experiments for the RAB15EP gene in human chromosome 12 (ENSG00000174236) (708 bp in length) with orthologues of this gene with the following species: the Abingdon island giant tortoise (Chelonoidis abingdonii) (699 bp in length); the Argentine black and white tegu (Salvator merianae) (714 bp in length); and the Common wombat (Vombatus ursinus) (708 bp in length). These sequences have an optimal alignment identity of 0.619, 0.567, and 0.538, respectively. There were again no output alignments by Minimap2 for the three pairs of sequences. Seedability was then used to identify optimal ðj; wÞ values for Minimap2 which were computed to be (4, 3), (3, 2), and (4, 3), respectively for the listed orthologues. The sequence pairs were re-aligned using the ðj; wÞ values determined by Seedability and the resulting alignment identities were 0.686, 0.635, and 0.734, respectively. Figure 7 shows a visual representation of the results for the alignment of ENSG00000174236 and ENSVURP00010006563_Vurs1 (Vombatus ursinus).
The results produced in Table 1 can be used directly to improve the alignment identities of short sequences when mapping to predetermined candidate positions on a reference genome. We tested the ðj; wÞ values presented in Table 1 on simulated reads from Chromosome 1 of the human genome (version GRCh38.p14). We used PBsim (Ono et al. 2020), a sequence simulator, to generate four datasets using GRCh38.p14: one dataset with an average length of 200 and divergence 0.10; one dataset with an average length of 200 and divergence 0.15; one dataset with an average length of 500 and divergence 0.10; and one dataset with an average length of 500 and divergence 0.15. Pairs of sequences were created by taking each simulated read and its original sequence interval in the genome. All datasets contained 100 pairs of sequences. Figure 8a shows the number of sequences out of 100 that were mapped when using the default ðj; wÞ values for Minimap2. Figure 8b shows the number of sequences out of 100 that were aligned when using the ðj; wÞ values determined by Seedability in Table 1. Note that for all experiments, the default preset map-ont was used. The parameter values determined by Seedability were able to produce mapped alignments for all sequences unlike when using the default parameter values of Minimap2. In particular, when using a divergence of 0.15 and length 200, the default parameter values of Minimap2 resulted in only 32 mapped alignments, that is, 68 fewer than when using the parameter values determined by Seedability. Table 4 shows the average time in ms required to map the 100 sequences to candidate positions in Chromosome 1 of the human genome when using Minimap2's default (j,w) values in comparison to those determined by Seedability. The datasets are presented in the table in the form A.B where A is the average length of the sequences and B is their divergence. The difference in time between the two runs is negligible. In fact, for the datasets with an average length of 500, Minimap2 performed faster when using the parameter values determined by Seedability in comparison to the default parameter values. Table 5 shows similar results for the peak memory required to compute the mappings. Again it is clear that for the datasets with an average length of 500, Minimap2 used, on average, less peak memory when using the parameter values determined by Seedability in comparison to the default parameter values. fixed-length k-mers as seeds. NGMLR (Sedlazeck et al. 2018) is designed to sensitively align PacBio or Oxford Nanopore reads to large reference genomes for structural variant calling. In many practical scenarios, identifying optimal k values is challenging, and default k values provide suboptimal results.
In this article, we presented Seedability, an alignment framework designed for estimating an optimal value for k as well as a minimal number t of shared seeds based on a given alignment identity threshold. Our extensive results, using both synthetic and real datasets, demonstrate that the (t, k) values determined by Seedability lead to improved alignments compared to the original alignments produced by Minimap2 when using sequences with lengths of a varying range and a varied divergence. Notably, the parameter values determined by Seedability lead to meaningful alignments in some cases where no output alignments were produced using the default parameter values of Minimap2.
For future work, we would be interested in extending Seedability to support BLEND (Firtina et al. 2023), which hashes seeds to identify similarities between sequences as well as extending Seedability to support mapquik (Ekim et al. 2023), a tool that makes use of longer seeds through matches of k consecutively sampled minimizers.

Supplementary data
Supplementary data are available at Bioinformatics Advances online.

Conflict of interest
None declared.

Data availability
The data underlying this article are available either in https:// github.com/lorrainea/Seedability or in the ensembl database at www.ensembl.org, and can be accessed using the gene names ENSPTRG00000044036 and ENSG00000174236 or in the NCBI database at www.ncbi.nlm.nih.gov and can be found using the reference sequence NC_000001.11.   Table 1. 8 Ayad et al.