Minimally overlapping words for sequence similarity search

Abstract Motivation Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via ‘seeds’: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results Here, we study a simple sparse-seeding method: using seeds at positions of certain ‘words’ (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed ‘minimizer’ sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Availability and implementation Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online.


Every nth sparsity
Every-nth sparsity means to only use seeds starting at every nth position in one of the two sequences being compared. We considered two slightly different ways to do that: every nth position starting from the first position in the sequence, versus every nth position starting from one of the first n positions chosen randomly. With the random-origin method, the average fraction of selected seed positions is precisely 1/n. With the origin-1 method, on the other hand, the number of seed positions is the smallest integer such that the fraction is ≥ 1/n.
Since we test sensitivity with fairly short length-100 sequences, the difference is not negligible (Figure S1). In the rest of this study, the random-origin method is used.

Variance in count of words
In the main text, we derived an equation for the variance in number of occurrences of a set of words, in a random sequence of length s. That derivation omits some tedious algebra, so it may be hard to follow. Here is one intermediate step. Assuming that s ≥ 2k − 2: Note that, if s < 2k, there is no pair of nonoverlapping word positions, so the p 2 term should be 0. Thus, the above equation is correct when s ≥ 2k − 2. number of seed hits between unrelated random sequences sensitivity relative to random−origin seeding  Figure S1: Sensitivity (y-axis) and spurious hit count (x-axis) for exact-match seeds with every-8th sparsity. Red: origin-1 seeding. Blue: randomorigin seeding. Sensitivity was measured on sequence pairs with PAM distance 20. Seed lengths 5-14 were tested, shown in gray.

Circular sequences
For a random circular sequence of length s, Let us assume that s ≥ 2k − 1, to avoid cases where the end of word V overlaps the start of word W and the end of word W overlaps the start of word V . If we define the modular distance l = min(|i − j|, s − |i − j|), then 3 Finding low-variance words

Overlap scores
As a minor simplification, we minimized an integer "overlap score", which is equivalent to minimizing variance. This is possible because we only considered the case of equal letter frequencies. So P n W (the product of probabilities of the first n letters in word W ) equals a −n , where a is the alphabet size (either 2 for the ry alphabet or 4 for the acgt alphabet). Thus, we can define an integer score: such that lower score implies lower variance. For VMR1, the overlap score is simpler:

Simulated annealing
For simulated annealing, we began with a random set of words. Then, at each step of the algorithm, we proposed replacing a random word in the set with a random word not in the set. If this replacement would decrease the overlap score, it was accepted. If it would increase the overlap score, it was accepted with probability: exp[(oldScore−newScore)/t]. The temperature parameter t was set equal to the overlap score of the initial random word set, then multiplied by 0.9999999 at each iteration. The algorithm was halted when t ≤ 0.05 (because the probability of accepting any integer score increase is extremely low then).

Low-variance words
In the main text, we saw that sensitivity of wordrestricted seeding can be boosted by using words with low variance (VMR1 and VMR2). The word sets depicted in Figure 3C,D are listed in Table S1. Some further, less impressive, results are shown here. Firstly, we sought even longer and lower-variance words, by heuristic search (simulated annealing). We found a set of 32 length-7 words (Table S2), which have lower VMR1 and VMR2 than the sixteen length-6 words. Surprisingly, however, their sensitivity is slightly worse ( Figure S2). This indicates that neither VMR1 nor VMR2 is a perfect guide to sensitivity.
Secondly, we sought low-variance DNA words without restriction to a 2-letter alphabet. In particular, we sought to replace ryn with a less-overlapping set of length-3 words. We found a set with lower VMR1 and VMR2 (Table S2: atc, atg, caa, . . . ). The sensitivity of seeds starting at these words is number of seed hits between unrelated random sequences sensitivity relative to every−4th seeding q q q q q q q q q q q sparsity 4 atc,atg,caa,cac,cag,ccg,ctc,ctg, gaa,gcg,gtg,taa,tac,tag,ttc,ttg ry sixteen 6−mers 32 7−mers Figure S2: Sensitivity (y-axis) and spurious hit count (x-axis) for exact-match seeds with wordbased sparsity. Sensitivity was measured on sequence pairs with PAM distance 20. Seed lengths 5-14 were tested.
indeed better than that of seeds starting at ry (Figure S2).

Edge bias of minimizers
In the main text, we found minimizer positions by sliding a length-w window along the sequence, and getting all positions that are the minimum of any window. This procedure, termed "interior minimizers" [RHH + 04], means that positions near sequence edges have fewer chances to be a minimizer. In particular, the first position in a sequence has only one chance to be a minimizer, because it is contained in only one length-w window. Thus, minimizer density is lower, on average, near sequence edges.
Moreover, in the main text we measured minimizer density in length-10 6 sequences, but sensitivity was tested with length-100 sequences. This makes our tests biased against minimizers, because their density in the tested length-100 sequences is lower than their reported density in length-10 6 sequences, due to the edge effect.
To address this concern, we performed a further test of minimizer sensitivity with length-100 sequences. This time, when finding minimizer positions, we treated the sequences as circular, so there is no density reduction at sequence edges. Figure 6 in the main text, when modified by using "circular minimizers", becomes Figure S3 here. Recall that Figure S3A,B compares words to minimizers with slightly lower sparsity (higher density), giving an unfair advantage to the minimizers. As expected, circular minimizers are slightly more sensitive than interior minimizers ( Figure S3 versus Figure 6). The main conclusion does not change: minimally-overlapping words perform better than minimizers. number of seed hits between unrelated random sequences % of related sequence pairs found q q q q q q q q q q q words: abbb minimizers: w=20, alphabetic minimizers: w=16, abb Figure S3: Sensitivity (y-axis) and spurious hit count (x-axis) for exact-match seeds at either word positions or minimizer positions. Seed lengths 5-14 were tested. Sensitivity was measured on sequence pairs with PAM distance 20 (A, C) or 50 (B, D). Here, "circular minimizers" were used.