Abstract

Motivation

Third generation sequencing techniques, such as the Single Molecule Real Time technique from PacBio and the MinION technique from Oxford Nanopore, can generate long, error-prone sequencing reads which pose new challenges for fragment assembly algorithms. In this paper, we study the overlap detection problem for error-prone reads, which is the first and most critical step in the de novo fragment assembly. We observe that all the state-of-the-art methods cannot achieve an ideal accuracy for overlap detection (in terms of relatively low precision and recall) due to the high sequencing error rates, especially when the overlap lengths between reads are relatively short (e.g. <2000 bases). This limitation appears inherent to these algorithms due to their usage of q-gram-based seeds under the seed-extension framework.

Results

We propose smooth q-gram, a variant of q-gram that captures q-gram pairs within small edit distances and design a novel algorithm for detecting overlapping reads using smooth q-gram-based seeds. We implemented the algorithm and tested it on both PacBio and Nanopore sequencing datasets. Our benchmarking results demonstrated that our algorithm outperforms the existing q-gram-based overlap detection algorithms, especially for reads with relatively short overlapping lengths.

Availability and implementation

The source code of our implementation in C++ is available at https://github.com/FIGOGO/smoothq.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Q-gram, also called n-gram, k-mer/shingle, has been used extensively in the areas of bioinformatics (Altschul et al., 1990; Berlin et al., 2015; Li, 2016; Myers, 2014; Pevzner et al., 2001), databases (Qin et al., 2011; Wang et al., 2012; Xiao et al., 2008), natural language processing (Manning and Schütze, 1999), etc. In particular, q-gram was used to construct the de Bruijn graph (Compeau et al., 2011; Pevzner et al., 2001), a data structure commonly exploited for fragment assembly in genome sequencing, especially for short reads obtained using next-generation sequencing (NGS) technologies (Miller et al., 2010). Another important application of q-gram in bioinformatics is for sequence alignment, which aims to detect highly similar regions between long strings (e.g. genomic sequences). Following the seed-extension approach, many sequence alignment algorithms, including the popular BLAST (Altschul et al., 1990) and more recent algorithms (Brudno et al., 2003; Kurtz et al., 2004; Schwartz et al., 2003), first search for q-gram matches (i.e. seeds) between each pair of input strings and then extend these matches into full-length alignment using dynamic programming algorithms.

Recently, the seed-extension approach was adopted for detecting overlaps between long, error-prone reads (Berlin et al., 2015; Chaisson and Tesler, 2012; Li, 2016, 2018; Myers, 2014) generated by single molecule sequencing technologies, including the PacBio Single Molecule Real Time (SMRT) technologies (Roberts et al., 2013) and Oxford MinION technologies (Mikheyev and Tin, 2014). Compared to the NGS reads, the single molecule sequencing technologies generate reads that are much longer but more error-prone. Notably, SMRT sequencers generate reads 1000–100 000 bps long with up to 12–18% sequencing errors (including most insertions/deletions and some substitutions); in comparison, Illumina sequencers generate reads of 100–300 bps long with <1% errors. As a result, two overlapping reads contain highly similar but not identical substrings (with a relatively small edit distance (The edit distance between two strings x and y is defined to be the minimum number of letter insertions, deletions and substitutions needed to transfer x to y.) due to sequencing errors), which should be addressed by an overlap detection algorithm.

A straightforward application of the seed-extension approach to overlap detection may be hurdled by an inherent limitation: Two strings sharing a highly similar substring may share only a small number of, or even zero, matched q-gram pairs (seeds) due to the pattern of sequencing errors within the shared substring. Consequently, a seed-extension algorithm may fail to detect some overlaps because of the lack of seeds between the reads.

To address this issue, we propose a variant of q-gram named the smooth q-gram, which we can use to identify not only those exactly matched q-gram pairs (with certainty) but also those q-gram pairs that have small edit distances (each with a high probability). Our smooth q-gram construction is based on a recent advance in metric embedding (Chakraborty et al., 2016) that maps a string from the edit distance space to the Hamming distance space while (approximately) preserving the distance; we will illustrate the details of this embedding method in Section 2.1. Our smooth q-gram-based approach can, with a high probability, find most pairs of q-grams of the two input strings whose edit distances are small. We introduce experiments to further show the advantages of smooth q-gram in Supplementary Material S1.1.

Using smooth q-grams as seeds, we designed a novel algorithm for the overlap detection problem. Our algorithm works for both PacBio and Nanopore sequencing data and only requires a single set of parameters on all datasets. Our experiments show that for all real-world datasets that we have tested, the smooth q-gram-based algorithm achieves a F1 score about 3.8% higher than the best existing algorithm on average. [The F1 score is the harmonic average of the precision and recall (see Section 4).] The advantage of our algorithm is even greater on pairs with short overlaps (e.g. with lengths 500–2000 bps), which are the relatively more challenging cases. For example, the recall of our algorithm for pairs with overlap lengths <2000 bps is about 7.0% higher than the best existing algorithm on average.

1.1 Related work

The only line of work, as far as we have concerned, that has a similar spirit as our smooth q-gram is the gapped q-gram (Burkhardt and Kärkkäinen, 2001, 2002, 2003), which is also referred to as the spaced seeds in bioinformatics applications (Keich et al., 2004; Ma et al., 2002). The idea of gapped q-gram is to make substrings of each string of a specific pattern. For example, the gapped 3-grams of the string ‘ACGTACGT’ with pattern ‘XX-X’ are {ACT, CGA, GTC, TAG, ACT}. That is, instead of taking the contiguous substrings as in the traditional q-gram approach, the gapped q-gram breaks the adjacency dependencies between the characters. Now, if we are allowed to choose multiple gapped q-gram patterns, then one will need more edits to make all gapped q-grams between two strings mismatched. However, the optimal pattern of gapped q-gram is difficult to find: it needs an exhaustive search on all possible patterns and the running time for the search has an exponential dependency on length of the pattern (Keich et al., 2004). In contrast, our smooth q-grams are systematically generated, and always have the same theoretical guarantees on all datasets.

The problem of detecting overlaps among long, error-prone reads has drawn significant attention in bioinformatics (Berlin et al., 2015; Chaisson and Tesler, 2012; Li, 2016, 2018; Myers, 2014; Sović et al., 2016). Existing overlap detection algorithms follow the ‘seed-extension’ approach and use conventional q-grams as seeds. We choose the most sensitive existing algorithms MHAP, Minimap2 and DALIGNER in our comparisons and introduce them in more detail in Section 4. Note that we find the GraphMap (Sović et al., 2016) is only optimized for Nanopore data and has worse performance compared with other algorithms. Thus, we did not include it in our comparison.

2 Smooth q-gram

As mentioned above, the innovation of this paper is to extend the conventional q-gram-based approach to the smooth q-gram-based approach for overlap detection. The main advantage of smooth q-gram is that it tolerates a small edit distance between matched q-grams, and is thus able to identify similar substrings at higher sensitivity. In this section, we present the details of smooth q-gram construction and its properties.

We will use m to denote the length of a smooth q-gram, and κ to denote the length of a q-gram after the CGK-embedding.

2.1 The CGK-embedding

The key tool that we will use in our construction of smooth q-gram is the CGK-embedding, which convert a string sΣq to s=Σκ for a value κ using a random string R1, where Σ is the alphabet (for nucleotides, Σ={A,C,G,T}) and R1 is a random string from {0,1}κ|Σ|.

More precisely, let j=1,2,,κ denote the time steps of the embedding. We also maintain a pointer i to the string s, initialized to be i = 1. At each step j, we first copy s[i] to s[j], and set jj+1. We then determine whether we should increment i or not. We sort characters in Σ in an arbitrary but fixed order. For a character σΣ, let Index(σ) be the index of σ in this order. We set

When i reaches q +1 while j<κ, we simply pad κj copies of ‘’ to s to make its length equal to κ, where Σ is an arbitrary character.

Denote the CGK-embedding as a function CGK(·,R1) for a fixed string (sampled randomly from {0,1}κ|Σ|). Given s,tΣq, let s=CGK(s,R1) and t=CGK(t,R1). It has been shown in Chakraborty et al. (2016) that for any κ2q+cq, for some large enough constant c, we have with probability 0.999 that
where ED(·,·) and HAM(·,·) denote the edit distance and the Hamming distance, respectively.

It is easy to see that after the CGK-embedding, q-grams with small edit distance will likely have small Hamming distance, whereas those with large edit distance will likely have large Hamming distance. In particular, if s = t, then we have s=t with certainty.

The CGK-embedding has recently been used for sketching edit distance by Belazzougui and Zhang (2016) and performing edit similarity joins by Zhang and Zhang (2017).

2.2 From q-gram to smooth q-gram

We show below how to construct a smooth q-gram from a conventional q-gram using random string R2. For convenience we will write ‘smooth q-gram’ instead of ‘smooth m-gram’ although the resulting smooth q-gram will have length m. Our algorithm is easy to describe (Algorithm 1): Given a q-gram s, we first perform the CGK-embedding on s to get a string s of length κ, and then construct a substring s¯ of length m by picking the coordinates i in s where R2[i]=1. We include running examples of Algorithm 1 in Supplementary Material S1.2.
Algorithm 1

Generate-Smooth-q-Gram(s,R1,R2)

Input:s: q-gram sΣq;

R1: random string from {0,1}κ|Σ|;

R2: random string from {0,1}κ under the constraint that there are m 1-bit;

Output:  s¯: smooth q-gram of s of size m

1: s CGK(s, R1)

2: s¯ is generated by removing coordinates i in s s.t. R2[i]=0

3: return  s¯

The motivation of introducing the smooth q-gram is that we expect the corresponding smooth q-grams of two q-grams s and t for which ED(s,t) is small, can be identical with a good probability. More precisely, let k=ED(s,t), and let s=CGK(s,R1) and t=CGK(t,R1). According to the property of the CGK-embedding, we know that HAM(s,t)k2. Let d=HAM(s,t). If we randomly sample without replacement m bits from two κ-bit strings s and t at the same indices, the probability that all the sampled bits are the same is
(1)

In our experiments, we typically choose m=κ/c for a constant c, and we are only interested in d being at most 4. In this case, we can approximate (1) as ((c1)/c)d for some constant c. In other words, a significant fraction q-gram pairs with distance of d or below will have their corresponding smooth q-grams to be identical. Finally, we note that for identical q-gram pairs, e.g. s = t, with fixed R1 and R2 we always have s¯=t¯.

We note that the construction algorithm for smooth q-gram is different from a simple subsampling of the original q-grams. Indeed, given two q-grams s and t with small edit distance i.e. ED(s,t)=2, if we just sample a constant fraction of symbols from s and t using common randomness, the resulting strings s¯ and t¯ will be different with high probability.

As mentioned in Section 1, if we are able to identify pairs of near-identical q-grams (under edit distance), then we are able to catch similar substrings between two strings that will otherwise be missed by the conventional q-gram-based approaches. Therefore, we can significantly improve the sensitivity of the overlap detection algorithm. Of course, by allowing approximate q-gram matching, we may also increase the number of false positives, i.e. dissimilar pairs of substrings may share some identical smooth q-grams, and will thus be considered as seeds. We need to check the actual edit distance between pairs of q-grams that share the same smooth q-grams, in the seed identification step for overlap detection.

3 Materials and methods

In this section, we show how to use smooth q-gram for overlap detection among long, error-prone sequence reads. In Table 1, we list a set of global parameters/notations that will be used in our algorithms. Let [n] denote the set {1,2,,n}.

Table 1.

List of global parameters

ParametersExplanation
mLength of a smooth q-gram
κLength of the q-gram after CGK-embedding
αFraction of positions selected as smooth q-grams
ηFrequency filtering threshold
KEdit distance threshold
CThreshold for #matched signatures between two overlapping reads
LOverlap length
ϵError rate tolerance
ΠΠ:Σm(0,1) a random hash function
R1A random string from {0,1}κ|Σ|
R2A random string from {x{0,1}κ|x1=m}
ParametersExplanation
mLength of a smooth q-gram
κLength of the q-gram after CGK-embedding
αFraction of positions selected as smooth q-grams
ηFrequency filtering threshold
KEdit distance threshold
CThreshold for #matched signatures between two overlapping reads
LOverlap length
ϵError rate tolerance
ΠΠ:Σm(0,1) a random hash function
R1A random string from {0,1}κ|Σ|
R2A random string from {x{0,1}κ|x1=m}
Table 1.

List of global parameters

ParametersExplanation
mLength of a smooth q-gram
κLength of the q-gram after CGK-embedding
αFraction of positions selected as smooth q-grams
ηFrequency filtering threshold
KEdit distance threshold
CThreshold for #matched signatures between two overlapping reads
LOverlap length
ϵError rate tolerance
ΠΠ:Σm(0,1) a random hash function
R1A random string from {0,1}κ|Σ|
R2A random string from {x{0,1}κ|x1=m}
ParametersExplanation
mLength of a smooth q-gram
κLength of the q-gram after CGK-embedding
αFraction of positions selected as smooth q-grams
ηFrequency filtering threshold
KEdit distance threshold
CThreshold for #matched signatures between two overlapping reads
LOverlap length
ϵError rate tolerance
ΠΠ:Σm(0,1) a random hash function
R1A random string from {0,1}κ|Σ|
R2A random string from {x{0,1}κ|x1=m}

Our main algorithm (Algorithm 2) consists of three stages. The first stage (Line 1–11) is signature generation: for each input sequence xi and each of its q-gram, we generate a smooth q-gram-based signature (seed). The second stage (Line 12–15) is subsampling and filtering: we sample a portion of signatures for each xi and only use them to detect candidate overlaps. The last stage (Line 16–31) is detection and verification: we detect candidate overlaps, verify and report true overlaps. Below, we will describe each stage in more detail.

3.1 Signature generation

In the signature generation stage (Line 1–11 of Algorithm 2), we use the following data structure to store useful information of a q-gram as a signature. 

Definition 1

(q-gram signature) Letδ(s,t,r,i,p)  be the signature of a q-gram; the parameters are interpreted as follows:

  • s: the q-gram;

  • t: the smooth q-gram of s;

  • r: the hash rank of t;

  • i, p: q-gram s is taken from the i-th input string xi from the position p, i.e. sxi[p,p+q1].

Obviously, t and r are fully determined by s, given the randomness R1, R2 and Π, but for the convenience of presentation, we still include t, r as parameters in the definition of the signature. We omit the signature generation for the reverse complement of sequences for the sake of convenience. In practice, for each input sequence, we consider both the sequence itself and its reverse complement, and generate signatures for both of them separately.

3.2 Subsampling and filtering

In the subsampling and filtering stage (Line 12–15 of Algorithm 2), we first perform a subsampling of an α-fraction of smooth q-grams for each sequence using a random hash function Π (Line 13). We then only focus on these sampled smooth q-grams to identify similar substrings. The subsampling step could improve the efficiency of our algorithm by reducing the number of smooth q-gram matches to consider.

Algorithm 2

Find-overlapping-Strings(X)

Input:  X={x1,,xn}: set of input strings;

Output:  O {overlapping pair (xi, xj) and their overlap region xi[pis,pie] and xj[pjs,pje]}

1: Initialize an empty table D

2: for each  i[n]  do

3:  Δi

4:  for each  p[|xi|q+1]  do

5:   sxi[p,p+q1]

6:   t Generate-Smooth-q-Gram(s,R1,R2)

7:   rΠ(t)

8:   δ(s,t,r,i,p)

9:   ΔiΔiδ

10:  end for

11: end for

12: for each  i[n]  do

13:  Construct Δi from Δi by keeping signatures in Δi with the smallest α|xi| of hash ranks r.

14: end for

15: Count for all t the number ct of signatures in the form of (·,t,·,·,·) in i[n]Δi, and remove all (s,t,r,i,p) in Δi with ct in the top η of all t for all i[n]

16: for each  i[n]  do

17:  Li

18:  for eachδ in Δi  do

19:   LiLi Search-Similar-q-Grams(δ, D)

20:  end for

21:  for eachj < ido

22:   Mij{(u,v)|(j,u,v)Li}

23:  end for

24: end for

25: for each  |Mij|C  do

26:  (o,Le) Verify(xi, xj, Mij)

27:  if  (o,Le)null  then

28:   (pis,pie,pjs,pje) Find-Shared-Substrings(xi, xj, o, Le, Δi, Δj)

29:   OO(xi,xj,[pis,pie],[pjs,pje])

30:  end if

31: end for

On the remaining subsampled signatures, we filter out those smooth q-grams whose frequency is above a certain threshold (Line 15). This is a common practice to reduce the number of false positives in overlap detection algorithms, such as MHAP (Berlin et al., 2015), Minimap2 (Li, 2016, 2018) and DALIGNER (Myers, 2014). The motivation behind the filtering step is that the highly frequent smooth q-grams often correspond to highly frequent q-grams, which do not carry many important features about the sequence (similar to the frequent words such as ‘a’ and ‘the’ in English sentences). On the other hand, these common smooth q-grams will contribute to false positives and consequently increase the running time of subsequent steps. As a result, with a properly chosen filtering threshold η, we could reduce the running time without much loss on the sensitivity of detecting true similar pairs.

3.3 Detection and verification

In the detection and verification stage (Line 16–31 of Algorithm 2), we use the smooth q-gram signatures to detect candidate overlaps, verify each of them and output the true overlaps. Below, we describe the detection and verification steps separately.

3.3.1 Detection

In the detection step, we find, for each pair of input sequences (xi, xj), their set of matching q-grams (i.e. those with edit distances less than or equal to K), using Algorithm 3. More precisely, in Algorithm 3, we try to find for a q-gram s, an (incomplete) list of matching q-grams s by considering all q-gram s falls into the same bucket in hash table D (Line 2). We then check if their edit distance ED(s,s)K; if so, then we record the pair and their positions into L. Finally, we add the signature of s into the table D to build the table gradually (Line 7).

We record all the matches between each pair of sequences (xi, xj) in Mi,j, where a match (u, v) indicates the u-th q-gram of xi matches the v-th q-gram of xj. For each pair of sequences with at least C matched q-grams, we move on to the verification step to see if this pair indeed share an overlap of a significant length, and if it is the case, we also identify the shared region in the pair.

Algorithm 3

Search-Similar-q-Grams(δ, D)

Input:  δ=(s,t,r,i,p): a signature for q-gram s (see Definition 1 for detailed explanation of the parameters); D: a table with buckets indexed by t;

Output:  L{(i,p,p)|δ=(s,t,r,i,p)Ds.t.ED(s,s)K}

1: L

2: for each  δ=(s,t,r,i,p) stored in D(t) do

3:  if  ED(s,s)K  then

4:   LL(i,p,p)

5:  end if

6: end for

7: Add δ to the D(t)

8: return  L

3.3.2 Verification

In the verification step, we employ two subroutines: Algorithm 4 to perform a basic verification that reports approximate overlap shift and length, and Algorithm 5 to compute the actual overlap regions. We describe Algorithms 4 and 5 in more detail.

Let M be the list of starting positions of the matching q-gram pairs in the input strings xi and xj. We construct a bipartite graph Gi,j with characters of xi as nodes on the left part and characters of xj as nodes on the right part. For each matching pair (u, v), an edge is connected between the nodes xi[u] and xj[v], denoted as (u, v), where (uv) represents the shift corresponding to the edge. Intuitively, if xi and xj overlap, there must be a large cluster of edges with similar shifts in Gi,j.

Algorithm 4 consists of two filtering steps. In the first step, we aim to find a dense cluster of edges with similar shifts and remove the remaining edges (Line 1–2). This could be implemented by finding an interval Io containing the maximum number of edges whose width is adjusted according to the maximum error rate and the minimum length of overlaps we want to detect. We set the default error rate tolerance ϵ to be 0.15, as PacBio and Nanopore reads have error rate around 15% (Berlin et al., 2015; Jain et al., 2015; Li, 2018). We then use the remaining edges to estimate the shift and the length of overlap (o,Le) by finding a reference edge (um, vm) with a median shift over all remaining edges (Line 3–5). In the second step, we try to find a dense region of edges with similar positions on xi and remove all the edges remaining (Line 6–7). This could be implemented again by finding an interval Ipos containing the maximum number of edges. Finally, we count the number of unremoved edges: if the number is at least C, then we consider (xi, xj) to be an overlap candidate pair and return its approximate shift and overlap length; otherwise, we simply return null (Line 8–12).

Algorithm 5 computes the locations of shared substrings between xi and xj more precisely. In this step, we exploit the complete set of matching q-gram pairs in order to generate more accurate overlap. We first construct the list M of matching q-grams, by a synchronized linear scan on the two sets Δi and Δj after sorting the tuples based on their r values. Then, we remove edges with shifts far deviated from our estimate (Line 2). Next, we perform a two-stage merging process to identify shared substrings. We represent a cluster of edges after merging as a window where window (pis,pie,pjs,pje) means xi[pis,pie] and xj[pjs,pje] are the shared substrings in xi and xj. We use cur to represent the current window to be considered. In the first merging process (Line 4–15), we determine if an edge (p,p) could be merged with cur. Considering the ‘boundary’ of cur: cur[2],cur[4] (the ending positions of shared substrings on both sequences) and (p,p), we compute their difference of shifts d and difference of locations step on both sequences. If d is no greater than ϵ·step (Line 9), then we merge the edge with cur. Otherwise, we create a new window cur starting from the edge and store the previous window in W. By the end of the first merge step, we have a collection of windows stored in W. In the second merging process (Line 16–28), we merge adjacent windows if they satisfy similar criteria and store the new windows in W. This process is mainly designed for Nanopore datasets where overlap regions are often broken into smaller sub-regions due to continuous sequencing errors. Finally, we pick the largest window in W as our output.

Algorithm 4

Verify(xi, xj, M)

Input:xi, xj: two input strings;

   M={(u,v)}: set of pairs of matched q-gram positions in xi and xj;

Output:o: estimated shift

   Le: estimated overlap length

1: Io[os,os+2ϵ·L] s.t. |{(u,v)|(u,v)M,(uv)Io}| is maximized

2: Remove all pairs (u,v)M whose (uv)Io

3: Find (um,vm)M with median (uv) value among all (u,v)M

4: oumvm

5: Lemin{um,vm}+min{|xi|um,|xj|vm}

6: Ipos[os,os+L] s.t. |{(u,v)|(u,v)M,uIpos}| is maximized

7: Remove all pairs (u,v)M whose uIpos

8: if  |M|<C  then

9:  returnnull

10: else

11:  return  (o,max(Le,L))

12: end if

Algorithm 5

Find-Shared-Substrings(xi, xj, o, Le, Δi, Δj)

Input:xi, xj: two input strings;

   o: estimated shift;

   Le: estimated overlap length

   Δi,Δj: sets of q-gram signatures of xi and xj

Output:  (pis,pie,pjs,pje): xi[pis,pie] and xj[pjs,pje] are shared substrings in xi and xj

1: M{(p,p)|(s,t,r,i,p)Δi,(s,t,r,j,p)Δj,t=t}

2: M{(p,p)M|(pp)[oϵ·Le,o+ϵ·Le]}

3: Sort matches (p,p)M using p in the increasing order

4: W{}

5: cur(p[1],p[1],p[1],p[1]), where (p[1],p[1]) is the first element in M, we represent the elements of cur as cur[i](i[4])

6: for each  (p,p)M  do

7:  d1pcur[2],d2pcur[4]

8:  d|d1d2|,stepmax(d1,d2)

9:  if  d20(dϵ·step) then

10:   cur(cur[1],p,cur[3],p)

11:  else

12:   WW{cur}

13:   cur(p,p,p,p)

14:  end if

15: end for

16: W{}

17: curW[1], where W[1] is the first element in W

18: for each  (pis,pie,pjs,pje)W  do

19:  d1piscur[2],d2pjscur[4]

20:  l1cur[2]+cur[4]cur[3]cur[1]2,l2pie+pjepispjs2

21:  d|d1d2|,stepd1+d22

22:  if  (step<max(l1,l2))(d2ϵ·step)  then

23:  cur(cur[1],max(cur[2],pie),min(cur[3],pjs),max(cur[4],pje))

24:  else

25:   WW{cur}

26:   cur(pis,pie,pjs,pje)

27:  end if

28: end for

29: return  (pis,pie,pjs,pje)W with largest pie+pjepispjs2

4 Results

We present our experimental studies in the section. We first introduce the setup of our experiments and then present our results.

4.1 Setup of experiments

4.1.1 Datasets

We test all algorithms on both simulated and real-world datasets (PacBio and Nanopore), on four genomes: Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Human. The PacBio datasets are downloaded from MHAP’s supporting data website (http://www.cbcb.umd.edu/software/PBcR/mhap). The Nanopore datasets are downloaded from the Sequence Read Archive at the National Center for Biotechnology Information website (https://www.ncbi.nlm.nih.gov). The details of these datasets are described in Supplementary Material S1.3.

We randomly sample 50 000 reads from each dataset in our experiments. The details of the subsampled datasets are shown in Table 2. We test algorithms on subsampled datasets because this can show the accurate performance of the overlap detection task while allowing all algorithms to finish in a reasonable amount of time. See Supplementary Material S1.6 for details.

Table 2.

Statistics of subsampled datasets

Dataset#ReadsCoverageAvg. lengthMedian lengthSubstitution error (%)Insertion error (%)Deletion error (%)
PacBio datasets
E.coli47 910848283.87477.52.118.353.37
S.cerevisiae50 000246038.85113.02.508.164.12
D.melanogaster50 00039324.88028.52.457.983.66
Human50 0000.117459.46533.03.038.884.63
Nanopore datasets
E.coli50 00012111407.79082.56.465.528.98
S.cerevisiae50 000307421.86440.07.385.358.30
D.melanogaster50 00037009.04675.05.714.436.23
Human50 0000.106633.36763.04.124.904.84
Dataset#ReadsCoverageAvg. lengthMedian lengthSubstitution error (%)Insertion error (%)Deletion error (%)
PacBio datasets
E.coli47 910848283.87477.52.118.353.37
S.cerevisiae50 000246038.85113.02.508.164.12
D.melanogaster50 00039324.88028.52.457.983.66
Human50 0000.117459.46533.03.038.884.63
Nanopore datasets
E.coli50 00012111407.79082.56.465.528.98
S.cerevisiae50 000307421.86440.07.385.358.30
D.melanogaster50 00037009.04675.05.714.436.23
Human50 0000.106633.36763.04.124.904.84
Table 2.

Statistics of subsampled datasets

Dataset#ReadsCoverageAvg. lengthMedian lengthSubstitution error (%)Insertion error (%)Deletion error (%)
PacBio datasets
E.coli47 910848283.87477.52.118.353.37
S.cerevisiae50 000246038.85113.02.508.164.12
D.melanogaster50 00039324.88028.52.457.983.66
Human50 0000.117459.46533.03.038.884.63
Nanopore datasets
E.coli50 00012111407.79082.56.465.528.98
S.cerevisiae50 000307421.86440.07.385.358.30
D.melanogaster50 00037009.04675.05.714.436.23
Human50 0000.106633.36763.04.124.904.84
Dataset#ReadsCoverageAvg. lengthMedian lengthSubstitution error (%)Insertion error (%)Deletion error (%)
PacBio datasets
E.coli47 910848283.87477.52.118.353.37
S.cerevisiae50 000246038.85113.02.508.164.12
D.melanogaster50 00039324.88028.52.457.983.66
Human50 0000.117459.46533.03.038.884.63
Nanopore datasets
E.coli50 00012111407.79082.56.465.528.98
S.cerevisiae50 000307421.86440.07.385.358.30
D.melanogaster50 00037009.04675.05.714.436.23
Human50 0000.106633.36763.04.124.904.84

We also test simulated datasets generated by NanoSim (Yang et al., 2017), a read simulator designed for Oxford MinION sequencers. Given the reference genome and a Nanopore dataset of the same species, NanoSim first analyzes the errors in the dataset to generate its error profile by mapping the dataset to the reference genome, and then utilizes the error profile to produce the set of simulated reads. For each simulated read, we know its exact mapping positions on the reference genome, and thus can use it as ground truth to test the performance of algorithms.

4.1.2 Measurements

We report recall, precision, F1 score, CPU time and memory usage in our experiments. To compute the precision and recall, we need to know ground truth, i.e. whether a pair of sequences overlaps or not. For the simulated datasets, we know the mapping of each read to the reference genome. For the real-world datasets, we compute such mapping using Blasr (Chaisson and Tesler, 2012).

We use the evaluation module in MHAP to calculate the recall and precision. It takes mapping positions of the reads as input and treats the read pairs sharing at least Γ bps region on the reference genome as true overlaps. To compute the recall, if a true overlap is reported by an overlap detection algorithm and the length difference between the reported and true overlaps is within 30% of the true overlap length, we consider this pair as a true positive. The evaluation program verifies all the true overlap pairs to compute recall. To compute the precision, the evaluation program first randomly samples 10 000 reported overlaps. Each sampled overlap pair is considered as true positive if the edit distance between the reported overlap region is not greater than 30% of the reported overlap length. The 30% criteria on both measurements comes from the 15% error rate in sequencing procedure since, in the worse case, two overlapping reads may contain a 30% fraction of error. We set the parameter Γ to be 500. We also compute the recall and precision for overlaps with lengths [500,2000] bps to show the advantage of SmoothQGram on short overlaps.

We compute F1 score to measure the overall performance of algorithms: F1=2×(precision×recall)/(precision+recall), which is an integrated metric evaluating both precision and recall.

We also report the running time and memory usage of all the algorithms. Since all algorithms use multiple threads in execution, we set the number of threads to be 16 for each algorithm and measure the CPU time for comparison. All experiments were conducted on Dell PowerEdge T630 server with two Intel Xeon E5-2667 v4 3.2 GHz CPU with eight cores each and 256 GB memory.

4.1.3 Algorithms and parameter selections

We compare SmoothQGram presented in Algorithm 2 with the state-of-the-art algorithms MHAP (Berlin et al., 2015), Minimap2 (Li, 2016, 2018) and DALIGNER (Chaisson and Tesler, 2012). (The implementation of MHAP is obtained from https://github.com/marbl/MHAP. The implementation of Minimap2 is obtained from https://github.com/lh3/minimap2. The implementation of DALIGNER is obtained from https://github.com/thegenemyers/DALIGNER.) We note that all these algorithms are using the same ‘seed-extension’ framework as ours: they first find all the matched q-grams between input sequence pairs, and then extend seeds to potential overlaps. The main difference between SmoothQGram and the other algorithms is that we have relaxed the strict q-gram matches to approximate q-gram matches (via smooth q-gram) to improve the sensitivity of overlap detection.

In the experiments, we run SmoothQGram with parameters q =14, m =16, κ = 35, α=0.2, K =2, C =5, ϵ=0.15 and L = 500. We run MHAP with the parameter ‘-num-hash 1256’, which corresponds to its sensitive mode that achieves higher sensitivity and F1 scores. For DALIGNER, we add the parameter ‘-H500’, which considers reads with the length at least 500 bps to improve its efficiency. Minimap2 provides two different sets of parameters for PacBio and Nanopore datasets. We use its default parameter set ‘-x ava-pb’ for PacBio datasets and ‘-x ava-ont’ for Nanopore datasets.

Under the default parameter set, Minimap2 runs extremely fast, but the recall is not satisfactory. We find that the performance of Minimap2 is mostly influenced by two parameters: the filter threshold (‘-f’) and the window size (‘-w’). The filter threshold controls the fraction of frequent signatures that the algorithm ignores, and the window size controls the size of consecutive q-grams in which one signature is generated. The lower these two parameters are, the more signatures are retained in the indexing. We try different combinations of these two parameters and find ‘-f = 1e−7, -w = 3’ achieves the highest F1 scores on most datasets. This is also the set of parameters that achieves the highest recall with the precision no smaller than 80%. Thus, beside the default parameters, we also compare the results from this alternative parameter set for Minimap2. We call the default version MM2-default and the alternative version MM2-alternative. Minimap2 under the alternative parameter set has the best recall compared with all other existing ‘seed-extension’ methods, but SmoothQGram still outperforms Minimap2.

We would like to note that we only select a single set of parameters for SmoothQGram to show its robustness. However, one could adopt different parameters according to datasets to further improve the sensitivity or running time of SmoothQGram. For instance, on the E.coli dataset, we could decrease α and η to select fewer signatures and reduce the running time, whereas on Human dataset, we could increase the α and η to improve our precision and recall.

4.2 Experimental results

4.2.1 Performance comparison

Table 3 shows the performance comparison (recall, precision, F1 score, running time and memory usage) of all algorithms. As we can see, SmoothQGram always achieves the best F1 score and has about a 3.8% advantage over the best competitor on average (over all real datasets). We also achieve the best recall on all datasets and the second-best precision on all real datasets except PacBio Human which ranks third. Among all the state-of-the-art algorithms, MM2-alternative always achieves the best F1 score, followed by MHAP in most cases.

Table 3.

Comparison of overlap detection algorithms on real datasets

Recall<2000 (%)Precision<2000 (%)Recall (%)Precision (%)F1 scoreCPU (s)Memory (G)
PacBio E.coli
MHAP34.5299.8860.9399.770.757378620.4
DALIGNER26.3695.2458.7397.410.73314768.4
MM2-default61.1199.9681.1299.970.8961955.1
MM2-alternative67.7999.8585.0199.890.9193327.5
SmoothQGram73.1199.8587.9899.930.936655030.6
PacBio S.cerevisiae
MHAP39.8696.6753.9793.090.683417622.2
DALIGNER10.0789.9031.9793.210.47615885.8
MM2-default8.0099.8312.4299.790.221743.4
MM2-alternative50.8095.6469.3593.280.79612309.0
SmoothQGram59.4999.2575.0099.340.855427424.3
PacBio D.melanogaster
MHAP28.5293.1641.1190.630.566397122.3
DALIGNER21.8869.9448.9376.260.59660 6439.3
MM2-default17.4598.6428.4898.330.4421235.1
MM2-alternative47.5292.9459.6688.020.71149 45932.8
SmoothQGram57.0095.5565.8796.980.785797635.4
PacBio Human
MHAP16.1474.7029.4172.790.419415222.3
DALIGNER11.2161.3431.7164.470.42528417.3
MM2-default15.0999.4527.2199.210.4271004.6
MM2-alternative28.3097.7746.1995.620.6233546.0
SmoothQGram33.2696.4351.3695.40.667555129.5
Nanopore E.coli
MHAP28.8797.0665.3396.930.781588721.8
DALIGNER10.7988.6232.7192.860.484492812.0
MM2-default33.3899.7773.8999.410.8483908.4
MM2-alternative38.8196.8276.5397.060.85688011.9
SmoothQGram48.3395.8079.6397.400.87614 40245.7
Nanopore S.cerevisiae
MHAP31.9898.6956.0499.210.716525122.0
DALIGNER3.3295.8514.8397.830.25839415.4
MM2-default4.8099.297.6599.260.1422355.6
MM2-alternative33.4898.0070.6197.320.818349022.1
SmoothQGram41.7895.7771.2597.450.823655533.4
Nanopore D.melanogaster
MHAP53.0590.7156.3989.520.692371322.2
DALIGNER14.9779.2631.9783.210.46218 2598.1
MM2-default15.9798.3029.1997.790.4501405.2
MM2-alternative57.0995.0065.5792.030.766641622.2
SmoothQGram61.5493.9567.4795.490.791481227.1
Nanopore Human
MHAP28.5939.6252.0442.210.46611 17022.4
DALIGNER19.6771.2144.4675.340.55936427.6
MM2-default23.3399.3833.0399.450.4961135.1
MM2-alternative32.6184.2865.4081.030.724258711.1
SmoothQGram38.1693.5667.3693.360.783548025.9
Recall<2000 (%)Precision<2000 (%)Recall (%)Precision (%)F1 scoreCPU (s)Memory (G)
PacBio E.coli
MHAP34.5299.8860.9399.770.757378620.4
DALIGNER26.3695.2458.7397.410.73314768.4
MM2-default61.1199.9681.1299.970.8961955.1
MM2-alternative67.7999.8585.0199.890.9193327.5
SmoothQGram73.1199.8587.9899.930.936655030.6
PacBio S.cerevisiae
MHAP39.8696.6753.9793.090.683417622.2
DALIGNER10.0789.9031.9793.210.47615885.8
MM2-default8.0099.8312.4299.790.221743.4
MM2-alternative50.8095.6469.3593.280.79612309.0
SmoothQGram59.4999.2575.0099.340.855427424.3
PacBio D.melanogaster
MHAP28.5293.1641.1190.630.566397122.3
DALIGNER21.8869.9448.9376.260.59660 6439.3
MM2-default17.4598.6428.4898.330.4421235.1
MM2-alternative47.5292.9459.6688.020.71149 45932.8
SmoothQGram57.0095.5565.8796.980.785797635.4
PacBio Human
MHAP16.1474.7029.4172.790.419415222.3
DALIGNER11.2161.3431.7164.470.42528417.3
MM2-default15.0999.4527.2199.210.4271004.6
MM2-alternative28.3097.7746.1995.620.6233546.0
SmoothQGram33.2696.4351.3695.40.667555129.5
Nanopore E.coli
MHAP28.8797.0665.3396.930.781588721.8
DALIGNER10.7988.6232.7192.860.484492812.0
MM2-default33.3899.7773.8999.410.8483908.4
MM2-alternative38.8196.8276.5397.060.85688011.9
SmoothQGram48.3395.8079.6397.400.87614 40245.7
Nanopore S.cerevisiae
MHAP31.9898.6956.0499.210.716525122.0
DALIGNER3.3295.8514.8397.830.25839415.4
MM2-default4.8099.297.6599.260.1422355.6
MM2-alternative33.4898.0070.6197.320.818349022.1
SmoothQGram41.7895.7771.2597.450.823655533.4
Nanopore D.melanogaster
MHAP53.0590.7156.3989.520.692371322.2
DALIGNER14.9779.2631.9783.210.46218 2598.1
MM2-default15.9798.3029.1997.790.4501405.2
MM2-alternative57.0995.0065.5792.030.766641622.2
SmoothQGram61.5493.9567.4795.490.791481227.1
Nanopore Human
MHAP28.5939.6252.0442.210.46611 17022.4
DALIGNER19.6771.2144.4675.340.55936427.6
MM2-default23.3399.3833.0399.450.4961135.1
MM2-alternative32.6184.2865.4081.030.724258711.1
SmoothQGram38.1693.5667.3693.360.783548025.9

Note: Recall<2000 and precision<2000 show the measurements for overlaps of length 500 to 2000 bps. Recall and precision show the measurements for all true overlaps.

Highest F1 score should be highlighted in Bold.

Table 3.

Comparison of overlap detection algorithms on real datasets

Recall<2000 (%)Precision<2000 (%)Recall (%)Precision (%)F1 scoreCPU (s)Memory (G)
PacBio E.coli
MHAP34.5299.8860.9399.770.757378620.4
DALIGNER26.3695.2458.7397.410.73314768.4
MM2-default61.1199.9681.1299.970.8961955.1
MM2-alternative67.7999.8585.0199.890.9193327.5
SmoothQGram73.1199.8587.9899.930.936655030.6
PacBio S.cerevisiae
MHAP39.8696.6753.9793.090.683417622.2
DALIGNER10.0789.9031.9793.210.47615885.8
MM2-default8.0099.8312.4299.790.221743.4
MM2-alternative50.8095.6469.3593.280.79612309.0
SmoothQGram59.4999.2575.0099.340.855427424.3
PacBio D.melanogaster
MHAP28.5293.1641.1190.630.566397122.3
DALIGNER21.8869.9448.9376.260.59660 6439.3
MM2-default17.4598.6428.4898.330.4421235.1
MM2-alternative47.5292.9459.6688.020.71149 45932.8
SmoothQGram57.0095.5565.8796.980.785797635.4
PacBio Human
MHAP16.1474.7029.4172.790.419415222.3
DALIGNER11.2161.3431.7164.470.42528417.3
MM2-default15.0999.4527.2199.210.4271004.6
MM2-alternative28.3097.7746.1995.620.6233546.0
SmoothQGram33.2696.4351.3695.40.667555129.5
Nanopore E.coli
MHAP28.8797.0665.3396.930.781588721.8
DALIGNER10.7988.6232.7192.860.484492812.0
MM2-default33.3899.7773.8999.410.8483908.4
MM2-alternative38.8196.8276.5397.060.85688011.9
SmoothQGram48.3395.8079.6397.400.87614 40245.7
Nanopore S.cerevisiae
MHAP31.9898.6956.0499.210.716525122.0
DALIGNER3.3295.8514.8397.830.25839415.4
MM2-default4.8099.297.6599.260.1422355.6
MM2-alternative33.4898.0070.6197.320.818349022.1
SmoothQGram41.7895.7771.2597.450.823655533.4
Nanopore D.melanogaster
MHAP53.0590.7156.3989.520.692371322.2
DALIGNER14.9779.2631.9783.210.46218 2598.1
MM2-default15.9798.3029.1997.790.4501405.2
MM2-alternative57.0995.0065.5792.030.766641622.2
SmoothQGram61.5493.9567.4795.490.791481227.1
Nanopore Human
MHAP28.5939.6252.0442.210.46611 17022.4
DALIGNER19.6771.2144.4675.340.55936427.6
MM2-default23.3399.3833.0399.450.4961135.1
MM2-alternative32.6184.2865.4081.030.724258711.1
SmoothQGram38.1693.5667.3693.360.783548025.9
Recall<2000 (%)Precision<2000 (%)Recall (%)Precision (%)F1 scoreCPU (s)Memory (G)
PacBio E.coli
MHAP34.5299.8860.9399.770.757378620.4
DALIGNER26.3695.2458.7397.410.73314768.4
MM2-default61.1199.9681.1299.970.8961955.1
MM2-alternative67.7999.8585.0199.890.9193327.5
SmoothQGram73.1199.8587.9899.930.936655030.6
PacBio S.cerevisiae
MHAP39.8696.6753.9793.090.683417622.2
DALIGNER10.0789.9031.9793.210.47615885.8
MM2-default8.0099.8312.4299.790.221743.4
MM2-alternative50.8095.6469.3593.280.79612309.0
SmoothQGram59.4999.2575.0099.340.855427424.3
PacBio D.melanogaster
MHAP28.5293.1641.1190.630.566397122.3
DALIGNER21.8869.9448.9376.260.59660 6439.3
MM2-default17.4598.6428.4898.330.4421235.1
MM2-alternative47.5292.9459.6688.020.71149 45932.8
SmoothQGram57.0095.5565.8796.980.785797635.4
PacBio Human
MHAP16.1474.7029.4172.790.419415222.3
DALIGNER11.2161.3431.7164.470.42528417.3
MM2-default15.0999.4527.2199.210.4271004.6
MM2-alternative28.3097.7746.1995.620.6233546.0
SmoothQGram33.2696.4351.3695.40.667555129.5
Nanopore E.coli
MHAP28.8797.0665.3396.930.781588721.8
DALIGNER10.7988.6232.7192.860.484492812.0
MM2-default33.3899.7773.8999.410.8483908.4
MM2-alternative38.8196.8276.5397.060.85688011.9
SmoothQGram48.3395.8079.6397.400.87614 40245.7
Nanopore S.cerevisiae
MHAP31.9898.6956.0499.210.716525122.0
DALIGNER3.3295.8514.8397.830.25839415.4
MM2-default4.8099.297.6599.260.1422355.6
MM2-alternative33.4898.0070.6197.320.818349022.1
SmoothQGram41.7895.7771.2597.450.823655533.4
Nanopore D.melanogaster
MHAP53.0590.7156.3989.520.692371322.2
DALIGNER14.9779.2631.9783.210.46218 2598.1
MM2-default15.9798.3029.1997.790.4501405.2
MM2-alternative57.0995.0065.5792.030.766641622.2
SmoothQGram61.5493.9567.4795.490.791481227.1
Nanopore Human
MHAP28.5939.6252.0442.210.46611 17022.4
DALIGNER19.6771.2144.4675.340.55936427.6
MM2-default23.3399.3833.0399.450.4961135.1
MM2-alternative32.6184.2865.4081.030.724258711.1
SmoothQGram38.1693.5667.3693.360.783548025.9

Note: Recall<2000 and precision<2000 show the measurements for overlaps of length 500 to 2000 bps. Recall and precision show the measurements for all true overlaps.

Highest F1 score should be highlighted in Bold.

Our advantage is greater when we consider the short overlaps with lengths [500,2000] where we gain on average about a 7.0% improvement on recall compared with the best competitor. Short overlaps are relatively not easy to detect as they contain fewer matched q-grams. Compared with existing algorithms, SmoothQGram using smooth q-grams as seeds is more sensitive and can capture short overlaps with higher probability. We observe that short overlap is the bottleneck for all the overlap detection algorithms: the recall on short overlaps is typical 20% lower than the overall recall for all algorithms.

For different datasets and species, we find that E.coli is the easiest one for all algorithms while D.melanogaster and Human are much more challenging, because there are larger numbers of repeated regions in these genomes. SmoothQGram is more robust on difficult datasets than the competitors. Moreover, we find that Nanopore datasets are more challenging than PacBio datasets for SmoothQGram. We notice that the overlaps in Nanopore are more likely to be broken into smaller sub-regions, which are more difficult to handle by our verification algorithm (Algorithm 4).

SmoothQGram’s running time is similar to MHAP but is worse than MM2-default. The reason is that instead of q-gram, SmoothQGram considers smooth q-gram, and thus it takes more time to Generate-Smooth-q-Gram signatures and verify candidate overlaps. On the other hand, this is also why SmoothQGram significantly improves the sensitivity for overlap detection. We note that in our experimental studies, we mainly focused on the accuracy; our codes were not optimized for running time.

We present the experimental results on simulated datasets in Supplementary Material S1.4.

4.2.2 The detection of short overlaps

When we focus on the detection of short overlaps (length 5002000 bps), we observe from Table 3 that SmoothQGram consistently outperforms other algorithms on all datasets. We would like to emphasize that the detection of short overlaps is critical in sequence assembly. This is because for long overlaps, the common mistake in the overlap detection is outputting overlaps with wrong lengths/locations, which may be corrected in the subsequent layout step. However, if short overlaps are missed in the overlap detection stage, they will not be recovered in future stages, which may induce more contigs in the final assembly, especially when the sequence coverage is not high.

4.3 Summary

In this section, we provide a thorough experimental study on smooth q-gram and its application to overlap detection. We observe that the smooth q-gram-based approach achieved better accuracy than the conventional q-gram-based approaches for overlap detection, which is mainly due to the fact that smooth q-gram is capable of capturing approximate matches between two sequences. Our algorithm is stable and robust on genome sequences that we have tested. The performance of our algorithm is more notable for short overlaps, and may improve the overall qualify of long-reads assembly.

5 Discussions

In this paper, we present SmoothQGram, a highly sensitive overlap detection algorithm for long, error-prone sequencing reads. We first propose smooth q-gram, a variant of q-gram which can capture q-grams with small edit distances, and then show that by using smooth q-gram-based signatures, the sensitivity of the overlap detection algorithm can be significantly improved. We also observe that SmoothQGram has greater advantage in detecting short overlaps, which has strong implications in improving the overall quality of genome assembly using long-reads.

SmoothQGram achieves robust performance on all the datasets with a single set of parameters. We may obtain even better sensitivity or running time if we adopt parameters for different datasets. In contrast, the commonly used Minimap2 used different parameters for PacBio and Nanopore data, and we have to select additional parameters to make its sensitivity satisfying. We observe that its results are highly sensitive with parameters.

Although we have only studied the overlap detection problem in this paper, we believe that smooth q-gram can be used in other cases, e.g. the other steps in fragment assembly (e.g. error corrections, consensus sequence generation) as well as other similarity search problems (e.g. sequence classification, sequence clustering), where sensitive and robust q-gram signatures are needed. We leave these applications as future work.

Funding

This work has been supported in part by the NSF CCF-1619081, IIS-1633215 and CCF-1844234.

Conflict of Interest: none declared.

References

Altschul
S.F.
 et al. (
1990
)
Basic local alignment search tool
.
J. Mol. Biol
.,
215
,
403
410
.

Belazzougui
D.
,
Zhang
Q.
(
2016
) Edit distance: sketching, streaming, and document exchange. In: IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9–11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pp.
51
60
.

Berlin
K.
 et al. (
2015
)
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
.
Nat. Biotechnol
.,
33
,
623
630
.

Brudno
M.
 et al. (
2003
)
Lagan and multi-lagan: efficient tools for large-scale multiple alignment of genomic DNA
.
Genome Res
.,
13
,
721
731
.

Burkhardt
S.
,
Kärkkäinen
J.
(
2001
) Better filtering with gapped q-grams. In: Combinatorial Pattern Matching, 12th Annual Symposium, CPM2001 Jerusalem, Israel, July 1–4, 2001, Proceedings (Lecture Notes in Computer Science), Vol.
2089
, pp.
73
85
.

Burkhardt
S.
,
Kärkkäinen
J.
(
2002
) One-gapped q-gram filters for Levenshtein distance. In: Combinatorial Pattern Matching, 13th Annual Symposium, CPM 2002, Fukuoka, Japan, July 3-5, 2002, Proceedings (Lecture Notes in Computer Science), Vol.
2373
, pp.
225
234
.

Burkhardt
S.
,
Kärkkäinen
J.
(
2003
)
Better filtering with gapped q-grams
.
Fundam. Inform
.,
56
,
51
70
.

Chaisson
M.
,
Tesler
G.
(
2012
)
Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): theory and application
.
BMC Bioinform
.,
13
,
238
.

Chakraborty
D.
 et al. (
2016
) Streaming algorithms for embedding and computing edit distance in the low distance regime. In: Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC2016, Cambridge, MA, USA, June 18–21, 2016, pp.
712
725
.

Compeau
P.E.
 et al. (
2011
)
How to apply de Bruijn graphs to genome assembly
.
Nat. Biotechnol
.,
29
,
987
991
.

Jain
M.
 et al. (
2015
)
Improved data analysis for the minion nanopore sequencer
.
Nat. Methods
,
12
,
351
356
.

Keich
U.
 et al. (
2004
)
On spaced seeds for similarity search
.
Discrete Appl. Math
.,
138
,
253
263
.

Kurtz
S.
 et al. (
2004
)
Versatile and open software for comparing large genomes
.
Genome Biol
.,
5
,
R12
.

Li
H.
(
2016
)
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
.
Bioinformatics
,
32
,
2103
2110
.

Li
H.
(
2018
)
Minimap2: pairwise alignment for nucleotide sequences
.
Bioinformatics
,
34
,
3094
3100
.

Ma
B.
 et al. (
2002
)
Patternhunter: faster and more sensitive homology search
.
Bioinformatics
,
18
,
440
445
.

Manning
C.D.
,
Schütze
H.
(
1999
)
Foundations of Statistical Natural Language Processing
.
MIT Press
,
Cambridge, MA, USA
.

Mikheyev
A.S.
,
Tin
M.M.
(
2014
)
A first look at the Oxford Nanopore minion sequencer
.
Mol. Ecol. Resourc
.,
14
,
1097
1102
.

Miller
J.R.
 et al. (
2010
)
Assembly algorithms for next-generation sequencing data
.
Genomics
,
95
,
315
327
.

Myers
G.
(
2014
) Efficient local alignment discovery amongst noisy long reads. In: Algorithms in Bioinformatics – 14th International Workshop, Wroclaw, Poland, Proceedings (Lecture Notes in Computer Science), Vol.
8701
, pp.
52
67
.

Pevzner
P.A.
 et al. (
2001
)
An Eulerian path approach to DNA fragment assembly
.
Proc. Natl. Acad. Sci. USA
,
98
,
9748
9753
.

Qin
J.
 et al. (
2011
) Efficient exact edit similarity query processing with the asymmetric signature scheme. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12–16, 2011, pp.
1033
1044
.

Roberts
R.J.
 et al. (
2013
)
The advantages of SMRT sequencing
.
Genome Biol
.,
14
,
405
.

Schwartz
S.
 et al. (
2003
)
Human–mouse alignments with BLASTZ
.
Genome Res
.,
13
,
103
107
.

Sović
I.
 et al. (
2016
)
Fast and sensitive mapping of nanopore sequencing reads with graphmap
.
Nat. Commun
.,
7
,
11307
.

Wang
J.
 et al. (
2012
) Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20–24, 2012, pp.
85
96
.

Xiao
C.
 et al. (
2008
)
Ed-join: an efficient algorithm for similarity joins with edit distance constraints
.
PVLDB
,
1
,
933
944
.

Yang
C.
 et al. (
2017
)
NanoSim: nanopore sequence read simulator based on statistical characterization
.
GigaScience
,
6
,
1
6
.

Zhang
H.
,
Zhang
Q.
(
2017
) Embedjoin: efficient edit similarity joins via embedding. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13–17, 2017, pp.
585
594
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Ponty Yann
Ponty Yann
Associate Editor
Search for other works by this author on:

Supplementary data