-
PDF
- Split View
-
Views
-
Cite
Cite
Bansho Masutani, Shinichi Morishita, A framework and an algorithm to detect low-abundance DNA by a handy sequencer and a palm-sized computer, Bioinformatics, Volume 35, Issue 4, February 2019, Pages 584–592, https://doi.org/10.1093/bioinformatics/bty663
- Share Icon Share
Abstract
Detection of DNA at low abundance with respect to the entire sample is an important problem in areas such as epidemiology and field research, as these samples are highly contaminated with non-target DNA. To solve this problem, many methods have been developed to date, but all require additional time-consuming and costly procedures. Meanwhile, the MinION sequencer developed by Oxford Nanopore Technology (ONT) is considered a powerful tool for tackling this problem, as it allows selective sequencing of target DNA. The main technology employed involves rejection of an undesirable read from a specific pore by inverting the voltage of that pore, which is referred to as ‘Read Until’. Despite its usefulness, several issues remain to be solved in real situations. First, limited computational resources are available in field research and epidemiological applications. In addition, a high-speed online classification algorithm is required to make a prompt decision. Lastly, the lack of a theoretical approach for modeling of selective sequencing makes it difficult to analyze and justify a given algorithm.
In this paper, we introduced a statistical model of selective sequencing, proposed an efficient constant-time classifier for any background DNA profile, and validated its optimal precision. To confirm the feasibility of the proposed method in practice, for a pre-recorded mock sample, we demonstrate that the method can selectively sequence a 100 kb region, consisting of 0.1% of the entire read pool, and achieve approximately 500-fold amplification. Furthermore, the algorithm is shown to process 26 queries per second with a $500 palm-sized next unit of computing box using an Intel® CoreTMi7 CPU without extended computer resources such as a GPU or high-performance computing. Next, we prepared a mixed DNA pool composed of Saccharomyces cerevisiae and lambda phage, in which any 200 kb region of S.cerevisiae consists of 0.1% of the whole sample. From this sample, a 30–230 kb region of S.cerevisiae chromosome 1 was amplified approximately 30-fold. In addition, this method allowed on-the-fly changing of the amplified region according to the uncovered characteristics of a given DNA sample.
The source code is available at: https://bitbucket.org/ban-m/dyss.
1 Introduction
After the appearance of long-read sequencers, such as the PacBio SMRT sequencer (Korlach et al., 2008; Levene et al., 2003) and ONT MinION sequencer (Meller et al., 2000), reads up to or even longer than 50 kb are now possible. There are numerous benefits of long reads, including de novo assembly, detection of novel structural variants, Iso-seq and gap-filling of reference genomes (Minot et al., 2013; Rhoads and Au, 2015).
However, making full use of its portability, the ONT MinION sequencer has found another niche such as roles in field research and epidemiology. For field research, sequencing studies conducted in the Antarctic (Johnson et al., 2017) and under microgravity (McIntyre et al., 2016) have shown its feasibility in proof-of-concept experiments. Epidemiology and microbiology are other fields in which this method is useful. Detection of an epidemic and its associated pathogens has been an area of active research (Cao et al., 2016). For example, viral diagnosis using blood samples (Greninger et al., 2015) was reportedly successful under certain conditions. In addition, identification of plasmids, which are important in providing accessory functions to their host, is now possible (Margos et al., 2017). For clinical application, real-time detection of Zika virus and Ebola virus has been successfully carried out (Faria et al., 2016; Quick et al., 2016).
Despite this progress, rapid identification of pathogens of low abundance relative to the whole sample remains difficult. The reason for such low abundance is not only a low relative number of molecules, but also a much shorter sequence than that of its host. For example, bacteria have genome sizes on the mega-base pair order (Trevors, 1996), while plasmids have strands of 200 kb at most (Thomas and Summers, 2008). The DNA strands of DNA viruses that infect humans are usually dozens of kb, adding up to several hundred kb in total (Stano et al., 2016).
Although many methods to selectively increase the concentration of rare reads have been developed, all of these methods require additional preparation procedures that are costly or time-consuming (Djikeng et al., 2008; Hagberg et al., 2016; Kav et al., 2013; Matranga et al., 2014; Wylie et al., 2015). One may attempt to sequence a huge number of candidate reads to detect the target reads; however, this approach is only a partial solution and yields numerous irrelevant reads (Albertsen et al., 2013). Distinguishing focal reads from contamination is nontrivial, as indicated in a human virome study (Minot et al., 2013).
Thus, there is a pressing need to develop an online method for rapid amplification of target reads that make up as little as 0.01% in some situations, such as Zika virus (Quick et al., 2017).
To overcome these problems, selective sequencing using the ‘Read Until’ application programming interface (API) with the ONT MinION is promising, as it allows rejection of unwanted DNA fragments from MinION pores in real time, followed by loading of another fragment. One interesting feature of selective rejection of undesirable DNA fragments (e.g. those from the host) is that this function can be programmed into software without any additional preparation steps to concentrate the relevant reads. However, it is nontrivial to implement an efficient algorithm that will achieve selective sequencing in real time. In their pioneering research, Loose et al. used a Dynamic Time Warping (DTW) algorithm to process the signals from a MinION sequencer and demonstrated that a target DNA region of 20 kb in size could be effectively enriched (Loose et al., 2016).
However, some serious issues in practical applications remain. First, limited computational resources are available in field research and epidemiological applications, preventing a selection classifier from using massively parallel computation or GPU acceleration. In addition, a high-speed online classification algorithm must make a prompt decision. The classifier must keep up with the sequencer, which currently can sequence DNA molecules at up to 450 bases per second, which is one order of magnitude faster than the processing time in 2016, when Loose et al. proposed the DTW algorithm. Although state-of-the-art base callers are fast enough to process several kb per second, nanopore sequences are error-prone, and the classification must be carried out by checking the first portion of these erroneous reads that are likely to contain many indels compared to the entire read. Lastly, the lack of a theoretical approach for modeling selective sequencing makes it difficult to analyze and validate a given algorithm. Although DNA sequencing has been elucidated by information theory and discrete mathematics (Motahari et al., 2012), no serious studies on selective sequencing have yet been conducted. Overall, selective sequencing differs from traditional sequence analysis in its real-time nature and large bias on negative reads, as noted previously.
2 Approach
The principle of the Nanopore sequencer is that the four DNA bases yield different current intensities when passing through a nanopore on a flow cell with a motor protein. A series of thousands of raw signals from each nanopore are transmitted to the sequencer independently and simultaneously every second in real time. This allows analysis of sequencing data as soon as the sequencer outputs it online, whereas traditional high-throughput short-read sequencers are slow in calling each base for billions of reads per hour and allow processing of the results only after finishing the procedure in a batch manner.
Moreover, a unique feature of nanopore sequencing is its function in providing reverse streams of information from a sequencer to each individual pore while monitoring sequencing on the fly. In particular, one can send a message to reverse the voltage of a specific pore such that the motor protein moves backward and eventually rejects the strand running through that pore.
Here, one may consider first to convert the signals into four bases, then determine whether to accept the query. Although base-calling step and alignment procedure have been accelerated in recent years, they still require substantial computational resources. Since we assume situations where relatively weak computational power is available at hand, we skip the base calling step and instead compare the signal itself directly to the reference, as proposed by the previous study (Loose et al., 2016). Here, we develop a suite of methods for accelerating the approach to work on a palm top computer.
To derive a very simple but strong classifier for read selection, we start by describing a formalization of selective sequencing and the criteria for better classifiers. Then, we propose an optimal algorithm, which is an extension of the traditional signal-processing algorithm to make full use of it.
3 Materials and methods
3.1 Theoretical foundation of selective sequencing
Determining appropriate criteria for selective sequencing is nontrivial, because we consider the case when only a small fraction of all queries is positive. There are similar problems in the machine-learning community, namely F-score optimization (Nan et al., 2012). Using an empirical utility maximization approach or decision-theoretic approach, one can optimize the F-score if the training set is sufficiently large, and the model family is well-specified (Nan et al., 2012), or if a necessary condition is met in a more generalized situation (Lipton et al., 2014). Unfortunately, these traditional approaches explicitly use the prior distribution of positive queries, which cannot be estimated a priori before one has completed sequencing of that sample.
In an effort to characterize the inherent nature of selective sequencing, we provide a formal definition of it using terms in online prediction theory. First, we define the policy function for deciding whether to accept a given query considering the score of the query.
(Policy). Let be real numbers and be positive real numbers. Policy π is a function , such that does not contain any isolated points.
Hereafter, we call the set representation of policy π. For example, a function with as its set representation is not a valid policy, while one with is a valid policy because it does not contain isolated point 4. The procedure of selective sequencing is defined as follows:
(Sequencing). Let be an odds ratio, be sequencing time, D0 and D1 be two probabilistic distributions on , and be the fraction of target reads at the time . A sequencing is a protocol between a player and an environment such that
For do
- The environment chooses a positive query (yt = 1) with probability ϵt and a negative query (yt = 0) otherwise; namely(1)
The environment sends to the player a raw signal, from which the player calculates a feature that follows the distribution through an arbitrary procedure. For example, qt may be a mapping score obtained by sequence comparison after base calling.
The player predicts whether or not the query is positive, denoted by , where π is a policy function.
Remark. The first term of (2) corresponds to precision, while the second one does to sensitivity. Note that this formalization differs from that given in traditional online prediction theory (Cesa-Bianchi and Lugosi, 2006), in that the answer yt is never revealed until the end of the entire protocol. Because the fraction of target reads depends on the time course t, this definition remains valid if we change a reference during sequencing. There are a couple of limitations to this formalization. First, the length of reads is considered a constant. In addition, the fractions of ϵt for are assumed to be independent from each other, which is not the case in general. With these issues in mind, we will employ this formalization and analyze an algorithm.
One might consider using the expectation of with respect to queries as a criterion for a given algorithm, but this is analytically difficult. To approximate the value of , we instead use an alternative criterion:
We also denote as S(A), where A is the set representation of π.
The main question here is whether we can identify the optimal policy that maximizes the objective function . To this end, we derive the score for an arbitrary policy function π.
Substituting each term in (3) with these equations above yields (4).
In a real application, although the relative abundance of the positive query is very low, the absolute abundance is not, which means that we place more weight on precision than on sensitivity, i.e. we can assume λ is small and set it to 0.
To determine the optimal policy which maximizes (4), we will use the following lemma:
Remark. P is the probability that a query q is in A, given that the query comes from the positive distribution. N is almost the same as P, except that q comes from the negative distribution.
Proof. Let be the set representation of an arbitrary policy function. We want to show if and only if (8) and (9) hold.
By defining and taking the limit of , we can prove (11) (8). (When x is at the boundary of A, we can similarly solve this problem by defining or . Since A does not contain any isolated points, we now exhaust all the cases.)
Likewise, we can prove (13) (9).
Likewise, we can prove that . Thus, we have for all . This proves that if and only if (8) and (9) hold.
We are now in a position to prove the following main theorem using the lemma above:
Theorem 3.3 (Optimality for threshold). Letbe a sequencing, be the probability density function for D0, D1and. Suppose that λ = 0, is a strictly monotonous decreasing function, and some θ exists such that. The optimal policy A for S is.
Proof. Proof by contradiction. If there is some set such that for all , we can define two cases:
If there is some larger than θ, , this indicates that B is not optimal.
If there is some less than θ, , then B is not optimal.
We call this policy function ‘Thrθ’, where the optimal threshold θ does not depend on the fraction of target reads (ϵ).
Remark. One might be concerned that this theorem assumes two ideal conditions; namely, that the probability density functions and are known, and that is monotonous. In reality, however, both assumptions are reasonable and are supported by empirical data, which are shown in Section 4. We also note that we can achieve similar results when is strictly increasing monotonously. This result might be useful for making decisions based on mapping score or quality.
3.2 Outline of our algorithm
Our algorithm consists of one offline step and three online steps (Fig. 1). In the offline step, we train the classifier Thrθ using pre-recorded event streams. We use event streams to describe the nanopore signal after segmentation, which converts about seven raw signals into one event and thus reduces the time requirements.

In the online steps, a nucleotide sequence to be selectively sequenced is first converted into an event stream by a 6-mer hidden Markov model (HMM). We hereafter call the sequence to be amplified as just a ‘reference’. Second, as queries are supplied, the similarity between query and reference is computed. By using this similarity, our classifier determines whether the query is accepted. In all steps, we determine similarity by DTW, the same algorithm employed in the previous study (Loose et al., 2016), and accelerate the performance of DTW using additional heuristics.
The 6-mer HMM, which is used to convert reference sequences to nanopore-like signals, is a HMM , where π is the initial distribution, is the state, is the transition probability from s to t and is the normal distribution of the current intensity of a pore at state s. We used parameters that had already been trained by ONT.
Using this model, a given DNA string is converted into an event stream as follows:
3.3 Basic characteristics of DTW
DTW is a metric used to compute the similarity between a real-value query and a reference (Vintsyuk, 1968). Combined with the k-nearest neighbor, this metric has long been used in the field of signal processing as a powerful baseline (Bagnall et al., 2017).
Here, is an element-wise similarity function.
Similar to sequence comparison, one can trace back the dynamic programming table to determine where the alignment starts. Thus, it is straightforward to use the location where the alignment starts as a criterion, and this algorithm has been used in previous research (Loose et al., 2016). However, this algorithm does not perform well unless the entire genomic sequence in a given sample is available beforehand. The complete knowledge assumption is too stringent in practice. For example, in uncultured microbiome, a sample is often heavily contaminated with non-target unknown bacteria. We can overcome this difficulty even when we only know the target reference to be amplified because we can select candidate reads with the target reference by checking whether their DTW scores are higher than the threshold of an optimal policy for maximizing the precision.
3.4 Fast approximation
3.4.1 Existing algorithm/approximation
Many heuristics and algorithms have been proposed for DTW to date, but few methods are suited for selective sequencing, in which most queries are negative, and the ratio of the query to reference length is large.
To illustrate this situation in more detail, we here consider four representative methods: the Sakoe–Chiba band (Sakoe and Chiba, 1978), Sparse-DTW (Altschul et al., 1990), Fast-DTW (Salvador and Chan, 2007) and UCR suite (Rakthanmanon et al., 2012).
First, although the Sakoe–Chiba band and Sparse-DTW have been widely accepted as good for approximation and acceleration (Dau et al., 2017), they are used mainly for alignments between sequences of similar length. More precisely, if one extends these algorithms to sub-DTW, the worst-case time complexity becomes O(mn).
Unfortunately, the linear time heuristic Fast-DTW fails to approximate sub-DTW well. Figure 2 shows a typical example illustrating a large discrepancy between Fast-DTW scores and the optimal scores from sub-DTW. This failure occurs because the procedure first constructs a candidate path at a coarse resolution and refines the path as the resolution becomes finer, which does not work well when the reference is hundreds of times longer than the query.

Comparison between optimal scores and Fast-DTW (Hagberg et al., 2016) approximation. We used the pre-recorded dataset described in Section 4.1 of Section 4. The reference size was set to 200 K events, the query size to 500, and the radius parameter in Fast-DTW to 70. In the legend, FastSub means Fast-DTW approximation for sub-DTW, while Sub means the optimal algorithm (sub-DTW). Positive scores were generated by mapping queries to its corresponding regions in the reference, while negative scores by assigning queries to random, irrelevant regions. The two distributions of positive and negative scores of the Fast-DTW approximation overlap, which means we cannot distinguish positive instances from negative ones using Fast-DTW scores. Furthermore, the distributions are different from those of the optimal algorithm
Lastly, the UCR suite, which has multiple pruning systems and operates quite fast with a smooth signal, is also unsuccessful at detecting similarity, because nanopore signals do not satisfy this condition.
3.4.2 Re-chunking
To compute sub-DTW for two nanopore event sequences, our proposed method consists of three key heuristics or algorithms that make full use of the Thrθ classifier: re-chunking, seeding and pruning.
Re-chunking is a heuristic that compresses redundant parts of a query. More precisely, re-chunking sequentially looks up a query and computes the distance between the current average of the chunking region and the current value; then, if the distance is below the threshold τ, the current value is chunked, and the average value is updated. The entire algorithm is shown in Algorithm 1.
Chunking m length query
Input:
Output:
1: (res, prev, len)
2: fordo
3: ifthen
4: prev
5: len
6: else
7: res.push(prev)
8: (prev, len)
9: return res.length, res
3.4.3 Seeding
Seeding is a common technique used in sequencing comparisons. In particular, when empowered by accurate index systems such as a suffix array, Barrows Wheeler Transformation with FM-index or Minimizer, seeding dramatically reduces the computational time (Altschul et al., 1990; Roberts et al., 2004). Even though there have already been some indexing methods developed for the DTW metric (Luo and Shrivastava, 2017), we use the simplest seed-extending approach in our first attempt at a selective sequencing application.
In the proposed method, seeding (k, s) is decomposed into three steps: (i) aligning the subsequence of the query, (ii) picking up the top k smallest scores and their start positions and (iii) extending each candidate. For Step (i), we construct a subquery that is s times smaller than the original query by slicing off the first m/s elements. Selection of the top k candidates can be carried out in O(n) time via the Floyd–Rivest algorithm (Floyd and Rivest, 1975). Taking into account that , we assert that the total time complexity is , which is s times faster than that previously attained.
3.4.4 Threshold pruning
For sub-DTW computation (20), we can easily prove the following lemma:
Therefore, when we use the Thrθ classifier, we can safely quit computation once holds for some query index j. Although this approach is similar to common pruning systems, this technique differs from other approaches in that it stops all computation for a query, not a portion of the procedure.
4 Results
4.1 Pre-recorded samples
A real sequencing dataset that had been recorded previously was used for verification of our approach. Specifically, we downloaded data from the Loman lab’s public data and used 1D passed reads (see Data access).
4.1.1 Sequence bias
First, we checked whether there was any dependency of the DTW score on the GC content of a reference genome. To this end, we synthesized genomes with various GC contents and calculated DTW scores for the pre-recorded data (from 40% to 60% by 1% step). The variance of the means was 0.224, while the mean of the means was 67.3. Since the Shapiro–Wilk test was rejected for all GC contents with a small P-value (<10e-15), we applied the Kruskal–Wallis test to the entire dataset, and the P-value obtained was 0.63. Although a statistical testing is only meaningful when the test fails, there might be a negligible amount of variance as long as the background sequence is assumed to be random. Moreover, these results suggest that DTW scores of a negative query may be invariant to background DNA composition to some extent.
4.1.2 Parameter tuning
The dataset was created as follows. First, reads were mapped to an Escherichia coli reference genome using Minimap2 (Li, 2018) with the default parameters. Then, non-primary mapping and alignments with a low mapping quality (<10) were filtered out. Next, we selected 500 events excluding the first 50 noisy events from these reads and aligned them to a subset of the reference signals converted from the reference genome after normalization. The reverse complement strand was used to construct negative queries.
Parameter tuning was carried out using 5-fold cross-validation for each parameter. By changing the score function from Euclidean distance to a fine-tuned Hill function, the accuracy was improved when using the 15-nearest neighbor method (Fig. 3). The remaining parameters, two for seed-extending and one for re-chunking, were set to and , respectively. Even though the accuracy was decreased (∼80%), the time needed for processing a query was approximately 6-fold faster.

Relationship between Hill function (21) parameters (α, β) and accuracy. Black line: accuracy when β was set to 2. Red line: accuracy when Euclidean distance was used (96.7%). Inset: heat map of accuracy with various α and β values. Using a Hill function with instead of naïve Euclidean distance, the accuracy was improved to 97.1%
Although the flow cell of the MinION has been changed a few times since the dataset was initiated, the proposed method provided similar results through re-tuning of the re-chunking parameter, while other parameters were unchanged.
4.1.3 Verification of the thresholding algorithm
Next, we checked the practical validity of the condition required by Theorem (3.3)—that the ratio of probability density functions was monotonously decreasing. Indeed, this condition seemed to hold in the proposed method when estimating the probability density function using empirical distributions (Fig. 4). Note that we cannot assert that this condition is true for all real values, as few reliable estimates might exist far outside the observed range of scores. Nonetheless, this result suggests that the condition holds over a reasonably broad range, and thus the Thrθ algorithm is acceptable for practical application.

Ratio of each probability density function. The reference size was set to 200 kb, and queries were 250 events long. Green vertical line: the estimated threshold (26.29). Inset: histogram of dynamic time warping scores. Red represents negative scores, and blue represents positive scores
4.1.4 Comparison with existing methods
Assuming there are very few positive queries in the read pool in a real application, we configured a dataset that contained only 0.1% positive queries to assess the performance of various methods. To evaluate speed, we used a dataset that contained only negative queries. These two sets were disjointed from the training dataset used in the previous section. The baseline was naïve sub-DTW, which fills all cells in (20), Sakoe–Chiba banded DTW and the seed-extending method using a 15-NN classifier. These experiments were carried out using a 2-inch × 4-inch × 4-inch next unit of computing (Intel[textregistered] Core i7-7567U with a clock rate of 3.5 GHz and 16 GB primary memory).
Our results suggested that the proposed method is the fastest of the four tested (Table 1). Considering the trade-off between the precision and sensitivity, our algorithm optimized the precision so as to detect rare positives, and hence its sensitivity was lower than a traditional sub-DTW, but the overall performance, i.e. the F-score, of the proposed method was higher than that of traditional ones. In addition, when we use the metric defined as (), the Thrθ classifier was better than the 15-NN classifier as long as the parameter λ is reasonably small (Fig. 5). Concerning memory usage, our method only requires additional several Mbyte memory because classifier needs only one variable for the threshold and memory for DTW computation. Although there is uncertainty as to how much memory allocation inside the API would occur, 18 GB of physical memory is typically enough.
Method . | Time (sec) . | Precision . | Sensitivity . | F-score . | Parameters . |
---|---|---|---|---|---|
Sub | 1.06 | 5.69% | 91.20% | 0.106 | t:0 |
Sakoe | 3.09 | 16.30% | 88.30% | 0.273 | t:0.40, b:61 |
Sakoe | 0.556 | 0.18% | 47.50% | 0.004 | t:0.40, b:11 |
Seeding | 0.229 | 20.60% | 88.10% | 0.331 | t:0.37, s:2, k:14 |
Seeding | 0.151 | 1.08% | 62.10% | 0.021 | t:0.36, s:3, k:16 |
Proposed | 0.154 | 50.20% | 54.00% | 0.518 | t:0.38, s:3, k:14 |
Proposed | 0.153 | 46.20% | 47.50% | 0.466 | t:0.35, s:3, k:14 |
Method . | Time (sec) . | Precision . | Sensitivity . | F-score . | Parameters . |
---|---|---|---|---|---|
Sub | 1.06 | 5.69% | 91.20% | 0.106 | t:0 |
Sakoe | 3.09 | 16.30% | 88.30% | 0.273 | t:0.40, b:61 |
Sakoe | 0.556 | 0.18% | 47.50% | 0.004 | t:0.40, b:11 |
Seeding | 0.229 | 20.60% | 88.10% | 0.331 | t:0.37, s:2, k:14 |
Seeding | 0.151 | 1.08% | 62.10% | 0.021 | t:0.36, s:3, k:16 |
Proposed | 0.154 | 50.20% | 54.00% | 0.518 | t:0.38, s:3, k:14 |
Proposed | 0.153 | 46.20% | 47.50% | 0.466 | t:0.35, s:3, k:14 |
Note: The reference size was 100 kb, and the queries were 250 events long. Aside from the proposed method, a 15-nearest neighbor was used as a classifier. In the parameter descriptions, t refers to the re-chunking parameter (1). This parameter was tuned from 0 to 50. b denotes the bandwidth of the Sakoe–Chiba algorithm. This parameter ranged from 11 to 91. s and k denote the number of packs and number of candidates in the seeding step (see Section 3.4.3). The best score for each criterion is represented in bold.
Method . | Time (sec) . | Precision . | Sensitivity . | F-score . | Parameters . |
---|---|---|---|---|---|
Sub | 1.06 | 5.69% | 91.20% | 0.106 | t:0 |
Sakoe | 3.09 | 16.30% | 88.30% | 0.273 | t:0.40, b:61 |
Sakoe | 0.556 | 0.18% | 47.50% | 0.004 | t:0.40, b:11 |
Seeding | 0.229 | 20.60% | 88.10% | 0.331 | t:0.37, s:2, k:14 |
Seeding | 0.151 | 1.08% | 62.10% | 0.021 | t:0.36, s:3, k:16 |
Proposed | 0.154 | 50.20% | 54.00% | 0.518 | t:0.38, s:3, k:14 |
Proposed | 0.153 | 46.20% | 47.50% | 0.466 | t:0.35, s:3, k:14 |
Method . | Time (sec) . | Precision . | Sensitivity . | F-score . | Parameters . |
---|---|---|---|---|---|
Sub | 1.06 | 5.69% | 91.20% | 0.106 | t:0 |
Sakoe | 3.09 | 16.30% | 88.30% | 0.273 | t:0.40, b:61 |
Sakoe | 0.556 | 0.18% | 47.50% | 0.004 | t:0.40, b:11 |
Seeding | 0.229 | 20.60% | 88.10% | 0.331 | t:0.37, s:2, k:14 |
Seeding | 0.151 | 1.08% | 62.10% | 0.021 | t:0.36, s:3, k:16 |
Proposed | 0.154 | 50.20% | 54.00% | 0.518 | t:0.38, s:3, k:14 |
Proposed | 0.153 | 46.20% | 47.50% | 0.466 | t:0.35, s:3, k:14 |
Note: The reference size was 100 kb, and the queries were 250 events long. Aside from the proposed method, a 15-nearest neighbor was used as a classifier. In the parameter descriptions, t refers to the re-chunking parameter (1). This parameter was tuned from 0 to 50. b denotes the bandwidth of the Sakoe–Chiba algorithm. This parameter ranged from 11 to 91. s and k denote the number of packs and number of candidates in the seeding step (see Section 3.4.3). The best score for each criterion is represented in bold.

Comparison between the 15-nearest neighbor and Thrθ algorithms. The seed-extending method was used to calculate scores from queries. Red line: 15-nearest neighbor with optimal parameters. Since the Thrθ algorithm was designed for cases when λ = 0, the parameters for seeding and chunking were tuned to obtain the optimal score with λ = 0. Green line: 15-nearest neighbor. Parameters are the same as for the blue line. Blue line: Proposed method. The parameters are as described in Table 1
4.2 Real experiment
To validate our method in a real application, a mixed DNA pool composed of Saccharomyces cerevisiae and lambda phage was configured. The mass ratio of lambda phage to the S.cerevisiae genome was set to 4.2. This made the target region of the S.cerevisiae genome, 30–230 kb of chromosome 1, very rare in the whole sample. The reference genome was downloaded from NCBI(S288C). Sequencing was performed for 18.5 h at room temperature and repeated twice on the same 13-inch MacBookPro (Intel[textregistered] Core i5 with a clock rate of 2.0 GHz and 8 GB primary memory) using a different flow cell with more than 1000 active pores.
For a negative control, we applied the rejection procedure only to channels with odd numbers, excluding channels with even numbers. This made it possible to compare not only the precision but also the number of target reads that the sequencer obtained. Because the total base pair number depends strongly on very long reads (>100 kb), this metric serves as a good additional criterion.
The results showed that the fraction of target reads was improved compared with the negative control (Table 2). Although the amplification rate improvement was very slight compared with that of pre-recorded samples, this occurred partly because the simulation did not take into account that selective sequencing must observe at least a few thousand signals to classify them. When the relevant reads were very rare, the total number of non-target reads increased and influenced the results.
. | Total kb . | Total reads . | Target kb . | Target reads . |
---|---|---|---|---|
Repl.1 | 175 467 | 97 704 | 1698 | 161 |
Control | 1 319 785 | 73 779 | 1660 | 120 |
Repl.2 | 119 430 | 65 949 | 1277 | 121 |
Control | 1 283 144 | 68 119 | 1401 | 102 |
. | Total kb . | Total reads . | Target kb . | Target reads . |
---|---|---|---|---|
Repl.1 | 175 467 | 97 704 | 1698 | 161 |
Control | 1 319 785 | 73 779 | 1660 | 120 |
Repl.2 | 119 430 | 65 949 | 1277 | 121 |
Control | 1 283 144 | 68 119 | 1401 | 102 |
Note: The first two columns represent the results in terms of total base pairs and number of reads, while the following two columns represent the results for the target region. Repl. indicates biological replicates, and control denotes the results obtained without the ‘Read Until’ API. Minimap2 (Li, 2018) with the -k 10 option was used to map reads to references. Reads with mapping quality values equal to 0 were filtered out, and only primary mapping was used.
. | Total kb . | Total reads . | Target kb . | Target reads . |
---|---|---|---|---|
Repl.1 | 175 467 | 97 704 | 1698 | 161 |
Control | 1 319 785 | 73 779 | 1660 | 120 |
Repl.2 | 119 430 | 65 949 | 1277 | 121 |
Control | 1 283 144 | 68 119 | 1401 | 102 |
. | Total kb . | Total reads . | Target kb . | Target reads . |
---|---|---|---|---|
Repl.1 | 175 467 | 97 704 | 1698 | 161 |
Control | 1 319 785 | 73 779 | 1660 | 120 |
Repl.2 | 119 430 | 65 949 | 1277 | 121 |
Control | 1 283 144 | 68 119 | 1401 | 102 |
Note: The first two columns represent the results in terms of total base pairs and number of reads, while the following two columns represent the results for the target region. Repl. indicates biological replicates, and control denotes the results obtained without the ‘Read Until’ API. Minimap2 (Li, 2018) with the -k 10 option was used to map reads to references. Reads with mapping quality values equal to 0 were filtered out, and only primary mapping was used.
Indeed, when we filtered out reads shorter than 5 kb, the amplification rate of base pairs improved to 34.3× (from 0.123% to 4.23%) and 27.2× (from 0.107% to 2.91%) for the two biological replicates. This filtering procedure was not likely to create a bias toward lambda phage sequences, because the median lengths of the reads from lambda phage and S.cerevisiae in the negative control were 14 and 12 kb for the first replicate and 19 and 12 kb for the second replicate.
From an alternative viewpoint, the result might suggest that overall throughput decreased, and there was little improvement with the ‘Read Until’ API. The total number of bases decreased from 1.3 Gbp to 175 Mbp and from 1.3 Gbp to 119 Mbp in the two replicates, but this was probably due to the rejection of undesirable reads. This reasoning was supported by the finding that the total amount of relevant bases did not show a very sharp decrease (Table 2). Therefore, even though the total sequencing throughput decreased, the API likely did not decrease the throughput of relevant reads.
We also validated our method in another sample composed of E.coli and lambda phage. Specifically, the mass ratio of lambda phage to the E.coli genome was set to 15 and the target sequence was the 0–200 kb region of the reference E.coli genome. The amplification rates were about 10 times for 4.5 h run, 21 times for 18.5 h run, respectively after filtering out reads with shorter than 5 kb (Table 3).
. | Total kb . | Total reads . | Target kb . | Target reads . |
---|---|---|---|---|
Repl.1 | 323 666 | 121 042 | 2905 | 570 |
Control | 770 849 | 64 503 | 1769 | 289 |
Repl.2 | 167 990 | 95 280 | 2297 | 317 |
Control | 1 412 678 | 83 469 | 2514 | 226 |
. | Total kb . | Total reads . | Target kb . | Target reads . |
---|---|---|---|---|
Repl.1 | 323 666 | 121 042 | 2905 | 570 |
Control | 770 849 | 64 503 | 1769 | 289 |
Repl.2 | 167 990 | 95 280 | 2297 | 317 |
Control | 1 412 678 | 83 469 | 2514 | 226 |
Note: The table summaries the results of selective sequencing using the mixture of E.coli and lambda phage, and is similar to Table 2. Repl.1 and Repl.2 represent two biological replicates of the mixture, but the respective sequencing experiments were performed for 4.5 h and 18.5 h to show how sequencing time may affect the amplification rate.
. | Total kb . | Total reads . | Target kb . | Target reads . |
---|---|---|---|---|
Repl.1 | 323 666 | 121 042 | 2905 | 570 |
Control | 770 849 | 64 503 | 1769 | 289 |
Repl.2 | 167 990 | 95 280 | 2297 | 317 |
Control | 1 412 678 | 83 469 | 2514 | 226 |
. | Total kb . | Total reads . | Target kb . | Target reads . |
---|---|---|---|---|
Repl.1 | 323 666 | 121 042 | 2905 | 570 |
Control | 770 849 | 64 503 | 1769 | 289 |
Repl.2 | 167 990 | 95 280 | 2297 | 317 |
Control | 1 412 678 | 83 469 | 2514 | 226 |
Note: The table summaries the results of selective sequencing using the mixture of E.coli and lambda phage, and is similar to Table 2. Repl.1 and Repl.2 represent two biological replicates of the mixture, but the respective sequencing experiments were performed for 4.5 h and 18.5 h to show how sequencing time may affect the amplification rate.
In addition, the reference was changed experimentally for the budding yeast sample. Specifically, after the region from 30 to 230 kb on the first chromosome of S.cerevisiae was amplified for 5.5 h, the reference sequence was changed to the 30–230 kb region of the second chromosome, and the experiment was restarted for 9 h to obtain similar coverage. The results clearly showed that both regions were amplified (Fig. 6). Specifically, the relative amplification rates of base pairs were 20.1× (from 0.18% to 3.5%) and 23.8× (from 0.19% to 4.5%), for the first and second chromosomes, respectively, with respect to reads longer than 5 kb. During sequencing, when one uncovers the characteristics of a given sample, one can change the focal region accordingly on the fly to maximize the sequencing capability.

Hot Swapping of references. Here, control refers to results obtained without using the ‘Read Until’ API, while target indicates those obtained using the API. The x-axis shows the position on the chromosome. The y-axis shows the relative abundance of each position, i.e. (coverage)/(total bp), smoothed with a 150 bp window using a simple script available from the bitbucket repository (see Data access). Minimap2 (Li, 2018) -k 10 was used to map the reads to references. The total number of relevant reads improved from 100 to 150 and from 70 to 121 for the two chromosomes
5 Discussion
Amplifying reads at low abundances using a purely software approach was shown to be achievable using the ‘Read Until’ API bundled with a MinION sequencer. Specifically, reads with very low abundance (0.1%) were amplified to approximately 4%. This method correctly rejected irrelevant reads during sequencing and thus reduced the total time required to identify target reads in a sample. Our results also suggest that it is feasible to swap the amplification region during sequencing. This finding makes our approach unique compared with CRISPR-Cas approaches, which process a sample irreversibly. Although not all reads can currently be classified when the pores on a flow cell are highly active (∼2 h from the beginning), algorithmic improvements such as hashing (Luo and Shrivastava, 2017) or base-calling approaches would make it possible to classify all queries.
In addition, we proposed a mathematical framework for selective sequencing. Since this formalization assumes the similarity score for each query is given, it is flexible enough to incorporate other approaches, such as base calling followed by sequence comparison. Herein, the optimality of our simple algorithm with respect to precision was demonstrated and found to hold regardless of the fraction of relevant reads.
We named our framework MoSS (a Model of Selective Sequencing) and our algorithm DySS (Dynamic time warping for Selective Sequencing). We demonstrated the merit of our method using mock samples, but applications of our method to more complex samples under different situations would help us improve our open source program, DySS.
Even though current pore occupancy depends strongly on flow cell and library activity, future improvements in the flow cell and library kit will lead to results from real experiments approaching those from pre-recorded samples.
The portability of the MinION sequencer, combined with the ‘ReadUntil’ API, will drastically reduce the total time and cost by amplifying the concentration of target reads and thus advance research fields such as viromics, plasmidomics and pathogen detection. We also hope that this technology will be employed in the field of DNA storage (Organick et al., 2018) to facilitate random access to the entire DNA pool.
Acknowledgements
We would like to thank Oxford Nanopore Technologies for allowing us to use the Read Until API to determine the feasibility of our theory and algorithms for selective sequencing. We also thank Tsurutsuki Sugai, Shigeko Tsujikawa and Wei Qu for their technical support with nanopore sequencing experiments.
Data access
The source code used in the training and experiments is available from a bitbucket.org repository (https://bitbucket.org/ban-m/dyss). The dataset for R9.2 1D flowcell used in Section 4 can be downloaded from Nicholas J Loman’s website (http://lab.loman.net/2016/07/30/nanopore-r9-data-release/). The log file of the ‘ReadUntil’ API and its result is found at bitbucket.org repository (https://bitbucket.org/ban-m/readuntillogfiles) and SRA repository (SRA accession: SRP136587). The S.cerevisiae genome and E.coli genome were purchased from NBRC, NITE, Kisarazu, Japan (sample IDs are 1136G and 12713G).
Funding
This work was supported in part by the Core Research for Evolutional Science and Technology (CREST) program [JPMJCR13W3] from the Japan Science and Technology Agency (JST) to S.M. and the Advanced Genome Research and Bioinformatics Study to Facilitate Medical Innovation from the Japan Agency for Medical Research and Development (AMED) to S.M.
Conflict of Interest: none declared.
References