Abstract

Motivation

Detection of DNA at low abundance with respect to the entire sample is an important problem in areas such as epidemiology and field research, as these samples are highly contaminated with non-target DNA. To solve this problem, many methods have been developed to date, but all require additional time-consuming and costly procedures. Meanwhile, the MinION sequencer developed by Oxford Nanopore Technology (ONT) is considered a powerful tool for tackling this problem, as it allows selective sequencing of target DNA. The main technology employed involves rejection of an undesirable read from a specific pore by inverting the voltage of that pore, which is referred to as ‘Read Until’. Despite its usefulness, several issues remain to be solved in real situations. First, limited computational resources are available in field research and epidemiological applications. In addition, a high-speed online classification algorithm is required to make a prompt decision. Lastly, the lack of a theoretical approach for modeling of selective sequencing makes it difficult to analyze and justify a given algorithm.

Results

In this paper, we introduced a statistical model of selective sequencing, proposed an efficient constant-time classifier for any background DNA profile, and validated its optimal precision. To confirm the feasibility of the proposed method in practice, for a pre-recorded mock sample, we demonstrate that the method can selectively sequence a 100 kb region, consisting of 0.1% of the entire read pool, and achieve approximately 500-fold amplification. Furthermore, the algorithm is shown to process 26 queries per second with a $500 palm-sized next unit of computing box using an Intel® CoreTMi7 CPU without extended computer resources such as a GPU or high-performance computing. Next, we prepared a mixed DNA pool composed of Saccharomyces cerevisiae and lambda phage, in which any 200 kb region of S.cerevisiae consists of 0.1% of the whole sample. From this sample, a 30–230 kb region of S.cerevisiae chromosome 1 was amplified approximately 30-fold. In addition, this method allowed on-the-fly changing of the amplified region according to the uncovered characteristics of a given DNA sample.

Availability and implementation

The source code is available at: https://bitbucket.org/ban-m/dyss.

1 Introduction

After the appearance of long-read sequencers, such as the PacBio SMRT sequencer (Korlach et al., 2008; Levene et al., 2003) and ONT MinION sequencer (Meller et al., 2000), reads up to or even longer than 50 kb are now possible. There are numerous benefits of long reads, including de novo assembly, detection of novel structural variants, Iso-seq and gap-filling of reference genomes (Minot et al., 2013; Rhoads and Au, 2015).

However, making full use of its portability, the ONT MinION sequencer has found another niche such as roles in field research and epidemiology. For field research, sequencing studies conducted in the Antarctic (Johnson et al., 2017) and under microgravity (McIntyre et al., 2016) have shown its feasibility in proof-of-concept experiments. Epidemiology and microbiology are other fields in which this method is useful. Detection of an epidemic and its associated pathogens has been an area of active research (Cao et al., 2016). For example, viral diagnosis using blood samples (Greninger et al., 2015) was reportedly successful under certain conditions. In addition, identification of plasmids, which are important in providing accessory functions to their host, is now possible (Margos et al., 2017). For clinical application, real-time detection of Zika virus and Ebola virus has been successfully carried out (Faria et al., 2016; Quick et al., 2016).

Despite this progress, rapid identification of pathogens of low abundance relative to the whole sample remains difficult. The reason for such low abundance is not only a low relative number of molecules, but also a much shorter sequence than that of its host. For example, bacteria have genome sizes on the mega-base pair order (Trevors, 1996), while plasmids have strands of 200 kb at most (Thomas and Summers, 2008). The DNA strands of DNA viruses that infect humans are usually dozens of kb, adding up to several hundred kb in total (Stano et al., 2016).

Although many methods to selectively increase the concentration of rare reads have been developed, all of these methods require additional preparation procedures that are costly or time-consuming (Djikeng et al., 2008; Hagberg et al., 2016; Kav et al., 2013; Matranga et al., 2014; Wylie et al., 2015). One may attempt to sequence a huge number of candidate reads to detect the target reads; however, this approach is only a partial solution and yields numerous irrelevant reads (Albertsen et al., 2013). Distinguishing focal reads from contamination is nontrivial, as indicated in a human virome study (Minot et al., 2013).

Thus, there is a pressing need to develop an online method for rapid amplification of target reads that make up as little as 0.01% in some situations, such as Zika virus (Quick et al., 2017).

To overcome these problems, selective sequencing using the ‘Read Until’ application programming interface (API) with the ONT MinION is promising, as it allows rejection of unwanted DNA fragments from MinION pores in real time, followed by loading of another fragment. One interesting feature of selective rejection of undesirable DNA fragments (e.g. those from the host) is that this function can be programmed into software without any additional preparation steps to concentrate the relevant reads. However, it is nontrivial to implement an efficient algorithm that will achieve selective sequencing in real time. In their pioneering research, Loose et al. used a Dynamic Time Warping (DTW) algorithm to process the signals from a MinION sequencer and demonstrated that a target DNA region of 20 kb in size could be effectively enriched (Loose et al., 2016).

However, some serious issues in practical applications remain. First, limited computational resources are available in field research and epidemiological applications, preventing a selection classifier from using massively parallel computation or GPU acceleration. In addition, a high-speed online classification algorithm must make a prompt decision. The classifier must keep up with the sequencer, which currently can sequence DNA molecules at up to 450 bases per second, which is one order of magnitude faster than the processing time in 2016, when Loose et al. proposed the DTW algorithm. Although state-of-the-art base callers are fast enough to process several kb per second, nanopore sequences are error-prone, and the classification must be carried out by checking the first portion of these erroneous reads that are likely to contain many indels compared to the entire read. Lastly, the lack of a theoretical approach for modeling selective sequencing makes it difficult to analyze and validate a given algorithm. Although DNA sequencing has been elucidated by information theory and discrete mathematics (Motahari et al., 2012), no serious studies on selective sequencing have yet been conducted. Overall, selective sequencing differs from traditional sequence analysis in its real-time nature and large bias on negative reads, as noted previously.

2 Approach

The principle of the Nanopore sequencer is that the four DNA bases yield different current intensities when passing through a nanopore on a flow cell with a motor protein. A series of thousands of raw signals from each nanopore are transmitted to the sequencer independently and simultaneously every second in real time. This allows analysis of sequencing data as soon as the sequencer outputs it online, whereas traditional high-throughput short-read sequencers are slow in calling each base for billions of reads per hour and allow processing of the results only after finishing the procedure in a batch manner.

Moreover, a unique feature of nanopore sequencing is its function in providing reverse streams of information from a sequencer to each individual pore while monitoring sequencing on the fly. In particular, one can send a message to reverse the voltage of a specific pore such that the motor protein moves backward and eventually rejects the strand running through that pore.

Here, one may consider first to convert the signals into four bases, then determine whether to accept the query. Although base-calling step and alignment procedure have been accelerated in recent years, they still require substantial computational resources. Since we assume situations where relatively weak computational power is available at hand, we skip the base calling step and instead compare the signal itself directly to the reference, as proposed by the previous study (Loose et al., 2016). Here, we develop a suite of methods for accelerating the approach to work on a palm top computer.

To derive a very simple but strong classifier for read selection, we start by describing a formalization of selective sequencing and the criteria for better classifiers. Then, we propose an optimal algorithm, which is an extension of the traditional signal-processing algorithm to make full use of it.

3 Materials and methods

3.1 Theoretical foundation of selective sequencing

Determining appropriate criteria for selective sequencing is nontrivial, because we consider the case when only a small fraction of all queries is positive. There are similar problems in the machine-learning community, namely F-score optimization (Nan et al., 2012). Using an empirical utility maximization approach or decision-theoretic approach, one can optimize the F-score if the training set is sufficiently large, and the model family is well-specified (Nan et al., 2012), or if a necessary condition is met in a more generalized situation (Lipton et al., 2014). Unfortunately, these traditional approaches explicitly use the prior distribution of positive queries, which cannot be estimated a priori before one has completed sequencing of that sample.

In an effort to characterize the inherent nature of selective sequencing, we provide a formal definition of it using terms in online prediction theory. First, we define the policy function for deciding whether to accept a given query considering the score of the query.

 

Definition 1

(Policy). Let R be real numbers and R+ be positive real numbers. Policy π is a function R+{0,1}, such that {x|π(x)=1} does not contain any isolated points.

Hereafter, we call {x|π(x)=1} the set representation of policy π. For example, a function with [1,3]{4}[6,7] as its set representation is not a valid policy, while one with [1,3][6,7] is a valid policy because it does not contain isolated point 4. The procedure of selective sequencing is defined as follows:

 

Definition 2

(Sequencing). Let λR be an odds ratio, TN be sequencing time, D0 and D1 be two probabilistic distributions on R+, and ϵt(0,1) be the fraction of target reads at the time t=1,,T. A sequencing (λ,T,D0,D1,{ϵt}t=1T) is a protocol between a player and an environment such that

For t=1,,T do

  1. The environment chooses a positive query (yt = 1) with probability ϵt and a negative query (yt = 0) otherwise; namely
    (1)
  2. The environment sends to the player a raw signal, from which the player calculates a feature qtR+ that follows the distribution Dyt through an arbitrary procedure. For example, qt may be a mapping score obtained by sequence comparison after base calling.

  3. The player predicts whether or not the query is positive, denoted by y˜t=π(qt){1,0}, where π is a policy function.

We define the total benefit that the player gets as follows:
(2)
where y˜t=π(qt).

 

Remark. The first term of (2) corresponds to precision, while the second one does to sensitivity. Note that this formalization differs from that given in traditional online prediction theory (Cesa-Bianchi and Lugosi, 2006), in that the answer yt is never revealed until the end of the entire protocol. Because the fraction of target reads depends on the time course t, this definition remains valid if we change a reference during sequencing. There are a couple of limitations to this formalization. First, the length of reads is considered a constant. In addition, the fractions of ϵt for t=1,,T are assumed to be independent from each other, which is not the case in general. With these issues in mind, we will employ this formalization and analyze an algorithm.

 

Remark. We assume that the probability density functions f0(x) and f1(x) (for distribution D0, D1, respectively) are positive C1 functions, and A={x|π(x)=1} for a policy π satisfies the following conditions:
to eliminate trivial cases.

One might consider using the expectation of B(π) with respect to queries {Yt}t=1T as a criterion for a given algorithm, but this is analytically difficult. To approximate the value of B(π), we instead use an alternative criterion:

 

Definition 3
(Objective function). The objective function for a given policy π is
(3)

We also denote S(π) as S(A), where A is the set representation of π.

The main question here is whether we can identify the optimal policy that maximizes the objective function S(π). To this end, we derive the score for an arbitrary policy function π.

 

Lemma 3.1. Let π denotes a policy function for sequencing (λ,T,D0,D1,{ϵt}), and A denotes its set representation {x|π(x)=1}. Let ϵ be the average fraction of target reads, 1/Tϵt, and f0(x),f1(x) denote the probability density functions of D0, D1, respectively. Then, the equation below holds:
(4)

 

Proof. Based on the definition of sequencing, {Y˜t|t=1..T} can be regarded as independently sampled. Thus,
(5)
For each component,
(6)

Because Qt follows the distribution DYt, using Bayes’ theorem,
In the first term on the right side of the equation, Yt˜ does not depend on the answer Yt, since the player cannot predict the future. Therefore,
In combination with the definition of selective sequencing, we obtain:
(7)
Through similar calculations, we determine:

Substituting each term in (3) with these equations above yields (4).

In a real application, although the relative abundance of the positive query is very low, the absolute abundance is not, which means that we place more weight on precision than on sensitivity, i.e. we can assume λ is small and set it to 0.

To determine the optimal policy which maximizes (4), we will use the following lemma:

 

Lemma 3.2. Let (λ,T,D0,D1,{ϵt}) be a sequencing reaction and A be a set representation of a policy. When λ = 0, A is the optimal policy if and only if the inequalities below hold:
(8)
(9)

 

Remark. P is the probability that a query q is in A, given that the query comes from the positive distribution. N is almost the same as P, except that q comes from the negative distribution.

 

Proof. Let A˜ be the set representation of an arbitrary policy function. We want to show S(A)S(A˜)0 if and only if (8) and (9) hold.

(Only-if part) First, assume that A˜A, and let D=AA˜. Then, we have

As we assumed λ  = 0, we get
where Dp=Df1(x)dx, and Dn=Df0(x)dx. The denominator is positive, because A˜ and A are not empty. Thus, we have
(10)
(11)

By defining D=[xh,x+h] and taking the limit of h0, we can prove (11) (8). (When x is at the boundary of A, we can similarly solve this problem by defining D:=[xh,x] or D:=[x,x+h]. Since A does not contain any isolated points, we now exhaust all the cases.)

Second, assume AA˜, and let D=A˜A. Then, similarly, we have
(12)
(13)

Likewise, we can prove (13) (9).

(If-part) For the arbitrary set A˜R+, we can decompose A˜=(AB)C, such that BA, and AC=. Then we get
(14)
(15)
A simple calculation shows that the sign of S(A)S(A˜) is determined by:
(16)
where Bp=Bf1(x)dx,Bn=Bf0(x)dx,Cp=Cf1(x)dx,Cn=Cf0(x)dx.
From (8),

Likewise, we can prove that CpNPCn. Thus, we have S(A)S(A˜)0 for all A˜. This proves that S(A)S(A˜)>0 if and only if (8) and (9) hold.

We are now in a position to prove the following main theorem using the lemma above:

 

Theorem 3.3 (Optimality for threshold). LetS=(λ,T,D0,D1,ϵ)be a sequencing, f0(x),f1(x)be the probability density function for D0, D1andP(h)=0hf1(x)dx,N(h)=0hf0(x)dx. Suppose that λ = 0, f1(x)/f0(x)is a strictly monotonous decreasing function, and some θ exists such thatf1(θ)/f0(θ)=P(θ)/N(θ). The optimal policy A for S isA={x|xθ}.

 

Proof. Proof by contradiction. If there is some set BA such that S(B)S(A˜)>0 for all A˜R+, we can define two cases:

  • If there is some xB larger than θ, f1(x)/f0(x)<f1(θ)/f0(θ)=P(θ)/N(θ), this indicates that B is not optimal.

  • If there is some xB less than θ, f1(x)/f0(x)>f1(θ)/f0(θ)=P(θ)/N(θ), then B is not optimal.

We call this policy function ‘Thrθ’, where the optimal threshold θ does not depend on the fraction of target reads (ϵ).

 

Remark. One might be concerned that this theorem assumes two ideal conditions; namely, that the probability density functions f0(x) and f1(x) are known, and that f1(x)/f0(x) is monotonous. In reality, however, both assumptions are reasonable and are supported by empirical data, which are shown in Section 4. We also note that we can achieve similar results when f1(x)/f0(x) is strictly increasing monotonously. This result might be useful for making decisions based on mapping score or quality.

3.2 Outline of our algorithm

Our algorithm consists of one offline step and three online steps (Fig. 1). In the offline step, we train the classifier Thrθ using pre-recorded event streams. We use event streams to describe the nanopore signal after segmentation, which converts about seven raw signals into one event and thus reduces the time requirements.

A schematic view of our work-flow in online steps
Fig. 1.

A schematic view of our work-flow in online steps

In the online steps, a nucleotide sequence to be selectively sequenced is first converted into an event stream by a 6-mer hidden Markov model (HMM). We hereafter call the sequence to be amplified as just a ‘reference’. Second, as queries are supplied, the similarity between query and reference is computed. By using this similarity, our classifier determines whether the query is accepted. In all steps, we determine similarity by DTW, the same algorithm employed in the previous study (Loose et al., 2016), and accelerate the performance of DTW using additional heuristics.

The 6-mer HMM, which is used to convert reference sequences to nanopore-like signals, is a HMM (π,Σ6,R,P,{μs},{σs}), where π is the initial distribution, Σ6={A,T,G,C}6 is the state, Ps,t is the transition probability from s to t and N(μs,σs2) is the normal distribution of the current intensity of a pore at state s. We used parameters that had already been trained by ONT.

Using this model, a given DNA string r˜Σn is converted into an event stream rRn5 as follows: ri=μr˜ii+5

3.3 Basic characteristics of DTW

DTW is a metric used to compute the similarity between a real-value query and a reference (Vintsyuk, 1968). Combined with the k-nearest neighbor, this metric has long been used in the field of signal processing as a powerful baseline (Bagnall et al., 2017).

To determine the similarity between two sequences r=r0r1rn1Rn and q=q0q1qm1Rm of different lengths (nm), we employ a variation of DTW called sub-DTW, which is defined as follows:
(17)
(18)
where
(19)
(20)

Here, d:R×RR is an element-wise similarity function.

Intuitively, this function can be seen as a counterpart to semi-global alignment in sequence analysis, where the gap, match and mismatch cost are replaced with an element-wise similarity score and the max operator with a min operator. Thus, d should return a small positive value when the two arguments are close to each other and a saturated value when the two are far from each other. For this, we use a Hill function:
(21)
which gives a better result than does Euclidian distance or its square.

Similar to sequence comparison, one can trace back the dynamic programming table to determine where the alignment starts. Thus, it is straightforward to use the location where the alignment starts as a criterion, and this algorithm has been used in previous research (Loose et al., 2016). However, this algorithm does not perform well unless the entire genomic sequence in a given sample is available beforehand. The complete knowledge assumption is too stringent in practice. For example, in uncultured microbiome, a sample is often heavily contaminated with non-target unknown bacteria. We can overcome this difficulty even when we only know the target reference to be amplified because we can select candidate reads with the target reference by checking whether their DTW scores are higher than the threshold of an optimal policy for maximizing the precision.

3.4 Fast approximation

3.4.1 Existing algorithm/approximation

Many heuristics and algorithms have been proposed for DTW to date, but few methods are suited for selective sequencing, in which most queries are negative, and the ratio of the query to reference length is large.

To illustrate this situation in more detail, we here consider four representative methods: the Sakoe–Chiba band (Sakoe and Chiba, 1978), Sparse-DTW (Altschul et al., 1990), Fast-DTW (Salvador and Chan, 2007) and UCR suite (Rakthanmanon et al., 2012).

First, although the Sakoe–Chiba band and Sparse-DTW have been widely accepted as good for approximation and acceleration (Dau et al., 2017), they are used mainly for alignments between sequences of similar length. More precisely, if one extends these algorithms to sub-DTW, the worst-case time complexity becomes O(mn).

Unfortunately, the linear time heuristic Fast-DTW fails to approximate sub-DTW well. Figure 2 shows a typical example illustrating a large discrepancy between Fast-DTW scores and the optimal scores from sub-DTW. This failure occurs because the procedure first constructs a candidate path at a coarse resolution and refines the path as the resolution becomes finer, which does not work well when the reference is hundreds of times longer than the query.

Comparison between optimal scores and Fast-DTW (Hagberg et al., 2016) approximation. We used the pre-recorded dataset described in Section 4.1 of Section 4. The reference size was set to 200 K events, the query size to 500, and the radius parameter in Fast-DTW to 70. In the legend, FastSub means Fast-DTW approximation for sub-DTW, while Sub means the optimal algorithm (sub-DTW). Positive scores were generated by mapping queries to its corresponding regions in the reference, while negative scores by assigning queries to random, irrelevant regions. The two distributions of positive and negative scores of the Fast-DTW approximation overlap, which means we cannot distinguish positive instances from negative ones using Fast-DTW scores. Furthermore, the distributions are different from those of the optimal algorithm
Fig. 2.

Comparison between optimal scores and Fast-DTW (Hagberg et al., 2016) approximation. We used the pre-recorded dataset described in Section 4.1 of Section 4. The reference size was set to 200 K events, the query size to 500, and the radius parameter in Fast-DTW to 70. In the legend, FastSub means Fast-DTW approximation for sub-DTW, while Sub means the optimal algorithm (sub-DTW). Positive scores were generated by mapping queries to its corresponding regions in the reference, while negative scores by assigning queries to random, irrelevant regions. The two distributions of positive and negative scores of the Fast-DTW approximation overlap, which means we cannot distinguish positive instances from negative ones using Fast-DTW scores. Furthermore, the distributions are different from those of the optimal algorithm

Lastly, the UCR suite, which has multiple pruning systems and operates quite fast with a smooth signal, is also unsuccessful at detecting similarity, because nanopore signals do not satisfy this condition.

3.4.2 Re-chunking

To compute sub-DTW for two nanopore event sequences, our proposed method consists of three key heuristics or algorithms that make full use of the Thrθ classifier: re-chunking, seeding and pruning.

Re-chunking is a heuristic that compresses redundant parts of a query. More precisely, re-chunking sequentially looks up a query and computes the distance between the current average of the chunking region and the current value; then, if the distance is below the threshold τ, the current value is chunked, and the average value is updated. The entire algorithm is shown in Algorithm 1.

Algorithm 1

Chunking m length query

Input:  qRm,τR

Output:  kN,q˜Rk

1: (res, prev, len) ([ ],,1)

2: fori0m1do

3:  ifd(prev,qi)<τthen

4:   prev (prev*len+qi)/(len+1)

5:   len len+1

6:  else

7:   res.push(prev)

8:   (prev, len) (qi,1)

9: return res.length, res

3.4.3 Seeding

Seeding is a common technique used in sequencing comparisons. In particular, when empowered by accurate index systems such as a suffix array, Barrows Wheeler Transformation with FM-index or Minimizer, seeding dramatically reduces the computational time (Altschul et al., 1990; Roberts et al., 2004). Even though there have already been some indexing methods developed for the DTW metric (Luo and Shrivastava, 2017), we use the simplest seed-extending approach in our first attempt at a selective sequencing application.

In the proposed method, seeding (k, s) is decomposed into three steps: (i) aligning the subsequence of the query, (ii) picking up the top k smallest scores and their start positions and (iii) extending each candidate. For Step (i), we construct a subquery that is s times smaller than the original query by slicing off the first m/s elements. Selection of the top k candidates can be carried out in O(n) time via the Floyd–Rivest algorithm (Floyd and Rivest, 1975). Taking into account that nk,m, we assert that the total time complexity is O(nm/s), which is s times faster than that previously attained.

3.4.4 Threshold pruning

For sub-DTW computation (20), we can easily prove the following lemma:

 

Lemma 3.3 (Monotonicity of sub-DTW score). Let rRn and qRm be reference and query sequences, respectively, and D be the n × m DP-matrix computed in sub-DTW. Then, the inequality below holds:
(22)

Therefore, when we use the Thrθ classifier, we can safely quit computation once miniD[i][j]>θ holds for some query index j. Although this approach is similar to common pruning systems, this technique differs from other approaches in that it stops all computation for a query, not a portion of the procedure.

4 Results

4.1 Pre-recorded samples

A real sequencing dataset that had been recorded previously was used for verification of our approach. Specifically, we downloaded data from the Loman lab’s public data and used 1D passed reads (see Data access).

4.1.1 Sequence bias

First, we checked whether there was any dependency of the DTW score on the GC content of a reference genome. To this end, we synthesized genomes with various GC contents and calculated DTW scores for the pre-recorded data (from 40% to 60% by 1% step). The variance of the means was 0.224, while the mean of the means was 67.3. Since the Shapiro–Wilk test was rejected for all GC contents with a small P-value (<10e-15), we applied the Kruskal–Wallis test to the entire dataset, and the P-value obtained was 0.63. Although a statistical testing is only meaningful when the test fails, there might be a negligible amount of variance as long as the background sequence is assumed to be random. Moreover, these results suggest that DTW scores of a negative query may be invariant to background DNA composition to some extent.

4.1.2 Parameter tuning

The dataset was created as follows. First, reads were mapped to an Escherichia coli reference genome using Minimap2 (Li, 2018) with the default parameters. Then, non-primary mapping and alignments with a low mapping quality (<10) were filtered out. Next, we selected 500 events excluding the first 50 noisy events from these reads and aligned them to a subset of the reference signals converted from the reference genome after normalization. The reverse complement strand was used to construct negative queries.

Parameter tuning was carried out using 5-fold cross-validation for each parameter. By changing the score function from Euclidean distance to a fine-tuned Hill function, the accuracy was improved when using the 15-nearest neighbor method (Fig. 3). The remaining parameters, two for seed-extending and one for re-chunking, were set to s=3,k=14 and τ=0.36, respectively. Even though the accuracy was decreased (∼80%), the time needed for processing a query was approximately 6-fold faster.

Relationship between Hill function (21) parameters (α, β) and accuracy. Black line: accuracy when β was set to 2. Red line: accuracy when Euclidean distance was used (96.7%). Inset: heat map of accuracy with various α and β values. Using a Hill function with α=1.0,β=2 instead of naïve Euclidean distance, the accuracy was improved to 97.1%
Fig. 3.

Relationship between Hill function (21) parameters (α, β) and accuracy. Black line: accuracy when β was set to 2. Red line: accuracy when Euclidean distance was used (96.7%). Inset: heat map of accuracy with various α and β values. Using a Hill function with α=1.0,β=2 instead of naïve Euclidean distance, the accuracy was improved to 97.1%

Although the flow cell of the MinION has been changed a few times since the dataset was initiated, the proposed method provided similar results through re-tuning of the re-chunking parameter, while other parameters were unchanged.

4.1.3 Verification of the thresholding algorithm

Next, we checked the practical validity of the condition required by Theorem (3.3)—that the ratio of probability density functions was monotonously decreasing. Indeed, this condition seemed to hold in the proposed method when estimating the probability density function using empirical distributions (Fig. 4). Note that we cannot assert that this condition is true for all real values, as few reliable estimates might exist far outside the observed range of scores. Nonetheless, this result suggests that the condition holds over a reasonably broad range, and thus the Thrθ algorithm is acceptable for practical application.

Ratio of each probability density function. The reference size was set to 200 kb, and queries were 250 events long. Green vertical line: the estimated threshold (26.29). Inset: histogram of dynamic time warping scores. Red represents negative scores, and blue represents positive scores
Fig. 4.

Ratio of each probability density function. The reference size was set to 200 kb, and queries were 250 events long. Green vertical line: the estimated threshold (26.29). Inset: histogram of dynamic time warping scores. Red represents negative scores, and blue represents positive scores

4.1.4 Comparison with existing methods

Assuming there are very few positive queries in the read pool in a real application, we configured a dataset that contained only 0.1% positive queries to assess the performance of various methods. To evaluate speed, we used a dataset that contained only negative queries. These two sets were disjointed from the training dataset used in the previous section. The baseline was naïve sub-DTW, which fills all cells in (20), Sakoe–Chiba banded DTW and the seed-extending method using a 15-NN classifier. These experiments were carried out using a 2-inch × 4-inch × 4-inch next unit of computing (Intel[textregistered] Core i7-7567U with a clock rate of 3.5 GHz and 16 GB primary memory).

Our results suggested that the proposed method is the fastest of the four tested (Table 1). Considering the trade-off between the precision and sensitivity, our algorithm optimized the precision so as to detect rare positives, and hence its sensitivity was lower than a traditional sub-DTW, but the overall performance, i.e. the F-score, of the proposed method was higher than that of traditional ones. In addition, when we use the metric defined as (3), the Thrθ classifier was better than the 15-NN classifier as long as the parameter λ is reasonably small (Fig. 5). Concerning memory usage, our method only requires additional several Mbyte memory because classifier needs only one variable for the threshold and O(m+n) memory for DTW computation. Although there is uncertainty as to how much memory allocation inside the API would occur, 18 GB of physical memory is typically enough.

Table 1.

Evaluation of each method using a pre-recorded dataset

MethodTime (sec)PrecisionSensitivityF-scoreParameters
Sub1.065.69%91.20%0.106t:0
Sakoe3.0916.30%88.30%0.273t:0.40, b:61
Sakoe0.5560.18%47.50%0.004t:0.40, b:11
Seeding0.22920.60%88.10%0.331t:0.37, s:2, k:14
Seeding0.1511.08%62.10%0.021t:0.36, s:3, k:16
Proposed0.15450.20%54.00%0.518t:0.38, s:3, k:14
Proposed0.15346.20%47.50%0.466t:0.35, s:3, k:14
MethodTime (sec)PrecisionSensitivityF-scoreParameters
Sub1.065.69%91.20%0.106t:0
Sakoe3.0916.30%88.30%0.273t:0.40, b:61
Sakoe0.5560.18%47.50%0.004t:0.40, b:11
Seeding0.22920.60%88.10%0.331t:0.37, s:2, k:14
Seeding0.1511.08%62.10%0.021t:0.36, s:3, k:16
Proposed0.15450.20%54.00%0.518t:0.38, s:3, k:14
Proposed0.15346.20%47.50%0.466t:0.35, s:3, k:14

Note: The reference size was 100 kb, and the queries were 250 events long. Aside from the proposed method, a 15-nearest neighbor was used as a classifier. In the parameter descriptions, t refers to the re-chunking parameter (1). This parameter was tuned from 0 to 50. b denotes the bandwidth of the Sakoe–Chiba algorithm. This parameter ranged from 11 to 91. s and k denote the number of packs and number of candidates in the seeding step (see Section 3.4.3). The best score for each criterion is represented in bold.

Table 1.

Evaluation of each method using a pre-recorded dataset

MethodTime (sec)PrecisionSensitivityF-scoreParameters
Sub1.065.69%91.20%0.106t:0
Sakoe3.0916.30%88.30%0.273t:0.40, b:61
Sakoe0.5560.18%47.50%0.004t:0.40, b:11
Seeding0.22920.60%88.10%0.331t:0.37, s:2, k:14
Seeding0.1511.08%62.10%0.021t:0.36, s:3, k:16
Proposed0.15450.20%54.00%0.518t:0.38, s:3, k:14
Proposed0.15346.20%47.50%0.466t:0.35, s:3, k:14
MethodTime (sec)PrecisionSensitivityF-scoreParameters
Sub1.065.69%91.20%0.106t:0
Sakoe3.0916.30%88.30%0.273t:0.40, b:61
Sakoe0.5560.18%47.50%0.004t:0.40, b:11
Seeding0.22920.60%88.10%0.331t:0.37, s:2, k:14
Seeding0.1511.08%62.10%0.021t:0.36, s:3, k:16
Proposed0.15450.20%54.00%0.518t:0.38, s:3, k:14
Proposed0.15346.20%47.50%0.466t:0.35, s:3, k:14

Note: The reference size was 100 kb, and the queries were 250 events long. Aside from the proposed method, a 15-nearest neighbor was used as a classifier. In the parameter descriptions, t refers to the re-chunking parameter (1). This parameter was tuned from 0 to 50. b denotes the bandwidth of the Sakoe–Chiba algorithm. This parameter ranged from 11 to 91. s and k denote the number of packs and number of candidates in the seeding step (see Section 3.4.3). The best score for each criterion is represented in bold.

Comparison between the 15-nearest neighbor and Thrθ algorithms. The seed-extending method was used to calculate scores from queries. Red line: 15-nearest neighbor with optimal parameters. Since the Thrθ algorithm was designed for cases when λ = 0, the parameters for seeding and chunking were tuned to obtain the optimal score with λ = 0. Green line: 15-nearest neighbor. Parameters are the same as for the blue line. Blue line: Proposed method. The parameters are as described in Table 1
Fig. 5.

Comparison between the 15-nearest neighbor and Thrθ algorithms. The seed-extending method was used to calculate scores from queries. Red line: 15-nearest neighbor with optimal parameters. Since the Thrθ algorithm was designed for cases when λ = 0, the parameters for seeding and chunking were tuned to obtain the optimal score with λ = 0. Green line: 15-nearest neighbor. Parameters are the same as for the blue line. Blue line: Proposed method. The parameters are as described in Table 1

4.2 Real experiment

To validate our method in a real application, a mixed DNA pool composed of Saccharomyces cerevisiae and lambda phage was configured. The mass ratio of lambda phage to the S.cerevisiae genome was set to 4.2. This made the target region of the S.cerevisiae genome, 30–230 kb of chromosome 1, very rare in the whole sample. The reference genome was downloaded from NCBI(S288C). Sequencing was performed for 18.5 h at room temperature and repeated twice on the same 13-inch MacBookPro (Intel[textregistered] Core i5 with a clock rate of 2.0 GHz and 8 GB primary memory) using a different flow cell with more than 1000 active pores.

For a negative control, we applied the rejection procedure only to channels with odd numbers, excluding channels with even numbers. This made it possible to compare not only the precision but also the number of target reads that the sequencer obtained. Because the total base pair number depends strongly on very long reads (>100 kb), this metric serves as a good additional criterion.

The results showed that the fraction of target reads was improved compared with the negative control (Table 2). Although the amplification rate improvement was very slight compared with that of pre-recorded samples, this occurred partly because the simulation did not take into account that selective sequencing must observe at least a few thousand signals to classify them. When the relevant reads were very rare, the total number of non-target reads increased and influenced the results.

Table 2.

Results of selective sequencing using S.cerevisiae and lambda phage

Total kbTotal readsTarget kbTarget reads
Repl.1175 46797 7041698161
Control1 319 78573 7791660120
Repl.2119 43065 9491277121
Control1 283 14468 1191401102
Total kbTotal readsTarget kbTarget reads
Repl.1175 46797 7041698161
Control1 319 78573 7791660120
Repl.2119 43065 9491277121
Control1 283 14468 1191401102

Note: The first two columns represent the results in terms of total base pairs and number of reads, while the following two columns represent the results for the target region. Repl. indicates biological replicates, and control denotes the results obtained without the ‘Read Until’ API. Minimap2 (Li, 2018) with the -k 10 option was used to map reads to references. Reads with mapping quality values equal to 0 were filtered out, and only primary mapping was used.

Table 2.

Results of selective sequencing using S.cerevisiae and lambda phage

Total kbTotal readsTarget kbTarget reads
Repl.1175 46797 7041698161
Control1 319 78573 7791660120
Repl.2119 43065 9491277121
Control1 283 14468 1191401102
Total kbTotal readsTarget kbTarget reads
Repl.1175 46797 7041698161
Control1 319 78573 7791660120
Repl.2119 43065 9491277121
Control1 283 14468 1191401102

Note: The first two columns represent the results in terms of total base pairs and number of reads, while the following two columns represent the results for the target region. Repl. indicates biological replicates, and control denotes the results obtained without the ‘Read Until’ API. Minimap2 (Li, 2018) with the -k 10 option was used to map reads to references. Reads with mapping quality values equal to 0 were filtered out, and only primary mapping was used.

Indeed, when we filtered out reads shorter than 5 kb, the amplification rate of base pairs improved to 34.3× (from 0.123% to 4.23%) and 27.2× (from 0.107% to 2.91%) for the two biological replicates. This filtering procedure was not likely to create a bias toward lambda phage sequences, because the median lengths of the reads from lambda phage and S.cerevisiae in the negative control were 14 and 12 kb for the first replicate and 19 and 12 kb for the second replicate.

From an alternative viewpoint, the result might suggest that overall throughput decreased, and there was little improvement with the ‘Read Until’ API. The total number of bases decreased from 1.3 Gbp to 175 Mbp and from 1.3 Gbp to 119 Mbp in the two replicates, but this was probably due to the rejection of undesirable reads. This reasoning was supported by the finding that the total amount of relevant bases did not show a very sharp decrease (Table 2). Therefore, even though the total sequencing throughput decreased, the API likely did not decrease the throughput of relevant reads.

We also validated our method in another sample composed of E.coli and lambda phage. Specifically, the mass ratio of lambda phage to the E.coli genome was set to 15 and the target sequence was the 0–200 kb region of the reference E.coli genome. The amplification rates were about 10 times for 4.5 h run, 21 times for 18.5 h run, respectively after filtering out reads with shorter than 5 kb (Table 3).

Table 3.

Results of selective sequencing using E.coli and lambda phage

Total kbTotal readsTarget kbTarget reads
Repl.1323 666121 0422905570
Control770 84964 5031769289
Repl.2167 99095 2802297317
Control1 412 67883 4692514226
Total kbTotal readsTarget kbTarget reads
Repl.1323 666121 0422905570
Control770 84964 5031769289
Repl.2167 99095 2802297317
Control1 412 67883 4692514226

Note: The table summaries the results of selective sequencing using the mixture of E.coli and lambda phage, and is similar to Table 2. Repl.1 and Repl.2 represent two biological replicates of the mixture, but the respective sequencing experiments were performed for 4.5 h and 18.5 h to show how sequencing time may affect the amplification rate.

Table 3.

Results of selective sequencing using E.coli and lambda phage

Total kbTotal readsTarget kbTarget reads
Repl.1323 666121 0422905570
Control770 84964 5031769289
Repl.2167 99095 2802297317
Control1 412 67883 4692514226
Total kbTotal readsTarget kbTarget reads
Repl.1323 666121 0422905570
Control770 84964 5031769289
Repl.2167 99095 2802297317
Control1 412 67883 4692514226

Note: The table summaries the results of selective sequencing using the mixture of E.coli and lambda phage, and is similar to Table 2. Repl.1 and Repl.2 represent two biological replicates of the mixture, but the respective sequencing experiments were performed for 4.5 h and 18.5 h to show how sequencing time may affect the amplification rate.

In addition, the reference was changed experimentally for the budding yeast sample. Specifically, after the region from 30 to 230 kb on the first chromosome of S.cerevisiae was amplified for 5.5 h, the reference sequence was changed to the 30–230 kb region of the second chromosome, and the experiment was restarted for 9 h to obtain similar coverage. The results clearly showed that both regions were amplified (Fig. 6). Specifically, the relative amplification rates of base pairs were 20.1× (from 0.18% to 3.5%) and 23.8× (from 0.19% to 4.5%), for the first and second chromosomes, respectively, with respect to reads longer than 5 kb. During sequencing, when one uncovers the characteristics of a given sample, one can change the focal region accordingly on the fly to maximize the sequencing capability.

Hot Swapping of references. Here, control refers to results obtained without using the ‘Read Until’ API, while target indicates those obtained using the API. The x-axis shows the position on the chromosome. The y-axis shows the relative abundance of each position, i.e. (coverage)/(total bp), smoothed with a 150 bp window using a simple script available from the bitbucket repository (see Data access). Minimap2 (Li, 2018) -k 10 was used to map the reads to references. The total number of relevant reads improved from 100 to 150 and from 70 to 121 for the two chromosomes
Fig. 6.

Hot Swapping of references. Here, control refers to results obtained without using the ‘Read Until’ API, while target indicates those obtained using the API. The x-axis shows the position on the chromosome. The y-axis shows the relative abundance of each position, i.e. (coverage)/(total bp), smoothed with a 150 bp window using a simple script available from the bitbucket repository (see Data access). Minimap2 (Li, 2018) -k 10 was used to map the reads to references. The total number of relevant reads improved from 100 to 150 and from 70 to 121 for the two chromosomes

5 Discussion

Amplifying reads at low abundances using a purely software approach was shown to be achievable using the ‘Read Until’ API bundled with a MinION sequencer. Specifically, reads with very low abundance (0.1%) were amplified to approximately 4%. This method correctly rejected irrelevant reads during sequencing and thus reduced the total time required to identify target reads in a sample. Our results also suggest that it is feasible to swap the amplification region during sequencing. This finding makes our approach unique compared with CRISPR-Cas approaches, which process a sample irreversibly. Although not all reads can currently be classified when the pores on a flow cell are highly active (∼2 h from the beginning), algorithmic improvements such as hashing (Luo and Shrivastava, 2017) or base-calling approaches would make it possible to classify all queries.

In addition, we proposed a mathematical framework for selective sequencing. Since this formalization assumes the similarity score for each query is given, it is flexible enough to incorporate other approaches, such as base calling followed by sequence comparison. Herein, the optimality of our simple algorithm with respect to precision was demonstrated and found to hold regardless of the fraction of relevant reads.

We named our framework MoSS (a Model of Selective Sequencing) and our algorithm DySS (Dynamic time warping for Selective Sequencing). We demonstrated the merit of our method using mock samples, but applications of our method to more complex samples under different situations would help us improve our open source program, DySS.

Even though current pore occupancy depends strongly on flow cell and library activity, future improvements in the flow cell and library kit will lead to results from real experiments approaching those from pre-recorded samples.

The portability of the MinION sequencer, combined with the ‘ReadUntil’ API, will drastically reduce the total time and cost by amplifying the concentration of target reads and thus advance research fields such as viromics, plasmidomics and pathogen detection. We also hope that this technology will be employed in the field of DNA storage (Organick et al., 2018) to facilitate random access to the entire DNA pool.

Acknowledgements

We would like to thank Oxford Nanopore Technologies for allowing us to use the Read Until API to determine the feasibility of our theory and algorithms for selective sequencing. We also thank Tsurutsuki Sugai, Shigeko Tsujikawa and Wei Qu for their technical support with nanopore sequencing experiments.

Data access

The source code used in the training and experiments is available from a bitbucket.org repository (https://bitbucket.org/ban-m/dyss). The dataset for R9.2 1D flowcell used in Section 4 can be downloaded from Nicholas J Loman’s website (http://lab.loman.net/2016/07/30/nanopore-r9-data-release/). The log file of the ‘ReadUntil’ API and its result is found at bitbucket.org repository (https://bitbucket.org/ban-m/readuntillogfiles) and SRA repository (SRA accession: SRP136587). The S.cerevisiae genome and E.coli genome were purchased from NBRC, NITE, Kisarazu, Japan (sample IDs are 1136G and 12713G).

Funding

This work was supported in part by the Core Research for Evolutional Science and Technology (CREST) program [JPMJCR13W3] from the Japan Science and Technology Agency (JST) to S.M. and the Advanced Genome Research and Bioinformatics Study to Facilitate Medical Innovation from the Japan Agency for Medical Research and Development (AMED) to S.M.

Conflict of Interest: none declared.

References

Albertsen
 
M.
 et al. (
2013
)
Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes
.
Nat. Biotechnol
.,
31
,
533.

Altschul
 
S.F.
 et al. (
1990
)
Basic local alignment search tool
.
J. Mol. Biol
.,
215
,
403
410
.

Bagnall
 
A.
 et al. (
2017
)
The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances
.
Data Min. Knowl. Discov
.,
31
,
606
660
.

Cao
 
M.D.
 et al. (
2016
)
Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing
.
Gigascience
,
5
,
32.

Cesa-Bianchi
 
N.
,
Lugosi
G.
(
2006
)
Prediction, Learning, and Games
.
Cambridge University Press
,
Cambridge, England, UK
.

Dau
 
H.A.
 et al. (
2017
)
Judicious Setting of Dynamic Time Warping’s Window Width Allows More Accurate Classification of Time Series.
IEEE Conference Publications
,
Piscataway, New Jersey, US
.

Djikeng
 
A.
 et al. (
2008
)
Viral genome sequencing by random priming methods
.
BMC Genomics
,
9
,
5.

Faria
 
N.R.
 et al. (
2016
)
Zika virus in the Americas: early epidemiological and genetic findings
.
Science
,
352
,
345
349
.

Floyd
 
R.W.
,
Rivest
R.L.
(
1975
)
Algorithm 489: the algorithm select for finding the ith smallest of n elements [m1]
.
Commun. ACM
,
18
,
173.

Greninger
 
A.L.
 et al. (
2015
)
Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis
.
Genome Med
.,
7
,
99.

Hagberg
 
E.E.
 et al. (
2016
)
A fast and robust method for whole genome sequencing of the Aleutian Mink Disease Virus (AMDV) genome
.
J. Virol. Methods
,
234
,
43
51
.

Johnson
 
S.S.
 et al. (
2017
)
Real-time DNA sequencing in the Antarctic dry valleys using the Oxford Nanopore sequencer
.
J. Biomol. Tech
.,
28
,
2
.

Kav
 
A.B.
 et al. (
2013
)
A method for purifying high quality and high yield plasmid DNA for metagenomic and deep sequencing approaches
.
J. Microbiol. Methods
,
95
,
272
279
.

Korlach
 
J.
 et al. (
2008
)
Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures
.
Proc. Natl. Acad. Sci
.,
105
,
1176
1181
.

Levene
 
M.J.
 et al. (
2003
)
Zero-mode waveguides for single-molecule analysis at high concentrations
.
Science
,
299
,
682
686
.

Li
 
H.
(
2018
)
Minimap2: pairwise alignment for nucleotide sequences
.
Bioinformatics
,
1
,
7
.

Lipton
 
Z.C.
 et al. (
2014
)
Optimal thresholding of classifiers to maximize F1 measure
. In: Calders,T. (ed.)
Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science, 8725
, pp.
225
239
.
Springer
,
Berlin, Heidelberg
.

Loose
 
M.
 et al. (
2016
)
Real-time selective sequencing using nanopore technology
.
Nat. Methods
,
13
,
751.

Luo
 
C.
,
Shrivastava
A.
(
2017
)
SSH (Sketch, Shingle, & Hash) for indexing massive-scale time series
. In:
Proceedings of the Time Series Workshop at NIPS 2016 in PMLR
,
55
, pp.
38
58
.

Margos
 
G.
 et al. (
2017
)
Lost in plasmids: next generation sequencing and the complex genome of the tick-borne pathogen Borrelia burgdorferi
.
BMC Genomics
,
18
,
422.

Matranga
 
C.B.
 et al. (
2014
)
Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples
.
Genome Biol
.,
15
,
519.

McIntyre
 
A.B.
 et al. (
2016
)
Nanopore sequencing in microgravity
.
NPJ Microgravity
,
2
,
16035.

Meller
 
A.
 et al. (
2000
)
Rapid nanopore discrimination between single polynucleotide molecules
.
Proc. Natl. Acad. Sci
.,
97
,
1079
1084
.

Minot
 
S.
 et al. (
2013
)
Rapid evolution of the human gut virome
.
Proc. Natl. Acad. Sci
.,
110
,
12450
12455
.

Motahari
 
A.
 et al. (
2012
) Information theory for DNA sequencing: part I: a basic model. In: Information Theory Proceedings (ISIT), 2012 IEEE International Symposium 2012, pp. 2741–2745. IEEE, Cambridge, MA.

Nan
 
Y.
 et al. (
2012
) Optimizing F-measure: a tale of two approaches. In:
Proceedings of the International Conference on Machine Learning, 2012
, pp. 289–296. Omnipress, New York, NY, USA.

Organick
 
L.
 et al. (
2018
)
Random access in large-scale DNA data storage
.
Nat. Biotechnol
,
36
,
242
248
.

Quick
 
J.
 et al. (
2016
)
Real-time, portable genome sequencing for Ebola surveillance
.
Nature
,
530
,
228.

Quick
 
J.
 et al. (
2017
)
Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples
.
Nat. Protoc
.,
12
,
1261.

Rakthanmanon
 
T.
 et al. (
2012
) Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 262–270. ACM, New York, NY, USA.

Rhoads
 
A.
,
Au
K.F.
(
2015
)
Pacbio sequencing and its applications
.
Genomics Proteomics Bioinform
.,
13
,
278
289
.

Roberts
 
M.
 et al. (
2004
)
Reducing storage requirements for biological sequence comparison
.
Bioinformatics
,
20
,
3363
3369
.

Sakoe
 
H.
,
Chiba
S.
(
1978
)
Dynamic programming algorithm optimization for spoken word recognition
.
IEEE Trans. Acoust. Speech Signal Process
.,
26
,
43
49
.

Salvador
 
S.
,
Chan
P.
(
2007
)
Toward accurate dynamic time warping in linear time and space
.
Intell. Data Anal
.,
11
,
561
580
.

Stano
 
M.
 et al. (
2016
)
viruSITE—integrated database for viral genomics
.
Database
. doi.org/10.1093/database/baw162.

Thomas
 
C.M.
,
Summers
D.
(
2008
) Bacterial plasmids. In:
Encyclopedia of Life Sciences
. Vol.
3
,
John Wiley & Sons, Ltd
,
Chichester, UK
, pp.
189
195
. http://www.els.net/WileyCDA/ElsArticle/refId-a0000468.html.

Trevors
 
J.
(
1996
)
Genome size in bacteria
.
Antonie Van Leeuwenhoek
,
69
,
293
303
.

Vintsyuk
 
T.K.
(
1968
)
Speech discrimination by dynamic programming
.
Cybernetics
,
4
,
52
57
.

Wylie
 
T.N.
 et al. (
2015
)
Enhanced virome sequencing using targeted sequence capture
.
Genome Res
.,
25
,
1910
1920
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Bonnie Berger
Bonnie Berger
Associate Editor
Search for other works by this author on: