Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing

The recently introduced Oxford Nanopore MinION platform generates DNA sequence data in real-time. This has great potential to shorten the sample-to-results time and is likely to have benefits such as rapid diagnosis of bacterial infection and identification of drug resistance. However, there are few tools available for streaming analysis of real-time sequencing data. Here, we present a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, strain typing and antibiotic resistance profile identification. Using four culture isolate samples, as well as a mixed-species sample, we demonstrate that bacterial species and strain information can be obtained within 30 min of sequencing and using about 500 reads, initial drug-resistance profiles within two hours, and complete resistance profiles within 10 h. While strain identification with multi-locus sequence typing required more than 15x coverage to generate confident assignments, our novel gene-presence typing could detect the presence of a known strain with 0.5x coverage. We also show that our pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer. Electronic supplementary material The online version of this article (doi:10.1186/s13742-016-0137-2) contains supplementary material, which is available to authorized users.


Background
Massively parallel, short-read sequencing has profoundly transformed genomics research [1,2] and has become the dominant technology for sequencing DNA. However, one inherent limitation of sequencing mil- 5 lions of sequence fragments in parallel one base at a time is that the sequencing run has to finish before the data analysis can begin. As a result, sequence analysis algorithms have been designed to make inference on a complete sequencing dataset. In contrast, streaming 10 algorithms are a class of algorithms which are applied to a sequence of data events and typically maintain an internal summary of the data as well as an approximation to the full inference without needing to store all of the observations [3]. Streaming algorithms have appli-1990s [5]. The key innovation of this device is that it measures the changes in electrical current as a singlestranded DNA passes through the nanopore and uses the signal to determine the nucleotide sequence of the DNA strand [6,7]. This sequence data can be retrieved 25 and analyzed as it is generated, providing the opportunity to obtain answers in the shortest possible time. Real-time sequencing has immense potential in many applications, especially in time-critical areas such as rapid clinical diagnosis. 30 In order to realise this potential there is a need to develop streaming bioinformatics algorithms which continually update inference about the sample as each sequence read is generated. To be of practical use -for example to know when to when to make a diagnosis in 35 the clinic -these algorithms must continuously update not only a point estimate (e.g. which species present and their proportions), but also confidence intervals in that estimate. Several systems incorporating real-time analysis of MinION data have been developed recently 40 such as the cloud based platform Metrichor (Oxford Nanopore), work by Quick et al [8] and MetaPORE [9], focusing on placing the sample on a phylogenetic tree but without providing an estimate of the confidence in this assignment.
Here we present a flexible framework for real-time analysis on MinION sequence data directly as it is sequenced and base-called. The framework can incorporate multiple real-time analyses to suit the problems at hand and can be deployed on a single com- 5 puter or on a high performance computing facility and computing cloud. We also present four streaming algorithms for identification and characterization of pathogen samples. These algorithms, which are seamlessly integrated into the pipeline, report analysis re-10 sults along with their confidence levels so that users can decide when to stop a sequencing run.
By sequencing three bacterial isolate samples and a mixture sample on the MinION sequencer, we demonstrate that we can reliably determine the species and 15 strain type of a sequenced sample with only 500 reads. This was achieved in less than half an hour of sequencing with the current throughput of the MinION. Furthermore, we show that we can identify the majority of the drug resistance genes present in a sample within 2 20 hours of sequencing, and the full drug resistance profile within 10 hours. We also show that MinION sequence data can be used for accurate Multi-Locus Sequence Typing (MLST), despite the relatively high error rates associated with the technology. The pipeline 25 can perform all these analyses on a single computer at a throughput of over 100 times higher than our best runs. As the throughput of nanopore sequencing is expected to increase, the time to obtain these results will be significantly shortened. Our findings support 30 the potential use of MinION sequencing for real-time analysis of clinical samples for species detection and analysis of antibiotic resistance.

Results and discussion
Real-time analysis framework 35 Our real-time analysis framework consists a number of streaming programs communicating to each other via the network sockets or the inter-process communication pipes provided by Unix-like operating systems. These programs typically take a sequence of items 40 as input and process after every some small number of items arrive. They either retain only the relevant statistics of the data, or upon processing any data items, immediately forward only the necessary information to the downstream programs for further pro- 45 cessing. Processing data in this way requires a much smaller memory footprint and hence is relevant for processing large amount of data, especially real-time data from MinION sequencing.
We developed a number of auxiliary programs to 50 facilitate setting up a real-time pipeline for analysis of MinION sequencing data. These include scripts for setting up communication channels in a pipeline, thereby allowing the pipeline to be deployed on a high performance computing cluster to scale with mas- 55 sive amounts of data. Programs for simple analyses of the MinION sequencing data such as initial analysis (npReader [10]) and read-filtering on the basis of read length and read quality are also provided. We developed streaming algorithms for a hand-60 ful of identification problems, namely species typing, multi-locus strain typing; gene-presence strain-typing and identification of antibiotic resistance profiles (see Methods). We integrated the implementations of these algorithms into the analysis pipeline (see Figure 1). 65 In this pipeline, npReader [10] continuously scans the folder containing sequencing data in parallel with the MinION sequencing. It picks up sequenced reads as soon as they are generated (from Metrichor), and simultaneously streams them through the pipeline for 70 the identification analyses. The pipeline also makes use of off-the-shelf bioinformatics tools such as BWA-MEM [11] as described later. In each step of this pipeline, data is piped from one process to the next without being written to disk, with the exception of 75 base-calling via Metrichor in which each read is written to disk once it has been base-called, and then read almost immediately by npReader.
We evaluated our real-time analysis pipeline and the accuracy of our algorithms using five MinION se-80 quencing data sets. Four of these datasets were collected before the pipeline was developed, and hence we emulated the timing of the sequencing for the evaluation from these datasets. Specifically, we extracted the time that each read was sequenced, and streamed the 85 sequence reads in the exact order and timing into the pipeline. With the emulation, we were able to stream the sequencing data with a hypothetical throughput of 120 times higher what we could obtain with the MinION. This allowed us to test the scalability of the 90 pipeline against the projected future throughput such as from the PromethION platform. The fifth dataset was passed through our pipeline as it was base-called from Metrichor, and thus represents a true demonstration of the real-time capability of the pipeline. Finally, 95 we validated the analysis results by sequencing these samples with Illumina MiSeq platform, where bioinformatics analysis methods are well-established.  blies of the five strains. With the assemblies, we were able to identify the strain types and the antibiotic resistance profiles of these strains (see Methods). These results were used as the benchmarks to validate the analysis of MinION sequencing data. 15 Sequencing yields and quality of MinION sequencing Sequence reads from the MinION were classified into three types: template, complement and higher quality 2D reads (i.e., reads resulted from computationally merging a template and a complement read). The 20 average Phred quality of template and complement reads across four runs was in the region of 5 while 2D reads were in higher quality, with average Phred quality about 9 (see Table 2 and Figure 3). The median read lengths of three K. pneumoniae samples were ap-25 proximately 5Kb, while the mixture sample was only less than 1Kb ( Figure 3). We observed a variation in terms of sequence yields across the four runs. While we obtained nearly 40000 reads (185Mb) for sample K. pneumoniae ATCC BAA-2146 after 60 hours of 30 sequencing, the run for sample K. quasipneumoniae ATCC 700603 yielded only 7092 reads (39Mb) with the same running time ( Figure 2). We sequenced sample K. pneumoniae ATCC 13883 and the mixture sample for 36 and 20 hours respectively both with the chem-35 istry 7.3 but and the yields were markedly different to each other. The read length and accuracy of our runs were consistent with other user reports [12][13][14][15].

Species detection
For real-time bacterial species detection, we built a 40 database from 2,785 complete genomes of 1,489 bacterial species available in GenBank (http://www.ncbi. nlm.nih.gov/genbank/, accessed Nov 2014), augmented with two K. quasipneumoniae genomes (which  was not the strain we sequenced) as none were present in the database. The database contained a number of K. pneumoniae, E. coli and S. aureus strains (10, 63 and 49 respectively), but none of the five strains in our samples were present. The pipeline aligns sequence 5 reads as they are generated from the sequencer to this database. The species typing algorithm periodically computes the simultaneous proportions of the species present in the sample and reports the 95% confidence intervals of these proportions (See Methods).
In both K. pneumoniae samples as well as the K. quasipneumoniae sample, we successfully detected the major species present in the isolate. This was achieved with as little as 120 sequence reads requiring only 5 minutes of sequencing time (Figures 4a), b) and 15 c)). For K. pneumoniae strains ATCC BAA-2146 and ATCC 13883, it required less than 500 reads (10 and 15 minutes of sequencing, respectively) to reach a 95% confidence interval of less then 0.05. For strain ATCC 700603 it required only 300 reads to correctly identify 20 K. quasipneumoniae as the species.
The pipeline accurately identified the two species in the mixture sample as E. coli and S. aureus after obtaining around 100 reads (5 minutes of sequencing). The reported proportions became stable after around 25 1200 reads (35 minutes of sequencing). E. coli was the predominant species type in the mixture sample and it was evident with high proportion of sequencing reads supporting the E. coli species.

30
K. pneumoniae and other bacteria are conventionally strain typed using a MLST system which requires accurate genotyping to distinguish the alleles of seven house-keeping genes [16]. Our analysis of MinION raw read quality (Fig. 3), together with other user re-35 ports [12][13][14][15], indicated high error rates in MinION sequencing in comparison to Illumina Miseq sequencing. This suggested that MLST typing was challenging with MinION sequence data, especially in real-time fashion. 40 We developed a method to carry out MLST typing using MinION sequence data. Our method selected only reads spanning one of the house-keeping genes. It then used multiple reads aligned to the same gene to correct error in the raw sequence reads and sub-45 sequently combined information across multiple alleles in a likelihood-based framework (see Methods). Table 3 presents the top five highest score types (in log-likelihood) forK. pneumoniae and K. quasipneumoniae strains using MinION sequencing. In all three 50 strains, the correct types were the highest score out of 1678 types available in the MLST database. We noticed that the typing system also outputted several other strain types with the same likelihood (e.g.,    . There was only one read aligned to these two genes by the end of the run due to the poor yield of this run, which may have also contributed to inability to differentiate these two strain types. While the results were encouraging, this also suggested that a more ac- 15 curate strain-typing methodology would need to consider all of the sequenced reads, rather than just those covering 7 house-keeping genes. Therefore we further devised a method for strain-typing which was based on presence or absence of genes. 20 Strain typing by presence or absence of genes We developed a novel strain typing method to identify the bacterial strain from the MinION sequence reads based on patterns of gene presence and absence. We downloaded the genome assemblies of all strains for K. 25 pneumoniae, E. coli and S. aureus species from Ref-Seq repository and identified their strain types using the relevant MLST schemes. This resulted in sets of 125 strain types for K. pneumoniae, 353 for E. coli and 107 for S. aureus. For each strain type, we picked 30 the highest quality assembly (in terms of N50 statistic) and extracted gene sequences from its RefSeq gene annotation. We then grouped genes from a species based on 90% sequence identity, and therein obtained the gene profile for each strain type. 35 Our pipeline identified genes present in the sample from sequence reads as they were generated by the MinION device. It then used this information to infer the posterior probability of each of the strain types, as well as the 95% confidence intervals in this estimate 40 3,000 S. aureus ST243 S. aureus ST12 S. aureus ST1460 S. aureus ST291 Sequencing yield (see Methods). For our K. pneumoniae and K. quasipneumoniae samples, we successfully identified the corresponding strain types from the sequence data with 95% confidence within 10 minutes of sequencing time and with as few as 200 sequencing reads (Figures 5a., 5 b., and c.). We streamed sequence reads from the mixture sample through the strain typing systems for E.
coli and S. aureus, and in both cases, the correct strain types of two species in the sample were also recovered. The correct type for E. coli strain in the 75%/ 25% E. 10 coli ,S. aureus mixture was recovered after 25 minutes of sequencing with about 1000 total reads (or approximately 750 E. coli derived reads). (Figure 5d.). The pipeline was able to correctly predict the S. aureus strain (which is known to have much less gene content variation) in this mixture sample after two hours of sequencing with about 2,800 total reads (or approximately 700 S. aureus derived reads).
Antibiotic resistance detection 5 The antibiotic resistance gene profiles of the samples were also characterized with MinION sequencing data. We obtained antibiotic drug resistance genes from ResFinder database [17] (https://cge.cbs.dtu.dk/ services/ResFinder/, accessed July 2015). This set contained over 2132 gene sequences, including variants of the same genes. We grouped these gene sequences based on 90% sequence identity into 609 groups. In this grouping, we found that sequences in a group were variants of the same gene.
Our antibiotic resistance profile identification pipeline aligned sequence reads to this antibiotic gene database. The algorithms retained reads that aligned to these genes, and periodically performed multiple alignment of reads that were aligned to the same gene. It then 20 generated a consensus sequence from these reads, and used a probabilistic Finite State Machine [18] to realign the consensus sequence to the gene sequence (see Methods). The pipeline reported the presence of a resistance gene as soon as the alignment score reached a 25 threshold. Table 4 shows the time-line of antibiotic genes detection from MinION sequencing of three K. pneumoniae strains. For the NDM-1 producing strain ATCC BAA-2146, we identified the presence of 26 antibiotic resis-30 tance genes in the MiSeq assembly of the strain. Our real-time pipeline identified all these 26 genes and an additional gene blaSHV from 10 hours of MinION sequencing. No further gene was detected thereafter. As gene blaSHV was reported with high confidence from 35 the our real-time analysis, we further investigated the alignment of the MiSeq assembly with this gene, and found that the gene was actually aligned to two contigs in the assembly suggesting the MiSeq assembly might have been fragmented in the middle of the gene. We 40 sourced a high quality assembly of the strain's genome using PacBio sequencing [19] and found that the assembly actually contained the gene. In other words, our pipeline detected precisely the antibiotic gene profile for this strain from 10 hours of MinION sequenc- 45 ing. We observed that the majority of these genes were identified in early stage of sequencing, i.e., three quarters of these genes were reported within 1.5 hours of sequencing, at less than 4000 reads (making up only a 3-fold coverage of the genome). We observed similar 50 performance for K. pneumoniae strain ATCC 13883 where 5 out of 6 genes after two hours of sequencing. The last gene (oqxB) was detected after 9.5 hours of sequencing, again recovering the full resistance profile without any false positive. For the multi-drug resistant 55 K. quasipneumoniae strain ATCC 700603, the pipeline only detected 8 out of 11 genes. The reduced sensitivity for this sample was most likely due to the low sequence yield (33Mb of data in total, or only 7-fold coverage of the genome).

60
Comparison with other methods To date, there are only a few existing pipelines for identification of species/subspecies from nanopore sequencing data, namely Metrichor [20], [8] and Meta-PORE [9]. These methods commonly place the sam-65 ple of question to a phylogeny taxonomy based on the number of reads that either are aligned to or have a similar k-mer profile to the taxon's reference genome. Our species typing method is somewhat similar to this approach, although it additionally esti-70 mates confidence intervals in the species assignment. While we found that this approach can successfully identify species within 500 reads, the signal to noise from nanopore sequencing is too low to use a similar approach to correctly discriminate at the strain level, 75 unless a large amount of data is available. Our strain typing uses a novel approach based on the presence and absence of genes and hence is able to make inference from a smaller number of reads.
Among the mentioned methods, only Metrichor [20] 80 and MetaPORE [9] support genuine real-time analysis. As MetaPORE only focuses on viral species identification, we could only directly compare the performance of our method to Metrichor. We uploaded the first 1000 reads from our single samples and the first 85 3000 reads from our mixture sample to the Metrichor What's In My Pot Bacteria k24 for SQK-MAP005 v1.27 (WIMP) workflow. Along with the species/subspecies and strains reported, WIMP provides a classification score filter where users can set the permissive-90 ness of reporting. Table 5 presents the bacterial taxa reported by WIMP workflow for our data with the default classification score. For sample K. pneumoniae ATCC BAA-2146, WIMP only returned the taxon K. pneumoniae at the species level. On the other hand, 95 for the second and third samples (K. quasipneumoniae ATCC 700603 and K. pneumoniae ATCC 13883), WIMP reported several K. pneumoniae strains but not the correct strain types of these samples (ST489 and ST3). For the mixture sample, two E. coli and 100 three S. aureus strains were reported, but also not the correct strain types (E. coli ST73 and S. aureus ST243). While it was unclear whether the strain types of these samples were included in WIMP's database, ST11 clearly was as it was reported in sample K. pneu-105 moniae ATCC 700603. However, WIMP was unable  to identify sample K. pneumoniae ATCC BAA-2146 to the strain level with 1000 reads, while our pipeline could do so in less than 400 reads ( Figure 5).
Antibiotic resistance genes detection from MinION sequencing was also explored in Judge et al [21]. Their 5 approach was broadly similar to ours in that it initially aligns sequence reads to a resistance gene database, and then constructs a consensus sequence from the multiple alignment of matched reads. Both pipelines reported close to perfect resistance gene identification 10 when compared with Illumina MiSeq sequencing. However, our pipeline uses a novel alignment parameter estimation using probabilistic Finite State Machines (see Methods). It is hence able to confidently report the presence of a resistance gene as soon as sufficient 15 supporting data are available. This is the essence of real-time analysis presented here.

Computational time
In our analyses, sequence reads were streamed through the pipeline in the exact order and timing as they 20 were generated. Analysis results were generated periodically (every minute for species typing and strain typing and every five minutes for resistance gene identification). We examined the scalability of the pipeline to higher throughput by running the pipeline on a 25 single computer equipped with 16 CPUs and streaming all sequence reads from the highest yield run (185Mb from sample K. pneumoniae ATCC BAA-2146) through the pipeline at 120 times higher speed than they were generated (e.g., data sequenced in 2 30 minutes were streamed within 1 second). Analysis re-sults were generated every 5 seconds for typing and every one minute for gene resistance analysis. With this hypothetical throughput, our pipeline correctly identified the species and strain of the sample in less than 35 20 seconds, upon which we could terminate the typing analyses. The pipeline then reported all the resistance genes in five minutes, which corresponded to the data generated in the first 10 hours of actual sequencing. This demonstrates the scalability of our pipeline to 40 higher throughput sequencing platforms in the future.

Real-time analysis of a clinical isolate
With the pipeline in place, we analysed a clinical K. pneumoniae isolate collected in Greece which was found to be resistant to a extensive range of antibi-45 otics. We sequenced the sample on the MinION with Chemistry R7.3 and ran the Metrichor service which performed basecalling and sample identification during the first three hours of the run. We also ran our pipeline in real-time on the base-called data returned 50 from the Metrichor service.
We observed a delay from the base-calling of the data; the first read was sequenced on the MinION within one minute from starting the run, but the basecalled data were received after 6 minutes. The delay 55 tended to increase as more data were generated. We found the base-called data returned during the three hour run of the Metrichor service were actually sequenced within 45 minutes on the MinION. This highlights the need for a local base-calling step to im-60 prove real-time analysis. Figures 6a and 6b  ple identification using our pipeline. The pipeline reported K. pneumoniae as the only species in the sample within 10 minutes, and reached a confidence interval of less than 0.1 in 40 minutes when approximately 200 reads were analysed. We noticed that these 200 5 reads were actually sequenced in 7 minutes by the Min-ION. For strain identification, our pipeline initially reported ST1199 but after 2.5 hours, reported ST258 as the sequence type for this isolate. It is worth noting that the two strains are highly similar; their MLST 10 profiles differ by only one SNP in the seven housekeeping genes. By sequencing the isolate on the Illumina MiSeq as described above, we confirmed that the sequence type for the strain is ST258. On the other hand, the sample identification from Metrichor 15 initially reported K. pneumoniae 1084 (ST23), but finally reported two strains namely K. pneumoniae JM45 (ST11) and K. pneumoniae HS11286 (ST11) after 3 hours (Figure 6c). During the three hour run with less than 4000 reads (16Mb of data), our pipeline 20 reported two antibiotic resistance genes, namely sul2 (sulphonamide) and tetA (tetracycline). Our analysis of the Illumina data for this strain confirmed the presence of these two genes. Clinical susceptibility testing also showed the resistance of this isolate to tetracy- 25 cline and Sulfamethoxazole-trimethoprim (MIC ≥ 16 µg/mL and ≥ 320 µg/mL, respectively analyzed by VITEK R 2 bioMérieux, Inc). Finally, we re-analysed the data from this run using the emulation described previously, and obtained the same results as from the 30 real-time analysis.

Discussion
In recent years HTS has become an integrative tool for infectious disease research [22,23]. There have been several reports emphasizing the use of HTS methods 35 to characterize clinical isolates, to study the spread of drug resistant microorganisms and to investigate outbreak of infections [24][25][26]. These studies predominantly use massively parallel short-read sequencing technologies such as the Illumina Miseq, NextSeq or 40 HiSeq. These sequencers achieve a very high base calling accuracy which makes them ideally suited to applications which require accurate calling of single nucleotide polymorphisms (SNPs), including reconstructing the evolutionary history of different bacterial 45 isolates; tracking transmissions during an outbreak; placing a new isolate on a phylogenetic tree and population genetic analyses. However, these technologies attain their high yield by sequencing a single base per cycle for millions of sequence fragments in parallel, 50 where each cycle takes at least 5 minutes.
The Oxford Nanopore MinION device, on the other hand, generated as many as 500 reads in the first 10 minutes of sequencing in our hands (which is 3 times lower than the theoretical maximum). The error rate of these reads was substantially higher than 5 the corresponding Illumina data. Existing bioinformatics algorithms -which have been developed for highly accurate Sanger and subsequently for short-read sequencing -rely on accurate base and SNP calling, which makes their application to MinION data challenging. As an example, most existing strain typing approaches often use a MLST system, either on a predefined set of house keeping genes [27], or on core genes set [28]. These approaches are highly standardized, reproducible and portable, and hence are routinely used 15 in laboratories around the world. Rapid genomics diagnosis tools using MLST from high-throughput sequencing such as SRST2 [29] have also been developed. While we showed that MLST can be adapted to identify bacterial strain type from nanopore sequenc- 20 ing, this requires high coverage sequencing of the gene set to overcome the high error rates. Similarly, other researchers have shown that error correction can overcome the high error rate providing enough coverage is obtained [15,30]. 25 The main contribution of this manuscript is to demonstrate that despite the higher error rate, it is possible to return clinical actionable information, including species and strain types from as few as 500 reads. We achieved this by developing novel ap-30 proaches which are less sensitive to base-calling errors and which use whatever subset of genome-wide information is observed up to a point in time, rather than a panel of pre-defined markers or genes. For example, the strain typing presence/absence approach relies only on 35 being able to identify homology to genes and also allows for a level of incorrect gene annotation.
Our species typing module has some similarities to the approach used by MetaPhlAn [31], in that we use the proportion of reads which map to different taxo-40 nomic groupings to estimate the proportion of different species in a sample. MetaPhlan optimises computational speed by aligning to a precomputed database of sequences which are pervasive within a single taxonomic grouping but not seen outside that grouping. 45 This allows it to blast against a database which is 20 times smaller than a full bacterial genomic database. This was designed to make metagenomics inference feasible on datasets with millions of reads. On the other hand, our species typing approach is designed 50 to make a similar inference using only hundreds of reads, and moreover, also continuously updates confidence intervals so the user knows when they can stop sequencing and make a diagnosis.
Our strain typing module has the advantage of being 55 able to rapidly type a known strain with a small number of low quality (i.e. mostly 1D) reads. Competiting approaches which use kmers, such as that implemented in 'WIMP' appear to require substantially more high quality data. The drawback of our approach is that if 60 a large number of genes are lost or gained in a single event, such as the gain or loss of a plasmid, the most likely strain may be incorrect, although the bootstrapderived confidence intervals will be wide in this case.
Our antibiotic resistance module is able to identify 65 the drug resistance potential of an isolate within a few hours of sequencing with very high specificity. In particular, with the most recent chemistry utilized in this paper (R7.3), we were able to identify the complete resistance potential of a K. pneumoniae isolate 70 without any false positives in 9.5 hours and with approximately 8000 reads (80% of the resistance genes were identified with 3000 reads in 2 hours). In order to achieve high specificity we designed a probabilistic Finite State Machine for error correction. This approach 75 continuously updates the consensus sequence from the multiple alignment of reads, and re-estimates the error profile of the consensus sequence. This allows the reporting of the presence of a resistance gene once sufficient accuracy is obtained, rather than waiting for 80 the full run to complete.
In summary, we have developed an open-source, flexible pipeline for real-time analysis of MinION sequencing data. The only step in our pipeline in which data are written to, and then re-read from disk is the base-85 calling step using Metrichor. npReader immediately identifies new reads as they are generated by Metrichor, however some delay can occur due to waiting for base-called data to be returned from Metrichor. Oxford Nanopore Technologies have recently opened up 90 the Application Programming Interface to extract raw data directly from the MinION. This, together with the recent development of open source base-calling algorithms [32,33] to run on the local machine, will allow future development of a completely streaming pipeline, 95 in the sense of never saving data to disk. Our pipeline can be deployed on a single 16 core computer, capable of analysing MinION data streaming at up to 120x the current rate of sequencing; or on a high performance computing cluster to scale with the potential 100 even higher throughput of forthcoming nanopore sequencing platforms. Our pipeline incorporates three streaming algorithms, but further algorithms can be flexibly integrated into this pipeline. Other algorithms such as for base-calling [32,33] or alignment of signal-105 level data can also be integrated into the pipeline to by-bass the base-calling currently from Metrichor.
Other investigators have focused on the long-read nature of MinION sequencing data, which enables complete genome assembly [30] as well as the identification of sites of integration of resistance islands [13]. Researchers have also recently reported that MinION se-5 quencing data could accurately identify bacterial outbreak strains within 50 minutes of sequencing [8] by placing reads onto a phylogenetic tree; and drug resistance profile of a S. aureus sample determined using a de-Bruijn graph approach from 8 hours of sequencing 10 data [34].
We have shown that switching from a traditional short-read sequencing pipeline coupled with standard, non-streaming bioinformatics algorithms, to a nanopore sequencing pipeline coupled with stream- 15 ing bioinformatics algorithms can dramatically cut the time taken from DNA library to results from at least 8 hours down to 30 minutes. With the time for library preparation for nanopore sequencing forecast to be shortened to 10 minutes, the major time bottleneck 20 then becomes the bacterial culture step (which can be 24 hours). The MinION sequencer can be used on clinical sample without culture, however this then dilutes the proportion of bacterial DNA present. Nevertheless, this may become a viable time-sensitive strategy as se- 25 quencing yield increases, particularly with high colonyforming-unit (CFU) infections. Another promising approach may be to use approaches to pre-concentrate bacterial DNA [35].
One of the major advantages of a whole-genome se-30 quencing approach to drug resistance profiling is that it is not necessary to restrict the analysis to a limited panel of drug-resistance tests but it is possible to discover the complete drug resistance profile in a sample. With a complete picture of the drug-resistance profile 35 within a few hours, a clinician may be able to design an antibiotic treatment regimen that is both more likely to succeed and less likely to induce further antibiotic resistance. However, even achieving completely accurate identification of resistance genes is only a first step 40 in accurately predicting the resistance profile, as mutations may effect the rate at which these genes are transcribed and also their antibiotic resistance activity. Prediction of antibiotic resistance from genotype is an area which warrants substantial further research. 45  For the library mixture sample, DNA concentration of each library was measured using Qubit Fluorimeter (Thermo Fisher Scientific). Based on the concentration, 75% of E. coli (ATCC 25922) library and 25% of 80 S. aureus (ATCC 25923) library were mixed prior to sequencing.

DNA extraction
For each sample a new MinION Flow Cell (R7 or R7.3) was used for sequencing. The library mix was loaded onto the MinION Flow Cell and the Genomic 85 DNA 48 hour sequencing protocol was initiated on the MinKNOW software.

MinION data analysis
The sequence read data were base called with Metrichor Agent (https://metrichor.com). We used 90 npReader [10] to convert base-called sequence data in fast5 format to fastq format. The npReader program also extracted the time that each read was sequenced and used this information to sort the read sequences in order they were produced. For the real-time anal-95 yses, we wrote a program to emulate the sequencing process in that it streamlined each read in the exact order it was sequenced. The program also allowed scaling up the sequencing emulation to a factor of choice. Our pipeline allows for filtering out 1D reads at mul-100 tiple stages (including via npReader). All subsequent analyses in this paper used both 1D and 2D reads.
MiSeq sequencing and data analysis Library preparation was performed using the Nexter-aXT DNA Sample preparation kit (Illumina) as recom-105 mended by the manufacturer. Libraries were sequenced on the MiSeq instrument (Illumina) with 300bp paired end sequencing, to a coverage of over 100-fold. Read data were trimmed with trimmomatic [36] (V0.32) and subsequently assembled using SPAdes [37] (V3.5), resulting in assemblies with N50 exceeding 200kb. Their strain types were identified by submitting the assembled genomes to the MLST servers https://cge.cbs. dtu.dk/services/MLST/ [38] for K. pneumoniae, E. coli (set #1) and S. aureus.
We identified the antibiotic resistance profiles of 10 these strains from their MiSeq assemblies. We used blastn (V2.29) to align these assemblies to the database of resistance genes obtained from ResFinder [17]. Genes which were covered at least greater than 85% by the alignments and with greater than 85% sequence identity were considered to be present in the sample. These gene profiles were used as a benchmark to validate the MinION sequencing analysis.

Species typing
We downloaded the bacterial genome database on 20 GenBank (ftp://ftp.ncbi.nlm.nih.gov/genomes/ Bacteria/, accessed 19 Nov 2014), which contained high quality complete genomes of 2785 bacterial strains from 1487 bacteria species. We expanded this database to include two K. quasipneumoniae genomes. 25 Our species typing pipeline streamed read data from npReader directly to BWA-MEM [11] (V0.7.10-r858) which aligned the reads to the database. Output from BWA in SAM format was streamed directly into our species typing pipeline, which calculated the 30 proportion of reads aligned to each of these species. Our species typing method considers the proportions {p 1 , p 2 , .., p k } of k species in the mixture as the parameters of a k-category multinomial distribution, and the read counts {c 1 , c 2 , .., c k } for the species as an observa- 35 tion from c 1 +c 2 +..+c k independent trials drawn from the distribution. It then uses the MultinomialCI package in R [39] to calculate the 95% confidence intervals of these proportions from the observation.

MLST typing 40
MinION sequence reads from K. pneumoniae strains were aligned to the seven house-keeping genes specified by the MLST system using BWA-MEM [11].
We then collected reads that were aligned to a gene and performed a multiple alignment on them using kalign2 [40]. The consensus sequence created from the multiple alignment was then globally aligned to all alleles of the gene using a probabilistic Finite State Machine (see below) for global alignment. The score of a MLST type was determined by the sum of the scores 50 of seven alleles making up the type.

Strain typing
We built gene profile databases for K. pneumoniae, E. coli and S. aureus from the RefSeq annotation. Specifically, we obtained the publicly available assemblies of 55 these species listed on RefSeq master database (ftp: //ftp.ncbi.nih.gov/genomes/ASSEMBLY_REPORTS/ assembly_summary_refseq.txt, accessed 17 July 2015). We used the relevant MLST schemes obtained from https://cge.cbs.dtu.dk/services/MLST/ [38] 60 to identify strain type of each assembly. For each strain type, we selected the assembly with highest N50 statistic and use the RefSeq gene annotation of the assembly to determine the gene content of the strain type.
In order to develop a simple probabilistic pres-65 ence/absence strain typing model, we consider the genomes of each of the strains simply as a collection of genes. Denote by St j=1..J the all the strains in our database (for a fixed species). Denote by g j,k the k th gene in the database for strain j, where the genes are 70 listed in no particular order. Denote by N j the total number of genes in St j .
We align each sequence read r i from the MinION device to the gene database using BWA-MEM [11]. We count the number of genes of each strain that are 75 aligned to some reads, denoted N j (r i ).
We describe below how we can calculate a likelihood, P (r i |St j ), of each strain generating each read, from which we can calculate the posterior probability of each strain St j conditional on observing the reads 80 r 1 . . . r m : The probability P (r i |St j ) could be calculated using a simple model as however, this model suffers from the problem that if we observe any read which overlaps a gene not in the reference genome for St j , then the posterior probability of that strain will become zero. Thus, this model 85 is very unstable. In order to make this estimate more stable, we use a mixture model which allows for the read to have been generated by a background model: The background model considers the probability that the read was generated from any of the strains: This makes the posterior probability estimates more stable. It also makes the model robust to incorrect annotation of the reads from the MinION sequencer and incorrect annotation of the reference genome. We have investigated use of c = 0.2, c = 0.1 and c = 0.05 5 and found that it has little impact on the results, with slightly smaller confidence intervals (data not shown). We choose c = 0.2 in order to conservatively estimate confidence intervals.
Finally, in order to calculate confidence intervals we 10 employ a bootstrap resampling approach in which we resample m reads from r 1 , . . . r m with replacement. This is repeated 1000 times, and the posterior probabilities are recalculated every iteration. We calculate the 95% confidence intervals from the empirical distri- 15 bution of these posterior probabilities.
To gain some insight into how this model works in response to gene presence, consider a gene g which is present in a fraction f of strains, including St j but not including St k . For simplicity assume that each 20 strain has N genes. The difference in log-likelihood St j and St k conditional on g can be approximated by log(1/c) + log(1/f ), showing that a more specific gene has a stronger effect in our model than a common gene in distinguishing strains. 25 To gain insight into the effect of gene absence in contrast to gene presence, assume instead that the only difference between St j and St k is that a single gene (g) is deleted in St j , and denote by N = N j = N k − 1. If we sequence N ln(2) genes from St j without see- 30 ing gene g, the difference in log-likelihood becomes N ln(2) * (log(N ) − log(N − 1)) ≈ 1 bit, corresponding to the likelihood for St j being twice as big as the likelihood of St k . For example, if a strain has 1000 genes, then we would need to observe 693 genes without ob- 35 serving g to be able to conclude that the observed data were twice as likely to be generated from the species with a single gene deletion. For comparison, we would need to only sequence 100 genes from St k to get an expected log-likelihood difference of 1 bits versus St j , 40 demonstrating the extra information in gene 'presence' versus 'absence' typing.

Antibiotic resistance gene classes detection
We downloaded the resistance gene database from Res-Finder (https://cge.cbs.dtu.dk/services/ResFinder/, 45 accessed July 2015). We aligned each gene to the collection of bacterial genomes in RefSeq using blastn [41], and used the best alignment of the gene to extract 100 bp sequences flanking the antibiotic resistance genes. We found that the inclusion of these flanking sequences 50 improved the sensitivity of mapping MinION reads to the gene database.
We then grouped these genes based on 90% sequence identity into 609 groups. We manually checked and found genes within a group were variants of the same 55 gene. We selected the longest gene in each group to make up a reduced resistance gene database. To create a benchmark of resistance gene for a sample, we blastn the Illumina assembly of the sample against this reduced gene database, and reported genes with greater 60 than 85% coverage and identity.
Our analysis pipeline aligned MinION sequencing data into this reduced resistance gene database using BWA-MEM [11] in a streamline fashion, and examined genes that had reads mapping to the whole gene 65 (not including flanking sequences). Due to the high error rates of MinION sequence data, we noticed a high rate of false positive genes. To reduce false positives, we used kalign2 [40] to perform a multiple alignment of reads that were aligned to the same gene. The consen-70 sus sequence resulting from the multiple alignment was then compared with the gene sequence using a probabilistic Finite State Machine (see below). The pipeline then reported gene classes based on the genes detected.
Sensitive alignment of noisy sequences with probabilistic 75 Finite State Machines Our methods for MLST strain typing and antibiotic resistance gene identification require the alignment of a consensus sequence to a gene or a gene allele. Such an alignment generally assumes a model and a set of pa-80 rameters of the differences between the sequences. It is widely recognised that the accuracy of the alignment is sensitive to these parameters [42][43][44]. However, in the context of real-time analysis of MinION sequencing, it is not possible to select in advance a sensible set 85 of parameters. On the one hand, the quality among sequence reads differs remarkably; as shown in Figure 3 and Table 2 -the majority (95%) of the reads across our four runs have the Phred score ranging between 3 and 7 for template and complement reads (corre-90 sponding to 50% -80% accuracy) and between 6 to 12 for 2D reads (75%-95% accuracy). On the other hand, a consensus sequence is computationally constructed from a set of reads. Its quality is hence contingent to not only the quality of the reads but also the number 95 of reads in the set.
We use a probabilistic Finite State Machine (pFSM) [45] to model the differences, and hence the simultaneous error profile of the consensus sequence. Briefly, a pFSM is a probabilistic model of genomic alignment 100  Figure 7 depicts a three-state pFSM which is equivalent to a affine gap penalty alignment model. In order to assess an alignment of two sequences A and 10 B, under a hypothesis specified by the parameters, the pFSM computes the cost to generate one sequence (say A) given the other (B). For example, while in state Copy, the machine consumes the next base in B, generates the next base in A; it is said to take action C if 15 the two bases are the same, or action S otherwise, and to follow either transition to state Copy. Alternatively, the machine can take either action D (consumes the next base in B without generating any base in A and moves to state Delete), or action I (generates the next 20 base in A without consuming a base in B and moves to state Insert). These actions are repeated until the whole sequence B is generated.
We use an information-theoretic measure where the cost of a transition is that of encoding the generated 25 base, or in other words, the negative logarithm of the probability of the associated action (c = −log 2 (P (a)). The foundation of this approach goes back to the 1960s when it was proposed as a basis for inductive inference [46,47]. It has since been used in a num-30 ber of bioinformatics applications such as for calculating the BLOSUM matrix [48] and modelling DNA sequences [49,50]. More importantly, this informationtheoretic framework allows one to estimate a sensible set of parameters for any related two sequences. 35 This is done via a Expectation-Maximisation process. This starts with an initial set of probabilities at each state. In the E-step, the best alignment (lowest cost) is calculated by a dynamic programming algorithm. The frequencies of actions at each state are then used 40 to re-estimate the probabilities in the M-step. A detailed discussion of this process is provided in Allison et al [45] and Cao et al [51]. The process is guaranteed to converse to an optimal, and it does so in only a few iterations in our experience. 45

Availability of supporting data
All scripts for the presented analyses are provided in https://github.com/mdcao/npAnalysis. The source code of the software is publicly available in github repository (https://github.com/mdcao/japsa). The 50 MinION sequencing data for the three single samples are available in European Nucleotide Archive Study Accession Number ERP010377 (http://www. ebi.ac.uk/ena/data/view/ERP010377). The Min-ION sequencing data for mixture sample and the Il-55 lumina sequencing are in the process of depositing to European Nucleotide Archive. They are made available via the links provided in https://github.com/ mdcao/npAnalysis.

List of abbreviations used
MLST: Multilocus sequence typing; pFSM: probabilistic Finite State Machine.
Competing interests MC is a participant of Oxford Nanopore's MinION Access Programme 5 (MAP) and received the MinION device, MinION Flow Cells and Oxford Nanopore Sequencing Kits in return for an early access fee deposit. None of the authors have any commercial or financial interest in Oxford Nanopore Technologies Ltd.
Author's contributions MDC, DG, MC and LC conceived the study, performed the analysis and wrote the first draft of the manuscript. AE performed the bacterial cultures and DNA extractions. DG performed the MinION sequencing. MDC and LC designed and developed the algorithms and the analysis framework. MDC, HZ, and LC performed the bioinformatics analyses. All authors contributed 15 to editing the final manuscript.