Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning

Abstract Sequencing by translocating DNA fragments through an array of nanopores is a rapidly maturing technology that offers faster and cheaper sequencing than other approaches. However, accurately deciphering the DNA sequence from the noisy and complex electrical signal is challenging. Here, we report Chiron, the first deep learning model to achieve end-to-end basecalling and directly translate the raw signal to DNA sequence without the error-prone segmentation step. Trained with only a small set of 4,000 reads, we show that our model provides state-of-the-art basecalling accuracy, even on previously unseen species. Chiron achieves basecalling speeds of more than 2,000 bases per second using desktop computer graphics processing units.

We tested the CPU rate on 4 threads and 8 threads and 20 threads, and full CPU utility is observed, the basecalling speed is increased proportionally as more threads are used, so we believe the CPU resource is used efficiently under a multi threads situation. We added a 8core CPU rate to table 2, and updated the table legend as follows: "Single core CPU rate is calculated by dividing the number of nucleotides basecalled by the total CPU time for the basecalling analysis.
8 core CPU rate is estimated by multiplying single core cpu rate by 8, based on observed 100% utility of CPU processors in multi-threaded mode on 8 cores. " Query: In the conclusion, you state that Chiron using a GPU is "faster than current data collection speed". However, at 450 bp/sec/pore (the current Nanopore sequencing rate), Chiron would only be able to keep up with about three in-strand pores. A MinION run can generate over 5 Gbp of reads, which would take over a month to basecall using your quoted GPU rate.

Response:
We have deleted this statement as it was misleading. We have included the following paragraph in the discussion to acknowledge the speed limitations of Chiron: "Our model is substantially more computationally expensive than Albacore and somewhat more computationally expensive than BasecRAWller. This is to be expected given the extra depth in the neural network. Our model can be run in a GPU mode, which makes computation feasible on small to medium sized datasets on a modern desktop computer. " Query: Consensus accuracy In addition to the error rate metrics for basecalled reads, I would like to see error rate metrics for the consensus sequences produced by each basecaller's reads. For researchers who work with assembly or other high-read-depth analyses, consensus accuracy may be more important than individual-read accuracy. I would suggest using either Racon or Canu to measure consensus accuracy, as they are widely used tools in the Nanopore sequencing community. I realise this would only be possible for your bacterial and viral read sets, where depth is sufficient for assembly and sequence consensus.
Response: Thank you for this suggestion. We have calculated the consensus rate for bacterial and viral datasets, using Miniasm + Racon. We describe the approach used in the methods section: "We assessed the quality of assemblies generated from reads produced by different base-callers. For each base-caller, a de-novo assembly is generated by the use of only Nanopore reads for the \tb{} \ec{} and Lambda Phage genome. We use Minimap2 and Miniasm to generate a draft genome, then Racon is used to polish on the draft genome for 10 rounds.
The results are presented in Table 2 and Figure 3, and summarised in the text as follows: "In order to assess the quality of genomes assembled from reads generated by each basecaller, we used Miniasm together with Racon to generate a de-novo genome assembly for each of the bacterial and viral genomes (see Methods). The results presented in Table 2 demonstrate that Chiron assemblies for Phage lambda and E-coli samples have approximately half as many errors as those generated from Albacore (v1 or v2) reads. For M. tuberculosis, Chiron has fewer errors than Albacore v1, but slightly more than Albacore v2. The identity rate and relative length for each round of polishing with Racon are shown in Figure 3." Reviewer #2: Query: In particular, it seems that the performance of Chiron is very similar to other available tools, and in many cases they seem to be very similar to e.g. Albacore-1.1 that uses the event segmentation.
Response: This is not correct. Table 1 shows that Chiron-BS is consistently better than Albacore v1.1 on bacterial and viral genomes at the read level. Moreover, following the suggestion of reviewer one, we have investigated the assembly-level accuracy (described above). We show that Chiron is superior to Albacore (v1 but also superior to v2) in generating highly accurate assemblies. We have added the following sentence in the discussion to reflect these new results: "Bacterial and viral genome assemblies generated from Chironbasecalled reads all had less than 0.5\% error rate, whereas those generated by Albacore had up to 0.8\% error rate. This marked reduction in error rate is essential for generating accurate SNP genotypes, a pre-requisite for many applications such as outbreak tracking. " These results conclusively demonstrate the benefits from removing the event segmentation step in base-calling.
Query: Moreover, design of the deep neural network underlying Chiron is much more complex than the one used in other currently available tools. In consequence, the tool is very slow and on CPU (even if parallelized) it would be very difficult to use. When using a high-end GPU card, Chiron can process ~1600bp per second. By a conservative estimate, a MinION run produces over 30000bp per second, so one would need approx. 19 of these GPU cards to keep up with the speed of sequencing (ONT Albacore would need about 10 CPU cores to process such run on-line according to the authors' measurements, which is a much more realistic setting). Consequently, Chiron cannot be considered a practical tool.
Response: As indicated above, we have removed the statement that indicated Chiron could be used as a real-time base-caller. However, we reject the characterization the Chiron is not a practical tool. In certain settings, obtaining the most accurate base-calls possible is extremely important. One such example is in SNP calling, e.g. accurate identification of SNPs conferring drug resistance. The fact that Chiron leads to up to a 50% reduction in base-calling error rate makes it a valuable tool.
Moreover, there are approaches to accelerating neural networks which may be used to accelerate Chiron. We have indicated this in the discussion as follows: "Also there are several existing methods which can be used to accelerate NN-based basecallers such as Chiron. One such example is Quantization, which reformats 32-bit float weights as 8-bit integers by binning the weight into a 256 linear set. As neural networks are robust to noise this will likely have negligible impact of the performance. Weight Pruning is another method used to compress and accelerate NN, which prunes the weights whose absolute value is under a certain threshold and then retrains the NN\cite{han2015deep}." Query: One interesting point of the paper is that they only used a limited amount of data for training and the network seems to generalize well. It would be interesting to explore this issue. Would using significantly more data lead to a significantly better accuracy? Is the use of training data more efficient than in the case of other available tools? Response:

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
We agree that the fact that the Chiron Neural Network generalises well is an interesting feature. However, exploring this issue in depth is beyond the scope of this paper. Moreover, it would be extremely difficult to compare the generalisability of Chiron and Albacore precisely because it is impossible to 're-train' Albacore on less data, as it is a proprietary basecaller.

########
We note in response to an editorial query that Chiron is now registered in SciCrunch, RRID is SCR_015950, and this information is now included in the manuscript

Introduction
DNA sequencing via bioengineered nanopores, recently introduced to the market by Oxford Nanopore Technologies (ONT), has profoundly changed the landscape of genomics. A key innovation of the ONT nanopore sequencing device, MinION, is that it measures the changes in electrical current across the pore as a single-stranded molecule of DNA passes through it. The signal is then used to determine the nucleotide sequence of the DNA strand [1,2,3]. Importantly, this signal can be obtained and analysed by the user while the sequencing is still in progress. A large number of pores can be packed into a MinION device in the size of a stapler, making the technology extremely portable. The small size and real-time nature of the sequencing opens up new opportunities in time-critical genomics applications [4,5,6,7] and in remote regions [8,9,10,11,12]. While nanopore sequencing can be massively scaled up by designing large arrays of nanopores and allowing faster translocation of DNA fragments, one of the bottle-necks in the analysis pipeline is the translation of the raw signal into nucleotide sequence, or basecalling. Prior to the release of Chiron , basecalling of nanopore data involved two stages. Raw data series are rst divided into segments corresponding to signals obtained from a k-mer (segmentation) before a model is then applied to translate segment signals into k-mers. DeepNano [13] introduced the idea of using a bi-directional Recurrent Neural Network (RNN), that uses the basic statistics of a segment (mean signal, standard deviation and length) to predict the corresponding k-mer. The o cial basecallers released by ONT, nanonet and albacore (prior to version 2.0.1), also employ similar techniques. As k-mers from successive segments are expected to overlap by k-1 bases, these methods use a dynamic programming algorithm to nd the most probable path, which results in the basecalled sequence data. BasecRAWller [14] uses a pair of unidirectional RNNs; the rst RNN predicts the probability of segment boundary for segmentation, while the second one translates the discrete event into base sequence. As such, basecRAWller is able to process the raw signal data in a streaming fashion.
In this article we present Chiron, which is the rst deep neural network model that can translate raw electrical sig-

Manuscript
Click here to download Manuscript chiron-paper-gigascience (3  nal directly to nucleotide sequence. Chiron has a novel architecture which couples a convolutional neural network (CNN) with an RNN and a Connectionist Temporal Classi cation (CTC) decoder [15]. This enables it to model the raw signal data directly, without use of an event segmentation step. Oxford Nanopore Technologies have also developed a segmentation free base-caller, Albacore v2.0.1, which was released shortly after Chiron v0.1. Chiron has been trained on a small data set sequenced from a viral and bacterial genome, and yet it is able to generalise to a range of genomes such as other bacteria and human. Chiron is as accurate as the ONT designed and trained Albacore v2.0.1 on bacterial and viral base-calling and outperforms all other existing methods. Moreover, unlike Albacore, Chiron allows users to train their own neural network, and it is also fully open-source, enabling development of specialised base-calling applications, such as detection of base-modi cations.

Deep neural network architecture
We have developed a deep neural network (NN) for end-to-end, segmentation-free basecalling which consists of two sets of layers: a set of convolutional layers and a set of recurrent layers (see Figure 1). The convolutional layers discriminate local patterns in the raw input signal, whereas the recurrent layers integrate these patterns into basecall probabilities. At the top of the neural network is a CTC decoder [15] to provide the nal DNA sequence according to the base probabilities. More details pertaining to the NN are provided in Methods.
Chiron presents an end-to-end basecaller, in that it predicts a complete DNA sequence from raw signal. It translates sliding windows of 300 raw signals to sequences of roughly 10-20 base pairs (which we call slices). These overlapping slices are stacked together to get a consensus sequence in real-time. The window is shifted by 30 raw signals, by processing this slices in parallel, the base-calling accuracy can be improved with little speed loss.

Performance Comparison
For training and evaluating the performance of Chiron, a phage Lambda virus sample (Escherichia virus Lambda provided by ONT and an Escherichia coli (K12 MG1655) sample using 1D protocol on R9.4 owcells are sequenced for calibrating the MinION device (See Methods). 34,383 reads were obtained for Lambda sample and 15,012 reads for E. coli, but only 2000 reads were randomly picked from each sample to train Chiron. It took the model 10 hours to train 3 epoch with 4,000 reads (∼ 4Mbp) on a Nvidia K80 GPU. Then Chiron is cross-validated on the remainder of the reads from two runs, and the model is further evaluated by testing its basecalling accuracy on other species. A Mycobacterium tuberculosis sample is sequenced and a set of human data is downloaded from chromosome 21 part 3 from the Nanopore WGS Consortium [16], to be used in testing the generality of Chiron (see Table 4).
In order to establish the ground-truth of the data,the E. coli and M. tuberculosis samples are sequenced using Illumina technology (see Methods) and assembled, which provided a high per-base accuracy reference. The reference sequence for the Phage Lambda virus is NCBI Reference Sequence NC_001416. 1   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 and for the human data the GRCh38 reference was used. The raw signals are labeled by identifying the raw signal segment corresponding to the nucleotide assumed to be in the pore at a given time-point (see Methods). Table 1 presents the accuracy of the four basecalling methods, including the Metrichor basecaller (the ONT cloud service), Albacore v1.1 (ONT o cial local basecaller), BasecRAWller [14] and Chiron,with a greedy decoder (Chiron) and beam search decoder(Chiron-BS), on the data. Chiron has the highest identity rate on the Lambda, E. coli and M. tuberculosis sample. Additionally, it had the lowest deletion rate, mismatch rate on Lambda, M. tuberculosis and E. coli, and the lowest insertion rate on Lambda and E. coli.In Human dataset where Chiron did not have the highest identity rate, it is was no more than 0.01 from the best.
In addition we compared the segmentation-free ONT basecaller Albacore v2.0.1 with Chiron-BS in Table 1. Chiron-BS had a consistently lower insertion rate across all species tested, as well as a lower deletion rate on Lambda and E-coli, however it su ered a slightly higher mismatch rate on all species except E-coli. The performance is comparable to Albacore v2.0.1 on all species except for Human, however this is likely at least partially due to the fact that it has not been trained on any human DNA.
In order to assess the quality of genomes assembled from reads generated by each basecaller, we used Miniasm together with Racon to generate a de-novo genome assembly for each of the bacterial and viral genomes (see Methods). The results presented in Table 2 demonstrate that Chiron assemblies for Phage lambda and E. coli have approximately half as many errors as those generated from Albacore (v1 or v2) reads. For M. tuberculosis, Chiron has fewer errors than Albacore v1, but slightly more than Albacore v2. The identity rate and relative length for each round of polishing with Racon are shown in Figure 3.
In terms of speed on a CPU processor, Chiron is slower (21bp/s, 17bp/s using a beam-search decoder with a 50 beam width) than Albacore (2975bp/s) and -to a lesser extent -Base-cRAWller (81bp/s). However, when run on a Nvidia K80 GPU, a basecalling rate of 1652bp/s and 1204bp/s using a beam search decoder is achieved. (Chiron is also tested on a Nvidia GTX 1080 Ti GPU and got a rate of 2657bp/s). The GPU rate for other two local basecallers are not included, as Albacore and basecRAWller do not currently o er GPU support. Metrichor was not included in the speed benchmarking as it is not possible to gather information about CPU/GPU speed as it is a cloud basecaller.

Discussion
Segmenting the raw nanopore electrical signal into piece-wise constant regions corresponding to the presence of di erent kmers in the pore is an appealing but error-prone approach. Segmentation algorithms determine a boundary between two segments based on a sharp change of signal values within a window. The window size is determined by the expected speed of the translocation of the DNA fragment in the pore. We noticed that the speed of DNA translocation is variable during a sequencing run, which coupled with the high level of signalto-noise in the raw data, can result in low segmentation accuracy. As a result, the segmentation algorithm often makes conservative estimates of the window size, resulting in segments smaller than the actual signal group for k-mers. While dynamic programming can correct this by joining several segments together for a k-mer, this e ects the prediction model.
All existing nanopore base callers prior to Chiron employ a segmentation step. The rst nanopore basecalling algorithms [17,18] employed a Hidden Markov Model, which maintains a table of event models for all possible k-mers. These event models were learned from a large set training data. More recent methods (DeepNano [13], nanonet) train a deep neural network for inferring k-mers from segmented raw signal data.
A recent basecaller named BasecRAWller [14] used an initial neural network (called a raw network) to output probabilities of boundaries between segments. A segmentation algorithm is then applied to segment these probabilities into discrete events. BasecRAWller then uses a second neural network (called the ne-tune network) to translate the segmented data Table 1. Results from the experimental validation and benchmarking of Chiron against three other segmentation-based Nanopore basecallers and Albacore V2(which is also segmentation-free basecaller).  Identity rate(%) is calculated by rst shredding the assembly contigs into 10K reads pieces, and then get the mean of the identity rate of the aligned reads, relative length(%) is de ned as the sum of the length of all the aligned pieces divided by the length of reference genome. E. coli-S10 and E. coli-S18 are reads from two independent sequencing. into the base sequence.
Our proposed model is a departure from the above approaches in that it performs base prediction directly from raw data without segmentation. Moreover the core model is an end-to-end basecaller in the sense that it predicts the complete base sequence from raw signal. This is made possible by combining a multi-layer convolutional neural network to extract the local features of the signal, with a recurrent neural network to predict the probability of nucleotides in the current position. Finally, the complete sequence is called by a simple greedy algorithm, based on a typical CTC-style decoder [15], reading out the nucleotide in each position with the highest probability. Thus, the model need not make any assumption of the speed of DNA fragment translocation and can avoid the errors introduced during segmentation.
To improve the basecalling speed and to minimize its memory requirements, the neural network is run on a 300-signal sliding window (equivalent to approximately 20bp), overlapping the sequences on these windows and generating a consensus sequence. Chiron has the potential to stream these input raw signal 'slices' into output sequence data, which will become increasingly important aspect of basecalling very long reads (100kb+), particularly if used in conjunction with the read-until capabilities of the MinION.
Our model was either the best or second-best in terms of accuracy on all of the datasets we tested in terms of read-level accuracy. This includes the human dataset, despite the fact that the model has not seen human DNA during training. Our model has only been trained on a mixture of 2,000 bacterial and 2,000 viral reads. The most accurate basecaller is the proprietary ONT Albacore basecaller. Chiron is within 1% accuracy on bacterial DNA, but only within 2% accuracy on human DNA. More extensive training on a broader spectrum of species, including human can be expected to improve the performance of our model. There are also improvements in accuracy to be gained from a better alignment of overlapping reads and consensus calling. Increasing the size of the sliding window will also improve accuracy but at the cost of increased memory and running time.
Bacterial and viral genome assemblies generated from Chiron basecalled reads all had less than 0.5% error, whereas those generated by Albacore had up to 0.8% accuracy Figure 3. This marked reduction in error rate is essential for generating accurate SNP genotypes, a pre-requisite for many applications such as outbreak tracking. These results are consistent with those reported in recent study into read and assembly level accuracy for K. pneumoniae [19].
Our model is substantially more computationally expensive than Albacore and somewhat more computationally expensive than BasecRAWller. This is to be expected given the extra depth in the neural network. Our model can be run in a GPU mode, which makes computation feasible on small to medium sized datasets on a modern desktop computer. Our method can be further sped up by increasing the step size of the sliding window, although this may impact accuracy. Also there are several existing methods which can be used to accelerate NN-based basecallers such as Chiron. One such example is Quantization, which reformats 32-bit oat weights as 8-bit integers by binning the weight into a 256 linear set. As neural networks are robust to noise this will likely have negligible impact of the performance. Weight Pruning is another method used to compress and accelerate NN, which prunes the weights whose absolute value is under a certain threshold and then retrains the NN [20].

Conclusion
We have presented a novel deep neural network approach for segmentation-free basecalling of raw nanopore signal. Our approach is the rst method that can map the raw signal data directly to base sequence without segmentation. We trained our method on only 4000 reads sequenced from the simple genome lambda virus and E. coli, but the method is su ciently generalised to be able to base call data from other species including human. Our method has state-of-art accuracy -outperforming the ONT cloud basecaller Metrichor as well as another 3rdparty basecaller, BasecRAWller.

Deep neural network architecture
Our model combines a 5-layer CNN [21] with a 3-layer RNN and a fully connected network (FNN) in the last layer that calculates the probability for a CTC decoder to get the nal output. This structure is similar to that used in speech recognition [22]. Both the CNN and RNN layers are found to be essential to the base calling as removing either will cause a dramatic drop in prediction accuracy, which is described more in the Training section.
The input signal is normalized by subtracting the mean of the whole read and dividing by the standard deviation. s' = (ss)/std(s).
Then the normalised signal is fed into a residual block [23] combined with global batch normalisation [24] in the 5 convolution layers to extract the local pattern from the signal. The stride is set as 1 to ensure the output of the CNN has the same length of the input raw signal. The residual block is illustrated in Figure 1, a convolution operation with a l×m lter, n×p stride and s output channels on a k channels input is de ned as: , j·p+dj, q)·Filter(di, dj, q, s) . An activation operation is performed after the convolution operation. Various kinds of activation functions can be chosen, however, in this model a Recti ed Linear Unit (ReLU) function is used as the activation operation which has been reported to have a good performance in CNN, de ned as : Following the convolution layers are multiple bi-directional RNN layers [25], a LSTM cell [26] is used as the RNN cell, with a separate batch normalisation on the inside cell state and input term [27].
A typical batch normalisation procedure [24] is where x be a inactivation term.
Let h l t be the output of l th RNN layer at time t, the batch normalisation for a LSTM cell is The batch normalisation is calculated separately in the re- The nal output is transferred through a fully connected network followed by a softmax operation The output o i , i = 1, 2, ..., T predict the symbol given the input vector x, P(o i = j|x). If the read is a DNA sequence then j ∈ {A, G, C, T, b}, where b represents a blank symbol( Figure  1). During training, the CTC loss is calculated between the output sequence o and label y [15] and back-propogation is used update the parameters. An Adam optimizer [28] with an initial learning rate of 0.001 is used to minimize the CTC loss.
During inference, the nal sequence constructed using either a greedy decoder [15], or a beam-search decoder [29]. The greedy decoder works by rst getting the argument of maximum probability in each position of o, and then producing the sequence call by rst removing the consecutive repeat, and then removing the blank symbols. For example, the greedy path of an output o is A A ---A --G -, here -represent the blank symbol, the consecutive repeat is removed rst and lead to A -A -G -, and the blank is removed to get the nal sequence AAG. The beam search decoder with beam width W, maintains a list of the W most probable sequences (after collapsing repeats and removing blanks) up to position i of o. To obtain this list at position i+1, it constructs the probability of all possible extensions of the W most probable at position i based on adding each symbol according to p(o i = j), and collapsing and summing up over repeated bases, or repeated blanks which are terminated by a non-blank. The greedy decoder is a special case of the beam-search decoder when the beam width is 1. It should be noted that the model can still call homopolymer repeats provided each repeated base is separated by a blank, which is typically the case.
Convolutional network to extract local patterns:. 256 channel lters are used for all ve convolutional layers. In each layer, there is a residual block [23] (Figure 1) composing with two branches. A 1x1 lter is used for reshaping in the rst branch. In the second branch, a 1x1 convolution lter is followed by a recti ed linear unit (RELU) [30] activation function and a 1x3 lter with a RELU activation function as well as a 1x1 lter. All lters have the same channel number of 256. An element-wise addition is performed on the two branches followed by a RELU activation function. A global batch normalisation operation is added after every convolution operation. A large kernel size (5,7,11) and di erent channel numbers (128,1024) is also tested, and the above combination is found to yielded the best performance.
Recurrent layers for unsegmented labelling:. The local pattern extracted from the CNN described above is then fed to a 3-layer RNN (Figure 1). Under the current ONT sequencing settings, the DNA fragments translocate through the pore with a speed of roughly 250 or 450 bases per second, depending on the sequencing chemistry used, while the sampling rate is 4000 samples per second. Because the sampling rate is higher than the translocation rate, each nucleotide usually stays in the current position for about 5 to 15 samplings, on average. Furthermore, as a number of nearby nucleotides also in uence the current, 40 to 100 samples (based on a 4-or 5-mer assumption) could contain information about a particular nucleotide. A 3-layer bidirectional RNN is used for extracting this long range information. LSTM (Long Short Term Memory) cells [31,32] with 200 hidden units are used in every layer and a fully connected neural network (FNN) is used to translate the output from the last RNN layer into a prediction. The output of the FNN is then fed into a CTC decoder to obtain the predicted nucleotide sequence for the given raw signals.
Improving basecalling performance:. To achieve a better accuracy and less memory allocation, a sliding window is applied (default of 300 raw signals), with a pre-set sliding step size (default of 10% of window size), to the long raw signal. This gives a group of short reads with uniform length (window length) that overlap the original long read. Then basecalling is run in parallel on these short reads, and reassemble the whole DNA sequence by nding the maximum overlap between two adja-cent short reads, and read out the consensus sequence. Note here the reassembly is very easy because the order of the short reads is known. This procedure improves the accuracy of the basecalling and also enables parallel processing on one read.

Data preparation
Sequencing:. The library preparations of the E. coli and M. tuberculosis samples were done using the 1D gDNA selecting for long reads using SQK-LSK108 (March 2017 version) protocol with the following modi cations. Increase the incubation time to 20 minutes in each end-repair and ligation step; use 0.7x Agencourt R AMPure R XP beads (Beckman Coulter) immediately after the end-repair step and incubation of the eluted beads for 10 minutes; and use elution bu er (ELB) warmed up at 50 o C with the incubation of the eluted bead at the same temperature. Labelling of raw signal:. Metrichor, the basecaller provided by ONT which runs as a cloud service, is used to basecall the Min-ION sequencing data rst. Then Nanoraw [34] is used for labelling the data. Brie y, the basecalled sequence data is aligned back to the genome of the sample, and from the alignment the errors introduced by Metrichor are corrected to avoid the bias from Metrichor being learned into Chiron, and the corrected data is mapped back to the raw data. The resulting labelling consists of the raw signal data, as well as the boundaries of raw signals when the DNA fragment translocates to a new base. We use the base-level segmentation of the raw data to obtain matched pairs of signal segment (of lengths 200, 400 and 1000) together with the corresponding DNA base sequence. From this point onwards, the exact matching of the signal to each base within a segment is disregarded.
Training dataset. A data set using 2,000 reads from E. coli and 2,000 reads from Phage Lambda is created for training Chiron. In every start of the training epoch, the dataset is shu ed rst and then fed into the model by batch. Training on this mixture dataset gave the model better performance both on generality and accuracy on not only the E. coli and Phage Lambda but also on M. tuberculosis and Human data.

Training
The labelling from Metrichor described previously in paragraph is used to train Chiron, although the neural network architecture is translation invariant and not restricted by the sequence length, a uniform length of sequences is suited for batch feeding, thus can accelerate the training process. From this view, the original reads were cut into short segments with  a uniform length of 200, 400 and 1000, and trained on these batches in alternation. Several di erent architectures of the neural network were tested, (see Table 5) with the CNN-RNN network architecture having the best accuracy compared to a CNN-or RNN-only network. Also using more layers seems to increase the performance of the model, however, the time consumed for training and basecalling is also increased. In the nal structure, a NN with 5 convolution layers and 3 recurrent layers is adopted, as adding layers above this structure gave negligible performance improvement but required more calculation and also increased the risk of over tting.

Parameters for basecalling
All basecallers were invoked on the same set of reads for each sample. When using Chiron to basecall, the raw signal was rstly sliced by a 300 length window, the window is slided by 30, and then these sliced segments are fed into the basecaller with a batch size equal to 1100, and then the output short reads are simply assembled by a pair-wise alignment between neighbouring reads, and the consensus sequence is output from this alignment. All basecalling with Albacore (version 1.1.1 and version 2.0.1) and BasecRAWller [14] (version 0.1) was done with default parameters. For the con guration setting in Albacore, r94_450bps_linear.cfg was used for all samples, as this matches the owcell and kit used for each sample.

Quality score
The quality score is calculated by the following algorithm: qs = 10 * log 10 ( P1 P2 ) where P1 is the probability of most probable base in current position, and P2 is the probability of the second probable base in current position.

Comparison of raw read accuracy
To assess the performance of each program, the resulting FASTA/FASTQ le from basecalling was aligned to the reference genome using graphmap [35] with the default parameters. The resulting BAM le is then assessed by the japsa error analysis tool (jsa.hts.errorAnalysis) which looks at the deletion, insertion, and mismatch rates, the number of unaligned and aligned reads, and the identi cation rate compared to the reference genome. The identity rate is calculated as number of matched bases number of bases in reference and is the marker used here for basecalling accuracy.

Assembly Identity Rate Comparison
We assessed the quality of assemblies generated from reads produced by di erent base-callers. For each base-caller, a de-novo assembly is generated by the use of only Nanopore reads for the M. tuberculosis E. coli and Lambda Phage genomes. We use Minimap2 [36] and Miniasm [37] to generate a draft genome, then Racon [38] is used to polish on the draft genome for 10 rounds.

Data availability
The M. tuberculosis sequencing data have been deposited Genbank under project number PRJNA386696. The Human nanopore data were downloaded from https://github.com/ nanopore-wgs-consortium/NA12878. The E. coli data are in the process of being deposited to Genbank.