Nanopore base calling on the edge

Abstract Motivation MinION is a portable nanopore sequencing device that can be easily operated in the field with features including monitoring of run progress and selective sequencing. To fully exploit these features, real-time base calling is required. Up to date, this has only been achieved at the cost of high computing requirements that pose limitations in terms of hardware availability in common laptops and energy consumption. Results We developed a new base caller DeepNano-coral for nanopore sequencing, which is optimized to run on the Coral Edge Tensor Processing Unit, a small USB-attached hardware accelerator. To achieve this goal, we have designed new versions of two key components used in convolutional neural networks for speech recognition and base calling. In our components, we propose a new way of factorization of a full convolution into smaller operations, which decreases memory access operations, memory access being a bottleneck on this device. DeepNano-coral achieves real-time base calling during sequencing with the accuracy slightly better than the fast mode of the Guppy base caller and is extremely energy efficient, using only 10 W of power. Availability and implementation https://github.com/fmfi-compbio/coral-basecaller Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
MinION by Oxford Nanopore Technologies (ONT) is a portable DNA sequencer that measures electric current as DNA passes through nanopores.Electrical signals produced by the device need to be translated into sequences by a base caller software.Base calling of nanopore reads is a non-trivial task, and existing tools require powerful hardware and high energy consumption to operate in real time.
In this paper, we present a new base caller DeepNano-coral, which runs on the Coral accelerator featuring the Edge tensor processing unit (TPU), a small, energy-efficient, cheap USB-connected device.DeepNano-coral can process approximately 1.5 million signals per second, which is enough to provide real-time base calling for a MinION device.This makes our base caller ideal for field sequencing applications, where power efficiency and low hardware requirements are highly desirable.Real-time base calling is also essential in unlocking some of the most promising MinION device capabilities, such as its ability to adapt the run length to the sample composition, or selective sequencing [15].
Current base callers are typically based on deep neural networks.Guppy, a base caller provided by ONT, is based on recurrent neural networks (RNN), and provides two different architectures: a fast base caller, which can base call with 85-92% median read accuracy in real time, using recent GPU cards, and a high-accuracy base caller (90-96% median read accuracy), which is too slow to arXiv:2011.04312v1 [cs.LG] 9 Nov 2020 be used in real time without specialized setup.DeepNano-blitz trades off a bit of accuracy in order to provide real-time base calling on a common CPU using a specifically engineered RNN, thus obviating the need for GPUs [1].Other RNNbased base callers, including Chiron [21], are too slow for real-time base calling.Another class of nanopore base callers is based on convolutional neural networks (CNN).In particular, Bonito v.0.2 [18] adapts Jasper/Quartznet [10,13] speech recognition architecture to base calling tasks.At the time of writing, Bonito provided the most accurate base calling, however, the time requirements exceed even the Guppy high-accuracy mode.
The Coral Edge TPU accelerator by Google is a limited device, which was designed mostly for vision tasks, such as image classification [3].It contains only 8 MB of memory (used for storing both model weights and intermediate tensors), it only works with 8-bit integers (while GPUs typically work with 32-bit floating point numbers), and the compiler and libraries provide only a limited set of building blocks, optimized mostly for CNNs with small receptive fields, which are typically used in image processing.Such a configuration mostly excludes possibility of adaptation of RNN-based architectures and even adapting CNN-based architectures, such as Bonito, is a challenge, due to large size of the network and the use of large receptive fields.
Our new base caller DeepNano-coral, running on the Edge TPU, provides real-time base calling that is significantly more energy efficient than existing approaches.To achieve this goal, we introduce the following innovations: -A novel component k-blueprint-separable-convolution, which replaces separable convolutions as a building block for CNNs.A separable convolution approximates a full convolution by using a depthwise operation and a pointwise operation, which are less computationally intensive.The k-blueprintseparable convolution factorizes the convolution into the two parts differently, in effect reducing the depthwise operation at the cost of increasing computation in the pointwise operation.Even though the new convolution component has a higher number of parameters and floating point operations (flops), it is more efficient on the Edge TPU, since in this architecture, depthwise operations do not fully utilize the hardware [24,5], possibly due to being bound by the memory bandwidth.-A new design of the residual block, which is a fundamental building block of the QuartzNet speech recognition architecture and was also deployed for base calling in Bonito.To improve its performance on the Edge TPU, we add a compression operation at the start of the residual block, taking x consecutive data samples of C channels each, and converting them into a single compressed sample of Cy channels.Using compression ratio x/y < 1, we save memory and allow subsequent convolutions to effectively mix x original samples and thus increase the receptive field of the block.-A surprising observation that identity initialization of some parts of the architecture helps the training and improves the prediction accuracy in some circumstances, which contradicts usual recommendations for initializing parameters of neural networks before training.
Our experiments show that DeepNano-coral achieves the accuracy comparable to other real-time base callers, Guppy fast and DeepNano-blitz.Such accuracy is sufficient for real-time monitoring tasks, such as monitoring barcode composition in pooled libraries or species composition in environmental or clinical samples [1].DeepNano-coral achieves this goal much more energy efficiently, using only 0.6-0.7 Wh of energy to base call a test sample of 40.8 Mbp of nanopore sequences at a speed of 1.54 million signal samples per second (on the same setup, the closest competitor, Guppy fast, uses 1.4 Wh of energy, processes 4.37 million signal samples per second, with up to 2 percentage points lower accuracy depending on the data set).
Background.Nanopore base calling translates the electrical signals produced by the sequencer into a sequence of DNA bases.The signal level depends on the context of about 5-12 DNA bases passing through the nanopore.The signal is read about 4000 times per second and DNA moves through the pore at the speed of approximately 450 bases per second, but the speed is rather uneven.This means that on average each shift of the context by one base corresponds to roughly 9 measured values with a large variance.This makes the problem somewhat similar to speech recognition.Note that different contexts may produce very similar signal levels and that there is a significant amount of noise present in the signal readouts, complicating the base calling problem.
The work in this paper is based on the QuartzNet architecture [10] for speech recognition, which has also been used in the Bonito base caller [18] developed by ONT.Briefly, a window of the raw signal of length T is used as an input to a deep CNN which uses several types of blocks to process the signal (see Figure 1).In the final decoder block, the network produces a tensor with 5 output channels.The five channels are converted by the softmax function into probability distributions over possible outputs A,C,G,T,-at each position, with dash corresponding to an empty output.Finally, the CTC layer [4] chooses a DNA sequence with the highest posterior probability.
In the QuartzNet / Bonito architecture, convolutions are organized into building blocks of two types B and C. The structure of a C-type block is simply a sequence of three layers: a convolutional layer, a batch normalization [8] (a layer that renormalizes channel values and stabilizes gradients for better training), and an activation function (Bonito uses Swish [19]).
The B-type blocks use residual skip connections.The input signal is split into two branches.The main branch consists of R copies of a C-type sub-block, with the last copy omitting the activation function.The second branch, called skip connection, consists of a pointwise convolution and batch normalization.The two branches are summed together and an activation is applied to the output.
The resulting network used in Bonito is large and computationally intensive.Some intermediate results reach size of up to B × T /3 × 464, where B is the number of sequences combined to a batch, and T is the length of the sequence.The network has 36 convolutional layers with 6.6 million parameters in total, requiring roughly 2.2 million multiplications per sample.

Methods
In this section, we present the architecture of our new base caller designed for the Edge TPU.Our architecture is inspired by the Bonito CNN, which was drastically scaled down and key components were replaced by the enhancements described here.Further technical details regarding adapting Bonito-like architecture to the Edge TPU are described in the Supplement.
The k-blueprint-separable convolutions.A convolutional layer is the dominant building block of many neural network architectures, mostly in the domain of image recognition, but recently also for automated speech recognition [13,10,22].In this paper, we consider 1D convolutions, which take as an input a tensor X of dimensions (T, C in ) representing a data stream of length T , each data point containing C in values called channels.To apply a convolution with odd depth D, the input tensor X is first padded with D 2 zeros at the beginning and at the end.Then the output tensor Y with dimensions (T, C out ) is computed as follows: Y t,j = 0≤d<D,0≤i<Cin X t+d,i W j,d,i + B j , where W and B are trained weights representing convolution kernel weights and bias.
An obvious drawback of full convolutions is a large number of parameters (C out DC in ) and required flops (T C out DC in ).A standard solution is to use a separable convolution [16], which is an approximation of the full convolution by a composition of two operations: depthwise and pointwise.The depthwise operation works on each channel separately: . This is followed by the pointwise operation, which mixes the channels at each time point: The ordering of pointwise and depthwise operations was chosen somewhat arbitrarily, and reversing it may improve the accuracy [6].The variant with the reversed order is called a blueprint-separable convolution.Figure 2 illustrates receptive fields of basic operations used in convolutions.
Recent works [24,5,14] indicate that separable convolutions do not always improve the speed on non-CPU architectures, because the depthwise operation requires a smaller ratio of flops to memory operations, which are generally slow.A full convolution with depth D = 3 can be faster than a separable convolution with the same depth [24].Full convolutions with small depths are thus feasible in image recognition, while in base calling, the kernels need to be much larger.
Our design of k-separable convolutions is heavily influenced by this observation.Our goal is to reduce the time-consuming depthwise operations using dilation with step size k, and compensate by replacing the pointwise operation by a convolution operating on a window of size k instead of a single point, as illustrated in Figure 3.
Namely, we start with what we call a fat-pointwise operation, which is a standard convolution of depth k: Z t,j = 0≤d<k,0≤i<Cin X t+d,i W The second step uses a dilated depthwise operation with depth D/k, which skips points by using dilation k: . This reduces the depthwise kernel (and thus memory I/O) by a factor of k, while retaining the receptive field D of the whole layer.
Note that the special case of k = 1 leads to a standard blueprint convolution, while we typically use k = 3, which on the Edge TPU roughly maintains the same computation time as separable convolutions, while increasing the accuracy.
Figure 4 demonstrates the performance of k-separable convolutions on two configurations used in our experiments in the next section.Our k-separable convolutions offer running times comparable to separable convolutions, while providing roughly k times more parameters, which increases their expressive power.
Residual block with depth-to-space compression.Our second change also targets reduction of the depthwise convolution.As shown in Figure 4, when we apply convolutions on a shorter input, we can use more channels in a comparable time.Our idea is to redesign the residual block of the CNN (B-type block in Figure 1) so that we compress its depth and increase the number of channels.In particular, compression with depth-to-space ratio x : y means converting input tensor (T, C) to tensor (T /x, Cy) using a strided convolution with both depth and stride set to x (see Figure 2).This convolution takes x consecutive data samples of C channels and converts them into a single compressed sample of Cy channels.At the end of the residual block, we restore the original dimensions with a strided transposed convolution.This makes the new block a drop-in replacement for the original B-type block design (see Figure 5).Compression ratio x/y < 1 saves memory, which is essential due to limited Coral resources.While compression may sometimes decrease accuracy, the network may learn to de-duplicate information from consecutive data samples, and thus prevent data loss.In fact, any subsequent pointwise operations effectively operate on x original samples, yielding increased receptive fields.Thus, we can further lower the depth of the depthwise operation in the block, offsetting larger computation of pointwise operations, which were increased by a factor of y 2 /x.In our experiments, compression ratio 3:2 works well on Coral.
To complete the residual block, we add the depthwise operation before the decompression.While the original B-type block repeats separable convolutions R times, we repeat them R − 2 times, since we consider the compression and decompression blocks as replacements for two separable convolutions.

Identity initialization.
A proper neural network initialization can affect both trainability and final accuracy of models [2,25,20,11,17].A standard way of initializing CNN architectures is to draw the entries of weight matrices from the uniform distribution U (−k, k), where k = 6/ √ C in + C out , and to set the bias terms to zero [2].The weighting factor k is used to keep the gradients from vanishing or exploding as the number of layers increases.Recent introduction of BatchNorm however obviates such problems, as the results are renormalized [8].
In some cases, task specific initialization may bring an improvement over the generic initialization strategies [11], and this proved to be the case for our base calling application as well.We initialize all k-separable blocks within the compressed main branch to near-identity, that is, depthwise kernels are initialized as W (D) d,j = δ (D/K)/2 ,d and fat pointwise kernels to W (P ) j,d,i ∼ δ K/2 ,d (δ i,j + U (− , )), where δ x,y = 1 if and only if x = y.We experimented with several other initializations and observed that setting the depthwise operations to identity helps the most, while setting pointwise operations to identity brings only a small additional improvement.On the other hand, initialization of the skip connection as well as of the compression/decompression block does not seem to affect the results significantly.In our experiments, the identity initialization described above speeds up the process of training and decreases overfitting (see Results).We believe that this surprising effect is explained by the properties of the base calling task.
Due to the nature of nanopore raw sequencing data, base calling is composed of two tasks.First, the input signal needs to be segmented into events, each event corresponding to the shift of the DNA currently read by the nanopore head by one base.The length distribution of these events is highly variable.The second task is to recognize the base under the nanopore head given the context of several events.Although initially base callers have performed these tasks separately, modern neural network approaches combine them into a single optimization problem.
One would assume that the second task of correctly identifying bases is the core of the problem.A quick experiment in which Bonito is provided with an additional binary input indicating event boundaries (as determined from a groundtruth alignment to a reference) shows otherwise.In particular, the additional input dramatically speeds up the training so that the network can in minutes outperform days-long Bonito training.While the modified network cannot be used for practical base calling (because the base caller obviously cannot receive ground-truth event boundaries as an input), it suggests that identification of events is in fact the harder part of the base calling task.This is further corroborated by the fact that even a simple logistic regression can distinguish purines A,G from pyrimidines C, T in a correctly segmented signal.
Our depthwise identity initialization indeed makes sense assuming that the network spends much more time learning how to split the raw signal into events rather than recognizing individual bases.Identity initialization may allow the network to learn the easy task of distinguishing bases first and then spend the rest of its capacity on learning intricate time-dependencies without the need for unlearning spurious long-range correlations that may have been introduced by random initial weights.

Results
In this section, we compare the speed, energy consumption, and accuracy of DeepNano-coral with other tools on a data set of R9.4.1 reads from K. pneumoniae [23] and human [9] (see Supplement).The base calls were mapped to the reference using minimap2 [12].DeepNano-coral slightly outperforms Guppy fast in most accuracy measures (Table 1).Guppy fast would currently be a method of choice for live base calling on a computer with a recent GPU card (compute capability 6.2, 4GB of memory).As demonstrated earlier [1], even slightly lower accuracy of DeepNano-blitz is sufficient for run monitoring, such as barcode composition or metagenomic analysis.Note that DeepNano-blitz provides real-time base calling on a CPU without the use of any accelerator.Guppy in the high accuracy (hac) mode illustrates accuracy gains possible with more extensive computational resources typically beyond the possibilities of real-time base calling.
We have measured the speed and energy consumption on two computers with different setups (Table 2), a desktop (i7-7700k 4 core CPU; NVIDIA GTX 1650 GPU) and a laptop (i7-7700HQ 4 core CPU) incapable of running the GPU version of Guppy.To run DeepNano-coral, we have attached the Coral Edge TPU device through USB 3.0 interface.On both computers, DeepNano-coral achieved the speed necessary for live base calling (1.5M signals per second) and used less than 11W (computed as a difference between the idle energy consumption and the consumption during base calling).On our testing set, the total energy spent on base calling was 0.58-0.68Wh,roughly half of the energy used by Guppy fast on the desktop.Although Guppy fast consumed less energy when baseline is included due to its shorter running time, in a practical setting, this would not translate to energy savings as the computer needs to run throughout the sequencing.
DeepNano-coral runs on GPU at lower speed and with higher energy consumption than GPU-and CPU-optimized software.This underlines the importance of optimization of the network architecture for a particular platform.
To further illustrate the impact of our new network designs on the base calling accuracy, we started with the small Bonito architecture (see the Supplement), in which we replaced various components by our new designs presented in the Methods section.In these experiments, we modify only B-type (residual) blocks, keeping the standalone C-type blocks the same.We however verified that altering configuration of these C-type blocks does not affect the accuracy significantly.
Figure 6 shows the accuracy and speed starting with the small Bonito and adding the following features: 3-blueprint-separable convolutions, compression with ratio 3:2, combination of the two, and finally the identity initialization.For each variant, we test several kernel depths.Note that 3-separable convolutions have a symmetrical receptive field only for depth of size k = 3(2n + 1).In most experiments, we stop at kernel size 21, because larger kernels lead to base calling speed below the speed of sequencing.In general, adding our modifications increases the accuracy at comparable speed, and the most accurate version is the one with all our improvements combined.We have presented a base caller capable of running on the Edge TPU in real time.To do so, we have designed new types of blocks which can be used as dropin replacements for separable convolutions and QuartzNet-style residual blocks, potentially improving their speed/accuracy tradeoff in other applications as well.
From a practical standpoint, our work enables real-time base calling with low energy consumption on modest hardware with addition of a $70 USB device.This contribution will help researchers attempting nanopore sequencing in field conditions with limited energy resources.Using Edge TPU as an alternative to GPU chips may also help to design new devices specifically targeted at nanopore sequencing, analogous to the MK1C device manufactured by ONT.
Further research into decreasing the size of the base calling neural networks may yield even better results on small accelerators.One option is to use knowledge distillation [7], where the smaller network is trained on outputs from a larger network.Another avenue is to consider a richer set of outputs from the network.In our case, the softmax layer output probabilities over the {A, C, G, T, −} alphabet, which is followed by CTC decoding.Guppy and Bonito v0.3 use a more complicated scheme, which could be adapted.The risk here is that we would need to do more intensive decoding on the CPU, which may become a bottleneck.

S1 Engineering neural networks for Coral Edge TPU
In this section, we discuss practical steps needed to adapt the Bonito GPU-tuned base caller architecture to run on and fully utilize Edge TPU accelerator.These observations can be useful also to others who want to optimize different neural network architectures for this platform.Our first step was to create a scaled-down version of Bonito, capable of running on the Edge TPU.Bonito uses large convolutions, with up to 464 output channels.Our experiments suggest that the performance of the Edge TPU accelerator severely deteriorates beyond approximately 128 channels.To stay safely within these bounds, we decrease the maximum number of channels to 128.We also scale down the depth of the depthwise operation, considering the range of 9-33, as the performance cost of larger depthwise kernels is noticeable (see the main text).The final version of DeepNano-coral uses kernel depth 21.
To avoid an extensive architecture search, we use a more uniform configuration of building blocks.In particular, all residual B-type blocks use five convolution blocks, which is a middle ground compared to the original Bonito configuration.
Another issue is related to the quantization.It would be ideal to fuse the activation function to the preceding convolution layer, but the development tool chain does not support this.It also does not support the Swish [19] activation function used in Bonito.For these reasons, we have used ReLU6 as the activation function, which can be fused into preceding convolution layers by a simple value clipping of int32 accumulator values used in matrix multiplication.
Further issues are related to limitations of the available development tools for the Edge platform.Typically, neural network inference is done in batches, with several inputs processed simultaneously to optimally utilize hardware capacity.However, this setup is not supported in the current development tool chain.Another problem is that 1D convolutions are internally converted to 2D convolutions, which adds a reshape operation before and after the convolution.This has a severe performance impact, since reshape operations are memory intensive.
To solve both these problems at the same time, we transform multiple 1D inputs into a single 2D "image" of dimensions B × T , where B is the batch size and T is the sequence length.The whole network is then rewritten to an equivalent network operating on this "image" using 2D convolutions.In our network, considering the utilization of the device and the target speed, we use B = 4 and T = 5004 (T must be a multiple of 9).Note that splitting the input signal into slightly overlapping chunks of 5004 observations is a reasonable compromise between overhead imposed by the overlaps and the capacity of the device.
After these modifications, we obtain a small Bonito-like architecture which the Edge TPU compiler is able to fit on the device.The overall architecture configuration is summarized in Table S1.Finally, the last softmax layer as well as CTC decoding are performed directly on the CPU.

S2 Training and Testing Sets
We have used the combination of the following public data sets to train the models: -Taiyaki set: Data set of 50k (downsampled to 5k) R9. (except flowcell FAB49164) from the human reference standard CEPH1463 from nanopore whole human genome sequencing project [9].
For testing, the following data sets were used: -Klebsiella: a benchmark set of native R9.4 K. pneumoniae reads [23].We only used reads before the sequencing restart.-Human testing: a sample of native R9.4.1 reads from chr14, chr15, chr16 (flowcell FAB49164) from human reference standard CEPH1463 from nanopore whole human genome sequencing project [9] The basic characteristics of all data sets are shown in Table S2.

S3 Training procedure details
We wrote our training pipeline in TensorFlow and the code is available at https://github.com/fmfi-compbio/coral-training.Our training schedule uses 6000 miniepochs, one miniepoch being 15 optimizer steps on batch of 100 sequences of length 5004 (5004 being the nearest multiple of 9 targeting our desired sequence length 5000).We use Adam optimizer with a schedule that uses rather high learning rate decaying linearly LR(t) = 0.01 * (1 − t) updated after each miniepoch, where 0 ≤ t ≤ 1 denotes progress.We have a short ramp-up period over first 10 miniepochs where we increase learning rate from 0 to the maximum value to avoid "shocking" the model with high learning rate from the start.At the end of training we use standard TensorFlow post-training quantization to convert the model into a format compatible with Edge TPU compiler.

Table 1 .
Comparison of base calling accuracy.Read accuracy is computed as one minus the ratio of the alignment edit distance and the base call length.We report the median read accuracy.

Table 2 .
Energy consumption and speed of different base callers (DN=DeepNano)

Table S1 .
Baseline small Bonito architecture we use in our experiments.

Table S2 .
Overview of training and testing sets used in the study.