RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes

Abstract Summary: Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either (i) require powerful computational resources that may not be available for portable sequencers or (ii) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: (i) read mapping, (ii) relative abundance estimation, and (iii) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides (i) 25.8× and 3.4× better average throughput and (ii) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.


Introduction
High-throughput sequencing (HTS) devices can generate a large amount of genomic data at a relatively low cost.HTS can be used to analyze a wide range of samples, from small amounts of DNA or RNA to entire genomes.Oxford Nanopore Technologies (ONT) is one of the most widely-used HTS technologies that can sequence long genomic regions, called reads, with up to a few million bases.ONT devices use the nanopore sequencing technique, which involves passing a single DNA or RNA strand through a tiny pore, nanopore or channel, at an average speed of 450 bases per second [1] and measuring the electrical current as the strand passes through.Nanopore sequencing enables two key features.First, nanopores provide the electrical raw signals in real-time as the DNA strand passes through a nanopore.Second, nanopore sequencing provides a functionality, known as Read Until [2], that can partially sequence DNA strands without fully sequencing them.These two features of nanopores provide opportunities for 1) real-time genome analysis and 2) significantly reducing sequencing time and cost.
Real-time analysis of nanopore raw signals using Read Until can reduce the sequencing time and cost per read by terminat-ing the sequencing whenever sequencing the full read is not necessary.The freed-up nanopore can then be used to sequence a different read.A purely computational mechanism can send a signal to eject a read from a nanopore by reversing the voltage if the partial sequencing of a read meets certain conditions for particular genome analysis, such as 1) reaching a desired coverage for a species in a sample [3] or 2) identifying that a read does not originate from a certain genome of interest (i.e., a target region) [1,4] and hence, does not need to be fully sequenced.By terminating the sequencing of reads that do not correspond to the target region, the sequencer can spend time and resources on higher coverage sequencing of the reads that correspond to the target.This process is referred to as nanopore adaptive sampling.By providing high coverage at target regions and avoiding unessential sequencing of reads outside those regions, this approach can improve the quality of sequencing and the downstream analysis utilizing the obtained data.
To effectively utilize adaptive sampling in nanopore sequencing, it is crucial to have computational methods that can accurately analyze the raw output signals from nanopores in real-time.These methods must provide 1) low latency and 2) throughput matching or exceeding that of the sequencer [1,4,5].Several works propose adaptive sampling methods for real-time analysis of raw nanopore signals [1,[3][4][5][6][7][8][9][10][11].However, these works have three key limitations.First, most techniques mainly use powerful computational resources, such as GPUs [3,7], or specialized hardware [5,8] due to the use of computationally-intensive algorithms such as basecalling as we explain in detail in Section 5.This can make real-time genome analysis challenging for portable and low-cost nanopore-based sequencers, such as the ONT Flongle or MinION, which are not typically equipped with such resources.Therefore these techniques introduce challenges for using them in resourceconstrained environments.Second, the sheer size of genomic data at the scale of large genomes (e.g., human genome) makes it challenging to process the data in real-time.This is because such large genomes require efficient and accurate similarity identification across a large number of regions.This renders many current methods [1,4] inaccurate or useless for large genomes as they cannot either provide accurate results or match the throughput of nanopores for these genomes.Third, machine learning models used in past works [3,6,7,9,10] to analyze raw nanopore signals often require retraining or reconfiguring the model to improve accuracy for a certain experiment, which can be a barrier to flexibly and easily performing real-time analysis without retraining or reconfiguring these models.To our knowledge, there is no work that can efficiently and accurately perform real-time analysis of raw nanopore signals on a large scale (e.g., whole-genome analysis for human) without requiring powerful computational resources, which can easily and flexibly be applied to a wide range of applications that could benefit from real-time nanopore raw signal analysis.
Our goal is to enable efficient and accurate real-time genome analysis for large genomes.To this end, we propose RawHash, the first mechanism that can efficiently and accurately perform real-time analysis of raw nanopore signals for large genomes in resource-contained environments.Unlike all the past works, RawHash is the only mechanism that can efficiently scale to large genomes and perform accurate real-time genomic analysis without requiring computationally-intensive algorithms such as basecalling.Our key idea is to encode regions of the raw nanopore signal into hash values such that similar signal regions can efficiently be identified by matching their hash values, facilitating efficient similarity identification between signals.However, enabling accurate hashing-based similarity identification in the raw signal domain is challenging because raw signals corresponding to the same DNA content are unlikely to have exactly the same signal amplitudes.This is because the raw signals generated by nanopores can vary each time the same DNA fragment is sequenced due to several factors impacting nanopores during sequencing, such as variations in the properties of the nanopores or the conditions in which the sequencing is performed [12].Although the similarity identification of raw signals is possible via calculating the Euclidean distance between a sequence of signals in a multi-dimensional space [4], such an approach can become impractical when dealing with larger sequences as the number of dimensions increases with the length of the sequences.This increase in dimensionality can lead to computational complexity and the curse of dimensionality, making it expensive and impractical.
To address these challenges, RawHash provides three key mechanisms for efficient signal encoding and similarity identification.First, RawHash encodes signal values that have a wider range of values into a smaller set of values using a quantization technique, such that signal values within a certain range are assigned to the same encoded value.This helps to alleviate the probability of having varying signal values for the same DNA content and enables RawHash to directly match these values using a hashing technique.Second, RawHash concatenates the quantized values of multiple consecutive signals and generates a single hash value for them.The hashing mechanism enables RawHash to efficiently identify similar signal regions of these consecutive signal values by directly matching their corresponding hash values.Representing many consecutive signals with a single hash value increases the size of the regions examined during similarity identification without suffering from the curse of dimensionality.Using larger regions can substantially reduce the number of possible matching regions that need to be examined.RawHash is the first work that can accurately use hash values in the raw signal domain, which enables using efficient data structures commonly-used used in the sequence domain (e.g., hash tables in minimap2 [13]).Third, RawHash uses an existing algorithm, known as chaining [13], to find the colinear matches of hash values between signals to identify similar signal regions.These efficient and accurate mechanisms enable RawHash to perform real-time genome analysis for large genomes.
While our proposed three key mechanisms have the potential to be used for various purposes in raw signal similarity identification, we design RawHash as a tool for mapping nanopore raw signals to their corresponding reference genomes in realtime.RawHash operates the mapping in two steps 1) indexing and 2) mapping.First, in the indexing step, RawHash 1) converts the reference genome sequence into expected signal values by simulating the expected behavior of nanopores based on a previously-known model, 2) generates the hash values from these signals, and 3) stores the hash values in a hash table for efficient matching.Second, in the mapping step, RawHash 1) generates the hash values from the raw signals in a streaming fashion, 2) queries the hash table from the indexing step with these hash values to find the matching regions in the reference genome with the same hash value, and 3) performs chaining to find the similar region between the reference genome and the raw signal of a read.
RawHash can utilize the unique functionalities of nanopore sequencing to reduce the sequencing time and cost in two ways.First, to avoid redundant sequencing and processing of each read, RawHash can use Read Until to eject a read before it is fully sequenced if RawHash identifies that the sequenced portion of the read can already be mapped to a reference genome.Second, to perform a cost-and time-efficient relative abundance estimation, RawHash can utilize Run Until to fully stop the entire sequencing of all subsequent reads after sequencing a certain amount of reads that is sufficient to make an accurate relative abundance estimation.We refer to such usage during abundance estimation as Sequence Until.Avoiding the redundant sequencing of further reads that are unlikely to substantially change the relative abundance estimation has the potential to significantly reduce the sequencing time and cost.To utilize Sequence Until, RawHash integrates a confidence calculation mechanism that evaluates the relative abundance estimations in real-time and fully stops the entire sequencing run if using more reads does not change its estimation.To stop the entire sequencing run for further reads, Run Until can be used to stop the entire sequencing run, which can enable the better utilization of nanopores.We find that Sequence Until can be applied to other mechanisms (e.g., UNCALLED) that can perform realtime relative abundance estimations.Prior work [14] proposes a technique to terminate the sequencing process when species in the sample reach a certain coverage depth.The key difference of Sequence Until is that it reduces the cost of sequencing for relative abundance estimation and is based on our adaptive, accurate, and low-cost confidence calculation during real-time abundance estimation.
We evaluate RawHash on three important applications that can benefit from real-time genome analysis: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis.We compare RawHash with the state-of-the-art approaches, UNCALLED and Sigmap, which can be used with nanopore sequencers that may not be equipped with GPUs, such as the MinION devices.We evaluate RawHash, UNCALLED, and Sigmap in terms of their performance, accuracy, and their estimated benefits in reducing the sequencing time and cost.
This paper provides the following key contributions and major results: • We propose RawHash, the first mechanism that can efficiently and accurately find the similarities between raw nanopore signals and a reference genome for large genomes without requiring powerful computational resources such as GPUs.• We propose the first sampling mechanism that can stop the entire sequencing run for certain applications when an accurate decision can be made without sequencing the entire sample, which we call Sequence Until.

Methods
We propose RawHash, a mechanism that can efficiently and accurately identify similarities between raw nanopore signals of a read and a large reference genome in real-time (i.e., while the read is sequenced).The raw nanopore signal of each read is a series of electrical current measurements as a strand of DNA passes through a nanopore.The reference genome is a set of strings over the alphabet A,C,G,T.RawHash provides the mechanisms for generating hash values from both a raw nanopore signal and a reference genome such that similar regions between the two can be efficiently and accurately found by matching their hash values.

Overview
Figure 1 shows the overview of how RawHash identifies similarities between raw nanopore signals of a read and a reference genome in four steps.First, RawHash pre-processes both 1) the raw nanopore signal and 2) the reference genome into values that are comparable to each other.For raw signals, RawHash segments the raw signal into non-overlapping regions such that each region is expected to contain a certain amount of signal values that are generated from reading a fixed number k of DNA bases.Each such region is called an event [12].Each event is usually represented with a value derived from the signal values in the segment.For the reference genome, RawHash translates each substring of length k (called a k-mer) into their expected event values based on the nanopore model.The event values from the reference genome are not directly comparable to the event values from raw nanopore signals due to variability in the current measurements in nanopores generating slightly different event values for the same k-mer [12].To generate the same values from slightly different events that may contain the same k-mer information, the second step of RawHash quantizes the event values from a larger set of values into a smaller set.The quantization technique ensures that the event values within a certain range are likely to be assigned to the same quantized value such that the effect of signal variation is alleviated, i.e., the same k-mer is likely assigned the same quantized value.
Due to the nature of nanopores, each event usually represents a very small k-mer of length around k=6 bases, depending on the nanopore model [4].Such a short k-mer is likely to exist in a large number of locations in the reference genome, making it challenging to efficiently identify the correct one.To make the events more unique (i.e., such that they exist only in a small number of locations in the reference genome), the third step of RawHash combines multiple consecutive quantized events into a single hash value.These hash values can then be used to efficiently identify similar regions between raw signals and the reference genome by matching the hash values generated from their events using efficient data structures such as hash tables.
Fourth, to map a raw nanopore signal of a read to a reference genome, RawHash uses a chaining algorithm [4,13] that find colinear matching hash values generated from regions that are close to each other both in the reference genome and the raw nanopore signal.

Event Generation
Our goal is to translate a reference genome sequence and a raw nanopore signal into comparable values.To this end, RawHash converts 1) each k-mer of the reference genome and 2) each segmented region of the raw signal into its corresponding event.
Sequence-to-Event Conversion.To convert a reference genome sequence into a form that can be compared with raw nanopore signals, RawHash converts the reference genome sequence into event values in three steps, as shown in Figure 2. First, RawHash extracts all k-mers from the reference genome sequence, where k depends on the nanopore.The k-mer model of a nanopore1 includes the information about the expected k-mer length of an event and the expected average event value for each k-mer based on certain variables affecting the signal outcome of the nanopore's current measurements.
Second, RawHash queries the k-mer model for each k-mer of the reference genome to convert k-mers into their expected event values.Although the k-mer model of a nanopore provides an extensive set of information for each possible k-mer, RawHash uses only the mean values of events that provide an average value for the signals in the same event since these mean values provide a sufficient level of meaningful information for comparison with the raw nanopore signals.
Third, RawHash normalizes the event values from the same reference genome sequence (e.g., entire chromosome sequence or a contig) by calculating the standard scores (i.e., z-scores) of these events.RawHash uses these normalized values as event values since the same normalization step is taken for raw signals to avoid certain variables that may affect the range of raw signal amplitudes during sequencing [1,4].Signal-To-Event Conversion.Our goal is to accurately convert the series of raw nanopore signals into a set of values where each value corresponds to certain DNA sequences of fixed  To achieve this, RawHash converts the raw signals into their corresponding values in three steps, as shown in Figure 3. First, to accurately identify the distinct regions in the raw signal that correspond to a certain k-mer from DNA, RawHash performs a segmentation step as described in a basecalling tool, Scrappie, and used by earlier works UNCALLED and Sigmap.The segmentation step aims to eliminate the factors that affect the speed of the DNA molecules passing through a nanopore, as the speed affects the number of signal measurements taken for a certain amount of bases in DNA.To perform the segmentation step, RawHash identifies the boundaries in the signal where the signal value changes significantly compared to the certain amount of previously measured signal values, which indicates a base change in the nanopore.Such boundaries are computed using a statistical test, known as Welch's t-test [16], over a rolling window of consecutive signals.RawHash performs this t-test for multiple windows of different lengths to avoid the variables that cause a change in the number of current measurements due to the varying speed of DNA through a nanopore, known as skip and stay errors [12].Signals that fall within the same segment (i.e., between the same measured boundaries) are usually called events since each event contains the signals from a reading of a fixed amount of DNA bases, k-mers.
Second, since the number of signals that each event includes is not constant across different events due to the stay and skip errors, RawHash generates a single value for each event to quickly avoid these potential errors and other factors that cause variations from reading the same amount of DNA bases.To this end, RawHash measures the mean value of the signals that fall within the same segment and uses this mean value for an event.
Third, since the amplitudes of the signal measurements may significantly vary when reading k-mers at different times, RawHash normalizes the mean event values using the event values generated from the nanopore within the same certain time interval in a streaming fashion.Although this time interval parameter can be modified in our tool, the default configuration of RawHash processes the events of signals generated by the nanopore within one second.For normalization, RawHash uses the same z-score calculation that it uses for normalizing the event values generated from reference sequences as described earlier.RawHash uses these normalized values as event values when comparing with the event values from reference sequences.

Quantization of Events
Our goal is to avoid the effects of generating different event values when reading the same k-mer content from nanopores so that we can identify k-mer matches by directly matching events.Although the segmentation and normalization steps explained in Section 2.2 can avoid the potential sequencing errors, such as stay and skip errors and significant changes in the current readings at different times, these approaches still do not guarantee to generate exactly the same event values when reading the same k-mer content.This is because slight changes in the normalized event values may occur when reading the same DNA content due to the high sensitivity and stochasticity of nanopores [12].Thus, it is challenging to generate the same event value for the same k-mer content after the segmentation and normalization steps.Since these event values generated from reading the same k-mer content are expected to be close to each other [4], we propose a quantization mechanism that encodes event values so that events with close mean values can have the same quantized value in two steps as shown in Figure 4.
Most significant  = 9 bits: Most significant  = 9 bits: Pruning  = 4 bits: Pruning  = 4 bits: First, to increase the probability of assigning the same value for similar event values, RawHash trims the least significant fractional part of mean values by using only the most significant Q bits of these mean event values from their binary format, which we represent as E[1, Q] for simplicity where E is the event value and E[1, Q] gives the most significant Q bits of E. We assume that the mean event values are represented by the standard single-precision floating-point format with the sign, exponent, and fraction bits.This enables RawHash to reduce the wide range of floating-point numbers into a smaller range without significantly losing from the accuracy such that event values closer to each other can be represented by the same value in the smaller range of values.We can perform this trimming technique without significant sensitivity loss because we observe that these normalized event values mostly use at most six digits from the fractional part of their values, leaving a large number of fractional bits useless.
Second, to avoid using redundant bits that may carry little or no information in the most significant Q bits of an event value, RawHash prunes p bits after the most significant two bits of For simplicity, we show the quantized value of E as E Q,p .By ignoring these p bits, we effectively pack Q bits into Qp bits without losing significant information from event values.We can perform such a pruning operation because we observe that the normalized event values are usually in the range [-3, 3] such that these p bits provide little information in distinguishing different event values due to the small range of values.We note that these Q and p values are parameters to RawHash and can empirically be adjusted based on the required sensitivity and quantization efficiency.This quantization tech-nique enables RawHash to assign the same quantized values for a pair of close event values, E and F, that may be generated from reading the same k-mer such that E Q,p = F Q,p where |E -F| < ε and ε is small enough for two events to represent the same k-mer content.RawHash always uses the most significant two bits as these two bits consistently carry the most significant information of the normalized event values, including the sign bit.

Generating the Hash Values
Our goal is to generate values for large regions of raw nanopore signals and reference sequences such that these values can be used to efficiently and accurately identify similarities between raw signals and a reference genome.To this end, RawHash generates hash values using quantized values of events in two steps, as shown in Figure 5. First, to avoid finding a large number of matches, RawHash uses the quantized values of n consecutive events to pack them in n × (Q -p) bits while preserving the order information of these consecutive events.RawHash uses several consecutive events in a single hash value because matching a single event is likely to generate a larger number of matches for larger genomes as a single event usually corresponds to a k-mer of 6 to 9 bases depending on the nanopore model [12].It is essential to use several consecutive events to reduce the number of matching regions between raw signals and the reference genome by increasing the region that these consecutive events span.
Second, to efficiently and accurately find matches between large regions of raw signals and a reference genome using a constrained space, RawHash uses a low collision hash function to generate a 32-bit hash value from n × (Q -p) bits of n consecutive quantized event values.Since n × (Q -p) can be larger than 32, using such a hash function is likely to increase the collision rate for dissimilar regions.To avoid inaccurate similarity identifications due to these incorrect collisions, RawHash requires several matches of hash values within close proximity for similarity identification, which we explain next.
Quantized Values (in binary) of  Consecutive Events:

32-bit Hash Value of 𝒏 Events
Figure 5: Generating a hash value from n consecutive quantized event values.

Seeding and Mapping
To efficiently identify similarities, RawHash uses hash values generated from raw nanopore signals and the reference genome in two steps.First, RawHash efficiently identifies matching regions between raw nanopore signals and a reference genome by matching their hash values.These hash values used for matching are usually known as seeds.Matching seeds enable efficiently finding similar regions between raw nanopore signals and a reference genome.Second, RawHash uses the chaining algorithm proposed in Sigmap [4] to identify the best colinear matching seeds that are close to each other in both raw nanopore signal and a reference genome.The region that the best chain of seed matches cover is the mapping position that RawHash identifies as a similar region.The chaining algorithm is useful for two reasons.First, the chaining algorithm can tolerate mismatches and indels as it allows including gaps between seed matches, which enables finding similar regions with many seed matches without requiring the entire region to match exactly, as shown in Supplementary Table S2.Second, incorrect seed matches due to collisions or our quantization mechanism that may generate the same quantized value for distinctly dissimilar events are likely to be filtered in the chaining step due to the difficulty of finding colinear seed matches in highly dissimilar regions.We note that we modify the original chaining algorithm in Sigmap by disabling the distance coefficient as RawHash does not calculate the distance between seed matches.
To efficiently map raw signals to a reference genome, RawHash provides efficient data structures.To this end, RawHash uses hash tables to store the hash values generated from reference genomes (i.e., the indexing step) and efficiently query the same hash table with the hash values generated from the raw signal as the read is sequenced from a nanopore to find positions in the reference genome with matching hash values.RawHash uses the events in chunks (i.e., collection of events generated within a certain time interval) to find seed matches and perform chaining in a streaming fashion such that the chaining computation from previous chunks (i.e., seed matches) is transferred to the next chunk if the mapping is unsuccessful for the current chunk.

Evaluation Methodology
We implement RawHash as a tool for mapping raw nanopore signals to a reference genome.Similar to regular read mapping tools, RawHash has two steps to complete the mapping process: 1) indexing the reference genome and 2) mapping raw signals.
Although indexing is usually a one-time task that can be performed prior to the mapping step, the indexing of RawHash can be performed relatively quickly within a few minutes for large genomes (Supplementary Table S3).RawHash provides the mapping information using a standard pairwise mapping format (PAF).In our implementation, we provide an extensive set of parameters that allow configuring several options to fit RawHash for many other applications and nanopore models that we do not evaluate, such as configuring details about the nanopore model (e.g., number of bases per second), number of events that can be included in a single hash value, range of bits to quantize, enabling seeding techniques such as minimizers and fuzzy seed matching.We also provide a default set of parameters that we empirically choose for each common application of real-time genome analysis.These default parameters are set to accurately and efficiently analyze 1) very small (e.g., viral) genomes, 2) small and mid-sized genomes (i.e., genomes with less than a few hundred million bases), 3) large genomes (e.g., genomes with a few billion bases such as a human genome).We show the details regarding these parameter selections and the versions of tools in Supplementary Tables S5, S6, and S7.
We evaluate RawHash in terms of its performance, peak memory usage, accuracy, and estimated benefits in sequencing time and cost compared to two state-of-the-art tools UNCALLED and Sigmap.For performance, we evaluate the throughput and overall runtime of each tool in terms of the number of bases they can process per second.Throughput determines if the tool is at least as fast as the speed of DNA passing through a nanopore.For many nanopore models (e.g., R9.4), a DNA strand passes through a pore at around 450 bases per second [1,4].It is essential to provide a throughput higher than the throughput of the nanopore to enable real-time genome analysis.To calculate the throughput, we use the tool that UNCALLED provides, UNCALLED pafstats, which measures the throughput of the tool from the number of bases that the tool processes and the time it takes to process those bases.Although theoretically, it is not possible to exceed the throughput of a nanopore due to the speed of raw signal generation, for comparison purposes, such a limitation is ignored by UNCALLED pafstats.For overall runtime, we calculate CPU time and real-time using 32 threads.CPU time shows the overall amount of CPU seconds spent running a tool, while real-time shows the overall elapsed (i.e., wall clock) time.All of these tools support multi-threading, where multiple reads can be mapped simultaneously using a single thread for each read.For all of these tools, assigning a larger number of threads enables processing a larger number of reads in parallel, similar to the behavior of nanopore sequencers with hundreds to thousands of pores (i.e., channels).We note that the throughput and mapping time per read values are not affected by the thread counts as 1) these are measured per read and 2) single thread performs the mapping of a single read.
For accuracy, we evaluate the correctness of the mapping positions that each tool provides when compared to the ground truth mapping positions.To generate the ground truth mapping, we use a read mapping tool, minimap2 [13], to map the basecalled sequences of raw nanopore signals to their corresponding whole-genome references.We use UNCALLED pafstats to compare the mapping output of a tool with the ground truth mapping to find the number of true positives or TP (i.e., correct mappings), false positives or FP (i.e., incorrect mappings), and false negatives or FN (i.e., unmapped reads that are mapped in ground truth).Correct and incorrect mappings are identified based on the distance of the mapping positions between ground truth and the tool.To evaluate the accuracy, we calculate the precision (P = TP/(TP + FP)), recall (R = TP/(TP + FN)) and the F 1 (F 1 = 2 × (P × R)/(P + R)) values.
For estimating the benefits in sequencing time and cost of each tool, we calculate the average length of sequenced bases per read when using UNCALLED and RawHash and the average number sequenced chunk of signals for Sigmap and RawHash.We compare RawHash with Sigmap in terms of the number of chunks because Sigmap does not provide the number of bases when a read is unmapped, while both tools provide the number of chunks used when a read is mapped or unmapped.These chunks include a portion of the signal produced by a nanopore within a certain time interval, which is by default set as one second of data for both RawHash and Sigmap.The average length of bases and the number of chunks determine the estimations of how quickly each tool can make a mapping decision to activate Read Until before sequencing the remaining portion of a read, which indicates the potential savings from overall sequencing time and cost.
We evaluate RawHash, UNCALLED, and Sigmap for three applications 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis.Read mapping aims to map the raw signals to their corresponding reference genomes.Relative abundance estimation measures the abundance of each genome relative to other genomes in the same sample by mapping raw signals to a given set of reference genomes.Contamination analysis aims to identify if a sample is contaminated with a certain genome (e.g., a viral genome) by mapping raw signals to the reference genome that the sample may be contaminated with.For each tool, we use their default parameter settings in our evaluation.
To evaluate each of these applications, we use real datasets that we list in Table 1.These datasets include both raw nanopore signals in the FAST5 format and their corresponding basecalled sequences in the FASTA format.We note that RawHash can also use POD5 files.For relative abundance estimation, we create a mock community using all the read sets from datasets D1 to D5, and the reference genome is the combination of reference genomes used in these datasets.We slightly modify the reference genome we use in the relative abundance estimation such that the sequence IDs in the reference genome provide additional information about the species (e.g., taxonomy IDs) to enable calculating relative abundance in real-time.For contamination analysis, we combine the SARS-CoV-2 read sets (D1) with human read sets (D5) to identify if the combined sample is contaminated with the SARS-CoV-2 sample by mapping raw signals in the combined set to the SARS-CoV-2 reference genome.For all evaluations, we use the AMD EPYC 7742 processor at 2.26GHz to run the tools.Evaluating Sequence Until.Our goal is to avoid redundant sequencing to reduce sequencing time and cost for relative abundance estimation.We find that the Run Until mechanism can be utilized to fully stop the sequencing run when the real-time relative abundance estimation reaches a certain confidence level to achieve accurate estimations, which we call Sequence Until.While a similar mechanism is evaluated to enrich the coverage depth of low-abundance species [14] using Read Until, we evaluate the potential benefits of Run Until for low-cost relative abundance estimations.We integrate a real-time confidence calculation mechanism in RawHash to activate the Sequence Until mechanism in three steps.First, RawHash measures the relative abundance estimation after every n reads that can be mapped to a reference genome in real-time.Second, to identify if the recently mapped reads provide substantial changes in the abundance estimations, RawHash performs a cross-correlation calculation between the last w estimations.Cross-correlation can identify outliers from a set of estimations to identify if the outlier is substantially different than other estimations, which indicates that recent reads can still change the relative abundance estimation, and more reads should be sequenced from the sample.Third, RawHash activates Sequence Until by fully stopping the sequencing using Run Until when there are no out-liers in the last w estimations, which indicates a convergence to a certain relative abundance estimation, and further sequencing is unlikely to change this estimation.RawHash provides a set of parameters to adjust these parameters related to Sequence Until.
We evaluate the benefits of Sequence Until by comparing 1) RawHash without Sequence Until and 2) RawHash with Sequence Until in terms of 1) the difference in the relative abundance estimations and 2) the estimated benefits in sequencing time and cost.To evaluate Sequence Until in a realistic sequencing environment where reads from different species can be sequenced in a random order, we randomly shuffle the reads in the relative abundance dataset and generate a set of 50,000 reads with a random order of species so that we can simulate this random behavior.We also find that Sequence Until can be applied to other mechanisms.To evaluate the potential benefits of Sequence Until, we simulate the benefits when using UN-CALLED with Sequence Until and compare it with RawHash.

Performance and Peak Memory
Figure 6 shows the throughput of regular nanopores that we use as a baseline and the throughput of the tools when mapping raw nanopore signals to each dataset for read mapping, contamination analysis, and relative abundance estimation.Supplementary Figure S1 and Supplementary Tables S3 and S4 show the mapping time per read, and the computational resources required for indexing and mapping, respectively.We make six key observations.First, RawHash and UNCALLED are the only tools that can perform real-time genome analysis for large genomes, as they can provide higher throughputs than nanopores for all datasets.Sigmap cannot perform real-time genome analysis for large genomes as it can provide 0.7× and 0.6× throughput of a nanopore for human genome mapping and relative abundance estimations, respectively.RawHash can achieve high throughput as its seeding mechanism is based on efficiently matching hash values compared to the costly distance calculations that Sigmap performs for matching seeds, which shows poor scalability for larger genomes.Second, the throughput of UNCALLED is not affected by the genome size as it provides a near-constant throughput of around 16× for all applications.This is because UNCALLED uses FM-index [17] and a branching algorithm that provides robust scaling with respect to the reference genome size [1].Third, the throughput of RawHash decreases with larger genomes as the seeding and chaining steps start taking up a larger fraction of the entire runtime of RawHash as shown in Supplementary Table S1.Fourth, RawHash provides an average throughput 25.8× and 3.4× better than UNCALLED and Sigmap, while providing an average mapping speedup of 32.1× and 2.1× per read, respectively.Higher throughput with faster mapping times suggests that the mapping time improvements of RawHash are mainly due to its computational efficiency rather than the ability to sequence shorter prefixes of reads than UNCALLED and Sigmap.Fifth, for indexing, Sigmap usually requires a larger amount of computational resources in terms of both runtime and peak memory usage.Sixth, for mapping, UNCALLED is the most efficient tool in terms of the peak memory usage as it requires at most 10GB of peak memory while 1) RawHash requires less than 12GB of memory for almost all the datasets and 2) Sigmap requires significantly larger memory space than both tools.RawHash has a larger memory footprint, ∼ 52GB, than UNCALLED for large genomes.Although such large memory requirements for larger genomes can lead to challenges in using RawHash for mobile devices with limited computational resources, such a requirement can be mitigated by using more efficient seeding techniques such as minimizers, which we leave as future work.We conclude that RawHash provides significant benefits in improving the throughput and performance for the real-time analysis of large genomes while matching the throughput of nanopores.

Accuracy
Table 2 shows the accuracy results of tools for each dataset and application.We make four key observations.First, RawHash provides the best accuracy in terms of precision, recall, and F 1 values compared to UNCALLED and Sigmap when mapping reads to large genomes (i.e., the human genome and the relative abundance estimation).RawHash can efficiently match several events using hash values, which is specifically beneficial in reducing the number of matching regions in large genomes and increasing the specificity due to finding longer matches compared to UNCALLED and Sigmap.
Second, RawHash and UNCALLED can accurately perform contamination analysis while Sigmap suffers from significantly lower precision and recall values.Due to the nature of a contamination analysis, it is essential to correctly eliminate the genomes other than the contaminating genome (precision) without missing the correct mappings of reads from the contaminating genome (recall).Unfortunately, Sigmap cannot provide high values in any of these categories, making it significantly unsafe for contamination detection.
Third, the precision of RawHash does not drop with the increased length in the reference genome due to the benefits of finding long matches, which provides a higher confidence in read mapping.
Fourth, although RawHash does not provide the best accuracy when mapping reads to genomes smaller than the human genome, its accuracy is on par with UNCALLED and Sigmap for these genomes.UNCALLED and Sigmap can achieve high recall values as their mechanisms are best optimized for accurately handling matches in relatively smaller genomes with fewer repeats and ambiguous mappings [1,4].We conclude that RawHash is the only tool that can accurately scale to performing real-time genome analysis for large genomes, especially with significantly high precision rates.Relative Abundance Estimations.Table 3 shows the relative abundance estimations that each tool makes and the Euclidean distance of their estimation to the ground truth estimation.We make two key observations.First, we find that RawHash provides the most accurate relative abundance estimations in terms of the estimation distance to the ground truth compared to UN-CALLED and Sigmap.This observation correlates with the accuracy results we show in Table 2 where RawHash provides the best overall accuracy for relative estimation, which results in generating the most accurate relative abundance estimations.Second, although Sigmap cannot perform real-time relative abundance estimation due to its throughput being lower than a nanopore (Figure 6), Sigmap provides accurate estimations that are on par with RawHash.This observation shows that while Sigmap provides mappings with more incorrect positions due to lower precision than RawHash (Table 2), these reads with incorrect mapping positions are mostly mapped to their correct species.We conclude that RawHash is the only tool that can accurately be applied to analyze relative abundance estimations while matching the throughput of nanopores at a large-scale based on the prior knowledge of the set of reference genomes to map the reads.

Sequencing Time and Cost
Our goal is to estimate the benefits that each tool provides in reducing the sequencing time and cost.To this end, we measure the average length of sequenced bases and the average number of sequenced chunks per read.We make two key observations.First, RawHash provides significant benefits in reducing the sequencing time and cost for large genomes (e.g., Green Algae and Human) compared to UNCALLED, as RawHash can com-plete the mapping process per read by using smaller prefixes of reads.Second, RawHash uses on average 1.58× more chunks compared to Sigmap when mapping reads, which can proportionally lead to worse sequencing time and cost for RawHash compared to Sigmap.We conclude that although UNCALLED and Sigmap provide better advantages in reducing sequencing time and cost for smaller genomes, RawHash can provide significant reductions in sequencing time and cost for larger genomes compared to UNCALLED.

Benefits of Sequence Until
Simulated Sequence Until.Our goal is to estimate the benefits of implementing the Sequence Until mechanism in UN-CALLED and compare it with RawHash when they both use Sequence Until under the same conditions.To this end, we use shuf in Linux to randomly shuffle the mapping files that both RawHash and UNCALLED generate for relative abundance and extract a certain portion of the randomly shuffled file to identify their relative abundance estimations after 0.01%, 0.1%, 1%, 10%, and 25% of the overall reads in the sample are randomly sequenced from nanopores.Table 5 shows the distance of relative abundance estimations after a certain portion of the read is randomly sequenced from nanopores.We make two key observations.First, both RawHash and UNCALLED can significantly benefit from Sequence Until by stopping sequencing after processing a smaller portion of the entire sample since their estimations using smaller portions are close to those using the entire set of reads (Table 3) in terms of their distance to the ground truth.This suggests that many other tools can benefit from Sequence Until as their sensitivity to relative abundance estimations may not significantly change while providing opportunities for reducing the sequencing time and cost up to a certain threshold based on the tool.
Second, RawHash can provide more accurate relative abundance estimations when using only 0.1% of the reads than the estimation that UNCALLED provides using the entire set of reads (Table 3).We conclude that Sequence Until provides significant opportunities in reducing sequencing time and cost while more accurate tools such as RawHash can benefit further from Sequence Until by using fewer portions of the entire read set than the portions that less accurate tools would need to achieve similar accuracy.Sequence Until with RawHash.Our goal is to evaluate Sequence Until when used in real-time with RawHash for relative abundance estimation.Table 6 shows the relative abundance estimations that RawHash makes with and without Sequence Until.We note that the estimations we show for RawHash in Table 6 are different than the estimations in Table 3 since we randomly subsample the reads in the relative abundance estimation dataset, as explained in Section 3.1.We make two key observations.First, we observe that the distance between the relative abundance estimations between these two configurations of RawHash is substantially low.This indicates that our outlier detection mechanism can accurately detect the convergence to the relative abundance estimations without using a full set of reads.Second, Sequence Until enables accurately stopping the entire sequencing after processing 7% of the reads in the entire set without substantially sacrificing accuracy.We conclude that Sequence Until has the potential to significantly reduce the sequencing time and cost by using only fewer reads from a sample while producing accurate results.

Discussion
We discuss the benefits we expect RawHash can immediately make, the limitations of RawHash, and future work.We envision that RawHash can be useful mainly for two directions.First, RawHash provides a low-cost solution for analyzing large genomes in real-time.Such an analysis can be significantly useful when using nanopore sequencers with limited computational resources to enable portable real-time genome analysis at a large scale.
Second, we expect that RawHash can also be useful for genome analysis that does not require real-time solutions by reducing the time and energy that further steps in genome analysis may require.One of the immediate steps after generating raw nanopore signals is their translation to their corresponding DNA bases as sequences of characters with a computationallyintensive step, basecalling.Basecalling approaches are usually computationally costly and consume significant energy as they use complex deep learning models [18,19].Although we do not evaluate in this work, we expect that RawHash can be used as a low-cost filter [20] to eliminate the reads that are unlikely to be useful in downstream analysis, which can reduce the overall workload of basecallers and downstream analysis.Future work.We find three key directions for future work.First, we find that our efficient hash-based similarity identification mechanism can be used to efficiently find overlaps between signals as the reads are sequenced in real-time.Although we observe that our indexing technique is efficient in terms of the amount it requires to construct an index even for large genomes, such an overlapping technique requires substantially more optimized indexing methods and techniques that can efficiently find overlaps as more reads are sequenced and evolves the index.Finding overlaps between signals can be beneficial in 1) providing enriched information to basecallers to increase their accuracy and 2) identifying redundant signals that fully overlap with already sequenced reads in an effort for generating assemblies from signals.
Second, since RawHash generates hash values for matching similar regions, it provides opportunities to use the hash-based seeding techniques that are optimized for identifying sequence similarities accurately without requiring large memory space, such as minimizers [13,21], spaced seeds [22], syncmers [23], strobemers [24], and fuzzy seed matching as in BLEND [25].Although we do not evaluate in this work, we implement the minimizer seeding technique in RawHash.Our initial observation motivates us that future work can exploit these seeding techniques with slight modifications in their seeding mechanisms to significantly improve the performance of certain applications without reducing the accuracy.
Third, we find that RawHash can also benefit from a GPU implementation as its low-cost and accurate implementation can effectively be scaled to nanopore sequencers that include thousands of nanopores such that these pores can be analyzed in parallel with an efficient GPU implementation, which we leave as future work.

Related Work
To our knowledge, RawHash is the first mechanism to efficiently and accurately perform real-time analysis of raw nanopore signals for large genomes.We discuss related work in 1) basecalling, 2) accelerating genome analysis after the basecalling step, and 3) real-time genome analysis with limited computational resources.
Basecalling.Deep learning-based models are utilized by modern basecallers to considerably enhance the precision of identifying a nucleotide base from raw signals compared to traditional non-deep learning-based basecallers [18,[26][27][28][29][30].Deep learning models can successfully basecall genomes due to the developments and advancements in their architecture, which enables them to model and accurately recognize spatial characteristics in the raw data.Many basecallers have been proposed using modern deep learning-based architectures [31][32][33][34][35][36][37][38][39][40] However, the use of complex deep learning models makes basecalling slow and memory-hungry, bottlenecking all genomic analyses that depend on it [18].Recent works focus on developing methods to speed up the basecalling process.One approach to basecalling acceleration is to use specialized hardware, such as field-programmable gate arrays (FPGAs) [41][42][43][44][45] or processingin-memory (PIM) [19,46,47], to perform the basecalling computations.These specialized hardware devices can perform many calculations in parallel, allowing for significant speedups in the basecalling process.Another approach is to use machine learning-based compression techniques to improve the performance of the basecalling process.RUBICON [18] provides a framework to develop hardware-optimized basecallers using neural architecture search [48], knowledge distillation [49], and pruning [50].Dorado [51], a basecaller by ONT, uses quantization [52] to reduce the bit-width precision at which neural network calculations are performed.All the above works accelerate the basecalling step without eliminating the wasted computation in basecalling.TargetCall [20] proposes a pre-basecalling filter that eliminates the wasted computation in basecalling by leveraging the observation that the majority of reads are discarded after basecalling.However, RawHash is different from these works as its goal is to perform real-time analysis of raw signals without performing the computationally-intensive basecalling step.
Accelerating the genome analysis after basecalling.There are several works that aim to accelerate the entire genome analysis pipeline by accelerating one or multiple steps in the pipeline after basecalling the raw nanopore signals [53,54].These works accelerate the pre-alignment filtering and read classification [55][56][57][58][59][60][61][62][63][64][65][66][67], chaining [68,69], read mapping and sequence alignment  steps.Although these works can significantly improve the performance of the genome analysis pipeline, unlike RawHash, these works cannot perform real-time genome analysis while the raw nanopore signals are generated from nanopore sequencers.Real-time analysis of raw nanopore signal.Several works perform real-time genome analysis of raw nanopore signals by utilizing adaptive sampling [1,[3][4][5][6][7][8]11].SquiggleFilter [5] uses an ASIC accelerator that quickly filters non-related raw electrical signals before basecalling for viral detection.HARU [8] is an FPGA accelerator that accelerates real-time selective genome sequencing on resource-constrained devices for detecting viral genomes.RawHash differs from these works as it does not require specialized hardware design and can scale to analyze large genomes while matching the throughput of nanopores.
SquiggleNet [7], DeepSelectNet [10], and RawMap [11] require training with machine learning techniques using sequencing reads as training data without using reference genomes.These works train their models to classify raw nanopore signals without mapping them to the reference genome, which is different than RawHash as it maps raw signals to a reference genome.These works often require retraining and reconfiguring the neural network model and architectures.Although such classification approaches can provide high accuracy in labeling reads as target or non-target reads based on a target genome of interest, it can be challenging to easily perform real-time analysis with high accuracy without retraining or reconfiguring these models.RawHash is different than these works as it can map reads to any reference genome using easily configurable parameter settings.
ReadFish [3] and ReadBouncer [9] can scale to mapping reads to large genomes such as a human genome using GPUs or CPUs (e.g., DeepNano-Blitz [128]) for performing basecalling.Similar to ReadFish and ReadBouncer, RUBRIC [6] use a basecalling approach followed by mapping the basecalled raw signals to analyze raw nanopore signals in real-time.These basecalling approaches are optimized to use the entire raw nanopore signal of a read rather than the portions of raw signals produced in real-time, which can be challenging in generating an accurate mapping with a small number of basecalled signals [1,4].RawHash differs from ReadFish and ReadBouncer as it does not require powerful computational resources for basecalling, which may not be immediately available for portable sequencers such as ONT MinION.RawHash can directly and accurately map a small number of raw signals (e.g., signals produced in one second) to a reference genome without basecalling them.
We note that ReadFish and ReadBouncer use an interface, MinKNOW, required for adaptive sampling in nanopore sequenc-ing.MinKNOW enables tools to analyze the raw nanopore signals and perform adaptive sampling by using functionalities such as Read Until.However, the throughput of these tools using MinKNOW cannot exceed the throughput of a nanopore sequencer.Thus, it becomes challenging to fairly compare these tools with the other tools, such as RawHash and Sigmap, for two reasons.First, the throughput of RawHash and Sigmap can be significantly larger than the throughput of a nanopore (Figure 6) due to the lack of support for the MinKNOW interface in these tools.Second, the parameters of RawHash are empirically chosen to provide the best throughput and accuracy without the potential effects of MinKNOW.It is likely that the accuracy of RawHash can improve while providing the same throughput as a nanopore sequencer.We leave the implementation of Min-KNOW for RawHash as future work as well as the comparison of RawHash with ReadFish and ReadBouncer.
UNCALLED [1] and Sigmap [4] are the most relevant works to RawHash.These works map raw nanopore signals to a reference genome without using powerful computational resources (e.g., GPUs), which can be directly used with portable nanopore sequencers.UNCALLED detects events from raw signals, and the probability of k-mers that each event can represent is calculated using k-mer models.UNCALLED identifies the sequence of matching k-mers between the most probable k-mers of events and a reference genome using an FM-index [17].However, it becomes challenging to accurately identify the matching regions with such a probabilistic model from a large number of matches as the genome size increases [1] (Table 2).Thus, UNCALLED is highly accurate for small genomes (e.g., E. coli and Yeast genomes) due to the smaller number of probabilistic matches in the reference genome that can be identified accurately.
Sigmap can map raw nanopore signals to genomes larger than the Yeast genome (e.g., Green Algae with around 100M bases).To achieve this, Sigmap converts the k-mers of the reference genome into events and matches the events between raw nanopore signals with the events of the reference genome.Since events are not necessarily identical when reading the same DNA content, it is challenging to find accurate matches between them due to the signal variations we discuss in Section 2.2.To address this challenge, Sigmap creates a vector from each n consecutive events (i.e., n-dimensional vector space) from the reference genome (i.e., the indexing step) and measures the Euclidean distance between these vectors and the vectors generated from raw nanopore signals (i.e., the mapping step) using a k-d tree structure.Although the distance between vector of events generated from similar regions is close, such a distance calculation is computationally costly and suffers from the curse of dimensionality that fundamentally prevents accurately and efficiently increasing the number of events within a single vector, which makes it ineffective for larger genomes.
RawHash is different from UNCALLED and Sigmap as it identifies similarities between a reference genome and a raw nanopore signal by efficiently and accurately matching the hash values generated from them without using 1) probabilistic model as proposed in UNCALLED that can be inaccurate for large genomes or costly distance calculations.

Conclusion
We propose RawHash, a novel mechanism that provides a lowcost and accurate approach for real-time genome analysis for large genomes.RawHash can efficiently and accurately perform real-time analysis of raw nanopore signals to identify similarities between the signals and a reference genome in real-time at a large-scale (e.g., whole-genome analysis for human or communities with multiple samples).To efficiently and accurately identify similarities, RawHash 1) generates events from both raw signals and the reference genome, 2) quantizes the events into values such that slightly different events that correspond to the same DNA content can have the same value, and 3) generates hash values from multiple events to efficiently find matching regions between raw signals and a reference genome using hash values with efficient data structures such as hash tables.We compare RawHash with the state-of-the-art approaches, UN-CALLED and Sigmap, on three important applications in terms of their performance, accuracy, and estimated benefits in reducing sequencing time and cost.Our results show that 1) RawHash is the only tool that can be accurately applied to analyze raw nanopore signals at large-scale, 2) provides 25.8× and 3.4× better average throughput, and 3) can map reads 32.1× and 2.1× faster than UNCALLED and Sigmap, respectively.the nanopore model used for sequencing.A more generic k-mer model that can accurately represent all nanopores is needed to easily adapt RawHash to all possible nanopore models that may be released in the future.
Second, RawHash starts providing lower recall values as the genome size increases, which indicates that a larger portion of correct reads cannot be mapped by RawHash due to the increase in the number of false negatives.Although such an increase in false negatives does not substantially affect some applications, such as contamination analysis, where providing higher precision is more critical to correctly identify the contaminated sample, improving it is useful to provide more accurate genome analysis overall.
Third, we perform our relative abundance estimations based on a priori knowledge of reference genomes.While such an experiment can still be useful in practical scenarios, this is not the common case in metagenomic analysis, where a sample is searched against a significantly larger set of species.We expect that our mechanism can still scale to such metagenomic analyses given that many metagenomic databases are efficiently constructed by including fewer and useful information for each species [4], as opposed to our analysis, where we include whole-genome references.
Fourth, we observe that the throughput of RawHash is expected to reach the throughput of a nanopore when analyzing reference genomes slightly larger than a human genome.Such a limitation can be alleviated by applying 1) seeding techniques that provide faster and more space-efficient searches in large spaces and 2) chaining algorithms that are optimized for hash-based seed matches without the notion of distance between seeds, unlike the chaining algorithm used in Sigmap.
t e x i t s h a 1 _ b a s e 6 4 = " b / G P U o 8 H Q T o G N L j 8 I B 7 v z j e D h q U

8 d u J 3 T / 1 F
7 8 N + L 9 + m q j 6 Q 2 5 p w N / z s n t 4 7 0 L z t 9 G y b v o 4 O P B 4 O T 9 6 r D X w S v w G u y C B B y C E 3 A K z s A I Y D A H 3 8 B 3 8 C M 4 D U T Q B N f L 0 N 7 a 6 s 8 W 6 F j w 5 R e p W T 6 6 < / l a t e x i t >

Figure 2 :
Figure 2: Converting sequences to event values based on the k-mer model of a nanopore.lengthk, k-mers, and consecutive values differ by one base.To achieve this, RawHash converts the raw signals into their corresponding values in three steps, as shown in Figure3.First, to accurately identify the distinct regions in the raw signal that correspond to a certain k-mer from DNA, RawHash performs a segmentation step as described in a basecalling tool, Scrappie, and used by earlier works UNCALLED and Sigmap.The segmentation step aims to eliminate the factors that affect the speed of the DNA molecules passing through a nanopore, as the speed affects the number of signal measurements taken for a certain amount of bases in DNA.To perform the segmentation step, RawHash identifies the boundaries in the signal where the signal value changes significantly compared to the certain amount of previously measured signal values, which indicates a base change in the nanopore.Such boundaries are computed using a statistical test, known as Welch's t-test[16], over a rolling window of consecutive signals.RawHash performs this t-test for multiple windows of different lengths to avoid the variables that cause a change in the number of current measurements due to the varying speed of DNA through a nanopore, known as skip and stay errors[12].Signals that fall within the same segment (i.e., between the same measured boundaries) are usually called events since each event contains the signals from a reading of a fixed amount of DNA bases, k-mers.Second, since the number of signals that each event includes is not constant across different events due to the stay and skip errors, RawHash generates a single value for each event to quickly avoid these potential errors and other factors that cause variations from reading the same amount of DNA bases.To this end, RawHash measures the mean value of the signals that fall within the same segment and uses this mean value for an event.Third, since the amplitudes of the signal measurements may significantly vary when reading k-mers at different times, RawHash normalizes the mean event values using the event values generated from the nanopore within the same certain time interval in a streaming fashion.Although this time interval parameter can be modified in our tool, the default configuration of RawHash processes the events of signals generated by the nanopore within one second.For normalization, RawHash uses the same z-score calculation that it uses for normalizing the event values generated from reference sequences as described earlier.RawHash uses these normalized values as event values when comparing with the event values from reference sequences.

Figure 4 :
Figure 4: Quantization of two event values.

Figure 6 :
Figure 6: Throughput of each tool.Values inside the bars show the throughput ratio between each tool and a nanopore.

Figure S1 :
Figure S1: Average time spent per read by each tool in real-time.Values inside the bars show the speedups that RawHash provides over other tools in each dataset.
Our results show that RawHash provides 1) comparable accuracy to UNCALLED and Sigmap for small genomes and 2) significantly better accuracy for large genomes than UN-CALLED and Sigmap.•We show that Sigmap cannot perform real-time genome analysis for large genomes as it cannot match the throughput of nanopores.• We provide the open source implementation of RawHash and the complete set of scripts to reproduce the results shown in this paper at https://github.com/CMU-SAFARI/RawHash.
• We extensively evaluate RawHash by comparing it with stateof-the-art approaches, UNCALLED and Sigmap, on various datasets ranging from small genomes (i.e., genomes with up to 100 million bases) to large genomes (e.g., human genome).

Table 1 :
Details of datasets used in our evaluation.
Best results are highlighted with bold text.

Table 4 :
The average sequenced length of bases and the number of chunks.

Table 5 :
Relative abundance with simulated Sequence Until.Percentages show the portion of the overall reads used.Best results are highlighted with bold text.

Table 6 :
Relative abundance with Sequence Until.Percentages show the portion of the overall reads used.