SquiggleKit: a toolkit for manipulating nanopore signal data

Abstract Summary The management of raw nanopore sequencing data poses a challenge that must be overcome to facilitate the creation of new bioinformatics algorithms predicated on signal analysis. SquiggleKit is a toolkit for manipulating and interrogating nanopore data that simplifies file handling, data extraction, visualization and signal processing. Availability and implementation SquiggleKit is cross platform and freely available from GitHub at (https://github.com/Psy-Fer/SquiggleKit). Detailed documentation can be found at (https://psy-fer.github.io/SquiggleKitDocs/). All tools have been designed to operate in python 2.7+, with minimal additional libraries. Supplementary information Supplementary data are available at Bioinformatics online.


Comparison to accessible methods with related functions
• Picopore -removes data from fast5 files to reduce size (Gigante, 2017) • Japsa -manages files, streams data and performs base space analysis (Nguyen et al., 2017) Analysis: • ReadUntil -local dynamic programming, real time (Loose et al., 2016) • Nanopolish -File management, segmentation, aligns signal to base (Loose et al., 2016;Loman et al., 2015;Loman and Quinlan, 2014)  It allows for great visualisation and annotation of signals in a bulk fast5 file. This is similar to SquigglePlot, but are used for different applications

Differences
SquigglePlot and segmenter work with regular fast5 files, or any signal input, where BulkVis plots bulk fast5 files.
Comments and use It has a specific use case around examining reads using MinKNOW's classifications and requires bulk fast5 file.

Info
Poretools provides a wealth of utilities and data exploration tools.

Info
HDFView is a visual tool written in Java for browsing and editing HDF5 files. View a file hierarchy in a tree structure. Create new files, add or delete groups and datasets. View and modify the content of a dataset. Add, delete and modify attributes.

Similarities
HDFView allows for basic visualisation of raw signal data, with overlap with SquigglePlot Differences SquigglePlot has many more features than HDFView for visualisation, and can plot many files at once with custom settings, at high DPI for figures.
Comments and use HDFView is excellent for exploring a fast5 file for information and structure or visualising one signal. When working with many signals, or just wanting to look at the signal directly, SquigglePlot is superior.

Comparison Japsa
Info Japsa has many packages that do a variety of nanopore data analysis and data streaming. Similarities Japsa has packages for real time file management and analysis during sequencing.

Differences
Japsa employs many base space sequence analysis methods like barcode demultiplexing. It does not however do any analysis, extraction, or visualisation of raw signal Comments and use Japsa seems to have been built as a pipeline for many real time analysis methods. It does not have any significant overlap other than file some file management.

Info
ReadUntil has a function, squiggle_search2(), to find if a particular sequencing matches a selected region or not, using the coordinates/position of the match. This is specifically designed for ReadUntil, and the scripts and methods can not be used for general exploration.
No plotting of raw signal.

Last update 2016
Links https://www.nature.com/articles/nmeth.3930 http://mattloose.github.io/RUscriptsdocs/ https://github.com/mattloose/RUscripts/tree/master/ReadUntil Similarities ReadUntil takes a reference sequence region and converts it into the event space using the models for basecalling. It then uses dynamic programming on the first portion of an incoming read to test if it matches within the selected region.

Differences
MotifSeq is a general method for finding the approximate position of where a particular sequence motif aligns with the raw signal data.
ReadUntil and MotifSeq do this is similar ways, however while ReadUntil does the comparisons in event space with streaming data, MotifSeq does the comparisons in raw signal space.
Comments and use squiggle_search2() from ReadUntil finds if a particular sequence matches a selected region or not, using the coordinates/position of the match. This is specifically designed for ReadUntil, and the scripts and methods can not be used for general exploration as is.

Tombo/NanoRaw
Info Nanoraw and tombo operate based on the genome_resquiggle/resquiggle method, mapping the raw signal with basecalls and reference alignment. All visualisation is centred around this method, and though it has the potential to be general use, it is not implemented or designed that way.

Similarities
Tombo can visualise raw signal data similar to SquigglePlot. Tombo can centre the visualisation around a particular motif, and extract that information with the API to find the signal associated with the basecalls, similar to MotifSeq. Tombo can extract the signal data, similar to SquigglePull.

Differences
Tombo can visualise multiple squiggles on one plot, and can associate them with the basecall as well creating a squiggle pileup plot.
While Tombo focuses on multiple squiggles, SquigglePlot focuses on one squiggle at a time.
Tombo uses the resquiggle algorithm to match the aligned base sequence to the raw signal.
MotifSeq converts the sequence motif to a signal using a model, and uses a local dynamic programming method to identify the position of the motif. The signal extraction in SquigglePull is more general than tombo's pipeline.
Comments and use Tombo is a fully fledged toolkit for finding differences in signal space, and exploring the raw signal of a dataset. Tombo and NanoRaw were designed to find differences in the signal to identify modifications, and help train models to detect them.
Fast5_fetcher may be of use for filtering large datasets to be analysed with tombo Use of either toolkit would be based on specific goals

Comparison Nanopolish
Info Nanopolish is a software package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more.

Similarities
File management, segmentation, aligns signal to base are similar to fast5_fetcher, segmenter, and MotifSeq respectively.

Differences
Nanopolish is aimed at specific problems and the tool is created as an integrated method of analysis, where SquiggleKit tools can all work standalone and in a more general way.
Comments and use Nanopolish is best suited to standard methods of analysis. SquiggleKit is best used for developing tools like nanopolish, or exploring nanopore signal data.

Standardisation of signal for comparison
Many factors can contribute to non-sequence dependent variability in nanopore sequencing data (temperature, voltage, etc), which can lead to progressive distortion of raw current values for a single read. The three main types of raw signal distortion are shift, scale, and drift, which can be mitigated by normalising or standardising raw signal using global properties of the data. Standardisation is required to compare 2 raw signals against each other, or a simulated signal from the basecalling model and a raw signal, as implemented in MotifSeq.
Z-score scaling is performed using the function z = ( x -u ) / s, where u and s are the mean and standard deviation of the signal, respectively. It is more sensitive to any large spikes in the signal (which are relatively common), thus impacting the overall mean and standard deviation. One method of offsetting this is to filter large spikes above and below a threshold, as is done with the --scale_hi/--scale_low flags in MotifSeq, set at 900 and 0 respectively for DNA.
Alternatively, raw signal can be shifted and scaled by converting to pico ampres, then scaled. This is done by extracting the digitisation, offset, and range values from the fast5 files, and converting the raw data points using the function pA = (x + offset) * (range/digitisation). This can be done automatically when the raw signal is extracted using the --convert flag in SqugglePull. There is no significant difference in MotifSeq scores when using Z-score or Z-score pA scaling (Supplementary Figure 1C).
MotifSeq can do Z-score or median-absolute-difference scaling, in a similar way, but which are more or less sensitive to large spikes in the signal. The med-MAD scaling uses the following function: z = (x -med) / MAD, where med and MAD are the median and Median-Absolute-Difference of the signal respectively. Med-MAD is more robust at handling noise in nanopore signals than z-score scaling, where the mean and standard deviation can be significantly impacted by the occasional spike in current. Med-MAD is used by nanopore basecalling software for this reason. This is evidenced in Supplementary Figure 1A-B by slightly lower MotifSeq scores, indicative of lower penalties for outlier spikes in current.
All 3 normalisation methods have been implemented in MotifSeq, which uses med-MAD by default given the benchmarking results described below (see section 5).
Sup. Figure 1: Comparison of raw signal scaling strategies on MotifSeq scoring (A-C) Pairwise comparison of MotifSeq scores for all combinations of raw Z-score scaling (raw z-scaled), raw Z-score scaling with conversion to picoAmperes (z-scaled pA converted), and median-median-absolute-difference (median-MAD scaled). Scores were generated by running MotifSeq with 12 different k-mers (D) corresponding to the 3' end used in the main example against the raw signals of 10,000 randomly selected cDNA reads, which were scaled using either of the 3 methods described above.

Benchmarking
Benchmarking was carried out to assess speed and accuracy for certain SquiggleKit tools. Below is a comparison of speed for fast5_fetcher, segmenter, and MotifSeq. Accuracy benchmarking was performed for Segmenter and MotifSeq. All local benchmarks were carried out on the same hardware, as described below.
System specifications:

Fast5_fetcher speed
By taking a dataset with 7,332,202 single fast5 files, tarballed into groups of 4000, the average runtime (after 5 replicates) of fast5_fetcher extraction was measured at different magnitudes (i.e. number of files to extract).

Segmenter speed
Segmenter identifies regions of low complexity using a greedy algorithm in a manner similar to a Markov chain. It processes raw signal as follows: 1. Calculate the median current intensity of the full read; 2. Set thresholds about the median using a fraction of the standard deviation (parameter --std_scale, default 0.75); 3. For each data point, calculate the difference to the median. Stretches of signal are returned if at least w (parameter --window, default 150) consecutive data points are within the standard deviation threshold; 4. Outliers are tolerated (parameter --error, default 5) as are consecutive low complexity regions within a given boundary (parameter --seg_dist, default 50).   193 This is comparable to the speed attained by (Loose et al., 2016) in their speed testing on "Intel(R) Xeon(R) E5-2690 version 3 central processing units (CPUs) running at 2.60 GHz (server)" attaining ~0.3 seconds per read for a k-mer length of 256.

Segmenter accuracy
Segmenter is designed to identify low complexity regions in raw nanopore signal above a given window length. It was designed as a qualitative annotation tool to identify homopolymer boundaries and polymer translocation stalling. Although it could be used to find associations between poly-A tail lengths and signal length, this is not the subject of this toolkit.
Identifying low complexity regions in raw signal is not straightforward to benchmark given the stochastic nature of single molecule sensing. Consequently, we evaluated the accuracy of Segmenter by manually inspecting 100 raw cDNA signals using Segmenter's "-v" argument (Supplementary Figure 2). Homopolymer boundaries corresponding to poly-A regions in cDNA reads were visually identified, their position recorded and compared to the those automatically identified by Segmenter (Table 5).
Sup. Figure 2: Segmenter boundary assessment Low complexity regions identified by Segmenter were manually inspected (top) to identify precise boundaries visually using matplotlib functionality. The positions on the x-axis corresponding to boundary signals (red lines) were recorded and compared to those output by Segmenter (bottom, blue lines), As can be seen, start accuracy is quite good, with the largest variation in the end position targeting. A mean length difference of 9.75 data points approximately corresponds to 1 nucleotide worth of signal (~8-11 data points per nucleotide for DNA pore models).

MotifSeq accuracy
We benchmarked MotifSeq accuracy by comparing minimum dynamic programming scores across positive and negative control datasets, which were generated as follows: (i) Using a sequin synthetic RNA spike-in control cDNA sequencing run (ENA accession number PRJEB33439), we randomly selected a sequin isoform from all isoforms with at least 10,000 uniquely mapping reads (minimap2 using the preset "-x map-ont") and a MAPQ score of 60 via the resulting .paf file.
(iii) For each k-mer, positive and negative control sets were generated by identifying all reference sequin isoforms with or without the selected k-mer, respectively, using direct sequence matching.
(iv) A random selection of 5000 cDNA reads were selected that uniquely map to the reference isoforms for both control sets defined in (iii). We ensure the mapped sequences overlapped the position of the k-mer identified in the reference sequences. Reads were selected with fast5_fetcher and raw signals extracted using SquigglePull.
(v) Each k-mer motif was submitted to MotifSeq comparisons across both positive and negative control datasets (Supplementary Figure 3).

Sup. Figure 3: MotifSeq scoring benchmark
The MotifSeq mean score (sDTW distance) and standard deviation (error bars) for 5000 comparisons against positive (red, top) and negative (green, bottom) controls is plotted for 5 different k-mer sizes in conjunction with (A) z-scaling and (B) med-MAD scaling of raw signal. A linear regression trend line is fitted to 5 different k-mer sizes with associated slopes and correlation of determination (R 2 ). (C) Distribution of MotifSeq scores for 50-mers using zscaling and (D) med-MAD scaling.
As expected, the separation of true positive scores from true negative scores increases with k-mer length. The separation further increases when using med-MAD scaling vs z-scaling, highlighting the improved tolerance of med-MAD scaling for outlier current spikes through slightly lower scores overall. MotifSeq scores against the negative control appear to be normally distributed (Supplementary Figure 3C-D, in red or on the right) suggesting that they can be interpreted as Z-scores to assess the confidence of hits via the background mean and standard deviation, which can be converted to a probability using the formula: Where is the mean, q is the query mean, b is the background mean, b is background standard deviation.
Furthermore, the seemingly linear correlation between MotifSeq scores and k-mer lengths suggests that linear regression can be used to extrapolate background models to a given motif size (Table 6). Consequently, we incorporated this additional scoring metric into MotifSeq, which now