-
PDF
- Split View
-
Views
-
Cite
Cite
Roberto Semeraro, Alberto Magi, PyPore: a python toolbox for nanopore sequencing data handling, Bioinformatics, Volume 35, Issue 21, November 2019, Pages 4445–4447, https://doi.org/10.1093/bioinformatics/btz269
- Share Icon Share
Abstract
The recent technological improvement of Oxford Nanopore sequencing pushed the throughput of these devices to 10–20 Gb allowing the generation of millions of reads. For these reasons, the availability of fast software packages for evaluating experimental quality by generating highly informative and interactive summary plots is of fundamental importance.
We developed PyPore, a three module python toolbox designed to handle raw FAST5 files from quality checking to alignment to a reference genome and to explore their features through the generation of browsable HTML files. The first module provides an interface to explore and evaluate the information contained in FAST5 and summarize them into informative quality measures. The second module converts raw data in FASTQ format, while the third module allows to easily use three state-of-the-art aligners and collects mapping statistics.
PyPore is an open-source software and is written in Python2.7, source code is freely available, for all OS platforms, in Github at https://github.com/rsemeraro/PyPore
Supplementary data are available at Bioinformatics online.
1 Introduction
Nanopore sequencing platforms are penetrating the market aiming to achieve an unprecedented combination of speed, accuracy and flexibility (Sedlazeck et al., 2018a), introducing a new era of fast sequencing applications and allowing the investigation of microbial community (Quick et al., 2015), the identification of complex structural variants in cancer genomes and the reconstruction of full-length transcripts in RNA-seq studies (Magi et al., 2017). The Oxford Nanopore Technologies (ONT) MinION and GridION are the first commercially available devices that uses nanopore as biosensor to sequence very long single-stranded DNA molecule. The current ONT flow cells, the cartridges where DNA samples are inserted, contain 2048 individual protein nanopores arranged in 512 channels, each containing four pores and sensors. When the DNA strand passes through the nanopore, each sensor measures ionic current changes with a constant sampling frequency and raw current data are then base called by means of a machine learning approach to obtain a consensus sequence (Jain et al., 2016). Although the earliest MinION flow cells (R6.0) allowed to generate a total throughput in the order of tens of Mb, the more recent R9.4 and R9.5 improved the throughput to 10–20 Gb allowing to generate millions of reads. For each read, all data related to the sequencing process are stored in FAST5 format which is a variant of HDF5, a very flexible data model, able to store an unlimited variety of datatypes, such as results of current signal segmentation and the metadata associated with the sequencing process.
At present, few software are available to deal with this file format and to facilitate downstream analyses starting from it (Legett et al., 2015). Moreover, the majority of these tools generate basic quality plots and none of them is capable to make data conversion and quality assessment at once. In this scenario, the availability of fast software packages for evaluating nanopore runs quality by generating highly informative and interactive summary plots is of fundamental importance. For these reasons, we developed PyPore a python toolbox for fast and accurate quality control, conversion and alignment of nanopore sequencing data.
2 Materials and methods
2.1 General overview
The PyPore suite consists of three modules: seqstats, fastqgen and alignment.
Seqstats provides an interface to explore the information related to a dataset of single or multi read FAST5 files and to, optionally, convert and gather them in FASTQ data. At end of the run, two HTML files are generated: a sequencing_summary and a pore_activity_map. The first one reports three plots summarizing run statistics over time (read mean length, base number and read number) and two histograms concerning the distribution of GC-content and quality values (Fig. 1A). All the information related to each single channel activity, e.g. number of reads and bases over time, is stored in an interactive map, faithfully representing the nanopore flowcell layout (contrary to other tools that generate improper representations, Fig. 1B). In the summary file (Fig. 1A), the cumulative pass/fail ratio (based on read quality larger or smaller than 7) is displayed as a pie chart in the top right corner, while in the activity map (Fig. 1B) it is shown for each channel by means of a colored bar, positioned at bottom of screen. All plots are mouse responsive, allowing user to zoom and pan the picture, retrieving all the information related to the experiment. For instance, as shown in Figure 1A, by moving the mouse on the run statistics plot it is possible to get these information at different time points, alternatively, by hover over the length’s histogram, the portion of reads contained in each bin it is displayed. The same holds for the activity map: clicking on a channel, represented as small rectangle, all information concerning its occupancy are shown in a line chart that split data according to number of reads and base pair respectively, for all four pores.

PyPore outputs (A) shows the sequencing_summary layout (B) shows the pore_activity_map layout. The color gradient expresses the productivity of each channel. Those in black did not work for the whole duration of the experiment
The second function provided by PyPore is a FAST5 converter, fastqgen. It is an optimized alternative to seqstats for FASTQ extraction. As input, it takes a folder containing the FAST5 files and a label for the resulting FASTQ.
The last feature of our tool consists of an alignment module based on three state-of-the-art long-read aligners and capable to produce interactive resulting summaries. To generate an alignment file (BAM format), a label and a FASTA reference sequence are required. The alignment outcome can be assessed triggering the –alignment_stats option. Inside the resulting summary are reported the error rate for each small variant category (substitutions, insertions and deletions), the mapped sequences fraction for size-binned reads and the experimental coverage distribution along reference genome (Supplementary Fig. S1). These summary statistics can help to plan a read error correction step and evaluate its effect. Finally, by means of –aligner option, it is possible to customize the aligners list, composed by minimap2 (Li, 2018), bwa (Li and Durbin, 2010) and ngmlr (Sedlazeck et al., 2018b), removing some of them or editing their execution order.
3 Discussion
We developed PyPore to accommodate all ONT data users, giving them an easy and fast tool to delve into FAST5 data, convert and align them. The rapid spread of this technology, mainly due to low price, simple sample preparation and to advantages related to the use of long-reads; also rely on the portability of ONT sequencers.
Thanks to these properties, even small bioinformatics laboratories can now study genome structure all in-house. PyPore shares various features with the current state-of-the-art programs, but also implements several specific characteristics that make it unique, such as interactive resulting summaries, GC-content estimation, as well as the alignment module equipped with a fancy plotting function, among others.
Moreover, due to the still increasing amounts of data generated by this technology, we developed PyPore paying more attention on data gathering and manipulation in order to optimize resource and time consumption.
In conclusion, our tool is a new cutting-edge method devised to manage and analyze extensive nanopore datasets, allowing to extract information, convert data and align them in a short time, thanks to multiprocessor support. Moreover, PyPore produces browsable interactive result summaries reporting a wide spectrum of information about sequencing or/and alignment, offering a new way to deal with nanopore data, by evaluating their global quality, discovering experimental biases and making decision on future analyses.
Funding
This work was supported by the Associazione Italiana per la Ricerca sul Cancro (AIRC Investigator Grant 20307, ‘Third Generation Cancer Genomics’).
Conflict of Interest: none declared.
References