NanoPipe—a web server for nanopore MinION sequencing data analysis

Abstract Background The fast-moving progress of the third-generation long-read sequencing technologies will soon bring the biological and medical sciences to a new era of research. Altogether, the technique and experimental procedures are becoming more straightforward and available to biologists from diverse fields, even without any profound experience in DNA sequencing. Thus, the introduction of the MinION device by Oxford Nanopore Technologies promises to “bring sequencing technology to the masses” and also allows quick and operative analysis in field studies. However, the convenience of this sequencing technology dramatically contrasts with the available analysis tools, which may significantly reduce enthusiasm of a “regular” user. To really bring the sequencing technology to every biologist, we need a set of user-friendly tools that can perform a powerful analysis in an automatic manner. Findings NanoPipe was developed in consideration of the specifics of the MinION sequencing technologies, providing accordingly adjusted alignment parameters. The range of the target species/sequences for the alignment is not limited, and the descriptive usage page of NanoPipe helps a user to succeed with NanoPipe analysis. The results contain alignment statistics, consensus sequence, polymorphisms data, and visualization of the alignment. Several test cases are used to demonstrate the efficiency of the tool. Conclusions Freely available NanoPipe software allows effortless and reliable analysis of MinION sequencing data for experienced bioinformaticians, as well for wet-lab biologists with minimum bioinformatics knowledge. Moreover, for the latter group, we describe the basic algorithm necessary for MinION sequencing analysis from the first to last step.


Background
Recent years have witnessed a DNA sequencing boom due to the constantly improving technologies and, consequently, the accessibility of sequencing to a large spectrum of customers including scientists and medical practitioners. Researchers in many fields, from metagenomics to plant physiology to medicine, have been implementing sequencing experiments into their research. Oxford Nanopore Technologies (ONT) essentially accelerated this process by introducing the MinION sequencer, a portable device with minimum requirements for technical skills and bioinformatics knowledge. Thus, DNA sequencing experiments became feasible, even in field studies, in small laboratories, and soon for medical applications in clinics [1].
The NCBI PubMed database includes 261 scientific articles containing the phrase "Oxford Nanopore" published between 2009 and 2018 (by September 1, 2018), the majority of which were published in the last three years (see Supplementary Fig.  S1). This is pointing both at the increased popularity of the ONT sequencers and at the considerable improvement of the technology and sequencing quality in the last three years. For example, the R10 version of the MinION flow cell was recently announced and it promises to improve sequencing quality, also for homopolymer stretches. High throughput and long reads allow diverse applications of MinION: virology [2,3], plant pathology and agriculture [4,5], tuberculosis studies [6], metagenomics and diet [7], and veterinary research [8]; as a portable platform, field biodiversity studies [9], detection of Ebola virus in patients on the spot [10,11], and sequencing in space [12]. Not to forget fundamental applications for long-read sequencing studies, such as de novo genome assembly, improvement of existing genome assemblies, and discovery of structural variants and long repeats [13][14][15][16]. ONT sequencing is favorable for microbiology research as small-sized bacterial genomes can be covered in just one Min-ION read [17,18], thus providing high resolution in genome architecture.
ONT supplies its users with the necessary software to perform base calling, i.e., converting of MinION's electrical signals into a sequence of nucleotides: the on-run MinKNOW and offline Albacore. Both applications utilize the complex, recurrent neural network algorithm, which is recently very popular in computer science. It allows the software to learn from existing data and improve its performance. It is worth noting that the base-calling process is central to improving the accuracy of ONT sequencing technology, and its algorithm is being constantly improved and updated. The output is a collection of FAST5 and/or FASTQ files containing the base-called sequences. These are the files that are used for any sequence analysis in bioinformatics, thus, the base-caller can be called a "gate" from MinION into data interpretation. Nevertheless, the range of ONT-provided analysis tools is limited and concerns only specific applications, excluding general processing, which is left to the user. For example, the EPI2ME software suite, based on the Metrichor platform, includes applications for barcode analysis, metagenomics, antimicrobial resistance analyses, and some technical tests [19]. Several research groups have been recently focusing on the development of MinION-specific bioinformatics tools [20][21][22], although most of these require considerable bioinformatics knowledge. These conditions impede benefits that MinION-based DNA sequencing could bring to medical practitioners and researchers with less information technology (IT) experience. To fill this gap, we developed NanoPipe, a web-driven automatic pipeline that can quickly and effortlessly process data produced by MinION, as well provide necessary files for further bioinformatics analysis if required.

Methods
NanoPipe can be conceptually divided into four stages: (1) data uploading in FASTA or FASTQ formats, (2) alignment of Min-ION reads against the target sequences, (3) alignment analysis, and (4) results display (see Fig. 1). It was developed with no-ITexperienced users in mind, hence, it provides a web-driven interface with the usage page describing the main features of the tool. The start-of-analysis page is simple and intuitive. A user must, first, choose a target genome from a NanoPipe's list (see further) or upload their own target sequence. The next step is the essential part of the pipeline: mapping of sequencing reads to the target using version 946 of the LAST sequence aligner (LAST, RRID:SCR 006119). LAST accounts for the MinION-specific sequencing errors, thus it generates highly reliable results. It can determine the rates of insertion, deletion, and each kind of substitution in a type of data (e.g., MinION reads of AT-rich Plasmodium DNA) [23]. It then uses these rates to determine the most probable alignments. LAST also finds the most probable division of each read into one or more parts together with the most probable alignment of each part with the last-split function [24,25], i.e., if LAST finds a better scoring alternative alignment of the read where it is being split into parts and mapped to different regions of the target, such alignment is submitted to the results. This is a principled way to handle complex DNA rearrangements, gene fusions in RNA, chimeric host/viral sequences, etc. The tasks of detecting polymorphisms and distinguishing viral serotypes can be performed more precisely for MinION data when based on a LAST alignment, because the tool estimates the probability that each base is correctly aligned. The probability is low if there is an ambiguity, i.e., the base could align to more than one place. The default NanoPipe parameters for the LAST alignment are efficient for most cases but can be easily adjusted by the user. We use last-train to find the optimal alignment parameters for MinION sequencing [23]. For advanced users, it is recommended, although not necessary, to be acquainted with the settings of LAST and last-train [26].
After the alignment is completed, NanoPipe evaluates the nucleotide variation for each position and, based on this analysis, generates a consensus sequence and a list of possible single nucleotide polymorphisms (SNPs). The minimum nucleotide count per position should be at least 10 (i.e., at least 10 reads), otherwise a gap will be assigned at the position in the consensus sequence. The consensus sequence is calculated based on the majority rule, i.e., the nucleotide with the higher count at a particular position is assigned to the consensus. If the counts for any two nucleotides differ from each other by not more than by 20%, both nucleotides are included in the consensus with the use of the IUPAC nomenclature [27]. Statistical evaluation of nucleotide variation is presented in a separate table and suggests a polymorphism candidate if an alternate nucleotide has coverage of at least 20% of all reads.
To distinguish between artifacts and true polymorphisms, we have set three additional filters. First, a SNP candidate should have read coverage of at least 30% from the maximum coverage in the respective contig, otherwise it is not listed in the polymorphisms table and not counted as a SNP. Second, the probability of a SNP at the position is calculated. For that, the relative nucleotide frequency is multiplied by a custom weight factor; transversions are weighted by the factor of 1 and transitions by the factor of 2. This is based on the assumption that transitions are two times more likely than transversions. This weighted probability for each SNP candidate is rescaled for convenience so that the maximum value is 1. The probability is displayed for each nucleotide in the polymorphisms table; a SNP candidate is most likely to be the true SNP if its joint probability is 1. Third, the analysis is refined by an assessment of the alignment quality around a potential SNP by estimating a p-error. The p-error is calculated based on the formula used in the LAST methods to calculate the probability of the alignment for a single nucleotide [28]. In NanoPipe the p-error is estimated over a region of a maximum of 10 nucleotides before and after the SNP position (excluding the SNP itself) and for all the read alignments at the region (Eq. 1). It is based on the LAST reliability score assigned to each base pair of the alignment. In addition, polymorphism candidates for human and Plasmodium are linked to the public SNP databases.
Equation (1) is used in NanoPipe for the p-error calculations, where i is the total number of nucleotides around the SNP for all mapped reads, i.e., i = N(nucleotides around SNP) * N(reads mapped to the evaluated region). ASCII values are extracted from the LAST alignment and indicate the reliability of each base's alignment; see [28] for more information about LAST quality symbols.
The results are supplemented with a number of useful pages, such as alignment of consensus sequences against target sequences, read length distribution and individual read's alignment length distribution, and nucleotide plots showing distribution of nucleotides from all reads at each position of the consensus sequence. The latter is the interactive visual representation of the results and enables the user to monitor any nucleotide variations by eye. NanoPipe also provides the necessary files for browsing the alignment in the IGV genome browser [29], where each individual read alignment is displayed.

Findings
The experience of our working group in conducting MinION sequencing and analysis workshops [30] for medical doctors and wet-lab researchers revealed the general gaps in bioinformatics knowledge among these groups. While the experimental part of the workshop was easily performed by the participants, even if mastered for the first time, the processing of the sequenced data proved to be the most enduring and difficult part. Therefore, keeping in mind the weakest spots of non-bioinformatician researchers, we describe here some features of NanoPipe usage in detail.
To start a new analysis, a query (sequencing reads) and a target (a genome or a region of interest) should be provided. The query (1D or 1D 2 ) can be uploaded via the NanoPipe's web interface from the user's computer in one FASTA or FASTQ file, also in archived format. As the ONT-provided base callers output sequenced reads in multiple files, it is important to know how to merge them into one file, on an operating system of the user (Windows, Mac OS, Linux). As an alternative, NanoPipe can also handle multiple FASTA or FASTQ files when they are archived (zipped), e.g., with "zip" command on Linux or Otherwise, DNA sequences of whole genomes, transcriptomes, or genes for different organisms can be accessed via particular species databases or in the corresponding databases at NCBI [31]. The NanoPipe-prepared targets include the precompiled, best-fitting alignment parameters and substitution matrices (based on last-train calculations). The substitution matrix is an important part of any alignment and contains information about the mismatch cost between any pair of nucleotides. Any uploaded targets will be used with the NanoPipe default parameters that are suitable for most cases. To avoid "noise," i.e., mapping of too short reads, the user can set the read length limit, e.g., to 200 nucleotides (depends on the experiment type and purpose). A unique name can be assigned to a job and used later within one month to retrieve the results from the NanoPipe server. After one month the data will be deleted from the server.
Depending on the query and target size, a NanoPipe analysis can last from several minutes to several hours, the server's memory used for the calculations is limited to 16 GB. The user will receive the notification via email (if the email address was provided) when his/her job is completed. The summary of the completed analysis depicts the LAST parameters that were used and the mapping statistics, i.e., how many reads were mapped altogether and the reads distribution per chromo-somes/scaffolds of the target. This table can be sorted in increasing/decreasing order. Mapping distribution statistics show how long the reads in the query were and allow estimating whether the sequencing resulted in the expected read lengths. Thus, for whole-genome sequencing by MinION, a researcher can expect read lengths of more than 1 million nucleotides; targeted sequencing results in read lengths corresponding to the target length; transcriptome sequencing or RNA sequencing (RNAseq) experiments should provide reads with length typical of particular species' transcript lengths (around one-to-two thousand nucleotides for most organisms). Alignment distribution statistics inform about the quality of the alignment, i.e., whether the whole read or a part of it could be mapped to the target.
Alignment results are visualized in NanoPipe in several ways: ordinal line-by-line pairwise alignment, a BAM file, and a graphical representation via nucleotide plots. Binary alignment (BAM) and indexed BAM files can be easily downloaded from the results page and further used for an interactive genome browsers, e.g., IGV. The target FASTA files are required by IGV and can be downloaded together with BAM files. IGV is free software and can be accessed from [32]. Nucleotide plots represent the colored mapping scheme at each position of the sequence, each nucleotide marked with its specific color. This enables easy monitoring for gaps and possible nucleotide substitutions. Navigating along the plot is enabled via right and left shift, as well by entering a nucleotide's coordinate in the search field. For long target regions, nucleotide plots provide a zoom-out preview at the bottom of the page.
Each chromosome/region of the target is supported with an individual consensus that can be seen and downloaded in FASTA format. Positions that cannot be defined (not enough information in the input data) are designated as "N"; gaps are designated as "-"; if a position is occupied by any of the four nucleotides, it is designated as "X"; other ambiguous positions are designated using IUPAC nomenclature.
The polymorphisms table lists SNP candidates and provides joint probabilities for each candidate (maximum = 1), as well raw counts for each nucleotide at the target position. If the data are available, the corresponding SNP IDs (identifiers) will be retrieved from the existing databases (currently available only for human [33] and Plasmodium [34]). The alignment quality around a SNP candidate is reported as p-error; the higher the p-error (maximum 1), the lower the alignment quality. Low alignment quality might indicate a region of sequencing errors around an individual SNP and, thus, signify the lower reliability of the candidate detection. However, a cluster of closely located SNPs within a distance of less than 10 nucleotides would have a similar effect. Therefore, the p-error is an additional parameter that might be taken into consideration by the user. Detailed analysis of the nucleotide plot or the alignment in the IGV viewer around questionable candidates may help in making a decision, including consideration of biological relevance.
To view the result pages for nucleotide plots, consensus, polymorphisms, and alignment, the user needs to choose a particular chromosome/region. This approach accelerates the data display and prevents the web browser from overloading.

Plasmodium polymorphism detection
The targeted regions were first amplified using the standard PCR protocol [35]. The resulting amplicons were sequenced with the MinION using the ONT sequencing kit SDK-MAP005 or SQK-MAP006 for the library preparation and the flow cell version FLO-MAP003. The sequences were aligned against the Plasmodium genome (P. falciparum reference genome based on 3D7 strain). Exhaustive discussion of the ONT utility for Plasmodium SNP calling and the library preparation methods are presented elsewhere [35]. NanoPipe succeeded in mapping 99.9% of all query reads and detecting the expected mutations (see Fig. 2). The specific characteristic of the Plasmodium genome, multiple AT/T repeats, can be easily observed in the nucleotide plot. The full analysis of a 21 MB (FASTA format) query against the P. falciparum genome with NanoPipe took less than 4 minutes. This example demonstrates that the analysis of just 10,157 MinION reads with a high AT content on NanoPipe results in reliable data. Detailed screen shots of the NanoPipe results for this study case can be found in the Supplementary Material (Supplementary Figs. S2-S10).

Targeted sequencing of EGFR transcript from human lung adenocarcinoma cell line H1975
The region of the human EGFR cDNA corresponding to exons 17-22 was amplified using the primers CTAAGATCCCGTCCATCGCC (forward) and ACATATGGGTGGCTGAGGGA (reverse). The library preparation was performed following the manufacturer's recommendation using the SQK-LSK108 kit from ONT, and then sequencing was done with the MinION using flow cell version FLO-MIN107 R9. The raw 1D reads (900 MB, FASTA format) were uploaded to NanoPipe using the human reference genome as a target; the analysis took 135 minutes. A total of 79.9% of all reads were mapped to the EGFR gene on chromosome 7. Most of the remaining reads (13.5%) were mapped to chromosome 11, to regions corresponding to the cortactin gene ( NM 138565) that is overexpressed in different cancers [36], and the gene encoding subunit 2 of the splicing factor 3b protein complex ( NM 006842). This gene might be differentially expressed in tumor tissues [37], nevertheless, there is no definite research about possible roles of this gene in cancers. It may be the case that the gene is overexpressed in cell line H1975; therefore, its transcript was sequenced as an abundant contaminant, similar to the cortactin transcript. The four mutations expected within this region in this cell line [38] were detected, see Fig. 3. This example demonstrates the suitability of NanoPipe for cancer sequencing analysis.

RNA-seq analysis of the ant species Camponotus maculatus
Monarch Total RNA Miniprep kit (NEB) was used for the poly-A RNA extraction according to the manufacturer's recommendations. The library was prepared using the SQK-PCS108 kit from ONT and then sequenced with MinION using FLO-MIN107 R9 flow cell version following the manufacturer's recommendations. The raw 1D reads in FASTA format were uploaded to NanoPipe; the reference genome of Camponotus floridanus was used as the target genome, as neither genome nor transcriptome of C. maculatus are available. The analysis of the 1.5 GB query ran for 3 hours. Out of 1,814,750 raw reads, 1,773,747 (97.7%) were mapped to the target, spanning 431 scaffolds out of 657, including 150 scaffolds with coverage of more than 1,000 reads per scaffold. This result is consistent with the number of reads sequenced and the fact that the target was the genome of a different species. Thus, NanoPipe can be used for studying newly sequenced species.

Direct RNA-seq of poxvirus isolated from the host cells of green monkey
We examined the recent MinION direct RNA-seq data of the Vaccinia virus WR mRNA isolated from the kidney fibroblast cells of Chlorocebus sabeus (for details, see [39]). Using the virus genome and the monkey genome as targets in the two separate runs, we could separate the reads coming from the two organisms. From the tested 29,846 reads, 1,314 were mapped to the virus genome (GenBank accession LT966077.1) and 14,714 reads were mapped to the C. sabeus genome (GenBank accession GCA 000 409795.2), which is consistent with the published results. The option of the direct target upload was also tested on the small and large input files (the virus genome size is 198 KB, and the green monkey genome is 2.82 GB). Both runs were finished successfully in 12 seconds and in 51 minutes, respectively, including the construction of the target databases. NanoPipe proved to be useful in analyzing the mixed long reads of virus and its host.

Discussion
We have developed NanoPipe, a web-driven application that enables easy analysis of MinION sequencing data that is suitable for both experienced bioinformaticians and biologists with lim-ited IT knowledge. NanoPipe provides users with the consensus sequence of the studied DNA and a list of putative SNPs. In this work, we used four different experimental datasets: targeted sequencing of the AT-rich genome of P. falciparum; targeted sequencing of EGFR human cDNA; direct RNA-seq of V. virus and its host, green monkey C. sabeus; and RNA-seq of the ant P. maculatus that has, to date, no fully sequenced genome or transcriptome. These four test cases represent different tasks that biologists from different fields of study could be interested in: sequencing and SNP detection in repeat-rich genomes with low sequencing yield (study case with P. falciparum); mutation detection in human cancer samples (study case with lung adenocarcinoma cell line H1975); RNA-seq of a newly sequenced organism (study case with C. maculatus); and of a mixed species sample (virus-host study). NanoPipe proved to be a reliable tool for all of these tasks. It detected all expected mutations for Plasmodium and the EGFR transcript region and succeeded in mapping more than 90% RNA-seq reads from C. maculatus to the reference genome. In the case of virus-host direct RNA-seq, NanoPipe successfully segregated the mixed reads between the virus genome and its host genome.
We have tested NanoPipe during two workshops at the International Summer School for MinION sequencing in Bangkok (2017) and Manado (2018) with participants having no bioin- formatics experience from the medical and biological research fields. NanoPipe proved to be efficient and understandable for these users. Nevertheless, our team also noted during these practical studies that the participants lacked some basic knowledge necessary for conducting successful sequencing experiments. Therefore, we described here in detail the general scheme of working with long-read sequencing data. The flexibility of NanoPipe allows researchers to study any sequence of interest (including cancer samples in human patients), as the list of targets is not limited. With this software, we hope to make MinION sequencing even more accessible to medical researchers and biologists without sophisticated IT resources and expertise. The next step for the NanoPipe project will include more sophisticated statistics for SNP detection and evaluation. Also, we are planning to cover the microbiology field and implement an option for metagenomics analysis.

Availability of source code and requirements
Project name: NanoPipe Project home page: http://bioinformatics.uni-muenster.de/too ls/nanopipe, GitHub page: https://github.com/IOB-Muenster/na nopipe2. The NanoPipe package for local installation is available at the NanoPipe GitHub page. The explanation on installation and check-install procedures can be found in theGitHub directory. The sequencing was performed in the Institute of Bioinformatics, Muenster, Germany; the library was prepared in the group of Juergen Gadau, University of Muenster 3.4 Raw reads in a single FASTA file, archived using zgip command 3.5 Larvae and adult individuals from the lab culture 3.6 Size: 1.5 GB, archived: 309 MB Supporting data and snapshots of the GitHub are also archived in the GigaScience GigaDB repository [40]; the sequences used in the test cases can be accessed at the European Nucleotide Archive under the study ID PRJEB30433. Figure S1. Number of scientific publications that contain the phrase "Oxford Nanopore" in their abstracts. Figures S2-S10. The results pages for the P. falciparum test case.