FLAS: fast and high-throughput algorithm for PacBio long-read self-correction

Bao, Ergude; Xie, Fei; Song, Changjin; Song, Dandan

doi:10.1093/bioinformatics/btz206

Abstract

Motivation

The third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high-throughput self-correction. MECAT is currently among the fastest self-correction algorithms, but its throughput is relatively small (Xiao et al., 2017).

Results

Here, we introduce FLAS, a wrapper algorithm of MECAT, to achieve high-throughput long-read self-correction while keeping MECAT’s fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, FLAS also uses the corrected long-read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana and human long reads, FLAS can achieve 22.0–50.6% larger throughput than MECAT. FLAS is 2–13× faster compared to the self-correction algorithms other than MECAT, and its throughput is also 9.8–281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1–29.8% larger N50 sizes than MECAT.

Availability and implementation

The FLAS software can be downloaded for free from this site: https://github.com/baoe/flas.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The third generation sequencing technology is advantageous over the second generation in its much larger read lengths (Eid et al., 2009). As a representative of the third generation sequencing technology, the PacBio Single Molecule Real-Time technology can currently generate long reads of 5–15 K base pairs on average with cost of $0.4–0.8 per million base pairs (Lee et al., 2016; Rhoads and Au, 2015). The long reads can overcome length limitation of the second generation short reads, and can thus facilitate assembling or analyzing complex genome regions such as GC islands and repeats in downstream. As a result, more and more sequencing projects have incorporated the long reads. The projects can be classified into two categories as below.

Long and short read hybrid projects use long and short reads together (Chen et al., 2014; Philippe et al., 2013). In these projects, long reads of low to moderate coverage are used as a complement to short reads of relatively high coverage, to obtain higher quality results than the short reads alone.
Long read only projects use long reads alone (Baker et al., 2014; Chaisson et al., 2015). Though more expensive, long reads of high coverage can be used without short reads, to guarantee high quality results while simplifying the sequencing workflow.

Nevertheless, the long reads contain about 15% sequencing errors dominated by insertions and deletions, so it is important to correct these errors. The error correction can be made in two levels: biological level by Circular Consensus Sequencing (CCS) technology and computational level by error correction algorithms (Eid et al., 2009). The CCS technology has to reduce read lengths to a large extent for correction, so the error correction algorithms are usually in favor over the CCS. The current error correction algorithms can be classified into two categories as below.

Short read assisted correction algorithms align the corresponding short reads from the same species to the long reads and correct them, so they are suitable for the long and short read hybrid projects. PBcR (Koren et al., 2012), LSC (Au et al., 2013), Proovread (Hackl et al., 2014) and CoLoRMap (Haghshenas et al., 2016) align the uncorrected short reads to the long reads for correction, while ECTools (ECTools and HGAP, FALCON, Canu and MECAT below are essentially pipeline algorithms including both error correction and downstream assembly algorithms.) (Lee et al., 2014), LoRDEC (Salmela and Rivals, 2014), Jabba (Miclotte et al., 2016) and HALC (by ourselves Bao and Lan, 2017) align the long reads to a de Bruijn graph constructed or contigs assembled from the short reads for correction.
Self-correction algorithms align the long reads to themselves and find multiple sequence alignments among the long reads to correct them, suitable for the long read only projects. HGAP (Chaisson and Tesler, 2012; Chin et al., 2013) finds seeding k-mers among the long reads and then the alignments by dynamic programing, and FALCON (Chin et al., 2016; Myers, 2014) applies fast techniques to find the k-mers and alignments. Canu (Berlin et al., 2015; Koren et al., 2017) weights and filters the k-mers to reduce load of finding the alignments, and MECAT (Xiao et al., 2017) takes into consideration correlation of the k-mers for the filtration. Differently, LoRMA (Salmela et al., 2016) aligns the long reads to de Bruijn graphs constructed from themselves for correction.

For the self-correction algorithms, there are two challenges to address as below.

It is challenging to make correction with fast speed, because it is time consuming to align the long reads of usually several millions to each other. HGAP, FALCON, Canu and MECAT all seek to address this challenge.
It is also challenging to correct a sufficient amount of read bases, i.e. to achieve high-throughput self-correction, because it is also difficult to align the long reads of about 15% errors to each other. LoRMA seeks to address this challenge.

Note that for the self-correction algorithms, though it is important to distinguish genome repeats in the long reads during correction, it is not as challenging as the above two, because the repeats are usually shorter than the long reads, and have only moderate to low influence on long-read alignment and correction. This is different from the short read assisted correction algorithms.

MECAT has successfully addressed the first challenge, and is currently among the fastest self-correction algorithms (Xiao et al., 2017). Nevertheless, its throughput is relatively small. This is not economical considering the higher cost of long reads compared to short reads, and may also affect quality of the downstream results (Lee et al., 2014). In order to address both of the two challenges, here we propose FLAS, a fast and high-throughput algorithm for long-read self-correction. FLAS is a wrapper algorithm of MECAT, to achieve high throughput while keeping MECAT’s fast speed. FLAS has two novelties as below.

FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. For the prealigned long reads, those aligned with each other are likely to be from the same genome region, so are found and put into the same set. For two such sets, if there are shared long reads existing in both of them, it is possible all the long reads are from the same genome region, so additional alignments could be found between the two sets for correction; alternatively, it is possible the shared long reads are misaligned in one set, so misalignments could be removed from the set.
It also uses the corrected long-read regions to correct the uncorrected ones to further improve the throughput. A graph is constructed from alignments of the corrected long-read regions, and context of the graph is examined to accurately align the uncorrected long-read regions to the corrected ones for further correction.

2 Materials and methods

2.1 Overview

The FLAS algorithm consists of two steps as below, corresponding to the two novelties above, respectively.

Correct with refined MECAT prealigned long reads.
- Align long reads with each other by MECAT, construct an overlap graph based on the alignments (Myers, 2005), and find maximal cliques with a modified Bron–Kerbosch algorithm in the graph (Eppstein et al., 2010; Eppstein and Strash, 2011).
- For two maximal cliques with shared vertices, find additional alignments between the corresponding long reads or remove misalignments depending on number of the shared vertices. Make error correction with the refined alignments.
Correct uncorrected long-read regions with corrected ones.
- Align the corrected long-read regions with each other, and construct a second overlap graph based on the alignments.
- Align all the long reads obtained from step 1 to paths of the graph. For each long-read region with multiple aligned paths, find the correct one by referring to not only alignment identities but also expected amounts of aligned long reads to the paths. Make error correction for the uncorrected long-read regions with the alignments.

In step 1, we construct an overlap graph recording all alignment information by MECAT, find maximal cliques in the graph as sets containing long reads aligned with each other (step 1a), and use the maximal cliques to refine the alignments for error correction (step 1b). In step 2, we construct a second overlap graph recording alignment information among the corrected long-read regions (step 2a), and align all the long reads to paths of the graph to correct the uncorrected regions (step 2b). Details of these steps are as below. Figure 1 shows an illustration of the algorithm.

Fig. 1.

Open in new tab Download slide

Illustrations on the FLAS algorithm. (A) Uncorrected long reads r_A–r_L are generated from two similar genome regions in the underlining target genome. r_C (shaded) may not be fully corrected by MECAT, since it is only aligned to long reads r_A and r_E, and not aligned to r_B and r_D from the same genome region; r_D (shaded) may be miscorrected, since it is misaligned to long reads r_H, r_I and r_J from the other genome region; r_K and r_L (shaded) may not be corrected, since they are not aligned to any long read from the same genome region. (B) In algorithm step 1, an overlap graph is constructed from the uncorrected long-read alignments. In the graph, there are four maximal cliques: $C_{1} = {A, C, E}, C_{2} = {A, B, D, E}, C_{3} = {D, E, F, G}$ and $C_{4} = {H, D, I, J}$ ⁠, where C₁ and C₂, C₂ and C₃, C₂ and C₄ and C₃ and C₄ are two cliques with shared vertices. C₁ and C₂ are classified in case (i), and additional alignments are obtained between r_C and r_B, as well as r_C and r_D (indicated with dashed lines). As a result, r_C can be fully corrected. C₂ and C₄ are classified in case (iii), and misalignments are removed between r_D and r_H, r_D and r_I, as well as r_D and r_J (indicated with crosses). As a result, r_D can be accurately corrected. (C) In algorithm step 2, a second overlap graph is constructed from alignments of the corrected long reads. With limited errors in paths of the graph, r_K could be aligned to two alternative paths: $P_{1} = A \to E$ and $P_{2} = H \to J$ ⁠. Despite P₁’s higher alignment identity $i_{1} = 88 %$ than P₂’s $i_{2} = 85 %$ ⁠, r_K is aligned to P₂ of much higher expected amount of aligned long reads $m_{2} = 3$ than P₁’s $m_{1} = 1$ ⁠. As a result, r_K can be corrected with accuracy. Similarly, r_L can also be corrected (not shown for simplicity)

2.2 Maximal cliques in overlap graph

We align the long reads with each other by MECAT. The parameter settings are the default of moderate alignment sensitivity and accuracy. This guarantees sparsity of the constructed overlap graph (see below).

Based on the alignments, we construct an overlap graph. Each vertex is constructed for one long read, and each edge is constructed between two vertices if the corresponding long reads are aligned with a sufficient overlap length. Transitive edges are not removed, because all the alignment information should be kept (Myers, 2005). The constructed graph is a sparse graph with maximum vertex degree d, where d is the maximum number of candidate alignments for each long read, usually 50–200 controlled by the aligner. The graph’s degeneracy is thus at most d. Sparsity of the graph is useful to guarantee small running time of the modified Bron–Kerbosch algorithm (see below).

In the graph, we find maximal cliques with a modified Bron–Kerbosch algorithm (Eppstein and Strash, 2011). In an arbitrary graph, the maximal clique finding problem is NP-hard, but because the overlap graph is a sparse graph with degeneracy at most d, the modified Bron–Kerbosch algorithm has polynomial time complexity $O (dn 3^{d / 3})$ ⁠, where n is number of the long reads or vertices. Therefore, running time of the modified Bron–Kerbosch algorithm is guaranteed relatively small.

2.3 Alignment refinement and error correction

Given two cliques C₁ of size c₁ and C₂ of size c₂ with s shared vertices existing in both cliques, there are basically three cases for them as below.

Case (i): the corresponding long reads of C₁ and C₂ are from the same genome region, and the long reads of shared vertices are correctly aligned.
Case (ii): the corresponding long reads of C₁ and C₂ are from two different genome regions, and the long reads of shared vertices span the genome regions and are correctly aligned.
Case (iii): the corresponding long reads of C₁ and C₂ are from two different genome regions, and the long reads of shared vertices are from one of the regions and misaligned to the other.

In cases (i) and (ii), s should be relatively large compared to $min {c_{1}, c_{2}}$ ⁠, while in case (iii), s should be relatively small. This is based on the observation that among all alignments, number of the correct alignments is much larger than the misalignments. In addition, in case (ii), each long read of a shared vertex should have one region alignable to the long reads of C₁, and a different region to those of C₂.

Therefore, to distinguish between cases (i)/(ii) and (iii), we check s and $min {c_{1}, c_{2}}$ ⁠. If $s \geq α_{1} min {c_{1}, c_{2}}$ ⁠, C₁ and C₂ are classified in case (i)/(ii); otherwise, they are in case (iii), where α₁ is a fraction value with default setting 50%. To further distinguish between cases (i) and (ii), we check s and $s'$ ⁠, where $s'$ is number of the shared vertices whose long reads have one region alignable to the long reads of C₁, and a different region to those of C₂. If $s' \geq α_{2} s$ ⁠, C₁ and C₂ are classified in case (ii); otherwise, they are in case (i), where α₂ is a fraction value also with default setting 50%.

For any two cliques classified in case (i), we realign the corresponding long reads of the cliques by MECAT to find more alignments. MECAT’s parameter settings are tuned for high alignment sensitivity and relatively low accuracy. The −k parameter, specifying the minimum number of matching k-mers to detect overlap candidates, is lowered from its default value to 2 to detect, form and report more overlaps. The −n parameter, specifying the minimum number of overlap candidates for extension to form overlaps, is lowered from its default value to 50 to form and report more overlaps. The −a parameter, specifying the minimum overlap lengths between reads to report the overlaps, is lowered from its default value to 1000 to report more overlaps. This could improve the error correction throughput. For any two cliques classified in case (iii), we calculate for each long read of a shared vertex, its average alignment identity to the long reads of each clique, and remove it from the clique of lower identity. This could improve the accuracy. After the refinement of alignments, we make error correction by MECAT with default parameter settings. Note that if two cliques are classified in case (ii), we do not do anything, because though the long reads of shared vertices overlap with those in both cliques, they are simply sufficiently long spanning genome regions, and do not imply any additional or incorrect alignment.

2.4 Correction of uncorrected regions with corrected regions

Similar to Section 2.2, we align the corrected long-read regions with each other, and construct a second overlap graph. Transitive edges are removed in the graph. Then we align all the long reads to paths of the graph for error correction. For the long reads containing both corrected and uncorrected regions, the corrected regions are aligned uniquely to paths in the graph, and the uncorrected regions could thus be aligned to paths in between for correction. For the long reads containing only uncorrected regions, they could also be aligned to paths of the graph for correction, because of the graph’s limited errors.

It is possible that an uncorrected long-read region has multiple paths for alignment. To find the correct path, we compare not only alignment identities, but also expected amounts of aligned long reads to the paths. The latter is a measurement of paths’ capability to accommodate aligned long reads, and a long read has higher probability to align to the path of larger capability (see below). Given a long-read region R with two aligned paths P₁ of identity i₁ and P₂ of identity i₂, if $| i_{1} - i_{2} | \geq β_{1}$ ⁠, we find the path of larger identity as the correct one, where β₁ is a difference value with default setting 5%. Otherwise, the expected amounts of aligned long reads to the paths m₁ and m₂ are calculated, respectively, and if $| \frac{m_{1} - m_{2}}{m_{1} + m_{2}} | \geq β_{2}$ ⁠, we find the path of larger amount as the correct one, where β₂ is also a difference value with default setting 20%. For long-read regions with more than two aligned paths, alignment identities and expected amounts of aligned long reads are compared in pairs to find the correct path.

Here, the expected amount of aligned long reads to a path P is calculated as the expected amount of long reads m_g from the path’s genome region minus the amount of long reads $m_{p}^{'}$ forming the path. $m_{p}^{'}$ can be recorded when the graph is constructed, and m_g is calculated as below. Because the probability that a long read of length l is from a genome region of length Δ_g is $\frac{Δ_{g} + 2 (l - o)}{G}$ ⁠, m_g is calculated as $\sum_{i = 0}^{n} \frac{Δ_{g} + 2 (l_{i} - o)}{G}$ ⁠, where o is minimum required overlap length between the long read and genome region, G is size of the genome, l_i is length of the ith long read and again, n is number of the long reads. Furthermore, because Δ_g is about P’s length Δ_p, and G can be estimated as $\frac{\sum_{j = 0}^{n} l_{j}}{c}$ ⁠, m_g is further refined as $\sum_{i = 0}^{n} \sum_{j = 0}^{n} \frac{Δ_{p} + 2 (l_{i} - o)}{l_{j} / c}$ ⁠, where c is coverage of the long reads. Note that this calculation assumes a path does not correspond to multiple genome regions. This is reasonable, since one alternatively aligned path of a long-read region represents a single genome region in most cases in overlap graph (Kamath et al., 2017; Myers, 2005).

2.5 Implementation of the FLAS software

The FLAS software is implemented in C++ for Linux platform. It contains data structures as below.

An n item adjacency list for the overlap graph in algorithm step 1. Its lists have the maximum length d, because of the graph’s sparsity.
An n item hash table for the read-clique mapping. It is used to find all the cliques with shared vertices.
An $O ((n - d) 3^{d / 3})$ item hash table for the clique-read mapping, where $O ((n - d) 3^{d / 3})$ is the maximum possible number of cliques in the graph (Eppstein and Strash, 2011). It is used to find all the shared vertices for any two cliques.
An $n'$ item adjacency list for the second overlap graph in algorithm step 2, where $n'$ is number of the corrected long-read regions.

Input of FLAS is the uncorrected long reads, and the outputs include the trimmed long reads that do not contain uncorrected regions in read heads and tails, and split long reads that do not contain any uncorrected region (Salmela and Rivals, 2014).

3 Evaluation

3.1 Experimental design

3.1.1 Test of error correction performance

To evaluate FLAS’s performance, we ran FLAS, MECAT and the other self-correction algorithms HGAP, LoRMA, FALCON and Canu on three sets of long reads from species: Escherichia coli, Saccharomyces cerevisiae and Arabidopsis thaliana of genome sizes 5, 12 and 125 Mbp, respectively. Canu’s genomeSize parameter could be set large values for high throughput, so it was run in both default mode and this high-throughput mode. The latter is represented as Canu+. We also ran FLAS and MECAT on another set of long reads from human of much larger genome size 3 Gbp. The other algorithms besides MECAT were not compared on the human long reads, because they need long running time (Xiao et al., 2017). Coverage of the first three long reads was all about 100×, meeting the requirement of most long read only projects, while that of the human long reads was 50×, due to its large genome size. In addition, on the first three long reads, we ran some typical short read assisted correction algorithms PBcR, CoLoRMap and ECTools, together with the corresponding short reads from the same species. Coverage of the short reads was all about 50×, meeting the requirement of most long and short read hybrid projects. After error correction, we aligned the corrected long reads to the corresponding target genomes, to assess quality of the long reads. Note that not all the short read assisted correction algorithms were compared, because they are in a different category from FLAS.

3.1.2 Test with various long-read coverage and parameter settings

FLAS’s performance is mainly affected by coverage of the long reads, because it is inputted with long reads alone. Hence, it is interesting to see FLAS’s performance with various long-read coverage, and especially, the performance change compared to MECAT and the other self-correction algorithms. To do this, we varied coverage of the E.coli long reads from 50 to 200×, and ran and compared FLAS, MECAT, HGAP, LoRMA, FALCON, Canu and Canu+.

In addition, FLAS’s performance could also be affected by its parameter settings. Specifically, the parameters include the two fraction values α₁ and α₂ to distinguish between cases (i)–(iii) as discussed in Section 2.3, and the two difference values β₁ and β₂ to find the correct aligned paths as discussed in Section 2.4. It is also interesting to see FLAS’s performance with various settings of the parameters. To do this, we varied α₁ and α₂ from 30 to 70%, varied β₁ from 1 to 15% and varied β₂ from 10 to 40%, and ran FLAS on the E.coli long reads of 100×. Similar to above, after error correction, we aligned the corrected long reads to the corresponding target genome, to assess quality of the long reads.

3.1.3 Test of long-read assemblies

FLAS’s performance influences downstream assembly and analysis as discussed in Section 1. Hence, it is interesting to see some downstream assembly results from the FLAS corrected long reads. To do this, we used FALCON’s assembly algorithm FALCON* (FALCON’s assembly algorithm and Canu’s assembly algorithm below are represented as FALCON* and Canu*, respectively, to distinguish from the error correction algorithms.) (Chin et al., 2016) and Canu’s assembly algorithm Canu* (Koren et al., 2017), and assembled the FLAS corrected S.cerevisiae and A.thaliana long reads. To make comparison, we also assembled the MECAT corrected long reads. Then we assessed quality of the contigs. Note that we did not assemble the corrected long reads by the other algorithms in this test, because this study improves MECAT rather than the other algorithms, with a focus on long-read error correction rather than assembly. Also note that MECAT does not have a standalone assembly algorithm, but applies Canu’s assembly algorithm.

All of the algorithms’ software was in default settings, except Canu+ and the test of various parameter settings as discussed in Section 3.1.2. Only the corrected split long reads were compared and used for assemblies in these tests, so that the results could not be affected by the uncorrected bases. All experiments were performed in a computing node of a computer cluster with 32 cores of 2.3 GHz and 1024 GB memory.

3.2 Datasets and performance measurements

3.2.1 Datasets

The long reads of E.coli, S.cerevisiae, A.thaliana and human were the filtered subreads downloaded from the MHAP website (Berlin et al., 2015). Additional E.coli long reads for the test with various coverage were downloaded from NCBI accession SRX1155577, and filtered by the Single Molecule Real-Time Analysis quality control. The corresponding target genomes were from NCBI accessions NC_000913.3, NC_001133.9, the Ensembl Plant FTP and Ensembl FTP, respectively. The corresponding short reads of E.coli, S.cerevisiae and A.thaliana were from NCBI accessions ERR022075, SRR567755 and ERR469286, respectively.

3.2.2 Performance measurements

We aligned the error corrected split long reads and uncorrected long reads to the corresponding target genomes to assess their quality. The BWA-MEM aligner was used for these alignments, because it is a typical aligner for genomic sequences with fast speed and relatively high sensitivity (Li and Durbin, 2010). We made the following measurements: (i) throughput is the number of corrected and outputted bases, and throughput ratio (THR) is the throughput over the total number of uncorrected long-read bases; (ii) alignment ratio is the number of aligned bases over the total number of outputted bases; (iii) alignment identity is the identity of aligned bases; (iv) genome fraction is the number of genome bases aligned by long reads over the total number of genome bases; (v) time is the elapsed time with 16 threads; (vi) memory is the corresponding memory usage.

Then we used the Error Correction Evaluation Toolkit for split long reads implemented by ourselves (Bao and Lan, 2017), and obtained in the outputted bases, the number of corrected errors (true positive or TP), the number of falsely converted correct bases (false positive or FP), the number of uncorrected errors (false negative or FN) and the number of unconverted correct bases (true negative or TN). With these numbers, due to the errors’ uniform distribution in long reads, we can estimate the total number of errors in the uncorrected long reads as the number of errors in the outputted bases over the THR, i.e. $EI = \frac{TP + FN}{THR}$ ⁠. We can also estimate the total number of correct bases in the uncorrected long reads as the number of correct bases in the outputted bases over the THR, i.e. $CI = \frac{TN + FP}{THR}$ ⁠, and thus the number of correct bases in the discarded bases as the total number of correct bases in the uncorrected long reads minus the number of correct bases in the outputted bases, i.e. $CD = CI - (TN + FP)$ ⁠. Therefore, we made the following measurements: (vii) sensitivity is calculated as $\frac{TP}{EI}$ ⁠; (viii) specificity is calculated as $\frac{TN + CD}{CI}$ ⁠; (ix) gain is the number of errors effectively corrected without introducing new ones over the total number of errors in the uncorrected long reads, and calculated as $\frac{TP - FP}{EI}$ ⁠.

For the assembled contigs, following Chin et al. (2016), we used the QUAST toolkit (Gurevich et al., 2013) and made the following measurements: (x) assembly size is the total length of contigs; (xi) N contigs is the number of contigs; (xii) N50 size is the contig size at 50% of the total number of contig bases; (xiii) N N50 is the number of contigs of lengths larger than the N50 size; (xiv) Max contig size is the maximum contig size. We also aligned the contigs to the corresponding genomes, and generated MUMmer plots to assess accuracy of the contigs (Kurtz et al., 2004).

3.3 Results

3.3.1 Results of error correction performance

Results on the E.coli long reads are listed in Table 1a. The throughput of MECAT is 284 Mbp, and the alignment ratio and alignment identity are about 100%, while the throughput of FLAS is 347 Mbp, and the alignment ratio and alignment identity are also about 100%. The throughput of FLAS is 22.0% larger than MECAT. The running time and memory usage of the other self-correction algorithms HGAP, LoRMA, FALCON, Canu and Canu+ are 0.3–2.0 h and 2.7–46.2 GB, respectively, while those of FLAS are 0.2 h and 2.9 GB, respectively. FLAS is 2–10× faster than all of them. In addition, the throughput of these algorithms is 158–316 Mbp, and that of FLAS is also 9.8–119.4% larger. As to the short read assisted correction algorithms PBcR, CoLoRMap and ECTools, FLAS is 6–16× faster than all of them, and interestingly, the throughput of FLAS is also 4.7–48.9% larger, even though the latter were inputted with additional 50× short reads. This is probably because the short reads have limited lengths, and thus may not be sufficiently aligned to the long reads for correction. Note that for PBcR and FALCON, the sensitivity, gain and specificity are not available without the correspondence information between the uncorrected long reads and corrected ones.

Table 1.

Evaluation of error correction performance

Method	Throughput (bp)	Alignment ratio (%)	Alignment identity (%)	Genome fraction (%)	Sensitivity (%)	Gain (%)	Specificity (%)	Time (h)	Memory (GB)
(a) Long reads of E.coli (397 486 752 bp)
PBcR^a	318 955 887	99.6	99.9	96.3	—	—	—	3.6	2.4
CoLoRMap	232 836 623	99.9	99.9	91.8	58.2	56.7	99.4	1.2	18.4
ECTools	331 259 738	99.9	99.9	94.1	83.0	82.6	99.7	3.2	11.2
HGAP	173 564 650	95.4	99.4	98.3	41.9	32.9	98.2	2.0	7.2
LoRMA	310 602 850	92.7	99.9	98.5	76.0	65.1	97.3	0.9	36.4
FALCON^a	245 895 710	97.0	99.4	98.8	—	—	—	1.3	46.2
Canu	158 007 936	95.6	98.9	97.7	38.0	28.1	98.0	0.3	2.7
Canu+	315 881 753	96.5	99.2	99.3	76.2	59.0	96.3	1.5	3.4

MECAT	284 138 161	97.2	99.6	98.4	68.8	55.5	97.2	0.1	2.9
FLAS	346 737 024	96.8	99.6	98.5	83.9	67.6	96.5	0.2	2.9

(b) Long reads of S.cerevisiae (1 405 936 758 bp)
PBcR^a	841 341 658	99.6	99.9	99.7	—	—	—	21.2	4.4
CoLoRMap	668 145 143	99.9	99.9	99.8	47.5	45.6	99.7	1.3	53.2
ECTools	429 253 952	99.9	99.4	99.8	26.8	26.7	99.9	90.6	11.2
HGAP	452 022 161	99.9	99.5	99.9	32.0	29.4	99.7	6.3	16.8
LoRMA	896 121 171	99.9	99.9	99.9	64.1	63.1	99.4	7.5	68.2
FALCON^a	603 240 666	96.3	99.6	99.9	—	—	—	1.5	75.4
Canu	388 286 736	95.0	99.3	99.9	27.4	24.6	99.6	3.4	4.6
Canu+	893 339 818	96.6	99.4	99.9	63.0	56.7	99.1	4.5	5.2

MECAT	726 605 009	99.9	99.6	99.9	51.4	47.6	99.4	0.3	8.5
FLAS	1 032 185 610	99.9	99.6	99.9	73.2	67.7	99.2	0.6	8.5

(c) Long reads of A.thaliana (18 557 597 273 bp)
PBcR^a	9 166 306 215	99.3	99.8	99.9	—	—	—	320.6	38.9
CoLoRMap	9 734 458 732	99.6	99.8	99.9	57.6	51.1	93.7	24.4	102.4
FALCON^a	12 110 519 562	99.9	99.3	99.9	—	—	—	80.8	104.2
Canu	4 090 897 122	99.8	99.1	99.9	23.9	19.9	99.4	28.5	44.3
Canu+	9 119 913 928	99.7	99.0	99.9	53.1	41.7	98.3	68.4	48.4

MECAT	10 373 391 686	99.9	99.4	99.9	61.0	52.7	98.6	6.7	13.8
FLAS	15 618 210 837	99.7	99.2	99.9	91.7	79.0	97.9	16.2	13.8

(d) Long reads of human (168 421 410 338 bp)
MECAT	74 946 808 568	97.3	99.7	98.7	43.4	37.3	99.1	163.4	24.9
FLAS	93 838 595 967	97.2	99.6	98.9	54.3	46.4	98.8	370.0	24.9

Method	Throughput (bp)	Alignment ratio (%)	Alignment identity (%)	Genome fraction (%)	Sensitivity (%)	Gain (%)	Specificity (%)	Time (h)	Memory (GB)
(a) Long reads of E.coli (397 486 752 bp)
PBcR^a	318 955 887	99.6	99.9	96.3	—	—	—	3.6	2.4
CoLoRMap	232 836 623	99.9	99.9	91.8	58.2	56.7	99.4	1.2	18.4
ECTools	331 259 738	99.9	99.9	94.1	83.0	82.6	99.7	3.2	11.2
HGAP	173 564 650	95.4	99.4	98.3	41.9	32.9	98.2	2.0	7.2
LoRMA	310 602 850	92.7	99.9	98.5	76.0	65.1	97.3	0.9	36.4
FALCON^a	245 895 710	97.0	99.4	98.8	—	—	—	1.3	46.2
Canu	158 007 936	95.6	98.9	97.7	38.0	28.1	98.0	0.3	2.7
Canu+	315 881 753	96.5	99.2	99.3	76.2	59.0	96.3	1.5	3.4

MECAT	284 138 161	97.2	99.6	98.4	68.8	55.5	97.2	0.1	2.9
FLAS	346 737 024	96.8	99.6	98.5	83.9	67.6	96.5	0.2	2.9

(b) Long reads of S.cerevisiae (1 405 936 758 bp)
PBcR^a	841 341 658	99.6	99.9	99.7	—	—	—	21.2	4.4
CoLoRMap	668 145 143	99.9	99.9	99.8	47.5	45.6	99.7	1.3	53.2
ECTools	429 253 952	99.9	99.4	99.8	26.8	26.7	99.9	90.6	11.2
HGAP	452 022 161	99.9	99.5	99.9	32.0	29.4	99.7	6.3	16.8
LoRMA	896 121 171	99.9	99.9	99.9	64.1	63.1	99.4	7.5	68.2
FALCON^a	603 240 666	96.3	99.6	99.9	—	—	—	1.5	75.4
Canu	388 286 736	95.0	99.3	99.9	27.4	24.6	99.6	3.4	4.6
Canu+	893 339 818	96.6	99.4	99.9	63.0	56.7	99.1	4.5	5.2

MECAT	726 605 009	99.9	99.6	99.9	51.4	47.6	99.4	0.3	8.5
FLAS	1 032 185 610	99.9	99.6	99.9	73.2	67.7	99.2	0.6	8.5

(c) Long reads of A.thaliana (18 557 597 273 bp)
PBcR^a	9 166 306 215	99.3	99.8	99.9	—	—	—	320.6	38.9
CoLoRMap	9 734 458 732	99.6	99.8	99.9	57.6	51.1	93.7	24.4	102.4
FALCON^a	12 110 519 562	99.9	99.3	99.9	—	—	—	80.8	104.2
Canu	4 090 897 122	99.8	99.1	99.9	23.9	19.9	99.4	28.5	44.3
Canu+	9 119 913 928	99.7	99.0	99.9	53.1	41.7	98.3	68.4	48.4

MECAT	10 373 391 686	99.9	99.4	99.9	61.0	52.7	98.6	6.7	13.8
FLAS	15 618 210 837	99.7	99.2	99.9	91.7	79.0	97.9	16.2	13.8

(d) Long reads of human (168 421 410 338 bp)
MECAT	74 946 808 568	97.3	99.7	98.7	43.4	37.3	99.1	163.4	24.9
FLAS	93 838 595 967	97.2	99.6	98.9	54.3	46.4	98.8	370.0	24.9

Note: The long reads in tests (a)–(d) are from E.coli, S.cerevisiae, A.thaliana and human, respectively. The corrected long reads by MECAT and FLAS (below dashed line), and the other self-correction algorithms HGAP, LoRMA, FALCON, Canu and Canu+ (above dashed line), and by short read assisted correction algorithms PBcR, CoLoRMap and ECTools (in gray) using additional short reads are compared in each test. The best result for the self-correction algorithms in each column is shown in boldface. The performance measurements are listed in Section 3.2.2.

a

Some measurements are not available without the correspondence information between the uncorrected long reads and corrected ones.

Open in new tab

Table 1.

Evaluation of error correction performance

Method	Throughput (bp)	Alignment ratio (%)	Alignment identity (%)	Genome fraction (%)	Sensitivity (%)	Gain (%)	Specificity (%)	Time (h)	Memory (GB)
(a) Long reads of E.coli (397 486 752 bp)
PBcR^a	318 955 887	99.6	99.9	96.3	—	—	—	3.6	2.4
CoLoRMap	232 836 623	99.9	99.9	91.8	58.2	56.7	99.4	1.2	18.4
ECTools	331 259 738	99.9	99.9	94.1	83.0	82.6	99.7	3.2	11.2
HGAP	173 564 650	95.4	99.4	98.3	41.9	32.9	98.2	2.0	7.2
LoRMA	310 602 850	92.7	99.9	98.5	76.0	65.1	97.3	0.9	36.4
FALCON^a	245 895 710	97.0	99.4	98.8	—	—	—	1.3	46.2
Canu	158 007 936	95.6	98.9	97.7	38.0	28.1	98.0	0.3	2.7
Canu+	315 881 753	96.5	99.2	99.3	76.2	59.0	96.3	1.5	3.4

MECAT	284 138 161	97.2	99.6	98.4	68.8	55.5	97.2	0.1	2.9
FLAS	346 737 024	96.8	99.6	98.5	83.9	67.6	96.5	0.2	2.9

(b) Long reads of S.cerevisiae (1 405 936 758 bp)
PBcR^a	841 341 658	99.6	99.9	99.7	—	—	—	21.2	4.4
CoLoRMap	668 145 143	99.9	99.9	99.8	47.5	45.6	99.7	1.3	53.2
ECTools	429 253 952	99.9	99.4	99.8	26.8	26.7	99.9	90.6	11.2
HGAP	452 022 161	99.9	99.5	99.9	32.0	29.4	99.7	6.3	16.8
LoRMA	896 121 171	99.9	99.9	99.9	64.1	63.1	99.4	7.5	68.2
FALCON^a	603 240 666	96.3	99.6	99.9	—	—	—	1.5	75.4
Canu	388 286 736	95.0	99.3	99.9	27.4	24.6	99.6	3.4	4.6
Canu+	893 339 818	96.6	99.4	99.9	63.0	56.7	99.1	4.5	5.2

MECAT	726 605 009	99.9	99.6	99.9	51.4	47.6	99.4	0.3	8.5
FLAS	1 032 185 610	99.9	99.6	99.9	73.2	67.7	99.2	0.6	8.5

(c) Long reads of A.thaliana (18 557 597 273 bp)
PBcR^a	9 166 306 215	99.3	99.8	99.9	—	—	—	320.6	38.9
CoLoRMap	9 734 458 732	99.6	99.8	99.9	57.6	51.1	93.7	24.4	102.4
FALCON^a	12 110 519 562	99.9	99.3	99.9	—	—	—	80.8	104.2
Canu	4 090 897 122	99.8	99.1	99.9	23.9	19.9	99.4	28.5	44.3
Canu+	9 119 913 928	99.7	99.0	99.9	53.1	41.7	98.3	68.4	48.4

MECAT	10 373 391 686	99.9	99.4	99.9	61.0	52.7	98.6	6.7	13.8
FLAS	15 618 210 837	99.7	99.2	99.9	91.7	79.0	97.9	16.2	13.8

(d) Long reads of human (168 421 410 338 bp)
MECAT	74 946 808 568	97.3	99.7	98.7	43.4	37.3	99.1	163.4	24.9
FLAS	93 838 595 967	97.2	99.6	98.9	54.3	46.4	98.8	370.0	24.9

Method	Throughput (bp)	Alignment ratio (%)	Alignment identity (%)	Genome fraction (%)	Sensitivity (%)	Gain (%)	Specificity (%)	Time (h)	Memory (GB)
(a) Long reads of E.coli (397 486 752 bp)
PBcR^a	318 955 887	99.6	99.9	96.3	—	—	—	3.6	2.4
CoLoRMap	232 836 623	99.9	99.9	91.8	58.2	56.7	99.4	1.2	18.4
ECTools	331 259 738	99.9	99.9	94.1	83.0	82.6	99.7	3.2	11.2
HGAP	173 564 650	95.4	99.4	98.3	41.9	32.9	98.2	2.0	7.2
LoRMA	310 602 850	92.7	99.9	98.5	76.0	65.1	97.3	0.9	36.4
FALCON^a	245 895 710	97.0	99.4	98.8	—	—	—	1.3	46.2
Canu	158 007 936	95.6	98.9	97.7	38.0	28.1	98.0	0.3	2.7
Canu+	315 881 753	96.5	99.2	99.3	76.2	59.0	96.3	1.5	3.4

MECAT	284 138 161	97.2	99.6	98.4	68.8	55.5	97.2	0.1	2.9
FLAS	346 737 024	96.8	99.6	98.5	83.9	67.6	96.5	0.2	2.9

(b) Long reads of S.cerevisiae (1 405 936 758 bp)
PBcR^a	841 341 658	99.6	99.9	99.7	—	—	—	21.2	4.4
CoLoRMap	668 145 143	99.9	99.9	99.8	47.5	45.6	99.7	1.3	53.2
ECTools	429 253 952	99.9	99.4	99.8	26.8	26.7	99.9	90.6	11.2
HGAP	452 022 161	99.9	99.5	99.9	32.0	29.4	99.7	6.3	16.8
LoRMA	896 121 171	99.9	99.9	99.9	64.1	63.1	99.4	7.5	68.2
FALCON^a	603 240 666	96.3	99.6	99.9	—	—	—	1.5	75.4
Canu	388 286 736	95.0	99.3	99.9	27.4	24.6	99.6	3.4	4.6
Canu+	893 339 818	96.6	99.4	99.9	63.0	56.7	99.1	4.5	5.2

MECAT	726 605 009	99.9	99.6	99.9	51.4	47.6	99.4	0.3	8.5
FLAS	1 032 185 610	99.9	99.6	99.9	73.2	67.7	99.2	0.6	8.5

(c) Long reads of A.thaliana (18 557 597 273 bp)
PBcR^a	9 166 306 215	99.3	99.8	99.9	—	—	—	320.6	38.9
CoLoRMap	9 734 458 732	99.6	99.8	99.9	57.6	51.1	93.7	24.4	102.4
FALCON^a	12 110 519 562	99.9	99.3	99.9	—	—	—	80.8	104.2
Canu	4 090 897 122	99.8	99.1	99.9	23.9	19.9	99.4	28.5	44.3
Canu+	9 119 913 928	99.7	99.0	99.9	53.1	41.7	98.3	68.4	48.4

MECAT	10 373 391 686	99.9	99.4	99.9	61.0	52.7	98.6	6.7	13.8
FLAS	15 618 210 837	99.7	99.2	99.9	91.7	79.0	97.9	16.2	13.8

(d) Long reads of human (168 421 410 338 bp)
MECAT	74 946 808 568	97.3	99.7	98.7	43.4	37.3	99.1	163.4	24.9
FLAS	93 838 595 967	97.2	99.6	98.9	54.3	46.4	98.8	370.0	24.9

Note: The long reads in tests (a)–(d) are from E.coli, S.cerevisiae, A.thaliana and human, respectively. The corrected long reads by MECAT and FLAS (below dashed line), and the other self-correction algorithms HGAP, LoRMA, FALCON, Canu and Canu+ (above dashed line), and by short read assisted correction algorithms PBcR, CoLoRMap and ECTools (in gray) using additional short reads are compared in each test. The best result for the self-correction algorithms in each column is shown in boldface. The performance measurements are listed in Section 3.2.2.

a

Some measurements are not available without the correspondence information between the uncorrected long reads and corrected ones.

Open in new tab

Results on the S.cerevisiae long reads are listed in Table 1b. The throughput of FLAS is 42.1% larger than MECAT. FLAS is 3–13× faster than the other self-correction algorithms HGAP, LoRMA, FALCON, Canu and Canu+, and the throughput of FLAS is also 15.2–165.8% larger than all of them. Results on the A.thaliana long reads are listed in Table 1c. The throughput of FLAS is 50.6% larger than MECAT. FLAS is 2–5× faster than the other self-correction algorithms FALCON, Canu and Canu+, and the throughput of FLAS is also 29.0–281.8% larger than all of them. Results on the human long reads are listed in Table 1d. The throughput of FLAS is 25.2% larger than MECAT. Note that results of ECTools, HGAP and LoRMA on the A.thaliana long reads are not shown, because they were designed for relatively small genomes and could not finish processing for the larger genome within relatively short time. These results indicate FLAS can achieve high throughput while keeping MECAT’s fast speed.

3.3.2 Results with various long-read coverage and parameter settings

Results with various coverage of the E.coli long reads are listed in Supplementary Table S1. For all the algorithms, the throughput and running time increase together with the coverage. For all the coverage, the throughput of FLAS is 15.2–22.4% larger than MECAT. In general, FLAS is 2–11× faster than the other self-correction algorithms HGAP, LoRMA, FALCON Canu and Canu+, and the throughput of FLAS is also 9.8–249.0% larger. These results indicate FLAS is scalable and efficient with various long-read coverage.

Results with various parameter settings are not shown, because difference of the results is within <5% and relatively small. The difference with various α₁ and α₂ is small, because their influence is within algorithm step 1b, while the remaining algorithm steps 2a and 2b are complementary to this step. The difference with various β₁ and β₂ is also small, because it is not a dominating situation to align one long-read region to multiple paths of the overlap graph in algorithm step 2b. However, this does not downgrade the importance to decide the correct paths in this step. As shown in Figure 2, on the E.coli long reads, a long-read region in the 12 033th long read from target genome region [3 203 249; 3 205 477] could be aligned to two paths of genome regions [1 428 679; 1 430 845] and [3 203 249; 3 205 477], and FLAS can find the correct path of genome region [3 203 249; 3 205 477] for it. In addition, a long-read region in the 20 672th long read from target genome region [3 181 373; 3 182 998] could be aligned to three paths of genome regions [2 079 308; 2 080 933], [3 181 373; 3 182 998] and [4 364 860; 4 366 488], and FLAS can also find the correct path of genome region [3 181 373; 3 182 998] for it.

Fig. 2.

Open in new tab Download slide

Illustrations on FLAS’s function in finding correct paths on the E.coli long reads. (A) A long-read region in the 12 033th long read from target genome region [3 203 249; 3 205 477] could be aligned to two paths of genome regions [1 428 679; 1 430 845] and [3 203 249; 3 205 477] (shaded), and FLAS can find the correct path of genome region [3 203 249; 3 205 477] for it. (B) A long-read region in the 20 672th long read from target genome region [3 181 373; 3 182 998] could be aligned to three paths of genome regions [2 079 308; 2 080 933], [3 181 373; 3 182 998] and [4 364 860; 4 366 488] (shaded), and FLAS can also find the correct path of genome region [3 181 373; 3 182 998] for it

3.3.3 Results of long-read assemblies

Results on the S.cerevisiae long reads are listed in Table 2a and b and 18–20 contigs are assembled from the MECAT corrected long reads, of N50 sizes 720 657–777 561 bp and max contig sizes 1 075 704–1 537 692 bp, while 16–20 contigs are assembled from the FLAS corrected long reads, of N50 sizes 912 129–927 803 bp and max contig sizes 1 526 323–1 544 348 bp. The N50 sizes with FLAS are 19.3–26.6% larger than MECAT, and the max contig sizes are 41.9% larger or comparable. Results on the A.thaliana long reads are listed in Table 2c and d. The N50 sizes with FLAS are 13.1–29.8% larger than MECAT, and the max contig sizes are comparable. Interestingly, the numbers of contigs with FLAS are a little larger than MECAT. This is probably because more alleles could be assembled from the FLAS corrected long reads.

Table 2.

Evaluation of long-read assemblies

Method	Assembly size (bp)	N Contigs	N50 size (bp)	N N50	Max contig size (bp)
(a) Contigs of S.cerevisiae by FALCON*
MECAT	11 986 444	18	777 561	6	1 537 692
FLAS	12 080 029	16	927 803	6	1 544 348
(b) Contigs of S.cerevisiae by Canu*
MECAT	11 949 806	20	720 657	7	1 075 704
FLAS	11 976 512	20	912 129	6	1 526 323
(c) Contigs of A.thaliana by FALCON*
MECAT	120 803 609	64	10 410 339	5	15 926 937
FLAS	121 875 742	81	11 775 451	5	15 950 055
(d) Contigs of A.thaliana by Canu*
MECAT	120 987 757	60	9 028 586	5	15 898 934
FLAS	122 399 256	89	11 721 501	5	15 900 235

Method	Assembly size (bp)	N Contigs	N50 size (bp)	N N50	Max contig size (bp)
(a) Contigs of S.cerevisiae by FALCON*
MECAT	11 986 444	18	777 561	6	1 537 692
FLAS	12 080 029	16	927 803	6	1 544 348
(b) Contigs of S.cerevisiae by Canu*
MECAT	11 949 806	20	720 657	7	1 075 704
FLAS	11 976 512	20	912 129	6	1 526 323
(c) Contigs of A.thaliana by FALCON*
MECAT	120 803 609	64	10 410 339	5	15 926 937
FLAS	121 875 742	81	11 775 451	5	15 950 055
(d) Contigs of A.thaliana by Canu*
MECAT	120 987 757	60	9 028 586	5	15 898 934
FLAS	122 399 256	89	11 721 501	5	15 900 235

Note: The contigs in tests (a)–(d) are of S.cerevisiae or A.thaliana by FALCON* or Canu*, assembled from the MECAT and FLAS corrected long reads. The contigs from MECAT corrected long reads and those from FLAS corrected long reads are compared in each test. The best result in each column is shown in boldface. The performance measurements are listed in Section 3.2.2.

Open in new tab

Table 2.

Evaluation of long-read assemblies

Method	Assembly size (bp)	N Contigs	N50 size (bp)	N N50	Max contig size (bp)
(a) Contigs of S.cerevisiae by FALCON*
MECAT	11 986 444	18	777 561	6	1 537 692
FLAS	12 080 029	16	927 803	6	1 544 348
(b) Contigs of S.cerevisiae by Canu*
MECAT	11 949 806	20	720 657	7	1 075 704
FLAS	11 976 512	20	912 129	6	1 526 323
(c) Contigs of A.thaliana by FALCON*
MECAT	120 803 609	64	10 410 339	5	15 926 937
FLAS	121 875 742	81	11 775 451	5	15 950 055
(d) Contigs of A.thaliana by Canu*
MECAT	120 987 757	60	9 028 586	5	15 898 934
FLAS	122 399 256	89	11 721 501	5	15 900 235

Method	Assembly size (bp)	N Contigs	N50 size (bp)	N N50	Max contig size (bp)
(a) Contigs of S.cerevisiae by FALCON*
MECAT	11 986 444	18	777 561	6	1 537 692
FLAS	12 080 029	16	927 803	6	1 544 348
(b) Contigs of S.cerevisiae by Canu*
MECAT	11 949 806	20	720 657	7	1 075 704
FLAS	11 976 512	20	912 129	6	1 526 323
(c) Contigs of A.thaliana by FALCON*
MECAT	120 803 609	64	10 410 339	5	15 926 937
FLAS	121 875 742	81	11 775 451	5	15 950 055
(d) Contigs of A.thaliana by Canu*
MECAT	120 987 757	60	9 028 586	5	15 898 934
FLAS	122 399 256	89	11 721 501	5	15 900 235

Note: The contigs in tests (a)–(d) are of S.cerevisiae or A.thaliana by FALCON* or Canu*, assembled from the MECAT and FLAS corrected long reads. The contigs from MECAT corrected long reads and those from FLAS corrected long reads are compared in each test. The best result in each column is shown in boldface. The performance measurements are listed in Section 3.2.2.

Open in new tab

Figure 3 shows MUMmer plots of the A.thaliana contigs by Canu*, assembled from the MECAT corrected long reads and FLAS corrected long reads, respectively. As shown in the plots, the latter contigs are as accurate as the former. Supplementary Figures S1, S2 and S3 show the remaining plots, and similar results can be observed in each figure. These results indicate FLAS corrected long reads can be assembled into quality contigs.

Fig. 3.

Open in new tab Download slide

MUMmer plots of the A.thaliana contigs by Canu*, assembled from the MECAT corrected long reads (A) and FLAS corrected long reads (B), respectively

4 Conclusions

This study introduces FLAS, a wrapper algorithm of MECAT, to achieve high-throughput long-read self-correction while keeping MECAT’s fast speed. FLAS finds maximal cliques in an overlap graph constructed from MECAT prealigned long reads, and refines the alignments for correction. FLAS also aligns the long reads to paths of a second overlap graph constructed from the corrected long-read regions for further correction, considering not only alignment identities but also expected amounts of aligned long reads to the paths. As a result, FLAS can achieve 22.0–50.6% larger throughput than MECAT. Compared to the self-correction algorithms other than MECAT, FLAS is 2–13× faster than all of them, and its throughput is also 9.8–281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1–29.8% larger N50 sizes than MECAT. In the future, we will adapt FLAS to process long reads of different characteristics from other platforms, e.g. Oxford Nanopore platform.

Acknowledgements

We thank Lingxiao Lan in our group, for implementing the Error Correction Evaluation Toolkit for split long reads. We acknowledge the support of core facilities at the Institute for Integrative Genome Biology (IIGB), the University of California, Riverside.

Funding

This work was supported by grants from the National Natural Science Foundation of China [61502027 to E.B.]; and the Fundamental Research Funds for the Central Universities [2015RC045 to E.B.].

Conflict of Interest: none declared.

References

Au

K.F.

et al. (

2013

)

Characterization of the human esc transcriptome by hybrid sequencing

.

Proc. Natl. Acad. Sci. USA

,

110

,

E4821

–

E4830

.

Google Scholar

Crossref

WorldCat

Baker

K.S.

et al. (

2014

)

The extant world war 1 dysentery bacillus nctc1: a genomic analysis

.

Lancet

,

384

,

1691

–

1697

.

Bao

E.

,

Lan

L.

(

2017

)

HALC: high throughput algorithm for long read error correction

.

BMC Bioinformatics

,

18

,

204.

Berlin

K.

et al. (

2015

)

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

.

Nat. Biotechnol

.,

33

,

623

–

630

.

Chaisson

M.J.

,

Tesler

G.

(

2012

)

Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory

.

BMC Bioinformatics

,

13

,

238

.

Chaisson

M.J.

et al. (

2015

)

Resolving the complexity of the human genome using single-molecule sequencing

.

Nature

,

517

,

608

–

611

.

Chen

X.

et al. (

2014

)

The architecture of a scrambled genome reveals massive levels of genomic rearrangement during development

.

Cell

,

158

,

1187

–

1198

.

Chin

C.-S.

et al. (

2013

)

Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data

.

Nat. Methods

,

10

,

563

–

569

.

Chin

C.-S.

et al. (

2016

)

Phased diploid genome assembly with single molecule real-time sequencing

.

Nat. Methods

,

13

,

1050.

Eid

J.

et al. (

2009

)

Real-time DNA sequencing from single polymerase molecules

.

Science

,

323

,

133

–

138

.

Eppstein

D.

,

Strash

D.

(

2011

) Listing all maximal cliques in large sparse real-world graphs. In: SEA'11 Proceedings of the 10th International Symposium on Experimental Algorithms. Springer, pp.

364

–

375

.

Eppstein

D.

et al. (

2010

) Listing all maximal cliques in sparse graphs in near-optimal time. In: International Symposium on Algorithms and Computation. pp.

403

–

414

. Springer.

Gurevich

A.

et al. (

2013

)

QUAST: quality assessment tool for genome assemblies

.

Bioinformatics

,

29

,

1072

–

1075

.

Hackl

T.

et al. (

2014

)

proovread: large-scale high-accuracy PacBio correction through iterative short read consensus

.

Bioinformatics

,

30

,

3004

–

3011

.

Haghshenas

E.

et al. (

2016

)

CoLoRMap: correcting long reads by mapping short reads

.

Bioinformatics

,

32

,

i545

–

i551

.

Kamath

G.M.

et al. (

2017

)

HINGE: long-read assembly achieves optimal repeat resolution

.

Genome Res

.,

27

,

747

–

756

.

Koren

S.

et al. (

2012

)

Hybrid error correction and de novo assembly of single-molecule sequencing reads

.

Nat. Biotechnol

.,

30

,

693

–

700

.

Koren

S.

et al. (

2017

)

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

.

Genome Res

.,

27

,

722

–

736

.

Kurtz

S.

et al. (

2004

)

Versatile and open software for comparing large genomes

.

Genome Biol

.,

5

,

R12.

Lee

H.

et al. (

2014

)

Error correction and assembly complexity of single molecule sequencing reads

.

bioRxiv

,

006395

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Lee

H.

et al. (

2016

)

Third-generation sequencing and the future of genomics

.

bioRxiv

,

048603

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Li

H.

,

Durbin

R.

(

2010

)

Fast and accurate long-read alignment with burrows–wheeler transform

.

Bioinformatics

,

26

,

589

–

595

.

Miclotte

G.

et al. (

2016

)

Jabba: hybrid error correction for long sequencing reads

.

Algorithms Mol. Biol

.,

11

,

1

.

Myers

E.W.

(

2005

)

The fragment assembly string graph

.

Bioinformatics

,

21

(Suppl. 2),

ii79

–

ii85

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Myers

G.

(

2014

) Efficient local alignment discovery amongst noisy long reads. In: International Workshop on Algorithms in Bioinformatics. pp.

52

–

67

. Springer.

Philippe

N.

et al. (

2013

)

Pandoraviruses: amoeba viruses with genomes up to 2.5 Mb reaching that of parasitic eukaryotes

.

Science

,

341

,

281

–

286

.

Rhoads

A.

,

Au

K.F.

(

2015

)

Pacbio sequencing and its applications

.

Genomics Proteomics Bioinformatics

,

13

,

278

–

289

.

Salmela

L.

,

Rivals

E.

(

2014

)

LoRDEC: accurate and efficient long read error correction

.

Bioinformatics

,

30

,

3506

–

3514

.

Salmela

L.

et al. (

2016

)

Accurate self-correction of errors in long reads using de Bruijn graphs

.

Bioinformatics

,

33

,

799

–

806

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Xiao

C.-L.

et al. (

2017

)

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads

.

Nat. Methods

,

14

,

1072

–

1074

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
March 2019	57
April 2019	25
May 2019	37
June 2019	10
July 2019	3
August 2019	11
September 2019	44
October 2019	135
November 2019	69
December 2019	39
January 2020	17
February 2020	7
March 2020	14
April 2020	22
May 2020	10
June 2020	8
July 2020	5
August 2020	11
September 2020	14
October 2020	24
November 2020	10
December 2020	16
January 2021	8
February 2021	10
March 2021	14
April 2021	42
May 2021	41
June 2021	18
July 2021	49
August 2021	37
September 2021	40
October 2021	44
November 2021	17
December 2021	24
January 2022	26
February 2022	32
March 2022	32
April 2022	48
May 2022	35
June 2022	38
July 2022	20
August 2022	35
September 2022	95
October 2022	46
November 2022	25
December 2022	65
January 2023	27
February 2023	20
March 2023	25
April 2023	32
May 2023	28
June 2023	15
July 2023	15
August 2023	22
September 2023	16
October 2023	20
November 2023	23
December 2023	35
January 2024	23
February 2024	26
March 2024	39
April 2024	34

Article Contents

FLAS: fast and high-throughput algorithm for PacBio long-read self-correction

Abstract

1 Introduction

2 Materials and methods

2.1 Overview

2.2 Maximal cliques in overlap graph

2.3 Alignment refinement and error correction

2.4 Correction of uncorrected regions with corrected regions

2.5 Implementation of the FLAS software

3 Evaluation

3.1 Experimental design

3.1.1 Test of error correction performance

3.1.2 Test with various long-read coverage and parameter settings

3.1.3 Test of long-read assemblies

3.2 Datasets and performance measurements

3.2.1 Datasets

3.2.2 Performance measurements

3.3 Results

3.3.1 Results of error correction performance

3.3.2 Results with various long-read coverage and parameter settings

3.3.3 Results of long-read assemblies

4 Conclusions

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

FLAS: fast and high-throughput algorithm for PacBio long-read self-correction

Abstract

1 Introduction

2 Materials and methods

2.1 Overview

2.2 Maximal cliques in overlap graph

2.3 Alignment refinement and error correction

2.4 Correction of uncorrected regions with corrected regions

2.5 Implementation of the FLAS software

3 Evaluation

3.1 Experimental design

3.1.1 Test of error correction performance

3.1.2 Test with various long-read coverage and parameter settings

3.1.3 Test of long-read assemblies

3.2 Datasets and performance measurements

3.2.1 Datasets

3.2.2 Performance measurements

3.3 Results

3.3.1 Results of error correction performance

3.3.2 Results with various long-read coverage and parameter settings

3.3.3 Results of long-read assemblies

4 Conclusions

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only