AccuVIR: an ACCUrate VIRal genome assembly tool for third-generation sequencing data

Abstract Motivation RNA viruses tend to mutate constantly. While many of the variants are neutral, some can lead to higher transmissibility or virulence. Accurate assembly of complete viral genomes enables the identification of underlying variants, which are essential for studying virus evolution and elucidating the relationship between genotypes and virus properties. Recently, third-generation sequencing platforms such as Nanopore sequencers have been used for real-time virus sequencing for Ebola, Zika, coronavirus disease 2019, etc. However, their high per-base error rate prevents the accurate reconstruction of the viral genome. Results In this work, we introduce a new tool, AccuVIR, for viral genome assembly and polishing using error-prone long reads. It can better distinguish sequencing errors from true variants based on the key observation that sequencing errors can disrupt the gene structures of viruses, which usually have a high density of coding regions. Our experimental results on both simulated and real third-generation sequencing data demonstrated its superior performance on generating more accurate viral genomes than generic assembly or polish tools. Availability and implementation The source code and the documentation of AccuVIR are available at https://github.com/rainyrubyzhou/AccuVIR. Supplementary information Supplementary data are available at Bioinformatics online.

1 Edit distance calculation in the experiment results Given a sequence S and a reference sequence R, we calculate the minimum number of edits (including deletion, insertion, and substitution) needed to transform S into R. This number is used as the edit distance in our evaluation metric.
However, in the benchmark experiment, many contigs Canu generated are two or three times longer than the target genome. To evaluate its results in a fair way, we want to evaluate only the region that contains the viral genome. We align the contigs to the backbone sequence that is used to build the graph and keep the longest aligned region for edit distance calculation.

Algorithms
Algorithm 1 Selection of homopolymer region Input: Alignment Graph G(V, E), V is topologically sorted Output: Homopolymer regions Homo_loc Main Algorithm: homo_begin ← succeeding node of first node that is not base B in window append (homo_begin,homo_end) to Homo_loc a Prec(v i ) is the set of preceding nodes for v i .
Algorithm 2 Branching on a given path Candidate nodes set CandiN odes[c 1 , c 2 , ..., c m ] Output: An alternative path based on P Main Algorithm: a A patch could be in the region that has been replaced by prior patches, will then ignore it. b v pend is the ending node in a patch. c Succ(v i ) is the set of succeeding nodes for v i . d v pbeg is the beginning node in a patch. Tables   Table S1: Results of different tools on simulated HIV-1 Nanopore datasets. Canu's output is used as input to all polish tools. The best result for each column is shown in bold. "Largest align/Total align": the tool generating the largest alignment closest to the genome length is the best. "Unalign": the tool with the shortest unaligned length is the best. "Mismatch", "Indels", and "Indel Len": the tool with the fewest errors is the best.   Mismatch*, Indels*, Indel Len*: they are presented by the number of errors in the coding region / the total number of errors. SPAdes* and rnaSPAdes*: the evaluation tool QUAST generate many short alignments that might be caused by lots of mismatch and indel errors for results of SPAdes and rnaSPAdes. To evaluate them in a fair way, we use BLAST to generate longer local alignments and present longest alignment, mismatches, and gapopens (equal to indel numbers). PBDAG-Con* uses the graph constructed by AccuVIR in these experiments.  Mismatch*, Indels*, Indel Len*: they are presented by the number of errors in the coding region / the total number of errors. SPAdes* and rnaSPAdes*: the evaluation tool QUAST generate many short alignments that might be caused by lots of mismatch and indel errors for results of SPAdes and rnaSPAdes. To evaluate them in a fair way, we use BLAST to generate longer local alignments and present longest alignment, mismatches, and gapopens (equal to indel numbers). PBDAG-Con* uses the graph constructed by AccuVIR in these experiments.   Mismatch*, Indels*, Indel Len*: they are presented by the number of errors in the coding region / the total number of errors. SPAdes* and rnaSPAdes*: the evaluation tool QUAST generate many short alignments that might be caused by lots of mismatch and indel errors for results of SPAdes and rnaSPAdes. To evaluate them in a fair way, we use BLAST to generate longer local alignments and present longest alignment, mismatches, and gapopens (equal to indel numbers). PBDAG-Con* uses the graph constructed by AccuVIR in these experiments. In this experiment, we test AccuVIR and PBDAG-Con on real Nanopore sequencing data of Ebola virus. The data was sequenced in Guinea and submitted to NCBI in 2015 (ERR1248093). Because the variant is unknown for this dataset, we only compare the length of the final output and genome coverage with respect to a reference Ebola genome from NCBI (NC_002549.1). Available assemblers cannot generate a complete contig on this dataset. Canu can only generate 12 segmented short contigs (See Fig. S1). Thus, we used a sequenced Ebola viral strain MK672825.1 as the backbone for AccuVIR. Table S3 presents the results of PBDAG-Con and AccuVIR, both of which have stable outputs according to our experiments on simulated data. As the comparison shows, AccuVIR outputs more complete contigs than PBDAG-Con does. Without a ground truth genome, we cannot provide meaningful evaluations for the polish tools and thus did not include them. Figure S1: Assembly results of Canu and AccuVIR on real Ebola Nanopore data.

Paths' quality with different τ
Our default value for the cutoff τ is determined empirically. Fig. S2 shows the change of the sampled paths' quality using different τ . Besides our default graph named ec_ass graph, which is constructed by aligning all error-corrected reads to Canu's assembled contig, we also tested τ using other graphs. In particular, we investigate how the method works if we use raw reads without error correction. align n raw reads means that the we align n raw reads in addition the error corrected reads to the backbone. We can see that when we add raw reads to the graph, the path quality will improve, most likely because the coverage (i.e. edge weight) becomes more accurate. However, the main caveat is that the graph size also increases significantly. And the edit distance is more volatile when we add raw reads. Thus, in our final implementation, we use error-corrected reads. Figure S2: The change of sampled paths' edit distances with different τ on HIV-1 data. x-axis: threshold = 1 − τ ;

Combination of Diverse Beam Search and MRR
Although Diverse Beam Search (DBS) generates a set of high-quality paths, a higher path score does not always lead to the fewest errors or the smallest edit distance with the target genome. As shown in Fig. S3 (a), there are many cases where paths with higher DBS scores have fewer errors despite the strong correlation. Combining DBS and the gene prediction score of these similar paths helps pinpoint the best path ( Fig. S3 (b)).

BLASTX results on real SARS-CoV-2 Nanopore data
In Fig. S4, we align Medaka's and AccuVIR's output sequences from real dataset 3 against ORF1b of SARS-CoV-2 using BLASTX. There are nine short and fragmented local alignments in Medaka's output. In contrast, AccuVIR's result has three longer alignments on ORF1b, all with higher identities. Figure S4: Alignment of Medaka's and AccuVIR's output sequences against ORF1b of SARS-CoV-2 (on real Nanopore dataset 3).