WebQUAST: online evaluation of genome assemblies

Abstract Selecting proper genome assembly is key for downstream analysis in genomics studies. However, the availability of many genome assembly tools and the huge variety of their running parameters challenge this task. The existing online evaluation tools are limited to specific taxa or provide just a one-sided view on the assembly quality. We present WebQUAST, a web server for multifaceted quality assessment and comparison of genome assemblies based on the state-of-the-art QUAST tool. The server is freely available at https://www.ccb.uni-saarland.de/quast/. WebQUAST can handle an unlimited number of genome assemblies and evaluate them against a user-provided or pre-loaded reference genome or in a completely reference-free fashion. We demonstrate key WebQUAST features in three common evaluation scenarios: assembly of an unknown species, a model organism, and a close variant of it.


INTRODUCTION
Despite the ongoing long-read sequencing revolution, it is still impossible to r ead entir e chromosomes for most species in a single run ( 1 ). Researchers use the so-called genome assembly software that combines the sequencing reads into longer genome fragments commonly r eferr ed to as contigs. Dozens of genome assemb lers e xist nowadays ( 2 ). These tools rely on different heuristics that greatly vary their output. Moreov er, e v en different settings of the same tool may result in substantially di v erging assemb lies. The quality assessment and comparison of multiple genome assemblies are of utmost importance since the assembly choice greatly affects the downstream analysis ( 3 ).
The e xisting assemb ly e valuation tools comprise two major categories. The r efer ence-based tools, such as GAGE ( 4 ), use gold-standard r efer ence genomes to evaluate assemblies on model datasets. The r efer ence-fr ee methods either r ely on r ead mapping back to assemblies to check their consistency with the input data and detect assembly er-rors, such as REAPR ( 5 ) and Inspector ( 6 ), or look for conservati v e genes to estimate the assembly completeness, such as BUSCO ( 7 , 8 ) and CEGMA ( 9 ). Previously, we developed QUAST, an ensemble method that incorporated the best software from both categories, enhanced them with in-house quality metrics and plots, and became the state-of-the-art quality assessment tool for genome assemb lies ( 10 , 11 ). Howe v er, Q UAST intrinsicall y inherited the limitations of the embedded tools which are available only for a few platforms (usually Linux) and have a commandline interface making them hardly suitable for r esear chers with a limited computational background.
Her e, we pr esent WebQUAST, a web server complementing QUAST with a user-friendly graphical interface and providing its functionality on any platform. In contrast to a fe w e xisting genome assemb ly e valuation w e b tools, WebQUAST is not restricted to specific taxa as gEVAL ( 12 ) and GenomeQC ( 13 ), performs versatile assemb ly e valua tion ra ther than only completeness estima tion as gVolante ( 14 ), and supports an unlimited number of assemblies on input. The WebQUAST evaluation reports can be bro wsed online, do wnloaded locall y, and shared privatel y with colleagues. We show WebQUAST performance using a sample dataset of four E. coli assemblies.

Web server overview
Workflow. A user uploads genome assemblies in the FASTA format (gzipped files ar e supported), configur es the evaluation parameters, such as the minimal contig length cut-off and the organism type (eukaryote or prokaryote), and optionally selects a r efer ence genome. The user might choose it from the list of pre-loaded genomes or upload a custom FASTA file that will be stored privately and can be reused later. Once the user clicks on the Evaluate button, WebQUAST transfers the input data to the QUAST processing engine.
If a r efer ence genome is provided, the assemblies are aligned against it using minimap2 ( 15 ). If the BUSCO checkbox is selected, the assemblies are screened for single-copy orthologues from the corresponding BUSCO database ( 8 ). If the gene finding is requested, the assemblies are processed with the GlimmerHMM gene prediction software ( 16 ). QUAST combines the outputs of all employed modules to compute numeric quality metrics, create assessment plots and Icarus viewers ( 17 ), and generate a single evaluation report. WebQUAST assigns the report a unique w e b link and renders it for the user. The link enables browsing the results online and sharing them. The user can download the full standalone report to store it permanently. The standalone report also provides additional insights into the analysis, such as the running commands of the embedded tools or the list of identified misassemblies in the GFF format.
Software implementation. The server is built on top of the Python w e b frame wor k Django. MySQL instance is used to record users , sessions , and analysis requests. To support long-running analysis, the r equests ar e processed and added into an asynchronous task queue Celery. A queued job represents a simple script that calls the command-line QUAST tool, which allows us to keep the main codebase agnostic to the w e b implementation. The front-end component is based on the jQuery frame wor k.

Sample data pr epar ation
To demonstrate We bQUAST performance, w e generated sample assemblies of a well-studied short-read Esc heric hia coli K-12 MG1655 dataset (SRA accession: ERR008613). The choice of a genome assembler might be influenced by many factors and one popular, yet often suboptimal, strategy is to choose among the most-cited methods ( 18 ). We mimicked this behavior by collecting information on short-read genome assemblers (Table 1 ) and selecting the fiv e most-cited tools. We further excluded SOAPdenovo ( 19 ) since the authors discontinued it and recommended using MEGAHIT ( 20 ), which was already shortlisted.
Some of the selected assemblers do not include a read error correction module, so we cleaned the raw sequencing data beforehand to make the comparison fair. We checked the reads with FastQC and trimmed low-quality ends with Trimmomatic ( 37 ). All assemblers but ABySS were run with default parameters or based on the recommendations in the documentation where v er availab le. We used the GAGE-B recipe ( 38 ) for ABySS since its default assembly was of very poor quality. All tools were installed via Bioconda ( 39 ), the installation and running commands are in the Supplementary Material.

RESULTS
Here we illustrate three typical WebQUAST usage scenarios. In each case, we evaluated the same four assemblies of the E. coli K-12 MG1655 dataset but selected the r efer ence genome differently. We assumed the reference was unknown in Case 1, exactly matched the dataset in Case 2, and was closely related to the dataset in Case 3.

Use Case 1: r efer ence-fr ee ev aluation
When a r efer ence genome is unavailable, WebQUAST computes 30 quality metrics and draws three assessment plots that mainly address the contiguity and completeness of the provided assemblies (Figure 1 A, Supplementary Figure S1). The heatmaps help to detect the best-performing tools in each category. Figure 1 A shows that there is no single winner in all metrics. Compar ed to thr ee other methods, ABySS produced the largest (4.8 Mb versus 4.6 Mb) but also the most fragmented assembly (176 contigs versus 90-95 for Velvet, SPAdes and MEGAHIT). SPAdes assembled larger contigs on average (the best N50, N90 and auN, the area under the Nx curve, values with Velvet and MEGAHIT being close runner-ups) and has the largest contig overall (285 versus 265, 248 and 236 kb for Velvet, ABySS and MEGAHIT). The MEGAHIT assembly does not contain uncalled bases ('N') while Velvet has the most of them (94 per 100 kb). All four assemblies are equally complete in terms of fully assembled r epr esentative bacterial single-copy orthologs (98.7% Nucleic Acids Research, 2023, Vol. 51, Web Server issue W603   Figure S1D), though we cannot exclude a presence of an organism with similar G + C content.

Use Case 2: r efer ence-based ev aluation
A r efer ence genome enables accurate and versatile evaluation by WebQUAST in all four quality categories: contiguity, correctness , completeness , and contamination. In this mode, the tool reports > 60 quality metrics accompanied by eight assessment plots and two Icarus viewers (Figure 1 B,  Figure 2 A, Supplementary Figures S2-S4). By default, We-bQ UAST displays onl y 18 key metrics and hides the rest behind the Extended report button (Figure 1 B). As in Use Case 1, there is no undisputed best assembly in Figure 1 B. How ever, w e can now investigate some quality categories in more detail. The increased Duplica-tion ratio for ABySS (1.04 versus 1.00 for the rest assemblers) indica tes tha t this method assembled many genomic regions more than once. Still, ABySS assembled the highest percentage of the genome (98.7 versus 98.0-98.4% for Velvet, SPAdes and MEGAHIT) but its leadership is not as evident as it appeared when we compared the total assembly lengths. SPAdes and ABySS have the best perbase quality with SPAdes being twice better as the runnerup (1.0 vs 2.1 mismatches and 0.3 versus 0.6 indels per 100 kb). MEGAHIT and SPAdes made no large assembl y errors, w hile Velv et and ABySS hav e four misassemb lies each. Though, the largest contigs in all four assemblies are error-free since their lengths exactly match the largest alignments. The Icarus viewer can be used for deep inspection of the misassembly locations (Figure 2 , Supplementary Figure S4).

Use case 3: evaluation based on a close reference
The true r efer ence genome is rar ely known in r eal studies but a close r efer ence could often be available. Here we used W604 Nucleic Acids Research, 2023, Vol. 51, Web Server issue  Supplementary Figures S5-S7). Naturally, the absolute values of many alignment-based metrics, such as lengths of misassembled and unaligned contigs, substantially deteriorated due to the actual differences between the sequenced organism and the provided reference genome. Howe v er, they are still useful for determining the best assembly among available options. Figure 2 highlights the substantially increased number of misassemblies compared to the evaluation based on the true r efer ence genome (49 versus 8 extensive misassemblies in total). Howe v er, a closer look at the misassembly locations, suggests that almost all of them are the same in all assemblies which likely means they are true structural variations rather than assembly errors and can be ignored for evaluation purposes (Figure 2 B and Supplementary Figure S7). Though, we cannot exclude the possibility that se v eral assemblers made the same error in a complex genomic region, especially if we compare tools inspired by the same computational approach such as the de Bruijn graph-based assembly ( 41 ).

CONCLUSION
Selecting the best -or, mor e pr ecisely, the most suitablegenome assembly is crucial for downstream analysis. While many post-processing steps, such as structural and functional annotation ( 42 ) or genome mining ( 43 ), have been available online for years, the assembly validation step is still mainly done with the Linux-based command-line tools. Her e, we pr esented WebQUAST, a web server for genome assemb ly e valua tion, tha t grea tly facilita tes this task for users with any operating system and computational background and helps them to make an informed choice. Since our tool is suitable for any organism and sequencing technology, we expect it would benefit the broad genomics comm unity. Furthermore, WebQ UAST is already incorporated in se v eral bioinformatics massi v e online open courses (MOOCs), so we hope it would also help to educate the future generation of researchers.

DA T A A V AILABILITY
WebQ UAST is freel y available at https://www.ccb.unisaarland.de/quast/ . The source code for the server is at https://github.com/a bla b/quast-w e bsite and for the core QUAST tool is at https://github.com/a bla b/quast . The sequencing data for E. coli K-12 MG1655 dataset is available from the National Center for Biotechnology Information (NCBI) Sequence Read Archi v e under accession number ERR008613. The E. coli strain K-12 r efer ence genomes and gene annotations are available from NCBI under accession numbers NC 000913.3 and AP009048.1 for substrains MG1655 and W3110, respecti v ely. The ABySS, MEGAHIT, SPAdes, and Velvet assemblies generated in this study and their interacti v e e valuation r eports ar e available from the WebQUAST front page and in Zenodo at https://doi.org/10.5281/zenodo.7863703 .