Assessing species coverage and assembly quality of rapidly accumulating sequenced genomes

Abstract Background Ambitious initiatives to coordinate genome sequencing of Earth's biodiversity mean that the accumulation of genomic data is growing rapidly. In addition to cataloguing biodiversity, these data provide the basis for understanding biological function and evolution. Accurate and complete genome assemblies offer a comprehensive and reliable foundation upon which to advance our understanding of organismal biology at genetic, species, and ecosystem levels. However, ever-changing sequencing technologies and analysis methods mean that available data are often heterogeneous in quality. To guide forthcoming genome generation efforts and promote efficient prioritization of resources, it is thus essential to define and monitor taxonomic coverage and quality of the data. Findings Here we present an automated analysis workflow that surveys genome assemblies from the United States NCBI, assesses their completeness using the relevant BUSCO datasets, and collates the results into an interactively browsable resource. We apply our workflow to produce a community resource of available assemblies from the phylum Arthropoda, the Arthropoda Assembly Assessment Catalogue. Using this resource, we survey current taxonomic coverage and assembly quality at the NCBI, examine how key assembly metrics relate to gene content completeness, and compare results from using different BUSCO lineage datasets. Conclusions These results demonstrate how the workflow can be used to build a community resource that enables large-scale assessments to survey species coverage and data quality of available genome assemblies, and to guide prioritizations for ongoing and future sampling, sequencing, and genome generation initiatives.


Full Title:
Assessing species coverage and assembly quality of rapidly accumulating sequenced genomes Background: Ambitious initiatives to coordinate genome sequencing of Earth's biodiversity mean that the accumulation of genomic data is growing rapidly. In addition to cataloguing biodiversity, these data provide the basis for understanding biological function and evolution. Accurate and complete genome assemblies offer a comprehensive and reliable foundation upon which to advance our understanding of organismal biology at genetic, species, and ecosystem levels. However, ever-changing sequencing technologies and analysis methods mean that available data are often heterogeneous in quality. In order to guide forthcoming genome generation efforts and promote efficient prioritisation of resources, it is thus essential to define and monitor taxonomic coverage and quality of the data. Findings: Here we present an automated analysis workflow that surveys genome assemblies from the United States National Center for Biotechnology Information (NCBI), assesses their completeness using the relevant Benchmarking Universal Single-Copy Orthologue (BUSCO) datasets, and collates the results into an interactively browsable resource. We apply our workflow to produce a community resource of available assemblies from the phylum Arthropoda, the Arthropoda Assembly Assessment Catalogue. Using this resource, we survey current taxonomic coverage and assembly quality at the NCBI, we examine how key assembly metrics relate to gene content completeness, and we compare results from using different BUSCO lineage datasets. Conclusions: These results demonstrate how the workflow can be used to build a community resource that enables large-scale assessments to survey species coverage and data quality of available genome assemblies, and to guide prioritisations for ongoing and future sampling, sequencing, and genome generation initiatives.
Reviewer #2: The publication describes a useful tool to quickly survey a range of QC metrics for genomes available in NCBI. The a3cat toolkit can be used to setup as well as update the assessment results for public or private assemblies for a user-defined taxon. Overall, the website and the workflow on gitlab are a useful resource for the genomics community ask a number of comparative genomics questions. I enjoyed reading this manuscript and only have minor comments. I would like to bring some more use cases to the attention of the authors that can enrich the discussion. The authors have already presented nuggets from the data mining of results but here are a few thoughts to add to the value of results reported here, as that can be further improved. Response: We thank Reviewer #2 for their appreciation of the workflow and the community resources we have developed, and we respond accordingly to the suggestions below.
Given an assembly from an insect with an approximate taxonomic classification based on morphology or genetic markers, can the a3cat results be used to figure out the best reference genome or a set of closely related genomes for comparative analysis of the gene space? One idea could be to use the overlap of lineage specific BUSCO genes found in the new genome with BUSCO genes present in other assemblies to identify related genomes.
Response: This is an interesting idea that could certainly be explored with the BUSCO assessment data we have generated and shared through the A3Cat. The resource in its current form is designed to be clean and simple by offering querying and data retrieval rather than data analysis. An analysis platform would require substantial additional development, we therefore offer the data in an easily findable way so that community users can then perform tailored downstream analyses such as overlap analyses suggested here.
The discussion covers results when the results are filtered by level (contig, scaffold, chromosome) or type (haploid, principal or alternate pseudohaplotype). It might be worthwhile to further segment the results based on input raw data (for e.g. short reads, short reads + mate pair, long reads) to explore if the contiguity of the assembly and completeness and duplication of the gene space is impacted by the proportion of indels in the raw reads irrespective of the length of the reads. There a number of other relevant variables like assembly algorithm and parameters but that can lead to very sparse data. Response: The metadata available from the NCBI describing these additional variables are somewhat inconsistently applied, heterogenous in the labels used, and often missing or unclear, making automated analyses unfeasible. Furthermore, explorations of these factors were recently presented in a snapshot assessment of 601 insect assemblies (Hotaling et al. 2021). We therefore decided to focus on aspects that were readily automatable, in the context of providing an updateable resource. We have now added a specific point to the main text to highlight that additional partitioning can be performed but would normally require metadata curation.
The authors talk about the proportion of repeat content in larger genomes. This might be a valuable resource to add to the a3cat results as initiatives like Ag100Pest and DToL are producing high quality insect genomes >1-2Gbp with a large number of repeats that are going to be better assembled than ever before with high fidelity long reads. Adding the results of a widely used de novo repeat identification tool like RepeatModeler based on the DFAM database will provide a consistent measure of repeat content across all analyzed genomes and add to the value of this toolkit. In case some of this information is already available in NCBI, it can be pulled using the API avoiding the need for this massive compute job. Response: We certainly consider this as a useful extra attribute to add to the catalogue of assembly metrics, so we had already investigated the current feasibility of retrieving these metadata from the NCBI. For those assemblies with RepeatMasker outputs provided on the NCBI FTP sites it is possible to extract some relevant metadata on repeat content, but this is not trivial. The continued development of the NCBI API tools greatly facilitates database querying and data retrieval, so when repeat content metadata are made accessible through the API we look forward to being able to add these features to future versions of the workflow and catalogue.
This next issue is related to BUSCO but effects the results and conclusions of the a3cat tool. Is it possible that some of the BUSCO marker genes (from OrthoDB9 or 10) are based on short read assemblies with minor errors in gene models? When run on recent assemblies based on high fidelity long reads with the correctly assembled gene model, BUSCO might report the marker as missing or fragmented. I understand this outside the scope of this paper but if this is possible, it should be mentioned as a potential pitfall. Response: We agree that this is more of a technical discussion on how BUSCO performs and thus out of scope here. We do acknowledge that it is important to mention possible limitations of BUSCO given that the a3cat is presenting BUSCO results. We have therefore added additional information on how to interpret BUSCO results to the a3cat website (About page).
A common problem with bioinformatics resources is the lack of a sustainability plan. I know this is difficult to pin down for the mid or long term in the face of unpredictable funding but I would like to encourage the authors to present a plan to manage and update the web resource if at all possible. Response: The management and updating of the current resource is relatively lightweight in terms of overheads as the whole workflow is designed to produce easily updatable results that add to the existing results. The workflow implementation and dependencies management makes it easy to run on any other compute platform where future updates (even by other groups) would not require re-running any existing assessments. We would rather not detail specific long-term sustainability plans in the manuscript itself as this implies some sort of obligation, but we are happy to elaborate on alternative solutions already considered: (i) the BUSCO developers themselves (Zdobnov Group, University of Geneva, https://www.ezlab.org/) could take over or provide support to sustain our resource; (ii) the Swiss Institute of Bioinformatics could also take over or provide support to sustain our resource as they do for several other bioinformatics tools and resources (https://www.sib.swiss/researchinfrastructure/database-software-tools/sib-resources); or (iii) the arthropod genomics community resource, the i5k Workspace @NAL (https://i5k.nal.usda.gov/) could provide an alternative hosting solution as they are already mandated to serve the needs of the community.
For future work, it might be a good idea to consider the extension of the a3cat toolkit to include other metrics beyond the current contiguity and gene space completeness measures. Mash or ANI distances are becoming computationally tractable for large data sets. I have already mentioned the repeat content issue. Long range similarity measures based on Hi-C data or nucleotide composition based on kmer analysis might be other items to ponder. Response: Here we have focused on attributes that are (i) straightforward to obtain directly from the NCBI, and (ii) produced as outputs of running BUSCO. We believe that these provide a solid baseline overview that already serves the most pressing needs of the community. We do however agree that adding additional features would enhance the utility of the resource, especially for more specialist users interested in more detailed metrics. In balancing sustainability and updatebility with feature comprehensiveness we have opted for a combination of attributes that is useful but not overly onerous or compute intensive to generate. The workflow of course remains fully extensible, and thus we have now mentioned in the manuscript some of these added extras and possible features for future development.

Minor revisions
Since the logic and applicability of this work is so straightforward, some of the text can be shortened to reduce duplication. For e.g. on Pg 4 this paragraph can be shortened, "Using their Complete Proteome…. for selected groups of species from their field of interest." In the same paragraph, I see "(i) aid project design, particularly in the context of comparative genomics analyses; (ii) simplify comparisons of the quality of their own data with that of existing assemblies; and (iii) provide a means to survey accumulating genomics resources of interest to their ongoing research projects." Can the difference between (i) and (iii) be clearly explained? Response: We have been through the text and removed or simplified statements that are somewhat repetitive, particularly in the paragraph mentioned. The intended distinction between (i) and (iii) was that (i) refers specifically to deciding which species to include in a given comparative genomics analysis while (iii) refers more generally to simply keeping up-to-date with accumulating resources. We have edited these to improve clarity.

Typographical errors
Response: Thank you for pointing these out, they have now both been corrected.
On Pg 8, the abbreviation CoL-needs an explanation. On Pg 12, can the term span be elaborated? at multiple levels, as well as to benefit human welfare [1,2]. Investigating such questions using genomic data often requires comprehensive multi-species comparative analyses that benefit from high quality assemblies [3,4]. It is therefore essential to be able to define the current taxonomic coverage of high-quality assemblies in order to guide forthcoming sequencing efforts and promote efficient prioritisation of resources globally.
Methods to gauge assembly quality include two main families of metrics [5]. One summarises contiguity using metrics like N50 length, where half the assembly comprises sequences of length N50 or longer, or L50 count, the smallest number of sequences whose lengths sum to 50% of the assembly. Complementary approaches estimate completeness by examining gene or protein content, e.g. the DOmain-based General Measure for transcriptome and proteome quality Assessment, DOGMA [6,7], or the Benchmarking Universal Single-Copy Orthologues, BUSCO [8,9]. BUSCO has emerged as a standard and is used by UniProt [10] and the United States National Center for Biotechnology Information (NCBI) [11], as well as by genomics data quality assessment pipelines like MultiQC [12] and BlobToolKit [13]. BUSCO is based on the evolutionary expectation that single-copy orthologues found in nearly all species from a given taxon should be present and single-copy in any newly sequenced species from the same clade.
BUSCO datasets are built for multiple taxonomic lineages by identifying near-universal groups of single-copy orthologues from OrthoDB [14,15]. For assembly evaluations, sequence searches followed by gene predictions and orthology classifications identify complete, duplicated, or fragmented BUSCOs. The proportions recovered indicate the completeness in terms of expected subsets of evolutionarily conserved genes.
Extrapolating from these, a high BUSCO completeness score suggests that the sequencing and assembly procedure has successfully reconstructed a reliable representation of the full set of genes.
Using their Complete Proteome Detector algorithm, UniProt classifies proteomes as 'standard', 'close to standard', or 'outlier', and provides BUSCO proteome completeness 4 summaries. For assemblies, the NCBI Assembly database provides summary statistics and metadata for each record. Querying these can provide snapshots of taxonomic coverage and data quality, but researchers currently lack access to comprehensive and standardised assessments of available assemblies. These would allow data producers to compare their assemblies to existing data at the most relevant taxonomic level. They would also provide researchers with comprehensive overviews of resources for their focal taxa. Such communities would benefit from being able to survey coverage and quality of available genomic resources for selected groups of species from their field of interest.
This would (i) aid project design, particularly in the context of comparative genomics analyses; (ii) simplify comparisons of the quality of their own data with that of existing assemblies; and (iii) provide a means to keep up-to-date with accumulating genomics resources relevant to their ongoing research projects.
To address these needs, we developed an automated analysis workflow that performs BUSCO assessments of assemblies for user-selected taxa from the NCBI, concurrently collating assembly metadata to build a catalogue of metrics in a taxonomically-aware framework. To demonstrate the utility of standardised evaluations for a clade, we applied our workflow to the phylum Arthropoda, for which genome data are supporting research on a wide range of topics including their roles as pests and disease vectors [16]. Since sequencing the fruit fly genome [17], sampling of arthropods has included ants and other Hymenoptera [18,19], arachnids [20], beetles [21], butterflies and other Lepidoptera [22], flies and other Diptera [23,24], hemipterans [25], and many others [26,27]. Through efforts such as the i5k 5000 arthropod genomes initiative [28] and others, the arthropod genomics community has worked to overcome challenges in genome sequencing, assembly, and annotation [29][30][31]. Despite encompassing only a tiny fraction of all arthropod diversity and showing taxonomic biases in sampling, assemblies are accumulating rapidly and are now publicly available for hundreds of species [32,33].
Our large-scale assessments allowed us to (i) survey the current taxonomic coverage and assembly quality across Arthropoda; (ii) examine how key assembly metrics relate to gene content completeness; (iii) quantify effects on assessment resolution using different BUSCO lineage datasets; (iv) compare the results of BUSCO v3 with the newer BUSCO v4, and (v) demonstrate how our workflow can be used to build a community resource.
We provide the catalogue as an open resource for the arthropod genomics community, and the standalone open-source workflow for users to build their own catalogues tailored to the needs of their research communities. Enabling user-customisable, taxonomicallyaware, standardised, and updatable quality assessments of available genome assemblies will empower genomics data producers and users, as well as helping to prioritise species for genomic sequencing of Earth's biodiversity.

Results and Discussion
An automated workflow for assembly assessments We developed an automated analysis workflow to build and maintain NCBI genome assembly assessment catalogues for selected taxa. This workflow performs the following steps: 1) query the NCBI GenBank Assembly database [11] to retrieve information about 6 available assemblies and corresponding metadata for a user-defined taxonomic group; 2) identify all relevant BUSCO lineages based on species taxonomy for each assembly; 3) run BUSCO on each assembly using each relevant lineage dataset; 4) generate a summary table that collates all BUSCO results with assembly metrics and metadata; and 5) generate an HTML / JavaScript interactive table containing all data from the summary ( Figure S1). Assembly metadata are integrated into a summary file along with five metrics obtained from the results of running BUSCO on each assembly with each relevant lineage: the percentages of complete, complete single-copy, complete duplicated, fragmented, and missing BUSCOs. The workflow allows users to systematically assess all assemblies available at the NCBI for a given taxon of interest. Importantly, it is also designed to perform on-demand updates to assess assemblies added to NCBI GenBank since the last run. The final output provides all the information retrieved for each assembly in both JSON and tab-separated formats, and an HTML / JavaScript table is generated to display the data. This output is saved in a summary folder each time the workflow is run. The workflow is implemented using the Snakemake workflow management engine [34,35] and all software dependencies are managed by the Conda package manager. It is fully automated and can be configured using a yaml file to specify the query to use for the NCBI Assembly database, BUSCO parameters, and the information to display in the output tables. The code and documentation are available from https://gitlab.com/evogenlab/a3cat-workflow [36].

A survey of arthropod genome assembly resources
Applying the assembly assessment workflow to the phylum Arthropoda on June 11, 2021 resulted in the retrieval of a total of 2083 assemblies from 1387 species, providing a snapshot of the taxonomic coverage of available genome resources for arthropods at the NCBI. Of the ~120 arthropod orders recognised by the NCBI Taxonomy database [37] or the Catalogue of Life [38], 48 are represented by at least one genome assembly, with 21 orders represented by five or more assemblies (Figure 1 Hymenoptera. An exception to this observation is Coleoptera -beetles, weevils, etc., which has the highest number of described species to date with currently available genome assembly resources for only 0.007% of these species. Excluding Lepidoptera that are skewed by a large number of poor-quality assemblies [39], median N50 lengths per order represented by at least five assemblies (shown in Figure   2C) range from 11.6 Kbp for Sarcoptiformes (mites, 15 assemblies for 12 species) to 96.3 Mbp for Xiphosura (horseshoe crabs, 8 assemblies for 4 species). The horseshoe crabs have large genomes of 1.7-2.2 Gbp, for which concerted efforts have been successful in producing contiguous assemblies [40][41][42][43]. The mite genomes are all much smaller, with a median assembly span (total length) of just 88.5 Mbp, where the latest assembly for the parasitic mite, Sarcoptes scabiei, provides an example of how long-read technologies are helping to improve available genomic resources [44].
Median BUSCO completeness scores per order represented by at least five assemblies for the Arthropoda lineage dataset ( Figure 2D) are less variable than the N50 lengths and, excluding Lepidoptera, range from 72.1% for Sarcoptiformes to above 97% for Diplostraca (clam shrimps and waterfleas, 9 assemblies for 7 species), Blattodea (cockroaches and termites, 6 assemblies for 5 species), Diptera, and Hymenoptera.
Although within-order distributions can be highly variable, all but two of the 21 orders (Sarcoptiformes and Trombidiformes mites) are represented by at least one assembly with more than 90% complete BUSCOs. These contiguity and completeness distributions include all available assemblies, i.e. not filtered by level (contig, scaffold, chromosome) or type (haploid, principal or alternate pseudohaplotype, etc.). The completeness of contig-level assemblies is expectedly lower than that of scaffold-or chromosome-level ( Figure S2B) assemblies, and although alternate pseudohaplotype assemblies can achieve high BUSCO completeness scores, they are generally lower than for principal pseudohaplotypes ( Figure S2C). Additional partitioning of the datasets by sequencing technologies, assembly algorithms, etc. is feasible where the metadata labels are applied consistently, or after metadata curation as for previous assessments of insects that contrasted short-and long-read technologies [33]. These phylum-wide comparisons of the qualities of available genome assemblies highlight the unbalanced order-level species representation as well as the variable levels of contiguity and completeness within and amongst arthropod orders.

Arthropod assembly contiguity, size, and completeness
With 2083 assemblies exhibiting variable contiguities and sizes, the survey results provide the opportunity to examine expectations of how assembly contiguity and size relate to gene content completeness. Although long-read sequencing technologies are producing improved results [33], large genomes have often been challenging to assemble due to expanded proportions of repetitive sequences [31]. Even for smaller genomes, repeats can hinder scaffolding of contigs, reducing contiguity and possibly adding undetermined gap regions to the assembly. Less contiguous assemblies are thus expected to have more genes split across scaffolds, or partially or completely missing, resulting in lower completeness scores [45].
The Earth BioGenome Project [2] criteria for a reference quality assembly include obtaining a complete and single-copy BUSCO score above 90% and having the majority of sequences assigned to chromosomes. While 828 of the assessed arthropod assemblies achieve a complete and single-copy BUSCO score >90%, only 229 of these are also labelled as chromosome-level assemblies. Indeed, comparing assembly N50 values with their completeness scores shows that obtaining >90% complete BUSCOs can be achieved across a wide range of contiguities ( Figure 3A) The largest assemblies span more than 5 Gbp, with the maximum reported for the tick, Haemaphysalis longicornis, at 7.3 Gbp that shows 92% complete BUSCOs ( Figure 3B).
The estimated genome size for this tick however is only 3.4 Gbp, and a duplicated BUSCO score of 74.4% suggests that the applied assembly methods failed to collapse the alternative haplotypes. Indeed, an alternative assembly for this tick spans just 2.6 Gbp and scores 89.5% complete and 2.1% duplicated BUSCOs. A handful of other large assemblies with high duplicated scores are annotated as being non-collapsed, but others with many duplicated BUSCOs are also likely diploid or partially diploid ( Figure S3). The smallest reported genome size for an arthropod to date is that of the tomato russet mite, Aculops lycopersici (Trombidiformes), exceptionally streamlined at only 32.5 Mbp [47]. It 13 achieves a Eukaryota completeness score of 83%, but only 67% Athropoda complete, which could reflect the evolutionary streamlining process but may also be related to challenges during gene prediction in such a gene-dense genome where genes have also experienced large-scale intron losses. The smallest assembly with a >80% Arthropoda completeness score is that of a grasshopper, Xenocatantops brachycerus (42 Mbp, 92% complete); however, inspecting the metadata reveals this to be a transcriptome rather than a genome assembly [48]. Amongst the smallest true genome assemblies achieving >80% completeness are other Trombidiformes as well as Sarcoptiformes, e.g. the house dust mite Dermatophagoides farinae (54 Mbp, 84% complete). Although there are fewer large assemblies spanning >1Gbp, across the full range of their sizes most achieve good completeness scores of >90%, indicating that sequencing technologies and assembly methods are able to overcome challenges often associated with large genomes.
Comparing assembly N50s and sizes with BUSCO duplicated scores ( Figure S3) identifies several assemblies with high duplication levels. Some of these are labelled as 'unresolved-diploid' assemblies, which explains these high duplication levels, but this mechanism to inform users about the non-strictly-haploid status of certain assemblies is not widely nor consistently applied. Fragmented BUSCO scores ( Figure S4) are expectedly higher for most of the less contiguous assemblies, highlighting those where many genes are likely split across two or more scaffolds. The survey results therefore provide the community with a comprehensive overview of genomic dataset qualities and of how contiguity and size relate to gene content completeness across currently available arthropod genome assemblies.

BUSCO dataset lineage and version comparisons
The reference BUSCO lineage datasets are defined at different taxonomic levels that capture sets of near-universal single-copy orthologues from OrthoDB [49] at ancient, intermediate, and younger nodes of the tree of life [8,9]. As duplication and loss events over evolutionary time erode the numbers of identifiable BUSCOs, datasets defined for more ancient lineages are smaller than for the younger ones, e.g. n=255 for Eukaryota and n=954 for Metazoa, versus n=3285 for Diptera and n=13780 for Primates (OrthoDB Our results provide the opportunity to compare the scores obtained using different lineage datasets for a large number of arthropod assemblies ( Figure 4). Comparing percentages of complete BUSCOs identified with the Eukaryota (n=255) and the Arthropoda (n=1013) lineage datasets for a total of 1977 arthropod assemblies shows highly linearly correlated scores, especially for the highest-scoring assemblies ( Figure 4A). For those scoring <80% there is a small but noticeable shift towards Arthropoda producing slightly higher scores than Eukaryota, indicating that proportionately more of the larger set of Arthropoda BUSCOs can be recovered from lower quality assemblies. Outlier points above the identity (y=x) axis suggest that the lower-resolution Eukaryota lineage dataset 15 occasionally produces over-estimates of completeness, where proportionately more of the smaller set of ancient Eukaryota BUSCOs are recovered. Similar trends are observed when comparing the Arthropoda results to the higher resolution Insecta (n=1367) lineage dataset, with highly linearly correlated scores and occasional small over-estimates of completeness using the Arthropoda lineage dataset ( Figure S5A).
Comparing Arthropoda results to those from four insect order-level lineage datasets shows high agreements for the highest-scoring assemblies ( Figure 4B). For lower-scoring assemblies, results from applying the Lepidoptera and Hemiptera lineage datasets tend towards slightly higher scores than for Arthropoda. In contrast, using the Hymenoptera and Diptera lineage datasets generally produces lower completeness scores than for Arthropoda. These shifts could arise from the uneven representations of these orders in the 90-species Arthropoda lineage dataset which is dominated by 20 hymenopterans and 15 dipterans, with only 9 species each for Lepidoptera and Hemiptera. The same trends are observed when comparing results from the order-level lineage datasets to those from the Insecta dataset ( Figure S5B).
In addition to updates to the codebase, BUSCO v4 was released with updated lineage datasets based on orthology data from OrthoDB v10 [49], while BUSCO v3 used data from OrthoDB v9 [50]. Comparing completeness scores using the two Arthropoda datasets shows high levels of agreement for the highest-scoring assemblies with a consistent shift towards lower scores reported by BUSCO v4 for lower-quality assemblies ( Figure 4C). A similar pattern is observed when comparing results from the two Insecta datasets ( Figure S5C). The Diptera comparisons on the other hand reveal some score variations, which nevertheless agree well over the full range of assembly qualities ( Figure   4D), similarly to results from the Hymenoptera datasets ( Figure S5D). The different versions therefore produce generally consistent and comparable estimates of completeness, with a tendency for the OrthoDB-v10-based Arthropoda and Insecta datasets to report lower scores, especially for lower-quality assemblies. For objective quantitative comparisons it is thus necessary to assess assemblies using the same BUSCO versions, parameters, and lineage datasets, as presented here for the phylumwide assessments of available arthropod genome assemblies.

Conclusions
Results from applying the assessment workflow to the phylum Arthropoda demonstrate the utility of building resources that provide a standardised overview of the current taxonomic coverage and quality of genome assembly resources available from the NCBI.
The large-scale dataset also offers the opportunity to examine how widely used assembly

Assembly selection and assessment workflow implementation
Accession numbers for all assemblies in the user-specified taxon are retrieved by  Figure S1). For each assembly, the data package is downloaded to a temporary zip file using the datasets command-line utility (version 11.22.0 in version 1.0 of the a3cat-workflow). The nucleotide sequence and metadata are extracted from each data package with the ncbi-datasets-pylib library and stored as fasta and JSON files, respectively (Step 2 in Figure S1). For each assembly, complete taxonomic information is retrieved from the NCBI Taxonomy database [37] using the ete3 python module [55], version 3.1.2 in version 1.0 of the a3catworkflow) and stored in a JSON file (Step 3 in Figure S1). Taxonomic information is used to determine all BUSCO lineage datasets relevant for each assembly (Step 4 in Figure   S1). During this step, assemblies are filtered by size, scaffold N50, and a manual filter list to discard assemblies which are too short and/or fragmented to contain any BUSCOs; this is necessary because BUSCO returns an error if no BUSCOs are found. The completeness of each assembly is assessed using BUSCO in genome mode and all other settings to default (version 4.1.4 in version 1.0 of the a3cat-workflow) for each applicable lineage dataset (Step 5 in Figure S1). The results folder generated by BUSCO is saved as a compressed archive with the exception of the BLAST database (blast_db) and BLAST input sequences (<run_name>/blast_output/sequences). The full results  Figure S1). This JSON file is converted into a table with formatted headers stored in a tab-separated file where columns represent metadata and BUSCO scores and each line corresponds to an assembly (Step 7 in Figure S1).
Finally, an interactive  Figure S1). The entire workflow is implemented using the Snakemake workflow management engine [34,35] and all software dependencies are managed by the Conda package manager; this implementation ensures that the workflow is portable and entirely reproducible. Parameters for each step of the workflow are specified in a YAML file and additional configuration files can be used to customize the table and HTML output.

Assessment workflow deployment and data analyses
Results presented in this study were obtained by running version 1.0 of the a3catworkflow on 2021-06-11. Species estimates were retrieved from the NCBI Taxonomy database using ete3 (version 3.1.2) on 2025-08-21 and from the Catalogue of Life version 2021-06-10. Phylogenetic trees were automatically generated from NCBI taxonomy data with ete3. BUSCO scores for version 4.1.4 were obtained directly from the output of a3catworkflow, while scores for version 3.12 were obtained with a development release version of the workflow available from https://gitlab.com/evogenlab/a3cat-workflow/-/releases/paper-busco-v3 [57]. Figures were generated with ggplot2 version 3.3.5 [58] and ggtree version 3.0.1 [59] in R version 4.1.0 [60]. All data-related figures, numbers, and supplementary material were generated with a Snakemake workflow [35] available from https://gitlab.com/evogenlab/paper-a3cat [61] using Snakemake version 6.3.0.