De novo genome assemblies of butterflies

Abstract Background The availability of thousands of genomes has enabled new advancements in biology. However, many genomes have not been investigated for their quality. Here we examine quality trends in a taxonomically diverse and well-known group, butterflies (Papilionoidea), and provide draft, de novo assemblies for all available butterfly genomes. Owing to massive genome sequencing investment and taxonomic curation, this is an excellent group to explore genome quality. Findings We provide de novo assemblies for all 822 available butterfly genomes and interpret their quality in terms of completeness and continuity. We identify the 50 highest quality genomes across butterflies and conclude that the ringlet, Aphantopus hyperantus, has the highest quality genome. Our post-processing of draft genome assemblies identified 118 butterfly genomes that should not be reused owing to contamination or extremely low quality. However, many draft genomes are of high utility, especially because permissibility of low-quality genomes is dependent on the objective of the study. Our assemblies will serve as a key resource for papilionid genomics, especially for researchers without computational resources. Conclusions Quality metrics and assemblies are typically presented with annotated genome accessions but rarely with de novo genomes. We recommend that studies presenting genome sequences provide the assembly and some metrics of quality because quality will significantly affect downstream results. Transparency in quality metrics is needed to improve the field of genome science and encourage data reuse.


Introduction
The explosion of available genomes across the Tree of Life has created entirely new fields of science and is changing how we investigate long-standing questions in biology. Studies of gene family evolution and gene mutation have expanded from single genes to mapping the architecture of entire genomes. Macroevolutionary studies using genomic data are now regularly being generated at impressive scales, e.g., complete class [1], continent [2], and spanning up to 500 million years [3]. As the scope of questions addressed with genomic data continues to expand, determining the effect of read length and genome completeness on results is vital. One metric that is often applied to assembled genomes is the N50 score, a weighted median statistic of contig continuity that describes the distribution of contig lengths. The N50 value indicates that half of the assembly is contained in contigs or scaffolds equal to or larger than the value. Assemblies with low N50s are more fragmented and have contigs or scaffolds with less overlap with one another. Completeness of a draft assembly can also be assessed using BUSCO scores [4]. This measure uses a taxonomically informed set of "core" protein-coding orthologs that are theoretically present in a given taxon to evaluate genomic completeness. BUSCO may detect both haplotypes sequenced from diploid tissue with adequate genome coverage. However, high heterozygosity can lead to more fragmented assemblies (low N50), potentially reducing the number of complete protein-coding genes recovered. These scores can be influenced by biological variation through natural variation in chromosome length or lineage-wide loss of core orthologs, but also by systematic error, as in poor sequencing depth [4]. Genomes may be of low quality in terms of continuity, completeness, or a com-bination of these 2 metrics. Understanding how genomes with low-quality metrics affect downstream analyses is critical.
Here, we provide draft de novo genome assemblies and quality metrics for butterflies that will be useful for studying Lepidoptera evolution, gene discovery, and genomics. To understand how genome quality varies across taxa, we examine genome assembly quality in this exemplar group of organisms that has >935 published genomes. Additionally, we explore potential uses of these data, bearing in mind their draft nature, and discuss the state of butterfly genomics in light of genome quality.
Gene family evolution and mutation holds immense potential in uncovering the mechanisms behind rapid functional adaptation and potential subsequent speciation [5,6], and significant progress is being made in this area with the inclusion of genomic data [3]. De novo genome assemblies allow for the discovery of novel genes with important ecological implications. For example, genes and gene duplications associated with plant detoxification can be identified [7]. Additionally, expansions of a particular gene copy can be indicative of functional adaptation (e.g., [8,9]). However, inaccurate assessment of gene copy number will lead to false interpretations. Denton et al. [10] document a pattern of gene misassembly and false gene duplication rates in draft genomes, with gene number either overestimated or underestimated in 40% of all gene families. The mechanism of such error is closely tied to N50, such that when genes are fragmented (low N50), multiple contigs are assembled into non-biological contigs [10]. These types of errors will present as misidentification of gene duplication and loss, as well as nonbiological mutations. Gene family evolution and mutation holds immense potential in uncovering the mechanisms behind rapid functional adaptation and potential subsequent speciation [5,6], and significant progress is being made in this area with the inclusion of genomic data [3]. Including sequences of known identity to identify regions of sequencing artefacts or incorrect annotation and implementing assembly error estimation [11] may mitigate these challenges.
Phylogenetic studies stand to gain enormous taxonomic ground into the 2020s, primarily owing to the explosion of lowcoverage genomes that are particularly well suited for phylogenetic studies. Taxonomic coverage in phylogenetic studies is increasing exponentially with the ability to sequence genomes from historical or museum specimens. Advances in both cost and quality of sequencing, as well as the ability to sequence DNA from degraded museum samples [12][13][14][15], allow researchers to now produce phylogenies including all extant, and even extinct species in a taxonomic group [16]. Stringency standards for including genomes in phylogenetic studies are not well established, and poor-quality genomes can produce erroneous assemblies of genes of interest [10], as detailed above. Furthermore, quality scores that highlight the completeness of a genome may serve an important quality control step for the inclusion of genomes in phylogenies, and we recommend that researchers prioritize this quality metric for phylogenetic inference. A more complete genome suggests that the sample possesses common and complete protein-coding genes, and thus it is more likely to include the researcher's set of orthologs. By assessing genome completeness, future systematic error due to taxa with low matrix occupancy may be avoided [17].
Here, we provide 822 draft de novo genome assemblies and quality metrics for a taxonomically diverse, well known group, butterflies, that will be useful for studies on their evolution, gene discovery, and genomics. We explore potential uses of these data, bearing in mind their draft nature, and discuss the state of butterfly genomics in light of genome quality.
We trimmed reads using TrimGalore [32] requiring a quality score of 20 and read length of 30. We assembled reads using SPAdes (SPAdes, RRID:SCR 000131) v3.13 [33] using paired reads and allowing values of K to vary based on read length. For the majority of the de novo genomes, 32 CPUs and 128 GB of memory were sufficient. Forty genomes required additional memory; we ran these genomes with 24 threads with 720 GB of memory, potentially due to deeper sequencing or greater genomic complexity.
Following assembly, we performed several post-processing steps to ensure sequence integrity. First, we identified and removed contigs composed of <200 bp using SeqTK (Seqtk, RRID: SCR 018927) [34]. We scanned for evidence of vector contamination using VecScreen (VecScreen, RRID:SCR 016577) [35] and removed affected contigs. Then, we used the NCBI contaminant screening database to identify common contaminants, such as from fungi or bacteria, and removed those contaminant sequences.
To assess assembly quality, we first used assembly-stats [36] to quantify scaffold N50 for each cleaned, contaminant-free assembly. This measure estimates the contiguity of assembly contigs and describes the contig length of half of the genome; i.e., 50% of the genome includes contigs greater than or equal to this length. We also used BUSCO (BUSCO, RRID:SCR 015008) v3.02 [4] to determine the presence of a set of 1,658 core insect singlecopy genes (version 9) that are highly conserved across insects and give an approximation of the completeness of the assembly. Herein, we evaluate only the BUSCO Complete score, which requires each of the 1,658 core ortholog genes in the assembly to include both start and stop codons. For the full BUSCO score report, see Supplemental Tables S1-S3. A custom script, filter seqs by NCBI.py (Supplementary File 1), was created to automate NCBI required edits. This script uses the text feedback file from NCBI and will be useful for researchers willing to make their assemblies available on NCBI.

Results
We assembled 873 papilionoid genomes using raw reads from the NCBI SRA database and downloaded 62 pre-assembled genomes from the NCBI Assembly database [18]. These 935 butterfly samples with genomic data represent 665 unique species because some species have multiple subspecies sequenced or have replicate genomes (Supplementary Table S1). We did not attempt to combine genomic reads from multiple conspecific individuals because this will artificially increase heterozygosity and inevitably affect assembly quality [37]. All genomes assembled for this study (Supplementary Table S1) are available for download through the TPA Database (BioProject PRJNA606954) Figure 1: Pre-assembled and de novo assembled genomes for each butterfly and subfamily shown on a topological sketch of [38]. Species-richness estimates and topology are presented for comparison only. and quality statistics calculated for each genome are listed in Supplementary Table S1.
Pre-assembled genomes from NCBI and LepBase span 6 butterfly families and 12 subfamilies; our de novo assembled genomes represent 6 families and 24 subfamilies (Fig. 1). The only family for which no public genomic data are available is the Hedylidae, a family with only 36 described neotropical species [39]. Hesperiidae has the greatest number of species with available genomic data (472), more than half of which are in subfamily Pyrginae (310), largely due to research by Grishin and colleagues [13, 20-27, 30, 31] (Fig. 1). The Nymphalidae, the most species-rich family of butterflies, have 287 genomes available, and 210 of these genomes are in the genus Junonia (Fig. 1). The Lycaenidae have comparatively few genomes available (10), given its high species richness (Fig. 1).
The metrics that we used revealed large variance in genome assembly quality. N50 and BUSCO scores were often similar, such that the highest quality genomes typically had both high N50 and high BUSCO scores, although this was not always the case (Fig. 2). These quality statistics measure 2 different aspects of quality and should be used in conjunction because length distribution may not be associated with gene content [4]. Pre-assembled genomes downloaded from NCBI and LepBase on average had high quality scores (Supplementary Table S2 Quality scores varied widely among the draft de novo genome assemblies (Fig. 2). In 51 cases, we found that assemblies comprised only short (<200 bp) fragments and contaminants. In these cases, we removed the assembly and report the N50 score as zero (Supplementary Tables S1 and S3). N50 ranged from 249 bp in Junonia evarete nigrosuffusa (SRR10765819; Nymphalidae) to 43,550 bp in Sertania guttata guttata ( Fig. 2E; SRR10158585; Riodinidae). One hundred seven de novo genomes resulted in a BUSCO score of 0% (Supplementary Tables S1 and S3), meaning that these genomes recovered none of the core insect orthologs. Seven had BUSCO scores of ≥90%, with the greatest Figure 2: Natural log-normalized N50 and BUSCO scores plotted for both pre-assembled (squares) and de novo (circles) genome assemblies. Colors denote taxonomic family designation, as in Fig. 1. Letters correspond to inset images of representative species.

Discussion
High-quality genomes are required for studies that span the biological sciences, from gene family, mutation research to macroevolutionary phylogenetics and population dynamics. Our results show that available genomes vary widely in quality and taxonomic coverage. The significant variance in N50 and BUSCO scores highlights an important message: in the scientific literature, a "genome" can range from genomic fragments to fully annotated chromosomes. Large-scale genomic studies, especially those that sequence species in an entire clade or geographic region, represent great scientific feats, but if they are based on many low-quality genomes, they may not be useful for subsequent studies. We encourage peer-reviewed journals and public databases to require authors to report genome quality via N50 and BUSCO, which can be accessioned with the assembly on NCBI as Global Statistics. Doing so provides maximum transparency, reproducibility, and a holistic view of future data reuse. In this way, users can easily evaluate whether the quality of the genome is high enough to investigate gene family diversification (prioritize N50) or phylogenetic systematics (prioritize BUSCO).
Our analyses highlight the extensive variation in genome quality. Part of this discrepancy can be alleviated with changes in language. Perhaps we should begin referring to low-quality genomes, such as J. evarete nigrosuffusa (SRR10765819; N50 = 249 bp; BUSCO = 0.1%), as "genomic data," as opposed to the potentially misleading term, "genome." Next, accessioning all assemblies would save countless hours of computation time and allow for the validation of results. In addition, assemblies would also allow results (e.g., gene family evolution, sequence identification, ortholog determination) from previous studies to be validated. Accessioning should include low-coverage draft genome assemblies, which can also be deposited in the NCBI's Assembly database. These assemblies have notably lower N50 and BUSCO scores when compared to the average assembly from NCBI and LepBase [19]. Quality metrics of our de novo assembled genomes were, in many cases, comparable to the 5 Heliconius genomes that we investigated, suggesting that even lowquality genome assemblies can and should be accessioned. Including quality scores (as Global Statistics) for each draft assembly via the NCBI Assembly database (in addition to taxon-specific genome databases, such as Lepbase [19]) would provide a transparent overview of available genomes for future studies.
Genome assembly requires considerable computational resources, and assessing genome quality simply from raw file size on GenBank can be misleading. Many studies in the biological and medical sciences rely on existing genomes and their annotations (e.g., [40]). If researchers independently assemble genomes, this can lead to duplicated effort and significant time investment. Furthermore, if raw data quality is poor, assemblies likely will not be useful. In our study, we found that ≥51 of the 873 genomes that we assembled are ultimately unusable, and another 67 should be reused only with caution (Supplementary  Table S1). These 118 samples produced assemblies that entirely comprised contamination or contigs <200 bp or were devoid of core insect genes, or a combination of these factors. However, it is possible that alternate assembly methods could produce a better assembly. Low N50 and low BUSCO assemblies are likely composed of fragmented genes, and, most likely, the contigs that are present are the result of very low sequence coverage. This low coverage is indicative of a high error rate and greater likelihood of incorrect sequence frame. As such, while we provide these extremely low-quality genomes, users should exercise caution in mining genes from these samples owing to the high probability of error. Reporting N50 and BUSCO, as well as genome assemblies, in manuscripts and databases promotes transparency and discourages needless computation. Contamination has been shown to be a pervasive pattern in genome and transcriptome sequencing projects, especially those that use multiplexed sequencing approaches [41][42][43]. In a recent study, Allio et al. [29] found that cross-contamination accounted for 0.26% of assembly contigs. While contaminants were removed from Allio et al. [29] using CroCo [44] and thus do not affect their results, it remains unknown how much these contaminant sequences will impact future studies that reuse these genomic data. The authors did not accession genome assemblies that had contaminants removed, and contaminants remain in accessioned reads. Furthermore, it is impossible to repeat these necessary decontamination steps without detailed information regarding multiplex strategy [44]. Accessioning decontaminated assemblies to NCBI is a necessary and easy solution.
Our study reveals a significant lack of standardization and reporting across genomic studies because many do not provide genome assemblies and necessary quality metrics. Our main conclusions are that: 1. We provide draft assemblies and quality metrics for all butterfly genomes available at the time of this study (available through NCBI TPA database) (Supplementary Table S1). We synthesize these data into tables of the 50 highest quality genomes, as well as exemplar genomes for each subfamily. 2. We found that the ringlet, Aphantopus hyperantus, has the highest quality papilionoid genome, and that ≥51 of 873 genomes that we assembled are ultimately unusable, and another 67 should be reused only with caution. Long and contiguous reads, indicated by high N50 values, are 1 quality metric that should be reported in all studies, especially those of gene mutation, duplication, or genomic architecture. 3. Quality metrics, such as sequence length, whether sequences are contiguous, and N50 and BUSCO scores, should be reported in all studies. Phylogenetic studies are strengthened when genomes with a high completeness score, such as BUSCO, are used. 4. Researchers should provide draft assemblies in all genome publications and databases. Accessioning quality scores will enhance transparency and avoid unnecessary use of computational resources. Accessioning assemblies further promotes the FAIR principles of interoperability and reuse by limiting contaminant sequences and allowing confirmation of results.

Data Availability
See Supplementary Tables S1-S3 for genomic read accession numbers used in this study and associated metadata. The 822 viable genome assemblies produced using SPAdes v3.13 are available in the NCBI TPA repository and can be accessed with Bio-Project PRJNA606954. The sequence assemblies, BUSCO files, scripts, and other supporting data underlying this article are also available via the GigaScience database, GigaDB [45].

Additional Files
Supplementary Table S1. Sample ID, N50, BUSCO, and sequencing metadata for de novo assembled genomes. Supplementary Table S2. Sample ID, N50, BUSCO, and sequencing metadata for pre-assembled genomes. Supplementary Table S3. Sample ID, N50, BUSCO, and sequencing metadata for de novo genomes resulting in extremely poor quality assemblies.
Supplementary File S1. Filter seqs by NCBI.py script used to automatically update assemblies with the feedback file from NCBI during the NCBI Accession process.