A reference genome for Bluegill (Centrarchidae: Lepomis macrochirus)

Abstract North American sunfishes (Family Centrarchidae) are among the most popular sportfish throughout the United States and Canada. Despite the popularity of sunfishes, their ecological importance, and their extensive stocking and aquacultural history, few molecular studies have examined the evolutionary relationships and species boundaries among members of this group, many of which are known to hybridize. Here, we describe a chromosome-scale genome assembly representing Bluegill (Lepomis macrochirus), one of the most widespread centrarchid species. By combining long-read, Oxford Nanopore sequencing data with short-insert, whole-genome and HiC sequence reads, we produced an assembly (Lm_LA_1.1) having a total length of 889 Mb including 1,841 scaffolds and having a scaffold N50 of 36 Mb, L50 of 12, N90 of 29 Mb, and L90 of 22. We detected 99% (eukaryota_odb10) and 98% (actinopterygii_odb10) universal single-copy orthologs (BUSCOs), and ab initio gene prediction performed using this new assembly identified a set of 17,233 genes that were supported by external (OrthoDB v10) data. This new assembly provides an important addition to the growing set of assemblies already available for spiny-rayed fishes (Acanthomorpha), and it will serve as a resource for future studies that focus on the complex evolutionary history of centrarchids.


Introduction
North American sunfishes (Family Centrarchidae) constitute one of the most popular recreational fisheries in North America (Page and Burr 2011). Sunfishes are also ecologically important freshwater predators (Aday et al. 2009). The family contains 38 species distributed among 8 genera (Fricke et al. 2022) and includes well-known fishes such as largemouth and other black bass (Micropterus spp.), crappies (Pomoxis spp.), and sunfish (Lepomis spp.). Aside from the California endemic Sacramento Perch, Archoplites interruptus (Girard 1854), sunfishes are native east of the Rocky Mountains, and their range extends north into Canada and south into northern Mexico (Page and Burr 2011). The most widespread sunfish species is Bluegill, Lepomis macrochirus (Rafinesque 1819), which are native to the St. Lawrence/Great Lakes region and basins of the Mississippi River down to the Gulf of Mexico (Fig. 1). Bluegill are also native to drainages along the southern Atlantic Coast of North America in the east and the Rio Grande drainage in Texas and Mexico to the west (Page and Burr 2011). Bluegill have been introduced outside of their native range to the western United States and also to localities including Africa (Ndaleni et al. 2018), Oceania (Yamamoto 1992), and Asia (Kawamura et al. 2006), where they have created management problems due to competition with native species (Maezono and Miyashita 2003).
Despite the widespread popularity of sunfishes, their ecological importance, and their aquacultural history (Regier 1962), few molecular studies examining the evolutionary relationships and species limits within centrarchids have occurred during the past 20 years (Near et al. 2004(Near et al. , 2005Near and Kim 2021). As such, genomic resources for this group, including genome assemblies, are lacking for all species except for 3 species of black bass (Supplementary Table 1; Sun et al. 2021).
Although genome data for sunfishes are few, genome-enabled studies would significantly advance our understanding of the group. For example, some sunfishes of the genus Lepomis display considerable color and meristic variation across their ranges, and these variable traits have historically caused taxonomic confusion (Near and Koppelman 2009). Whole-genome resequencing data aligned to a high-quality reference assembly for 1 or more species of sunfish would help clarify our understanding of species limits in the group while also enabling studies of the genomic basis for these incredibly variable traits. Hybridization is also common among centrarchids, with 31 species pairs known to hybridize in the wild (Bolnick 2009) and reports of hybrids between species of different genera (Burr 1974). Yet, the extent of introgression between species, the effects of introgression on the delineation of species boundaries, and the role of introgression on species diversity in this group are largely unknown-a highquality reference assembly for sunfishes would enable these studies. Finally, under certain circumstances, Bluegill diverge into pelagic and benthic ecomorphs (Uchii et al. 2007), similar to Threespine Sticklebacks (Gasterosteus aculeatus; Rundle et al. 2000), and a high-quality reference assembly for this species would provide an important comparative resource for studying the evolution of these forms.
Here, we expand the genomic resources available for Centrarchidae by describing a chromosome-scale assembly (Lm_LA_1.1) we produced for a vouchered (Buckner et al. 2021), male Bluegill collected from Louisiana.

Methods
We collected muscle, gill, fin, and liver tissues from a male Bluegill captured at the Sherburne Wildlife Management Area (30.515441, −91.7164) during 2018 under Louisiana Department of Fish and Wildlife Collecting Permit SCP167 and LSU IACUC 18-065. Tissues were flash-frozen immediately in the field. After tissue collection, we prepared a specimen for the LSU Museum of Natural Science (LSUMNS) Collection of Fishes (LSUMZ 21031), and we stored tissue samples from this specimen in the LSUMNS Collection of Genetic Resources (LSUMZ 10149). We subsampled ∼25 mg of gill tissue to prepare a short-insert library for this individual, and we subsampled and shipped ∼25-30 mg of liver tissue to Dovetail Genomics to prepare and sequence longread and HiC libraries.
We extracted DNA from the 25-mg subsample of gill tissue using a Qiagen DNeasy extraction kit and quantified extracted DNA using a fluorometer (Life Technologies, Inc.). After quantification, we sheared 650 ng DNA to a modal size of 500-600 bp using a sonicator (Qsonica, Newtown, CT, USA; 12 cycles of 20 s on and 20 s off), and we input 250 ng of sheared DNA to a commercial library preparation kit (Kapa HyperPrep; F. Hoffmann-La Roche AG) following the PCR-free protocol to incorporate unique dual index adapters (Integrated DNA Technologies, Inc.). Following library preparation, we performed a 1.8× (v/v) SPRI bead cleanup (Rohland and Reich 2012) followed by a column-based cleanup (Qiagen GeneRead Size Selection Kit) and quantified the cleaned product using a fluorometer. Then, we determined the insert size distribution of the library using a Bioanalyzer (Agilent, Inc.) and quantified the library using a commercial qPCR kit (Kapa Library Quantification Kit; F. Hoffmann-La Roche AG). We sequenced the library as part of a paired-end (PE), 150 bp lane of Illumina NovaSeq (Novogene, Inc.), targeting ∼50× coverage after assuming a genome size of ∼1 Gb (Ragland and Gold 1989). After sequencing, we used jellyfish (v2.3.0; Marcais and Kingsford 2011) to count kmers (kmer size = 21), and we input the kmer histogram to GenomeScope (Vurture et al. 2017) to estimate genome size.
Dovetail staff extracted DNA from a subsample of the liver tissue shipped to their facility following the Qiagen Genomic DNA extraction protocol for tissues (QIAGEN 2015) and using a Qiagen Tip-100 Midi Column, and they prepared Oxford Nanopore 1D libraries (Rapid Sequencing Kit SQK-RAD004) from extracted DNA with slight modifications to the protocol. Modifications included using variable amounts of input DNA (3-4 µg), using smaller amounts of fragmentation mix (1-2.5 µl), and extending the ligation time to 20 min for most reactions (Supplementary Table 1). After preparation, long-read libraries were sequenced on an Oxford Nanopore MinION using an R9.4 flowcell, and basecalling was performed using MinKnow 1.15.1 (Oxford Nanopore Technologies PLC). Data were generated from all libraries to achieve an approximate depth of 33× assuming a genome size of 1 Gbp. Dovetail staff also prepared 3 HiC libraries following a protocol similar to that described in Lieberman-Aiden et al. (2009)   generated data from each HiC library using PE 150 BP sequencing on an Illumina HiSeq X targeting 150-250 million read pairs per library.
We assembled the long-read FASTQ data received from Dovetail using wtdbg2 (v2.5; Ruan and Li 2020) and flye (v2.9-b1774; Kolmogorov et al. 2019) on a 1.5 TB RAM compute node and computed contiguity and completeness metrics of the assemblies using assembly-stats (Wellcome Sanger Institute 2022a) and BUSCO (eukaryota_odb10; Manni et al. 2021). The BUSCO results suggested that flye produced a more complete assembly, so we ran 2 additional rounds of long-read polishing in flye (for a total of 3), and we used the resulting flye assembly in all subsequent steps. We performed 1 round of short-read polishing by trimming the adapters and low-quality bases from the short insert, Illumina data using trimmomatic, aligning the trimmed data to the flye contigs using BWA (v0.7.17;Li 2013) and SAMtools (v1.10; Li et al. 2009), and using Pilon (v1.23;Walker et al. 2014) to fix "-all" of the issues identified (where possible).
After short-read polishing, we trimmed the HiC libraries for adapters and low-quality bases using trimmomatic, we combined all trimmed read files, and we used the juicer workflow (v1.6;  to align the trimmed HiC data to the polished assembly, remove duplicates, and compute HiC library metrics. Then, we generated temporary scaffolds using 3D-DNA (v180922; Dudchenko et al. 2017) with error correction turned off, manually corrected the temporary scaffolds using JuiceBox (v1.11.08; Durand, Robinson, et al. 2016) where the HiC contact map suggested a misjoin, and rescaffolded the assembly using the 3D-DNA post-review assembly workflow. To improve the orientation of contigs within scaffolds, we ran the resulting assembly through HiCHiker (v1.0.0; Nakabayashi and Morishita 2020), and we used faFilter (Kent et al. 2002) to remove contigs/ scaffolds from the assembly that were shorter than 1,000 bp. We also used BlobTools (v2.6.3; Laetsch and Blaxter 2017) to compute (long read) coverage of the assembly, perform taxonomic partitioning of the scaffolds/contigs, and remove scaffolds having <5× coverage. We mapped 1 library of HiC read pairs (DTG-HiC-732) to the remaining scaffolds and contigs using the Arima Genomics Mapping Pipeline (commit b001ebc; Arima Genomics 2019), BWA (v0.7.17), and SAMtools (v1.10), and we used PretextMap (v0.1.9) and PretextView (v0.2.5; Wellcome Sanger Institute 2022b) to produce a visual representation of the contact map.
After removing low-coverage contigs, we used the Dfam TE Tools container (v1.3.1; Dfam-Consortium 2022) to run RepeatModeller (v2.0.2; Flynn et al. 2020) identification of transposable elements (including the "-LTRStruct" option), and we input the repeat models to RepeatMasker (v4.1.2-p1; Smith et al. 2022). We used the general feature format file output by RepeatMasker with BEDTools (v2.17.0; Quinlan and Hall 2010) to soft mask the assembly. After soft-masking, we renamed the scaffolds and contigs; sorted the contigs and scaffolds by name/size using SeqKit (v2.2; Shen et al. 2016); removed 1 scaffold that represented a long, improperly linearized version of the L. macrochirus mitochondrial genome that we identified using a minimap2 (v2.17-r941; Li 2018) alignment to the L. macrochirus mitochondrial RefSeq (NC_015984.2); and computed a final set of contiguity and completeness metrics using assembly-stats and BUSCO (with both eukaryota_odb10 and actinopterygii_odb10 databases). We also estimated assembly completeness and consensus quality value (QV) by counting kmers in short insert, Illumina data using meryl (v1.3) with a k-value of 20 and inputting the meryl database, along with the final version of the assembly, to Merqury (v1.3; Rhie et al. 2020). To recover a properly linearized assembly of the mitochondrial genome, we input the long read and the short insert, Illumina data to mitoVGP (v2.2; Formenti et al. 2021).
Finally, we performed a single round of ab initio gene prediction using a containerized build (Faircloth 2022) of Braker2-GenemarkEP+-Augustus (v2.1.6; Lomsadze et al. 2005;Stanke et al. 2006Stanke et al. , 2008Gotoh 2008;Iwata and Gotoh 2012;Buchfink et al. 2015;Hoff et al. 2016Hoff et al. , 2019Brůna et al. 2020Brůna et al. , 2021 and a file of vertebrate protein sequences from OrthoDB v10 (Kriventseva et al. 2019). We functionally annotated the predicted protein sequences output by Braker2 using InterProScan (v5.57-90.0;Jones et al. 2014), and we used an accessory script from the braker2 repository (selectSupportedSubsets.py), along with custom Python code, to produce a filtered version of the predicted transcript sequences that were "fully supported" by external evidence. The braker2 accessory script describes "fully supported" gene sequences as those transcripts: (1) that are complete, (2) where all introns in a transcript are supported by external (OrthoDB protein) evidence, and (3) that have start and stop codons supported by external (OrthoDB protein) evidence, when transcripts are composed of a single exon.

Results and discussion
Illumina sequencing of the short-insert library produced 195,820,475 read pairs with an average insert size of 467 bp. GenomeScope results using a kmer size of 21 estimated that the length of the haploid Bluegill genome was 0.751-0.752 Gb, suggesting that the short-insert reads approximated 78× coverage. Eight flowcells of nanopore sequencing produced a total of 4.8 million reads (Supplementary Table 1) having an average length (across all flowcells) of 7 kb and totaling 33 Mb of sequence data (∼44× coverage given the estimated genome size), and sequencing each HiC library produced 150, 156, and 241 M read pairs (total: 547 M).
The contig assembly produced by flye was more complete than that produced by wtdbg2 (Table 1), and short-read polishing of the flye contigs using Pilon corrected 1.3 million SNPs, 2.9 million small insertions, and 2.4 million small deletions totaling 7.4 million base pairs (∼1% of the total contig length). After trimming, 526 M HiC read pairs were aligned to the polished contigs using the juicer workflow, ∼382 M read pairs were unique, and the juicer software identified 290 M HiC contacts that were used to scaffold the assembly. After scaffolding and manually correcting the assembly using 3D-DNA and JuiceBox, contiguity and completeness metrics substantially improved (Table 1). Taxonomic partitioning using BlobTools did not identify any contigs that aligned to unexpected taxonomic groups, and the final rounds of filtering for short-and/or low-coverage contigs had minimal impact on contiguity and BUSCO metrics (Table 1) of the final assembly, which we refer to as Lm_LA_1.1. The final assembly contained 24 large scaffolds (scaffold1-scaffold24; >25 Mb; Fig. 2a), a number equal to the count of Bluegill chromosomes (Roberts 1964). The next largest scaffold (scaffold25) showed a substantial reduction in length (1.2 Mb), suggesting that it, and the remaining scaffolds, are unplaced components of the Bluegill chromosomes. Merqury estimated that assembly completeness was 92.5% and the consensus QV score was 31 (>99.9% accuracy). Copy number and assembly spectrum plots produced by Merqury are provided in Fig. 2, b and c.
Repetitive elements identified using our de novo repeat models comprised 37% of the Lm_LA_1.1 assembly (Supplementary Table 2), which is a value similar to that estimated for other Complete centrarchids (Sun et al. 2021). Retroelements and DNA transposons comprised approximately equal percentages (9%) of the total repeat content, and ∼15% of the total repeats were "unclassified." Gene prediction using Braker2 with vertebrate protein sequences from OrthoDB identified a total of 76,741 possible gene regions, of which 17,233 were fully supported by external data. The highly contiguous, chromosome-scale assembly we produced contributes to the growing number of genome assemblies representing the enormously diverse (Near et al. 2013) group of spiny-rayed fishes known as the acanthomorphs. Lm_LA_1.1 is the third assembly representing a centrarchid species that has been scaffolded to chromosome level (Supplementary Table 3) and the first assembly representing a member of the widespread sunfishes (Lepomis spp.). This assembly will facilitate studies of species relationships and species limits within this group, enable researchers to gain a better understanding of the degree and effects of introgression among Lepomis species, and serve as a tool to study the evolution of pelagic and benthic ecomorphs in a new organismal model.

Data availability
All sequencing data and the final assembly, Lm_LA_1.1, are available from NCBI BioProject (PRJNA830889). Short-insert, Nanopore, and HiC reads are also available from the NCBI SRA (SRP372356), and the Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JALXJV000000000. The version described in this manuscript is version JALXJV020000000. The Supplemental Tables, a list of steps used to assemble and annotate the genome, PretextMap, Merqury results, RepeatMasker annotations, and gene predictions are available from FigShare (https://doi.org/10.6084/m9.figshare.21215777).
Supplemental material available at G3 online.