Chromosomal DNA sequences of the Pacific saury genome: versatile resources for fishery science and comparative biology

Abstract Pacific saury (Cololabis saira) is a commercially important small pelagic fish species in Asia. In this study, we conducted the first-ever whole genome sequencing of this species, with single molecule, real-time (SMRT) sequencing technology. The obtained high-fidelity (HiFi) long-read sequence data, which amount to ~30-folds of its haploid genome size that was measured with quantitative PCR (1.17 Gb), were assembled into contigs. Scaffolding with Hi-C reads yielded a whole genome assembly containing 24 chromosome-scale sequences, with a scaffold N50 length of 47.7 Mb. Screening of repetitive elements including telomeric repeats was performed to characterize possible factors that need to be resolved towards ‘telomere-to-telomere’ sequencing. The larger genome size than in medaka, a close relative in Beloniformes, is at least partly explained by larger repetitive element quantity, which is reflected in more abundant tRNAs, in the Pacific saury genome. Protein-coding regions were predicted using transcriptome data, which resulted in 22,274 components. Retrieval of Pacific saury homologs of aquaporin (AQP) genes known from other teleost fishes validated high completeness and continuity of the genome assembly. These resources are available at https://treethinkers.nig.ac.jp/saira/ and will assist various molecular-level studies in fishery science and comparative biology.


Introduction
The Pacific saury or mackerel pike (Fig. 1; Cololabis saira Brevoort, 1856) is a carnivorous, marine thermophilic shallow-water dweller and is one of the most popular dominant epipelagic nekton species in the North Pacific.2][3] Analyses based on comparative molecular biology and genomics are expected to address anatomical and behavioural questions.However, the lack of molecular sequence information is an obstacle.
For its importance in fisheries, molecular-level studies of the Pacific saury are severely lacking.As of May 2023, the number of entries of nucleotide and protein sequences in NCBI 4 are 150 and 99, respectively, most of which are fragments of mitochondrial DNA and peptide sequences encoded by them.In the Beloniformes that the Pacific saury belongs to 337 species have been described as of May 2023 according to Eschmeyer's Catalog of Fishes, 5 among which the wholegenome sequences of the nuclear DNA have been sequenced and made publicly available only for nine species including as many as seven species that belongs to medaka fish genus Oryzias.
The nuclear DNA content (or genome size) of the Pacific saury has not been reported to date, while its karyotype was reported to be 2n = 42. 6Importantly, this karyotype report lacked a detailed documentation of the karyotypic variation based on the observation of metaphase spreads of chromosomes prepared from multiple cells.Furthermore, the reported diploid number of chromosomes, namely 42, is distinct from that of most of the close relatives of this group, namely 48 or 50, 7 which requires reevaluation.
To build a basis for molecular-level research, in this study, we demonstrated a modern art of whole genome sequencing employing the high-fidelity long-read technology and obtained a chromosome-scale genome assembly by means of Hi-C scaffolding.Our product provides a fundamental resource for evidence-based analysis of fishery science and comparative biology.

Animals
The male individual of the Pacific saury used in this study (ToLID; fColSai1) was sampled in May 2022 from the exhibition tank of the public aquarium Aquamarine Fukushima, Iwaki, Japan.Animal handling and sample collections at the aquaria were conducted by veterinary staff without restraining the individuals, in accordance with the Husbandry Guidelines approved by the Ethics and Welfare Committee of the Japanese Association of Zoos and Aquariums.

Gonadal histology
Histological analysis was performed using the gonad of the individual sampled above in May 2022 for genome sequencing.Also, tissues of a male and a female collected in June 2022 were fixed in Bouin's fixative and used for histological observations as a reference to the gonadal structure of each sex.All tissues were embedded in paraffin (Leica Biosystems, Nussloch, Germany), sectioned in 5 μm, and stained with Mayer's haematoxylin (Fujifilm Wako Pure Chemical Co., Osaka, Japan) and eosin Y (Fujifilm Wako Pure Chemical Co.).

Genome sequencing and assembly
High molecular weight DNA was extracted from blood cells using a NucleoBond AXG column (Macherey-Nagel, Düren, Germany), which was followed by purification with phenolchloroform.The concentration of the extracted DNA was measured with Qubit 4 (ThermoFisher, MA, USA), and their size distribution was first analysed with TapeStation 2100 (Agilent Technologies, CA, USA) to ensure high integrity and later analysed with pulse-field gel electrophoresis on CHEF DR-II (BioRad, CA, USA) to ensure the size range between 20 kb and 100 kb.The DNA was fragmented with g-TUBE (Covaris, MA, USA) and size-selected with BluePippin according to the official protocol.A SMRT sequence library was constructed with an SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA) and was sequenced in a single 8M SMRT cell on a PacBio Sequel IIe system (Pacific Biosciences).The sequencing output was processed to generate circular consensus sequences (CCS) to obtain a total of 33.7 Gb HiFi sequence reads (Accession ID, DRR486909).From these reads, adapter sequences were removed using the program HiFiAdapterFilt. 8The obtained HiFi sequence reads were assembled using the program hifiasm v0.16.1 9 with its default parameters.The obtained contigs were subjected to haplotig purging by using the program purge_haplotigs 10 with the options '-l 5 -m 23 -h 45'.

Hi-C data production and genome scaffolding
The obtained contigs were scaffolded using Hi-C read pairs obtained as follows.The Hi-C library was prepared using the muscle tissue of the individual fColSai1 used for genome sequencing, according to the iconHi-C protocol employing restriction enzymes DpnII and HinfI, 11 and it was sequenced on a HiSeq X sequencing platform (Illumina Inc., CA, USA).The obtained Hi-C read pairs (Accession ID, DRR507759) were processed with the program Trim Galore! v0.6.8 (https:// www.bioinformatics.babraham.ac.uk/projects/trim_galore/) specifying the options '--phred33 --stringency 2 --quality 30 --length 25 --paired' and aligned to the HiFi sequence contigs with the program Juicer v1.6, 12 and using its results, HiFi sequence contigs were scaffolded with 3d-dna 13 (version 201008) specifying the options '-m haploid -i 5000 --editorrepeat-coverage 15 -r 2' to be consistent with the chromatin contact profiles.The continuity and completeness of the resultant genome assembly, designated as fColSai1.1,as well as the predicted protein-coding gene set (see below), was assessed with the webserver gVolante v2.0.0 14 in which the pipeline BUSCO v5.1.2 is implemented, 15 consistently using the ortholog set 'actinopterygii_odb10' supplied with BUSCO.Especially, the completeness of the nucleotide sequences of the genome assembly was assessed with compleasm 16 (formerly called minibusco) which is thought to achieve higher accuracy due to the use of miniprot. 17The genome assembly was deposited in NCBI under the accession ID PRJNA1027303.

Annotation of non-coding sequences
The obtained chromosome scale and other genomic sequences were subjected to the de novo detection of repetitive elements with RepeatModeler v2.0.4 with the -LTRStruct option.The detected repeat sequences were input in RepeatMasker v4.1.4for repeat-masking in the sensitive mode (with the option '-s').Simple tandem repeat detection was further reinforced by the use of the program tantan. 18Putative transfer RNAs (tRNAs) were detected by the program tRNAscan-SE version 2.0.12 and the detected regions excluding those labelled as 'pseudogenized' were counted.Detection of the rDNA-containing regions was performed with the program barrnap version 0.9 (https://github.com/tseemann/barrnap)with the aid of the partial sequence entry spanning the Pacific saury rDNA locus (OP151180.1)and the human 45S preribosomal sequence (NR_145819.1).Detection of canonical telomeric repeats (TTAGGG)n was performed with tidk version 0.2.31 (https://github.com/tolkit/telomeric-identifier).

Genome size estimation
The genome size was quantified with the qPCR-based protocol sQuantGenome 19 (https://github.com/Squalomix/cvalue)with the same DNA sample used for whole genome sequencing.Three protein-coding genes, ACAT1, DLD, and RFC3 from the single-copy orthologous gene set CVG 20 (Core Vertebrate Genes), whose transcript sequences (Supplementary Table S1) were identified in the transcriptome sequence assembly (see below).The primers used for qPCR are included in Supplementary Table S2.Genome size in nucleotide base pairs was calculated based on the formula 21 that the number of base pairs corresponds to mass in pg × 0.978 × 10 9 .

Transcriptome sequencing and assembly
For transcriptome data acquisition, an adult individual of unknown sex was sampled in July 2022 and was dissected.Total RNA extraction was performed for the heart, muscle, gill, liver, gut, and eyeball of this adult fish as well as a whole larvae body, using TRIzol reagent (ThermoFisher).After DNaseI digestion, strand-specific RNA-seq libraries were prepared using 150 ng of each of the extracted total RNAs (Supplementary Table S3), with Illumina Stranded mRNA Prep kit (Illumina, Cat.No. 20040534) and IDT for Illumina RNA UD Indexes Set A Ligation (Illumina, Cat.No. 20040553) according to its standard protocol unless stated otherwise below.Before the total volume PCR amplification was performed, we performed a preliminary PCR using a 1.5 μl aliquot of 10 μl DNA from the previous step, with KAPA Real-Time Library Amplification Kit (Kapa Biosystems, cat.No. KK2702).This demonstrated that the amplification of the products reached Standard 1 accompanying this kit between three and four PCR cycles, which instructed us to perform the full-volume PCR with three PCR cycles, introducing the minimal amplification (Supplementary Table S3).The libraries were sequenced in 0.6 lanes to obtain PE150 reads on a HiSeq X (Illumina).Quality control of the obtained fastq files for individual libraries was performed with FASTQC v0.11.5 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

Molecular phylogenetics analysis
Protein sequences were collected from the NCBI and Ensembl databases, and their accession IDs used for the phylogenetic analysis are included in Supplementary Table S4.The deduced amino acid sequences were aligned with the MAFFT v7.505 using the L-INS-i method.The aligned sequences were trimmed with trimAl v1.4.rev15 22 to remove unreliably aligned sites using the '-gappyout' option.The maximumlikelihood tree was inferred with RAxML v8.2.12 23 using the PROTCATWAG model, and for evaluating the confidence of the nodes, the rapid bootstrap resampling with 100 replicates was performed.A molecular phylogenetic tree employing the Bayesian framework was inferred with PhyloBayes v4.1c using the CAT-WAG-Γ model.

Gene prediction
Protein-coding genes were predicted with the program pipeline Braker3 24 on the genome assembly sequences in which interspersed repeats and simple tandem repeats were softmasked with RepeatModeler/RepeatMasker 25 (see above) and tantan. 18The gene prediction incorporated all peptide sequences registered in the file 'odb11_vertebrata_fasta' provided by the OrthoDB database, 26 as well as the output of paired-end RNA-seq read mapping with hisat2. 27

Animal sampling and sex identification
An adult of the Pacific saury was sampled from the exhibition tank of Aquamarine Fukushima and was subjected to histological analysis for identifying the sex and the stage of gonadal maturity, in parallel with high-molecular-weight DNA extraction.Our histological observation of the Pacific saury individual revealed sexually mature testes, and it was judged to be a male (Fig. 2A-C).The testes showed a tubular-type structure (Fig. 2D) and had cysts containing various stages of germ cells; spermatozoa (Fig. 2E), spermatids (Fig. 2F), and spermatocytes (Fig. 2G).Abundant spermatozoa produced and stored in cysts close to the central cavity (Fig. 2E) indicate that this individual was sexually mature at the time of our sampling for genome sequencing.

Genome size estimation
The extracted genomic DNA was used for the estimation of nuclear DNA content with the method using quantitative PCR (qPCR). 19Three protein-coding genes included in Table 1 were selected from those contained in the transcript contigs (see Materials and methods) and detected as singlecopy genes.We designed specific primers for these genes and prepared standard molecules for each of them by PCR.Using the standard molecules controlled with sequencing specificity and quantity, we performed qPCR and quantified the DNA amount that yields a haploid DNA molecule.This resulted in the haploid nuclear DNA content of 1.19 pg, which corresponds to 1.17 Gb.

Genome sequencing and assembly
The extracted high molecular weight DNA was used for the SMRTbell library and subjected to circular consensus sequencing (CCS) with single molecule, real-time (SMRT) sequencing technology (see Materials and methods).This yielded 33.6 Gb high-fidelity long (HiFi) reads of ~30-folds of its haploid genome size (1.17Gb).K-mer distribution in these reads suggested a heterozygosity of 1.87 % (Supplementary Fig. S1).These sequence reads were assembled into 1,555 contigs with an N50 length of 6.3 Mb.These contigs exhibited a percentage of duplicated one-to-one orthologs of 7.7 % and a total sequence length of 1.33 Gb that remarkably exceeded the genome size that was estimated independently (see above).We imputed these features to the relatively high heterozygosity (1.87%; see above) and purged redundant haplotigs.This resulted in 672 contig sequences with a reduced redundancy of one-to-one orthologs down to 1.84 % and a total sequence length of 1.15 Gb that is close to the genome size, 1.17 Gb (Table 2).
We prepared a Hi-C library using a combination of restriction enzymes DpnII and HinfI (see Materials and methods) with the iconHi-C protocol (Kadota et al., 2020).This Hi-C library was sequenced on a short-read sequencing platform to obtain 142 million paired-end reads.Using these Hi-C reads, the contigs obtained above were scaffolded consistently with chromatin contact profiles (Fig. 3A).Subsequent manual curation by referring to the Hi-C contact map 28 resulted in a whole genome assembly containing 24 chromosome-scale sequences, with a total nucleotide length of 1.146 Gb and a scaffold N50 length of 47.7 Mb.Of those chromosome-scale sequences, the longest sequence amounts to 57.6 Mb and the shortest one to 27.1 Mb.The whole assembly including non-chromosomal sequences were subjected to an assessment of the completeness of the protein-coding landscape of the genome, which exhibited percentages of the conserved orthologs identified as complete and that including those identified as fragmented of 98.9 and 99.3 %, respectively (Table 2).Separately, we identified a 16,499 bp-long whole mitochondrial DNA sequence assembled from the obtained HiFi reads that was identical to the nucleotide sequence of the existing NCBI entry NC_003183.1 for the Pacific saury whole mitochondrial DNA.The whole genome assembly excluding this mitochondrial DNA sequence was designated fColSai1.1 (Accession ID at NCBI Genome, GCF_033807715.1) and was used in downstream analyses.The genome assembly, as well as other sequence resources obtained in this study, are available as databases for BLAST 29 searches at the original webserver https://treethinkers.nig.ac.jp/saira/blast/ built with SequenceServer v2.0.0. 30e compared the Pacific saury genome assembly produced in this study with those of typical teleost fish model species and other beloniform species (Table 2).This comparison highlighted a particularly high proportion of the sequenced genomic regions contained in the chromosome-scale sequences for our Pacific saury assembly (96.8%).It also showed comparably high coverages (>98%) of the protein-coding landscape of the genome indicated by the number of pre-selected Gaps stand for a string of no shorter than five undetermined ('N') bases.These scores stand for the percentage of reference orthologues identified in the input genome assembly and were obtained with the program compleasm (formerly, called minibusco) using the ortholog set Actinopterygii provided by the BUSCO developer (see Materials and methods).conserved one-to-one orthologs identified, among the species (Table 2).The Pacific saury assembly has fewer gaps (strings of undetermined nucleotide 'N' with no less than five bases, in this case) than most of the other chromosome-scale genome assemblies (Table 2).These metrics show a high utility of the Pacific saury genome assembly produced in this study.The contents of the genomic sequences can be further scrutinized by aligning them with genomic sequences of closely related sequences.For this purpose, we compared the Pacific saury genome assembly with that of the Japanese medaka 31 which is visualized in a dot plot (Fig. 4).This comparison resulted in the regions of cross-species similarity from a chromosome pair residing mostly in a single grid in Figure 4.This suggests high reliability of intra-chromosomal linkages of the Pacific saury genomic contigs that support our chromosome-scale scaffolding as well as high conservation of genomic structure between the two species, with far more intra-chromosomal rearrangements rather than interchromosomal ones.

Non-coding landscape of the genome
Our identification process of repetitive elements, including species-specific ('de novo') elements, detected repetitive elements in 51.3% of the whole Pacific saury genome, while those in the medaka genome occupied only 39.5%, based on the same method and criterion.When broken into different repeat classes, the Pacific saury genome showed a particularly higher proportion of DNA transposons than that of the medaka (12.9 vs. 3.16 %) (Fig. 5).
Chromosome ends consistently exhibited higher GC content and higher repeat element frequency (Fig. 3B).Among the simple repeats, canonical telomeric repeats (TTAGGG)n were mostly detected closer to, but not always on, the ends of chromosome-scale scaffold sequences (Fig. 3B), which necessitates in-depth investigation to examine whether the chromosome ends are undergoing any lineage-specific structural modification or includes any misjoins in the genome assembly.
We also performed a genome-wide scan of genomic segments containing ribosomal DNA (rDNA) whose intricate genomic structure was one of the major confounding factors in completing the human genome sequencing. 32,33This analysis was aided by the existing sequence entry OP151180.1 partially covering a Pacific saury rDNA region (6,848 bp) in the NCBI Nucleotide database.Our search detected two units of a 23 kb-long string of 28S, 5.8S, and 18S rDNA  genes on chromosome 15 (Fig. 6).We observed more regions with high similarity, but none of those with >90% sequence identity was included in chromosome-scale sequences (scaffold s1-24).The retrieval of the rDNA loci in the obtained Pacific saury genome assembly shows higher completeness of the non-coding landscape than the currently available medaka genome assembly (GCF_002234675.1) in which no intact 28S, 5.8S, and 18S rDNA gene cluster is included in its chromosome-scale sequences.Still, the exact number of repetitive units of the rDNA loci remains to be rigidly confirmed with longer sequence reads.
Our assessment of the non-coding landscape of the genome encompassed the census of transfer RNAs (tRNAs) whose abundance and distribution in vertebrate genomes are rarely studied 34,35 .Our search, based on both sequence similarity and secondary structure, detected 11,088 candidates for tRNA decoding 20 standard amino acids in the Pacific saury genome, while the medaka genome harboured only 1,894 on the same detection method and criterion.This discrepancy demands in-depth comparative analysis including more species to characterize its biological significance.

Inference of transcribed and translated genomic regions
Animal dissection was performed for a different individual of unknown sex to sample the heart, muscle, gill, liver, gut, and eyeball of an adult as well as a whole larvae body (Supplementary Table S3).The RNAs extracted from these tissues were subjected to library preparation for short-read sequencing.We performed a comprehensive detection of protein-coding regions in the obtained genome assembly with the program pipeline Braker3 using an alignment of the transcriptome short reads from adult and larvae Pacific saury tissues as evidence (398 million read pairs in total) as well as an exploitative protein sequence data set as hints (see Materials and methods).This resulted in 22,274 predicted proteincoding genes, which are contained in a ready-to-use gene model file that is released under our original website https:// treethinkers.nig.ac.jp/saira/.Of these predicted genes, only 24 were predicted outside the 24 chromosome-scale scaffolds.
We compared the protein-coding landscape of the Pacific saury genome with a close relative medaka, with reference to a remarkable difference in the genome size (1.17Gb versus 0.80 Gb 31 ) between the two species.On the ground of similar numbers of protein-coding genes (22,274 vs. 22,059), the Pacific saury had comparable mean gene length and mean intron length (Table 3).This suggests that the difference in the genome size is attributed to uncharacterized differences outside protein-coding regions.

Census of aquaporin (AQP) gene family members
We scanned the obtained genome assembly to identify the coding genomic regions of the genes encoding aquaporin (AQP) proteins that function mainly as water channels. 36The coverage of AQP genes, often intervened by long introns, is expected to allow an additional metric for evaluating the continuity of genomic sequences.Our search identified 15 Pacific saury AQP genes, and they were categorized into eleven major subgroups that presumably existed in the common jawed vertebrate ancestor (Fig. 7) by our molecular phylogeny inference (Supplementary Fig. S3).Because of the teleost-specific whole genome duplication, 38 two paralogs were identified for AQP0, -9, -10, and -11, as previously shown for some teleost fishes. 39n the Pacific saury genome, we could not detect the multiplicity  The statistics for medaka were computed for the gene model in the .gfffile registered for the genome assembly GCF_002234675.1 31 .
of the AQP1, -3, and -8 which was previously observed in the zebrafish but not in the medaka, suggesting that their absences in the medaka is not due to false negative detection.The identified Pacific saury AQP genes possess the conserved organizations of coding exons expected from their Japanese medaka orthologs (Supplementary Fig. S4).The retrieval of all the AQP genes possessed by the medaka that belong to the same taxon Beloniformes ascertains the high completeness of the Pacific saury genome assembly obtained in this study.

Figure 1 .
Figure 1.Pacific saury.(A) Adult fish of ~29.5 cm in fork length.(B) Position of the Pacific saury in the phylogeny of selected teleost fish species.The phylogenetic position is based on molecular phylogeny presented in existing literature. 1,2

Figure 2 .
Figure 2. Histological sex identification.Gonadal cross sections of the individual whose genome was sequenced in this study (A) were sampled in May 2022, and the reference male (B) and female (C) individuals sampled in June 2022, stained with hematoxylin and eosin staining (scale bar = 1 mm).(D-G) Detailed testis structure of the individual whose genome was sequenced in this study.Cysts are arranged depending on the developmental stages of the containing germ cells (D, scale bar = 200 μm).Germ cells located on the centrally, spermatozoa (E, scale bar = 20 μm); intermediate, spermatids (F, scale bar = 20 μm); outside, spermatocytes (G, scale bar = 20 μm).

Figure 4 .
Figure 4. Cross-species comparison of chromosomal sequences.Chromosome-scale sequences of the Pacific saury obtained in this study (total length: 1,111 Mb) and those of the Japanese medaka (Oryzias latipes) (total length: 734 Mb) were input as target and query in the D-Genies (https://dgenies.toulouse.inra.fr/)respectively, and the alignment between the two whole genome assemblies was performed with minimap2 (version 2.24).Diagonal lines show regions of continuous similarity.

Figure 5 .
Figure 5.Comparison of repetitive element breakdown into different repeat classes.See Materials and Methods for details.Only chromosome-scale genome sequences were subjected to repeat detection.Distributions of the divergence of detected repetitive sequences are shown in Supplementary Fig. S2.

Figure 6 .
Figure 6.Ribosomal DNA loci in the Pacific saury genome.The 34 kb-long genomic region (position 6,433,000-6,467,000 on Scaffold15) containing two units of the ribosomal DNA cluster (28S-5.8S-18S) is shown.(A) Location of individual rDNA loci.(B) Dot matrix comparing the rDNA-containing Pacific saury genomic region and the human 45S pre-rDNA sequence (NR_145819.1).(C) Dot matrix comparing the rDNA-containing Pacific saury genomic region with itself to indicate internal repeats in it.A light purple background shows ~14 kb-long rDNA-containing regions, and a grey background shows the rest of the repetitive unit.

Table 1 .
Summary of genome size estimation by qPCRGenome size in nucleotide base pairs calculated according to the previously proposed formula (see Materials and methods).
b Mean value of the C-value estimated for the three individual genes.c

Table 2 .
Genome assembly statistics in comparison with other teleost fishes

Table 3 .
Pacific saury genome annotation and its cross-species comparison