Abstract

Pacific saury (Cololabis saira) is a commercially important small pelagic fish species in Asia. In this study, we conducted the first-ever whole genome sequencing of this species, with single molecule, real-time (SMRT) sequencing technology. The obtained high-fidelity (HiFi) long-read sequence data, which amount to ~30-folds of its haploid genome size that was measured with quantitative PCR (1.17 Gb), were assembled into contigs. Scaffolding with Hi-C reads yielded a whole genome assembly containing 24 chromosome-scale sequences, with a scaffold N50 length of 47.7 Mb. Screening of repetitive elements including telomeric repeats was performed to characterize possible factors that need to be resolved towards ‘telomere-to-telomere’ sequencing. The larger genome size than in medaka, a close relative in Beloniformes, is at least partly explained by larger repetitive element quantity, which is reflected in more abundant tRNAs, in the Pacific saury genome. Protein-coding regions were predicted using transcriptome data, which resulted in 22,274 components. Retrieval of Pacific saury homologs of aquaporin (AQP) genes known from other teleost fishes validated high completeness and continuity of the genome assembly. These resources are available at https://treethinkers.nig.ac.jp/saira/ and will assist various molecular-level studies in fishery science and comparative biology.

1. Introduction

The Pacific saury or mackerel pike (Fig. 1; Cololabis saira Brevoort, 1856) is a carnivorous, marine thermophilic shallow-water dweller and is one of the most popular dominant epipelagic nekton species in the North Pacific. Inside the taxon Beloniformes, the lineages of the Pacific saury and Japanese medaka (Oryzias latipes) diverged from each other for ~74 million years.1–3 Analyses based on comparative molecular biology and genomics are expected to address anatomical and behavioural questions. However, the lack of molecular sequence information is an obstacle.

Pacific saury. (A) Adult fish of ~29.5 cm in fork length. (B) Position of the Pacific saury in the phylogeny of selected teleost fish species. The phylogenetic position is based on molecular phylogeny presented in existing literature.1,2
Figure 1.

Pacific saury. (A) Adult fish of ~29.5 cm in fork length. (B) Position of the Pacific saury in the phylogeny of selected teleost fish species. The phylogenetic position is based on molecular phylogeny presented in existing literature.1,2

For its importance in fisheries, molecular-level studies of the Pacific saury are severely lacking. As of May 2023, the number of entries of nucleotide and protein sequences in NCBI4 are 150 and 99, respectively, most of which are fragments of mitochondrial DNA and peptide sequences encoded by them. In the Beloniformes that the Pacific saury belongs to 337 species have been described as of May 2023 according to Eschmeyer’s Catalog of Fishes,5 among which the whole-genome sequences of the nuclear DNA have been sequenced and made publicly available only for nine species including as many as seven species that belongs to medaka fish genus Oryzias.

The nuclear DNA content (or genome size) of the Pacific saury has not been reported to date, while its karyotype was reported to be 2n = 42.6 Importantly, this karyotype report lacked a detailed documentation of the karyotypic variation based on the observation of metaphase spreads of chromosomes prepared from multiple cells. Furthermore, the reported diploid number of chromosomes, namely 42, is distinct from that of most of the close relatives of this group, namely 48 or 50,7 which requires reevaluation.

To build a basis for molecular-level research, in this study, we demonstrated a modern art of whole genome sequencing employing the high-fidelity long-read technology and obtained a chromosome-scale genome assembly by means of Hi-C scaffolding. Our product provides a fundamental resource for evidence-based analysis of fishery science and comparative biology.

2. Materials and methods

2.1. Animals

The male individual of the Pacific saury used in this study (ToLID; fColSai1) was sampled in May 2022 from the exhibition tank of the public aquarium Aquamarine Fukushima, Iwaki, Japan. Animal handling and sample collections at the aquaria were conducted by veterinary staff without restraining the individuals, in accordance with the Husbandry Guidelines approved by the Ethics and Welfare Committee of the Japanese Association of Zoos and Aquariums.

2.2. Gonadal histology

Histological analysis was performed using the gonad of the individual sampled above in May 2022 for genome sequencing. Also, tissues of a male and a female collected in June 2022 were fixed in Bouin’s fixative and used for histological observations as a reference to the gonadal structure of each sex. All tissues were embedded in paraffin (Leica Biosystems, Nussloch, Germany), sectioned in 5 μm, and stained with Mayer’s haematoxylin (Fujifilm Wako Pure Chemical Co., Osaka, Japan) and eosin Y (Fujifilm Wako Pure Chemical Co.).

2.3. Genome sequencing and assembly

High molecular weight DNA was extracted from blood cells using a NucleoBond AXG column (Macherey-Nagel, Düren, Germany), which was followed by purification with phenol–chloroform. The concentration of the extracted DNA was measured with Qubit 4 (ThermoFisher, MA, USA), and their size distribution was first analysed with TapeStation 2100 (Agilent Technologies, CA, USA) to ensure high integrity and later analysed with pulse-field gel electrophoresis on CHEF DR-II (BioRad, CA, USA) to ensure the size range between 20 kb and 100 kb. The DNA was fragmented with g-TUBE (Covaris, MA, USA) and size-selected with BluePippin according to the official protocol. A SMRT sequence library was constructed with an SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA) and was sequenced in a single 8M SMRT cell on a PacBio Sequel IIe system (Pacific Biosciences). The sequencing output was processed to generate circular consensus sequences (CCS) to obtain a total of 33.7 Gb HiFi sequence reads (Accession ID, DRR486909). From these reads, adapter sequences were removed using the program HiFiAdapterFilt.8 The obtained HiFi sequence reads were assembled using the program hifiasm v0.16.19 with its default parameters. The obtained contigs were subjected to haplotig purging by using the program purge_haplotigs10 with the options ‘-l 5 -m 23 -h 45’.

2.4. Hi-C data production and genome scaffolding

The obtained contigs were scaffolded using Hi-C read pairs obtained as follows. The Hi-C library was prepared using the muscle tissue of the individual fColSai1 used for genome sequencing, according to the iconHi-C protocol employing restriction enzymes DpnII and HinfI,11 and it was sequenced on a HiSeq X sequencing platform (Illumina Inc., CA, USA). The obtained Hi-C read pairs (Accession ID, DRR507759) were processed with the program Trim Galore! v0.6.8 (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) specifying the options ‘--phred33 --stringency 2 --quality 30 --length 25 --paired’ and aligned to the HiFi sequence contigs with the program Juicer v1.6,12 and using its results, HiFi sequence contigs were scaffolded with 3d-dna13 (version 201008) specifying the options ‘-m haploid -i 5000 --editor-repeat-coverage 15 -r 2’ to be consistent with the chromatin contact profiles. The continuity and completeness of the resultant genome assembly, designated as fColSai1.1, as well as the predicted protein-coding gene set (see below), was assessed with the webserver gVolante v2.0.014 in which the pipeline BUSCO v5.1.2 is implemented,15 consistently using the ortholog set ‘actinopterygii_odb10’ supplied with BUSCO. Especially, the completeness of the nucleotide sequences of the genome assembly was assessed with compleasm16 (formerly called minibusco) which is thought to achieve higher accuracy due to the use of miniprot.17 The genome assembly was deposited in NCBI under the accession ID PRJNA1027303.

2.5. Annotation of non-coding sequences

The obtained chromosome scale and other genomic sequences were subjected to the de novo detection of repetitive elements with RepeatModeler v2.0.4 with the -LTRStruct option. The detected repeat sequences were input in RepeatMasker v4.1.4 for repeat-masking in the sensitive mode (with the option ‘-s’). Simple tandem repeat detection was further reinforced by the use of the program tantan.18 Putative transfer RNAs (tRNAs) were detected by the program tRNAscan-SE version 2.0.12 and the detected regions excluding those labelled as ‘pseudogenized’ were counted. Detection of the rDNA-containing regions was performed with the program barrnap version 0.9 (https://github.com/tseemann/barrnap) with the aid of the partial sequence entry spanning the Pacific saury rDNA locus (OP151180.1) and the human 45S pre-ribosomal sequence (NR_145819.1). Detection of canonical telomeric repeats (TTAGGG)n was performed with tidk version 0.2.31 (https://github.com/tolkit/telomeric-identifier).

2.6. Genome size estimation

The genome size was quantified with the qPCR-based protocol sQuantGenome19 (https://github.com/Squalomix/c-value) with the same DNA sample used for whole genome sequencing. Three protein-coding genes, ACAT1, DLD, and RFC3 from the single-copy orthologous gene set CVG20 (Core Vertebrate Genes), whose transcript sequences (Supplementary Table S1) were identified in the transcriptome sequence assembly (see below). The primers used for qPCR are included in Supplementary Table S2. Genome size in nucleotide base pairs was calculated based on the formula21 that the number of base pairs corresponds to mass in pg × 0.978 × 109.

2.7. Transcriptome sequencing and assembly

For transcriptome data acquisition, an adult individual of unknown sex was sampled in July 2022 and was dissected. Total RNA extraction was performed for the heart, muscle, gill, liver, gut, and eyeball of this adult fish as well as a whole larvae body, using TRIzol reagent (ThermoFisher). After DNaseI digestion, strand-specific RNA-seq libraries were prepared using 150 ng of each of the extracted total RNAs (Supplementary Table S3), with Illumina Stranded mRNA Prep kit (Illumina, Cat. No. 20040534) and IDT for Illumina RNA UD Indexes Set A Ligation (Illumina, Cat. No. 20040553) according to its standard protocol unless stated otherwise below. Before the total volume PCR amplification was performed, we performed a preliminary PCR using a 1.5 μl aliquot of 10 μl DNA from the previous step, with KAPA Real-Time Library Amplification Kit (Kapa Biosystems, cat. No. KK2702). This demonstrated that the amplification of the products reached Standard 1 accompanying this kit between three and four PCR cycles, which instructed us to perform the full-volume PCR with three PCR cycles, introducing the minimal amplification (Supplementary Table S3). The libraries were sequenced in 0.6 lanes to obtain PE150 reads on a HiSeq X (Illumina). Quality control of the obtained fastq files for individual libraries was performed with FASTQC v0.11.5 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

The obtained sequence reads in the fastq files were processed with Trim Galore! v0.6.8 with the options ʻ--phred33 --stringency 2 --quality 30 --length 25 --paired’. The reads after adaptor trimming were assembled with the program Trinity v2.14.0 with the options ʻ--trimmomatic --SS_lib_type RF’. Among the resultant contig sequences, those matching PhiX, and mitochondrial DNA in the BLASTN v2.2.30 + 9 results executed with the options ‘-perc_identity 95’ were removed.

2.8. Molecular phylogenetics analysis

Protein sequences were collected from the NCBI and Ensembl databases, and their accession IDs used for the phylogenetic analysis are included in Supplementary Table S4. The deduced amino acid sequences were aligned with the MAFFT v7.505 using the L-INS-i method. The aligned sequences were trimmed with trimAl v1.4.rev1522 to remove unreliably aligned sites using the ‘-gappyout’ option. The maximum-likelihood tree was inferred with RAxML v8.2.1223 using the PROTCATWAG model, and for evaluating the confidence of the nodes, the rapid bootstrap resampling with 100 replicates was performed. A molecular phylogenetic tree employing the Bayesian framework was inferred with PhyloBayes v4.1c using the CAT-WAG-Γ model.

2.9. Gene prediction

Protein-coding genes were predicted with the program pipeline Braker324 on the genome assembly sequences in which interspersed repeats and simple tandem repeats were soft-masked with RepeatModeler/RepeatMasker25 (see above) and tantan.18 The gene prediction incorporated all peptide sequences registered in the file ‘odb11_vertebrata_fasta’ provided by the OrthoDB database,26 as well as the output of paired-end RNA-seq read mapping with hisat2.27

3. Results and discussion

3.1. Animal sampling and sex identification

An adult of the Pacific saury was sampled from the exhibition tank of Aquamarine Fukushima and was subjected to histological analysis for identifying the sex and the stage of gonadal maturity, in parallel with high-molecular-weight DNA extraction. Our histological observation of the Pacific saury individual revealed sexually mature testes, and it was judged to be a male (Fig. 2A–C). The testes showed a tubular-type structure (Fig. 2D) and had cysts containing various stages of germ cells; spermatozoa (Fig. 2E), spermatids (Fig. 2F), and spermatocytes (Fig. 2G). Abundant spermatozoa produced and stored in cysts close to the central cavity (Fig. 2E) indicate that this individual was sexually mature at the time of our sampling for genome sequencing.

Histological sex identification. Gonadal cross sections of the individual whose genome was sequenced in this study (A) were sampled in May 2022, and the reference male (B) and female (C) individuals sampled in June 2022, stained with hematoxylin and eosin staining (scale bar = 1 mm). (D-G) Detailed testis structure of the individual whose genome was sequenced in this study. Cysts are arranged depending on the developmental stages of the containing germ cells (D, scale bar = 200 μm). Germ cells located on the centrally, spermatozoa (E, scale bar = 20 μm); intermediate, spermatids (F, scale bar = 20 μm); outside, spermatocytes (G, scale bar = 20 μm).
Figure 2.

Histological sex identification. Gonadal cross sections of the individual whose genome was sequenced in this study (A) were sampled in May 2022, and the reference male (B) and female (C) individuals sampled in June 2022, stained with hematoxylin and eosin staining (scale bar = 1 mm). (D-G) Detailed testis structure of the individual whose genome was sequenced in this study. Cysts are arranged depending on the developmental stages of the containing germ cells (D, scale bar = 200 μm). Germ cells located on the centrally, spermatozoa (E, scale bar = 20 μm); intermediate, spermatids (F, scale bar = 20 μm); outside, spermatocytes (G, scale bar = 20 μm).

3.2. Genome size estimation

The extracted genomic DNA was used for the estimation of nuclear DNA content with the method using quantitative PCR (qPCR).19 Three protein-coding genes included in Table 1 were selected from those contained in the transcript contigs (see Materials and methods) and detected as single-copy genes. We designed specific primers for these genes and prepared standard molecules for each of them by PCR. Using the standard molecules controlled with sequencing specificity and quantity, we performed qPCR and quantified the DNA amount that yields a haploid DNA molecule. This resulted in the haploid nuclear DNA content of 1.19 pg, which corresponds to 1.17 Gb.

Table 1.

Summary of genome size estimation by qPCR

Target geneDNA copy number calculated by qPCRa
(copies/μl DNA)
Estimated
C-value (pg)
Mean
C-value (pg)b
Estimated haploid
genome size (Gbp)a
ACAT15,025.251.141.191.17
DLD4,335.211.32
RFC35,066.121.13
Target geneDNA copy number calculated by qPCRa
(copies/μl DNA)
Estimated
C-value (pg)
Mean
C-value (pg)b
Estimated haploid
genome size (Gbp)a
ACAT15,025.251.141.191.17
DLD4,335.211.32
RFC35,066.121.13

aA uniform amount of gDNA (5.71 ng/ μl) was used in qPCR.

bMean value of the C-value estimated for the three individual genes.

cGenome size in nucleotide base pairs calculated according to the previously proposed formula (see Materials and methods).

Table 1.

Summary of genome size estimation by qPCR

Target geneDNA copy number calculated by qPCRa
(copies/μl DNA)
Estimated
C-value (pg)
Mean
C-value (pg)b
Estimated haploid
genome size (Gbp)a
ACAT15,025.251.141.191.17
DLD4,335.211.32
RFC35,066.121.13
Target geneDNA copy number calculated by qPCRa
(copies/μl DNA)
Estimated
C-value (pg)
Mean
C-value (pg)b
Estimated haploid
genome size (Gbp)a
ACAT15,025.251.141.191.17
DLD4,335.211.32
RFC35,066.121.13

aA uniform amount of gDNA (5.71 ng/ μl) was used in qPCR.

bMean value of the C-value estimated for the three individual genes.

cGenome size in nucleotide base pairs calculated according to the previously proposed formula (see Materials and methods).

3.3. Genome sequencing and assembly

The extracted high molecular weight DNA was used for the SMRTbell library and subjected to circular consensus sequencing (CCS) with single molecule, real-time (SMRT) sequencing technology (see Materials and methods). This yielded 33.6 Gb high-fidelity long (HiFi) reads of ~30-folds of its haploid genome size (1.17 Gb). K-mer distribution in these reads suggested a heterozygosity of 1.87 % (Supplementary Fig. S1). These sequence reads were assembled into 1,555 contigs with an N50 length of 6.3 Mb. These contigs exhibited a percentage of duplicated one-to-one orthologs of 7.7 % and a total sequence length of 1.33 Gb that remarkably exceeded the genome size that was estimated independently (see above). We imputed these features to the relatively high heterozygosity (1.87%; see above) and purged redundant haplotigs. This resulted in 672 contig sequences with a reduced redundancy of one-to-one orthologs down to 1.84 % and a total sequence length of 1.15 Gb that is close to the genome size, 1.17 Gb (Table 2).

Table 2.

Genome assembly statistics in comparison with other teleost fishes

SpeciesCololabis sairaOryzias latipesDanio rerioGasterosteus aculeatusHirundichthys speculigerXenentodon cancila
Pacific sauryMedakaZebrafishThree-spined sticklebackMirrorwing flyingfishFreshwater garfish
AssemblyfColSai1.1UTGRCz11GAculeatus_UGA_version5GCA_018398805.1GCA_014839995.1
(fXenCan1.pri)
Total length (Mbp)1,1467901,6794721,043736
N50 scaffold length (Mbp)47.7315220129
Largest scaffold length (Mbp)57.6387834936
Number of scaffolds > 10 Mbp24242522024
Number of scaffolds > 1 Mbp2425252225925
% of sum length of sequences > 10 Mbp96.792.980.195.80.096.4
% of sum length of sequences > 1 Mbp96.893.180.195.88.597.1
# Seqs7589201,9222,9373,052410
# Gaps189649118,7963,1222061,373
% Gap10.040.060.280.760.051.02
% Complete ortholog198.999.297.798.799.398.9
% Complete + fragment ortholog199.399.599.199.499.799.3
% Duplicate ortholog11.840.417.01.63.90.2
% Missing ortholog10.660.50.90.60.30.6
SpeciesCololabis sairaOryzias latipesDanio rerioGasterosteus aculeatusHirundichthys speculigerXenentodon cancila
Pacific sauryMedakaZebrafishThree-spined sticklebackMirrorwing flyingfishFreshwater garfish
AssemblyfColSai1.1UTGRCz11GAculeatus_UGA_version5GCA_018398805.1GCA_014839995.1
(fXenCan1.pri)
Total length (Mbp)1,1467901,6794721,043736
N50 scaffold length (Mbp)47.7315220129
Largest scaffold length (Mbp)57.6387834936
Number of scaffolds > 10 Mbp24242522024
Number of scaffolds > 1 Mbp2425252225925
% of sum length of sequences > 10 Mbp96.792.980.195.80.096.4
% of sum length of sequences > 1 Mbp96.893.180.195.88.597.1
# Seqs7589201,9222,9373,052410
# Gaps189649118,7963,1222061,373
% Gap10.040.060.280.760.051.02
% Complete ortholog198.999.297.798.799.398.9
% Complete + fragment ortholog199.399.599.199.499.799.3
% Duplicate ortholog11.840.417.01.63.90.2
% Missing ortholog10.660.50.90.60.30.6

1Gaps stand for a string of no shorter than five undetermined (‘N’) bases. These scores stand for the percentage of reference orthologues identified in the input genome assembly and were obtained with the program compleasm (formerly, called minibusco) using the ortholog set Actinopterygii provided by the BUSCO developer (see Materials and methods).

Table 2.

Genome assembly statistics in comparison with other teleost fishes

SpeciesCololabis sairaOryzias latipesDanio rerioGasterosteus aculeatusHirundichthys speculigerXenentodon cancila
Pacific sauryMedakaZebrafishThree-spined sticklebackMirrorwing flyingfishFreshwater garfish
AssemblyfColSai1.1UTGRCz11GAculeatus_UGA_version5GCA_018398805.1GCA_014839995.1
(fXenCan1.pri)
Total length (Mbp)1,1467901,6794721,043736
N50 scaffold length (Mbp)47.7315220129
Largest scaffold length (Mbp)57.6387834936
Number of scaffolds > 10 Mbp24242522024
Number of scaffolds > 1 Mbp2425252225925
% of sum length of sequences > 10 Mbp96.792.980.195.80.096.4
% of sum length of sequences > 1 Mbp96.893.180.195.88.597.1
# Seqs7589201,9222,9373,052410
# Gaps189649118,7963,1222061,373
% Gap10.040.060.280.760.051.02
% Complete ortholog198.999.297.798.799.398.9
% Complete + fragment ortholog199.399.599.199.499.799.3
% Duplicate ortholog11.840.417.01.63.90.2
% Missing ortholog10.660.50.90.60.30.6
SpeciesCololabis sairaOryzias latipesDanio rerioGasterosteus aculeatusHirundichthys speculigerXenentodon cancila
Pacific sauryMedakaZebrafishThree-spined sticklebackMirrorwing flyingfishFreshwater garfish
AssemblyfColSai1.1UTGRCz11GAculeatus_UGA_version5GCA_018398805.1GCA_014839995.1
(fXenCan1.pri)
Total length (Mbp)1,1467901,6794721,043736
N50 scaffold length (Mbp)47.7315220129
Largest scaffold length (Mbp)57.6387834936
Number of scaffolds > 10 Mbp24242522024
Number of scaffolds > 1 Mbp2425252225925
% of sum length of sequences > 10 Mbp96.792.980.195.80.096.4
% of sum length of sequences > 1 Mbp96.893.180.195.88.597.1
# Seqs7589201,9222,9373,052410
# Gaps189649118,7963,1222061,373
% Gap10.040.060.280.760.051.02
% Complete ortholog198.999.297.798.799.398.9
% Complete + fragment ortholog199.399.599.199.499.799.3
% Duplicate ortholog11.840.417.01.63.90.2
% Missing ortholog10.660.50.90.60.30.6

1Gaps stand for a string of no shorter than five undetermined (‘N’) bases. These scores stand for the percentage of reference orthologues identified in the input genome assembly and were obtained with the program compleasm (formerly, called minibusco) using the ortholog set Actinopterygii provided by the BUSCO developer (see Materials and methods).

We prepared a Hi-C library using a combination of restriction enzymes DpnII and HinfI (see Materials and methods) with the iconHi-C protocol (Kadota et al., 2020). This Hi-C library was sequenced on a short-read sequencing platform to obtain 142 million paired-end reads. Using these Hi-C reads, the contigs obtained above were scaffolded consistently with chromatin contact profiles (Fig. 3A). Subsequent manual curation by referring to the Hi-C contact map28 resulted in a whole genome assembly containing 24 chromosome-scale sequences, with a total nucleotide length of 1.146 Gb and a scaffold N50 length of 47.7 Mb. Of those chromosome-scale sequences, the longest sequence amounts to 57.6 Mb and the shortest one to 27.1 Mb. The whole assembly including non-chromosomal sequences were subjected to an assessment of the completeness of the protein-coding landscape of the genome, which exhibited percentages of the conserved orthologs identified as complete and that including those identified as fragmented of 98.9 and 99.3 %, respectively (Table 2). Separately, we identified a 16,499 bp-long whole mitochondrial DNA sequence assembled from the obtained HiFi reads that was identical to the nucleotide sequence of the existing NCBI entry NC_003183.1 for the Pacific saury whole mitochondrial DNA. The whole genome assembly excluding this mitochondrial DNA sequence was designated fColSai1.1 (Accession ID at NCBI Genome, GCF_033807715.1) and was used in downstream analyses. The genome assembly, as well as other sequence resources obtained in this study, are available as databases for BLAST29 searches at the original webserver https://treethinkers.nig.ac.jp/saira/blast/ built with SequenceServer v2.0.0.30

Pacific saury genome assembly. (A) Hi-C contact map. (B) The genomic landscape showed for individual chromosomes (black letters). This circular view shows the contents of GC nucleotides (grey; 40-50%), simple tandem repeats (red; 0-90%), interspersed repeats (blue; 30-90%), and canonical vertebrate telomeric repeats (TTAGGG)n counted for 10 kb-long non-overlapping windows (green; 0-300 units), as well as the distribution of tRNA-coding regions (purple) and rDNA loci (red; indicated with an arrowhead).
Figure 3.

Pacific saury genome assembly. (A) Hi-C contact map. (B) The genomic landscape showed for individual chromosomes (black letters). This circular view shows the contents of GC nucleotides (grey; 40-50%), simple tandem repeats (red; 0-90%), interspersed repeats (blue; 30-90%), and canonical vertebrate telomeric repeats (TTAGGG)n counted for 10 kb-long non-overlapping windows (green; 0-300 units), as well as the distribution of tRNA-coding regions (purple) and rDNA loci (red; indicated with an arrowhead).

We compared the Pacific saury genome assembly produced in this study with those of typical teleost fish model species and other beloniform species (Table 2). This comparison highlighted a particularly high proportion of the sequenced genomic regions contained in the chromosome-scale sequences for our Pacific saury assembly (96.8%). It also showed comparably high coverages (>98%) of the protein-coding landscape of the genome indicated by the number of pre-selected conserved one-to-one orthologs identified, among the species (Table 2). The Pacific saury assembly has fewer gaps (strings of undetermined nucleotide ‘N’ with no less than five bases, in this case) than most of the other chromosome-scale genome assemblies (Table 2). These metrics show a high utility of the Pacific saury genome assembly produced in this study.

The contents of the genomic sequences can be further scrutinized by aligning them with genomic sequences of closely related sequences. For this purpose, we compared the Pacific saury genome assembly with that of the Japanese medaka31 which is visualized in a dot plot (Fig. 4). This comparison resulted in the regions of cross-species similarity from a chromosome pair residing mostly in a single grid in Figure 4. This suggests high reliability of intra-chromosomal linkages of the Pacific saury genomic contigs that support our chromosome-scale scaffolding as well as high conservation of genomic structure between the two species, with far more intra-chromosomal rearrangements rather than inter-chromosomal ones.

Cross-species comparison of chromosomal sequences. Chromosome-scale sequences of the Pacific saury obtained in this study (total length: 1,111 Mb) and those of the Japanese medaka (Oryzias latipes) (total length: 734 Mb) were input as target and query in the D-Genies (https://dgenies.toulouse.inra.fr/) respectively, and the alignment between the two whole genome assemblies was performed with minimap2 (version 2.24). Diagonal lines show regions of continuous similarity.
Figure 4.

Cross-species comparison of chromosomal sequences. Chromosome-scale sequences of the Pacific saury obtained in this study (total length: 1,111 Mb) and those of the Japanese medaka (Oryzias latipes) (total length: 734 Mb) were input as target and query in the D-Genies (https://dgenies.toulouse.inra.fr/) respectively, and the alignment between the two whole genome assemblies was performed with minimap2 (version 2.24). Diagonal lines show regions of continuous similarity.

3.4. Non-coding landscape of the genome

Our identification process of repetitive elements, including species-specific (‘de novo’) elements, detected repetitive elements in 51.3% of the whole Pacific saury genome, while those in the medaka genome occupied only 39.5%, based on the same method and criterion. When broken into different repeat classes, the Pacific saury genome showed a particularly higher proportion of DNA transposons than that of the medaka (12.9 vs. 3.16 %) (Fig. 5).

Comparison of repetitive element breakdown into different repeat classes. See Materials and Methods for details. Only chromosome-scale genome sequences were subjected to repeat detection. Distributions of the divergence of detected repetitive sequences are shown in Supplementary Fig. S2.
Figure 5.

Comparison of repetitive element breakdown into different repeat classes. See Materials and Methods for details. Only chromosome-scale genome sequences were subjected to repeat detection. Distributions of the divergence of detected repetitive sequences are shown in Supplementary Fig. S2.

Chromosome ends consistently exhibited higher GC content and higher repeat element frequency (Fig. 3B). Among the simple repeats, canonical telomeric repeats (TTAGGG)n were mostly detected closer to, but not always on, the ends of chromosome-scale scaffold sequences (Fig. 3B), which necessitates in-depth investigation to examine whether the chromosome ends are undergoing any lineage-specific structural modification or includes any misjoins in the genome assembly.

We also performed a genome-wide scan of genomic segments containing ribosomal DNA (rDNA) whose intricate genomic structure was one of the major confounding factors in completing the human genome sequencing.32,33 This analysis was aided by the existing sequence entry OP151180.1 partially covering a Pacific saury rDNA region (6,848 bp) in the NCBI Nucleotide database. Our search detected two units of a 23 kb-long string of 28S, 5.8S, and 18S rDNA genes on chromosome 15 (Fig. 6). We observed more regions with high similarity, but none of those with >90% sequence identity was included in chromosome-scale sequences (scaffold s1–24). The retrieval of the rDNA loci in the obtained Pacific saury genome assembly shows higher completeness of the non-coding landscape than the currently available medaka genome assembly (GCF_002234675.1) in which no intact 28S, 5.8S, and 18S rDNA gene cluster is included in its chromosome-scale sequences. Still, the exact number of repetitive units of the rDNA loci remains to be rigidly confirmed with longer sequence reads.

Ribosomal DNA loci in the Pacific saury genome. The 34 kb-long genomic region (position 6,433,000-6,467,000 on Scaffold15) containing two units of the ribosomal DNA cluster (28S-5.8S-18S) is shown. (A) Location of individual rDNA loci. (B) Dot matrix comparing the rDNA-containing Pacific saury genomic region and the human 45S pre-rDNA sequence (NR_145819.1). (C) Dot matrix comparing the rDNA-containing Pacific saury genomic region with itself to indicate internal repeats in it. A light purple background shows ~14 kb-long rDNA-containing regions, and a grey background shows the rest of the repetitive unit.
Figure 6.

Ribosomal DNA loci in the Pacific saury genome. The 34 kb-long genomic region (position 6,433,000-6,467,000 on Scaffold15) containing two units of the ribosomal DNA cluster (28S-5.8S-18S) is shown. (A) Location of individual rDNA loci. (B) Dot matrix comparing the rDNA-containing Pacific saury genomic region and the human 45S pre-rDNA sequence (NR_145819.1). (C) Dot matrix comparing the rDNA-containing Pacific saury genomic region with itself to indicate internal repeats in it. A light purple background shows ~14 kb-long rDNA-containing regions, and a grey background shows the rest of the repetitive unit.

Our assessment of the non-coding landscape of the genome encompassed the census of transfer RNAs (tRNAs) whose abundance and distribution in vertebrate genomes are rarely studied34,35. Our search, based on both sequence similarity and secondary structure, detected 11,088 candidates for tRNA decoding 20 standard amino acids in the Pacific saury genome, while the medaka genome harboured only 1,894 on the same detection method and criterion. This discrepancy demands in-depth comparative analysis including more species to characterize its biological significance.

3.5. Inference of transcribed and translated genomic regions

Animal dissection was performed for a different individual of unknown sex to sample the heart, muscle, gill, liver, gut, and eyeball of an adult as well as a whole larvae body (Supplementary Table S3). The RNAs extracted from these tissues were subjected to library preparation for short-read sequencing. We performed a comprehensive detection of protein-coding regions in the obtained genome assembly with the program pipeline Braker3 using an alignment of the transcriptome short reads from adult and larvae Pacific saury tissues as evidence (398 million read pairs in total) as well as an exploitative protein sequence data set as hints (see Materials and methods). This resulted in 22,274 predicted protein-coding genes, which are contained in a ready-to-use gene model file that is released under our original website https://treethinkers.nig.ac.jp/saira/. Of these predicted genes, only 24 were predicted outside the 24 chromosome-scale scaffolds.

We compared the protein-coding landscape of the Pacific saury genome with a close relative medaka, with reference to a remarkable difference in the genome size (1.17 Gb versus 0.80 Gb31) between the two species. On the ground of similar numbers of protein-coding genes (22,274 vs. 22,059), the Pacific saury had comparable mean gene length and mean intron length (Table 3). This suggests that the difference in the genome size is attributed to uncharacterized differences outside protein-coding regions.

Table 3.

Pacific saury genome annotation and its cross-species comparison

FeaturePacific sauryMedaka
Number of protein-coding genes22,26622,059
Mean gene length (nt)17,82020,607
Mean intron length (nt)1,8881,886
Longest gene (nt)488,4971,049,862
Longest cds (nt)26,66194,659
Longest exon (nt)13,17819,832
Longest intron (nt)459,5151,035,396
Number of predicted tRNA311,0881,894
Repetitiveness (%)151.339.5
FeaturePacific sauryMedaka
Number of protein-coding genes22,26622,059
Mean gene length (nt)17,82020,607
Mean intron length (nt)1,8881,886
Longest gene (nt)488,4971,049,862
Longest cds (nt)26,66194,659
Longest exon (nt)13,17819,832
Longest intron (nt)459,5151,035,396
Number of predicted tRNA311,0881,894
Repetitiveness (%)151.339.5

The statistics for medaka were computed for the gene model in the .gff file registered for the genome assembly GCF_002234675.131.

Table 3.

Pacific saury genome annotation and its cross-species comparison

FeaturePacific sauryMedaka
Number of protein-coding genes22,26622,059
Mean gene length (nt)17,82020,607
Mean intron length (nt)1,8881,886
Longest gene (nt)488,4971,049,862
Longest cds (nt)26,66194,659
Longest exon (nt)13,17819,832
Longest intron (nt)459,5151,035,396
Number of predicted tRNA311,0881,894
Repetitiveness (%)151.339.5
FeaturePacific sauryMedaka
Number of protein-coding genes22,26622,059
Mean gene length (nt)17,82020,607
Mean intron length (nt)1,8881,886
Longest gene (nt)488,4971,049,862
Longest cds (nt)26,66194,659
Longest exon (nt)13,17819,832
Longest intron (nt)459,5151,035,396
Number of predicted tRNA311,0881,894
Repetitiveness (%)151.339.5

The statistics for medaka were computed for the gene model in the .gff file registered for the genome assembly GCF_002234675.131.

3.6. Census of aquaporin (AQP) gene family members

We scanned the obtained genome assembly to identify the coding genomic regions of the genes encoding aquaporin (AQP) proteins that function mainly as water channels.36 The coverage of AQP genes, often intervened by long introns, is expected to allow an additional metric for evaluating the continuity of genomic sequences. Our search identified 15 Pacific saury AQP genes, and they were categorized into eleven major subgroups that presumably existed in the common jawed vertebrate ancestor (Fig. 7) by our molecular phylogeny inference (Supplementary Fig. S3). Because of the teleost-specific whole genome duplication,38 two paralogs were identified for AQP0, -9, -10, and -11, as previously shown for some teleost fishes.39 In the Pacific saury genome, we could not detect the multiplicity of the AQP1, -3, and -8 which was previously observed in the zebrafish but not in the medaka, suggesting that their absences in the medaka is not due to false negative detection. The identified Pacific saury AQP genes possess the conserved organizations of coding exons expected from their Japanese medaka orthologs (Supplementary Fig. S4). The retrieval of all the AQP genes possessed by the medaka that belong to the same taxon Beloniformes ascertains the high completeness of the Pacific saury genome assembly obtained in this study.

Aquaporin (AQP) gene identification. Orthology-based schematic categorization of Pacific saury genes. Different AQP subtype genes are positioned based on the orthology suggested by the molecular phylogenetic tree in Supplementary Fig. S3. The numbers in the boxes indicate the multiplicity of the orthologs generated by lineage-specific gene duplications, while no box shows the absence of ortholog in the currently available genome assemblies. Telest fish gene repertoires are highlighted in cyan. At the top is the classification into different subfamilies including glpAQPs (aquaglyceroporins) based on existing literature.36,37
Figure 7.

Aquaporin (AQP) gene identification. Orthology-based schematic categorization of Pacific saury genes. Different AQP subtype genes are positioned based on the orthology suggested by the molecular phylogenetic tree in Supplementary Fig. S3. The numbers in the boxes indicate the multiplicity of the orthologs generated by lineage-specific gene duplications, while no box shows the absence of ortholog in the currently available genome assemblies. Telest fish gene repertoires are highlighted in cyan. At the top is the classification into different subfamilies including glpAQPs (aquaglyceroporins) based on existing literature.36,37

Acknowledgements

We thank Mayuki Ando and Hiroaki Chiba for their helpful advice and assistance during the histological experiments, Miyuki Mekuchi, Yoji Nakamura, and Motoshige Yasuike at FRA for insightful discussion about fishery science of the Pacific saury, Taiki Niwa for early trials of gene prediction, Yoshinobu Uno for discussion on karyotyping reports, Osamu Nishimura for quality control of short read sequencing data, Yoshinori Hasegawa for conducting HiFi sequence data production, Rintaro Ishii, Toshiaki Mori, Hikari Yoshizato for their maintaining fish colonies, and Takeshi Furukawa for supervising fish maintenance.

Funding

This research was supported by a ‘Strategic Research Projects’ grant from ROIS (Research Organization of Information and Systems).

Conflict of interest statement

None declared.

Data availability

Raw sequence reads and the genome assembly have been deposited in NCBI under the BioProject ID PRJNA1027303. The predicted coding sequences (in both nucleotides and amino acids) and the gene models (in the .gff3 file format), as well as the genome assembly, are also available at https://figshare.com/projects/Pacific-saury-genome/178218.

References

1.

Lovejoy
,
N.R.
 
2000
,
Reinterpreting recapitulation: systematics of needlefishes and their allies (Teleostei: Beloniformes)
,
Evolution
,
54
,
1349
62
.

2.

Miya
,
M.
,
Kawaguchi
,
A.
, and
Nishida
,
M.
 
2001
,
Mitogenomic exploration of higher teleostean phylogenies: a case study for moderate-scale evolutionary genomics with 38 newly determined complete mitochondrial DNA sequences
,
Mol. Biol. Evol.
,
18
,
1993
2009
.

3.

Hedges
,
S.B.
,
Dudley
,
J.
, and
Kumar
,
S.
 
2006
,
TimeTree: a public knowledge-base of divergence times among organisms
,
Bioinformatics
,
22
,
2971
2
.

4.

Sayers
,
E.W.
,
Bolton
,
E.E.
,
Brister
,
J.R.
, et al.  
2023
,
Database resources of the National Center for Biotechnology Information in 2023
,
Nucleic Acids Res.
,
51
,
D29
38
.

5.

Fricke
 
R
,
Eschmeyer
 
WN
,
Van der Laan
 
R
,
2023
,
Eschmeyer’s Catalog of Fishes: Genera, Species, References
. http://researcharchive.calacademy.org/research/ichthyology/catalog/fishcatmain.asp (
14 July 2023,
date last accessed).

6.

Yabu
,
H.
and
Ishii
,
K.
 
1981
,
Chromosomes in Pacific saury, Cololabis saira
. Bull. Jap. Soc. Sci. Fish
,
47
,
559
.

7.

Arai
,
R.
 
2011
,
Fish Karyotypes
.
Springer
,
Tokyo, Japan
.

8.

Sim
,
S.B.
,
Corpuz
,
R.L.
,
Simmonds
,
T.J.
, and
Geib
,
S.M.
 
2022
,
HiFiAdapterFilt, a memory efficient read processing pipeline, prevents occurrence of adapter sequence in PacBio HiFi reads and their negative impacts on genome assembly
,
BMC Genomics
,
23
,
157
.

9.

Cheng
,
H.
,
Concepcion
,
G.T.
,
Feng
,
X.
,
Zhang
,
H.
, and
Li
,
H.
 
2021
,
Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm
,
Nat. Methods
,
18
,
170
5
.

10.

Roach
,
M.J.
,
Schmidt
,
S.A.
, and
Borneman
,
A.R.
 
2018
,
Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
,
BMC Bioinf.
,
19
,
460
.

11.

Kadota
,
M.
,
Nishimura
,
O.
,
Miura
,
H.
,
Tanaka
,
K.
,
Hiratani
,
I.
, and
Kuraku
,
S.
 
2020
,
Multifaceted Hi
C benchmarking: what makes a difference in chromosome-scale genome scaffolding
?
GigaScience
,
9
,
giz158
.

12.

Durand
,
N.C.
,
Shamim
,
M.S.
,
Machol
,
I.
, et al.  
2016
,
Juicer provides a one-click system for analyzing loop-resolution hi-C experiments
,
Cell Syst
,
3
,
95
8
.

13.

Dudchenko
,
O.
,
Batra
,
S.S.
,
Omer
,
A.D.
, et al.  
2017
,
De novo assembly of the Aedes aegypti genome using Hi
C yields chromosome-length scaffolds
,
Science
,
356
,
92
5
.

14.

Nishimura
,
O.
,
Hara
,
Y.
, and
Kuraku
,
S.
 
2017
,
gVolante for standardizing completeness assessment of genome and transcriptome assemblies
,
Bioinformatics
,
33
,
3635
7
.

15.

Seppey
,
M.
,
Manni
,
M.
, and
Zdobnov
,
E.M.
 
2019
,
BUSCO: assessing genome assembly and annotation completeness
,
Methods Mol. Biol.
,
1962
,
227
45
.

16.

Huang
,
N.
and
Li
,
H.
 
2023
,
compleasm: a faster and more accurate reimplementation of BUSCO
,
Bioinformatics
,
39
,
btad595
.

17.

Li
,
H.
 
2023
,
Protein-to-genome alignment with miniprot
,
Bioinformatics
,
39
,
btad014
.

18.

Frith
,
M.C.
 
2011
,
A new repeat-masking method enables specific detection of homologous sequences
,
Nucleic Acids Res.
,
39
,
e23
.

19.

Kadota
,
M.
,
Tatsumi
,
K.
,
Yamaguchi
,
K.
,
Uno
,
Y.
, and
Kuraku
,
S.
 
2023
,
Shark and ray genome size estimation: methodological optimization for inclusive and controllable biodiversity genomics
,
bioRxiv
,
12
,
1204
.

20.

Hara
,
Y.
,
Tatsumi
,
K.
,
Yoshida
,
M.
,
Kajikawa
,
E.
,
Kiyonari
,
H.
, and
Kuraku
,
S.
 
2015
,
Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation
,
BMC Genomics
,
16
,
977
.

21.

Dolezel
,
J.
,
Bartos
,
J.
,
Voglmayr
,
H.
, and
Greilhuber
,
J.
 
2003
,
February, Nuclear DNA content and genome size of trout and human
,
Cytometry A
,
51
,
127
8
.

22.

Capella-Gutiérrez
,
S.
,
Silla-Martínez
,
J.M.
, and
Gabaldón
,
T.
 
2009
,
trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses
,
Bioinformatics
,
25
,
1972
3
.

23.

Hübner
,
L.
,
Kozlov
,
A.M.
,
Hespe
,
D.
,
Sanders
,
P.
, and
Stamatakis
,
A.
 
2021
,
Exploring parallel MPI fault tolerance mechanisms for phylogenetic inference with RAxML-NG
,
Bioinformatics
,
37
,
4056
63
.

24.

Gabriel
,
L.
,
Brůna
,
T.
,
Hoff
,
K.J.
, et al.  
2023
,
BRAKER3: fully automated genome annotation using RNA-Seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA
,
bioRxiv
, doi:10.1101/2023.06.10.544449

25.

Tempel
,
S.
 
2012
,
Using and understanding RepeatMasker
,
Methods Mol. Biol.
,
859
,
29
51
.

26.

Kuznetsov
,
D.
,
Tegenfeldt
,
F.
,
Manni
,
M.
, et al.  
2023
,
OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity
,
Nucleic Acids Res.
,
51
,
D445
51
.

27.

Kim
,
D.
,
Paggi
,
J.M.
,
Park
,
C.
,
Bennett
,
C.
, and
Salzberg
,
S.L.
 
2019
,
Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype
,
Nat. Biotechnol.
,
37
,
907
15
.

28.

Yamaguchi
,
K.
,
Kadota
,
M.
,
Nishimura
,
O.
,
Ohishi
,
Y.
,
Naito
,
Y.
, and
Kuraku
,
S.
 
2021
,
Technical considerations in Hi-C scaffolding and evaluation of chromosome‐scale genome assemblies
,
Mol. Ecol.
,
30
,
5923
34
.

29.

Altschul
,
S.F.
,
Madden
,
T.L.
,
Schäffer
,
A.A.
, et al.  
1997
,
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
,
Nucleic Acids Res.
,
25
,
3389
402
.

30.

Priyam
,
A.
,
Woodcroft
,
B.J.
,
Rai
,
V.
, et al.  
2019
,
Sequenceserver: A modern graphical user interface for custom BLAST databases
,
Mol. Biol. Evol.
,
36
,
2922
4
.

31.

Ichikawa
,
K.
,
Tomioka
,
S.
,
Suzuki
,
Y.
, et al.  
2017
,
Centromere evolution and CpG methylation during vertebrate speciation
,
Nat. Commun.
,
8
,
1833
.

32.

Nurk
,
S.
,
Koren
,
S.
,
Rhie
,
A.
, et al.  
2022
,
The complete sequence of a human genome
,
Science
,
376
,
44
53
.

33.

Hori
,
Y.
,
Shimamoto
,
A.
, and
Kobayashi
,
T.
 
2021
,
The human ribosomal DNA array is composed of highly homogenized tandem clusters
,
Genome Res.
,
31
,
1971
82
.

34.

Ottenburghs
,
J.
,
Geng
,
K.
,
Suh
,
A.
, and
Kutter
,
C.
 
2021
,
Genome size reduction and transposon activity impact tRNA gene diversity while ensuring translational stability in birds
,
Genome Biol. Evol
,
13
,
evab016
.

35.

Rak
,
R.
,
Dahan
,
O.
, and
Pilpel
,
Y.
 
2018
,
Repertoires of tRNAs: the couplers of genomics and proteomics
,
Annu. Rev. Cell Dev. Biol.
,
34
,
239
64
.

36.

King
,
L.S.
,
Kozono
,
D.
, and
Agre
,
P.
 
2004
,
From structure to disease: the evolving tale of aquaporin biology
,
Nat. Rev. Mol. Cell Biol.
,
5
,
687
98
.

37.

Finn
,
R.N.
,
Chauvigné
,
F.
,
Hlidberg
,
J.B.
,
Cutler
,
C.P.
, and
Cerdà
,
J.
 
2014
,
The lineage-specific evolution of aquaporin gene clusters facilitated tetrapod terrestrial adaptation
,
PLoS One
,
9
,
e113686
.

38.

Kuraku
,
S.
and
Meyer
,
A.
 
2009
,
The evolution and maintenance of Hox gen in vertebrates and the teleost-specific genome duplication
,
Int. J. Dev. Biol.
,
53
,
765
73
.

39.

Tingaud-Sequeira
,
A.
,
Calusinska
,
M.
,
Finn
,
R.N.
,
Chauvigné
,
F.
,
Lozano
,
J.
, and
Cerdà
,
J.
 
2010
,
The zebrafish genome encodes the largest vertebrate repertoire of functional aquaporins with dual paralogy and substrate specificities similar to mammals
,
BMC Evol. Biol.
,
10
,
38
.

Author notes

Mana Sato, Kazuya Fukuda, Mitsutaka Kadota, Hatsune Makino-Itou and Kaori Tatsumi contributed equally.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.