Chromosome-level genome assembly of the shuttles hoppfish, Periophthalmus modestus

Abstract Background The shuttles hoppfish (mudskipper), Periophthalmus modestus, is one of the mudskippers, which are the largest group of amphibious teleost fishes, which are uniquely adapted to live on mudflats. Because mudskippers can survive on land for extended periods by breathing through their skin and through the lining of the mouth and throat, they were evaluated as a model for the evolutionary sea-land transition of Devonian protoamphibians, ancestors of all present tetrapods. Results A total of 39.6, 80.2, 52.9, and 33.3 Gb of Illumina, Pacific Biosciences, 10X linked, and Hi-C data, respectively, was assembled into 1,419 scaffolds with an N50 length of 33 Mb and BUSCO score of 96.6%. The assembly covered 117% of the estimated genome size (729 Mb) and included 23 pseudo-chromosomes anchored by a Hi-C contact map, which corresponded to the top 23 longest scaffolds above 20 Mb and close to the estimated one. Of the genome, 43.8% were various repetitive elements such as DNAs, tandem repeats, long interspersed nuclear elements, and simple repeats. Ab initio and homology-based gene prediction identified 30,505 genes, of which 94% had homology to the 14 Actinopterygii transcriptomes and 89% and 85% to Pfam familes and InterPro domains, respectively. Comparative genomics with 15 Actinopterygii species identified 59,448 gene families of which 12% were only in P. modestus. Conclusions We present the high quality of the first genome assembly and gene annotation of the shuttles hoppfish. It will provide a valuable resource for further studies on sea-land transition, bimodal respiration, nitrogen excretion, osmoregulation, thermoregulation, vision, and mechanoreception.


Introduction
Mudskippers are of the subfamily Oxudercinae and the family Oxudercidae, which was recently separated from the family Gobiidae [1], and the largest group of amphibious teleost fishes, which are uniquely adapted to live on mudflats [2]. They can survive on land for extended periods by breathing through their skin and through the lining of the mouth and throat. They propel themselves over land on their sturdy forefins, and some of them are also able to climb trees and skip atop the surface of the water [3]. They inhabit tropical, subtropical, and temperate regions, including the Indo-Pacific and the Atlantic coast of Africa [4].
The family Oxudercidae has 10 genera and 42 species in Fish-Base. Among them, 4 species have been sequenced for the draft genome [2]. However, only Boleophthalmus pectinirostris is useful as a draft genome.
In this study, we present a chromosome-level high-quality genome of Periophthalmus modestus (NCBI:txid146921; Fishbase ID: 54509) using Pacific Biosciences (PacBio) long-read, Illumina shortread, 10X linked read, and Hi-C sequencing. P. modestus [5] is a species of the shuttles hoppfish occurring worldwide in tropical and temperate near shore-marine habitats, including the northwestern Pacific Ocean from Vietnam to Korea, as well as Japan [6]. P. modestus can reach a length of 10 cm (Fig. 1) and was known to have 23 chromosomes [7]. We performed structural gene annotation and repeats analysis. Comparative genomics with 16 Actinopterygii genomes identified synteny map, orthologous gene families, evolutionary divergence, and expanded and contracted gene families.

Methods
Sample collection and extraction of genomic DNA and total RNA P. modestus samples were collected from Gochang-gun, Jeollabukdo, South Korea ( 35 20 24.0 N, 126 22 12.0 E), in May 2018. Total DNA was isolated from the muscle of P. modestus using the DNeasy Blood & Tissue kit (Qiagen, Germantown MD USA), following the manufacturer's protocol.
For species identification, the mitochondrial DNA cytochrome b gene barcode region was amplified using PCR as described in [8]. The PCR product of ∼803 bp was purified using the QIAquick PCR purification kit (Qiagen, Germantown MD, USA) and sequenced on an ABI 3730xl DNA Analyzer (Applied Biosystems 3730xl Capillary Genetic Sequencer, RRID:SCR_018059) with the same PCR primer set. The sequence data were edited and aligned using the ATGC 4.0 software (Genetyx, Japan).
Organs of specimens collected in July 2019 were manually dissected for eye, brain, liver, gut, muscle, and fin tissues, and total RNA was extracted from the dissected organs using the RNeasy Mini Kit (Qiagen, Germantown MD USA). The RNA preparation was repeated 3 times, and then 3-replicate RNA samples were mixed and processed for RNA sequencing (RNA-seq) and isoform sequencing (Iso-seq).

DNA library construction and sequencing
For short-read sequencing, a paired-end library with insert sizes of 550 bp was constructed using Illumina TruSeq DNA Nano Prep Kit (Illumina, San Diego CA USA) and sequenced on an Illumina HiSeq 4000 instrument (Illumina HiSeq 4000 System, RRID:SCR_016386). For long-read sequencing, a 20-kb SMRTbell library (PacBio, Menlo Park CA USA) was prepared and sequenced on a PacBio Sequel (PacBio Sequel System, RRID:SCR_017989) using 11 cells. To increase continuity in the genome assembly, we further produced linked reads and Hi-C reads. For linked-read sequencing, a 10X Chromium genome v2 library (10X Genomics, Pleasanton CA USA) was constructed and sequenced on an Illumna NovaSeq 6000 instrument. For long-range scaffolding, a Dovetail Hi-C library was prepared with Dovetail Hi-C Library kit (Dovetail Genomics, Scotts Valley CA USA) and sequenced on an Illumina NovaSeq 6000 instrument (Illumina NovaSeq 6000 Sequencing System, RRID:SC R_016387).

RNA library construction and sequencing
For RNA-seq, paired-end libraries with insert size of 150 bp were prepared with the Truseq mRNA Prep kit (Illumina, San Diego CA USA) from total messenger RNA (mRNA), which was subsequently sequenced on an Illumina HiSeq 2500 (Illumina HiSeq 2500 System, RRID:SCR_016383). For PacBio Iso-seq, 3 libraries of length 1-2, 2-3, and 3-6 kb were prepared from polyadenylated RNA according to the PacBio Iso-seq protocol (PacBio, Menlo Park CA USA). Six SMRT cells were run on a PacBio RS II system (PacBio RS II Sequencing System, RRID:SCR_017988).

Comparative genomics
Chromeister [49] performed all pairwise comparison with 17 Actinopterygii genomes to generate a synteny map. OrthoMCL (OrthoMCL DB: Ortholog Groups of Protein Sequences, RRID: SCR_007839) [50] identified orthologous gene families among 15 Actinopterygii transcriptomes (Supplementary Table S1). GO (Gene Ontology, RRID:SCR_002811) enrichment was performed using the Fisher exact test and false discovery rate correction to identify functionally enriched GO terms among gene families relative to the "genome background," as annotated by Pfam.  Table S12 shows the software versions, settings, and parameters.

Chromosome-level genome assembly
We generated 39.6, 80.2, 52.9, and 33.3 Gb (46×, 94×, 62×, and 39× coverage) of Illumina, PacBio, 10X linked, and Hi-C data, respectively, for genome sequencing (Supplementary Table S2). The genome size was estimated at 729 Mb using the 17-mer peak and distribution from cleaned Illumina data ( Supplementary Fig. S1). MiniMAP2 and MiniASM followed by polishing using RACON and Pilon generated 3,839 contigs (854 Mb and N50 of 579 kb) using PacBio sequencing data. Tigmint, ARCS, and LINKS generated 2,170 scaffolds (854 Mb and N50 of 1.5 Mb) using 10X linked data, and Dovetail HiRise finally generated 1,419 scaffolds including 23 pseudo-chromosomes (854 Mb and N50 of 33 Mb) using Hi-C data ( Table 1). The pseudo-chromosomes were anchored by a Hi-C contact map (Supplementary Fig. S2), and corresponded to the top 23 longest scaffolds, of which the sum of lengths was close to

Genome annotation
Repetitive elements predicted by the 3 ways were merged to a total of 452 Mb, which covered 44% of the genome: 11, 6, 5, 10, and 17% for DNA, long interspersed nuclear elements, simple repeats, tandem repeats, and unknown, respectively (Supplementary Table S5). We compared P. modestus with 16 Actinopterygii species for repeats (Supplementary Table S7). As shown in Fig. 2, P. modestus had more simple and tandem repeats than the other Actinopterygii species.
For ab initio gene prediction, we generated 172 Gb and 125 Mb of RNA-seq and PacBio data, respectively, which yielded 366,298 and 131,807 hints for introns. BRAKER with GeneMark and AUGUSTUS predicted 132,821 genes. For homology-based gene prediction, we used 14 Actinopterygii species (Supplementary Table S1). A pipeline of TBLASTN, GenBlastA, and Exonerate predicted 22,721 genes. Merging the 2 outputs and filtering incomplete genes produced 30,505 genes and 34,916 transcripts (Supplementary Table S6), of which 94% had homology to the 14 Actinopterygii transcriptomes. As a result of InterProScan annotation, 27,048 genes had 5,489 Pfam families, 25,995 genes had 5,121 InterPro domains, 17,310 genes had 2,277 GO terms, and 6,059 genes had 2,166 pathways.

Synteny map
The 17 Actinopterygii genomes (Supplementary Table S1) were compared to identify a synteny map using Chromeister. Supplementary Fig. S3 shows dot plots in the upper triangular matrix and distance scores in the lower triangular matrix. As expected, the pair of P. modestus and Periophthalmus magnuspinnatus had the lowest score, meaning the closest pair. The second and third lowest score corresponded to the pair of Boleophthalmus pectinirostris with P. magnuspinnatus and P. modestus, respectively. Note that the scores of Danio rerio and Lepisosteus oculatus with the others were >0.99 because of the evolutionary distances.

Orthologous gene family
The 15 Actinopterygii whole-genome gene datasets (Supplementary Table S1) were compared to identify orthologous gene families using orthoMCL. Among Fig. 3, P. modestus had more families than the others and the number of common families in ≥13 species were dominant. The unique gene families of P. modestus were enriched in negative regulation of RNA metabolic and biosynthetic process, nucleic acid-templated, transcription DNAtemplated, nucleobase-containing, biosynthetic process, and cellular macromolecule (Supplementary Table S8).

Phylogenetic relationships and divergence time
All genomes had 281 single-copy orthologous gene families, which were used to construct a phylogenetic tree and estimate divergence time. The TimeTree database [56] was used to take calibration times between L. calcarifer-S. maximus, K. marmoratus-O. latipes, and T. rubireps-T. nigroviridis divergence as 70-94, 76-114, and 42-59 MYA. As shown in Fig. 4, the infraclass Teleostei was separated at ∼320 MYA, consistent with the previous study [57], the order Cypriniformes at ∼287 MYA, the order Esociformes at ∼224 MYA, and the order Gobiiformes at ∼141 MYA. P. modestus clustered with the other species in the order Gobiiformes, and diverged from P. magnuspinnatus and B. pectinirostris during the late and mid-Cenozoic era (15 and 25 MYA), respectively.

Gene family expansion and contraction
Orthologous gene families among the 15 Actinopterygii genomes were used for analyzing gene family expansion and contraction. The number of expanded and contracted gene families of P. modestus with its common ancestor were 411 and 225, while those of P. magnuspinnatus, the closest genome, were 257 and 442, respectively (Fig. 4). The expanded gene families of P. modestus were enriched in base-excision repair, transmembrane receptor protein tyrosine kinase signaling pathway, and enzyme linked receptor protein signaling pathway (Supplementary Table S9), while the contracted gene families of P. modestus were in FMN binding, ion binding, and reactive oxygen species metabolic process (Supplementary Table S10). Supplementary Fig. S4 shows a word cloud for GO term description enriched in unique, expanded, and contracted gene families of P. modestus.

Conclusions
We present a chromosome-level high-quality genome assembly of P. modestus with N50 length of 33 Mb using Illumina, PacBio, 10X, Hi-C, RNA, and Isoform sequencing, respectively. The completeness of the genome was confirmed by the BUSCO score of 96.3%. The top 23 longest scaffolds were >20 Mb in size and close to the estimated genome size of 728 Mb. P. modestus had various repetitive elements in 43.8% of the genome and more repetitive elements than the 16 Actinopterygii genomes. We predicted 34,871 protein-coding and 7,865 non-coding genes, and 93% of the protein-coding genes had homology to the 14 Actinopterygii transcriptomes. This dataset will provide a valuable resource for further studies on sea-land transition, bimodal respiration, nitrogen excretion, osmoregulation, thermoregulation, vision, and mechanoreception.

Data Availability
All raw sequencing reads underlying this article are available in the NCBI SRA (Supplementary Table S2) and can be accessed with BioProject No. PRJNA660579. The assembled genome was submitted to NCBI Assembly. Gene annotation and transcript sequences are provided as supplementary files. JBrowse [58] was set up on ht tp://magic.re.kr/gbrowser/jb/mabik/?data=shuttles_hoppfish. All supporting data and materials are available in the GigaScience Gi-gaDB database [59].

Additional Files
Supplementary Figure S1   ; the second (red) and third (blue) numbers represent the number of expanded and contracted, respectively, gene families identified by CAFE; the geologic timescale, earth impacts, oxgen, carbon dioxide, and solar luminosity were generated on the TimeTree database.