Chromosome-level genome assembly of the female western mosquitofish (Gambusia affinis)

Abstract Background The western mosquitofish (Gambusia affinis) is a sexually dimorphic poeciliid fish known for its worldwide biological invasion and therefore an important research model for studying invasion biology. This organism may also be used as a suitable model to explore sex chromosome evolution and reproductive development in terms of differentiation of ZW sex chromosomes, ovoviviparity, and specialization of reproductive organs. However, there is a lack of high-quality genomic data for the female G. affinis; hence, this study aimed to generate a chromosome-level genome assembly for it. Results The chromosome-level genome assembly was constructed using Oxford nanopore sequencing, BioNano, and Hi-C technology. G. affinis genomic DNA sequences containing 217 contigs with an N50 length of 12.9 Mb and 125 scaffolds with an N50 length of 26.5 Mb were obtained by Oxford nanopore and BioNano, respectively, and the 113 scaffolds (90.4% of scaffolds containing 97.9% nucleotide bases) were assembled into 24 chromosomes (pseudo-chromosomes) by Hi-C. The Z and W chromosomes of G. affinis were identified by comparative genomic analysis of female and male G. affinis, and the mechanism of differentiation of the Z and W chromosomes was explored. Combined with transcriptome data from 6 tissues, a total of 23,997 protein-coding genes were predicted and 23,737 (98.9%) genes were functionally annotated. Conclusions The high-quality female G. affinis reference genome provides a valuable omics resource for future studies of comparative genomics and functional genomics to explore the evolution of Z and W chromosomes and the reproductive developmental biology of G. affinis.


Background
The western mosquitofish (Gambusia affinis) is a well-known invasive species of the Poeciliidae family, native to North America. To date, G. affinis has invaded many countries worldwide, competing successfully with native fish everywhere and destroying the ecological balance, leading to recognition by the World Conservation Union as one of the world's top 100 invasive alien species.

DNA library construction and sequencing
Genomic DNA was extracted from the whole body (excluding the gut) using a Qiagen GenomicTip100 (Qiagen, Hilden, Germany). The Illumina TruSeq Nano DNA Library Prep Kit (Illumina, CA, USA) was used to construct an Illumina library with insert sizes of 350 bp, which were then sequenced on an Illumina NovaSeq platform (150-bp paired-end reads). The raw data were filtered using the following strategies: (i) filtered reads with adapters; (ii) removing reads with ≥10% unidentified nucleotides (N); (iii) removing reads with >50% of bases having a phred quality of <5; (iv) removing reads with >10 nt (nucleotide) aligned to the adapter, allowing ≤10% mismatches; and (v) removing putative PCR duplicates generated by PCR amplification in the library construction process. Clean reads were used for subsequent kmer analysis and nanopore data correction.
Approximately 8 μg of genomic DNA was prepared; Blue Pippin (Sage Science, Beverly, MA, USA) and Ligation sequencing 1D kit (SQK-LSK108; ONT, UK) were used for size selection (>10 kb) and nanopore library construction according to the manufacturer's instructions. Two nanopore libraries were constructed and sequenced on 2 different FlowCells using the PromethION sequencer (ONT). Subread quality control was subsequently executed on fast5 files using ONT Albacore software (v0. 8.4) [13], and the "passed filter" reads (higher quality reads) were used for subsequent analysis.

RNA library construction and sequencing
For RNA analyses, 6 tissues (brain, liver, heart, gills, gonads, and muscles) were extracted using an RNeasy Plus Mini Kit (Qiagen) from 5 individuals. The RNA purity, degradation/contamination, concentration, and integrity were measured using NanoDrop One (Thermo Fisher Scientific, MA, USA), 1% agarose gels, Qubit RNA Assay Kit with a Qubit 3.0 Fluorometer (Life Technologies, CA, USA), and the RNA Nano 6000 Assay Kit with a Bioanalyzer 2100 system (Agilent Technologies, CA, USA), respectively. The RNA quality criteria for the RNA samples were as follows: RNA integrity number >8.0 and OD 260/280 between 2.0 and 2.2. Validated RNA samples (from brain, liver, heart, gill, gonad, and muscle tissues) were used for Illumina library construction and sequencing and Pacific Biosciences (PacBio) library preparation (pooled samples), construction, and sequencing.
For Illumina paired-end sequencing (Illumina Novaseq platform, 150-bp), the complementary DNA (cDNA) library was prepared using a TruSeq Sample Preparation Kit (Illumina). The clean data were obtained by removing reads containing adapters, reads containing poly-N, and low-quality reads from the raw data. Qualified RNA from 6 tissues were mixed in equal amounts, reverse-transcribed using a Clontech SMARTer PCR cDNA Synthesis Kit (TaKaRa, Beijing, China), and subjected to PCR amplification using a PrimeSTAR GXL DNA polymerase, and the obtained 0.5-6-kb fragments were retained for PacBio sequencing library construction using a SMRTbell Template Prep Kit (PacBio, CA, USA). Finally, a library for single-molecule realtime sequencing (SMRT) cell was sequenced using polymerase and V2.1 chemistry on a PacBio Sequel platform with 10 h of movie time.

Genomic features from k-mer analysis and nanopore assembly building
Clean reads obtained from the Illumina NovaSeq platform were applied to estimate the genome size and heterozygosity of the western mosquitofish by k-mer analysis (17-mer frequency distribution) using jellyfish (jellyfish, RRID:SCR 005491) v2.0 [14].

Genome scaffolding with BioNano auxiliary assembly
High molecular weight DNA was isolated from the whole body (excluding the gut) and then labeled with Labeling Master Mix and DLE-1; next, the DNA was imaged automatically with a Bio-Nano Saphyr system. BioNano raw BNX files were de novo assembled into genome maps with BioNano Solve [20]. The sorted and autodenoised single molecules were subjected to pairwise comparisons by RefAligner [21] to identify molecule overlaps, and consensus maps were constructed. All molecules were then mapped back to the consensus maps, and the maps were recursively refined and extended (2 times).

Chromosomal-level genome assembly by Hi-C
The Hi-C library was prepared following a previously described procedure [22] with some modifications. Briefly, fresh wholebody samples (excluding the gut) were cut into 2-cm pieces and treated with 1% formaldehyde for 10 min at room temperature to induce cross-linking. The reaction was quenched by adding 2.5 M glycine to a final concentration of 0.2 M for 5 min. Nuclei were digested, marked, and ligated using DpnII, biotin-14-dCTP (Invitrogen, Carlsbad, CA, USA), and T4 DNA Ligase, respectively. After incubation overnight for reverse cross-linking, the ligated DNA was sheared into 300-600-bp fragments. The DNA fragments were blunt-end repaired and A-tailed, followed by purification through biotin-streptavidin-mediated pulldown. Finally, the Hi-C libraries were quantified and sequenced using an Illumina Hiseq platform (150-bp paired-end reads).
As a result, 141 million uniquely mapped paired-end reads were generated, of which 76.82% were valid interaction pairs. Combined with the valid Hi-C data, we subsequently used the LACHESIS (LACHESIS, RRID:SCR 017644) [25]  The interaction heat map of the initial assembly results of LACHESIS was drawn, according to the interaction between different scaffolds, the position and direction of the scaffolds that obviously did not meet the chromosome interaction characteristics in the figure were adjusted. Of note, if there were situations in a scaffold itself that did not meet the chromosome interaction characteristics, the scaffold was interrupted. Next, the scaffolds were adjusted separately until the overall heat map conformed to the characteristics of chromosome interaction.
We used the same method to assemble the genome of a published male G. affinis ( GCA 003097735.1) and obtained chromosomal-level genomic data.

Evolutionary and comparative genomic analyses
We used OrthoMCL (OrthoMCL, RRID:SCR 007839) version 2.0.9 [51] to cluster the female G. affinis annotated genes with an Evalue cutoff of 1e-5 and Markov chain clustering with default inflation parameters for an all-to-all BLASTP (BLASTP, RRID:SCR 0 01010) analysis of entries for the reference genomes of 11 fishes, including G. affinis in this study and 10 other published fishes reported to date (Poecilia reticulata, P. formosa, Poecilia latipinna, Poecilia mexicana, Xiphophorus couchianus, N. furzeri, Cyprinodon variegatus, Fundulus heteroclitus, Lepisosteus oculatus, and Oreochromis niloticus). CAFE (CAFE, RRID:SCR 005983) v4.0.1 [52] was used to identify expanded and contracted gene families in G. affinis, and these data were then used for GO and KEGG enrichment analyses.
The orthologous genes were then used in PAML codon substitution models and likelihood ratio tests (codeml) based on the branch-site model to calculate Ka and Ks, yielding positively selected genes, which were then used for GO and KEGG enrichment analyses.

Recognition and comparison of the W and Z chromosomes of G. affinis
Mummer (Mummer, RRID:SCR 018171) v3.0 [58] was used for aligning entire genomic DNA sequences from X. couchianus, and male and female G. affinis, to make the chromosome numbering system of both species the same. The W chromosome was identified by the specificity of the female G. affinis W chromosome length, and then, the Z chromosome in the male G. affinis was identified on the basis of synteny. Mummer was also used for aligning entire genomic DNA sequences from the Z and W chromosomes, and Circos plot distributions of homologous sequence pairs among the Z and W chromosome pairs were plotted using Circos (Circos, RRID:SCR 011798) v0.69-6 [59].
According to the results of the RepeatMasker analysis of female and male G. affinis genomes, the length and distribution of TEs on chromosomes Z and W were compared. OrthoFinder (OrthoFinder, RRID:SCR 017118) v2.3.8 [60] was used to compare the genes on chromosomes Z and W.

TE insertion time analyses
We calculated female and male G. affinis TE insertion times in genomes using the algorithm T = K/2r, where K is the Kimura distance-based copy divergence of TEs and r is the nucleic acid substitution rate. The K-value was obtained from Re-peatMasker. To estimate r-values for G. affinis, we used LASTZ (LASTZ, RRID:SCR 018556) v1.04.00 [61], chainNet (v2) [62], and MULTIZ (v11.2) [63], along with genomes used in evolutionary analyses and the female G. affinis genome as a reference sequence. With the whole-genome alignments, we used the msa view tool in the PHAST package (PHAST, RRID:SCR 00320 4) v1.2.1 [64] to extract 4D site alignments, based on the female G. affinis gene annotations. The phyloFit program in the PHAST package was used to estimate the phylogenetic model, with tree topology (result of evolutionary analyses) as an input parameter. The branch length results were represented as units of substitutions per site. We calculated the root-to-tip substitution rates from the most recent common ancestor of selected species to each fish lineage, and then divided the rootto-tip substitution rates by the divergence time (314.47 million years ago [Mya]) of most recent common ancestor of selected species.

Female G. affinis genome initial characteristics
A total of 30.5 Gb Illumina clean reads were used for analyzing female G. affinis genome size and heterozygosity using k-mer analysis. Based on 26,474,864,304 17-mers and a peak 17-mer depth of 37 ( Supplementary Fig. S1), the estimated heterozygosity rate was ∼0.42%, and the estimated genome size of female G. affinis was ∼715 Mb. Of note, the estimated genome size is similar to that of the nuclear DNA content estimated in a previous study using flow cytometry (0.75 pg, ∼733 Mb) [65].

De novo assembly of a female G. affinis reference genome
Next, 74.4 Gb (2,866,145 reads, average read length of 25.98 kb, N50 35.86 kb, and longest read length of 273 kb) ONT clean long reads were used to construct the reference genome. We obtained a 662-Mb genomic DNA sequence by assembly with a contig N50 length of 12.9 Mb.
The long-reads assembly result consisted of 217 contigs, and the longest contig was 28.6 Mb. Then, BUSCO was used to assess the completeness of the assembled genome. Approximately 97.2% of the complete genes could be detected in the genome of female G. affinis, confirming the completeness of the genome. Assembly results of long-reads scaffolds obtained using optical maps were assembled with 80-Gb BioNano molecules. The final assembly contained 125 scaffolds with a scaffold N50 size of 26.4 Mb. Finally, we used the Hi-C technique to anchor the assembly scaffolds in 24 chromosomes of female G. affinis (Supplementary Table S1). We found that 141,528,181 unique mapped paired-end reads were generated and occupied ∼51.5% of the total clean paired-end reads (274,658,176). Then, the frequency of scaffold interactions was estimated on the basis of the pairs mapped to the scaffolds. We found that 113 scaffolds were successfully anchored in 24 chromosomes (Fig. 2, Supplementary Fig. S2a), consistent with the records of the chromosome number by cytogenetic analysis [10,66], representing 90.4% of all scaffolds and 97.9% of all scaffold nucleotide bases. The total assembly size of the chromosomes was ∼679.4 Mb (Table 1). In the male G. affinis genomic DNA sequences, 734 scaffolds were successfully anchored in 24 chromosomes (Fig. 2, Supplementary Fig. S2b), and the total assembly size was 592.7 Mb.

Genome annotation
Assembled chromosome-level genome of female G. affinis was used to predict repeat sequences. In total, 5,630,271 SSRs were identified, including 5,478,552 mono-, 100,272 di-, 28,431 tri-, 19,721 tetra-, 2,048 penta-, and 1,247 hexa-nucleotide repeats. Overall, the combined homology-based and de novo prediction results indicated that TEs accounted for 22.54% of the assembly genome (Supplementary Table S2). Additionally, class I TEs (RNA transposons) occupied ∼5.15% of the assembly genome. The most abundant RNA transposons found in the G. affinis assembly genome were long interspersed nuclear elements, which constituted 54.37% of all identified RNA transposons. The female G. affinis genome was very rich in class II TEs (DNA transposons), which occupied 11.83% of genome content.
For genome annotation, 23,997 protein-coding genes were predicted in the female G. affinis genome. Compared with other existing published poeciliid fish annotated information, the number of genes in female G. affinis was similar to those in P. formosa (23,615 genes) and X. maculatus (23,628 genes) (Supplementary Table S3 and Fig. S3). BUSCO gene prediction was carried out using the actinopterygii odb9 single-copy homologous gene to predict the existing sequence of the genome. Approximately 97% of complete gene components could be found in this gene set, indicating that most of the conserved genes were well predicted and that the prediction results were relatively reliable (Supplementary Table S4). Finally, 23,737 genes were annotated in ≥1 of the databases (KOG, KEGG, NR, Swis-sProt, GO), and up to 98.92% of G. affinis genes were functionally annotated (Supplementary Table S5). Finally, 143 snR-NAs, 220 rRNAs, 371 microRNAs, and 3,885 tRNAs were also identified.

Genome evolution
To determine the evolutionary relationships between G. affinis and other vertebrates, a phylogenetic tree was reconstructed on the basis of 6,457 single-copy orthologous genes from 10 other vertebrate genomes (Fig. 3a). L. oculatus and O. niloticus were used as outgroups. As a species of the family Poeciliidae, G. affinis clustered into 1 branch with other fish from Poeciliidae. Compared with 6 other sequenced members of the Poeciliidae family, G. affinis had a closer relationship with X. couchianus, consistent with previously published phylogenies [67]. Next, a timetree was created on the basis of the above 6,457 single-copy orthologous genes, and the estimated divergence time between G. affinis and X. couchianus was found to be ∼16.57 Mya (Fig. 3; Supplementary  Fig. S3). In addition, the divergence time between G. affinis and 4 other members of the Poeciliidae family was ∼22.75 Mya.
To examine the evolutionary history of gene families, we performed gene family expansion and contraction analysis with the female G. affinis genes. We found 652 expansion gene families and 1,046 contraction gene families (Fig. 3b). Expansion gene families were enriched in 44 GO (Supplementary Table S6) categories and 34 KEGG pathways (Supplementary Table S7), most of which were related to oxygen metabolism, olfactory pathways, and visual pathways. Next, codeml was used to calculate the average Ka/Ks values and conduct branch-site likelihood ratio analyses to detect positively selected genes in the female G. affinis genome. The results showed that there were 590 positively selected genes in the female G. affinis genome. The positively selected genes were enriched in 12 GO categories (aspartic-type endopeptidase activity, DNA repair, microtubule binding, insulin-like growth factor binding, tRNA aminoacylation for protein translation, microtubule motor activity, rRNA processing, microtubule-based movement, protein dephosphorylation, protein tyrosine phosphatase activity, chromatin binding, and nucleus) and 3 KEGG pathways (complement and coagulation cascades, peroxisome, and platelet activation).

Recognition and evolution of sex chromosomes
Genetic-controlled sex determination systems in fish are variable, ranging from XX/XY to ZZ/ZW [68]. Fish generally do not have highly morphologically differentiated sex chromosomes, making it difficult to distinguish between autosomes and sex chromosomes. Hence, there are only a few fish species for which there is known information on sex determination mechanisms and sex chromosome systems. Therefore, a suitable experimental model is required for the identification and elucidation of the mechanisms of fish sex chromosome evolution, and the female G. affinis is a suitable and consistent model.
Early karyotype analysis demonstrated that female G. affinis shows heterogamy of the ZW type, and its W chromosome is much longer than other chromosomes [10,66]. The longest sequence was selected from the assembly results at the chromosome level as the W candidate chromosome, and 1 femalespecific DNA marker [69] was used for confirmation. In the end, the marker has been aligned to the W candidate chromosome but was not found in the genomic DNA sequences of the male G. affinis. Analysis of the synteny of the whole genomes of female and male mosquitofish by Mummer demonstrated that the Z chromosome was also present in the male G. affinis genomic DNA sequences (Fig. 2). By comparing the Z and W chromosomes (Fig. 4a), we found the length of the W and Z chromosome repeat sequences to be ∼8.5 and 5.0 Mb, respectively. Among them, the length and content of the Helitron superfamily of the 2 chro-  Homology analysis showed that there were 794 oneto-one pairs. There were 118 and 85 genes on the W and Z chromosomes unassigned to any gene groups, and the others were of the one-to-many and many-to-many types; these results provide research directions for our future analyses on functional genomics (results have been submitted to GigaDB). Some researchers have studied the role of transposons in sex chromosome differentiation, and they found that TEs seem to play an important role in the evolution of sex chromosomes, with their accumulation and loss having huge effects on the lengths of sex chromosomes [70][71][72]. However, differences in TE contents between Z and W chromosomes alone cannot determine the true course of differentiation, e.g., whether the increased length of the W chromosome compared with that of the Z chromosome is caused by extension of the W chromosome or by degeneration of the Z chromosome. There is no substantial evidence to explain this observation. Therefore, the introduction of the time factor is extremely important. Gambusia holbrooki and G. affinis are so closely related that for a long time, biologists thought they were the same species. Phylogenetic analyses estimated that their divergence time was ∼2-7 Mya [69,73,74], and other researchers showed that the XY and ZW sex determination mechanisms had independent origins in G. holbrooki and G. affinis, respectively [68]. Therefore, we speculate that the differentiation of Z and W sex chromosomes is a very recent event. Additionally, previous studies suggested that this process may be enriched on the W chromosome by TEs, leading to an increase in the sex chromosome size during the early phase of differentiation and the subsequent reduction in size later during evolution [75]. If this hypothesis is correct, then we should be able to observe a large number of transposons inserted in the W chromosome in the recent past (between 2 and 7 Mya). Indeed, our results indicated very recent mass insertion events of TEs into the W chromosome (Fig. 4b), and the insertion time characteristics of the TEs into the W chromosome were specific be- cause its insertion time trends were dramatically different from those of autosomal and Z-chromosomal TEs ( Fig. 4c and Supplementary Fig. S5). Moreover, we speculate that most of the long gaps (Fig. 4a) on the W chromosome were also caused by the aggregation of too many highly similar TE sequences to form TE clusters through the recent activation of TEs. Thus, we expected that the TE content of the W chromosome of G. affinis should be much higher than that observed to date. Accordingly, our results showed that the cause of sex chromosome differentiation in female G. affinis was likely to be related to extension of the W chromosome.

Conclusions
In this study, we assembled the chromosome-level female western mosquitofish genome using the most mainstream technology available. In terms of parameters such as contig N50, scaffold N50, and gene annotation number, these are high-quality genomic data. Evolutionary analysis provides ideas for future work; e.g., oxygen transport in mosquitofish deserves attention.
We conducted a preliminary study on W and Z sex chromosome differentiation based on the specificity of the sex chromosome in female western mosquitofish and provided data to support the previous hypothesis that a longer W chromosome is associated with the activity (insertion) of TEs. In conclusion, our highquality genomic data lay the foundation for the study of chromosome evolution, reproductive characteristics, and sexual dimorphism in western mosquitofish.

Availability of Supporting Data and Materials
The raw genome and RNA sequencing data were deposited in the SRA under Bioproject No. PRJNA599452. The chromosome-level genome, annotation, and other supporting data are also available via the GigaScience database, GigaDB [76]. Table S1: Result of female and male Gambusia affinis genomic assembly at chromosome level.   Table S3: Comparative analysis of the annotated gene set of female Gambusia affinis with those of 5 teleosts. Table S4: Assessment of female Gambusia affinis genome completeness by BUSCO. Table S5: Statistics for gene function annotation in female Gambusia affinis genome. Table S6: Expansion gene families of female Gambusia affinis were enriched in 44 GO categories. Table S7: Expansion gene families of female Gambusia affinis were enriched in 34 KEGG pathways. Figure S1: Frequency distribution of the 17-mer graph analysis used to estimate the size of female Gambusia affinis. Figure S2: Western mosquitofish genome scaffold contact matrix using Hi-C data. (a) Female western mosquitofish. (b) Male western mosquitofish. The color bar indicates the contact density from red (high) to white (low). Figure S3: The comparisons of coding sequence length, exon length, exon number, gene length, intron length, and intron number in the genomes of female Gambusia affinis and other teleosts. Figure S4: Divergence time of Gambusia affinis and other fish species. Figure S5: Distribution of transposon activity time for different autosomes of female Gambusia affinis. Additional File: All genes located on the Z and W sex chromosomes with their locations.