A high-quality chromosome-level genome assembly of rohu carp, Labeo rohita, and its utilization in SNP-based exploration of gene flow and sex determination

Abstract Labeo rohita (rohu) is a carp important to aquaculture in South Asia, with a production volume close to Atlantic salmon. While genetic improvements to rohu are ongoing, the genomic methods commonly used in other aquaculture improvement programs have historically been precluded in rohu, partially due to the lack of a high-quality reference genome. Here we present a high-quality de novo genome produced using a combination of next-generation sequencing technologies, resulting in a 946 Mb genome consisting of 25 chromosomes and 2,844 unplaced scaffolds. Notably, while approximately half the size of the existing genome sequence, our genome represents 97.9% of the genome size newly estimated here using flow cytometry. Sequencing from 120 individuals was used in conjunction with this genome to predict the population structure, diversity, and divergence in three major rivers (Jamuna, Padma, and Halda), in addition to infer a likely sex determination mechism in rohu. These results demonstrate the utility of the new rohu genome in modernizing some aspects of rohu genetic improvement programs.


Introduction
Labeo rohita (rohu; rui), a carp naturally found in the Indo-Gangetic and surrounding river systems , is an important aquaculture fish in many areas of South Asia (FAO 2020). The annual aquaculture production of L. rohita in Bangladesh was 386.3 thousand tonnes in the 2019-2020 fiscal year, the second-highest among all aquaculture species in the country (DoF 2020). Annual aquaculture production of the species is approximately 2.0 million metric tons (Mt) globally, a volume comparable with Salmo salar (Atlantic salmon; 2.4 Mt); however, study and understanding of L. rohita genomics is not commensurate with its global significance (Rasal and Sundaray 2020). Although there is increasing interest in applying next-generation sequencing (NGS) and other highthroughput methods to L. rohita (Robinson et al. 2014;Rasal et al. 2017;Hamilton et al. 2019;Sahoo et al. 2021), to date, most studies have been conducted in the absence of a genome sequence. Recently, a draft genome was published for L. rohita  to provide a unifying resource for NGS analysis; however, the quality of the genome limits the development of a robust genomic framework for the species.
Genetically improved L. rohita seed is increasingly available to farmers, from both mass-selection (e.g. "Subarna Rohu" in Bangladesh and "Ayeyarwady Hatchery" in Myanmar) (Hamilton 2019;SZA 2021) and family-based (i.e. pedigree-based) improvement programs (e.g. "Jayanti" in India and "WorldFish Genetically Improved Rohu" in Bangladesh) (Das Mahapatra et al. 2007;Rasal et al. 2017;Hamilton et al. 2019;Hamilton et al. 2022). However, genomic methods routinely applied in other aquaculture species (e.g. parentage assignment and genomic selection) have yet to be routinely applied in L. rohita genetic improvement programs (Sahoo et al. 2017;Rasal and Sundaray 2020), primarily due to a historical focus on improving growth rate (directly assessable at low cost on selection candidates), limited financial resources, and the absence of a genome sequence. As existing family-based programs expand to include additional traits (e.g. carcass traits, feed conversion ratio, tolerance to extreme environments, and disease resistance) (Rasal and Sundaray 2020), the advantages afforded by improved genomic resources in L. rohita will become increasingly compelling.
The mechanism of sex determination (SD) in L. rohita is a lingering question with applications to aquaculture, as understanding SD mechanisms in other species has been used to prevent precocious maturation, exploit sexual dimorphism in growth rate, improve carcass quality, and protect both environmental values and intellectual property (Budd et al. 2015). Despite its relevance to aquaculture and genetic improvement, SD in L. rohita has been understudied (Sahoo et al. 2021) both due to the high diversity of teleost SD mechanisms (Heule et al. 2014) and the lack of high-quality genomic resources (Sahu et al. 2013).
Here we present a new de novo high-quality genome for L. rohita that improves sequence contiguity and reduces duplication. We use this reference to assess diversity among populations of L. rohita from three different rivers and to preliminarily describe the gametic system of SD in L. rohita, demonstrating the utility of this improved sequence to increase understanding and facilitate aquacultural production and genetic improvement.

Sample collection, DNA extraction, and sequencing
Blood samples were collected from five male Labeo rohita (henceforth referred to as Rohu-1 through Rohu-5) from a fish farm located in the District of Rangpur, Bangladesh. The fish were handled as per guidelines of the Ethics Standard Review Committee of Bangladesh Agricultural University (BAU) involving fish and animals (approval no. BAURES/ESRC/2019/Fish/01). Each fish was euthanized using clove oil, dissected, and blood was collected from the heart using a syringe. Each blood sample was placed in an ethylenediaminetetraacetic acid containing vial, and vials were shipped in an insulated container to Mississippi State University for DNA extraction.
High-molecular-weight (HMW) genomic DNA for whole genome sequencing was extracted from 150 µl of blood from Rohu-1 using CTAB lysis buffer followed by the phenol/chloroform purification procedure (Doyle and Doyle 1987). The concentration and purity of extracted genomic DNA samples were measured by a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA). The quality of genomic DNA was validated by electrophoresis on a 0.8% w/v agarose gel.
The genomic DNA from Rohu-1 was used to prepare 10 Oxford Nanopore R9.4 MinION flow cells. For each flow cell, 2 to 2.5 µg of genomic DNA and a Nanopore Genomic DNA Ligation Sequencing Kit SQK-LSK 109 (Oxford Nanopore Technologies, Oxford, UK) were used to create a DNA library. For each of the 10 libraries, 700-750 ng of DNA was loaded onto a Nanopore Flow Cell R9.4.1 (Oxford Nanopore Technologies, Oxford, UK) and sequenced on a GridION sequencer (Oxford Nanopore Technologies, Oxford, UK) for 48 h.
Rohu-1 genomic DNA was also sequenced on an Illumina HiSeq X-Ten (2 × 150 bp). In brief, 2 µg of Rohu-1 genomic DNA was used with an Illumina TruSeq DNA PCR-free Library Prep Kit (Illumina, San Diego, CA, USA) to create an Illumina sequencing library. The final DNA-Seq library, which had an insert size range of 350-450 bp, was submitted to Novogene (www.en.novogene.com) for two lanes of PE150 on an Illumina HiSeq X-Ten (Illumina, San Diego, CA, USA) sequencer.
A Hi-C library also was prepared using 100 µl of Rohu-1 blood with the Proximo Hi-C Animal Kit (Phase Genomics, Seattle, WA, USA). The final Hi-C DNA-Seq library was submitted to Novogene (www.en.novogene.com) for one lane of PE150 Illumina HiSeq X-Ten (Illumina, San Diego, CA, USA) sequencing.
Lastly, Rohu-1 blood cells were embedded in agarose and HMW DNA was isolated according to the Bionano Prep Frozen Blood Protocol (Bionano Genomics, San Diego, CA) . The extracted DNA molecules were labeled with the Direct Label and Stain (DLS) DNA Labeling kit (Bionano Genomics, San Diego, CA). Once labeled and stained, the DNA was imaged on the Bionano Saphyr instrument (Bionano Genomics, San Diego, CA).

Genome size estimation
The genome size of L. rohita was estimated using two independent methods: flow cytometry and k-mer profiling.

Assembly and annotation
Nanopore sequence data was filtered to remove the control lambda-phage and sequences shorter than 1,000 bases using the nanopack tool suite [v1.0.1] (De Coster et al. 2018). Trimmomatic [v0.32] (Bolger et al. 2014) was used to remove adapters, trim lowquality bases, and filter out reads shorter than 85 bp. The filtered nanopore data were assembled into contigs using wtdbg2 [v2.4] (Ruan and Li 2020 (Smit et al. 2013) were used to create a species-specific repeat database, and this database was subsequently used by RepeatMasker to mask those repeats in the genome. All available RNA-seq libraries for L. rohita (comprising brain, pituitary, gonad, liver, pooled, and whole body tissues for both sexes; Supplementary  (Holt and Yandell 2011) predicts genes based on the new Augustus, GeneMark, and SNAP models derived from Braker2 along with the Mikado predicted transcripts as an external ab-initio source, modifying the predictions based on the available RNA and protein evidence from the Cyprinidae family in the NCBI RefSeq database. Any predicted genes with an annotation edit distance (AED) above 0.47 were removed from further analysis. The remaining genes were functionally annotated using InterProScan  (Manni et al. 2021) was used to verify the completeness of both the genome and annotations against the actinopterygii_odb10 database. Lastly, genes spanning large gaps or completely contained within another gene on the opposite strand were removed using a custom Perl script (https:// github.com/IGBB/rohu-genome/).

Comparative genomics
The assembly statistics, length distributions, BUSCO completeness scores, and sequence similarity via dot-plots were compared between the IGBB L. rohita genome (reported here) and the L. rohita genome reported by Das et al. (2020) (CIFA, Refseq accession GCA_004120215.1), as well as all 12 annotated Cypriniformes genomes from NCBI (Table 1). Assembly statistics were calculated using abyss-fac from ABySS [v2.3.4] (Jackman et al. 2017). Length distributions were calculated using samtools [v1.9] (Danecek et al. 2021) and graphed using R [v4.0.2] (R Core Team 2020) with the tidyverse package (Wickham et al. 2019). Minimap2 [v2.17-r941] and the pafCoordsDotPlotly R script (https://github.com/tpoorten/ dotPlotly) were used to create dot-plots. For the Cypriniformes data-set, only chromosome level assemblies were included in the dot-plots. The Danio rerio (zebrafish) and Triplophysa tibetana genomes were also excluded from the dot-plots since few of the alignments passed the default quality filter in pafCoordsDotPlotly. BUSCO with the actinopterygii_odb10 database was used to find the BUSCO scores for each genome. The annotated genes from this new assembly were also compared to all annotated Cypriniformes using OrthoFinder [v2.5.4] (Emms and Kelly 2019).

ddRAD-seq sample collection and library prep
Fin clips were taken from the founders of the WorldFish Rohu Genetic Improvement Program, as described in Hamilton et al. (2019). A custom R script (https://github.com/IGBB/rohugenome/) was used to minimize sampling putatively related founders . In total, fin clips from 64 male and 56 female L. rohita were sampled, sourced from the Halda (39), Jamuna (38), and Padma (43) rivers.
Genomic DNA was extracted from the samples using the Qiagen DNeasy Blood & Tissue Mini kit (Qiagen, Valencia, CA, USA). The concentration and purity of extracted genomic DNA samples were evaluated using a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA). The quality of genomic DNA was validated by electrophoresis on a 0.8% w/v agarose gel. The ddRAD-Seq libraries were made using the method described in Magbanua et al. (2022) with minor modifications. Briefly, NsiI and MspI were used to digest the genomic DNA and the adapters (Supplementary Table 2) were ligated into the digested genomic DNA. Polymerase chain reaction was used to attach the i5 and i7 index primers [Nextera XT Index Kit v2 Set A (FC-131-2001) and Set B (FC-131-2002), Illumina, San Diego, CA] to the ligation products to provide unique dual barcodes to each sample while generating the sequencing libraries. The libraries were submitted to Novogene (www.en.novogene. com) for a total of two lanes of PE150 Illumina HiSeq X-Ten (Illumina, San Diego, CA, USA) sequencing.

Single-nucleotide polymorphism discovery and population analyses
Reads were mapped to the L. rohita genome using bwa [v0.7.17]. Single-nucleotide polymorphisms (SNPs) were called over two rounds using the Sentieon pipeline (Kendig et al. 2019) [Spack version sentieon-genomics/201808.01-opfuvzr] and following the DNAseq guidelines. Briefly, SNPs were predicted for the ddRAD-seq samples using the DNAseq pipeline for all samples, and these SNPs were used as known sites during base quality score recalibration (BQSR) in the second iteration of the DNAseq pipeline. The final SNP set was filtered via vcftools [Spack version 0.1.14-v5mvhea] (Danecek et al. 2011) to remove sites with insufficient representation (i.e. present in <90% of samples). The filtered SNP set was used with the R packages LEA (Frichot and François 2015) for the population structure analysis and SNPRelate (Zheng et al. 2012) for the principle component analysis. Nucleotide diversity (π) and divergence (πxy, or dxy) were calculated in 10 kb windows using pixy v1.2.5.beta1 (Korunes and Samuk 2021)  The "L" column is an abbreviation of the assembly level: (S)caffold and (C)hromosome.

Sex-associated fragments
To find regions of the L. rohita genome associated with sex, twosample Monte Carlo tests comparing the high-quality read mappings for male and female samples were run for each fragment between the two digestion sites. The digestion site fragments for the L. rohita genome were found using egads (https://github.com/ IGBB/egads). The high-quality read mappings for each sample were calculated by first filtering high-quality (mapq >= 30) mappings using samtools [v1.9] (Danecek et al. 2021), and then using the bedtools [v2.28.0] (Quinlan and Hall 2010) coverage to count the number of mappings to each fragment. Given the maximum selected size (613 bp) and the paired read size (300 bp), fragments with less than half of the sequence covered in a sample were removed from further analysis. Fragments with fewer than 50 samples (90% of the smallest sample group) surviving the filter were removed altogether. The fragment read mappings for each sample were normalized based on the total number of high-quality read mappings within a sample. Permutation tests were run on each fragment for 100,000 replicates, and the resulting P-values were adjusted using the Benjamini-Hochberg method.
Fragments with an adjusted P-value less than 0.05 were considered associated with sex. The commands and code used can be found at https://github.com/IGBB/rohu-genome/y-link.

Genome size estimation
The C-value of Labeo rohita was previously reported as 1.99 pg (∼1.95 Gb) based on Feulgen densitometry (Patel et al. 2009) and 1.427Gb in the currently available assembly ); however, our flow cytometry results based on five individuals and our k-mer-based genome size estimation suggest that the L. rohita genome size is 50-65% the size previously reported by Patel et al. (2009) and Das et al. (2020), respectively. Our flow cytometry results indicate a C-value of 0.99 pg (∼0.97 Gb) with a standard deviation of only 0.02 across all measurements (Supplementary  Table 3). Moreover, our k-mer-based estimate using GenomeScope (complete results in Supplementary Table 4 and Supplementary  Fig. 1) is 0.97 Gb, the same value determined by our flow cytometry analysis. Lastly, our final genome assembly size for L. rohita is 0.95 Gb. Notably, the Feulgen densitometry estimate reported in Patel et al. (2009) for a second fish, Labeo catla (synonymous with Catla catla), was also approximately twice that later reported , perhaps suggesting stochastic differences, including cryptic variation in ploidy and/or differences in measurement techniques (Greilhuber 2005). Figure 1 shows the genome size comparison of all samples mentioned above.

Genome assembly and annotation
Genome assembly was started with (a) a total of 130.5 Gb (138X coverage) of Nanopore data, derived from 44.7 million reads, (b) 261 Gb (276X coverage) of Illumina short reads (870 million 150 bp paired-end reads), and (c) 382 million 150 bp paired reads (114 Gb) from a Hi-C library. The initial de novo assembly was generated using the Nanopore data and polished with the short insert Illumina data, resulting in 4,999 contigs with an N50 of 1.28 Mb.
After the Bionano and Hi-C data were incorporated, the total number of sequences dropped to 2,899 and the N50 increased to 29.9 Mb. These sequences were ordered and oriented by RagTag using the Onychostoma macrolepis reference to produce a final assembly with 25 chromosome-length scaffolds (deemed Chr01 through Chr25- Supplementary Table 5) and 2,844 unplaced scaffolds, which ranged in size from 1,479 bp to 7.18 Mb. The chromosome scaffolds were composed of one to eight sequences each, with all but three composed of three or fewer sequences. The final assembled genome size was 945.5 Mbp, representing 97.9% of the estimated genome size (see Table 2 for assembly statistics at each step). RepeatModeler2 predicted 3,851 repeat families. Interestingly, while over three-quarters of the predicted TEs remain uncategorized (due to lack of related representatives), L. rohita has a relative abundance of LTR-retrotransposons vs other types of elements (e.g. LINEs and Class II elements; 730 vs <100 each), which is in contrast to the model fish (i.e. D. rerio), where DNA elements are   (Chang et al. 2022); however, because so few elements are categorized for L. rohita (∼24%), it is impossible to determine if this represents a lineage-specific difference or technical noise. Using these repeats, RepeatMasker masked 41.25% of the genome. The annotation pipeline identified 51,079 primary transcripts, of which 31,274 survived the AED, gap, and overlapping filter criteria. BUSCO analysis shows the genome includes complete copies of 98.1% of the 3,640 orthologs in the actinopterygii_odb10 database with 37 (1%) duplicated. The filtered transcriptome contains 84.5% of the total orthologs complete with 74 (2%) duplicated. An overview of the BUSCO analyses can be found in Table 3.

Comparative genomics
Our assembly (IGBB) was compared with the published and publicly available L. rohita assembly (CIFA), and annotated Cypriniformes assemblies from NCBI that were scaffold level or higher. Both the scaffold N50 and maximum length of the IGBB assembly are 30 Mb longer than the CIFA assembly ( Table 2). The length distributions (Supplementary Fig. 2) show a similar separation, with overall greater contiguity in the IGBB genome. Interestingly, when the two L. rohita assemblies were pairwise aligned and plotted (Fig. 2), the CIFA assembly shows a few large gaps, specifically in Chr09 and Chr19, despite being larger in size. Due to the twofold size difference between the assemblies and the fragmentation of the CIFA assembly, the inverse comparison (i.e. IGBB aligned to CIFA) was not informative. Dot-plot alignments of the chromosome level Cypriniformes assemblies ( Supplementary Fig. 3) generally exhibited similar chromosome structures, with some duplications and/or rearrangements. The assemblies for Danio rerio and Triplophysa tibetana were removed from the dot-plot grid since very few of the alignments passed the graphing threshold. Comparing the BUSCO results for the L. rohita assemblies, the IGBB assembly had fewer duplicate, fragmented, and missing BUSCOs than the CIFA assembly. Furthermore, the IGBB assembly had the most single-copy BUSCOs of any Cypriniformes (Fig. 3), even surpassing the model fish D. rerio. Notably, Carassius auratus and Cyprinus carpio are both allotetraploid fishes (Xu et al. 2019;Braasch 2020) and therefore exhibit a good deal of duplication in the dot-plots and BUSCO results. Lastly, the annotations for the Cypriniformes were compared using OrthoFinder. Of the 31,274 genes annotated, 29,904 (95.6%) were placed into 18,740 orthogroups, which comprise 63.5% of the total orthogroups found. Table 4 contains the summary statistics for all species used in the OrthoFinder analysis.

SNP discovery and population similarities among L. rohita fisheries
Aquaculture is an agricultural growth industry, producing 46% of the fish consumed worldwide. Over 50 million tonnes of finfish are raised in aquaculture each year, with the vast majority of aquaculture occurring in Asia (FAO 2020). Farm-raised L. rohita comprises 3.7% of the finfishes produced annually and represents the 11th most commonly farmed finfish (FAO 2020). Consumer preferences have been surveyed, identifying traits (e.g. length and weight) to prioritize in improvement programs (Mehar et al. 2022) along with disease resistance, some of which may be multigenic and complex. Genetically improved L. rohita seed is increasingly available to farmers (Das Mahapatra et al. 2007;Rasal et al. 2017;Hamilton 2019;Hamilton et al. 2019;SZA 2021;Hamilton et al. 2022); however, there is interest in further improving the characteristics of farmed L. rohita. Here we used ddRAD-sequencing in conjunction with the reference genome to provide insight into diversity and divergence among L. rohita in the Halda, Jamuna, and Padma river systems. Patterns of divergence between the river systems ( Fig. 4a and Supplementary Table 6, Supplementary Fig. 4a, Supplementary  Fig. 5a) suggest that the geographically proximal Padma and Jamuna river systems (the Jamuna flows into the Padma) exhibited far less differentiation than either does to the hydrologically isolated and geographically distant Halda river system. While this pattern is similar to what was observed with silicoDArT markers , the greater number of nuclear sites surveyed here (i.e. 1.4 million) suggests that the differentiation between fish inhabiting these river systems is somewhat greater than previously reported using <2,000 SNP sites (Supplementary Table 6). These results (i.e. low differentiation between Padma  and Jamuna and greater differentiation than previously reported) are congruent with an analysis of population structure (k = 2) that reveals similar profiles for Padma-and Jamuna-based fish and a more divergent profile for fish from the Halda river system (Fig. 4b), which is reiterated in a principal component analysis (PCA) of fish from these rivers where fish from the Halda river system were adjacent to, but not intermingled with, fish from the Padma and Jamuna river systems (Fig. 4c). Interestingly, however, population structure analysis reaches optimization for these fish at k = 1 (Fig. 4d), and the proportion of variation explained by the first two principal components is low (∼1.5% total), possibly indicating greater than expected admixture between Halda fish and those from the other two, geographically distant rivers (Fig. 4a). Diversity among fish within each river was remarkably similar, ranging from 0.00063 in Halda to 0.00065 in Jamuna (π; Supplementary Table 7, Supplementary Fig. 4b, Supplementary  Fig. 5b). Notably, these estimates were nearly identical to the estimates of between-population divergence (πxy; Supplementary Table 8, Supplementary Fig. 4c, Supplementary Fig. 5c), which was 0.00064 for Padma-Halda and 0.00065 for both Padma-Jamuna and Jamuna-Halda, possibly indicating that these river populations are still representative of their shared ancestry. Diversity within populations (π) and divergence between (πxy) populations were distributed relatively evenly across the chromosomes; however, in both cases, chromosomes 3, 4, and 22 were the only chromosomes that exhibited greater than average π and πxy, possibly indicating differences in selection and/or permeability on those chromosomes. Interestingly, while Fst for chromosomes 3 and 4 were not considerably different from many of the other chromosomes, chromosome 22 exhibited the greatest relative population divergence (Fst; Supplementary

Sex-associated fragments
The genetics underlying SD in fish can be complicated and variable even within species (Devlin and Nagahama 2002;Volff et al. 2007;Parnell and Streelman 2013;Heule et al. 2014;Nguyen et al. 2021); however, controlling the sex ratio is essential to optimizing farming of finfish (Martínez et al. 2014). L. rohita breeding, for example, requires specific environmental conditions [i.e. monsoon (Qasim and Qayyum 1962;Natarajan and Jhingran 1963)], which is currently circumvented using hormonal induction (Bhattacharya 1999).
Despite its importance to aquaculture, the mechanisms governing SD in L. rohita are currently unknown. Karyotypic analyses suggest that if L. rohita has sex chromosomes, they are likely homomorphic (Bhatnagar et al. 2014), similar to many other fish (Heule et al. 2014), and are indistinguishable from the remaining chromosomes. We screened the L. rohita genome for regions linked to sex by evaluating read mapping in each ddRAD region from female vs male fish. Between 9.8 and 23.4 million (M) reads were uniquely mapped per sample to the 473,345 genomic regions Fig. 4. a) Map of the river locations (dot), diversity (π) within each population (number within fish), and divergence (πxy) between each population (number between fish); b) LEA predicted population structure (k = 2) separated by river of origin. The vertical columns show the proportion (Q) assigned to each population for each individual; c) PCA plot of the filtered SNP set, colored by river of origin (D) Cross-entropy summary for the LEA analysis, using K = 1 to 10 with 10 repetitions each.
occurring between the two restriction sites, as predicted by egads. Approximately 42% of these regions (200,543) had at least 50 samples with >50% of the region covered and were retained for two-sample Monte Carlo testing. Monte Carlo testing highlighted 25 fragments from three chromosomes/scaffolds (Chr25, scaf-fold_1958, and scaffold_971) as significantly (BH adj. P-value <= 0.05) different between females and males with respect to read coverage ( Fig. 5 and Supplementary Table 9). Interestingly, the seven significant fragments on Chr25 are (1) contiguous, (2) cover approximately 30 kb (26,052,083,955), and (3) have no female samples mapping, suggesting that this may be a malespecific region of chromosome 25. The five fragments on scaf-fold_1958 show a similar pattern, albeit with a shorter total length (6.1 kb) and with female reads present, although significantly diminished, for a single region of the scaffold (∼100 bp). Conversely, both male and female samples map to the 13 significant regions of scaffold_971 with reasonable coverage; however, the female samples generally had around double the mapping rate relative to the male samples, suggesting that this region may be represented by a greater copy number in females vs males. Together, these results suggest that L. rohita has a maleheterogametic (XX/XY) system of SD. Furthermore, since the sex chromosomes are indistinguisable by karyotype (Bhatnagar et al. 2014) and the uniquely male regions comprise only a small region of the chromosome, L. rohita may only have a Y-specific region (or "young Y"), similar to Oryias latipes (Kondo et al. 2006); however, sequences similar to the O. latipeshomolog for male-determination (i.e. dmY (Matsuda et al. 2002;Hornung et al. 2007)) were not found in the L. rohita assembly, indicating that further study is needed.

Conclusion
Despite its importance to aquaculture, Labeo rohita has only recently been studied using modern molecular techniques. Our flow cytometry, k-mer analysis, and NGS assembly of the L. rohita genome indicate a genome size of 0.97 Gb, a size 50-65% smaller than previously reported. Our IGBB reference-quality genome for L. rohita both improves contiguity and removes the excessive redundancy of the previously existing CIFA draft genome sequence. The IGBB reference genome is a valuable resource for breeding programs and evolutionary biologists as demonstrated in our initial ddRAD-seq experiments. We find that, while fish from the connected rivers (Jamuna and Padma) are more similar in relative divergence, there remains a question of whether these populations are recently diverged from the Halda river system or if there remains some gene flow between all three, despite the hydrological and geographical isolation of the Halda. We also report candidate regions for SD in L. rohita that may underlie a maleheterogametic (XX/XY) system. While greater sampling will be required to understand the genetics underlying L. rohita SD and the population dynamics of these river systems, the present information provides a foundation for breeders to facilitate aquacultural improvement of L. rohita.

Data availability
The data used for the Labeo rohita genome and annotation are available at NCBI under the BioProject PRJNA650519. The assembled genome sequence and annotations are available at GenBank under accessions JACTAM000000000. The raw data is available at the SRA (Sequence Read Archive) under accessions SRR12580210-SRR12580221. The ddRAD-seq data used for SNP discovery, population analyses, and sex-associated fragment analysis are available under the BioProject PRJNA841581 and the SRA accessions SRR19358298 -SRR19358417. The RepeatModeler and RepeatMasker analysis output, along with the unfiltered ddRAD vcf, are available at https://doi.org/10. 5281/zenodo.7377776.

Fig. 5.
Regions of statistically significant differences between male and female L. rohita read counts. Each pair shows the average read counts for male and female samples for a ddRAD fragment. Opacity of the bars vary with the number of samples present. The fragments are ordered but not spaced according to position. Contiguous statistically significant fragments are outlined. staff based in Jashore, Bangladesh for managing and sampling fish, and Mahirah Mahmuddin for sample collation and management.

Funding
The authors declare that this work was in part supported through the United States Agency for International Development (USAID) "Innovate4Fish Feed the Future Fish Innovation Lab-Quick Start" (Grant ID 7200AA18CA00030). Fin clip samples were collected with funding from the United States Agency for International Development (USAID) Aquaculture for Income and Nutrition project (Grant ID EEM-G-00-04-00013-00).

Conflicts of interest
The authors declare no conflict of interest.
Supplemental material available at G3 online.