Draft genome of Glyptosternon maculatum, an endemic fish from Tibet Plateau

Abstracts Background Mechanisms for high-altitude adaption have attracted widespread interest among evolutionary biologists. Several genome-wide studies have been carried out for endemic vertebrates in Tibet, including mammals, birds, and amphibians. However, little information is available about the adaptive evolution of highland fishes. Glyptosternon maculatum (Regan 1905), also known as Regan or barkley and endemic to the Tibetan Plateau, belongs to the Sisoridae family, order Siluriformes (catfishes). This species lives at an elevation ranging from roughly 2,800 m to 4,200 m. Hence, a high-quality reference genome of G. maculatum provides an opportunity to investigate high-altitude adaption mechanisms of fishes. Findings To obtain a high-quality reference genome sequence of G. maculatum, we combined Pacific Bioscience single-molecule real-time sequencing, Illumina paired-end sequencing, 10X Genomics linked-reads, and BioNano optical map techniques. In total, 603.99 Gb sequencing data were generated. The assembled genome was about 662.34 Mb with scaffold and contig N50 sizes of 20.90 Mb and 993.67 kb, respectively, which captured 83% complete and 3.9% partial vertebrate Benchmarking Universal Single-Copy Orthologs. Repetitive elements account for 35.88% of the genome, and  22,066 protein-coding genes were predicted from the genome, of which 91.7% have been functionally annotated. Conclusions We present the first comprehensive de novo genome of G. maculatum. This genetic resource is fundamental for investigating the origin of G. maculatum and will improve our understanding of high-altitude adaption of fishes. The assembled genome can also be used as reference for future population genetic studies of G. maculatum.

Mechanisms for high altitude adaption have arisen widespread interest to evolution biologists. Several genome wide studies have been carried out for endemic vertebrates in Tibet, including mammals, birds and amphibians. However, little information was known about the adaptive evolution of highland fishes. Glyptosternon maculatum (G. maculatum, Regan, 1905), also known as Regan or barkley, is a fish endemic to the Tibetan plateau, which belongs to Sisoridae family, Siluriformes (catfishes) order. This species live within an elevation ranging from roughly 2800 m to 4200 m. Hence, a highquality reference genome of G. maculatum provides an opportunity to address high altitude adaption mechanisms of fishes.

Findings
To get a high-quality reference genome of G. maculatum, we combined PacBio singlemolecule real-time sequencing, Illumina paired-end sequencing, 10X Genomics linkedreads and BioNano optical map techniques. In total, 603.99 Gb sequencing data were generated. The assembled genome was about 662.34 Mb with scaffold and contig N50 sizes of 20.90 Mb and 993.67 kb, respectively, which captured 83% complete and 3.9% partial vertebrate Benchmarking Universal Single-copy orthologs (BUSCO). Repetitive elements account for 35.88% of the genome, and 22,066 protein-coding genes were predicted from the genome, of which 91.7% have been functionally annotated.

Conclusions
We provide the first comprehensive de novo genome of the G. maculatum. This genetic resource is fundamental for investigating the origin of the G.maculatum and will improve our understanding of high altitude adaption of fishes. The assembled genome can also be used as reference for future population genetic studies of G. maculatum.
How many iterations of Quiver and Pilon were performed? Current recommendations are to use Quiver to correct SNPs and indels in PacBio assemblies, then to use Pilon to only correct indels since short Illumina reads may be misaligned in repetitive regions.
Reply: The reviewer's question is quite important. We have performed one round of Quiver and Pilon correction using pacbio and NGS data, respectively. In Pilon correction process, because we have observed effects of indel correction using Pilon in our previous analysis (below figure); therefore both snp and indel were corrected in our analysis.
Please state whether the Illumina reads that were mapped with BWA were from the reference individual or another individual. Reply: Thanks for the reminding. We used Illumina sequencing reads from the reference individual. We stated the detail in our revised manuscript.
Please provide discussion as to why some Trinity contigs only aligned at low coverages (75-85%). Reply: Thanks for reviewer's suggestion. We have searched our mRNA sequencing reads to NT database and found that the top 5 hits ware all from the closely related fish species, such as Ictalurus punctatus and Zebrafish. Therefore, the probability for external contamination was ruled out (SI Table 7). We attribute the low coverage of some trinity contig to two fold reasons: 1) the potential chimeric transcript genereated during the transcriptome assembly using trinity, especially for genes with various alternative splicing models; 2) the fragments of genomic contig sequences was also one reason for the low coverage alignment of some assembled transcripts. We have discussed the reason for the low coverage in our manuscript and the revision were highlighted by red.
There is confusion in the text when describing Figure 1b and Figure 1c. See first paragraph of Background information, First line of Sample collection and sequencing. Please clarify. Also, can the map be magnified to better locate the location of the reference sample? Reply: Thanks for the reviewer's suggestion. Figure 1a and 1b described the G. maculatum and Figure 1c described the location of the sample collection. We have magnified the map in Figure 1c to Tibet-plateau according to the reviewer's suggestion.
Please provide more justification as to why the species in Figure 2 were chosen. Hopefully it is more than just because the data was available. If the purpose is to focus on the divergence between the two Siluriform catfish then are all the other species necessary? Reply: The reviewer is correct. The species divergence analysis between G. maculatum and I. punctatus was the main purpose for phylogenetic analysis. The analysis could provide us useful information regarding to the species divergence time and relative relationships among fish species. For this purpose, we used other 12 fish genome to construct the phylogenetic tree, not only due to the availability of genome data of those evolutionarily close species, but also because more species (typically 10 or more species in previous studies) are needed to recalibrate the phylogenetic relationships and species divergence time. On Line 7 under Functional annotation: "refers" Reply: Thanks for the reviewer's reminding. We have corrected the sentence and the revision were highlighted by red in the manuscript.
On Line 20: "protein data" Reply: We have corrected according to the reviewer's comment. The revision were highlighted by red in the manuscript.
Lines 43 and 44: "were" is used twice in the sentence and should only be used once Reply: We have deleted "were" in the sentence. The revision were highlighted by red in the manuscript.
On Line 13 under Conclusion: "Glyptosternoids". Reply: Thanks for reviewer's correction. The revision were highlighted by red in the manuscript.
Reviewer #2: This is a purely descriptive paper reporting the sequencing and genome annotation of Glyptosternon maculatum, an endemic catfish species from the Tibet plateau. This is a straightforward paper and a valuable resource, which deserves publication after minor revision. -It might be also of interest to predict long non-coding RNA genes (not presented in "all kinds" of non-coding RNA in Tab S5). Reply: LncRNA is an important non-coding genes in gene expression regulation. However, the transcriptome used in this work were generated from the enrichment by oligo(dT), and it is not suitable for lncRNA prediction with a reference genome. Therefore, we did not annotate the lncRNA gene in this work. However, lncRNA regulation in high-altitude would be interesting direction, and related work will be performed in our following studies.
-To affirm that 228 genes are species-specific sounds always strange to me. More precise comparative analysis of their presence/absence in other (related) species should be performed to confirm this. Reply: Thanks a lot for reviewer's suggestion. We further blast the 228 genes to NCBI NR (non-redundant) database, and found that 142 genes hit to database with e-value of 1e-5; however, there were still 86 genes failed to hit any protein sequences in the database. The function analysis of those genes is an interesting topic in our following studies.
We have corrected the term of "species-specific genes" to "genes without significant homologous hits" and added the additional analysis to the manuscript. The correction were highlighted by red. -Species names should be italicized in Fig.2 Reply: We thanks reviewer for the reminding. Species names were italicized in Fig. 2.
-The paper should be edited for typos/grammatical errors Reply: Thanks a lot for reviewer's suggestion. We have revised and corrected typos and grammatical errors through the manuscript. The corrections were highlighted by red.