Chromosome-level assembly of the mustache toad genome using third-generation DNA sequencing and Hi-C analysis

Abstract Background The mustache toad, Vibrissaphora ailaonica, is endemic to China and belongs to the Megophryidae family. Like other mustache toad species, V. ailaonica males temporarily develop keratinized nuptial spines on their upper jaw during each breeding season, which fall off at the end of the breeding season. This feature is likely result of the reversal of sexual dimorphism in body size, with males being larger than females. A high-quality reference genome for the mustache toad would be invaluable to investigate the genetic mechanism underlying these repeatedly developing keratinized spines. Findings To construct the mustache toad genome, we generated 225 Gb of short reads and 277 Gb of long reads using Illumina and Pacific Biosciences (PacBio) sequencing technologies, respectively. Sequencing data were assembled into a 3.53-Gb genome assembly, with a contig N50 length of 821 kb. We also used high-throughput chromosome conformation capture (Hi-C) technology to identify contacts between contigs, then assembled contigs into scaffolds and assembled a genome with 13 chromosomes and a scaffold N50 length of 412.42 Mb. Based on the 26,227 protein-coding genes annotated in the genome, we analyzed phylogenetic relationships between the mustache toad and other chordate species. The mustache toad has a relatively higher evolutionary rate and separated from a common ancestor of the marine toad, bullfrog, and Tibetan frog 206.1 million years ago. Furthermore, we identified 201 expanded gene families in the mustache toad, which were mainly enriched in immune pathway, keratin filament, and metabolic processes. Conclusions Using Illumina, PacBio, and Hi-C technologies, we constructed the first high-quality chromosome-level mustache toad genome. This work not only offers a valuable reference genome for functional studies of mustache toad traits but also provides important chromosomal information for wider genome comparisons.


Data Description
The mustache toad, Vibrissaphora ailaonica (NCBI:txid428466), is an amphibian belonging to the Megophryidae family that is endemic to China (including the China-Vietnam border) [1][2][3]. This mustache toad species exhibits many interesting features, including unique keratinized spines along the upper jaw [1,[4][5][6]. These spines grow repeatedly in sexually mature males during the breeding season, and fall off at the end of this process [5][6][7][8] (Fig. 1). This morphological difference between males and females is further highlighted by their sexual dimorphism in body size (males are significantly larger than females). The spines (and body size) may be used as a weapon for sexually mature individuals to compete for nests and mating opportunities [7,9,10]. Another unique aspect of the mustache toad is that breeding occurs during the cold season, whereas most frogs and toads breed in the warmer months [1]. However, despite the importance of the mustache toad in terms of dynamic spine development and sexual dimorphism in body size, few genomic resources exist for this species. In fact, to date, no next-generation sequencing data have been reported in the Vibrissaphora genus. The lack of genome sequence and transcriptome data for V. ailaonica has hindered identification of functional genes related to their attractive and dynamic appearance (e.g., spine and body size). The shortage of amphibian genomes represented in the Genome 10K/Vertebrate Genomes Projects makes it necessary to analyze other important genomes to study phylogenetic relationships in amphibians on a larger scale [11].
In the present study, we combined genomic sequencing data from Illumina short reads, Pacific Biosciences (PacBio) long reads, and Hi-C data to generate the first chromosome-level reference genome for the mustache toad. The completeness and continuity of the genome were comparable with those of other important amphibian species. The high-quality reference genome generated in this study will facilitate research on population genetic traits and functional gene identification related to important characteristics of the mustache toad.

Sampling and sequencing
During the breeding season (in February), a male mustache toad (V. ailaonica) with keratinized nuptial spines on its upper jaw was caught for sequencing from Ailao Mountain (Fig. 1). To obtain sufficient high-quality DNA for the PacBio Sequel platform (Pacific Biosciences, USA), the mustache toad was dissected, and fresh liver tissue was used for DNA extraction using phenol/chloroform extraction. DNA quality was checked by agarose gel electrophoresis, and high-integrity DNA molecules were obtained. Other tissues, including spines, brain, stomach, intestine, liver, lung, spleen, blood, and tongue, were snap-frozen in liquid nitrogen for 10 min. These 9 organs/tissues were stored at −80 • C for RNA-sequencing (RNA-seq) analysis. Isolated total RNA was used to isolate intact poly (A) + RNA using the NEBnext Ultra-Directional RNA Library Prep kit (NEB, protocol B) for library construction. The messenger RNA (mRNA) was further fragmented and randomly primed during first-strand synthesis by reverse transcription. This procedure was followed by second-strand synthesis with DNA polymerase I to create double-stranded complementary DNA fragments using Transcriptor First Strand cDNA Synthesis Kit (Roche).
For the Hi-C experiments, collected blood was used for library construction. The blood sample (150 μL) was cross-linked for 10 min with formaldehyde (1% final concentration), after which glycine (0.2 M final concentration) was added for 5 min to stop the cross-linking process. The sample was then stored until required for further analysis. Extracted DNA was sequenced using the Illumina and PacBio Sequel platforms. Short reads generated from the Illumina platform were used to estimate genome size and to correct errors in the assembled genome, and the PacBio long reads were used for genome assembly. To this end, 5 libraries with insertion lengths of 220 or 500 bp were sequenced on an Illumina HiSeq 2500 platform, generating 150-bp paired-end reads. A 20-kb library was constructed using the PacBio platform, according to the manufacturer's protocols. Finally, we obtained 225.03 Gb of Illumina short reads and 277.15 Gb of PacBio long reads (Table 1, Additional Tables S1 and S2). The mean N50 length of subreads was 14.78 kb, providing ultra-long genomic sequences for the following assembly and analysis (Additional Table S2).
RNA-seq samples were obtained by mixing an equal amount of RNA extracted from each tissue that had been stored and used for library construction. After sequencing on the Illumina HiSeq 4000 platform, we obtained 14.18 Gb of sequencing data ( Table 1,  Additional Table S3). Four Hi-C libraries were constructed using the same sample with same parameters, and sequenced on the Illumina Hiseq X-ten platform, which generated 378.78 Gb of clean data (Table 1, Additional Table S4).

Genome characteristic estimation
Illumina short reads were filtered for quality as follows. First, adaptors were removed from the sequencing reads. Then, read pairs were excluded if any 1 read had >10% "N," and read pairs with >50% low-quality bases were removed. Finally, PCR duplicates produced during library construction were removed.
Filtered reads were used to estimate genome size and other characteristics. Using the k-mer method, we calculated the 17mer depth frequency distribution in the mustache toad. Genome size was estimated as follows: where TKN 17-mer is the total k-mer number and PKFD 17-mer is the peak k-mer frequency depth of 17-mer. We estimated a genome size of 3.52 Gb (peak = 54) and found heterozygous, repeated sequence peaks, suggesting that the mustache toad genome exhibits complex genome assembly (Fig. 2).

Genome assembly using PacBio long reads and Hi-C data
Based on 38 single-molecule real-time cells, and using the PacBio Sequel platform, we generated 277.15 Gb of subreads (Table 1, Additional Table S2). The mean and N50 length of subreads was 9.65 and 14.78 kb, respectively (Additional Table S2). All long reads were assembled using wtdbg software [12] (WTDBG, RRID: SCR 01 7225). As a result, we obtained a 3.95-Gb genome assembly, with a contig N50 length of 739.54 kb. However, although the size of the genome assembly was comparable to the estimated kmer result, the end result was a slightly larger. This may be associated with the complexity of the mustache toad genome (which has a high rate of heterozygosity and repetitive sequences). Redundancy in the genome assembly was removed using Redundans software (v0.13c) [13], with an identity of 0.7 and overlap of 0.7. This resulted in a genome assembly of 3.58 Gb and a contig N50 length of 834.90 kb. To ensure that all contigs removed were not real sequences, we used BUSCO [14] and the mapping ratio   of Illumina reads in both the raw genome and the redundancyfiltered genome. Results of these checks indicated that the parameters used in the redundancy-filtered step were appropriate for this study (Additional Table S5). To further improve the quality and accuracy of our genome assembly, Illumina short reads were used to polish the genome using Pilon software (Pilon, RRID: SCR 01 4731, v1.21) [15] at the single-base level.
Hi-C data were used to improve the connection integrity of the contigs (15,899 contigs). We obtained 378.78 Gb of Hi-C sequencing data, which was first filtered using Hic-Pro (v2.10.0) [16] (Table 1, Additional Table S4) and then mapped to the polished mustache toad genome [17]. The locations and directions of the contigs were determined by 3D de novo assembly (3d-DNA) software (v180419) [18], with default parameters. Most contigs were then successfully clustered and anchored in 13 groups (Fig. 3) [19]. Finally, we obtained the first chromosome-level, high-quality mustache toad assembly (3.53 Gb) with a scaffold N50 length of 412.42 Mb, which provides a solid genomic resource to assist further study of the mustache toad (Table 2).

Genome assembly evaluation
The quality of a genome assembly is directly related to the accuracy and completeness of protein-coding gene prediction. Therefore, we evaluated the assembled mustache toad genome using 3 methods. First, the assembled genome was compared against the core gene set in BUSCO (BUSCO, RRID:SCR 015008, v2.0) [14]. We found 245 (80.8%) and 833 (85.1%) conserved core genes in the mustache toad genome using the eukaryote and metazoan databases, respectively (Table 3). When we further considered the fragmented BUSCO genes found in the genome, there were 272 (89.7%) and 881 (90.1%) conserved core genes in the eukaryote and metazoan databases, respectively. These results indicated that the assembled mustache toad genome is comparable with published amphibian genomes (Table 3).
Second, all filtered short reads generated from the Illumina platform were aligned to the genome using BWA software (BWA, RRID:SCR 010910, v0.7.12) [20]; 1,778 million clean reads could be mapped to the genome, accounting for 97.78% of total clean reads (Additional Table S6).
Third, RNA-seq reads were de novo assembled using Bridger software (Bridger, RRID:SCR 017039, version: r2014-12-01) [21], with redundant transcripts removed by TGICL [22]. This resulted in 19,876 transcripts (Additional Table S7). These transcripts were then aligned to the genome, with 17,878 transcripts (89.95%) found in the assembled genome, and 94.52% of transcripts being longer than 1 kb (Additional Table S8). Analysis of N50 length and BUSCO results revealed that the mustache toad genome was comparable to that of other published amphibian genomes (Tables 2-4), indicating that our assembled mustache toad genome exhibited high completeness and accuracy.
The GC distribution of the mustache toad genome, and that of other vertebrate species, was calculated using the slide window method. GC distributions were similar, with a mean GC content of 43.68% in the mustache toad, and 36.60-44.49% in other species (Additional Figure S1).

Genome annotation
Tandem Repeats Finder (TRF, v4.04) [23] was used to identify repetitive elements, and RepeatModeler software (RepeatModeler, RRID:SCR 015027, v1.0.4) was used to detect transposable elements (TEs) in the mustache toad genome. Then, the de novo library of repeats produced by RepeatModeler analysis and the repbase (RepBase16.02) database were used for RepeatMasker (RepeatMasker, RRID:SCR 012954, version: open-4.0) [24] analysis to identify homologous repeats. RepeatProteinMask was used to query the TE protein database at the protein level. Last, we identified 2.45 Gb of repeat sequences, accounting for 69.48% of the estimated genome size (Additional Table S9). Among these repeat sequences, 60.87% (2.15 Gb) was predicted by the de novo method (Table 5).
Gene functional annotation can help to elucidate gene function. Thus, we aligned all 26,227 protein-coding genes to protein databases, including InterProScan, KEGG, SwissProt, and TrEMBL. Results showed that most of the genes obtained could be annotated from these functional databases (Table 6).
The 238 single-copy genes were aligned using MUSCLE software (MUSCLE, RRID:SCR 011812, v3.8.31) [41,42] and concatenated to supergenes for maximum-likelihood-based phylogenetic analyses. We performed phylogenetic analysis, with zebrafish as the outgroup, using RAxML software (RAxML, RRID: SCR 006086, v8.2.3) [43], with the parameter "-m" for PROTGAM-MAAUTO. Results indicated that the mustache toad has a close relationship with the ancestor of the marine toad (R. marina), bullfrog (R. catesbeiana), and Tibetan frog (N. parkeri), with topological relationships in other clades found to be the same as reported previously (Fig. 6). To further investigate the diver- Note: These data pertain to genome assembly. Wtdbg contig was the genome assembled by wtdbg and 2-round pilon error correction. Hi-C scaffold was the genome finished by Hi-C assembly.

Relative evolutionary rate of species
The evolutionary rate of species can reflect its evolutionary history and status. The relative evolutionary rate of the mustache toad to other closely related species was analyzed using LINTRE [47] and MEGA (MEGA, RRID:SCR 000667, v7.0.26) software. Twocluster analysis was applied to test the molecular evolution of multiple sequences in a phylogenetic context, based on concatenated supergenes (protein sequences) using "tpcv" (a module in LINTRE software). Concatenated supergenes were also used for Tajima's relative rate test. We used zebrafish as the outgroup in both methods and found that, except for the axolotl, the mustache toad had a relatively faster evolutionary rate than its closely related species (e.g., X. tropicalis, R. marina, R. catesbeiana, and N. parkeri) (Additional Tables S15 and S16). The crocodile had a slower evolutionary rate, relative to its closely related species, which is consistent with previous work [48] (Additional Tables S15 and S16).

Discussion
Using Illumina, PacBio, and Hi-C sequencing technologies, we report the first chromosome-level genome assembly of the mustache toad. We successfully annotated the high-quality proteincoding genes by integrating results from 3 different methods. Phylogenetic analysis indicated that the mustache toad is closely related to the marine toad, bullfrog, and Tibetan frog. Analysis showed that the mustache toad had a faster evolutionary rate, relative to most other closely related species studied. Analysis of the expansion and contraction of gene families identified several biological processes and pathways, such as metabolism and intermediate filaments, suggesting that these terms may relate to the special adaptations of the mustache toad to its habitat. This work not only offers a valuable chromosomelevel genomic data for comparative genomics analysis but also provides important genomic data for studying the mustache toad traits.