Whole genome hybrid assembly and protein-coding gene annotation of the entirely black native Korean chicken breed Yeonsan Ogye

Yeonsan Ogye (YO), an indigenous Korean chicken breed (gallus gallus domesticus), has entirely black external features and internal organs. In this study, the draft genome of YO was assembled using a hybrid de novo assembly method that takes advantage of high-depth Illumina short-reads (232.2X) and lowdepth PacBio long-reads (11.5X). Although the contig and scaffold N50s (defined as the shortest contig or scaffold length at 50% of the entire assembly) of the initial de novo assembly were 53.6Kbp and 10.7Mbp, respectively, additional and pseudo-reference-assisted assemblies extended the assembly to 504.8Kbp for contig N50 (pseudo-contig) and 21.2Mbp for scaffold N50, which included 211,827 retrotransposons and 27,613 DNA transposons. The completeness (97.6%) of the YO genome was evaluated with BUSCO, and found to be comparable to the current chicken reference genome (galGal5; 97.4%), which was assembled with a long read-only method, and superior to other avian genomes (92~93%), assembled with short read-only and hybrid methods. To comprehensively reconstruct transcriptome maps, RNA sequencing (RNA-seq) data were analyzed from twenty different tissues, including black tissues. The maps included 15,766 protein-coding and 6,900 long non-coding RNA genes. By comparing to the gallus gallus red junglefowl chicken reference genome and re-sequencing data from additional YO chickens, 551 large structural variants, 895,988 single nucleotide polymorphisms, and 82,392 small insertions/deletions were detected across the YO genome.


Background
The Yeonsan Ogye (YO), a designated natural monument of Korea (No. 265), is an indigenous Korean chicken breed that is notable for its entirely black plumage, skin, beak, comb, eyes, shank, claws, and internal organs [1]. In terms of its plumage and body color, as well as its number of toes, this unique chicken breed resembles the indigenous Indonesian chicken breed Ayam cemani [2][3][4]. YO also has some morphological features that are similar to those of the Silkie fowl (called Ogolgye in Korea), except for a veiled black walnut comb and hair-like, fluffy plumage that is white or variably colored [5,6]. Although the exact origin of the YO breed has not yet been clearly defined, its features and medicinal usages were To date, a number of avian genomes from both domestic and wild species have been constructed and compared, revealing genomic signatures associated with the domestication process and genomic differences that provide an evolutionary perspective [8]. The chicken reference genome was first assembled using the Red junglefowl [9], first domesticated at least five thousand years ago in Asia; the

Hybrid whole genome assembly
The YO genome was assembled using our hybrid genome assembly pipeline, employing the following four steps: preprocessing, hybrid de novo assembly, super-scaffolding, and polishing (Figures 1b and S1).
During the preprocessing step, the errors in the Illumina short-reads were corrected by KmerFreq and Corrector [13] using sequencing quality scores. In turn, using the corrected short-reads, the sequencing errors in the PacBio long reads (SRR6189090) were corrected by LoRDEC [14].
In hybrid de novo genome assembly, the initial assembly was done with the error-corrected short reads from the paired-end and mate-pair libraries using ALLPATHS-LG [15] with the default option, producing contigs and scaffolds. The resulting contigs and scaffolds showed 53.6 Kb and 10.7 Mb of N50 (Table S1), respectively. Next, the scaffolds were additionally connected using SSPACE-LongRead [16] and OPERA [17] with corrected PacBio long reads and FOSMID reads. The gaps within and between scaffolds were re-examined with GapCloser [13] with error-corrected short-reads. All resulting scaffolds were aligned to the galGal4 genome (GenBank assembly accession: GCA_000002315.2) by LASTZ [18] to detect putative mis-assemblies, verified by paired-end and mate-pair reads mapped to the scaffolds  (Table S1). A pseudo-contig is defined as a sequence broken by gaps of >1bp or a single N, which are assumed to be gaps or errors.
In the super-scaffolding stage, pseudo-reference-assisted assembly was done using LASTZ, BWA-MEM, PBJelly [21], and SSPACE-LongRead to enhance the quality of assembly using errorcorrected PacBio long-reads to reduce the topological complexity of the assembly graphs [22]. Because even scaffolding with long-reads can be affected by repetitive sequences, the results of mapping scaffolds to each chromosome were transformed into a hierarchical bipartite graph ( Figure S3) to minimize the influence of repetitive sequences. The hierarchical bipartite graph was built by mapping PacBio (errorcorrected) reads to scaffolds using BWA-MEM and, in turn, mapping scaffolds to the galGal4 genome (GCA_000002315.2) using LASTZ. Using the hierarchical bipartite graphs, all scaffolds and PacBio reads were finally assigned to each chromosome. Based on the results, super-scaffolding and additional gap-filling was performed by SSPACE-LongRead and PBJelly, respectively, resulting in scaffold N50 of 21.2Mbp ( Figure 1C and Table S1). In the last stage, nucleotide errors or ambiguities were corrected by the GATK pipeline [23] with paired-end reads, and in turn, any vector contamination was removed using VecScreen with UniVec database [24]. The final assembly results showed that the gap percentage and (pseudo-)contig N50 were significantly improved, from 1.87% and 53.6 Kb in the initial assembly to 0.85% and 504.8 Kbp in the final assembly, respectively. Among avian genome assemblies, these results are second best and the scaffold N50 is the best ( Figure 1C). The complete genome sequence at the chromosome level was built by connecting final scaffolds in the order of appearance in each chromosome with the introduction of 100 Kbp 'N' gaps between them. To evaluate the completeness of the genome, the YO draft genome was compared to the galGal4 (short read-based assembly) and galGal5 (long readbased assembly) genomes, with respect to 2,586 conserved vertebrate genes, using BUSCO [25]. The results indicated that the YO genome contains more complete single-copy BUSCO genes, suggesting that the YO genome is slightly more complete than the others ( Table 2).

Large structural variations
When the Ogye_1 genome assembly was compared to two versions of the chicken reference genome assembly, galGal4 and 5, using LASTZ [18] and in-house programs/scripts, 551 common large (≥1 Kb) structural variations (SVs) evident in both assembly versions were detected by at least two different SV prediction programs (Delly, Lumpy, FermiKit, and novoBreak) [26][27][28][29] (Figure 2A; Table S1). SVs Although the Fibromelanosis (FM) locus, which contains the hyperpigmentation-related edn3 gene, is known to be duplicated in the genomes of certain hyperpigmented chicken breeds, such as Silkie and Ayam cemani [3,6] , the exact structure of the duplicated FM locus in such breeds has not been completely resolved due to its large size (~1Mbp). A previous study suggested that the inverted duplication of the FM locus could be explained by three possible mechanistic scenarios ( Figure 2B) [3].
To understand more about the mechanism of FM locus SV in the YO genome, we compared it to the same locus in the galGal4 genome. Aligning paired-end reads of the YO genome to the galGal4 genome, we found higher read depth at the FM locus in YO, indicating a gain of copy number at the locus ( Figure 2C top). The intervening region between the two duplicated regions was estimated to be 412.6 Kb in length.
Regarding possible mechanistic scenarios, mate-pair reads (3-10 Kbp, and FOSMID) mapped to the locus supported all three suggested scenarios, but an alignment of chromosome 20 from Ogye_1 and galGal4 showed that the intervening regions, including inner-partial regions in both duplicated regions, were inverted at the same time, which supports the first mechanistic scenario in Figure 2B. Given the resulting alignments, the FM locus of the Ogye_1 genome was updated according to the first scenario. The size of Gap_1 and Gap_2 were estimated at 164.5 Kb and 63.3 Kb, respectively.
Additionally, a large inversion was detected near a locus including the tyrp1 gene ( Figure 2D), which is known to be related to melanogenesis [30,31]. However, when resequencing data from white leghorn (white skin and plumage), Korean black (white skin and black plumage), and white silkie (black skin and white plumage) were compared to the galGal4 or 5 genome assemblies, the same break points related to the inversion were detected, suggesting that the inversion including tyrp1 is not specifically related to skin hyperpigmentation.

Repeats
Repeat elements in the Ogye and other genomes were predicted by a reference-guided approach, RepeatMasker [32], which utilizes Repbase libraries [33]. In the Ogye_1 genome, 211,827 retrotransposable elements (8.01%), including LINEs (6.58%), SINEs (0.07%) and LTR elements (1.36%), 27,613 DNA transposons (1.02%), 8,131 simple repeats (0.97%), and 45,264 low-complexity repeats (0.21%) were annotated (Figure 3 and Supplementary Table S2). It turns out that the composition of repeats in the Ogye_1 genome is similar to that in other avian genomes (Figure 3), although the total percentage is slightly greater in Ogye_1 than in other avian genomes, with the exception of the galGal4 genome. Compared with other avian genomes, the Ogye_1 genome is more similar to galGal4 and 5 in terms of repeat composition except for the fractions of simple repeats (0.14% for Ogye_1 and 1.47% for galGal5) and satellite DNA repeats (0.01% for Ogye and 6% for galGal5).

SNPs/INDELs
To annotate SNPs and INDELs in the YO genome, resequencing data from 20 YO individuals were produced and mapped to the Ogye_1 genome using BWA-MEM, and a series of post-processes including deduplication were performed by Picard modules [34]. Using Genome Analysis Toolkit (GATK) modules,  Figure 4A.

Protein-coding genes
By analyzing large-scale of RNA-seq data from twenty different tissues through our protein-coding gene annotation pipeline, 15,766 protein-coding genes were annotated in the YO genome (see Figure 4B and C), including 946 novel genes and 14,820 known genes. 164 protein-coding genes annotated in galGal4 were missing from the Ogye_1 assembly. To sensitively annotate protein-coding genes, all paired-end RNAseq data were mapped on the Ogye_1 genome by STAR [35] for each tissue and the mapping results were then assembled into potential transcripts using StringTie [36]. Assembled transcripts from each sample were merged using StringTie and the resulting transcriptome was subjected to the prediction of coding DNA sequences (CDSs) using TransDecoder [37]. For high-confidence prediction, transcripts with intact gene structures (5'UTR, CDS, and 3'UTR) were selected. To verify the coding potential, the candidate sequences were examined using CPAT [38] and CPC [39]. Candidates with a high CPAT score (>0.99) were directly assigned to be protein-coding genes, and those with an intermediate score (0.8-0.99) were re-examined to determine whether the CPC score is >0. Candidates with low coding potential or that were partially annotated were examined to determine if their loci overlapped with annotated protein-coding genes from galGal4 (ENSEMBL cDNA release 85). Overlapping genes were added to the set of Ogye protein-coding genes. Finally, 164 genes were not mapped to the Ogye_1 genome by GMAP, 131 of which were confirmed to be expressed in Ogye (≥ 0.1 FPKM) using all Ogye RNA-seq data. However, expression of the remaining 33 genes was not confirmed, suggesting that they are not expressed in Ogye (< 0.1 FPKM) or have been lost from the Ogye genome. The missing genes are listed in Table S3. In contrast, the 946 newly annotated Ogye genes appeared to be mapped to the galGal4 or galGal5 genomes ( Figure 4C).

Discussion
In this work, the first draft genome from a Korean native chicken breed, Yeonsan Ogye, was constructed with genomic variation, repeat, and protein-coding and non-coding gene maps. Compared with the chicken reference genome maps, many more novel coding and non-coding elements were identified from large-scale RNA-seq datasets across twenty different tissues. Although the genome completeness evaluated by BUSCO was better than the existing chicken reference genomes including galGal4 and 5, the Ogye_1 genome seems to lack simple and long repeats compared to galGal5, which was assembled from high-depth PacBio long-reads (50X) [10] that can capture simple and long repeats. Although we also used Pacbio long reads, because their sequence depth is shallow (11.5X), they were only used for scaffolding and gap filling, so that some simple and satellite repeats would be missed during assembly. A similar tendency can be seen in the Golden-collared manakin genome (ASM171598v1) (Figure 3), which was also assembled in a hybrid manner using MaSuRCA assembler with Illumina short-reads and PacBio long-reads.
Annotated genomic variations and comparative analysis of coding and non-coding genes will provide a resource for understanding genomic differences and evolution of Ogye as well as identifying functional elements related to its unique phenotypes, such as fibromelanosis. Additionally, such analyses will be useful for future genome-based animal genetics.

Availability of data
All of our sequencing data and the genome sequence have been deposited in NCBI's BIOPROJECT under the accession number PRJNA412424. The raw sequence data have been deposited in the Short Read Archive (SRA) under accession numbers SRR6189081-SRR6189098 (Table 1). Figure S1. Assembly statistics of Ogye_1 gnome assembly at each step.    Illumina, "P" is PacBio, "S" is Sanger, and "4" is Roche454).

B.
A.