Whole genome and transcriptome maps of the entirely black native Korean chicken breed Yeonsan Ogye

ABSTRACT Background Yeonsan Ogye (YO), an indigenous Korean chicken breed (Gallus gallus domesticus), has entirely black external features and internal organs. In this study, the draft genome of YO was assembled using a hybrid de novo assembly method that takes advantage of high-depth Illumina short reads (376.6X) and low-depth Pacific Biosciences (PacBio) long reads (9.7X). Findings The contig and scaffold NG50s of the hybrid de novo assembly were 362.3 Kbp and 16.8 Mbp, respectively. The completeness (97.6%) of the draft genome (Ogye_1.1) was evaluated with single-copy orthologous genes using Benchmarking Universal Single-Copy Orthologs and found to be comparable to the current chicken reference genome (galGal5; 97.4%; contigs were assembled with high-depth PacBio long reads (50X) and scaffolded with short reads) and superior to other avian genomes (92%–93%; assembled with short read-only or hybrid methods). Compared to galGal4 and galGal5, the draft genome included 551 structural variations including the fibromelanosis (FM) locus duplication, related to hyperpigmentation. To comprehensively reconstruct transcriptome maps, RNA sequencing and reduced representation bisulfite sequencing data were analyzed from 20 tissues, including 4 black tissues (skin, shank, comb, and fascia). The maps included 15,766 protein-coding and 6,900 long noncoding RNA genes, many of which were tissue-specifically expressed and displayed tissue-specific DNA methylation patterns in the promoter regions. Conclusions We expect that the resulting genome sequence and transcriptome maps will be valuable resources for studying domestic chicken breeds, including black-skinned chickens, as well as for understanding genomic differences between breeds and the evolution of hyperpigmented chickens and functional elements related to hyperpigmentation.


Background
The Yeonsan Ogye (YO), a designated natural monument of Korea (No. 265), is an indigenous Korean chicken breed that is notable for its entirely black plumage, skin, beak, comb, eyes, shank, claws, and internal organs [1]. In terms of its plumage and body color, as well as its number of toes, this unique chicken breed resembles the indigenous Indonesian chicken breed Ayam cemani [2][3][4]. YO also has some morphological features that are similar to those of the Silkie fowl, except for a veiled black walnut comb and hair-like, fluffy plumage that is white or variably colored [5,6]. Although the exact origin of the YO breed has not yet been clearly defined, its features and medicinal usages were recorded in Dongui Bogam [7], a traditional Korean medical encyclopedia compiled and edited by Heo Jun in 1613.
To date, a number of avian genomes from both domestic and wild species have been constructed and compared, revealing genomic signatures associated with the domestication process and genomic differences that provide an evolutionary perspective [8]. The chicken reference genome was first assembled using the Red junglefowl [9], first domesticated at least five thousand years ago in Asia; the latest version of the reference genome was released in 2015 (galGal5, GenBank Assembly ID GCA_000002315.3) [10]. However, because domesticated chickens exhibit diverse morphological features, including skin and plumage colors, the genome sequences of unique breeds are necessary for understanding their characteristic phenotypes through analyses of single nucleotide polymorphisms (SNPs), insertions and deletions (INDELs), structural variations (SVs), and coding and non-coding transcriptomes. Here, we provide the first version of YO genome (Ogye_1) and transcriptome maps, which include annotations of large SVs, SNPs, INDELs, and repeats, as well as coding and non-coding transcriptome maps along with DNA methylation landscapes across twenty different tissues of YO.
corrected PacBio long-reads to reduce the topological complexity of the assembly graphs [21]. Because even scaffolding with long-reads can be affected by repetitive sequences, the results of mapping scaffolds to each chromosome were transformed into a hierarchical bipartite graph ( Figure S3) to minimize the influence of repetitive sequences. The hierarchical bipartite graph was built by mapping PacBio (errorcorrected) reads to scaffolds using BWA-MEM and, in turn, mapping scaffolds to the galGal4 genome (GCA_000002315.2) using LASTZ. Using the hierarchical bipartite graphs, all scaffolds and PacBio reads were finally assigned to each chromosome. Based on the results, super-scaffolding and additional gap-filling was performed by SSPACE-LongRead and PBJelly, respectively, resulting in scaffold N50 of 21.2Mbp ( Figure 1C and Figure S1). In the last stage, nucleotide errors or ambiguities were corrected by the GATK pipeline [22] with paired-end reads, and in turn, any vector contamination was removed using VecScreen with UniVec database [23]. The final assembly results showed that the gap percentage and (pseudo-)contig N50 were significantly improved, from 1.87% and 53. 6 Kbp in the initial assembly to 0.85% and 504.8 Kbp in the final assembly, respectively. Among avian genome assemblies, these results are second best and the scaffold N50 is the best (Figure 1C). The complete genome sequence at the chromosome level ( Figure S4) was built by connecting final scaffolds in the order of appearance in each chromosome with the introduction of 100 Kbp 'N' gaps between them. To evaluate the completeness of the genome, the YO draft genome was compared to the galGal4 (short read-based assembly) and galGal5 (long read-based assembly) genomes, with respect to 2,586 conserved vertebrate genes, using BUSCO [24]. The results indicated that the Ogye_1 genome contains more complete single-copy BUSCO genes, suggesting that the Ogye_1 genome is slightly more complete than the others ( Table 4).
Although the Fibromelanosis (FM) locus, which contains the hyperpigmentation-related edn3 gene, is known to be duplicated in the genomes of certain hyperpigmented chicken breeds, such as Silkie and Ayam cemani [3,6] , the exact structure of the duplicated FM locus in such breeds has not been completely resolved due to its large size (~1Mbp). A previous study suggested that the inverted duplication of the FM locus could be explained by three possible mechanistic scenarios ( Figure 2B) [3].
To understand more about the mechanism of FM locus SV in the Ogye_1 genome, we compared it to the same locus in the galGal4 genome. Aligning paired-end reads of the Ogye_1 genome to the galGal4 genome, we found higher read depth at the FM locus in YO, indicating a gain of copy number at the locus ( Figure 2C top). Also, we have identified same mate-pair information using paired-end and mate-pair reads (see Supplementary Figure S5). The intervening region between the two duplicated regions was estimated to be 412.6 Kbp in length in Ogye_1. Regarding possible mechanistic scenarios, mate-pair reads (3-10 Kbp, and FOSMID) mapped to the locus supported all three suggested scenarios, but an alignment of chromosome 20 from Ogye_1 and galGal4 showed that the intervening regions, including inner-partial regions in both duplicated regions, were inverted at the same time, which supports the first mechanistic scenario in Figure 2B. Given the resulting alignments, the FM locus of the Ogye_1 genome was updated according to the first scenario. The size of Gap_1 and Gap_2 were estimated at 164.5 Kbp and 63.3 Kbp, respectively.
Additionally, a large inversion was detected near a locus including the tyrp1 gene ( Figure 2D), which is known to be related to melanogenesis [29,30]. However, when resequencing data from White Leghorn (white skin and plumage), Korean Black (white skin and black plumage), and white Silkie (black skin and white plumage) were compared to the galGal4 or 5 genome assemblies, the same break points related to the inversion were detected (see supplementary Figure S6 and Figure S7), suggesting that the inversion including tyrp1 is not specifically related to skin hyperpigmentation.
The density of TEs across all chromosomes was depicted in Figure 4A.

Protein-coding genes
By analyzing large-scale of RNA-seq data from twenty different tissues through our proteincoding gene annotation pipeline ( Figure 4B), 15,766 protein-coding genes were annotated in the Ogye_1 genome ( Figure 4C), including 946 novel genes and 14,820 known genes. 164 protein-coding genes annotated in galGal4 were missing from the Ogye_1 assembly. The density of protein-coding genes across all chromosomes was depicted in Figure 4A.
To sensitively annotate protein-coding genes, all paired-end RNA-seq data were mapped on the Ogye_1 genome by STAR [34] for each tissue and the mapping results were then assembled into potential transcripts using StringTie [35]. Assembled transcripts from each sample were merged using StringTie and the resulting transcriptome was subjected to the prediction of coding DNA sequences (CDSs) using TransDecoder [36]. For high-confidence prediction, transcripts with intact gene structures (5'UTR, CDS, and 3'UTR) were selected. To verify the coding potential, the candidate sequences were examined using CPAT [37] and CPC [38]. Candidates with a high CPAT score (>0.99) were directly assigned to be protein-coding genes, and those with an intermediate score (0.8-0.99) were re-examined to determine whether the CPC score is >0. Candidates with low coding potential or that were partially annotated were examined to determine if their loci overlapped with annotated protein-coding genes from galGal4 (ENSEMBL cDNA release 85). Overlapping genes were added to the set of Ogye_1 protein-coding genes.
Finally, 164 genes were not mapped to the Ogye_1 genome by GMAP, 131 of which were confirmed to be expressed in YO (≥ 0.1 FPKM) using all paired-end YO RNA-seq data. However, expression of the remaining 33 genes was not confirmed, suggesting that they are not expressed in YO (< 0.1 FPKM) or have been lost from the Ogye_1 genome. The missing genes are listed in Table S4. In contrast, the 946 newly annotated Ogye_1 genes appeared to be mapped to the galGal4 or galGal5 genomes ( Figure 4C).

lncRNAs
In total, 6,900 YO lncRNA genes, including 5,610 novel loci and 1,290 known loci, were confidently annotated from RNA-seq data using our lncRNA annotation pipeline, adopted from our previous study [39] (Figure 5A).
To annotated and profile lncRNA genes, pooled single-and paired-end RNA-seq reads of each tissue were mapped to the Ogye_1 genome (PRJNA412424) using STAR [34], and subjected to transcriptome assembly using Cufflinks [44], leading to the construction of transcriptome maps for twenty tissues. The resulting maps were combined using Cuffmerge and, in total, 206,084 transcripts from 103,405 loci were reconstructed in the Ogye genome. We removed other biotypes of RNAs (the sequences of mRNAs, tRNAs, rRNAs, snoRNAs, miRNAs, and other small non-coding RNAs downloaded from ENSEMBL biomart) and short transcripts (less than 200nt in length). 54,760 lncRNA candidate loci (60,257 transcripts) were retained, and which were compared with a chicken lncRNA annotation of NONCODE (v2016) [45]. Of the candidates, 2,094 loci (5,215 transcripts) overlapped with previously annotated chicken lncRNAs. 52,666 non-overlapping loci (55,042 transcripts) were further examined to determine whether they had coding potential using CPC score [38]. Those with a score greater than -1 were filtered out, and the remainder (14,108 novel lncRNA candidate loci without coding potential) were subjected to the next step. Because many candidates still appeared to be fragmented, those with a single exon but with neighboring candidates within 36,873bp, which is the intron length of the 99th percentile, were re-examined using both exon-junction reads consistently presented over twenty tissues and the maximum entropy score [46], as done in our previous study [39]. If there were at least two junction reads spanning two neighboring transcripts or if the entropy score was greater than 4.66 in the interspace, two candidates were reconnected, and those with a single exon were discarded. In the final version, 9,529 transcripts from 6,900 lncRNA loci (5,610 novel and 1,290 known) were annotated as lncRNAs (see Figure 5B), which included 6,170 (89.40 %) intergenic lncRNAs and 730 (10.57 %) antisense ncRNAs. Consistent with other species [40][41][42][43], the median gene length and the median exon number of Ogye lncRNAs were less than those of protein-coding genes ( Figure 5C and D).
Although 13,540 of 15,766 protein-coding genes (92%) were redetected by our transcriptome assembly and protein-coding gene annotations (see Figure 4C), 8,204 of NONCODE lncRNAs were missed in our Ogye lncRNA annotations (Figure 5B), the majority of which were either fragments of protein-coding genes or not expressed in all twenty Ogye tissues ( Figure 5B). Only 276 were missed in our transcriptome assembly, and 648 were missed in our draft genome.

Coding and non-coding transcriptome maps
Using paired-end YO RNA-seq data, the expression levels of protein-coding and lncRNA genes were As lncRNAs tend to be specifically expressed in a tissue or in related tissues, they could be better factors for defining genomic characteristics of tissues than protein-coding genes. To prove this idea, principle component analyses (PCA) were performed with tissue-specific lncRNAs and protein-coding genes using reshape2 R package (Figure 7) [47]. As expected, the 1st, 2nd, and 3rd PCs of lncRNAs enabled us to predict the majority of variances, and better discerned distantly-related tissues and functionally and histologically-related tissues (i.e., black tissues and brain tissues) ( Figure 7B) than those of protein-coding genes ( Figure 7A).

DNA methylation maps
RRBS reads were mapped to Ogye draft genome ( Table 3), and calculated DNA methylation signals (C to T changes in CpGs) across chromosomes using Bismark in each sample [48]. The DNA methylation landscape in protein-coding and lncRNA genes were shown in Figure 8A. Based on CpG methylation pattern in the promoter of genes, hierarchical clustering was performed using rsgcc R package, and clusters including adjacent or functionally similar tissues, such as cerebrum and cerebellum, immature and mature eggs, or comb and skin were identified ( Figure 8B). Of all CpG sites in genomes, 31~65% were methylated across tissues while only 19~43% were methylated in the promoter ( Table 5), indicating hypomethylation status in the promoters of expressed genes. Exceptionally, the methylation levels of spleen (40.28% in all genomic regions; 24.92% in promoter region) and liver (30.82% in all genomic region; 18.68% in promoter regions) displayed much less methylation signal, compared to those of others.
In fact, examining the averaged methylation landscapes over protein-coding and lncRNA loci showed that the methylation levels in gene body regions were much higher than those in 2 Kbp upstream regions from transcript start sites (TSS) (Figure 8C and D). To confirm that CpG methylation represses the expression, 280 protein-coding and 392 lncRNA genes of with highly tissue-specific expression pattern were selected. The methylation level of highly expressed genes was much lower than others ( Figure 8E and F).
To correlate the expression of genes with their methylation levels, only tissues in which a certain position had sufficient read coverage (at least five) were considered for measuring the correlation. The Spearman correlation coefficients between the expression and methylations levels were observed over twenty tissues (Figure 9). 454 protein-coding and 25 lncRNA genes displayed a negative correlation to methylation levels in promoter regions, while 157 protein-coding and 20 lncRNA genes have a positive correlation (box plots in Figure 9)

Discussions
In this work, the first draft genome from a Korean native chicken breed, YO, was constructed with genomic variation, repeat, and protein-coding and non-coding gene maps. Compared with the chicken reference genome maps, many more novel coding and non-coding elements were identified from largescale RNA-seq datasets across twenty different tissues. Although the Ogye_1 genome is comparable with galGal5 with respect to genome completeness evaluated by BUSCO, the Ogye_1 seems to lack simple and long repeats compared with galGal5, which was assembled from high-depth PacBio long-reads (50X) [10] that can capture simple and long repeats. Although we also used PacBio long reads, some simple and satellite repeats would be missed during assembly, because we used the PacBio data of shallow depth (11.5X) for scaffolding and gap filling. A similar tendency can be seen in the Golden-collared manakin genome (ASM171598v1) (Figure 3), which was also assembled in a hybrid manner using MaSuRCA assembler with high-depth Illumina short-reads and low-depth PacBio long-reads. Annotated genomic variations and comparative analysis of coding and non-coding genes will provide a resource for understanding genomic differences and evolution of YO as well as identifying functional elements related to its unique phenotypes, such as fibromelanosis. Additionally, such analyses will be useful for future genome-based animal genetics.

Availability of data
All of our sequencing data and the genome sequence have been deposited in NCBI's BIOPROJECT under the accession number PRJNA412424. The raw sequence data have been deposited in the Short Read Archive (SRA) under accession numbers SRR6189081-SRR6189098 (Table 1), SRX3223583-SRX3223622 ( Table 2), and SRX3223667-SRX3223686 (Table 3).

Additional files
The supplementary Figures and tables have been included in a supplementary file: Figure S1. Assembly statistics of Ogye_1 gnome assembly at each step. Figure S2. Filtration of noise and mis-assembly detection using Lastz Figure S3. Hierarchical mapping information in the reference-assisted additional assembly pipeline. Figure S4. Alignment of the Ogye_1 genome to galGal4/5 drawn by MUMmer.          indicates Illumina, "P" is PacBio, "S" is Sanger, and "4" is Roche454).       respectively. The methylation level is indicated with a color-coded z-score, described in the key.

B.
A.