Genome sequencing and analysis of two early-flowering cherry (Cerasus × kanzakura) varieties, ‘Kawazu-zakura’ and ‘Atami-zakura’

Abstract To gain genetic insights into the early-flowering phenotype of ornamental cherry, also known as sakura, we determined the genome sequences of two early-flowering cherry (Cerasus × kanzakura) varieties, ‘Kawazu-zakura’ and ‘Atami-zakura’. Because the two varieties are interspecific hybrids, likely derived from crosses between Cerasus campanulata (early-flowering species) and Cerasus speciosa, we employed the haplotype-resolved sequence assembly strategy. Genome sequence reads obtained from each variety by single-molecule real-time sequencing (SMRT) were split into two subsets, based on the genome sequence information of the two probable ancestors, and assembled to obtain haplotype-phased genome sequences. The resultant genome assembly of ‘Kawazu-zakura’ spanned 519.8 Mb with 1,544 contigs and an N50 value of 1,220.5 kb, while that of ‘Atami-zakura’ totalled 509.6 Mb with 2,180 contigs and an N50 value of 709.1 kb. A total of 72,702 and 69,528 potential protein-coding genes were predicted in the genome assemblies of ‘Kawazu-zakura’ and ‘Atami-zakura’, respectively. Gene clustering analysis identified 2,634 clusters uniquely presented in the C. campanulata haplotype sequences, which might contribute to its early-flowering phenotype. Genome sequences determined in this study provide fundamental information for elucidating the molecular and genetic mechanisms underlying the early-flowering phenotype of ornamental cherry tree varieties and their relatives.


Introduction
Flowering cherry, called sakura in Japanese, is an ornamental plant popular worldwide. A major Cerasus Â yedoensis cultivar 'Somei-Yoshino', which is an interspecific hybrid of Cerasus spachiana and Cerasus speciosa, 1 usually blooms from March to April in Japan. In addition, early-flowering sakura species, such as Cerasus campanulata, usually bloom 1-2 months earlier than 'Somei-Yoshino', and its interspecific hybrids such as Cerasus Â kanzakura also exhibit early flowering. C. Â kanzakura is considered a hybrid between C. campanulata and Cerasus speciosa and/or Cerasus jamasakura, 2 but its origin is still debated. Two C. Â kanzakura cultivars, 'Kawazu-zakura' and 'Atamizakura', also bloom early (January and February, respectively); however, the molecular mechanisms underlying their early-flowering phenotype remain unknown. Although the mechanisms of early flowering in Rosaceae family members, Japanese plum (Prunus mume) and peach (Prunus persica), which flower in February and March, respectively, are well known, 3 it remains unclear whether these mechanisms are common between Cerasus and Prunus.
Genome sequence analysis provides information on nucleotide polymorphisms and gene copy number variation, which can lead to phenotypic differences among individuals and cultivars. 4 Pangenomics, which involves de novo genome sequencing of multiple lines within a species, is conducted to obtain information on variation in all genes within a species to understand the origin of the organism under study. 5,6 In 'Somei-Yoshino', haplotype-phased genome sequences have been reported, and comprehensive changes in gene expression during floral bud development that contribute towards flowering have been revealed by time-course transcriptome analysis. 7 Therefore, comparative genomics of multiple lines of flowering cherry varieties, such as 'Kawazu-zakura', 'Atami-zakura' and 'Somei-Yoshino', could provide genetic insights into their earlyflowering phenotypes.
A trio-binning strategy, 8 previously used in a bovine F1 hybrid to resolve two haplotype-phased genome sequences, was recently applied to 'Somei-Yoshino'. 7 Genes associated with the earlyflowering phenotype of 'Kawazu-zakura' and 'Atami-zakura' were assumed to be encoded by the C. campanulata haplotype sequences. Therefore, in this study, we used the trio-binning strategy to determine the haplotype-phased sequences of 'Kawazu-zakura' and 'Atami-zakura'. Comparative analysis of three sakura genomes ('Kawazu-zakura', 'Atami-zakura' and 'Somei-Yoshino') facilitated the identification of genes unique to the C. campanulata haplotype sequences of 'Kawazu-zakura' and 'Atami-zakura' as candidates responsible for the early-flowering phenotype of these varieties.

Plant materials and DNA extraction
Two early-flowering cherry (Cerasus Â kanzakura) varieties, 'Kawazu-zakura' and 'Atami-zakura', were used in this study. Both varieties were planted at the orchard of Kyoto Prefectural University (Kyoto, Japan). Genome DNA was extracted from young leaves by a modified sodium dodecyl sulphate (SDS) method. 9

Genome size estimation
Software tools used for data analyses are listed in Supplementary  Table S1. Genome libraries for short-read sequencing were prepared with the TruSeq DNA PCR-Free Sample Prep Kit (Illumina, San Diego, CA, USA) and sequenced on the NextSeq 500 platform (Illumina, San Diego, CA, USA) in paired-end, 150 bp mode. The genome size was estimated with Jellyfish.

De novo genome sequence assembly and reference-guided contig ordering and orientation
Genomes of the two cherry varieties were sequenced using the single-molecule real-time (SMRT) sequencing technology. Long-read DNA libraries were constructed using the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA) and sequenced on SMRT cells (1M v3 LR) in a PacBio Sequel system (PacBio). Raw sequence reads of each variety were divided into two subsets with the trio-binning strategy 8 using the short-read data of C. campanulata ('Kanhi-zakura') and C. speciosa ('Ohshimazakura') together with six lines (Supplementary Table S2), which are representatives of 139 flowering cherries (DDBJ sequence archive accession no.: DRA008096). 7 The sequence read subsets were assembled separately with Falcon or Canu to build haplotypephased diploid genome sequences. Sequence errors in the contigs were corrected twice using long reads with ARROW. Potential contaminating sequence reads from organelle genomes were identified by alignments with the chloroplast and mitochondrial genome sequences of Prunus avium (GenBank accession nos: MK622380 and MK816392) with Minimap2 and then removed from the final assemblies. Haplotype-phased sequences, based on binning with C. campanulata and C. speciosa, were aligned against the C. spachiana and C. speciosa haplotype sequences, respectively, of the 'Somei-Yoshino' genome using Ragoo to build pseudomolecule sequences. Genome sequences were compared with D-Genies, and coverage was calculated with BEDTools. Two haplotype sequences were aligned with Minimap2 to identify sequence variants by paftools implemented in Minimap2.

Gene prediction and repetitive sequence analysis
Potential protein-coding genes were predicted with the MAKER pipeline, which was based on peptide sequences predicted from the genome sequences of sweet cherry (PAV_r1.0), 10 peach (v2.0.a1) 11 and Japanese plum. 12 Short genes (<300 bp) as well as genes predicted with an annotation edit distance >0.5, which is proposed as a threshold for good annotations in the MAKER protocol, were removed to facilitate the selection of high-confidence (HC) genes. Functional annotation of the predicted genes was performed with Hayai-Annotation Plants. Gene clustering was performed with OrthoFinder and visualized with UpSetR.
Repetitive sequences in the pseudomolecules were identified with RepeatMasker using repeat sequences registered in Repbase and a de novo repeat library built with RepeatModeler. The identified repetitive sequences were classified into nine types, in accordance with RepeatMasker: short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), long terminal repeat (LTR) elements, DNA elements, small RNAs, satellites, simple repeats, lowcomplexity repeats and unclassified.
Long-read data (34.9 Gb) of 'Kawazu-zakura' obtained from two SMRT cells were evenly divided into two subsets (17.2 and 17.6 Gb), in accordance with the short-read data of potential parental species, C. campanulata and C. speciosa, 7 respectively (Supplementary Table S2). Reads in each subset were independently assembled with Falcon or Canu to construct contigs representing the two haplotype sequences (Supplementary Table S3). We employed the Falcon assembly for further analysis, because the contig number was less than that in Canu assembly. Potential errors in the haplotype sequences were corrected with long reads, and sequences of organelle genomes were removed to obtain the final assembly of the diploid genome of 'Kawazu-zakura'. The resulting assemblies consisted of C. campanulata (262.2 Mb, N50 ¼ 1.4 Mb) and C. speciosa (257.6 Mb, N50 ¼ 1.1 Mb) haplotypes (Table 1) and were designated as KWZcam_r1.0 and KWZspe_r1.0, respectively. Although the total assembly size was shorter than the estimated size, the complete BUSCO scores of KWZcam_r1.0 and KWZspe_r1.0 were 93.1% and 96.7%, respectively, indicating that the assemblies were complete ( Table 1). The two assemblies were merged to generate KWZ_r1.0, with a complete BUSCO score of 98.0%.
The 'Atami-zakura' genome was sequenced in parallel with the 'Kawazu-zakura' genome. Long-read data of 'Atami-zakura' (14.3 Gb) were obtained from two SMRT cells and divided into two subsets (7.4 and 6.8 Gb) using the short-read data of C. campanulata and C. speciosa, 7 respectively (Supplementary Table S2). The reads were assembled with Falcon or Canu to generate two haplotype contig sequences (Supplementary Table S3). The size of the Falcon assembly was much smaller than the Canu assembly. Therefore, we used the Canu for further assembly. This was followed by potential sequence error correction and organelle genome sequence removal. The sizes of the resultant assemblies were improved to 267.4 Mb (N50 ¼ 853.5 kb) and 242.2 Mb (N50 ¼ 569.4 Mb) for the C. campanulata and C. speciosa haplotypes, respectively (Table 1), and the assemblies were designated as ATMcam_r1.0 and ATMspe_r1.0, respectively. The complete BUSCO scores were 93.4% and 93.5% for ATMcam_r1.0 and ATMspe_r1.0, respectively (Table 1), and 98.2% for the merged assembly (ATM_r1.0).

Gene and repetitive sequence predictions
A total of 36,281 and 36,421 HC protein-coding genes were predicted in KWZcam_r1.0 and KWZspe_r1.0 assemblies, respectively ( Table 2). The complete BUSCO scores of genes in the KWZcam_r1.0 and KWZspe_r1.0 were 88.3% and 86.6%, respectively, while the BUSCO score of all 72,702 genes was 97.0%. Functional gene annotation revealed that 9,430, 17,907 and 12,603 sequences were assigned to Gene Ontology (GO) slim terms in the biological process, cellular component and molecular function categories, respectively, and 2,264 genes had enzyme commission numbers.
On the other hand, 36,264 and 33,264 HC genes were predicted in ATMcam_r1.0 and ATMspe_r1.0 assemblies, respectively ( Table 2). Complete BUSCOs of genes in ATMcam_r1.0 and ATMspe_r1.0 were 88.3% and 86.6%, respectively, while that of all 69,528 genes was 96.8%. According to the functional gene annotation, 9,836, 18,586 and 13,020 sequences were assigned to GO slim terms in the biological process, cellular component and molecular function categories, respectively, and 2,301 genes had enzyme commission numbers.

Gene clustering and sequence variant analyses in early-flowering cherry varieties
Four sets of genes predicted in the haplotype-phased genomes of 'Kawazu-zakura' and 'Atami-zakura' clustered with two sets of genes in the two haploid sequences of 'Somei-Yoshino'. A total of 35,226 clusters were obtained, of which 10,702 were common across all six gene sets (Fig. 3). The early-flowering phenotype of C. Â kanzakura could be explained by genes uniquely present in the C. campanulata haplotype sequences. In the C. campanulata haplotype sequences of 'Kawazu-zakura' and 'Atami-zakura' genomes, a total of 2,634 clusters were found to include 3,123 and 3,113 genes, respectively (Supplementary Table S4).

Conclusion and future perspectives
Here, we report haplotype-phased genome assemblies of two earlyflowering cherry (C. Â kanzakura) cultivars, 'Kawazu-zakura' and 'Atami-zakura', both of which are interspecific hybrids derived from C. campanulata and C. speciosa. Although the origin of C. Â kanzakura remains unclear, C. campanulata and C. speciosa and/or C. jamasakura are considered as its potential parents. 2 Another possibility is that 'Atami-zakura' originated from C. jamasakura and C. campanulata. 13 This is supported by the fact that our attempt to divide the long reads of 'Atami-zakura' into two subsets using shortread data of C. serrulata (closely related to C. jamasakura 7 ) and C. campanulata failed (Supplementary Table S2). Therefore, we used short reads of C. campanulata and C. speciosa for both 'Kawazuzakura' and 'Atami-zakura'. This result suggests that both 'Kawazuzakura' and 'Atami-zakura' are closely related to C. campanulata and C. speciosa. Clustering analysis of genes predicted in the genomes of 'Kawazu-zakura' and 'Atami-zakura' together with those of 'Somei-Yoshino' revealed that 2,634 gene clusters were uniquely present in the genome of C. campanulata but absent from the genomes of C. spachiana and C. speciosa (Fig. 3, Supplementary  Table S4). Such copy number variation (or presence/absence variation) of genes could explain the early-flowering phenotype of 'Kawazu-zakura' and 'Atami-zakura'. In addition, approximately 1 million sequence variants were found between the two haplotype sequences in both 'Kawazu-zakura' and 'Atami-zakura' (Supplementary Table S3). Previously, we performed a timecourse transcriptome analysis of the floral buds and flowers of 'Somei-Yoshino' to clarify gene expression patterns during flowering. 7,14 A similar time-course transcriptome analysis could be applied to 'Kawazu-zakura' and 'Atami-zakura'. Comparative transcriptome analysis of three cultivars could identify the genes responsible for the early-flowering phenotype of sakura. Furthermore, comparative transcriptome analysis of Japanese apricot and peach 3 could reveal the genetic mechanisms controlling flowering time across all Prunus and Cerasus species.
Although several flowering cherry cultivars are known to bloom in late-spring, fall and winter seasons, 7 genome sequences of only a few of these cultivars are publicly available. 7,15,16 Comparative genomics and transcriptomics, also known as pan-genomics, 4  would provide insights into the origins of these cultivars and their flowering mechanisms, which could facilitate the development of new cultivars with attractive flower characteristics and provide us with the ability to forecast the date of sakura blooming.