Genome sequence of Hydrangea macrophylla and its application in analysis of the double flower phenotype

Abstract Owing to its high ornamental value, the double flower phenotype of hydrangea (Hydrangea macrophylla) is one of its most important traits. In this study, genome sequence information was obtained to explore effective DNA markers and the causative genes for double flower production in hydrangea. Single-molecule real-time sequencing data followed by a Hi-C analysis were employed. Two haplotype-phased sequences were obtained from the heterozygous genome of hydrangea. One assembly consisted of 3,779 scaffolds (2.256 Gb in length and N50 of 1.5 Mb), the other also contained 3,779 scaffolds (2.227 Gb in length, and N50 of 1.4 Mb). A total of 36,930 genes were predicted in the sequences, of which 32,205 and 32,222 were found in each haplotype. A pair of 18 pseudomolecules was constructed along with a high-density single-nucleotide polymorphism (SNP) genetic linkage map. Using the genome sequence data, and two F2 populations, the SNPs linked to double flower loci (djo and dsu) were discovered. DNA markers linked to djo and dsu were developed, and these could distinguish the recessive double flower allele for each locus, respectively. The LEAFY gene is a very likely candidate as the causative gene for dsu, since frameshift was specifically observed in the double flower accession with dsu.


Introduction
Hydrangea macrophylla (Thunb.) Ser., commonly known as hydrangea, originated in Japan, and since it is the place of origin, there are rich genetic resources for this species there. Wild hydrangea accessions with superior characteristics have been used as breeding parents to create attractive cultivars, and it has a long history of use as an ornamental garden plant in temperate regions. In hydrangea, there are both decorative and non-decorative flowers within groups of flowers on the stems of the Hydrangea inflorescence. Decorative flowers have large ornamental petaloid sepals that attract pollinators, whereas non-decorative flowers have inconspicuous perianths with normal sepals that instead of play a major role in seed production. [1][2][3] In hydrangea, there are two decorative flower phenotypes: a single flower and double flower. Single flowers generally have four petaloid sepals per decorative flower, while in double flowers, it is approximately fourteen. 4 Double flowers do not have stamens or petals in the decorative flower. 4 Therefore, petals and stamens could be converted to petaloid sepals since the number of petaloid sepals increases, and stamens and petals are lost. Because of their high ornamental value, the production of double flowers is an important breeding target in hydrangea cultivation.
The double flower cultivars 'Sumidanohanabi' (Fig. 1A) and 'Jogasaki' (Fig. 1B) have been used as breeding parents in Japan 4 to obtain double flower progenies. Both cultivars lack stamens and petals and show increased petaloid sepals in their decorative flowers compared with single flower cultivars (Supplementary Fig. S1A-C). For non-decorative flowers, while they lack stamen similar to decorative flowers, slightly sepaloid petals are observed which increased its number (Supplementary Fig. S1D-F). Sepals in non-decorative flowers are observed in both single and double flower accessions (Supplementary Fig. S1G and H). Because both double flower cultivars lack stamen, they could not be used as pollen parents. However, these cultivars still have fertile pistils, they could be used as seed parent. Previous studies have suggested that the double flower phenotype is a recessive trait controlled by a single major gene. 4,5 Suyama et al. 4 found that crosses between the progeny of 'Sumidanohanabi' and the progeny of 'Jogasaki' produced only single flower descendants. Thus, it was also suggested that the genes controlling the double flower phenotype are different. 4 We named the double flower locus d su as the locus controlling the double flower phenotype of 'Sumidanohanabi' and the double flower locus d jo as the locus controlling the double flower phenotype of 'Jogasaki.' Waki et al. 5 identified d su on the genetic linkage map. They also found that the DNA marker STAB045 was the closest marker to d su and that STAB045 could help distinguish flower phenotypes with a high degree of agreement. Contrarily, d jo has not been identified, and the DNA marker linked to d jo has not been developed. It is still not known whether d jo and d su are at the same loci.
The mechanisms and genes controlling the double flower phenotype in hydrangea have not been clarified. Waki et al. 5 hypothesized that the mutation of C-class genes could be associated with the double flower phenotype of 'Sumidanohanabi' since the C-class gene mutant of Arabidopsis thaliana and C-class gene-repressed petunias produce double flowers. 6 However, the double flower phenotype of hydrangea is morphologically different from that of Arabidopsis thaliana and petunia: petals and stamens converted to petaloid sepals, while stamens converted to petals in Arabidopsis thaiana and petunia. This suggests that the genes controlling double flower production in hydrangea differ from genes controlling double flower production in other plant species. Identification of the genes controlling double flower production in hydrangea could reveal novel regulatory mechanisms of flower development.
Genomic information is essential for DNA marker development and identification of genes controlling specific phenotypes. However, no reference genome sequence is publicly available for hydrangea so far. Although a genome assembly of hydrangea (1.6 Gb) using only short-read data has been reported, 7 the resultant assembly is so fragmented that it comprises 1,519,429 contigs with an N50 size 2,447 bp, and has not been disclosed. Improved, advanced long-read technologies and bioinformatics methods would make it possible to determine the sequences of complex genomes. An assembly strategy for single-molecule real-time sequencing data followed by a Hi-C analysis has been developed to generate haplotype-phased sequences in heterozygous regions of diploid genomes. 8 Genome sequences at the chromosome level could be obtained with a Hi-C clustering analysis 9 and a genetic linkage analysis. 10 Such a genomic sequence could provide basic information to identify genes and DNA markers of interest and to discover allelic sequence variations. In this study, we constructed the genomic DNA sequence, obtained single-nucleotide polymorphisms (SNPs) information, and performed gene prediction. We also developed DNA markers linked to d jo using SNP information obtained by double digest restriction site associated DNA sequence (ddRAD-Seq) analysis of the F 2 mapping population 12GM1, which segregated double flower phenotypes of d jo . In addition, we attempted to identify the causative genes for d jo and d su .

De novo assembly of the hydrangea genome
For genomic DNA sequencing, H. macrophylla 'Aogashima-1,' collected from Aogashima Island of the Izu Islands in the Tokyo Prefecture, Japan, was used. Genomic DNA was extracted from the young leaves using Genomic-Tip (Qiagen, Hilden, Germany). First, we constructed a sequencing library (insert size 500 bp) with TruSeq DNA PCR-Free Library Prep Kit (Illumina, San Diego, CA, USA) to sequence on HiSeqX (Illumina). The size of the 'Aogashima-1' genome was estimated using Jellyfish v2. 1.4. 11 After removing the adapter sequences and trimming low-quality reads, high-quality reads were assembled using Platanus. 12 The resultant sequences were designated HMA_r0.1. The completeness of the assembly was assessed with sets of BUSCO v.1.1b. 13 Next, an PacBio single molecule, real-Time (SMRT) library was constructed with the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA) in accordance with the manufacturer's protocol and sequenced using SMRT Cell v2.1 on a Sequel System. The sequence reads were assembled using FALCON v.1.8.8 14 to generate primary contig sequences and associate contigs representing alternative alleles. Haplotype-resolved assemblies (i.e. haplotigs) were generated using FALCON-Unzip v.1.8.8. 14 Potential sequence errors in the contigs were corrected twice with ARROW v.2.2.1 implemented in SMRT Link v.5.0 (PacBio) followed by one polishing with Pilon. 15 The resultant sequences were designated HMA_r1.0. Subsequently, a Hi-C library was constructed with Proximo Hi-C (Plant) Kit (Phase Genomics, Seattle, WA, USA) and sequenced on HiSeqX (Illumina). After removing the adapter sequences and trimming the low-quality reads, high-quality Hi-C reads were used to generate two haplotype-phased sequences (Phase 0 and Phase 1) from the primary contigs and haplotig sequences with FALCON-Phase. 8 The resultant sequences were designated HMA_r1.1. Haplotype phase 0 was derived from one of a pair of homologous chromosomes of 'Aogashima-1', and the haplotype phase 1 was derived from other 'Aogashima-1' chromosomes.
To validate the accuracy of the sequences, we developed a genetic map based on SNPs, from a ddRAD-Seq analysis on the 12GM1, F 2 mapping population (n ¼ 147), maintained at the Fukuoka Agriculture and Forestry Research Center, Japan. The 12GM1 population was generated from a cross between 'Posy Bouquet Grace' (Fig. 1C) and 'Blue Picotee Manaslu' (Fig. 1D). Genomic DNA was extracted from the leaves with a DNeasy Plant Mini Kit (Qiagen). A ddRAD-Seq library was constructed as described by Shirasawa et al. 16 and sequenced with HiSeq4000. Sequence reads were processed as described by Shirasawa et al. 16 and mapped on the HMA_r1.1 as a reference. From the mapping alignment, highconfidence biallelic SNPs were obtained with the following filtering options: -minDP 5 -minQ 10 -max-missing 0.5. The genetic map was constructed using Lep-Map3. 17 Potential mis-jointed points in Phase 0 and 1 sequences of HMA_r1.1 were cut and re-joined, based on the marker order in the genetic map. The resultant sequences were named HMA_r1.2. Haplotype-phased pseudomolecule sequences were generated based on HMA_r1.2 and genetic map using ALLMAPS. 18 The resultant sequences were named HMA_r1.2.pmol, as two haplotype-phased pseudomolecule sequences of the 'Aogashima-1' genome. Sequences that were unassigned to the genetic map were connected and termed chromosome 0. The constructed pseudomolecule sequence was corresponded with the previously reported SSR marker-based genetic linkage map 5 by a BLAST search of simple sequence repeat (SSR) marker sequences. SSR markers that were included in the hydrangea genetic linkage map 5 were searched for HMA_r1.2.pmol. When more than two SSR markers in a linkage group were located on the same pseudomolecule, the pseudomolecule was treated as corresponded to the genetic linkage map. The nomenclature of the pseudomolecule number was assigned in accordance with linkage group number of the genetic map. 5

Gene prediction
For gene prediction, we performed an Iso-Seq analysis. Total RNA was extracted from 12 samples of 'Aogashima-1': flower buds (2 stages); decorative flowers (2 stages); coloured and colourless nondecorative flowers; fruits; stem; roots; leaf buds and one-day lightintercepted leaves and buds. In addition, the mixed sample (Sample No. 29) listed in Supplementary Table S1 was included. Iso-Seq libraries were prepared following the manufacture's Iso-Seq Express Template Preparation (PacBio) protocol and sequenced on a Sequel System (PacBio). The raw reads obtained were treated with ISO-Seq3 pipeline, implemented in SMRT Link v.5.0 (PacBio) to generate fulllength, high-quality consensus isoforms. In parallel, RNA-Seq data were also obtained from the 16 samples listed in Supplementary  Table S1. Total RNA extracted from the samples was converted into cDNA and sequenced on HiSeq2000, Hiseq2500 (Illumina), and NovaSeq6000 (Illumina). The Iso-Seq isoform sequences and the RNA-Seq short-reads were employed for gene prediction.
We used ab-initio-, evidence-, and homology-based gene prediction methods to identify putative protein-encoding genes in the genome assemblies. For this prediction, unigene sets generated from (i) the Iso-Seq isoforms; (ii) de novo assembly of the RNA-Seq shortreads with Trinity-v2.4.0 19 ; (iii) peptide sequences predicted from the genomes of Arabidopsis thaliana, Arachis hypogaea, Cannabis sativa, Capsicum annuum, Cucumis sativus, Populus trichocarpa, and Quercus lobata; and (iv) ab-initio genes, were predicted with Augustus-v3.3.1. 20 The unigene sequences were aligned onto the genome assembly with BLAT, 21 and the genome positions of the genes were listed in general feature format version 3 with blat2gff.pl (https://github.com/vikas0633/perl/blob/master/blat2gff.pl)(12 November 2020, date last accessed). Gene annotation was performed with Hayai-annotation Plants. 22 Completeness of the gene prediction was assessed with sets of BUSCO v4.0.6. 13

Detection of SNPs linked to double flower phenotype
For identification of SNPs linked to double flower loci d jo and d su , ddRAD-Seq data analysis was performed. ddRAD-Seq data of the 12GM1 population described above were used to identify d jo . For identification of SNPs linked to double flower locus d su , KF population 5 -93 F 2 individuals of 'Kirakiraboshi' (Fig. 1E) and 'Frau Yoshimi' (Fig. 1F)-were used for ddRAD-Seq analysis. The KF population was maintained at the Tochigi Prefectural Agricultural Experimental Station, Japan. ddRAD-Seq analysis of the KF population was performed using the same method used for the 12GM1 population.
ddRAD-Seq data of the 12GM1 and KF populations were processed as follows: low-quality sequences were removed, and adapters were trimmed using Trimmomatic-0.36 23 (LEADING : 10, TRAILING : 10, SLIDING WINDOW: 4:15, MINLEN : 51). BWA-MEM (v 0.7.15-r1140) was used for mapping to the genome sequence. The resultant sequence alignment/map format files were converted to binary sequence alignment/map format files and subjected to SNP calling using the mpileup option of SAMtools 24 (v 1.4.1) and the view option of BCFtools (parameter -vcg). If the DP of called SNP in individuals was under 5%, the genotype was treated as missing. SNPs with 5% or more of missing genotype were filtered out. Each SNP was evaluated for its degree of agreement with the flower phenotype. The degree of agreement with the flower phenotype was calculated as the percentage of individuals agreeing with the model (double flower phenotype is observed when homozygous of allele is from the double flower parent, while single flower phenotype is observed when heterozygous or homozygous of allele is from the single flower parent) among the population.

DNA marker development and analysis for d jo
Cleaved amplified polymorphic sequence (CAPS) markers were designed based on SNPs that were completely co-segregated to the double flower locus d jo . Primers were designed using Primer3 25 under conditions with product size ranging from 150 to 350 bp, primer size from 18 to 27 bp, and primer TM from 57 to 63 C. PCR assays were conducted in a total volume of 10 lL, containing 5 lL of GoTaq Master Mix (Promega, Madison, WI, USA), 1 mM each of forward and reverse primer, and 5 ng of template DNA. The PCR conditions were 94 C for 2 min, 35 cycles of denaturation at 94 C for 1 min, annealing at 55 C for 1 min, and extension at 72 C for 1 min; and a final extension step at 72 C for 3 min. Subsequently, restriction enzyme assay was conducted in a total volume of 10 lL, containing 5 lL PCR product and 10 units of restriction enzyme. The restriction assay product was stained with 1Â GRRED (Biocraft, Tokyo, Japan) and separated in 1.5% (w/v) agarose gel in Tris-acetate-EDTA (TAE) buffer. The CAPS marker, whose genotype was coincident with the SNP genotype in the 12GM1 population, was selected as the relevant marker. The relevant CAPS marker named J01 was designed from SNP on scaffold : 0008 F-2, position: 780104. The primer sequences of J01 were: Forward: 5 0 -CTG GCAGATTCCTCCTGAC-3 0 and Reverse: 5 0 -TATTTCCTTGGGG AGGCTCT-3 0 . A restriction enzyme assay was performed at 65 C for 3 h using Taq I (New England Biolabs, Ipswich, MA, USA). CAPS marker J01 was applied to the 14GT77 population (64 F 2 individuals of 'Posy Bouquet Grace' Â 'Chibori') and the 15IJP1 population (98 F 1 individuals of 'Izunohana' Â 03JP1) in addition to the 12GM1 population that segregate the double flower locus d jo .  24 High-confidence variants were selected using VCFtools 28 v. 0.1.12b with parameters of -minDP 10, -maxDP 100, -minQ 999, -maxmissing 1.

Resequencing and comparison of LEAFY gene sequence and DNA marker development
For comparison of the LEAFY (LFY) sequence in 'Kirakiraboshi,' 'Frau Yoshimi,' 'Posy Bouquet Grace,' and 'Blue Picotee Manaslu,' BLAST analysis was performed to confirm detected sequence variants using the genomic sequence of LFY (Scaffold 0577 F, position 678200-684639) as query, and the genomic DNA sequence of each cultivar as the database. These data analyses were performed using CLC main workbench (Qiagen). INDEL marker S01 that amplifies the second intron of LFY, was designed by visual inspection (Forward: 5 0 -CA TCATTAATAGTGGTGACAG-3 0 , Reverse: 5 0 -CACACATGAATTAG TAGCTC-3 0 ). The PCR conditions were 94 C for 2 min, 35 cycles of denaturation at 94 C for 1 min, annealing at 55 C for 1 min, extension at 72 C for 1 min; and a final extension step at 72 C for 3 min. The PCR product was stained with 1x GRRED (Biocraft) and separated in 2.5% (w/v) agarose gel in TAE buffer.

Cloning and sequence determination of LFY gene of 'Kirakiraboshi' and 'Frau Yoshimi'
Total RNA was isolated from flower buds of 'Kirakiraboshi' and 'Frau Yoshimi' using RNAiso Plus (TaKaRa, Japan), and reverse transcribed using a PrimeScript II 1st strand cDNA Synthesis Kit (TaKaRa, Japan). The sequence of the LFY gene was amplified by PCR in a 50-mL reaction mixture using TaKaRa Ex Taq Hot Start Version (TaKaRa Bio, Shiga, Japan) and the LFY specific primer (Forward: 5 0 -ATGGCTCCACTACCTCCACC-3 0 and Reverse: 5 0 -CTAACACCCTCTAAAAGCAG-3 0 ). These PCR products were purified and inserted into a pMD20-T vector using a Mighty TAcloning kit (TaKaRa Bio). The sequence of the LFY coding sequence (CDS) in the pMD20-T vector was analysed using a 3130xl Genetic Analyzer (Applied Biosystems, Foster City, CA, USA) for DNA sequencing. Sequence alignments were obtained using the CLC main workbench (Qiagen).

DNA marker assessment across hydrangea accessions
Thirty-five H. macrophylla accessions were used for the assessment of DNA markers for the double flower phenotype. Genotyping for J01 was performed as described above. Genotyping for S01 was performed by fragment analysis as follows. PCR amplification was performed in a 10-lL reaction mixture containing 5 lL of GoTaq Master Mix (Promega), 5 pmol FAM-labeled universal primer (5 0 -FAM-gctacggactgacctcggac -3 0 ), 2.5 pmol forward primer with universal adapter sequence (5 0 -gctacggactgacctcggac CATCATTAATAGTGGTGACAG -3 0 ), 5 pmol reverse primer, and 5 ng of template DNA. DNA was amplified in 35 cycles of 94 C for 1 min, 55 C for 1 min, and 72 C for 2 min; and a final extension of 5 min at 72 C. The amplified PCR products were separated and detected with a PRISM 3130xl Genetic Analyzer (Applied Biosystems, USA). The sizes of the amplified bands were scored against internal-standard DNA (400HD-ROX, Applied Biosystems, USA) with GeneMapper software (Applied Biosystems, USA).

Draft genome assembly with long-read and Hi-C technologies
The size of the hydrangea genome was estimated by k-mer-distribution analysis with the short-read of 132.3 Gb data. The resultant distribution pattern indicated two peaks, representing homozygous (right peak) and heterozygous (left peak) genomes, respectively ( Supplementary Fig. S2). The haploid genome of hydrangea was estimated to be 2.2 Gb in size. The short-reads were assembled into 612,846 scaffold sequences. The total length of the resultant scaffolds, i.e. HMA_r0.1, was 1.7 Gb with an N50 length of 9.1 kb (Supplementary Table S2). Only 72.2% of complete single-copy orthologues in plant genomes were identified by BUSCO analysis (Supplementary Table S2).
Next, we employed long sequence technology to extend the sequence contiguity and to improve the genome coverage. A total of 106.9 Gb of reads (48.6Â genome coverage) with an N50 read length of 28.8 kb was obtained from 14 SMRT Cells. The long-reads were assembled, followed by sequence error corrections into 15

Pseudomolecule sequences based on genetic mapping
To detect potential errors in the assembly and to assign the contig sequences onto the hydrangea chromosomes, we established an F 2 genetic map based on SNPs derived from ddRAD-Seq analysis. Approximately 1.8 million high-quality ddRAD-Seq reads per sample were obtained from the mapping population and mapped to the Phase 0 and Phase 1 sequences, with alignment rates of 88.4% and 88.7%, respectively. A set of SNPs detected from the alignments were classified into 18 groups and ordered to construct two genetic maps for the two-phased sequences (2,849.3 cM in length with 3,980 SNPs 2,944.5 cM in length with 4,071 SNPs). The phased sequences were aligned on each genetic map to establish haplotypephased, chromosome-level pseudomolecule sequences. During this process, one contig was cut due to possible mis-assembly and was designated as HMA_r1.2. The resultant pseudomolecule sequence HMA_r1.2.pmol for Phase 0 had 730 contigs with a total length of 1,078 Mb, and the other for Phase 1 had 743 contigs spanning 1,076 Mb. The pseudomolecule was named in accordance with the nomenclature of the previous genetic map based on SSRs. 5 HMA_r1.2.pmol included 36 pseudomolecule sequences of CHR01 to CHR18 for Phase 0 and Phase 1, which corresponded to the chromosome number of the diploid H. macrophylla genome (2n ¼ 36). 29 Correspondence between the pseudomolecule sequence and previous genetic linkage map 5

Transcriptome analysis followed by gene prediction
In the Iso-Seq analysis Circular Consensus Sequence (CCS) reads were generated from the raw sequence reads. The CCS reads were classified as full-length and non-full-length reads, and the full-length reads were clustered to produce consensus isoforms. In total, 116,634 high-quality isoforms were used for gene prediction, while, in the RNA-Seq analysis, a total of 80.7 Gb reads were obtained and assembled into 12,265 unigenes. The high-quality isoforms and unigenes, together with gene sequences predicted from the Arabidopsis thaliana, Arachis hypogaea, Cannabis sativa, Capsicum annuum, Cucumis sativus, Populus trichocarpa, and Quercus lobate genomes were aligned onto the assembly sequence of the hydrangea genome. By adding ab-initio on the genes, a total of 36,930 putative proteinencoding genes were predicted in the hydrangea genome, out of which 32,205 and 32,222 genes were found in the Phase 0 and Phase 1 sequences. The 36,930 genes included 89.9% complete BUSCOs.

Identification of SNPs tightly linked to double flower phenotype d jo
To identify SNPs tightly linked to the d jo derived double flower phenotype of 'Jogasaki,' ddRAD-Seq analysis was performed on the 12GM1 population, which segregates the double flower phenotype of 'Jogasaki.' According to data of variety under Plant Variety Protection (http://www.hinshu2.maff.go.jp/)(12 November 2020, date last accessed), 'Posy Bouquet Grace' was obtained by a cross between 'Jogasaki' derived progenies. Therefore, the double flower phenotype of 'Posy Bouquet Grace' and its F 2 population 12GM1 should contain the d jo mutation. As a result, 14,006 SNPs were called by ddRAD-Seq analysis of the 12GM1 population. In this population, the double flower phenotype was expected when the plant was homozygous for the 'Posy Bouquet Grace' genotype, and the single flower phenotype was expected when the plant was homozygous for 'Blue Picotee Manaslu' or was heterozygous. Each SNP was tested for its degree of agreement with this model. As a result, nine SNPs were found to have more than a 95% degree of agreement, and six SNPs were completely co-segregated with flower phenotype (Table 1). While these SNPs were located on 33.7 Mb to 43.8 Mb on CHR17, d jo was suggested to be located at approximately 33.7 Mb to 43.8 Mb on CHR17 (Fig. 2).

Identification of SNPs tightly linked to double flower phenotype d su
The KF population that segregates the double flower phenotype derived from 'Sumidanohanabi' was used to identify SNPs linked to the double flower phenotype of 'Sumidanohanabi.' First, we attempted to find co-segregated scaffolds with the double flower phenotype using ddRAD-Seq analysis of the KF population. As a result of the ddRAD-Seq analysis, 15,102 of SNPs were called. In this population, the double flower phenotype was expected when the plant was homozygous for the 'Kirakiraboshi' genotype, and the single flower phenotype was expected when the plant was homozygous for 'Frau Yoshimi' or was heterozygous. Each SNP was tested for its degree of agreement with this model. As a result, five SNPs on three scaffolds were found to have more than a 95% degree of agreement with the model ( Table 2). Since the SNPs on scaffold 3145 F all had the same genotype across the KF population, three loci-on scaffold 0577 F, 3145 F, 0109 F-were detected. Although the numbers of heterozygous and homozygous of 'Frau Yoshimi' were the same (Table 2), recombination was observed in two individuals between 0577 and 3145 F ( Table 2). According to genotypes of the KF population, these three loci were tightly linked within 5 cM; 0109 F (0 cM)-3145F (3.9 cM)-0577F (5.0 cM). The scaffold 3145 and 0577 F sequences were not included in the pseudomolecule sequence because no segregated SNPs were detected on 3145 and 0577 F in the 12GM1 population. However, the SNP at position 868,569 in 0109 F was found at position 57,436,162 in CHR04 (the terminal region of CHR04). Thus, these three loci and locus d su , which control the double flower phenotype of 'Sumidanohanabi,' were suggested to be located at the terminal region of CHR04 (Fig. 2). Waki et al. 5 reported that the d su locus was located on linkage group KF_4, which corresponded to pseudochromosome CHR04 in this study. Therefore, our results were coincident with the previous report. It has been suggested that genes controlling the double flower phenotype differed between 'Jogasaki' and 'Sumidanohanabi' based on confirmation of the segregation ratio of crossed progenies. 4 Our study revealed that the double flower phenotype of 'Jogasaki' was controlled by the single d jo locus on CHR17, and the double flower phenotype of 'Sumidanohanabi' was controlled by the single d su locus on CHR04. It is confirmed that genes controlling the double flowers d jo and d su are different as suggested by Suyama et al 4 .

Prediction of genes controlling double flower for d su and d jo from genome sequence
To find the gene controlling d su and d jo , we searched the homeotic genes on the scaffolds shown in Tables 1 and 2. We did not find any notable homeotic gene controlling flower phenotype for d jo . However, for d su , the g182220 gene that encoded the homeotic LFY gene was found on scaffold 0577 F. To investigate the possibility that it was the causative gene for d su , variants on LFY genomic sequence were searched to identify the 'Kirakiraboshi' specific mutation, using the resequencing data of 'Kirakiraboshi,' 'Frau Yoshimi,' 'Posy Bouquet Grace,' and 'Blue Picotee Manaslu.' As a result, four INDELs and five sequence variants were 'Kirakiraboshi' specific mutations (Fig. 3A).
Cloning and sequencing of the LFY CDS were performed on 'Kirakiraboshi' and 'Frau Yoshimi.' From 'Frau Yoshimi,' a single CDS comprising three exons was obtained. From 'Kirakiraboshi,' two CDSs with splice variants were obtained. While splicing variant 1 resulted in three exons, splicing variant 2 resulted in only two exons, corresponding to the first and third splice products of splicing variant 1 (Fig. 3B, Supplementary Fig. S3). The deduced amino acid sequences were aligned using the CDSs of 'Frau Yoshimi' and 'Kirakiraboshi,' g182220 sequence, protein LFY of Arabidopsis thaliana, and protein FLO of Antirrhinum majus. While the deduced amino acid sequences of 'Frau Yoshimi' and g182220 showed sequence similarity in the entire region, frameshift occurred in the two splicing variants obtained from 'Kirakiraboshi' and the resulting products had no sequence similarity across the latter half (Fig. 4). The frameshift observed in splicing variant 1 was due to one basepair of DNA insertion in the second exon, at position 1,931 (Fig. 3A, Supplementary  Fig. S3). Contrarily, the frameshift observed in splicing variant 2 was due to the complete loss of exon 2 (Fig. 3B, Supplementary  Fig. S3).

LFY gene function in double flower of hydrangea
In the genomic sequence of 'Kirakiraboshi,' an insertion was detected in the second exon of the LFY gene. This insertion resulted in a frameshift of cloned mRNA in splicing variant 1. For splicing variant 2, the lack of a second exon resulted in frameshift. Because frameshift occurred in both splicing variants of 'Kirakiraboshi,' it was speculated that the function of LFY was suppressed or lost in 'Kirakiraboshi.' LFY and its homolog FLO have been identified in many plants such as Arabidopsis thaliana and Antirrhinum majus, and are known as transcription factors for major flowering signals. [30][31][32] Additionally, many phenotypes of Arabidopsis lfy mutants have been reported. 32,33 In the lfy strong phenotype, most organs are sepal-like, or mosaic sepal/carpel organs and the sepal-like organs are characteristic of wild-type cauline leaves. 34 A similar phenotype has been reported in the LFY homolog mutants or transgenic plants such as the flo mutant of Antirrhinum majus, 35 uni mutant of pea, 36 and co-suppressed NFL transgenic tobacco. 37 As stated above, when the LFY gene function is lost, petals, stamens, and carpels are likely to be replaced by sepal-like organs. In addition, when male sterility was observed in the Arabidopsis lfy mutant, female fertility was reduced due to the lack of stamens, but remained due to the persistence of the carpel. 34 In the decorative flowers of hydrangea, sepals show petaloid characteristics, including pigmentation and enlarged organ size. 1 In addition, the double flower derived from d su showed male sterility and reduced female fertility. These phenotypes in double flower hydrangea are similar to lfy mutants. Thus, it is reasonable to assume that the double flower phenotype of d su is derived from the transformation of petals and stamens into sepal-like organs caused by an LFY mutation. We assumed that LFY is a causative gene of the double flower phenotype of 'Sumidanohanabi. ' However, there remain several unexplained observations in this study. The double flower of 'Kirakiraboshi' did not exhibit the same phenotype as the lfy mutant. Generally, the flowers of lfy or its orthologous gene mutants have only leaf-like or sepal-like organs that have chlorophyll, stomata, and trichomes, and these organs have almost no petal identity. 34,35 When flowering signals in lfy mutants were completely lost, floral organs were not fully formed. [34][35][36] It has also been reported that lfy mutants with an intermediate or weak phenotype sometimes develop petaloid organs. 34 In the double flowers of 'Kirakiraboshi,' the floral organs keep their petal identity, have papilla cells, and are pink or blue. These phenotypes of 'Kirakiraboshi' might reflect partially remnant LFY function. It should be considered that another LFY gene possibly exists in the genome, and it partially compensates for the LFY function. However, according to the genomic and Iso-Seq sequences of H. macrophylla, no other LFY gene was observed. The mechanism of keeping petal identity in the floral organ of the double flower cultivar is still unknown. Comparing gene expression of LFY and LFY regulated genes between double and single flower cultivars might reveal aspects of the mechanism.

Candidate causative gene for d jo
While LFY was found as a candidate gene for controlling the double flower phenotype of d su , we could not find any candidate gene for the d jo locus. One possible reason was that SNPs were not called in scaffold with the causative gene. In pseudomolecules, about half of the total length of the scaffold was not included since relevant SNPs were not called. Improvement of SNP density would be effective for discovering additional scaffolds that are tightly linked to d jo . Although a candidate gene for d jo could not be identified from the completely co-segregated scaffold sequencing, we predicted several candidate genes based on gene function.
Because the phenotype of the d jo mutant is very similar to the d su mutant, the d jo causative gene may be a gene associated with LFY, which was suggested as a candidate causative gene for the d su mutant. The absence of petals and stamens in lfy mutant flowers in Arabidopsis has been traced back to a failure to activate the petaland stamen-specific B-class genes APETALA3 (AP3) and PISTILLATA (PI). 34 For activation of gene expression of AP3, UNUSUAL FLORAL ORGANS (UFO) is required as a transcriptional cofactor of LFY. 38 It is reported that the ufo mutant showed a similar phenotype with the lfy mutant in Arabidopsis. 39 According to the regulatory network, the mutation in UFO or B-class genes, could cause an increase in sepals and lack of stamens and petals. Previously, HmPI, HmAP3, and HmTM6 were identified as Bclass genes in hydrangea. 40,41 As HmAP3 was located on CHR13 as 0773 F.1_g216110, it was not considered as a causative gene for d jo . Although HmPI was found as 0880 F.1_g231800 and HmTM6 as 0266 F.1_g601980, they were not included in the pseudomolecule. Ascertaining the loci of these genes might reveal the causative gene for d jo . On the contrary to B-class genes, the UFO gene has not been reported in hydrangea. According to HMA_r1.2 scaffold sequences, the UFO gene was found as 0185 F.1g081460. Because this gene was located on pseudomolecule CHR17_ 53812239-53813558, which was adjacent to d jo , it should be a candidate gene for d jo . Although we could not find a loss of function mutation of UFO CDS in the d jo mutant cultivar 'Posy Bouquet Grace' (data are not shown), it is still possible that the UFO gene function is lost. One possibility is that the loss of gene expression occurs owing to large INDEL in the promoter region of UFO. Further genome sequence comparison could clarify these possibilities.

DNA marker development for d jo and d su
For the development of the CAPS marker linked to d jo , SNPs that were completely co-segregated with the double flower phenotype were selected. As a result, the CAPS marker J01 was successfully developed based on the SNP at scaffold 0008F-2_780104. The J01 CAPS marker amplified 167 bp of the fragment by PCR, and digestion with Taq I restriction enzyme generated 50 and 117 bp fragments in the double flower allele (Fig. 5). The J01 marker showed a high degree of agreement with the flower phenotype at 99.0% in the 15IJP1 and 100% in the 14GT77 populations that segregated the double flower phenotype of 'Jogasaki' (Supplementary Tables S4 and  S5). This indicated that the J01 marker was tightly linked to the d jo locus, and J01 could effectively distinguish the recessive allele from the dominant allele.
For the development of the DNA marker for d su , an INDEL marker named S01 was developed on the LFY gene. To develop a DNA marker for distinguishing the d su recessive allele from the dominant alleles in the LFY genomic sequence, we focused and designed a DNA marker on the 'Kirakiraboshi' specific 14 bp deletion at position 3,617 from the initiation codon (Fig. 3A). We developed an S01 marker amplified 236 bp fragment for the double flower allele of 'Kirakiraboshi,' and 250 bp and 280 bp fragments for the single flower allele of 'Frau Yoshimi' (Fig. 6A). Three types of alleles resulted from the presence or absence of a 30 bp deletion at position 3,513 in addition to the 14 bp deletion. These were 30 bp and 14 bp deletions on the 236 bp allele, 30 bp deletion on the 250 bp allele, and no deletion on the 280 bp allele (Fig. 6B).

DNA marker application for hydrangea accessions
Since the J01 marker could distinguish d jo alleles and the S01 marker could distinguish d su alleles, the combined use of J01 and S01 DNA markers was expected to reveal the origin of the double flower phenotype, d jo or d su , in various accessions. Therefore, we performed DNA marker genotyping on H. macrophylla accessions using two DNA markers, J01 and S01. All tested double flower accessions showed homozygous genotypes of J01 or S01; 10 of the double flower accessions were homozygous of 117_50 in J01, and 4 were homozygous of 236 in S01 (Table 3). Contrarily, all single flower accessions showed other genotypes. Therefore, developed DNA markers J01 and S01 could successfully identify recessive double flower alleles for d jo and d su , respectively. Both markers showed a high degree of agreement with phenotype and were applicable to the examined H. macrophylla accessions. These DNA markers could be useful in marker-assisted selection (MAS) of double flower progenies. Identification of flower phenotype at the seedling stage by MAS would enable the discarding of single flower individuals and allow the growth of double flower individuals. The developed DNA markers should accelerate the breeding of double flower phenotypes.