A draft genome, resequencing, and metabolomes reveal the genetic background and molecular basis of the nutritional and medicinal properties of loquat (Eriobotrya japonica (Thunb.) Lindl)

Loquat (Eriobotrya japonica) is a popular fruit and medicinal plant. Here, a high-quality draft genome of the E. japonica ‘Big Five-pointed Star’ cultivar that covers ~98% (733.32 Mb) of the estimated genome size (749.25 Mb) and contains a total of 45,492 protein-coding genes is reported. Comparative genomic analysis suggests that the loquat genome has evolved a unique genetic mechanism of chromosome repair. Resequencing data from 52 loquat cultivars, including 16 white-fleshed and 36 yellow-fleshed variants, were analyzed, and the flower, leaf, and root metabolomes of ‘Big Five-pointed Star’ were determined using a UPLC-ESI-MS/M system. A genome-wide association study identified several candidate genes associated with flesh color in E. japonica, linking these phenotypes to sugar metabolism. A total of 577 metabolites, including 98 phenolic acids, 95 flavonoids, and 28 terpenoids, were found, and 191 metabolites, including 46 phenolic acids, 33 flavonoids, and 7 terpenoids, showed no differences in concentration among the leaves, roots, and flowers. Candidate genes related to the biosynthesis of various medicinal ingredients, such as phenolics, flavonoids, terpenoids, and polysaccharides, were identified. Some of these genes were confirmed to be members of expanding gene families, suggesting that the high concentrations of beneficial metabolites in loquat may be associated with the number of biosynthetic genes in this plant. In summary, this study provides fundamental molecular insights into the nutritional and medical properties of E. japonica.


Introduction
Eriobotrya japonica (Maloideae: Rosaceae), commonly known as loquat, is a type of evergreen fruit crop with a delicious taste and high nutrient contents 1 . According to documentary records and archaeological relics, E. japonica was first domesticated during the Han dynasty in China, 2000 years ago 2 . Today, it has been planted in more than 30 countries, including Japan, the United States, France, Italy, Egypt, and Spain. Its annual yield exceeds 1.2 million tons worldwide 3 . E. japonica cultivars can be divided into two groups based on their pulp and pericarp color: white-or yellow-fleshed. White-fleshed cultivars have higher sucrose contents than their yellowfleshed counterparts, making them taste better. Yellowfleshed fruit cultivars, which are the dominant cultivars, have higher nutritional value than their white-fleshed counterparts due to increased carotene contents 4 . E. japonica is also an important medicinal plant; its roots, leaves, and flowers have long been used in traditional Chinese medicine for the treatment of inflammation, diabetes, cancer, bacterial infection, aging, pain, and allergy [5][6][7] .
Whole-genome sequencing has been performed on other important fruit-producing crops and ornamental plants in the Rosaceae family, including Malus domestica 8 , Prunus persica 9 , Pyrus bretschneideri 10 , Pyrus betuleafolia 11 , Fragaria vesca 12 , Prunus mume 13 , Prunus avium 14 , Prunus yedoensis 15 , Rubus occidentalis 16 , Fragaria ananassa 17 , and Rosa multiflora 18 . The corresponding genomic data have provided global genetic information associated with their growth and development, ecological adaptive, and horticultural traits, which has been invaluable for breeding new varieties and tracking the complex evolution of these species. An increasing number of medicinal plants have appeared on the list of genome-sequenced species, which has significantly enhanced our understanding of the genetic background and molecular basis associated with the biosynthesis of medicinal components by these plants [19][20][21][22] .
As a popular fruit and medicinal plant, E. japonica has received particular attention from scientists and is the subject of horticultural, biological, and pharmaceutical research. However, there remains a gap in our knowledge regarding the genetic background of E. japonica, particularly with respect to the molecular basis of medicinal compound biosynthesis. Clarifying this issue would undoubtedly represent important progress in our understanding of the molecular pharmacognosy of E. japonica.
Here, a draft genome of the 'Big Five-pointed Star' yellow-fleshed E. japonica cultivar was assembled and annotated. Resequencing data from 52 E. japonica cultivars, including 16 white-fleshed and 36 yellow-fleshed cultivars, were analyzed, and the metabolite profiles of leaf, flower, and root tissues from the 'Big Five-pointed Star' cultivar were determined. The major aims of this study were to construct an additional high-quality reference genome for further research and utilization and to evaluate and provide insights into the nutritional and medicinal properties of E. japonica.

Results
Sequencing and assembly of a high-quality loquat draft genome An individual 'Big Five-pointed Star' plant was selected for sequencing. Approximately 688.18 million clean short reads and a total of 51.54 Gb of data were generated using the HiSeq 4000 sequencing platform (Illumina, San Diego, CA, USA) (Table S1). These data were combined with a k-mer analysis to estimate a 'Big Five-point Star' genome size of 749.25 Mb (Table S2; Fig. S1), which was almost identical to that (749 Mb) determined using flow cytometry 23 , and the heterozygosity and GC content of the genome were found to be 0.31% and 38.58%, respectively (Table S2). Wholegenome sequencing was then performed using PacBio long-read sequencing technology (Pacific Biosciences, Menlo Park, CA, USA), and more than six million clean subreads with an average length of 6121 bp (N50 = 11,469 bp) were obtained (Table S3, Fig. S2). With these clean subreads, an initial draft genome composed of 3677 contigs with 733.32 Mb of nonredundant sequences was assembled (Table S4), covering~97.87% of the estimated genome size. Three measures were adopted to evaluate the completeness of the initial draft genome assembly. First, the screening of 458 core eukaryotic genes and 248 conserved sequence datasets in the Core Eukaryotic Genes Mapping Approach (CEGMA) database 24 identified 447 (97.06%) and 238 (95.97%) matches, respectively (Table S5). Second, using the Benchmarking Universal Single-copy Orthologs (BUSCO) database 25 , which contains 2326 plantspecific orthologous genes, a total of 2170 (93.29%) genes were identified, among which 1450 were single, 689 were duplicated complete, and 31 were fragmented. The number of missing BUSCO genes was only 156 (6.71%) (Table S5). Last, by mapping the short-read data onto the draft genome, it was found that 93.81% of the draft genome could be aligned (Table S5). The above results suggested that the initial draft genome had good assembly completeness. Approximately 96 Gb of data from~321.2 million reads generated on the Illumina HiSeq 4000 sequencing platform were used to locate the contigs on chromosomes with Hi-C technology (Table S6). Among these reads,~159.7 million read pairs were uniquely mapped to the initial draft genome, and more than 74.6 million read pairs were shown to represent valid interactions (Table S6). These read pairs were used to scaffold the contigs onto 17 chromosomes (Fig. S3), and the number of contigs was finally corrected to 3938, among which 3725 contigs (727.40 Mb, covering 99.19% of the draft genome sequence) were anchored to chromosomes. The order and direction on the chromosomes of 2181 contigs (644.88 Mb) could be determined (Table S7). These results indicate that the final assembled draft genome had good integrity and can be employed as reference whole-genome resequencing data and for other purposes.

Genome element annotation
Approximately 516.11 Mb of repetitive sequences were identified in the E. japonica draft genome, accounting for 70.38% of all sequences (Table S8) (Table S8). By integrating de novo prediction, homologous species prediction, and transcriptome prediction to determine protein-coding genes in nonrepetitive regions of the draft genome, a total of 45,492 protein-coding genes, with an average length of 3420 bp and an average exon length of 1532 bp, were identified (Table S9, 10). Among these genes, 45,090 (99.12%) could be annotated (Table S11; Supplementary data file 1). In addition, 10,426 rRNA genes belonging to four different families, 165 miRNA genes belonging to 25 families, 691 tRNA genes belonging to 24 families, 197 snRNAs, 1023 snoRNAs, and 8314 pseudogenes were also identified in the final assembled draft genome (Table  S12; Supplementary data file 2). The distribution pattern of protein-coding genes and RNA genes on the chromosomes was very uneven (Fig. 1).
The E. japonica genome contained a relatively high proportion of repetitive sequences and large numbers of protein-coding and rRNA genes. Nevertheless, the gene density was lower than in the other sequenced diploid Rosaceae species (Table 1). A high degree of correlation was observed among genome size, repetitive sequence length (r = 0.98, p < 0.05), and putative protein-coding gene numbers in all nine species (r = 0.97, p < 0.05). These results provide statistical evidence that repetitive sequences are the major determinants of genome size in Rosaceae species; the expansion of genome size may be accompanied by an increased gene number in the Rosaceae lineage.
Unique molecular mechanisms underlying genome recombination and repair in E. japonica To understand the evolutionary pattern of the E. japonica genome, a comparative genomics analysis was performed using eight sequenced diploid Rosaceae species, including apple (M. domestica), pear (P. betuleafolia), peach (P. persica), sweet cherry (P. avium), Chinese plum (P. mume), black raspberry (R. occidentalis), woodland strawberry (F. vesca), and rose (R. chinensis), and the model plant Arabidopsis thaliana. The protein-coding genes of all ten species were clustered into gene families according to sequence similarity, and a total of 34 895 families were classified (Supplementary data file 3). Among these families, 544 families containing 1632 genes were shown to be unique to E. japonica (Supplementary data file 4). Gene Ontology (GO) enrichment analysis showed that these genes were  primarily involved in biological process categories such as 'DNA integration', 'RNA-intended biological process', 'DNA recombination' and 'DNA metabolism', and in molecular functions, such as 'RNA-directed DNA polymerase activity', 'RNA-DNA hybrid ribonuclease activity' and 'RNA binding'. In addition, Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment annotation showed that these E. japonica genes were primarily involved in metabolic pathways including 'homologous recombination', 'base excision repair', 'nucleotide excision repair', 'DNA replication', and 'mismatch repair' (Fig. S4; Supplementary data file 4). These results imply that E. japonica may have evolved unique genetic and molecular mechanisms for genome recombination and repair.
Loquat originated earlier than apple and pear according to phylogenomics Apple, pear, and loquat all belong to the Amygdaloideae subfamily. However, the phylogenetic relationships between these three species have not been determined 26,27 . Here, a phylogenetic tree (Fig. 2a) containing nine Rosaceae species with A. thaliana as the outgroup was constructed using the protein sequences of 594 single-copy gene families (Supplementary data file 5). The analysis placed loquat, pear, and apple at the distal end of the outgroup, suggesting that the speciation time of the Amygdaloideae lineages was relatively late. The topological position of the nine Rosaceae species in the phylogenetic tree was consistent with that indicated by a previous study by Xiang et al. 26 . Molecular clock analysis indicated that the loquat lineage originated~23 million years ago (MYA), with a 95% confidence interval of~17-36 MYA, which is close to the beginning of the Neogene Period of the Cenozoic Era. All species of the Rosaceae family separated from the common ancestor~82 MYA (95% confidence interval:~46-111 MYA,) in the Late Cretaceous period, consistent with a previous dating analysis by Forest & Chase 28 . These results describe the time and order of species differentiation and differ from the data described by Jiang et al. 29 , which put pear before loquat and indicated that the Amygdaloideae subfamily origi-nated~8.6 MYA. The most likely reason for these differences was the different data used to construct the phylogenetic trees. Here, 594 single-copy gene sequences were used, while only 51 single-copy genes were used in the study by Jiang et al. 29 .
Genes enriched for genetic repair functions have undergone positive selection Further investigations of the single-copy gene families were then completed to establish if any family was under positive selection. Among the 594 single-copy gene families used to construct the phylogenetic tree, 90 (15.15%) families from E. japonica were found to have undergone strong positive selection. These genes were mainly enriched for functions including 'metabolic process' and 'cellular process' in the 'biological process' category, 'cell part', 'cell' and 'organelle' in the 'cellular component' category, and 'catalytic activity' and 'binding activity' in the 'molecular function' category ( Fig. S5a; Supplementary data file 6). These genes were also found to be enriched in metabolic pathways such as 'nucleotide excision repair' (Fig. S5b; Supplementary data file 6). These results provide additional genetic evidence that E. japonica has evolved a unique set of DNA repair and recombination mechanisms.

Expanded gene families linked to both medicinal compound biosynthesis and fruit flavor
A total of 483 gene families with significant expansion were detected in the E. japonica genome (Fig. 2b). These included 4472 genes in total, which were most enriched for GO terms, in the 'biological process' ('metabolic process' and 'cellular process'), 'cellular component' ('cell part', 'cell', and 'organelle') and 'molecular function' ('catalytic activity' and 'binding activity') categories. KEGG analysis indicated that these genes were enriched for metabolic pathways involving 'monoterpenoid biosynthesis' and 'starch and sucrose metabolism' (Fig. S6; Supplementary data file 7). These results suggest a change in the genetic mechanism related to the metabolism of terpenoids and soluble polysaccharides in E. japonica.

Genome-wide association study (GWAS) of flesh color
Flesh color is one of the most important agricultural characteristics of E. japonica and other fruit crops. A GWAS for flesh color traits was performed using the EMMAX program with an efficient mixed-model association based on a linear mixed model 30 and the linear mixed model program FaST-LMM 31 . EMMAX analysis identified four single nucleotide polymorphism (SNP) loci that were significantly associated with flesh color (p < 0.005) on chromosomes 1, 3, 9, and 11, including 70 gene loci located within 100 kb of these SNPs (Table S13;   the mass spectrometry data from these samples were reliable and that the metabolite profiles of the same organ from different samples were more similar than those from different organs from the same E. japonica plant. A total of 577 metabolites were detected in total, including 193 phenols, 33 alkaloids, 28 terpenoids, and one steroid. More metabolites (573) were found in the flowers than in the leaves (565) or roots (509) ( Table 2; Supplementary data file 9, 10). This is in reasonable agreement with the 536 metabolites, including 60 organic acids, identified in a previous study of E. japonica fruits 32 . More metabolites and fewer organic acids were found in E. japonica leaves and flowers, respectively. The flowers contained 89, 178, and 310 metabolites that were upregulated, downregulated, and unchanged, respectively, compared to those in the root metabolome ( Fig. 4a; Supplementary data file 11). These values were 76, 94, and 407 when the flower metabolome was compared with that of the leaf ( Fig. 4b; Supplementary data file 12) and 189, 84, and 304 when comparing the leaf and root metabolomes, respectively ( Fig. 4c; Supplementary data file 13). There were significant differences in the quantities of 51 metabolites among the leaves, flowers, and roots; however, there were no significant differences in 192 metabolites between these organs ( Fig. 4d;  Supplementary data file 14). These results demonstrated that the E. japonica flowers, roots, and leaves were all rich in metabolites, many of which did not show significant differences in accumulation among these three organs, which could explain why E. japonica roots, leaves, and flowers all have medicinal value. A total of 271 of these metabolites could be annotated using the KEGG database (Supplementary data file 9, 15). However, only 110 metabolites, including 15 phenolic acids, 6 flavonoids, and 12 alkaloids, could be assigned to specific pathways, such as 'metabolic pathways' (ko01100), 'biosynthesis of secondary metabolites' (ko01110), 'phenylpropanoid biosynthesis' (ko00940), 'flavonoid biosynthesis' (ko00941), 'stilbenoid, diarylheptanoid and gingerol biosynthesis' (ko00945), and 'isoflavonoid biosynthesis' (ko00950) (Supplementary data file 10). Phenolic acids in E. japonica flowers, leaves, and roots and the genes associated with their biosynthesis Previous studies have indicated that the principal components responsible for the medicinal value of E. japonica are their phenolics, terpenoids, and polysaccharides [33][34][35] .
Phenolic compounds, the aromatic secondary metabolites that occur in plants, can be clustered into different families, including phenolic acids, flavonoids, lignans, coumarins, and tannins. They have considerable potential benefits for human health, including antiaging effects and reducing the  Phenylpropanoids play a central role in the biosynthesis of phenolic compounds 38 . Here, a total of 13 compounds, including p-coumaryl alcohol, L-phenylalanine, coniferaldehyde, caffeic acid, and coniferyl alcohol, and 286 predicted protein-coding genes were annotated within the metabolic pathway of 'phenylpropanoid biosynthesis' (Ko00940) (Supplementary data files 10

Polysaccharides in E. japonica leaves, flowers, and roots and their biosynthetic genes
In addition to phenolic compounds, terpenoids, alkaloids, and polysaccharides have been shown to have important medicinal properties 46  Many E. japonica protein-coding genes were linked to various pathways associated with polysaccharide metabolism (e.g., 111 related to 'fructose and mannose metabolism' (Ko00051) and 92 related to 'galactose metabolism' (Ko00052)) (Supplementary data file 1). Among these genes, 11 (Fig. S18) and D-fructose ( Figure S19), respectively. Notably, both the OG0000179 and OG0000175 gene families have undergone expansion during their evolution (Supplementary data file 7). Phylogenetic trees of the OG0000179 and OG0000175 gene families revealed that some genes have undergone increases in in copy number only in the E. japonica genome. These E. japonica-specific replication pairs in the OG0000179 gene family included EVM0017523 and EVM0017523 and EVM0012272 and EVM0044131 (Fig. 5a). The OG0000175 gene family had similar replications in EVM0039482 and EVM0005339; EVM0020813 and EVM0040115; EVM0007102; and EVM0029488 (Fig. 5b). In addition, D-glucose and many of its derivatives were found in the E. japonica flowers, leaves, and roots. At the same time, none of these tissues produced any fructose (Supplementary data file 9, 10), suggesting that fructose is only abundant in E. japonica fruit 32 . These results provide a meaningful link between the genetic background, such as the expansion of gene family members, and the accumulation of related metabolites.

Discussion
In recent years, E. japonica has become an increasingly important fruit worldwide. However, due to the lack of information on its genome, studies on the genetics and molecular biology of E. japonica are limited. This situation is now changing, as a recent publication described a draft genome of the E. japonica 'Seventh Star cultivar' 29 , and the present study describes a new high-quality draft genome of the E. japonica 'Big Five-pointed Star' cultivar. The 'Big Five-pointed Star' cultivar is known for its yellow/red flesh and has the largest cultivated area in China 47 . In contrast, 'Seventh Star' is a mutant cultivar with white flesh that was only recently bred into existence. Both the previous and current draft genomes show high assembly quality. In the former, 99.66% of the genome consisted of contig sequences, 89.27% of which were clustered into pseudochromosomes, as shown using Hi-C protocols. The corresponding statistics for this study are 99.19% and 87.94%, respectively. A comparative analysis showed that only three chromosomes in the draft genomes of the 'Big Five-pointed Star' and 'Seventh Star' cultivars present the same orientation and that 14 chromosomes could be considered complementary (Table  S14). The predicted genome size and number of proteincoding genes in 'Seventh Star' are ∼710.83 Mb and 45 473, while those for 'Big Five-pointed Star' are ∼749. 25 Mb and 45 492, respectively. This suggests active genome differentiation between different E. japonica varieties. The current E. japonica draft genome of 'Big Five-pointed Star', together with the predicted gene sequences, will be released to provide a greater variety E. japonica reference genomes for further molecular biology, genetic, and breeding studies.
Flesh color is an important horticultural and commodity trait that differs among E. japonica cultivars. This phenotype is controlled by the genotype but not by external environmental conditions, similar to many other horticultural and commodity traits. Flesh color is associated with markedly higher levels of colored carotenoids in the flesh tissue of redfleshed cultivars than in the flesh tissue of white-fleshed cultivars 48 . The genetic basis of this difference has been investigated by different methods, including biochemical assays and functional genomics [49][50][51] , suggesting that different carotenoid types and contents lead to phenotypic differences in E. japonica flesh color. However, it is difficult to identify specific gene loci or alleles responsible for flesh color variations. GWAS offers a complementary and powerful tool for elucidating the relationship between genotype and phenotype 52,53 . Numerous studies have revealed substantial genotype-phenotype associations in crops, highlighting the value of GWAS in functional genomic studies 54 . This study identified a set of SNPs and alleles that are significantly associated with flesh color and contributed to the identification of potential candidate genes for phenotypes located near these SNPs in the genome. However, after careful testing, none of these genes were directly involved in the metabolism of carotenes. However, some of these genes were involved in sugar metabolism. These results provide several molecular clues linking sugar contents to the differing nutrient properties of red-fleshed and white-fleshed E. japonica cultivars.
Some important metabolites with potential medicinal value, including ursolic acid, ursolic acid methyl ester, acetyl ursolic acid, oleanolic acid, chlorogenic acid, neochlorogenic acid, and caffeic acid, have been detected in different organs of E. japonica using classical instrumental analyses 55,56 . However, traditional phytochemistry methods are time-consuming and labor-intensive and present a low throughput. There are limited data describing the metabolome of E. japonica, including its production of phenols, flavonoids, terpenes, and polysaccharides with potential health benefits. This lack of information is not conducive to further research on the medicinal properties of E. japonica and its utilization. In recent years, our collective understanding of traditional Chinese medicine has been considerably advanced by the use of various analytical technologies and genomics, proteomics, and metabolomics research 57,58 . Among these approaches, metabolomics has been particularly valuable for analyzing the chemical components, including various metabolites used in traditional Chinese medicine 59 . Here, the metabolomes of E. japonica leaves, flowers, and roots were determined using a widely targeted metabolomic analysis method based on liquid chromatography and tandem mass spectrometry (LC-MS/ MS), as described by Chen et al. 60 , and many additional metabolites, including phenols, terpenoids, and alkaloids, were detected. These results supplement available information on the medicinally valuable biochemical substances present in E. japonica. These data were also used to identify and annotate some of the genes encoding key biosynthetic enzymes related to phenol, terpenoid, and polysaccharide biosynthesis in E. japonica. The data allowed the further evaluation of the evolutionary relationships among these genes. Taken together, these results provide valuable insights into the molecular mechanisms and genetic background facilitating the production of several important medicinal compounds in this species and lay a foundation for conducting further studies on the molecular pharmacology of E. japonica.

Conclusion
This study produced a high-quality draft genome of yellow-fleshed E. japonica and high-throughput metabolomes of its leaf, flower, and root tissues. A total of 45,492 putative protein-encoding genes and 577 metabolites were identified. In addition, 91 phenols, 81 flavonoids, 28 terpenoids, and some saccharide metabolites with potential health benefits and genes related to the biosynthesis of these metabolites were highlighted. Overall, this study describes a high-quality draft genome and provides a global view of the fundamental molecular components that contribute to the medicinal value of E. japonica.

Cultivar description and genome materials
The 'Big Five-pointed Star' cultivar was subjected to sequencing to assemble a draft genome, which was then used as a reference genome for the SNP mapping of resequencing data. The 'Big Five-pointed Star' drupe has excellent characteristics, including a high average single fruit weight, high edible rate, high soluble solid contents, juiciness and delicious flavor. As a result, the Big Fivepointed Star has become a popular cultivar, with the most rapid development and expansive planting area in China 47 . Descriptions of the flesh color and geographical origins of the 53 E. japonica cultivars used for GWAS are provided in Table S15.

DNA extraction
Total DNA was extracted from the young leaves of 'Big Five-pointed Star' using a modified protocol based on the CTAB method 61 and was then treated with RNase (Thermo Fisher Scientific, Waltham, MA, United States) at 37°C for 1 h. The quality and concentration of the total extracted genomic DNA were determined using agarose gel electrophoresis and ND-1000 spectrophotometry (NanoDrop Technologies Inc., Wilmington, DE, USA).

Illumina short-read library construction and sequencing
The Illumina paired-end library (350 bp) was constructed according to the manufacturer's instructions via the following steps: qualified DNA was fragmented, and segments of~350 bp in length were selected on a 3% agarose gel for further analysis. End repair and A-tailing were performed, and Illumina-compatible adaptors were added to the selected DNA fragments before PCR amplification using Illumina adapter-specific primers, which completed paired-end sequencing library construction. Raw short-read sequences of~150 bp in length were generated on the Illumina HiSeq 4000 platform (Illumina Inc.).
Survey of the 'Big Five-pointed Star' genome based on short-read data by k-mer analysis A k-mer analysis was performed using the K-mer Analysis Toolkit (KAT) program 62 to determine an initial estimate based on genome size, heterozygosity, and the repetitive rate of the 'Big Five-pointed Star' genome. The following formula was used: genome size = (total nucleotide number)/(average sequencing depth) = (total number of k-mers)/(average k-mer depth). The K value used the maximum number of odd numbers that met the following criteria: 4^K/genome >200.
PacBio long-read library construction, sequencing, and raw data statistics The long-read sequencing library (20 kb) was constructed according to the PacBio guidelines and was completed as follows: G-tube fragmentation of genomic DNA, damage repair of fragmented DNA, end-repair of fragmented DNA, ligation of fragmented DNA with dumbbell-shaped adaptors, digestion of DNA segments using exonuclease, and selection of target segments using BluePippin. Long-read sequences were generated on the PacBio sequencing platform (Pacific Biosciences Inc., California, USA).

Assembly and evaluation of the integrity of the draft genome
To assemble the draft genome of E. japonica using the long-read sequencing data, subreads with low quality (<Q20) and short lengths (<500 bp) were removed, and the remaining subreads were corrected using Canu software 63 . The corrected data were then assembled into a draft genome sequence by using WTDBG (https://hpc.ilri. cgiar.org/wtdbg2-software) with parameters '-p 19 -AS 2' (wtdbg, RRID:SCR_017225), Falcon software 64 with the parameters set to the defaults, and Canu software with the parameters set to canu: errorRate 0.045. The results of these three analyses were then optimized using the Quickmerge ideology 65 under its default parameters and were improved by correcting errors by combining the short-read data using Pilon software 66 with default parameter settings.
The following three methods were then used to evaluate the completeness of the draft genome. The first consisted of BLAST searches of the assembled draft genome with a standard of more than 70% identity against the CEGMA database 24 , which included 458 core eukaryotic genes (CEGs) and 248 highly conserved CEGs. The second consisted of BLAST searches of the assembled draft genome with at least 70% identity against the embryophyta_odb10 dataset in the BUSCO v4.0 database (https://busco.ezlab. org/busco_v4_data.html), which included 1440 conserved core plant genes. Finally, the short-read sequencing data were mapped to the assembled draft genome using BWA software 67 (v0.7.10-r789; aln model; other parameters were set to default).

Hi-C sequencing library construction
A Hi-C sequencing library was constructed according to protocols described by Servant et al. 68 and Burton et al. 69 . Briefly, the cells of young leaves of the 'Big Five-pointed Star' cultivar were fixed with formaldehyde and then dissociated, and the cross-linked products were treated with restriction endonucleases to produce cohesive ends. A biotin marker was introduced at the cohesive ends, which were repaired to produce blunt ends. The blunt ends were ligated; the cross-links were released to separate DNA from proteins; the DNA was extracted; a Covaris E220 instrument (Covaris, Brighton, UK) was used to fragment DNA to the correct size and then repair the ends; the fragmented DNA segments were purified by gel electrophoresis and recycled with a QIAquick Gel Extraction Kit (Qiagen Inc., Germany); those DNA segments without a biotin marker were removed; Poly(A) sequences were added to the remaining DNA segments including biotin markers; PCR adaptors were added; PCR was performed; and the PCR products were purified by gel electrophoresis and recycled by using a QIAquick Gel Extraction Kit (Qiagen Inc.).

Repetitive sequence prediction and annotation
A unique database for identifying repetitive sequences in the genome was constructed with the help of LTR_FINDER v1.05 71 , MITE-Hunter 72 , RepeatScout v1.0.5 (Price et al.) 73 , and PILER-DF v2.4 software 74 based on structure and de novo prediction. This unique database was then merged with the Repbase database 75 to generate the final repetitive sequence database, and PASTEClassifier software 76 was used to classify the database. Finally, Repeatmasker v4.0.6 software 77 was used to predict repetitive sequences in the draft genome based on a well-constructed, repeating sequence database.

Protein-coding gene prediction and functional annotation
Protein-coding genes based on nonrepetitive sequences in the draft genome were predicted using three methods: (1) De novo prediction (Ab initio) using Genscan software 78

Identification of gene families
The gene sets of the above species and E. japonica were aligned using all-against-all Blastp 91 according to evalues ≤ 1e5 and ≤500 hits. OrthoFinder v2.3.7 software 92 was used for the gene family classification of the protein sequences of the ten species, and the 'Protein Analysis Through Evolutionary Relationships' (PANTHER v15) database (http://pantherdb.org/) was used for the annotation of the obtained gene families. Finally, GO and KEGG enrichment analyses were performed by using clusterProfile v3.14.0 software 93 .

Phylogenetic tree construction
The protein sequences of 594 single-copy genes were used to construct an evolutionary tree using IQ-TREE v1.6.11 software 94 . Specifically, MAFFT v7.205 software 95 was used to align the sequences, and Gblocks v0.91b 96 was used to remove regions with poor sequence alignment or significant differences using the following parameters: -b5 = H. ModelFinder 97 was used for model detection, and the best-obtained model was JTT + F + G4, which was then used to construct a maximum likelihood (ML) evolutionary tree with the number of bootstraps set to 1000 and A. thaliana as the outgroup. The MCMCTree module of PAML v4.9i software 98 was used to calculate the divergence times between species. Finally, the evolutionary tree with divergence times was graphically presented using MCMCTreeR 99 .

Identification and enrichment analysis of single-copy genes under positive selection
The modular CodeML built-in PAML v4.9d package 98 was used to detect the selection pressure on specific single-copy genes in each of the nine species based on the construction of phylogenetic trees. Single-copy orthologs with nonsynonymous/synonymous (D N /D S ) substitution ratios >1 (p ≤ 0.05 by chi-square test) were indicated to be under positive selection (rapid evolution or adaptive evolution). These genes were then subjected to GO and KEGG enrichment using GOseq 100 and KOBAS 101 software, respectively.

Gene family expansion and contraction and functional enrichment analysis
The expansion and contraction of the gene families of E. japonica and those of eight other Rosaceae species were analyzed using CAFÉ software 102 based on the phylogenetic evolutionary tree. Significant expansion or contraction was indicated by p < 0.05. GO and KEGG enrichment analyses were performed on the genes in the expanded families using GOseq software 100 and KOBAS 101 software, respectively.

GWAS
The resequencing data of the 52 E. japonica cultivars and the methods for SNP dataset detection have been previously reported 103 . The EMMAX 30 and Fastlmm 31 programs, with a compressed mixed linear model and linear mixed model, respectively, were used to perform the GWAS based on the original SNP dataset filtered according to a threshold minor allele frequency <0.05 and locus integrity >0.8. Manhattan and quantile-quantile plots were constructed using the R package (https:// rstudio.com/products/rpackages/).

Sampling, sample preparation, and metabolite extraction
A total of nine samples (three from leaves, three from flowers, and three from roots) of the 'Big Five-pointed Star' cultivar were collected at different developmental stages and subjected to metabolome analysis. The freezedried samples were crushed using an MM 400 mixer mill (Retsch, Haan, Germany) and zirconia beads for 1.5 min at 30 Hz. Then, 100 mg of each sample was subjected to 70% methanol extraction overnight at 4°C. After centrifugation at 10,000 × g for 10 min, the extracts were absorbed (CNWBOND Carbon-GCB SPE Cartridge, 250 mg, 3 mL; ANPEL Laboratory Technologies, Shanghai, China) and filtered (SCAA-104, 0.22 μm pore size; ANPEL) before being subjected to UPLC-MS/MS analysis.

UPLC conditions
Sample extracts were analyzed using a UPLC-ESI-MS/ MS system (UPLC: Shim-pack UFLC CBM30A system, Shimadzu, Kyoto, Japan; MS: Applied Biosystems 4500 QTRAP, AB Sciex, Framingham, MA, USA). The analytical conditions were as follows: Agilent SB-C18 UPLC column (1.8 µm, 2.1 mm × 100 mm; Agilent Technologies, Santa Clara, CA, USA) and a mobile phase comprising solvent A (pure water with 0.1% formic acid) and solvent B (acetonitrile). Sample measurements were performed using a gradient program that started with 95% A and 5% B. Within 9 min, a linear gradient with an endpoint of 5% A and 95% B was programmed, and the composition of 5% A and 95% B was maintained for 1 min. This was then reversed to 95% A and 5% B within 1.10 min, which was maintained for 2.9 min. The column oven temperature was set at 1-40°C, and the injection volume was 4 μL. The effluent was connected to ESI-triple quadrupole linear ion trap (QTRAP)-MS.

ESI-Q TRAP-MS/MS
LIT and triple quadrupole scans were acquired using a triple quadrupole-linear ion trap mass spectrometer (QTRAP; API 4500 QTRAP UPLC/MS/MS System) equipped with an ESI Turbo Ion-Spray interface, operating in both positive and negative ion modes and were controlled using Analyst Software 1.6.3 (https://sciex. com/products/software/analyst-software; AB Sciex). The ESI source operation parameters were as follows: ion source, turbo spray; source temperature, 550°C; ion spray voltage: 5500 V (positive ion mode)/-4500 V (negative ion mode); and ion source gas I (GSI), gas II (GSII), and curtain gas (CUR) values set at 50, 60, and 30 psi, respectively. Instrument tuning and mass calibration were performed using 10 and 100 μmol/L polypropylene glycol solutions in triple quadrupole and LIT modes, respectively. Triple quadrupole scans were acquired after the completion of a multiple reaction monitoring (MRM) experiment with the collision gas (nitrogen) set to 5 psi. The declustering potential and collision energy of individual MRM transitions were then determined before further declustering potential and collision energy optimization. A specific set of MRM transitions was monitored in each period according to the metabolites eluted within that period 60 .

Qualitative and quantitative analysis of metabolites
The qualitative analysis of metabolites was performed using secondary spectrum information with reference to the MetWare database (Maiwei Metabolism, Wuhan, China). Furthermore, repeat signals, including those from the K + , Na + , and NH 4 + ions, as well as repeated signals from fragmented ions, indicating a high molecular weight substance, were removed. Then, metabolites were quantified via the MRM mode analysis of triple quadrupole mass spectrometry data 104 .

MS data analysis
The mass spectrum data were processed using Analyst Software 1.6.3 (https://sciex.com/products/software/analystsoftware), and samples from the same organ were treated as repeats. Principal component analysis, Pearson's correlation analysis, differential expression analysis, and heat map generation were all performed using the statistical module in R (version 3.1.1) (https://www.r-project.org/). The KEGG database (https://www.kegg.jp/) was used to annotate and elucidate the biosynthetic pathways of different metabolites.