Genome-wide sequencing of longan (Dimocarpus longan Lour.) provides insights into molecular basis of its polyphenol-rich characteristics

Abstract Longan (Dimocarpus longan Lour.), an important subtropical fruit in the family Sapindaceae, is grown in more than 10 countries. Longan is an edible drupe fruit and a source of traditional medicine with polyphenol-rich traits. Tree size, alternate bearing, and witches' broom disease still pose serious problems. To gain insights into the genomic basis of longan traits, a draft genome sequence was assembled. The draft genome (about 471.88 Mb) of a Chinese longan cultivar, “Honghezi,” was estimated to contain 31 007 genes and 261.88 Mb of repetitive sequences. No recent whole-genome-wide duplication event was detected in the genome. Whole-genome resequencing and analysis of 13 cultivated D. longan accessions revealed the extent of genetic diversity. Comparative transcriptome studies combined with genome-wide analysis revealed polyphenol-rich and pathogen resistance characteristics. Genes involved in secondary metabolism, especially those from significantly expanded (DHS, SDH, F3΄H, ANR, and UFGT) and contracted (PAL, CHS, and F3΄5΄H) gene families with tissue-specific expression, may be important contributors to the high accumulation levels of polyphenolic compounds observed in longan fruit. The high number of genes encoding nucleotide-binding site leucine-rich repeat (NBS-LRR) and leucine-rich repeat receptor-like kinase proteins, as well as the recent expansion and contraction of the NBS-LRR family, suggested a genomic basis for resistance to insects, fungus, and bacteria in this fruit tree. These data provide insights into the evolution and diversity of the longan genome. The comparative genomic and transcriptome analyses provided information about longan-specific traits, particularly genes involved in its polyphenol-rich and pathogen resistance characteristics.

the molecular mechanisms of the biosynthesis of polyphenol, and the pathogen resistance of longan. Together, these results provided insights into the evolution and diversity of the longan genome and will help to improve the efficiency of longan conventional breeding by integrating biotechnological tools.

Genome sequencing and assembly
We selected the D. longan "HHZ" cultivar for genome sequencing. In brief, a total of 316.84 Gb of raw data was generated by Illumina sequencing of 12 genome shotgun libraries with different fragment lengths ranging from 170 bp to 40 kb (Additional file 1: Table S1). After stringent filtering and correction steps, 121.68 Gb of high-quality sequence data, representing 273.44fold coverage of the entire genome, were obtained (Additional file 1: Table S2). Based on K-mer frequency methods [21], the D. longan genome was estimated to be 445 Mb with a 0.88% heterozygosity rate (Additional file 2: Fig. S1, Additional file 1: Table S3). Compared with other sequenced fruit tree genomes, the D. longan genome was bigger than papaya [9], orange [13], peach [14], and plum [12], and smaller than grape [10], apple [11], pear [15], pineapple [17], and kiwifruit [16]. Longan trees are generally thought to have highly heterozygous traits. The estimated 0.88% heterozygosity rate in the whole genome of the longan "HHZ" cultivar is reported here for the first time. This heterozygosity rate is higher than the rates reported for kiwifruit (0.536%) [16], plum (0.03%) [12,22], and poplar (about 0.5%) [23], and lower than the rates for pear (1-2% sequence divergence) [15] and pineapple (1.89% in F153, 1.98% in MD2, 2.93% in CB5) [17]. These results imply that the idea that fruit trees always have high heterozygosity, which may be due to artificial grafting and/or asexual reproduction.
Using the SOAPdenovo program [24], all the high-quality reads were assembled into 51 392 contigs and 17 367 scaffolds (≥200 bp) totaling 471.88 Mb excluding gaps (Table 1). These  assembled sequences accounted for approximately 106.04% of the estimated longan genome, which conflicts with previously reported genome assemblies where the sequences accounted for less than 100% of the estimated genome [13][14][15]. The higher percentage might be due to the high heterozygosity of the longan genome, suggesting that, in the future, a single-molecule sequencing technology should be used to correct the longan genome assembly. Here, the N50s of contigs and scaffolds were 26.04 kb (longest, 173.29 kb) and 566.63 kb (longest, 6942.32 kb), respectively (Table 1), suggesting the high quality of the assembly. The GC content of the D. longan genome was 33.7%, which is comparable with the GC content of the genomes of pineapple (33%) [17], jujuba (33.41%) [19], and orange (34.06%) [13], but lower than the GC content of the genomes of kiwifruit (35.2%) [16], papaya (35.3%) [9], and grape (36.2%) [10] ( Table 2, Additional file 2: Fig. S2). Analysis of the percent GC content among different fruit trees can provide important clues about gene density, gene expression, replication timing, recombination, and evolutionary relationships [25]. The GC depth graph and distribution indicated no contamination of any bacterial sequence in the longan genome assembly, and 99.2% of the assembly was sequenced with more than 20× coverage (Additional file 2: Fig. S3). The statistics and comparison of the D. longan assembly with 12 other fruit tree genomes are shown in detail in Table 2. The quality of the assembly was assessed by aligning the scaffolds to a longan transcriptome assembly from the NCBI Sequence Read Archive (SRA; SRA050205). Of the 96 251 longan transcriptome sequences (≥100) reported previously [26], 97.55% were identified in the genome assembly (Additional file 1: Table S4), confirming the high quality of the assembly.

Benchmarking Universal Single-Copy Orthologs analysis
We further evaluated the quality and completeness of the draft longan genome assembly using the Benchmarking Universal Single-Copy Orthologs (BUSCO) data sets [27]. Of the total of 956 BUSCO ortholog groups searched in the longan assembly, 900 (94%) BUSCO genes were "complete single-copy," 288 (30%) were "complete duplicated," 16 (1.6%) were "fragmented," and 40 (4.1%) were "missing" (Additional file 1: Tables S5). The percentage of missing BUSCO genes was comparable to the percentages missing in the assemblies of banana (3%), Brassica napus (3%), and Arabidopsis (2%), which have served as well-assembled standards at the chromosomal level [28], further suggesting the high quality of our assembly.

Repetitive elements and gene annotation
Repetitive elements are major components of eukaryotic genomes, and they have been used extensively to analyze genome structure, karyotype, ploidy, and evolution. In the longan assembly, we found that a total of 261.88 Mb (52.87%, 445 Mb) was repetitive sequences (Additional file 1: Table S6), which is higher than the amount observed in orange (20%, 367 Mb) [13], peach (29.6%, 265 Mb) [14], kiwifruit (36%, 758 Mb) [16], pineapple (38.3%, 526 Mb) [17], grape (41.4%, 475 Mb) [10], jujuba (49.49%, 444 Mb) [10], and papaya (51.9%, 372 Mb) [9], and lower than the amount reported in pear (53.1%, 527 Mb) [15] and apple (67.4%, 742.3 Mb) [11] (Table 2), indicating that the sizes of fruit tree genomes differed as a result of the variable amounts of repetitive elements that they contained. Accordingly, the bigger plant genomes often possessed higher percentages of repetitive elements than the smaller plant genomes. Most plant genomes appear to contain abundant long-terminal repeat (LTR) retrotransposons and a small number of short interspersed elements (SINEs) and long interspersed elements (LINEs) [29]. We found that the repetitive fraction of the longan genome comprised LTR retrotransposons, which were the most abundant (36.54%), and SINEs (2.43%) and LINEs (0.04%), which were the least abundant; other repeats, including tandem repeats and unknown repeats, made up 7.59% and 7.71% of the repetitive fraction, respectively (Additional file 1: Table S7). A large number of the unknown repetitive sequences may be longan specific. The characterization of repetitive sequences is of primary importance for understanding the structure and evolution of the longan genome.

Gene family evolution and comparison
Orthologous clustering analysis was conducted with the longan genome and eight other selected plant genomes: Arabidopsis, orange, papaya, grapevine, banana, peach, kiwifruit, and apple. Of the 31 007 protein-coding genes in the genome, 26 261 were grouped into 14 961 gene families (763 of which were longan-unique families), giving an average of 1.76 genes per family (Additional file 1: Table S11). The remaining 5834 genes were classed as un-clustered genes. Among the 31 007 genes, 4653 were longan-unique paralogs, 5184 were multiple-copy orthologs, 3606 were single-copy orthologs, and 12 818 were other orthologs (Fig. 1b). Comparative analysis of the longan genome with eight other selected plant genomes indicated that the number of gene families in the longan genome was similar to the numbers in the genomes of orange (15 000) [13] and peach (15 326) [14], higher than in banana (12 519) [18], Arabidopsis (13 406), grape (13 570) [10], kiwifruit (13 702) [16], and papaya (13 763) [9], and lower than in apple (17 740) [11] Table S11). These comparisons indicated that differences in gene families in plant genomes may be important sources of genetic traits and adaptation in different species. Comparative analysis of the longan genome with the genomes of citrus, banana, peach, and Arabidopsis showed that these five species contained a core set of 9215 genes in common, whereas 1207 genes were specific to longan, which is more than the numbers of genes specific to citrus and Arabidopsis, and lower than the numbers specific to M. acuminate and peach (Fig. 1d). Expansion or contraction of gene families may provide clues to the evolutionary forces that have shaped plant genomes and have an important role in the diversification of plants. In this study, we used CAFÉ [31] to identify gene families that had potentially undergone expansion or contraction in the longan genome. We found a total of 2849 expanded gene families and 2842 contracted families; however, only 386 expanded families (7839 genes) and 12 contracted families (53 genes), accounting for 19.96% and 0.13% of the total coding genes (39 282), respectively, were found to be statistically significant at P < 0.05 (Supplemental Excel Files 2 and 3). The genes in the significantly expanded and contracted families (P < 0.05) were annotated with gene ontology (GO) terms. Genes in a total of 32 (expanded) and 11 (contracted) families were assigned GO terms under the three GO categories: biological process, cellular component, and molecular function. Almost all the expanded or contracted families contained genes that were assigned terms under biological process, and a few genes in the contracted families were assigned terms under the cellular component and molecular function categories (Additional file 2: Fig. S4a and b). The dominant terms in the expanded or contracted gene families were "cellular component organization," "locomotion," "auxiliary transport protein," and "binding," revealing important clues to the evolutionary forces that may have shaped the longan genomes.

Genome evolution
Whole-genome duplication is common in most plant species, and it represents an important molecular mechanism that has shaped modern plant karyotypes [32]. Characterization and annotation of the longan genome provided comprehensive information for us to further investigate the evolutionary history of longan. Single-copy nuclear genes from orange, Arabidopsis, cacao (Theobroma cacao), poplar (Populus trichocarpa), grape, apple, papaya, soybean, peach, kiwifruit, and banana [18] were used in a genome-scale phylogenic analysis using the maximum likelihood method. The phylogenetic analysis showed that longan was phylogenetically closest to orange, close to papaya, Arabidopsis, and cacao, and most distant from monocotyledon fruits (banana). From the phylogenetic tree, we estimated that longan diverged about 69.3 million years ago (Fig. 1a). To determine the nature of the evolutionary events that led to the modern longan genome structure, we analyzed the syntenic relationships between longan and poplar. We detected a total of 2106 and 883 syntenic blocks containing 17 901 and 17 447 colinear genes for longan and poplar, respectively (Additional file 1: Table S12), which supported the reported conserved colinearity and close evolutionary relationship in these two plant species. To further analyze the evolutionary divergence and the relative age of duplication events in longan and other related species, we calculated the distance-transversion rates at 4-fold degenerate sites (4DTv) (Fig. 1c). The 4DTv value peaked at 0.3 for paralog pairs in Arabidopsis, highlighting the recent whole-genome duplication in this species. Two 4DTv values that peaked at 0.72 and 0.6 for orthologs between longan and ba-nana, and between longan and Arabidopsis (data no shown), respectively, supported species divergence. These results are consistent with the more ancient divergence between monocotyledons and dicotyledons. The orthologs between longan and grape (data no shown), longan and peach, and longan and orange showed 4DTv distance peaks at 0.36, 0.36, and 0.26, respectively, which is consistent with the 4DTv peaks reported previously for Vitaceae and Rosaceae species, and more ancient than the 4DTv values for Rutaceae or Sapindaceae. In longan, the analysis showed ancient duplication events (the 4DTv peak at about 0.55) but did not reveal a recent whole-genome duplication. These results complement the results for the longan genome and will contribute to studies into ancestral forms and arrangements of plant genes [33].

Assessment of genetic diversity in longan germplasm
A representative characteristic of longan cultivars is their high heterozygosity, which has resulted in the low efficiency of longan germplasm management and utilization. Traditionally, molecular markers (RAPD, AFLP, SCAR, SCTP, and SRAP) and single nucleotide polymorphisms (SNPs) based on transcriptome data [34] have been used for accurate identification of longan varieties. However, the extent of heterozygosity in the whole genome is not well understood [7]. The availability of the longan draft genome provided the foundation with a comprehensive assessment of heterozygosity in the longan genome.
We selected 13 representative commercially cultivated accessions with early-maturing, middle-maturing, late-maturing, multiple-flowering, aborted-seeded, and disease-resistant characteristics for whole-genome resequencing (Additional file 1: Table S13). A total of 45.77 Gb of raw data was generated by Illumina sequencing. After alignment of the clean reads corresponding to 5.02-to 7.31-fold depths and >78% coverage to the reference genome (Additional file 1: Table S14), we identified 357 737 SNPs (Additional file 1: Table S15) and 23,225 small insertions/deletions (indels) (Additional file 1: Table S16). The overall polymorphism density was 0.05-0.12 SNPs and 0.004-0.007 indels per 10 kb of the genome sequence, which is much lower than the diversity reported in orange [13]. Notably, the major variations existed among the "FY," "MQ," and "SJM" accessions, whereas variations within the cultivated longan accessions, particularly the "LDB" accessions, were relatively low (Additional file 1: Tables S15 and S16).
To further investigate the population structure and relationships among the longan accessions, we constructed a neighborjoining tree (Fig. 2a) and carried out a principal component analysis (PCA) (Fig. 2b). The neighbor-joining tree, constructed based on all the identified SNPs, indicated that the 13 longan accessions clustered into two subfamilies. The first subfamily consisted only of "FY," which showed the highest variations and clear separation from other cultivars. This result is quite different from results reported previously [35,36]. In previous studies using molecular markers, "FY," which originated from Quanzhou, China, was found to cluster together with other Chinese longan accessions. In our study, which was conducted at an overall genomic level, "FY" was found to possess more genetic differences compared with the other longan accessions tested. This result might be due to the special traits of "FY," such as witches' broom disease-resistant, middle-maturity, and canned processing products. This result also supports the observed diversity of "FY" at the overall genomic level. The second subfamily neighbor-joining tree consisted of three clades (Fig. 2a). The first clade included "JHLY," "WLL," "JYW," and "SN1H"; the second contained "MQ," "SX," "SJM," and "SEY"; and the third consisted of "DB," "HHZ," "LDB," and "YTB." Moreover, the PCA showed that the samples that originated from China tended to cluster together ("HHZ," "DB," "JYW," "LDB," "WLL," "SN1H," "YTB," "SEY," "JHLY," and "SX"). The PCA also showed the clear separation of "FY," "SJM," and "MQ." The "SJM" and "MQ" accessions, which originated from Southeast Asia and Thailand, respectively, possessed apparent differences compared with the Chinese longan accessions tested in this study. Together these results indicated geographic patterns of genetic differentiation, which agree with findings reported previously [34]. The relatively low levels of genetic variation among the Chinese cultivars also suggested that they might have suffered a bottleneck during domestication [7,34]. These results suggested that the relationship among the 13 selected longan accessions was, at least partly, determined by their geographical distributions.
An additional analysis of the population structure was conducted using the FRAPPE program [37] with K (the number of populations) set from 2 to 7 (Fig. 2c). For K = 7, a new subgroup was detected among the 13 longan accessions. This subgroup had characteristics, such as various maturity levels, high yielding, aborted seeding, disease resistance, and multiple flowering. The cultivars "SX" and "YTB," which are susceptible to disease, contained more variations in resistance genes, such as nucleotide-binding site leucine-rich repeat (NBS-LRR) and leucine-rich repeat receptor-like kinases (LRR-RLK), than the disease-resistant cultivars ("FY," "SN1H," "MQ," "LDB," and "JYW") (Supplemental Excel Files 4 and 5). These results provided a measure of the changes in genetic diversity and a theoretical estimate of the genetic relationships among the selected longan cultivars.

RNA sequencing revealed SNPs, indels, differentially expressed genes, and alternative splicing events in different tissues of "SJM" longan
To improve the gene annotation of the longan genome sequence and get more information about longan traits, we constructed nine cDNA libraries corresponding to nine different organs (root, stem, mature leaf, flower bud, flower, young fruit, pericarp, pulp, and seed) from a representative "SJM" cultivar. "SJM," which originated in Southeast Asia, blossoms and bears fruit throughout the year, with no requirement of environmental control [38]. Here, 490 502 822 clean reads from nine RNA sequencing (RNA-seq) data sets were obtained after removing low-quality reads and adaptor sequences, and about 53.55-79.40% of the clean reads mapped to the longan draft genome (Additional file 1: Table S17). This percentage of mapped reads is lower than the 90% previously reported in peach [39], suggesting that the "SJM" cultivar contained high variations compared with the sequenced "HHZ" genome, probably because of their different origins. Moreover, the BUSCO analysis [27] showed that 483 (87%) of BUSCO genes were "complete single-copy," 352 (36%) were "complete duplicated," 53 (5.5%) were "fragmented," and 68 (7.1%) were "missing" (Additional file 1: Table S18), indicating the high quality of our assembled transcriptome.
The transcribed regions/units were constructed independently for individual tissues. We found that transcripts/genes ranged from 19 322 (pulp) to 23 118 (flower bud), completely or partially (49.18-58.85%) overlapped with 39 282 annotated genes in the longan genome. The numbers of expressed transcripts in each longan tissue were much lower than the numbers previously reported in Brassica rapa (32 335 genes expressed in at least one tissue, equivalent to 78.8% of the 41 020 annotated genes) [40]. The lower numbers of transcripts detected in each tissue may be due to the high variations and genetic heterozygosity in the "SJM" cultivar. The coverage of the longan gene set by our transcripts indicated the broad representation of our unigenes and provided the opportunity to identify alternative splicing (AS) events. In addition to the predicted genes, novel transcripts, which ranged from 1621 (stem) to 1999 (young fruit), were detected across all nine samples. Among the novel transcripts, 798 (flower) to 988 (young fruit) contained open reading frames, while 820 (stem) to 1011 (young fruit) were identified as non-coding RNAs in the longan genome (Additional file 1: Table S17). Most of these non-coding RNAs were longer than 200 nt and had no ORFs encoding sequences longer than 300 amino acids, suggesting that they may be long intergenic noncoding RNAs [41] or cis-natural antisense transcripts [42], which will need further analysis. The numbers of novel encoding and non-coding transcripts in young fruit were the highest among the nine samples, suggesting that the development of young fruit required more complicated gene regulatory networks than the other stages. To further optimize the structure of the transcripts, we compared the assembled transcripts and annotated genes from the reference longan genome and extended the 5 or 3 ends of the transcripts according to the annotated gene information. In total, the extending 5 or 3 end of annotated genes ranged from 8126 (pulp) to 9995 (flower bud) across nine tissues, and almost half the number of total genes extended by 5 end in each sample. We identified a total of 1 255 816 SNPs and 34 390 indels across the nine longan tissues and found that the highest number of SNPs and indels were detected in young fruit (161 897) and leaf (4673), respectively, suggesting that the expressed transcripts may be more diverse in these two tissues. Notably, the lowest frequencies of SNPs and indels were detected in pulp (105 007 and 2587, respectively). The SNPs and indels detected in the transcript sequences will be a valuable resource from which to identify candidate genes, analyze population structures and evolution, and accelerate plant breeding [39]. The identification of novel genes, extended annotated genes, SNPs, and in-dels from different developmental stages implies that our gene set can serve as a valuable complementary resource for longan genomics.
To identify significantly differentially expressed genes (DEGs), we used 12 pair-wise comparisons among the nine samples as follows: root VS stem, root VS leaf, leaf VS stem, flower bud VS flower, flower bud VS young fruit, flower VS young fruit, young fruit VS pulp, young fruit VS seed, pericarp VS pulp, pericarp VS seed, and pulp VS seed. Among the detected DEGs (Additional file 2: Fig. S5), an average of 3922 ± 2391 were upregulated and an average of 4859 ± 2666 were down-regulated in the 12 comparisons. The highest number of DEGs was detected in young fruit VS seed (9737), followed by root VS leaf (9702) and flower VS young fruit (9101), and the lowest number of DEGs was detected in flower bud VS flower (3722). The numbers of organ-specific genes ranged from 87 in young fruit to 530 in root, and the significantly differentially expressed transcription factors in each comparison ranged from 272 (flower bud VS flower) to 732 (young fruit VS pulp). To evaluate the potential functions of the DEGs, we annotated them by assigning GO terms under the three main categories: biological process, cellular component, and molecular function. DEGs in each pair were categorized into 43 (flower bud VS flower) to 47 (young fruit VS pulp). Details of the GO annotations are provided in Additional file 2: Fig. S6. The dominant terms in all 12 comparisons were "metabolic process," "cellular process," "cell," "cell part," "catalytic activity," and "binding," similar to results previously reported in the "SJM" and "LDB" cultivars [43]. To further understand the biological functions of the DEGs, we carried out a Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway-based analysis. In nine of the 12 comparisons, the highest numbers of DEGs were involved in "metabolic pathway," followed by the "biosynthesis of secondary metabolites" and "plant-pathogen interaction" pathways. In pericarp VS seed, root VS leaf, and pericarp VS pulp, "biosynthesis of secondary metabolites," "pyrimidine metabolism," and "stilbenoid, diarylheptanoid and gingerol biosynthesis" were the most represented pathways, respectively (Additional file 2: Fig. S7). These results are fully consistent with the view that D. longan contains high levels of polyphenolic compounds and a large number of pathogen resistance genes [44,45].
To determine the types of AS events represented in our assembled transcripts data set, we used the software TopHat [46]. First, the nine longan tissues were analyzed at the exon level, which can provide important information about the types of gene isoforms that are expressed and variable [47]. Expressed exons were detected in the range of 96 105 (pulp) to 111 476 (flower bud) across the nine tissues (Additional file 1: Table S17). A total of 298 914 AS events were detected across all the tissues, representing the four known types of AS, namely intron retention, exon skipping, alternative 5 splice site donor, and alternative 3 splice site acceptor. Alternative transcripts have been shown to be tissue or condition specific [47,48]. We also found that the largest numbers of AS events were detected in leaf (37 216), followed by young fruit (35 998) and pericarp (35 384), and the smallest numbers were found in pulp (28 058), corresponding to the least expressed exons. The predominant and rare types of AS events in all nine tissues were intron retention and exon skipping, respectively. This result is consistent with prior findings in rice [49], Arabidopsis [50], grape [48,51], and B. rapa [40], but contradicts a previous finding that exon-skipping was predominant in peach [39] and metazoans [52], indicating the complexity of the AS landscape in plants and the important consequences this may have on plant/crop phenotypes.

Biosynthesis of polyphenols and MYB transcription factors in longan
Polyphenols, potential antioxidative compounds, are the major category of secondary metabolites in longan leaf, flower, fruit, and seed [4]. Phenolic compounds are derived primarily through the shikimic acid, phenylpropanoid, and flavonoid pathways. Our transcriptome data showed that the significant DEGs in the nine longan tissues were involved mainly in "biosynthesis of secondary metabolites." To further assess changes between the primary and secondary metabolism of polyphenols during the longan vegetative and reproductive growth stages, the copy numbers of 26 selected structural genes within the shikimate acid, phenylpropanoid, and flavonoid biosynthesis pathways were compared with those in corresponding pathways of Arabidopsis, orange, peach, grape, poplar, and eucalyptus (Fig. 3a, Supplemental Excel File 6).
Comparison analysis showed that the 26 structural genes showed up and down variations in copy numbers among the seven plants tested (Supplemental Excel File 6). The significant expanded gene families in longan, orange, peach, poplar, and eucalyptus were DHS, SDH, F3 H, ANR, and UFGT, when compared with the corresponding families in grape, which is considered to be the oldest among the seven selected plants in evolutionary history [53]. SDH catalyzes the NADPH-dependent reduction of 3-dehydroshikimate to shikimate in the fourth step of the shikimate pathway, which is the metabolic route required for the biosynthesis of the aromatic amino acids. SDH had six copy numbers in longan, which is the same as in Populus, but much higher than in Arabidopsis (1 copy), peach and grape (2 copies each), and orange and eucalyptus (3 copies each). F3 H is involved in flavonoid biosynthesis and is important for flower color and fruit skin. We found 65 copies of F3 H in the eucalyptus genome, 35 in longan, 28 in peach, 25 in orange, 26 in Populus, and only 12 in grape and 10 in Arabidopsis, suggesting that the F3 H family was significantly expanded in woody plants and a little contracted in herbs. These findings may provide important clues for the mechanism of flavonoid biosynthesis in plants. The gene encoding ANR, which is involved in the biosynthesis of proanthocyanidins (also called condensed tannins), had higher copy numbers (6) in longan than in Arabidopsis (2), orange (1), peach (1), grape (4), and Populus (5), implying that the expanded ANR numbers may play a role in proanthocyanidin biosynthesis. Significantly smaller numbers of the structural genes PAL, CHS, and F3 5 H were detected in longan (6,14,3), Arabidopsis (4, 1, 1), orange (4,15,4), peach (3,7,4), eucalyptus (9,16,8), and Populus (5, 12, 2), compared with the higher numbers detected in grape (13,34,12). PAL and CHS are involved in the key regulatory step in the branch pathway of phenylpropanoid biosynthesis specific to synthesis of ubiquitous flavonoid pigments [54], and F3 5 H is important for determining flower color [55], which may suggest that the PAL, CHS, and F3 5 H encoding genes that were discarded in the evolution history of longan, Arabidopsis, orange, peach, eucalyptus, and Populus, compared with grape, were functionally redundant. Besides the expanded and contracted numbers of structural genes, other structural genes, namely DHS, DHQS, SK, EPSP, CS, CM, ADT, C4H, 4CL, CHI, F3H, DFR, and ANS, showed little variation in copy numbers among longan, Arabidopsis, orange, peach, grape, poplar, and eucalyptus, which indicated their evolutionary conservation in different plant species. Overall, the expended, contracted, and conserved copy numbers of the 26 selected structural genes among the seven selected plants defined the different characteristics of polyphenol biosynthesis in the different species.
To further understand the functions of the 26 structural genes, we measured their expression levels between primary and secondary metabolism during longan vegetative and reproductive growth (Fig. 3b, Supplemental Excel File 7). The PCA showed that all the genes related to the biosynthesis of polyphenols were similarly expressed in leaf, pulp, and pericarp, but their expression levels differed among root, stem, flower bud, flower, young fruit, and seed (Fig. 3b), suggesting that these genes may have tissue-specific roles in longan. Thirteen of the 26 structural genes were found to be expressed in specific tissues, such as root, flower, flower bud, and/or seed (Supplemental Excel File 7). For example, two members of the SDH family, Cs9g05070.1-D1 and Cs9g05070.1-D5, showed high expression levels during the vegetative and reproductive stages, especially in pulp and pericarp, while the other members of the family were barely detectable, suggesting that Cs9g05070.1-D1 and Cs9g05070.1-D5 may play major roles in the shikimate acid pathway. The six members of the PAL family all exhibited low or undetectable expression levels in pulp, two had the highest expression levels in stem, and the other four were strongly expressed in stem, root, leaf, flower, and pericarp. The tissuespecific expression pattern of PAL further confirmed that PAL was related to lignin, the structural component of the cell wall in longan [56]. Five of the 14 members of the CHS family were barely detectable among the nine samples; among the other members, the highest expression levels were observed for four in seed, three in flower bud, and two in root, suggesting that CHS played an important role in the synthesis of flavonoid pigments in flower bud and seed. The 35 members of the F3 H family (Fig. 3c) exhibited different temporal and spatial expression levels (Fig. 3d). Among them, the highest expression levels were observed for one of the members in root, two in stem, five in leaf, 11 in flower bud, three in flower, six in young fruit, three in pericarp, and three in seed, while 11 F3 H family members were barely detectable in pericarp, pulp, and seed. For the three members of the F3 5 H family, one was detected only in root and one only in flower bud, implying that F3 H and F3 5 H both played major roles in determining longan flower colors. Proanthocyanidin synthesis involves both LAR and ANR (Fig. 3c). The six ANR family members and two of the four LAR members were barely detectable in pulp, and all the ANR and LAR genes were highly expressed in pericarp and relatively less expressed in seed (Fig. 3d). Previous studies of 12 varieties of Chinese longan fruit have shown that total polyphenols, tannins, and proanthocyanidins were most abundant in pericarp, followed by seed and pulp [57]. The high expression levels of ANR and LAR in pericarp and seed and the low expression levels in pulp indicated that they may determine the tannin composition of longan fruit, further indicating why whole longan fruit is dried for use in sweet desserts and soups for human health [58].
The MYB family of TFs is involved in the regulation of flavonoid biosynthesis [59]. To further investigate the biosynthesis of polyphenols in longan, we compared the numbers of MYBencoding genes in longan with their numbers in Arabidopsis, orange, peach, and grape. We also investigated their expression levels in longan using the genome and transcriptome data. We detected 94 R2R3-MYB genes in longan, which was more than in orange (74) and peach (88) but less than in grape (116) and Arabidopsis (141) (Fig. 4a). A neighbor-joining tree of the MYB gene family was constructed (Fig. 4b). The expression profiles of the MYB gene family in each tissue were clustered by PCA. The plots showed that the expression profiles in three of the tissues (stem, pericarp, and seed) formed one cluster, while the expression profiles of the other tissues were independently separated, implying that each had a distinct MYB expression profile (Fig. 4c). All members of the MYB gene family were expressed at varying levels among the nine vegetative growth and reproductive growth tissues, with some preferentially expressed in specific tissues (Fig. 4d, Supplemental Excel File 8). In Arabidopsis, specific R2R3-MYB family members, namely MYB3-5, -7, -11, -12, -32, -75, -90, -111, -113, -114, and -123, are known to be involved in regulating the flavonoid pathway [59]. In longan, only four R2R3-MYB genes, which are homologs of AtMYB4, -12, and -123, were found. In Arabidopsis, AtMYB4 down-regulated C4H and controlled sinapate ester biosynthesis in a UV-dependent manner; AtMYB12 up-regulated CHS, CHI, F3H, and F3 H, and controlled flavonol biosynthesis in all the tissues tested; and AtMYB123 up-regulated DNS and controlled the biosynthesis of proanthocyanidins in the seed coat [59]. In longan, three of the four homologous R2R3-MYB genes reached peaks in root but were undetected or lowly expressed in pericarp, pulp, and seed (Fig. 4d). The tissue-specific expression of these genes indicated that they may be required for flavonoid biosynthesis.

Identification and classification of genes encoding NBS-LRR and LRR-RLK
Transcriptome data analysis showed that longan contained a large number of significantly differentially expressed plant pathogen resistance genes. To further investigate the molecular basis for longan pathogen susceptibility, we searched for two classes of resistance genes in the longan genome, those encoding NBS-LRR proteins and those encoding LRR-RLK. We identified 594 NBS-LRR and 338 LRR-RLK encoding genes, which accounted for approximately 1.51% and 0.86% of the annotated protein-coding genes in longan, respectively. These numbers of NBS-LRR and LRR-RLK coding genes in the longan genome were more than those in orange (509, 325) [13], grape (341, 234) [10], kiwifruit (110, 259) [16], peach (425, 268) [14], mei (411, 253) [12], and papaya (60, 134) [9], but nearly half that in apple (1035, 477) [11] (Additional file 1: Table S19). NBS and LRR existed before the divergence of prokaryotes and eukaryotes, but their fusion has been detected only in land plant lineages [60], which are assumed to have originated from a common ancestor. A previous study showed that grape was the oldest among the fruits tested [53]. In this study, the numbers of NBS-LRR and LRR-RLK genes were either more or less in longan, orange, kiwifruit, peach, papaya, mei, and apple compared with grape. Detail analysis showed that the total number of genes encoding NBS and LRR was not associated with genome expansion or the total number of protein-coding genes in the selected genomes, which is similar to what was found in grass species [60]. Moreover, the NBSand LRR-encoding genes were significantly more in apple than in the other selected fruits, possibly as a result of a whole-genomewide duplication event in the apple [53]. The uneven distribution of NBS-and LRR-encoding genes on chromosomes was reported previously in Arabidopsis, rice, grapevine, and poplar [61]. These results suggest that changes in the numbers of genes encoding NBS-LRR and LRR-RLK in different species may alter the resistance of these species to different diseases.
The   Table  S19). Previous studies have shown that the deduced NBS-LRR proteins can be divided into two subfamilies, TIR and non-TIR proteins, based on their N-terminal features [62]. The TIR family of NBS-LRR genes probably originated earlier than the non-TIR family [60]. Here, the numbers of genes encoding the TIR proteins (TIR-NBS-LRR and TIR-NBS) varied from one (kiwifruit) to 288 (apple), and the numbers of genes encoding the non-TIR proteins were 567 in longan, 415 in orange, 320 in grape, 109 in kiwifruit, 282 in peach, 53 in papaya, and 753 in apple. The ratio of TIR to non-TIR genes was found to differ markedly in different species [62], suggesting ancient origins and subsequent divergence between the two NBS gene types. The distribution of resistance genes in the longan genome and the encoded domains is similar to those of the resistance proteins in other sequenced genomes, as shown in Additional file 1: Table S19. In addition, we noted that allelic variations due to the presence of SNPs in NBSencoding genes were associated with the phenotypic divergence between resistant ("FY," "SN1H," "MQ," "LDB," and "JYW") and susceptible ("SX," and "YTB") longan accessions. Such detailed knowledge of the longan genome will help to accelerate the development of genetic strategies to counter fruit loss caused by diverse pathogens [30].

Conclusions
Here, a draft genome of D. longan is presented for the first time. The assembled genome sequence is 471.88 Mb with 273.44-fold coverage obtained by paired-end sequencing. Whole-genome resequencing and analysis of 13 representative cultivated D. longan accessions revealed the extent of genetic diversity and contributed to trait discovery. Annotation of the protein-coding genes, comparative genomic analysis, and transcriptome analyses provided insights into longan-specific traits, particularly those involved in the biosynthesis of secondary metabolites and pathogen resistance.

DNA extraction, library construction, whole-genome shotgun sequencing and assembly
Whole-genome shotgun sequencing was performed using the Illumina HiSeq 2000 system. Genomic DNA was extracted from fresh mature leaves of the D. longan "HHZ" cultivar using the modified SDS method. DNA sequencing libraries were constructed according to the standard Illumina library preparation protocols. A total of 12 paired-end sequencing libraries, spanning 170, 250, 500, 800, 2000, 5000, 10 000, 20 000, and 40 000 bp, were constructed and sequenced on an Illumina HiSeq 2000 system. After stringent filtering and correction steps using Kmer frequency-based methods [21], a total of 121.68 Gb of data was obtained and then assembled using SOAPdenovo and SS-PACE software [63]. To check the completeness of the assembly, a longan transcriptome assembly comprising 68 925 unigenes (SRA050205) was mapped to the genome assembly using BLAT32 with various sequence homology and coverage parameters. The BUSCO pipeline [27] was also used to check the genome completeness.

Repetitive elements identification
Tandem repeats and interspersed repeats are two main types of repeats found in genomes. Tandem repeats were identified using LTR FINDER [64] with the default parameters. Interspersed repeats were identified by Repeat Masker (http://www.repeat masker.org/) and RepeatProteinMask using the Repbase library [65] and the de novo transposable element library. Identified repeats were then classified into different known classes, as previously described [33].

Gene prediction and annotation
For gene prediction, the scaffolds were first repeat-masked [65]. Then, three de novo homology-based and RNA-seq unigenesbased prediction methods, Augustus [66], GENSCAN [67], and GlimmerHMM [68], were used with parameters trained on Arabidopsis thaliana and Carica papaya. The de novo predictions were then merged into a unigene set. For the homology search, translated protein sequences from three sequenced plant genomes (Glycine max, Populus trichocarpa, and Vitis vinifera) were mapped to the longan genome assembly using TBLASTN (E-value cutoff 1 × 10 −5 ). To extract accurate exon-intron information, the homologous genome sequences were aligned against the matching proteins using GeneWise [69]. Subsequently, the Illumina RNAseq unigenes sequences [26] were aligned to the longan genome assembly using BLAT [70] to detect spliced alignments.
Finally, to generate the consensus gene set, the results obtained using the three methods described above were integrated using the GLEAN program [71]. The final gene set contained 39,282 genes. TFs were identified and classified using the TAK program [72]. Non-coding RNAs were predicted and classified as previously described [73]. Functions of the predicted protein genes were obtained by BLAST searches (E-value cutoff 1 × 10 −5 ) against the InterproScan [74], GO [75], KEGG [76], SwissProt [77], and TrEMBL databases.

Gene families and phylogenetic analysis
To identify gene families, the translated protein sequences from T. cacao, C. sinensis, A. thaliana, C. papaya, Populus trichocarpa, Glycine max, V. vinifera, M. acuminata, P. persica, A. chinensis, and M. domestica genomes were scanned using BLASTP (E-value cutoff 1e−5), and gene family clusters among the different plant species were identified by OrthoMCL [78]. Single-copy families that were represented in all the selected species were aligned using MUSCLE [79]. 4DTv in the 12 species, including longan, were used to construct a phylogenetic tree by MRBAYES [80]. The divergence time was estimated using the software MultiDivtime [79]. Colinearity between D. longan and P. trichocarpa was computed by SyMAP v3.4 [81]. Subsequently, TF families were identified using the IPR2genomes tool in GreenphylDB v. 2.0 [82] based on InterPro domains, and gene family expansion and contraction within phylogenetically related organisms were detected by CAFÉ, a tool for computational analysis of gene family evolution [31].

Resequencing, SNPs, indels, and sequence variations analysis
Paired-end Illumina libraries for 13 D. longan cultivars were prepared following the manufacturer's instructions and sequenced on an Illumina HiSeq 2000 system. After stringent filtering and correction steps, the resulting sequence data were uniquely aligned to the reference longan genome. SNPs, indels, and sequence variations were identified using SOAPsnp (http://soap.genomics.org.cn/soapsnp.html), SOAPindel [83], and SOAPsv [84].
We used all and high-quality SNPs to infer the phylogeography and population structure for D. longan. A phylogenetic tree was subsequently generated using the neighbor-joining method implemented in TreeBeST. The bootstrap was set at 1000 replicates.
Population structure was examined primarily via PCA using our own program and model-based clustering algorithms implemented in FRAPPE v1.1 (http://med.stanford.edu/ tanglab/software/frappe.html). We increased the pre-defined genetic clusters from K2 to K7 and ran the analysis with 10 000 maximum iterations.

Transcriptome sequencing
Transcriptome sequencing was performed on the Illumina HiSeq 2000 system. Total RNAs from the samples descried above were isolated using a TRIzol Reagent kit (Invitrogen, Carlsbad, CA). cDNA libraries were constructed and sequenced using the Illumina protocols. All the raw reads were first processed to remove the adaptor sequences, low-quality reads, and possible contaminations from chloroplast, mitochondrion, and ribosomal DNA. The clean reads were then aligned to the longan genome sequence using TopHat [46] to identify exons and splice junctions ab initio. The expression levels of matched genes in each cDNA library were derived and normalized to fragments per kilobase of exon per million fragments mapped. Cluster 3.0 [85] was used to analyze hierarchical clustering of genes. DEGs among different samples were identified using the EBSeq packages [86]. Subsequently, GATK (http://www.broadinstitute.org/gatk/) with default parameters was used to call SNPs based on the transcript sequence data.