The chromosome-level Stevia genome provides insights into steviol glycoside biosynthesis

Stevia (Stevia rebaudiana Bertoni) is well known for its very sweet steviol glycosides (SGs) consisting of a common tetracyclic diterpenoid steviol backbone and a variable glycone. Steviol glycosides are 150–300 times sweeter than sucrose and are used as natural zero-calorie sweeteners. However, the most promising compounds are biosynthesized in small amounts. Based on Illumina, PacBio, and Hi-C sequencing, we constructed a chromosome-level assembly of Stevia covering 1416 Mb with a contig N50 value of 616.85 kb and a scaffold N50 value of 106.55 Mb. More than four-fifths of the Stevia genome consisted of repetitive elements. We annotated 44,143 high-confidence protein-coding genes in the high-quality genome. Genome evolution analysis suggested that Stevia and sunflower diverged ~29.4 million years ago (Mya), shortly after the whole-genome duplication (WGD) event (WGD-2, ~32.1 Mya) that occurred in their common ancestor. Comparative genomic analysis revealed that the expanded genes in Stevia were mainly enriched for biosynthesis of specialized metabolites, especially biosynthesis of terpenoid backbones, and for further oxidation and glycosylation of these compounds. We further identified all candidate genes involved in SG biosynthesis. Collectively, our current findings on the Stevia reference genome will be very helpful for dissecting the evolutionary history of Stevia and for discovering novel genes contributing to SG biosynthesis and other important agronomic traits in future breeding programs.


Introduction
High-sugar diets are known to cause severe health problems such as obesity and diabetes 1 . Some countries have levied sugar taxes to reduce the consumption of high-calorie sugars, a recommended strategy to reduce sugar consumption by encouraging substitution with zero-calorie sweeteners 2 . Steviol glycosides, extracted from the leaves of Stevia, contain no calories and have desirable natural sweetness 3 . Stevia rebaudiana (2n = 22) is a sweet herb native to Paraguay, and its leaf extract has been used as a natural sweetener for centuries in South America 4 . In addition to sweetness, the two abundant components of SGs, stevioside and rebaudioside A (Reb A), may also provide therapeutic benefits for type 2 diabetes, as these compounds can directly enhance insulin secretion by potentiating TRPM5 channel activity in animal models 5,6 . Stevia is widely cultivated in Asia, North America, and Europe for its use as a natural sweetener and traditional medicine.
The genus Stevia belongs to the Eupatorieae tribe within the Asteraceae family. Among the~230 species of the genus Stevia, S. rebaudiana is the only one that contains SGs 3,7 . SGs have a core diterpenoid steviol backbone (aglycone) decorated with different glycosylation patterns at the C-13 and C-19 positions 3,8 . These diterpenoid glycosides occur almost exclusively in Stevia leaves, accounting for up to~20% of the dry weight 9,10 . Stevioside and Reb A are the two main components of SGs, followed by Reb C, Reb F, dulcoside A, Reb D, and Reb M. Labeling experiments have revealed that the backbone of SGs is biosynthesized from 5-carbon isoprenoid units, which are predominantly derived from the methylerythritol phosphate (MEP) pathway 11 . The biosynthetic pathways of SG and gibberellic acid (GA) share four steps, and the last common substrate of these two labdane-type diterpenoids is ent-kaurenoic acid 12-14 . In the SG biosynthesis pathway, ent-kaurenoic acid hydroxylase (ent-KAH) catalyzes the 13-hydroxylation of ent-kaurenoic acid to form ent-13-hydroxy kaurenoic acid (steviol), which serves as the backbone for all SGs 14,15 . Then, a series of glycosylation processes of the aglycone (steviol) catalyzed by a set of cytosolic UDP-dependent glycosyltransferases (UGTs) leads to production of diverse types of SGs. Glucose is a major sugar moiety in all SGs, while rhamnose and xylose are present in only a few SGs, such as Reb C and dulcoside A 3 .
Stevia is unique in its accumulation of SGs, which are secondary metabolites of diterpenoids and have initial biosynthetic steps similar to those of GAs. GAs are essential for normal plant growth and development, and genes involved in the GA biosynthesis pathway are conserved and strictly regulated in higher plants 16,17 . It seems that the biosynthesis of SGs and GAs should be spatially or temporally separated to avoid disturbing the normal metabolism of GAs 12 ; however, the evolution of SG accumulation and the separation mechanisms of SG and GA biosynthesis in Stevia remain elusive. After the formation of the core tetracyclic diterpenes (GA 12 and steviol), oxidation and glycosylation take place to yield the final bioactive compounds: GAs and SGs, respectively. Unlike the oxidase genes involved in GA biosynthesis, UGT genes participating in diterpenoid glycosylation have rarely been documented 18,19 . Stevia is an ideal plant model for the study of diterpenoid glycosylation, not only because its leaves accumulate more than 30 types of SGs but also because of its short growth cycle and easy reproduction. Due to the absence of the genome sequence of Stevia, most studies identifying UGT genes involved in glycosylation of SGs have been based on expressed sequence tags or transcriptomic sequences, and only three UGTs have been characterized to contribute to the biosynthesis of SGs so far 20,21 . This deficiency has hindered research on SG biosynthesis and, consequently, has hindered comprehensive understanding of the evolution of Stevia in Asteraceae.
In the present study, we generated a high-quality reference genome sequence for Stevia (cv. 'Zhongshan No. 7') through a combination of PacBio sequencing and Hi-C approaches. Based on this genome sequence, we performed an evolutionary analysis of Stevia in the Asteraceae family. Furthermore, candidate gene sets involved in SG biosynthesis were identified. This reference genome will be very helpful for the evolutionary understanding of SG biosynthesis and for quality improvement of Stevia in the future.

Genome sequencing and assembly
The leaves of the Chinese Stevia cultivar 'Zhongshan No. 7' were collected for de novo genome sequencing and assembly. Based on K-mer analysis, we estimated a genome size of 1.16 Gb for Stevia, with heterozygosity rates of 0.43% and 73.13% repeats (Supplementary Table 1 and Supplementary Fig. 1). To accurately assemble this complex genome with a high rate of repeat sequences, we used a combination of short-read Illumina sequencing, longread PacBio sequencing, and Hi-C sequence approaches. We obtained 114.96 Gb of PacBio sequencing subreads, providing~99.5-fold coverage of the Stevia genome (Supplementary Table 2). These subreads were assembled into a 1405 Mb genome that contained 6978 contigs, with a contig N50 value of 616.85 kb, and the longest contig was 26.27 Mb in length (Supplementary Table 3). A total of 76.86 Gb of Hi-C clean data were generated, of which 90.42% reads were mapped to the assembled contigs (Supplementary Table 4). Under the guidance of the Hi-C data, we successfully clustered 6358 contigs into 11 pseudochromosomes and oriented 91.28% of the assembly according to a hierarchical clustering strategy 22 (Supplementary Table 5 and Supplementary Fig. 2). The final chromosome-level Stevia assembly was 1416 Mb in length, with an N50 scaffold size of 106.55 Mb (Table 1).
We used a combination of three data sources to assess the completeness of the Stevia genome assembly. First, we aligned our Illumina data to the assembled genome, and 98.14% of the clean reads were mapped (Supplementary Table 6). A total of 451 of the 458 core eukaryotic genes (CEGs) were identified in our current Stevia genome  Table  8). All these findings suggested that the assembled Stevia genome had high completeness and accuracy.

Annotation of the Stevia assembly
We annotated the repetitive elements of the Stevia genome through homology-based methods and in silico prediction, and 80.11% of the assembly was determined to be composed of repetitive elements. Among them, retrotransposons accounted for 69.45%, and DNA transposons accounted for 5.83%. More than 65% of the Stevia genome consisted of long terminal repeat retrotransposons (LTR-RTs), 32.30% of which belonged to the Copia lineage and 66.76% of which belonged to the Gypsy lineage (Supplementary Table 9). It was not surprising that such a high content of LTR-RTs was present in the Stevia genome since LTR-RTs are also present in large proportions in other Asteraceae species, such as Helianthus annuus (sunflower) 24 , Lactuca sativa (lettuce) 25 , and Chrysanthemum nankingense 26 . For protein-coding gene annotation, we used a combination of three methods: homology-based, ab initio, and RNA Seq-assisted prediction methods. Finally, 44,143 protein-coding genes were predicted in our current Stevia assembly, with an average gene length of 3493 bp (Table 1). Transcriptome analysis showed that 37,489 predicted genes (84.93%) were supported by at least one of the seven organs (root, stem, and leaves at five different developmental stages). Overall, 41,801 protein-coding genes (94.69%) were assigned functions in at least one of the five databases (NR, TrEMBL, KOG, KEGG, and GO) (Supplementary Table 10), and 40,355 were anchored on the 11 pseudochromosomes. Fig. 1b-e shows the Copia density, Gypsy density, gene density, and transcriptional level of each pseudochromosome.

Evolutionary history of S. rebaudiana
To understand the evolutionary relationship between Stevia and other Asteraceae plants, we performed a comparative genomic analysis using Vitis vinifera, Solanum lycopersicum, Daucus carota, and five Asteraceae plants (L. sativa, C. nankingense, Artemisia annua, H. annuus, and S. rebaudiana). Phylogenetic analysis based on 799 single-copy orthologous genes identified in these eight species confirmed the close relationship between Stevia and sunflower (Heliantheae alliance) and between C. nankingense and A. annua (Anthemideae tribe) (Fig. 2a). The estimated divergence time of Stevia and sunflower was~28-31 million years ago (Mya). The most recent common ancestor (MRCA) of Stevia and sunflower diverged from the MRCA of C. nankingense and A. annua~37-38 Mya, which is consistent with the findings of Song et al. 26 . The MRCA of Stevia and C. nankingense diverged from lettucẽ 39-40 Mya (Fig. 2a). We further investigated whole-genome duplication (WGD) events during Stevia evolution since WGD has been considered a significant source of plant genetic, biochemical, and evolutionary novelty [27][28][29] . We identified 5209 paralogous gene pairs that accounted for 20.69% of the predicted Stevia genes using the MCScanX package 30 . The Ks distribution of these duplicated gene pairs peaked at 0.53, reflecting the occurrence of a WGD event 32.1 Mya (Fig. 2b). Based on the Ks distribution of the orthologous gene pairs between Stevia and sunflower, we estimated that their divergence time was~29.4 Mya, indicating that they diverged soon after the WGD event (WGD-2) experienced by their common ancestor 24 . The Ks distribution of homologous gene pairs clearly illustrated that whole-genome triplication (WGT-1, 45.5-51.5 Mya) occurred in lettuce (Fig. 2b), which is also believed to have occurred in the ancestry of most Asteraceae [24][25][26]31 . This analysis showed that Stevia experienced a complicated evolutionary history characterized by a recent WGD-2 shared with sunflower, the basal WGT-1 in Asteraceae, and the ancestral paleohexaploidy event (WGT-γ) that occurred in all eudicots 32 . Thus, for any ancestral region from the MRCA of Asteraceae (post-WGT-1), two inherited regions are currently expected to exist in the Stevia and sunflower genomes compared to the lettuce genome (Fig. 2c). Although Stevia and sunflower experienced the same paleopolyploidy events (WGT-γ, WGT-1, and WGD-2) and there were many collinear regions between their genomes ( Supplementary Fig. 3), these two species may have undergone different chromosome rearrangement patterns and duplicated gene loss after divergence, resulting in different chromosome numbers and genome sizes.

Gene family analysis
Based on sequence homology, a total of 41,701 gene families containing 276,277 genes were identified using the predicted genes of the above eight plants (seven from asterids and V. vinifera as an outgroup) ( Fig. 3a and Supplementary Table 11). Of these, 5749 gene families consisting of 68,964 genes were shared among all eight plants, and 12,326 were shared among the five Asteraceae plants (Fig. 3b). We assigned 40,214 Stevia genes to 20,147 families and found that 1057 gene families contained 4281 genes unique to Stevia. Gene Ontology (GO) enrichment analysis revealed that these unique gene families were mostly involved in RNA-directed DNA polymerase activity (GO:0003964), aspartic-type endopeptidase activity (GO:0004190), and RNA binding (GO:0003723) ( Supplementary Fig. 4).
Further gene family analysis demonstrated that 323 gene families were expanded in the Stevia genome, while 346 were contracted (Fig. 3a). GO enrichment analysis of the expanded gene families of Stevia revealed that they were enriched for transferase activity (GO:0016758), monooxygenase activity (GO:0004497), ADP binding (GO:0043531), catalytic activity (GO:0003824), and terpene synthase activity (GO:0010333) ( Supplementary Fig.  5). KEGG enrichment analysis revealed that the expanded gene families of Stevia were enriched mainly for phenylpropanoid biosynthesis (ko00940), terpenoid backbone biosynthesis (ko00900), flavonoid biosynthesis (ko00941), monoterpenoid biosynthesis (ko00902), cyanoamino acid metabolism (ko00460), and sesquiterpenoid and triterpenoid biosynthesis (ko00909) (Supplementary Fig. 6). The GO and KEGG enrichment analyses demonstrated that a considerable number of these expanded gene families participated in the biosynthesis of specialized metabolites. As the terpenoid biosynthesis pathway was enriched several times, we further investigated the expansion pattern of the terpene synthase (TPS) gene family, which drives the diversification of terpenoids. Eighty-two TPS genes were identified in the Stevia genome, which could be divided into five subfamilies. More than three-quarters of the Stevia TPS genes were classified into the TPS-a and TPS-b subfamilies, indicating significant expansion of these two subfamilies (Fig. 3c).

Genes involved in SG biosynthesis
Accumulation of large amounts of SGs in leaves is the most notable feature of Stevia. Although the SG biosynthetic pathway has been extensively studied during the past two decades, and although some of the critical UGT genes have been well characterized 21,33 , we obtained new insights into SG biosynthesis by combining genomic and transcriptome analyses. All terpenes are derived from 5carbon isoprenoid units produced through either the MEP pathway or the mevalonate (MVA) pathway 34 . Candidate genes in the MEP and MVA pathways were identified using homolog searching and functional annotation methods. Transcriptome analysis revealed that almost all the candidate genes in the MEP pathway were expressed in seven selected tissues, including leaves at different developmental stages, with high accumulation of SGs (Fig. 4). However, the expression levels of HMGR and MK in the MVA pathway were deficient in the leaves, indicating that the 5-carbon isoprenoid unit for SG biosynthesis comes from mainly the MEP pathway instead of the MVA pathway, which is consistent with the conclusions derived from the results of labeling experiments 11 .
Since the biosynthesis of SGs shares four steps with the biosynthesis of GAs before the generation of ent-kaurenoic acid, we identified all candidate genes involved in this common pathway, including 14 geranylgeranyl diphosphate (GGPP) synthase (GGPPS) genes, seven ent-copalyl diphosphate synthase (ent-CPS) genes, five ent-kaurene synthase (ent-KS) genes, and six ent-kaurene oxidase (ent-KO) genes (Fig. 4). All four types of genes were multicopy genes in the Stevia genome, and the homologous genes showed expression differentiation (Fig. 4b), reflecting subfunctionalization or neofunctionalization of these duplicated genes. Stevia has evolved to biosynthesize SGs based on the conserved early steps of the GA biosynthesis pathway in vascular plants.
Steviol glycoside biosynthesis diverges from GA biosynthesis with the 13-hydroxylation of ent-kaurenoic acid by ent-KAH. The evolution of ent-KAH from numerous P450s was the key step of SG biosynthesis in Stevia; however, cloning of its encoding gene has not been successful yet 3,35 . Several potential ent-KAH genes of Stevia have been deposited in the NCBI database, and most belong to the CYP716 family. In contrast, CYP714A2, a member of the CYP714 family of Arabidopsis thaliana, has been reported to catalyze the 13-hydroxylation of entkaurenoic acid when expressed in yeast 36 . Thus, we identified all putative members of the CYP716 and CYP714 families in the Stevia genome. There was significant expansion of CYP716 genes, and four tandemly duplicated genes (Streb.1G007430, Streb.1G007460, Streb.1G007470, and Streb.1G007480) were consistently highly expressed in leaves with SG biosynthesis (Supplementary Fig. 7). In contrast, there was only one member of the CYP714 family in the Stevia genome (Streb.8G019490), and it was hardly expressed in the leaves (Fig. 4b). After the formation of steviol, continuous glycosylation processes catalyzed by a set of UGTs lead to different types of SGs (Fig. 4a). UGT genes were highly enriched (GO:0016758, P < 0.001) among the expanded gene families of Stevia. In total, we identified 259 putative UGT genes in the Stevia genome, and 86 UGT genes were expressed in at least two of the five selected leaf tissues ( Supplementary Fig. 8), including three UGT genes that have been reported to be involved in SG biosynthesis (UGT85C2, UGT74G1, and UGT76G1). Unearthing of these candidate UGT genes through genomic and transcriptome analyses will accelerate the identification of UGT genes involved in specific SG glycosylation.

Discussion
Obesity is a serious global health issue that affects millions of people, with a high-sugar diet being one of the leading causes of obesity. Reducing sugar intake by substitution with zero-calorie sweeteners is an effective way to reduce dietary energy consumption. Although artificial sweeteners such as saccharin, aspartame, and sucralose are widely added to various food products in daily use, longterm uptake of these sugar substitutes may pose health risks 37,38 . There is a strong demand for natural zero-calorie sweeteners, and SGs may be the most promising candidates. Steviol glycosides have been approved by the foremost regulatory authorities worldwide for use in foods and beverages. High yields and improvements in the levels of the best-tasting SGs, such as Reb A, Reb D, and Reb M, are currently the main objectives for breeding of Stevia. Thus far, there have been no comprehensive analyses combining genomic and transcriptomic methods to provide in-depth insights into the unique diterpenes (SGs) of Stevia. It is still very challenging to construct a high-quality Stevia genome assembly based only on second-generation sequencing owing to the large size and high complexity of the genome as well as the percentage of repeats 39 . Here, we propose a chromosomelevel genome assembly of Stevia. The Stevia genome spanning 1416 Mb was obtained, with a contig N50 of 616.85 kb and a scaffold N50 of 106.55 Mb. We predicted 44,143 protein-coding genes in our current assembly using homology-based, ab initio, and RNA Seq-assisted prediction methods; this number is almost twice that of the previous genome assembled using second-generation short sequences (24,994), probably because only 411 Mb of the genome had previously been assembled 39 . Therefore, the genome assembled using long sequences and Hi-C approaches in this study is superior to the genome previously assembled using only short sequences.
More than four-fifths of the Stevia genome consisted of repetitive elements, of which 21.02% belonged to the Copia lineage and 43.44% belonged to the Gypsy lineage. In sunflower, more than three-quarters of the genome was composed of LTR-RTs, 59.9% of which belonged to the Gypsy lineage and 25.8% of which belonged to the Copia lineage 24 . In C. nankingense, repetitive elements accounted for 69.6% of the genome, among which LTR-RTs (Gypsy and Copia) were the most abundant 26 . Having many repetitive sequences, especially LTR-RTs, might be a significant feature of the Asteraceae family, contributing to the genome sizes of its members.
Comparative genomic analyses of Stevia and other Asteraceae plants have provided crucial clues regarding Stevia genome evolution in Asteraceae. Our results showed that Stevia and sunflower diverged~29.4 Mya, shortly after the WGD event (WGD-2,~32.1 Mya) that occurred in their MRCA (Fig. 2a, b). Stevia, a member of the Asteraceae family, also experienced a basal WGT-1 event in Asteraceae, and a WGT-γ event occurred in all eudicots 24,25,32 . Most of the syntenic blocks present in the Fig. 4 The SGs biosynthesis pathway in Stevia. a Diagram depicting the pathway of SGs biosynthesis. Inside the dashed box is the unique SGs biosynthesis pathway in Stevia. b Expression patterns of candidate genes of the SGs biosynthesis pathway. RS root at the seedling stage, SS stem at the seedling stage, LS leaf at the seedling stage, LV leaf at the vegetative stage, LB leaf at the bud stage, LIF leaf at the initial flowering stage, LPF leaf at the peak flowering stage Stevia genome were derived from the recent WGD-2 event, while syntenic blocks derived from ancient WGT-1 events were rarely preserved (Fig. 2b). After it diverged from an ancestor shared with sunflower, Stevia evolved to synthesize SGs, unique metabolites not found in other plants. The expansion of specific gene families may play important roles in promoting phenotypic diversification as well as in the evolution of novel traits in plants 40,41 . The expanded genes in Stevia were mainly enriched for biosynthesis of specialized metabolites, especially biosynthesis of terpenoid backbones, and for further oxidation and glycosylation of these compounds (Supplementary Figs. 5,6). We further identified all candidate genes in the pathway of SG biosynthesis based on the genome sequences and found that the essential genes responsible for steviol biosynthesis were multiple-copy genes (Fig. 4b). These duplicated genes might be important contributors to the ability of Stevia to synthesize SGs. Thus, this high-quality chromosome-level genome assembly will undoubtedly benefit researchers in the exploration of Stevia characteristics.

Materials and methods
Leaf collection, DNA library construction, and genome sequencing Fresh young leaves present during the seedling stage of 'Zhongshan No. 7', a cultivated diploid Stevia species, were collected from the Stevia germplasm resource laboratory located at the Nanjing Botanical Garden Mem. Sun Yat-Sen. Genomic DNA was isolated for Illumina and PacBio sequencing. For Illumina sequencing, a short-read (270 bp) library was constructed and sequenced on an Illumina HiSeq platform (Illumina, CA, USA), and 141.90 Gb of clean reads were obtained. For PacBio sequencing, genomic DNA was fragmented to~20 kb to construct a long-read library according to the manufacturer's instructions (Pacific Biosciences, CA, USA), and then the library was sequenced on a PacBio Sequel platform. After filtering out the low-quality reads and sequence adapters, we obtained 114.95 Gb of clean subreads with an N50 value of 12.82 kb.

Genome assembly
For de novo genome assembly, we first used Canu (v1.5) 42 to correct for potential errors in the PacBio subreads. Then, the high-quality PacBio subreads were independently assembled using WTDBG (v1.2.8), FAL-CON (v0.7) 43 and Canu (v1.5). These three assembly strategies yielded 1.36, 3.25, and 2.03 Gb assemblies, and the contig N50 sizes of these three assemblies were 205.51, 59.71, and 277.70 kb, respectively. We further merged the well-assembled WTGDB and Canu assemblies using Quickmerge 44 . To correct the indel and SNP errors in the assembly sequence, we mapped paired-end Illumina reads to the merged assembly using Pilon 45 . Finally, the size of the genome assembled using PacBio long reads was 1.40 Gb with a contig N50 value of 616.85 kb (Supplementary Table 3). To evaluate the quality of the assembly, we used BWA 46 to map the short paired-end reads to the optimized contigs and then performed CEGMA 47 and BUSCO 23 analyses.

Chromosome-level assembly
Hi-C sequencing for the chromosome-level genome assembly was performed as previously described 48 . Briefly, fresh young leaves present at the seedling stage of 'Zhongshan No. 7' were collected and fixed in formaldehyde solution. HindIII was used to digest the chromatin extracted from the fixed leaves. The DNA fragments were then ligated together to form chimeric junctions after biotinylation. Next, the enriched chimeric junctions were physically sheared into DNA fragments of 300-700 bp in length. Biotin-containing DNA fragments were enriched through streptavidin pulldown and then subjected to Illumina HiSeq sequencing. Finally, 76.86 Gb of clean Hi-C reads were generated. HiC-Pro 49 was used to assess the Hi-C sequencing data. The Hi-C sequencing data were mapped to assembled contigs using BWA-aln 46 . The preassembled scaffolds were split into 50 kb segments on average and conjoined with unique mapped reads for assembly using LACHESIS software 22 . To evaluate the final chromosome assemblies, we divided them into bins of equal lengths (100 kb) and visualized the interaction matrix in a heat map.

Genome annotation
De novo searches and homology-based alignments were used to predict repetitive sequences across the Stevia genome. We first used PILER-DF (v2.4) 50 , RepeatScout (v1.0.5) 51 , and LTR_FINDER (v1.05) 52 to predict de novo repetitive sequences and classified them into families using PASTEClassifier 53 . We then used RepeatMasker (v4.0.6) 54 to scan the integrated database of the de novo repetitive sequences and the known Repbase 55 TE library.
We used homology-based, ab initio, and RNA Seqassisted approaches to predict protein-coding genes in the Stevia genome assembly. Augustus 56 , GlimmerHMM 57 , SNAP 58 and GeneID 59 were used for ab initio programs. In homologous predictions, the protein sequences of A. thaliana, H. annuus, L. sativa, and Oryza sativa from Phytozome were downloaded and aligned to the assembled Stevia genome using TBLASTN 60 . We then aligned the homologous genomic sequences against matching proteins to construct the exact protein-coding gene models using GeMoMa 61 . For the RNA Seq-assisted predictions, RNA sequencing reads from different organs of Stevia were mapped to the assembly, and transcripts from these mapping results were identified using GeneMarkS-T 62 and TransDecoder (https://github.com). We integrated all gene models obtained from the above three annotation procedures with EVM 63 to construct a final consensus set and then filtered it with PASA (v2.0.2) 64 . These predicted protein-coding genes were then assigned to BLAST against public databases, including TrEMBL 65 and NCBI nonredundant protein databases. Blast2GO 66 was used to determine the functions and pathways based on the GO 67 and KEGG 68 databases.

WGD analysis
All-versus-all protein sequence comparisons were performed using BLASTP (E-value: 1e−05) to identify homologous gene pairs. Syntenic blocks within and between species were determined using MCScanX 30 . The Perl script 'add_ka_and_ks_to_collinearity.pl'. implemented in the MCScanX package was used to calculate the Ks values of the collinear homologous gene pairs. For five Asteraceae species, the neutral substitution rate of asterids (r = 8.25E−9) was applied to calculate the divergence date of the WGD or speciation events 24 . A Stevia-lettuce-sunflower syntenic block diagram and dot plot of coding genes between Stevia and sunflower were drawn with TBtools 75 . An image of the syntenic blocks and genomic features in the Stevia genome was produced with Circos (v0.69) 76 .