Chromosome-level genome assembly of Salvia miltiorrhiza with orange roots uncovers the role of Sm2OGD3 in catalyzing 15,16-dehydrogenation of tanshinones

Abstract Salvia miltiorrhiza is well known for its clinical practice in treating heart and cardiovascular diseases. Its roots, used for traditional Chinese medicine materials, are usually brick-red due to accumulation of red pigments, such as tanshinone IIA and tanshinone I. Here we report a S. miltiorrhiza line (shh) with orange roots. Compared with the red roots of normal S. miltiorrhiza plants, the contents of tanshinones with a single bond at C-15,16 were increased, whereas those with a double bond at C-15,16 were significantly decreased in shh. We assembled a high-quality chromosome-level genome of shh. Phylogenomic analysis showed that the relationship between two S. miltiorrhiza lines with red roots was closer than the relationship with shh. It indicates that shh could not be the mutant of an extant S. miltiorrhiza line with red roots. Comparative genomic and transcriptomic analyses showed that a 1.0 kb DNA fragment was deleted in shh Sm2OGD3m. Complementation assay showed that overexpression of intact Sm2OGD3 in shh hairy roots recovered furan D-ring tanshinone accumulation. Consistently, in vitro protein assay showed that Sm2OGD3 catalyzed the conversion of cyptotanshinone, 15,16-dihydrotanshinone I and 1,2,15,16-tetrahydrotanshinone I into tanshinone IIA, tanshinone I and 1,2-dihydrotanshinone I, respectively. Thus, Sm2OGD3 functions as tanshinone 15,16-dehydrogenase and is a key enzyme in tanshinone biosynthesis. The results provide novel insights into the metabolic network of medicinally important tanshinone compounds.


Introduction
Salvia miltiorrhiza Bunge is one of the most well-known traditional Chinese medicine materials and a model system for medicinal plant biology [1]. It has been clinically used to treat heart and cardiovascular diseases and is beneficial for management of various other diseases, such as cancer and neurodegenerative diseases [2,3]. Lipophilic tanshinones that are mainly produced in the roots are a major class of bioactive compounds in S. miltiorrhiza [4,5]. Some of them, cryptotanshinone (CT), 15,16dihydrotanshinone I (15,16-DHT), 1,2,15,16-tetrahydrotanshinone I (THT), and methylenedihydro-tanshinquinone, have a single bond at C-15,16, whereas some of the others, tanshinone I (TAI), 1,2-dihydrotanshinone I (1,2-DHT), tanshinone IIA (TAII), tanshinone IIB, and methylenetanshinquinone, possess a double bond at C-15,16 [6,7]. Biosynthesis of tanshinones can be generally divided into four stages, including the biosynthesis of C5 isoprene unit isopentenyl diphosphate (IPP) and its isomer dimethylallyl diphosphate (DMAPPP), the formation of intermdediate diphosphate precursor GGPP, the biosynthesis of parent terpene carbon skeletons copalyl diphosphate (CPP) and miltiradiene, and the formation of different tanshinone compounds [1]. So far, the reactions at the first three stages and the genes encoding the enzymes that catalyse these reactions have been elucidated [8][9][10][11][12]. However, due to the complexity, it is still largely unknown about the fourth stage that involves multiple modifications, such as hydroxylation, oxidation, heterocyclization, dehydrogenation, and demethylation.
Recently studies suggest that several cytochromes P450 (CYPs) belonging to the CYP76 and CYP71 subfamilies are involved in tanshinones biosynthesis. SmCYP76AH1 catalyzes the hydroxylation of miltiradiene into ferruginol. Subsequently, ferruginol is further converted to 11-hydroxyferruginol, sugiol and 11-hydroxysugiol under the catalysis of SmCYP76AH3 [13,14]. The generated 11hydroxyferruginol and 11-hydroxysugiol are then hydroxylated at C-20 by SmCYP76AK1. SmCYP71D411 catalyzes the hydroxylation of sugiol to 20-hydroxysugiol [14,15]. In addition, SmCYP71D375 is involved in the hydroxylation and heterocyclization of miltirone, 4-methylenemiltirone and Ro to form the D-ring that has a single bond at C-15, 16. Formation of such a D-ring on Ro can also be catalyzed by SmCYP71D373 [15].
In addition to CYP450s, members of the 2-oxoglutarate dependent dioxygenase (2OGD) superfamily have been implied in catalysis of tanshinone biosynthesis [16][17][18]. 2OGDs are a class of iron-containing non-heme oxygenases localizing in the cytosol and the most versatile oxidative enzymes in nature [19]. They contain the conserved 2-His-1-carboxylate facial triad (HX(D/E)X50-210H) and the conserved RXS motif and catalyze the oxidation of organic substrates using ferrous iron Fe (II) as the active site cofactor and 2-oxoglutarate (2OG) and molecular oxygen as the co-substrates [20]. The reactions they catalyzed include hydroxylation, de-alkylation, demethylation, desaturation, epoxidation, epimerization, cyclization, halogenation, peroxide formation, and ring expansion/contraction [19][20][21][22][23]. The best-studied examples of 2OGDs include GA 20-oxidase, GA 2-oxidase, and GA 3-oxidase that are involved in gibberellin metabolism and f lavone synthase I, f lavonone 3β-hydroxylase, f lavonol synthase and leucoanthocyanidin synthase that are involved in f lavonoid metabolism [22,23]. Compared with CYP450s, 2OGDs are relatively less studied in S. miltiorrhiza, although they are often considered as candidate genes involved in tanshinones biosynthesis [16,24]. S. miltiorrhiza has a total of 132 Sm2OGD genes [16]. It is reported that down-regulation of Sm2OGD5 gene expression significantly reduced the contents of miltirone, CT, and TAII in transgenic S. miltiorrhiza hairy roots [16]. The reaction catalyzed by 2OGD5 protein is unknown. In addition, it has been shown that Sm2OGD25 catalyzes the hydroxylation of sugiol to produce hypargenin B and crossogumerin C [17]. Sm2-ODD14 converts CT and isocryptotanshinone to TAII and isotanshione IIA, respectively [18].
The roots and rhizomes of S. miltiorrhiza have been used as traditional Chinese medicine materials for a long time. They are called Danshen and mean 'red root' in Chinese due to the substantial accumulation of red tanshinones, such as TAII and TAI in root periderm [4,5,25,26]. In this study, we investigated a S. miltiorrhiza line shh that has orange roots. In comparison with tanshinone contents in red roots of normal S. miltiorrhiza plants, the contents of tanshinones with a single bond at C-15,16 were increased in orange roots, whereas the contents of those unsaturated at C-15 and C-16 were significantly decreased. In order to obtain more comprehensive information to explain the large variations of tanshinone contents in shh, we sequenced the whole genome and assembled it to the chromosome level. The comparative genomic and transcriptomic analyses showed that the Sm2OGD3 gene was mutated by fragment deletion in shh. Further enzyme activity analysis and transgenic hairy roots experiment of intact Sm2OGD3 protein revealed its function in catalyzing the dehydrogenation of tanshinones at C-15 and C-16. Our results suggest that fragment deletion in Sm2OGD3 gene in shh could be responsible for the lack of 15,16-unsaturated tanshinones.

Comparative metabolome and UPLC analysis of tanshinones in S. miltiorrhiza lines with orange roots (shh) and red roots (99-3)
Tanshinones are enriched in the periderm of S. miltiorrhiza roots [4,5]. Many tanshinone compounds show colors, ranging from orange to red [25,26]. For instance, the methanolic solution of TAII, TAI, and 1,2-DHT are red, whereas CT, 15,16-DHT, and THT are orange (Fig. 1d). Normal S. miltiorrhiza plants, such as line 99-3, are brick-red, because of the accumulation of TAII, TAI, 1,2-DHT, and other red tanshinones (Fig. 1b). However, the roots of a nature line of S. miltiorrhiza were orange (Fig. 1a). This line was originally found in Shaanxi in China, in 1999. It has been transplanted to the field nursery of the Institute of Medicinal Plant Development of the Chinese Academy of Medical Sciences and propagated through root cuttings for about 20 years. This line was named Shanhuang (shh).
In order to analyse the changes of chemical compositions in shh, we performed comparative metabolomic analysis of roots from lines 99-3 and shh. The content of metabolites was evaluated using OrthogonalPartial Least Squares Discrimination Analysis (OPLS-DA) with the threshold for significant differences determined by variable importance in the projection (VIP) >1.0, P-value <0.05, and fold change >1.5 or <0.67. It resulted in the identification of 29 compounds with contents significantly changed in shh, among which TAII, CT, THT, and TAI were the four compounds with the largest VIP (Table S1, see online supplementary material). TAII, TAI, and methyl tanshinoate that possess a double bond at C-15,16 were decreased to 17.78-, 46.24-, and 39.06-folds, respectively. Tanshinones with a single bond at C-15,16, such as THT, CT and its derivatives, cryptomethyltanshinoate and anhydride of CT, were increased to 2.24-, 1.76-, 3.87-, and 19.76-folds, respectively. In addition, the upstream products of the tanshinone biosynthetic pathway, such as sugiol, miltirone I and dehydromiltirone, were increased to 2.00-, 1.94-, and 2.56-folds, respectively.

Chromosome-level genome assembly of S. miltiorrhiza line with orange roots (shh)
So far, the whole genome assemblies have been available for four S. miltiorrhiza lines with red roots [15,[27][28][29]. However, there is no genomic sequence available for S. miltiorrhiza plants with orange roots. In order to investigate the genes responsible for the variations in tanshinone production in shh, we sequenced and assembled the whole genome of shh to chromosome level (Fig. 2a). The K-mer 17 distribution analysis of 28.73 Gb Illumina short reads estimated that the size of shh genome was 581.68 Mb and the heterozygosity is 1.34% (Table S2, [15,[27][28][29] (Table 2). Furthermore, 133.51 Gb Hi-C data were used to assist the assembly [30]. A total of 371 scaffolds covering 496.5 Mb (93.51%) of the assembled genome were anchored into eight pseudochromosomes (Figs S1 and S2, see online supplementary material). A super-scaffold N50 of 60.28 Mb and a maximum scaffold length of 77.73 Mb were attained after Hi-C (Table 1; Table S4, see online supplementary material).
To assess the completeness of the assembly and the homogeneity of the sequencing, Illumina short reads were mapped to the genome using BWA [31]. It exhibits exceptional alignments with the mapping rate of 95.96%, the coverage of 99.74%, and the average sequence depth of 46.77%. BUSCO [32] and CEGMA [33] assessments were also performed, resulting in the identification of 88% single-copy orthologs and 89.52% core eukaryotic conserved genes, respectively (Table S5, see online supplementary material). It indicates that the chromosome-level genome assembly of S. miltiorrhiza line shh has a high degree of continuity and completeness.
Whole genome repeats were identified by homology alignment and de novo search (Tables S6 and S7, see online supplementary material). The results showed that shh genome contained about 56.65% repetitive sequences (Fig. 2a). Long terminal repeats (LTRs) were the predominant transposable elements, which represented 49.30% of the whole genome. Based on ab initio prediction, homology-based prediction and RNA-Seq assisted prediction, a total of 32 191 protein-coding genes with an average CDS length of 1191 bp were predicted (Table 1; Table S8, see online supplementary material). The number of predicted gene models was similar to previously reported S. miltiorrhiza genomes [15,[27][28][29] (Table 2). Among the 32 191 predicted protein-coding genes, 30 538 (94.9%) were functionally annotated by mapping to protein databeses, including Swiss-Prot, NR, KEGG, InterPro, GO, and Pfam (Table S9, see online supplementary material).

Phylogenomic analysis revealed the evolutionary history of the orange root line (shh)
With the whole genome assembly available, we asked how shh was evolved. To address this question, shh genome assembly was compared with the whole genomes from the other plant species, including five Salvia species (S. miltiorrhiza_99-3, S. mil-tiorrhiza_unknown, S. miltiorrhiza_DSS3, Salvia bowleyana, Salvia hispanica, Salvia splendens), seven Lamiales species (Erythranthe guttata, Sesamum indicum, Utricularia gibba, Olea europaea, Scutellaria baicalensis), two Solanaceae species (Solanum lycopersicum, Solanum tuberosum), two Apiales species (Panax ginseng, Daucus carota), Arabidopsis thaliana, and Populus trichocarpa. The phylogenomic tree was constructed based on single-copy gene families using maximum likelihood estimation.
The phylogenomic tree showed that three S. miltiorrhiza lines were clustered with S. splendens, whereas S. splendens diverged from the common ancestor of S. miltiorrhiza approximately 36.3 million years ago (MYA) (Fig. 2c). Interestingly, two S. miltiorrhiza lines having red roots showed a close relationship with the divergence time about 16.2 MYA, whereas the divergence of shh from the common ancestor of the S. miltiorrhiza lines with red roots occurred about 18.6 MYA (Fig. 2c). However, when we added S. miltiorrhiza_DSS3, S. bowleyana, S. hispanica and S. baicalensis to the new phylogenetic tree, S. bowleyana was clustered with several S.  (1). Tracks (2) to (5) represent density of gene, differential expressed gene between young root and mature roots of shh, long terminal repeats, and transposable elements, respectively. Tracks (6) and (7) [34]. This indicates that the two plants are very similar at the molecular level. The roots of S. bowleyana have been reported to be used as a substitute for S. miltiorrhiza [34]. The divergence of shh from the common ancestor of S. miltiorrhiza 99-3 and S. bowleyana occurred about 13.8 MYA (Fig. S4, see online supplementary material). S. miltiorrhiza line shh with orange roots was clustered with S. miltiorrhiza lines with red roots and S. bowleyana. It indicates that shh could not be the mutant of an extant S. miltiorrhiza line with red roots. The common ancestor of shh, S. miltiorrhiza with red roots and S. bowleyana may be a red-rooted species. We next asked whether there was a lineage-specific wholegenome duplication (WGD) event occurring in S. miltiorrhiza plants with orange roots and this event resulted in the changes of tanshinone production. To address this question, distribution of synonymous substitutions per synonymous site (Ks) analysis for paralogous genes and syntenic blocks was carried out. The results showed that S. splendens experienced twice lineage-specific WGD events after the divergence from S. miltiorrhiza, whereas there was no recent WGD event occurring in shh (Fig. 2b). Lack of recent WGD events was also found in S. miltiorrhiza lines with red roots (Fig. 2b). It indicated that no recent WGD could be a common phenomenon for all S. miltiorrhiza plants and the changes of tanshinone production in shh were not caused by WGD events.

Comparative transcriptomic analysis of known genes involved in tanshinone biosynthesis in orange roots (shh) and red roots (99-3) of S. miltiorrhiza lines
Tanshinones are a gathering of lipophilic diterpenoid natural products biosynthesized through the terpenoid biosynthetic pathway. So far, a total of 19 gene families have been found to encode enzymes associated with tanshinones production in the S. miltiorrhiza plants with red roots [1]. To get the first-hand information for these genes in shh, we performed genome-wide identification of shh homologs for those known enzyme-encoding genes [8,9,[11][12][13]15]. Homologs for all of the known enzyme-encoding genes implicated in tanshinone production could be found in the genome assembly of shh (Table S10, see online supplementary material). Comparative transcriptomic analysis showed that there was no significant difference for their expression in the periderm of mature roots of shh and 99-3 (Fig. 3). The results suggested that the variations of tanshinone production in shh were not caused by loss-of-function or significant expression level changes of those known enzyme-encoding genes.

Comparative genomic and transcriptomic analyses showed that Sm2OGD3 gene was mutated by fragment deletion in the orange root line (shh)
To further investigate the reasons for the variations of tanshinone contents in shh, comparative transcriptomic analysis was performed using our RNA-seq data from mature periderm of 99-3 and shh roots [24]. A total of 3416 differentially expressed genes (DEGs) with a significance level of padj <0.05 were identified between 99-3 and shh root periderm (Fig. S5, see online supplementary material). Among them, 2162 DEGs were down-regulated in shh. It includes 12 Sm2OGD gene models (Fig. 4a). 2OGDs are a class of oxygenases encoding by a gene superfamily. Members of the S. miltiorrhiza 2OGD gene family have been recently implied in the catalysis of tanshinone biosynthesis [16][17][18]24]. In order to check whether the Sm2OGDs were involved in the production of tanshinones with a double bond at C-15,16, we screened candidate Sm2OGD genes based on the following criteria, including (i) significantly down-regulated in shh periderm in comparison with 99-3, (i) highly expressed in 99-3 periderm, and (iii) specifically highly expressed in 99-3 roots among different organs (f lower, stem, leaf, and root). As shown in the clustering heatmap, among the 12 Sm2OGD gene models, SMil_00016113, SMil_00018668, and SMil_00020342 co-expressed with SmCYP76AH1, SmCYP76AH3, SmCYP76AK1, SmCYP71D411, SmCYP71D373, and SmCYP71D375 that were involved in tanshinone biosynthesis ( Fig. 4a; Fig. S6, see online supplementary material). Based on the previously published article [16], SMil_00020342 was expressed much higher in stems than roots of 99-3 (Fig. S7, see online supplementary material), indicating it was not related to tanshinone biosynthesis. Thus, SMil_00016113 and SMil_00018668 were selected for further analysis.
SMil_00018668 was named Sm2OGD3 in a previously published article [16]. However, BLASTp analysis against the NCBI NR database showed that the proteins encoded by SMil_00016113 and SMil_00018668 were partial. SMil_00016113, which includes a DIOX_N domain, was mapped to the 5 region of a 2OGD. SMil_00018668 that includes a 2OG-FeII_Oxy domain was mapped to the 3 region (Fig. 4b). To address this problem, we performed BLASTn analysis of SMil_00016113 and SMil_00018668 against the genome assemblies of three S. miltiorrhiza lines with red roots, including 99-3 [28], bh2-7 [15], and DSS3 [29]. SMil_00016113 and SMil_00018668 were mapped to a Sm2OGD gene with three exons in the genome assemblies of bh2-7 [15] and DSS3 [29]. SMil_00016113 corresponds to the first exon, whereas SMil_00018668 corresponds to the second and third exons (Fig. 4b). When mapping SMil_00016113 and SMil_00018668 to the genome assembly of 99-3 [28], we found that SMil_00016113 and SMil_00018668 were located at scaffold5164 and scaffold2881, respectively. Because the quality of 99-3 genome assembly was relatively low, it is possible that the sequences of SMil_00016113 and SMil_00018668 were incorrectly assembled. To exam whether SMil_00016113 and SMil_00018668 also came from a Sm2OGD gene in 99-3, primers were designed (Table S11, see online supplementary material) and PCR amplification were carried out using gDNA and cDNA from 99-3 roots as the templates. The results showed that SMil_00016113 and SMil_00018668 indeed came from a Sm2OGD gene in 99-3 (Fig. 4b). Based on these results, we corrected the full-length genomic sequence and CDS of Sm2OGD3 in 99-3 (Table S12,  BLASTn analysis of SMil_00016113 and SMil_00018668 against the genome assembly of shh identified a 1.2 kb genomic sequence. Its 5 region corresponds to SMil_00016113, its 3 region corresponds to the 3 region of SMil_00018668, whereas the genomic region corresponding to the 5 region of SMil_00018668 is missing (Fig. 4b). Compared to the genomic sequence of Sm2OGD3 in 99-3, the identified 1.2 kb genomic sequence of shh lacked a DNA fragment of approximately 1.0 kb in length, which corresponds to the second exon and partial of the first and the second intron sequences (Fig. 4b). We named this mutated gene in shh as Sm2OGD3m. Again, we performed PCR amplification using gDNA and cDNA templates from shh roots and the same primers. According to the full-length genomic sequence and CDS of Sm2OGD3m in shh (Table S12, see online supplementary material), we found that Sm2OGD3m underwent an exon skip type of variable splicing due to a fragment deletion that disrupted the original splice site, with only the first and third exons remaining intact in the CDS. However, the misalignment of the reading frame resulted in the premature termination of translation. The first stop codon occurs at 541-543 nt and the truncated protein is only 180aa. It results in deprivation of the important conserved 2-His-1-carboxylate facial triad and the conserved RXS motif [20].

Sm2OGD3 recombinant protein catalyzed 15,16-dehydrogenation of tanshinones
To test whether Sm2OGD3 protein can catalyze 15,16dehydrogenation of tanshinones, full-length Sm2OGD3 cDNA was amplified from the roots of 99-3. It was then cloned into an Escherichia coli expression vector pET-30a (Fig. S8a, see online supplementary material). Recombinant protein of Sm2OGD3 was expressed in E. coli strain BL21(DE3) (Fig. S8b, see online supplementary material). Enzyme activity was analyzed in vitro. E. coli harboring the empty pET-30a vector was used as a control. The crude Sm2OGD3 enzymes were incubated with CT, THT, or 15,16-DHT that has a single bond at C-15,16. The ethyl acetate extracts obtained from these incubations were analysed using LC-MS. MS-MS data of samples were compared to standards. The results showed that CT, THT, and 15,16-DHT were converted into TAII, 1,2-DHT, and TAI, respectively (Fig. 5). No such conversion was found for the crude extract proteins of E. coli cells carrying the empty pET-30a vector (Fig. 5). It suggests that the Sm2OGD3 recombinant protein can convert the single bond at C-15,16 of tanshinones to double bond. Furthermore, we examined the effect of different components of the reaction system on the enzyme activity using DHT as the substrate (Fig. S9, see online supplementary material). When Lascorbate was absent, elevated levels of 2-oxoglutarate and Fe 2+ did not significantly contribute to the reaction rate (reaction 1, 2, 6, and 7). The reaction rate for the removal of Fe 2+ alone was similar to that for the removal of L-ascorbate alone (reaction 3). When both Fe 2+ and L-ascorbate were present in the system, the reaction rate was significantly higher (reaction 4 and 5). To optimise the reaction conditions, protein activity was evaluated at different temperatures (20 • C, 30 • C, and 37 • C) and pH values (5.8, 6.5, 7.3, and 8.0) using CT as the substrate. The results showed that the optimal conditions were 30 • C and pH 8.0 (Fig. S10, see online supplementary material). The enzymatic rate toward each substrate was measured at different substrate concentrations using the optimal conditions and reaction system five. The Km values of Sm2OGD3 toward CT, 15,16-DHT, and THT were 32.15, 55.42, and 20.92 μM, respectively and the estimated V max values were 0.3318, 1.228, and 0.7303 μM/min, respectively (Fig. S11, see online supplementary material). The results showed that 2OGD3 had a higher affinity for THT, followed by CT. In the roots of S. miltiorrhiza, the content of CT was higher than that of THT and the content of TAII was higher than that of 1,2-DHT (Fig. 1c). The content of the products was similar to that of the corresponding substrates, respectively (Fig. 1c). This implies that the conversion efficiency of Sm2OGD3 seems to be more dependent on the substrate concentration.

Intact Sm2OGD3 recovered the production of tanshinones with a double bond at C-15,16 in line shh transgenic hairy roots
Agrobacterium rhizogenes strain ACCC10060 containing recombinant vector with intact Sm2OGD3 ORF was used to transform S. miltiorrhiza line shh for the generation of transgenic hairy roots. ACCC10060 strain without Sm2OGD3 ORF was used as a control.

Orthologous genes of Sm2OGD3 in salvia species
When we were conducting this study, a Sm2-ODD14, which was designated as S. miltiorrhiza tanshinone IIA synthase (SmTIIAS), was identified to convert CT and isocryptotanshinone to TAII and isotanshione IIA, respectively (Fig. 7) [18]. The ORF of SmTIIAS showed 98.6% identity with Sm2OGD3, and the amino acid sequence of SmTIIAS showed 97.3% identity with Sm2OGD3. Both Sm2OGD3 and SmTIIAS could convert CT to tanshinone TAII. However, other substrates that they could accept were different. SmTIIAS showed strict substrate specificity to CT and isocryptotanshinone. Differently, Sm2OGD3 showed a broader substrates spectrum. It could accept CT, 15,16-DHT, and THT as substrates. First, to clarify the identity of SmTIIAS and Sm2OGD3, we respectively identified these two genes in two S. miltiorrhiza lines (993 and DSS3) and several salvia plants (S. bowleyana [34], S. splendens [35] and Salvia officinalis [36]) with published genomes. Genomic location information of Sm2OGD3 and SmTIIAS is identical in all five salvia plants (Table S13,   To further confirm the catalytic function of these alleles and homologs, we synthesized SmTIIAS and these newly identified genes (Table S12 and Fig. S12, see online supplementary material). Enzymatic reactions show that, as with Sm2OGD3, SmTIIAS, Sm2OGD3_DSS3, and Sb2OGD3 from S. bowleyana could all convert CT, 15,16-DHT, and THT to TAII, TAI, and 1,2-DHT, respectively (Fig. S13, see online supplementary material). Ss2OGD3_like from S. splendens and So2OGD3_like from S. officinalis failed to catalyze 15,16-dehydrogenation (Fig. S13, see online supplementary material). This result is consistent with the metabolite phenotype of the plants. The roots of S. bowleyana could be used as a substitute for S. miltiorrhiza due to its high contents of tanshinones [34]. The content of CT, TAII, and TAI are extremely low or undetected in S. officinalis and S. splendens [35,36].

Discussion
S. miltiorrhiza has attracted widespread research interests due to its high economic and therapeutic value and has developed as a model system for medicinal plant biology [1]. Numerous transcriptome and sRNAome data has been generated from various S. miltiorrhiza tissues with or without treatments [1]. The whole genomes of four S. miltiorrhiza lines with brick-red roots have been sequenced and assembled [15,[27][28][29]. In this study, a novel S. miltiorrhiza line, named shh, was sequenced and subsequently assembled through a combination of different technologies, such as Illumina paired-end, 10X Genomics, and PacBio SMART sequencing. It resulted in the acquirement of a chromosome level assembly of shh with high degree of continuity and completeness. The assembled shh genome has a total length of 530.97 Mb, 93.51% of which was anchored into eight pseudochromosomes with a super-scaffold N50 of 60.28 Mb and a maximum scaffold length of 77.73 Mb. It gives a strong groundwork for functional genomic analysis of bioactive compound biosynthesis in S. miltiorrhiza. Due to poor quality of the 99-3 genome assembly, we selected the S. miltiorrhiza line DSS3 genome assembled to the chromosome level for blastp-based collinearity analysis with the shh genome using MCScanX. The reasults from MCScanX showed that the encoding genes in shh assembly and DSS3 assembly had a high degree of genome-wide collinearity (Fig. S14, see online supplementary material). Sm2OGD3_DSS3 (ID: GWHGAOSJ020467) is located in DSS3_7 pseudochromosomes and it does not have any alignments. Sm2OGD3m is located in SHH_7 pseudochromosomes (position: 311089-309 911). However, there did not appear to be some collinear blocks in the shh pseudochromosomes for some segments on the DSS3_3 and DSS3_5 pseudochromosomes (Fig. S14, see online supplementary material). To follow up with a more comprehensive analysis of genomic structural variation, we need to perform DNA sequencebased genome global alignment and obtain second and third generation sequencing reads of DSS3 and shh for identification by means of more systematic bioinformatic analysis. Different from normal S. miltiorrhiza lines with brick-red roots, shh is characteristic with orange roots. Metabolomic and UPLC analyses showed increased contents of tanshinones with a single bond at C-15,16 and significantly decreased contents of those with a double bond at C-15,16. The variations of tanshinone compounds in shh were similar to that of the other S. miltiorrhiza line with orange roots [25]. Previously, systematic analysis of transcriptome profiles showed that four genes related to endoplasmic reticulum (ER)-associated degradation of proteins were up-regulated in orange roots [25] and no genes were downregulated in the CYP, NADP/NAD-dependent dehydrogenases, and reductases [25]. Zhan et al. [25] proposed that, in orange S. miltiorrhiza roots, the decrease of tanshinones with a furan D-ring could be caused by incorrect folding and endoplasmic reticulum (ER)-associated degradation of enzymes catalyzing 15,16dehydrogenation. In this study, the enzyme catalyzing 15,16dehydrogenation has been shown to be Sm2OGD3. Moreover, Sm2OGD3 in shh lost the second exon and forms a truncated protein Sm2OGD3m. It seems to indicate that the accumulation of tanshinone compounds with a single bond at C-15,16 and the reduction of those with a double bond at C-15,16 could be caused by loss of Sm2OGD3 function in shh. According to Zhan et al. [25], we identify these genes except c91931_g1 in S. miltiorrhiza 99-3 (Table S14, see online supplementary material). Among them, SMil_00017121 was significantly upregulated in shh with log2FoldChange 2.6 and padj 0.0015. SMil_00006173 and SMil_00005538 were not the differentially expressed genes between 99-3 and shh root periderm. Gene c80749_g2 was annotated as ER luminal binding protein (BIP) [25]. Similarly, SMil_00017121 was annotated as luminal-binding protein belonging to heat shock proteins (HSP70) against the NR and SMART databases. These results seem to imply that misfolding of the truncated protein in shh triggers endoplasmic reticulum-associated protein degradation. In our study, Sm2OGD3 was mutated in the orange root of shh, so are the same mutations present in other orange roots of S. miltiorrhiza lines? Our findings are an important contribution to the analysis of the mechanism of S. miltiorrhiza orange roots formation.
The plant 2OGD superfamily is usually divided into three classes, including DOXA, DOXB, and DOXC [23]. Members of the DOXA and DOXB classes are involved in primary metabolic networks, such as DNA repair, histone demethylation, and posttranslational modification [23]. Members of the DOXC class not only participate in the conserved pathways of phytohormone metabolism but also contribute to the production of highly specialized metabolites, such as phenolic acids and alkaloids [21]. S. miltiorrhiza line 99-3 has 132 Sm2OGDs, of which three belong to DOXA, eight belong to DOXB, 118 belong to DOXC, and the other three are unclassified [16]. The 118 DOXC members can be further divided into 13 clades, including ACO, AOP, CODM/NCS, D4H/GSLOH/BX6, DAO, F3H, FLS/ANS, GA2ox, GA3ox, GA20ox, H6H, S3H, and unknown [16]. Sm2OGD3 was assigned to the D4H/GSLOH/BX6 clade. The 12 significantly down-regulated Sm2OGD gene models in shh belong to six clades of the DOXC classes. Through genomic and transcriptomic comparison, in vitro protein assay and in vivo complementation assay, we found that Sm2OGD3 was responsible for 15,16-dehydrogenation of tanshinones. In addition, several homologous proteins of Sm2OGD3 from other S. miltiorrhiza lines and S. bowleyana also could convert CT, 15,16-DHT, and THT into TAII, TAI, and 1,2-DHT, respectively. The results confirmed the catalytic roles of Sm2OGD3 in 15,16-dehydrogenation, a key reaction in the tanshinone biosynthetic pathway (Fig. 7). Identification of the enzyme catalyzing this reaction is significant for elucidation of tanshinone biosynthesis and regulatory mechanisms. The results can also contribute to synthetic biology of tanshinones, a class of medicinally and economically important bioactive compounds.

Library construction and sequencing
Genomic DNA (gDNA) of shh leaves was isolated by the cetyl trimethyl ammonium bromide (CTAB) method as per the manufacturer's protocol [37]. For Illumina sequencing, a shortinsert library was constructed by randomly broking gDNA into fragments using Covaris ultrasonic fragmentation, followed by end repair, addition of poly-A, ligation of adapters, template purification, and PCR amplification. The constructed library was sequenced using Illumina paired-end sequencing technology (San Diego, CA, USA). For PacBio SMRT sequencing, a SMRTbell library was first built by ligating hairpin adapters to the gDNA, and then primers and polymerase were added. DNA sequencing was running on the PacBio Sequel instrument (Menlo Park, CA, USA). For 10× Genomics libraries, Illumina P5 adapters, barcodes, Illumina sequencing primers, and random primers linked to Gel Beads and then Gel Beads combined with DNA fragments and wrapped in oil. After PCR amplification, Illumina P7 adapters were added to the libraries for illumina sequencing.
Phasing of the resulting contigs was carried out using FALCON-Unzip. The contigs were polished with PacBio reads using the Quiver software [38]. Illumina reads were analysed using pilon [39] to improve the quality of assembly. Because of the high heterozygosity of S. miltiorrhiza genome, the genome assembly was dehybridized using the purge_haplotigs [40]. The filtered assembly was scaffolded on the basis of 10× Genomics Linked-reads using fragScaff software [41]. To assess the integrity of the assembly and the homogeneity of the sequencing, Illumina reads were mapped onto the assembly via BWA software [31] and calculated for mapping rate, genome coverage, and sequencing depth. CEGMA [33] and BUSCO [32] were applied to assess the completeness of the assembly.

Hi-C library construction and assignment of the chromosome
To construct Hi-C library, the gDNA was fixed with paraformaldehyde, digested and biotin-labeled. Cells were treated with paraformaldehyde to immobilise the DNA. The cross-linked DNA was then cut using restriction endonucleases. For end repair, biotin was added to label the oligonucleotide ends. Ligation of adjacent chromatin DNA fragments with T4 DNA ligase. Proteinase was added to digest the ligase and uncross-link the proteins and DNA. Genomic DNA was extracted and randomly broken into fragments of 350 bp. DNA marked with biotin was captured by affinity beads, followed by end-repair, A-tailing, adapter ligation, PCR amplification, and template purification. The libraries were sequenced on Illumina HiSeq PE150.
The filtered reads were mapped to the draft assembly. Duplicated mapping and unmapping reads were cut out using SAMtools [42] with the parameter rmdup. The reads attached to the ligation sites were picked out for ancillary assembly. The contigs were assigned, ordered, and oriented using LACHESIS (version-201 701) (http://shendurelab.github.io/LACHESIS/).

Genome annotation
Repeat elements were annotated based on homology alignment and de novo search [43]. Homolog prediction was carried out through alignment of the Repbase database with the genome assembly using RepeatMasker (http://www.repeatmasker.org/). Tandem Repeat was extracted from the genome assembly using TRF (http://tandem.bu.edu/trf/trf.html). De novo repetitive element database was constructed through ab initio prediction using LTR_FINDER [44], RepeatScou, and RepeatModeler. All of the repeat sequences with lengths greater than 100 bp and gap 'N' less than 5% were used to construct the raw transposable element (TE) library. Repbase and TE library were handled by UCLUST [45] to yield a non-redundant library which was given to RepeatMasker for repeat identification.
The functions of predicted protein sequences were analysed against the Swiss-Prot, NR, and KEGG databases using Blastp with a threshold of E-value ≤1e-5. Motifs and domains were assigned by InterProScan70 (v5.31) [56] through searching the predicted proteins against Pfam and InterPro databases. Gene Ontology (GO) IDs of each gene were accredited based on the related InterPro entry.

Comparative genomic analysis
The genome and annotation data for S. miltiorrhiza_99-3, S. mil-tiorrhiza_unknown, S. splendens, E. guttata, S. indicum, U. gibba, Olea europaea, S. lycopersicum, S. tuberosum, P. ginseng, D. carota, A. thaliana, and P. trichocarpa were downloaded. The relationship of orthologous genes in shh and the other 13 plant species was inferred through running all-against-all blastp program. Gene family clustering was performed using OrthoMCL [57] with 'mode 3 -inf lation 1.5'. The software CAFE was applied to measure the expansion and contraction of orthologous gene families [58]. For each single-copy gene family, an alignment was produced using Muscle [59]. Ambiguously aligned posiions were trimmed by Gblocks [60]. Phylogenomic tree was constructed with RAxML (v.8.2.12). Divergence times were estimated based on the mcmctree program implemented in PAML [61] with default parameters. Syntenic blocks identified within S. miltiorrhiza_shh by MCScanX [62] were used for WGD analysis. The codeml program in the PAML package was used to estimate Ks for all pairwise paralogous genes.

Metabolite analysis
Fresh powder (0.1 g) was weighed into 1 ml of methanol, followed by sonicate for 20 min, and centrifuged at 8000 rpm for 5 min. The extraction was repeated once. The two supernatants from extracts were combined and filtered using a 0.22 μm microporous membrane. Metabolites were quantified using ultra-highperformance liquid chromatography (UPLC, Waters, MA, USA) equipped with a C18 column (1.7 μm, 2.1 × 100 mm; Waters). The detection wavelength, column temperature and f low rate were maintained at 270 nm, 25 • C, and 0.3 ml min −1 , respectively. The mobile phase was comprised of water with 0.1% (v/v) formic acid (A) and 100% acetonitrile (B). The gradient elution was performed as follows: 0 min-5 min, 40% B; 5 min-25 min, 40%-60% B.
For metabolomic analysis, Dionex Ultimate 3000 UPLC coupled with a Thermo Syncronis C18 column (2.1 mm × 100 mm, 1.7 μm) was used. A quality control sample mixing all samples is used to monitor the stability of the system. Q Exactive Orbitrap mass spectrometry equipped with electrospray ionization (ESI) source was performed to analyse metabolite profile. The settings were as follows: simultaneous positive and negative ion scanning mode; ion spray voltage (IS), 2.8 kV; Sheath gas f low rate, 35 arb; f low rate of the auxiliary gas, 10 arb; and capillary temperature, 320 • C. A full scan with a resolution of 70 000 and a scanning range of 70-1050 (m/z) was adopted. The resolution of secondary data dependence scan (full MS/dd-MS) was 17 500. The Stepped NCE values are 20 V, 40 V, and 60 V.

Analysis of genes involved in tanshinone biosynthesis
To identify the genes participating in the MVA and MEP pathways, the related protein sequences from A. thaliana and S. miltiorrhiza were used as queries to search against the shh genome assembly using BLASTP-algorithm with an E-value ≤1e-5. The gene sequences encoding S. miltiorrhiza CPS, KSL, CYP76AH1, CYP76AH3, CYP76AK1, CYP71D411, CYP71D375, and CYP71D373 were downloaded from NCBI. The 2OGD gene family sequences were acquired from the genome annotation file of S. miltiorrhiza_99-3. These gene sequences were used as baits to identify the corresponding genes in shh genome with the filtering parameters, E-value ≤1e-50, identity ≥80%, and coverage ≥80%. The candidates were submitted to pfam and SMART databases for domain and motif analysis. Proteins with same conserved domains were considered to be homologs. Transcriptome data published previously [22] were analysed for the expression level of tanshinone biosynthesis-related genes using align-free software Salmon [63]. The heat map was generated using TBtools [64].

In vitro analysis of Sm2OGD3 protein activity
The open reading frame (ORF) of Sm2OGD3 from mature roots of 99-3 were PCR-amplified and cloned into the PET-30a vector. E. coli strain BL21(DE3) harboring the recombinant plasmid was grown in LB medium at 37 • C until OD 600 reached to 0.6. Subsequently, the culture temperature was cooled to 20 • C. Induction of recombinant proteins was carried out using 0.5 mM isopropyl β-Dthiogalactopyranoside (IPTG) at 20 • C for 20 h. Cells from 100 ml of culture were harvested by centrifugation and resuspended at 4 • C in 15 ml PBS buffer (pH 7.2-7.4) containing Na 2 HPO 4 , NaH 2 PO 4 , and NaCl [65]. Cells were lysed by sonication for 30 min (lysed for 3 s, paused for 10 s) with 187.5 watt. The crude extracts were obtained by centrifugation and then incubated with 160 μM 2oxoglutarate, 50 μM ferrous sulfate, and 10 to 100 μM substrate in a final volume of 0.5 ml PBS buffer for 0.5-1 h at 30 • C [65].
As a negative control, proteins were similarly prepared from cells harboring an empty PET-30a plasmid. Reaction products were extracted twice in 0.5 ml ethyl acetate and then evaporated using a nitrogen blower. The residue was dissolved in 0.2 ml of methanol, filtered and analysed using LC-MS (UPLC-electrospray ionization (ESI)-MS, Waters). MS-MS data were analysed using MssLynx V4.1 software (Waters).

Transgenic analysis of Sm2OGD3
Sm2OGD3 ORF were PCR-amplified and cloned into pTOPO-Blunt vector. After verification using sanger sequencing, the ORF was digested with BstE II and Nco I and then inserted into pCAMBIA1391 vector. A. rhizogenes strain ACCC10060 was utilized for plant transformation. Transgenic hairy roots were generated according the Wei's procedure [66]. The forward primer 35S-F and the reverse primer Sm2OGD3-R were used for PCR identification of Sm2OGD3 transgenic lines. PCR-positive hairy roots were transferred to 6,7-V liquid medium and sub-cultured every 3 weeks. After three months, hairy roots were harvested and frozen in liquid nitrogen. The frozen samples were ground into 0.2 g powders and extracted with 1 ml methanol. The extracts were evaporated and re-dissolved in 0.2 ml methanol. The filtrates were analyzed using UPLC.