Integrative analysis of the shikonin metabolic network identifies new gene connections and reveals evolutionary insight into shikonin biosynthesis

Summary Plant specialized 1,4-naphthoquinones present a remarkable case of convergent evolution. Species across multiple discrete orders of vascular plants produce diverse 1,4-naphthoquinones via one of several pathways using different metabolic precursors. Evolution of these pathways was preceded by events of metabolic innovation and many appear to share connections with biosynthesis of photosynthetic or respiratory quinones. Here, we sought to shed light on the metabolic connections linking shikonin biosynthesis with its precursor pathways and on the origins of shikonin metabolic genes. Downregulation of Lithospermum erythrorhizon geranyl diphosphate synthase (LeGPPS), recently shown to have been recruited from a cytoplasmic farnesyl diphosphate synthase (FPPS), resulted in reduced shikonin production and a decrease in expression of mevalonic acid and phenylpropanoid pathway genes. Next, we used LeGPPS and other known shikonin pathway genes to build a coexpression network model for identifying new gene connections to shikonin metabolism. Integrative in silico analyses of network genes revealed candidates for biochemical steps in the shikonin pathway arising from Boraginales-specific gene family expansion. Multiple genes in the shikonin coexpression network were also discovered to have originated from duplication of ubiquinone pathway genes. Taken together, our study provides evidence for transcriptional crosstalk between shikonin biosynthesis and its precursor pathways, identifies several shikonin pathway gene candidates and their evolutionary histories, and establishes additional evolutionary links between shikonin and ubiquinone metabolism. Moreover, we demonstrate that global coexpression analysis using limited transcriptomic data obtained from targeted experiments is effective for identifying gene connections within a defined metabolic network.


Introduction
The shikonins are a group of red-pigmented naphthoquinones produced in the root periderm of many members of Boraginaceae [1,2]. They include shikonin ( Fig. 1), its enantiomer alkannin, and several shikonin/alkannin derivatives that are excreted into the rhizosphere, where they function in defense, mediate plant-microbe interactions, and/or elicit allelopathic effects on other plants. For example, the invasion success of Paterson's curse (Echium plantagineum) in southeast Australia is attributed, at least in part, to the synthesis and release of shikonins [3]. Shikonins are also the bioactive compounds responsible for the various pharmacological properties of medicinal plants like red gromwell (Lithospermum erythrorhizon) [4] and have emerged as scaffolds for semi-synthesis of novel cancer therapeutics [5].
The structure of shikonin is comprised of a redoxactive naphthazarin (5,6-dihydroxy-1,4-naphthoquinone) ring fused with a 1-hydroxy-4-methyl-3-pentenyl side chain (Fig. 1). The hydroxybenzene ring, ring A, of shikonin's naphthazarin moiety is derived from Lphenylalanine via cinnamic acid and 4-hydroxybenzoate (4-HBA) [6,7]. This is the same route predominantly responsible for forming the benzoquinone ring of ubiquinone (coenzyme Q) in plants [8,9]. Many of the genes responsible for synthesis of the 4-HBA precursor of shikonin have already been cloned and investigated (e.g. [10][11][12]). In contrast, the genetic basis and regulation underlying the unique formation of the prenyl diphosphate precursor providing shikonin's quinone ring, ring B, and its six-carbon atom isoprenoid side chain is not as well characterized. The shikonin pathway begins with the conjugation of 4-hydroxybenzoate (4-HBA) and geranyl diphosphate (GPP) catalyzed by p-hydroxybenzoate:geranyltransferase (PGT) [13] to produce 3-geranyl-4-HBA [14] (Fig. 1). GPP and other prenyl diphosphates are synthesized from the condensation of the five-carbon building blocks isopentenyl diphosphate (IPP) and dimethylallyl diphosphate (DMAPP). In plants, GPP is typically produced by plastidial GPP synthases (GPPSs) that catalyze the condensation of one IPP and one DMAPP derived from the methylerythritol phosphate (MEP) pathway localized in plastids [15]. Plants also produce IPP and DMAPP via the mevalonic acid (MVA) pathway, a route separated from the MEP pathway that is compartmentalized across the cytoplasm, endoplasmic reticulum, and peroxisomes [15][16][17]. The MVA pathway is generally considered to generate isoprenoid precursors for farnesyl diphosphate (FPP) synthases (FPPSs), which catalyze the condensation of one DMAPP with two IPP molecules to produce FPP and two molecules of pyrophosphate in the cytoplasm. Experimental evidence by Gaisser and Heide [18] and others (reviewed in Widhalm and Rhodes [4]) long suggested that shikonin biosynthesis unconventionally relies on GPP produced by a cytoplasmic GPPS. The recent discovery and biochemical characterization of L. erythrorhizon GPPS (LeGPPS, Fig. 1) [9] revealed that it is a neofunctionalized cytoplasmic farnesyl diphosphate synthase (FPPS) and that mutation(s) adjacent to the first aspartate-rich motif resulted in acquisition of GPPS activity [19].
Previous analysis by our group of the L. erythrorhizon genome uncovered an evolutionary link between PGTs and the ubiquinone prenyltransferase gene, demonstrating that retrotransposition-derived gene duplication and subsequent neofunctionalization contributed to the evolution of PGT genes [20]. Coupled with the evolution of LeGPPS from a cytoplasmic FPPS [19] and whole genome duplication (WGD) in the Boraginaceae [20,21], the evolutionary history of the shikonin pathway appears to be marked by several events of metabolic innovation. Taken together, this raises the prospect of additional evolutionary links between the shikonin and ubiquinone pathways and opens new questions about the metabolic intersection of the isoprenoid, phenylpropanoid, ubiquinone, and shikonin pathways in the Boraginaceae.
In this study, we investigated the metabolic connections linking shikonin biosynthesis with its precursor pathways by downregulating expression of LeGPPS and testing the capacity of the MEP and MVA pathways to supply GPP for shikonin production. We also explored whether network analysis of transcript abundances could identify genes coexpressed with LeGPPS and other established shikonin pathway genes. Integrative computational analyses of candidate genes identified by the model suggest likely metabolic roles for these genes and give insight into the evolution of metabolic innovation in the shikonin pathway. Our study provides evidence of crosstalk between the MVA, MEP, and phenylpropanoid pathways and reveals additional evolutionary links between shikonin and ubiquinone biosynthesis. Given the other links between specialized and primary quinone metabolism [22,23], the mechanistic insights uncovered here are expected to broadly guide investigation into the convergent evolution of specialized 1,4-naphthoquinone metabolism in plants.

Cytoplasmic LeGPPS supplies GPP to the shikonin pathway using MVA pathway-derived IPP/DMAPP
To investigate the in vivo role of LeGPPS, which sits at the interface between the MVA, phenylpropanoid, and shikonin pathways, we knocked down expression of its encoding gene in L. erythrorhizon hairy roots. Several independent LeGPPS-RNAi (LeGPPSi) lines were generated, excised, and transferred to B5 selection media plates and subsequently screened for levels of total shikonins excreted into the growth media 4 d after transfer to M9 and darkness. Analysis of 10 independent LeGPPS-RNAi lines revealed that total shikonins were reduced by more than 95% compared to the lowest producing emptyvector control line (Fig. 2a). Further analysis of two independent lines, LeGPPSi-45 and LeGPPSi-75, revealed that LeGPPS expression was reduced by more than 95% compared to empty-vector control EV-26 without any affect on expression of the canonical plastidial GPPS gene LepGPPS (Fig. 2b). Re-analysis of total shikonins excreted from LeGPPSi lines 45 and 75 confirmed nearly 95% reduction compared to EV-26 (Fig. 2c), thus indicating that cytoplasmic LeGPPS is predominantly responsible for supplying GPP precursor to the shikonin pathway.
Concievably, MEP pathway-derived GPP could contribute to the shikonin pathway if it or MEP pathwayderived IPP/DMAPP were exported from the plastid to the cytoplasm and used as substrate by LeGPPS. To test for MEP pathway involvement in shikonin production we carried out two inhibitor experiments on the EV-26 and LeGPPSi-45 lines (Fig. 1). We predicted that if the MVA pathway is predominantly responsible for supplying IPP/DMAPP to LeGPPS, then treatment with the MVA pathway inhibitor mevinolin should decrease shikonin accumulation in EV-26 lines but not in the LeGPPSi-45 RNAi line. Indeed, total shikonins produced by mevinolin-treated EV-26 lines were reduced by 76% compared to those in the EV-26 control lines (Fig. 3a), while shikonins in mevinolin-treated LeGPPSi-45 lines were unchanged compared to the LeGPPSi-45 control lines (Fig. 3b). If the MEP pathway does not supply IPP/DMAPP precursor to the shikonin pathway, we expected no change in shikonin accumulation in EV-26 lines treated with the MEP pathway inhibitor fosmidomycin compared to controls. If, however, the MEP pathway is contributing to the remaining shikonin produced by LeGPPSi-45 lines, treatment with fosmidomycin should further reduce shikonin accumulation compared to LeGPPSi-45 controls. Instead, we observed that shikonin production increased by 73% and 108%, respectively, in EV-26 and LeGPPSi-45 lines treated with fosmidomycin compared to their corresponding controls (Fig. 3a,b). This result points to crosstalk between the MEP and MVA pathways such that when f lux through the MEP pathway is impaired, flux through the MVA pathway is increased. Taken together, our genetic and inhibitor studies support the work of Ueoka et al. [19] by showing that LeGPPS is required for shikonin formation and it shows that the MEP pathway does not supply IPP or DMAPP substrates, or direct GPP precursor to the shikonin pathway.

Downregulation of LeGPPS reveals crosstalk between phenylpropanoid and isoprenoid metabolism
The observed increase in shikonin content in LeGPPSi-45 RNAi lines treated with the MEP pathway inhibitor fosmidomycin (Fig. 3b) led us to hypothesize that the smaller pool size of shikonin in LeGPPS-RNAi lines (Fig. 2) may be due, in part, to an upstream effect on the MVA pathway. To investigate if MVA pathway gene expression is changed, we performed RNA-seq Expression levels of LeGPPS and the canonical plastid-localized GPPS gene (LepGPPS) in hairy roots of two independent LeGPPSi lines compared to an empty-vector control line (EV-26) (b). Analysis of total shikonin in the same cultures used to measure expression in panel b (c). All data are means ± SEM (n = 3-4 biological replicates). Different letters indicate significant differences via analysis of variance (ANOVA) followed by post-hoc Tukey test (α = 0.05). In panel b, lowercase and capital letters correspond to statistical comparisons for LeGPPS and LepGPPS expression, respectively. analysis of LeGPPSi-45 lines compared to EV-26 control. This analysis confirmed that LeGPPS but not LepGPPS is significantly downregulated the LeGPPSi-45 line (Fig. S1a). Our analysis showed 6115 differentially expressed genes (DEGs); 2903 genes were significantly overexpressed or the MEP pathway inhibitor fosmidomycin (+ fos). Inhibitor treatments were administered immediately upon transfer of 14-d-old hairy roots to M9 and darkness. Total shikonins were measure at 6 d after transfer of 14-d-old hairy roots to M9 and darkness. All data are means ± SEM (n = 3-4 biological replicates). Statistically significant differences are indicated ( * P < 0.05, Student's t test).
in LeGPPSi-45 lines compared to EV-26 while 3212 were significantly underexpressed (Table S1), including shikonin pathway genes LePGT1, LePGT2, CYP76B100, and CYP82AR (Fig. S1b). Kyoto Encyclopedia of Genes and Genomes (KEGG) term enrichment analysis of genes underexpressed in the LeGPPSi-45 line revealed an enrichment of genes involved in various metabolic pathways connected to shikonin metabolism (BH-adjusted p-value <0.05; Fig. 4a, Fig. S2). The category "monoterpenoid biosynthesis," which encompases metabolic genes downstream of GPP was significantly enriched among underexpressed genes. The KEGG category "terpenoid backbone biosynthesis," which contains the MVA and MEP pathway genes, was not significantly enriched (BH-adjusted p-value = 0.097; Fig. S2). Yet, 11 of the 17 MVA pathway genes involved in IPP biosynthesis were found to be significantly underexpressed in LeGPPSi-45. This included six of the eight genes encoding 3hydroxy-3-methylglutaryl-CoA reductase (HMGR), which is generally considered to catalyze the rate-limiting step of the MVA pathway (Fig. 4a, Table S1) [24]. This suggests that lower expression of upstream MVA pathway genes may have contributed to reduced shikonin production in LeGPPS-RNAi lines (Fig. 2). These data also point to an unknown factor connecting downregulation of LeGPPS with reduced expression of upstream MVA pathway genes.
The KEGG pathway analysis also revealed that genes involved in "phenylpropanoid biosynthesis" and "ubiquinone and other terpenoid-quinone biosynthesis" were enriched in those underexpressed in LeGPPSi-45 (Fig. S2). This is noteworthy because the phenylpropanoid pathway supplies p-coumaroyl-CoA to make the 4-HBA precursor that becomes the hydroxybenzene ring, ring A, of shikonin's naphthazarin moiety (Fig. 1) and of ubiquinone's benzenoid moiety [8]. Further examination of genes underexpressed in LeGPPSi-45 showed that several genes in the core phenylpropanoid pathway are underexpressed, including multiple genes encoding phenylalanine ammonia-lyases (PALs) (Fig. 4b, Table S1). Moreover, one copy of the At4g19010-like peroxisomal p-coumarate-CoA ligase genes (Leryth_018919) was significantly underexpressed (Fig. 4b, Table S1). In Arabidopsis thaliana, it was demonstrated that At4g19010 is responsible for activating the propyl side chain of pcoumarate for β-oxidative shortening to supply 4-HBA precursor for ubiquinone biosynthesis [8]. These results suggest that, like the MVA pathway, an unknown factor links downregulation of LeGPPS to reduced expression of phenylpropanoid and benzenoid pathway genes.
Decreased accumulation of transcripts encoding core phenylpropanoid and β-oxidative benzenoid biosynthetic genes (Fig. 4b) raises the possibility that 4-HBA availability might also limit shikonin production in GPPSi-RNAi lines (Fig. 2). To test this, shikonin levels were determined in EV-26 and LeGPPSi-45 lines supplied with exogenous 4-HBA. The amount of shikonin produced, however, remained unchanged compared to the unfed controls ( Fig. S3) suggesting that 4-HBA availability does not limit shikonin production in LeGPPS knockdown lines. Taken together, the in vivo investigation of LeGPPS demonstrates that in addition to LeGPPS being involved in shikonin biosynthesis, the expression of LeGPPS is highly connected to other genes in the larger shikonin metabolic network including those in the shikonin, MVA, phenylpropanoid, and benzenoid pathways.

Coexpression network analysis recovers known shikonin pathway gene associations and predicts new connections
We hypothesized that LeGPPS and other known shikonin biosynthesis genes, including LePGT1, would appear as hub genes that we could use to identify coexpressed genes with roles in shikonin biosynthesis. To construct a transcriptional network model with a high likelihood of recovering the shikonin biosynthetic pathway as a module, we used publicly available comparative RNAseq experiments from tissues and conditions divergent in their shikonin levels (NCBI Sequence Read Archives PRJNA596998 [20] and PRJNA331015). These included whole L. erythrorhizon root tissue versus above ground tissue; root periderm versus root vascular (inner) tissue; and hairy root cultures grown in M9 media in the dark versus roots grown in B5 media in light conditions. In all three experiments, the former tissue or condition in each comparison was previously shown to contain higher LePGT1 expression and shikonin content [20]. Like LeGPPS, LePGT1 functions at the interface of the phenylpropanoid, MVA, and shikonin pathways (Fig. 1). Therefore, we constructed the model based on the hypothesis that genes involved in the shikonin pathway and upstream metabolism would also be more highly expressed in the same tissues or conditions as LePGT1.
One potential source of noise in coexpression analyses is the inclusion of genes that are either not expressed or are constitutively expressed at a constant level across all conditions. These genes may appear significantly coexpressed with other genes in the dataset artifactually [25]. Although this is less of a concern when working with dozens or hundreds or RNA-seq samples [26], our dataset consisted of only 14 RNA-seq samples across six total tissues or conditions. To control for this source of false positive coexpression, we only included genes likely to be DEGs in at least one of the three comparisons at a false discovery rate (FDR) cutoff of ≤0.1. A total of 8680 transcripts were included using this approach (Table S2). Within this set of transcripts, 23.9% (2077) were overexpressed in whole root; 37.9% (3290) were overexpressed in root periderm; and 21.6% (1876) were overexpressed in hairy roots sampled in the dark (Table S3). The overlap of all three comparisons contained 374 genes that were overexpressed in the shikonin accumulating condition (Fig. 5a). These included several genes already implicated in shikonin biosynthesis: LePGT1, LePGT2, two additional PGT-like genes [20], LeGPPS [19], CYP82AR2 [27], L. erythrorhizon pigment callus-specific gene 2 (LePS-2) [28], and LeMYB1 [29] (Table S4).
The 8680 DEGs were used as input for a global coexpression network analysis (Table S5). Pairwise measurements of gene coexpression were specified as mutual ranks (MRs), which are calculated as the geometric mean of the rank of the Pearson's correlation coefficient (PCC) of gene A to gene B and the PCC rank of gene B to gene A [30]. Ranking the PCCs in this manner has been shown to improve the recovery of known pathways as discrete subgraphs in global coexpression networks [31]. We constructed four MRbased networks (N1-N4), using different coexpression thresholds for assigning edge weights (i.e. connections) between nodes (i.e. genes) in the network. Networks were ordered by size (i.e. total number of edges between nodes), such that N1 represents the smallest network and N4 represents the largest network. Graph-clustering implemented by ClusterONE [32] was used to discover coexpressed subgraphs (hereafter referred to as gene modules) within the global networks (Dataset S1). The benefit of using ClusterONE over other graph-clustering methods, e.g. MCL [33] is its capacity to assign genes to multiple overlapping modules, which is more reflective of complex biological networks. We chose to focus our analysis on four target genes based on evidence of their involvement in shikonin metabolism: LeGPPS [19] ( Fig. 2), LePGT1 [20], LeCYP76B101 [34,35], and LeMDR [36]. Because ClusterONE modules can overlap, each target gene was assigned to multiple modules within the larger networks. For example, LePGT1 was found in 3, 2, 3, and 6 different modules in network N1, N2, N3, and N4, respectively (Dataset S1). To address this redundancy, we collapsed all modules within a network that contained one or more of the four target shikonin metabolic genes into non-intersecting metamodules ( Fig. 5b; Fig. S4) [26]. Collectively, these metamodules are models, which we refer to as shikonin metabolic subnetworks.
The number of genes recovered in the shikonin metabolic subnetworks varied from 102, 152, 359, and 1268 genes in networks N1, N2, N3, and N4, respectively (Tables S6-S9). We focused our subsequent analyses on the N2 network, which contained a large number of candidate genes to investigate while also limiting the number of peripheral genes that appeared only weakly connected to shikonin biosynthesis (Fig. S4). The N2 shikonin metabolic subnetwork was comprised of two metamodules (Fig. 5b). The first N2 metamodule contained 125 genes including LeGPPS, LePGT1, and LeCYP76B101; whereas, the second metamodule contained 27 genes including LeMDR. To be considered coexpressed in our analysis two genes must have at least one shared module within the larger metamodule. For example, LeGPPS and CYP76B101 were coexpressed with one another, being members of three shared modules: N2M94, N2M298, and N2M317 (Dataset S1). Within metamodule 1, 60 genes were coexpressed with LeGPPS and CYP76B101; 6 genes were uniquely coexpressed with LeGPPS; and 59 genes were uniquely coexpressed with LePGT1 ( Fig. 5b; Table S7). Four genes (Leryth_014746, Leryth_025160, Leyrth_004583, Leryth_002195) coexpressed with all three LeGPPS, LePGT1, and LeCYP76B101 ( Fig. 5b; Table S7). Although the genes of metamodule 2, including LeMDR, were not coexpressed with the three other target shikonin genes in network N2, the two larger networks N3 and N4 did show a small amount of overlap (Fig. S4).
The N2 shikonin subnetwork was enriched in 62 Gene Ontology (GO) categories (BH-adjusted p-value <0.05; Table S10) including broad enzymatic categories . Nodes in the map represent genes, and edges connecting two genes represent the weight (transformed MR score) for the association. Genes are colored according to its coexpression status with known shikonin genes (grey). Network maps were drawn using a Fruchterman-Reingold force-directed layout using the edge-weighted spring embedded layout in cytoscape (https://cytoscape.org) such as GO:0016491 oxidoreductase activity (18 genes), and GO:0016740 transferase activity (36 genes). Another enriched category, ATPase-coupled intramembrane lipid transport activity (2 genes; Leryth_023505, Leryth_019206) is of high interest because the previous implication of an ARF/GEF-like system required for shikonin transport [41].
To identify shared 5 cis regulatory regions among the coexpressed genes, we performed a motif enrichment analysis on the genes of the N2 shikonin subnetwork using Motif Indexer [42]. The most overrepresented motif within the upstream region of shikonin subnetwork genes was AmrGTCwA (p-value = 9.67x10 −10 ; FDR = 0.007; Table S11), the reverse compliment of which (TwGACykT) is similar to the canonical W-box element sequence motif (T)TGAC(C/T) recognized by the WRKY family of transcription factors [43]. Of the 152 genes in the N2 shikonin subnetwork, 47.37% of genes (N = 72) contained this motif including all four target genes: LePGT1, LeGPPS, LeMDR, and CYP76B101. Five WRKY transcription factors were identified in the N2 shikonin subnetwork (Fig. 5b) two of which (Leryth_027519 and Leryth_002564) were also significantly overexpressed in all three conditions where shikonin was abundant (Table S7).

Expansion of the LeFPPS gene family in the Boraginales gave rise to LeGPPS
We next performed a phylogenetic analysis to gain insight into the evolutionary events giving rise to genes in the shikonin metabolic network. Previous work demonstrated that LeGPPS encodes an enzyme having GPPS-like activity but is a member of the FPPS gene family [19]. To better understand the evolutionary history of LeGPPS, we reconstructed the phylogeny of the FPPS gene family using homologous sequence groups downloaded from the PLAZA 4.0 database (Table S12) [44]. In addition to L. erythrorhizon, we included in our analysis de novo transcriptome-based proteomes from 18 additional Boraginales species including three other shikonin producing plants (E. plantagineum, Arnebia euchroma, and Lithospermum officinale), one additional Boraginaceae that does not produce shikonin (Mertensia paniculata), and 14 additional Boraginales species that do not produce shikonin (Table S13) [20,45]. The Boraginales contain two distinct subfamilies in the FPPS gene family phylogeny (Fig. S7). Subfamily I contains LeGPPS and was present in 17 of the 19 Boraginales in the analysis, including all four shikonin-producing species (Fig. S7). Subfamily I was absent in Heliotropium karwinsky and Heliotropium sp. Subfamily II, the canonical FPPS group, contains LeFPPS1 (Leryth_005102) and 29 other sequences. Subfamily II was present in all four shikoninproducing species and absent in Heliotropium calcicole and Heliotropium texanum. The absence of subfamily I or II in some Heliotropium species is likely artifactual due to these gene sets being transcriptome derived. Subfamily II also contained two additional genes from L. erythrorhizon, Leryth_007856 (referred to as LeFPPS2 by Ueoka et al. [19]) and Leryth_010152 (hereafter LeFPPS3) (Fig. S7). We searched for shared synteny between genome assembly contigs containing FPPS genes in L. erythrorhizon to investigate whether whole genome duplication (WGD) was involved in the evolution of the subfamily I. A WGD is proposed for the Boraginaceae roughtly 25 MYA [21] and L. erythrorhizon and E. plantagineum have similar distributions of synonymous substitution (Ks) between syntenic paralogs at 0.45 and 0.417, respectively [20,21]. The contigs containing LeFPPS1 and LeFPPS3 (Fig. S5) were syntenic and the syntelogs in these two contigs have a median Ks value of 0.484 (Table S14). The median Ks of this syntenic block is similar to the peaks in Ks distribution described by Auber et al. [20] and Tang et al. [21], consistent with the Boraginaceae WGD giving rise to LeFPPS1 and LeFPPS3. In contrast, the lack of shared synteny between LeGPPS and any of the three genes in the FPPS group suggests that LeGPPS did not arise via WGD. Intron position is conserved between LeGPPS and the three FPPS genes, with only LeFPPS3 showing some divergent intron positioning toward its 3 end (Fig. S6), which is consistent with segmental duplication or DNA transposition giving rise to the LeGPPS homolog rather than retrotransposition.
Previous work by Ueoka et al. [19] demonstrated that the histidine (His) residue adjacent to the first aspartaterich motif in LeGPPS was responsible for its GPPS-like activity. Examination of our FPPS gene family sequence alignment shows that this His residue is present in all sequences of the GPPS group (subfamily I), with the exception of two transcriptome-derived sequences from two non shikonin-producing species Ehretia acuminata and Heliotropium greggii that are both missing this region (Fig. S7). In contrast, all sequences in the FPPS group (subfamily II) contain the canonical leucine (Leu) residue adjacent to the Asp-rich motif, with the exception of three transcriptome-derived sequences that are missing the region (Fig. S7). We identified a His residue in place of Leu in three additional sequences from Fragaria vesca (woodland strawberry), Pyrus bretschneideri (Chinese white pear), and Solanum tuberosum (potato) (Figs. S7,S8). Like the Boraginales, each of these three species maintained a second FPPS gene that retains the canonical Leu adjacent to the Asp-rich motif (Fig. S8). A similar observation was made with FPPS homologs from Fragaria x ananassa (strawberry), Malus domestica (apple), and Prunus persica (peach) [19]. Thus, the recruitment of a cytoplasmic FPPS to function as a GPPS convergently evolved multiple times in plants and has likely contributed to the diversification of plant terpenoid metabolism.

Shikonin pathway gene candidates provide insights into specialized metabolic innovation in the Boraginaceae
We extended our phylogenetic analysis to additional shikonin gene candidates (Table 1). We first considered gene candidates that could be responsible for missing enzymes in the shikonin pathway. It is estimated that 97% of cytochromes P450 in plants are associated with specialized metabolic pathways [46]. Therefore, considering that missing steps in the shikonin pathway require decarboxylation, hydroxylations, or carbon-carbon ring closure, we examined cytochromes P450 in the coexpression network. In addition to LeCYP76B101, which was used as a known target in metamodule construction, three additional cytochromes P450 were recovered in the N2 shikonin subnetwork, all three of which were coexpressed with LeMDR metamodule 2 ( Fig. 5b; Table S7). None of the three additional cytochromes P450 correspond to the LeCYP82AR2 recently described to catalyze deoxyshikonin hydroxylation in vitro [27]. Although LeCYP82AR2 (Leryth_026973 , Table S3) was not recovered as a candidate in the N1 or N2 shikonin subnetworks, it was coexpressed with LePGT1, LeGPPS, and LeCYP76B101 in our N3 network (Fig. S4c) and was also overexpressed in all shikonin-abundant conditions ( Table S3).
One of the three cytochromes P450 identified was Leryth_021809, which encodes a CYP76B6-like enzyme and was significantly overexpressed in all shikoninabundant conditions (Table S3). Other CYP76B genes, including CYP76B74 in A. euchroma [35] and CYP76B100/101 [34] in L. erythrorhizon (Fig. 1), have already been implicated in oxidative reactions in shikonin biosynthesis, but the evolutionary relationships between these genes has been unclear. A phylogeny of CYP76B6-like genes reveals that AeCYP76B74 and LeCYP76B101 are orthologs (Fig. S9). LeCYP76B100 is the mostly closely related paralog to LeCYP76B101 but groups more closely to other sequences in A. euchroma, E. plantagineum, and Mertensia paniculata (Fig. S9). This indicates that the gene duplication event that gave rise to LeCYP76B100/101 occurred in the last common ancestor of these Boraginaceae species. Leryth_021809, the additional CYP76B6-like gene recovered in the coexpression analysis, is within a separate clade that has expanded within shikonin producing species. The closest homolog in M. paniculata (the only Boraginaceae in our analysis that does not produce shikonin) groups closer to a different cytochrome P450 in L. erythrorhizon (Leryth_021691; Fig. S9).
A phylogeny of sequences homologous to the second cytochrome P450 candidate (Leryth_001242), which encodes a CYP76A2-like enzyme, also shows an expansion of gene copies in the Boraginaceae (Fig. S10). This gene is one of three homologs in a tandem repeat, including Leryth_001243 and Leryth_001244 indicating that tandem gene duplication has expanded this cytochrome P450 subfamily in L. erythrorhizon (Fig. S11). Leryth_001243 and Leryth_001244 were not captured in the shikonin subnetwork but their expression is greater in whole root versus above ground tissue (Table S3). Lastly, the phylogeny of sequences homologous to the third cytochrome P450 candidate (Leryth_000257), which encodes a CYP89A2-like enzyme, shows a smaller group of Boraginales sequences without the rounds of expansion present in the other two trees (Fig. S12).
The production of geranylhydroquinone (GHQ) from 3-geranyl-4HBA (Fig. 1) may occur via decarboxylation and subsequent hydroxylation or a single oxidative decarboxylation event [4]. In addition to cytochromes P450, we examined the generated shikonin network for non-cytochrome P450 candidate genes that may function in either hypothesized mechanism. One candidate to consider is a prephenate dehydrogenase-like (PDHlike) gene. PDH catalyzes oxidative decarboxylation of prephenate to 4-hydroxyphenylpyruvate for synthesis of tyrosine [47]. Leryth_001358 encodes a PDH-like protein and is coexpressed with LePGT1 in the N2 subnetwork ( Fig. 5b; Table S7). Similar to other genes in our analysis, the phylogeny of PDHs shows an expansion of this gene family within the Boraginaceae (Fig. S13). Coexpression of the PDH-like gene may simply be related to the connection between shikonin and aromatic amino acid metabolism via phenylpropanoid metabolism, further research is needed to determine if a duplicated PDH could evolve to utilize another 4-hydroxylated substrate.
Given the dozens of shikonin and alkannin derivatives collectively present in the Boraginaceae [48], we looked for genes in the shikonin subnetwork that may encode tailoring enzymes involved in the synthesis of shikonin derivatives. Recently, two BAHD acyltransferases, shikonin O-acyltransferase (LeSAT1) and alkannin O-acyltransferase (LeAAT1), were discovered to mediate enantiomer-specific acylation in L. erythrorhizon [49]. Neither LeSAT1 nor LeAAT1 were recovered in the coexpression networks but expression of both is more abundant in at least one shikonin-abundant condition (Table S3). In our N2 shikonin subnetwork ( Fig. 5b; Table S7), two additional genes encoding putative transferases (Leryth_012925, Leryth_015823) were recovered. Phylogenetic analysis of the Leryth_015823 transferase and its homologs places Leryth_015823 in a group that contains all Boraginaceae species in our analysis (Fig. S14). In contrast, phylogenetic analysis of Leryth_012925 and its homologs shows Leryth_012925 on a long branch and lacking closely related homologs in other Boraginaceae (Fig. S15), which may make it a potential candidate for a L. erythrorhizon-specific shikonin/alkannin tailoring enzyme that is absent in the other shikonin-producing species in our analysis.

Coexpression network analysis reveals candidates with links to ubiquinone biosynthesis
It has already been demonstrated that LePGT1 and LePGT2 evolved via duplication of a primary metabolic prenyltransferase involved in ubiquinone biosynthesis [20]. Given this previously observed connection, the coexpression of a COQ4 ubiquinone biosynthesis-like gene with LePGT1, LeGPPS, and LeCYP76B101 in the N2 network appeared remarkable ( Fig. 5b; Table S7). Although the precise biochemical function of COQ4 is unknown it is thought to function as a scaffold protein binding proteins and lipids required for efficient ubiquinone biosynthesis [50]. The phylogenetic tree of the COQ4 gene family (Fig. S16) is strikingly similar to that of the ubiquinone prenyltransferase gene family [20]. Both phylogenies contain two subfamilies of Boraginales sequences. One subfamily has shorter branch lengths and contains a single sequence per species, suggesting that this subfamily has retained the ancestral COQ4 ubiquinone biosynthesis activity (Fig. S16). The second Boraginales subfamily has longer branches and shows a radiation of COQ4 paralogs and includes the candidate gene (Leryth_002195; Fig. S16), which was overexpressed in all shikonin-abundant conditions (Table S3). Given the similarities in the precursors and biosynthetic steps in the ubiquinone and shikonin pathways [20], this COQ4 paralog (Leryth_002195) could fulfill an analogus function and participate in assembling a shikonin biosynthesis metabolon.
In addition to the COQ4-like genes, we also identified two COQ3-like O-methyltransferase genes in the N2 shikonin subnetwork (Leryth_019821 and Leryth_021171). Leryth_019821 was coexpressed with LeMDR and Leryth_ 021171 was coexpressed with LeGPPS and LeCYP76B101 (Fig. 5b). A phylogenetic tree of the COQ3 gene family suggests that the two copies in L. erythrorhizon diverged in an ancestor of the Boraginaceae; the Leryth_021171 subfamily contained a sequence from M. paniculata, whereas the Leryth_019821 subfamily appears to be unique to shikonin producing species (Fig. S17). The metabolic significance of this network connection remains enigmatic, though it is possible that these enzyme could function in formation of shikonin derivatives.
A final connection to ubiquinone metabolism uncovered in the coexpression analysis was the recovery of a quinoprotein dehydrogenase gene (Leryth_020454) in the largest N4 subnetwork that coexpressed with LeGPPS, LePGT1, and LeCYP76B101 ( Fig. S4d; Table S9). Leryth_020454 was also significantly overexpressed in all shikonin-abundant conditions (Table S3). Quinoprotein dehydrogenases catalyze the oxidation of glucose to gluconate with concomitant reduction of ubiquinone to ubiquinol [51]. It is conceivable that such an enzyme could function to maintain shikonins and/or pathway intermediates in reduced states to protect the cell. Alternatively, it could function to ensure a pathway intermediate(s) remains in its reduced form. A similar chemical prerequisite is necessary for transmethylation of the 1,4-naphthoquinone ring of demethylphylloquinone in the vitamin K1 pathway [52]. The phylogeny of quinoprotein dehydrogenases shows two copies of this gene in the Boraginaceae (Fig. S18). The clade that contains Leryth_020454 appears unique to shikonin producers and is absent in M. panticulata (Fig. S18). Collectively, the analyses provided here suggest there are multiple genes in the shikonin coexpression network that originated from duplication of ubiquinone pathway genes.

Discussion
In this study, we downregulated expression of LeGPPS to explore the connections linking the shikonin pathway with the pathways supplying its metabolic precursors. In doing so, we showed that the recently discovered LeGPPS, an FPPS with evolved GPPS activity [19], is required for shikonin production (Fig. 2) and that LeGPPS supplies GPP precursor to the shikonin pathway using MVA-pathway derived IPP/DMAPP (Fig. 3). We also performed a series of computational analyses to investigate the evolutionary history of metabolic innovation in the shikonin pathway. Synteny analysis of the L. erythrorhizon genome revealed one syntenic block in contigs containing LeFPPS1 and LeF-PPS3 (Fig. S5) suggesting that WGD in the Boraginaceae was responsible for a duplication giving rise to these canonical FPPS paralogs (Fig. S7). However, the absence of shared synteny between LeGPPS and any other FPPS genes, suggests that LeGPPS did not arise via WGD. There is also no clear evidence of tandem duplication, and the presence of introns likely rules out retro duplication similar to what occurred with PGT evolution [20]. Instead, conservation of intron positions between LeGPPS and other FPPS genes (Fig. S6) is consistent with a segmental or DNA transposition event.
Wisecaver et al. [26] previously showed that network analysis based on abundant coexpression data (i.e. hundreds of RNA-seq and/or microarray samples) is a powerful strategy for high-throughput discovery of genes involved in specialized metabolic pathways in plants. We utilized a similar computational approach here with a limited but strategically selected set of transcriptome samples (N = 14) to construct a shikonin metabolic network model. We chose to focus our analysis on LeGPPS [19] (Fig. 2), LePGT1 [20], LeCYP76B101 [34,35], and LeMDR [36] given their demonstrated roles in shikonin metabolism. Using conventional differential gene expression analysis to refine the gene coexpression matrix, we uncovered a L. erythrorhizon shikonin gene network model that predicts strong associations between MVA pathway genes and known shikonin biosynthesis genes, as well as links between shikonin genes and several uncharacterized enzyme-coding genes (Fig. 5) that present new candidates for missing shikonin biosynthesis steps (Fig. 1). Moreover, L. erythrorhizon produces high amounts of rosmarinic acid and other specialized metabolites [53]. It is therefore important to note that the gene connections uncovered in the shikonin coexpression subnetworks may extend beyond shikonin biosynthesis. However, examining expression of the rosmarinic acid biosynthesis gene CYP98A654 (Leryth_006600) and five other CYP98A6-like genes present in the L. erythrorhizon genome (see Table S1 and S3) showed that none were recovered in any of our shikonin subnetworks, including the largest (N4). Plotting the spearman's correlation of each CYP98A6 homolog against the eigengene [54] for each module within the N2 shikonin subnetwork shows that CYP98A6 homologs are poorly correlated with the shikonin subnetwork (Fig. S19). This is consistent with results from hierarchical clustering analysis of proteomic data showing that CYP98A6 clusters separately from shikonin biosynthesis proteins [55].
Similar to LeGPPS (Fig. S7) and the PGTs [20], Boraginalesspecific gene family expansions were observed in the phylogenies (Figs. S8,S9,S11-S17) of the genes identified by coexpression network modeling (Table 1). Therefore, gene duplication appears to be the primary mechanism contributing to metabolic innovation in the Boraginales. Synteny analysis suggests that WGD was unlikely to be responsible for the expansion in the gene families for these candidates (data not shown). Furthermore, examination of the genomic regions surrounding these candidates suggests that tandem duplication did not contribute to their respective gene family expansions either, except for the cytochrome P450 encoded by Leryth_001242 (Fig. S11). Though Leryth_001243 and Leryth_001244 were not candidates identified in the coexpression network, their transcript abundance is higher in roots than in aboveground tissues (Table S3). These cytochromes P450 are predicted to encode CYP76A2-like enzymes. Other CYP76A members have been found to catalyze oxidation cascades involved in formation of terpenoid-derived specialized metabolites [56,57], thus making Leryth_001242 and its paralogs intriguing shikonin pathway candidate genes.
The shikonin pathway relies on precursors from both isoprenoid and phenylpropanoid metabolism. Inhibitor experiments with LeGPPS-RNAi lines led us to discover an additional layer of regulatory complexity coordinating f lux between the phenylpropanoid, MVA, and MEP pathways. Inhibition with the MEP pathway inhibitor fosmidomycin, for example, unexpectedly led to increased shikonin levels in both the EV-26 and LeGPPSi-45 lines (Fig. 3). This not only provides further evidence that neither IPP/DMAPP derived from the MEP pathway, nor GPP produced from MEP pathway-derived IPP/DMAPP, is exported to the cytoplasm for shikonin biosynthesis but it likely points to an increase in f lux through the MVA pathway due to the impairment of the MEP pathway.
To test if impairment of the MVA pathway affects regulation of the MEP pathway, we examined expression of MVA and MEP pathway genes in EV-26 lines treated with mevinolin (Fig. S20). Treatment with mevinolin increased expression of MVA pathway genes and decreased expression of early MEP pathway genes, including one of the copies encoding the first and rate-limiting enzyme 1-deoxy-D-xylulose-5-phosphate synthase (DXS). These data implicate the existence of unknown factors coordinating f lux from central carbon metabolism into the MVA and MEP pathways, adding another level of control to the complex regulation of these parallel routes in plants [15].
Comparative RNA-seq analysis of EV-26 and LeGPPSi-45 hairy root lines revealed that downregulation of LeGPPS results in transcriptional changes of genes throughout the terpenoid and phenylpropanoid metabolic networks (Fig. 4). The decreased expression of upstream MVA pathway genes and increased expression of genes encoding cytoplasmic enzymes utilizing IPP/DMAPP (i.e. NUDX1 [17] and FPPS) may indicate that IPP/DMAPP accumulates when the LeGPPS step is limiting. The increased pool of IPP/DMAPP may then be sensed by the cell leading to transcriptional reprogramming of isoprenoid metabolism to redirect the C5 building blocks toward other products. While levels of sterols (cytoplasmic IPP/DMAPP-derived product) and abscisic acid (plastidial IPP/DMAPP-derived product) were not significantly different, the levels of ubiquinones (mitochondrial IPP/DMAPP-derived product) were increased by 36% in LeGPPSi-45 lines compared to EV-26 lines (Fig. S21). The observed increase in ubiquinone levels is also noteworthy because it further suggests that its precursor pools are shared with the shikonin pathway.
The WRKYs are strong candidates for factors coordinately regulating expression of phenylpropanoid and terpenoid metabolic genes. As one of the largest classes of plant transcription factors, they are involved in regulating processes in response to a number of developmental cues and environmental stimuli. Moreover, they can act as activators or repressors and in doing so they create a regulatory network modulating signaling events from organelles and the cytoplasm to the nucleus [58]. Here, we found that 72 of the 152 genes in the N2 shikonin subnetwork, including LePGT1, LeGPPS, LeMDR, and CYP76B101, contain a canonical W-box element sequence motif (T)TGAC(C/T) (Table S11) recognized by the WRKY family of transcription factors [43]. From our analyses we identified five candidate transcription factors containing WRKY domains in the N2 shikonin subnetwork (Fig. 5b) including two, Leryth_027519 and Leryth_002564, which were both overexpressed in all shikonin-abundant conditions in the analyzed RNA-seq datasets (Table S3).
In addition to sharing 4-HBA and MVA-derived prenyl diphosphate metabolic precursors and having a common origin of their prenyltransferase genes, the shikonin and ubiquinone pathways rely on multiple analogous biochemical ring modifications [20]. This raises the prospect that neofunctionalization of duplicated ubiquinone biosynthesis genes facilitated evolution of the shikonin pathway. Considering this hypothesis, we explored the shikonin coexpression subnetworks for other connections to ubiquinone biosynthesis-like genes. Interestingly, COQ3-like O-methyltransferase and a quinoprotein dehydrogenase genes were found in the coexpression network that are unique to shikoninproducing species (Figs. S16 and S17). Whether these genes function in shikonin metabolism or point to another functional connection between shikonin and ubiquinone remains unclear. Moreover, we identified that in addition to encoding a canonical COQ4, L. erythrorhizon has a COQ4-like gene that was coexpressed with LePGT1, LeGPPS, and LeCYP76B101 in the N2 network and was overexpressed in shikonin-abundant conditions ( Fig. 5b; Tables S3 and S7). COQ4 is a scaffold protein found in plants, fungi, and animals, including humans, that is required for ubiquinone biosynthesis. While its specific function is unknown, it binds proteins and lipids and thus likely assembles a metabolon for efficient ubiquinone biosynthesis [50]. Whether COQ4-like functions similarly in shikonin biosynthesis is an open question that should be explored, especially considering any insight may inform the function of the canonical COQ4 found throughout eukaryotes. Given that shikonin is abundant and non-vital, it may provide a better model for genetically studying the COQ4 gene family.
In summary, our study has i) indicated transcriptional and metabolic connections linking the shikonin pathway with it precursor pathways; ii) established a shikonin coexpression network model that includes genes encoding candidates for missing shikonin pathway steps and regulatory factors; iii) revealed instances of Boraginales-specific gene family expansion facilitated by duplication events for genes in the shikonin metabolic network; and iv) uncovered evolutionary links between shikonin metabolic network genes and ubiquinone pathway genes. The evolution of other plant specialized 1,4-naphthoquinone pathways appears to be linked to primary metabolic quinone pathways [22,23]. Thus, we expect that the evolutionary mechanistic insights gained here, combined with the demonstration that a robust coexpression network can be built from a small set of RNA-Seq experiments relying on spatial-and conditionspecific metabolite correlations, can be used to guide further investigation into the convergent evolution of specialized 1,4-naphthoquinone metabolism in plants.

Plant materials and hairy root culturing
L. erythrorhizon (accession Siebold & Zucc.) seeds were obtained from the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) seed bank (Gatersleben, Germany). Propagation of plants to bulk seeds and the generation and maintenance of hairy roots were performed as done previously [20].
L. erythrorhizon hairy root GPPSi lines were generated by applying prepared cultures of A. rhizogenes containing the pB2GW7-GPPSi construct to wounded stems of L. erythrorhizon plants in tissue culture as previously described [20]. Emergent roots from plants 2-4 weeks after infection were excised and transferred to Gamborg B5 media plates containing 3% sucrose and 200 μg/mL cefotaxime to eliminate A. rhizogenes. After 2 weeks, hairy roots were transferred to Gamborg B5 media containing 3% sucrose and 10 mg/L Basta for selection for 2 weeks. Hairy root lines transformed by A. rhizogenes carrying an empty pB2GW7 vector were generated in parallel as controls.

Metabolite extraction and quantification
Extraction and analysis of ABA by liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) was performed as previously described [64]. Extraction of total shikonins from growth media of hairy root cultures and quantification on an Agilent 1260 Infinity high performance liquid chromatography with diode array detection (HPLC-DAD) system (Agilent Technologies) was done as previously described [20]. Sterols were extracted from 100-200 mg of ground flash-frozen hairy root tissue, derivatized with BSTFA, and analyzed on an Agilent 7890B gas chromatograph (GC) coupled with a 5977A mass spectrometer (MS) equipped with a DB-5MS column (30 m × 0.25 mm × 0.25 μm film; Agilent Technologies) and employing Chemstation software as previously described [16]. Ubiquinones were extracted from 100-200 mg of ground flash-frozen fresh tissue in 3 mL of 95% ethanol spiked with 4 nmol ubiquinone-4 internal standard and incubated overnight with shaking at 4 • C. The next day, samples were centrifuged at 500 x g to pellet debris. Then, 1.5 mL of water was added to supernatant and partitioned twice with 4.5 mL hexane. The hexane layers were combined and concentrated under nitrogen gas at 37 • C. Nearly dry samples were resuspended in 1 ml 90:10 methanol:dichloromethane and filtered through 0.2 μm PTFE syringe filters. Care was taken throughout the extraction process to protect samples from light. Samples were analyzed by HPLC-DAD on an Agilent Zorbax SB-C18 column (5 μm, 250 x 4.6 mm) thermostatted at 25 • C and eluted in isocratic mode with 30% 60:40 isopropanol:hexanes and 70% 80:20 methanol:hexanes [8]. Ubiquinones were detected spectrophotometrically at 255 nm and had retention times of 4.8 min for ubiquinone-4, 11.4 min for ubiquinone-9, and 14.3 min for ubiquinone-10. Instrument operation and data analysis steps were performed through the Agilent ChemStation software. Quantification of ubiquinones was done by DAD using signals obtained in the linear range of calibration standards (0.0313, 0.0625, 0.125, 0.250 and 0.500 nmol). The data were corrected for recovery according to the ubiquinone-4 internal standard, and final quantifications were made using linear regression. Differences in total shikonin and ubiquinone-9 and ubiquinone-10 content produced by empty-vector control and LeGPPSi lines (n = 4 biological replicates) were analyzed using one-way ANOVA and means were compared with Tukey's HSD post-hoc test at a 95% significance level.

RNA-sequencing analysis of LeGPPSi and empty-vector control lines
For RNA-seq analysis of L. erythrorhizon EV-26 and LeGPPSi-45, three independent hairy root cultures of each line were started in liquid Gamborg B5 media containing 3% sucrose and grown at 28 • C in 100 μE m −2 s −1 light. After two weeks, the hairy roots were transferred to M9 media containing 3% sucrose and darkness for six days. The hairy roots were then frozen in liquid nitrogen, ground by mortar and pestle, and RNA was extracted from ∼100 mg of tissue as described above. For RNA-seq analysis of L. erythrorhizon EV-26 lines, three independent hairy root cultures were grown as just described. Mock (control) and 100 μM mevinolin treatments were administered immediately upon transfer of 14-d-old hairy roots to M9 and darkness. Total RNA was extracted at 6 d after transfer of 14-d-old hairy roots to M9 and darkness.
Library construction (NEBNext Ultra RNA Library Prep Kit, New England Biolabs Inc.) from 1 μg RNA, Illumina sequencing, and analyses of DEGs were performed by Novogene Corporation Inc. (Sacramento, CA). Paired-end clean reads were mapped to the L. erythrorhizon reference genome [20] using HISAT2 software [65] For each sequenced library, read counts were adjusted by TMM [66] and DEG analysis was performed using DESeq2 [67] with p-value adjusted using an FDR calculated with Benjamini-Hochberg (BH) methods [68]. Genes were considered significantly differentially expressed if they had a BH-adjusted p-value of 0.005 and a log2 fold change of 1. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of DEGs was implemented by the clusterProfiler R package [69] and KEGG pathways with BH-adjusted p-value <0.05 were considered significantly enriched. The raw data were submitted to the Sequence Read Archive (http://www.ncbi.nlm.nih.go v/sra/) and are available at the NCBI Sequence Read Archive (PRJNA811172).

Analysis of transcriptomes used to build shikonin gene coexpression networks
Illumina RNA-seq reads of L. erythrorhizon root periderm, root vascular, and hairy root cultures were generated as described [20] and are available at the NCBI Sequence Read Archive (PRJNA596998). Additional Illumina RNAseq reads of L. erythrorhizon whole roots and above ground tissue (pooled leaves and stems) were downloaded from the NCBI SRA database, experiments SRR3957230 and SRR3957231 respectively. L. erythrorhizon gene functional annotations were downloaded from [20].
L. erythrorhizon RNA-seq raw reads were error corrected using the Tadpole (default parameters; software last modified June 27, 2017) program from the BBMap software package (https://sourceforge.net/projects/bbmap/). Gene expression was quantified with Kallisto [70] by aligning the error corrected reads to a collection of the longest transcript per gene of the L. erythrorhizon genome. The LePS-2 gene was previously implicated in shikonin biosynthesis [28] but was not present in v1.0 of the L. erythrorhizon gene set [20]. Therefore, we identified a putative coding sequence for LePS-2 in the L. erythrorhizon genome assembly manually and added its sequence to the total gene set prior to gene expression quantification.
Analyses of differential gene expression was performed using the edgeR package [71]. Gene expression counts were normalized using the TMM (trimmed mean of M values) method [66]. Exact tests were conducted using a trended dispersion value and a double tail reject region. FDRs were calculated using the BH procedure [68]. Genes that did not have a significant differential expression status in at least one comparison (FDR < 0.1) were excluded from downstream coexpression analyses.

Coexpression network analysis
Raw gene expression counts were normalized using the transcripts per million method and transformed using the variance-stabilizing transformation method in DESeq2 [67], and global gene coexpression networks were constructed as previously described [26]. Briefly, a Pearson's correlation coefficient (PCC) was calculated between gene pairs and converted into a mutual rank (MR) using scripts available for download on GitHub (https://github.rcac.purdue.edu/jwisecav/coexp-pipe). MR scores were transformed to network edge weights using the exponential decay function e −(MR−1/x) ; four different networks were constructed with x set to 5, 10, 25, and 50, respectively. Edges with a weight < 0.01 were trimmed from the global network. Modules of coexpressed genes were detected using ClusterOne v1.0 using default parameters [32]. Module eigengenes were calculated using WGCNA [54,72]. Overlapping modules within each coexpression network were combined by collapsing all modules containing the known Shikonin pathway genes LeGPPS [19], LePGT1 [13], LeCYP76B101 [35] and LeMDR [73] into a subnetwork. Modules were visualized in Cytoscape using the spring embedded layout. Tests for functional enrichment of Gene Ontology (GO) terms in the different shikonin subnetworks (Table S10) were performed using hypergeometric tests using the SciPy library hypergeom, and p-values were adjusted for multiple comparisons using the StatsModels library multitest using the BH procedure [68]. GO terms and other gene functional annotations were taken from [20].

Promoter analysis
Nucleic acid sequence motifs enriched in promoter regions of genes in the N2 shikonin subnetwork (N = 152) were identified with Motif Indexer [42] using a 1000 base pair window upstream of all transcriptional start sites using the same upstream region of all L. erythrorhizon genes as background. Identified motifs were consolidated and ranked using the KeyMotifs.pl perl script provided by Motif Indexer. To calculate a false discovery rate, 1000 random sets of 152 genes were run through Motif Indexer determine a p-value threshold. No motif identified from a random gene set had a p-value less than 1x10 −9 .

Phylogenetic analysis
To construct gene phylogenies, the gene family containing the best A. thaliana BLAST hit to the query gene was downloaded from the PLAZA 4.0 Dicots comparative genomics database [44]. Homology between the predicted proteomes of L. erythrorhizon and 18 additional Borginales was determined with OrthoFinder v2.1.2 using the following parameters: -S diamond -M msa -T fasttree [74]. OrthoFinder orthogroups containing the query gene were combined with the Plaza 4.0 gene family to obtain the final sequence sets. Sequences were aligned with MAFFT [75] using the E-INS-I strategy and following parameters: -maxiterate 1000 -bl 45op 1.0 -retree 3. The maximum likelihood phylogeny was constructed using IQ-TREE [76] using the built in ModelFinder to determine the best-fit substitution model [77] and performing SH-aLRT and the ultrafast bootstrapping analyses with 1000 replicates each. For the cytochrome P450 and acetyltransferase gene candidates, because the PLAZA 4.0 gene families were so large, a quick guide tree of the entire gene family was built using FastTree [78]. Regions of the guide tree that contained candidate genes of interest were identified; sequences within these regions were realigned using MAFFT, and phylogenies were built using IQ-TREE as described above.

Synteny analysis
Regions of shared synteny within the genome of L. erythrorhizon were detected using SynMap2 on the online Comparative Genomics Platform (CoGe) using default settings with the exception that the merge syntenic blocks algorithm was set to Quota Align Merge, syntenic depth algorithm was set to Quota Align, and the CodeML option was activated to calculate substitution rates between syntenic CDS pairs. For syntenic blocks containing genes of interest and their homologs, the encompassing contigs were aligned using promer of the MUMmer4 alignment system [79].