Natural products of medicinal plants: biosynthesis and bioengineering in post-genomic era

Abstract Globally, medicinal plant natural products (PNPs) are a major source of substances used in traditional and modern medicine. As we human race face the tremendous public health challenge posed by emerging infectious diseases, antibiotic resistance and surging drug prices etc., harnessing the healing power of medicinal plants gifted from mother nature is more urgent than ever in helping us survive future challenge in a sustainable way. PNP research efforts in the pre-genomic era focus on discovering bioactive molecules with pharmaceutical activities, and identifying individual genes responsible for biosynthesis. Critically, systemic biological, multi- and inter-disciplinary approaches integrating and interrogating all accessible data from genomics, metabolomics, structural biology, and chemical informatics are necessary to accelerate the full characterization of biosynthetic and regulatory circuitry for producing PNPs in medicinal plants. In this review, we attempt to provide a brief update on the current research of PNPs in medicinal plants by focusing on how different state-of-the-art biotechnologies facilitate their discovery, the molecular basis of their biosynthesis, as well as synthetic biology. Finally, we humbly provide a foresight of the research trend for understanding the biology of medicinal plants in the coming decades.


Natural products in medicinal plants: hidden treasures with healing power
Plants have existed on Earth for hundreds of millions of years and evolved ingenious chemical factories to survive exogenous and endogenous stresses [1]. These chemicals known as secondary metabolites or natural products are synthesized by plants to accommodate environmental changes without disrupting much of their cellular and developmental physiological processes [2]. To date more than 100 000 natural products are present in the Kingdom of Plants, primarily involved in plant defense against biotic and abiotic stresses [3]. These powerful substances are also important chemical signals mediating plant communication with symbiotic microorganisms, and attracting pollinators and seed dispersal. Derived from primary metabolites, secondary metabolites accumulate at cellular, tissue and organ levels through diverse biosynthetic pathways [4]. Plant natural products (PNPs) are generally divided into three classes: phenolics, terpenoids, and alkaloids [5] and broadly used as pharmaceuticals, nutraceuticals, cosmetics, and fine chemicals. Phenolics are synthesized from the shikimic acid biosynthetic pathway where the final products are formed after phenylalanine and aromatic amino acids undergo deaminization, hydroxylation, and coupling reactions [7] (Fig. 1A). Ingeneral, phenolics consist of monophenols such as benzenoids,and polyphenols such as f lavonoids, stilbenoids, and coumuminoids [6]. Terpenoids are biosynthesized through mevalonic acid and methylerythritol phosphate pathways from isopentenyl diphosphate (IPP, C 5 ), the precursor and fundamental structural unit of all terpenoids including monoterpenoids (C 10 ), sesquiterpenoids (C 15 ), diterpenoids (C 20 ), and triterpenoids (C 30 ) [8] (Fig. 1B). Alkaloids are a large group of plant nitrogencontaining compounds with a broad range of pharmaceutical activities such as painkilling (e.g. morphine), cough-suppressing (e.g. noscapine), anti-inf lammation (e.g. sanguinarine, berberine), and anti-cancer (e.g. vinblastine, noscapine) (Fig. 2). Originated from the amino acid and isoamylene biosynthetic pathway, alkaloids are generally classified into sparteine, quinine, mescaline, coniine, and aconitine (Fig. 2). The biosynthetic pathways of different alkaloids are diversified and often independent.
PNPs are bioactive substances dispensable to normal plant cellular functions yet vital to biodefense and environmental adaptation of plants with a sessile lifestyle. As a major source of traditional and modern medicine, PNPs have had broad implications for human health as a herb remedy for thousands of years, and changed the course of human civilization and history [9]. The healing power of medicinal herbs has long been recognized and harnessed by our human ancestors and forefathers, who learned to use plant-based folk medicine to cure ailments such as headache, fever, and pains. For example, archaeologists found in a grave from Shanidar -an archaeological site in Iraq -that a Neanderthal man may have used plant-based medicine at around 60 000 bce, the earliest record of human use of herb medicine [10]. In addition, ancient Europeans had started to cultivate and use different varieties of opium poppy plants at least 5000 years ago [11]. Another example is tobacco, which was historically used to treat various ailments including yaws, syphilis, and black death given its strong antimicrobial activities [12,13]. Much of early human knowledge about medical herbs is documented in ancient scriptures and literature such as Treatise on Cold Pathogenic and Miscellaneous Diseases by Zhang Zhongjing (Eastern Han dynasty) and Compendium of Materia Medica composed by Li Shizhen (Ming dynasty). However, the use of medicinal herbs in traditional medicine has been largely considered as empiricism with little knowledge of the chemical properties of the effective PNPs. The first isolated PNP was morphine, a benzylisoquiloine alkaloid in opium poppy (Papaver somniferum) plants, by Germany pharmacist Friedrich Sertürner at around 1817, marking the birth of modern chemistry of natural products [14]. Later on, another alkaloid quinine isolated from the bark of cinchona tree (Cinchona officinalis) became the first effective medicine against malaria, caused by mosquito-transmitted Plasmodium species [15]. Salicin, the original source of aspirin and identified in the bark of willow tree (Salix babylonica), is another significantly used PNP, used prominently as a pain reliever [16]. To date, hundreds of plant-based bioactive compounds have been identified and many are used as an effective treatment of human diseases such as ginsenoside (anti-tumor), paclitaxel/taxol (antitumor), and artemisinin (anti-malaria) etc. The sesquiterpene endoperoxide artemisinin from Artemisia annua is recommended by the WHO as the most effective drug against malaria [17]. The paclitaxel (taxol) isolated from the tree barks of Taxus genus has been approved for the treatment of ovarian, breast, and lung cancer, as well as Kaposi's sarcoma [18]. These examples have demonstrated that medicinal plants and their powerful natural products have great healing power and have shaped the development of human history.
Throughout history, the human race has battled with many infectious diseases such as tuberculosis, cholera, malaria, black death, inf luenza, smallpox, etc. The most recent public health challenge comes from the coronavirus pandemic initiated in 2019 (COVID-19) which continuously threatens the world with the non-stop emergence of new variants. These pandemics each and collectively have led to significant progress in human knowledge of medical sciences and the generation of new public health solutions in which medicinal plants played important roles. Because antibiotics that target bacterial and fungal pathogens are futile against viral infection, the current solution against viral infections such as inf luenza and coronaviruses is primarily through vaccination, whereas effective viruskilling drugs are highly desired but scarce. Whereas western medicine working against COVID remains under development and clinical trials such as sabizabulin [19], remdesivir [20], hydroxychloroquine [21], and PNPs with anti-viral activities have been reported as effective to contain viral replication and alleviate patient symptoms. For example, f lavonoids such as neo-hesperidin, hesperidin, baicalin, kaempferol 3-O-rutinoside, and rutin from different sources, and a series of xanthones from Swertia plants could effectively interact with SARS-CoV-2 targets [22]. In addition, molecular docking was recently performed using f lavonoids from fruit peel of Citrus reticulata 'Chachi' to target the spike proteins, 3CLpro, PLpro, and RdRp of SARS-CoV-2, suggesting that many f lavonoids have stronger affinity with the targets than do positive control drugs [23]. Recently, in a multicenter, prospective and randomized controlled clinical trial, Lianhuaqingwen capsule, a manufactured product of the traditional Chinese medicine (TCM) formula, could significantly inhibit SARS-CoV-2 replication [24]. Despite the promising therapeutic effect of medicinal herbs against COVID, it remains unknown what is the causing substance, and mechanisms of action against the virus. Furthermore, because biosynthetic pathways of most PNPs are elusive, the working molecules cannot be obtained in sufficient quantities to satisfy the need for drugs to control global pandemics. For decades, the biochemical pathways of PNP biosynthesis, genome architecture, and regulation of PNP production, and most importantly bioengineering to massively produce bioactive PNPs are all extensively researched areas in medicinal plant biology and chemistry. Here, we attempt to review the current research of PNPs by focusing on how different state-of-theart multi-discipline technologies facilitate their discovery, the molecular basis of biosynthesis, as well as bioengineering via breeding and synthetic biology. Finally, we highlight a few research trends regarding the exploitation of medicinal plants in the next decades [25].

Decoding biosynthetic pathways of PNPs is key to their applications
The challenge of acquiring PNPs en masse stems from the fact that medicinal plants have a tight regulation of producing these chemicals, mostly localized in specific cells and tissues under certain conditions [26]. The biosynthesis of PNPs typically goes through a cascade of enzymatic reactions converting primary metabolites into various structurally diverse secondary metabolites. Although some reactions can occur spontaneously in nature, most steps require a catalysis by enzymes such as cytochrome P450, methyltransferases, O-methyltransferases, deaminases, UDPglucuronosyltransferases, etc. A major challenge in exploitation of PNPs is to understand their biosynthetic pathways so that the bioengineering approach can be applied to produce them on a massive scale through plant breeding or synthetic biology. Fully resolving the biosynthetic pathways of any PNP is a daunting task, because plants have evolved a complex cellular network of metabolic pathways [27], formed by various enzymes catalyzing a myriad of biochemical reactions, and regulatory proteins fine-tuning the spatial and temporal accumulation of PNPs. Currently, the full biochemical pathways remain unknown for the majority of medicinal PNPs after years of research efforts, highlighting the difficulty of decoding PNP biosynthetic pathways.
Before the genomic era, plant biosynthetic pathway characterization was laborious and time-consuming, either relying on approaches such as isotope labeling and forward genetics, such as by creating random mutants followed by analysing their metabolic profiles, or involving sequence-homology based  gene cloning to identify individual biosynthetic enzymes [28]. These studies typically identify genes encoding parts of the pathways through guilt-by-association, as loss-or gain-offunction mutations of true biosynthetic genes can alter metabolic profiles. However, they are usually inept to resolve the complete components and reconstruct the pathway owing to the pleiotropic or promiscuous nature of the biosynthetic enzymes. The majority of our knowledge about PNP biosynthetic pathways so far is derived from such homology-based or transcriptome-based gene mining [29,30]. With growing volumes of genomic data available for medicinal plants, it is now common to exploit high-throughput data mining combined with experimental validations to untangle the complex biosynthetic pathways and networks underlying PNP accumulation in medicinal herbs. PNP biosynthetic genes are usually co-expressed and coregulated in specific tissues and growth stages. Therefore, in the post-genomic era, gene expression profiling techniques such as whole transcriptome sequencing (RNA-seq) enable researchers to quickly narrow down the co-expressed candidate genes encoding specific biosynthetic pathways, typically via comparative transcriptiome analysis of plant samples with contrasting levels of metabolic production. Transcriptomic analysis combined with pathway inference based on chemistry logic, and experimental validations in heterologous hosts are routinely used to identify biosynthetic genes for PNPs such as thebaine and noscapine of opium poppy (P. somniferum) [31], sanguinarine and chelerythrine of Macleaya cordata [32], vinblastine of Madagascar periwinkle [33], colchicine of Colchicum autumnale [34] and strychnine of Strychnos nux-vomica [35]. Alternatively, proteomic profiling is also used to identify proteins corresponding to specific biosynthetic pathways, capturing translational and post-translational modifications unseen in transcriptomic data. Candidate genes identified by omic profiling analysis are then validated experimentally to ascertain their biochemical functions such as catalyzing specific reactions, metabolite transport or transcriptional regulation. It typically involves heterologously expressing the candidate gene(s) in microbial cells including bacteria (Escherichia coli [36], Corynebacterium glutamicum [37]), yeasts (Saccharomyces cerevisiae [38], Pichia pastoris [39]), or plant chassis such as tobacco (Nicotiana benthamiana [40]) and algae (Chlamydomonas reinhardtii [41], Phaeodactylum tricornutum [42]), followed by detection of the target metabolites using untargeted metabolomics. Metabolomics study the full complement of plant metabolites through high performance liquid chromatography (HPLC)-or gas chromatography (GC) coupled with mass spectrometry (MS). Integrated mining of multi-dimensional data such as transcriptomic or proteomic with metabolomic profiling in different tissues and growth stages of medicinal plants enables gene-metabolite association. Network mining using these big data can reconstruct the gene co-expression networks correlated with accumulation of PNPs, thus generating hypothesis for downstream experimental validations [43]. This approach still suffers from the fact that many biosynthetic genes may be cryptic or lowly expressed, leaving them undiscovered by the expression-based methods. Thus, systematic decoding of PNP biosynthetic pathways will require a framework of high-throughput analysis of omic data from plants growing under multiple conditions or developmental stages. For example, two recent studies acquired transcriptomic and metabolomic data of tomato and rice plants across the full spectrum of growing stages, and by integrative data analysis revealed gene modules associated with natural products [44,45]. It is expected that similar analysis from full growth stages of medicinal plants will facilitate discovery of biosynthetic genes and yield critical insight into the regulatory networks underlying PNP biosynthesis.

Medicinal plant genomes yield insights into composition and evolution of biosynthetic pathways
Genome sequence dictates the foundation of biological functions of all life forms. Despite the progress made by the homology-and transcriptome-based approach, the elucidation of full biosynthetic pathways is often hampered by a lack of reference genome sequences for medicinal plants. A reference genome sets the foundation to identify all protein-coding genes, regulatory DNA elements and importantly their precise genomic locations. Nextgeneration sequencing (NGS) and third-generation sequencing (TGS) technologies have revolutionized biological sciences by changing the way genomes are decoded. The first human and plant (Arabidopsis thaliana) genome assembly were initially achieved using Sanger sequencing. Despite the high accuracy, Sanger sequencing is expensive and time-consuming to generate sequencing data for assembly of eukaryotic genomes of mid to large sizes. Since around 2005, genome assembly projects started to adopt high-throughput sequencing platforms like Solexa, Ion Torrent and later Illumina, giving rise to the first reference genomes for many organisms. However, contig-level assemblies using short reads have low contiguity (low N50) with numerous assembly errors due to high repeat content of plant genomes, even when a high coverage and long insert such as mate-pair libraries or linked-reads are used. Scaffold level assemblies are often improved using genetic map data such as GBS (genotyping by sequencing), or optical maps to anchor the contigs to linkage groups. However, a high-resolution genetic map is essential to reducing contig misplacement but often unavailable for non-model plants. TGS technologies developed by Pacific Biosciences (PB) and Oxford Nanopore technology (ONT) produce single-molecule DNA sequencing reads of 20 kb or longer, albeit error-prone (up to 15%). TGS became a game changer for genome assembly because long reads can often span most repetitive regions. Besides, technologies such as chromatin conformation capture sequencing (e.g. Hi-C) and Bionano are now routinely used to anchor contigs to chromosomes and correct misassemblies present in NGS and TGS draft assemblies. As a result, for model organisms and many agricultural organisms, genome assembly quality and contiguity have leaped to much higher levels with a combination of long-read and short-read sequencing data.
Medicinal plant genomes have a wide genome size range, high heterozygosity rates and repeat contents, making them difficult to assemble correctly [46]. Leveraging these different technologies combined with improving bioinformatic algorithms has yielded a growing number of high-quality medicinal plant genome assemblies and annotations. Nearly 100 species of medicinal plants ( Table 1) have at least one version of reference genome available , although the quality of current genome assemblies varies depending on the genome complexity of medicinal plants and choice of sequencing technologies as well as computational tools used in assembly. The herbgenomics initiative, firstly proposed in 2010, has greatly promoted the elucidation of biosynthetic pathways for many medicinal bioactive ingredients [48]. Under this initiative and many independent genome projects, several medicinal plants have had reference genomes assembled even at chromosome level, such as P. somniferum [49], Camptotheca acuminata [50], Scutellaria baicalensis [51], Panax notoginseng [52], Tripterygium wilfordii [53], Salvia miltiorrhiza [54], Taxus wallichiana [55], and Erigeron breviscapus [56]. Lately, a Chinese consortium of the 1 K Herb Genomes Project has been officially launched to produce high-quality genome sequences for 1000 high-value TCM in order to promote the study and exploitation of their PNPs.
The chromosome-level genome assemblies are instrumental to highly robust downstream genomic analyses such as comparative genome analysis, chromosome evolution analysis, and gene cluster characterization etc. For example, chromosome-level assemblies of P. somniferum has allowed Guo et al. to discover a gene cluster (BIA gene cluster) that encodes biosynthetic pathways for two morphinans: morphine and noscapine [49]. Two additional assemblies of Papaver species produced by Yang et al. have enabled them to reconstruct the evolutionary history of Papaver karyotypes, showing morphinan biosynthetic pathways underwent punctuated evolution pattern [57]. Tu et al. [53] assembled a high quality chromosomal-scale Tripterygium wilfordii genome and found the recent duplication of triptolide biosynthetic pathway genes. Then multiple omics methods were integrated to construct gene-to-metabolite network, and finally a CYP728B70 that participated in triptolide biosynthesis was identified. The genome assemblies for C. acuminata [50] and Catharanthus roseus [33] also provide comprehensive genomic resources for the analysis of camptothecin and vinblastine biosynthesis pathway, which share common upstream to produce loganic acid, and then f lux into two independent branches, respectively. The C. roseus genome, coupled with chemical investigations, enabled the discovery of the last two enzymes of precondylocarpine acetate synthase and dihydroprecondylocarpine synthase responsible for vinblastine biosynthesis [33,58], resolving a long-standing question of how vinblastine/vincristine is synthesized, making their heterologous production possible. The C. acuminata genome found two secologanic acid synthases that converted the loganic acid f lux into camptothecin production, and the downstream candidate genes set the foundation to fully discover camptothecin biosynthetic mechanism [50]. Furthermore, high quality genomes are also essential in genomewide association studies (GWAS) to identify quantitative trait loci   [59]. The availability of high-quality reference genomes provides critical resources to the elucidation of biosynthetic pathways of high-value PNPs in medicinal plants.

Discovery of metabolic gene clusters through genomic mining
Microbial genes controlling secondary metabolite biosynthesis are typically clustered in certain genomic regions, known as metabolic gene clusters (MGCs). Unlike microbes, plant biosynthesis genes are usually dispersed throughout the genome and MGCs have long been considered rare in plants. However, MGCs have recently been identified in several plants including Arabidopsis [140,141], rice [142], maize [143], and several medicinal plants such as opium poppy [49] and Taxus [55,139]. The MGCs contain at least three non-homologous genes, ranging from tens to several hundred kilobases in total length. To date, over 30 MGCs in plants have been reported to encode PNP biosynthetic pathways based on experimental evidence [143], encoding terpenoides [140][141][142], alkaloids [49], steroidal glycoalkaloids [144], and fatty acids [145]. Some of these PNPs encoded by gene clusters have medicinal values such as morphine and noscapine, while many have known roles in antimicrobial, allelopathic activity, and plant defense against herbivores and pathogens. Many of these functionally characterized MGCs have been discovered even before a reference genome is available. In the post-genomic era, the availability of reference genomes expedited identification of MGCs in plants using genomic mining. Recently, computer algorithms such as Plantismash [146], Phytoclust [147] and PlantClusterFinder [148] have been developed to predict MGCs encoding potential biosynthetic pathways of secondary metabolites. A large number of potential MGCs have been found in plants encoding unknown biosynthetic pathways, highlighting the limitation of our knowledge of what and how plants can produce chemically. For example, genome mining of P. somniferum genome has revealed 84 MGCs, among which one cluster encoding the pathway for morphinan and noscapine has been functionally validated [49,149]. Tomato (Solanum lycopersicum) has 47 predicted MGCs, four of which have been associated with alpha-tomatine [144], lycosantalonol [150], fatty acids [145], and hydroxycinnamic acid amide [151] biosynthesis. A six-gene cluster for taxadiene biosynthesis was recently identified in the Taxus genome, involving in the first two biosynthetic steps, which helps to decode the complete taxol biosynthesis in the future [139]. Wheat genome mining combined with transcriptomic analysis has recently identified six pathogen-induced biosynthetic pathways encoded by MGCs, producing f lavonoids and terpenes that could potentially be used as phytoalexins in disease control [152]. Combining transcriptomic and metabolic profiling data would be useful to link MGCs with particular metabolites, such as the targeting of a four-gene cluster with the falcarindiol biosynthesis in tomato [145], although such strategy faces the challenge of lacking co-expression in many MGCs.
Despite the genomic discovery of hundreds of plant MGCs, questions remain to be answered regarding this specialized genetic architecture. First, how did plants gain gene clusters during evolution? MGCs in bacterial and fungi are commonly formed through gene duplication, translocation, and horizontal gene transfer (HGT). Despite a few exceptions [153,154], it is uncommon for plants to undergo HGT and there must be special mechanisms for MGC formation in plants. Recent genomic analyses [57,155] have shown that structural variation events such as whole genome duplications, chromosome fission and fusion, gene duplication, translocation and loss have been implicated in the birth and evolution of biosynthesis gene clusters, supporting the theory of punctuated evolution in forming plant specialized metabolites. Overall, investigation of plant MGC evolution remains at the infant stage, requiring comparative analysis of a large number of plant genomes. It will help us understand the major driving forces of PNP evolution and its ecological impact in nature.
Second, how do many of the MGCs actually contribute to PNP biosynthesis? As more candidate MGCs continue to be identified from plant genomes via an in silico approach, it is critical to functionally characterize predicted MGCs of unknown function and associate them with potential natural products. Expressing the candidate MGCs in a heterologous host such as yeast or E. coli will be informative to determine the synthesized product, as shown by several recent examples [151,156]. For instance, Kong et al. used yeast as a chassis to investigate the function of a tomato gene cluster, and discovered a novel naringenin chalcone synthase responsible for the production of dihydro-coumaroyl anthranilate amide [151]. The caveat of this approach is that most predicted plant MGCs are quite large (up to hundreds of kb), presenting a huge challenge to cloning them into bacterial or yeast expression vectors. The plant MGCs successfully cloned and expressed in microbial hosts are mostly mini gene clusters of several kb containing only a few open reading frames. In addition, expressing plant proteins in yeast or E. coli cells does not always work due to different codon usage and post-translational modifications in eukaryotic versus prokaryotic cells. Alternatively, validation using plant host such as Nicotiana benthemiana enables expression of candidate MGCs via Agrobacterium infiltration of plant leaves or cells, followed by metabolomic detection, although it faces the same problem of delivering and expressing long MGC fragment into tobacco cells. Recently, a platform for high-throughput secondary metabolite discovery has been developed for filamentous fungi by cloning and expressing fragmented genomic DNAs containing MGCs into fungal artificial chromosomes followed by metabolomic profiling [157]. Application of this or a similar approach has not been reported in plants considering the low genomic fraction of plant MGCs and lack of a proper artificial chromosome cloning system. A high-throughput MGC validation platform will expedite the identification and utilization of novel PNPs.

Bioengineering of PNP through plant biotechnology
In nature, PNPs are accumulated at low abundance and only in specific tissues and developmental stages of medicinal plants for two major reasons. Firstly, biosynthesis of these compounds consumes energy and competes with normal plant vegetative growth and reproduction. Secondly, most PNPs have cell toxicity from which plants have to protect themselves by detoxification, storing them in compartments, or only producing the toxic chemicals when and where needed. Evolutionarily, PNPs have probably undergone natural as well as human selection. For example, P. somniferum accumulates high levels of painkilling morphines in capsules instead of other tissues, and the amount of morphine produced differed among cultivars [158]. By contrast, its close relative Papaver rhoeas only produces a trace amount of morphine [57]. This suggests that the ability to produce morphine has been under natural selection in poppy plants, and likely selected by domestication and breeding process.
The naturally low content of PNPs in medicinal herbs renders a major bottleneck in drug developments and clinical therapeutics. Structures of PNPs are often too complex for a profitable production by total chemical synthesis. Therefore, plant extraction remains the primary commercial source of most PNP for pharmaceuticals, causing over-exploitation of natural resources and instability to the Earth's ecosystem. For instance, taxol, a well-known anticancer drug ingredient derived from the bark of yew tree once put the yew on the verge of extinction due to exhaustive exploitation. The demand of medical PNPs thus stimulates the breeding of superior germplasm resources for sustainable use of medicinal herbs. There are many challenges in breeding medicinal plants for high PNP yield, including limited understanding of how PNPs are exactly made and regulated by plants, lack of high-quality genome sequence, annotation and molecular markers, long breeding cycles as well as the difficulty of genetic transformation. Herb genomic research has accelerated the identification of functional genes and genome-wide molecular markers, linked molecular markers with desired characters, and improved breeding medicinal herbs. Many efforts have been made to increase PNP yield through plant breeding and biotechnological improvement, including artemisinin in A. annua [159,160], morphine in P. somniferum [161], THC (tetrahydrocannabinol) in Cannabis sativa [162], etc. A. annua has been a primary source of the anti-malaria drug artemisinin. A. annua transcriptome sequencing enabled construction of genetic linkage map and identification of quantitative trait loci (QTL) that control artemisinin yield [160], providing genetic resources for molecular breeding. Phenotype selection coupled with molecular breeding in the past decade has led to the production of Artemisia F1 Seed (https://www. artemisiaf1seed.org), increasing the artemisinin yield from 5 kg per hectare to 55 kg per hectare with a 1.44% of dry weight [159]. To date, A. annua remains the sole source of artemisinin globally, although artemisinin metabolic engineering has been reported [163].
The rise of modern biotechnology provides a novel strategy for precise and expedited medicinal plant breeding. A key breakthrough is the revolutionary genome editing technology, most notably the CRISPR-cas9 system, that allows precise genome bases to obtain traits of interest at an unprecedented pace [164]. This biotechnology has powered the next generation of plant breeding to improve crop yield and quality such as PNP content. Unlike model and crop plants, the genome editing of medical herbs is still at the infant stage, hindered by the lack of genomic information and a reliable genetic transformation system. Nevertheless, CRISPR-Cas9 based genome editing has been reported in several medicinal plants towards optimizing production of pharmacological components in P. somniferum [165], S. miltiorrhiza [166], Dendrobium officinale [167] and Camelina sativa [168]. Notably, genome editing successfully targeted three FAD2 (fatty acid desaturase 2) genes in allohexaploid Camelina sativa and enhanced seed fatty acid levels [168]. CRISPR-Cas9 mediated gene deletion significantly decreased the benzylisoquinoline alkaloid f lux in transgenic opium poppies [165]. Plant genetic engineering offers clear advantages in improving the yield of PNPs [169] as plant chassis naturally carries many fundamental plant biosynthetic gene circuits, making them natural cell factories to produce PNPs of interest. A paradigmatic case is the Golden Rice [170], where the whole β-carotene biosynthetic pathway was introduced into rice endosperm using Agrobacterium-mediated co-transformation to generate rice plants with carotenoid content up to 1.6 mg/g in the endosperm. Moreover, a highefficiency vector system was developed for transgene stacking to engineer anthocyanin biosynthesis in rice endosperm [171]. In addition, Zhu et al. [172] developed a novel method called combinatorial nuclear transformation to generate multiplextransgenic plants allowing five carotenogenic genes to be simultaneously transferred into a white maize through biolistic transformation, resulting in transgenic plants with elevated levels of β-carotene.
Besides modifying existing metabolic pathways, genetically engineered plant chassis offers a cheap and sustainable source to produce high-value PNPs. The de novo production of PNPs in model organisms such as Arabidopsis, tobacco, tomato, and moss has progressed rapidly recently. For example, Fuentes et al. [173] transferred the entire artemisinic acid metabolic pathway from A. annua to tobacco chloroplast genome using combinatorial supertransformation of transplastomic recipient lines (COSTREL). Plants with high artemisinic acid levels were then isolated through screening large populations of transplastomic lines. In addition, strategies to increase terpenoids including overexpression of rate-limiting enzymes, chloroplast-compartmentalized engineering, and integration of transcription factors have been applied to the production of momilactone [174] and taxadiene [175] in tobacco. Momilactones are a group of diterpenes predominantly found in rice with an allelopathic activity. Through changing the subcellular localization of prenyltransferase and diterpene synthases, the diterpene biosynthesis was rerouted from chloroplast to cytosolic MEP pathway, significantly promoting the production of momilactone in N. benthamiana [175]. Noteworthy, this strategy also enabled the discovery of missing steps in momilactone B pathway, providing insights into pathway reconstitution and elucidation for desired products.
Metabolic engineering of an in vitro plant tissue culture, such as suspension cell culture and hairy root culture, is another efficient approach to yield valuable phytochemicals. Hairy roots are induced by Agrobacterium rhizogenes mediated transformation, which can be applied as high-capacity bioreactor to produce PNPs without the need of light and hormones. Hairy root cultures were successfully induced to overproduce cannabinoids in C. sativa [176], phytosterols and ginsenosides in Panax ginseng [177], and curcumin in Atropa belladonna [178]. Suspension cell culture has also made great advances to produce valuable PNPs with high yields. Plant Cell Fermentation (PCF ® ) Technology (https://phytonbiotech. com/) could produce natural taxol directly from plant cells of Taxus chinensis v. marei, while the metabolic engineered grapevine cells were able to produce resveratrol derivatives when elicited with MeJA and methylated cyclodextrins [179]. In the suspension cell culture system, the addition of heterologous elicitors induces the biosynthetic gene expression and increases the production of PNPs. Glandular trichomes are hairy structures differentiated from epidermal cells, featured by their enormous capacity to synthesize, store and secret large quantities of metabolites with distinct types, making them an excellent platform for decoding the biosynthesis pathway of PNPs, and efficient phytochemical factories to produce PNPs [180]. Kortbeek et al. [181] engineered tomato glandular trichomes where a farnesyl diphosphate synthase was overexpressed, resulting in a decline of monoterpenoid production in the trichomes.

Synthetic biology: a green revolution for PNP bioengineering and industrialization
Although plant breeding can produce cultivars that accumulate higher level of metabolites than others do, it is still inefficient and environmentally unsustainable for massive production for commercial uses in most cases. Alternatively, synthetic biology where microbial (e.g. yeast and bacteria) chassis are used to massively produce PNPs offers a more effective and environmentfriendly alternative. Synthetic biology aims to design microbial cell factories carrying genetic circuits made of biosynthetic genes for heterologous production of PNPs. It has several advantages over plant-based extraction, such as circumventing the requirement of growing plants, rapid production and little interference from natural environment [182]. With the development of synthetic biology tools and knowledge of biosynthetic pathways, PNPs such as artemisinic acid [183], amorphadiene [163], taxadiene [36], cannabinoids [184], morphine [38], and noscapine [185] has been successfully produced using engineered microbial cells. A prominent example is the semi-synthesis of artemisinin [183], where an engineered amorphadiene-producing yeast [38] produces artemisinic acid with the titer of 25 g/L, later converted to artemisinin by chemical synthesis. This opens a route to the industrial production of artemisinin against the urgent demand of anti-malarial drugs. Parallelly, Luo et al. [184] partitioned the cannabinoid metabolic pathway into three modules: an engineered S. cerevisiae MEP pathway to make more f lux into geranyl pyrophosphate, a hexanoyl-CoA biosynthetic pathway and several Cannabis genes to accumulate more olivetolic acid, as well as a heterologous downstream pathway to form the corresponding cannabinoids. The engineered yeast strains yielded 1.6 mg/L of cannabinoid from the simple sugar galactose, laying a foundation for the large-scale production of cannabinoids.
Despite progress, it remains challenging to use microbe as a chassis to synthesize high-value PNPs [182]. First, decoding biosynthetic pathways is the prerequisite for successful heterologous production but remains a challenge for vast majority of PNPs. For instance, previous attempts to reconstruct taxol pathways in microbe without knowing the steps converting downstream taxadiene to final taxol ended up with a production of taxadiene, the first committed intermediate [185]. The combination of bioinformatics and downstream functional validations in heterologous hosts to resolve the biosynthetic pathways of PNPs from sequencing data plays a key role in synthetic biology today (Fig. 3). For example, the Taxus genome analysis identified a functional grouping of CYP725As and a taxadiene gene cluster, which will facilitate the future elucidation of taxol biosynthesis [55,146]. Second, choosing and optimizing a microbial host is essential to maximize yield of PNPs. E. coli and S. cerevisiae are two of the most widely used microorganisms to engineer biosynthetic pathways, given their fast growth, well-known genetic background and well-established genetic manipulation methods [36,38,163,[183][184][185]. However, it is challenging to express functional plantderived cytochrome P450 genes in prokaryotic E. coli which lacks an intracellular organelle system, post-translational modification and electron transfer machinery, thus limiting its application in heterologous production of many PNPs [36]. For example, E. coli was engineered to produce artemisinic acid with high titer of 20 g/L amorphadiene, achieving only 1 g/L artemisinic acid [186], much lower than the aforementioned 25 g/L artemisinic acid in S. cerevisiae. Apart from the commonly used E. coli and S. cerevisiae, nonconventional chassis cells are also used such as the industrial production of 4-hydroxybenzoic acid by C. glutamicum [187] and (+)-nootkatone by Pichia pastoris [39].
Third, reconstituting an efficient module is key to heterogeneous production of PNPs. To construct heterogeneous expression cassettes, codon-optimization of plant genes is often required for microbial expression [36,38]. Partition of complex biosynthetic pathways into several modules [36,184] and the strains are optimized by synthetic biology tools.
Finally, the balance between microbial growth and PNP yield should be considered as many PNP and their intermediates, such as alkaloids and phenols, are toxic to the microorganisms. Taken together, with the rapid development of technologies including genome sequencing, multi-omic data mining, structure biology, directed evolution of proteins, and design of novel proteins, it is expected that the efficiency in synthesizing PNPs in microorganisms will be greatly improved in the near future.

Opportunities and challenges for future PNP research
It is an exciting time to study PNP in this golden era of chemical biology. New technologies in both experimental and computational sciences are emerging at an unprecedented pace. These GbMYBF2 [195] GbCHS, GbF3H GbPAL, GbFLS, GbANS, GbCHI Quercetin, Kaempferol, Anthocyanin P. hybrida AN2 [191], AN4 [191] dfrA, Pmyb27 Anthocyanin M. rubra MrMYB1 [192] MrCHI, MrF3'H, MrDFR1, MrANS, MrUFGT Anthocyanin A. majus AmMYB305 [196], AmMYB340 [197], Rosea1 [198], Rosea2 [198], Venosa [198] PAL  Benzyl isoquinoline alkaloid technologies have enabled the untangling the complex biosynthetic pathways and networks using systems biology approach, providing insight into the mechanistic and evolutionary mechanisms behind the chemical and structural diversity of natural products. Accurate and complete assembly of medicinal plant genomes is the key to understanding genome function and evolutionary patterns and improving plant traits through breeding and synthetic biology. Horticultural plant genomes are notoriously difficult to assemble, due to their wide range of genome sizes (up to hundreds of Gb), various ploidy levels, high repeat content and heterozygosity. Yet with genome technologies advancing quickly in recent years, chromosome-scale assembly, once a rarity, is now a realistic target for most genome projects. A typical genome project now adopts a combination of different technologies including long-read and short-read sequencing data, 10× Genomics data, Hi-C sequencing data or Bionano optical genome maps. Bioinformatic algorithms are continuing to be developed to solve complex problems of genome assembly and annotation leveraging these different technologies [243]. As a result, a growing number of medicinal plants now have chromosome-level genome assembly and some offer obvious improvement over previous versions of assembly and annotation [244,73]. Despite the progress, a long journey is ahead to resolve the complete genome sequence of medicinal plants. Twenty years after the first plant reference genome was produced, the majority of plant reference genomes initially assembled using short reads remain unimproved. Recently, a telomere to telomere (T2T) genome assembly has been produced for a human cell line CHM13, resolving the nearly complete haploid genome sequence of human being [245]. The achievement is largely reliant on the use of high fidelity PacBio long-read sequencing data (HiFi reads), in addition to the ultra-long ONT reads and Hi-C reads. This is not just a milestone of human genomics, but also has a profound impact on animal and plant genomic research by kicking off an ambitious journey to resolve the complete genome sequence of all known living organisms on this planet. In fact, shortly after the publication of the human T2T genome assembly, the nearly complete genome sequence of A. thaliana [246,247], and the gap-free reference genomes of rice [248] and watermelon [249] were reported. The nearly complete genome sequences of plants reveal tandem repeats of satellite regions located in centromeres, and find new genes that weren't accessible in previous versions of assemblies. T2T plant genome sequences will make huge impact on plant biology allowing researchers to grasp a full complement of genetic elements associated with various traits including growth, development, and physiology. Another technical challenge for plant genome assembly is, instead of generating a collapsed diploid assembly, the ability to produce a haploid-resolved assembly that separates the parental and maternal haplotypes, also known as genome phasing. The phased assembly allows understanding of mechanisms behind heterosis and allelespecific gene regulation that contributes to many plant biological processes. To date there are only a handful of plant genomes that have been assembled and phased, including tea [250], lychee [251], and pear [252]. Commonly, genome phasing is conducted through trio-binning sequencing where genomes of parents are used to untangle the two haploid genomes of a child, although the information of parents is often unavailable for medicinal plants. In such cases, recently genome assembly methods such as hifiasm [253] allow for the resolution of haploids by using HiFi reads and Hi-C data without relying on sequencing data of family trios.
Unraveling the regulatory mechanisms underlying PNP biosynthesis is beneficial to improving the quality of traditional Chinese medicine through genetic breeding and metabolic engineering. Genomics and metabolomics of plant tissues have shown that the genes of biosynthetic pathways for PNPs are often co-expressed in specific tissues [49,139]. So far it remains elusive how PNP biosynthetic enzymes and pathways are regulated in such a spatial and temporal fine-tuned manner. Studies regarding the molecular mechanisms of transcriptional regulation related to biosynthetic pathways of PNPs are still limited, most of which being focused on the specialized medicinal plants with well-known PNPs, highquality genomic data, and an established transgenic system. The expression of PNP biosynthetic pathways is controlled by epigenetic mechanisms such as chromatin topology dynamics as shown by Nützmann et al. in model plant Arabidopsis [254]. Thus, it will be exciting to reveal what roles 3D genomic architecture and organization play in regulating biosynthesis of f lavonoids, terpenoids, and alkaloids in medicinal plants. In addition, several transcription factors have been identified to regulate the process over the years (Table 2), offering targets to enhance PNP accumulation potentially via overexpression (activator) or silencing (repressor). However, it remains a challenge to stably transform most medicinal plants and obtaining transgenic plants often takes years even if it does show promise of significantly improving the PNP level. Another caveat is that expressing an excess amount of any epigenetic or transcriptional regulators, usually entangled in complex regulatory networks, can potentially lead to undesired traits. Therefore, a systems biology approach is essential to tease apart the PNP-specific gene circuits for precision modulation of target traits. Moreover, the picture is still a blur in cell heterogeneity that accounts for the cell-type specific expression of biosynthetic enzymes, transporters, and gene regulators. Recently, cutting-edge technologies such as single-cell transcriptomics (scRNA-seq) and spatial-transcriptomics have been widely applied to mammalian and plant tissues to generate a cell atlas and identify cell types within these tissues. The use of such technologies in PNP research has yet to be reported as of the date of thisreview being written, although spatial metabolomics has been reported to investigate the distribution of plant metabolites [255,256]. It will be very interesting to identify the specific celltypes expressing PNP biosynthetic genes using the scRNA-seq and spatio-transcriptomics, guiding a precise design and bioengineering of gene circuits for PNP improvement. The key challenge of the application of single-cell genomics in medicinal plants includes lack of high-quality reference genome and gene annotations, single-cell preparations, tissue and cell-type specific markers, and robust methods to integrative anlaysis of omic profiles at a single-cell level. Protocols and methods developed for human and mammalian samples are well in place but remain to be tested and optimized for medicinal plant studies.
Last but not least, plant and microbe interactions also have a big inf luence on the profiles and abundance of PNPs. The importance of location and environment for growing TCM to their medicinal values and pharmaceutical properties has long been recognized and documented in traditional medicine scriptures. The difference could be down to a combination of factors such as the ecological environment, weather and perhaps the microbiota inhabiting in the soils where the TCM grows. Although the exact formula of microbial communities involved in modulating the PNP, as well as the mechanisms of regulation remain elusive, the association between microbiota and medicinal properties in plants has been suggested in several recent studies. An outstanding example of such association is reported in model plant A. thaliana which produces specific types of triterpenoids to electively modulate root bacteria in root microbiome [257]. Interestingly, isolation and re-inoculation of these bacteria in the roots can induce an increased production of these very PNPs. It remains to be studied how different types of microbes form microbial communities and signaling networks to interact with plants, either internally (endophytes) or externally (e.g. surface of roots and leaves), contribute to the accumulation of specific PNPs. Equally interesting is the regulation of plant microbiota by plant metabolites during plant-microbe interactions. Compared to model and crop plants, microbiome studies of medicinal plants have been quite limited. A combination of culture-based and culture-free microbiome analysis with functional metabolomics in medicinal plant rhizosphere and endophytes will help identify the association between plant microbiota and specialized metabolites. With this knowledge, it will be possible to modulate the production of special natural products in controlled settings such as greenhouse and plant factories in the future, much more efficient and effective than that which could be harvested from traditional authentic herb medicines.