Genome sequencing of four culinary herbs reveals terpenoid genes underlying chemodiversity in the Nepetoideae

Abstract Species within the mint family, Lamiaceae, are widely used for their culinary, cultural, and medicinal properties due to production of a wide variety of specialized metabolites, especially terpenoids. To further our understanding of genome diversity in the Lamiaceae and to provide a resource for mining biochemical pathways, we generated high-quality genome assemblies of four economically important culinary herbs, namely, sweet basil (Ocimum basilicum L.), sweet marjoram (Origanum majorana L.), oregano (Origanum vulgare L.), and rosemary (Rosmarinus officinalis L.), and characterized their terpenoid diversity through metabolite profiling and genomic analyses. A total 25 monoterpenes and 11 sesquiterpenes were identified in leaf tissue from the 4 species. Genes encoding enzymes responsible for the biosynthesis of precursors for mono- and sesqui-terpene synthases were identified in all four species. Across all 4 species, a total of 235 terpene synthases were identified, ranging from 27 in O. majorana to 137 in the tetraploid O. basilicum. This study provides valuable resources for further investigation of the genetic basis of chemodiversity in these important culinary herbs.


Introduction
The Lamiaceae (mint) family is among the largest angiosperm families, containing $7,000 species that occupy a wide geographic distribution 1 and are commonly recognized by their square stems, opposite leaves, and lobed inflorescences. The Lamiaceae is not only rich in species number and diversity, but also in the production of specialized metabolites. The largest class of these metabolites, terpenes, exhibit substantial chemodiversity and broad variation in abundances across Lamiaceae. 2,3 Recent molecular-based phylogenetic analyses support 10 À 12 major clades within the mint family, the largest of which is the Nepetoideae that contains approximately half of all Lamiaceae species. 3,4 The Nepetoideae also has the greatest diversity of monoterpenes 5 among the mint clades, making it a robust clade to study the relationship between terpenoid diversity and species diversity.
All terpenoids are derived from two universal precursors, isopentenyl diphosphate (IPP) and its isomer dimethylallyl diphosphate (DMAPP), which are synthesized in plants via two independent pathways: the methylerythriol phosphate (MEP) pathway in the plastid and the mevalonic acid (MVA) pathway distributed among the cytosol endoplasmic reticulum and peroxisomes. IPP and DMAPP then serve as substrates for short-chain prenyltransferases, which produce the prenyl diphosphates, geranyl diphosphate (GPP) in the plastid and farnesyl diphosphate (FPP) in the cytosol. Finally, GPP and FPP are converted into monoterpenes and sesquiterpenes, respectively, by the action of enzymes of the terpene synthase (TPS) superfamily. Product promiscuity of TPSs and enzymes modifying TPS products are the main sources of terpene structural diversity in plants. 6 A number of Nepetoideae species are used as culinary herbs due to their production of specialized metabolites that impart unique flavour profiles. Sweet basil (Ocimum basilicum L.), e.g. exhibits a wide range of phenotypes and chemotypes 7 and is commonly used in pesto sauce. Although ploidy varies within O. basilicum, the cultivar 'Genovese' is tetraploid (2n ¼ 4x ¼ 48) with a genome size of 4.1 À 4.7 Gb estimated by flow cytometry. 8,9 Rosemary (Rosmarinus officinalis L.), in contrast, is an evergreen shrub with greyÀgreen needle-like leaves that emit a strong fragrance due to multiple aromatic volatiles. 10,11 Rosmarinus officinalis and its essential oils have been widely used in cultural practices, culinary flavourings, medicinal remedies, and pest deterrents. 12 Oregano (Origanum vulgare L.) and sweet marjoram (Origanum majorana L.) are both members of the Origanum genus, having a shrub-like architecture and small, ovate leaves. Their relatedness is evidenced biologically and culturally as Origanum spp. can make interspecific hybridizations 13 and some regions of the world refer to oregano and marjoram interchangeably. Young leaves of both Origanum spp. are harvested, dried, ground, and added to culinary dishes to bestow a spicy, bittersweet flavour. Three major chemotypes have been described for O. vulgare: acyclic, cymyl, and sabinyl. 14 Chemotypes of O. majorana are less-defined due to nomenclature challenges and the presence of distillation artefacts 15,16 but both cymyl and sabinyl compounds have been widely documented. 17,18 All four species were part of the 1k Plant Transcriptome Initiative 19,20 and leaf transcriptomes were generated to examine the evolution of green plants as well as for understanding the evolution of chemodiversity within the Lamiaceae. 3 In addition, for O. basilicum, transcriptomes were generated for cultivars 'Tiguillo' and 'Red Rubin' 21 as well as 'CIM Saumya', 22 and a genome assembly of the O. basilicum cultivar 'Perrie' has recently been reported although the actual sequence is not currently available. 23 There is a growing number of Lamiaceae species with assembled genomes (see Supplementary Dataset S1) that have enabled identification of genes involved in specialized metabolism and a broader understanding of genome organization with respect to specialized metabolism. 24 However, the genetic repertoire encoding chemical diversity within the culinary herbs remain largely unexplored. Here, we here report the genome sequence, annotation, and metabolite profiling of four culinary herbs and describe their repertoire of terpenoid biosynthetic genes that will provide a resource for data-mining not only terpenoid biosynthetic pathway genes but also other genes that function in specialized metabolism. were grown in a growth chamber under a 14-/10-h day/night cycle with a daytime temperature of 27 C and a night-time temperature of 15 C; light intensity in the chamber was 210 lE m À2 s À1 . Origanum majorana and O. vulgare were grown in a greenhouse under a 15-/9h day/night cycle at a temperature of 26.6 C. Plant management included weekly fertilizing and pesticide application as necessary. Flow cytometry was performed on leaf samples at the Benaroya Research Institute (Seattle, WA), and k-mer estimated genome sizes were determined using Jellyfish v2.2.0 25 with a k-mer size of 31 and adjusted for heterozygous sequences using the R package findGSE v0.1.0. 26

DNA and RNA isolation
Nuclei were isolated from leaf tissue following a previously described protocol 27 with an input of 1 À 2 g of ground tissue; spin speeds were used based on estimated genome size, 2,700, 3,030, 3,030, and 2,900 g for O. basilicum, O. majorana, O. vulgare, and R. officinalis, respectively. DNA was extracted using the Nanobind Plant Nuclei Big DNA Kit (Circulomics, Baltimore, MD, Cat No. NB-900-801-01). RNA was extracted from mature leaf tissue using a hot phenol protocol, 28 and DNA was removed using the TURBO DNAfree TM Kit (Invitrogen, Carlsbad, CA, Cat No. AM1907). Quality and concentrations were verified by Nanodrop, Qubit, and agarose gel electrophoresis.

Library construction, sequencing, and expression abundance estimation
Genomic libraries were constructed using 10Â Genomics Technology (Chromium TM Genome Library Kit & Gel Bead Kit v2; Pleasanton, CA, USA) and sequenced at the Roy J. Carver Biotechnology Center at the University of Illinois at Urbana-Champaign. Sequencing was performed on an Illumina NovaSeq 6000 at 150 nt in paired-end mode. Libraries were pooled with an aim of 65Â coverage for each species. RNA-Seq libraries were constructed using the Illumina TruSeq Stranded mRNA Kit with polyA mRNA selection and IDT for Illumina Unique Dual Index primers (Illumina, San Diego, CA, USA) and sequenced at the Michigan State University Research Technology Support Facility. RNA-Seq libraries were sequenced on an Illumina HiSeq 4,000 at 150 nt in paired-end mode. Reads were cleaned with Cutadapt v2.3, 29 which trimmed adapters and 3 0 bases with a quality score <10, and only kept reads of at least 100 nt. After cleaning, reads were aligned to their respective genomes using HISAT2 v2.1.0 30 with the following parameters set: -dta-cufflinks, -max-intronlen 5000, and -rna-strandness RF. Cufflinks v2.2.1 31 was run in stranded mode to generate expression abundances (fragments per kilobase of exon model per million mapped reads).

Genome assembly and annotation
The 10Â Genomics reads were demultiplexed and assembled using Supernova v2.1.1, 32 with -maxreads set to 900 million, 330 million, 259 million, and 450 million reads for O. basilicum, O. majorana, O. vulgare, and R. officinalis, respectively. Scaffolds containing only N sequences were removed from the final assemblies. Custom repeat libraries (CRL) were generated for each species using RepeatModeler v1.0.8 (http://www.repeatmasker.org; last accessed April 2020) as described previously. 33 Genome assemblies were masked with their respective CRLs using RepeatMasker v4.0.6 (http://www.repeat masker.org; last accessed April 2020). Gene prediction on the masked assembly was performed using Augustus v3.1 34 with a matrix trained for the Nepetoideae species, Hyssopus officinalis L. 35 To refine the gene models, leaf RNA-Seq libraries were cleaned and used to generated genome-guided transcript assemblies using Trinity v2.3.2 36 with a maximum intron size of 5,000 and a minimum contig length of 500 nt in stranded mode. The genome-guided transcript assemblies were used with PASA2 v2.3.3 37 to create the working gene model set. To identify high confidence gene models from the working gene model set, the gene models were searched against PFAM v32.0 38 using HMMER v3.2.1 (hmmer.org) with search cutoffs-domE 1eÀ3 ÀE 1eÀ5, and gene abundances of the leaf RNA-Seq library were calculated using Kallisto v0.46.0. 39 Gene models that were not partial models, did not contain an internal stop codon, not transposable element-related, and had a PFAM domain match or a TPM > 0 were selected as high confidence models. Functional annotation of the high confidence gene models was generated as described previously. 33 Gene ontology (GO) terms were assigned to the representative high confidence gene models using IPRscan v5.34.73.0. 40

Genome sequence and annotation quality assessment
To assess the completeness of the assembly, whole genome shotgun libraries were processed using Cutadapt v2.3 29 and reads were aligned to their respective assemblies with BWA-MEM v0.7.16a. 41 Paired-end RNA-Seq libraries constructed from leaf tissue were aligned to the assemblies using HISAT2 v2.1.0 30 using stranded mode with a maximum intron length of 5,000 bp. Coverage of the genic space was assessed using BUSCO v3.0.2b 42,43 with the Embryophyta odb9 dataset (creation date: 2016-02-13, number of species: 30, number of BUSCOs: 1,440) to detect conserved orthologs in the assemblies.

Extraction and analysis of terpenoids by gas chromatographyÀmass spectrometry
The same leaf tissue harvested for RNA extraction was used for terpenoid profiling. Tissue from each species (0.2 g) was ground in liquid nitrogen and extracted overnight with shaking at room temperature with 5 ml of dichloromethane containing 6.6 lg of the internal standard naphthalene. After centrifugation, the solvent containing the extracted metabolites was transferred to a new glass tube and concentrated to $180 ml under nitrogen gas. 44 Subsequently, gas chromatographyÀmass spectrometry (GCÀMS) analysis was performed on an Agilent 6890 gas chromatograph (Agilent Technologies) equipped with a HP-5MS column (30 m, 0.25 mm, 0.25 lm; Agilent Technologies) and coupled to an Agilent 5975B insert MSD quadrupole mass spectrometer (Agilent Technologies). Each sample (2 ll) was injected at a pulsed splitless mode at 250 C. The column temperature was held at 50 C for 2 min, followed by increased to 320 C at 20 C min À1 , and held at 320 C for 4.5 min. Helium was applied as a carrier gas at a flow rate of 1 ml min À1 . MS ionization energy was set at 70 eV, and the mass spectrum was scanned from 50 to 300 amu. Three biological replicates were used for metabolite profile analysis for each species. Compounds were identified by comparing retention times and mass spectra with those of commercially available authentic standards including a-pinene, bpinene, a-phellandrene, a-terpinene, cis-b-ocimene, c-terpinene, terpinolene, linalool, geraniol, b-caryophyllene, and caryophyllene oxide as well as by comparing mass spectra to the National Institute of Standards and Technology (NIST) Mass Spectral Library v2.2. Quantification of terpenoids was performed using the Mass Hunter quantitative software (Agilent Technologies, v. B. 07.01) using response factors relative to the internal standard determined experimentally for the commercially available authentic standards a-pinene (representative monoterpene for a-thujene, a-pinene, camphene), b-pinene (representative monoterpene for b-pinene and b-myrcene), a-phellandrene (representative monoterpene for a-phellandrene and b-phellandrene), a-terpinene (representative monoterpene for a-terpinene, c-terpinene, o-cymene, cis-b-ocimene, terpinolene), geraniol (representative monoterpene alcohol), b-caryophyllene (representative sesquiterpene), and nerolidol (representative sesquiterpene alcohol) and normalized to the fresh weight of the tissue.

Comparative genome analyses
Representative peptides from teak (Tectona grandis L.f.) 33 and Arabidopsis thaliana Col-0 45 were included in comparative genome analysis as outgroups for Nepetoideae and Lamiaceae, respectively. Predicted teak peptides were downloaded from GigaDB (http://dx. doi.org/10.5524/100550) on 26 November 2019 and A. thaliana peptides were downloaded from Araport11 on 13 November 2019. Orthofinder2 v2.3.7 46 was run using default settings to identify orthologous and paralogous TPSs in each species. Orthologous groups represented by all species were used to construct and root a consensus species tree with the STAG 47 and STRIDE 48 algorithms, respectively. Gene family expansion and contraction were determined with CAFE v4.2.1 49 using default settings and an ultrametric tree rooted at 125 million years ago based on estimates from multiple studies. [50][51][52][53] Enrichment of GO terms was performed using the Bioconductor package TopGO v2.38.1. 54

Identification of TPS orthologs
Manually reviewed cloned TPS genes from Lamiaceae were retrieved from SwissProt (Supplementary Table S1). TPSs were selected to include species within and outside the Nepetoideae subfamily of interest, as well as discrete clades within the Nepetoideae. Lamiaceae TPSs were used along with annotated A. thaliana TPSs to identify putative orthologs in the predicted proteomes of the four culinary herbs.

Data availability
Raw sequences are available in the National Center for Biotechnology Information Sequence Read Archive under BioProject PRJNA592145. Large files associated with the genomes including genome sequence, annotation, gff, and expression matrices are available in the Dryad Digital Repository under doi https://doi.org/ 10.5061/dryad.jwstqjq6t.

Genome assembly and annotation
Flow cytometry of the four culinary herbs revealed estimated haploid genome sizes consistent with previous studies. The flow cytometry haploid genome estimation of the tetraploid O. basilicum var. 'Genovese' was 2.34 Gb which is within previous estimates of 2.04 Gb 8 to 2.37 Gb. 9 The flow cytometry estimate of haploid genome size for O. majorana (880.2 Mb) is comparable to a recent estimate of 846 Mb, 55 S1); overall, estimation of genome sizes between flow cytometry and k-mer frequency were comparable. We utilized Supernova 32 to assemble the genomes of the four species. As shown in Table 1 Although the Supernova assembler was originally designed for human genomics applications, it has been used to assemble nonhuman animal species like perch 59 and rice coral, 60 as well as diploid plant species such as pepper, 61 snowberry, 62 and maize. 63 Supernova assemblies have been generated for polyploid species including proso millet (Panicum miliaceum), an allotetraploid, 63 and potato (Solanum tuberosum subsp. andigena), an autopolyploid. 64 Our successful assembly of O. basilicum further supports the use of Supernova to generate quality assemblies of polyploid species.
To assess the completeness and representation of genic sequences, whole genome shotgun and RNAseq reads were aligned to their cognate genome assembly. At least 95.8% of whole genome shotgun reads aligned to the assemblies (Supplementary Table S3 (Table 2). Approximately half of the O. basilicum orthologs were present as multiple copies, consistent with its tetraploidy. 8 Of the species assembled in this study, a previous assembly was reported for O. basilicum cultivar, 'Perrie' 23 while the present study assembled the cultivar 'Genovese'. Both cultivars are tetraploid (2n ¼ 4x ¼ 48) yet flow cytometry estimates of haploid genome size differ; 'Perrie' was estimated at 1.59 Gb, 59 while our flow cytometry estimate of 'Genovese' was 2.34 Gb is in agreement with previous estimations 8,9 and estimated size using k-mer frequency as well as from the Supernova programme. The 'Perrie' assembly was 2.13 Gb and the 'Genovese' assembly size generated in this study was 2.07 Gb; these differences may reflect variation in genome size among and within Ocimum spp. 8 Table S5). Assessment of genic completeness using 1,440 BUSCO genes revealed 93.0 and 86.7% of complete genes in 'Perrie' and 'Genovese', respectively; however, our 'Genovese' assembly contained 30.5% of these genes as single-copy compared with 18.5% for the 'Perrie' assembly.
The four genomes were annotated using the gene finder Augustus 34 and the resulting gene models were refined with PASA2 37 using the genome-guided transcript assemblies. The initial working gene model sets were filtered for high confidence genes using expression evidence and/or PFAM domains, resulting in high confidence gene models for O. basilicum  Table S6). The annotation data sets were assessed for completeness using BUSCO, revealing a high proportion of complete single copy orthologs. Specifically, single copy orthologs were considered complete in the high confidence representative gene model set at frequencies of 88.3% (O. basilicum), 90.6% (O. majorana), 89.7% (O. vulgare), and 89.2% (R. officinalis). As expected, the tetraploid O. basilicum gene model sets contained substantially more duplicated orthologs compared with the three other diploid species.
Repeat-masking was performed on the culinary herb genome assemblies to mask repetitive elements (Supplementary Table S7). The proportion of masked bases was not dependent on genome assembly size, as R. officinalis

Mono-and sesqui-terpene profiles of culinary herbs
As terpenoids are produced primarily in the leaves, 60 metabolite profiling of leaf terpenoids from the four species was performed by GCÀMS (Fig. 1). Spectrometric analyses revealed that these plants produce both monoterpenes and sesquiterpenes, with monoterpenes contributing to a higher degree (Supplementary Table S8). A total of 25 different monoterpenes were identified with only b-myrcene produced in all cultivars. Ten monoterpenes were species-specific (e.g. carvacrol was found only in O. vulgare), while the others, such as c-terpinene, were shared by two or three species. The highest monoterpene diversity was detected in R. officinalis, which produced 17 monoterpenes, while the other cultivars synthesized 10 À 11 compounds. The total amounts of produced monoterpenes also varied between the species, ranging from 8.99 mmol g FW À1 in O. vulgare to just 0.52 mmol g FW À1 in O. majorana (Supplementary Table S8). The obtained metabolic profiles were generally consistent with literature reports. [61][62][63][64] In contrast to rich chemical diversity observed for monoterpenes, the amount and spectrum of sesquiterpenes was significantly lower. A total of eleven sesquiterpenes were detected, five of which were unique to a single species. There was no sesquiterpene shared by all four species, although b-caryophyllene was produced by three species. While O. basilicum produced the most diverse spectrum of sesquiterpenes, the highest amount of sesquiterpenes was found in O. vulgare, suggesting that this species is the highest producer of both mono-and sesquiterpenes. Comparative analysis of the most abundant compounds revealed that in species with relatively high levels of terpenoids, O. vulgare and R. officinalis, these are mostly monoterpenes, while in low terpene producers, such as O. basilicum and O. majorana, sesquiterpenes contribute to the overall terpenoid profile. This analysis also revealed that the spectra of most abundant compounds are mostly species-specific (Fig. 2).

Orthologous and paralogous clustering
Orthofinder2 is a software programme that partitions genes according to their phylogenetic ancestry 46 and clusters them into orthologous (orthogroups) and paralogous clusters. Identification and comparison of orthologs within orthogroups may reveal gene duplication or loss over evolutionary time. Thus, we performed this type of analysis for the four Nepetoideae species used in this study along with two additional species included in the analysis as outgroups. Tectona grandis (teak) was used as a non-Nepetoideae Lamiaceae species along with the model species A. thaliana, a member of the Brassicaceae. High confidence representative predicted peptides for these six species, along with curated Lamiaceae TPS obtained from SwissProt, were used as input for Orthofinder2; in total, 219,047 predicted peptides were included. Of these, 200,920 (91.7%) were assigned to 25,660 orthogroups ( Fig. 3A; Supplementary Dataset S2).
Orthogroup occupancy by species was similar for the Lamiaceae species, ranging from 65.5% (T. grandis) to 76.6% (O. basilicum), while orthologous genes from the non-Lamiaceae outgroup A. thaliana were only present in 53.9% of orthogroups. A rooted species tree 47,48 revealed a topology in agreement with a previous Lamiaceae cladogram (Fig. 3B). 3 As expected, O. majorana and O. vulgare were closely related, and teak and A. thaliana were more distantly related to the rest of the Nepetoideae species.

Gene family analysis
Of the 25,660 orthologous groups identified by the Orthofinder analysis, 12,615 were inferred to be present in the most recent common ancestor and were used in the CAFE analysis along with the ultrametric tree. The number of gene families in the observed Lamiaceae species was found to be generally consistent over evolutionary time (Fig. 3C). Compared with the rest of the culinary herbs, a noticeable gene family contraction occurred in the Origanum genus, while O. basilicum shows significant gene family expansion, likely due in part to its tetraploidy. Both non-Nepetoideae outgroups share a similar number and proportion of contracted gene families.
To better understand evolutionary relationships of these four culinary herbs among the ever-growing list of sequenced Lamiaceae spp., we conducted an Orthofinder analysis for all available Lamiaceae predicted proteomes (Supplementary Dataset S1 and Fig.  S2), characterizing 34,998 orthologous groups. Approximately 30% of the orthologous groups contained at least one ortholog from each species. Intra-genus orthologous groups for the Origanum spp. and Nepeta spp. contained the second-and third-most number of nonencompassing intersections. Orthologs from Pogostemon cablin, an octoploid, were represented in the most orthologous groups. The species tree derived from ancestral gene families was in agreement with previous cladograms, 3,4,67 confirming the monophyly of Nepetoideae subfamily and the polyphyly of the Salvia genus, as described previously. 68

Identification of precursor genes and TPS in four culinary herbs
Terpenes are synthesized from common IPP and DMAPP precursors via the MEP and MVA pathways. To examine the terpenoid biosynthetic pathway in the culinary herbs, orthologous groups were queried for genes belonging to A. thaliana MEP and MVA pathways, as identified previously (Supplementary Table S10). 3 All 22 of these A. thaliana MEP/MVA genes clustered into 17 orthologous groups (Supplementary Table S11). Six additional A. thaliana genes also clustered with the MEP/MVA orthogroups OG0001733 and OG0006021, representing five geranylgeranyl phosphate synthases and a putative 1-deoxyxylulose 5-phosphate synthase, respectively. A total of 148 culinary herb orthologs were present among the MEP/ MVA orthologous groups. In 13 of the 17 orthogroups, each culinary herb contained equal to or greater numbers of orthologs than A. thaliana. However, the difference in the number of MEP/MVA orthologs across all species was small as each culinary herb contained one to ten MEP/MVA orthologs, compared with one to six orthologs in A. thaliana and one to three orthologs in T. grandis (Fig. 4A).
TPS enzymes synthesize terpenoids from the GPP and FPP end products of the MEP and MVA pathways. To identify TPS genes in the four culinary herbs, orthogroup occupancy of previously published A. thaliana TPSs 3 as well as curated Lamiaceae TPSs was investigated. The A. thaliana TPSs (n ¼ 34) belong to TPS subfamilies TPSa, TPSb, TPSc, TPSe/f, and TPSg; these genes clustered into 12 orthologous groups and 6 singletons ( Table 3; Supplementary Table  S12). Six of the 12 orthologous groups containing A. thaliana TPSs were unique to A. thaliana; the other 6 orthologous groups contained a total of 151 putative TPSs from the four culinary herbs. Curated Lamiaceae TPSs (n ¼ 26) were also included in the analyses to identify Lamiaceae-specific TPSs (Supplementary Table S1); these 'bait' Lamiaceae TPS genes clustered into 7 orthologous groups containing a total of 212 putative TPS across the culinary herbs (Table 3). Within these orthogroups, O. basilicum contained the most TPSs (n ¼ 128) of the culinary herbs, followed by R. officinalis (n ¼ 35), O. vulgare (n ¼ 26), and O. majorana (n ¼ 23). In these same orthogroups, 57 orthologs from T. grandis were detected along with 8 orthologs from A. thaliana. Orthologous groups OG0000008, OG0000079, and OG0001915 contained TPSs from both A. thaliana and Lamiaceae bait TPSs, representing TPS subfamilies TPSa, TPSb, and TPSc, respectively.
Overall, the culinary herb genomes encoded a total of 235 putative TPS occupying ten orthologous groups with TPS from A. thaliana or the Lamiaceae bait. Nearly four to five times as many TPSs were found in O. basilicum (n ¼ 137) compared with the diploid culinary herbs that contained substantially fewer TPSs: R. officinalis (n ¼ 38), O. vulgare (n ¼ 33), O. majorana (n ¼ 27) (Fig. 4B). Consistent with the relatively high levels and rich chemical diversity of monoterpenes in these culinary herbs (Supplementary Table S8), these TPSs were mainly represented by members of the TPSb subfamily, which includes most of the angiosperm monoterpene synthases. 69 Among the six orthogroups jointly occupied by A. thaliana TPSs and culinary herb orthologs, five orthogroups contained Lamiaceae orthologs in the same or greater quantity than the A. thaliana TPSs (Table 3). Whereas A. thaliana contained one to six orthologs in both the MVA/MEP and TPS related orthogroups, the culinary herbs generally contained more TPS orthologs compared with MEP/MVA orthologs ( Fig. 4; Supplementary Table S11). For example, orthogroup OG0000008 contained six A. thaliana TPSs, all belonging to the TPSb subfamily, while all other mint species were represented by 12 (O. majorana and R. officinalis) to 38 (O. basilicum) orthologs. In OG0000079, a single A. thaliana sesquiterpene synthase gene, TPS21 (AT5G23960) was present along with one and two orthologs in O. majorana and O. vulgare, respectively, and 32 orthologs in O. basilicum. However, this trend did not hold for all orthologous groups. For example, the highest number of orthologs in orthogroups OG0003155 (n ¼ 7) and OG0001915 (n ¼ 4) belonged to T. grandis; the A. thaliana TPSs in these orthogroups were associated with TPSe/f and TPSc subfamilies, respectively. Lineage-specific A. thaliana TPS orthogroups included OG0016169 (n ¼ 5), OG0018766 (n ¼ 3), OG0018926 (n ¼ 3), OG0020906 (n ¼ 2), OG0020971 (n ¼ 2), and OG0021104 (n ¼ 2) in addition to the six singletons.

Physical clustering of specialized metabolite pathways
Physical clustering within the genome has been reported for numerous specialized metabolism biosynthetic pathways, 70 thus to identify putative clusters of enzymes involved in secondary metabolite synthesis, plantiSMASH v1.0 analysis 71 was performed on each culinary herb genome assembly with its high confidence representative gene set (Supplementary Table S13). The quantity and classification of clusters detected varied by species. In total, there were 104 clusters detected in O. basilicum, 36 clusters detected in O. majorana, 38 clusters detected in O. vulgare, and 22 clusters detected in R. officinalis. In particular, the four species were enriched in clusters related to terpene and saccharide production. Among the culinary herbs, the most terpene clusters were found for the O. basilicum assembly, with 226 genes located across 24 clusters. The O. majorana and O. vulgare assemblies contained eight and nine terpene clusters, with 65 and 83 corresponding genes, respectively. The smallest number of terpene clusters was found in R. officinalis, with 26 genes in 2 clusters. In comparison, A. thaliana had 7 putative terpene clusters containing 129 genes, and T. grandis had 6 clusters with 72 genes. Other secondary metabolite clusters were represented by a combination of terpene-related genes along with other secondary metabolites such as alkaloids, lignans, and polyketides.
culinary herbs: O. basilicum, O. majorana, O. vulgare, and R. officinalis. Genome sequencing, assembly, and annotation of these herbs revealed a diversity of genes involved in terpenoid biosynthesis. In addition, targeted metabolic profiling revealed the diversity of monoterpenes and sesquiterpenes in these species and exemplified unique terpenoid profiles for each species. Our study showcases the genomic and metabolomic characterization of these four herbs that can be used to further explore terpene biosynthesis.