The reference genome of Camellia chekiangoleosa provides insights into Camellia evolution and tea oil biosynthesis

Abstract Camellia oil extracted from Camellia seeds is rich in unsaturated fatty acids and secondary metabolites beneficial to human health. However, no oil-tea tree genome has yet been published, which is a major obstacle to investigating the heredity improvement of oil-tea trees. Here, using both Illumina and PicBio sequencing technologies, we present the first chromosome-level genome sequence of the oil-tea tree species Camellia chekiangoleosa Hu. (CCH). The assembled genome consists of 15 pseudochromosomes with a genome size of 2.73 Gb and a scaffold N50 of 185.30 Mb. At least 2.16 Gb of the genome assembly consists of repetitive sequences, and the rest involves a high-confidence set of 64 608 protein-coding gene models. Comparative genomic analysis revealed that the CCH genome underwent a whole-genome duplication event shared across the Camellia genus at ~57.48 MYA and a γ-WGT event shared across all core eudicot plants at ~120 MYA. Gene family clustering revealed that the genes involved in terpenoid biosynthesis have undergone rapid expansion. Furthermore, we determined the expression patterns of oleic acid accumulation- and terpenoid biosynthesis-associated genes in six tissues. We found that these genes tend to be highly expressed in leaves, pericarp tissues, roots, and seeds. The first chromosome-level genome of oil-tea trees will provide valuable resources for determining Camellia evolution and utilizing the germplasm of this taxon.


Introduction
World-famous Camellia (Theaceae) plants are valued not only for their aesthetic contributions to landscaping but also for the nutritional and health benefits of beverages containing their compounds and of their edible oils. Constituting one of the four major oil-bearing groups of trees worldwide, oil-tea trees refer to the general name of several Camellia species whose seeds have a high oil content and are cultivated for their edible value. Camellia oil or tea-oil extracted from Camellia seeds is rich in unsaturated fatty acids (UFAs) and a variety of secondary metabolites beneficial to human health and is known as "oriental olive oil" due to its high oleic oil content and antioxidant activity [1]. Camellia oil and its by-products are also widely used in the medicinal and cosmetic industries. Oil-tea tree species are documented as traditional woody edible oil crop species in East and Southeast Asia, with their cultivation and use for edible oil in China dating back more than 2000 years [2]. In 2020, the planting area of oil-tea trees was approximately 4.3 million hectares in China, accounting for more than 95% of the global camellia tea-oil resources, and its annual output value exceeded 116 billion yuan. China is currently the only country with a substantial production of tea oil, and the main cultivated species are Camellia oleifera, Camellia meiocarpa, and Camellia chekiangoleosa [3].
Camellia oil has an ideal fatty acid profile and is a natural competitor of olive oil. UFAs are an important component of biomembranes and play a critical role in regulating the biological processes of cells. UFAs are also essential for human survival and play a vital role in regulating physiological processes, such as maintenance of the nervous system and regulation of glucose and lipid metabolism [4]. The biosynthesis of UFAs is a very complex process. Acetyl coenzyme A is polymerized in the chloroplast stroma to form saturated fatty acids (SFAs) with 16-18 carbons via the catalysis of fatty acid synthases, and then SFAs are converted to palm acid and oleic acid via 9 fatty acid desaturase [stearoyl-ACP desaturase (SAD)] [5]. Oleic acid is desaturated by 12 fatty acid desaturase (FAD2 and FAD6) to form linoleic acid, which is further desaturated by 15 desaturase (FAD3, FAD7, and FAD8) to synthesize α-linolenic acid (ALA) or by 6 desaturase to synthesize γ -linolenic acid (GLA). ALA and GLA are further processed into docosahexaenoic acid and arachidonic acid, respectively, both of which are essential for humans [5].
As important bioactive components of camellia oil, secondary metabolites, including phenols, proanthocyanidins, tocopherols, and carotenoids, are gaining increased amounts of attention because of their health benefits [6]. Various triterpenoids have been isolated from Camellia and investigated for their bioactivity. Cycloartanol, β-amyrin, and squalene are the three main triterpenoids in camellia oil, with gross contents of 1043.30 mg/kg, 878.24 mg/kg, and 133.26 mg/kg, respectively. A triterpenoid saponin from camellia oil exerts both antioxidant and antimutagenic properties in humans and animals. In camellia oil, squalene, a common precursor of triterpenoids, has a variety of physiological activities, such as antiaging, antitumor, and antioxidant activities [7]. In the biosynthesis of triterpenoids, two molecules of FPP are catalyzed by squalene synthase (SQS) to synthesize squalene, which is further catalyzed by squalene epoxidase (SQE) to form 2,3oxido-squalene. Finally, 2,3-oxido-squalene undergoes a series of protonation, cyclization, rearrangement, and deprotonation through the cyclization of different types of oxosqualene cyclases (OSCs) to form a triterpenoid backbone [8].
Oil-tea trees are monoecious and allogamous plants that can be vegetatively propagated and live for thousands of years. Although the naturally long generation time of oil-tea trees has traditionally hindered the breeding of these species, considerable efforts have been made to study the f lowering and fruiting and biosynthesis of camellia oils from biochemical, physiological, or molecular genetic perspectives over the last several decades. The oil-tea tree species Camellia chekiangoleosa Hu. (CCH), a diploid in the genus Camellia, is naturally distributed in the mountainous areas in Fujian, Jiangxi, Hunan, Zhejiang, and Anhui provinces and has been introduced to Europe, America, and Australia as an ornamental plant species [9]. Extensive trials on CCH have been conducted in various provinces of China to improve its oil quality and yield [10]. Dried CCH kernels consisted of 60.3% crude fats, 8.8% crude protein, and 10.3% camellia saponins, and the oil content exceeded 60%. CCH oil mostly consists of oleic acid, linoleic acid, stearic acid, and palm acid, of which UFAs (mainly oleic acid and linoleic acid) account for approximately 90%, resulting in its very highly valuable nutritional and health-promoting functions [11]. However, the genetic bases for the growth and development of these Camellia species, especially the biosynthesis of bioactive compounds in camellia oil, are not yet understood due to lack of a reference genome. A high-quality reference genome could provide researchers with great convenience. Several tea (Camellia sinensis) genome sequences have recently been released and there are increasing multiomics studies based on their genomes, transcriptomes, proteomes, non-coding RNA, and genome-wide association studies that have fully revealed the biological properties of tea and provided researchers with new insights to better study the medicinal and economic value of tea [12][13][14]. The lack of a reference genome sequence is a major obstacle for basic and applied biology on oil-tea trees. We herein present a high-quality genome sequence of CCH, and reveal the biosynthesis of UFAs and terpenoids in camellia oil. This genome sequence will facilitate the understanding of Camellia genome evolution and tea oil biosynthesis and will promote germplasm utilization for breeding improved oil-tea tree varieties.

Chromosome-level assembly of the CCH genome
The genome size of CCH was evaluated by two methods before formal assembly ( Supplementary Fig. S1). To obtain a chromosome-level reference genome of CCH, we generated PacBio HiFi reads (51.09 Gb, ∼19-fold genome coverage) and Illumina Hi-C reads (283.40 Gb, ∼102-fold genome coverage). By using the hifiasm software as an assembly tool, we ultimately yielded a 2.73 Gb genome of CCH that covers 97.40% of the scaffolds and consists of 15 pseudochromosomes (scaffold N50 = 185.30 Mb) ( Fig. 1). At least 2.16 Gb of repetitive sequences accounted for 79.09% of the CCH genome assembly. Long terminal repeat (LTR) retrotransposons are the most dominant class of transposable elements in the CCH genome, accounting for approximately 64.56% of the genome, among which Copia and Gypsy elements are the two most dominant classes of LTR retrotransposons, constituting 6.55% and 34.47% of the genome, respectively (Supplementary Table S1). Through ab initio modeling, protein-based searches, and transcript analysis of longread isoform sequencing and short-read RNA sequencing data, a high-confidence set of 64 608 protein-coding gene models encoding a total of 66 579 proteins, 64 130 of which were annotated in at least one database, was established (Table 1, Supplementary Fig. S2). Two methods were used to assess the completeness of the CCH genome. First, the statistical results of BUSCO showed that the CCH genome covered 2177 (93.59%) of 2326 complete gene models, of which 1940 (83.40%) genes were present as single copies and 237 genes were present as multiple copies. Second, an LTR assembly index score (LAI) of 11.53 indicated that the CCH genome has high sequence continuity. Taken together, these results indicate that the assembly of the CCH genome is of high quality and meets the reference genome standards ( Supplementary Fig. S3).

Phylogenetic status of CCH
A total of 21 747 gene families were identified via comparisons of protein sequences homologous to CCH and 15 other species, of which the number of single-copy orthologous genes was 154. To confirm the relationship between CCH and other species, we constructed a  high-confidence phylogenetic tree with Zea mays and Elaeis guineensis as the outgroup species by using 154 single-copy orthologous genes shared between CCH and 15 other species. According to the results, CCH and C. sinensis were clustered on the same branch, which belongs to the family Theaceae, and the result is consistent with research based on chloroplast genomes ( Fig. 2a) [15]. Furthermore, we also constructed a phylogenetic tree of divergence time, and the results showed that the family Theaceae (CCH and C. sinensis) separated from the family Actinidiaceae (Actinidia chinensis) ∼71.22

Expansion and contraction of gene families
Comparing the gene families of CCH, Elaeis guineensis, Vernicia fordii, Diospyros oleifera, and Olea europaea. A total of 18 577 gene families were found in five species, and we found that the KEGG enrichment analysis of CCHspecific gene families (2264 gene families) revealed that "fatty acid biosynthesis" (ko00061, enrichment score = 2.83) was one of the significantly enriched terms ( Supplementary Fig. S5, Supplementary Fig. S6, Supplementary Table S2). We identified 3017 and 2941 gene families in the CCH genome that underwent expansion and contraction, respectively, with 414 gene families (8168 genes) undergoing rapid expansion and 131 gene families (265 genes) undergoing rapid contraction (Fig. 2a, Supplementary Fig. S7, Supplementary Table S3). KEGG enrichment analysis of the 414 rapidly expanding gene families revealed that 15 KEGG pathways were significantly enriched (P-value <0.05), with "sesquiterpenoid and triterpenoid biosynthesis" (ko00909, enrichment score = 11.06) and "monoterpenoid biosynthesis" (ko00902, enrichment score = 12.39) among the significantly enriched pathways, suggesting that the rapid expansion of genes associated with monoterpene, sesquiterpene, and triterpene biosynthesis may be related to the unique properties of CCH (Fig. 2b, Supplementary Table S4).

Whole-genome duplication and collinearity
To identify the whole-genome duplication (WGD) events experienced by CCH, we analyzed the Ks distribution of CCH, C. sinensis, E. guineensis, and Vitis vinifera. The distribution of Ks showed one peak at ∼1.3-1.5 in the genome of CCH, C. sinensis, E. guineensis, and V. vinifera, indicating that these species shared an ancient γ -WGT (wholegenome triplication) event that was also shared by all core eudicot plants (Fig. 2c). We also noticed one peak at ∼0.3-0.4 in the genome of CCH and C. sinensis, indicating that they shared a recent WGD event that was shared across members of the genus Camellia (Fig. 2c). Studies have shown that tea plants experienced only one WGD event after the γ -WGT event, and some genes involved in catechin and caffeine biosynthesis expanded and were retained following the WGD event, contributing to the flavor compounds of tea plants [16]. A number of studies have found that the ancestors of eudicots had a γ -WGT event 120 million years ago, and the genes preserved from this ancient WGT event were mostly associated with water acquisition and salt stress, which occurred during the arid Cretaceous period, so it is assumed that this genome-wide replication event provided the genetic basis for plant species to adapt to the harsh survival environment during this period [17,18]. As a member of eudicots, several tea genome sequences supported the genus Camellia, which also underwent the γ -WGT event [19,20]. According to Fig. 2c, the grape underwent only one WGD event (γ -WGT), and the genus Camellia had a recent WGD event (∼57.48 million years ago). We think that this event may have weakened the relationship between paralogous homologous gene pairs in WGT events, resulting in their peaks being less easily observed in the region of ∼1.3-1.5.
The peak value of orthologous gene pairs between CCH and C. sinensis (Ks = 0.5) was lower than both the value between CCH and V. vinifera (Ks = 0.8) and the peak value between CCH and E. guineensis (Ks = 1.6), implying that the divergence time between CCH and C. sinensis occurred later; these results correspond to the phylogenetic relationships (Fig. 2a, Supplementary Fig. S8).
To better understand the evolution of CCH, we determined the collinearity relationship between CCH and CSS-BY. The results showed that there was high collinearity between CCH and CSS-BY, indicating that there was no large-scale structural variation after the divergence of CCH and CSS-BY. For most collinear regions, one chromosome of CCH corresponded to one chromosome of CSS-BY; for example, CchChr1, CchChr2, CchChr3, CchChr4, CchChr5, CchChr6, CchChr7, CchChr8, CchChr9, CchChr10, CchChr11, CchChr12, CchChr13, CchChr14, and CchChr15 of CCH corresponded to CsChr3, CsChr8, CsChr1, CsChr6, CsChr4, CsChr9, CsChr2, CsChr5, CsChr10, CsChr12, CsChr14, CsChr7, CsChr13, CsChr15, and CsChr11 of CSS-BY, respectively (Fig. 2d, Supplementary Fig. S9a). In addition, we further analyzed the intergenomic collinearity between CCH and DASZ. Although one chromosome of CCH usually corresponds to one chromosome of DASZ, the collinear regions between the two are not as strong as those between CCH and CSS-BY, indicating that they have a distant relationship (Fig. 2d, Supplementary Fig. S9b).

UFA biosynthesis-associated genes
Studies have shown that FAD and SAD genes are essential for the biosynthesis of UFAs in a variety of oilseed plant species [21,22]. A total of 32 genes involved in the biosynthesis of acyl-lipid desaturase and acyl-ACP desaturase were identified in the CCH genome in this study, of which there were 11 CchSAD and 17 CchFAD genes, respectively. We found that all SAD genes clustered onto one branch according to our phylogenetic tree based on the homologous sequences of CCH, Glycine max, Oryza sativa, Sesamum indicum, Olea europaea, Arabidopsis thaliana, and Arachis hypogaea, indicating that the expansion of SAD genes occurred after the divergence of these species (Fig. 3). FAD1/2/3/6/7/8 of FAD genes clustered onto one branch, while FAD4/5 clustered onto another branch. Among FAD1/2/3/6/7/8, the products of FAD6/7/8 are localized in plastids and have similar structures; the products of FAD2 and FAD3 are localized in the endoplasmic reticulum and use phosphatidylcholine as the preferred substrate; FAD2 and FAD3 are key genes involved in SFA desaturation and encode key enzymes for oleic and linoleic acid desaturation, respectively [23]. In fact, some of the genes involved in fatty acid biosynthesis are expressed specifically in seeds and are important organs for the rapid accumulation of lipids. For example, one type of the CchFAD2 gene, CchFAD2A (Cch15G000175), is expressed over one hundred times more in the seeds than in other tissues, while another type, CchFAD2B (Cch10G003830), is highly expressed only in the shoots (Fig. 4). Members of the CchSAD (CchSAD2, Cch05G001837) gene family, which are closely involved in fatty acid synthesis and highly expressed in seeds, may be closely related to the accumulation of UFAs (Fig. 4).

Terpenoid biosynthesis pathway in CCH
Camellia oil is rich in a variety of secondary metabolites that are beneficial to human health, such as terpenoids, and elucidating the potential terpenoid biosynthesis pathway of CCH will help us better understand the medicinal value of camellia oil. We found that genes involved in terpenoid biosynthesis underwent rapid expansion in the CCH genome, and a total of 86 genes involved in terpenoid biosynthesis were identified in the whole-genome sequence of CCH in this study, including 61 CchTPS, five CchSQS, five CchSQE, and 15 CchOSC genes, which were named in order of their position on the chromosome (Fig. 5). Chromosomal localization of genes involved in terpenoid biosynthesis revealed that most of them were distributed on the chromosomes in accordance with tandem duplication (Fig. 5a, Supplementary Fig. S10a). To investigate the potential functions of these genes in CCH, we analyzed the expression of the 86 genes in 6 tissues, which were found to be highly expressed in the shoots and stems but expressed at low levels in the leaves, pericarp tissues, roots, and seeds (Fig. 5c, Supplementary Fig. S11).

Discussion
In this study, by combining PacBio, Hi-C, and Illumina sequencing technologies, we yielded the first chromosome-level reference genome of an oil-tea tree species. The assembled genome consisted of 15 pseudochromosomes with a size of approximately 2.73 Gb. Both BUSCO and LAI scores indicated that the assembly quality was high and met the reference genome level. The percentage of repetitive sequences in the genome of CCH was lower than that in DASZ (3.11 Gb) and CSS-SCZ (2.94 Gb) and higher than that in CSS-BY (3.25 Gb) (Table 1) [20,29,30].
Combining the gene families that expanded and contracted during the evolution of CCH and the genes subjected to positive selection, we found that genes related to fatty acid synthesis expanded and those related to linoleic acid synthesis contracted throughout evolution. The genes subjected to positive selection include fatty acid-degrading-and FAD-binding-related functions, all of which are inextricably linked to lipid metabolism. We speculate that during the evolutionary process of CCH, to continuously adapt to the environment, genes related to lipid synthesis evolved adaptively, eventually leading to the gradual development of high-lipid characteristics. Two types of FAD2 and SAD2 genes were found in several species, those expressed specifically in the seeds and those constitutively expressed, with the former highly expressed in the seeds and the latter expressed evenly across different tissues. In this study, CchFAD2A and CchSAD2 were expressed specifically in the seeds, while CchFAD2B was constitutively expressed. Our previous study showed that the expression of Cch-FAD2A was much higher than that of CchFAD2B in the seeds, while the expression of CchFAD2B was more similar across various tissues.
Terpenoids are among the most diverse plant secondary metabolites and have significant research value. Terpenoid biosynthesis begins with acetyl coenzyme A or pyruvate and glyceraldehyde-3-phosphate; the former undergoes a six-step condensation reaction to produce isopentenyl diphosphate (IPP), while the latter undergoes a seven-step condensation reaction to produce IPP [31]. IPP and its isomer, DMIPP, generate monoterpenes, sesquiterpenes, diterpenes, and triterpenes under the action of different enzymes, including those encoded by TPS genes associated with monoterpene, sesquiterpene, and diterpene biosynthesis [32], and SQS, SOE, and OSC genes associated with triterpene biosynthesis [33][34][35]. The number of TPS genes in angiosperms ranged from 40 to 152 [36], and a total of 61 CchTPS genes containing both C-and N-terminal structural domains were identified in the CCH genome in this study, the number of which was much greater than the 23 CsTPS genes identified in CSS-SCZ and the 45 CsTPS genes identified in Tieguanyin [37], indicating that the CchTPS gene family in the CCH genome expanded (Fig. 5a). The phylogenetic tree showed that CchTPSs could be divided into five subfamilies, of which the TPS-a subfamily was the largest, which was consistent with the findings in CSS-SCZ [37], Arabidopsis thaliana [38], and S. lycopersicum, in contrast to the results in Tie-guanyin, in which TPS-b was found to be the largest subfamily. Functional annotation of 61 CchTPSs revealed that 22 CchTPSs were associated with monoterpene synthesis, 34 CchTPSs were associated with sesquiterpene synthesis, and five CchTPSs were associated with diterpene synthesis; in addition, a significant expansion of monoterpene-and sesquiterpene-encoding genes occurred in the CCH genome compared with the Tie-guanyin and CSS-SCZ genomes (Fig. 2b, Fig. 4) [37]. We also found that CchTPS genes were unevenly distributed on chromosomes and were mostly tandemly repeated, suggesting that CCH underwent genetic expansion during evolution, which is consistent with the findings of studies in S. lycopersicum, Tie-guanyin, Ricinus communis., Avena sativa, M. truncatula, A. thaliana, and Lotus corniculatus [25,27,28]. The expression of CchTPSs was significantly different in the six different studied tissues, with most CchTPS genes tending to be highly expressed in the shoots and stems but expressed at low levels in the leaves, pericarp tissues, roots, and seeds; however, in CSS-SCZ, most CsTPSs tended to be highly expressed in the f lowers and leaves. CchTPS genes are specifically expressed in different tissues and organs, suggesting that these genes play different functions (Fig. 5a).
SQS is a key enzyme in triterpene biosynthesis and has six relatively conserved structural domains. Two AthSQS genes have been identified in A. thaliana, but the product of only one has catalytic activity [39]. In this study, five CchSQS genes were identified in CCH, with CchSQS03 and CchSQS05 being more highly expressed in the roots. SQE is widely found in plants, animals, and humans, and it usually catalyzes squalene to form 2,3-oxidosqualene. In this study, five CchSQE genes were identified, CchSQE01 was expressed at the highest level in the buds, CchSQE02 was expressed at the highest level in the seeds, CchSQE04 was expressed at the highest level in the pericarp tissues, and CchSQE03 and CchSQE05 were expressed at the highest level in the roots; moreover, seven TwSQE genes were identified in Tripterygium wilfordii and had the highest expression in the roots and flowers [40]. Two SgSQE genes were identified in Siraitia grosvenorii, and their expression was greatest on day 15 of fruit development. OSC catalyzes the generation of sterols and triterpenoid precursors from 2,3-oxidosqualene, which is a key step in the product diversity of triterpenoids. In this study, 15 CchOSC genes were identified in the CCH genome, the number of which was much greater than the four CoOSC genes found in C. oleifera, suggesting that CchOSC genes also underwent expansion. These findings imply that the high number of CchOSC gene family members may be closely related to their triterpene products (Fig. 5b).

Conclusions
In this study, the first high-quality chromosome-level reference genome of oil-tea trees was obtained by various techniques, and gene structure and comparative genomic analyses were performed. We identified the expression pattern of UFA-and terpenoid biosynthesisassociated genes in six tissues. It was found that genes involved in monoterpenes, sesquiterpenes, and triterpenes in the CCH genome underwent rapid expansion. Our study provides a useful reference genome for revealing Camellia evolution and investigating its medicinal value.

Library construction and genome sequencing
High-quality genomic DNA was extracted from fresh leaves of CCH using the Plant Genomic DNA Kit (Tiangen, China) according to the manufacturer's instructions. The library was constructed using the TruSeq DNA LT Sample Prep Kit according to the manufacturer's instructions, and sequencing was performed on the Illumina HiSeq X platform (Illumina, San Diego, CA, USA) in 150 bp pairedend mode. Hi-C libraries were prepared according to a previous report [41], and paired-end 150 bp sequencing was performed on the Illumina HiSeq platform (Illumina, San Diego, CA, USA) following quality control measures. The raw sequences were filtered using fastp with the default parameters [42]. A SMRT library was then generated using the PacBio Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA) according to the manufacturer's instructions.

Genome assembly and annotation
Genome assembly was performed using hifiasm in "no purge" mode, and heterozygosity was removed using purge_dups to obtain the draft genome of CCH [43]. The chromosomes were clustered, sorted, and corrected based on Hi-C interaction information using 3D-DNA [44]. Finally, the Hi-C interaction matrix was imported into Juicebox for manual inspection. The integrity of the assembled genomes was subsequently assessed using BUSCO [45].

Gene family clustering and phylogenetic analysis
After filtering the protein sequences to those less than 30 amino acids in length, we clustered the filtered protein sequences of CCH, Z. mays, E. guineensis, S. indicum, Theobroma cacao, V. vinifera, G. max, A. thaliana, V. fordii, A. chinensis, C. sinensis, D. oleifera, O. europaea, S. lycopersicum, Solanum tuberosum, and Populus alba based on similarity by OrthoFinder [51].
MAFFT was used to perform the multiple sequence alignment [52]. Afterward, RAxML was used to construct a phylogenetic tree by concatenating the 154 singlecopy orthologous genes [53] after exacting the conserved region by TrimAL [54]. The correction time points were obtained using TimeTree [55], and the divergence time was estimated using PAML [56]. Gene family expansion and contraction were detected using CAFE [57] and visualized using ggtree [58].

WGD and collinearity
We analyzed the Ks distribution to discover WGD events within CCH, C. sinensis, E. guineensis, and V. vinifera. KaKs_calculator was used to calculate Ks values [59]. The collinearity relationships between CCH, CSS-BY, and DASZ were determined using JCVI after exacting the longest protein sequences of each gene [60].

Tandem duplication and positive selection
To identify tandemly duplicated genes in the CCH genome, we first extracted the longest protein sequences of each gene and then used blastp to identify homologous genes.
The protein sequences of single-copy gene family members shared across CCH, C. sinensis, D. oleifera, O. europaea, S. indicum, C. sinensis, S. lycopersicum, and Solanum tuberosum were aligned by MAFFT [52]. Then, we detected the positively selected genes using the branch-site mode with CCH serving as the foreground branch and the other seven species constituting the background branch [56].

Terpenoid-related gene identification
Genes involved in terpenoid biosynthesis were identified using HMMER against the proteome of CCH (e < 10-5). PF03936 and PF01397 were used to identify the CchTPS gene; PF00494, the CchSQS gene; PF08491, the CchSQE gene; and PF13243 and PF13249, the CchOSC gene. Sequences shorter than 200 amino acids were excluded from further analysis.