Exploring the evolutionary process of alkannin/shikonin O-acyltransferases by a reliable Lithospermum erythrorhizon genome

Abstract Increasing genome data are coming out. Genome size estimation plays an essential role in guiding genome assembly. Several months ago, other researchers were the first to publish a draft genome of the red gromwell (i.e. Lithospermum erythrorhizon). However, we considered that the genome size they estimated and assembled was incorrect. This study meticulously estimated the L. erythrorhizon genome size to should be ∼708.74 Mb and further provided a reliable genome version (size ≈ 693.34 Mb; contigN50 length ≈ 238.08 Kb) to support our objection. Furthermore, according to our genome, we identified a gene family of the alkannin/shikonin O-acyltransferases (i.e. AAT/SAT) that catalysed enantiomer-specific acylations in the alkannin/shikonin biosynthesis (a characteristic metabolic pathway in L. erythrorhizon’s roots) and further explored its evolutionary process. The results indicated that the existing AAT/SAT were not generated from only one round of gene duplication but three rounds; after different rounds of gene duplication, the existing AAT/SAT and their recent ancestors were under positive selection at different amino acid sites. These suggested that a combined power from gene duplication plus positive selection plausibly propelled AAT/SAT’s functional differentiation in evolution.


Introduction
Red gromwell (Fig. 1A), i.e. Lithospermum erythrorhizon Siebold & Zucc., is a traditional Chinese medicine plant [former name: L. officinale var. erythrorhizon (Siebold & Zucc.) Maxim. 1 ; No. of chromosomes: all records 2n ¼ 28 2 ]. In the past, L. erythrorhizon was recognized as a variant of L. officinale L. (former name: L. officinale var. stewartii Kazmi 1 ; No. of chromosomes: most records 2n ¼ 28 2 ), although they are now separated species. Besides, based on current molecular evidence, L. erythrorhizon is still determined as the closest species of L. officinale. 3 Lately, Auber et al. 4 published a hybrid assembled genome of L. erythrorhizon (estimated genome size % 369.34 Mb; assembled genome size % 367.41 Mb) using our short Illumina data (NCBI ID: SRX2882373; Supplementary Table S1) plus their long ONT data (NCBI IDs: SRX7432848-SRX7432852; Supplementary Table S1). However, Pustahija et al. 5 reported that L. officinale's genome size was $743 Mb (1C % 0.76 pg), significantly greater than the L. erythrorhizon's genome size estimated and assembled by Auber et al. 4 Since the chromosome numbers between L. erythrorhizon and L. officinale are almost identical, 2 we consider that this significant difference is not due to polyploidization but to Auber et al.'s misestimation and misassembly. 4 Therefore, in this study, we carried out a rigorous genome size estimation and further provided a new version of the L. erythrorhizon genome to support our objection.

Plant materials
The seeds of L. erythrorhizon were purchased from ShiJie Seed Breeding Company (https://cfsjzy.1688.com/), located in Chifeng, Inner Mongolia Autonomous Region, China. The healthy seeds were germinated in several pots and then cultured in a greenhouse. Young leaves from flowering individuals were applied for genome sequencing.
Based on the codon model, the L. erythrorhizon's AAT/SAT-like family members' nucleotide sequences were aligned via PRANK v170427 51 (key parameter: -F -codon). Then, the preliminary alignment was trimmed by trimAl v1.4.1 49 (key parameter: -gt 0.50). The trimmed alignment was transformed back to amino acid sequences, and this amino acid alignment was used to construct a phylogenetic tree via MEGA-X 52 based on the ML method (best-fit model: JTT þ G4; bootstrap replications: 1,000). Besides, this tree and its trimmed codon alignment were used for the following selection pressure analysis.

Ks calculation
Total 12 L. erythrorhizon's AAT/SAT-like family members combined to produce 66 gene pairs C122. Each gene pair was aligned via MUSCLE v3.8.31 48 based on the corresponding amino acid sequences, and each alignment was transformed back to nucleotide sequences. Ks values for each gene pair were calculated via KaKs_Calculator v2.0 53 (key parameter: -m NG).

Gene duplication identification
Our L. erythrorhizon gene set was applied for all-vs.-all similarity searches via DIAMOND v2.0.5 41 (key parameter: -f 6 -more-sensitive -e 1e-30 -k 6). The results plus the corresponding gff file of the gene set were further input into the 'duplicate_gene_classifier' module in MCScanX 54 to identify duplication types for each gene (priority: WGD/Segmental > Tandem > Proximal > Dispersed > Singleton).

Selection pressure analysis
According to the branch-site models, the CodeML module in PAML v4.9j 55 was used to detect positive sites on foreground branches: (i) first, a target foreground branch was labelled in the corresponding tree; (ii) an alternative model (i.e. Model A) was set to that sites were under positive selection on the labelled foreground branch (key parameter: model ¼ 2, NSsites ¼ 2, fix_omega ¼ 0, omega ¼ 1.5); (iii) a null model (i.e. Model A null) was then set to that sites were under neutral selection on the labelled foreground branch (key parameter: model ¼ 2, NSsites ¼ 2, fix_omega ¼ 1, omega ¼ 1); (iv) the likelihood ratio test (i.e. LRT) 56 was then applied to determine which model was accepted [threshold: when P < 0.05, the alternative model (i.e. Model A) was accepted], (v) furthermore, the bayes empirical bayes test (i.e., BEB) 57 was used to determine which site was under positive selection (threshold: when posterior probabilities > 0.90, that site probably was under positive selection).

Lithospermum Erythrorhizon genome
Based on our Illumina data [NCBI ID: SRX2882373 (SRR5644206); Supplementary Table S1], Auber et al. 4 estimated L. erythrorhizon's genome size to be $369.34 Mb using GenomeScope v1.0 21 with default parameters (i.e. parameter 'Kmer length' ¼ 21 and parameter 'Max kmer coverage' ¼ 1e þ 03). We repeated their calculation and obtained an identical result (Supplementary Table S3 and Fig. S1). The original intention of setting parameter 'Max kmer coverage' ¼ 1e þ 03 was to avoid interference from high-frequency non-nuclear reads (e.g. organelle reads and contamination reads). 21 However, the practice had proven that this obsolete default parameter (i.e. 'Max kmer coverage' ¼ 1e þ 03) was improper (https://github.com/schat zlab/genomescope/issues/22; https://github.com/schatzlab/genome scope/issues/28). Thus, software developers suggested this parameter to be set to 1e þ 06 (https://github.com/schatzlab/genomescope/ issues/30), and further changed this default from 1e þ 03 to all (i.e. 'Max kmer coverage' ¼ -1) in the GenomeScope latest version (i.e. v2.0). 22 For Spermatophyta, high-frequency non-nuclear reads primarily come from chloroplast because current materials used for genome sequencing are generally green leaves rather than etiolated leaves. Accordingly, through applying GenomeScope v1.0, 21 GenomeScope v2.0, 22 and GCE v1.0.2, 23 we calculated the L. erythrorhizon's genome size at five thresholds of parameter 'Max kmer coverage' (i.e. 1e þ 03, 1e þ 04, 1e þ 05, 1e þ 06, and all) with three levels of parameter 'Kmer length' (i.e. 17, 19, and 21), based on total Illumina data and corresponding chloroplast-filtered data (i.e. cpclean data; Supplementary Table S4). The results ( Fig. 2A and Supplementary  Table S3) showed that: (i) different software (or versions) and parameter 'Kmer length' had little effect on genome size estimation when parameter 'Max kmer coverage' was fixed; (ii) the estimated genome sizes continued to increase as 'Max kmer coverage' became large; (iii) when 'Max kmer coverage' ! 1e þ 04, the genome sizes estimated by total data were significantly greater than that estimated by cpclean data; but, the corresponding differences remained almost constant ($36.0 Mb; Fig. 2A) when 'Max kmer coverage' ! 1e þ 05; these suggested that chloroplast reads significantly skewed the estimated genome size, and these reads mainly concentrated at around 'Kmer coverage' % 1e þ 04, consistent with the kmer distribution (Fig. 2B) and previous study 21 ; (iv) coincidently, the genome size ($707.03 Mb) estimated by total data at 'Max kmer coverage' ¼ 1e þ 06 was approximately equal to the size ($708.74 Mb) estimated by cpclean data at 'Max kmer coverage' ¼ all, due to the increased size caused by chloroplast reads exactly offset the decreased size caused by a lack of high-kmer reads (i.e. the reads at 'Max kmer coverage' > 1e þ 6); this probably was why developers first suggested parameter 'Max kmer coverage' to be set to 1e þ 06 and further changed it to all; in other words, to make the calculation more accurate, parameter 'Max kmer coverage' was recommended to be set to all when high-frequency non-nuclear reads can be filtered out, whereas this parameter was suggested to be set to 1e þ 06 as an empirical value when high-frequency non-nuclear reads cannot be filtered out due to a lack of reference databases (e.g. a chloroplast genome). 22 Therefore, we believed that the actual L. erythrorhizon's genome size should be $708.74 Mb, which approached the L. officinale's genome size ($743 Mb) as Pustahija et al. reported 5 Table S7); and, our genome also contained 35,932 protein-coding genes, in which 28,995 genes ($80.69%) were supported by transcriptome data (Supplementary Table S8 and Fig. S2).

AAT/SAT-like family
Lithospermum erythrorhizon belongs to Boraginales; and, Boraginales, together with three other orders (i.e. Solanales, Gentianales, and Laminates), are the four core groups in the lamiids clade. 6,58 Therefore, through sequence similarity search, we identified 1,233 AAT/SAT-like genes (Supplementary Table S9) in L. erythrorhizon, six other lamiids species (i.e. two Solanales species: Solanum lycopersicum, Ipomoea trifida; two Gentianales species: Coffea canephora, Catharanthus roseus; two Lamiales species: Tectona grandis, Callicarpa americana), and three outgroup species (i.e. Rhododendron simsii, Actinidia eriantha, and Arabidopsis thaliana) (Supplementary Table S2). As expected, we found that the AAT's equivalent was LE32265.1, and the SAT's equivalent was  Table S9). After removing redundant sequences and unusual encoding sequences, a total of 674 genes were retained (Supplementary Table S10). Since AAT/SAT contained one characteristic domain (i.e. PF02458jTransferase) and two characteristic motifs (i.e. HXXXD and DFGWG) (Supplementary Tables S11 and S12), 44-46 the sequences containing abnormal domains and motifs were further filtered. Finally, a total of 563 genes ( Fig. 3 and Supplementary Table S12) were retained as AAT/SAT-like superfamily members, in which at least 18 members had been verified by previous studies [i.e. 2 (i.e. AAT/SAT) þ 16 from the Swiss-Prot database (Supplementary Table S13)]. According to the above structural and functional information, the AAT/SAT-like superfamily should be the BAHD superfamily (i.e. benzylalcohol O-acetyltransferase, anthocyanin O-hydroxycinnamoyltransferase, N-hydroxycinnamoyl anthranilate benzoyltransferase, and deacetylvindoline O-acetyltransferase superfamily), which catalysed various acylation reactions in plant metabolism (e.g. lignins, anthocyanins, terpenoids, and various esters). [44][45][46] To further classify the AAT/SAT-like superfamily, we constructed a phylogenetic tree. The results (Fig. 3) showed that: (i) this superfamily was roughly divided into three sections, i.e. Sub-clade I, Subclade II, and some oddments; furthermore, the Sub-clade I was roughly divided into four broad categories and five sub-categories  SAT-like family' in this study; in addition, this AAT/SAT-like family (i.e. the sub-category I.A.1) contained a total of eight verified members, in which CRO_120021.mRNA1 (i.e. Swiss-Prot ID: Q9ZTK5jdeacetylvindoline O-acetyltransferase from C. roseus; Supplementary Table S13) was used to name the BAHD superfamily in previous studies [44][45][46] ; (iii) although S. lycopersicum and I. trifida belonged to Solanales, the numbers of the AAT/SAT-like family members they each contained were significantly different (i.e. S. lycopersicum: 15 vs. I. trifida: 4; Fig. 3), and a similar numerical difference was also found in two Gentianales species (i.e. C. canephora: 2 vs. C. roseus: 13; Fig. 3); in addition, two Lamiales species (i.e. T. grandis and C. americana) did not contain any members in this AAT/SAT-like family, although they owned abundant members in the Sub-clade I and the whole superfamily (Fig. 3); therefore, all these indicated that the number of AAT/SAT-like family members significantly expanded or contracted in different species, which might be related to the species-specific properties.

AAT/SAT's evolutionary process
In the AAT/SAT-like family (i.e. the sub-category I.A.1), AAT/SAT (i.e. LE32265.1 and LE01141.1) plus two other members (i.e. LE03170.1 and LE25525.1) seem to converge into a common clade (Fig. 3). Therefore, we named this clade as 'AAT/SAT clade' in this study. These four members should be real gene loci because the evidence collectively supports them on three fronts (i.e. Ab initio þ Homology þ Transcriptome; Supplementary Table S8). Based on the tree reconstructed only using 12 L. erythrorhizon members ( Supplementary Fig. S3) Table S15). These were consistent with their loci information in the genome: LE01141.1, LE03170.1, and LE25525.1 were closely located in the Contig00446, whereas LE32265.1 was located in the Contig00079 alone (Fig. 4A). The phylogenetic relationship in the AAT/SAT clade had been confirmed as ((LE32265.1, LE03170.1) Node A , (LE01141.1, LE25525.1) Node B ) Node C ( Fig. 3 and Supplementary  Fig. S3), and the Ks values between these four members were also known (Supplementary Table S14). Therefore, (i) one round of dispersed duplication should occur in Node A at Ks Node A ¼ Ks LE03170.1 vs. LE32265.1 ¼ 0.4477 because only LE32265.1 was identified as 'dispersed duplication'; (ii) one round of proximal duplication should occur in Node B at Ks Node B ¼ Ks LE01141.1 vs. LE25525.1 ¼ 0.3245 because both LE01141.1 and LE25525.1 were identified as 'proximal duplications'; (iii) and, another round of proximal duplication should occur in Node C at Ks Node C ¼ Ks LE03170.1 vs. LE01141.1 ¼ 0.4976 because there was only one round of dispersed duplication in the AAT/SAT clade (thus, Ks LE32265.1 vs. LE01141.1 and Ks LE32265.1 vs. LE25525.1 were excluded), and Ks Node C must be greater than Ks Node A and Ks Node B (thus, Ks LE03170.1 vs. LE25525.1 were excluded) (Fig. 4A). To sum up, we inferred that the AAT/SAT's evolutionary process probable underwent three rounds of gene duplication (Fig. 4A): (i) first, one round of proximal duplication occurred in Node C at Ks ¼ 0.4976 and made ancestor C produce ancestor A and ancestor B; ancestor A probable located on the existing LE03170.1 locus, and ancestor B probable located on the existing LE01141.1 locus, due to Ks LE03170.1 vs. LE01141.1 was assigned to Ks Node C ; (ii) subsequently, one round of dispersed duplication occurred in Node A at Ks ¼ 0.4477 and made ancestor A produce the existing LE03170.1 and LE32265.1 (i.e. AAT); (iii) finally, another round of proximal duplication occurred in Node B at Ks ¼ 0.3245 and made ancestor B produce the existing LE01141.1 (i.e. SAT) and LE25525.1.
Furthermore, we detected whether positive selection sites existed on each branch inside the AAT/SAT clade. The results showed two potential positive sites (i.e. sites 267 and 269) were on the branch LE32265.1 and one potential positive site (i.e. site 389) was on the branch ancestor B (Fig. 4B and Supplementary Table S16). In other words, (i) after the proximal duplication in Node C, ancestor B was possibly subjected to positive selection, while ancestor A was not; (ii) after the dispersed duplication in Node A, LE32265.1 was possibly subjected to positive selection, while LE03170.1 was not; (iii) after the proximal duplication in Node B, both LE01141.1 and LE25525.1 were not under positive selection. To sum up, the above evidence suggested that gene duplication and positive selection collectively propelled AAT/SAT's functional differentiation in evolution.