Genome analysis of Taraxacum kok-saghyz Rodin provides new insights into rubber biosynthesis

The Russian dandelion Taraxacum kok-saghyz Rodin (TKS), a member of the Composite family and a potential alternative source of natural rubber (NR) and inulin, is an ideal model system for studying rubber biosynthesis. Here we present the draft genome of TKS, the first assembled NR-producing weed plant. The draft TKS genome assembly has a length of 1.29 Gb, containing 46 731 predicted protein-coding genes and 68.56% repeats, in which the LTR-RT elements predominantly contribute to the genome enlargement. We analyzed the heterozygous regions/genes, suggesting its possible involvement in inbreeding depression. Through comparative studies between rubber-producing and non-rubber-producing plants, we found that enzymes of the mevalonate (MVA) pathway and rubber elongation might be critical for rubber biosynthesis, and several key isoforms have been isolated and shown to be predominantly expressed in the latex, indicating their crucial functions in rubber biosynthesis. Moreover, for two important families in rubber elongation, the CPT/CPTL and REF/SRPP families, diverse evolutionary tracks have been revealed. These results provide valuable resources and new insights into the mechanism of NR biosynthesis, and facilitate the development of alternative NR-producing crops.


INTRODUCTION
Natural rubber (NR) is an isoprenoid polymer synthesized on small dispersed rubber particles in the latex of many plant species. In 2015, the global consumption of NR reached over 12.14 million tons, with a value of ∼17 billion US dollars [1], and demand is still steadily increasing. By now, the Pará rubber tree (Hevea brasiliensis L.) is nearly the exclusive source of NR; however, further increase in the yield is severely impaired due to its limited planting area, narrow genetic backgrounds, severe diseases and laborious work requirements [2,3]. Therefore, it is vital to explore an alternative source and a model plant for NR production and research.
There are more than 2500 plant species that could generate NR in the latex, but high-molecular-weight rubber is only identified in a few plants; of these, Taraxacum kok-saghyz Rodin (TKS) has drawn special attention since the 1940s [4]. TKS is a perennial plant from the Composite family, originating from the Tian Shan mountain regions in China and Kazakhstan [5], which can be widely planted in highor low-latitude climates. The root of TKS could produce a large amount of NR (up to ∼20% dry weight), as it shows even higher molecular weight than the Pará rubber tree [6]. In addition, the root of TKS could also produce an abundance of inulin, which can be used for bioethanol production. The advantages of broad planting area, high content and quality of rubber, easy planting and harvest, and short maturation time make TKS an excellent alternative source of NR. Moreover, TKS has a relatively simple genome of an estimated 1.4 Gb [5], forming 2N = 2X = 16 chromosomes [7], and more importantly it can easily be genetically manipulated. These make TKS an ideal model plant for NR research. The future of domesticating wild TKS is promising [4], but its high heterozygosity and inbreeding depression due to its self-incompatibility raise serious challenges for generating a complete genome sequence and developing TKS as an NR-producing crop [5].
Understanding the NR biosynthetic pathway is always of great interest. In dandelion and some other rubber-producing plants, several rubberbiosynthesis-related genes have been identified, including cis-prenyltransferases (CPTs) that control the NR chain elongation [8,9], small rubber particle proteins (SRPPs) that maintain the stability of the rubber particles [10,11], rubber elongation factors (REF) that improve the CPT activity [12], and rubber transferase activators (RTA or CPTL) that interact with CPTs [13][14][15]. However, the knowledge is still segmental and the mechanisms of rubber biosynthesis in different rubber-producing plants remain elusive. Therefore, to better understand the biosynthesis and regulation of NR and to accelerate the development of TKS for NR production, here we present the de novo genome assembly of TKS line 1151. The draft genome sequence provides new insights into genome evolution, inbreeding depression and genetic mechanisms underlying the biosynthesis of rubber and inulin.

Sequencing, assembly and annotation
We carried out de novo sequencing and genome assembly of the TKS line 1151 based on the PacBio RSII platform that generates ∼49.76 Gb of genomic sequence with 48-fold coverage and the Illumina HiSeq 2500 platform that produces ∼60.48 Gb of genomic sequence with 58-fold coverage (see Supplementary Fig. S1 and Supplementary Table S1). We first confirmed the karyotype of TKS as 2N = 2X = 16 (see Supplementary Fig. S2) and estimated the genome size in the range of 1.04 Gb based on a 19-mer analysis (see Supplementary Fig. S3). This is slightly smaller than the size predicted as ∼1.18 Gb by flow cytometry (see Supplementary Fig. S4). To assist the scaffold construction, we also generated an additional ∼123.31 Gb high-quality matepair data with insertion sizes from 5000 to 13 000 bp (see Supplementary Fig. S5 and Supplementary  Table S1). Finally, we obtained a 1.29 Gb genome assembly, which contains 19 227 scaffolds with N50 sizes of 100.21 kb and 31 965 contigs with N50 sizes of 47.63 kb (see Table 1 Supplementary Fig. S8). To evaluate the assembled genome, we aligned 248 highly core-eukaryotic genes to the assembled genome, and identified highconfidence hits of 238 (95.97%) with full length and 241 (97.18%) with partial length (see Supplementary Table S4). We also gauged the completeness and accuracy of the assembled TKS genome using the 956 Benchmarking Universal Single-Copy Orthologs (BUSCO) gene set [16], and found complete genes of 91.7% (876) and fragmented genes of 2.6% (25) (see supplementary data online). This evidence indicated the high quality and coverage of the assembled TKS genome.
To facilitate the gene annotation of the assembled TKS genome and construct the gene expression atlas, we performed RNA-seq with three replicates for latex and other 11 representative tissues, generating a total of ∼401 Gb data (see Fig. 1 and Supplementary Table S5). Using the combined method of ab initio, assembled transcripts and protein homologues, we predicted 46 731 proteincoding genes (see Table 1 and Supplementary  Fig. S9), with ∼82.90% supported by various public protein-function databases (see Supplementary  Table S6). In addition, we also identified a large number of non-coding RNAs, including 162 rRNAs, 836 tRNAs, 265 miRNAs, 22 SRPRNAs, 167 snR-NAs, 594 snoRNAs and 214 other ncRNAs (see Supplementary Table S7). Globe artichoke is a non-rubber-producing plant and the nearest species to TKS with an assembled genome. A total of 173 genomic synteny regions were found between TKS and globe artichoke, and the largest collinearity region exists between TKS scaffold22 (841 702 bp; 68 genes) and globe artichoke LG9 (18 344 014 bp; 1011 genes), in which 23 genes are matched overall (see Fig. 2A). We also identified two scaffolds (scaffold1261 and scaf-fold896) containing one CPTL gene and one SRPP gene in the assembled TKS genome (see Supplementary Fig. S10). In the TKS genome assembly, segments on a certain scaffold are collinear to different chromosomes of globe artichoke, indicating that some structural variations occur during genome evolution (see Supplementary Fig. S11). In addition, we found 23 genomic synteny regions and a total length of 2.78 Mb homologous genome sequence between the TKS and Hevea genomes. We also identified 51 435 recent segmental duplications with a total length of 26.90 Mb (2.08%) in the assembled TKS genome (see Fig. 2B). However, we should be cautious and these segmental duplications need to be validated by different methods. These genomic results suggest high collinearity with the globe artichoke genome and are useful for studying the TKS evolution and biological functions of rubber biosynthesis.

Heterozygosity
TKS is reported to be self-incompatible [5], which usually leads to high heterozygosity of the genome and results in inbreeding depression, such as the reduced survival and fertility of offspring. When exam-ining the sequencing depth distribution, we found two peaks (see Supplementary Figs S12 and S13) at ∼18 × depth (the first peak) and ∼33 × depth (the second peak), indicating that the genomic regions with the first peak might be underassembled allelic regions with high heterozygosity. A total of 9241 contigs with half of the sequencing depth were identified as heterozygous genomic regions with a total length of ∼306.47 Mb (see Supplementary  Fig. S14). These contigs indeed explain the first peak (∼18 ×) for the sequencing depth, as the depth distribution becomes normal after excluding them (see Supplementary Fig. S12). Of these 9241 contigs, 3261 (∼110.68 Mb in length) were independent contigs, each of which constitutes one scaffold, which might be an allelic copy of the other 5980 integrated contigs (see supplementary data online). Therefore, we aligned these independent contigs to the integrated ones using LASTZ [17]. The results showed that 997 independent contigs (24.21 Mb in length) may be the extra allelic copy because they can be aligned to 427 integrated contigs (28.56 Mb in length) with more than 50% identity (see Supplementary Fig. S15). We examined the functions of 301 genes in these independent contigs (see Supplementary Table S8), and they were significantly enriched in carboxylic acid metabolic pathways, methylation, phosphatase complex, and ion binding and transporter activity pathways by GO analysis (see Supplementary Table S9). These observations indicate that these heterozygous regions or genes are possibly related to inbreeding depression caused by homozygous recessive deleterious effects [18].

LTR-RT elements drive the TKS genome enlargement
We identified 875.81 Mb (68.56% of the assembled genome length) as repetitive sequences in the TKS genome (see Table 1 and Supplementary Table S10), which is higher than the 58.4% in globe artichoke [19], 6.25% in horseweed [20] and 54.4% in danshen [21], but slightly lower than the 71.2% in rubber tree [22]. Among the transposable elements (TE), long terminal repeats (LTRs) were found to be the predominant TE family in TKS, accounting for 40.73% (∼520 Mb) of the assembled genome, and Copia (∼260.5 Mb, 20.39%) and Gypsy-type (∼252.9 Mb, 19.79%) LTRs were the most abundant TE subfamilies. Complete LTR-RTs were identified for four Asteraceae species (TKS, globe artichoke, horseweed and sunflower), one Lamiaceae species (danshen) and six other model plants using the LTR-FINDER [23] (see Supplementary Fig. S16 and Supplementary Table S11). The TKS genome assembly contains 6154 complete LTR-RTs, which is the highest in all species. The average LTR-RT nucleotide distance (d-value = 0.023) in the TKS genome assembly is significantly lower than the other three Asteraceae species (see Supplementary  Table S11) with a peak of substitution distance around 0.0025, suggesting that a burst of LTR-RT insertion occurred later than other Asteraceae species at ∼0.24 Mya (see Supplementary Fig. S17). Taken together, these results suggest that the enlargement of the TKS genome might be driven by LTR insertion.

Genome evolution
Based on protein homologies within and between TKS and ten other species, we identified 16 169

RESEARCH ARTICLE
gene families in the assembled TKS genome, 1907 of which were TKS specific and 10 021 were shared with globe artichoke and sunflower (see Fig. 2C). We detected 340 genes in the TKS-specific gene families with tissue-specific expression patterns, and they are significantly enriched in polygalacturonase, responses to stresses or defenses, and cell redox homeostasis by GO analysis (see Supplementary  Table S12), which may be associated with biotic and abiotic stress resistance.
A mean family size of 3.9 genes per family was found in 7396 gene families except for 8773 orphan genes, which is at a medium level in the eudicot plants (average size of 3.5 genes per family). To analyze gene family gain-loss events, 64 single-copy gene families were used to construct a maximum likelihood (ML) phylogenetic tree between TKS and ten species (see Fig. 2D). Among these species, TKS has the most gene families in the Asteraceae and Lamiaceae species. Meanwhile, we detected 3046 gene families that were expanded in TKS compared with globe artichoke, and GO analysis showed significant enrichment (P < 0.01) of these genes in protein kinases, ATP binding, macromolecular complex, fatty acid biosynthesis and cell wall organization (see Supplementary Table S13). Notably, we also detected the CPT gene family in these expanded families, suggesting the importance of these expansions in TKS compared to non-rubber-producing plants. The divergence time of the Asteraceae species estimated by the nucleotide substitution rate is 52.89 Mya (see Fig. 2D), which is consistent with the 47.5 Mya estimated by the fossil time [24,25].

Rubber and inulin biosynthetic pathways in TKS
NR biosynthesis is initiated by the priming allylic molecules (initiators) and elongated by the cisconfiguration sequential condensation of isopentenyl pyrophosphate (IPP), an important precursor of NR formed from acetyl-CoA via the cytosolic mevalonate (MVA) pathway or pyruvate and glyceraldehyde 3-phosphate via the plastidic 2-C-methyl-D-erythritol-4-phosphate (MEP) pathway. To investigate rubber biosynthetic pathways of TKS, we performed a homolog analysis of rubber biosynthetic genes (see Supplementary Fig. S18). A total of 102 candidate rubber biosynthesis-related genes were identified in the genome assembly, including 40 genes in 6 processes of the MVA pathway, 23 in 7 processes of the MEP pathway and 19 for initiator synthesis, in which geranyl pyrophosphate (GPP), famesyl pyrophosphate (FPP) and geranylgeranyl pyrophosphate (GGPP) are involved [26]. In ad-dition, 20 genes for 'rubber elongation' on rubber particles were also predicted (see Fig. 3 and Supplementary Table S14). We compared these genes with their homologs in the rubber-producing species (Hevea) and non-rubber-producing species (globe artichoke), and found that at least one enzyme exists in each process in all three species, and the gene number is similar for enzymes in the MEP pathway and rubber initiator synthesis, but differs in the MVA pathway and rubber elongation (see Supplementary Table S15). These results suggest that the MVA pathway and rubber elongation-related enzymes might be more important for rubber biosynthesis during genome evolution. Consistent with this, for each process in the MVA pathway in TKS, at least one enzyme shows predominant expression in latex and roots (see Fig. 3), but most genes in the MEP pathway have a medium or low expression level in latex. In Hevea it is also reported that the MVA pathway shows latex-biased abundant expression [22,27,28]. These data suggest that the different activities and spatiotemporal expression of these related enzymes may be crucial for their function in different species, and the MVA pathway, rather than the MEP pathway, might be the main source of IPP for rubber biosynthesis in both roots of TKS and inner bark of Hevea. Based on this, we found that the two genes, TkHMGR1 and TkHMGR2, which encode 3-hydroxy-3-methylglutaryl-coenyme, a reductase that catalyzes the key step in the MVA pathway, were predominantly expressed in roots, with the highest expression level in latex. We further found that the TkCPT1, TkCPT2 and TkCPTL1 genes may play a critical role for the elongation of rubber polymers because they encompass the highest expression levels in latex and roots and are homologous with that in T. brevicorniculatum. Furthermore, seven of the nine TkSRPP genes were found highly expressed in latex and roots. These data demonstrate that the complete genome information together with the gene expression profile could facilitate the identification and isolation of key genes of the rubber biosynthetic pathway.
In addition to NR, abundant secondary metabolites, especially inulin, are also important economic outcomes of TKS. In the assembled genome, 15 candidate genes encoding the proteins required for biosynthesis and metabolism of inulin were detected (see Supplementary Fig. S19), including six sucrose transporters (SUCs), two fructosyltransferases, and seven fructan 1-exohydrolases (1-FEHs). Both the two essential genes encoding sucrose:sucrose 1-fructoslytransferase (1-SST) and fructan:fructan 1-fructosyltransferase (1-FFT) were found strongly expressed in roots, but most of the 1-FEH genes were not highly expressed.

The CPT/CPTL gene family in TKS
We identified eight TkCPT and two TkCPTL genes for 'rubber elongation' on rubber particles in the assembled TKS genome. Comparative analysis revealed that the CPT family genes possess five highly conserved regions, Region I to Region V (see Supplementary Fig. S20), in which the conserved Asp residue in Region I and the five conserved residues (F93, S94, R243, R249 and E307) in Regions II to V are essential for catalysis and substrate recognition in the rubber elongation, respectively [29,30]. In addition, the eight TkCPT genes could be further classified into two types (see Supplementary Figs S20 and S21), in which type I is highly conserved between Regions III and IV, whereas type II is diverse from them.
The phylogenetic tree divides the CPT/CPTL gene family from eighteen species into five groups (see Fig. 4A), and Group 3 and Group 4 bear dicotand monocot-specific CPT genes, respectively. All four type I CPT genes in the assembled TKS genome were situated within Group 1, which also contains the genes of three T. brevicorniculatum CPTs, two lettuce CPTs and two Hevea CPTs (see Fig. 4A), suggesting that this clade is very important for NR biosynthesis and the rubber chain elongation pathway is conserved among different rubber-producing species. Furthermore, we found that both TkCPT1  and TkCPT2 are predominantly expressed in latex (see Fig. 3), indicating that they may be responsible for rubber biosynthesis in TKS. Interestingly, among the four type II TKS CPT genes in Group 3, TkCPT7 is highly expressed in roots and latex, but it contains deletions of conserved F93 and S94 residues and a substitution of R for S at the 249 position, suggesting that TkCPT7 may have other diverse functions in roots and latex.
Recently, the study of T. brevicorniculatum [13], lettuce [15] and Hevea [14] has shown that CPTL genes are essential for rubber biosynthesis. We observed a distinct clade (Group 5) that includes 13 CPTLs from plants and human/yeast (see Fig. 4A), indicating that the divergence of CPT and CPTL genes is prior to the split of the plant and animal lineages. The protein sequences of CPTLs have low homology to the CPTs, and only a part of three conserved regions (III, IV and V) were found in the CPTLs (see Supplementary Fig. S22), suggesting that these CPTLs may have lost the catalytic functions. Indeed, in lettuce, LsCPTL2 shows no catalytic activity in vitro [15]. Meanwhile, two other conserved regions were detected in the CPTLs but were lost in CPTs; these were predicted to be a transmembrane domain [31]. In TKS, TkCPTL1 was expressed highest in latex, but its homologous TkCPTL2 showed low expression levels in all tissues (see Fig. 3). Further research on TkCPTL1 will provide new insights in understanding the NR biosynthetic pathway.

The REF/SRPP gene family in TKS
REF/SRPP family proteins are important components for the biogenesis and stability of rubber particles. In the assembled TKS genome, we detected a family consisting of one TkREF and nine TkSRPP genes. Based on the phylogenetic tree of REF/SRPP proteins from rubber-producing and non-rubberproducing plants, we found that these proteins could be assigned into five clades (see Fig. 4B), including Clades I, II, III and IV from the eudicots, and Clade V from the monocots and bryophyte/ lycophyte, suggesting that the REF/SRPP genes of Asterids and Rosids may originate from a common ancestor. In the eudicots, some REF/SRPP genes from rubber-producing plants were interspersed with other SRPP-like genes from non-rubberproducing plants. However, we found that most of the TKS and Hevea SRPP genes belong to two separate clades, which may suggest that the mechanism of rubber particle stabilization is different in the two species because of independent evolution in rubber biosynthesis. Similar results were also observed in previous studies [22,32].
Most of the REF/SRPP genes in the assembled TKS genome exhibit specific high expression levels in latex and roots (see Fig. 3). Four SRPP isoforms (TkSRPP1, TkSRPP2, TkSRPP3 and TkSRPP4) have considerably higher RPKM values (>140) in latex than in other tissues (RPKM values of ∼6.23 on average), and another four isoforms have strikingly high RPKM values of >100 in roots and >2700 in latex, which is over 55 times higher than those in non-root tissues (RPKM values of ∼11.65 on average), suggesting that these REF/SRPP genes may play important roles in the biosynthesis of NR in TKS.

DISCUSSION
In this study, we present a draft sequence of T. kok-saghyz, which is a potential alternative resource for NR production and an ideal model system for NR research. TKS is the first assembled genome sequence of a rubber-producing wild weed plant. The advent of the TKS genome will facilitate and accelerate understanding of the molecular mechanisms underlying rubber biosynthesis by the cloning of key genes, genome-wide association studies and genetic improvement of TKS breeding. The TKS genome consists of a higher proportion of repetitive content (∼70%) and heterozygous rate (∼1%), making it difficult to assemble the genome. Indeed, using the Illumina sequencing data, we only obtained ∼4.2 kb length of the contig N50. Similar difficulties were also observed for the barley genome, containing ∼84% repetitive DNA, and the rubber tree genome, containing ∼78% repetitive DNA, the assembled genomes of which had ∼1.4 kb and ∼3.0 kb lengths of the contig N50, respectively [33,34]. Recently, in several species, the PacBio sequencing data has helped to greatly improve the contiguity compared to the Illumina sequencing data; e.g. the contig N50 length of the sunflower genome increased from 21-25 kb to 399 kb (∼20-fold) [35], and the contig N50 length of the maize genome increased from 40 kb to 1.2 Mb (∼30-fold) [36]. We therefore used long PacBio reads to assemble the TKS genome, and produced a much better assembled genome, which achieved a greater than 13-fold increase in the contig N50 length and more complete LTRs than other model plants. Furthermore, the PacBio assembly also helped us to get a fine gene set, which covered 91.7% genes with full length in BUSCO evaluation. However, further improvement of the TKS genome assembly is still needed. Next-generation DNA sequencing technologies offer new and powerful tools for the construction of genetic linkages. In particular, a high-density genetic linkage map is an effective strategy to validate assemblies and assign short sequences to chromosomal linkage groups. Therefore, the construction of a high-resolution genetic linkage map to improve the assembled TKS genome will be an important goal in future.
Based on the comparative analysis of CPT/CPTL and REF/SRPP gene families, we found several clues for common and unique evolution and mechanisms for rubber biosynthesis in rubber-producing plants. First, the divergence of CPT and CPTL genes is prior to the split of the plant and animal lineages, and for rubber-producing plants the CPT genes were clustered into two groups by their functions rather than species (see Fig. 4A), suggesting a similar mechanism for the rubber chain elongation in these species. Second, we detected very high sequence identity (more than 98%) among REF/SRPP homologs from TKS and T. brevicorniculatum (see Fig. 4B), indicating their close relatives between genomes [37]. Third, no expansion of the REF/SRPP gene family was found in the assembled TKS genome, which is different from the Hevea genome [22]. Similarly, the phylogenetic analysis of REF/SRPP gene family showed that most of the TkSRPP, TbSRPP and LsSRPP genes belong to the same clade, but most of the Hevea SRPP genes are assigned to a different clade, suggesting that the gene function and mechanism of rubber particle stabilization might be diversified during evolution between TKS and Hevea. These evolutionary traces raise new research interests and require further molecular and genetic evidence. For example, the rubber content in the TbREF-RNAi plant was lower than that in the wild type [12], but silencing its highest homologous gene LsSRPP8 in lettuce showed RESEARCH ARTICLE no effect on its rubber biosynthesis [38], indicating different molecular mechanisms of rubber particle stabilization in these two species. The assembled genome of T. kok-saghyz will provide fundamental information to facilitate the molecular dissection of NR biosynthesis and regulation, contribute to a better understanding of the adaptation of TKS to environmental stresses and provide a more powerful tool to develop a new crop in producing NR.

METHODS
The dandelion (Taraxacum kok-saghyz Rodin) line 1151 was used for genome sequencing and assembly. The genome assembly was sequenced using the whole-genome shotgun approach with PacBio RSII platforms and assembled with smartdenovo. Detailed information on materials, sequencing, assembly, annotation, and genome and transcriptome analyses is given in the supplementary data online. The T. kok-saghyz genome, gene annotation, and nucleotide and protein sequences were deposited in the Genome Warehouse (GWH; http://bigd.big.ac.cn/gwh/) under the accession number PRJCA000437.