Evolution of complex genome architecture in gymnosperms

Abstract Gymnosperms represent an ancient lineage that diverged from early spermatophytes during the Devonian. The long fossil records and low diversity in living species prove their complex evolutionary history, which included ancient radiations and massive extinctions. Due to their ultra-large genome size, the whole-genome assembly of gymnosperms has only generated in the past 10 years and is now being further expanded into more taxonomic representations. Here, we provide an overview of the publicly available gymnosperm genome resources and discuss their assembly quality and recent findings in large genome architectures. In particular, we describe the genomic features most related to changes affecting the whole genome. We also highlight new realizations relative to repetitive sequence dynamics, paleopolyploidy, and long introns. Based on the results of relevant genomic studies of gymnosperms, we suggest additional efforts should be made toward exploring the genomes of medium-sized (5–15 gigabases) species. Lastly, more comparative analyses among high-quality assemblies are needed to understand the genomic shifts and the early species diversification of seed plants.

" N/A [4,5]. Extant gymnosperms comprise ~1,100 species encompassing four major lineages: cycads, Ginkgo, conifers, and gnetophytes (Fig. 1A). Due to the conifers' immense ecological and economic value, great efforts were made to examine the whole genomes of this group [6]. The conifers consist of approximately 615 species covering enormous regions of the Northern Hemisphere and serving as the major backbone of worldwide forest ecosystems [7] (Fig. 1A).
A milestone report from early 2013 presented a 23-Gb assembly of loblolly pine (Pinus taeda), the first draft genome of a gymnosperm species [8,9]; a pre-publication release of the initial assembly was made in 2012 [10]. Notably, at least ten conifer genome projects were underway at that time [8]. Another sequencing study on Norway spruce (Picea abies) conducted a comparative analysis of the genome architectures of seed plants [11]. Two sets of annotated coding genes (high-confidence and low-confidence) with a BUSCO (Benchmarking Universal Single-Copy Orthologs) ratio <30% indicated there are still considerable gaps and redundancies in this assembly. The small size of the scaffolds (the total length of those scaffolds size >10 kb is 4.3 Gb) also reflected the objective limits of short-read sequencing, even when using highcoverage Illumina data [11]. Based on samples of the protein-coding and -noncoding fractions of the assembly, a plausible model for the conifer genome evolution was proposed: slow rates of activity for a diverse set of retrotransposons, and a much lower frequency of recombination in noncoding regions compared to angiosperms [11]. The subsequent investigations revived the scenario of genomic dynamics in conifers, enabling the establishment of giant genomes [12][13][14][15] and the study of ecological adaptiveness and phenotypic stasis [16,17]. With increased data, including transcriptomes and plastid genomes, studies focusing on the phylogenetic relationships among extant gymnosperms triggered great debates regarding various lineages whose studies were based on different data matrices and/or analytical approaches. One of the most controversial issues is the placement of gnetophytes. Several hypotheses have been put forward, suggesting gnetophytes are sisters to Pinaceae (the 'Gnepine' hypothesis), cupressophytes (the 'Gnecup' hypothesis), all conifers (the 'Gnetifer' hypothesis), or all the other gymnosperms [18][19][20][21][22]. The unresolved phylogenetic relationships have encouraged new efforts toward filling in the taxonomic sampling gaps. In the last five years, draft maps of Ginkgo, gnetophytes, cupressophytes (Conifer II), and cycads have been produced and refined with an improved assembly quality [6,[23][24][25][26][27][28]. In addition, genome-wide investigations have revealed typical signatures of the gymnosperm genomes, such as ubiquitously large introns and the higher expression levels of long genes [11,15,26,29]. However, the reasons behind the preservation of long genes remain poorly understood.
Here, we summarized the progress made in the whole-genome assembly of gymnosperms and described the considerably varied genomic features observed in different lineages, focusing on the early genome divergence patterns of gymnosperms. We also discussed the concerns relative to inferred paleopolyploid events and provided insights for future research directions.
Additionally, we reviewed the current knowledge on the effect of genomic changes on the diversification of gymnosperms and suggested that more efforts should be focused on mediumsized genomes. Finally, to understand the function of long introns, we recommended further examinations with reverse-genetic tools, which can enhance our understanding of plant genome evolution and adaptation.

The pulsed rises in the whole-genome assembly of gymnosperms
Thus far, compared with flowering plants, the quantities and qualities of the assembled genomes of gymnosperms are relatively lower, with an average BUSCO value of 56.92% computed from 15 decoded species (Fig. 1B). These low values derive from time-consuming projects that were launched several years ago: decades before long-read technologies were developed and became widely used. Also, the species-specific gene sets included in the library may have contributed to the underrepresented annotation of gymnosperms [6]. In terms of high-throughput Illumina sequencing platforms, it often takes 4-6 months to obtain clean reads, as a 100× coverage is required for a typical genome of 15 Gb in size and high heterozygosity [30]. Upon the completion of sequencing, the subsequent assembly has further costs, requiring more time and advanced technology. This is because large genomes commonly comprise a variety of repetitive sequences (hereafter called 'repeats'), which are untenable with short-read sequencing approaches based on overlapping reads [31,32]. For example, in the genome project of loblolly pine, although various strategies have been adopted (including fosmid and bacterial artificial chromosome (BAC) clones combined with whole-genome shotgun sequencing (WGS), RNAseq, and Bionano-seq), it was challenging to gain good contiguous contigs, a critical requirement for gene annotation [13]. Additionally, investments in both computational and analytical resources further burdened the progress of genomics research since most assemblers could not handle the incredibly large amount of input sequences from the high coverage sequencing [33][34][35][36][37].
Thanks to the advanced sequencing technologies of the PacBio RSII and Oxford Nanopore platforms, there has recently been a dramatic increase in the high-quality assembly of these gigantic genomes ( Fig. 1B and Table 1). For instance, a refinement of the previous Ginkgo draft showed that the contig N50 had remarkably grown from 48 kb to 1.58 Mb in length [23,26]; also, nearly 95% (9.33 Gb) of the scaffolds had been anchored onto the pseudochromosomes (Fig. 1B). The genomes of two iconic species from the Cupressaceae family, the giant sequoia (Sequoiadendron giganteum, 8.1 Gb) and the coast redwood (Sequoia sempervirens, a hexaploid genome of 26.5 Gb), were successively decoded with conspicuously enhanced contiguity [6,38]. Additionally, three assembly data resources for a single genus, Taxus, were released almost simultaneously, reflecting the great interest in the gymnosperm genomes [22,39,40]. Notably, all the records provided impressively complete genomes, as suggested by assembly lengths (contig N50 = 2.44 Mb in Taxus chinensis, 2.89 Mb in Taxus yunnanensis, and 8.60 Mb in Taxus wallichiana) and the coverage of the core Embryophyta gene library [41] (Fig. 1B). Moreover, the recent sequencing of the haploid megagametophytes of Cycas panzhihuaensis showed outstanding assembled quality, with a contig N50 length of 12 Mb [28]. The integrative strategies combining long-read mapping and short-read data polish have been proven possible for almost all species. Also, high-throughput chromosome conformation capture (Hi-C) can further assist the sorting of sequences [15,42].

Insights into the repetitive sequence dynamics in gymnosperms
Comparative genomic studies revealed that angiosperm genomes are considerably flexible and dynamic in terms of the rate of DNA sequence integration and elimination [43][44][45]. Apart from the insertion of viral DNAs, plastids, and mitochondrial sequences, the fluctuation of plant genome sizes is mainly attributed to the historical and ongoing activity of (retro)transposable elements (TEs) (i.e., long terminal repeat retrotransposons (LTR-RTs), which are a major component contributing to the non-coding genomic regions of most seed plant genomes [46][47][48]). However, many of the angiosperm genomes have a fast turnover of a few million years (Ma) via the proliferation of retrotransposons and unequal recombinations (URs) [49]. Thus, the inevitable genome enlargement was efficiently counteracted by a high rate of DNA excisions [50]. In contrast, the ultra-large (>10 Gb) genomes of gymnosperms are commonly characterized by a relatively low frequency of UR, as evidenced by surveys of the ratio of intact long terminal repeats (LTRs) and solitary LTRs (solo-LTRs) (Fig. 1C). The URs between LTR-RTs often remove the intervening sequences and lead to the formation of solo-LTRs, enabling the ratio of intact versus solo-LTRs to be an indirect proxy for the removal mechanism [51,52].
The genome-skimming of P. abies and Pinus tabuliformis identified lopsided numbers of LTRs with much more complete LTRs than solo-LTRs [11,15]. This is consistent with the patterns observed in other conifers (P. taeda and Picea glauca) [24,52]. However, such a signature is atypical in non-conifer gymnosperms, specifically in non-Pinaceae species, regardless of the genome size. Numerous solo-LTRs (60,623) in contrast to much less intact-LTRs (14,128) were detected in the 9.88 Gb of the Ginkgo genome [27]. Likewise, a higher ratio of solo-to intact-LTRs (5.5:1) was reported in T. wallichiana (10.9 Gb), a species belonging to the cupressophytes [40]. Moreover, two gnetophyte species, Gnetum montanum (4.13 Gb) and Welwitschia mirabilis (6.86 Gb), showed an elevated frequency of the recombination-based removal of retroelements [24,27]. Hence, the greatly reduced TE elimination activity revealed in Pinaceae might be a family-specific feature generated after their separation from the main conifer clade. Potentially, such kinetic process of TE removal might diverge independently within the lineages, considering the incomplete examination of Pinaceae, especially in those groups of relatively smaller genomes (i.e., the Larix). Furthermore, the low occurrence rate of the solo-LTRs in Pinaceae was mostly inferred from either fragmental assembly [11,52] or the manual examination of randomly sampled contigs/scaffolds [15]. More integrative and genome-wide identifications of these LTRs in high-quality genomes of Pinaceae are needed before we can fully understand the formation of ultra-large genomes. Except for infrequent URs, the reduced activity of other co-occurring processes, such as 'illegitimate recombinations', may also affect the steady growth of genomes in the long term [53]. Mobile elements like LTRs that are repaired by non-homologous end joining and single-strand annealing may generate truncated or solitary elements, resulting in genome shrinkage [50,54]. These disarmed LTRs may no longer be autonomous and thus cannot contribute to genome expansion [54]. More data needs to be collected concerning the DNA repair by-products of gymnosperms. Also, the comparison between gymnosperms and angiosperms of the proteins and genes (i.e., Ku70/Ku80 [55] and AtBRCC36A [56]) involved in such processes is required, especially among those species with distinct genome sizes.
As the prevalent class of TEs, the historical activities of LTRs have a crucial influence on the genome size and the gene structure of plants [57,58]. All gymnosperms likely share the common feature of repeats dynamic as more ancient but continuous amplification of LTRs within a range of 5-50 Ma [28,40]. The estimation of the insertion date is usually determined by the synonymous substitutions per synonymous site (Ks) between each 5′-LTR and 3′-LTR flanking sequences, which are calculated based on appropriate mutation rates (per base per year) [59]. The intergenic nucleotide substitution rate of 2.2×10 -9 is normally adopted, assuming that gymnosperms evolved at a slower pace than angiosperms. Thus, the various ages estimated by different studies of the LTR outbreaks of the same gymnosperm could be partially explained by the different neutral mutation rates assigned (i.e., 7.3×10 −10 was used for T. yunnanensis and T. chinensis var. mairei [22,40]). It is worth mentioning that the outlier Welwitschia has suffered from a very recent expansion of both autonomous and nonautonomous LTRs in less than 1-2 Ma, which probably resulted from a cascade of events triggered by intense aridity [27]. The high-resolution categories of retroelements and the use of appropriate mutation rates [60] are both required to distinguish the species-specific expansions that contribute to the diversity in genome growth rhythms [61,62].
The subsequent ancient insertions and the unusual recent burst of LTRs raise an intriguing question regarding the differences in TE surveillance between gymnosperms and angiosperms since the genome size is generally smaller in the latter. The necessity of TE silencing has been widely acknowledged, and the epigenetic control of DNA sequences is considered the vital nuclear defence system of plant genomes to the destructive potential of TEs [63]. Approaches RNA-directed DNA methylation (RdDM) is an epigenetic pathway that evolved to guide the modelling of DNA condensation and TE silencing [68]. This complicated pathway was first observed in transgenic tobacco infected with viroids, plant pathogens containing solely nonprotein-coding RNA [69]. Despite the limited epigenetic investigations in gymnosperms, several instructive studies provided the general landscape of DNA methylation in the gymnosperm genome [70,71]. For example, CpG and non-CpG methylations are both surprisingly high in P. tabuliformis (88.4% for CG; 81.6% for CHG) and W. mirabilis (78.32% for CG; 76.11% for CHG) [15,27], consistently with previous observations in P. abies [72].
Furthermore, global methylation levels positively correlate with genome sizes due to the widespread distribution of TEs along the genome [73,74]. In addition, the representative genes associated with various methylation pathways have mostly been identified in gymnosperms, implying the probable functional conservation of pathways across seed plants [70]. The activity of RdDMs was further validated by its dynamic changes in the methylation level of specific sequence contexts among different tissue types [27,70]. The oscillating abundance of 21 nucleotide (nt), 22 nt, and 24 nt sRNAs indicated that both canonical and non-canonical RdDMs may play a role in TE's control [15,27], complementing previous hypotheses that 24 nt sRNAs are restricted to the reproductive tissue in P. abies [11]. Thus, TE silencing is particularly reinforced by non-canonical RdDMs in gymnosperms, which mildly differs from the primary role of 24 nt-RdDMs in angiosperms [15,72]. However, assessing the extent to which the epigenetic mechanisms contribute to genome methylation and how they contribute to the developmental process is a highly anticipated direction for the genomic studies of gymnosperms. Incidentally, H3K9me, a mark for heterochromatin, showed contrasting distribution patterns between angiosperms and gymnosperms (P. abies and Pinus sylvestris), implying potential distinctive genome silencing mechanisms [4,73].
A fundamental shift in repeats' dynamic has been observed in giant genomes, as indicated by the changes in repeats' abundance and the curvilinear relationship between genome size and repeats' proportion among 101 seed plant species (The samples have an approximately 2,400fold range from 0.063-88.55 Gb in genome size) [74]. In particular, genomes larger than 10 Gb are characterized by the conspicuous increase in non-repetitive and low-copy DNA sequences (excluding genes) and the relative decrease in medium-copy repeats (>20 copies). Most of these repeats seem to have been slowly degraded and fossilized into very low copy numbers due to epigenetic suppression and limited recombination [74]. In turn, these highly heterogenous repeats contribute to the formation of interstitial heterochromatin with heavily methylated DNA [57,75]. Hence, large genomes have "one-way tickets to genomic obesity" [74,76]. Such genome evolutionary patterns involving derivative retrotransposons may help understand the observation that excess low-repetitive DNA components are overrepresented in the pine genome [61,77].

Controversy regarding paleopolyploidy and its implications for gymnosperm diversification
The extant gymnosperms have painted quite a different picture of the rarity of ancient polyploidizations known as whole-genome duplications (WGDs), which are often found with high frequency in flowering plants [20,78] (Fig. 1C). These events have been suggested as determining factors controlling the lower species abundance in gymnosperms unlike angiosperms [4,11,79,80]. Since postpolyploid diploidization often occurs rapidly and gives rise to many unpredictable consequences, such as chromosome number shifts and DNA loss [81], the inference of ancient WGDs remains highly challenging due to the long-term erosion of genome doubling signals (i.e., loss of duplicates and saturation of synonymous distances [82,83]).
Combining syntenic analysis with the Ks distribution of all paralogous pairs has been vital for distinguishing WGD-derived and small-scale duplication-derived paralogues [84,85].
However, due to the intermittent release of high-quality genome assemblies of gymnosperms, significant efforts have shifted to comparing genic signatures with improved phylogenomic approaches [20,78]. Heuristic gene tree-species tree reconciliation methods are broadly employed to search the evidence of ancient WGDs based on transcriptome data [83,86,87]. As tabuliformis, Sequoiadendron giganteum, and Ginkgo biloba [15]. Furthermore, Li et al.
confirmed the seed plant WGD (named ζ-) and predicted that a lineage-specific WGD occurred in Welwitschiathe latter prediction was validated in a recent Welwitschia genome investigation [27]. Another comprehensive study of WGD mapping with a considerably large RNA-seq sample suggested that a shared WGD might have occurred before all extant gymnosperms diverged [17]. However, such hypothetical WGD cannot be corroborated by most taxonomic-oriented genomic studies [15,23,26,40] (Fig. 1C). Among these genomes, a common feature was the lack of recent species-specific WGDs since only a few intragenomic blocks and syntenic gene pairs could be detected. However, all of the candidate old WGDs The contentious species-tree topologies probably led to differences in gene duplication mapping, despite the fact that specific nodes were examined [17,20]. Alternatively, the duplicated genes introduced by the ζ-WGD were preferentially retained over the duplicates derived from the gymnosperm-WGD in all the species surveyed. In addition, a Ks peak (~0.8) that was recently observed in the Cycas genome was similar to the Ks peak of Ginkgo [28], suggesting an ancient WGD shared by the two lineages as proposed by Roodt et al. (2017) [89].
This ancient WGD (named ω-) was further dated to the most recent common ancestors (MRCA) of all gymnosperms and supported by both transcriptome data and multispecies syntenic block alignments [28]. However, an analysis with a probabilistic approach of the WGD inference against 21 representative seed plants provided clear evidence of the ζ-WGD but not of the ω-WGD, rendering the placement of the Cycas+Ginkgo WGD highly controversial [26,83] (Fig.   1C).
Given the considerable number of predicted ancient WGDs, based at least on the increased signals of gene duplication (restricted to the WGD-derives) [17,20], the question was raised regarding how polyploidy contributes to the evolution of gymnosperms. A recent comprehensive measurement of the traits from living and fossil records suggested that two ancient pulsed rises of morphological innovation occurred in seed plants' evolutionary history: the incipient diversification of gymnosperms (ca. 400 Ma) and the subsequent prosperity of angiosperms during the Late Cretaceous (ca. 100 Ma) [90]. The first increase represented by gymnosperms seems to result from the most commonly shared ζ-WGD and can be extended to the hypothetical ω-WGD. Two direct correlations between the conifers' WGD and their diversification shifts [17] likely suggest the potential roles of WGD in the culmination of early gymnosperms (Cupressophyta-WGD and Pinaceae-WGD occurred ca. 200-342 Ma [88]).
Besides, considerable evolutionary stasis persisted in the morphological complexity of gymnosperms and was further exacerbated by the emergence of flowering plants [90]. One report linked to a genetic map analysis showed that many more ζ -duplicates (688 gene pairs) than conifer-specific tandem duplicates (87 pairs) were preserved in the Pinaceae genomes. A highly conserved genome macrostructure was found between spruce and pine, which diverged at least 120 Ma ago [91]. The large excess of ancestral duplicates and the remarkable level of synteny indicated the much slower pace of evolution in Pinaceae, which can be considered evidence of their relative stasis. Interestingly, a karyotype comparison between Pinaceae and Cupressaceae suggested that substantial chromosomal shuffling likely commenced after their split [92]. Interspecies alignments within the Cupressaceae and other families are required to determine if the shuffling is a common feature of low-frequency genome rearrangements. This would help our understanding of the conifer cladogenesis resulting in speciation and diversity.
Moreover, a case of coast redwood (S. sempervirens) implied that a very slow diploidization process followed WGD and found the persistence of multisomic inheritance in this hexaploidy species (2n = 66). These findings may contribute to explaining why there are so few polyploid species in modern gymnosperms [92]. Normally, the long-term benefits of polyploidy require the divergence among homologous chromosomes, which can only happen once loci are diploidized [81,93]. In turn, the reduced selection of efficient meiosis in Sequoia would preclude the emergence of any evolutionary advantages in polyploidy lineages. Hence, Scott et al. (2016) [93] proposed that such an intriguing evolutionary strategy was additionally reinforced by asexual reproduction, self-compatibility, and extreme longevity, which likely took place in other conifers, such as Fizroya cupressoides [94]. Aside from this, the fundamental dynamic shift in repeats is noteworthy, assuming that the genomic shift occurred early in gymnosperms, probably before most modern lineages diverged. The ancestral genome size of gymnosperms has been estimated to have been ~12.375-15.75 Gb [95]. If so, heterogeneous rates of genome size evolution should be expected considering the large range in 1C-DNA content (i.e., from 2.21 Gb in Gnetum ula to 35.28 Gb in Pinus ayacahuite) exhibited across gymnosperms [15] (Fig. 1D and E). The shift in genomic dynamics could directly lead to the unfavourable architecture of those large genomes as constrained chromosomal homogenization. Together with the slow pace of diploidization, these factors make polyploidy a burden rather than a boon in gymnosperms. Therefore, the extraordinarily massive loss of duplicates should not surprise due to the highly structured chromosomes and severely limited recombination of these genomes [4]; hence, most signals of WGD in the doubled genome were expunged (e.g., to date, W. mirabilis is the only gymnosperm species known to have a family-specific WGD that occurred ~86 Ma ago while showing an extremely low level of intrachromosomal syntenic relationships compared to angiosperms) [27]. The unusually low rate of WGD duplicate retention could further restrain the morphological and biological diversity of these lineages, given that polyploidy often introduces sub-or neofunctionalization and increases variations in dosage-sensitive genes and pathways [96][97][98].
To conclude, the concomitant problems imposed by an enlarged genome could affect the diverse physiological processes of plants, such as longer cell cycles [99,100] and higher nutrient costs [4], which eventually impact the competitiveness of the species.

Intriguing intron morphology and evolution in gymnosperms
The presence of astonishingly long genes has been extensively reported in many gymnosperms from distinct lineages [11,15,23] (Fig. 1C). These long genes are often associated with large amounts of intronic sequences characterized by cumulative size distributions, including numerous atypical long ones (>20 kb) [11,15,23,28]. Why these very long introns are preserved and how they influence the evolution and function of genes in gymnosperms remain largely obscure [15].
It has long been acknowledged that the genome size may be correlated with the intron size across broad phylogenetic groups. However, such a pattern was poorly translated into some narrow taxonomic distant groups of angiosperms [101]. A pioneering description and comparison of the gene structures of P. glauca and P. taeda with data from BAC clones and genome scaffolds indicated a relatively conserved signature in the long introns [29]. Moreover, the high frequency (32%) of the TEs found in captured sequences, even in introns <1 kb, suggested the important role of such invasive elements in the long gene space [29]. Niu et al. (2022) [15] tabulated the characteristics of the gene structures among 68 recently sequenced seed plants. They found a positive correlation between the ratio of total intron/exon length and the genome size, especially in gymnosperm lineages (Fig. 1C). Collectively, this robust evidence supports the claim that genic expansion was coupled with the genome upsizing in the majority of gymnosperms, which is probably attributed to the slow growth and accumulation of repeats [15]. Additionally, Nystedt et al. (2013) [11] first provided insights into the presence of long introns by comparing the orthologues of the normal-sized (50-300 bp) and long (1-20 kb) introns of P. abies, P. sylvestris, and G. montanum. They suggested that an early intron expansion might have already occurred in the MRCA of all conifers, which would explain the identical trend in the increased length of orthologous introns. However, this point of view was changed by subsequent comparisons conducted within more species of early diverged seed plants [24]. Similar growth patterns of the intron size and content were observed in orthologues between Ginkgo and P. taeda with the accumulation of LTR-RTs (especially Ty1-copia elements). By contrast, a high proportion of long interspersed nuclear elements (LINEs) were found in orthologous long introns between G. montanum and Amborella trichopoda (the 'basal' angiosperm [102]), and both these species involved the expansion of long introns, consistently with the scenario of all intron morphology in G. montanum and A. trichopoda [24]. This result might indicate different repeat dynamics within the introns of G. montanum compared with other gymnosperms, and the level of Ty1-copia activity in introns might be more ancient and could be traced back to the origin of gymnosperms. Likewise, LINEs could be partially involved in the intron evolution of ancestral seed plants [24]. However, these hypotheses require more investigations using closely related or representative species like Welwitschia, Ephedra, and even Cycads, because the evolution of the gene structure of plants was determined by many more interacting forces than classically expected (i.e., the selective recombination rate [103,104] and the species-specific TE activity [105,106]). Indeed, a large portion of unknown sequences has been found in Cycas' introns, which is quite different from the pattern of LTR or LINE dominance found in other gymnosperms [28].
Exploring the biological relevance of long introns could be insightful for addressing a fundamental scientific inquiry: "Why some genomes are really big and others quite compact?".
Unfortunately, this matter has been poorly addressed in gymnosperms [29] except for a very recent description of gene expression profiles, alternative splicing, and DNA methylation [15].
The atypically long introns seem to have minimal influence on transcript accuracy, probably facilitated by different levels of CpG and non-CpG methylations among exons and introns [15].
These results call for similar examinations in other giant gymnosperm genomes, such as Ginkgo or Welwitschia, considering their lower effective population size compared to conifers since the loosening of natural selection often allows the fixation of potentially deleterious mutations in the genome [107]. In addition, long genes tend to have higher expression levels in P.
tabuliformis, similarly to the situation observed in P. glauca, Oryza sativa, and A. thaliana [29,108]. However, such a pattern contrasts with other organisms, like Physcomitrium patens [109], Caenorhabditis elegans, and Homo sapiens [110], where compact genes are highly expressed. If so, the 'low-cost transcription hypothesis' is probably unsuitable for gymnosperms. Alternatively, the length of introns is likely less relevant to the expression level since introns are involved in a variety of regulatory phenomena (i.e., post-transcriptional gene regulation [111], nucleosome formation, and chromatin organization [112][113][114]). Nevertheless, the correlation between gene length and gene expression should be interpreted with caution and is likely caused by technical issues: the statistical bias in RNA-sequencing data due, for instance, to the over-count reads from long transcripts [102].

Conclusion and perspectives
In this review, while appreciating the advances in our knowledge of the genome evolution of gymnosperms, we demonstrated that some essential characteristics, such as repeat dynamics, ancient WGDs inference, and the biological relevance of long introns, are far from understood.
The state of 'genome paralysis' may be confined to Pinaceae rather than all conifers or gymnosperms since a high frequency of TE removal does exist in cupressophytes, gnetophytes, and Ginkgo. The hypothetical ω-WGD is still highly contested and needs to be reconsidered by future studies. The sporadic and long-awaited releases of genome drafts inevitably limit the conclusions of species-specific cases. Despite the low level of cladogenesis and the rarity of polyploids, the fundamental shift of genomic dynamics and the potential signature of the slow process of diploidization probably offer new insights into the complex evolution of the genome architectures of gymnosperms. Additionally, the dominant model of recent allopolyploidy speciation in Ephedra [115], and the growing number of species on the list of hybridization and polyploidization in Juniperus [116], contrasts with the gymnosperm reputation of being composed of ancient species. These results could be explained by the resurgence of gymnosperm diversification and the increase in habitat ranges [17]. With regards to all these aspects, we envisage that gymnosperms could be a candidate model to investigate the changes in genome dynamics and their influence on species diversifications (Fig. 1E). However, indepth studies on the wealth of information contained within these genomes cannot be conducted without generating more high-quality assemblies. The investigation of interspecific variations and diverse properties in gymnosperms would be more profound if the data sampled were consistent, as in many excellent works conducted on animals or crops [117,118].Considering the intricate evolutionary history of gymnosperms, we propose that, in the future, attention should be paid to at least the four aspects next described. First, more integrative estimations of TE eliminations are needed, and a high-resolution subclassification of the TEs would help to distinguish family-specific expansion patterns. Intensive studies on the many repetitive relics with a low copy number would also enable us to illustrate the formation of the highly-structured and less dynamic chromosomes of gymnosperms [4,11,75]. Finally, the rapid accumulation of epigenetic data is imperative since variable repeat dynamics and sophisticated epigenetic machinery play crucial roles in gymnosperms. This data should be either at the single-base resolution of DNA methylation or for comparing methylomes among different tissues. Second, ancestral paleopolylpoidy inferences should be investigated by large-scale multi-alignments of more complete gymnosperm assemblies with fully considered phylogenies. In particular, the structural evidence of intra-and inter-species collinearity may be essential to clarify the number and timing of these ancient duplications [82]. Moreover, the comprehensive evaluation of the loss and retention of duplicate genes could help elucidate the potential heterogeneity in the genome evolution of gymnosperms. Third, it may be worthwhile to include intron length and expression characteristics in future whole-genome studies of gymnosperms. Also, more investigations on alternative splicing patterns should be carried out and analysed together with DNA methylation footprints. Despite the lack of appropriate genetic transformation tools for long-lived perennial species, it might be insightful to conduct analogous molecular experiments in model plant systems concerning the potential biological functions of ultra-long genes [15,119]. Finally, more chromosome-level genomes of gymnosperms are needed. However, we suggest that additional efforts should be made to sequence medium-sized (5-15 G) species and refine the short-read drafts released for conifers, especially Pinaceae.

Data Availability
Not applicable.

Competing interests
The authors declare no competing interests.