A De Novo Transcriptome Assembly of Ceratopteris richardii Provides Insights into the Evolutionary Dynamics of Complex Gene Families in Land Plants

Abstract As the closest extant sister group to seed plants, ferns are an important reference point to study the origin and evolution of plant genes and traits. One bottleneck to the use of ferns in phylogenetic and genetic studies is the fact that genome-level sequence information of this group is limited, due to the extreme genome sizes of most ferns. Ceratopteris richardii (hereafter Ceratopteris) has been widely used as a model system for ferns. In this study, we generated a transcriptome of Ceratopteris, through the de novo assembly of the RNA-seq data from 17 sequencing libraries that are derived from two sexual types of gametophytes and five different sporophyte tissues. The Ceratopteris transcriptome, together with 38 genomes and transcriptomes from other species across the Viridiplantae, were used to uncover the evolutionary dynamics of orthogroups (predicted gene families using OrthoFinder) within the euphyllophytes and identify proteins associated with the major shifts in plant morphology and physiology that occurred in the last common ancestors of euphyllophytes, ferns, and seed plants. Furthermore, this resource was used to identify and classify the GRAS domain transcriptional regulators of many developmental processes in plants. Through the phylogenetic analysis within each of the 15 GRAS orthogroups, we uncovered which GRAS family members are conserved or have diversified in ferns and seed plants. Taken together, the transcriptome database and analyses reported here provide an important platform for exploring the evolution of gene families in land plants and for studying gene function in seed-free vascular plants.


Introduction
Ferns (Monilophyta) are the second most diverse group of vascular plants, with >10,000 species (PPG I 2016). Ferns are also a unique plant lineage, possessing both ancestral traits (e.g., spores and independent gametophytes) as well as many derived traits (e.g., vascular systems and true leaves) that they share with their larger and iconic sister lineage, the seed plants (Banks 1999;Plackett et al. 2015). Despite the fact that ferns represent a critical branch of the land plant tree of life, they are often absent from investigations into the origin and evolution of plant traits (Plackett et al. 2015;Rensing 2017). The reason for this omission is practical; ferns have extremely large genomes that average over 10 Gb in length (Sessa and Der 2016). Genome assemblies are available for only two fern species, Azolla filiculoides and Salvinia cucullata, both of which are heterosporous and have small genomes (0.75 Gb and 0.25 Gb, respectively) that have been secondarily reduced (Li et al. 2018). This leaves the homosporous ferns and their enormous genomes with little to no representation in genome-level analyses of plant evolution.
The homosporous fern Ceratopteris is a model system for studying many aspects of fern biology and especially gametophyte biology, including sex determination Banks et al. 1993;Banks 1994Banks , 1997Banks , 1999Atallah and Banks 2015), physiology (McAdam et al. 2016), spore germination (Chatterjee and Roux 2000;Salmi et al. 2005), gametophyte meristem development (Banks 1999;Plackett et al. 2018), and photomorphogenesis (Cooke et al. 1995). Although a partial assembly representing 38% of the nuclear genome of Ceratopteris was recently published (Marchant et al. 2019), the large size (11.25 Gb) and high complexity of the Ceratopteris genome has hindered comprehensive investigation into the gene repertoire of this model fern. The 1,000 plants (oneKP) initiative published de novo transcriptome assemblies for Ceratopteris and several other diverse fern species (Leebens-Mack et al. 2019). However, the oneKP transcriptomes are mainly derived from sporophyte leaf tissue, which is sufficient for taxonomic studies of conserved gene families but will miss critical genes that are not expressed in leaves (e.g., gametophyte-specific genes).
Here, we report a Ceratopteris transcriptome database, which we de novo assembled using 17 RNAseq libraries from two sexual types of gametophytes (Atallah et al. 2018) and five different sporophyte tissues. This assembly captures protein sequences from 44,668 predicted loci, of which 83% are assigned a functional annotation. We leverage the genomes and transcriptomes of 38 additional plant species to assign Ceratopteris predicted proteins to 17,489 orthologous families. Phylogenomic analysis of these gene families identifies dynamic gain, loss and amplification of different orthogroups in the last common ancestor (LCA) of the major plant lineages, including in the sister lineages of ferns and seed plants. We then characterize the evolutionary history of the GRAS domain proteins, an important superfamily of regulatory proteins with a complex evolutionary history (Cheng et al. 2019). GRAS proteins were assigned to 15 orthologous gene families with varying taxonomic representation. We further refine these orthogroups into 19 subfamilies, including one (GRAS-I) that is ubiquitous in ferns and absent in most seed plants, including the model plant Arabidopsis thaliana (hereafter Arabidopsis). Taken together, the results of our study illustrate the utility of our Ceratopteris transcriptome, which will be a valuable genetic resource to facilitate molecular, evolutionary, and genetic studies in ferns.

Plant Materials and Growth Conditions
The Ceratopteris accession Hn-n ) was used in this study. Sporophytes were generated by self-fertilizing wild-type hermaphroditic gametophytes grown on FM plates containing 0.7% (w/v) agar (Sigma-Aldrich). Once sporophytes developed roots, they were transferred to soil. Young sporophytes were grown at 28 C, under continuous light. Adult sporophytes were grown at 28 C in greenhouse at Purdue.  1D) were harvested from adult sporophytes. Calli were induced from shoot tips or fronds of young Ceratopteris sporophytes ($3 weeks old) on the callus induction plates (pH 5.8) containing 1Â MS salts (PhytoTechnology Laboratories), 2% (w/v) sucrose and 1 mg l À1 benzylaminopurine (BAP), and 0.7% agar (Sigma-Aldrich). Fully induced calli ( fig. 1E) were collected for RNA isolation. Gametophyte sample preparation and RNA-seq are described previously (Atallah et al. 2018).

RNA Isolation, Library Construction and Sequencing
Total RNAs were isolated using the RNeasy Mini Kit (Qiagen). TruSeq DNA PCR-Free Low Throughput Library Prep Kit (Illumina) was used to construct cDNA libraries from each sample except for bulbil and root samples. Due to relatively Significance Ferns are an important reference point to study the origin and evolution of plant genes and traits. Ceratopteris richardii (Ceratopteris) has been widely used as the model system for ferns. Although a partial assembly representing 38% of the nuclear genome of Ceratopteris was recently published, the large size (11.25 Gb) and high complexity of the Ceratopteris genome has hindered the comprehensive investigation into the protein coding repertoire of this model species. In this study, we report a transcriptome of Ceratopteris, through the de novo assembly of RNA-seq libraries derived from two sexual types of gametophytes and five different sporophyte tissues. We leverage the genomes and transcriptomes of 38 additional plant species across the Viridiplantae to assign Ceratopteris predicted proteins to 17,489 orthologous families. Phylogenomic analysis of these gene families shows the dynamic gain, loss and amplification of different orthogroups in the last common ancestor of the major plant lineages, including in the sister lineages of ferns and seed plants. Using this resource, we also characterize the evolutionary history of GRAS domain proteins and uncover both conserved and lineage-specific GRAS subfamilies in land plants.
low amount of RNA in the bulbil and root samples, the Universal Plus mRNA Kit (NuGen) was used to construct cDNA libraries for these tissues. All libraries were sequenced on the NovaSeq 6000 Sequencing System (Illumina). Library construction and sequencing were performed at the Purdue Genomics Core Facility. The gametophyte RNA-seq data were acquired previously (Atallah et al. 2018).
We checked the transcriptome for possible contamination using the Alien Index (AI) pipeline (https://github.rcac.purdue. edu/jwisecav/phylo-pipe; last updated August 26, 2019) as previously described (Wisecaver et al. 2016;Gonc¸alves et al. 2018). Briefly, each predicted protein sequence was queried against the NCBI RefSeq database (retrieved June 2018) using DIAMOND (v0.8.36) (Buchfink et al. 2015), and the AI score was calculated based on the output. The AI score is given by the formula: AI¼nbsO-nbsV, where nbsO is the normalized bit score of the best hit to a species outside of the eudicot lineage, nbsV is the normalized bit score of the best hit to a species within the Viridiplantae lineage (i.e., green plants and algae). AI scores range from -1 to 1, being greater than zero if the predicted protein sequence had a better hit to a nonviridiplantae species, suggestive of either horizontal gene transfer (HGT) or contamination (Wisecaver et al. 2016;Gonc¸alves et al. 2018). Although HGT is a potential source of alien genes and has been documented in ferns (Li et al. 2014), it is difficult to differentiate between true HGT and contamination in a de novo transcriptome due to the inability to confirm true integration of a foreign gene into the host nuclear genome. Therefore, proteins with an AI score ! 0.05 were considered possible contaminants from other species and excluded from downstream analyses. In addition, contigs matching sequences from the Ceratopteris chloroplast genome, Arabidopsis mitochondrial genome, and SILVA database (for rRNA) detected using BLASTn (BLASTþ, v2.7.1; E-value 10) were also excluded. Lastly, a BUSCO (v2.0) analysis was performed to assess the completeness of the final transcriptome using the eukaryote_odb9 (303 conserved genes) and embry-ophyta_odb9 (1,440 conserved genes) data sets (Simão et al. 2015).
Orthogroup evolutionary changes including gains, losses, expansions and contractions were inferred with the program Count (Cs} uö s 2010) using Dollo parsimony and unweighted Wagner parsimony. In both types of parsimony analyses, an orthogroup gain is defined as a shift from orthogroup absence at the preceding node to presence at the node of interest; whereas, orthogroup loss is the opposite transition. Under Dollo parsimony an orthogroup may be lost multiple times but gained only once. This approach effectively pushes any gain back in time to the LCA of any species present in that orthogroup. Therefore, Dollo parsimony is the more conservative estimator of orthogroup gain at leaf nodes as well as interior nodes that are closer to the leaves. Conversely, a Wagner parsimony approach, in which gene gain and gene loss are weighted the same, is a more conservative method for estimating gene gain for nodes closer to the root of the species tree. For example, when dealing with a hypothetical orthogroup that is present in Ceratopteris and the chlorophyte Chlamydomonas reinhardtii yet absent from all other species, Dollo parsimony would predict that the orthogroup was gained in the LCA of these two species (corresponding to the root of the species tree) and subsequently lost multiple times. In contrast, Wagner parsimony would predict that the orthogroup was gained independently in the two species, requiring no subsequent losses. One biological explanation for an orthogroup being acquired independently in different species is HGT; however, HGT appears to be relatively rare in land plants. Methodological artifacts in the OrthoFinder prediction (which would cause genes to be assigned to orthogroups incorrectly) could also result in an orthogroup being gained multiple times in our analysis under Wagner parsimony. For these reasons, the two parsimony methods represent realistic bounds for estimating the number of gene changes at a node.
In addition to comparing orthogroup gain and loss, we also investigated how orthogroup copy number evolved across the species tree. Using Wagner parsimony, orthogroup expansion is defined as a shift from a single-copy state at the preceding node to a multi-copy state at the node of interest; whereas, an orthogroup contraction is the shift from multi-copy to single-copy. Changes in copy number were also inferred from the OrthoFinder analysis directly; duplications with clade support ! 0.50 were parsed from the OrthoFinder duplications.csv output.
GO annotations for Arabidopsis and Ceratopteris were assigned via InterProScan (v5.29-68.0) (Jones et al. 2014). Tests for functional enrichment were performed in R Bioconductor v3.11 by first creating custom orgDBs using AnnotationForge v1.30.1. GO enrichment tests were then performed by clusterProfiler v3.16.1 using the function enrichGO. P-values were adjusted for multiple comparisons using the Benjamini & Hochberg (BH) method. Network maps of enriched terms were also created by clusterProfiler using the function emapplot.

Phylogenetic Analysis of GRAS Family Proteins
We performed separate phylogenetic analyses for all the orthogroups containing sequences that were annotated by InterProScan with the GRAS domain (PF03514). Sequences in each orthogroup were aligned with MAFFT (v7.407) using the E-INS-I strategy and following parameters: -maxiterate 1000 -bl 30 -op 1.0 -retree 1 (Katoh and Standley 2013). Maximum likelihood trees were constructed using IQ-TREE (v1.6.10) (Nguyen et al. 2015) using the built in ModelFinder to determine the best-fit substitution model (Kalyaanamoorthy et al. 2017) and performing SH-aLRT and ultrafast bootstrapping analyses with 1,000 replicates each. Following initial tree building, several sequences were excluded due to their long branch lengths, including Ginko_biloba_GBI00007557 and Ginko_biloba_GBI00011220 in OG0001609, Amborella_trichopoda_ATR0582G205 and Populus_trichocarpa_Potri.004G208700 in OG0003235, Polytrichum_commune_1kpSZYG2036234 in OG0005632, Picea_abies_PAB00030745 in OG0007617, Sceptridium_ dissectum_1kpEE in OG0007977, and Amborella_trichopoda_ATR0174G041 in OG0010265. Alignments and phylogenetic trees were reconstructed following sequence curation. Trees were visualized and annotated using iTOL (v4) (https://itol.embl.de/itol.cgi, last accessed March 9, 2021) (Letunic and Bork 2019). The distribution of GRAS proteins was displayed using TBtools (Chen et al. 2020).

Results
The Ceratopteris richardii Transcriptome To generate a comprehensive transcriptome of Ceratopteris, independent RNA libraries from two gametophyte tissues (males and hermaphrodites) (Atallah et al. 2018) and five sporophyte tissues (shoot apices, fertile fronds, roots, bulbils, and calli) ( fig. 1A-E) were generated and sequenced resulting in $2.6 billion cleaned paired-end reads (supplementary table S1, Supplementary Material online). The de novo assembly consisted of 64,974 contigs with an average GC content of 43.04% and an N50 of 2,704 bp. Assembly contigs were collapsed into 44,668 Trinity genes, which are defined by the Trinity assembler as clusters of similar contigs predicted to arise from the same locus in the genome (Grabherr et al. 2011) and are hereafter referred to simply as genes. A BUSCO analysis (Simão et al. 2015) was performed to evaluate the completeness of the Ceratopteris transcriptome, recovering 274 of the 303 conserved eukaryotic genes (97%) and 1,023 of 1,440 conserved embryophyta genes (71%) ( fig. 1F). This result is comparable to the similar BUSCO analysis that was conducted using the transcriptome of another homosporous fern, Polypodium amorphum, which recovered 1,028 of 1,440 conserved embryophyta genes (71%) (Sigel et al. 2018).
Within the Ceratopteris transcriptome, 68.2% (30,449) of the total genes are expressed within both gametophyte and sporophyte tissues ( fig. 1G and H). In contrast, in the moss Physcomitrella patens (a species with a gametophytedominate life cycle), 85.5% of genes are detected in both gametophyte and sporophyte phases (Ortiz-Ramirez et al. 2016). In the angiosperm Arabidopsis, only 30.6% of genes are shared in both sperm cells and seedlings (Borges et al. 2008). These data demonstrate a trend of decreased number of genes commonly expressed in both gametophyte and sporophyte stages and is consistent with the highly reduced gametophytes in seed plants and the transition from gametophyte-dominant to sporophyte-dominant life cycles.
To investigate the evolutionary dynamics of plant gene families, we performed an OrthoFinder analysis (Emms and Kelly 2015)  A moderate number of duplication events were also predicted at the euphyllophyte LCA (n ¼ 136; fig. 2B  In agreement with our analysis, the radiation of the COBRA gene family, which is necessary for oriented cell expansion in Arabidopsis, has previously been noted (Sorek et al. 2016). All enriched GO terms related to amino acid/anion transport are attributed to the GLUTAMINE DUMPER gene family (supplementary table S4, Supplementary Material online). Lastly, the enrichment GO terms involved in nitrate response was driven by the HT1 protein kinase, which regulates stomata opening in response to red-light and CO 2 (Matrosova et al. 2015).
To evaluate the types of gene functional categories that were lost in the seed plant LCA (and therefore absent in Arabidopsis), we performed GO term enrichment analyses on the Ceratopteris genes that were present in these orthogroups, identifying 42 and 37 enriched GO terms in the Dollo and Wagner analyses, respectively (supplementary  table S4 (Li et al. 2018), and these findings could be suggestive of HGT, a process that has already been speculated to play a role in the evolution of triterpenoid synthases in ferns (Frickey and Kannenberg 2009;Li et al. 2018).

Gene Family Evolution in the Fern LCA
The fern LCA showed a net trend of more orthogroups lost than gained (Dollo n ¼ 429; Wagner n ¼ 18; fig. 2B, supplementary table S3, Supplementary Material online). However, we are cautious not to overinterpret a trend of gene loss at this internode as all but two species of ferns are represented by transcriptome data only. Genes may be absent from a de novo transcriptome if they show little to no expression in the sampled tissues and will therefore appear lost in our analysis. Moreover, because of the evolutionary distance between Ceratopteris and model seed plants, assigning GO terms based on homology is also less accurate.
We did not identify any enriched GO terms in the Ceratopteris gene sets that were gained at this internode under either Dollo or Wagner parsimony. Although the number of gene families gained in the fern LCA was low, there was a moderate number of lineage-specific gene amplifications at this internode (expansions n ¼ 190, duplications n ¼ 255; fig. 2B To evaluate the types of gene functional categories that may have been lost in the fern LCA (and therefore absent in Ceratopteris), we performed GO term enrichment analyses on the Arabidopsis genes that were present in these orthogroups, identifying 15 and 52 enriched GO terms in the Dollo and Wagner analyses, respectively (supplementary table S4, Supplementary Material online). Focusing on the Dollo results, the enriched GO terms in this gene set include photosynthesis and a variety of related categories broadly involved in tissue and organ development ( fig. 3D). Enrichment of the photosynthesis GO terms was driven by five genes that encode proteins that are components of photosystem I and II (supplementary

Phylogenetic Analysis of GRAS Domain Proteins
To further illustrate the utility of our gene family analysis, we selected the GRAS domain proteins to investigate in more detail. This superfamily of proteins is thought to have evolved as crucial regulators in control of various plant growth and developmental processes, including shoot and root development, stem cell homeostasis, light and hormone signaling, responses to biotic and abiotic stresses, and symbiosis with microorganisms (Hirsch and Oldroyd 2009;Bolle 2016). GRAS proteins clustered into 15 orthogroups that were variably present in the different plant lineages ( fig. 4). To better understand the evolutionary history of GRAS proteins, maximum-likelihood phylogenetic trees were built for each orthogroup (supplementary figs. S2-S16 and supplementary notes S2-S16, Supplementary Material online). OG0000320 was one of two fully conserved GRAS-containing orthogroups present in all species in the analysis ( fig. 4). This orthogroup contained the functionally characterized Arabidopsis GRAS homolog AtSCL14 (AT1G07530) (supplementary fig. S2, Supplementary Material online), which is likely involved in the detoxification of xenobiotics (Fode et al. 2008). The second orthogroup containing all species in the analysis was OG0000323, which consisted of two subfamilies, with one subfamily containing the Arabidopsis GRAS homologs AtSCL4 (AT5G66770) and AtSCL7 (AT3G50650) and the second subfamily containing the Arabidopsis GRAS homologs AtSCL1 (AT1G21450) and AtPAT1 (AT5G48150) (supplementary fig. S3, Supplementary Material online).
Three orthogroups (OG0000755, OG0001609, and OG0003235) were present in all lineages and the majority of species in the analysis ( fig. 4). OG0000755, a large orthogroup with 171 sequences, was subdivided into four major subfamilies ( fig. 4, supplementary fig. S4, Supplementary Material online). OG0000755 was the only GRAS orthogroup that contained a significant number of multi-domain proteins; 34 proteins in this orthogroup were annotated with the DELLA Pfam domain (PF12041) in addition to GRAS (PF03514) (supplementary fig.  S4, Supplementary Material online). The majority of DELLAcontaining proteins in this orthogroup were from angiosperms, including the five Arabidopsis DELLA proteins AtRGL1 (AT1G66350), AtRGL2 (AT3G03450), AtRGL3 (AT5G17490), GAI (AT1G14920), and RGA (AT2G01570). In Arabidopsis DELLA proteins function as crucial repressors in the gibberellin signaling pathway (Vera-Sirera et al. 2016). DELLA domains were also identified in two Ceratopteris sequences ( fig. 4, supplementary fig. S4, Supplementary Material online), suggesting ferns and seed plants share conserved components to transduce the gibberellin signal. A second subfamily within the larger OG0000755 orthogroup contained three sequences with shared conserved roles in control of axillary bud initiation and lateral shoot development (Schumacher et al. 1999;Greb et al. 2003;Li et al. 2003): LATERAL SUPPRESSOR (LAS) in Arabidopsis (AT1G55580) (Greb et al. 2003); MONOCULM 1 (MOC1) in Oryza sativa (Os06g40780) (Li et al. 2003); and Lateral suppressor (LS) in Solanum lycopersicum (Solyc07g066250.1) (Schumacher et al. 1999) (fig. 4, supplementary fig. S4, Supplementary Material online). A third subfamily in OG0000755 is present in at least one species in each of the five plant lineages in our analysis and is fully present in all angiosperms, including Arabidopsis AtSCL28 (AT1G63100). Lastly, OG0000755 contained one more characterized GRAS protein from Medicago truncatula (hereafter Medicago) (RAM1), which participates in the symbiosis with arbuscular mycorrhizal fungi (Gobbato et al. 2012). RAM1 homologs are present in all angiosperm genomes with the notable exception of Arabidopsis ( fig. 4, supplementary fig. S4, Supplementary Material online). Outside of the angiosperms, the RAM1 subfamily appears to group closely with one additional sequence from the lycophyte Selaginella moellendorffi and appears to be absent in the bryophytes, ferns, and gymnosperms in this analysis ( fig. 4,  . HAM proteins control the determinacy and proliferation of shoot apical meristems and the de novo formation of axillary meristems in petunia and Arabidopsis (Stuurman et al. 2002;Schulze et al. 2010;David-Schwartz et al. 2013;Fan et al. 2015;Zhou et al. 2015;Hendelman et al. 2016;Zhou et al. 2018;Han, Liu, et al. 2020;Han, Geng, et al. 2020). Members of the HAM subfamily are present in almost all the species in the analysis, with the exception of the liverwort Marchantia polymorpha (fig. 4). Although taxon sampling is different, the topology of the HAM phylogenetic tree in this study is generally in agreement with the result we showed previously (Geng et al. 2021). Lastly, OG0003235 contained the Arabidopsis GRAS homolog AtSCL3 (AT1G50420) (supplementary fig.  S6, Supplementary Material online), which antagonizes the DELLA proteins and regulates the gibberellin signaling pathway in Arabidopsis (Zhang et al. 2011).
Six orthogroups (OG0009315, OG0002549, OG0005632, OG0007977, OG0002843, and OG0006078) were well represented in angiosperms but variably present in the other four lineages ( fig. 4). OG0009315 contained the Medicago protein RAD1, which, similar to RAM1 in OG0000755, is involved in the symbiosis with arbuscular mycorrhizal fungi (Rey et al. 2017). Also like the RAM1 subfamily, RAD1 is conspicuously  fig. S9, Supplementary Material online) together contain three functionally characterized proteins in Arabidopsis: SCR (AT3G54220), SCL23 (AT5G41920), and SHR (AT4G37650), which collectively determine both the root and shoot radial patterning (Cui et al. 2007;Long et al. 2015;Yoon et al. 2016). Both SCR and SCL23 belong to OG0002549 and they are derived from a duplication in the LCA of seed plants (

A Ceratopteris Transcriptome Database Provides a Platform for Functional Genomic Studies in Ferns
To explore the protein coding gene space and to generate a comprehensive catalog of transcripts in Ceratopteris , 17 independent sequencing libraries from gametophyte and sporophyte tissues were sequenced at great depth ($2.6 billion cleaned paired-end reads) in total (supplementary table S1, Supplementary Material online) prior to de novo assembly. The BUSCO analysis showed that this transcriptome covers 97% eukaryotic genes and 71% of conserved embryophyta genes. We also performed OrthoFinder analysis (Emms and Kelly 2015) using this Ceratopteris transcriptome plus 38 additional genomes and transcriptomes from species across the Viridiplantae. 17,489 orthogroups were assigned to Ceratopteris, which is comparable to the number of orthogroups identified from the published genomes of two aquatic ferns, S. cucullata (n ¼ 17,593) and A. filiculoides (n ¼ 29,456) (supplementary table S2, Supplementary Material online) (Li et al. 2018). Among them, the GRAS domain homologs from the Ceratopteris transcriptome and the two aquatic fern genomes are consistently present or absent in each orthogroup, and they are clustered together with most GRAS homologs identified from transcriptomes of 14 additional ferns. Collectively, our data indicate that the do novo assembled Ceratopteris transcriptome is accurate and likely comprehensive.
Ceratopteris is a member of the Polypodiales, the most species-rich clade in the fern lineage (PPG I 2016), which retains many characteristics representative of ferns (e.g., homospory and extremely large genomes). The LCA between the Polypodiales and S. cucullata and A. filiculoides (the two ferns with sequenced genomes) existed over 200 Ma (Kumar et al. 2017), which has allowed for a significant number of accumulated differences between Ceratopteris and our current fern reference genomes (Sessa et al. 2014). Therefore, the incorporation of Ceratopteris, in addition to S. cucullata and A. filiculoides, in comparative studies of ferns provides tremendous opportunities identifying genes involved the evolution of different developmental or physiological processes within the ferns, including those associated with adaptations to an aquatic environment. These resources are also an entry point for understanding how heterospory arose from homospory within a lineage, which represents a major shift in reproduction (Sussex 1966), and occurred independently in different lineages (Sussex 1966;Bateman and DiMichele 1994).
The Ceratopteris transcriptome and its comparison to other plant genomic resources allowed us to identify genes that were gained or expanded as well as lost or contracted in the LCA of different plant lineages. This analysis is novel in that it leverages the Ceratopteris transcriptome and other recently acquired genome/transcriptome resources from additional ferns to distinguish euphyllophytes from lycophytes and, within the euphyllophyte lineage, ferns from seed plants. The gene gains and losses in the LCA of these lineages are likely to underly the major shifts in plant morphology and physiology that occurred in each of these ancestors, for example, the loss of flagella and shift to sporophyte-dominant lifecycle in seed plants. Due to the difficulty in differentiating true gene loss from missing data in de novo transcriptome assemblies, more genome assemblies for diverse ferns are needed to provide further insight into the dynamics of gene families in land plants.

The Classification and Evolution of GRAS Domain Proteins in Land Plants
We also analyzed the evolution of the GRAS gene superfamily, which contains crucial regulators in control of various land plant growth and developmental processes (Hirsch and Oldroyd 2009;Bolle 2016). The phylogenies of GRAS proteins from a number of seed plants (Tian et al. 2004;Lee et al. 2008;Wu et al. 2014;Cenci and Rouard 2017;Cheng et al. 2019) and from a few non seed plants Wu et al. 2014;Cheng et al. 2019) have been characterized. However, the phylogeny of GRAS proteins from homosporous ferns is lacking. In this study, we performed comprehensive phylogenetic analyses of GRAS proteins from 36 species in all the representative land plant lineages including bryophytes, lycophytes, ferns, gymnosperms, and angiosperms. Ferns serve as a very informative middle point in the evolution of land plants. By including several distantly related heterosporous and homosporous ferns in the study, we classified the GRAS domain proteins into 15 orthogroups including at least 19 distinct subfamilies ( fig. 4).
Among them, 11 subfamilies (SCL14, SCL4/7, SCL1/PAT1, SCL28, LAS, DELLA, HAM, SCR, SCL3, SHR, and SCL32) contain GRAS homologs from all the representative land plant lineages ( fig. 4). These subfamilies include many crucial components in plant growth and development (Hirsch and Oldroyd 2009;Bolle 2016), suggesting that ferns and seed plants may share and exploit similar components to control their body formation. In addition, the sex of the Ceratopteris gametophytes is determined by the antheridiogen (Scott and Hickok 1987;Banks et al. 1993;Banks 1994Banks , 1999, which is one type of gibberellins (Yamane 1998). Since DELLA proteins play important and conserved roles in repressing the gibberellin signaling (Vera-Sirera et al. 2016), the identification of DELLA domain GRAS proteins in Ceratopteris will allow us to perform genetic studies and establish the molecular linkage between the sex determination and gibberellin signaling in ferns in the future.
Members of other four subfamilies-RAM1, RAD1, NSP1, and NSP2 contain essential regulators during the establishment of arbuscular mycorrhizal symbiosis in angiosperms (Delaux et al. 2014(Delaux et al. , 2015. Despite the wide distribution of the NSP2 subfamily, the other three families (RAM1, RAD1, NSP1) are largely missing in both eusporangiate and leptosporangiate ferns ( fig. 4), reflecting the potentially reduced arbuscular mycorrhizal dependency in the fern lineage. Additionally, the GRAS-I subfamily is absent in angiosperms but present in ferns. In contrast, the GRAS-II, GRAS-III, and GRAS-IV subfamilies are present in angiosperms but absent in ferns ( fig. 4), suggesting the diversification of these four subfamilies between ferns and angiosperms. Future work to identify the lineage-specific roles of these subfamily members will provide important insights into the evolution of the GRAS domain proteins in land plants.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.