Large-scale phylogenomic analysis resolves a backbone phylogeny in ferns

Abstract Background Ferns, originated about 360 million years ago, are the sister group of seed plants. Despite the remarkable progress in our understanding of fern phylogeny, with conflicting molecular evidence and different morphological interpretations, relationships among major fern lineages remain controversial. Results With the aim to obtain a robust fern phylogeny, we carried out a large-scale phylogenomic analysis using high-quality transcriptome sequencing data, which covered 69 fern species from 38 families and 11 orders. Both coalescent-based and concatenation-based methods were applied to both nucleotide and amino acid sequences in species tree estimation. The resulting topologies are largely congruent with each other, except for the placement of Angiopteris fokiensis, Cheiropleuria bicuspis, Diplaziopsis brunoniana, Matteuccia struthiopteris, Elaphoglossum mcclurei, and Tectaria subpedata. Conclusions Our result confirmed that Equisetales is sister to the rest of ferns, and Dennstaedtiaceae is sister to eupolypods. Moreover, our result strongly supported some relationships different from the current view of fern phylogeny, including that Marattiaceae may be sister to the monophyletic clade of Psilotaceae and Ophioglossaceae; that Gleicheniaceae and Hymenophyllaceae form a monophyletic clade sister to Dipteridaceae; and that Aspleniaceae is sister to the rest of the groups in eupolypods II. These results were interpreted with morphological traits, especially sporangia characters, and a new evolutionary route of sporangial annulus in ferns was suggested. This backbone phylogeny in ferns sets a foundation for further studies in biology and evolution in ferns, and therefore in plants.

concatenation-based methods were applied to both nucleotides and amino acids 22 sequences in species tree estimation. Among the mainly consistent and strongly 23 supported cladograms, coalescent-based method using nucleotides sequence 24 yielded the most robust cladogram. 25 Conclusions: Our result confirmed that Equisetales is sister to the rest of ferns, and 26 Dennstaedtiaceae is sister to eupolypods. Moreover, our result strongly supported 27 some relationships new to the current view of fern phylogeny, including that 28

Psilotaceae and Ophioglossaceae form a monophyletic clade which is sister to 29
Marattiaceae; Gleicheniaceae and Hymenophyllaceae form a monophyletic clade 30 which is sister to Dipteridaceae; and that Aspleniaceae is sister to the rest groups in 31 eupolypods II. These results were interpreted with morphological traits, especially 32 sporangia characters, and a new evolutionary route of sporangia annulus in ferns 33 was suggested. This backbone phylogeny in ferns sets a foundation for further 34 studies in biology and evolution in ferns, and therefore in plants. 35 Key Words: phylogenomic, monilophytes, evolution, sporangium, transcriptome 36 Background 37 Phylogeny, which reflects natural history, is fundamental to understanding evolution 38 and biodiversity. Ferns (monilophytes), originated about 360 million years (MY) ago, 39 are the sister group of seed plants [1,2]. With estimated 10,578 extant living species 40 globally [3], they are the second most diverse group in vascular plants. Phylogenetic 41 studies for ferns, especially based on molecular evidences, have been widely carried 42 in recent decades. These studies have revolutionized our understanding of the 43 evolution in ferns, among which the milestones being setting ferns as the sister group 44 of seed plants [1,2], placing Psilotaceae and Equisetaceae within ferns [2,4,5], and 45 revealing a major polypods radiation following the rise of angiosperms [6,7]. 46 Resolution at shallow phylogenetic depth among families or genera have also been 47 improved remarkably [8][9][10][11][12][13][14]. 48 However, previous researches on fern phylogeny have mostly relied on plastid 49 genes [10,12,13], some combined with a few nuclear genes [4,5,14] or 50 morphological traits [5,11]. Due to incomplete lineage sorting (ILS), genes from 51 different resources often show conflicting evolutionary patterns, especially when 52 based on a limited number of samples, some deep relationships in fern phylogeny 53 remain controversial (Figure 1). In the latest PPG I system [3], which has derived 54 from many recent phylogenetic studies, some important nodes remain uncertain, Transcriptome sequencing (RNA-Seq) represents massive transcript information 59 from the genome. Phylogenetic reconstructions basing on RNA-Seq are more 60 efficient and cost-effective than traditional PCR-based or EST-based methods when 61 lacking whole-genome data [15]. Successful cases in recent years include mollusks 62 [16], insects [17], the grape family [18], angiosperms [19], and land plants including 63 six ferns [20]. Here, with the aim to reconstruct the framework of fern phylogeny, we 64 4 sampled abundant fern species representing all important linages and applied latest 65 phylogenomic analysis basing on  To reconstruct a robust and well-resolved phylogeny in ferns, applying multiple 67 methods of phylogenomic analysis is extremely important. Since concatenation-68 based estimations of species tree usually have good accuracy under low level of ILS, 69 while coalescent-based methods are developed to overcome the effect of ILS, but 70 are sensitive to gene tree estimation error [21], so both concatenation-based and 71 coalescent-based estimations are applied. Moreover, due to the fact that amino-acid 72 sequence is more conserved than nucleotide sequence, it may be more suited to 73 estimate relationships among distant taxa. While for close related taxa, the higher 74 variability of nucleotide sequence brings useful information to reconstruct 75 relationships that might not be differentiated using amino-acid sequence. Therefore, 76 both nucleotide and amino-acid sequences are used in phylogeny reconstruction. 77 In the aspect of morphology, fern sporangium is an organ for enclosing and 78 dispersing spores, most of which functions like a unique catapult with annulus [22]. 79 During the last centuries, Bower's hypothesis on the evolution of sporangia with a 80 focus on annulus [23] had been one of the most important cornerstones to fern 81 phylogeny based on morphology [24,25]. However, this hypothesis has been 82 challenged by somewhat conflicting frameworks of fern phylogeny [4,10,12,14,26]. 83 A robust framework in fern phylogeny which reflects the evolutionary history will 84 improve our understanding for the evolution of fern sporangia as well as other 85 characters. 86 Data description 87 Taxa sampling and RNA-Seq 88 We chose 69 fern species from 38 families according to PPG I system (totally 48 fern 89 families), covering all the 11 orders (Equisetales,Psilotales,Ophioglossales,90 Marattiales, Osmundales, Hymenophyllales, Gleicheniales,Schezaeles,Salviniales,91 Cyatheales, and Polypodiales). Information about the location and time for sampling 92 is given in Table S1. All the sampled species were collected under the permissions of 93 the natural reserves and Shanghai Chenshan Botanical Garden in China. 94 Sporophyll or/and trophophyll were collected and frozen in liquid nitrogen 95 immediately, and preserved in Ultra-low temperature refrigerator at -80°C before 96 RNA extraction. Total RNA was extracted using TRIzol (Life Technologies Corp.) 97 according to the manufacturer's protocols. The RNA concentration was determined 98 using a NanoDrop spectrophotometer, and RNA quality was assessed with an Agilent 99 Bioanalyzer. Paired-end reads were generated by Majorbio Company (Shanghai,100 China) using the HiSeq 2500 system. Raw reads were deposited in GenBank [27] . 101

Transcriptomes assembly and orthology assignment 102
Transcriptomes data were generated from 69 fern species (Table 1). After filtration, 103 about 2,726.9 million pair-end DNA sequence reads (about 313 Gbp) were retained. 104 We assembled these reads de novo and obtained a total of 5,449,842 contigs [28]. 105 In order to obtain a reliable phylogenetic relationship, we selected four species 106 as the outgroup, representing the main lineage of land plants: Amborella trichopoda 107 (representing angiosperms), Picea abies (representing gymnosperms), Selaginella 108 moellendorffii (representing lycophytes), Physcomitrella patens (representing 109 bryophytes). The translated ORF (protein) sequences of these four species were 110 6 downloaded from Phytozone [29] and used in the following analysis. 111 To ensure the consistency of phylogenomic analysis, we used a phylogenetic-112 based ortholog selection method, and obtained two subsets of " one to one" 113 orthologous genes that differed in gene number and species occupancy rate, named 114 "Matrix 1" and "Matrix 2" [30]. Matrix 1 consists of 2391 genes that are present in at 115 least 52 taxa (that is 75% of the 69 taxa in total), resulted in 2,024,565 nucleotide 116 and 674,855 amino acid positions, the gene and character occupancy were 88% and 117 85% respectively. Matrix 2 consists of 1334 genes that are present in at least 62 taxa 118 (that is 90% of the 69 taxa in total), resulted in 1,171,332 nucleotide and 390,444 119 amino acid positions, the gene and character occupancy reached 94% and 90% in 120 each. For each orthologues gene set, coalescent-based and concatenation-based 121 methods were applied separately to both nucleotides and amino acids sequences. A 122 working flow diagram showing the major processes in this study is given in Figure 2. 123

Comparison of cladograms estimated by various methods 140
By comparing cladograms estimated by coalescent-based and concatenation-based 141 method using both nucleotide and amino-acid sequences (Table 2), we find that the 142 cladograms yielded from coalescent-based and concatenation-based methods using 143 nucleotide sequence are mostly consistent, except the location of Angiopteris 144 fokiensis. Cladograms yielded from coalescent-based method using nucleotide 145 sequence and amino-acid sequence showed three sites of inconsistency, all of which 146 belong to eupolypods. Since eupolypods have experienced rapid evolutionary 147 radiation in Cenozoic (Figure 3), and nucleotide sequences usually provide more 148 information to reconstruct relationships among close related taxa, we consider the 149 cladogram yielded from coalescent-based method using nucleotide sequence maybe 150 more reliable. However, the inconsistent sites among cladograms often show 151 relatively lower supporting values, and they are often controversial nodes among 152 different researches based on different genes, we suggest these different results may 153 be aroused partially by LIS and reticulate evolution. 154

35]. 205
Our new phylogram confirmed the morphology-based hypothesis that 206 Dennstaedtiaceae with two indusial, rather than Pteridaceae with one false indusium, 207 is more close to eupolypod ferns [39]. In Pteridaceae, the unstable structure of 208 spherical sporangial, including variable annulus and short sporangial stalk, indicates 209 these sporangial are relatively primitive and are close to the sporangial with oblique 210 annulus in early leptosporangiate [23]. We also noticed that the spherical sporangia 211 with slightly oblique annulus in Monachosorum should be more primitive than the 212 flattened sporangia with typical vertical annulate in other genera of Dennstaedtaceae. 213 For distinguishing eupolypods I and eupolypods II, the number and shape of the 214 vascular bundles at the base of petiole have been demonstrated to be a powerful 215 diagnostic character [35,38]. 216

The evolution of sporangia annulus in ferns 217
By observing the character of sporangia annulus of abundant samples in each fern 218 group, and reconstructing these characters onto our well-resolved backbone 219 phylogeny (Figure 3), here we reconstructed the evolutionary history of sporangia 220 annulus in ferns ( Figure 4). First, exannulate sporangia, as in Equisetaceae,221 Psilotaceae, and Ophioglossaceae, is the original type in ferns; followed by 222 rudimentary multiseriate annulus, which is inverse U-shaped in Marattiaceae (a), and 223 U-shaped in Osmundaceae (b); and by equatorial transverse-oblique uniseriate 224 annulus, as in Gleicheniaceae and Hymenophyllaceae. After that, the main route 225 divides into two subroutes, one is towards apical annulus as in Lygodium and 226 Schizaea, followed by vestige or disappeared annulus as in Salviniales (aquatic 227 ferns); the other is towards oblique annulus as in Cyatheales (tree ferns), followed by 228 vertical annulus as in polypods. Inconsistent with Bower's hypothesis [23], our results 229 showed that sporangia with apical annulus as in Schizaeales are no longer the 230 primitive type in ferns but a specialized one. Moreover, the oldest fossils of 231 Schizaeaceae is now believed to appear in Jurassic (201-145 Ma BP) rather than 232 formerly thought Carboniferous (359-252 Ma BP) [40]. 233

De novo transcriptome assembly 250
For each paired-end library, we first removed the Illumina adapter of raw reads using 251 Scythe (32) and trimmed the poor quality bases using DynamicTrim Perl script of the 252 SolexQA package with default parameters [41]. Next, de novo transcriptome 253 assembly of each species was conducted using the Trinity package (version: 254 trinityrnaseq_r20140413) with default parameters [42]. To discard the duplicated 255 sequences, the obtained contigs were clustered using CD-HIT-EST (v4.6.1) to 256 generated a non-redundant contigs. All contigs with lengths greater than 200 bp were 257 used for downstream analysis. We used the transDescoder, a program in the Trinity 258 package, to identify the candidate coding sequences (CDSs) from the contigs with 259 default criteria. Finally, the translated protein sequences of CDSs were searched by 260 BLASTP against the NCBI nr protein database with an e-value threshold of 1E-5. 261 These BLASTP hit sequences were used for further analysis. 262

Orthology assignment, alignment, and alignment masking 263
The orthology assignment for the 69 sample assemblies together with the four 264 outgroup species employed a phylogenetic based clustering method described 265 previously [16]. In short, all-vs-all BLAST search of amino acid sequence was 266 performed among every species, the BLAST results were clustered using MCL [43] 267   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 software with the parameters '-I 2-tf ′gq(20)′'. Optimization of the inflation parameter 268 (I) was conducted as described previously [44], the default value 2.0 was selected 269 ultimately. To reduce the complexity of each group, we removed all sequences of the 270 species that had more than 10 sequences in this group. Then, groups with at least 35 271 (50%) ferns species were aligned using einsi command, implemented in MAFFT sequences. We pruned the paralogous subtrees from the homologous gene trees 277 until only one monophyletic subtree retained. Next, the resulted orthologous gene 278 trees were further filtered by the criteria that each species should be represented by 279 only one sequence, this resulted subset genes were referred to "one to one 280 orthologs", which were largely free of gene duplication. Then, we extracted both the 281 CDSs (nucleotide sequence) and translated amino acid sequence from the each 282 orthologous gene group, followed by aligning with MAFFT and trimming with 283 Gblocks. The alignment which with coding and corresponding translated sequences 284 lengths greater than 150 bp (or 50 amino acids) were kept for the further analysis. 285

BUSCO analysis 286
The Basic Universal Single Copy Orthologs (BUSCOs), which employ a core set of 287 orthologs conserved in all eukaryotic species to determine the gene coverage degree 288 of each assembly [49], was employed to assess the completeness of the 289 transcriptome assembly we obtained (Table S2) [50]. A total of 303 BUSCOs were 290 employed to blast against by translated amino acid of the assemblies using BLASTP. 291 Then the number of complete and parcially matched gene from each assembly was 292 counted respectively. Out of 69 samples in total, 65 samples (that is 94.2% of total) 293 were defined to have a relatively higher gene coverage degree. In these samples, at 294   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 least 251 complete genes (up to 295) could be identified, making the coverage rate 295 exceeded 80%. Unexpectedly, among our total assemblies, 1 sample (Aleuritopteris 296 chrysophylla, named RS_72) present extremely low gene coverage degree, in which 297 only 72 (23.8%) complete housekeeping genes could be found (Supplementary Table  298 2). However, when the sample is deleted from the matrix used to construct the 299 backbone of the phylogenetic tree, the cladogram remains unchanged, indicating that 300 the lower completeness in this sample doesn't affect our results (data not shown). 301

Phylogenetic analysis 302
The coalescent-based species tree was reconstructed by ASTRAL v4.

Reconstruction of the evolution of sporangia annulus 318
Characters of sporangia annulus of the sampled species were observed using a 319 polarized light microscope (Axio Scope.A1, ZEISS) after the fresh and mature 320 rudimentary annulus, ex-annulus, apical annulus, transverse annulus, and vestigial 324 annulus) were treated as unordered and equally weighted. To reconstruct character 325 evolution, a maximum likelihood approach using Markov k-state 1 parameter model 326 [57] was applied. To account for phylogenetic uncertainty, the "Trace-characters-over-327 trees" command was used to calculate ancestral states at each node including 328 probabilities in the context of likelihood reconstructions. To carry out these analyses, 329 characters were plotted onto 100 trees that were sampled in the ML analyses of the 330 combined dataset using RAxML v7. The results were finally summarized as 331 percentage of changes of character states on a given branch among all 100 trees 332 utilizing the option of "Average-frequencies-across-trees". 333      [5, 12-14, 26, 33]. 513

. Cladograms (a-f) adapted from published results
Branches with support < 75% were shown using dotted lines; and taxa which differ in 514 their phylogeny locations were shown in different colors. 515
Ferns are the sister group of seed plants. However, the relationships among major fern lineages remain controversial. Here, we carried a large scale phylogenomic analysis using high-quality transcriptome sequencing data which covered 69 fern species from all the 11 orders. By comparing the cladograms yielded from various methods of species tree estimation, we obtained a robust fern phylogeny. Our results are interpreted with sporangia characters, and a new evolutionary route of sporangia in ferns are suggested. This backbone phylogeny in ferns sets a foundation for further studies in biology and evolution in ferns, and therefore in plants, especially when fern genomes are not available.
We have adopted all the suggestions in our revised manuscript. The major revisions include: 1 during species tree estimation, we applied both coalescent-based and concatenation-based methods to both nucleotides and amino acids sequences, and compared the results; 2 we used fossil records to estimate the divergence times; 3 before discussing the evolution of sporangia in ferns, we reconstruct the ancestral state of sporangium annulus; 4 we added a working flow diagram (Figure 2) to show the major processes of data production and analysis; 5 we deposited the datasets, trees, and scripts in open repositories, including GenBank, figshare, and github; 6 we improved our "tree thinking" and writing.
Thank you very much for handling our manuscript. I am looking forward to hearing your decision soon.

Yue-Hong Yan
Response to the review comments: Reviewer #1: Shen et al presented an impressive dataset on fern transcriptomes, and attempted to build a better backbone phylogeny of ferns. Despite that I believe the transcriptome data will be invaluable for the community, authors' phylogenetic analyses, and the interpretation of the results, are inadequate for publication. My major concerns are: (1) The authors concatenated all the loci for phylogenetic reconstruction, a practice that has been shown prone to give high supports on wrong relationships. There are numerous simulation and empirical studies on the danger of data concatenation in phylogenomics. Just name a few: http://currents.plos.org/treeoflife/article/concatenation-analyses-in-the-presence-of-incom plete-lineage-sorting/ and http://www.nature.com/nature/journal/v497/n7449/abs/nature12130.html. Concatenation is particularly inappropriate when there are incomplete lineage sorting, and I believe some of the "novel" relationships the authors found may not be true, but due to the pitfalls of their phylogenetic methodology. The better approach would be to use multi-species coalescent method, like ASTRAL (Mirarab et al 2014 Bioinformatics 30: i541-i548).

R:
We thank the reviewer for that these suggestions are very helpful. We have applied both coalescent-based and concatenation-based methods in species tree estimation in the revised manuscript (Line 303-312). The coalescent-based species tree was reconstructed by ASTRAL (v4.10.4) (Line 303-304). When nucleotide sequence is used, the cladograms yielded from coalescent-based and concatenation-based methods are highly consistent, except the location of Angiopteris fokiensis (Table 2, Line 531).
(2) The authors did not address how they deal with transcript isoforms from the Trinity output. When there are multiple isoforms, which one did you include in the alignment? R: We thank the reviewer for the suggestion. In our pipeline, we used CD-HIT-EST (v4.6.1) to cluster contigs, followed by discard the duplicated sequences (Line 255-257). The modification has been incorporated in the revised version of the manuscript.
(3) I don't have the access to the alignments to assess the quality. And the alignment and tree files should be deposited in Dryad, TreeBase, or other open repositories.

R:
We thank the reviewer for the suggestion. We have deposited the datasets in the open repository "Figshare" , including: -The 4 alignment sets of single orthologous gene (including 2 matrices both in DNA and Protein sequences)，available at https://figshare.com/s/f835735cb66911ff1ffd The datasets for concatenation based phylogenetic tree, including: -The 4 concatenation matrices (including 2 matrices both in DNA and Protein sequences); -The results of model selection for Protein concatenation tree (We did not adopt partition method here, the model for each gene was the same; For DNA concatenation tree, the default model GTR + Γ 4 + I was used); -The 4 resulting concatenation tree files (inferred by 2 matrices both in DNA and Protein sequences).
These data are available at https://figshare.com/s/8af236b660f61078e40b. For coalescent-based species tree, the deposited data including: -The 4 single orthologous gene matrices sets (including 2 matrices both in DNA and Protein sequences); -The best gene trees selected from the 100 replicated tree inferred by each gene matrices, which were used to calculate the topology of the consensus coalescent-based species tree; -The 100 random gene trees of each gene matrices, which were used to calculate the bootstrap of the consensus coalescent-based species tree; -The 4 coalescent-based species trees in newick format (inferred by 2 matrices both in DNA and Protein sequences).
(4) The authors do not have the correct "tree-thinking". The extant Equisetum is not more primitive/basal/earlier than say Polypodium. Their interpretation on annulus evolution also assumed a "ladderized" progression from Equisetales, Ophioglossales, Marattiales, Gleicheniales, Schizaeales, to others. But remember the trees can be freely rotated! If you want to make claims on the evolutionary "route", use fossils and/or character state reconstruction. Please refer to Stacey Smith and David Baum's Tree Thinking book, and also this blog post http://for-the-love-of-trees.blogspot.com/2016/09/the-ancestors-are-not-among-us.html.

R:
These suggestions are very helpful. We have referred to the literature which the reviewer suggested, and improved our "tree-thinking". We have removed words like "basal", "primitive" when we describe phylogenetic relationships and have used "sister to" instead. In the revised manuscript, to analyze the evolutionary route of sporangia annulus, we have used fossil records (Line 311-316, Figure 3) to estimate the divergence times, and applied character state reconstruction (Line 317-332, Figure 4).
(6) The writing needs to be tightened up a lot. There are many typos and awkward grammar.

R:
We thank the reviewer for the suggestion. We have improved the writing and checked the grammar carefully.
In summary, I think this study by Shen et al lacks the rigor and to some extent the novelty, despite having generated this large amount of data. The authors should consider improving their phylogenetic methods, and rethink about what new insights can be generated from their dataset. Good luck : ) R: These suggestions are very helpful. In the revised manuscript, we have improved the phylogenetic methods greatly, such as to estimate species tree using both coalescent-based and concatenation-based methods; and to reconstruct character state of sporangium annulus; and improved our "tree thinking" also. We have found some new insights, such as that eusporangiate ferns except Equisetales form a monophyletic clade, and that Gleicheniaceae and Hymenophyllaceae form a monophyletic clade which is sister to Dipteridaceae.
Reviewer #2: I review the paper not as an expert in fern evolution but just to assess the appropriateness of the methodology and data reporting. The paper report on a concatenation analysis of transcriptomic data for 69 fern species. The problem addressed is interesting, and the dataset is large and promising. The language is mostly clear, although improvements are needed (examples of problematic sentences are shown in the attached file). The paper certainly has merit.
There are three issues with the current version of the manuscript. The first two issues have to be solved before the paper becomes suitable for publication at GigaScience. The third issue raised can be taken as a suggestion.
1-Reproducibility: A major premise of journals like GigaScience is that data and methods used should be made available. The authors have deposited transcriptomes to GenBank. But this is not nearly enough. They should also make available: -Their ALIGNED orthologous gene sets -Their filtered data (after GBlocks) in form of the concatenation matrix used.
-The results of model selection (I assume one model per gene, but this was not clear).
-Resulting trees, in newick or other machine-readable formats. Authors put some newick strings (I don't think for the main tree) in the supplement. This is not the most useful way to publish newick trees. Instead, they should be made available as files. Places like TreeBase and dryad can be used for depositing the trees.

R:
We thank the reviewer for the suggestion. We have deposited the datasets in the open repository "Figshare" , including: -The 4 alignment sets of single orthologous gene (including 2 matrices both in DNA and Protein sequences)，available at https://figshare.com/s/f835735cb66911ff1ffd; The datasets for concatenation based phylogenetic tree, including: -The 4 concatenation matrices (including 2 matrices both in DNA and Protein sequences); -The results of model selection for Protein concatenation tree (We did not adopt partition method here, the model for each gene was the same; For DNA concatenation tree, the default model GTR + Γ 4 + I was used); -The 4 resulting concatenation tree files (inferred by 2 matrices in both DNA and Protein sequences).
For coalescent-based species tree, the deposited data include: -The 4 single orthologous gene matrices sets (including 2 matrices in both DNA and Protein sequences); -The best gene tree selected from 100 replicated trees which were inferred from each gene matrices. These best gene trees were used to calculate the topology of the consensus coalescent-based species tree; -The 100 random gene trees from each gene matrices, which were used to calculate the bootstrap of the consensus coalescent-based species tree; -The 4 coalescent-based species tree in newick format (inferred by 2 matrices in both DNA and Protein sequences).
In addition, the authors describe the methods but do not make any of their scripts available. Making those available would be important for reproducibility. At a minimum, the exact commands used for running various tools (e.g., MCL, Mafft, Trinity, RAxML, etc.) should be provided in the supplement. In absence of details I had to assume certain things. For example, it seemed like the analyses were partitioned, but details were not clear.
2-It was not clear to me how the dataset was divided into primitive and derived taxa. The criteria should be described clearly. The abstract should not use these two terms assuming their meaning is clear to the reader. In general, calling taxa derived and primitive tends to be controversial (to say the least).

R:
We thank the reviewer for the suggestion. In the revised manuscript, we did not divide the sampled species into primitive and derived taxa; instead, we have grouped the species as Eusporangiate, Early leptosporangiate, and Polypodiales according to the phylogeny. Moreover, since words like primitive and derived taxa are not correct "tree thinking", we have avoided using them in the revised version.
3-The methods used are OK, but do not include the types of species tree analyses that are the norm in modern phylogenomics. The authors have >1000 genes. They can estimate individual gene trees and then combine them using a summary method (e.g., NJst/ASTRID, ASTRAL, MP-EST, etc). This approach would provide an alternative analysis that can be compared with concatenation. The authors should feel free to argue in favor of one type of analysis over the other, but not doing any species tree analysis makes this into a paper that one would have expected to see 5 years ago but not now. At a minimum, the authors should discuss why they don't think a species tree method is needed or relevant.

R:
We thank the reviewer for the suggestion. In this revised manuscript, we have applied both coalescent-based and concatenation-based methods in species tree estimation (Line 303-312). The coalescent-based species tree was reconstructed by ASTRAL (v4.10.4) (Line 303). We also compared the cladograms estimated by coalescent-based and concatenation-based methods ( Table 2, Line 531).

Reviewer #3:
In "Large scale phylogenomic analysis resolves a backbone phylogeny in ferns" Hui Shen and colleague collect RNA-seq data and perform phylogenetic analyses of a large group of ferns to address existing phylogenetic uncertainties. As agreed by the editor, I am not qualified to assess the biological significance of the findings and accordingly my review will be limited to the technical aspects and the presentation.
Through the application of RNA-Seq, the authors estimate a comprehensive phylogenetic tree of 69 species of ferns covering the major groups, resolving some outstanding placement issues with sufficient confidence. Moreover, they utilize their newly obtained tree to revise the morphological evolution of sporangia, offering a novel hypothesis (this could be better reflected in the title). Overall, I found the study well design and conducted and a valuable addition. I offer some recommendations for some potential improvements below.
For their main result (the estimation of a robust phylogenetic backbone tree) the authors apply a multi-step process composed of sequencing QC, transcriptome assembly, orthology assignment and tree estimation by maximum likelihood from (translated) amino acid sequences. While these are pretty standard procedures, the description of the exact steps could be improved and their purpose more clearly stated: a flow diagram could greatly improve understanding. For example it is not clear which alignments are of nucleotides and which of proteins (eg. line 128 vs. 134). The code used for the various steps was not initially provided and was subsequently made available by the authors as a set of scripts. Some steps of the pipeline are however missing (removal of sequences of species with more than 10 sequences in a group, line 127; tree estimation within groups of orthologues, line 130) and it is not obvious that the script 4_Runall_for_multi_alignment.pl is (probably) run also after #2. Overall, these are documentation problems rather than methodological issues and the approach used in this study is sufficiently robust and appropriate. The strong bootstrap support and the topological consistency validate the chosen approach. As mentioned above, more detailed explanations (possibly along with a flow diagram) would help clarify the matter.

R:
We thank the reviewer for the comments and suggestions.
-We have added a flow diagram as Figure 2 to show the major processes and methods in this study.
-For the missing step of "removal of sequences of species with more than 10 sequences in a group" in the pipeline, we have added a new script named "3_Runall_for_mci_result_analysis.pl" which is run for the masking of the resulting homologous gene families obtained by MCL. Within this script, lines 49-53 are coded for "removal of sequences of species with more than 10 sequences in a group".
-For the missing step of "tree estimation within groups of orthologues", we have added a new script named "5_ Runall_for_raxml.pl", which is run for construction of Raxml tree of each homologous gene family using protein sequence.

R:
We thank the reviewer for the comments. As we have applied both coalescent-based and concatenation-based methods to both nucleotides and amino acids sequences in species tree estimation, and have compared the four resulting cladograms (Table 2, Line 531); the AU test seems not necessary, and has been removed from the revised manuscript.
Below, I offer some recommendations that could make the paper more accessible to non-experts in the field of fern phylogeny and evolution.
Generally, the authors do a good job at introducing the uncertainties in fern phylogeny that they wish to address, however the usage of common names should be moved to the introduction, rather than being left to the Discussion. Similarly, for the non-expert, it is necessary to specify which species belong to which group (Family/Order), either as part of Table 1 and/or as part of Figure 2. R: Thanks for the suggestions. Since only a few species have common names and they were not referred to in the introduction, we have removed the common names from the revised manuscript. In order to help the readers find the Order/group to which a certain species belongs, we have color the species names in Figure 3(a) the same as their correspondence groups in Figure 3(b).
However, much of the focus of the study is devoted to addressing the evolution of sporangia given the newly obtained phylogeny. As a new evolutionary pathway is proposed as an important novel result, I suggest introducing the current view in the Introduction, possibly with a diagram (similar to Figure 3), rather than scattered in the Discussion section. I understand that the current phylogeny is (or historically has been) informed by the morphology of the sporangium, therefore being able to map the morphology on an independently obtained phylogeny is a major advancement.
It's surprising that one species (Cyclopeltis crenata) only had 23% coverage of its BUSCO set, compared with most of the others above 90%. This observation, its causes and its potential consequences are not addressed in the text.

R:
We thank the reviewer for the comment. We have perform an analysis to assess the potential affection caused by the relative lower assembly quality of Cyclopeltis crenata. A RAxML tree was constructed using concatenation method, with the matrix that excluded all the gene sequence of Cyclopeltis crenata. We compared the obtained topology with our main tree results, and found no difference. This result implied that the low coverage of BUSCO in the sample Cyclopeltis crenata does not affect our main phylogeny conclusion. In addition, it seems that the orthologous gene presented in Cyclopeltis crenata is not deceased even the lower assembly quality: 2146 in the matrix 1, 2391 in total；1279 in the matrix 2, 1334 in total.
My major issue is with Figure 2, which is the centerpiece of the study. In its current form it is underwhelming and I strongly recommend improving the figure and especially the legend. The phylogram should definitely include the group membership of the various species (possibly currently indicated by the A-D and I-IV markings on the side, but not explained) and the legend should explain the changes in sporangium.

R:
We thank the reviewer for the comments. We have revised Figure 3 (formerly Figure 2), such as marking the group names on the side of the species names, and adding the time scale. The change of sporangia and their annulus are explained in Figure 4 and the main text (Line 216-228) in the revised manuscript.
Finally, I would strongly urge the authors to deposit the tree and the underlying matrices to a domain specific repository like TreeBASE (https://treebase.org) and/or OpenTreeOfLife (https://tree.opentreeoflife.org/about/open-tree-of-life).

R:
We thank the reviewer for the comments. We have deposited the trees and the underlying datasets in the open repository "Figshare" , including:

-
The 4 alignment sets of single orthologous gene (including 2 matrices both in DNA and Protein sequences)，available at https://figshare.com/s/f835735cb66911ff1ffd The datasets for concatenation based phylogenetic tree, including: -The 4 concatenation matrices (including 2 matrices both in DNA and Protein sequences); -The results of model selection for Protein concatenation tree (We did not adopt partition method here, the model for each gene was the same; For DNA concatenation tree, the default model GTR + Γ 4 + I was used); -The 4 resulting concatenation tree files (inferred by 2 matrices both in DNA and Protein sequences).
These data are available at https://figshare.com/s/8af236b660f61078e40b. For coalescent-based species tree, the deposited data including: -The 4 single orthologous gene matrices sets (including 2 matrices both in DNA and Protein sequences); -The best gene trees selected from the 100 replicated tree inferred by each gene matrices, which were used to calculate the topology of the consensus coalescent-based species tree; -The 100 random gene trees of each gene matrices, which were used to calculate the bootstrap of the consensus coalescent-based species tree; -The 4 coalescent-based species trees in newick format (inferred by 2 matrices both in DNA and Protein sequences).
Other minor points "basing on" should be replaced with "based on" throughout. l. 51: "phylogenetic studies in ferns" l. 52: "our understanding of fern evolution" The sentence in lines 69-70 should be removed as it's factually incorrect. l. 61: PPG is only spelled out in l. 200 l. 81: "cornerstone of fern phylogeny" ll. 83-85: unclear what is meant by "A natural framework" l. 148: a reference to "hypothesys_tree.doc file" is not clear. ll. 159-160 ("together with..."): incomplete sentence. l. 182-183: "which is adapted" seems to refer to the AU test, rather than the five topologies.

R:
We thank the reviewer for the comments. We have revised these inappropriate words/sentences mentioned above, and checked the grammar all over the revised manuscript carefully.