Trochodendron aralioides, the first chromosome-level draft genome in Trochodendrales and a valuable resource for basal eudicot research

Abstract Background The wheel tree (Trochodendron aralioides) is one of only 2 species in the basal eudicot order Trochodendrales. Together with Tetracentron sinense, the family is unique in having secondary xylem without vessel elements, long considered to be a primitive character also found in Amborella and Winteraceae. Recent studies however have shown that Trochodendraceae belong to basal eudicots and demonstrate that this represents an evolutionary reversal for the group. Trochodendron aralioides is widespread in cultivation and popular for use in gardens and parks. Findings We assembled the T. aralioides genome using a total of 679.56 Gb of clean reads that were generated using both Pacific Biosciences and Illumina short-reads in combination with 10XGenomics and Hi-C data. Nineteen scaffolds corresponding to 19 chromosomes were assembled to a final size of 1.614 Gb with a scaffold N50 of 73.37 Mb in addition to 1,534 contigs. Repeat sequences accounted for 64.226% of the genome, and 35,328 protein-coding genes with an average of 5.09 exons per gene were annotated using de novo, RNA-sequencing, and homology-based approaches. According to a phylogenetic analysis of protein-coding genes, T. aralioides diverged in a basal position relative to core eudicots, ∼121.8–125.8 million years ago. Conclusions Trochodendron aralioides is the first chromosome-scale genome assembled in the order Trochodendrales. It represents the largest genome assembled to date in the basal eudicot grade, as well as the closest order relative to the core-eudicots, as the position of Buxales remains unresolved. This genome will support further studies of wood morphology and floral evolution, and will be an essential resource for understanding rapid changes that took place at the base of the Eudicot tree. Finally, it can further genome-assisted improvement for cultivation and conservation efforts of the wheel tree.


Introduction
The Trochodendraceae family (order Trochodendrales) includes only 2 species (Trochodendron aralioides and Tetracentron sinense), both of which are commercially used and widely cultivated. T. aralioides (wheel tree) is a native species of the forests of Japan (Honshu-southwards from Yamagata Prefecture, Shikoku, Kyushu, Ryukyu Islands) and Taiwan. Although its hardiness extends to lower temperatures, it is generally restricted to lower temperate montane mixed forests between 600 and 1,700 m in Japan. In Taiwan, the range is more extensive, occurring in broad-leaved evergreen forest (2,000-3,000 m) in the central mountain ranges and in northern Taiwan between 500 and 1,250 m forming monotypic stands [1]. Over the past century, it has been repeatedly reported from Korea [2][3][4][5][6][7][8][9][10] although these occurrences are not confirmed in online repositories (e.g., [11]). Properties of Trochodendron (e.g., mild-warm temperate range, restricted elevational intervals, and natural occurrence in small discontinuous populations) make it difficult to predict the sensitivity of T. aralioides to the effects of projected changes in climate [12]. The fossil record shows that both the species diversity and distribution of the family were much more extensive and continuous during the Eocene (50-52 million years ago [Mya]) to Miocene [13,14]. Unique for basal eudicots, Trochodendraceae have secondary xylem without vessel elements, a property only found in Amborella and Winteraceae [15]. This raises interesting questions on the biological conditions or triggers giving rise to such anatomical reversals, and the evolutionary and ecological consequences inherent in them.
Here, we constructed a high-quality chromosome-level reference genome assembly for T. aralioides (NCBI:txid4407) using long reads from the Pacific Biosciences (PacBio) DNA sequencing platform and a genome assembly strategy taking advantage of the Canu assembler [16]. This assembly of T. aralioides genome is the first chromosome-level reference genome constructed for the Trochodendrales order, and the closest relative to core eudicots sequenced to date. The completeness and continuity of the genome will provide high-quality genomic resources for studies on floral evolution and the rapid divergence of eudicots.

Genomic DNA extraction, Illumina sequencing, and genome size estimation
High-quality genomic DNA was extracted from freshly frozen leaf tissue of T. aralioides (Fig. 1) using the Plant Genomic DNA Kit (Tiangen, Beijing, PR China ), following the manufacturer's instructions. After purification, a short-insert library (300-350 bp) was constructed and sequenced on the Illumina NovaSeq platform (Illumina Inc., San Diego, CA, USA), according to manufacturer guidelines. A total of ∼124.6 Gb of raw reads were generated.
Sequencing adapters were then removed from the raw reads, and reads from non-nuclear origins (e.g., chloroplast, mitochondrial, bacterial, and viral sequences) screened by aligning them to the nr database (NCBI [17]) using megablast v2.2.26 with the parameters " -v 1 -b 1 -e 1e-5 -m 8 -a 13 ". The in-house script duplication rm.v2 was used to remove the duplicated read pairs, and low-quality reads were filtered as follows: 1) reads with ≥10% unidentified nucleotides (N) were removed; 2) reads with adapters were removed; 3) reads with >20% bases having Phred quality <5 were removed.
After the removal of low-quality and duplicated reads, ∼124.3 Gb of clean data (Supplementary Table S1) were used for the genome size estimation.
The k-mer peak occurred at a depth of 51 (Fig. 2), and we calculated the genome size of T. aralioides to be 1.758 Gb, with an estimated heterozygosity of 0.86% and a repeat content of 69.31%. This estimate is slightly smaller than the previously reported size, based on cytometry estimate (1.868 Gb) [18]. The GC content was 39.58% ( Supplementary Fig. S1). A first genome assembly, using the Illumina data and the assembly program SOAPdenovo (SOAPdenovo, RRID:SCR 010752) [19], was ∼1.324 Gb total length, with a contig N50 of 740 bp and a scaffold N50 of 1.079 kb. This first attempt to assemble the wheel tree genome was of low quality, likely due to its high genomic repeat content and high heterozygosity level.

PacBio sequencing
High molecular weight (HMW) genomic DNA (gDNA) was sheared using a g-TUBE device (Covaris, Brighton, UK) with 20-kb settings. Sheared DNA was purified and concentrated with AmpureXP beads (Agencourt, Bioscience Corp., Beverly, MA, USA) and then used for single-molecule real time (SMRT) bell sequencing library preparation according to manufacturer's protocol (Pacific Biosciences; 20-kb template preparation using BluePippin size selection). Size-selected and isolated SM-RTbell fractions were purified using AmpureXP beads (Agencourt, Bioscience Corp., Beverly, MA, USA) and these purified SM-RTbells were finally used for primer-and-polymerase (P6) binding according to manufacturer's protocol (Pacific Biosciences). DNA-Polymerase complexes were used for MagBead binding and loaded at 0.1 nM on-plate concentration in 35 SMRT cells. Singlemolecule sequencing was performed on a PacBio Sequel platform, yielding a total of 177.80 Gb filtered polymerase read bases (Supplementary Table S1).

10X Genomics sequencing
Libraries were built using a Chromium automated microfluidic system (10X Genomics, San Francisco, CA, USA) that allows the combination of the functionalized gel beads and HMW gDNA with oil to form a "gel bead in emulsion (GEM)." Each GEM contains ∼10 molecules of HMW gDNA and primers with unique barcodes and P5 sequencing adapters. After PCR amplification, P7 sequencing adapters are added for Illumina sequencing. Data were processed as follows: First, 16-bp barcode sequences and the 7-bp random sequences are trimmed from the reads, as well as low-quality pairs. We generated a total of 186.95 Gb raw data, and 183.52 Gb clean reads (Supplementary Table S1).

Hi-C sequencing data
To build a Hi-C library [20], nuclear HMW gDNA from T. aralioides leaves was cross-linked, then cut with the DpnII GATC restriction enzyme, leaving pairs of distally located but physically interacted DNA molecules attached to one another. The sticky  ends of these digested fragments were biotinylated and then ligated to each other to form chimeric circles. Biotinylated circles, that are chimeras of the physically associated DNA molecules from the original cross-linking, were enriched, sheared, and sequenced on an Illumina platform as described above. After adapter removal and filter of low-quality reads, there were a total of 193.90 Gb clean Hi-C reads. Sequencing quality assessment is shown in Supplementary Table S2.

De novo genome assembly
Short PacBio reads (<5 kb) were first used to correct the PacBio long reads using the "daligner" option in FALCON (Falcon, RRID: SCR 016089) [21], and to generate a consensus sequence. Following this error correction step, reads overlap was used to construct a directed string graph following Myers' algorithm [22]. Contigs were then constructed by finding the paths from the string graph. Error correction of the preceding assembly was performed using the consensus-calling algorithm Quiver (PacBio Inc., Menlo Park, CA, USA) [23]. Reads were assembled and errorcorrected with FALCON and Quiver to generate 4,226 contigs with a contig N50 length of 702 kb and total length of 1.607 Gb.
FragScaff [24] was used for 10X Genomics scaffolding, as follows: 1) Linked reads generated using the 10X Genomics library were aligned with BOWTIE v2 (Bowtie, RRID:SCR 005476) [25] against the consensus sequence of the PacBio assembly, to obtain Super-Scaffolds; 2) With increasing distance to consensus sequence, the number of linked reads supporting scaffold connection will decrease. Consensus sequences without linked read supports were then filtered and only the consensus sequence supported by linked reads was used for the subsequent assembly. FragScaff scaffolding resulted in 1,469 scaffolds, with a scaffold N50 length of 3.38 Mb.
To assess the completeness of the assembled T. aralioides genome, we performed a BUSCO analysis by searching against the plant (BUSCO, RRID:SCR 015008, version 3.0) [26]. Overall, 91.4% and 2.8% of the 1,440 expected genes were identified in the assembled genome as complete and partial, respectively (Supplementary Table S3). Overall, 94.2% (1,356) genes were found in our assembly. We also assessed the completeness of conserved genes in the T. aralioides genome by Core Eukaryotic Genes Mapping Approach (CEGMA, RRID:SCR 015055) [27]. According to CEGMA, 232 conserved genes in T. aralioides were identified, which have 93.55% completeness compared to the sets of CEGMA (Supplementary Table S3).
The Hi-C clean data were aligned against the PacBio reads assembly using BWA (BWA, RRID:SCR 010910) [28]. Only the read pairs with both reads aligned to contigs were considered for scaffolding. For each read pair, its physical coverage was defined as the total bp number spanned by the sequence of reads and the gap between the 2 reads when mapping to contigs. Per-base physical coverage for each base in the contig was calculated as the number of read pairs' physical coverage it contributes too. Misassembly can be detected by the sudden drop in per-base physical coverage in a contig.
Following the physical coverage of the resulting alignment, any misassembly was split to apply corrections. Using the clustering output, the order and orientation of each contig interaction was assessed on intensity of contig interaction and the position of the interacting reads. Combining the linkage information and restriction enzyme site, the string graph formulation was used to construct the scaffold graph using LACHESIS [29], and 1,469 scaffolds of our draft genome were clustered to 19 chromosomes ( Figure 3, Supplementary Table S4). The T. aralioides genome information is summarized in Supplementary Table S5.

Repeat sequences in the wheel tree genome
Transposable elements (TEs) in the genome assembly were identified both at the DNA and protein level. RepeatModeler (Re-peatModeler, RRID:SCR 015027) [30], RepeatScout (RepeatScout, RRID:SCR 014653) [31], and LTR FINDER (LTR Finder, RRID:SCR 0 15247) [32] were used to build a de novo TE library with default parameters. RepeatMasker [33] was used to map the repeats from the de novo library against Repbase [34]. Uclust [35] was then used with the 80-80-80 rule [36] to combine the results from the above software. At the protein level, RepeatProteinMask in the Repeat-Masker package was used to identify TE-related proteins with WU-BLASTX searches against the TE protein database. Overlapping TEs belonging to the same type of repeats were merged.
Repeat sequences accounted for 64.2% of the T. aralioides genome, with 57.2% of the genome identified from the de novo repeat library (Table 2). Approximately 53.2% of the T. aralioides genome was identified as long terminal repeat (LTR) (most often TEs). Among them, LTRs were the most abundant type of repeat sequences, representing 53.249% of the whole genome. Long interspersed nuclear elements (LINEs) and DNA TE repeats accounted for 0.837% and 2.416% of the whole genome, respectively ( Table 2, Supplementary Fig. S2).

Annotation
Gene annotation was performed using 3 approaches: r A homology-based approach, in which the protein sequences from O. sativa, Aquilegia coerulea, Fraxinus excelsior, Nelumbo nucifera, Quercus robur, and Vitis vinifera were aligned to the genome by using TblastN (TBLASTN, RRID:SCR 011822) [38] with an E-value cutoff by 1E−5. BLAST hits were conjoined with Solar software [44]. For each BLAST hit, Genewise (Ge-neWise, RRID:SCR 015054) [45] was used to predict the exact gene structure in the corresponding genomic regions.
All gene models predicted from the above 3 approaches were combined by EVM into a non-redundant set of gene structures. Then, we filtered out low-quality gene models, defined as follows: (i) coding region lengths ≤150 bp, (ii) models supported only by ab initio methods and with FPKM < 1.
We identified an average of 5.1 exons per gene (mean length of 10.622 kb) in the T. aralioides genome ( Table 1). The gene number, gene length distribution, coding sequence (CDS) length distribution, exon length distribution, and intron length distribution were all comparable to those of selected angiosperm species (Supplementary Table S8, Supplementary Fig. S3).

Gene family identification and phylogenetic analyses of wheel tree
Orthologous relationships between genes of Amaranthus hypochondriacus, Amborella trichopoda, Annona muricata, A. coerulea, A. thaliana, Helianthus annuus, Cinnamomum kanehirae, Musa acuminata, N. nucifera, O. sativa, Q. robur, and V. vinifera were inferred through all-against-all protein sequence similarity searches with OrthoMCL [54] and only the longest predicted transcript per locus was retained ( Supplementary Figs S4 and  S5). This resulted in 31,290 orthologous groups with representatives in each species being identified. Among these gene families, 484 were single-copy orthologs, and used for phylogenomic reconstruction and positive selection analyses. The similarity among these orthologs ranged from 34.8% to 80.3% (mean: 60.64% ± 7.33%, distribution shown in Supplementary  Fig. S6a).
To identify gene families that experienced a significant expansion or contraction during the evolution of the wheel tree, we used the likelihood model implemented in CAFE (CAFÉ, RRID: SCR 005983) [60], with default parameters. The phylogenetic tree topology and branch lengths were taken into account to infer the significance of change in gene family size in each branch (Fig. 4). The gene families that experienced the most significant expansions were mainly involved in pathogen/stress response (e.g., the cyanoamino-acid metabolism, P = 2.35 × 10 −28 ; the plant-pathogen interaction map, P = 2.29 × 10 −22 ; the tryptophan metabolism, P = 5.85 × 10 −10 ) (Supplementary Table S10).

Positive selection
Positive selection is a major driver of biological adaptation. Using the protein-coding sequences, we calculated the number of synonymous substitutions per site (Ks) and nonsynonymous substitutions per site (Ka) and assessed the deviation from zero of the difference Ka − Ks. A Ka/Ks > 1 represents evidence of positive selection. The Ka/Ks ratio ranged from 0.0061 to 5.7689 (mean: 0.6976 ± 0.9173, see Supplementary Fig. S6b), with Ks ranging from 0.0030 to 0.6281 (mean: 0.1585 ± 0.0749). MUSCLE [55] was used to align the protein and nucleotide sequences; then we used Gblocks [56] with default parameters (-b3 8; -b4 10; -b5 n) to eliminate poorly aligned positions and divergent regions from the alignment. The maximum likelihood-based branch length test of the PAML package [58] was used for comparisons, and the ratio Ka/Ks was calculated over the entire length of the protein-coding gene. Using T. aralioides as the foreground branch and A. hypochondriacus, H. annuus, N. nucifera, and A. coerulea as background branches, we identified 238 genes that were considered as candidate genes under positive selection (P-value < 0.01, false discovery rate < 0.05) using a maximum likelihood-based branch length test. The GO terms and KEGG pathways for these genes showed that positive selection was especially detected in cell metabolism (such as vitamins and amino acid biosynthesis; Supplementary Table S11).

Whole-genome duplication analysis
MCScan [61] was used to identify collinear segments within Trochodendron and between the T. aralioides and other angiosperm genomes. The sequences of the gene pairs contained in the (inter-) collinear segments of the genome were extracted and the codeml tool in the PAML package [58] was used to calculate the value of the 4-fold synonymous third-codon transversion rate (4dTv). The distribution of 4dTv can reflect whether genome-wide replication events occur in the evolutionary history of species, the relative time of genome-wide replication events, and the divergent events among species.
The 4dTv values of all paralogous gene pairs in T. aralioides were calculated, as well as those in A. coerulea, H. annuus, and A. muricata for comparisons. In addition, the 4dTv values were calculated for all ortholog gene pairs between T. aralioides and A. coerulea, H. annuus, and A. muricata to observe species divergence events. The peak of 4dTv distribution in the T. aralioides genome was around ∼0.1, whereas the peaks for interspecific comparisons with A. coerulea and H. annuus were around ∼0.3 and ∼0.2, respectively ( Supplementary Fig. S7). All together, these 4dTv distributions indicate that a whole-genome replication event occurred in T. aralioides after its divergence from both other basal angiosperms and core eudicots.

Conclusions
We successfully assembled the genome of T. aralioides and report the first chromosome-level genome sequencing, assembly, and annotation based on long reads from the third-generation PacBio Sequel sequencing platform for basal eudicotyledons. The final draft genome assembly is ∼1.614 Gb, which is slightly smaller than the estimated genome size based on k-mer analysis (1.758 Gb) and on cytometry (1.868 Gb, [18]). With a contig N50 of 691 kb and a scaffold N50 of 73.37 Mb, the chromosomelevel genome assembly of T. aralioides is the first high-quality genome in the Trochodendrales order. We also predicted 35,328 protein-coding genes from the generated assembly, and 95.4% (33,696 genes) of all protein-coding genes were annotated. We found that the divergence time between T. aralioides and its common ancestor with the core eudicots was ∼124.2 Mya. The chromosome-level genome assembly together with gene annotation data generated in this work will provide a valuable resource for further research on floral morphology diversity, on the early evolution of eudicotyledons, and on the conservation of this iconic tree species.

Availability of Supporting Data and Materials
Raw reads were deposited to EBI (project PRJEB32669), and supporting data and materials are available in the GigaScience Gi-gaDB database [62].