The origin, diversiﬁcation and adaptation of a major mangrove clade (Rhizophoreae) revealed by whole-genome sequencing

Mangroves invade some very marginal habitats for woody plants—at the interface between land and sea. Since mangroves anchor tropical coastal communities globally, their origin, diversification and adaptation are of scientific significance, particularly at a time of global climate change. In this study, a combination of single-molecule long reads and the more conventional short reads are generated from Rhizophora apiculata for the de novo assembly of its genome to a near chromosome level. The longest scaffold, N50 and N90 for the R. apiculata genome, are 13.3 Mb, 5.4 Mb and 1.0 Mb, respectively. Short reads for the genomes and transcriptomes of eight related species are also generated. We find that the ancestor of Rhizophoreae experienced a whole-genome duplication ∼ 70 Myrs ago, which is followed rather quickly by colonization and species diversification. Mangroves exhibit pan-exome modifications of amino acid (AA) usage as well as unusual AA substitutions among closely related species. The usage and substitution of AAs, unique among plants surveyed, is correlated with the rapid evolution of proteins in mangroves. A small subset of these substitutions is associated with mangroves’ highly specialized traits (vivipary and red bark) thought to be adaptive in the intertidal habitats. Despite the many adaptive features, mangroves are among the least genetically diverse plants, likely the result of continual habitat turnovers caused by repeated rises and falls of sea level in the geologically recent past. Mangrove genomes thus inform about their past evolutionary success as well as portend a possibly difficult future.


INTRODUCTION
One of the most productive and diverse environments for many life forms is at the interface between land and sea. Woody plants, however, are an exception to this species richness in intertidal zones. Globally, no more than 80 tree species have succeeded in invading intertidal zones to become mangroves, compared to over 10 000 that are found at the land-water interface in non-saline systems. (The term 'mangrove' refers to many independently evolved lineages of woody plants that occupy these land/saltwater interfaces.) Remarkably, the small number of mangrove species anchors tropical intertidal communities globally by providing key ecological services that include carbon sequestration [1], sediment accretion, seashore protection and ecosystem productivity [2].
How mangroves became adapted to the intertidal environments is thus a most interesting question. Mangroves differ from other plants living in hypersaline habitats [3,4] because their environments are 722 Natl Sci Rev, 2017, Vol. 4, No. 5 RESEARCH ARTICLE stressful in multiple dimensions including high salinity, hypoxia, strong UV light and anaerobic soils [5]. All these stresses fluctuate daily as the tides ebb and flow. In the adapation, mangroves have evolved specialized traits that include viviparous embryonic development, aerial roots and high tannin content [2,6,7]. While there have been some attempts at understanding the molecular basis of these adaptations [8][9][10][11], a systematic investigation is hindered by the absence of genomic data in any mangrove species. This absence will be remedied in this study.
In this study, we sequenced the genomes and/or transcriptomes of nine species using the latest sequencing technology supplemented by the more conventional platforms. Among these species, seven are mangroves from the Rhizophoreae tribe, the most mangrove-rich taxon comprising 20 typical mangrove species. The remaining two are inland species most closely related to Rhizophoreae.
There is an urgency at this time for studying mangroves because of the impending sea-level rises. Plants of the tropical coasts are expected to be affected disproportionately and mangroves will likely bear the brunt of these environmental changes. Indeed, there have been several warnings for 'a world without mangroves' [12]. The availability of genome sequences may help to reveal the history of mangrove colonization and mechanisms of adaptation to intertidal zones. Perhaps most important of all, the genomic resources may help to spur research on these most interesting adaptations. Research activities in themselves could offer some needed protections for these fragile intertidal ecosystems.

MANGROVE GENOME SEQUENCE AND COMPOSITION
We obtained 16 [13]). This indicates that our assembly approaches the chromosome-level completeness.
To correct for long-read sequencing errors, we also generated 89.3 Gb of short paired-end reads, which are generally more accurate (Supplementary  Note and Supplementary Table 3, available as Supplementary Data at NSR online). Nevertheless, had we used only the Illumina short reads, the N50 and N90 scaffolds would have decreased by 80% in length. Most crucially, we obtained contig N50 of 2.45 Mb using the SMRT assembly, whereas the short pair-end reads yielded an N50 of only 9.7 Kb (Supplementary Table 4, available as Supplementary Data at NSR online).
In order to study the evolution of mangroves that form the Rhizophoreae tribe, we further sequenced the genomes of R. mangle, R. stylosa and R. mucronata at lower depth (Supplementary Fig. 5 and Supplementary Table 5, available as Supplementary Data at NSR online). In the companion studies, we generated transcriptomes of Kandelia obovata, Bruguiera gymnorrhiza and Ceriops tagal, also from the Rhizophoreae tribe, as well as Pellacalyx yunnanesis and Carallia brachiata from the closest non-mangrove genera in the Rhizophoraceae family [14,15] (Supplementary Table 6, available as Supplementary Data at NSR online).
The quality of the assembly is reflected in the following statistics: 96% of the expressed sequences (Supplementary Note, available as Supplementary Data at NSR online) could be mapped to the genome; 93% of the core eukaryotic genes [16] are present and 99% of previously identified R. apiculata genes [17] could be uniquely mapped. Details of the procedures are given in the Supplementary Note and summarized in Supplementary Table 7 (both available as Supplementary Data at NSR online). The statistics indicate that these genomes are of the high quality necessary for advancing to the next stage of global mangrove research.
Using a combination of homology-based search and de novo prediction, we estimate that 29% of the R. apiculata genome consists of repetitive sequences (Supplementary Table 8, available as Supplementary Data at NSR online). The repetitive portions of the genome, comprising predominantly transposable element (TE) families, are drastically reduced in R. apiculata compared to the closely related nonmangrove plants (Lyu et al., unpublished data). By examining the long terminal repeats of many TEs, we conclude that the reduction is due to a lower rate of transposition, rather than a higher rate of TE loss. The underlying mechanisms of TE reductions are similar across independent mangrove lineages of Rhizophora, Avicennia and Sonneratia, the latter two being presented in the companion studies. The lower birth rate of TEs hints that active repression of TE jumping is a common strategy of mangrove genomes

WHOLE-GENOME DUPLICATION AND THE ORIGIN-DIVERSIFICATION OF RHIZOPHOREAE
Whole-genome duplication (WGD) is a feature of many taxa. WGD is, after all, an efficient way of expanding the genome [18]. We wish to know whether WGDs may have happened in Rhizophoreae. In particular, the timing of WGD in relation to geological events that permit the colonization of the intertidal zones has never been explored before.
We used MCScanX [19] (see 'Methods' and Supplementary Notes, available as Supplementary Data at NSR online) to align the R. apiculata genome to itself. We define 'collinear blocks' as regions of the genome that harbor at least five genes with homologs elsewhere in the genome and in the same order. We identified 377 such blocks that together cover 10 846 protein-coding genes (41% of all genes; Fig. 1). These genes are distributed among 4615 pairs, as well as some triplets/quadruplets, of genes. The extensive collinear blocks indicate at least one WGD event in the past.
To estimate the timing of WGD, we calculate the synonymous divergence (dS) between paralogous genes of the same genome (Fig. 2b). The distribution of dS between paralogous genes is uni-modal, suggesting a single WGD in the ancestor of R. apiculata. Since the mean dS between paralogs (0.35, red line in Fig. 2b) is larger than that between R. apiculata and Ca. brachiate/Pe. yunnanessis (0.25) but smaller than that between R. apiculata and P. trichocarpa (0.75, green line; Supplementary Fig. 9, available as Supplementary Data at NSR online), the WGD likely happened between these two time points, as indicated by the star in Fig. 2a  What then may be the timing of WGD in relation to mangrove emergence in the phylogeny? Did it occur before or after the origin of Rhizophoreae mangroves? To answer this question, we reconstruct the phylogeny of the 11 mangrove and nonmangrove species ( Fig. 2a; Supplementary Table 6, available as Supplementary Data at NSR online). The tree topology was reconstructed using PhyML [20]. Ca. brachiata and Pe. yunnanensis are the nonmangrove members of the same family [21]. The divergence time is estimated by the MCMCTREE program from the PAML package [22,23] (see 'Methods') based on three dated events for calibration and confirmation. First, the root node of the common ancestor of Rhizophoraceae, Euphorbiaceae (Ri. communis) and Salicaceae (P. trichocarpa) has been placed in the interval of 105-120 Myr before present [24,25]. Second, a most recent fossil recognized as ancestral Rhizophora has been dated to the late Eocene (33.9-38 Mya) [26,27]. These two time points are used to constrain the estimation of substitution rates ( Supplementary  Fig. 11, available as Supplementary Data at NSR online). The third dated event is given by fossils of the ancestor of Rhizophoreae from the early Eocene (47.8-56 Myr ago) [27,28], which is used to corroborate the MCMCTREE estimates.
We estimate that the mangrove-non-mangrove divergence happened 54.6 Myr ago, while the most recent common ancestor of Rhizophoreae mangroves is estimated to be 40. The invasion of the intertidal zones by Rhizophoreae appears to have coincided with a brief period of extreme global warming referred to as the Paleocene-Eocene Thermal Maximum (PETM) that occurred approximately 55.5 Myr ago [29] (Fig. 2a). Eustatic sea levels rose during PETM, likely submerging the angiosperms living at the margins of rainforests and forcing them to adapt to the new environment. Therefore, the emergence of mangroves may have been aided first by the genetic WGD event and then by suitable ecological conditions during the PETM.
It has been suggested that WGDs played important roles in the origin and diversification of many angiosperms [30]. The connection between genome duplication and evolutionary innovation seems particularly relevant in the emergence of mangroves. In addition to Rhizophoreae, two other major clades, Avicennia and Sonneratia, also experienced independent WGDs before their invasions of the intertidal zones (He et al., unpublished data).  that of its closest relative, P. trichocarpa. The comparison between the two species is done in the context of 48 other species of plants, whose AA usages are shown in the shaded area by the quartile. It is apparent that the AA usage in R. apiculata, but not P. trichocarpa, deviates strongly from the norm for plants. (b) dN/dS ratio along each branch of the phylogenetic tree. Mangrove lineages are colored red. (c) dN/dS ratios among genes grouped by GO terms in R. apiculata vs. in Ca. brachiata. The lower line indicates equal ratios and the upper line indicates a two-fold increase in R. apiculata. GO terms above the upper line are marked in red.
A common pattern after WGD is that one of the two duplicated copies becomes lost shortly afterward [31]. Those genes that retain both copies are therefore of great interest. Across the genome, 2878 pairs are retained when we examined genes with the dS values in the 0.25-0.70 range. Genes with dS outside of this range may have unusual or complicated evolutionary dynamics and are excluded here. The retainers are enriched for ontology terms related to regulation and stress response ( Fig. 2c; see also Supplementary Note, available as Supplementary Data at NSR online). The preferential retention of regulatory genes supports the evolutionary model of genome duplication [32], while the retention of stress response genes may pertain to the invasion of the intertidal zones.

ADAPTATION AT THE WHOLE-GENOME LEVEL
Inhabiting highly saline habitats relative to other woody plants, mangroves have to regulate intracellular salt levels to mitigate the effect of the environment [33]. However, daily sea-level fluctuations due to tides prevent an effective regulation. Indeed, it may take more than a week to reach an equilibrium when the salt concentration changes [9,33]. Thus, intracellular proteins have to adapt to an increase in environmental salinity.
We surveyed amino acid (AA) compositions across all proteins among 50 plant species (Fig. 3a). The AA usages in R. apiculata (red bars) and P. trichocarpa (its closest non-mangrove relative with complete genome sequences, blue bars) are compared in the context of the other plant species. When the AAs are ranked from under-utilization to over-utilization, the divergence between this mangrove-non-mangrove pair is striking. In Rhizophora, two groups of AAs are found under-or overutilized as shown on the opposite ends of Fig. 3a. In the more extreme cases, P. trichocarpa deviates from the mean in the opposite direction to R. apiculata (Asn, Arg and Ala). Overall, AA usage in R. apiculata deviates from the norm across the entire proteome.
It is most striking that other mangroves show the same trend in AA usage. In fact, R. apiculata has the least deviant AA usage among the three mangrove taxa that include Sonneratia alba and Avicennia marina (He et al., unpublished data). These two lineages, outside of the phylogeny in Fig. 2a, represent independent evolutionary events of mangrove emergence.
Given the unusual AA usage, we ask whether non-synonymous nucleotide substitutions might be more frequent in mangroves than in their relatives. This can be measured by the dN/dS ratio (ω) where dN and dS are, respectively, the number of nonsynonymous and synonymous substitutions per site. In Fig. 3b, ω values along all lineages are calculated using the PAML suite of programs [22]. It shows that ω is elevated in the Rhizophoreae clade relative to its inland relatives. The ω values on mangrove branches are generally larger than 0.25, whereas they are typically smaller than 0.2 in non-mangrove lineages (Fig. 3b). Although a high ω ratio is indicative of selection, it is usually attributable to the relaxation of negative selection against deleterious AA substitutions, unless ω > 1. In other words, stronger positive selection driving a higher ω ratio cannot be distinguished from weaker selection that also permits a higher ω. Only positive selection leads to adaptation.
To distinguish between positive selection and relaxation of negative selection, we scanned the genome using the 'branch-site method' in PAML, which applies a likelihood ratio test to compare models that permit or forbid ω >1 [34]. We identified 209 genes that harbor codons with ω > 1. Of these, 19 are implicated in embryonic development of A. thaliana (Supplementary Table 15, available as Supplementary Data at NSR online). Three of these, EMB88 (embryo defective 88), EMB2768 and EMB3137, will be used in further analyses below.
Whole groups of genes may also show the sign of adaptive evolution. The average dN/dS ratios among genes grouped by ontology are given in Fig. 3c. Eight categories show a greater than twofold increase in R. apiculata over its closest relative, Ca. brachiata (see Supplementary Table 16, available as Supplementary Data at NSR online, for detailed analyses), notably 'cell redox homeostasis' and 'cellular homeostasis'. Because various stressors in the intertidal zone can break cellular homeostasis, especially redox homeostasis, the rapid evolution of these genes in R. apiculata deserves future investigation.
The unusual AA usage and high rate of nonsynonymous changes can be observed in greater detail when we examine AA substitutions between R. apiculata and R. mangle. Previous studies have shown that AA substitutions among broad taxa follow a common, or nearly universal, pattern in which certain pairs of amino acids are rarely exchanged, even though their codons differ by only one bp [35,36]. Interestingly, these infrequent AA substitutions are unusually common between R. apiculata and R. mangle. Such patterns of AA substitution are also observed between closely related species from the Avicennia and Sonneratia genera (He et al., unpublished data). Thus, the dynamic pattern of AA substitutions, like the static pattern of AA usage reported in Fig. 3, may be quite general among mangroves.

SPECIALIZED ADAPTIVE TRAITS: VIVIPARY AND THE TANNIN CONTENT (THE RED BARK)
Two traits are particularly common in mangroves and rare in other woody plants: vivipary and the reddish bark. Vivipary broadly means the ability of embryos to germinate while still attached to the parent (Fig. 4a). Previous works have suggested that viviparous embryos are protected from high salinity and other stresses during early development [2,37]. To identify candidate genes for this trait, we use a branch-site model implemented in PAML (modified A) [22], which focuses on a pre-determined set of genes to detect adaptive signals on a specific branch of phylogeny. We chose 255 genes from our orthologous gene set that are also found in the SeedGenes database [38]. These loci have been empirically confirmed to play a role in embryonic development in Arabidopsis. We focused on the branch spanning the interval of 47.8-54.6 Myr before present (Fig. 2a), which represents the most recent common ancestor of Rhizophoreae.
Five genes are identified by the branch-site test (Supplementary Table 17, available as Supplementary Data at NSR online). The most dramatic gene is SAE2 (SUMO-activating enzyme 2), which carries 11 Rhizophoreae-specific AA substitutions (Fig. 4b) [39]. Seven of these changes show signs of positive selection, including a site predicted to be functionally critical (predicted by PROVEAN [40]    The second specialized trait in Rhizophoreae mangroves is the characteristic red bark that earns mangroves the nickname of 'red trees' in some languages. The red color is caused by a high concentration of tannins, a collection of flavonoid polymers, in several tissues (Fig. 4c) [42,43]. Tannin and other polyphenols have antioxidant activities and play a role in photoprotection, pathogen and herbivory resistance as well as salt tolerance [43][44][45]  The changes in N e are inferred using the PSMC method [48], which relies on the varying level of genetic diversity in different DNA segments across the genome as a basis for the inference of historical N e changes. In this graph, the generation time (g) is set to 20 years and the mutation rate (μ) is 1.6 × 10 −8 /bp/generation. Historical sea-level fluctuations are plotted for comparison (blue line) [50].

A S E S T P T P I M N A S E R T A M P I M N A G E R S A T P I M N A K E N T S P P I M N A R E N S S P P I M N S T K I A L E A V L I S T K I A L E A V L I S T K I A L E A V L I S T K I A L E A V L I
Vivipary and tannin concentration are only two of the most conspicuous traits in mangroves. When recent expansions of gene families are analysed, it is evident that many have evolved to cope with the various stresses in these inhospitable environments. Using a maximum-likelihood method implemented in the CAFE software [46], we identify 112 gene families that have expanded in R. apiculata during recent evolution ( Supplementary Figs 20 and 21, available as Supplementary Data at NSR online). The expanded families are enriched mostly for pathways involved in plant-pathogen interaction and biosynthesis of secondary metabolites (Supplementary Table 20, available as Supplementary Data at NSR online). We also find 2963 (11% of the total) R. apiculata genes that have been duplicated in tandem. Many of these genes are in the category of 'response to chemical stimulus' (Supplementary  Table 21, available as Supplementary Data at NSR online).

DISCUSSION
The extensive set of high-quality genomic sequences provides a glimpse into the emergence, diversification and adaptation of Rhizophoreae, the largest monophyletic group of mangroves.
Rhizophoreae mangroves originated from inland ancestors about 50 Myr ago during PETM and after a WGD event. We suggest that the WGD event provided the genetic material and the PETM provided the suitable ecological conditions for this emer-gence. The genome-wide AA usage modification and acceleration of substitution rates, together with positive selection on AA sequence or copy number of genes underlying specific traits, contributed to the adaptation and diversification of this most successful mangrove clade.
The greater significance of the genomic sequences of mangroves lies in the future research possibilities. In addition to viviparous embryo and high tannin content, Rhizophoreae mangroves have other specialized traits, such as the aerial roots and cuticular waxes, the molecular bases of which have never been investigated. At the population level, Rhizophoreae is much more prone to undergo speciation than all other mangroves. This tendency could be due to both its genetic architecture and ecological conditions [47,48]. Furthermore, the independent evolution of mangroves makes them ideal candidates for studying convergent evolution. A recent analysis [41] provides a glimpse of the possible extent of molecular convergence among mangrove clades.
Despite their prominent global presence on the tropical coasts, mangroves should not be considered abundant in the genetic sense. All four Rhizophora species have extremely low genetic diversity: 3.1-5.5 × 10 −4 per bp (Supplementary Note, available as Supplementary Data at NSR online). For a confirmation, we use the pairwise sequentially Markovian coalescent analysis (PSMC) [49] to infer the recent demographic histories of the Rhizophora species based on their whole-genome sequences. The model suggests a decrease in effective RESEARCH ARTICLE population size (N e ) starting about 100 000 years ago. Interestingly, this drop coincides with a dramatic change in global sea level ( Fig. 5; Supplementary Fig. 22, available as Supplementary Data at NSR online). Although sea levels started to rise in the last 20 000 years, and mangrove populations expanded to colonize the newly available habitat, genetic diversity has yet to recover from the earlier reduction in effective population size. This trend is observable across all species.
While mangroves and the tropical intertidal ecosystems appear vibrant at present, genetic data suggest that this may not have been the case in recent geological times. Sea-level changes may have taken their toll until recently, when the levels became relatively stable in the last 7000 years. With sea levels projected to rise, mangrove populations could conceivably recede to levels even lower than those indicated by their low extant genetic diversity, especially when we factor in human-driven disturbance of their habitats.
The analyses and research resources provided by this study are significant because they will enable modern evolutionary, ecological and genomic research to expand to mangroves. The transition from inland to intertidal zones is an important model of adaptation and species proliferation. The genomewide changes in AA usage are but one example of adaptation in this transition. Further active research on mangroves will also be crucial for the understanding and appreciation of the tropical coastal ecosystems anchored by these 'red trees'. Since a large fraction of Earth's human population lives near them, a sense of urgency should be very appropriate.

ONLINE METHODS
Genome sequencing and assembly. Tissues from one mature individual of Rhizophora apiculata. (Qinglan Harbor, Hainan, China (19 • 37 N, 110 • 48 E)) were collected for DNA extraction. Genomic DNA was extracted from leaves using the CTAB method [51] and total RNA was extracted from leaves, roots, flowers and stems using the modified CTAB method [52]. Short-reads libraries were constructed following the TruSeq DNA Sample Preparation Guide. Ten libraries with different DNA fragment sizes were sequenced using Illumina Hiseq 2000 platform. 20 kb Single Molecule Real Time (SMRT) long-read library were prepared following PacBio SMRTbell 20 kb Template Preparation BluePippin Size Selection protocal and were sequenced using Biosciences RS II platform. The sequencing data of R. mangle, R. mucronata and R. stylosa were produced in the same way as the short-reads libraries. The transcriptome of R. apiculata and other five species in the Rhizophoraceae family (Kandelia obovata, Bruguiera gymnorrhiza, Ceriops tagal, Pellacalyx yunnanensis and Carallia brachiata) was sequenced on the Illumina Hiseq 2000 platform with insert size of 300 bp.
Before assembling, PCR duplication, adaptor contamination and low-quality reads were filtered out. The SMRT long reads and Illumina short reads were combined to assemble a draft genome. The de novo assembled genome based on the SMRT long reads was produced using four programs: falcon (https://github.com/PacificBiosciences/ FALCON/), DBG2OLC [53], smartdenovo (https://github.com/ruanjue/smartdenovo) and wtdbg (https://github.com/ruanjue/wtdbg). The result obtained with smartdenovo was used as the final assembly because of its superior quality. Genome polishing was performed using Quiver [54] to further improve site-specific consensus accuracy. Illumina reads were then mapped to the polished genome assembly by BWA [55]. SNPs as well as small indels were called and corrected by SAMTOOLS [56] and in-house scripts. Finally, gap-filling were performed on the scaffolds with SSPACE 3.0 [57] using 10 Kb mate-pair sequences with the key parameters set as: -x 1 -m 50 -o 10 -z 200 -p 1.
The sequences of the transcriptome of R. apiculata, 458 core eukaryotic genes (CEGMA) [16] and 79 randomly selected genes from our previous work were used to evaluate the genome coverage and structural accuracy of the genome assembly (Supplementary Note).
The three re-sequenced congeneric genomes were mapped to the de novo assembled R. apiculata genome for comparison. Transcriptomes of the other five species (K. obovata, B. gymnorrhiza, Ce. tagal, Pe. yunnanensis and Ca. brachiata) were assembled and annotated using a common procedure [14] (see also in Supplementary  [63] algorithms were used to predict protein coding genes ab initio. Thirdly, RNA-seq reads were mapped to the genome using Tophat (version v2.1.1) [64], and gene models from spliced transcripts were identified using cufflinks (version v2.2.1) [65]. Finally, the three sets of predicted genes were combined using EVidenceModeler (EVM) [66] to generate a weighted and non-redundant consensus set of gene structures.
To annotate the functions of genes, coding sequences were aligned against the SwissProt, TrEMBL [67] and NCBI non-redundant protein databases using BLAST (v2.2.6) with an e-value threshold of 1 × 10 −5 . Gene Ontology (GO) annotation was obtained by aligning against the Pfam database [68] using HMMER2GO (https://github.com/sestaton/HMMER2GO). The protein sequences were also searched against the KEGG database [69] for KO (KEGG Orthology) assignments and pathway annotation.

Phylogenetic analyses and time dating.
The de novo genome of R. apiculata, the short-read sequences of R. mangle, R. mucronata and R. stylosa, the published genomes of P. trichocarpa and Ri. communis and the transcriptome data from five other species of Rhizophoraceae were used to reconstruct phylogenetic trees as well as estimate divergence times.
Orthologous genes were identified using the Or-thoMCL software [70]. Phylogenetic trees were built using PhyML [20]. The species-divergent times were estimated using the program MCMC-TREE from the PAML 4.8 package [22] with the HKY85+gamma model, assuming an independent rate for each branch.

Collinearity analysis.
To detect the signature of whole-genome duplication, self-alignment was performed on protein sequences of R. apiculata using BLASTp (with an e-value cutoff of 1 × 10 −5 , identity ≥40%), followed by identification of syntenic blocks using MCScanX [19]. Collinear blocks having at least five paired homologous genes were accepted as duplicated blocks in this study. Genome distribution of the collinear blocks was visualized using the Circos software (v0.65) [71]. The time of WGD events was estimated following methods described in the Supplementary Note, available as Supplementary Data at NSR online.
Gene family analysis. OrthoMCL software [70] was used to identify orthologous and paralogous groups of genes from four genomes (R. apiculata, A. thaliana, Ri. communis and P. trichocarpa). For genes with alternative splicing, the longest transcript was selected to represent the gene. All proteins from these four species were merged to perform an all-vs.all alignment using BLASTp with an e-value cutoff of 1 × 10 −10 . The alignments were fed into a standalone OrthoMCL program with the default MCL inflation parameter (2.0). In the next step, CAFE [46] took the gene family sizes as input and used a stochastic birth and death process to model the evolution of gene family sizes across a given phylogenetic tree and detected expanded or contracted gene families with P-values < 0.05.
Heterozygosity and demographic history. Using the aligner bowtie2 [72], clean short reads from R. apiculata (insert size: 200 bp, 300 bp, 400 bp and 600 bp), R. mangle (insert size: 300 bp), R. mucronata (insert size: 300 bp) and R. stylosa (insert size: 300 bp) were mapped to the assembled reference genome to identify the single nucleotide polymorphism sites (SNPs). Several filters were applied to ensure the accuracy of SNP calling: (i) removing potential PCR-duplicated, single-end mapped and improperly paired mapped reads; (ii) only sites having adequate sequencing depth (20-200× for R. apiculata, 15-80× for R. mangle, R. mucronata and R. stylosa) were used; (iii) the called heterozygous sites had to have minor allele frequency larger than 0.15. More than 99.9% of heterozygous sites were retained according to the binomial function, assuming that the two alleles are equally sequenced, which indicated a good quality of the SNP data set. Heterozygosity was estimated as the number of identified heterozygous sites divided by the total number of sites meeting our depth criteria.

SUPPLEMENTARY DATA
Supplementary Data are available at NSR online.

ACKNOWLEDGEMENT
We thank X. He, J. Lu, J. Liu and F. He for insightful comments.