Chromosome-level genome assembly of the aquatic plant Nymphoides indica reveals transposable element bursts and NBS-LRR gene family expansion shedding light on its invasiveness

Abstract Nymphoides indica, an aquatic plant, is an invasive species that causes both ecological and economic damage in North America and elsewhere. However, the lack of genomic data of N. indica limits the in-depth analysis of this invasive species. Here, we report a chromosome-level genome assembly of nine pseudochromosomes of N. indica with a total size of ∼ 520 Mb. More than half of the N. indica genome consists of transposable elements (TEs), and a higher density of TEs around genes may play a significant role in response to an ever-changing environment by regulating the nearby gene. Additionally, our analysis revealed that N. indica only experienced a gamma (γ) whole-genome triplication event. Functional enrichment of the N. indica-specific and expanded gene families highlighted genes involved in the responses to hypoxia and plant–pathogen interactions, which may strengthen the ability to adapt to external challenges and improve ecological fitness. Furthermore, we identified 160 members of the nucleotide-binding site and leucine-rich repeat gene family, which may be linked to the defence response. Collectively, the high-quality N. indica genome reported here opens a novel avenue to understand the evolution and rapid invasion of Nymphoides spp.


Introduction
Nymphoides indica (L.) Kuntze (Menyanthaceae) is a perennial floating-leaved aquatic plant (water snowflake) and widely distributed in pantropical regions, such as tropical America, Asia, and Australia. [1][2][3] It is a heterostylous self-incompatible species that can reproduce sexually by seeds, asexually by producing a type of turion in the roots that can easily separate from the parent plant, or by adventitious plantlet growth. 4,5 The plant grows mainly in lakes, ponds, rice fields, and slow-moving rivers. Due to human activities, the local populations of N. indica in Japan have been greatly reduced and it is now listed as an endangered species in the country. 5 However, it has been intentionally introduced in many countries or regions as an ornamental plant in waters where it escapes into the wild and rapidly expands. 6 In Florida, Texas, and North Carolina in the USA, N. indica has been found in numerous water bodies and listed as a noxious weed. [6][7][8] Similar to N. indica, several other species in the genus Nymphoides, such as N. cristata (crested floating heart) and N. peltata (yellow floating heart), also have wide distribution ranges and are regarded as noxious weeds in some regions and countries. [9][10][11] Owing to their welldefined vegetative propagation capability, 10 these species can rapidly spread into various waters and form dense floating canopies covering the water surface. This can significantly reduce light penetration into water and the dissolved oxygen in water, causing serious damage to aquatic ecosystems and further reducing native standing plant biomass and biodiversity. 12,13 However, little is known about the mechanisms underlying this rapid invasion.
Studying the genetic basis of weeds/invasive species enables the identification of the factors responsible for their successful range expansion. Whole-genome sequencing offers opportunities to link the environment with innovation. For instance, by analysing the ancient whole-genome duplications (WGDs) of flowering plants, many studies show that WGDs are common and coincide with several paleoclimate change events. 14,15 The expansion of genetic material from the genomic duplication may result in key innovations and phenotypic novelty, which could promote adaptation and enhance survival. [16][17][18] By studying transposable elements (TEs) at the whole genomic scale, researchers found that TEs may influence regulatory networks and provide a mechanism for genome evolution and adaptation. 19,20 In addition, several population genomic studies have confirmed that in the process of plant invasions, TEs can produce genetic variation and generate rapid phenotypic variations to facilitate adaptation. 21,22 By investigating the expansions and contractions of gene families in plant genomes, such as the nucleotide-binding site and leucine-rich repeat (NBS-LRR) gene family, which includes the majority of disease-resistance genes in plants, previous studies have found that the rapid expansion and/or contraction of the NBS-LRR gene family plays a significant role in quick adaptation to challenges from novel pathogens. 23,24 To date, only a few weedy/invasive plant species genomes have been sequenced (e.g. Conyza canadensis; 25 Mikania micrantha; 26 Microstegium vimineum ;27 and Phragmites australis 28 ) no genome has been reported for the family Menyanthaceae (Asterales). Our understanding of the adaptations of the weeds or widely distributed plant species is still limited, especially for aquatic plants that include a large number of globally notorious invasive species.
In this study, we present a high-quality chromosome-level genome assembly of N. indica using a combination of PacBio sequencing, Illumina sequencing, and HiC technology. Comparative genomic analyses were conducted to investigate WGDs and gene family expansions and contractions, with an emphasis on the NBS-LRR gene family. In addition, we explored the effect of TEs on the rapid invasion of N. indica. The results provide vital insights into the quick dispersal of aquatic plants to diverse aquatic environments and their potential invasiveness.

Genome sequencing
Wild N. indica samples used for sequencing were collected from Haikou, Hainan Province, China and planted in a greenhouse at the Wuhan Botanical Garden. The leaves of N. indica (2n ¼ 18) were collected for de novo genome sequencing. 29 High-quality genomic DNA was extracted from fresh leaves using a modified CTAB method and sequenced on the Illumina HiSeq X Ten (Illumina Inc., San Diego, CA, USA) and PacBio Sequel (Pacific Biosciences of California, Menlo Park, CA, USA) platforms.
RNA data were extracted separately from the four tissues (root, stem, leaf, and flower) using the NEBNext V R Ultra RNA Library Prep Kit for Illumina V R (NEB, USA). According to the manufacturer's instructions, each RNA-seq library was purified on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumina), and PE150 sequencing was conducted on an Illumina NovaSeq 6000 platform. Quality control of primitive RNA-seq reads was performed using FastQC v0.11.7 (http://www.bioinformatics. babraham.ac.uk/projects/fastqc/, 10 January 2022, date last accessed), and poor-quality reads were trimmed using Trimmomatic v0.36. 30 Briefly, the Hi-C libraries of N. indica were constructed using the following steps: (i) fresh leaves were cut and fixed in formaldehyde for crosslinking, chromatin digestion with the four-cutter restriction enzyme MboI, (ii) DNA ligation with the T4 DNA ligation enzyme, and (iii) purification and fragmentation (the purified DNA was sheared to a length of $ 400 bp). Finally, the Hi-C libraries were sequenced on the Illumina HiSeq X Ten platform (San Diego, CA, USA) in PE150 mode.

Genome size evaluation and genome assembly
We estimated the genome size of N. indica using the k-mer frequency analysis. Illumina clean reads were used to calculate k-mer distribution by Jellyfish v2.2.3 31 and then to estimate the genome size and the level of heterozygosity of the genome using GCE v1.0.2. 32 For genome assembly, quality-controlled PacBio long reads were corrected using the NextCorrect module of NextDenovo v2.1 (https://github.com/Nextomics/NextDenovo, 10 January 2022, date last accessed) with default parameters. Corrected reads were trimmed and assembled using Canu software v1.8 33 with the Corrected-Error-Rate parameter set at 0.035. Subsequently, the draft assembly was polished using NextPolish v1.3.1. 34 All PacBio long reads and Illumina short reads were used in all three genome editing iterations for genome polishing. We then performed Purge Haplotigs v1.1.1 35 to remove redundant sequences and obtain the final contiglevel assembly.
In this study, we used Hi-C reads to construct chromosome-level assemblies. The original Hi-C sequenced data were filtered using Hic-Pro v2.11.1. 36 Then, the clean data were compared with the genome using Juicer v1.6. 37 Contig ordering, orientation, and chimera splitting was conducted using the 3D-DNA pipeline 38 under default parameters. Contig misassemblies and scaffold misjoins were manually detected and corrected based on interaction densities from visualization in Juicebox. 39 Finally, we evaluated the integrity and accuracy of the genome assembly using Benchmarking Universal Single-Copy Orthologs (BUSCO; v4.0.1) 40 in the Embryo phyta_odb10 database.

Genome annotation
We performed RepeatMasker v4.1.0 (http://www.repeatmasker.org, 10 January 2022, date last accessed) to identify and classify the repetitive elements in the N. indica genome using the de novo repeat library and the Repbase database. A de novo repeat library for N. indica was built using RepeatModeler v2.0.1 (http://www.repeat masker.org, 10 January 2022, date last accessed) and the Repbase 41 database was downloaded from http://www.girinst.org/repbase/) (10 January 2022, date last accessed. Three different procedures were used to predict protein-coding genes in the N. indica genome assembly: (i) AUGUSTUS v3.3.3 42 was used to perform de novo prediction; (ii) for homology prediction, the homologous sequences of three Asteraceae species, Carthamus tinctorius, Chrysanthemum nankingense, and Cynara cardunculus were aligned against the N. indica genome using TBLASTN with an e-value cutoff of 1eÀ5, and the gene structure was predicted using GeneWise v2.4.1; 43 (iii) Trinity v2.11.0 44 was conducted to de novo assemble all RNA-seq data for transcriptome-based prediction. The assembled transcripts were applied to PASA v2.4.1 45 and TransDecoder v5.5.0 (https:// github.com/TransDecoder/TransDecoder, 10 January 2022, date last accessed) to identify protein-coding regions. Finally, the gene models identified by each method were integrated using EvidenceModeler v1.1.1 46 and then updated using PASA v2.4.1. 45 For the functional annotation of genes, all predicted proteincoding genes were mapped to the Swiss-Prot and NR databases using BLASTP searches with an e-value cutoff of 1eÀ5 and aligned to the eggNOG database v5.0 47 using eggNOG-mapper v2. 48 InterProScan v5.47 49 was used to identify the functional domains and motifs of the gene set with default parameters. We also used PlantRegMap 50 to identify transcription factors (TFs) in the N. indica genome.

TE identification
We used EDTA v1.9.7, 51 an automated whole-genome de novo TE annotation software, for de novo prediction of TEs in N. indica. The Perl script 'solo_finder.pl' within the LTR_retriever v2.9.0 52 package was used to identify solo-LTRs. In short, an independent LTR sequence located in regions without a similar LTR sequence (presented at a 300 bp distance) that covered at least 80% of the library LTR entry with an alignment score >300 and length >100 bp were identified as solo LTR. To estimate the removal ratio of long terminal repeat retrotransposons (LTR-RTs), the ratio of solo LTRs to intact LTRs (S/I) was calculated based on the number of solo LTR over the number of intact LTR.
The insertion time of the LTR-RTs was estimated using the formula T ¼ K/2r, where the substitution rate of r ¼ 7 Â 10 À9 per site per year 53 was adopted, and the sequence divergence (K) was estimated using the Kimura two-parameter model.
In addition, we compared the distribution of TEs in 100-bp bins in the 2 kb upstream and downstream regions of genes between the invasive species N. indica, C. tinctorius, C. cardunculus, C. nankingense, and M. micrantha, and non-invasive species Coffea canephora and Vitis vinifera.

Whole-genome duplication
We performed a comparison between N. indica and two other species (V. vinifera and C. canephora) to detect WGD events in the N. indica genome. Paralogous and orthologous gene pairs were searched via all-versus-all protein sequence comparisons using BLASTP with an e-value cutoff of 1e-5. Intragenomic and intergenomic syntenic blocks were searched for using MCScanX 63 with default parameters. The Perl script 'add_ka_and_ks_to_collinearity.pl' from the MCScanX package was employed to calculate synonymous substitutions per synonymous site (Ks) in N. indica, V. vinifera, and C. canephora. The fitting Ks distribution was plotted using WGDI v0.4.7. 64 To further confirm the WGD events, synteny patterns were analysed between N. indica, V. vinifera, and C. canephora based on the intergenomic synteny regions using MCscan (Python-version, https://github.com/tanghaibao/jcvi/wiki/MCscan , 10 January 2022, date last accessed).

Analysis of NBS-LRR genes
All nucleotide-binding site leucine-rich repeat (NBS-LRR) protein sequences of A. thaliana were downloaded from RefPlantNLR 65 and the NBS (NB-ARC) domain (PF00931) was retrieved from the Pfam database (http://pfam.xfam.org/ , 10 January 2022, date last accessed ). First, the NBS-LRR protein sequences of A. thaliana were used as queries to search for potential NBS-LRR genes in the N. indica genome using BLASTP (e-value ¼ 1eÀ10). Then, based on the HMM model of NBS, HMMER v3.2.1 (http://hmmer.org/ , 10 January 2022, date last accessed) was used to identify NBS-LRR candidate genes. All candidate NBS-encoding genes from the above methods were integrated and submitted to the HMMER website (https://www. ebi.ac.uk/Tools/hmmer/search/hmmscan , 10 January 2022, date last accessed) to search the TIR, RPW8, CCX, NBS, and LRR domains.
All protein sequences of A. thaliana and N. indica were aligned to the NB-ARC1_ARC2 HMM 66 of the NB-ARC domain using the hmmalign script from the HMMER v3.2.1 package (http://hmmer. org/ , 10 January 2022, date last accessed). Then we performed Belvu 67 to trim the alignment to the NBS-ARC domain region and removed columns and sequences with over 95% gaps. Then, we manually verified the NB-ARC domain of the sequence alignments by Jalview. 68 Finally, A maximum-likelihood phylogenetic tree was constructed using RAxML v8.2.12 69 based on the JTT model for amino acid substitution, with other parameters following Kourelis et al. 65 The gene cluster was defined with the criteria that in the range of up to 200 kb, non-NBS-LRR genes were restricted to eight within the cluster. 70 TBtools v1.098 71 was used to visualize the conserved domain of N. indica NBS-LRR proteins identified by MEME (http://meme-suite.org/tools/meme , 10 January 2022, date last accessed) (the maximum number of motifs set to 20) and to display the physical location of NBS genes in the N. indica genome.
All NBS protein sequences from N. indica were used to construct a phylogenetic tree following the method illustrated above. RNA-seq data from different tissues of N. indica were mapped back to the N. indica genome using Hisat2 v2.0.0 72 with default settings. Then, the value of FPKM (fragments per kilobase of exon per million fragments) was calculated by StringTie v1.3.3 73 and standardized using Log 2 FPKM to quantify the expression level of NBS-LRR genes for each tissue. A heatmap of the gene expression profile was then drawn using the HeatMap tool of TBtools v1.098. 71 3. Results and discussion

Genome assembly and annotation
We first generated $ 72.88 Gb Illumina paired-end reads (150 bp) (Supplementary Table S2). Based on the 17 k-mer analysis, the N. indica genome was estimated to be 620 Mb in size with a high heterozygous ratio of 1.10% (Supplementary Fig. S1). In addition, $ 141 Gb of PacBio sequencing subreads and $ 93 Gb of Hi-C reads were generated for draft genome assembly. They were corrected, trimmed, and assembled to preliminarily produce a 1.01 Gb genome that contained 1,891 contigs with a contig N50 of 2.39 Mb. After removing redundancy and polishing, we obtained a 520 Mb draft genome containing 283 contigs, with a contig N50 size of 7.17 Mb. We then anchored these contigs to nine pseudochromosomes using the Hi-C data, and 460 Mb (83.87%) of contigs were clustered and oriented into nine pseudochromosomes with sizes ranging from 66.87 Mb to 40.70 Mb (Supplementary Fig. S2 and Supplementary Table S3). The final chromosome-level assembly was 520 Mb in length with a scaffold N50 of 49.18 Mb, constituting 89.65% of the predicted genome size (Table 1). A total of 94.8% of the complete BUSCOs were found in the final assembly (Supplementary Table S4) and the Hi-C interactions plot shows a matrix with a clear plaid pattern indicating that the N. indica genome is nearly complete and of high quality (Supplementary Fig. S2).
By combining de novo, homology-based, and transcriptomebased approaches, a total of 29,983 protein-coding genes were predicted in the assembled genome with an average gene length of 6,473 bp (Supplementary Table S5). In total, 24,871 (82.9%) coding proteins were assigned functions by searching the eggNOG, InterPro, NR, and Swiss-Prot databases (Supplementary Table S6). Among these genes, 1,275 TFs belonging to 57 families were identified (Supplementary Table S7).

Repetitive content and recent LTR-RT expansion
The overall integrity and accuracy of the N. indica genome have enabled a comprehensive analysis of TEs. By integrating homologybased and de novo approaches, we found that $ 61.45% of the assemblies were repetitive elements, and more than half (52.29%) of the N. indica genome consisted of TEs (Supplementary Table S8). Approximately 65% of TEs are LTR-RTs. Notably, the most abundant LTR-RTs family was Gypsy, occupying 41.36% of all LTR elements, followed by Copia at 24.31% (Supplementary Table S8). Most LTR-RTs accumulate around centromeres and pericentromeric regions in N. indica, which may reduce the impact directly compared with insertion into genes (Fig. 1). TEs have been confirmed to escape silencing and to experience a quick outburst. 74 Previous studies have confirmed that TE insertions could produce new genetic variants and then promote speciation. [75][76][77] Such as in Oryza, the rapid and recent evolution of LTR retrotransposons has been reported to drive the speciation and diversification of rice. They have led to the variation and evolution of the rice genome structure and played an important role in influencing the expression and functional differentiation of rice functional genes during the formation and diversity of rice species. 78 Also, in the rye genome, the increased TE bursts contribute to the increase of transposed duplicated genes, such as starch pathway synthesis-related genes, in which new functional differentiation of genes has occurred, suggesting that TE-driven gene duplication can provide new resources for the generation of new functional differentiation of genes, and thus provide the necessary basis for the adaptive evolution and speciation. 79 In addition, in Tibetan peach, SINE-type TEs were expanded and may play a crucial role in adapting to highplateau environments and speciation. 80 In conclusion, we here observed that LTR-RTs experienced a significant burst in the past 1 Mya (Supplementary Fig. S3), corresponding to the speciation time of N. indica. We inferred that TE mobilization during the burst might contribute to the speciation of N. indica.
However, as a ubiquitous component of eukaryotic genomes, the burst of LTR-RTs is an important force for plant genome expansion, and the removal of LTR-RTs can also cause rapid genome shrinkage. [81][82][83] Unequal recombination (UR) to excise TEs is the general method for genome contraction, yielding solo-LTRs that reflect the rate of TE removal in a genome. 83,84 In N. indica, relatively high removal rates 3.24 (S: 31,025, I: 9,565) were observed in N. indica, compared with C. tinctorius (S/I: 2.917; S: 78,664; I: 26,961) and C. canephora (S/I: 2.66; S: 13,800; I: 4,529). These results imply that the recent expansion of LTR-RTs in N. indica failed to facilitate genome size evolution. Moreover, the higher UR rate in N. indica may lead to genome downsizing.
As an invasive species, rapid adaptation to a novel habitat is a critical prerequisite for successful invasion. 21,85 TE insertions near genes influence the expression of neighbouring genes in response to environmental changes. 86,87 For example, a Copia-type retrotransposon named ONSEN confers heat-mediated activation to nearby genes. 88,89 Similarly, the tomato Copia LTR retrotransposon family Rider may confer drought-responsive expression to neighbouring genes. 90 We explored whether the density of TEs located near genes differs between invasive and non-invasive plants. Our comparative analyses revealed that the density of TEs was higher in the upstream and downstream regions of genes in invasive species (N. indica, C. tinctorius, C. cardunculus, C. nankingense, and M. micrantha) than in non-invasive species (V. vinifera and C. canephora) (Fig. 2), suggesting that a relatively higher TE density may contribute to key adaptive variation in response to changing environments. Similar to previous reports, TEs are highly enriched in the wider-distribution species Capsella rubella compared with its outcrossing sister species Capsella grandiflora, suggesting that TEs can potentially contribute to adapting to novel environments that experience a genetic bottleneck. 22 Additionally, LTRs can also confer stress-inducible expression patterns on their flanking TFs, such as Ruby, an MYB transcriptional activator of anthocyanin production in Citrus sinensis, which is regulated by a nearby Copia-like retrotransposon under cold induction. 91 In N. indica, as the key regulator of abiotic stresses, APETALA2/ETHYLENE RESPONSIVE FACTOR (AP2/ERF) family TFs are located in LTR-rich regions compared with the other TFs, indicating the potential impact of LTR-RTs on neighbouring AP2/ERF TFs (Supplementary Fig. S4). It can be inferred that TEs might influence the expression of AP2/ERF TFs in response to environmental stress during the rapid adaptation of N. indica.

Genome evolution and expansion of gene families
A phylogenomic tree with estimated divergence times for the 14 species was constructed based on a concatenated sequence alignment of 230 common single-copy genes. Our phylogenetic results revealed a close relationship between N. indica and Asteraceae, which diverged from the most recent common ancestor of C. cardunculus and H. annuus at $ 68 Mya, after separating V. vinifera at $ 91.5 Mya (Fig. 3). Based on sequence homology, a total of 389,445 (88.3%) genes were clustered into 29,220 orthogroups, 11,678 of which were species-specific orthogroups and 6,885 gene families were shared by all 14 species. In N. indica, 27,286 protein-coding genes were assigned to 12,600 families and 926 gene families containing 4,249 genes were N. indica-specific (Supplementary Table S9). GO and KEGG enrichment analysis showed that these unique gene families were enriched for GO terms, such as 'GO : 0009553, embryo sac development', 'GO : 0000741, karyogamy' and 'GO : 0004869, cysteine-type endopeptidase inhibitor activity' and KEGG categories such as 'ko00750, Vitamin B6 metabolism' and 'ko01040, Biosynthesis of unsaturated fatty acids' (Supplementary Tables S10 and S11). Comparative genomic analysis showed that 2,055 gene families were expanded in N. indica, whereas 4,777 were contracted. The expanded gene families of N. indica were significantly enriched in catalytic activity, brassinosteroid biosynthesis, and plant-pathogen interactions (Supplementary Tables S12 and S13). In addition, rapidly expanded gene families showed high enrichment in response to hypoxia, decreased oxygen levels, and plant-pathogen interactions (Supplementary Tables S14 and S15). We observed that the process of plant-pathogen interaction was enriched in rapidly expanded gene families of N. indica (Supplementary Table S15). This may explain why N. indica adapts quickly to diverse environments and is widely distributed in neotropical regions. 2 To further understand the paleohistory of the N. indica genome, we analysed WGD events. Dot plots of orthologs between N. indica and V. vinifera showed a 3-3 orthology ratio (Fig. 3), indicating a similar evolutionary history between the genomes of N. indica and V. vinifera. The Ks distribution of the intragenomic paralogs in the N. indica genome showed a single peak at Ks values of $ 0.98, which was close to the peak of V. vinifera (1.21) and C. canephora (1.08) (Fig. 3). This clearly indicated that these three species may have experienced a shared WGD event in their common ancestor. Given that the V. vinifera and C. canephora genomes have been confirmed to experience only an ancient whole-genome triplication (WGT) event, 92,93 N. indica likely experienced this WGT event shared by the recent common ancestor of core eudicots. 94 Furthermore, intragenomic syntenic analysis of N. indica (1-3 chromosomal relationships) revealed a WGT event ( Supplementary Fig. S5). We also compared the syntenic depth ratios of N. indica, V. vinifera, and C. canephora. All paired genome comparisons detected an identical 2:2 syntenic depth ratio (Supplementary Figs. S6-S8). The evidence provided here is sufficient to prove that the N. indica genome experienced only the gamma (c) WGT event shared by all core eudicots.
WGDs result in the complete doubling of the genome and produce abundant duplicated genes to facilitate adaptation to changing environments. 95 We performed GO and KEGG enrichment analyses of the duplicated genes generated by WGT-c. These WGT-related gene pairs were significantly enriched in some processes that are essential for growth and development, such as nucleobase-containing compound biosynthetic process, regulation of nucleic acid-templated transcription, carbon fixation in photosynthetic organisms, and regulation of gene expression and biosynthetic process (Supplementary Tables S16 and S17). This is consistent with previous reports on M. micrantha; only genes that are crucial for survival were retained during the evolution procedure, whereas redundant genes were lost. 26 Moreover, in comparison with the ancestral eudicot karyotype genome, only 4,154 (13.85%) genes were identified in N. indica, lower than 6,828 (25.9%) genes in V. vinifera, and 13,932 (34.9%) in C. tinctorius. 96 Despite a lack of recent WGDs, 41% of the genes were duplicated through dispersed duplications. A substantial number of dispersed duplicates may have been generated after the core-eudicot c triplication, and they may contribute to phenotypic diversification, such as stress-resistance traits. 97 In the present study, a large number of dispersed duplicates were detected in the N. indica genome, suggesting multiple chromosome rearrangements that occurred after the WGT-c event, which may be a critical factor in the success of N. indica colonization of a novel environment.

Analysis of NBS-LRR gene family
RefPlantNLR 65 is an accurate and comprehensive database, including 481 NLRs grouped in 31 genera and 11 orders of angiosperms. In this database, NBS-LRR were divided into four clades, TIR-NLRs, CC R -NLRs, CC G10 -NLRs (containing autoactive coiled-coil domains; hereafter termed ANLs) and CC-NLRs. According to the RefPlantNLR database, 65 Tables S18 and S19). Similar to the previous study, the ANL subclass is a monophyletic group that is distinguished from the other CNL clades. 65,98 In addition, all NBS-LRR genes were unevenly distributed across all nine chromosomes ( Supplementary Fig. S11). Previous studies have revealed that a substantial number of NBS-LRR genes are clustered across chromosomes and are associated with tandem duplication. 99,100 In N. indica, 26 gene clusters containing 131 NBS-LRR genes were identified. Additionally, NBS-LRR genes located in the same cluster showed a close relationship in the phylogenetic tree, indicating that tandem duplication is responsible for NBS-LRR gene expansion in N. indica. The expression patterns of the NBS-LRR genes in the four tissues were significantly different in N. indica. The leaves had 44 NBS-LRR genes that exhibited higher expression levels than other tissues ( Fig. 4 and Supplementary Table 19), which can explain why leaves come into contact with water to resist pathogens. It has been reported that many aquatic plant genomes, such as S. polyrhiza, Z. marina, and Utricularia gibba, have convergently lost the majority of their NLR genes and well-known downstream immune signalling complex ENHANCED DISEASE SUSCEPTIBILITY 1 (EDS1)/PHYTOALEXIN DEFICIENT 4 (PAD4). 101 On the contrary, we identified several homologs of known immunity genes in the aquatic plant N. indica (Supplementary Table 20), such as EDS1, PAD4, and SENESCENCE ASSOCIATED GENE 101 (SAG101), which indicated that signalling components are likely conserved in N. indica and these genes were lost independently in different lineages. 101 In addition, our results well supported the view that the loss of EDS1/PAD4, might be limited to a few species with a low number of NLR genes. 101

Conclusions
We assembled and annotated a chromosome-scale reference genome of N. indica using PacBio, Illumina, and Hi-C sequencing data, which will provide an opportunity to explore the potential mechanisms underlying its invasiveness. Our study highlights the important role of TEs in the rapid invasion of N. indica in response to environmental change, colonization, and expansion as invasive species. Based on the gene family analysis, the expanded gene families were significantly enriched in the function of plant-pathogen interaction and rapidly expanded gene families were enriched in response to hypoxia, decreased oxygen levels, and plant-pathogen interaction, which may coincide with the adaptation of N. indica to the low oxygen level in semi-arid land as well as the eutrophic body of water. Furthermore, we identified the NBS-LRR gene family, which will be valuable in future evolutionary and functional studies. As the first sequenced species in Menyanthaceae, our results provide insights into the evolutionary origin and ecological control of Nymphoides.

Acknowledgments
We thank the two anonymous reviewers for providing valuable and critical comments on this study.

Data availability
All the data sets used in this study are available at the China National GeneBank DataBase (CNGBdb, https://db.cngb.org/) website under the accessions CNS0525744-CNS0525745 and CNS0548186-CNS0548187 with CNGB-Project ID CNP0002767.