Chromosome Level Genome Assembly and Annotation of Highly Invasive Japanese Stiltgrass (Microstegium vimineum)

Abstract The invasive Japanese stiltgrass (Microstegium vimineum) affects a wide range of ecosystems and threatens biodiversity across the eastern USA. However, the mechanisms underlying rapid adaptation, plasticity, and epigenetics in the invasive range are largely unknown. We present a chromosome-level assembly for M. vimineum to investigate genome dynamics, evolution, adaptation, and the genomics of phenotypic plasticity. We generated a 1.12-Gb genome with scaffold N50 length of 53.44 Mb respectively, taking a de novo assembly approach that combined PacBio and Dovetail Genomics Omni-C sequencing. The assembly contains 23 pseudochromosomes, representing 99.96% of the genome. BUSCO assessment indicated that 80.3% of Poales gene groups are present in the assembly. The genome is predicted to contain 39,604 protein-coding genes, of which 26,288 are functionally annotated. Furthermore, 66.68% of the genome is repetitive, of which unclassified (35.63%) and long-terminal repeat (LTR) retrotransposons (26.90%) are predominant. Similar to other grasses, Gypsy (41.07%) and Copia (32%) are the most abundant LTR-retrotransposon families. The majority of LTR-retrotransposons are derived from a significant expansion in the past 1–2 Myr, suggesting the presence of relatively young LTR-retrotransposon lineages. We find corroborating evidence from Ks plots for a stiltgrass-specific duplication event, distinct from the more ancient grass-specific duplication event. The assembly and annotation of M. vimineum will serve as an essential genomic resource facilitating studies of the invasion process, the history and consequences of polyploidy in grasses, and provides a crucial tool for natural resource managers.


Introduction
Invasive species cause billions of dollars in damage annually, and are considered the second greatest threat to native biodiversity after habitat loss (Pejchar and Mooney 2009;Simberloff 2013). Yet, genomic resources for invasive species are generally lacking relative to other economically important species such as crops, microbial pathogens, and many animal systems (McCartney et al. 2019). Almost half of the native species in the United States are at risk of extinction either due the direct effects of introduced species or impacts combined with other processes (Pimentel et al. 2005). Efforts to identify and eradicate newly introduced species are hampered by the lack of resources needed to predict how and why some species will become invasive. Genomics has become an increasingly valuable and cost-efficient tool to predict and diagnose invasions (Chown et al. 2015;Hamelin and Roe 2020). Genomics can provide novel insights on the roles of genetic variation, multiple introductions, admixture, introgression, and rapid adaptation (Schrader et al. 2014;Kreiner et al. 2019;Bertolotti et al. 2020;Olazcuaga et al. 2020;Yainna et al. 2020;Malinsky et al. 2021). For instance, a high-quality genome is useful for genome-wide scans of selection, trait association mapping, and timing invasion events (Nielsen et al. 2005;DeGiorgio et al. 2016;North et al. 2021). With improved understanding and forecasting at each stage of the invasion process, managers can make decisions on invasions much more accurately than in the past (Bergeron et al. 2019;Keriö et al. 2020). Hence, sequencing whole genomes for these nonmodel organisms provide crucial tools to efficiently manage and predict future invasions.
Japanese stiltgrass (Microstegium vimineum) is a shadetolerant, annual, C4 grass introduced to the eastern USA from Asia in the early 1900s that has spread to 30 US states and Canada. This species invades a range of habitats in the United States, displays a high degree of phenotypic plasticity, has a mixed mating system (outcrossing and self-fertilization), and exhibits prolific reproductive output with seeds being viable in the soil up to 5 years (Barden 1987;Redman 1995;Gibson et al. 2002;Nees 2016;Culpepper et al. 2018). Considerable research interest has been focused on unraveling potential links between ploidy levels and invasiveness, as most invasive plant species are polyploids (Pandit et al. 2011;te Beest et al. 2012). Japanese stiltgrass is an ideal system for the study of rapid adaptation of invasive species, being a putative polyploid in addition to the aforementioned features (2N ¼ 20 as opposed to the "base" 2N ¼ 10 among members of Andropogoneae; Watson and Dallwitz 1992).
Here, we present a high-quality, chromosome-level assembly, and annotation for M. vimineum, by integrating PacBio sequencing, Omni-C scaffolding, and RNAseq. The genome will lay groundwork for further investigation of traits allowing M. vimineum to adapt and thrive as an invasive species. Further, this genome will provide an important genomic resource for studies of rapid adaptation in invasive plants, help elucidate the history and consequences of polyploidy in grasses, and provide a tool for natural resource scientists and managers.

Genome Sequencing and Assembly
We generated a high-quality, chromosome-level genome assembly of M. vimineum using PacBio and Dovetail Omni-C libraries. Using approximately 60 Gb of PacBio long read data, we initially assembled 5,261 de novo contigs with N50 of 605 kb. In parallel, a total of 73.21 Gb (30Â coverage) of short read sequence data were produced by Illumina HiSeqX from Dovetail's Omni-C libraries to achieve chromosome-scale scaffolding. The initial assembly was significantly improved with Omni-C data using the HiRise pipeline ( fig. 1A), which produced a final assembly consisting of 462 scaffolds spanning 1.1 Gb in length, with the scaffold N50 size of 53 Mb (table 1). The final assembly covers 99.96% of 1.3 Gb genome size and interestingly, about 99.11% of assembled genome were anchored into 23 pseudochromosomes (size range 20.9-68.32 Mb), corresponding closely to the expected number of 20 chromosomes ( fig. 1A).

Repeat and Gene Annotation
Over half of the genome is composed of repetitive elements (66.68%, 745.92 Mb; table 1). Class I long-terminal repeat (LTR) retrotransposons are predominant, constituting 39.08% of the assembled genome. Similar to other grasses, the most abundant LTR-retrotransposon family present in M. vimineum genome is Gypsy (41.07%), followed by Copia (32%) (Baucom et al. 2009;Paterson et al. 2009;Schnable et al. 2009;Tian et al. 2009). Gypsy elements are distributed in gene-poor regions in most pseudochromosomes, whereas Copia shows a more even distribution ( fig. 1B). Calibrated sequence divergence of 5 0 -and 3 0 -terminal repeats revealed that most LTR-retrotransposons insertions appear to have occurred 1-2 Ma ( fig. 1C), suggesting recent activity of LTR-retrotransposons and preponderance of young LTR lineages in the genome.
We predicted 39,604 genes spanning 55.22 Mb (approximately 4.9%) of the genome, with an average gene length of 1,394 bp (table 1). A total of 26,230 genes were functionally annotated. We evaluated the completeness of the predicted gene sets and extent of gene duplication with 4,896 BUSCOs from the Poales database (v10; Manni et al. 2021), of which 3,930 (80.2%) were complete, indicating a relatively complete genome assembly and gene prediction (table 1). An interesting observation among the complete BUSCO's was the presence of 1,159 (30%) complete duplicated copies. This degree of duplication is comparable with, but lower than that seen in the polyploids Miscanthus sinensis (Mitros et al. 2020)  The Ks peaks in figure 1G indicate two WGD events: 1) a paleoduplication event shared by all grasses at Ks ¼ 1.04, estimated at 80-90 Ma (Paterson et al. 2004), and 2), and a M. vimineum-specific WGD at Ks ¼ 0.09. The majority of duplicates in M. vimineum were derived from WGD/segmental (28.5%) and dispersed (27.5%) duplications, corroborating polyploidization followed by considerable chromosomal reshuffling in M. vimineum ( fig. 1H). At a minimum, this suggests the M. vimineum-specific duplication likely occurred in   the last $10 Ma, but additional taxon sampling is needed to more accurately estimate the timing of this event.

Conclusion
We generated a high-quality, chromosome-scale genome assembly, and annotation for M. vimineum using PacBio sequencing and Omni-C technology. Genome quality assessment indicated a highly contiguous, accurate assembly and annotation, revealing recent WGD and transposon activity. Given the paucity of sequenced genomes for invasive species, this genome will serve as an important resource to study invasive species at the genomic level. Due to the varying abilities of introduced species to establish in a new environment, decision-making regarding resource allocation, mitigation, and management has always been uncertain; availability of genomic information for nonnative species may provide new solutions (Hamelin and Roe 2020). Whole-genome information expedites downstream population genomic studies on the role of multiple introductions, admixture, and adaptive ramifications of novel genotypes allowing "exploration" of novel phenotypic space, phenologies, and ecological interactions (Bertolotti et al. 2020). In addition, this genome will facilitate studies on the role of epigenetic variation and mobile elements of the genome to delineate their roles in rapid adaptation to the introduced range. These latter processes may allow novel phenotypes and gene expression modifications against the predicted genomic background of low allelic diversity in many invasive species (M erel et al. 2021). Further, comparative genomics and evolutionary studies of invasive versus noninvasive grasses or other plants, animals, and microbes may help to identify genomic commonalities characteristic of successful invaders.

Sample Collection and DNA Extraction
Florets containing seeds were collected and mixed from three populations in the Potomac Ranger District (PRD) of the Monongahela National Forest (MNF) near Petersburg, WV, and three populations in the Cheat Ranger District (CRD) of the same forest near Parsons, WV. Florets were air-dried for 3 months, and cold-dry stratified at 4 C for 1 year. One plant was also grown from seed-bank soil collected along the Monongahela River Rail Trail (RT) in Morgantown, WV. Seeds were germinated over 2 weeks in a Conviron growth chamber under temperatures of 25 C/15 C (12-h day/12h night), approximately 70% humidity, and 500 mmol m À2 s À1 light. RT seedlings were transplanted into potting soil. After germination, day length was increased to 14 h and night temperature was increased to 20 C. The complete shoot of one individual was harvested from each location (PRD, CRD, and RT). Twenty-five grams of fresh, young, green leaf tissue from one PRD accession was chosen for genome sequencing; tissue was flash-frozen in liquid nitrogen and stored at À80 C for 1 month before shipping on dry ice. The remainder of these individuals were stored at À80 C upon flowering with a voucher specimen of each deposited at the Northern Research Station, USDA Forest Service Herbarium. Further, tissue was harvested from these frozen samples for RNAseq analysis. Approximately 0.2 g of tissue was harvested from you, developing tissues for: 1) leaves, 2) roots, 3) cleistogamous inflorescences (covered by leaf sheaths at the nodes), and 4) apical, chasmogamous inflorescences. Tissues were flash frozen as above, stored at À80 C, and shipped on dry ice to GeneWiz, Inc. (South Plainfield, NJ) for RNA sequencing.

PacBio Library Sequencing
Total genomic DNAs were extracted from leaf tissues to construct sequencing libraries (see Supplementary Material online). PacBio SMRTbell libraries ($20 kb) were constructed using the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA), following the manufacturer's protocol. Libraries were bound to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto a PacBio Sequel II at Dovetail Genomics, LLC. Sequencing was performed on two PacBio Sequel II 8M SMRT cells. PacBio reads were assembled using the Wtdbg2 pipeline (Ruan and Li 2020). Contaminants and "haplotigs" (contigs from a single, alternative haplotype) were filtered using Blobtools v1.1.1 (Laetsch

Dovetail Omni-C Library Preparation and Sequencing
For Dovetail Omni-C libraries, chromatin was fixed with formaldehyde, extracted, and randomly digested with DNAse I. Chromatin ends were repaired and ligated to a biotinylated bridge adapter, followed by proximity-ligation of adaptercontaining ends. After proximity ligation, crosslinks were reversed and DNA was purified. Purified DNA was treated to remove biotin that was not internal to ligated fragments, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotincontaining fragments were isolated using streptavidin beads before PCR enrichment of each library. The library was sequenced on an Illumina HiSeqX platform to produce approximately 30Â sequence coverage depth. HiRise was used for scaffolding, a pipeline designed specifically for proximity ligation data (Putnam et al. 2016), requiring mapping quality >50 reads. Dovetail OmniC library sequences were aligned to the draft input assembly using bwa (version 0.7.17; https://github. com/lh3/bwa; Li and Durbin 2009). Separations of OmniC read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and used to identify and break putative mis-joins, to score and make prospective joins.

RNA-Seq
Total RNAs were extracted using the QIAGEN RNeasy Plus Kit following manufacturer protocols. Total RNAs were quantified using the Qubit RNA Assay and a TapeStation 4200. Prior to library preparation, DNase treatment was performed followed by AMPure (Beckman Coulter Life sciences) bead cleanup and QIAGEN FastSelect HMR rRNA (QIAGEN) depletion. Libraries were prepared with the NEBNext Ultra II RNA Library Prep Kit following manufacturer protocols and run on an Illumina NovaSeq6000 in 2 Â 150 bp configuration.

Assessment of Genome Assembly Quality
Completeness of the genome and predicted gene quality was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO v3.0.1; Simão et al. 2015). The poale-s_odb10 lineage-specific profile that contains 4,896 BUSCO gene groups was evaluated against our chromosome-level assembly.

Detection of WGD Events
To investigate WGD events in M. vimineum genome, the distribution of synonymous substitution (Ks) rates was obtained from protein-coding sequences and compared with closely related grasses, for example, S. bicolor, Coix lacryma-jobi, and Z. mays. Paralog and ortholog pairs were detected from protein sequence data and the associated Ks values were calculated using the tool "ksrates" (https://github.com/VIB-PSB/ ksrates; Sensalari et al. 2021 (Wang et al. 2012), by parsing the all_Vs_all BlastP results.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.