A chromosome-level reference genome of Ensete glaucum gives insight into diversity, chromosomal and repetitive sequence evolution in the Musaceae

Background Ensete glaucum (2n = 2x = 18) is a giant herbaceous monocotyledonous plant in the small Musaceae family along with banana (Musa). A high-quality reference genome sequence of E. glaucum offers a vital genomic resource for functional and evolutionary studies of Ensete, the Musaceae, and more widely in the Zingiberales. Findings Using a combination of Illumina and Oxford Nanopore Technologies (ONT) sequencing, genome-wide chromosome conformation capture (Hi-C), and RNA survey sequence, we report a high-quality assembly of the 481.5Mb genome with 9 pseudochromosomes and 36,836 genes (BUSCO 94.7%). A total of 55% of the genome is composed of repetitive sequences with LTR-retroelements (37%) and DNA transposons (7%) predominant. The 5S and 45S rDNA were each present at one locus, and the 5S rDNA had an exceptionally long monomer length of c.1,056 bp, contrasting with the c. 450 bp monomer at multiple loci in Musa. A tandemly repeated c. 134 bp satellite, 1.1% of the genome (with no similar sequence in Musa), was present around all nine centromeres, with a LINE retroelement also found at Musa centromeres. The assembly, including centromeric positions, enabled us to characterize in detail the chromosomal rearrangements occurring between the x = 9 species and x = 11 species of Musa. Only one chromosome has the same gene content as M. acuminata (ma). Three ma chromosomes represent part of only one E. glaucum (eg) chromosome, while the remaining seven ma chromosomes are fusions of parts of two, three, or four eg chromosomes, demonstrating complex and multiple evolutionary rearrangements in the change between x = 9 and x = 11. Conclusions The advance towards a Musaceae pangenome including E. glaucum, tolerant of extreme environments, makes a complete set of gene alleles available for crop breeding and understanding environmental responses. The chromosome-scale genome assembly show how chromosome number evolves, and features of the rapid evolution of repetitive sequences.

Genome size and heterozygosity estimation 118 The final genome assembly after Hi-C scaffolding (481,507,213 bp) anchored 97.2% 119 sequences of the contig-level assembly (495,175,598 bp; Table 2). Some arrays of tandem 120 repeats, including the rDNA (see below), were collapsed and chromosome termini were not 121 fully assembled. Around 55% of the assembled genome was estimated to be repeat sequences 122 (RepeatMasker; Table 2). The genome size was estimated as 563,295,571bp (highest 17-mer 123 peak frequency), with slightly higher estimates made by findGSE software (588,939,614 bp). 124 The genome size of E. glaucum is similar to that of the x=11 Musa species (see [20]) using 125 sequencing methods, and to estimates of both genera by flow cytometry [21]. 126 The heterozygosity rate of E. glaucum was 0. Avena atlantica and 0.12% A. eriantha [27]); it is however, low compared to other species 136 (e.g. walnut, Juglans nigra 1.0%, [28]; Nyssa sinensis 0.87%, [29]) and in particular many 137 Musa species, some with known hybrid genome composition [26]. The low value seen in 138 species including E. glaucum here, is consistent with frequent self-pollination and inbreeding, 139 or a population bottleneck of this monocarpic tropical plant [30]. 140 Ensete glaucum chromosome-scale genome assembly. Wang et al. Page 7 of 57. not identify satellite sequences. 5S, 45S and Egcen were identified manually in assemblies 143 and the abundance measured in raw read data; microsatellite abundance was calculated from 144 the assemblies. (See Supplementary Tables S1, S2, S3, S10, S12, S14) 145 Xiao et al [35] discuss the importance of a HLH factor involved in starch degradation during 220 fruit ripening. Ensete and Musa differ in these characteristics so it will be interesting to 221 analyze differences in transcription factors responsible. 222 Repetitive DNA analysis 223

Repeat identification 224
A range of different programs were applied for repeat analysis, and, as has been considered 225 previously [36], there were differences in the repeats identified between approaches, and 226 small changes in parameters and reference sequences give substantial changes. Repeated 227 elements in the genome assembly were identified by RepeatMasker ( to-tail junctions), potentially artefacts from both strands of the DNA molecule passing 245 sequentially through one pore, and these junctions need further investigation. 246 Fig. 4A compares the abundance and species-distributions of major repeat classes in 247 the Musaceae using the comparative genome analysis function of RepeatExplorer. All species 248 shared many transposons and rDNA sequences (Fig 4A, central region). However, genus-249 specific retroelement variants were identified in Musa (Fig. 4A, left) and Ensete-with-250 Musella (Fig. 4A, right), showing the separation of the two phylogenetic branches, supported 251 by extensive divergence of the repetitive sequence sub-families, and evolution in copy 252 number. Notably, satellite sequences ( Fig. 4A center-right) were much more abundant and 253 some sequences (see centromere sequence below) were present exclusively in Ensete. 254

Transposable Elements 255
The most abundant class of repetitive elements were transposable elements, in particular LTR 256 retroelements. The distributions of Copia and Gypsy LTR retroelements along assembled 257 pseudo-chromosomes ( Fig. 2  Gypsy families show relatively constant activity over the last 2.5 My, with the major a peak 272 of insertion activity 3.5 to 5.5 Mya, (Fig. 4B, C)  Copia elements leading to a higher proportion of Gypsy elements within the genome of E. 280 glaucum compared to Musa (see above, and Supplementary Tables S10 and S11). This

Tandem (satellite) repeats and centromeric sequences 289
The repeat analysis revealed the presence of an abundant tandemly repeated sequence with a 290 monomer length of c. 134bp (Fig. 5A) (Table S14). An average of one SSR was found per 4000bp, with the density 361 lowest around the centromere and higher at the telomeres ( genome are motif-specific, and, if SSR markers were to be used for genetic mapping, those 372 associated with genes (such as AG/CT) would potentially be more useful. 373

5S and 45S rDNA and rRNA genes 374
Tandem repeats of the rDNA were predominantly located within extended, complex, loci on 375 chromosomes eg05 (5S rDNA) and eg06 (45S rDNA) (  BluePippin (Sage Sciences), and the SQK-LSK109 kit (Oxford Nanopore) was used to build 553 a library that was sequenced using PromethION. The base calling was performed with Guppy 554 and reads mean_q score_template (Phred) > 7 (base call accuracy >80%) were selected. A 555 total of 129 Gb ONT reads (~ 250X coverage) was generated. fastp v0.19.7 [58] was used for 556 quality control including adaptor-trimming, filtering reads with too many Ns or mean q score 557 lower than 7 and resulted in remaining clean data of 109 Gb (Table 1). The mean read length 558 was c. 20kb, with the longest >120kb ( Supplementary Fig. S11). 559

Hi-C chromatin interaction data 560
The Hi-C library was prepared followed by a procedure with an improved modification [59].  (Table 2 and Supplementary Tables S1, S2). 580 Read pairs from the Hi-C data were mapped to the draft assembly using bowtie2 v2.   Supplementary Fig. S5). Visualization (Fig. 3B was used to model the evolution of gene family sizes and stochastic birth and death processes 662 and summarized in the phylogenetic tree (Fig. 3C).  Table 2 (Supplementary Tables S11 and S12)  based on the divergence of of the 5' and 3' end LTRs, and these two LTRs of every LTR 730 retrotransposons were extracted into separate files with a custom script.  Figure S3) and 745 compared with RepeatExplorer2 following "comparative repeat analysis" protocol. The 746 results were visualized by R script "plot_comparative_clustering_summary.R" (Fig, 4A). 747

SSR Tandem Repeats 748
The genome assembly was searched for SSR (microsatellite) motifs using the SSR mining 749 pipeline developed by Biswas et al [46]. Searches were standardized for mining perfect SSRs 750 from mono to hexa-nucleotide repeats (minimum repeat number of 12 for mononucleotides, 8 751 for di-, 5 for tri-, tetra-and penta-, and 4 repeats for penta-and hexa-nucleotides). SSRs 752 abundance and nature was analyzed based on density in the genome (about 1 per 4000bp), 753 array length ( Figure 7B  in 10mM citric acid/sodium citrate buffer (pH4.6) for 3-5h at 37°C and then kept in buffer for 767 12-30h at 4°C. Meristems were dissected in 60% acetic acid and routinely 2-6 slide 768 preparations were made from each root. Slides were stored at -20°C until FISH. 769 The 45S rDNA probe was labelled by random priming (Invitrogen) with digoxigenin 770 dUTP (Roche) using the linearized clone pTa71 (Gerlach and Bedbrook 1979) containing the 771 45S rDNA repeat unit of Triticum aestivum. 50-100ng of labelled probe was used per slides 772 and detection of hybridization sites was carried out with Fluorescein-conjugated anti-773 digoxigenin (Roche). The remaining probes were designed from the consensus sequence of 774 the centromeric repeat Egcen (Fig. 5) and the 5S rDNA (Fig. 7) or as simple sequence repeats 775  TCT CTC TCT CTC TCT CTC TCT CTC TCT CTC  790  TCT CTC TCT CTC T  791   792 For hybridization, probes were prepared in 40% (v/v) formamide, 20% (w/v) dextran 793 sulphate, 2x SSC (sodium chloride sodium citrate), 0.03μg of salmon sperm DNA, 0.12% 794 SDS (sodium dodecyl sulphate) and 0.12mM EDTA (ethylenediamine-tetra acetic acid). 795 Chromosomes and 40-50µl of probe mixture were denatured together at 72oC for 8 mins, 796 cooled down slowly and allowed to hybridize overnight at 37°C. Post-hybridization washes 797 were at 42°C in 0.1xSSC, giving a stringency of 80-85% for the short oligo probes, and 70-798 75% for the 45SrDNA probe. Chromosomes were counterstained with 4µg/ml DAPI (4´,6-799 diamidino-2-phenylindole) and mounted in CitifluorAF. Slides were examined using Nikon 800 Eclipse 80i microscope and images were captured with a DS-QiMc monochrome camera, and 801 NIS-Elements v2.34 (Nikon, Tokyo, Japan). Overlays of hybridization signal and DAPI 802 images were viewed enhanced with Adobe Photoshop CC2018 using only cropping and 803 functions that treat all pixels equally. Seven FISH runs with different combinations of probes 804 and replicates were performed, and between 5 and 15 metaphases per slide (99 metaphases in 805 total from 15 slides) were analyzed in detail. Musa acuminata