Genome Sequencing of Musa acuminata Dwarf Cavendish Reveals a Duplication of a Large Segment of Chromosome 2

Different Musa species, subspecies, and cultivars are currently investigated to reveal their genomic diversity. Here, we compare the genome sequence of one of the commercially most important cultivars, Musa acuminata Dwarf Cavendish, against the Pahang reference genome assembly. Numerous small sequence variants were detected and the ploidy of the cultivar presented here was determined as triploid based on sequence variant frequencies. Illumina sequence data also revealed a duplication of a large segment on the long arm of chromosome 2 in the Dwarf Cavendish genome. Comparison against previously sequenced cultivars provided evidence that this duplication is unique to Dwarf Cavendish. Although no functional relevance of this duplication was identified, this example shows the potential of plants to tolerate such aneuploidies.

Here we report about our investigation of the genome of M. acuminata Dwarf Cavendish, one of the commercially most important cultivars. We identified an increased copy number of a segment of the long arm of chromosome 2, indicating that this region was duplicated in one haplophase.

MATERIALS AND METHODS
Plant material and DNA extraction Musa acuminata Dwarf Cavendish tissue culture seedlings were obtained from FUTURE EXOTICS/SolarTek (Düsseldorf, Germany) ( Figure 1). Plants were grown under natural daylight at 21°. Genomic DNA was isolated from leaves following the protocol of Dellaporta et al. (1983).
Library preparation and sequencing Genomic DNA was subjected to sequencing library preparation via the TrueSeq v2 protocol as previously described (Pucker et al. 2016). Paired-end sequencing was performed on an Illumina HiSeq1500 and NextSeq500, respectively, resulting in 2x250 nt and 2x154 nt read data sets with an average Phred score of 38. These data sets provide 55x and 65.6x coverage, respectively, for the approximately 523 Mbp (D'Hont et al. 2012) haploid banana genome.  acuminata Dwarf Cavendish reads against the DH Pahang v2 reference sequence assembly revealed a 6.2 Mbp tetraploid region on the long arm of chromosome 2 in Dwarf Cavendish (see enlarged box in the upper right). Apparent large scale deletions, indicated by regions with almost zero coverage, are technical artifacts caused by large stretches of ambiguous bases (Ns) in the Pahang assembly that cannot be covered by reads; these artifacts are marked with horizontal gray lines. Plots with higher per chromosome resolution data are presented in Supplementary File S1.
Read mapping, variant calling, and variant annotation All reads were mapped to the DH Pahang v2 reference genome sequence via BWA-MEM v0.7 (Li 2013) using -M to flag short hits for downstream filtering. This read mapping was analyzed by the HaplotypeCaller of the Genome Analysis ToolKit (GATK) v3.8 (McKenna et al. 2010;Van der Auwera et al. 2013) to identify sequence variations in single nucleotides, called "single nucleotide variants" (SNVs), and also insertions/deletions (InDels). SNVs and InDels were called using the following filter rules in accordance with the GATK developer recommendation: 'QD , 2.0', 'FS . 60.0', and 'MQ , 40' for SNVs and 'QD , 2.0' and 'FS . 200.0' for InDels. An InDel length cutoff of 100 bp was applied to restrict downstream analyses to a set of high quality variants called from 2x250nt reads. Only variants supported by at least five reads were kept. The resulting variant set was subjected to SnpEff (Cingolani et al. 2012) to assign predictions about the functional impact to the variants in the set. Variants with disruptive effects were selected using a customized Python script as described earlier (Pucker et al. 2016).
The genome-wide distribution of SNVs and InDels was assessed based on previously developed scripts (Baasner et al. 2019). The length distribution of InDels inside coding sequences was compared to the length distribution of InDels outside coding sequences using a customized Python script (Pucker et al. 2016).
De novo genome assembly Trimmomatic v0.38 (Bolger et al. 2014) was applied to remove low quality sequences (i.e., four consecutive bases below Phred 15) and remaining adapter sequences (based on similarity to all known Illumina adapter sequences). Different sets of trimmed reads were subjected to SOAPdenovo2 (Luo et al. 2012) for assembly using optimized parameters  including avg_ins = 600, asm_flags = 3, rd_ len_cutoff = 300, pair_num_cutoff = 3, and map_len = 100. K-mer sizes ranged from 67 to 127 in steps of 10. Resulting assemblies were evaluated using previously described criteria  including general assembly statistics (e.g., number of contigs, assembly size, N50, and N90) and a BUSCO v3 (Simão et al. 2015) assessment. Polishing was done by removing potential contaminations and adapters as described before . The DH Pahang v2 assembly (D'Hont et al. 2012;Martin et al. 2016) was used in the contamination detection process to distinguish between bona fide banana contigs and sequences of unknown origin. Contigs with high sequence similarity to non-plant sequences were removed as previously described . Remaining contigs were sorted based on the DH Pahang v2 reference genome sequence and concatenated to build pseudochromosomes to facilitate downstream analyses. A de novo Dwarf Cavendish assembly generated with a K-mer size of K = 127 was choosen to give statistics.

Data availability
Supplemental tables and figures are available at GSA figshare. File S1 shows the per chromosome read coverage distribution of Dwarf Cavendish reads. File S2 presents a comparison of SNVs in the duplicated segment on the long arm of chromosome 2 to all other SNVs in the genome. The higher read coverage at variants indicates a duplication of this region. File

Structural variants
Mapping of M. acuminata Dwarf Cavendish reads against the DH Pahang v2 reference sequence assembly revealed several copy number variations in different parts of the genome (Figure 2, File S1). The most remarkable difference between the Dwarf Cavendish and Pahang genome sequence is the amplification of an about 6.2 Mbp continuous region (length deduced from the reference genome) on the long arm of chromosome 2 (Figure 2, File S1, S2). An investigation of allele frequencies in the duplicated segment on chromosome 2 revealed that this duplication originates from a haplophase with high similarity to the reference sequence (Figure 3). Such a duplication was not observed in any of the other publicly available genomic sequencing data sets when compared against the DH Pahang v2 genome sequence (File S3, S4). Apparently, read mapping also indicates at least four large scale deletions in Dwarf Cavendish compared to Pahang v2 on chromosomes 2, 4, 5 and 7 ( Figure 2). However, analysis of the underlying sequence revealed long stretches of ambiguous bases (Ns) at these positions in the Pahang assembly as the cause for these pseudo low coverage regions.

Ploidy of M. acuminata Dwarf Cavendish
Based on allele frequency of small sequence variants (SNVs and InDels), the ploidy of Dwarf Cavendish was identified as triploid ( Figure 3). Many heterozygous variant positions display a frequency of the reference allele close to 0.33 or close to 0.66. This fits the expectation for two copies of the reference allele and one copy of a different allele, or vice versa. Deviation from the precise values is explained by random fluctuation of the read distribution at the given position. Since the peak around 0.66 for the frequency of the allele identical to the reference is substantially higher than the peak around 0.33, it is reasonable to assume that two haplophases are very similar to the reference. The third haplophase is the one that contains more deviating positions and differs more from the reference. It is likely that reads of the divergent haplophase are mapped with a slightly reduced rate. This might explain why the peak at 0.66 is slightly more than twice the size of the peak at 0.33. In the duplicated segment on chromosome 2 the Figure 4 Genome-wide distribution of small sequence variants. SNVs (green) and InDels (magenta) distinguish Dwarf Cavendish from Pahang. Variants were counted in 100 kb windows and are displayed on two different y-axes to allow maximal resolution (Pucker et al. 2016). allele frequency peaks are shifted to 0.25 and 0.75 (Figure 3), indicating a tetraploid region with three haplophases identical to the reference and one haplophase divergent from the Pahang reference.
To be able to test and prove or disprove hypotheses regarding differences of the haplophases of the Dwarf Cavendish genome, a high continuity phased assembly would be needed. Up-to-date long read sequencing technologies like Single Molecule Real-Time (Pacific Biosciences) or nanopore sequencing (Oxford Nanopore Technologies) in principle allow to generate such assemblies. However, successful phase separation currently requires tools like TrioCanu (Koren et al. 2018) which use Mendelian relationships between parents and F1 (i.e., crosses) for assignment of reads to phases. Generation of such data sets will be very difficult for banana and goes significantly beyond the scope of this study.
Genome-wide distribution of small sequence variants In total, 10,535,983 SNVs and 1,466,047 InDels were identified between the Dwarf Cavendish reads and the Pahang v2 assembly (see Data availability above). The genome-wide distribution of these variants is shown in Figure 4. As previously observed in other re-sequencing studies (Pucker et al. 2016), the number of SNVs exceeds the number of InDels substantially. Moreover, InDels are more frequent outside of annotated coding regions. Inside coding regions, InDels show an increased proportion of lengths which are divisible by 3, a bias introduced due to the avoidance of frameshifts.
SnpEff predicted 4,163 premature stop codons, 3,238 lost stop codons, and 8,065 frameshifts based on this variant set (File S5). Even given the larger genome size, these numbers are substantially higher than high impact variant numbers observed in re-sequencing studies of homozygous species before (Pucker et al. 2016;Xu et al. 2019). One explanation could be the presence of three alleles for each locus leading to compensation of disrupted alleles. Since banana plants are propagated vegetatively, breeders do not suffer inbreeding depressions.
De novo genome assembly To facilitate wet lab applications like oligonucleotide design and validation of amplicons, the genome sequence of Dwarf Cavendish was assembled de novo. The assembly comprises 256,523 scaffolds with an N50 of 5.4 kb (Table 1). Differences between the three haplophases are one possible explanation for the low assembly contiguity. The assembly size slightly exceeds the size of one haplotype. Due to the low contiguity of this assembly and only minimal above 50% complete BUSCOs (Benchmarking Universal Single-Copy Orthologs) (Simão et al. 2015), annotation was omitted. Nevertheless, we successfully used the produced genome assembly for primer design and detection of small sequence variants.

ACKNOWLEDGMENTS
We thank Joachim Weber for great technical assistance. We acknowledge support for the Article Processing Charge by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.