Chromosome-Level Assembly of Drosophila bifasciata Reveals Important Karyotypic Transition of the X Chromosome

The Drosophila obscura species group is one of the most studied clades of Drosophila and harbors multiple distinct karyotypes. Here we present a de novo genome assembly and annotation of D. bifasciata, a species which represents an important subgroup for which no high-quality chromosome-level genome assembly currently exists. We combined long-read sequencing (Nanopore) and Hi-C scaffolding to achieve a highly contiguous genome assembly approximately 193 Mb in size, with repetitive elements constituting 30.1% of the total length. Drosophila bifasciata harbors four large metacentric chromosomes and the small dot, and our assembly contains each chromosome in a single scaffold, including the highly repetitive pericentromeres, which were largely composed of Jockey and Gypsy transposable elements. We annotated a total of 12,821 protein-coding genes and comparisons of synteny with D. athabasca orthologs show that the large metacentric pericentromeric regions of multiple chromosomes are conserved between these species. Importantly, Muller A (X chromosome) was found to be metacentric in D. bifasciata and the pericentromeric region appears homologous to the pericentromeric region of the fused Muller A-AD (XL and XR) of pseudoobscura/affinis subgroup species. Our finding suggests a metacentric ancestral X fused to a telocentric Muller D and created the large neo-X (Muller A-AD) chromosome ∼15 MYA. We also confirm the fusion of Muller C and D in D. bifasciata and show that it likely involved a centromere-centromere fusion.

emergence of metacentric chromosomes and determine if metacentric pericentromeres are conserved over evolutionary time (Figure 1). Second, the configuration of the Muller A chromosome (the ancestral X chromosome in Drosophila) is particularly interesting, since it became fused to Muller D in some members of the obscura group 15 million years ago (Figure 1) thereby creating a large neo-sex chromosome (Carvalho and Clark 2005). The location of the centromere (metacentric or telocentric) prior to the fusion is not known, and the A-to-D fusion has been a matter of some debate (Schaeffer 2018). If Muller A was metacentric prior to the fusion, that could explain the presence of ancestral Muller A genes on the long arm of the fused A-D chromosome (denoted XR in D. pseudoobscura) (Mahajan et al. 2018, Bracewell et al. 2019) (hereafter referred to as Muller A-AD). Third, the putative Muller C-D fusion is only present in some species of the obscura subgroup, suggesting it occurred recently. How the chromosomes fused is unknown (centromerecentromere, centromere-telomere, telomere-telomere) and the relative size and gene content of this new pericentromeric region is unknown. Here, we report on our genome assembly and annotation of D. bifasciata and we characterize chromosome structure, the distribution of transposable elements (TE), and explore the putative Muller C-D fusion.

METHODS AND MATERIALS
Genome sequencing and assembly We sequenced the D. bifasciata isofemale line 14012-0181.02, which was originally collected in Hokkaido, Japan and obtained from the National Drosophila Species Stock Center at Cornell University. High molecular weight DNA for sequencing was extracted from 60 female flies using a Qiagen Blood & Cell Culture DNA Midi Kit and the resulting DNA was size selected for fragments .15 kb using BluePippin (Sage Science). For size selection, we used 6 mg of total DNA (100 ng/ml) run in two wells. The elute was bead purified, resulting in a total of 2.7 mg total DNA in a 50 ml solution. We generated longreads using Nanopore and the SQK-LSK109 sequencing kit on one 9.4.1RevD flow cell and with the minKNOW software version 3.1.13. Raw output files from our sequencing run were base called using Albacore Sequencing Pipeline software version 2.3.3 (Oxford Nanopore Technologies) with default parameters for quality score filtering.
We used Canu version 1.8  to first error-correct the raw sequencing reads using slightly modified parameters (correc-tedErrorRate = 0.065 corMinCoverage = 8 batOptions="-dg 3 -db 3 -dr 1 -ca 500 -cp 50" trimReadsCoverage = 4 trimReadsOverlap = 500 genomeSize = 200m). The resulting error-corrected reads were then assembled into contigs using the WTDBG2 assembler (Ruan and Li 2019) with default settings. We then BLAST searched all contigs ,1 MB to the nt database and returned the top two hits to identify any contigs from non-target species (typically Acetobacter and Saccharomyces).
After removing contaminant contigs we polished the genome assembly using three rounds of Racon (Vaser et al. 2017) followed by three rounds of Pilon (Walker et al. 2014). This method of combining multiple rounds of Racon and Pilon has been shown to increase genome assembly quality in other Drosophila species . To polish with Racon we mapped our raw nanopore reads each round using minimap2 and specified -x ava-ont (Li 2018). For genome polishing with Pilon we used reads derived from our high coverage Hi-C Illumina data (below). Because of the inherent properties of Hi-C data (paired-end reads with atypical orientations, highly variable insert sizes, chimeric reads) that could lead to spurious genome polishing, we first mapped our Illumina reads to the genome using BWA mem (Li and Durbin 2009) and extracted only those reads with correct pairing using samtools (view -bf 0x2) ). We then used those reads as single-end reads for genome polishing. A fraction of the single-end reads will be chimeric but read mapping with BWA mem soft-clips reads and these soft-clipped reads should be randomly distributed across the genome ( Figure S1) and not contribute significantly to genome polishing. At each step of assembly and polishing we assessed genome completeness using BUSCO v3 (Simão et al. 2015) and the odb9 eukaryota database.

Hi-C scaffolding
Prior to scaffolding we compared our polished contigs with other chromosome-length genome assemblies from obscura group species (Mahajan et al. 2018, Bracewell et al. 2019) using whole genome alignments with D-Genies (Cabanettes and Klopp 2018). We then identified the largest contigs belonging to Muller elements to help guide any potential manual manipulations during Hi-C scaffolding. To scaffold the assembly, we used chromatin conformation capture to generate Figure 1 Evolutionary relationships and karyotype transitions of obscura group flies. The ancestral karyotype of the obscura group (shown here as Drosophila subobscura) consists of five large and one small pair of telocentric chromosomes, referred to as Muller elements A-F (reviewed in Schaeffer 2018), and shown color coded. Significant karyotypic changes have occurred across the obscura group (highlighted with gray boxes) with chromosomal fusions and centromere movement altering chromosome structure (Bracewell et al. 2019). Drosophila bifasciata represents an important karyotype to understand evolutionary transitions since Muller A (the X chromosome), B and E are thought to be metacentric and Muller A is unfused (Moriwaki and Kitagawa 1955). In D. bifasciata, it is thought that Muller C and D fused, although C-D fusions are only present in some obscura subgroup species (Buzzati-Traverso and Scossiroli 1955). Shown phylogenetic relationships adapted from (Gao et al. 2007) with subgroup designations shown along the branches.
Hi-C data (Lieberman-Aiden et al. 2009). We generated Hi-C libraries as outlined in Bracewell et al. (2019) using a DNase digestion method (Ramani et al. 2016). The resulting DNA library was prepped using an Illumina TruSeq Nano library prep kit and was sequenced on a HiSeq 4000 with 100 bp PE reads. We used Juicer (Durand et al. 2016b) to map raw Hi-C reads and generate contact maps based on 3D interactions to scaffold the genome assembly. We then used the 3D-DNA pipeline (Dudchenko et al. 2017) to orient and place contigs. 3D-DNA output files were visualized and checked for accuracy using Juicebox (Durand et al. 2016a) with verification and modifications to scaffolding done using built-in tools. The final assembly was scaffolded together with 300 Ns between each contig.
Estimating residual isofemale line genetic variation and genome assembly accuracy Residual genetic variation can complicate genome assembly and lead to varying assembly quality across chromosomes. To characterize genomic patterns of variation in the sequenced isofemale line, we used the mapped Illumina polishing reads (above) and GATK's UnifiedGenotyper (DePristo et al. 2011) to call single nucleotide polymorphisms (SNPs). We filtered SNPs using VCFtools (Danecek et al. 2011) (-minGQ 20 -minDP 3 -min-alleles 2) and estimated nucleotide diversity (p) in 50 kb non-overlapping windows, with the expectation that diversity should be nearly zero for genomic regions that are isogenic. To estimate base level accuracy (QV) of the genome assembly, we followed methods outlined in Koren et al. (2018) and used FreeBayes (Garrison and Marth 2012) to identify variants in our mapped Illumina polishing reads (same as above) with the command fasta. Heterozygous calls (0/1) with a reference allele were filtered out and QV was calculated as: where B is the sum of all bases changed from indels and homozygous alternate SNPs with sequencing depth $3·, and T is total bases with sequencing depth $3·.

Repetitive element identification and genome masking
We first used REPdenovo (Chu et al. 2016) to identify novel repeats from our single-end Hi-C Illumina sequencing data (above) using parameters described in detail in Bracewell et al. (2019). We then concatenated the REPdenovo repeats with the Repbase Drosophila repeat library (downloaded March 22, 2016, from www.girinst.org) and used this combined file to mask the genome with RepeatMasker version 4.0.7 using the -no_is and -nolow flags. To characterized the genomic distribution of specific transposable element (TE) families we used a TE library developed from obscura group flies (Hill and Betancourt 2018) and again used RepeatMasker and then bedtools coverage (Quinlan and Hall 2010) to determine the proportion of masked bases per TE family.

Genome annotation and characterization of assembly
To annotate our D. bifasciata genome assembly we used the REPdenovo/Repbase repeat-masked genome (above) and the MAKER annotation pipeline (Campbell et al. 2014) to identify gene models. The ab initio gene predictors SNAP (Korf 2004) and Augustus (Stanke and Waack 2003) were used to guide the annotation and we used protein sets from D. pseudoobscura and D. melanogaster (FlyBase) to aid in gene prediction. We used karyoploteR (Gel and Serra 2017) to plot features of the D. bifasciata genome assembly.
Gene orthologs, genome synteny, and Muller element fusion orientation To compare our genome assembly with D. athabasca which has metacentric Muller A-AD, B and E chromosomes (Figure 1), and D. subobscura, which harbors the ancestral karyotype and is composed entirely of telocentric chromosomes (Figure 1), we performed BLASTP reciprocal best hit searches between proteins from our annotations of each species (Bracewell et al. 2019). We used the blast_rbh.py script (Cock et al. 2015) and genomic coordinates of reciprocal best hits were plotted using the genoPlotR package (Guy et al. 2010). To determine if the Muller C-D fusion in D. bifasciata was the result of a centromere-centromere, centromere-telomere, or telomere-telomere fusion, we identified the 50 most proximal pericentromeric genes from telocentric Muller C and D in D. subobscura and plotted the location of orthologs in D. bifasciata.

Data availability
Raw Nanopore and Hi-C (Illumina) reads have been deposited in the NCBI SRA and are under the BioProject (PRJNA565796). The genome assembly and annotation have been deposited with NCBI (accession WIOZ00000000.1). Figure S1 shows an Integrative Genomics Viewer image of a region on Muller A of the genome assembly with mapped Hi-C reads filtered for Pilon polishing. Figure S2 contains a plot of Illumina and Nanopore sequencing coverage over the draft genome assembly. Figure S3 shows the Hi-C heatmap and scaffolding as in Figure 1 with nucleotide diversity (p) estimated from the isofemale line in 50 kb non-overlapping windows across the assembly. Figure S4 displays the genomic distribution of the ten most frequently encountered transposable elements (TEs) for each Muller element in the genome assembly. Figure S5 shows the genomic location of D. subobscura pericentromeric orthologs from Muller C and D on the fused Muller CD of D. bifasciata. Supplemental material available at figshare: https://doi.org/10.25387/g3.11561892.

RESULTS AND DISCUSSION
Using one Nanopore flow cell, we generated 538,757 reads that passed Albacore's standard quality filtering. Our Nanopore reads had an N50 read length of 23,957 bases and provided 45· coverage over the genome given an estimated genome size of 200 Mb for D. bifasciata. Our initial hybrid Canu/WTDBG2 assembly resulted in a genome assembly that consisted of 796 contigs with an N50 of 2,325,530. BLAST results flagged multiple putative bacterial contigs (primarily Acetobacter) and 49 contigs (5.5 Mb of total sequence) were removed. As expected, rounds of Racon polishing (3·) and subsequent Pilon polishing (3·) led to an appreciable increase in our BUSCO scores (Table 1) although the most significant increases in genome completeness were detected after the initial round of Racon or Pilon. Pilon polishing did not lead to as dramatic an increase in genome completeness as seen in other studies (Bracewell et al. 2019) and this was likely due to limitations of our Illumina polishing data that was single-end and was of modest coverage (mean 18·) over the genome ( Figure S2). However, we did see a significant increase in genome completeness suggesting that polishing the genome with Hi-C reads can be a viable strategy for increasing genome assembly quality. Our polished genome assembly consisted of 747 contigs with an N50 of 2,386,451. The longest contig was 18,852,285 bp with a total genome assembly length of 192,589,718 bp.
Our Hi-C library generated a total of 13,018,415 sequenced read pairs of which 73.8% were alignable to the draft genome. The Juicer pipeline identified 6,734,204 Hi-C contacts which were used to scaffold the genome. The final scaffolded genome assembly placed 126 contigs to the fused Muller CD, 54 to Muller A, 238 to Muller B, 119 to Muller E, 1 to Muller F, and 209 were left unplaced (Figure 2A). Hi-C scaffolding revealed clear associations between euchromatic arms of the same chromosome thereby increasing our confidence in the assembly of metacentric chromosomes (Figure 2A). For example, Muller CD is thought to be the result of a fusion of telocentric Muller C and D elements (Moriwaki and Kitagawa 1955) and our assembly showed clear associations between the C and D arms (Figure 2A). Importantly, there were also clear associations between Muller C and D euchromatic arms with adjacent pericentromeric contigs (Figure 2A), thus providing evidence for the placement of the repeat-rich pericentromeric sequence as well ( Figure 2B). Muller A also showed clear associations that extend into highly repetitive pericentromeric regions highlighting this chromosome is indeed metacentric (Figure 2A). The combination of inter-arm and arm-pericentromere Hi-C associations allowed us to determine the correct orientation for all arms of the D. bifasciata chromosomes.
BUSCO results suggest our final scaffolded genome assembly is of high quality and 95.7% of BUSCOs were found complete (Table 1). We found the BUSCO statistics to be slightly lower than our other high-quality obscura group assemblies which average 98.7% complete (Bracewell et al. 2019). To investigate this reduction, we looked for missing BUSCOs in a species with a similar karyotype and higher score (D. athabasca) and found that 49% of missing BUSCOs (20 of 41 total) were in pericentromeric regions. Therefore, residual genome assembly and polishing issues of highly repetitive pericentromeric regions are likely the main contributor to the slightly lower scores of D. bifasciata.
Base level accuracy (QV) was found to be rather high at 35.9. We identified 25,889 indels and 19,151 homozygous SNPs that differed from the genome assembly over the 174,670,905 bases with coverage $3·. The assembly QV for D. bifasciata is slightly lower than assemblies of the reference genome strain of Drosophila melanogaster , Solares et al. 2018 but slightly higher than that of the domestic goat (Bickhart et al. 2017). A likely contributor to the lower accuracy than D. melanogaster (Solares et al. 2018) was the levels of genetic variation we identified in the isofemale line used for sequencing ( Figure S3). We found moderately high nucleotide diversity over most of Muller CD, and the long arms of Muller B and E. In contrast, the n■  Figure 2 Chromosome-level genome assembly of Drosophila bifasciata using Hi-C. A) Hi-C heatmap showing long-range contacts and scaffolding of the genome assembly. Green and blue squares denote contigs and chromosomes, respectively. Euchromatic chromosome arms and heterochromatic pericentromeres for each chromosome show distinct and primarily isolated associations that resemble a 'checkerboard' pattern. Note that chromosome arms on opposite sides of a pericentromere often show associations on the diagonal confirming their placement (yellow arrow) while pericentromeres show finer-scale associations with their chromosome arms (blue arrow). B) Shown is the D. bifasciata genome assembled into Muller elements (color coded as in Figure 1), scaffolding stitch points, gene density (genes per 100 kb) and repeat content (proportion of bases repeat-masked in 100 kb non-overlapping windows). Boxes around highly repetitive regions indicate putative pericentromere boundaries (defined as $40% repeat-masked sequence in sliding windows away from the center).
short arms of Muller B and E, all of Muller A, and Muller F showed very low levels of variation, consistent with being nearly isogenic ( Figure S3). These patterns of elevated diversity are likely driven by chromosomal inversions still present within the sequenced isofemale line. A total of 57,947,182 bp of the genome assembly was identified as being repetitive (30.1% of the total length of the assembly) and large fractions of all Muller elements were repeat-masked ( Figure 2B). The exceptionally high level of repeat-masking located in the middle of chromosome-length scaffolds is indicative of pericentromeric regions on metacentric chromosomes that harbor large numbers of TEs (Kaminker et al. 2002, Bracewell et al. 2019. Indeed, we find that TEs from a few specific families are highly abundant in the pericentromeric region of all Muller elements (Figure 3). Gypsy and Jockey elements are frequently encountered in the pericentromeres of D. bifasciata (Figure 3). Nearly 10 Mb, and over 5 Mb, of assembled sequence (28.4% and 15.8% of all bases masked for TEs) was classified as either Gypsy or Jockey elements, respectively. One specific element, Daff_Jockey_18, is at high frequency in all pericentromeres of D. bifasciata ( Figure S3) and was also the most frequently encountered TE in D. athabasca, which also has large metacentric chromosomes (Bracewell et al. 2019).
Our MAKER annotation identified a total of 12,821 protein coding genes models in our D. bifasciata genome assembly. This number is very similar to other obscura groups species, which have been found to harbor anywhere from 12,714 -14,547 genes (Mahajan et al. 2018, Puerma et al. 2018, Bracewell et al. 2019, Karageorgiou et al. 2019. We find a total of 2,279 protein-coding genes on Muller A, 2,499 on Muller B, 4,599 on the fused Muller CD, 3,276 on Muller E and 90 on Muller F ( Table 2). Comparisons of orthologs between D. bifasciata and D. athabasca (affinis subgroup), which also has a metacentric Muller A-AD, Muller B, and Muller E indicates that the large pericentromeric region in these species is homologous (Figure 4). Surprisingly, the pericentromeric regions in D. bifasciata are remarkably similar in size to those of Drosophila athabasca suggesting some level of pericentromere stability over long periods of evolutionary time. Conservation of the pericentromere for Muller A between D. bifasciata and D. athabasca strongly suggests the fusion between Muller A and D involved a telomere-centromere or telomere-telomere fusion between the metacentric Muller A and the telocentric Muller D. This type of fusion would have resulted in the large neo-X (Muller A-AD) we see in species from the pseudoobscura/affinis subgroup and would account for the excess of Muller A genes on XR of the fused chromosome. For Muller B and Muller E, we find clear evidence of multiple paracentric inversions that differentiate the D. bifasciata and D. athabasca chromosomes ( Figure 4). However, we find no signatures of pericentric inversions, and each arm of Muller B and E appears to be conserved n■ Table 2 Genome assembly and annotation results

Contigs
Length ( ( Figure 4). This pattern contrasts with Muller A where we find evidence of both paracentric and pericentric inversions that differentiate these species (Figure 4). We also sought to determine the orientation of the fusion between Muller C and D in D. bifasciata and we find that the current configuration most likely occurred via a fusion of the two chromosomes at their centromeres. Orthologs of pericentromeric C and D genes in D. subobscura are adjacent to one another in our scaffolded assembly ( Figure S4) and Hi-C results strongly support this relationship ( Figure 2A). Interestingly, the pericentromeric region of the fused C-D chromosome appears smaller than all other pericentromeres in our assembly (Figure 2A). Although speculative, this may be due to the young age of this pericentromere which may be just beginning to expand through the proliferation of repetitive sequences. For example, the 50 pericentromeric C genes in D. subobscura are in a 1.0 Mb region while orthologs in D. bifasciata are spread out across 4.6 Mb ( Figure S4).
In conclusion, our chromosome-level assembly of D. bifasciata provides a valuable resource for future work in this species and will allow for more comprehensive comparative genomic analyses of Drosophila. Our genome assembly method highlights how long-read Nanopore sequencing combined with Hi-C scaffolding can assemble long stretches of highly repetitive pericentromeric sequence, resulting in the assembly of entire metacentric chromosomes. These chromosome-level assemblies allow for evolutionary comparisons of pericentromeric regions that until recently have not been possible. As more chromosome-level genome assemblies become available, we will begin to better understand large-scale changes in chromosome morphology and their impact on genome architecture, gene evolution and speciation.

ACKNOWLEDGMENTS
We thank K Wei and D Mai for discussions of repetitive DNA and genome assembly. This work was supported by NIH grants R01GM076007, R01GM101255 and R01GM093182 to D. Bachtrog and 5F32GM123764-02 to R. Bracewell.