Whole Genome Sequence of the Parasitoid Wasp Microplitis demolitor That Harbors an Endogenous Virus Mutualist

Microplitis demolitor (Hymenoptera: Braconidae) is a parasitoid used as a biological control agent to control larval-stage Lepidoptera and serves as a model for studying the function and evolution of symbiotic viruses in the genus Bracovirus. Here we present the M. demolitor genome (assembly version 2.0), with a genome size of 241 Mb, and a N50 scaffold and contig size of 1.1 Mb and 14 Kb, respectively. Using RNA-Seq data and manual annotation of genes of viral origin, we produced a high-quality gene set that includes 18,586 eukaryotic and 171 virus-derived protein-coding genes. Bracoviruses are dsDNA viruses with unusual genome architecture, in which the viral genome is integrated into the wasp genome and is comprised of two distinct components: proviral segments that are amplified, circularized, and packaged into virions for export into the wasp’s host via oviposition; and replication genes. This genome assembly revealed that at least two scaffolds contain both nudivirus-like genes and proviral segments, demonstrating that at least some of these components are near each other in the genome on a single chromosome. The updated assembly and annotation are available in several publicly accessible databases; including the National Center for Biotechnology Information and the Ag Data Commons. In addition, all raw sequence data available for M. demolitor have been consolidated and are available for visualization at the i5k Workspace. This whole genome assembly and annotation represents the only genome-scale, annotated assembly from the lineage of parasitoid wasps that has associations with bracoviruses (the ‘microgastroid complex’), providing important baseline knowledge about the architecture of co-opted virus symbiont genomes.

dispersed in the wasp genome and organized in ways that enable formation of replication-defective virions that wasps use to infect hosts (Bézier et al. 2009;Bézier et al. 2013;Herniou et al. 2013). The elements of Microplitis demolitor bracovirus (MdBV) within the M. demolitor genome have been described in depth, using the assembly named Mdem1 as a reference . Although the M. demolitor genome sequence was generated primarily to focus upon MdBV, there are few genomic resources available for braconid wasps and other parasitoids, making the wasp genome useful for researchers in other fields (e.g., Geib et al. 2017;Bewick et al. 2017;Zhou et al. 2015). In this manuscript, we announce the full genome sequence of M. demolitor with an improved assembly and an annotated gene set for both wasp and viral genes. This publicly available genome assembly will continue to facilitate research on bracoviruses but also provide a resource to help address other questions specific to M. demolitor and to enable comparative analyses with other insect species.

Wasp samples
Wasp samples were derived from a culture maintained at the University of Georgia as described previously (Burke 2016). DNA was isolated from single and pooled male wasps with a high-salt precipitation method to maintain the integrity of high molecular weight DNA as described in .

Whole genome sequencing and assembly
In addition to the sequencing libraries reported in  (180 bp, 1.5 kb, 5 kb, and 10 kb), a new 20 kb long-insert mate-pair library was constructed from pooled adult male DNA using Illumina's Nextera Mate-Pairs Sample Prep Kit. All libraries were sequenced for 100 cycles on a HiSeq2000 using TruSeq chemistry. Raw reads were trimmed, filtered, and error-corrected as described in . The Mdem1 assembly was further improved by additional scaffolding with the 20 kb Nextera mate-pair library and use of GapCloser v1.12 to close gaps generated in the scaffolding process with short paired read data (Luo et al. 2012). The genome assembly was screened by NCBI during the whole genome submission process to filter out adapter, vector, and other contaminant sequences. Methods used to generate RNASeq and viral DNA libraries and sequence data used for mapping have been described previously (Bitra et al. 2016;Burke and Strand 2012;Burke 2016).
Automated annotation of the M. demolitor genome Structural and functional annotation of genes was performed with the NCBI Eukaryotic Genome Annotation Pipeline. This automated pipeline utilized short read transcript evidence from existing RNASeq data for M. demolitor , in addition to the MdBV proviral segments present in GenBank (Webb et al. 2006), NCBI RefSeq protein sets for Fopius arisanus, Nasonia vitripennis and Apis mellifera and 81,697 protein sequences from GenBank derived from the Insecta. Alignments were used to inform gene model prediction using the NCBI eukaryotic gene prediction tool GNOMON. Details of the annotation process can be accessed at: https://www.ncbi.nlm.nih.gov/genome/ annotation_euk/process/. The completeness of the annotated gene set was analyzed by identifying the number of arthropod Benchmark Universal Single-Copy Orthologs (BUSCOs) (Simão et al. 2015). BUSCO v.1.1b1 was run on the RefSeq Gene set at the predicted peptide level ("-m OGS"). BUSCO results were compared to the RefSeq Gene sets for braconid species F. arisanus and D. alloeum as well as Nasonia vitripennis, for which a large portion of the genome is mapped to one of five chromosomes (Werren et al. 2010).

Manual annotation of M. demolitor genes of viral origin
Manual verification or correction of nudivirus-like replication genes and proviral genes was performed using the M. demolitor jBrowse/Apollo instance hosted at the USDA National Agricultural Library i5k Workspace. Protein sequences from the previously published manually curated viral gene set from the Mdem1 assembly were aligned to the genome using a modified version of exonerate v. 2.3.0 in which the gff3 output is compatible with jBrowse for upload as a custom track (available at https://github.com/hotdogee/exonerate-gff3). Exonerate alignments were used as the basis for correction of existing gene models or addition of gene models missing in the Mdem2 annotation. The boundaries of proviral segments and replication units in the Mdem2 assembly were identified by searching for sequence motifs that define these regions, along with use of short read mapping data from existing deep sequencing data from MdBV viral DNA and DNA isolated from ovaries when replication and associated amplification of viral DNA is at its peak Burke et al. 2015). Short read data were filtered with the fastx toolkit to retain reads with a phred score equivalent .30 for .90% of bases within a read. Quality filtered reads from sequenced DNAs were mapped to the Mdem2 n  Kim et al. 2015). Any reads that did not map to the proviral segments were removed using samtools v.1.3.1 (Li et al. 2009).

Data Availability
All raw sequencing data are available from the NCBI Sequence Read Archive (see Table 1 for accessions

RESULTS AND DISCUSSION
In total, approximately 17.5 Gb of small-insert sequence data were generated from a single male adult wasp for the Mdem1 assembly, along with 129.4 Gb of data generated from larger insert libraries (1.5, 5, 10, and 20 kb insert sizes) for scaffolding purposes ( Table 1). The 20 kb library derived data were not included in the previous assembly Mdem1. Assembly of these sequence data with SOAPdenovo resulted in a new assembly (Mdem2) that consisted of 1,794 scaffolds with an N50 size of 1.1 Mb and 27,508 contigs with an N50 of 14.12 kb ( Table 2). The assembly was 241.2 Mb in total length, which has very good concordance with the genome size estimated by flow cytometry (241 +/2 6 Mb, . Only 14.6% of the genome assembly was comprised of sequence gaps. The overall G + C nucleotide content was 33.1%. These assembly statistics are a large improvement over the Mdem1 assembly, with approximately 65% fewer scaffolds and an N50 size 3.6x longer (Table 2). Genome assemblies are available for three other braconid wasp species while sequences are available for a fourth (Cotesia vestalis) but have not been scaffolded. The Mdem2 assembly statistics are similar to these other braconids and Nasonia vitripennis (family Pteromalidae) in both genome size and G + C content (Table 2). Annotation using the NCBI Eukaryotic Annotation Pipeline yielded 12,755 genes or pseudogenes, including 12,144 containing protein coding regions. A total of 19,597 transcripts were annotated, with a mean of 1.54 (median 1) transcripts per gene (Table 3). Evidence for gene annotations were derived from RNA-Seq data from adult wasp ovaries, venom glands, and teratocytes, and larvae (Table 4) and proteins from related species, or ab initio evidence predicted by GNOMON. A large proportion of transcripts (16,219 of 18,586 (87.2%)) were fully supported with experimental evidence. A total of n 526 non-coding genes were identified, including tRNAs, lncRNAs and others. Details of the annotation are presented in Table 3 as well as online at https://www.ncbi.nlm.nih.gov/genome/annotation_euk/ Microplitis_demolitor/101/. BUSCO analysis revealed that the M. demolitor genome assembly and annotation is very complete, with 97% of all BUSCOs conserved in Insecta identified in the protein-coding gene set (Table 5). Only 1.2% of BUSCOs were present as fragments in the M. demolitor annotation, and 0.7% were missing. These results are very similar to BUSCO analyses of other hymenopteran genomes ( Table 5).
As previously noted, BV genomes are integrated into the genomes of wasps. They also consist of two distinct components: proviral segments and nudivirus-like replication genes (Bézier et al. 2009;, Figure 1). Expression of nudivirus-like replication genes in wasp ovaries results in formation of virions, while proviral segments, bounded by excision motifs targeted by specific nudivirus-like replication genes, are amplified in regions known as Replication Units (RUs), circularized and packaged into virions (Burke et al. 2013;Bézier et al. 2009;Annaheim and Lanzrein 2007;Savary et al. 1997;Bézier et al. 2013;Burke et al. 2015;Louis et al. 2013). This results in virions that package genes on proviral segments but lack all nudivirus-like replication genes. The genes located on proviral segments are often short and many contain introns (Webb and Strand 2005;Desjardins et al. 2008;Espagne et al. 2004). In contrast, no introns have been described for the nudivirus-like replication genes in bracoviruses (but see below) (Bézier et al. 2009;. M. demolitor genes of viral origin were previously described from manual annotation of the Mdem1 assembly . The genome contained 26 proviral segments that are amplified in eight replication units ) located at 8 loci on M. demolitor scaffolds. 95 genes were identified in proviral segments, while 76 nudivirus-like replication genes were located on 30 different genome scaffolds. Evidence for these gene models was derived from RNASeq data from wasp cells and tissues as above and also MdBV infected hemocytes (Table 4, . Only a single nudivirus-like replication gene was located on the same scaffold as a proviral segment (HzNVorf93-like and Segment T).
The Mdem2 automatic annotation performed by GNOMON correctly recovered 90% of the M. demolitor viral genes. Eighteen genes that were either missing or incorrectly annotated were manually corrected using alignment with older gene models in the M. demolitor Mdem2 jBrowse/Apollo instance hosted at the i5k Workspace. An additional four gene models (lef-8, lef-9, HzNVorf128-like, and K425_12) were edited to reflect the presence of introns that were previously unidentified.
The architecture of the proviral portion of the M. demolitor genome did not change appreciably between the Mdem1 and Mdem2 assemblies, with proviral segments still located in 8 loci across 9 scaffolds. Coordinates for proviral segments and replication units in the Mdem2 assembly are listed in Table 6. The nudivirus-like replication genes were located on 24 different scaffolds (5 fewer than in Mdem1 assembly). One major difference was that an additional link between proviral segments and nudivirus-like replication genes was identified. Locus 2, containing Segments V, W, E, C and X, was located on a 2.4 Mb scaffold approximately 75 kb away from the nudivirus-like gene p74, and more than 323 kb from several other nudivirus-like replication genes (35a-8 to 35a-14; odv-e66-9 to odv-e66-20; 35a-6 and 35a-7; and helicase). The entire set of proviral segments, replication units, and viral n  genes are available as gff3 and sequence files at AgDataCommons (http://dx.doi.org/10.15482/USDA.ADC/1432667) and can be uploaded as custom tracks for visualization at the i5k Workspace. In addition to updating annotation of the regions of viral origin in the M. demolitor genome, we also consolidated all sequence-based resources we have available for M. demolitor on the jBrowse/Apollo instance of the genome hosted at the i5k Workspace. Genome resources include the most recent genome assembly (Mdem2) and gene sets from NCBI Annotation Release 101. Transcriptome data (in the form of BigWig coverage plots and mapped reads) are available for ovary, teratocyte, venom gland, and larval samples from wasps (Table 4). We have also contributed transcriptome data for all MdBV genes that are expressed in infected host caterpillars. These include the permissive host Chrysodeixis includens and the semipermissive host Trichoplusia ni data from hemocytes and whole body samples (Table 4). Finally, mapped DNA data are available from deep sequencing of DNAs isolated from MdBV virions and M. demolitor ovaries when proviral segment amplification is at its peak (Table 4). These data will facilitate the exploration of the evolution and function of MdBV and other viral symbionts in the future.
The M. demolitor genome described herein represents a high-quality assembly. The assembly of the genome has greatly benefitted from a sequencing strategy in which contigs were built from sequences derived from a single haploid male wasp, followed by scaffolding using sequence data from large-insert libraries. The Mdem1 assembly was also significantly improved with the addition of sequence data derived from a large insert (20kb) mate-pair library used in the Mdem2 assembly. The M. demolitor annotated gene set is similar to related genomes from select other parasitic Hymenoptera in terms of numbers of genes and estimated completeness.
The Mdem2 assembly also provides a more complete picture of the architecture of the MdBV genome in the wasp genome. While proviral segments share no similarity with sequences from pathogenic nudiviruses, prior results showing that the recognition of excision motifs on proviral segments by nudivirus-like integrases strongly suggests that the proviral segments and nudivirus-like replication genes have shared ancestry (Burke et al. 2013). While it is unclear how genome rearrangement of the viral genome was achieved in the wasp genome, the physical location of several nudivirus-like replication genes and proviral segments in neighboring regions of M. demolitor chromosomes provides further evidence for their shared evolutionary history . Future assemblies with new long-read sequencing technologies generating chromosome-length scaffolds will provide further insight into the location of viral genome components relative to each other. These data will help to determine whether proviral segment loci and nudivirus-like replication genes are limited to either single or multiple chromosomes in the wasp genome, which will provide information about the events leading to the inception of viral sequences in the wasp genome and their maintenance over time.

ACKNOWLEDGMENTS
This work was supported by the National Institutes of Health (F32 AI096552) (GRB), the US National Science Foundation (IOS-12611328) (MRS) and (DEB-1622986) (GRB), the US Department of Agriculture n a Each proviral segment and its associated locus is listed in a row along with the M. demolitor genome scaffold where it is located. Scaffold accession numbers are indicated along with the coordinates for the boundaries of each proviral segment. Amplification start and end coordinates are listed for each RU that contains one segment. For multi-segment RUs, the amplification start and end coordinates correspond to the outermost segments. "." signs indicate that gaps in scaffolds or scaffold termini prevent determination of segment or replication unit ends. " Ã " is similar to ".", but segment ends are detected in smaller contigs that were not incorporated into scaffolds (e.g., ends of Segments V and W are in NW_014463725.1, Mdem_contig_4120015, while the end of Segment I is in NW_014463324.1, Mdem_contig_4046930).