Genome Report: Whole Genome Sequence and Annotation of the Parasitoid Jewel Wasp Nasonia giraulti Laboratory Strain RV2X[u]

Jewel wasps in the genus of Nasonia are parasitoids with haplodiploidy sex determination, rapid development and are easy to culture in the laboratory. They are excellent models for insect genetics, genomics, epigenetics, development, and evolution. Nasonia vitripennis (Nv) and N. giraulti (Ng) are closely-related species that can be intercrossed, particularly after removal of the intracellular bacterium Wolbachia, which serve as a powerful tool to map and positionally clone morphological, behavioral, expression and methylation phenotypes. The Nv reference genome was assembled using Sanger, PacBio and Nanopore approaches and annotated with extensive RNA-seq data. In contrast, Ng genome is only available through low coverage resequencing. Therefore, de novo Ng assembly is in urgent need to advance this system. In this study, we report a high-quality Ng assembly using 10X Genomics linked-reads with 670X sequencing depth. The current assembly has a genome size of 259,040,977 bp in 3,160 scaffolds with 38.05% G-C and a 98.6% BUSCO completeness score. 97% of the RNA reads are perfectly aligned to the genome, indicating high quality in contiguity and completeness. A total of 14,777 genes are annotated in the Ng genome, and 72% of the annotated genes have a one-to-one ortholog in the Nv genome. We reported 5 million Ng-Nv SNPs which will facility mapping and population genomic studies in Nasonia. In addition, 42 Ng-specific genes were identified by comparing with Nv genome and annotation. This is the first de novo assembly for this important species in the Nasonia model system, providing a useful new genomic toolkit.

Nasonia has been a good model for insect research (Werren and Loehlin 2009;Lynch 2015;Whiting 1967;Beukeboom and Desplan 2003). Whole-genome sequencing efforts have been made in Nv, Ng and Nl (Werren et al. 2010). The Nv genome was sequenced with 6X coverage Sanger sequencing to generate a de novo assembly, whereas Ng and Nl genomes were sequenced with 1X coverage supplemented with short-read sequences, and aligned to the Nv assembly for reference-based genomes (Werren et al. 2010). Plenty of datasets have been published for Nv genome and transcriptomes after its reference genome was available. Crosses between Nv and Ng have been extremely successful for mapping and positional cloning of genes involved in species differences (Werren and Loehlin 2009;Niehuis et al. 2013), in some cases using chromosomal regions of Ng introgressed into an Nv background (Hoedjes et al. 2014). Comparative genomics between Ng and Nv is informative to investigate many aspects in Nasonia biology, such as behavior (Raychoudhury et al. 2010), development (Loehlin and Werren 2012), pheromones (Niehuis et al. 2013), sex determination (Verhulst et al. 2010), gene expression (Wang et al. 2015Rago et al. 2020), venom evolution (Martinson et al. 2017) and regulation by DNA methylation (Beeler et al. 2014;Pegoraro et al. 2016;Wang et al. 2013). Therefore, a well-assembled reference genome of Ng will advance utility of the system by the research community. In this study, we generated a high-quality reference genome assembly for N. giraulti, which will provide essential new genomic tools for Nasonia research.

MATERIALS AND METHODS
DNA extraction, library preparation, and sequencing DNA was extracted from 24-hour male adults of the N. giraulti RV2X [u] strain. High molecular weight (HMW) genomic DNA (gDNA) was isolated using MagAttract HMW DNA Mini Kit (Qiagen, MD). The quality of extracted gDNA was examined on a Qubit 3.0 Fluorometer (Thermo Fisher Scientific, USA). The size distribution of the extracted gDNA was accessed using the genomic DNA kit on Agilent TapeStation 4200 (Agilent technologies, CA).
A 10X Genomic library was prepared with the Chromium Genome Reagent Kits v2 on the 10X Chromium Controller (10X Genomics Inc., CA). In brief, HMW gDNA was diluted from original concentrations to 0.9 ng/ml with EB buffer. The diluted denatured gDNA, sample master mix and gel beads were loaded to the genomic chip, and then ran on 10X Chromium Controller to create Gel Bead-In-EMulsions (GEMs). After the run, the obtained GEMs were used for the subsequent incubation and cleanup. Chromium i7 Sample Index was used as the library barcode. Quality control of post library construction was accessed with Qubit 3.0 Fluorometer and Agilent TapeStation 4200. The prepared 10X genomic library was sequenced on a HiSeq X sequencer at the Genomic Services Lab at the HudsonAlpha Institute for Biotechnology. An Illumina short-read resequencing library (300 bp insert size) was made from genomic DNA samples extracted from six N. giraulti adult males (whole body), using TruSeq DNA Sample Prep Kit. Approximately 50X paired-end sequencing was done using Illumina HiSeq 2000 platform.
Total RNA extraction, library preparation, and sequencing of developmental stage samples Male and female N. giraulti RV2X(u) strain samples were collected at five developmental stages: 0-10hr early embryo, 14-24hr late embryo, 44-54hr larva, yellow pupa and 1-day adult. Sarcophaga bullata pupae were inserted into foam plugs, with only anterior available for oviposition. To obtain the male samples, two host pupae were provided to two virgin female wasps, allowing host feeding for 48 hr. These unmated females lay unfertilized eggs and produce all-male progeny. For female sample collection, mated females will produce more than 90% daughters under the experimental conditions, allowing the expression quantification of mostly female progeny for embryo and larva stages. Six individuals were pooled per stage, except early embryos for which 40 individuals were pooled due to the small size. All samples were homogenized in 1 mL TRIzol and stored at -80C freezer. Total RNA extractions, quantification, library preparation and sequencing protocol were previously described (Martinson et al. 2017).

Genome assembly and assessment
The raw sequencing reads from both 10X library and Illumina resequencing library were checked for sequencing quality by FastQC v11.5 (Andrews 2010) before used for genome assembly. The genome assembly strategy of N. giraulti includes the constructions of three draft de novo assemblies using different assemblers and a final step to reconcile three draft assemblies into a final high-quality assembly. The first de novo assembly of N. giraulti genome was performed with Figure 1 Image of N. giraulti and its geographic distribution in the North America based on Darling & Werren (1990).
the Supernova 2.0 assembler (Weisenfeld et al. 2017) using linked reads from 10X Genomics library. To achieve the best de novo assembly result, we examined a grid of barcode subsampling percentage parameters and the maximum number of input reads including no barcode subsampling with all linked reads. A second de novo assembly was conducted by MEGAHIT v1.2.9 (Li et al. 2015). The 10X linked reads were transferred to regular paired-end Illumina sequencing reads by trimming the barcode sequences and potential adaptor sequences with Trimmomatic v0.38 (Bolger et al. 2014). All trimmed sequencing reads were used for the second de novo assembly using MEGAHIT v1.2.9 (Li et al. 2015) with all default parameter settings. In addition, a third de novo assembly (ngirB_goodCOV) was generated by velvet v1.2.10 (Zerbino and Birney 2008) using sequencing reads from the Illumina short-read resequencing library.
A final high-quality assembly was generated by merging these three draft assemblies using an assembly reconciliation tool Metassembler v1.5 (Wences and Schatz 2015). All reverse complementary scaffolds with same length, coverage, A/T/C/G counts, as well as the duplicated scaffolds identified by self-BLAT version 35 (Kent 2002) were removed from the final assembly. In addition, potential contaminating bacterial scaffolds were checked and removed from the assembly, using a combination of methods mentioned in our previous publications (Wang et al. 2019;Wheeler et al. 2013;Ferguson et al. 2020). To estimate the contiguity and completeness of our genome assembly, three evaluation pipelines were performed: (1) genome sequencing reads were aligned to our assembly with BWA-MEM aligner version 0.7.17 (Bernt et al. 2013); (2) transcriptomic data of different developmental stages and sexes were mapped to the current assembly using Tophat v2.1.1 (Trapnell et al. 2009); (3) The BUSCO (Seppey et al. 2019) score of our genome assembly was calculated by aligning to arthropoda_odb9 with a total of 1,066 orthologs.

Genome annotation
The annotation of the N. giraulti genome was performed using MAKER version 2.31.9 (Cantarel et al. 2008) based on the following pipeline: (1) A custom N. giraulti repeat database constructed with RepeatModeler v.1.08 using the default parameter settings, with low complexity repeat regions soft-masked by MAKER; (2) A de novo assembly of the N. giraulti transcriptomes by Trinity v 2.4.0 (Haas et al. 2013) and pre-aligned transcripts annotated by Cufflinks v2.2.1 (Trapnell et al. 2012).For gene annotation, ab initio gene prediction algorithms were trained to predict gene models using protein and transcriptome evidences by EST2GENOME and PROTEIN2GE-NOME in MAKER. After filtered based on gene length and quality, the predicted genes were then used to train both the SNAP and the AUGUSTUS gene predictors. The results were fed to MAKER to repeat this procedure for another round, to generate the final predicted genes in N. giraulti genome. Default parameters were used except where otherwise noted.
Comparitive ananlysis between N. giraulti and N. vitripennis genomes To compare the genome structure between N. giraulti and N. vitripennis genomes, we conducted whole-genome alignment of our Ng assembly and the recent Nv genome assembly of (Dalla  using NUCmer in the MUMmer v4.0 program suite with default p parameter settings (Kurtz et al. 2004). The pairwise alignments (match length longer than 500bp) between Ng scaffolds and Nv chromosomes were visualized using Mummerplot (Kurtz et al. 2004).
To identify the candidate Ng specific genes, genes with no assigned orthogroup between N. giraulti and N. vitripennis were generated using OrthoFinder v2.2.7 (Emms and Kelly 2019). The Ng genes identified to have no assigned orthogroup with Nv were potential candidates for Ng specific genes. To ensure the absence of these candidates in Nv genome, protein sequences of these candidate Ng-specific genes were BLASTed to two Nv genome assemblies, including the Nv reference genome assembly (GCA_000002325.2) (Werren et al. 2010) and the newly released Nv PSR1.1 genome assembly using PacBio and Nanopore platforms (GCA_009193385.1) (Dalla  with an E-value cutoff of 1E-5 and protein length larger than 30. Genes with no BLAST hit to the two Nv genome assemblies were then aligned to the annotated Ng transcripts. The annotated Ng transcripts were generated with available Ng RNA-Seq data from different n■

Reference
This study (Werren et al. 2010) developmental stages and sexes (Embryo stage of 0-10 hr, 10-24 hr, 24-36 hr, female and male pupa and adult) using Cufflinks (Trapnell et al. 2012). Genes with support from annotated transcripts were kept as Ng-specific candidates. The protein sequences of these genes were aligned to the Nv PSR1.1 and Trichomalopsis sarcophagae assemblies using tBLASTn with an E-value cutoff of 1E-5. The final genes were annotated using both Blast2GO and KofamKOALA with an E-value cutoff of 1E-4.

Phylogenomic analysis
We conducted a phylogenomic analysis using our assembled

Data availability
The Ng genome assembly is available in GenBank with accession number QLYP00000000. Raw sequencing data are available in the NCBI Sequence Read Archive under the accession number PRJNA476699. Supplemental material available at figshare: https:// doi.org/10.25387/g3.12433559.

RESULTS AND DISCUSSION
Genome assembly and assessment Supernova 2.0 assembler (Weisenfeld et al. 2017) was used for the Ng genomic assembly with the barcode subsampling strategy. The best Supernova assembly has a contig N50 of 36.14 Kb and a scaffold N50 of 400.25 Kb, which was obtained by using 20% barcode subsampling of 140 million input reads. Interestingly, using all available reads with no barcode subsampling provided the worst assembly result. This can be caused by the overkill of reads coverage (.600X), which might lead to fragmented assembly due to the presence of sequencing errors. The draft de novo assembly was found to contain some artifacts, which was also reported for this assembler in a recent study (Helmkampf et al. 2019). We removed all the identical or nearly identical scaffolds as well as reverse complementary scaffolds prior to subsequent analyses. All these three de novo assemblies generated from different algorithms were further reconciled using an assembly reconciliation tool Metassembler (Wences and Schatz 2015). To identify the mitochondrial scaffold, we aligned the final assembly to the previously assembled mitochondrial genome of N. giraulti. Scaffolds with high identity (.90%) and high coverage (.16,000X) were assigned as mitochondrial scaffolds (Supplemental Figure S1). The detailed genome statistics of our final assembly of N. giraulti and all other available wasp genomes, including previous assembled genomes are listed in Table 1. The final genome assembly of N. giraulti is a total of 259,040,977 bp in 3,160 scaffolds. The contig N50 is 34,917 bp and the scaffold N50 is 545,346 bp, respectively. The previous Ng assembly was based on 1X Sanger and 10X Illumina short-read alignments to an earlier Nv assembly (Werren et al. 2010). Comparing to reference-assisted Ng assembly, our de novo assembly was significantly improved in contig level with much lower number of contigs and larger contig N50. The gap percentage is only 1.5% of the whole assembly, which surpasses most of the previous Nasonia genome assemblies. Although the scaffold N50 of the whole Ng genome is 545 kb, the scaffold N50 of the protein coding gene-contained scaffolds (a total of 1,393 scaffolds) is 664.6 Kb, indicating the high quality of our current assembly in the genic regions.
The 10X Genomics reads were aligned to the final assembly to compute the summary statistics. The average scaffold coverage is 671.87X and the GC-content is 41.4% (Supplemental Figure 1). RNAseq reads from different development stages (see Methods) of N. giraulti were also aligned to the final assembly with an average mapping percentage of 97%, indicating a high-quality assembly of Ng genome. To assess the completeness of this genome, the BUSCO scores of all five genome assemblies were generated (Table 1). The BUSCO completeness score for the current assembly of N. giraulti is 98.6% (N = 1,066; Complete: 98.6%; Duplicated: 3.0%; Fragmented:0.4%; Missing:1.0%), indicating a high level of completeness of our genome assembly.
Genome comparison between N. giraulti and N. vitripennis Ng scaffolds were mapped to each chromosome of the Nv assembly (GCA_000002325.2) (Werren et al. 2010) with BWA-MEM aligner (Bernt et al. 2013). Overall the alignments are consistent between Ng and Nv with a few insistencies (Figure 2). A total of 1,137 Ng scaffolds were aligned to Nv chromosomes (Table 2 and Supplemental Table 1), accounting for 89.3% of the total chromosome length in Nv. The average sequence identity in these aligned regions is 93.23%. As a useful tool for comparative analysis and interspecific mapping, we provide a set of 5,147,972 high-quality single nucleotide polymorphisms between the Ng and Nv genome assemblies (Supplemental Data 1). The SNPs fall 6.1% percent into exons (3.4% of these are synonymous and 2.7% are nonsynonymous), 16.3% percent in introns, and 77.6% percent are intragenic. These represent either species-specific or strain-specific differences, which will be resolved in the resequencing of multiple Ng strains in future work.

Genome annotation
In our current N. giraulti assembly, we have identified a total repeat content of 83,899,561 bp, by using an Ng specific repeat library, consisting of approximately 32.39% of the genome assembly (Table  2). Among the classified repetitive elements, the top three repeat types are DNA elements (7.58%), LINEs (6.71%) and SINEs (6.68%) ( Table 3). After all the repeat regions were soft-masked by MAKER, the final annotation resulted in 14,777 protein coding genes. By comparing the annotated genes in Ng with Nv, there are 10,640 1:1 orthologs between Ng and Nv, and 83.7% Ng genes were assigned in orthogroups between Ng and Nv.

Identification of genes present in Ng but not Nv
We further compared the Ng gene sets with the Nv annotated gene set OGS2 (Rago et al. 2016) to determine if there are any candidates for Ng-specific genes (see Method and Supplemental Figure S2). A total of 2,361 Ng-specific candidate genes were generated by Orthofinder (Emms and Kelly 2019). The protein sequences of these candidate genes were BLASTed to the Nv genome. A total of 112 Ng candidate genes showed no hits to the reference and Nv PSR1.1 genome n■ assemblies (Dalla . To exclude potential pseudogenes in Ng, these 112 candidate genes were then aligned to the Ng transcripts annotated by Cufflinks (Trapnell et al. 2012) and 45 genes were retained. The protein sequences of these genes were aligned to Nv PSR1.1 again using tBLASTn and three more genes were excluded (E-value cutoff 1E-5), resulting in final list of 42 Ng-specific genes (Supplemental Data S2). 28 of these Ng-specific genes have a tBLASTn hit in Trichomalopsis sarcophagae (TSAR), which is a sister species to the Nasonia genus, suggesting that they could be degenerated genes in Nv. We therefore divide this class further into 28 "Nv absent" genes, which are not present in the annotated Nv genome but are found in the closely related species Trichomalopsis sarcophagae, and 14 candidate "Ng novel" genes, which are not found in either Nv or TSAR. Among these Ng-specific genes, eight genes are annotated with E-value , 1E-4 and identity .40% to the NCBI NR database. These include hypothetical protein TSAR_007225, NADH dehydrogenase (ubiquinone) flavoprotein 3, T-complex protein 1 subunit eta, gem associated protein 4, PREDICTED uncharacterized protein LOC107980813, collagen type II alpha, [histone H4]-N-methyl-L-lysine20 N-methyltransferase, and neuropeptides capa receptor-like gene. The BLAST2GO functional analysis revealed that these 42 genes are enriched for genes involved in gluconate transmembrane transporter activity (Supplemental Figure S3 and Data S2). The genes warrant further study to investigate their possible origins and functions.

Phylogenomic relationship with arthropod genomes
We compared the Ng genome to 8 other sequenced arthropod genomes (fruit fly, pea aphid, honey bee, water flea, human lice, mosquito, silk moth and jewel wasp Nv), to identify a core gene set for phylogenomic analysis. A total of 348 single-copy 1:1 orthologs (listed in Supplemental Data S3) were identified. Ng is most closely related, to Nv, and they cluster with honey bee, another Hymenoptera species (Figure 3). These 348 single-copy ortholog provide a useful gene set for evolutionary analysis.

CONCLUSIONS
This study describes the assembly and annotation of the genome for Nasonia giraulti, a key model organism in speciation and evolutionary studies that range in focus from pheromones and sex determination to behavior and memory. The assembly of 259 Mbp is very complete with a 98.6% BUSCO completeness and aligns to 89% of the genome of its sister species, Nasonia vitripennis. We predicted and analyzed 14,777 protein-coding genes that offer insights into the development and evolution of N. giraulti. We identified 5 million SNPs and 42 genes that are unique to N. giraulti when compared to N. vitripennis. This de novo assembled genome will provide a powerful tool in comparative genomics and evolution to the model parasitoid wasp N. vitripennis and will enhance future studies in the behavior, development, pheromones, repeat evolution, mitochondrianuclear interaction, and parasitoid-host biology. Figure 3 Phylogenetic relationships of N. giraulti with eight selected arthropod species. A phylogenetic tree of N. giraulti with 8 other arthropod species was constructed based on a total of 348 single-copy 1:1 orthologs. The selected arthropod genomes are from fruit fly, pea aphid, honey bee, water flea, human lice, mosquito, silk moth and jewel wasp Nasonia vitripennis.