Genome assembly and annotation of the European earwig Forficula auricularia (subspecies B)

Abstract The European earwig Forficula auricularia is an important model for studies of maternal care, sexual selection, sociality, and host–parasite interactions. However, detailed genetic investigations of this species are hindered by a lack of genomic resources. Here, we present a high-quality hybrid genome assembly for Forficula auricularia using Nanopore long-reads and 10× linked-reads. The final assembly is 1.06 Gb in length with 31.03% GC content. It consists of 919 scaffolds with an N50 of 12.55 Mb. Half of the genome is present in only 20 scaffolds. Benchmarking Universal Single-Copy Orthologs scores are ∼90% from 3 sets of single-copy orthologs (eukaryotic, insect, and arthropod). The total repeat elements in the genome are 64.62%. The MAKER2 pipeline annotated 12,876 protein-coding genes and 21,031 mRNAs. Phylogenetic analysis revealed the assembled genome as that of species B, one of the 2 known genetic subspecies of Forficula auricularia. The genome assembly, annotation, and associated resources will be of high value to a large and diverse group of researchers working on dermapterans.


Introduction
Insects have been at the forefront of genetic research for various biological questions (Wilson-Sanders 2011; Mukherjee et al. 2015;Simons and Tibbetts 2019). However, most of the genetic studies are carried out on a small number of holometabolous insects that undergo true metamorphosis. In contrast to Holometabola, hemimetabolous insects undergo incomplete metamorphosis with a series of nymphal molts that increasingly resemble the adult form (Truman 2019). It is widely accepted that Holometabola branched out from hemimetabolous ancestors during the Permian 300 Mya (Labandeira and Phillips 1996;Yang 2001). Yet the conserved mode of development, embryonic organization, and the adult body plan of hemimetabolous insects offer a unique model for the study of developmental and evolutionary mechanisms. However, even with the increasing number of sequenced genomes, the majority belong to the Holometabola (Ylla et al. 2021). This has been a bottleneck for the exploration of the diverse biology and life history of hemimetabolous insects. To address this paucity, we report a high-quality annotated genome of the European earwig, Forficula auricularia (Dermaptera: Forficulidae).
The European earwig F. auricularia is widely distributed, comprising 2 recognized subspecies, A and B (Wirth et al. 1998). They are native to the western Eurasian region and were introduced to North America, Australia, and New Zealand, where they have quickly adapted and became abundant throughout the regions (Quarrell et al. 2018;Tourneur and Meunier 2019). The 2 subspecies A and B differ through mitochondrial divergence and in their reproductive life histories (Guillet, Guiller, et al. 2000;Guillet, Josselin, et al. 2000). Subspecies A is found in relatively colder climates and is univoltine with a long gregarious phase, whereas subspecies B is found in temperate and oceanic climates and is bivoltine (Lamb and Wellington 1974). In laboratory conditions, they fail to produce offspring by cross mating (Wirth et al. 1998). Their propensity to dwell on flower and kitchen gardens can cause significant damage to crops, flowers, and commercial vegetables and make them important agricultural pests (Campos et al. 2011;Hill et al. 2019).
They have been of particular interest for many researchers not just because of their importance in the agricultural ecosystem (Binns et al. 2021) but also their importance as a research model for various biological and evolutionary phenomena like sexual selection, maternal care, family interactions, reproductive strategy, and social behavior (Forslund 2000;Falk et al. 2014;Kramer et al. 2015;Van Meyel and Meunier 2020). They have been extensively studied by behavioral ecologists for the early evolution of group-living and family life (Falk et al. 2014). The male earwigs also show an unusual bias in their use of lateral left and right sexual organs without any conspicuous anatomical differentiation (Kamimura 2006). Like the right-handedness in humans, 90% of males of giant earwig Labidura riparia show a preference for the right penis for copulation, providing insights into the evolutionary origin of lateralization (Kamimura et al. 2021). Similarly, they are an excellent lab model to study extended phenotypes as they exhibit strange suicidal water-seeking behavior during the late stages of infection by mermithid nematodes (Herbison et al. 2019). However, their use as a genetic model has been severely limited by the lack of a reference genome.
Here, we have sequenced, assembled, annotated, and analyzed the genome of the European Earwig, F. auricularia and confirmed the subspecies identity of the individuals we used. This genome will help researchers study multiple facets of this insect's exciting biology and evolutionary characters and broaden our understanding of insect and genome evolution.

Methods and materials
Sample collection and preparation Earwigs (F. auricularia) were field collected from the Dunedin Botanic Garden (À45 51 0 27.59 00 S, 170 31 0 15.56 00 E) and reared in a temperature-controlled room (temperature: cycling from 15 to 12 C, day/night; photoperiod of L:D 16:8) in the Department of Zoology, University of Otago, Dunedin. Earwigs were snap-frozen in liquid nitrogen and stored at À80 C before dissection and subsequent nucleotide extraction. Earwigs were dissected in 1Â PBS buffer under a dissection microscope to check for nematode parasites, and only nonparasitized individuals were used in this study. The head, wings and muscles from the thorax region were used for DNA extraction to avoid the gut microbiota. Juvenile instars required for RNA extraction were obtained directly from the field.

DNA extractions
DNA was extracted using either the Nanobind Tissue Big DNA kit (Circulomics, USA) for high molecular weight DNA or DNeasy Blood & Tissue Kit (Qiagen, Germany) by following the manufacturer's protocol. Tissues from a single individual were used for each extraction. After the extraction, RNase treatment was performed using 4 ml of RNase A (10 mg/ml) per 200 ml of DNA elute. DNA was quantified in a Qubit 2.0 Fluorometer (Life Technologies, USA) and quality analyzed using Nanodrop. Low-quality DNA samples were further cleaned with 1.8Â by volume AMPure XP beads (Beckman Coulter, USA), wherever applicable, following the manufacturer's instructions and eluted in 55 ml of molecular grade water. High-quality DNA samples were stored at À20 C and were used within a week of extraction.

Linked-read library preparation and sequencing
Linked-read library was prepared at the Genetic Analysis Services (GAS), University of Otago (Dunedin, New Zealand). DNA from an adult male was extracted using the Nanobind kit and sizeselected for fragments over 40 kbp using Blue Pippin (Sage Science, USA). A 10Â linked-reads (10Â Genomics, USA) library was prepared following the manufacturer's instructions. The library was sequenced on the Illumina Nova-seq platform to generate 2 Â 151-bp paired-end reads (Garvan Institute, Australia).

Long-read library preparation and sequencing
Five long-read sequencing libraries for Oxford Nanopore MinION were prepared using the Ligation Sequencing Kit (SQK-LSK109) (Oxford Nanopore Technologies, Oxford, UK) following the manufacturer's instructions. To increase the raw Nanopore read N50, the first and the second libraries were prepared using 1.75 and 0.75 mg of DNA extracted via a Circulomics kit from 2 adult male earwigs. Both libraries were sequenced in a single Minion flow cell, flushing the flow cell to remove remains of the first library before loading the second library with a Flow Cell Wash Kit (EXP-WSH004) (Oxford Nanopore Technologies, Oxford, UK).
To increase the total raw output, the third and the fourth libraries were prepared with DNA from 2 adult female earwigs, both extracted with a DNeasy Blood & Tissue, Qiagen kit followed by the AMPure XP beads clean-up step. Input DNA for these 2 libraries were 2.6 and 3.2 mg. These were each sequenced on an individual minion flow cell. The fifth library was prepared using 3.0 mg of DNA from an adult male earwig. As before DNA was extracted using a DNeasy Blood & Tissue, Qiagen kit followed by AMPure XP beads clean-up. However, before library preparation, the DNA was sheared 5 times using a 26 G Â 0.5 00 needle (Terumo, Japan). Since the sample type and the extraction method can impact the molecular weight of extracted DNA and the nanopore sequencing output, we tried samples from both sexes and different extraction protocols to optimize our sequencing output. All prepared libraries were sequenced with R9 chemistry MinION flow cell (FLO-MIN106) (Oxford Nanopore Technologies, UK) on a MinION connected to a laptop and operated with MinKNOW (v.2.0) interface.

RNA extraction and sequencing
Total RNA from the different developmental stages, sex, and tissues was extracted using a Direct-zol RNA MicroPrep kit (Zymo Research, USA) with an on-filter DNAse treatment following the manufacturer's instructions. Samples included: whole body (gut removed) of juvenile instars 1-2 and juvenile instars 3-4, dissected tissues (antennae, head, thorax, abdomen, legs, and gonads) of adult males and females. RNA from each individual and tissue type was extracted separately. RNA was quantified on a Qubit 2.0 Fluorometer (Life Technologies, USA) and initially quality checked using a nanodrop. Only high-quality extracts were further processed and were stored at À80 C until use.
RNA integrity was evaluated on a Fragment Analyzer (Advanced Analytical Technologies Inc., USA) at the Otago Genomics Facility (OGF), University of Otago, Dunedin, New Zealand. As with most of the insect RNA extracts (Winnebeck et al. 2010) RNA quality number (RQN) values ranged from 2.5 to 10 due to the collapsing of the 28S peak; quality was thus determined via the trace rather than RQN. Four pools of samples at equimolar concentration underwent library preparation. Pools consisted of: 8 whole body extractions for juvenile instar 1-2, 8 whole body extractions for juvenile instar 3-4, individual body tissues from 5 adult males, and individual body tissues from 5 adult females. TruSeq stranded mRNA libraries were prepared and sequenced as 2 Â 100-bp paired-end reads across 2 lanes of HiSeq 2500 Rapid V2 flowcell at the OGF.

Genome size estimation
Flow cytometry and k-mer-based approach with short-read data were used to estimate the genome size. Flow cytometry analysis was performed on a single head of earwig with 2 biological replicates at Flowjoanna (Palmerston North, NZ, USA). Briefly, the earwig's head was dissociated with a pestle in 500 ml of the stock solution containing 0.1% w/v trisodium citrate dihydrate, 0.1% v/v IGEPAL, 0.052% w/v spermine tetrahydrochloride, and 0.006% sigma 7-9 (all Sigma-Aldrich, USA). Rooster red blood cells (RRBC) derived from the domestic chicken (Gallus gallus), stored in citrate buffer, were used as reference samples. Test samples were filtered through a 35-ml filter cap and further dissociated by adding 100 ml of 0.21 mg/ml trypsin followed by 75 ml of 2.5 mg/ml trypsin inhibitor (both Sigma-Aldrich) for 10 min at 37 C. Nuclei were stained using 100 ml of prestain (containing 416 mg/ml propidium iodide with 500 mg/ml RNAse in-stock solution). Two sample tubes, 1 prepared with RRBC and 1 prepared without, were then processed on a FACSCalibur (BD Biosciences, USA).
The instrument was equipped with a 488-nm laser to produce fluorescence collected using the FL-2-Area signal (585/42 BP), along with forward scatter and side scatter signals that enabled RRBC nuclei to be resolved from earwig nuclei. Data were analyzed using Flowjo (BD Biosciences, USA) and the pg/nuclei of the sample calculated.

Phylogenetic analysis
Two sibling species of F. auricularia have been described (Wirth et al. 1998). To assess which of these we sequenced, nucleotide sequences covering the COI and COII region from 34 isolates of F. auricularia were downloaded from NCBI. Those included 15 isolates reported by Wirth et al. (1998) originally used to infer sibling species A and B and other isolates from Belgian orchards submitted to NCBI. Nucleotide sequence covering COI and COII regions from the assembled genome was extracted through BLAST hits. To ensure that a single subspecies was sequenced across all the individuals, raw reads from each run were blasted back to this sequence to ensure the presence of a single haplotype. The same genomic region extracted from the mitochondrial genome of Euborellia arcanum was used as an outgroup. Nucleotide sequences were aligned using Clustal Omega (v1.2.3) (Goujon et al. 2010). The evolutionary history was inferred using the Neighbor-Joining method (Saitou and Nei 1987) with 1,000 bootstrap replicates (Felsenstein 1985). The evolutionary distances were computed using the Maximum Composite Likelihood method (Tamura et al. 2004) and are in the units of the number of base substitutions per site. All ambiguous positions were removed for each nucleotide sequence pair (pairwise deletion option). There were a total of 799 positions in the final dataset. The optimal tree is presented and the evolutionary analyses were conducted in MEGA11 (Tamura et al. 2021).

Bioinformatic pipeline
All the scripts used for genome assembly, de novo repeat library construction, and annotation are available on GitHub (https:// github.com/upendrabhattarai/Earwig_Genome_Project). The bioinformatics software and packages were run in New Zealand eScience Infrastructure. Below is a description of the pipeline (Fig. 1).

Genome assembly
Paired-end Illumina reads from the Chromium library were assembled using Supernova (v.2.1.1) (Weisenfeld et al. 2017). Assembly metrics such as N50 values and contig/scaffold number were assessed using Quast (v.5.0.2) (Gurevich et al. 2013) and the presence of the single-copy ortholog genes was assessed using the insecta_odb10 database in BUSCO (v.5.1.3) (Simão et al. 2015). BUSCO score from Quast analysis wherever mentioned used BUSCO version 3.0.2 and the eukaryote_odb9 database. Based on several trial assemblies, we down-sampled the total input to 660 million paired-end reads using "-maxreads" option with "supernova run" to produce an assembly with better completeness and contiguity. The assembled fasta sequence was obtained with "pseudohap" style of the supernova "mkoutput" function.

Repeat content analysis
To assist with annotation a custom repeat library was generated for the Earwig genome using different de novo repeat and homology-based identifiers, including LTRharvest (v.1.5.10) (Ellinghaus et al. 2008 We concatenated the individual libraries, and sequences with more than 80% similarity were merged to remove redundancy using usearch (v.11.0.667) (Edgar 2010). The library was then classified with RepeatClassifier (v.2.0) (Flynn et al. 2020). Sequences with unknown categories in the library were mapped against the UniProtKB/Swiss-Prot database (e-value <1eÀ01); if sequences were not annotated as repeat sequences they were removed from the library. The final repeat library was used in RepeatMasker (v.4.1.2) (Chen 2004) to generate a report for genome repeat content and provided to the MAKER2 pipeline to mask the genome.

Genome annotation
Genome annotation was carried out with 3 iterations of the MAKER2 (v.2.31.9) (Holt and Yandell 2011) pipeline combining evidence-based and ab initio gene models. The first round of MAKER2 used evidence-based models and the other 2 rounds were run using ab initio gene models. For the first round, we provided the MAKER2 pipeline with 180,119 mRNA transcripts denovo assembled via the Trinity pipeline (v.2.13.2) (Grabherr et al. 2011) along with 26,414 mRNA and 1,529 protein sequences of dermapterans from NCBI and 779 dermapteran protein sequences from the Uniprot database.
Augustus was trained using BRAKER (v.2.16) (Hoff et al. 2019) and SNAP was trained after each round of MAKER2 to use for ab initio gene model prediction. For the functional annotation, we ran InterProScan (v.5.51-85.0) (Jones et al. 2014) for the predicted protein sequences obtained from MAKER2 and retrieved InterPro ID, PFAM domains, and Gene Ontology (GO) terms. Furthermore, we ran BLASTp (Altschul et al. 1990) with the Uniprot database to assign gene descriptors to each transcript based on the best BLAST hit.

Genome size estimates
The flow cytometer estimated the genome size of 968.226 20.747 Mb (mean 6 SD) for the earwig genome. Similarly, the k-mer-based approach using adapter removed paired-end data from linkedread sequencing estimated the male earwig to be 988 Mb. Whereas an earlier estimation of an unknown dermapteran (earwig) species genome size was 1.4 Gb (Gregory 2005) showing a variable genome size within the order.

Phylogenetic analysis
The phylogenetic analysis showed 2 distinct subspecies groups within F. auricularia (Fig. 2), in agreement with Wirth et al. (1998). One clade includes 24 individuals including 9 originally identified as species A (green circle labels, Fig. 2). While the remaining 11 individuals cluster into a separate clade that include the assembled genome herein (red square label, Fig. 2) and 6 individuals originally identified as species B (green square labels, Fig. 2). The analysis confirmed that the genome reported in this article (Dunedin, NZ) belongs to the subspecies B of F. auricularia. This is also in accordance with the report from Quarrell et al. (2018) where 2 isolates from New Zealand were reported as subspecies B of F. auricularia.

Genome assembly
A total of 799.6 million paired-end reads was generated from 10Â linked-read sequencing. Downsampled to 660 million paired-end reads, Supernova estimated the genome size of 1.22 Gb, raw coverage of 82.02%, effective coverage of 39.50% and weighted mean molecule size of 22.45 kb. The Supernova assembly was 1.15 Gb in size and had 145,055 contigs, with an N50 of 0.03 Mb and L50 of 7,500. Quast reported a complete BUSCO of 64.69% and a partial BUSCO of 9.24% from the eukaryotic database.
The Nanopore sequencing yielded approximately 10.7 Gb of data, consisting of over 3 million reads. The median read length was 897 bp with an N50 length of 11,986 bp (Supplementary Table  1). The median read Phred quality was 13.34. Flye produced an assembly of 1.1 Gb, comprised of 18,766 contigs with N50 of 0.18 Mb and L50 of 1,832. Quast reported a complete BUSCO of 82.18% and a partial BUSCO of 9.24%. The long-read assembly was more complete based on the BUSCO scores and demonstrated better contiguity, so we merged the 2 assemblies using the Flye assembly as the primary assembly (Table 1).
The BlobTools2 filtering produced a clean assembly with only 215 contigs out of 2.7 K assigned as no-hits and all other contigs with blast hits to the Arthropoda database ( Supplementary  Fig. 1).The final hybrid assembly has a size of 1.06 Gb. It has 919 scaffolds with an N50 of 12.55 Mb, which shows that the assembly is highly contiguous. Half of the genome is present in just 20 scaffolds, as denoted by the L50 number (Table 1). Assembly has 846.85 "N's" per 100kbp. The BUSCO score from the insect database (n ¼ 1,367) for the assembly is 87.1% complete, among which 4.1% were duplicated, and 3.1% fragmented BUSCO ( Supplementary Fig. 2). Improvement in assembly statistics after each processing step is given in Supplementary Table 2. The only other whole-genome sequence publicly available from the Dermaptera order is of the earwig Anisolabis maritima [GenBank assembly accession: GCA_010014785.1, available to download from InsectBase (v.2)]. The A. maritima genome assembly is 649.7 Mb with a N50 of 1.4 Mb, (Mei et al. 2022), while its BUSCO score is 83.4% complete and 10.8% fragmented using the insect database (n ¼ 1,367). In comparison, the F. auricularia genome assembly has a better gene model and contiguity.

Genome repeat contents
Repeat analysis of the assembly showed that interspersed repeats comprised 686.43 Mb (64.62%) of the F. auricularia genome. Fig. 2. The phylogenetic relationships of F. auricularia obtained from different geographic regions inferred from COI and COII using a Neighbour-Joining method and Maximum Composite Likelihood approach in MEGA11. All ambiguous positions were removed for each nucleotide sequence pair (pairwise deletion). The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1,000 replicates) are shown next to the branches. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. Species labeled with the colored squares are subspecies B. The red square (Dunedin NZ) is the one for which the genome is reported in this article. Green squares are the species categorized as subspecies B by Wirth et al. (1998) and the purple squares are others for which the nucleotide sequences were downloaded from NCBI. Species labeled with colored circles belong to subspecies A. Green circles represent subspecies A inferred by Wirth et al. (1998) and blue are other species for which nucleotide sequences were downloaded from NCBI. E. arcanum is the outgroup labelled with a black triangle. The Supernova and the Flye assembly statistics are for the assembly right after the assembler and no further processing, whereas the Final hybrid assembly shows the statistics of the assembly through all the assembly process as described in this article. Quast scores are to its default Eukaryota database.  (Table 2). Unusually large and variable genome sizes characterize Hemimetabolans (Wu et al. 2017). Comparative analysis in 6 species of Gomphocerine grasshoppers showed a strong positive correlation between repeat content and genome size. Genome size ranged from 8.2 to 13.7 Gb in these 6 species with a repeat content ranging from 79% to 87%, with the exception of Stauroderus scalaris whose genome is 96% repetitive DNA and the second-largest insect genome documented. Our estimation of genome size for F. auricularia does not show gigantism (968.22 Mb, flow cytometer estimate). However, its repeatome (64.62%) is almost twice that of other hemimetabolous insects like Gryllus bimaculatus (33.69%) and Laupala kohalensis (35.51%) (Ylla et al. 2021). This fold increase in the repeatome is surprising given both G. bimaculatus and L. kohalensis have bigger genomes (1.6 Gb) than F. auricularia.

Genome annotation
Combining evidence-based and ab initio gene models in the MAKER2 pipeline, we identified 12,876 genes and 21,031 mRNAs in the genome assembly. The mean gene length is 12,096 bp and the total gene length is 155.75 Mb, which makes 14.7% of the whole assembly. The longest gene annotated is 412,198 bp and the longest CDS is 19,035 bp (Table 3). 61.35% of total predicted mRNAs and 59.53% of predicted proteins are also functionally annotated through either 1 or more of InterPro, GO, and Pfam databases (Supplementary Table 3). The annotated transcriptome and proteome had a complete BUSCO score of 73.4% and 70% respectively using the insect database ( Supplementary Fig.  3). 98.3% of the gene models have AED score of 0.5 or less, assuring highly confident gene prediction ( Supplementary Fig. 4). The GC content of the F. auricularia genome is 31.03%, far greater than the 19.3% GC in the genome of the earwig A. maritima reported in InsectBase2 database (Mei et al. 2022). So we compared the GC content between different regions of F. auricularia genome to see if there are any abnormal distributions. Our analysis showed that exons have higher GC content (0.372 6 0.087) (mean 6 SD) and introns have lower (0.267 6 0.075) when compared between intergenic regions (N ¼ 823,037), genes (N ¼ 12,876), exons (N ¼ 145,003), introns (N ¼ 123,973), and nonoverlapping 10-kb windows throughout the genome (N ¼ 106,686) (Fig. 3). GC content for 10-kb windows was 0.308 6 0.032, which resembles the mean GC content of the whole genome (0.310). This finding is not unexpected as a higher GC content in exons vs. introns is common across the animal and plant kingdom because of the evolutionary selection of exon regions (Amit et al. 2012). There was a significant difference for each pairwise comparison using ANOVA followed by Tukey HSD with P < 0.0001.
Recently there has been a growing interest in hemimetabolous insects for use as genetic research models and hence sequencing and analyzing their genomes (Adamski et al. 2019;Ylla et al. 2021). Because of their primitive yet successful biology, the evolutionary insights they can offer for various biological traits are enormous. Genomes of milkweed bug (Oncopeltus fasciatus) (Panfilio et al. 2019) and field cricket (G. bimaculatus) (Ylla et al. 2021) have been instrumental for developmental biology research. Similarly, the genome of Rhodnius prolixus, a medically important hemimetabolous insect vector, provides key insights into the genetic re-organization contributing to the evolution of a blood-feeding lifestyle (Mesquita et al. 2015). Furthermore, the genome of Halyomorpha halys has informed research on polyphagy and insecticide resistance and contributed to advances in research on insect-pest control strategies (Sparks et al. 2020). In this context, we believe that, the genome of F. auricularia will be a key resource to develop this important insect species as a genetic model. We anticipate this will enhance the genetic study on various aspects of its biology, including developmental biology, sociality, and evolutionary characteristics.

Data availability
The genome assembly and annotation of F. auricularia are available through FigShare https://doi.org/10.6084/m9.figshare. 19092044. The raw sequencing reads are deposited in NCBI with accession number PRJNA800435. The scripts used for genome assembly, repeat library preparation and masking, and genome annotation are available at GitHub under GNU GPLv3 license (https://github.com/upendrab hattarai/Earwig_genome_project).
Supplemental material is available at G3 online.