Unveiling the Genetic Blueprint of a Desert Scorpion: A Chromosome-level Genome of Hadrurus arizonensis Provides the First Reference for Parvorder Iurida

Abstract Over 400 million years old, scorpions represent an ancient group of arachnids and one of the first animals to adapt to life on land. Presently, the lack of available genomes within scorpions hinders research on their evolution. This study leverages ultralong nanopore sequencing and Pore-C to generate the first chromosome-level assembly and annotation for the desert hairy scorpion, Hadrurus arizonensis. The assembled genome is 2.23 Gb in size with an N50 of 280 Mb. Pore-C scaffolding reoriented 99.6% of bases into nine chromosomes and BUSCO identified 998 (98.6%) complete arthropod single copy orthologs. Repetitive elements represent 54.69% of the assembled bases, including 872,874 (29.39%) LINE elements. A total of 18,996 protein-coding genes and 75,256 transcripts were predicted, and extracted protein sequences yielded a BUSCO score of 97.2%. This is the first genome assembled and annotated within the family Hadruridae, representing a crucial resource for closing gaps in genomic knowledge of scorpions, resolving arachnid phylogeny, and advancing studies in comparative and functional genomics.


Introduction
Arachnids emerged on land over 400 million years ago (Coddington and Colwell 2001) and are a diverse taxonomic group containing over 100,000 known species (Proctor et al. 2015).They inhabit a range of habitats, including terrestrial and aquatic environments (Kuntner 2022).Despite this diversity, genomic research within Arachnida has primarily focused on Araneae (spiders) (Sanggaard et al. 2014) and Parasitiformes (ticks) (Pagel Van Zee et al. 2007).
The order Scorpiones comprises 23 families and over 2,800 species in two main parvorders: Buthida and Iurida (Santibáñez-López et al. 2022, 2023).Most are nocturnal, solitary predators and the majority of species fluoresce under UV light.Being Arachnopulmonates, they possess book lungs for respiration; however, unlike other arachnids, their body is uniquely segmented into a prosoma, mesosoma, and metasoma (Howard et al. 2019).They have a pair of chelate pedipalps for defense and prey acquisition (prosoma), ventral sensory organs called pectines (mesosoma), and a segmented tail that ends in a telson and stinger (metasoma) to deliver venom.Scorpion venom is often the subject of research for human health and biotechnological applications (Kerkis et al. 2017).Despite this, significant gaps persist in regards to the diversity and evolution of scorpions (Kuntner 2022).
The higher-level relationships among scorpions have been controversial, especially with regard to monophyly and trait evolution (Lozano-Fernandez et al. 2019;Santibáñez-López et al. 2020;Ballesteros et al. 2022).The contribution of more high-quality genomes from underrepresented families, apart from the four buthid genomes currently available, can improve our understanding of relationships among scorpions and the evolution of unique traits (Long et al. 2003;Rittschof and Robinson 2016).
This study presents the first chromosome-scale genome assembly for the desert hairy scorpion, Hadrurus arizonensis.This species is an iconic inhabitant of the Mojave and Sonoran deserts of North America, and one of the largest scorpions found on the continent (Fig. 1A).It also provides the first genome assembly for any species of the parvorder Iurida (Fig. 1B).The sequenced genome of H. arizonensis, known to be significantly impacted by historical climate shifts (Graham et al. 2013), also offers a valuable foundation for future studies of adaptation in extreme environments.

Genome Assembly
Flye and Hifiasm, with and without Purge Haplotigs (PH), were used to assemble the first genome (Fig. 1C S1, Supplementary Material online).Flye PH had a BUSCO score of C: 98.1%, D: 3.7%, and F&M: 1.9%.Flye PH, with the greatest contiguity and completeness, was selected for scaffolding.Contigs less than 3 Kb were removed from Flye PH prior to scaffolding.The final filtered Flye PH assembly had an N50 of 10.9 Mb, 850 contigs, a BUSCO score of C: 98.2% [S: 94.6%, D: 3.6%], F&M: 1.8%, and a Merqury QV score of 41.27.
Retrotransposons were the most abundant transposable element (TE) class in the H. arizonensis genome, comprising 29.86% of the repetitive elements identified (Fig. 2A).Long interspersed nuclear elements (LINEs) represented the majority of retrotransposons (29.39%), in stark contrast with short interspersed nuclear elements (SINEs) (0.16%) and long terminal repeat elements (LTRs) (0.31%) (supplementary table S3, Supplementary Material online; Fig. 2B).Interestingly, Bovine-B (RTE/Bov-B) was the most abundant LINE (27.07%).Bov-B elements have a widespread and patchy distribution in eukaryotes and phylogenetic analysis of these elements has identified potential horizontal transfer vectors in Arthropoda (Ivancevic et al. 2018).Approximately 13.57% of the genome's repeat elements consist of DNA transposons, with Tc1-IS630-Pogo accounting for 9.77%.This family has been identified in several species, including Drosophila, plants, even vertebrates (Gao et al. 2020) and, in some instances, greatly expanded (Marburger et al. 2018).

Protein-coding Genes
Six H. arizonensis RNA libraries exported from NCBI had an average mapping rate of 92.44% (supplementary table S4, Supplementary Material online).Combined, the filtered EASEL and GeneMark structural annotation yielded 31,841 genes and 90,343 transcripts.These numbers were reduced to 18,996 and 75,256, respectively, following PFAM (protein domain) filtration (Fig. 2B).With alternative transcripts present, the mono:multi exonic ratio was 0.135 and BUSCO completeness was 97.2% [S: 9.7%, D: 87.5%] .After taking the longest isoform, representing the unique gene space, BUSCO completeness was 91.6% [S: 87.6%, D: 4.0%] and the mono:multi exonic ratio remained mostly unchanged at 0.14 (supplementary table S5, Supplementary Material online).The reciprocal BLAST annotation rate of the longest isoform proteins against the RefSeq database was 59.21%.In comparison, the annotation rate for gene family assignment using the EggNOG v4 database was 74.33%.Combined, the final annotation rate for the extracted longest isoforms reached 75.94% (supplementary table S6, Supplementary Material online).While 18,996 genes may be an underestimation in comparison to the 24,591 predicted protein-coding genes in the Centruroides sculpturatus genome (GCF_000671375.1;Schwager et al. 2017), they are complete and in range with other Araneae (Thomas et al. 2020).

Scorpion Genomes
Presently, only 12 of 237 arachnid genomes hosted at NCBI are chromosome-scale.Prior to assembling H. arizonensis, there were no chromosome-scale genome assemblies within the order Scorpiones.As such, this high-quality reference represents the first, with a completeness and contiguity that surpasses the scorpion assemblies available to date (Figs 1B and 2C).
frozen in liquid nitrogen.The tissue was ground in liquid nitrogen to a fine powder and stored at −80 °C.

DNA Extraction
Hadrurus arizonensis DNA was extracted using the Monarch Ⓡ high molecular weight (HMW) DNA Extraction Kit: Tissue Protocol (NEB, T3060).Changes made to the protocol include: adding 580 μL of HMW gDNA Tissue Lysis Buffer to approximately 30 mg of frozen ground tissue, lysis incubation was at 56 °C for 45 min at 700 rpm and the binding of the gDNA to beads was performed on a vertical rotating mixer at 8 rpm rather than 10 rpm.DNA concentration was measured using a Qubit fluorometer.The gDNA for the standard library was sheared using a Covaris Ⓡ G-tube to approximately 25 Kb and small fragments were removed using PacBio's Short Read Eliminator XS kit.For the ultralong library, three separate Monarch Ⓡ extractions were pooled and eluted in Oxford Nanopore's EEB buffer.

Genomic Library Preparation
Oxford Nanopore Technologies Ⓡ Ligation Sequencing Kit V14 (SQK-LSK114) was used to prepare the library on the extracted HMW DNA.DNA repair was performed at 20 °C for 20 min followed by 10 min at 65 °C to inactivate the enzymes.DNA was eluted at 37 °C.The final library was quantified using a Qubit fluorometer.The flow cell was loaded 3 times each with 10.5 fmol of library.For ultralong genomic library preparation, the Ultra-Long DNA Kit V14 (SQK-ULK114) was used on the pooled HMW DNA.Both PromethION flow cells ran for 72 h.

Pore-C Library Preparation and Sequencing
A total of 150 mg of ground tissue was used for cell crosslinking.Crosslinked cells were resuspended in permeabilization solution and incubated on ice.The chromatin was denatured and the permeabilized cells were digested with restriction enzyme NlaIII.To reverse the crosslinking of the ligated chromatin, the sample suspension was incubated in a thermomixer with periodic rotation.Full details on the library preparation are available (supplementary file S1, Supplementary Material online).A total of 1.5 μg of purified DNA was used as input material for the SQK-LSK114 (Oxford Nanopore Technologies, UK) kit and protocol.DNA repair was performed at 20 °C for 20 min, followed by 65 °C for 10 min to inactivate the enzymes.The PromethION flow cell was run for 96 h.

Genome Annotation
Transposable elements were identified with RepeatModeler (v2.02) and the reference was softmasked with RepeatMasker (v4.1.4)(Smit et al. 2015;Flynn et al. 2020).Six H. arizonensis RNA libraries were imported from NCBI (PRJNA340270) and used in the EASEL (v1.5) pipeline along with the softmasked draft genome and arthropoda OrthoDB (v11) protein sequences to predict protein-coding genes (Webster et al. n.d.).The invertebrate training set was configured to filter false-positive predictions.The same inputs were utilized in the GeneMark-ETP (Bruna et al. 2024) gene prediction tool, which is part of the BRAKER3 (v3.0.2) pipeline (Gabriel et al. 2024).EASEL (filtered) and GeneMark (hmm) genes were independently mapped to the Pore-C chromosome-level assembly with Liftoff (v1.6.3)(Shumate and Salzberg 2021).The resulting GFF files were combined using the AGAT (v1.2) toolkit (Dainat et al. 2023).Protein sequences were extracted and scanned for protein domains using   (Eddy 2011;Mistry et al. 2021).Sequences without a domain were removed.Final summary statistics, including BUSCO, were run on the filtered proteins.Finally, EnTAP (v1.0.1) was run with the complete RefSeq database (v208) at 70/70 coverage to functionally annotate the predicted proteins (Hart et al. 2020).Summary statistics, BUSCO, and EnTAP were also run on the longest isoforms to represent the unique gene space.
FIG. 1.-A)Range map of H. arizonensis with a point (star) denoting the approximate sampling location.The arrow extends to a photo of this region (near Oatman, Arizona, USA).The image on the right is a photo of H. arizonensis [both images are courtesy of Brent E. Hendrixson].B) Phylogeny of scorpion species with genomes currently available on NCBI, informed by previous molecular phylogenetic analyses(Lozano-Fernandez et al. 2019; Santibáñez-López et al.  2023).Symbols appended to the species name indicate assembly status: "*" Scaffold level, "**" Contig level, "+" Complete chromosome-scale assembly.The lighter shade (red) segments indicate parvorder Buthida while the darker shade (blue) segments indicate parvorder Iurida.C) Contiguity (N50 and total number of scaffolds/contigs) and completeness (BUSCO) of genome assembly approaches for H. arizonensis.These include Flye and Hifiasm, with and without purge haplotigs (PH).The final chromosome-scale assembly is denoted with a star.Each assembly is represented by a pie chart to visualize BUSCO completeness.The darker shade (dark purple) indicates complete BUSCOs while the lighter shade (yellow) indicates missing and/or fragmented BUSCOs.D) Chromatin contact map generated from Pore-C data shows the nine chromosomes (2n = 18) that represent 99.56% of the assembled H. arizonensis genome.