Reference genome of the long-jawed orb-weaver, Tetragnatha versicolor (Araneae: Tetragnathidae)

Abstract Climate-driven changes in hydrological regimes are of global importance and are particularly significant in riparian ecosystems. Riparian ecosystems in California provide refuge to many native and vulnerable species within a xeric landscape. California Tetragnatha spiders play a key role in riparian ecosystems, serving as a link between terrestrial and aquatic elements. Their tight reliance on water paired with the widespread distributions of many species make them ideal candidates to better understand the relative role of waterways versus geographic distance in shaping the population structure of riparian species. To assist in better understanding population structure, we constructed a reference genome assembly for Tetragnatha versicolor using long-read sequencing, scaffolded with proximity ligation Omni-C data. The near-chromosome-level assembly is comprised of 174 scaffolds spanning 1.06 Gb pairs, with a scaffold N50 of 64.1 Mb pairs and BUSCO completeness of 97.6%. This reference genome will facilitate future study of T. versicolor population structure associated with the rapidly changing environment of California.


Introduction
Climate change has had, and will continue to have, a significant effect on the world's natural ecosystems and consequently on the services that these systems provide to humanity (IPCC 2022). However, it is clear that successful local and global mitigation efforts can help vulnerable ecosystems persist through forthcoming changes (Greenwood et al. 2016;Owen 2020). Riparian ecosystems have been shown to be one of the most vulnerable to climate change and anthropogenic disturbances due to the acute dependency of these systems on water availability, flow rate and direction, as well as temperature and habitat modifications (Capon et al. 2013). Riparian ecosystems in California are in a particularly precarious position because they exist in a heavily populated and fundamentally xeric landscape where the effects of anthropogenic disturbance and climate change are widespread (Seavy et al. 2009;Rohde et al. 2021). These riparian ecosystems provide refuge to many native and imperiled species, both aquatic and terrestrial, and therefore will play a key role in mediating the adaptability of California biodiversity in response to climate change (Capon et al. 2013;Bogan et al. 2019). Without effective conservation and management strategies, these ecosystems are at risk. Critical to any management approach is an understanding of the connectivity within and between the various rivers and subwatersheds.
Members of the spider family Tetragnathidae (Araneae), the long-jawed spiders, and in particular the genus Tetragnatha, are prevalent in riparian habitats, characterized by their orb-webs constructed over bodies of water. They play an integral part in the riparian ecosystem as both predator and prey, providing a trophic connection between terrestrial and aquatic ecosystems (Akamatsu et al. 2004). Species are often widely distributed but depend on bodies of water as they are sensitive to desiccation (Gillespie 1987;Adams 2022). In California, there are 6 species of Tetragnatha spiders: T. versicolor, T. laboriosa, T. pallescens, T. nitens, T. elongata, and T. guatemalensis. T. versicolor Walckenaer, 1841 is the most abundant and widespread species, and is found across a wide range of ecoregions (Levi 1981), though always in strict association with water. This narrow niche requirement in addition to its widespread nature makes T. versicolor an ideal taxon to study how populations are structured, whether by the waterways themselves or by geographic distance. Insights into how the species is being affected by recent shifts in hydrological regimes connected to anthropogenic change require more detailed genomic assessment. To this end, we constructed a reference genome assembly of T. versicolor as part of the California Conservation Genomics Project (CCGP, Shaffer et al. 2022) to facilitate an investigation of the population structure of California Tetragnatha species and its response to climate change in California.

Biological materials
Adult female T. versicolor ( Fig. 1) were hand collected from plants on the banks of the Eel River in Angelo Coast Range Reserve in Branscomb, California (39.72114°N, 123.648963°W) on 2 October 2020. Specimens were kept alive until DNA extraction. One individual (CCGP_27_UCD_003, SAMN29044170) was used for HiFi library preparation and sequencing while another individual (CCGP_27_UCSC_010; SAMN29044171) was used for Omni-C library preparation and sequencing. Female specimens were chosen since females have the X 1 X 1 X 2 X 2 sex chromosome system and they are generally larger than the male spiders that have the X 1 X 2 0 sex chromosome system.
Nucleic acid library preparations and DNA sequencing DNA extraction. One entire female spider (CCGP_27_UCD_003) was flash frozen in liquid nitrogen (LN 2 ) and homogenized by grinding in a mortar and pestle in the presence of LN 2 . Homogenized tissue was lysed with 2 ml of lysis buffer containing 100 mM NaCl, 10 mM Tris-HCL-pH 8.0, 25 mM EDTA, 0.5% SDS, and 100 µg/ml proteinase K overnight at room temperature. The lysate was treated with 20 µg/ml RNAse at 37 °C for 30 min and cleaned with equal volumes of phenol/chloroform using phase-lock gels (Quantabio, Beverly, Massachusetts; Cat # 2302830). The DNA was precipitated by adding 0.4× volume of 5 M ammonium acetate and 3× volume of ice-cold ethanol. The DNA pellet was washed twice with 70% ethanol and resuspended in an elution buffer (10 mM Tris, pH 8.0). genetic DNA (gDNA) purity was accessed using a NanoDrop ND-1000 spectrophotometer, which returned a 260/280 ratio of 1.9 and 260/230 of 2.0. DNA yield (12 µg total) was quantified using Qbit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, Massachusetts). Integrity of the high molecular weight (HMW) gDNA was verified on a Femto pulse system (Agilent Technologies, Santa Clara, California) where 60% of the DNA was found in fragments above 50 kb and 50% of the DNA was found in fragments above 120 kb.

HiFi library preparation and sequencing.
A HiFi SMRTbell library was constructed using the SMRTbell Express Template Prep Kit v2.0 (Pacific Biosciences-PacBio, Menlo Park, California, Cat. #100-938-900) according to the manufacturer's instructions. HMW gDNA was sheared to a target DNA size distribution between 15 and 20 kb. The sheared gDNA was concentrated using 0.45× of AMPure PB beads (PacBio, Cat. #100-265-900) for the removal of singlestrand overhangs at 37 °C for 15 min, followed by further enzymatic steps of DNA damage repair at 37 °C for 30 min, end repair and A-tailing at 20 °C for 10 min and 65 °C for 30 min, ligation of overhang adapter v3 at 20 °C for 60 min and 65 °C for 10 min to inactivate the ligase, and nuclease treatment at 37 °C for 1 h. The SMRTbell library was purified and concentrated with 0.45× Ampure PB beads (PacBio, Cat. #100-265-900) for size selection using the BluePippin/ PippinHT system (Sage Science, Beverly, Massachusetts; Cat. #BLF7510/HPE7510) to collect fragments greater than 7 to 9 kb. The 15 to 20 kb average HiFi SMRTbell library was sequenced at University of California Davis DNA Technologies Core (Davis, California) using 2 8M SMRT cells, Sequel II sequencing chemistry 2.0, and 30-h movies each on a PacBio Sequel II sequencer.

Omni-C library preparation and sequencing.
The Omni-C library was prepared using the Dovetail Omni-C Kit (Dovetail Genomics, Scotts Valley, California) according to the manufacturer's protocol with slight modifications. First, specimen tissue (using whole individual spider CCGP_27_ UCSC_010) was thoroughly ground with a mortar and pestle while cooled with LN 2 . Subsequently, chromatin was fixed in place in the nucleus. The suspended chromatin solution was then passed through 100 and 40 μm cell strainers to remove large debris. Fixed chromatin was digested under various conditions of DNase I until a suitable fragment length distribution of DNA molecules was obtained. Chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends. After proximity ligation, crosslinks were reversed and the DNA purified from proteins. Purified DNA was treated to remove biotin that was not internal to ligated fragments, and an NGS library was generated using an NEB Ultra II DNA Library Prep kit (New England Biolabs, Ipswich, Massachusetts) with an Illumina compatible y-adaptor. Biotin-containing fragments were then captured using streptavidin beads, and the post-capture product was split into 2 replicates prior to PCR enrichment to preserve library complexity with each replicate receiving unique dual indices. The libraries were sequenced at the Vincent J. Coates Genomics Sequencing Lab (Berkeley, California) on an Illumina NovaSeq platform (Illumina, San Diego, California) to generate approximately 100 million 2 × 150 bp read pairs per GB of genome size.

Nuclear genome assembly
We assembled the genome of the long-jawed spider following the CCGP assembly pipeline Version 4.0, as outlined in Table  1 which lists the tools and non-default parameters used in the assembly. The pipeline uses PacBio HiFi reads and Omni-C data to produce high quality and highly contiguous genome assemblies while minimizing manual curation. We removed remnant adapter sequences from the PacBio HiFi dataset using HiFiAdapterFilt (Sim et al. 2022) and obtained the initial dual or partially phased diploid assembly (http://lh3.github. io/2021/10/10/introducing-dual-assembly) using HiFiasm (Cheng et al. 2022) with the filtered PacBio HiFi reads and the Omni-C dataset. We tagged output haplotype 1 as the primary assembly, and output haplotype 2 as the alternate assembly. We identified sequences corresponding to haplotypic duplications, contig overlaps and repeats on the primary assembly with purge_dups (Guan et al. 2020) and transferred them to the alternate assembly. We aligned the Omni-C data to both assemblies following the Arima Genomics Mapping Pipeline (https://github.com/ArimaGenomics/mapping_pipeline) and then scaffolded both assemblies with SALSA (Ghurye et al. 2017(Ghurye et al. , 2019. We generated Omni-C contact maps for both assemblies by aligning the Omni-C data with BWA-MEM (Li 2013), identified ligation junctions, generated Omni-C pairs using pairtools (Goloborodko et al. 2018), generated a multiresolution Omni-C matrix with cooler (Abdennur and Mirny 2020), and balanced it with hicExplorer (Ramírez et al. 2018). We used HiGlass (Kerpedjiev et al. 2018) and the PretextSuite (https://github.com/wtsi-hpag/PretextView; https://github. com/wtsi-hpag/PretextMap; https://github.com/wtsi-hpag/ PretextSnapshot) to visualize the contact maps and checked the contact maps for major misassemblies. If we identified a strong off-diagonal signal and a lack of signal in the consecutive genomic region in the proximity of a join that was made by the scaffolder, we marked the join. Afterwards, all marked joins were dissolved by cutting the scaffolds at the coordinates of the joins. After this process, no further manual joins were made. Some of the remaining gaps (joins) were closed using the PacBio HiFi reads and YAGCloser (https://github. com/merlyescalona/yagcloser). We then checked for contamination using the BlobToolKit Framework (Challis et al. 2020). Finally, we trimmed remnants of sequence adaptors and mitochondrial contamination identified during NCBI contamination screening.

Mitochondrial genome assembly
We assembled the mitochondrial genome of the long-jawed spider from the PacBio HiFi reads using the referenceguided pipeline MitoHiFi (https://github.com/marcelauliano/ MitoHiFi; Allio et al. 2020). The mitochondrial sequence of T. nitens (NCBI:NC_028068.1; Wang et al. 2016) was used as the starting reference sequence. After completion of the nuclear genome, we searched for matches of the resulting mitochondrial assembly sequence in the nuclear genome assembly using BLAST+ (Camacho et al. 2009) and filtered out contigs and scaffolds from the nuclear genome with a percentage of sequence identity >99% and size smaller than the mitochondrial assembly sequence.

Genome size estimation and quality assessment
We generated k-mer counts from the PacBio HiFi reads using meryl (https://github.com/marbl/meryl). The k-mer database was then used in GenomeScope2.0 (Ranallo-Benavidez et al. 2020) to estimate genome features including genome size, heterozygosity, and repeat content. To obtain general contiguity metrics, we ran QUAST (Gurevich et al. 2013). To evaluate genome quality and completeness we used BUSCO (Manni et al. 2021) with the arthropoda ortholog database (arthropoda_ odb10) which contains 1,013 genes. Assessment of base level accuracy (QV) and k-mer completeness was performed using the previously generated meryl database and merqury (Rhie et al. 2020). We further estimated genome assembly accuracy via BUSCO gene set frameshift analysis using the pipeline described in Korlach et al. (2017). Measurements of the size of the phased blocks is based on the size of the contigs generated by HiFiasm on HiC mode. We follow the quality metric nomenclature established by Rhie et al. (2021), with the genome quality code x.y.P.Q.C, where, x = log10[contig NG50]; y = log10[scaffold NG50]; P = log10[phased block NG50]; Q = Phred base accuracy QV (quality value); C = % genome represented by the first "n" scaffolds, following a known karyotype of 2n = 24 inferred from the congeneric T. maxillosa (The Animal Chromosome Count database-V1.0.0; https://cromanpa94.github.io/ACC/). Quality metrics for the notation were calculated on the primary assembly.

Results
The Omni-C and PacBio HiFi sequencing libraries generated 91.4 million read pairs and 3.29 million reads, respectively. The latter yielded 55.64-fold coverage (N50 read length 17,735 bp; minimum read length 49 bp; mean read length 17,343 bp; maximum read length of 61,245 bp) based on the Genomescope 2.0 genome size estimation of 1.09 Gb. Based on PacBio HiFi reads, we estimated 0.148% sequencing error rate and 1.9% nucleotide heterozygosity rate. The k-mer spectrum based on PacBio HiFi reads show a bimodal distribution with 2 major peaks at 27-and 55-fold coverage, where peaks correspond to homozygous and heterozygous states of a diploid species (Fig. 2A). The distribution presented in this k-mer spectrum supports that of a high heterozygosity profile. The final assembly (qqTetVers1) consists of 2 pseudo haplotypes, primary and alternate, with both genome sizes similar to the estimated value from Genomescope2.0 ( Fig.   2A). The primary assembly consists of 174 scaffolds spanning 1.06 Gb with contig N50 of 9 Mb, scaffold N50 of 64.1 Mb, longest contig of 38.5 Mb, and largest scaffold of 88.6 Mb. GC content composition for the primary assembly is 33.5% and AT content 66.5%. The alternate assembly is similar, Fig. 2. Visual overview of Tetragnatha versicolor genome assembly metrics. A) K-mer spectrum output generated from PacBio HiFi data without adapters using GenomeScope2.0. The bimodal pattern observed corresponds to a diploid genome and the k-mer profile matches that of high heterozygosity. K-mers at lower coverage and high frequency correspond to differences between haplotypes, whereas the higher coverage and low frequency k-mers correspond to the similarities between haplotypes. B) BlobToolKit Snail plot showing a graphical representation of the quality metrics presented in Table 2 for the T. versicolor primary assembly (qqTetVers1.0.p). The plot circle represents the full size of the assembly. From the insideout, the central plot covers length-related metrics. The red line represents the size of the longest scaffold; all other scaffolds are arranged in size-order moving clockwise around the plot and drawn in gray starting from the outside of the central plot. Dark and light orange arcs show the scaffold N50 and scaffold N90 values. The central light gray spiral shows the cumulative scaffold count with a white line at each order of magnitude. White regions in this area reflect the proportion of Ns in the assembly; the dark vs. light blue area around it shows mean, maximum and minimum GC vs AT content at 0.1% intervals (Challis et al. 2020). C and D) HiC Contact maps for the primary (2C) and alternate (2D) genome assembly generated with PretextSnapshot. Hi-C contact maps translate proximity of genomic regions in 3-D space to contiguous linear organization. Each cell in the contact map corresponds to sequencing data supporting the linkage (or join) between 2 of such regions. Scaffolds are separated by black lines and higher density of the lines may correspond to higher levels of fragmentation.
although it consists of 539 scaffolds spanning 1.03 Gb with contig N50 of 7.6 Mb, scaffold N50 of 55.1 Mb, largest contig 24 Mb, and largest scaffold of 82.5 Mb. GC and AT content composition for the alternate assembly are similar to the primary, with 33.6% for GC content and 66.4% for AT content. Detailed assembly statistics are reported in Table  2, and graphically for the primary assembly in Fig. 2B (see Supplementary Figure 1 for the alternate assembly). The primary assembly has a BUSCO completeness score of 97.7% using the Arthropoda gene set, a per-base quality (QV) of 66.35, a k-mer completeness of 76.05, and a frameshift indel QV of 52.6; while the alternate assembly has a BUSCO completeness score of 96.1% using the same gene set, a per-base quality (QV) of 65.08, a k-mer completeness of 74.02, and a frameshift indel QV of 55.48.
We identified 21 misassemblies, 9 on the primary and 12 on the alternate, and broke the corresponding joins made by SALSA. We were able to close a total of 14 gaps, 9 on the primary assembly and 5 on the alternate. Finally, we filtered out 3 contigs, 1 from the primary and 2 from the alternate assembly, corresponding to mitochondrial contamination. No further contigs were removed. The Omni-C contact maps shows that both assemblies are highly contiguous (Fig. 2C and D). We have deposited both assemblies on NCBI (see Table 2 and Data Availability for details).
The assembled mitochondrial genome is 14,426 bp in length, with base composition of A = 39.95%, C = 17.11%, G = 10.38%, T = 32.56%, and includes 20 unique transfer RNAs and 12 protein coding genes.

Discussion
Of the 16 spider genome assemblies that have been published at time of writing (Supplementary Table 1), T. versicolor has the smallest assembled genome size, at 1.060 Gb. Its congeneric and close relative, T. kauaensis, has a very similar genome assembly size of 1.085 Gb (Cerca et al. 2021). The remaining spider genome assemblies range in size from 1.222 Gb (Latrodectus hesperus, Thomas et al. 2020) to 6.255 Gb (Acanthoscurria geniculata, Sanggaard et al. 2014;Supplementary Table 1). Our sequence assembly generated a reference genome with contig N50 of 9 Mb. The data clustered into 13 chromosome-candidate scaffolds (Fig. 2), which is consistent with a karyotype of 2n = 26 chromosomes (Pajpach 2018). Therefore, our assembly is comparable to the 3 chromosome-level spider genome assemblies available at the time of writing: Argiope bruennichi (Sheffer et al. 2021), Trichonephila antipodiana (Fan et al. 2021), and Dysdera silvatica (Escuer et al. 2022).
Using this T. versicolor reference genome, we are conducting more detailed genomic analyses of Tetragnatha spiders using genome resequencing as part of the CCGP. These data will allow us to investigate the importance of water proximity, drought stress, and habitat connectivity in structuring populations across ecoregions, as well as enabling the study of adaptive responses to climate and other anthropogenic changes. Additionally, we will be able to assess demographic patterns and ask how habit conversion, aridification, and alteration in waterways through time has structured Tetragnatha populations. These data, in combination with other freshwater aquatic taxa included in the CCGP, will help build our overall understanding of the genetic structure and patterns of gene flow among riverine taxa that is central to management and conservation efforts Shaffer et al. 2022), as well as filling in a key position in our understanding of the phylogenetic diversity of California taxa ). We will also use the T. versicolor genome to understand venom evolution, and the possible role of venoms in mate recognition (Zobel-Thropp et al. 2018). Moreover, comparison with the closely related Hawaiian T. kauaiensis will provide insights into the loss of web-building behaviors and the loss of an obligate association with water that characterize many of the Hawaiian Tetragnatha including T. kauaiensis (Berger et al. 2021).
Studying genomic structure and response to drought in Tetragnatha spiders will also provide valuable information on the health of riparian biological communities. Tetragnatha play crucial roles in riparian ecosystems as the primary predators of insects emerging from aquatic larval stages (Okuma 1968;Barrion and Litsinger 1984) and as a source of prey for other predatory taxa, notably birds (Gunnarsson and Wiklander 2015). Both their integral nature in riparian communities and their sensitivity to changes in water availability make them important bioindicators (Reyes-Maldonado et al. 2017). Understanding the genetic diversity, demographic structure, and population connectivity of Tetragnatha across habitats experiencing different levels of environmental change will therefore provide valuable insight into the response of the ecosystem to climate change.

Data availability
Data generated for this study are available under NCBI BioProject PRJNA851111. Raw sequencing data for sample CCGP_RG_IND1 and TVERS_CCGP_IND2 (NCBI BioSamples SAMN29044170 and SAMN29044171) are deposited in the NCBI Short Read Archive (SRA) under SRR20722016 for PacBio HiFi sequencing data, and SRR20722014-5 for the Omni-C Illumina sequencing data. GenBank accessions for both primary and alternate assemblies are JANDEH000000000 and JANDEI000000000; and for genome sequences GCA_024610705.1 and GCA_024610695.1. The GenBank organelle genome assembly for the mitochondrial genome is CM045180.1. Assembly scripts and other data for the analyses presented can be found at the following GitHub repository: www.github.com/ccgproject/ccgp_assembly Author contributions SAA, NRG, MMS, ECS, and RGG conceived of and designed the study. SAA collected specimens and sent them out for genomic work. EB, NC, CF, MPAM, ON, SS, RS, and WS performed lab work. ME assembled the genome and ran statistical analysis. SAA, NRG, AJH, MMS, ECS, and ME wrote the initial draft of the manuscript, and all authors read, edited, and approved the final manuscript. HBS and ET provided infrastructure and financial support, as well as overall project support for the CCGP.