The draft genome sequence of the spider Dysdera silvatica (Araneae, Dysderidae): A valuable resource for functional and evolutionary genomic studies in chelicerates

Abstract Background We present the draft genome sequence of Dysdera silvatica, a nocturnal ground-dwelling spider from a genus that has undergone a remarkable adaptive radiation in the Canary Islands. Results The draft assembly was obtained using short (Illumina) and long (PaciBio and Nanopore) sequencing reads. Our de novo assembly (1.36 Gb), which represents 80% of the genome size estimated by flow cytometry (1.7 Gb), is constituted by a high fraction of interspersed repetitive elements (53.8%). The assembly completeness, using BUSCO and core eukaryotic genes, ranges from 90% to 96%. Functional annotations based on both ab initio and evidence-based information (including D. silvatica RNA sequencing) yielded a total of 48,619 protein-coding sequences, of which 36,398 (74.9%) have the molecular hallmark of known protein domains, or sequence similarity with Swiss-Prot sequences. The D. silvatica assembly is the first representative of the superfamily Dysderoidea, and just the second available genome of Synspermiata, one of the major evolutionary lineages of the “true spiders” (Araneomorphae). Conclusions Dysderoids, which are known for their numerous instances of adaptation to underground environments, include some of the few examples of trophic specialization within spiders and are excellent models for the study of cryptic female choice. This resource will be therefore useful as a starting point to study fundamental evolutionary and functional questions, including the molecular bases of the adaptation to extreme environments and ecological shifts, as well of the origin and evolution of relevant spider traits, such as the venom and silk.

Photo credit Miquel Arnedo proximately 45,000 spider species have been recorded to date [1]. The nocturnal ground family Dysderidae, ranks 17th out of 118 currently accepted spider families in number of species. The type genus of the family, Dysdera Latreille, 1804, includes half of the family diversity (282 species). This genus is remarkable in several aspects. First, it represents one of the few cases of stenophagy, i.e. prey specialization, across spiders [2]. Many species in the genus have evolved special morphological, behavioral and physiological adaptations to feed on woodlice, including modi cations of mouthparts, unique hunting strategies and e ective restriction to assimilation of metals into its tissues [3,4,5,6,7]. Because of their chemical defenses and ability to accumulate heavy metals from the soil, woodlice are usually avoided as prey by most spiders, including generalist Dysdera [2,4,5,7]. Although mostly circumscribed to the Mediterranean region, Dysdera has colonized all the Macaronesian archipelagoes, and has undergone a remarkable species diversi cation in the Canary Islands [8]. As many as 55 species have been recorded across the seven main islands and islets of this archipelago, being most of them single-island endemics [9]. Although multiple colonization events may account for the initial origin of species diversity the bulk of this diversity is the result of in situ diversi cation [8]. Dysdera spiders have adapted to a broad range of terrestrial habitats within Canary Islands [9]. Interestingly, many co-occurring species signi cantly di er in mouthpart sizes and shapes, presumably due to adaptations to a specialized diet [6,7], suggesting that stenophagy has evolved multiple times independently in these islands [10]. Although behavioral and physiological experiments have revealed a close correlation between morphological traits and prey preference in Dysdera, little is known about the molecular basis of trophic adaptations in this genus.
Here we present the draft assembly and functional annotation of the genome of the Canary Island endemic spider Dysdera silvatica Schmidt, 1981 (NCBI Taxon ID: 477319; Figure 1). This study is the rst genomic initiative within its family and just the second within the Synspermiata [11], a clade that includes most of the families formerly included in Haplogynae, which was recently shown to be paraphyletic [12,13] (Figure 2).
Remarkably, a recent review on arachnid genomics identi ed the superfamily Dysderoidea (namely Dysderidae, Orsolobidae, Oonopidae and Segestriidae) as one of the priority candidates for genome sequencing [14]. The new genome, intended to be a reference genome for genomic studies on trophic specialisation, will also be a valuable source for the ongoing studies on the molecular components of the chemosensory system in chelicerates [15]. Besides, because of the numerous instances of independent adaptation to caves [16], the peculiar holocentric chromosomes [17], and the evidence for cryptic female choice mechanisms [18,19] within the family, the new genome will be a useful reference for the study of the molecular basis of adaptation to extreme environments, karyotype evolution and sexual selection. Additionally, a new fully annotated spider genome will greatly improve our understanding of key features, such as the venom and silk. The availability of new genomic information in a sparsely sampled section of the tree of life of spiders [14] will further provide valuable knowledge about relevant scienti c questions, such as gene content evolution across main arthropod groups, including the consequences of whole genome duplications, or the phylogenetic relationships with Araneae.

Sampling and DNA extraction
We sampled adult individuals of D. silvatica in di erent localities of La Gomera (Canary Islands) in March 2012 and June 2013 (Supplementary Table S1-1). The species was con rmed in the laboratory and samples were stored at -80ºC until its use. For Illumina and PacBio libraries (see below), we extracted genomic DNA (gDNA) using Qiagen DNeasy Blood & Tissue Kit (Qiagen, Germany) according to the manufacturer's protocol. For the Oxford Nanopore libraries, we used a modi ed version of the Blood & Cell Culture DNA Mini Kit (Qiagen, Germany). Due to the high amount of chitin present in spiders we incubated fresh original samples 48 h at 32ºC, avoiding a centrifugation step prior to sample loading to Qiagen Genomic tips permitting the solution to precipitate by gravity. We also added an extra wash with 70% Ethanol and centrifuged the solution at > 5,000 g for 10 min at 4ºC. We quanti ed the gDNA in a Qubit uorometer (Life Technologies, Thermo Fisher Scienti c) using the dsDNA BR (double stranded DNA Broad Range) Assay Kit, and checked its purity in a NanoDrop 2000 spectrophotometer (Thermo Fisher Scienti c).

DNA sequencing
We obtained the genome of D. silvatica using four di erent sequencing platforms (Table 1; Supplementary Table S1-2). First, we used the Illumina HiSeq2000 to obtain the genome sequence of a single male (100 bp, paired-end reads, 100 PE; TruSeq library). The ow-cell lane generated about 51 Gb of sequence, representing a genome coverage of 30X (assuming a genome size of~1.7 Gb; see below). The genome of a female was sequenced using a Mate Pair approach; for that we used Nextera 5 kb-insert 100 PE libraries and the HiSeq2000 to generate

D. silvatica chromosome and genome size
D. silvatica has a diploid chromosome set of 6 pairs of autosomes and two (females are XX; 2n = 14) or one (males are X0) sex chromosomes (M. A. Arnedo, unpublished results). Using ow cytometry and the genome of the German cockroach Blattella germanica (1C = 2.025 Gb, J. S. Johnston, personal communication; see also [23] as reference, we determined that the haploid genome size of D. silvatica is~1.7 Gb. For the analysis, we adapted the Hare and Johnston [24] protocol for spiders species, without using male palps and chelicers to avoid analyzing haploid or endoreplicated cells, respectively [25,26]. Shortly, we isolated cells from the head of male cockroach, and legs and palps from female spiders. We incubated the cells in LB0.1 with 2% of tween [27], Propidium Iodide (50 µg/mL) and RNAse (40 µg/mL). After 10 minutes, the processed tissue was ltered using a nylon mesh of 20 µm. We determined the DNA content of the diploid cells through the relative G0/G1 peak positions of the stained nuclei using a Gallios ow cytometer (Beckman Coulter, Inc, Fullerton, CA); the results were based on the average of 3 spider replicates, counting a minimum of 5,000 cells per individual.
Quality-based trimming and ltering was performed according to the chemistry, technology and library used (Supplementary Table S1-4). For the short-insert 100 PE library, we used Trimmomatic v0.36 (Trimmomatic, RRID:SCR_011848) [33] with speci c lists of adapters of the TruSeq v3 libraries to lter all reads shorter than 36bp or with minimum quality scores < 30 along a 4 bp sliding windows. We also ltered trailing and leading bases with a quality score < 10. Long-insert MP libraries were pre-processed using NxTrim v0.4.1 [34] with default parameters (Supplementary Table S1-4a & S1-4b). We pre-processed the raw PacBio reads using the SMRT Analysis Software (SMRT Analysis Software, RRID:SCR_002942)

De novo genome assembly
We used MaSuRCA v3.2.9 (MaSuRCA, RRID:SCR_010691) [37] for an hybrid de novo assembly of the D. silvatica genome (Supplementary Figure S2). Additionally, we performed a sca olding phase using AGOUTI (minimum number of joining reads pairs support, k = 3) [38], and the raw reads from a D. silvatica RNAseq experiment [39] (Supplementary Table S1-5 & S1-6). During the assembly phase, we chose for each software the parameter values that generated the best assembly (Supplementary Table S1-7) in terms of, i) continuity and contig size statistics, such as the N50, L50, and the total number of sequences and bases assembled, and ii) completeness measures, obtained as the fraction (and length) of a series of highly conserved proteins present in the draft genome. Particularly, we used ve data sets, BUSCO v3 (BUSCO, RRID:SCR_015008) with genome option [40] using i) the Arthropoda or ii) the Metazoa dataset, iii) the 457 Core Eukaryotic Genes (CEG) of Drosophila melanogaster [41], iv) the 58,966 transcripts in the D. silvatica transcriptome [39] and, v) the 9,  Table S1-7). We discarded 16 contaminant sequences > 5 kb. The nal assembly size of the D. silvatica genome (Dsil v1.2) was~1.36 Gb, with a N50 of about 38 kb (Table 2).
We also studied the distribution of the high covered genome regions to describe the spacing pattern among repetitive se- quences. In particular, we searched for genomic regions that have a higher than average sequencing coverage above a particular threshold. Since repetitive regions are more prone to form chimeric contigs in the assembly step, we only used MaSuRCA super reads, and longer than 10 kb and free of Ns (34,937 contigs; 1.12 Gb). We estimated the coverage after mapping the short reads (from the 100PE library) to those contigs. We dened as high coverage regions (HCR) those with a coverage equal or greater than a 2.5 or 5 times the genome-wide average (~30X), in a region of at least (150, 500, 1000 or 5000 bp, Supplementary Figure S4a; Supplementary Table S2). We found a large number of contigs encompassing one or more HCR. For instance, 21,614 contigs (~61.9%) include at least one 150 bp HCR region with more than 2.5x coverage (an average of 2.48 HCRs per contig; 77.7 HCR per Mb) (Supplementary Table S2-2a). For HCRs of more than 5x coverage, the results are also remarkable (10,604 contigs have at least one 150 bp HCR, corresponding to 25.6 HCR per Mb). As expected, the longer the HCR the smaller the fraction in the genome; indeed, we found that the genome is encompassing~5 HCR per Mb (HCR, longer than 1 kb at 2.5x). The distances between consecutive HCRs does not show clear di erences between the 2.5x and 5x thresholds ( Supplementary Figure 4b & S5; Supplementary Table S2-2b).
We found a strong relationship between the length of the HCR and the type of the included repetitive elements ( Figure  3; Supplementary Table S2-3). For instance, while LINEs represent 8.62% of the repetitive elements in the whole genome, they are clearly enriched in the HCRs (36.12% in HCRs longer than 150 bp; 12.08% in HCRs longer than 5,000 bp) (Figure 3; Supplementary Table S2-3a); the same was found for the small RNA fraction (rRNA). In contrast, the fraction of low complexity repetitive sequences is much less represented in small HCRs than in the whole genome (about 1.3%). We also found that the coverage threshold has little e ect on the results (Supplementary Table S2 Given that the HCR analysis covers an important fraction of the assembled bases (~82%), current results can likely be extrapolated to the whole genome. Therefore, the relatively low N50 of the D. silvatica genome draft is very likely to be caused by

Completeness
We determined the completeness of the D. silvatica genome assembly (Table 3) using BLASTP (E-value cuto <10 -3 ; >30% of alignment length and identity > 50%). We searched for homologs of the functionally annotated peptides (36,398) in, i) among CEG genes of Drosophila melanogaster [41], ii) among the predicted peptides of Parasteatoda tepidariorum, a spider with a well annotated genome [61], iii) among the 9,473 1:1 orthologs across ve Dysdera species and, iv) among the 2,198 single copy genes identi ed in all spiders and available in OrthoDB v10 [55]. We found in D. silvatica a high fraction of putative homologs (95.8% of CEG genes, and 97.4% spider-speci c single copy genes (Table 3). Furthermore, the analysis based on the putative homologs of the single-copy genes included in the BUSCO data set (BUSCO, RRID:SCR_015008) [40], applying the default parameters for the genome and protein mode, also demonstrated the high completeness of the genome draft. Indeed the analysis recovered the~90% of Metazoa or Arthropoda genes (v9), and nearly 70% of them are complete in D. silvatica.
We extended the search for D. silvatica homologs to a broader taxonomic range (Figure 2; Supplementary Table S1-11) by including other metazoan lineages and performing a series of local BLASTP searches (E-value cuto < 10 -3 ; >30% align-a) b) 10 Figure 2. b) Homology relationships among D. silvatica (Dsil) and di erent chelicerates genomes available in OrthoDB v10 [55], Parasteatoda tepidariorum (Ptep), Stegodyphus mimosarum (Smim), Ixodes scapularis (Isca) and Tetranychus urticae (Turt). Red and orange bars indicate the fraction of single copy genes (1:1 orthologs) identi ed in all species, and in all but one (e.g., missing in one species), respectively. The dark and light green bar indicate the fraction of orthologs present in all species and in all but one, respectively, that are not included previous categories. The blue bar (other orthology/homology) shows other more complex homologous relationships. The results were generated uploading D. silvatica proteins to the OrthoDB web server. ment length). We found that a great majority of D. silvatica genes are shared among arthropods (57.9%), being 11,995 of them (32.95%) also present in Ecdysozoa (Figure 4a). Remarkably, 9,560 genes appears to be spider-speci c, being 4,077 of them speci c (unique) of D. silvatica. Despite almost all these species-speci c genes have interproscan signatures, the annotation metrics are poor compared with genes having homologs in other species (Supplementary Table S1

Mitochondrial genome assembly and annotation
We assembled the mitochondrial genome of D. silvatica (mtDsil) from 126,758 reads identi ed in the 100PE library by the software NOVOPlasty [62]. Our de novo assembly yielded a unique contig of 14,440 bp (coverage of 878X) (Supplementary Table S1-13). CGVIEW (CGVIEW, RRID:SCR_011779) [63] was used to generate a genome visualization of the annotated mtDsil genome (Supplementary Figure S10). We identi ed 2 rRNAs, 13 protein-coding genes and 15 tRNAs (out of the putative 22 tRNAs). Based on the contig length and the inability of standard automatic annotation algorithms to identify tRNA with missing arms, as reported for spiders [64], the complete set of tRNAs is most likely present for this species.

Conclusion
We have reported the assembly and annotation of the nuclear and mitochondrial genomes of the rst representative of the spider superfamily Dysderoidea and the second genome of a Synspermiata, one of the main evolutionary lineages within the "true spiders" (Araneomorphae) and still sparsely sampled at the genomic level [14]. Despite the high coverage and the hybrid assembly strategy, the repetitive nature of the D. silvatica genome precluded obtaining a high continuity draft. The characteristic holocentric chromosomes of Dysderidae [17] may also explain the observed genome fragmentation; indeed, it has been recently shown that genome-wide centromere-speci c repeat arrays are interspersed among euchromatin in holocentric plants (Rhynchospora, Cyperceae) [65].
Nevertheless, the completeness and the extensive annotations achieved for this genome, as well as the new referenceguided transcriptome, make this draft an excellent source tool for further functional and evolutionary analyses in this and other related species, including the origin and evolution of relevant spider traits, such as venom and silk. Moreover, the availability of new genomic information in a lineage with remarkable evolutionary features such as recurrent colonisations of the underground environment or complex reproductive anatomies indicatives of cryptic female choice, to cite two examples, will further provide valuable knowledge about relevant scienti c questions, such as the molecular basis of adaptation to extreme habitats or the genetic drivers of sexual selection, along with more general aspects related to gene content across main arthropod groups, the consequences of whole genome duplications or phylogenetic relationships with the Araneae. Additionally, since this genus experienced a spectacular adaptive radiation in Canary Islands, current genome draft could be very useful to further studies understanding of the genomic basis of island radiations.

Availability of supporting data and materials
The whole genome shotgun project has been deposited at DDBJ/ENA/GenBank under accession number QLNU00000000 and project id PRJNA475203. The version described in this paper is version QLNU01000000. This project repository includes raw data, sequencing libraries information and assemblies of the mitochondrial and nuclear genomes. Other relevant datasets such as annotation, reference-guide assembled transcripts, repeat and HCR data, as other data relevant for the reproducibility of results are available in the GigaDB dataset [66]. Tables Click here to access/download; Table;Tables.xlsx    Please find enclosed the revised version of our manuscript GIGA-D-19-00156R2 entitled "The draft genome sequence of the spider Dysdera silvatica (Araneae, Dysderidae): A valuable resource for functional and evolutionary genomic studies in chelicerates" in which we have fixed some points in your annotated version.

In particular
For the numbers in the text and tables, you wrote like this "39 609 522 995", please replace the space with comma in all numbers. Done And you can also revise the following things this time: 1) RRIDs for software: eg. you wrote "with Jellyish v.2.2.3 (RRID:SCR_005491) [28].", it should be "with Jellyish v.2.2.3 (Jellyish, RRID:SCR_005491) [28]." Please revise all these writings. Done 2) For the QIAGEN protocol issue Scott mentioned in last email, if you think it's a better way to present the protocol, you can add it this time. I think there is no need. We have used the original protocol of QIAGEN, with some very minor adjustments that are also indicated in the paper (extra wash; one centrifugation step less). Moreover, there is no any entry in protocols.io for the original QUIAGEN.
With best regards, Sincerely,

Dr. Julio Rozas Professor of Genetics
Cover Click here to access/download;Personal Cover;CoverLetter_GIGA.pdf