The draft genome sequence of the grove snail Cepaea nemoralis

Abstract Studies on the shell color and banding polymorphism of the grove snail Cepaea nemoralis and the sister taxon Cepaea hortensis have provided compelling evidence for the fundamental role of natural selection in promoting and maintaining intraspecific variation. More recently, Cepaea has been the focus of citizen science projects on shell color evolution in relation to climate change and urbanization. C. nemoralis is particularly useful for studies on the genetics of shell polymorphism and the evolution of “supergenes,” as well as evo-devo studies of shell biomineralization, because it is relatively easily maintained in captivity. However, an absence of genomic resources for C. nemoralis has generally hindered detailed genetic and molecular investigations. We therefore generated ∼23× coverage long-read data for the ∼3.5 Gb genome, and produced a draft assembly composed of 28,537 contigs with the N50 length of 333 kb. Genome completeness, estimated by BUSCO using the metazoa dataset, was 91%. Repetitive regions cover over 77% of the genome. A total of 43,519 protein-coding genes were predicted in the assembled genome, and 97.3% of these were functionally annotated from either sequence homology or protein signature searches. This first assembled and annotated genome sequence for a helicoid snail, a large group that includes edible species, agricultural pests, and parasite hosts, will be a core resource for identifying the loci that determine the shell polymorphism, as well as in a wide range of analyses in evolutionary and developmental biology, and snail biology in general.


Introduction
Studies on the shell color and banding polymorphism of the grove snail Cepaea nemoralis (Figure 1), and its sister taxon Cepaea hortensis, played a prominent role in demonstrating how selective forces and random processes drive or maintain morphological variation, and contributed to the establishment of the field of ecological genetics (Jones et al. 1977;Cook 1998;Ozgo 2008). Alongside the peppered moth, the shell polymorphism of Cepaea snails is still the classic text book example used to illustrate natural selection and micro-evolution. Recently, C. nemoralis has been the focus of citizen science projects which studied shell color evolution in association to climate change and urbanization (Silvertown et al. 2011;Kerstes et al. 2019). Being relatively easily maintained and bred in captivity, this snail is also particularly appropriate for evo-devo studies of shell biomineralization (Mann and Jackson 2014;Jackson and Degnan 2016) and pigmentation (Kerkvliet et al. 2017;Affenzeller et al. 2020).
Previous work has shown that the shell polymorphism is controlled by a series of nine or more loci, of which five or more are tightly linked in a single "supergene" (Cook 1998;Gonzalez et al. 2019). This, combined with the advantages mentioned above, means that Cepaea has great potential to provide insights into supergene evolution and the role of genome structure in adaptation. However, progress in understanding the genetic basis of its color pattern formation has been slow, in contrast to other classical systems such as mimicry in Heliconius butterflies (Nadeau et al. 2016) and industrial melanism in the peppered moth ( Van 't Hof et al. 2016). Although some advancement toward identifying the supergene has been made recently (Richards et al. 2013;Kerkvliet et al. 2017), a lack of genomic resources has largely prevented further analyses.
Here, we present a draft assembly and annotation of the C. nemoralis genome, the first available genome for helicoid snails (Wade et al. 2007) and the second for a terrestrial mollusk, after the giant African snail Achatina fulica (Guo et al. 2019). Helicoidea is a large group of stylommatophoran land snails that includes not only important models for studies of shell formation and chirality (e.g. the genus Euhadra, see Davison 2020), but also several edible species (e.g. including Cepaea, but especially the genera Helix and Cornu) and agricultural pests. In addition, many Helicoidea are intermediate hosts of various parasites (e.g. Gé rard et al. 2020), and therefore are important subjects in studies of human and animal disease prevention.
Despite great ecological, economical, and medical importance, stylommatophoran land snails have been underrepresented in whole genome sequencing projects (Yang et al. 2020), mainly because of their large repetitive genomes (C-values between 1.68 and 4.00, see http://www.genomesize.com/). Usually, sequencing coverage above 30Â is recommended to overcome this problem (Dominguez Del Angel et al. 2018), but this is often financially challenging for individual research groups. Here, we took advantage of recent technological and computational breakthroughs to produce the draft assembly of C. nemoralis genome based on lower coverage PacBio sequencing. Even though the assembly presented here is rather fragmented, it should be a key resource for researchers working on diverse aspects of land snail biology, including the identification of genes involved in developmental processes, e.g. shell formation and color patterning. Furthermore, it will open up new research avenues for understanding such important biological processes as adaptation to urban environments and climate change, interactions with parasites, and reproduction.

Estimation of genome size by flow cytometry
The haploid chromosome number in C. nemoralis is 22 (Page 1978). We performed flow cytometry analysis to estimate the haploid genome size using zebrafish Danio rerio as a reference and the "CyStain PI Absolute P" reagent kit (Sysmex Europe, Germany). Briefly, zebrafish tail and snail foot tissues were chopped with a sharp razor blade in 500 mL ice-cold nuclei extraction buffer in a petri dish and incubated for 1 min. Then, the tissues were incubated for 30 minutes in 2.0 mL of staining buffer containing the fluorescent dye propidium iodide (50 mg/mL), RNAse (10 mg/mL), 0.1% dithiothreitol, and 1% polyvinylpyrolidone. The processed sample was passed through a nylon 50 mm filter. The DNA content of stained nuclei was determined using CyFlow-Cube-6 flow cytometer (Sysmex Europe, Germany) as an average of three replicates.

Sample preparation
A single mid-banded hyalozonate snail with yellow ground color was used for the construction of the reference genome. This individual (C981) is the offspring of cross #13 described in Gonzalez et al. (2019), partially inbred, with additional information on and DNA from five generations of the relatives available for future work. High-molecular-weight genomic DNA (HMW-gDNA) was extracted from frozen snail foot tissue using the CTAB (cetyl trimethylammonium bromide) protocol as described in Richards et al. (2013) and Gonzalez et al. (2019). In brief, slices of snail tissue were incubated at 65 C in extraction solution (3% CTAB, 100 mM Tris-HCl, pH 7.5, 25 mM EDTA, pH 8, 2 M NaCl) with 0.2 mg/mL proteinase K and 80 lg/mL RNase. Upon lysis, a chloroform extraction was performed, then three volumes of CTAB dilution solution were added (1% CTAB, 50 mm Tris-HCl, pH 7.5, 10 mM EDTA, pH 8). Samples were mixed until a precipitate appeared, then the supernatant was removed. The pellet was washed twice in 0.4 M NaCl in TE (0.4 M NaCl, 10 mM Tris-HCl, pH 7.5, 1 mm EDTA, pH 8), redissolved in 1.42 M NaCl in TE (1.42 M NaCl, 10 mM Tris-HCl, pH 7.5, 1 mM EDTA, pH 8), then precipitated in ethanol, spooled out, washed in 70% ethanol, and air dried. The integrity of extracted HMW-gDNA was evaluated by performing pulsed-field agarose gel electrophoresis, whereas the purity and concentration were measured by spectrophotometry (with Nanodrop 2000, Thermo Fisher Scientific Inc.) and fluorometry (with Qubit 3.0, Thermo Fisher Scientific Inc.), respectively.

Whole genome sequencing and quality control
We sequenced the genome of C. nemoralis using PacBio singlemolecule real-time (SMRT) and Illumina platforms. PacBio library preparation and sequencing were performed at Leiden Genome Technology Center (Leiden, the Netherlands). Without additional shearing, 4 mg of HMW-gDNA was converted into a SMRTbell library using "Procedure & Checklist-Preparing >30 kb Libraries Using SMRTbell Express Template Preparation Kit" (Pacific Biosciences). The insert size of the final library was then determined on Fragment Analyzer (Agilent Technologies). To increase the sequencing read length, an additional damage repair was performed on the library. The library was annealed with sequencing primer V4 and binding was done using binding kit version 3. The library was sequenced with 20 h movie-time using Sequel Sequencing kit v3.0 chemistry on 12 PacBio Sequel SMRT cells (PacBio Sequel System, RRID: SCR_017989), generating 7,202,997 subreads, or 80 Gb of sequence data (i.e. 23Â genome coverage). The polymerase read length N50 (18,196 bp) was only slightly higher than the subread length N50 (16,882 bp), indicating that the majority of data consists of continuous long reads (CLRs). In addition, 17,390 circular consensus sequencing (CCS) reads of >99% accuracy were generated as well.
For Illumina sequencing, HMW-gDNA was sheared with the Covaris M220 (Covaris Inc., Woburn, MA, USA), set to 500-bp fragment size. A paired-end library was prepared using NEBNext Ultra II DNA Library Prep Kit (New England Biolabs) and sequenced on the Illumina NovaSeq 6000 Sequencing System (RRID: SCR_016387). Illumina sequencing was performed at BaseClear B.V. (Leiden, the Netherlands). Initial quality assessment was based on data passing the Illumina Chastity filtering. Subsequently, reads containing PhiX control signal were removed using an in-house filtering protocol. In addition, reads containing (partial) adapters were clipped (up to a minimum read length of 50 bp). The second quality assessment of the remaining reads was done with FASTQC v0.11.5 (Andrews 2014). We obtained $400 million of filtered 150 bp paired-end reads, or 120 Gb of sequence data, representing $34Â coverage of a 3.5 Gb genome.

Heterozygosity estimation
Illumina paired-end reads were used to estimate heterozygosity of the sequenced individual by k-mer analysis. We used Jellyfish v2.3.0 (Jellyfish, RRID: SCR 005491) (Marcais and Kingsford 2011) to count canonical 31-mers from the sequencing data and to produce the k-mer count histogram with max coverage threshold set to 1,000,000. The latter was analyzed by GenomeScope (Vurture et al. 2017) to estimate the heterozygosity.

Genome annotation
The annotation was performed on the soft-masked assembly to avoid missing (parts of) coding sequences due to overlap with masked areas of the genome. We used the MAKER v2.31.10 pipeline (MAKER, RRID: SCR_005309) (Cantarel et al. 2007;Campbell et al. 2014) in three consecutive rounds, combining ab initio gene predictions with sequence-based evidence. In the first round, the available transcriptome generated from foot and mantle tissues of four C. nemoralis snails (147,397 contigs, see Kerkvliet et al. 2017), as well as the protein dataset of A. fulica snail (23,726 predicted proteins, see Guo et al. 2019), were aligned to the genome with BLASTn (BLASTN, RRID: SCR_001598) and BLASTx (BLASTX, RRID: SCR_001653) algorithms from BLAST v2.9.0þ (NCBI BLAST, RRID: SCR_004870), respectively (est2genome and protein2genome options in MAKER configuration file). After further refinement of these alignments with respect to splice sites using Exonerate v2.4.0 (Exonerate, RRID: SCR_016088) (Slater and Birney 2005), MAKER generated gene models and calculated their annotation edit distance (AED) scores in order to assess the quality of gene prediction (i.e. the lower AED value the smaller the difference between the predicted protein and the transcript/protein evidence). Out of 308,927 genes models generated in the first round, 89% had an AED <0.5, indicating that the annotation is well-supported by transcript and/or protein evidence. The second and third rounds of MAKER were performed on the gene models with AED < 0.4 obtained from the first and second runs, respectively. MAKER scripts maker2zff, fathom, forge, and hmm-assembler.pl were used to create snaphmm files (snaphmm option in maker configuration file) to train ab initio gene predictor SNAP (SNAP, RRID: SCR_002127) (Korf 2004). Another ab initio gene predictor, Augustus v3.3.3 (Augustus, RRID: SCR_008417) (Stanke et al. 2006), was selftrained running BUSCO v4.0.2 with the specific parameter (long); the generated "retraining parameters" file for C. nemoralis was included in the second and third rounds of MAKER annotation. The third and final round of MAKER generated 173,620 gene models with AED <0.5. As the annotation was performed on the soft-masked assembly, many of these putative genes could be derived from repetitive sequences, explaining such a high number. Hence, we removed gene models with >50% overlap within a single repeat region as annotated by Repeat Masker (see above). This resulted in the final set of 43,519 predicted protein-coding genes (Supplementary Files S2 and S3) with average AED of 0.27.
We performed functional annotation of predicted proteins using three automated methods. First, we applied Diamond (Buchfink et al. 2015) BLASTp searches (-sensitive -max-targetseqs 1 -outfmt 6 qseqid sallseqid pident evalue bitscore -evalue 1e-5) against UniProt reference proteome database (v2019_09, composed of 561,176 Swiss-Prot and 180,179,667 TrEMBL entries) and the NCBI nonredundant protein database (downloaded on 26 May 2020 and composed of 287,467,303 entries). Second, we used KEGG Automatic Annotation Server (KAAS) (Moriya et al. 2007) with eukaryotic species set and the bi-directional best-hit method to assign KEGG orthology (Kanehisa et al. 2012) to gene models. Finally, we used InterProScan (Jones et al. 2014) and Blast2GO (Gö tz et al. 2008) functions in the OmicsBox to examine motifs, domains, and signatures in the protein sequences and to assign gene ontology (GO) terms to the gene models.

Data availability
This C. nemoralis whole genome sequencing project has been submitted to NCBI with BioProject accession number PRJNA646049. Sequencing reads from Illumina and PacBio platforms have been deposited at NCBI Sequence Read Archive (SRA) under the accession numbers SRX8724912 and SRX8724913, respectively. The assembled genome sequence has been deposited at DDBJ/ENA/ GenBank under the accession JACEFZ000000000.  Figure S1 describes main characteristics of the predicted protein-coding genes.

Genome size and heterozigosity estimation
We used flow cytometry to determine that the haploid genome size of C. nemoralis is 2.06 times larger than that of the zebrafish (C-value $1.7, see Vinogradov 1998;Ciudad et al. 2002) and is therefore $3.5 picogram, or $3.42 Gb. When taking the total length of the most recent zebrafish genome assembly of 1.68 Gb (cf. Genome Reference Consortium, https://www.ncbi.nlm.nih. gov/grc/zebrafish/data, last accessed on 13-01-2021) as a reference, the genome size of C. nemoralis is calculated at $3.46 Gb. This fits within the range of estimated genome sizes for others members of the family Helicidae (C-values between 2.88 and 4.00, see http://www.genomesize.com/, last accessed on 13-01-2021). The 31-mer based estimate of genome size provided by GenomeScope ($3.1 Gb, see Figure 2 and Supplementary Table S1) is smaller than the flow cytometry estimate. Such discrepancy is often found in repeat-rich genomes (e.g. Edwards et al. 2018), because high-frequency repeats are difficult to model accurately, leading to an underestimation of total repeat length and therefore genome size. The heterozygosity of the individual C981 (Gonzalez et al.

Genome assembly and quality evaluation
We used 4.8 million PacBio long reads, or 73.7 Gb of sequence data, to assemble the genome of C. nemoralis. The assembly was polished with PacBio subreads and with highly accurate Illumina short reads. The final genome assembly has total length of 3.5 Gb and is composed of 28,537 contigs with N50 length of 333 kb ( Table 1). The mapping rate of Illumina reads agains the final assembly was rather high, with about 99.3% of the reads aligned, and 93.5% properly paired (i.e. both reads of the pair mapped to the same contig).

Figure 2
GenomeScope k-mer profile plot for the genome of C. nemoralis individual C981, based on 31-mers in Illumina reads. The observed k-mer frequency distribution is depicted in blue, whereas the GenomeScope fit model is shown as a black line. The unique and putative error k-mer distributions are plotted in yellow and red, respectively. Blobtools analyses indicated no substantial contamination with bacterial DNA (Figure 3 and Supplementary Table S2). About 75% of the contigs were assigned to Mollusca, whereas $20% were assigned to Chordata and Arthropoda. Closer examination of such cases revealed that the assignment to these two orders is due to a chance blast match with relatively high similarity over a small region of the contig (i.e. top hit is to a vertebrate/arthropod species but multiple other hits with a slightly lower bit score are to a mollusk species).
Finally, assembly completeness was assessed with BUSCO v4.0.2 (Simão et al. 2015), the tool that looks for Benchmarking Universal Single-Copy Orthologs (BUSCOs) that should be present in a metazoan genome. Out of the 954 metazoan BUSCOs, 832 (87.2%) were identified in the draft assembly of C. nemoralis genome as complete (709, or 74.3% as single copy, and 123, or 12.9% as duplicated), 36 (3.8%) as fragmented, and only 86 (9.0%) as missing (Supplementary Table S3). High levels of duplicated genes indicate that, despite haplotig removal, some genomic regions were assembled as separate contigs, most likely due to the high heterozygosity of the genome.

Genome annotation
We estimated the total repeat content of the C. nemoralis genome to be around 77% (Figure 4), comparable to the 71% found in A. fulica (Guo et al. 2019) and expected for such a large genome. Nearly 45% of the genome can be attributed to TEs: nonLTR retrotransposons such as LINEs (long interspersed nuclear elements) and SINEs (short interspersed nuclear elements), LTR (long terminal repeat) retrotransposons, and DNA transposons; $6.4% of the repeats were predicted to be small RNAs (i.e. transfer RNAs and Figure 3 BlobPlot of the C. nemoralis genome assembly. Each contig is represented by a circle, colored according to the best match to taxonomic annotation (e.g. Mollusca, Chordata, and so on) and distributed according to the proportion GC (x-axis) and read coverage (y-axis). The upper-and righthand panels show the distribution of the total span (kb) of contigs for a given coverage (right panel) or GC (upper panel) bin. small nuclear RNAs), satellites, simple and low-complexity repeats (Table 2 and Supplementary Table S4).
We annotated the genome using MAKER v2.31.10 (Cantarel et al. 2007;Campbell et al. 2014), by supplementing the ab initio gene predictions with the C. nemoralis transcriptome (Kerkvliet et al. 2017) and the protein dataset of the snail A. fulica (Guo et al. 2019), and two additional rounds of further refinement of gene models with multiple tools integrated into the MAKER pipeline. The final assembly contains 43,519 predicted protein-coding genes (Supplementary Files S2 and S3). Length distribution for genes, exons, and introns is comparable to those of other mollusks (Guo et al. 2019) (Table 3 and Supplementary Figure S1). About 93.1% of the predicted genes have multiple exons (4.7 on average), which is slightly lower than in other mollusks (Kenny et al. 2020). This could be explained by some degree of fragmentation in the gene models, especially those in small contigs. In addition, 97.3% of the predicted protein-coding genes had a hit to at least one of the databases (Table 4) and were functionally annotated (Supplementary Table S5).

Conclusions and perspectives
We performed whole-genome assembly of C. nemoralis using a combination of PacBio long-read technology and Illumina shortread sequencing. This $3.5 Gb draft assembly is composed of 28,537 contigs with the N50 length of 333 kb; repetitive regions cover over 77% of the genome. BUSCO analysis showed that only 9.0% of metazoan orthologs were missing, indicating high genome completeness. More than 43,000 protein-coding genes were identified in the genome, and more than 97.0% of these were functionally annotated from either sequence homology or protein signature searches. To our best knowledge, this is the largest gastropod genome sequenced and assembled to date. Compared to other gastropods (e.g. Guo et al. 2019;Gomes-dos-Santos et al. 2020;Sun et al. 2020), the genome of C. nemoralis is characterized by a very high content of repetitive sequences.
Despite its large size and the abundance of repeats, the assembly presented here is of high quality, and will be a valuable resource for the land snail research community. In particular, it will facilitate the identification of genes that drive the extraordinary diversity of shell colors and patterns in C. nemoralis, and the sister species C. hortensis, as well as comparative work in other stylommatophoran snails. In addition, the genome assembly described here will directly enable a wide range of studies on various aspects of terrestrial snail biology, from early development and biomineralization to physiology, behavior, and population genomics.

Acknowledgments
We thank Susan Kloet and Rolf Vossen from Leiden Genome Technology Centre for advice on PacBio sequencing and technical support. Figure 4 Repetitive content of the assembled C. nemoralis genome as identified by RepeatMasker. Numbers indicate percentages of the genome size. NonLTR retrotransposons of the LINE type and LTR retrotransposons, as well as unclassified sequences, dominate the repetitive content.