Chromosome-Level Genome Assembly of the Butter Clam Saxidomus purpuratus

Abstract Herein, we provide the first whole-genome sequence of the purple butter clam (Saxidomus purpuratus), an economically important bivalve shellfish. Specifically, we sequenced and de novo assembled the genome of Sa. purpuratus based on PromethION long reads and Hi-C data. The 978-Mb genome of Sa. purpuratus comprises 19 chromosomes with 36,591 predicted protein-coding genes. The N50 length of Sa. purpuratus genome is 52 Mb, showing the highest continuous assembly among bivalve genomes. The Benchmarking by Universal Single-Copy Orthologs assessment indicated that 95.07% of complete metazoan universal single-copy orthologs (n = 954) were present in the assembly. Approximately 51% of Sa. purpuratus genome comprises repetitive sequences. Based on the high-quality Sa. purpuratus genome, we resolved half of the immune-associated genes, namely, scavenger receptor (SR) proteins, which are collinear to those in the closely related Cyclina sinensis genome. This finding suggested a high degree of conservation among immune-associated genes. Twenty-two (19%) SR proteins are tandemly duplicated in Sa. purpuratus genome, suggesting putative convergence evolution. Overall, Sa. purpuratus genome provides a new resource for the discovery of economically important traits and immune-response genes.


Introduction
The purple butter clam ( fig. 1a), Saxidomus purpuratus (NCBI: txid311201), is an economically important marine clam belonging to the family Veneridae, subclass Heterodonta, and class Bivalvia. Its habitat is mud up to 30-m deep in the intertidal zone of southwestern Korea (water temperature 3-26 °C, salinity: 30-33%). The shell of the purple butter clam is the heaviest and hardest compared with that of other Korean shellfish and is composed of outer calcite crystals and inner aragonite layers (Jiao et al. 2015). Previous omic studies have revealed the mitochondrial genome of Sa. purpuratus (Bao et al. 2016), as well as the transcriptome sequence for primary gene annotation and marker development (Li et al. 2017). In the current study, we generated the first wholegenome assembly of Sa. purpuratus and performed comparative genomic analysis, revealing that gene expansion is associated with adaptation to past marine chemical changes.

Results and Discussion
Genome Assembly of Sa. purpuratus . This estimate is within the 843 Mb (Scapharca broughtonii) to 1,071 Mb (Ruditapes philippinarum) range and is similar to previously assembled clam genomes (Bai et al. 2019;Yan et al. 2019;Wei et al. 2020). Based on the estimated genome size, our long and short reads covered 229-and 83-folds of Sa. purpuratus genome, respectively. To reduce the high heterozygosity, we assembled phased long reads and obtained a 1.06-Gb Sa. purpuratus assembly (table 1). For scaffolding, we sequenced 129.8 Gb of Hi-C reads and constructed 2,175 scaffolds (table 1) (table 1). We observed the highest number of complete metazoan single-copy orthologs relative to ten other bivalve genomes and a relatively low number of complete duplicates (supplementary table S3, Supplementary Material online). This suggests a high-quality chromosome-level assembly of Sa. purpuratus genome.

Phylogenomics and Gene Family Evolution
We analyzed genome conservation in Veneridae clams by comparing the high-quality genomes of Sa. purpuratus and Cyclina sinensis (Wei et al. 2020). We identified 14,771 collinear gene pairs in 4,019 syntenic blocks ( fig. 1c), representing 12,824 (30.42%) Sa. purpuratus genes and 13,518 (49.04%) C. sinensis genes. We also analyzed the genome-wide distribution of scavenger receptor (SR) proteins (supplementary table S4, Supplementary Material online), which are known to be involved in the immune response of clams (Yan et al. 2019). We identified nine genes encoding SR family members from 38 orthologous groups (OGs), namely, SR-A4, SR-A6, SR-B1, SR-E3, SR-F2, SR-H2, SR-I1, SR-L1, and SR-L2, from an in-depth analysis. The SR proteins are distributed throughout bivalve genomes compared with those in gastropod (e.g., Haliotis discus) genomes (Nam et al. 2017). SR-F2 is the most abundant SR family gene in bivalve genomes (supplementary table S4, Supplementary Material online). We examined Sa. purpuratus SR-protein-coding genes, which are collinear to those in C. sinensis. A total of 62 (53.45%) Sa. purpuratus SR proteins retained collinearity with closely related C. sinensis ( fig. 1c). Genes encoding 22 (18.97%) Sa. purpuratus SR proteins and 19 (22.89%) C. sinensis SR proteins were tandemly duplicated in their genomes. In particular, three SR family genes, namely, SR-A4, SR-L1, and SR-L2, were observed to be expanded in the Sa. purpuratus genome. A previous study reported that SR-A4 induces an immune response by recognizing lipoproteins and oxidatively modifying low-density lipoproteins (Selman et al. 2008). Meanwhile, SR-L1 recognizes a myriad of cargo ligands or bioactive molecules (Herz and Strickland 2001), and SR-L2 binds to various internal ligands, including leptin, insulin, and amyloid peptide (Bartolome et al. 2017). In fact, mice lacking SR-L2 in brine endothelial cells exhibit neuroinflammation (Bartolome et al. 2017). Moreover, a previous functional study on SR proteins has revealed an association with the evolution of clam immunology, in particular, via recognition of a wide range of common ligands (Zani et al. 2015). Taken together, these results suggest that SR proteins have evolved independently in a specific lineage, which may explain both evolutionary consensus and divergence of SR proteins.

Conclusion
The genome of Sa. purpuratus, the purple butter clam, comprises 19 pseudo-chromosomes with 36,591 protein-coding genes. Evolutionary comparison of the SR-protein-coding genes revealed the expansion of SR-A4, SR-L1, and SR-L2 in Sa. purpuratus compared with those in other clam genomes. Half of the SR-protein-coding genes were collinear to C. sinensis genome, whereas 20% of them were randomly duplicated. Provision of this reference genome of an economically Hi-C contact map shows 19 pseudo-chromosomes of Sa. purpuratus genome. (c) Circus diagram represents collinear gene pairs (gray lines) between Sa. purpuratus (SPU) and Cyclina sinensis (CSI). Colored lines represent scavenger receptor (SR) proteins with evolutionary relationships predicted using MCScanX (Wang et al. 2012).
Genome Biol. Evol. 14(7) https://doi.org/10.1093/gbe/evac106 Advance Access publication 26 July 2022 important bivalve shellfish could be a useful scientific resource for the genetic studies such as ecology and environmental adaptation.

Sample Collection and Genomic DNA
Saxidomus purpuratus samples were obtained from Eunpa Fisheries Company (Sadeung, Republic of Korea; juveniles, shell width of approximately 10 mm) and Jangmok Bay (Geoje, Gyeongnam, Republic of Korea; 34°59′21.2″N 128°40′52.4″E; adults, shell width of approximately 70 mm). The total DNA of Sa. purpuratus muscle tissue was extracted and processed as previously described (Kim et al. 2019).
RNA was extracted using 700 µl of water-saturated phenol. A 1/3 volume of 8 M LiCl was added to the retained aqueous phase, which was maintained at 4 °C for 2 h. RNA was precipitated after centrifugation at 16,000 × g for 30 min, followed by resuspension in 300 µl of diethylpyrocarbonate (DEPC)-treated water. Next, RNA was reprecipitated with 1/10 volume of 3 M sodium acetate (pH 5.2) and isopropanol. The precipitated RNA was rinsed with 70% ethanol (diluted in DEPC-treated water) and dissolved in an appropriate volume of DEPC-treated water (30-40 µl). The RNA library of Sa. purpuratus soft muscle was constructed using the Illumina TruSeq Stranded mRNA LT Sample Prep Kit (Illumina, Inc., San Diego, CA, USA) and sequenced on the NovaSeq 6000 platform (Macrogen, Inc., Seoul, Republic of Korea).

Short-Read Sequencing and Genome Size Estimation
For short reads, DNA libraries were constructed using the TruSeq Nano HT Sample Preparation Kit (Illumina, Inc.), and paired-end reads were generated on the NovaSeq 6000 platform (Illumina, Inc.) according to the manufacturer's instructions. For quality control of the short reads, we trimmed adapters and low-quality reads (Q < 20) using Trimmomatic (ver. 0.64; RRID: SCR_011848; Bolger et al. 2014

Hi-C Long-Range Mapping-based Data Generation and Sequencing
To construct an Hi-C library, we collected Sa. purpuratus muscle tissues from the same individuals used for longand short-read sequencing. The Arima-Hi-C kit (Arima Genomics, Inc., San Diego, CA, USA) was used according to the manufacturer's instructions. The Hi-C library was sequenced using the NovaSeq 6000 platform.

De novo Assembly of RNA-sequencing Data
Quality control of the RNA-sequencing reads was achieved by trimming adapter sequences and low-quality reads below a Phred-score of 20. Contaminated reads were removed as described for the genomic short reads. De novo assembly of the transcriptome was performed using Trinity assembler (ver. 2.11.0; RRID: SCR_013048; Grabherr et al. 2011). Finally, we extracted coding regions within the assembled transcripts using TransDecoder (ver. 5.3.0; RRID: SCR_017647; https://github.com/ TransDecoder/TransDecoder/).
We assessed Sa. purpuratus genome using the BUSCO analysis with molluscan OrthoDB (ver. 5.2.1) and compared the BUSCO values with those of ten bivalve genomes, including two scallops (Atlantic bay scallop [Argopecten irradiansr; Liu et al. 2020

Orthologous Gene Family and Synteny Analysis
For effective comparative analysis, representative bivalve genomes with a high N50 were selected and analyzed using long-read-based assembly (supplementary table S3, Supplementary Material online). We collected ten bivalve genomes including those of two scallops, two oysters, and four clams.

Classification of Scavenger Receptors
We collected 66 previously classified SR proteins in humans and mice (supplementary table S5, Supplementary Material online; Zani et al. 2015) and identified their domains using Pfam (supplementary table S5, Supplementary Material online). These data were used to classify the SR proteins in our samples. The protein sequences were subjected to homology searches against human and mouse SR proteins (e-value < 1e -10 ), and SR-coding domains were identified. We mapped putative SR proteins in the OG (supplementary table S4, Supplementary Material online). Considering that several orthologous genes classified by the OrthoMCL algorithm were found to lack the SR-coding domain, we defined an SR-protein OG when >50% of the protein members conserved the SR-coding domain and more than ten OG members were included. Based on these criteria, we identified 38 OGs for nine SR-protein families and manually identified species-specific expansion of the SR proteins in each OG.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.