Draft genome of the Peruvian scallop Argopecten purpuratus

Abstract Background The Peruvian scallop, Argopecten purpuratus, is mainly cultured in southern Chile and Peru was introduced into China in the last century. Unlike other Argopecten scallops, the Peruvian scallop normally has a long life span of up to 7 to 10 years. Therefore, researchers have been using it to develop hybrid vigor. Here, we performed whole genome sequencing, assembly, and gene annotation of the Peruvian scallop, with an important aim to develop genomic resources for genetic breeding in scallops. Findings A total of 463.19-Gb raw DNA reads were sequenced. A draft genome assembly of 724.78 Mb was generated (accounting for 81.87% of the estimated genome size of 885.29 Mb), with a contig N50 size of 80.11 kb and a scaffold N50 size of 1.02 Mb. Repeat sequences were calculated to reach 33.74% of the whole genome, and 26,256 protein-coding genes and 3,057 noncoding RNAs were predicted from the assembly. Conclusions We generated a high-quality draft genome assembly of the Peruvian scallop, which will provide a solid resource for further genetic breeding and for the analysis of the evolutionary history of this economically important scallop.


Introduction
The Peruvian scallop (Argopecten purpuratus), also known as the Chilean scallop, is a medium-sized bivalve with a wide distribution in Peru and Chile [1]. In Chile, the cultured scallops reach a commercial size of around 9 cm in shell height within 14-16 months [2]. It is a relatively stenothermic species as its natural habitat is largely under the influence of upwelling currents from Antarctica [3]. Unlike other Argopecten scallops, the Peruvian scallop normally has a long life span of up to 7-10 years [4,5]. This scallop was introduced into China in the late 2000s and has played an important role in stock improvement of Argopecten scallops via interspecific hybridization with bay scallops [6,7].

Whole genome sequencing
Genomic DNA was extracted from an adductor muscle sample of a single A. purpuratus (Fig. 1), which was obtained from a local scallop farm in Laizhou, Shandong Province, China. A whole genome shotgun sequencing strategy was then applied. Briefly, six libraries with different insert length (250 bp, 450 bp, 2 kb, 5 kb, 10 kb, and 20 kb) were constructed according to the standard protocol provided by Illumina (San Diego, CA, USA). In de- tail, the DNA sample was randomly broken into fragments using covaris ultrasonic fragmentation apparatus. The library was prepared following end repair, adding sequence adaptor, purification, and polymerase chain reaction amplification. The matepair libraries (2 kb, 5 kb, 10 kb, and 20 kb) and paired-end libraries (250 bp, 450 bp) were all sequenced on the Illumina HiSeq4000 platform with paired-end 150 bp. In addition, SMRTbell libraries were prepared using either 10-kb or 20-kb preparation protocols. Briefly, the DNA sample was sheared by Diagenode Megarup-tor2 (Belgium), the SMRTbell library was produced by ligating universal hairpin adapters onto double-stranded DNA fragments. Adapter dimers were efficiently removed using Pacific Biosciences' (PacBio's) MagBead kit. The final step of the protocol was to remove failed ligation products through the use of exonucleases. After the exonuclease and AMPure PB purification steps, sequencing primer was annealed to the SMRTbell templates, followed by binding of the sequence polymerase to the annealed templates. Subsequent sequencing was performed on PacBio Sequel instrument with Sequel TM Sequencing Kit 1.2.1 (Pacific Biosciences, California, USA). Finally, the 10X Genomics library was constructed and sequenced with paired-end 150 bp on the Illumina Hiseq platform. The Chromium TM Genome Solution (10X Genomics, USA) massively partitions and molecularly bar codes DNA using microfluidics, producing sequencing-ready libraries with >1000,000 unique bar codes. In total, 463.19 Gb raw reads were generated, including 75.72, 70. 22, 19.21, 45.71, 28.34, 11.78, 18.01, and 194.20 Gb from the 250-bp, 450-bp, 2-kb, 5-kb, 10kb, and 20-kb libraries, PaBbio sequencing library, and 10X Genomics library, respectively. The raw reads were trimmed before being used for subsequent genome assembly. For Illumina HiSeq sequencing, the adaptor sequences, the reads containing more than 10% ambiguous nucleotides, as well as the reads containing more than 20% low-quality nucleotides (quality score less than 5),were all removed. For PacBio sequencing, the generated polymerase reads were first broken at the adaptor positions, and the subreads were generated after removal of the adaptor sequences. The subreads were then filtered by a minimum length = 50.

Estimation of the genome size and sequencing coverage
The 17-mer frequency distribution analysis [8] was performed on the remaining clean reads to estimate the genome size of the Peruvian scallop using the following formula: genome size = kmer number/peak depth. Based on a total number of 6.22 10 10 k-mers and a peak k-mer depth of 69, the estimated genome size was calculated to be 885.29 Mb (Table 1) and the estimated repeat sequencing ratio was 33.74%.

De novo genome assembly and quality assessment of A. purpuratus genome
All the pair-end Illumina reads were first assembled into scaffolds using Platanus v1.2.4 (Platanus, RRID:SCR 015531) [9], and the gaps were then filled by GapCloser v1.12-r6 (GapCloser, RRID: SCR 015026) [10]. Subsequently, the PacBio data were used for additional gap filling by PBJelly v14.1 (PBJelly, RRID:SCR 012091) with default parameters [11], and then all of the Illumina reads were used to correct the genome assembly by Pilon v1.18 (Pilon, RRID:SCR 014731) for two rounds [12]. After that, the 10X linked-reads were used to link scaffolds by fragScaff 140 324.1 [13]. First, in order to solve the issue of heterozygosity, in our assembly process we chose 19-kmer to draw k-mer distribution histogram and classified all the kmers into homozygous kmer and heterozygous kmer according to the coverage depth. Second, we utilized 45-kmer to construct the de Bruijn figure and combine the bubbles for heterozygous sites, according to the sequences with longer length and deeper coverage depth. Then, the pair-end information was used to determine the connection between the heterozygous parts and filter the contigs lacking support. Finally, the heterozygous contigs and homozygous contigs were distinguished based on contig coverage depth. After assembly, the reads from short insert length libraries were mapped onto the assembled genome. Only one peak was observed in the sequencing depth distribution analysis with the average sequencing depth of 148.2×, which is consistent with the sequencing depth, indicating high quality of the assembled scallop genome. Finally, a draft genome of 724.78 Mb was assembled (accounting for 81.87% of the estimated genome size of 885.29 Mb), with a contig N50 size of 80.11 kb and scaffold N50 size of 1.02 Mb (Table1). With this initial assembly, we mapped the short insert library reads onto the assembled genome using BWA 0.6.2 (BWA, RRID:SCR 010910) software [14] to calculate the mapping ratio and assess the assembly integrity. In summary, 91.05% of the short reads were mapped onto the assembled genome with a coverage of 89.40%, indicating high reliability of genome assembly. CEGMA v2.5 (Core Eukaryotic Genes Mapping Approach; CEGMA, RRID:SCR 015055) defines a set of conserved protein families that occur in a wide range of eukaryotes and presents a mapping procedure to accurately identify their exon-intron structures in a novel genomic sequence [15]. A protein is classified as complete if the alignment of the predicted protein to the HMM profile represents at least 70% of the original KOG domain, or otherwise classified as partial. Through mapping to the 248 core eukaryotic genes, 222 genes (89.52%) were identified. BUSCO v3 (Benchmarking Universal Single-Copy Orthologs; RR ID:SCR 015008) provides quantitative measures for the assessment of genome assembly completeness, based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs [16]. We confirmed that 89% of the 843 single-copy genes were identified, indicating good integrity of the genome assembly.

Gene functional annotation
Gene functions were predicted from the best BLASTP (e-value ≤10 −5 ) hits in SwissProt databases [30]. Gene domain annotation was performed by searching the InterPro (InterPro, RRID:SC R 006695) database [31]. All genes were aligned against Kyoto Encyclopedia of Genes and Genomes (KEGG, RRID:SCR 012773) [32] to identify the best hits for pathways. Gene ontology terms for genes were obtained from the corresponding InterPro entry [33]. Finally, among these annotated genes, 70.3% of encoded proteins showed homology to proteins in the SwissProt database, 91.1% were identified in the nonredundant database, 70.4% were identified in the KEGG database, 72.1% were identified in the In-terPro database, and 92.1% could be mapped onto the functional databases.

Noncoding RNA annotation
The noncoding RNA genes, including miRNAs, rRNAs, snRNAs, and tRNAs, were identified. The tRNAscan-SE 2.0 (tRNAscan-SE, RRID:SCR 010835) software with eukaryote parameters [34] was used to predict tRNA genes. The miRNA and snRNA genes in the assembled genome were extracted by INFERNAL 1.1.2 software [35] against the Rfam (Rfam, RRID:SCR 007891) database [36] with default parameters. Finally, 1132 miRNAs, 1664 tRNAs, 41 rRNAs, and 220 snRNAs were discovered from the Peruvian scallop genome. were analyzed. All data were downloaded from Ensemble [21] or National Center for Biotechnology Information (NCBI) [37]. For each protein-coding gene with alternative splicing isoforms, only the longest protein sequence was kept as the representative. Gene family analysis based on the homolog of gene sequences in related species was initially implemented by the alignment of an "all against all" BLASTP (with a cutoff of 1e-7) and subsequently followed by alignments with high-scoring segment pairs conjoined for each gene pair by TreeFam 3.0 [38]. To identify homologous gene pairs, we required more than 30% coverage of the aligned regions in both homologous genes. Finally, homologous genes were clustered into gene families by OrthoMCL-5 (OrthoMCL DB: Ortholog Groups of Protein Sequences, RRID:SCR 007839) [39] with the optimized parameter of "-inflation 1.5."All protein-coding genes from the 18 examined genomes were used to assign gene families. In total, the protein- coding genes were classified into 45,268 families and 108 strict single-copy orthologs (Fig.2).

Phylogenetic analysis
Evolutionary analysis was performed using these single-copy protein-coding genes from the 18 examined species. Amino acid and nucleotide sequences of the ortholog genes were aligned using the multiple alignment software MUSCLE (MUSCLE, RRID: SCR 011812) with default parameters [40]. A total number of 108 single-copy ortholog alignments were concatenated into a super alignment matrix of 242,085 nucleotides. A maximum likelihood method deduced tree was inferred based on the matrix of nucleotide sequences using RAxML v8.0.19 (RAxML, RRID:SCR 006086) with default nucleotide substitution model-PROTGAMMAAUTO [41]. Clade support was assessed using bootstrapping algorithm in the RAxML package with 100 alignment replicates (Fig. 3) [42]. The constructed phylogenetic tree (Fig. 3) indicated that the Peruvian scallop and Yesso scallop were clustered closely first and then clustered with oysters and mussels, which is in consistent with their putative evolution relationships [43][44][45][46].

The estimation of divergence time
The species divergence times were inferred with MCMCTree included in PAML v4.7a (PAML, RRID:SCR 014932) [47] with the parameter set as "burn-in = 1000, sample-number = 1000 000, sample-frequency = 2,"and evolutionary analysis was performed using single-copy protein-coding genes from the 18 examined species. Based on the phylogenetic tree (Fig.3), the molecular clock was calibrated based on the fossil records according to previous studies [48][49][50]. Finally, we estimated that the divergence between the Peruvian scallop and Yesso scallop happened at 113.6 million years ago.

Conclusions
In the present study, we report the first whole genome sequencing, assembly, and annotation of the Peruvian scallop (A. purpuratus), an economically important bivalve in Chile, Peru, and China. The assembled draft genome of 724.78 Mb accounts for 81.87% of the estimated genome size (885.29 Mb). A total of 26,256 protein-coding genes and 3,057 noncodingRNAs were predicted from the genome assembly. This genome assembly will provide solid support for in-depth biological studies. With the availability of these genomic data, subsequent development of genetic markers for further genetic selection and molecular breeding of scallops could be realized. The current genome data will also facilitate genetic analyses of the evolutionary history of the abundant scallops in the world.