The pearl oyster Pinctada fucata martensii genome and multi-omic analyses provide insights into biomineralization

Abstract Nacre, the iridescent material found in pearls and shells of molluscs, is formed through an extraordinary process of matrix-assisted biomineralization. Despite recent advances, many aspects of the biomineralization process and its evolutionary origin remain unknown. The pearl oyster Pinctada fucata martensii is a well-known master of biomineralization, but the molecular mechanisms that underlie its production of shells and pearls are not fully understood. We sequenced the highly polymorphic genome of the pearl oyster and conducted multi-omic and biochemical studies to probe nacre formation. We identified a large set of novel proteins participating in matrix-framework formation, many in expanded families, including components similar to that found in vertebrate bones such as collagen-related VWA-containing proteins, chondroitin sulfotransferases, and regulatory elements. Considering that there are only collagen-based matrices in vertebrate bones and chitin-based matrices in most invertebrate skeletons, the presence of both chitin and elements of collagen-based matrices in nacre suggests that elements of chitin- and collagen-based matrices have deep roots and might be part of an ancient biomineralizing matrix. Our results expand the current shell matrix-framework model and provide new insights into the evolution of diverse biomineralization systems.


Background
Biomineralization is an extraordinary process where minerals form not following rules of inorganic chemistry but through active biological facilitation and control.
Biomineralization is widely distributed and essential to the lives of diverse organisms, ranging from algae to vertebrates that rely on mineralized materials for morphology, structure, protection, movement and feeding. Three principal classes of skeletal biominerals exist on earth: calcium carbonate, calcium phosphate and silica [1].
Nacre is the remarkable biomineral found in pearls and shells of molluscs that provides lustre and enhanced toughness. The formation of lustrous pearls and shells in molluscs such as the pearl oyster Pinctada fucata martensii has long fascinated humans. The biomineralization process of nacre formation is complex and involves sophisticated organic matrices as well as cells, many aspects of which remain elusive [4][5][6][7][8]. The origin and homology of nacre formation with other biomineralization processes such as crustacean shell and vertebrate bone formation is not understood [9].
Studies of biomineralization and other fundamental questions in biology and evolution can be greatly empowered by whole genome analyses, which have been difficult in molluscs owing to challenges in assembling their highly polymorphic and complex genomes [6,10]. To understand the biomineralization process of nacre, we sequenced and assembled the P. f. martensii genome and generated transcriptomes from 11 organs/tissues and 12 developmental stages, along with the proteomes of shell organic matrices.

Data description
We used a pearl oyster from a third-generation line selected for fast growth for sequencing and assembly using a BAC-to-BAC strategy. In addition to BAC sequencing, we also constructed whole genome shotgun (WGS) libraries including 3 with short insert-sizes and 4 with long insert-sizes. We used a draft assembly from a previous study [10], Sanger-sequenced BACs and transcripts generated by RNA-seq to assess the integrity of our assembly. Furthermore, to anchor scaffolds to chromosomes, we constructed a genetic map using restriction-site associated DNA sequencing (RAD-seq) using 148 F1 offspring obtained by crossing two genetically distant parents.
To determine gene expression profile in different organs or tissues, we performed transcriptome sequencing on eleven organs and tissues, including adductor muscle, 4 mantle, mantle pallium, mantle edge, hepatopancreas, hemocyte, gonad, gill, foot, and pearl sac at 180 days (d) after nucleus transplantation. Further, we performed transcriptome sequencing on 12  To understand gene regulation during nacre formation, we analyzed transcriptome data from mantle edge, mantle pallial and two entire mantle tissues representing fast and slow growing pearl oysters with WGCNA (weighted-gene co-expression network analysis), and obtained co-expression network patterns. All sequencing and genome data were uploaded to GigaDB under the accession number XXX.

Results
Genome assembly and characterization. As our initial assembly of ~130 Gb (134-fold coverage) of whole-genome shotgun (WGS) Illumina sequences (Additional file 1: Table S1) was too fragmented for annotation and analysis, probably due to high polymorphism and repetitive sequences (Additional file 2: Figure S1a and b), we subsequently adopted a BAC-to-BAC (bacterial artificial chromosome) sequencing strategy [6,11]. We sequenced 46,080 BACs (5-fold genome coverage) to a depth of 100X using Illumina next-generation sequencing (NGS), assembled each BAC separately (Additional file 2: Figure S1c), and then built supercontigs after merging and filtering redundant sequences. After constructing scaffolds and filling gaps with WGS reads, we obtained a final assembly of 990,658,107 bp with a contig N50 size of 21 kb and a scaffold N50 of 324 kb (Additional file 1: Table S2), which was a significant improvement compared with the contig N50 of 1.6 kb of the previous draft assembly [10].
Searches against public databases showed that 84.0% of the gene models matched known proteins (Additional file 1: Table S8). Further, BUSCO analysis shows that 82.8% of predicted genes are completed and 7.4% of them are fragmented, indicating our assembly is adequate for further analysis. To assess the impact of selection, we determined codon usage, GC content of intron, exon and inter-genic regions, and GC content at each codon position, which were similar in P. f. martensii and other 8 species (Additional file 4: Figure S3).
Phylogenetic analysis of the sequenced genomes of P. f. martensii, C. gigas and L.
gigantea along with selected model organisms provided estimates of divergence times: 485 million years ago (mya) between P. f. martensii (Bivalvia) and L. gigantea (Gastropoda) and 316 mya between P. f. martensii (Pteriidae) and C. gigas (Ostreidae) (Additional file 5: Figure S4). These estimates are in agreement with the most up-to-date phylogenetic analyses of molluscan evolution [12]. Compared to Homo sapiens and Danio rerio, molluscan genomes do not have transforming growth factor (TGF)-beta factors but only bone morphogenetic proteins (BMPs), but these two proteins share a common origin with TGF-beta being derived from BMPs (Additional file 1: Table S10, S11 and S12; Additional file 6: Figure S5,). TGF-beta factors are crucial in regulating osteoblast proliferation, differentiation and bone matrix maturation in vertebrates [13,14]. This finding suggests that molluscs have maintained an ancient BMP-regulatory system for shell formation [15], while TGF-beta emerged in vertebrates to regulate bone matrix.   Figure S6a). 6 Transcriptome analysis of different tissues indicated that some chitin synthases (CHSs) and chitinase were highly expressed in the mantle and pearl sac, the two main calcifying tissues responsible for shell and pearl formation (Additional file7: Figure   S6b and S6c). During larval development, some CHSs and chitinases were highly expressed at the trochophore and post-veliger stages (Additional file 7: Figure S6d), corresponding to prodissoconch and dissoconch/adult shell formation, respectively. Furthermore, the gene family of CHS was significantly expanded in P. f. martensii and other shelled molluscs, but not in molluscs without shells, Octopus bimaculoides (Additional file 1: Table S10). These results suggest that chitin is a key component of the shell matrix, and CHS genes in P. f. martensii and other shelled molluscs might have played crucial roles in the evolution of advanced shells in molluscs.
The presence and involvement of VWA containing proteins (VWAP). According to the current model, silk proteins are major components of the organic matrix in molluscan shells. We searched for silk proteins in the P. f. martensii genome and the proteome of the shell matrix but found none. Interestingly, a total of 10 VWAPs were detected in the nacre proteome with 372 unique spectra, compared with 146 spectra in the prismatic layer proteome (Additional file 8: Datasets S1). Among the 10 VWAPs, 8 VWAPs specifically existed in the nacre and not found in the prismatic layer proteome, and 8 VWAPs contained VWA domains that show highest sequence homology with VWA domains of human or mouse collagens (Additional file 1: Table   S13). Corresponding to the abundance of VWAPs in the nacre proteome, the P. f. martensii genome has an expanded family of 164 VWAPs, similar to the 162 found in C. gigas [6] but more than the 94 found in L. gigantea and 91 in humans (Fig. 2a).
The 10 VWAPs were highly expressed in the mantle pallium and pearl sac, which are responsible for nacreous layer production (Fig. 2b). All 10 VWAPs were up-regulated (at least 5X of the level in egg) in post-veliger larvae with nacreous/aragonite shells, again suggesting their crucial role in nacreous matrix formation. Meanwhile, two of the 10 VWAPs (Pma_10019835, Pma_10011421) was significantly up-regulated (40X of egg) at the trochophore stage, in correlation with aragonite shell formation (Fig. 2c).
After inhibition of six VWAPs (Pma_44.534, Pma_530.149, Pma_10011175, Pma_10019835, Pma_10019836, Pma_10015641) by RNA interference, the microstructure of the nacre showed disordered growth, as observed by scanning electron microscopy (SEM) (Fig. 2d and Additional file 9: Figure S7). These results suggest that VWAPs are a major component of the nacreous organic matrix and play a key role in nacreous shell formation in P. f. martensii. Structure analysis indicated two of the 10 VWAPs (Pma_10015641 and Pma_10011421) has a chitin-binding domain, supporting its possible function in interacting with the chitin framework during matrix formation. Surprisingly, there are no collagens containing both VWA and triple helix repeat (THRs) in genomes of the three molluscs analyzed (P. f. martensii, C. gigas and L. gigantea). Collagens with VWA and THR are found in some invertebrates (C. teleta, H. robusta and Mytilus coruscus [16]) (Additional file 10: Figure S8), suggesting that THRs may be lost in some lineages. THR containing genes showed contraction in the mollusc genomes, over expansion in annulata and vertebrates (Fig. 2a).

Acidic glycosaminoglycans (GAGs) constitute a gel-like substance.
According to the matrix model, the shell matrix contains a gel-like substance where acidic proteins induce the nucleation of calcium carbonate crystals [17]. Consistent with the model and previous reports, we identified a list of acid proteins that might be involved in shell formation (Additional file 8: Datasets S2). In addition to the acidic proteins that are unique to molluscan shells, we also found acidic GAGs, fibronectin-like proteins and chondroitin sulfotransferases that are characteristic components of vertebrate bone matrices. By Alcian blue-periodic acid Schiff staining (AB-PAS), we found that the organic matrix extracted from nacreous shells contained large amounts of acidic GAGs compared with mainly neutral GAGs in prismatic layers of P. f. martensii and C. gigas shells. In addition, we detected acid GAGs in secretory cells of the mantle pallium of P. f. martensii, but mainly neutral GAGs in the mantle of C. gigas (Fig. 3a).

Discussion
The assembly of highly polymorphic genomes and gene prediction in non-model organisms remain challenging. Software based on de Bruijn Graph, such as SOAPdenovo [18], is inadequate in producing satisfactory results due to the increased complexity of de Bruijn graph structure. Overlap-Layout-Consensus assembler, such as Celera Assembler [19], based on the data of fosmids or BACs hierarchical sequencing and third-generation long reads (such as PacBio long reads) are employed to overcome such problems. However, the best choice for assembling complex genomes is to sample haploid or homozygous sequences. For the ab initio gene prediction software, such AUGUSTUS [20], the aim is to find potential coding sequences with sufficiently long open reading frames, but the translated regions may be too short making the absence of stop codons meaningless. The similarity-based approaches including homologous protein sequences, EST sequences and transcripts assembled from RNA-seq reads can produce biologically relevant predictions, but they may not cover all coding exons. Considering their strengths and weaknesses, synthesis software, such as GLEAN [21] and MAKER [22], were used to synthesize these evidences obtained from ab initio gene predictions and similarity-based approaches into the final gene annotation. BUSCO [23] analysis indicates our assembly is sufficiently complete.
The aragonite nacre that gives pearls and certain shells their lustre and enhanced toughness is the target of many studies and modelling. According to the matrix model of molluscan shell formation, the mineralization of calcium carbonate is directed by a mantle-secreted organic matrix [24,25], which is not fully understood but may contain chitin [26][27][28] and silk fibroin [29][30][31] for the structural framework and soluble acidic proteins for crystal nucleation [32][33][34]. Alternatively, the cellular hypothesis argues that biomineralization may be directed by hemocytes [7,35] although there is no dispute about the involvement of organic matrices which are the focus of our study. Chitin is an ancient macromolecule and the primary framework component of organic matrices in cell walls of fungi and diatoms, sponge skeletons and arthropod shells [3]. It is possible that the chitin component of lophotrochozoan 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 and ecdysozoan shells and of sponge skeletons constitute a shared feature and have the same ancient origin. Our results provide strong evidence that chitin is the basic component of P. f. martensii shell matrices.
While silk proteins, which are also considered as the major components of the organic matrix in the molluscan shell, were not found in the P. f. martensii genome and the proteome of the nacreous shell matrix, abundant signatures of expanded VWAPs were detected in the nacre proteome. VWAPs were also reported to be found in C. gigas, Mytilus edulis and Pecten maximus [36]. The VWA domains are a family of 200-amino-acid residues and function as interaction modules in many intra-and extracellular proteins, such as copines, integrins, von Willebrand factor, complement factors B and C2, matrilins, and collagens [37]. Collagens are a large family of extracellular matrix proteins with typical THRs. Eight of the 28 known collagens (collagen VI, VII, XII, XIV, XX, XXI, XXII, and XXVIII) contain VWA domains in addition to THRs. The finding that VWAs of VWAPs from shell matrix show the highest homology with VWAs of vertebrate collagens, suggest that these VWAPs and vertebrate collagens may have a common origin. It is possible that collagens with VWAs are evolved from VWAPs through the addition of THRs, and VWAPs of P. f. martensii represents an ancient form that never acquired THRs. It is also possible and that some of the VWAPs were collagens that lost THRs. Collagens with both VWAs and THRs are found in some invertebrates such as C. teleta, H. robusta and Mytilus coruscus [16], but not in P. f. martensii, C. gigas and L. gigantea, which argues for the loss of THRs in some molluscan lineages. THRs are crucial for the self-assembly of collagen subunits into triple-helix protomers and the formation of fibrillar collagens [38,39]. The absence of THRs in VWAPs indicates that VWAPs may function differently in the nacreous shells of P. f. martensii from collagens in bone. VWAPs may not self-assemble into fibrous structures but instead cross-link with each other and other matrix proteins to form a network structure [40][41][42]. The finding of VWAPs with chitin-binding domains further highlights their function in interacting with the chitin framework during matrix formation. In addition, VWA domains bind to positive ions that attract water, and may cooperate with GAGs or other proteins and provide initial hydrogel properties for biomineralization [37].
Mammalian cartilage and bone matrices consist of collagen fibrils and a gel-like ground substance that is rich in chondroitin-containing proteoglycans, fibronectins 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 and link proteins [43]. Our results confirm the presence of fibronectin-like proteins in shells of P. f. martensii and C. gigas [6]. Proteoglycans or GAGs, which have strong water-binding capabilities and have been detected in the shell [29], may function as the gel-like substance [44]. In the nacre and the secretory cells of the mantle pallium of P. f. martensii, we found large amounts of acidic GAGs, which have also been detected in coral [45] and bone [46], and this finding argues that the acidic GAGs might also play key roles in crystal nucleation during nacre formation. Combining this finding with the findings of collagen-related VWAPs and other elements shared by bone formation, our results suggest that the nacreous shell matrix, while having a chitin-based framework, also possesses key elements of collagen-based matrices, such as fibronectins, proteoglycans and chondroitin sulfotransferases. Chitin-and collagen-based matrices are considered as two basic types of biomineralizing framework, and our results suggest that they may share some basic components and have a common origin, or might have co-existed as parts of an ancient/ancestral matrix with dual-elements, despite subsequent divergence in different taxa into chitinor collagen-based organic matrices.
The shell organic matrix, rather than being a simple self-assembling structure, might instead be a complex and dynamic matrix that requires active construction, regulation and remodelling. Tyrs, which can catalyse the formation of dopa and dopaquinone, were highly abundant in the shell of bivalves, and may function in mediating intermolecular cross-links [6,47], or as a structural component of the shell.
Tyrs belong to the "type-3 copper" family and have a conserved active site of six histidine residues mediating the binding of copper ion as cofactor [48]. Metal ions such as Cu 2+ , Zn 2+ and Mg 2+ are important factors for stabilizing the crystalline form of calcium carbonate [49][50][51]. Therefore, the deposition of Tyrs and associated metal ions in the matrix may regulate metal ion concentration in the extrapallial fluid and help to stabilize the crystalline form. Interestingly, we found that the histidine residues were retained in the 4 prism-specific Tyrs but mostly lost in the two nacre-specific Tyrs (Pma_10005159 and Pma_10016044), suggesting possible divergence in metal ion binding capability between nacre-specific and prism-specific Tyrs. It should be noted that many of expanded Tyrs may be unrelated to shell formation as shell-less Octopus bimaculoides also shows some expansion (Table S10), and instead they may function in their well-established roles in melanin pigment production, wound healing and immune responses in P. f. martensii also [52]. 12 The complexity of the shell matrix and biomineralization processes is further demonstrated by co-expression network analysis, which indicates that genes related to polysaccharide metabolism are significantly co-expressed with nacre proteins. This result is consistent with the abundance of chitin and acid GAGs in the nacreous layer.
Interestingly, nacre proteins were also co-expressed with ABC-transporters known as ATP-dependent transport proteins. ABC-transporters may mediate the secretion of proteins without signal peptide [53], which are not uncommon among nacre proteins and may be also secreted through other mechanisms such as exosomes [6]. Some nacre proteins without signal peptide may be due to assembly and annotation errors.
More importantly, signal pathway related to bone formation, such as Wnt signalling pathway and osteoclast differentiation signalling pathway, were also implicated.
Together, these results suggest that molluscan shell formation is an elaborate and dynamic process that shares certain basic elements with mammalian bone formation, but with added complexity. Although molluscan shells have a chitin-dominated framework, the identification of key elements shared by collagen-based matrices supports a single origin for the two types of matrices or a common set of tools that may have been lost, modified and reorganized during evolution to produce diverse forms of biomineralized structures in adaptation to new environments and in assuming new functions.
In conclusion, we sequenced and assembled the highly polymorphic genome of P. f. martensii using NGS and the BAC-to-BAC strategy. Based on genomic, transcriptomic, and proteomic analyses and experimental studies, we identified a large number of genes related to shell nacre formation, which helped us to re-construct the shell matrix model (Fig. 5). The identification of collagen-related VWAPs and other elements of collagen-based matrices in the chitin-rich nacre matrix supports the homology and single evolutionary origin of the common biomineralization toolkit.
The hypothesis of a single evolutionary origin challenges the prevailing idea of independent evolution [2] and may stimulate homology-based studies towards a better understanding of the diverse forms of biomineralization.
Hierarchical BAC-to-BAC assembly strategy. We used a hierarchical BAC-to-BAC assembly approach as used for the moth genome [11]. Before the hierarchical assembly of BACs, we used SOAPdenovo to assemble the reads of each BAC with odd numbered K-mers from 27 to 63 and selected the best results with the longest scaffold N50 and total length, as primary scaffolds. Then, we used the paired-end reads information of the BACs and locally assembled the reads in the gap regions to fill in the gaps within the primary BAC scaffolds. Our custom assembly software (Rabbit) [11] was used to assemble scaffolds of BACs with large overlaps. After finding relationship among sequences, merging overlapping sequences and removing redundant sequences, we obtained longer segments as secondary scaffolds. Finally, SSPACE was used to join the secondary scaffolds to form final scaffolds, and SOAP-Gapcloser was used to fill in the gaps in the final scaffolds using all WGS reads with short insert sizes.
Linkage group construction. We constructed a genetic map using RAD-seq of 148 F1 progeny from a family obtained by crossing two genetically distant parents. We used SOAP2 [54] to map the reads to the reference genome sequences of P. f. martensii (scaffolds) and performed SNP calling using SOAPsnp [55]. After SNP calling, we extracted genotypes by combining all SNPs among the 148 progeny and the 2 parents and constructed linkage map using JoinMap 4.1 [56].
Phylogenetic tree construction and divergence time estimation. We used Treefam to obtain gene families and one-to-one orthologs, and used MrBayes to construct the phylogenetic tree.
Transcriptome analysis. We extracted total RNA from each sample and isolated mRNA using oligo (dT) magnetic beads. Then, the mRNA was fragmented into short fragments (200~500 bp) for construction of RNA-seq libraries that were sequenced on an Illumina HiSeq2000. Using SOAP2, all clean reads were mapped to the genome assembly with less than 5 mismatches. We used the RPKM method (Reads per kilobase transcript per million mapped reads) to calculate the gene expression levels.
We also tested TPM (Transcripts Per Million) [57] for quantifying gene expression and found excellent correspondence between RPKM and TRM for our samples.
Peptide mass tolerance was set to 0.05Da and fragment mass tolerance was set to 0.01Da. We used target-decoy search strategy [58] to identify the matrix proteins, and the False Discovery Rate (PDR) was ≤1%.
Extraction of matrix proteins from the nacre and prismatic layer. Shells of freshly collected oysters were thoroughly cleaned by hand and treated with sodium hypochlorite solution (6-14% active chlorine) to remove organic surface contaminants [59]. The prismatic layer was separated from the edges of pearl oyster shells without nacre. The nacre was directly scraped from the internal shell surfaces dominated by aragonite. These samples were thoroughly ground and soaked in acetic acid solution (5%, v/v) for at least 12 h to dissolve calcium carbonate, before being centrifuged at 14,000 g and 4 °C for 1 h. Acid-soluble proteins were in the supernatant, and acid-insoluble proteins were in the residue.
Samples were electrophoresed on 12% polyacrylamide gels and stained with Coomassie blue R-250. The extracted peptides were dried and stored at -80 °C until liquid chromatography/tandem mass spectrometry (LC-MS/MS) analysis.
Chitin identification in shell matrix. We decalcified the shells in 1 M acetic acid at 4 °C for one week, and the acid-insoluble material was collected. This insoluble material was washed with distilled water and embedded in paraffin for sectioning. The sections were placed on slides and stained for 5 min with 0.1% Calcofluor White M2R (FlupstainⅠ) (Sigma-Aldrich). Excess dye was rinsed off with distilled water.
The stained specimens were observed under a confocal laser microscope using filters with 492 nm excitation and 520 nm emission [60].

Nitrobluetetrazolium (NBT)/glycinate assay for dopa and dopaquione protein.
Sections of decalcified shells were stained with 100 μL of solution containing 0.24 mM NBT and 2 M potassium glycinate (pH10) for nearly 5 min in darkness until violet positive signals appeared [62]. The sections were rinsed with double-distilled water to stop the reaction and then mounted for microscopic examination.
Co-expression network analysis. We used WGCNA to reconstruct the co-expression network for biomineralization [63]. A weighted correlation network was constructed between all pairs of genes across four mantle tissue samples. The adjacency matrix was calculated through a so-called 'soft' thresholding framework (power β=9) that converted the co-expression measure to a connection weight. Based on the adjacency matrix, we implemented a topological overlap dissimilarity measure to reflect relative inter-connectedness, which may represent a meaningful biological network. Hub genes (highly connected genes), by definition, tend to have high connectivity in the constructed network.