A high-quality, haplotype-phased genome reconstruction reveals unexpected haplotype diversity in a pearl oyster

Abstract Homologous chromosomes in the diploid genome are thought to contain equivalent genetic information, but this common concept has not been fully verified in animal genomes with high heterozygosity. Here we report a near-complete, haplotype-phased, genome assembly of the pearl oyster, Pinctada fucata, using hi-fidelity (HiFi) long reads and chromosome conformation capture data. This assembly includes 14 pairs of long scaffolds (>38 Mb) corresponding to chromosomes (2n = 28). The accuracy of the assembly, as measured by an analysis of k-mers, is estimated to be 99.99997%. Moreover, the haplotypes contain 95.2% and 95.9%, respectively, complete and single-copy BUSCO genes, demonstrating the high quality of the assembly. Transposons comprise 53.3% of the assembly and are a major contributor to structural variations. Despite overall collinearity between haplotypes, one of the chromosomal scaffolds contains megabase-scale non-syntenic regions, which necessarily have never been detected and resolved in conventional haplotype-merged assemblies. These regions encode expanded gene families of NACHT, DZIP3/hRUL138-like HEPN, and immunoglobulin domains, multiplying the immunity gene repertoire, which we hypothesize is important for the innate immune capability of pearl oysters. The pearl oyster genome provides insight into remarkable haplotype diversity in animals.

proximity ligation data to scaffold genome assembly 11 . Hi-C assembly steps were performed by Dovetail Genomics.
The final MK assembly includes 414 scaffolds with an N50 length of 72.75 megabases (Supplementary Table S3 Table S6), indicating that genome sequence redundancy derived from heterozygosity was collapsed in the final MK assembly.

Estimation of heterozygosity in inbred line
To estimate the level of heterozygosity in the inbred line, we conducted a k-merbased statistical assessment with short read data. Approximately 60-fold paired-end Illumina short read sequences for P. fucata genome were analyzed using Jellyfish (version 2.2.10) 12 with different k-mer values ranging from 17 to 61. Next, the obtained frequency distribution of all k-mers was analyzed using GenomeScope 13 . Processed Illumina short reads were mapped to the MK genome assembly using BWA (version 0.7.15) 14 . Genotypes were retrieved using BCFtools (version 1.10) mpileup option 15 . A homozygous peak is observed at 60x coverage when k-mer=19 (Supplementary Figure   S17).

Supplementary Figures
Supplementary Figure S1. K-mer analysis plots for sequencing reads and assembled scaffolds of the AI assembly. A-B) GeneScope plots of HiFi reads (A) and Omni-C reads (B). Note that the haploid coverage estimate in HiFi is wrong, although no peak actually exists at around 20x coverage. On the other hand, the curve fitting in Omni-C seems correct and the estimated genome size is close to the value based on flow cytometry. C-D) PloidyPlots (improved version of SmudgePlot) of the HiFi reads (C) and the Omni-C reads (D). E) Merqury's spectra-asm plot with the two phased scaffold sets and the HiFi reads.

9
Supplementary Figure S2. Omni-C contact maps around manually curated locations before (left) and after (right) manual curation. Blue lines are boundaries between scaffolds. Gray areas indicate no unique mappings between the two locations due to repeats or misassemblies. Figure S3. Pairwise alignments of homologous chromosomes in haplotype A and B assemblies. Alignment dot plots were generated using MUMmer 3.23 16 . Aligned segments are represented as red (forward alignment) or blue (reverse alignment) dots. Heatmaps of tandem repeat density are shown along with x and y axes.

Supplementary
Blue bars in the heatmaps indicate gap positions in the scaffolds. Omni-C contact maps for each scaffold are also shown at the x and y axes. Length differences between haplotype A (left) and B (right) due to unassembled repetitive sequences are exemplified by asterisks in scaffolds 2, 8, and 11. In scaffold 2, a Helitron-rich region is found only in scaffold 2B, while the corresponding position in scaffold 2A was bridged by gaps (red arrowhead). In scaffolds 8 and 11, centromeric regions were fully sequenced and assembled only in haplotype A.              In this 3rd-generation individual, extremely reduced heterozygosity regions are found on scaffolds 3, 6, 9, 10, 12, 13, and 14, presumably due to autozygosity. Supplementary Table S1. Summary of P. fucata genome sequence data.