An improved reference of the grapevine genome reasserts the origin of the PN40024 highly homozygous genotype

Abstract The genome sequence of the diploid and highly homozygous Vitis vinifera genotype PN40024 serves as the reference for many grapevine studies. Despite several improvements to the PN40024 genome assembly, its current version PN12X.v2 is quite fragmented and only represents the haploid state of the genome with mixed haplotypes. In fact, being nearly homozygous, this genome contains several heterozygous regions that are yet to be resolved. Taking the opportunity of improvements that long-read sequencing technologies offer to fully discriminate haplotype sequences, an improved version of the reference, called PN40024.v4, was generated. Through incorporating long genomic sequencing reads to the assembly, the continuity of the 12X.v2 scaffolds was highly increased with a total number decreasing from 2,059 to 640 and a reduction in N bases of 88%. Additionally, the full alternative haplotype sequence was built for the first time, the chromosome anchoring was improved and the number of unplaced scaffolds was reduced by half. To obtain a high-quality gene annotation that outperforms previous versions, a liftover approach was complemented with an optimized annotation workflow for Vitis. Integration of the gene reference catalogue and its manual curation have also assisted in improving the annotation, while defining the most reliable estimation of 35,230 genes to date. Finally, we demonstrated that PN40024 resulted from 9 selfings of cv. “Helfensteiner” (cross of cv. “Pinot noir” and “Schiava grossa”) instead of a single “Pinot noir”. These advances will help maintain the PN40024 genome as a gold-standard reference, also contributing toward the eventual elaboration of the grapevine pangenome.


Introduction
Cultivated grapevine (Vitis vinifera ssp. vinifera) was the fourth plant whose genome was sequenced and assembled (Jaillon et al. 2007). Because of the grapevine's high level of heterozygosity [one Single Nucleotide Polymorphism (SNP) per 100 bp and one Indel per 450 bp, Velasco et al. 2007], the genotype selected for sequencing was PN40024, whose ∼475 Mb genome (Lodhi and Reisch 1995) is nearly homozygous (estimated at ∼93%). PN40024 was indeed generated through 9 rounds of selfing and supposedly originated from "Pinot noir", hence its identification as "PN". This unique genome characteristic allowed a high-quality wholegenome shotgun assembly based on 8X coverage Sanger reads (Jaillon et al. 2007). In 2009, a 4X coverage was added, which improved the overall coverage of the genome (from 68.9% for the 8X version to 91.2% for the 12X.v0) (http://urgi.versailles.inra.fr/ Species/Vitis/Data-Sequences/Genome-sequences; FN597015-FN597047 at EMBL, release 102; Supplementary File 1 and Supplementary Fig. 1). In 2017, a third assembly version, named 12X.v2, was published as the result of a large anchoring effort using 6 dense parental genetic maps (Canaguier et al. 2017).
Despite these advances, no additional sequencing efforts have been made and although it is of very high quality, the 12X.v0 Sanger contigs are numerous (14,642), the 12X.v2 scaffolds are composed of large N gaps (3.1% of the cumulative scaffold size) and the 19 pseudomolecules are quite fragmented (19.3 scaffolds on average per pseudomolecule).
In recent years, the advent of third generation sequencing technologies, especially those from the Pacific Biosciences (PacBio) platform, have allowed the assembly of grapevine diploid genomes with a higher level of contiguity compared to the 12X.v2 version of the PN40024 genome (e.g., cv. "Cabernet Sauvignon" genome assembly, Massonnet et al. 2020).
Along with the versions of each genome assembly, several versions of gene annotations were made available (Supplementary File 1 and Supplementary Fig. 1). The first version of the grapevine genome assembly, 8X, was published along with the prediction of 30,434 gene models based on the GAZE software (Howe et al. 2002;Jaillon et al. 2007). For the 12X.v0, 3 different versions of gene predictions were made available: the v0 version (26,346 gene models), based on the GAZE software (Howe et al. 2002), the CRIBIv1 version (29,971 gene models), based on the JIGSAW software (Allen and Salzberg 2005), and the CRIBIv2 version (31,845 gene models), with an effort made on the discovery of splicing variants (Vitulo et al. 2014). For the 12X.v2, the International Grapevine Genome Program (IGGP) led the initiative of merging annotations from NCBI Refseq, CRIBIv1, and VCost, which was based on the Eugene software (Sallet et al. 2019) and was generated in the frame of the COST Action FA1106. This version, called VCost.v3, resulted in an exhaustive view of the PN40024 grapevine gene content with its 42,413 gene models (Canaguier et al. 2017). However, after several years as the reference annotation by the grapevine scientific community, it appeared that the great increase in number of gene models for VCost.v3 compared to all the previous annotation versions was caused by many small and fragmented predictions that were probably erroneous.
By combining the top-quality Sanger contigs from the 12X version and long reads generated here by Single-Molecule Real-Time (SMRT) sequencing (PacBio), we provide an improved version of the PN40024 genome sequence assembly, referred to as PN40024.v4. Along with this new assembly, we also provide a new version of the gene annotation, PN40024.v4.2, based on a newly developed annotation workflow, RNA-Seq datasets and an exhaustive manual curation of a set of catalogued genes of functional interest to the community. Finally, we demonstrate that PN40024 originates from selfings of the "Helfensteiner" cultivar instead of "Pinot noir".
One gram of young leaves (1 cm 2 ) of PN40024 (ID code FRA038-40024.Col.1) was collected and DNA was extracted using QIAGEN Genomic-tips 100/G kit. SMRT sequencing on a Sequel I machine (3 SMRTCells; PacBio) and dedicated library preparation were performed according to provider procedures.
All data generated in the frame of this study were submitted under the ENA Study Accession PRJEB45423.
The anchoring of the new haploid scaffolds was performed using the 6 genetic maps used for the same purpose by Canaguier et al. (2017) and 2 new genetic maps from cv. "Riesling" and cv. "Gewurztraminer" derived from GBS. To transfer the markers from Canaguier et al. (2017) from PN12X.v2 to the scaffolds of PN40024.v4, BLAST (v2.2.28) (Altschul et al. 1990) or ipcress (ipcress from exonerate v2.2.0) (Slater and Birney 2005) was used to align the markers and find the position of each on the scaffolds of PN40024.v4 REF and ALT. A total of 2,333 markers for REF and 2,326 markers for ALT were used from these 6 maps to anchor the scaffolds. For the 2 new genetic maps from "Riesling" and "Gewurztraminer", 5,884 ("Riesling") and 5,840 ("Gewurztraminer") SNP markers were available for REF and 5,866 ("Riesling") and 5,832 ("Gewurztraminer") for ALT. The SNP markers were derived from GBS data (ERR8657388 to ERR8657647) and were analyzed with Fast-GBS (Torkamaneh et al. 2017) with modifications to allow paired-end read analysis (https://forgemia.inra.fr/sophie.blanc/gbs). The 2 genetic maps were built using R ASMap package with the "kosambi" parameter (Taylor and Butler 2017). A first run of Allmaps (v0.9.13) (Tang et al. 2015) was performed with the "merge" command to merge all genetic maps and then "split", "gaps", "refine" and "build" commands to create breakpoints (58 for REF scaffolds and 47 for ALT scaffolds), with default parameters. Subsequently, all maps were recreated for new scaffolds and then orientation and anchoring of new haploid scaffolds on the 19 pseudochromosomes were performed using Allmaps with the "merge" command to merge all Assembly process for the PN40024.v4 genome sequence assembly. a) Initial datasets: Sanger-based scaffolds of PN12X.v2 with unknown bases ("N's"), genomic PacBio SMRT reads, and genomic Illumina short reads. Erroneous bases are represented by vertical lines. b) Scaffold assembly steps. Dark regions represent newly incorporated PacBio SMRT assembled regions. c) Pseudomolecule construction using the new scaffolds and genetic maps. The new scaffolds are a mosaic of 12X.v2 scaffolds and newly incorporated PacBio SMRT assembled regions. maps and "path" command to anchor, with default parameters (Fig. 1c).

Quality assessment of the PN40024.v4 genome sequence assembly
A quality analysis of the genome assembly was done with Merqury v1.3 (Rhie et al. 2020). Since PN40024 is a "Helfensteiner" selfing (demonstrated below) and since "Helfensteiner" originated from a cross between "Pinot noir precoce" and "Schiava grossa", "Schiava grossa" was used as the maternal parent. The run was carried out on the scaffolds using genomic paired-end short reads of PN40024 as the child data (SRR8835144), short reads of "Pinot noir" as the paternal data (ERR8014965) and short reads of cv. "Schiava grossa" as the maternal data (ERR8014964). A k-mer database was built for the 3 read datasets with k = 19, the Merqury hap-mer databases were computed and the PN40024.v4 genome assembly was evaluated using "num_switch 100" and "short_range 20,000". For comparison reasons, the Merqury quality analysis was carried out on PN12X.v2 using the same k-mer databases.
PN12X scaffolds were mapped against PN40024.v4 REF pseudomolecules using NUCmer (MUMmer v3.1) (Kurtz et al. 2004) with "-maxmatch -l 100 -c 500" parameters. The output file was filtered using MUMmer show-coords command with "-l -g -I 99.5" parameters. The resulting file was formatted into BED format and merged with the bed file corresponding to N gap regions in the PN40024.v4 assembly. Pseudomolecule regions over 100 bp that did not correspond to either PN12X scaffolds or N gap regions were identified as "newly assembled" PacBio long read-based regions.
The identification of variants between PN40024 paired-end Illumina resequencing (SRR8835144) and PN40024.v4 REF and ALT pseudomolecules was performed as described in the section "Origin of PN40024". The homozygous calls "1/1" were considered as assembly errors. The densities of the heterozygous calls "0/1" along the REF and ALT pseudomolecules were used to define 7 heterozygous regions of the PN40024 genome.
The haplotypic blocks were defined after segmentation of homozygous SNP densities along the chromosomes using the R package changepoint (v2.2.2) (Killick and Eckley 2014) with command "cpt.mean" and the parameters method="PELT" and penalty="AIC". Some manual curation of the segments was performed to join directly adjacent segments of the same origin ("Pinot noir" or "Schiava grossa"). The size of the segments was used to calculate the proportion of "Pinot noir", "Schiava grossa", and common haplotypes.
Additionally, Arabidopsis thaliana protein sequences (UniProt/ SwissProt release 2020_02), eudicotyledone protein sequences (UniProt/SwissProt release 2020_02, OrthoDB10 v1), and Viridiplantae and Vitales sequences (UniProt/SwissProt release 2020_02) were aligned on 12X.v0 and on PN40024.v4 REF and ALT with pBLAT v1.9 (Wang and Kong 2019), a parallel implementation of the original blat algorithm (Kent 2002). The genome regions on which the protein data mapped were extracted and the protein sequences were aligned to these regions with exonerate v2.4.0 (Slater and Birney 2005). Only the proteins that aligned on the reference genome with an identity of 25%, a similarity of 50% and with a sequence alignment coverage of at least 80%, were retained and included in the gene prediction.
The last ab initio gene prediction was done on the PN40024.v4 genome assembly with GeneID v1.4.5-master-20200916 and the publicly available V. vinifera parameter set using default settings. To add the VCost.v3 gene annotation to the set of predictions, an annotation liftover was performed with liftoff v1.5.1 (Shumate and Salzberg 2021) with default parameters onto the PN40024.v4 genome assembly.
To combine the predictions and evidence data into an overall gene model set, the GlimmerHMM, SNAP, BRAKER2, and GeneID ab initio gene prediction as well as the lifted VCost.v3 annotation, the stranded and unstranded transcriptome assemblies, the GFF file with the aligned protein data, the repeat annotation GFF file, and the PN40024.v4 genome assembly were given to EvidenceModeler v1.1.1 (Haas et al. 2008). The used weights are listed in Supplementary File 2 and Supplementary Table 3.
Subsequently, the raw gene models were quality filtered. Gene models only supported by ab initio predictors were kept if at least 2 gene prediction programs predicted them, if the start and stop codon was present and if the gene length was equal or larger than 300 bp. However, ab initio supported gene models not matching these constraints were kept if they had a database hit with the UniProt/SwissProt or NCBI nonredundant database. To obtain that, a blastp search of the protein sequences against the 2 databases was run, allowing hits with an e-value <1e −6 . Of the gene models only supported by evidence data or by VCost.v3 lifted annotation, those gene models with missing start and stop and a gene length <300 bp were discarded.
The gene models generated by EvidenceModeler were finally processed by PASA (v2.4.1, default parameters) using the stranded transcriptome assembly as a reference to add UTR regions and to calculate alternatively spliced models. Genes with overlapping UTRs were shortened. tRNAs were predicted with tRNAscan-SE-2.0 (Chan et al. 2021) on the PN40024.v4 genome assembly.
To retain gene naming of VCost.v3 gene models, a reciprocal best blast hit (RBH) search between protein sequences of PN40024.v4.1 gene models and protein sequences of VCost.v3 gene models was carried out. For the RBH search, only the longest protein sequence per gene was used, the e-value was set to 1e −4 and the query coverage and identity was set to 70%. Moreover, only RBHs with genes on the same pseudochromosome and showing collinearity with other genes were considered valid. Thus, genes with a valid RBH were named according to the VCost.v3 gene, novel genes received the prefix "04" at the start of the gene number and genes predicted for alternative heterozygous sequence regions received the suffix "_alt" (Supplementary File 2  and Supplementary Table 4).

Manual gene model curation
For manual gene model curation, an Apollo Webserver v2.6.4 (https://github.com/GMOD/Apollo/blob/master/README.md) (Dunn et al. 2019) was set up for the PN40024.v4 genome assembly and provided with different data tracks such as PN40024.v4.1 and previous gene annotations, RNA-Seq mappings and exonerate protein mappings (see Gene prediction). By these means, gene models were manually inspected and curated if needed or also new genes were added following dedicated guidelines offered to the community (https://integrape.eu/resources/data-management/). Using Apollo, the plant core genes classified as fragmented or missing by BUSCO were manually curated and adapted if necessary. In the frame of this study, we also began to manually curate genes present in the grape reference catalogue (Navarro-Payá et al. 2022; https://grapedia.org/genes/). A home-made python script was used to generate the PN40024.v4.2 version of gene annotations including those manually curated (https://gitlab.com/ MSVteam/pn40024-visualization-tools/-/tree/master/update_gff3_ script).

Improved metrics for the genome assembly of PN40024
A hybrid strategy was developed to assemble the genome of PN40024 genotype using 27X of SMRT long reads along with the PN12X scaffolds and 15X PN40024 Illumina paired-end resequencing data (Fig. 1). This new assembly was named PN40024.v4. Six hundred and forty scaffolds were produced with a N50 size of 6.5 Mb for a cumulative size of 474.5 Mb for the PN40024.v4 REF haplotype (Table 1). Compared to the former PN12X.v2, the number of scaffolds was reduced by a factor of 3 and the N50 was doubled. Moreover, the number of unknown bases, marked as N in the new scaffold sequences, represents 1.8 Mb and 0.4% of the assembly size versus 15.0 Mb and 3.1% for PN12X.v2 scaffolds. Thus, PN40024.v4 REF is more contiguous and has more informative sequences than PN12X.v2. Also, the PN40024.v4 assembly size is closer to the grapevine genome size of 475 Mb, estimated using flow cytometry by Lodhi and Reisch (1995). Phasing efforts on the partially heterozygous genotype resulted in the reconstruction of the second PN40024 haplotype (PN40024.v4 ALT) with 485 scaffolds and a total genome assembly size of ∼463 Mb (Supplementary File 2 and Supplementary Table 5). Thus, the PN40024.v4 genome assembly now represents both haplotypes of the diploid PN40024 genome.
There are 7,640 newly assembled PacBio long read-based regions that were identified as missing from PN12X.v2 scaffolds. Their cumulative size is 24.1 Mb, that is 5.1% of the total PN40024.v4 genome assembly size (average = 3,152 bp; median = 558 bp; max = 32,650 bp).
By aligning PN40024 Illumina paired-end reads against PN40024.v4 genome assembly, we identified 101,778 heterozygous variations. Using their density along the chromosomes, we were able to identify 7 well-defined heterozygous regions in PN40024.v4 genome assembly as it was the first time that a software dedicated to diploid assembly (Haplomerger2) was used to assemble the PN40024 genome. These regions were located on chromosomes 2, 3, 4, 7, 10, 11, and 15 with the 2 largest regions being on chromosome 7 and 10 (11.4 and 5.5 Mb, respectively) (Fig. 3). Their overall cumulative size of 20.6 Mb represents 4.3% of PN40024.v4, which is less than the residual heterozygosity size of 7%, estimated by Jaillon et al. (2007) based on genetic markers. Using the same procedure, we identified 6 heterozygous regions in PN12X.v2 assembly on the same chromosomes as PN40024.v4 except for the one on chromosome 15. Their overall cumulative size of 16.6 Mb represents 3.4% of PN12X.v2 and 4 Mb less than the heterozygous regions anchored on the PN40024.v4 chromosomes. These sequences were badly resolved and mostly located in the unanchored fraction of PN12X.v2  Supplementary Fig. 2). Thus, we conclude that PN40024.v4 is a better diploid assembly compared to PN12X.v2.

Quality of the PN40024.v4 genome assembly
The BUSCO analysis performed on the PN40024.v4 genome assembly confirmed that the gene space was more complete with 98.1% of the 2,326 total searched Eudicots BUSCO genes being complete, compared to PN12X.v2 with 97.6% (Fig. 6). The FT gene is conserved among all flowering plants as it promotes transition from vegetative growth to flowering. However, its sequence could only be found on an unanchored scaffold in the PN8X version and was totally missing in PN12X.v0 and PN12X.v2. It is now present on chromosome 7 of the PN40024.v4 assembly and also on its allelic region, chromosome 7_ALT sequence. Similarly, the APRT3 gene, located in the sex determination locus of grapevine, was present on chromosome 2 in the PN8X version and was truncated in PN12X.v0 and PN12X.v2. It is now fully retrieved on chromosome 2 of PN40024.v4 assembly and on its allelic region, chromosome 2_ALT sequence. These 2 examples, along with the BUSCO analysis, show that the PN40024.v4 assembly is more complete, especially in the residual heterozygous regions that are now more accurately exposed. The alignment metrics of PN40024 genomic Illumina paired-end reads have always been better against PN40024.v4 compared to PN12X.v2, either for overall percentage of mapped  reads (97.58% vs 96.58%) or for properly mapped pairs of reads (85.81% vs 82.82%) (Fig. 2). This confirms that the PN40024.v4 assembly is more complete and with a more accurate structure than PN12X.v2. Moreover, we compared alignments of 11 genomic Illumina paired-end read datasets from various cultivars against PN40024.v4 and PN12X.v2 assemblies, but also against "Cabernet Sauvignon" (Massonnet et al. 2020) haplotype 1, whose assembly metrics and technology were similar to PN40024.v4. Again, PN40024.v4 performs best for each dataset, even when "Cabernet Sauvignon" was aligned against its own assembly (Fig. 2). These results confirm that PN40024.v4 shows a quality suitable to become the new grapevine reference genome assembly, as it performs well with aligning genomic reads of various V. vinifera cultivars.
The error rate at nucleotide level was assessed by calling homozygous variations between PN40024 genomic Illumina paired-end reads aligned against the PN40024.v4 genome assembly. We identified 28.7 compared to 8.4 errors/Mb in the PN12X.v2 genome assembly. However, they are unevenly distributed along the chromosomes and they mostly co-localize with the newly assembled long read-based regions and the 7 heterozygous regions (Fig. 3). A higher density of errors was also detected in the heterozygous regions of the PN12X.v2 genome assembly (Supplementary File 1 and Supplementary Fig. 3). We detected 284.4 errors/Mb in PN40024.v4 heterozygous regions and 83.1 errors/Mb in PN12X.v2 heterozygous regions, which is, respectively, about 10 times denser than their average error rate. Thus, the overall increase of error rate in the PN40024.v4 assembly is mostly due to the use of SMRT long reads to improve the completeness of the reference genome assembly. Using Merqury, the base level quality value (QV) of the PN40024.v4 genome assembly was estimated to be 36.02, which is slightly worse than QV of 37.43 of the PN12X.v2 genome assembly (Table 2). This result confirms that additional SMRT sequences are not as accurate as Sanger-based sequences and they slightly decrease overall accuracy of the assembly. Also, the error rate of the PN40024.v4 genome assembly was increased by 0.00006964% compared to PN12X.v2, but still represents an accuracy of 99.999749801%, a metric associated with high-quality genome assemblies.
Nevertheless, the k-mer completeness was raised from 96.79% to 96.96% for the PN40024.v4 assembly. Based on k-mer profiles of PN40024 and its parents (see The origin of the PN40024 genotype section for details), Merqury computed the inheritance spectrum ( Supplementary File 1 and Supplementary Fig. 4) showing a low portion of read-only missing k-mers that are unique for the child read set (paired-end short reads of PN40024). The few missing sequences are probably due to sequencing errors, k-mers of novel variations or contamination from microbiome in PN40024 short reads, indicating an almost fully complete PN40024.v4 genome sequence assembly. Also, as the spectrum shows a single 2-copy peak around 12× and that no 1-copy peak was observed at half the size, the k-mer analysis supports the assumptions of an almost homozygous grapevine genotype.

The origin of the PN40024 genotype
So far, the PN40024 genotype was supposed to be originally derived from cv. "Pinot noir" (Jaillon et al. 2007). However, we found 1,415,200 homozygous variants between "Pinot noir" and PN40024.v4 (versus 17,696 homozygous variants of PN40024 against its own assembly), meaning that "Pinot noir" haplotypes were completely missing at these locations. These homozygous "Pinot noir" variants were unevenly distributed along the chromosomes and formed blocks (Fig. 4). We identified that the haplotypes of unknown origins could be assigned to "Schiava grossa" (synonyms: "Trollinger" and "Frankenthal") as already suspected by Jaillon et al. (2007). There were 953,735 homozygous variants found between cv. "Schiava grossa" and PN40024.v4 and the formed haplotype blocks were highly complementary to "Pinot noir" haplotype blocks (Fig. 4). As a negative control, the same analysis was performed with cv. "Araklinos" and 2,273,888 homozygous variants were identified, evenly distributed along the chromosomes (Supplementary File 1 and Supplementary  Fig. 5).
Using Merqury, only a small portion of hap-mer specific k-mers (parental specific k-mers of the assembled F1) were found in the PN40024.v4 genome assembly (Supplementary File 1 and Supplementary Fig. 4). With the use of read data from both parents and child, Merqury was able to compute haplotype blocks by using the parental specific k-mers as anchors. A total of 1,454 haplotype blocks were computed for PN40024.v4 sequences with additional 289 haplotype blocks for alternative heterozygous sequence regions and 2,575 haplotype blocks for the 12X.v2 genome assembly (Table 3). The N50 was measured to 2.05 Mb (REF), 0.25 Mb (ALT), and 1.76 Mb (PN12X.v2). Compared to the PN12X.v2 genome assembly, PN40024.v4 presented less haplotype blocks, but comprised almost all bases showing a higher N50 value, that is its haplotype blocks are more contiguous.
A greater amount of paternal ("Schiava grossa") than maternal ("Pinot Noir") specific k-mers were identified. After identifying the origin of each haplotype block using segmentation, it is estimated that 41% of the genome harbors a "Schiava grossa"-specific haplotype and 27% a "Pinot noir"-specific haplotype. It is estimated that 32% of the genome shares a common haplotype between the 2 parents, that is that these regions could originate either from "Pinot noir" or "Schiava grossa" indicating that ∼57% could originate from "Schiava grossa" and ∼43% from "Pinot noir".
The switch error rate was determined to 0.96% (REF), to 4.76% (ALT), and to 0.77% (PN12X.v2). Some of the switches are probably due to sequencing errors in the additional long read-based sequences. Moreover, as the error rate of ALT sequences was measured to ∼4.76%, portions of the alternative sequences are a mixture of the maternal and paternal haplotype, confirming that despite the improved separation of the 2 haplotypes in PN40024.v4, phasing is still not perfect.
By exploring the VIVC database (www.vivc.de), the "Helfensteiner" cultivar was found to originate from a cross between "Pinot noir precoce" (a clone of "Pinot noir") and "Schiava grossa". By performing the same variant calling analysis, 53,671 homozygous variants were found between cv. "Helfensteiner" and PN40024.v4, with 543 homozygous variants/Mb in the heterozygous regions and 93 homozygous variants/Mb in the homozygous regions (Fig. 5). As a negative control, "Araklinos" showed 3,967 homozygous variants/Mb in the heterozygous regions and 4,818 homozygous variants/Mb in the homozygous regions). Thus, the "Helfensteiner" homozygous variants are almost 6 times denser in error-prone regions of the PN40024.v4 assembly, which makes them probable "false positive" homozygous variants. Apart from heterozygous regions, no blocks of homozygous variants could be identified, meaning that one of the 2 "Helfensteiner" haplotypes is always present in the PN40024 genome. This confirms that the "Helfensteiner" variety is the true parent of the first selfing, from which the PN40024 genotype was created after 8 more selfings.

PN40024.v4.1 gene prediction, functional annotation, and manual curation
The PN40024.v4.1 gene annotation of REF haplotype comprises 35,922 gene models of which 35,197 are protein-coding and 725 encode for tRNAs (Table 4). In particular, 1,572 novel protein-coding genes were annotated in the newly assembled long read-based regions. For heterozygous regions, 1,855 and 1,809 protein-coding genes were predicted for REF and ALT haplotypes, respectively (Table 5). Most genes were predicted on the ∼11 Mb heterozygous region on chromosome 7 with 830 on the reference sequence and 792 on the alternative sequence followed by the ∼5 Mb region on chromosome 10 with 650 and 623 protein-coding genes.
To check for completeness of the gene models, the plant core genes of the database eudicots_odb10 were predicted with BUSCO (Fig. 6). Of the 2,326 searched plant core genes, 2,296 or 98.7% were classified as complete in the PN40024.v4.1 gene annotation. Only 16 were predicted as fragmented and only 14 were not found.
To help the community in the transfer of information across versions (i.e. correspondences), we retained as many gene names from VCost.v3 in PN40024.v4.1 as possible. We adopted a strategy based on RBHs followed by some filtering steps which allowed us to transfer names for 66% (23,206) of PN40024.v4.1 gene models with the nomenclature VitviXXg0YYYY (XX being the chromosome number and YYYY a sequential number below 4,000). One third (11,991) of PN40024.v4.1 gene models could not be named with a VCost.v3 identifier and were named with the nomenclature VitviXXg04ZZZ (XX being the chromosome number and ZZZ a sequential number below 1,000). The detailed nomenclature for PN40024.v4.1 gene annotations is given in Supplementary File 2 and Supplementary Table 4.
The functional annotation of PN40024.v4.1 was performed using Blast2GO and resulted in at least one Gene Ontology term for 87% (30,689) of the genes and one Enzyme Code for 41% (14,512) of them. The main classes and ontologies are detailed in Supplementary File 1 and Supplementary Figs. 6 and 7.
A subset of the RNA-Seq data published by Palumbo et al. (2014) was used to compare the results of a differential gene expression analysis performed with PN12X.v2/VCost.v3 and PN40024.v4/ PN40024.v4.1. In terms of mapping, the percentage of aligned reads was equivalent or slightly better when using PN40024.v4 Fig. 4. Density of "Pinot noir" and "Schiava grossa" homozygous SNPs compared to the PN40024.v4 genome assembly. The x-axis shows the 19 main pseudochromosomes and the artificial chrUn ("Un"). The y-axis shows the base position in [bp]. Where density of "Pinot noir" SNPs is high, it means PN40024.v4 carries the "Schiava grossa" haplotype and vice versa. The regions where both "Pinot noir" and "Schiava grossa" SNP density is low correspond to regions where both genomes share a common haplotype.     Supplementary Fig. 8). This result along with the exhaustive functional annotation of PN40024.v4.1 shows that this new version of the PN40024 reference genome and annotation is a very efficient resource to perform transcriptomics and functional enrichment analyses. Despite marked improvement of the PN40024.v4.1 automated annotation with respect to the previous VCost.v3 annotation, some recently expanded gene families have not been comprehensively annotated, such as the stilbene synthase gene family. Therefore, 1,641 genes (1,579 edited and 62 deleted) were manually curated using a purpose-built Apollo server (http://138.102.159. 70:8080/apollo) providing a wide range of transcriptomic and genomic data for PN40024.v4. In an effort to preserve previous VCost.v3 manual curation and functional annotation efforts, a particular focus was given to genes present in the reference catalogue (Navarro-Payá et al 2022). The PN40024.v4.1 automated annotation including the manually curated features was called PN40024.v4.2, which metrics are presented in Table 4. An automated annotation from PN40024.v4.1 that was manually curated was deleted and replaced by its curated version in PN40024.v4.2. Also, the same rules were applied for gene name transfer and nomenclature for PN40024.v4.1 and PN40024.v4.2. The BUSCO analysis performed on PN40024.v4.2 shows that the fragmented plant core genes were reduced to 6 and the missing genes to 8 (Fig. 6). Thus, PN40024.v4.2 gene models comprise 2,308 or 99.2% complete plant core genes.
In conclusion, the here provided PN40024.v4 assembly is the most suitable grapevine reference genome sequence assembly as it notably outperforms PN12X.v2. In terms of genomic and transcriptomic read mapping, the assembly also outperforms other high-quality V. vinifera genome assemblies, something that occurs even when reads from these recently sequenced cultivars are used. Having a fully resolved alternative haplotype sequence, more continuous sequences and resolving many up-to-now unknown bases, PN40024.v4 represents the near complete diploid genome of the PN40024 genotype. Despite many improvements and advances in PN40024.v4, the genome sequence is still not perfect in regard to haplotype switching and newly introduced errors by implementation of long genomic reads. Further improvements should focus on these regions. Nevertheless, the gene annotation of PN40024.v4 should be used as the most updated resource for transcriptomics and functional enrichment analyses, while the genes of heterozygous regions that are likely represented on both haplotypes will allow exploration of heterozygous genetic traits.
Supplemental material available at G3 online.

Funding
This work was supported by INRAE department "Biologie et Amélioration des Plantes", by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) and by COST (European Cooperation in Science and Technology).

Conflicts of interest
The author(s) declare no conflict of interest.