The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding

Abstract Grapevine is one of the most economically important crops worldwide. However, the previous versions of the grapevine reference genome tipically consist of thousands of fragments with missing centromeres and telomeres, limiting the accessibility of the repetitive sequences, the centromeric and telomeric regions, and the study of inheritance of important agronomic traits in these regions. Here, we assembled a telomere-to-telomere (T2T) gap-free reference genome for the cultivar PN40024 using PacBio HiFi long reads. The T2T reference genome (PN_T2T) is 69 Mb longer with 9018 more genes identified than the 12X.v0 version. We annotated 67% repetitive sequences, 19 centromeres and 36 telomeres, and incorporated gene annotations of previous versions into the PN_T2T assembly. We detected a total of 377 gene clusters, which showed associations with complex traits, such as aroma and disease resistance. Even though PN40024 derives from nine generations of selfing, we still found nine genomic hotspots of heterozygous sites associated with biological processes, such as the oxidation–reduction process and protein phosphorylation. The fully annotated complete reference genome therefore constitutes an important resource for grapevine genetic studies and breeding programs.


Introduction
Since the first human genome was published in 2000, hundreds of reference genomes have successively been assembled in a variety of species [1][2][3]. A reference genome is essential for biological and genetic studies. Thus, acquiring a high-quality genome has persistently been pursued. Despite this, there are many missing segments due to highly repetitive sequences clustered across the genome, especially three representative regions: telomere, centromere, and ribosome DNA (rDNA) [3][4][5].
The centromere, which hosts CENPA/CENH3-variant nucleosomes and where the kinetochore forms and attaches to spindle microtubules, plays an essential role during cell division. It consists of alpha satellites, highly repetitive DNA sequences. The alpha satellite is composed of monomeric DNA repeats known as higher-order repeats (HORs), which contain arranged monomers that range from 100 to 200 bp [6][7][8]. Despite their conserved function across species, their structure and sequence can change rapidly within and between species, and diverse organizations can be observed from one species to another. Nevertheless, centromeres show concerted evolution within genomes [7,[9][10][11]. Currently, the centromere remains mostly unknown to researchers.
Telomeres are mostly unknown as well. They are composed of tandem repeats of relatively conserved microsatellite sequences located at the ends of chromosomes in eukaryotes [12,13]. Telomeres are important for protecting chromosome terminal sequences during cell division [14][15][16][17]. Ribosomal DNA (rDNA) is one of the most abundant repetitive elements in a genome, and plays an essential role in ribosome formation while driving cell growth and cell proliferation [18][19][20].
Because of the missing information on previously assembled genomes, the investigation of centromeres, telomeres, and rDNA has been extremely limited in the past two decades. Fortunately, benefiting from the improvement of sequencing technologies and computational algorithms, genome assembly has ushered in a new era: that of telomere-to-telomere (T2T) sequencing [21]. Compared with fragmented genomes, a T2T genome has fewer or no gaps at all. It is based on third-generation sequencing platforms, including PacBio high-fidelity long reads (HiFi), ultra-long Oxford Nanopore Technologies (ONT), and Hi-C data. Moreover, the T2T genome includes nearly complete information on the telomere, centromere, and rDNA regions [22,23]. Promisingly, the T2T genome allows us to access these regions, opening a window into understanding the structure of these regions and the function of genes in these regions. Since the first complete human X chromosome was published in 2020, T2T assembly has quickly become a research hotspot [22,23]. In plants, the first T2T genome was reported in Arabidopsis thaliana in 2021 [7,24]. At present, T2T genome assemblies have been obtained in several species, such as rice, banana, and watermelon, fascinating researchers into genomic structure and function and their relation to crop breeding traits [25][26][27][28].
The grapevine (Vitis vinifera ssp. vinifera), a fruit tree that originated in the Near East, is one of the most widely cultivated and economically valuable crops worldwide [29]. Domesticated grapes often have highly heterozygous genomes [30], which greatly impedes the acquisition of high-quality genomes. For instance, ∼15% of genes are hemizygous in the 'Chardonnay' genome. 31 Fortunately, the PN40024 genotype, a highly homozygous cultivar derived from selfing of cv. 'Helfensteiner' [31], became the reference genome of grapevine, first obtained in 2007 (8X), and was the first fruit crop to be sequenced [32]. Subsequently, several updated versions have been released: the 12X.v2 version and its upgraded annotation VCost.v3 in 2017, and the PN40024.v4.1 version in 2021 [33]. The grape gene reference catalogue now includes a full correspondence between all of their annotation versions [34]. In addition, fragmented genome assemblies of various grape cultivars have been produced in recent years, such as those for 'Black Corinth' [35], 'Cabernet Franc' [36,37], 'Cabernet Sauvignon' [37][38][39], 'Carménère' [40], 'Chardonnay' [30,41], 'Merlot' [35], and 'Nebbiolo' [42]. As the grapevine is a representative dicotyledonous plant among fruit trees, its highquality genome will greatly facilitate research on gene function, genetic structure, and evolution of Vitis and eudicot species.
Despite the great number of grape genome sequences available, these genome assemblies are incomplete in repetitive regions, centromeres, and telomeres. Here we generated a T2T-level gap-free grape genome of the PN40024 reference and aimed to address four main analyses. The application of third-generation sequencing and assembly technologies to high-fidelity long reads has contributed to gap-free genome assemblies [43,44]. Thus, our first question was to see whether we could complete the grape reference genome using these new sequencing and assembly approaches. Second, as studies on the centromere, telomere, and rDNA have long been neglected, we analyzed the features, structure, and distribution of these regions based on the assembled gapless grape genome. Third, the annotation of transposable elements (TEs) and genes in highly repetitive regions was improved based on the T2T genome, which could further improve our understanding of their biological functions, especially those of gene clusters. Finally, the PN40024 genome is almost fully homozygous [32], but some sites remain heterozygous after nine generations of selfing. It is worthwhile to investigate the genomic distribution and genetic effects of such heterozygous sites.

Results
A telomere-to-telomere gap-free reference genome for grapevine PN40024, a highly homozygous inbred line originating from 'Helfensteiner', was used for T2T genome assembly. In total, 21 Gb (21 024 461 524 bp, ∼42× coverage) HiFi reads were generated by the PacBio platform. For the preliminary assembly, hifiasm was used to assemble the HiFi reads. We then used MUMmer and the 12X.v0 genome version (V. vinifera genome assembly 12X.v0; https://www.ncbi.nlm.nih.gov/data-hub/genome/ GCF_000003745.3) to order the 38 contigs into 19 chromosomes (Fig. 1). Only one gap was left after initial assembly into contigs (Supplementary Data Fig. S1). After filling the gap with continuous long reads of PN40024.v4, a gap-free PN_T2T genome was finally generated (494.87 Mb), being 69 Mb longer than 12X.v0 (426.18 Mb, Table 1) using the same statistical method. The k-mer metric was used to evaluate genomic homozygosity, estimated at 99.8% (Supplementary Data Fig. S2A-D). BUSCO (Benchmarking Universal Single-Copy Orthologs) was used to evaluate genomic completeness; 98.5% of the core conserved plant genes were found complete in the genome assembly (Supplementary Data Fig. S2E), which is 4.8% more than in 12X.v0 (93.7%, Table 1).
Compared with the 12X.v0 genome, a substantial improvement of several metrics was observed in our PN_T2T assembly. The contig N50 length of PN_T2T was ∼250 times higher than that of 12X.v0 (25.93 Mb versus 102 kb), and all 9429 gaps in 12X.v0 and 3391 gaps present in PN40024.v4 were filled in the PN_T2T genome (Table 1, Supplementary Data Table S1, Fig. 1A). As shown in Fig. 1C, 28 gaps in 12X.v0 were filled in PN_T2T, the largest gap being 16 951 bp in the 1-Mb syntenic region on chromosome 18 (Fig. 1C). Orientation errors in 12X.v0 were also corrected, such as inversions and translocations compared with PN_T2T ( Fig. 1A, Supplementary Data Fig. S3). For example, two large inversions, which were located surrounding the centromere of chromosome 3 and at the ends of chromosome 5, with the length of 4.9 and 1.9 Mb, were observed between two versions of the assembly, respectively ( Fig. 1A and B, Fig. S8 Table S2). Based on the species-specific pan-TE database constructed by RepeatModeler2, the repeats were detected with the pipeline shown in Fig. 2A. Finally, 66.47% of our gap-free grape genome was marked as repetitive sequences (Fig. 1D). As a comparison, 62.47% of the repetitive sequences were identified in the 12X.v0 genome using the same pipeline (Supplementary Data Table S3). Among the repeats predicted in the PN_T2T genome, the largest portion comprises TEs (63.90%), with a total length of 316 Mb (59.96% and 292 Mb in 12X.v0). The TEs mainly consisted of the long terminal repeat (LTR) type (47.54%), predominantly Gypsy (20.22%) and Copia (19.67%) elements. In total, we detected 276 rDNA sequences, representing 0.019% of the genome.

Identification of telomeres and centromeres
To access the telomeric and centromeric regions in PN_T2T, we identified the telomeres and centromeres using the pipeline described in Fig. 2A. For telomeres, we checked the 150-kb sequences at both ends of each chromosome, and the length of the telomere repeat unit was set to range from 5 to 12 bp. Finally, the telomere repeat unit (TTTAGGG/CCCTAAA) was detected, which was the most abundant in the genome and carried by all chromosomes. The same telomere repeat unit was reported in grapes by Melters et al. [11] and Castro et al. [45]. We further predicted the telomeres in 36 out of 38 telomeres in the PN_T2T genome, except the short arms of chromosome 15 and chromosome 17 (Figs 1A and 2B, Supplementary Data Table S4). Among them, the longest telomere (31 kb) was in the short arm of chromosome 8, with 4479 repeats, while the shortest telomere (1260 bp) was in the long arm of chromosome 7, with only 180 repeats.
To detect centromeric regions, we scanned candidate repeats from 30 to 500 bp along the genome. Tandem Repeats Finder (TRF) found 470 different repeat units in the PN_T2T genome. The 107-bp repeats were the most abundant unit in the whole genome, which had 182 620.5 (copies ≥2) repetitions accounted for ∼3.95% of the total genome sequence, followed by 321 bp (2.45%), 214 bp (1.94%), and 135 bp (1.05%) (Fig. 3A). Interestingly, we found the sequences of 214-and 321-bp repeat units consisted of two and three copies of the 107-bp repeat unit, respectively. The TE analyses also supported the centromeric feature of this 107bp repetitive unit (Fig. 2). Thus, the centromeres were recognized mainly based on 107-bp repeats, and localized on all 19 chromosomes (Figs 1A and 2B, Supplementary Data Table S5). As shown in Fig. 3B, the total length of 107-bp repeats varied from 1.4 kb to 3.8 Mb, but the sequences of the 107-bp repeats were highly conserved among chromosomes (Fig. 3C). The 107-bp repeats were the most abundant in all chromosomes, except chromosomes 3, 14, and 18 ( Fig. 3D-H, Supplementary Data Table S6). We found that the 187-bp was the main repeat unit in chromosome 14 and was scattered throughout the whole chromosome, and that 51-, 56-, 105-, and 107-bp repeat units were highly overlapped and enriched in the centromere, which showed a core region in the chromosome through IGV visualization (Supplementary Data Fig.  S4). The centromeric repeat unit in chromosome 3 was the 135-bp repeat and its integer multiples (270 and 405 bp). For chromosome 18, 66 bp and its integer multiple 132 bp were the main repeat units (Supplementary Data Fig. S4).
To locate the centromeric repeats, we further examined the relationship between TEs and centromeres. LTR retrotransposons or centromeric retrotransposons (CRs) were usually mixed with tandem repeats and enriched in plant centromeric regions [46,47]. We found (Fig. 4A) that the genes and TE repeats, such as LTR  Table S5). The pattern of 107 bp was the target, which was highly linked with the centromeric region in grapes. However, there were likely different repeat units and patterns that appeared on chromosomes 3, 14, and 18 ( Fig. 3F-H). The scattering of transposons and the distribution of the centromere showed that specific sequencedefined repeat superfamilies were correlated or anticorrelated, to various levels, with centromeric proximity (Figs 2B and 4A), forming density gradients that are the main chromosome-scale repeat-associated features, presumably ref lecting overall chromatin structure (Supplementary Data Fig. S4).
To detect the captured genes, we then screened all genes in these regions in the highly linked centromeric region. Interestingly, we found 343 genes (Supplementary Data Tables S7  and S8) captured in the centromeres, which included 179 genes with Uni-Prot ID through BLASTP. Through GO (Gene Ontology) functional annotation, 12 genes were enriched in protein binding (molecular function, MF), such as VviAMP1 (Uni-Prot ID Q9M1S8), involved in ethylene, gibberellin, and abscisic acid signaling pathways [48,49]. In addition, we found 10 genes enriched in the cellular component (CC) of the cytosol, mitochondrion and cytoplasm, including auxin transport protein VviBIG (Uni-Prot ID Q9SRU2), which inf luences general growth and development in plants [50]; fumarate hydratase 1 VviFUM1 (Uni-Prot ID P93033), which catalyzes the active of mitochondrial Krebs cycleassociated enzyme [51]; and 6-phosphogluconate dehydrogenase, decarboxylating 2 VviPGD2 (Uni-Prot ID Q9FWA3), which plays a key role in the development of the male gametophytes and the interaction between the pollen tube and the ovule [52]. Moreover, RNA modification, protein autophosphorylation, DNA integration, DNA recombination, and photomorphogenesis appeared enriched while exploring biological process (BP) related terms (Fig. 4C).

Gene clusters in the grapevine reference genome
To infer the gene clusters in the grapevine genome, protein-toprotein alignments among the PN40024 protein-coding genes exposed a rich panoply of duplication structures in terms of genomic positions and functions. Prominent and complex tandem-like blocks of high-similarity genes could be seen via visualizations of all-versus-all alignments ( Supplementary  Data Fig. S5). We found a total of 377 gene clusters in the grapevine reference genome (Supplementary Data Table S9). These duplications often involved local rearrangements and could extend to megabases with dozens to hundreds of genes involved (Fig. 5)

Heterozygous regions remaining after nine generations of selfing
Based on the PN_T2T genome assembly, the resequencing data of four PN40024 clones were downloaded from NCBI and analyzed [32,53]. A total of 244 215 SNPs were detected, among which 208 330 SNPs (85.3%) were shared in all four samples while the other 35 886 SNPs were only present in one to three samples (Fig. 6A). Interestingly, we found nine hotspots of heterozygous SNPs on chromosomes 1, 2, 3, 4, 7, 10, 11, and 16 ( Fig. 5A, Supplementary Data Fig. S6). To further investigate the highly heterozygous regions, we examined the top 5% heterozygosity windows and identified a total of nine large continuous fragments (chromosome 1, 1. . The GO enrichment analysis of the genes in these regions showed that the most significantly enriched terms were response to water deprivation, protein phosphorylation, cell division, response to oxidative stress, and response to salt stress, which were closely associated with key physiological activities in plants (Supplementary Data Tables S10 and S11, Fig. 6C, Supplementary Data Fig. S7). We further phased these nine hotspots of heterozygous regions on the PN_T2T reference genome (Supplementary Data File 2).

Discussion
A complete reference genome is essential for crop genetic studies and breeding purposes. The latest version of the PN40024.v4 assembly improved the reference resource by including longread sequences and by gathering a gold-standard annotation [31]. Nevertheless, these previous versions still possessed thousands of gaps and lacked repetitive regions, centromeres, and telomeres, all of which limited access to variants within these regions. On occasion such unreachable regions underlie quantitative trait loci (QTLs) for important agronomic traits, such as berry color and sex determination on chromosome 2 [30,[54][55][56] and disease resistance on chromosome 14 [57,58]. A full reference genome has therefore great potential to reveal the missing heritabilities of important polygenic agronomic traits, increasing genetic gain in grapevine breeding.
More and more investigations suggest the important functions of gene clusters, with a total of 377 gene clusters being detected in PN_T2T. The grapevine genome is also widely used in studies of plant evolution and comparative genomics because of its important phylogenetic position in the evolution of eudicots [32]. The T2T version could be widely used in plant evolutionary genomics, especially the repetitive sequences, centromeres, and telomeres. The T2T gap-free reference genome has incorporated gene annotations of previous versions with more accurate TE annotation (up to ∼67% of the genome), which will be an important resource for grapevine functional genomics and breeding.

Architecture and context of plant centromeres
The centromeric region ranges from kilobases to gigabases in length, including >90% tandem repeats [59]. The centromere is among the last great unknowns in genomics, since it was inaccessible by previous sequencing technologies. Assemblies often collapse due to the highly repetitive nature of the centromeric region. We assembled and annotated centromeres for all 19 chromosomes of the grapevine genome (Fig. 1). Most of the chromosomes have a single centromere while others could have multiple centromeric regions-the so-called holocentromere [60,61]. On chromosomes 16 and 18 we found tandem repeats in many regions, while on other chromosomes only a single peak was detected (Fig. 2B), suggesting that the structure of the centromeric region might be more complicated and requires further investigation.
In the PN40024 grapevine reference genome there are three major repetitive patterns across the 19 chromosomes, suggesting different chromosomal evolutionary histories (Fig. 3D-H). On chromosomes 3, 14, and 18, we found 135-, 56-, and 66-bp tandem repeats, respectively (Supplementary Data Fig. S4), while on other chromosomes the major unit of tandem repeats was 107 bp (Figs 3D-H and 4B and D). The evolutionary histories of the centromeres of each grapevine chromosome are still an open question to be addressed with all Vitis genomes. Previous comparative genomic analyses suggested that the centromere is conservative among closely related species with a constant number of chromosomes [9]. Transformation of centromeric structures occurs during chromosome division and fusion when the number of chromosomes changes throughout evolution. The muscadine grape (Vitis rotundifolia) has 20 chromosomes, with chromosomes 7 and 13 collinear with subgenus Vitis chromosome 7, which is associated with a chromosome fusion event [62]. Only one centromeric region is left on chromosome 7 in our grapevine reference genome (Fig. 2B, Supplementary Data Fig. S4), suggesting one centromere was lost during the evolution of the genus Vitis.
Centromeric architecture shaped the content within the genome, population genetic diversity within species, and genetic differentiation among species. Population genetic analyses have previously revealed that the genetic variants in the centromeric region are highly linked, with much lower genetic diversity compared with chromosome arms [63]. The centromeres capture tens to thousands of genes that are highly linked to the centromeric tandem repeats. These genes, along with the centromeric region, are functional as supergenes [64]. In total, we found 343 captured  Table S7) in the centromeric region in the grapevine reference genome. Interestingly, the genes are mainly involved in the ethylene, gibberellin, and abscisic acid signaling pathways [48,49].

Hotspots of heterozygosity in a nearly homozygous genotype
The current plant used to build the grapevine reference genome originated from the 'Helfensteiner' cultivar selfed for nine generations, which resulted in a 99.8% homozygous genome (Supplementary Data Fig. S2A-D). The remaining heterozygous sites are still of interest as they could represent hotspots of required heterozygosity, with lethal consequences if found in the homozygous state. Thus, we collected Illumina resequencing reads for four clones of PN40024 maintained in different international laboratories. Interestingly, the heterozygous SNPs and structure variants (SVs) were enriched in specific regions when mapped to PN_T2T. In total, we found 208 330 heterozygous SNPs shared by the four samples, and 35 886 SNPs specific to one to three samples. The former is more likely the original variant of PN40024 after nine generations of selfing while the latter could be somatic variants generated during distribution and tissue culture in the different laboratories. Interestingly, we found that hotspots of common variants were enriched in central biological processes, including the oxidation-reduction process and protein phosphorylation. The hotspots on chromosome 2 also covered the sexdetermination QTL region (Fig. 6), which complicated the mining of the sex-determination genes [30,56], because the candidate genes were not present in the old version of the reference genome. It has been reported that, during the clonal reproduction of fruit trees, such heterozygous deleterious variants accumulate in the genome [30,65]. The clonal processes hide recessive deleterious variants, including small SNPs and indels and large structural variants, in a heterozygous state [30,55]. Strong inbreeding depression has been commonly observed in clonal crops, including potato, cassava, citrus, and grapevine [55,[66][67][68], since the strongly deleterious variants in these genomic regions have been exposed to lethal or strong recessive selection during selfing cycles. In grapevine breeding, inbreeding and outcrossing depression were commonly detected because the hidden heterozygous recessive deleterious variants that increased during clonal propagation were exposed during sexual reproduction.
Altogether, and still acknowledging all previous sequencing efforts, our work represents the completion of a full T2T sequence of the grape reference genome. This assembly, together with the previous manually curated annotation, currently being transferred into PN_T2T, should represent the gold standard for the grapevine community. In line with this forecast, the T2T assembly and its updated annotation are available for download at the Grape Genomics Encyclopedia (GRAPEDIA; https://grapedia.org/), where it will be used along with different application program interfaces, including gene cards, transcriptomic data visualizations, and software for variationgene expression-phenotype associations.

Sample collection and genome sequencing
PN40024 is a line that belongs to one of the near-homozygous lines originally derived from the 'Helfensteiner' cultivar [31] by successive selfing steps, estimated to be close to 97% homozygosity as tested by SSR markers [32]. We got this inbred material from INRAE under a Material Transfer Agreement (MTA) and transplanted it in the greenhouse belonging to AGIS (Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China) for subsequent experiments.
Young leaves and ovules from PN40024 were f lash-frozen in liquid nitrogen. Genomic DNA and RNA were isolated using the DNeasy Plant Mini Kit (Qiagen) following the manufacturer's instructions. For PacBio HiFi sequencing, two single-molecule real-time cells were sequenced on a PacBio Sequel II platform, and a total of 21 Gb of HiFi reads was generated using CCS (https://github. com/PacificBiosciences/ccs) with the default parameter for the sequenced accessions. For RNA-seq, 10 μg of poly(A) mRNA that isolated from total RNA was used for preparing Illumina RNA-seq libraries for each sample. These libraries were then sequenced using the Illumina HiSeq™ 2000 system in accordance with the manufacturer's instructions.

Telomere-to-telomere genome assembly
Initially, the PN40024 genome was assembled by incorporating PacBio single-molecule real-time long-read sequences. Reads generated by the PacBio Sequel II platform were self-corrected, trimmed, and assembled by hifiasm, using default parameters (https://github.com/chhylp123/hifiasm) [43]. The initial output of hifiasm (v.0.13) yielded the p_ctg draft assembly. Genome heterozygosity was estimated using a k-mer-based approach by GenomeScope 2.0 [69]; it was estimated to be close to 99.8% homozygosity (Supplementary Data Fig. S2A-D). Then, homologybased scaffolds were generated with MUMmer (v.4.0.0) [70] 'scaffold', using the 12X.v0 reference genome (Supplementary Data Fig. S3). By applying MUMmer tools, we ordered and oriented the contig-level assemblies into 19 chromosomes, and joined the adjacent contigs to generate a scaffold with 100 N. Finally, we adjusted the assembly manually through aligning the genome sequencing data from the previous version of PN40024, which was mapped to the genome assembly by minimap2 (v.2.21) and visualized in IGV (v.2.12.3) software to observe whether the gap regions were supported by reads (Supplementary Data Fig. S1). Filling and closing of the gaps with the selected and assigned contigs were performed by mapping the 50-bp sequences around the gap to continuous long reads of PN40024.v4 and obtaining the gapless T2T PN40024 assembly for all 19 grape chromosomes. The assembly was inspected based on BUSCO [71] completeness and the duplication score. For the phasing of highly heterozygous regions, minimap2 was used to align all reads to the PN_T2T assembly. The primary contigs assembled by hifiasm and ragtag were used to phase these contigs into two haplotypes.

Annotation of genes and transposable elements
We have used a self-developed method for genome annotation. The putative genes were first searched for by using transcripts and Uni-Prot as evidence. A preliminary gene model was then built for the putative genes and further search was performed using AUGUSTUS (v.3.4.0) [72]. All the found putative genes fragments were then filtered, including genes involving duplicated regions, genes with coding sequence lengths shorter than 90 and genes not supported by any evidence. We attempted to complement missing genes and the complete genes were subjected to alternative splicing analyses. Finally, all the results were examined by hidden Markov models downloaded from the Pfam database to obtain the final gene models. Interproscan (v.5.56-89.0) [73] was used for function annotation for our assembly, and Pfam (v.34.0) [74] and Coils (v.2.2.1) [75] were used for the identification of structural domains (https://github.com/unavailable-2374/ Genome-Wide-Annotation-Pipeline).
The primary repeat analysis is outlined in Fig. 2A and began with the construction of a pan-Vitis database of repeat families by RepeatModeler (open-2.0.3) [76] and a series of scripts, which was then applied with RepeatMasker (open-4.1.2). For building this pan-Vitis repeat database we downloaded 17 Vitis genomes from NCBI, then used RepeatModeler2 to identify TE families. After that, we got 17 consensus fasta files of TE families and by removing the single-copy and failed annotations we aggregated these files. We used NCBI-BLAST+2.9.0 [77] to remove some redundant sequences (−i 80%, −l 80%). Next, we got the final file of repeat identity, then used deepTE [78] with the Plant model to classify the unclassified repeat elements. Finally, the repetitive sequence of the complete reference genome was annotated by RepeatMasker.

Identification of telomeres and centromeres
The telomere repeat units were explored by using the TIDK (v.0.2.0) (https://github.com/tolkit/telomeric-identifier) with options tidk explore -f genome.fa -minimum 5 -maximum 12 -o tidk_explore -t 2 -log -dir telomere_find -extension TSV. Then the whole genome was searched using the following parameters: tidk search -f genome.fa -s TTTAGGG -o tidk_search -dir telomere_find. Finally, we completed the rapid statistics of telomeres based on the TIDK plot and used the R script to visualize the telomere peak.
To detect the functions of the genes captured in the centromeric regions, we downloaded the protein sequence library of Swiss-Prot (2022/08/30, https://ftp.ncbi.nlm.nih.gov/blast/db/ FASTA/) for a local blast. After this, we extracted all the protein sequences of PN_T2T blasted by diamond (v.2.0.15) (parameter: -k 1 -e 0.00001, https://github.com/python-diamond/Diamond). We further uploaded the Swiss-Prot ID to DAVID (https://david.ncifcrf. gov/tools.jsp) and completed GO enrichment and annotation. Finally, data visualization was completed by our R scripts.

Identification of gene clusters
To define the clustered genes in the reference genome, protein sequences were extracted using gffread and then filtered by evalue <1e-5 and similarity >30% using BLASTP for all-versus-all alignments. The filtered alignment results were combined with functional annotations to filter out alignment results that did not share the same structural domains. Finally, we determined the presence of gene clusters by identifying three consecutive identical Pfam accessions (https://www.ebi.ac.uk/interpro/entry/ pfam/#table), using such Pfam accessions as seeds, and going up and down 30 genes to find genes with the same Pfam accessions. In total, 377 gene clusters were found (Supplementary Data  Table S8).

Heterozygosity in PN40024 clones
Four resequencing samples were downloaded from the NCBI database (SRR6156373, SRR8835144, SRR8835157, SRR8835168) and mapped to the newly assembled PN_T2T genome for SNP calling. Quality-controlled reads were mapped to the genome using bwa (v.0.7.15) with the default parameters. SAMtools (v. 1.4) and GATK (v.4.1.8) were used for sorting and indexing the bam file with no duplicates. The gvcf files were combined in GATK and were used to join calling SNPs across all samples. To obtain highquality SNPs, we performed strict filtering of the SNP calls based on the following criteria: (i) SNPs with more than two alleles were removed in all samples in vcftools with parameters -min-alleles 2 -max-alleles 2; (ii) we removed the SNPs with quality scores (GQ) <30 (-minGQ 30) and missing rate 0 (-max-missing 1); (iii) SNPs with minor allele frequencies (MAFs) ≥.01 to remove the invariable sites.