High-quality chromosome-scale genomes facilitate effective identification of large structural variations in hot and sweet peppers

Abstract Pepper (Capsicum annuum) is an important vegetable crop that has been subjected to intensive breeding, resulting in limited genetic diversity, especially for sweet peppers. Previous studies have reported pepper draft genome assemblies using short read sequencing, but their capture of the extent of large structural variants (SVs), such as presence–absence variants (PAVs), inversions, and copy-number variants (CNVs) in the complex pepper genome falls short. In this study, we sequenced the genomes of representative sweet and hot pepper accessions by long-read and/or linked-read methods and advanced scaffolding technologies. First, we developed a high-quality reference genome for the sweet pepper cultivar ‘Dempsey’ and then used the reference genome to identify SVs in 11 other pepper accessions and constructed a graph-based pan-genome for pepper. We annotated an average of 42 972 gene families in each pepper accession, defining a set of 19 662 core and 23 115 non-core gene families. The new pepper pan-genome includes informative variants, 222 159 PAVs, 12 322 CNVs, and 16 032 inversions. Pan-genome analysis revealed PAVs associated with important agricultural traits, including potyvirus resistance, fruit color, pungency, and pepper fruit orientation. Comparatively, a large number of genes are affected by PAVs, which is positively correlated with the high frequency of transposable elements (TEs), indicating TEs play a key role in shaping the genomic landscape of peppers. The datasets presented herein provide a powerful new genomic resource for genetic analysis and genome-assisted breeding for pepper improvement.


Introduction
The availability of reference genomes for many crops has generated a powerful platform for discovering genetic variation and identifying marker-trait associations for agriculturally important traits. Until recently, small variations between plant genomes were identified by comparing a single genome typically not of high quality but rather a single pass with short reads to a high-quality reference genome [1,2]. This approach has supported the efficient discovery of small insertion/deletion (InDel) variants and single-nucleotide polymorphisms (SNPs) between the two genomes of interest but has been less useful, or even failed, in identifying large structural variants (SVs) when they are either missing or highly diverged from the reference genome. Furthermore, a single genome cannot encompass the full complement of the genetic diversity of a species [3,4]. SVs, such as presence-absence variation (PAV) and copynumber variation (CNV), cover large genomic regions from 1 kb to several megabases and can functionally inf luence the genome to a greater extent than SNPs or small InDels [3][4][5][6]. However, capturing the full spectrum of all genomic variants in plant species with complex genomes remains challenging.
Pepper (Capsicum spp.) is one of the earliest domesticated crops in the Solanaceae family and is believed to have originated in Central and South America [21]. Pepper accessions can be broadly classified as either "hot" or "sweet" according to the pungency of their fruits. As a result of domestication, breeding, and market demand, hot peppers usually bear small fruits, whereas sweet peppers tend to have larger fruits. Capsicum species display extensive variation in agriculturally important traits, such as plant architecture, fruit color and morphology, accumulation of secondary metabolites, yield, and response to a number of biotic and abiotic stresses [22,23]. Efficient utilization of pepper genetic resources is essential to help face challenges arising from adverse climatic change and rapidly evolving pathogens and, ultimately, to meet the current and future global demand for pepper fruits [24].
Following the release of the first C. annuum reference genome in 2014 [25,26], multiple pepper genome assemblies have been released using short read sequencing approaches [27][28][29] and have opened the door to better understanding pepper genome organization and to accelerating breeding. In addition, these initial genome sequences have led to the identification of specific genomic regions subjected to domestication and improvement [24,[30][31][32]. An earlier version of a pepper pan-genome using C. annuum "Zunla" as a reference genome facilitated genomewide association studies (GWAS) for carotenoid contents in Capsicum species [33]. The complement of genomic variation was identified in these studies based on pepper genome assemblies developed from short reads, which have inherent limitations, such as low continuity and low coverage of genetic variants [25,26,29], as well as poor resolution, especially in low-recombining regions [28]. Overall, the current state of pepper genome assemblies has limited applications for the capture of large genome SVs, including CNVs and PAVs, by short read-based genome sequencing and reference-based approaches. Therefore, a pan-genome developed using highquality reference genomes is an important and necessary step toward a comprehensive survey of domestication, improved utilization of genomic and genetic resources, and a more inclusive characterization of pepper gene functions and breeding potential.
In this study, we assembled a chromosome-scale reference genome of the C. annuum sweet pepper-type cultivar 'Dempsey' using a combination of long-read and short-read sequencing, complemented with high-resolution chromosome conformation capture (Hi-C) scaffolding and optical mapping technologies, with the aim to improve the overall assembly quality. We exploited the resulting high-quality 'Dempsey' reference genome for the detection of novel loci associated with fruit orientation. We also sequenced the genomes of important pepper accessions including sweet and hot peppers by long-read and/or linked-read methods. We combined all genome assemblies into a pepper pangenome using 'Dempsey' as the reference, or pivot genome, followed by the identification of large SVs. Finally, we validated SVs associated with important agricultural traits using this pepper pan-genome.

De novo genome assembly of the 'Dempsey' reference genome
We sequenced and assembled a high-quality genome for the C. annuum cultivar 'Dempsey' by integrating Pacific Biosciences (Pacbio, Menlo Park, USA) long reads, Illumina short reads (Illumina Inc., Carlsbad, USA), optical mapping, and Hi-C (Table S1). We generated approximately 279 Gb (∼90× coverage) of raw longread data using the PacBio Sequel platform. The assembled contigs after error correction covered 3.03 Gb of sequence, with an N50 (the size of the shortest contig fragment from 50% of the total genome) value of 18.3 Mb. We then used 815 million Hi-C sequencing reads to align the assembled PacBio contigs above, which increased the N50 value to 257.6 Mb (Table S1, S2). Optical mapping generated 157× genome coverage, with an N50 value of 236.7 kb (Table S3). The integration of all scaffolds obtained from PacBio, Hi-C, and optical mapping data covered 3.05 Gb of genome sequence, with a final scaffold N50 value of 260.6 Mb, which is close to chromosome size (Table S1). For any incongruences between the Hi-C and optical maps, we manually adjusted BioNano super-scaffolds with previously published pepper genetic maps (Fig. 1a) and 'Perennial' × 'Dempsey' genetic map [34,35] (Fig. 1b). Pearson's correlation coefficients between the 'Dempsey' genome assembly and the genetic maps ranged from 0.928 to 1 for all chromosomes ( Supplementary Fig. S1).
The final 'Dempsey' genome assembly consisted of 3.03 Gb of sequence arranged into 141 scaffolds, with a scaffold N50 value of 260.6 Mb. We then assembled them into 15 super-scaffolds and 18 scaffolds representing the 12 'Dempsey' chromosomes ( Fig. 1a; Table S4; Supplementary Dataset S1), with much improved genome quality (smaller gap sizes and considerably fewer gaps) than previous pepper genomes (Table S5).

Use of 'Dempsey' reference genome for GWAS of fruit orientation
We performed a GWAS for fruit orientation using an association panel of 288 pepper accessions and the high-quality Dempsey v1.0 reference genome of this study. We used PepperSNP16K array SNP data [35] and compared it to the Dempsey v1.0 reference. GWAS identified 28 SNPs associated with the fruit orientation trait, with most significant associations detected on chromosomes 10, 11, and 12 of the newly assembled 'Dempsey' genome ( Fig. 2a,b; Supplementary Fig. S2, Supplementary Dataset S2, Supplementary Dataset S3). The most significant SNP on chromosome 12 (UC_Pep_Genomic_6969647) explained over 20% of the phenotypic variation (PVE) of the trait, and was located 1.8 kb from DEMF12G21950, encoding a protein with a zinc-binding in the reverse transcriptase (zf-RVT) domain (Pfam domain: PF13966) (Supplementary Dataset S1). A previous GWAS consisting of 196 diverse pepper accessions detected a significant association in this region as well [36], suggesting that DEMF12G21950 is a strong candidate gene for controlling pepper fruit orientation. In addition, we generated a boxplot for visualization of correlation between the genotype and phenotype across the association panel of 288 accessions for the three most significant SNPs from chromosome 10 (UC_Pep_Genomic_5119667), chromosome 11 (UC_Pep_Genomic_7186095), and chromosome 12 (UC_Pep_Genomic_6969647) (Fig. 2c). Most of the erect peppers carried the B haplotype on chromosome 12, while the contrasting pendent phenotype was associated with haplotype B on pepper chromosome 10 or 11. These results suggest that a factor on chromosome 12 is required for erect fruit phenotype, while factors

Genomic information of pan-genome related pepper accessions
To construct a pepper pan-genome, we selected 11 additional C. annuum cultivars, 'CV3', 'ThaiHot', 'Maor', 'I19-702-1', 'LaMuyo-01', 'PG1', 'UCD10X', 'Perennial', 'CC-090', 'CC-260', and 'MR', which represented much of the genetic diversity among the 288 accessions of the C. annuum core collection, as ref lected by their phylogenetic relationships ( Supplementary Fig. S3). Pepper cultivars, 'CV3' and 'ThaiHot' were sequenced using a linkedread sequencing approach (to ∼100× coverage), and 'Perennial', 'CC-090', 'MR', and 'CC-260' were sequenced using Nanopore (Oxford, UK) sequencing (∼30× coverage with Illumina HiSeq 150 bp reads) ( Table 1). We performed reference-guided genome assembly for each pepper accession. The scaffold N50 sizes of 11 pepper genome assemblies ranged from 3.69 to 260.6 Mb ( Table 1). The sizes for the final genome assemblies ranged from 3.03 to 3.22 Gb, with an average of 96.3% scaffolds anchored to chromosomes (Table 1). Consistent with this very high percentage of anchored scaffolds, we obtained similar numbers of genes in each pepper genome. The annotation of all pepper genomes identified between 35 018 and 39 938 gene models (Table 1), which was close to the 34 903 genes annotated in the reference genome of the 'CM334' pepper (v.1.55) [25]. Moreover, benchmarking universal single- (c) Boxplot analysis of GWAS results from the three highly significant SNPs (Chr 10, Chr 11, Chr 12) for fruit orientation. Each A/B genotype represents the SNPs in three significant associated GWAS regions. Chromosome numbers of each genotype were designated with A/B genotype on the horizontal axis of the boxplot graph. Numbers above the graph indicate the number of accessions for each genotype. For phenotypic analysis (y-axis), a low number represents the erect phenotype; a high number represents the pendent phenotype.
copy ortholog (BUSCO) analysis (Embryophyta odb10) revealed >96% completeness for all pepper genome assemblies reported here (Table 1). Repetitive DNA made up >84% of each pepper genome, ranging from 84.9% to 88.4% (Table 1). The most abundant repetitive sequences were retrotransposons (RTs) (an average of 69.6% of 12 pepper genome assemblies), of which RTs with long terminal repeats (LTR RTs) accounted for an average of 66.05% of the pepper genome assemblies (Table S6). The most frequent RT belonged to the Gypsy family, covering about 56.0% to 57.1% of the pepper genome and representing 60.2% of all repeat elements. The second most abundant RTs were Copia elements, making up 6.1-6.2% of the pepper genomes and 6.2% of all repeat elements (Table S6). Long interspersed nuclear elements (LINEs), non-LTR RTs, and short interspersed nuclear elements (SINEs) represented another 3.6-3.65% of the pepper genomes (Table S6). DNA transposons, also called class II elements (DNA transposons and Helitrons), accounted for 8.7-9.2% of the genome (Table S6).

Gene family analysis of 12 pepper genomes
A search for gene families classified all genes annotated across the 12 pepper genomes into 42 972 gene families (Fig. 3a). The total number of gene families increased with each addition of a genome, but the number of core-and pan-gene families were nearly f lattened when 10 or more genomes were added, ref lecting the good representation of the pan-gene landscape offered by the 12 selected pepper accessions. Based on the genome size of the Dempsey v1.0 assembly presented here (3.05 Gb), cumulative lengths of 79.39-149.70 Mb additional genomic sequences (including PAVs and CNVs) were found in the other pepper genome sequences. Of the 42 972 gene families, we classified 19 662 (276 220 genes) as core gene families (present in 12 genomes) ( Fig. 3b and c). Another 23 115 gene families (178 616 genes) were non-core (present in 2-11 accessions) (Fig. 3b), which represented about 39.19% of all genes in the pan-genome (Fig. 3b). The percentage of specific genes ranged from 0.05 to 0.68% in the different genomes. Gene Ontology (GO) term enrichment analysis on core and noncore genes indicated that core genes were most often enriched for GO terms related to biosynthetic process, gene expressions, biosynthetic process, biological regulation, while non-core genes appeared to be enriched for nucleotide/protein binding, hydrolase activity, DNA metabolic process, and nuclease activity (Fig. 3d,e).

Genomic SVs between pan-genome pepper lines
To identify SVs, we aligned the assembled genome sequences of the 12 pepper accessions including recently published high quality pepper genome, 'CA59' [37], to the pivot genome, Dempsey v1.0. The alignments showed good co-linearity between pan-genome individuals and the 'Dempsey' reference genome ( Supplementary Fig. S4). We determined that an average of 94.9% (93.09-96.36%) of the 'Dempsey' genome exhibits a syntenic relationship with each of the other 11 pepper genomes (Table S7) Table S8). The UCD10X genome assembly stood out with many more INVs relative to 'Dempsey' than the other 11 pepper genomes, which might ref lect the quality of the UCD10X assembly and the unknown orientation of its genome contigs. In addition to INVs, we detected an average of 131 972 InDels (with sizes no more than 50 bp) (Fig. 4b) and an average of 7 279 285 SNPs across the pan-genome. PAVs were usually enriched in distal regions ( Fig. 1c; Supplementary Fig. S5); we also identified genomic regions as potential hotspots for PAVs ( Supplementary Fig. S5; Supplementary Dataset S5). The hotspot region on chromosome 3 harbored many genes involved in stress responses (NB-ARC, protein kinase, and cytochrome P450) and was characterized by a complex genetic architecture (Fig. 4c-e).
We observed similar distribution patterns for PAVs within hot and sweet peppers genomes, however sweet peppers exhibited lower number of PAVs in 5 UTRs than hot peppers, which had more InDels than sweet peppers (Fig. 4a,b), indicating the complex and distinct genomic architecture of hot and sweet peppers. Combining all PAVs from the 12 pepper genomes, we constructed a pan-genome by following a graph-based strategy [38,39].

Validation of known SVs in constituent accessions of the pan-genome
To test the power of a graph-based pepper pan-genome for the detection of genetic variants associated with agricultural traits, we first focused on the genomic regions harboring the genes Pun1, Capsanthin/capsorubin synthase (CCS), potyvirus resistance6 (pvr6), and FASCICULATE (FA) to search for variation that might correlate with the known phenotypes of the pan-genome accessions [2,[40][41][42] (Table S9 and S10). Non-pungent peppers carry a deletion in the 5 region of Pun1; we identified a deletion within the coding region of Pun1 in the genomes of all sweet peppers from this study ('Dempsey', 'LaMuyo-01', 'I19-702-1', 'PG1', and 'Maor') (Supplementary Fig. S6; Table S10). All sweet pepper genomes had a 2.57-kb deletion spanning the promoter, 5 UTR, and first exon of Pun1, whereas we detected no such deletion in the hot pepper genomes ('Perennial', 'ThaiHot', 'MR', 'CC260', 'CC090', and 'CV3'). For the UCD10X genome, an F 1 hybrid between pungent and nonpungent pepper accessions, the non-pungent allele types were detected in the chromosome 2 while the other allele was detected on the other haplotig [28]. CCS is an important gene associated with variation for fruit color in pepper. The genome of accession 'I19-702-1', characterized by yellow fruits, showed a large deletion of 4430 bp in the CCS gene resulting in a non-functional CCS allele. The pvr6 mutation is a recessive allele that confers resistance to potyvirus and was originally identified in the cultivar 'Perennial'. The causal mutation was mapped to a 83-bp deletion in the first exon of Eukaryotic initiation factor 4E isoform (eIF(iso)4E) [41]. The wild-type Pvr6 allele lacking this 83-bp deletion is present in the genomes of potyvirus-susceptible pepper accessions. We detected the truncated nonfunctional pvr6 allele in the 'CV3' pepper genome based on our pan-genome study (Supplementary Fig. S6). Nonfunctional alleles of FA are responsible for the fasciculate fruit clusters mutation [42]. We discovered a 428-bp deletion spanning the second exon of FA in the genome of the accession 'ThaiHot' (Supplementary Fig. S6), which bears fasciculate fruit clusters. Together, these results highlight the utility of a pangenome for the identification of PAVs in pepper genomes, which will constitute a useful resource for pepper molecular breeding research.

SVs analysis of hot and blocky sweet peppers using the pan-genome PAVs
Using the pepper pan-genome, we calculated the frequency distribution of PAVs along the chromosomes of hot and sweet pepper accessions using a sliding window approach. Specific chromosomal regions exhibited higher frequencies of PAVs than other regions (Fig. 5a). We selected the top 5% genomic regions with high PAV frequency (covering 230.6 Mb size) that differentiate hot and sweet peppers ( Fig. 5a; Supplementary Table S11). These regions contain many genes of repeat domains (PF00078, PF00665, PF17917, and PF17921) (Fig. 5b). In addition, we compared the whole genome distribution of transposable elements (TEs) with respect to the PAVs distribution (Fig. 5c). Both whole genome and top 5% ranked region comparisons revealed proportionally higher number of PAVs in genomic areas with high TEs (Figure 5c). However, in case of top 5% enriched PAVs genomic regions, higher number of PAVs were largely correlated with relatively higher number of LTRs presence, suggesting a key role for LTRs in larger genomic variations, such as PAVs.
Furthermore, to explore genomic clues for fruit-related traits in pepper, we investigated the regions with high PAV frequency within the GWAS regions for fruit shape [1,43]. All previously detected GWAS regions on chromosome 11 associated with fruit shape traits were included in ±1 Mb region of the top 5% of regions with high PAV frequency, as expected (Supplementary Table S12). We also detected a single PAV located in the GWAS region on chromosome 11 associated with fruit shape. This region harbored a gene encoding Ribosomal protein S4 (RPS4) (DEM11G09440), which was absent in hot pepper accessions (Fig. 6a), suggesting a possible role for RPS4 in pepper domestication. Additionally, we detected PAVs in fruit shape related candidate genes in 5 , coding, and intergenic regions from the high PAV regions (top 5%) on chromosome 2, 6, 7, 10, 11, and 12 (Supplementary Table S13).
Using PAV information from the graph-based pan-genome, we identified PAVs co-segregating with fruit orientation in the 5 UTR of pepper genes encoding tyrosylprotein sulfotransferase (TPST) (400 bp upstream) on chromosome 11 and a putative zinc-finger including ribonuclease H protein (zf-RVT) (1.1 kb upstream) on chromosome 12, both located within the GWAS significant regions (Fig. 6b). Notably, the genomes of the accessions with erect fruits ('Perennial', 'ThaiHot', 'CV3', and 'MR') showed a 572-bp deletion in the promoter region of the zf-RVT-like gene. These results demonstrate the broad utility of the newly assembled 'Dempsey' reference for comparative genomics and genetic dissection of agriculturally important traits.

Discussion
Genome SVs represent an important component of genetic diversity in plants that have gained substantial interest due to their inf luence on phenotypic variation. In the early days of plant genomic analyses, technological limitations and reference genomes and assemblies of lower quality precluded a comprehensive exploration of SVs, especially in plants with large and complex genomes [44]. With recent improvements in DNA sequencing technologies, as well as with the use of advanced bioinformatics tools and pipelines, the identification and analysis of SVs is becoming increasingly more common and within reach, including for plant genomes [45,46]. Nevertheless, it is difficult to accurately detect large SVs and their intricate structures based on short-read sequences, simply because the read length is very often shorter than the SV size. Furthermore, short reads (c) Comparative analysis of PAVs associated with transposable elements (TEs). "Total" represents whole genome PAVs/TEs generated from the pan-genome analysis. "Top 5% ranked" represents top 5% ranked PAV frequency regions between hot and sweet peppers.
pose challenges for the detection and characterization of highly repetitive regions in particular in highly complex genomes such as pepper [47]. Despite numerous improvements in sequencing, whole-genome mapping studies, algorithms, and tools, a highquality and accurately assembled Capsicum genome is lacking, limiting comparative genome-wide SV studies across Capsicum species. To fill this gap, we constructed a high-quality genome assembly of the 'Dempsey' using a combination of PacBio longreads, Hi-C, short reads, optical mapping, and classical genetic maps. The final Dempsey v1.0 genome assembly had a contig N50 value of 18.3 Mb.
Compared to existing pepper genome assemblies, the contig N50 value of the Dempsey v1.0 genome assembly is six times greater, and the scaffold N50 value is at least 80 times higher than earlier pepper genomes [25,28]. In addition, only a small fraction (0.4% or 9.33 Mb) of the Dempsey v1.0 genome remains unanchored, which marks a considerable improvement over previous pepper reference genomes, with unanchored sequences account-ing for ∼17-20% (or 500-800 Mb) of the entire genome [25,28]. The final 'Dempsey' genome assembly (v1.0) comprises 121 scaffolds, and the 12 pseudochromosomes, which represented 99.6% of the entire assembly, are congruent with multiple pepper genetic maps [30,34,35]. To date, pepper genomic studies have had to rely on the reference genomes of hot or F 1 hybrid (hot × sweet) peppers. The new high-quality 'Dempsey' reference genome is the first pepper genome based on a bell-type pepper, which is the main non-pungent pepper type grown worldwide.
In this study, we applied a novel graph-based approach to produce a pepper pan-genome assembly using 12 long-and linked-read-sequenced accessions. Graph-based pan-genomes allow nearly all structural variants to be represented more efficiently, represent a more sustainable approach than linear references, and are useful to find agriculturally important variants at the population level [39,48]. The pepper pan-genome constructed here based on 12 accessions revealed that 48.67% of pan-genes are core genes, while another 31.47% of genes are non-core. Previous studies suggest that the number of genomes included in the pan-genome construction can greatly inf luence the pan-genome metrics and analyses [9,13,49]. Based on a comparative de novo assembly of three divergent rice species, Schatz et al. [50] defined 92% of all genes as core genes and only ∼8% as variable genes. Similarly, a pan-genome for cabbage (Brassica oleracea) constructed based on the genomes of 8 cultivars determined that ∼56% are core gene clusters, while ∼42% are non-core gene clusters.
Pan-genomes aid in the discovery of genes missing from the original reference genomes, highlighting the significance of sequencing more than one genome to capture the broad genetic diversity of a given species [15,17]. To-date several reports in pepper have been published using resequencing with short reads and mapping to reference genome, including reports of Kim et al. [27], Ou et al. [33] and Acquadro et al. [29]. In contrast, we present a comparison of structural variants of 11 de novoassembled genomes using a comprehensive long-read assembly of the cultivar 'Dempsey' as the pivot genome. By doing so we describe not only SNP and indel, but also large SVs including PAVs, CNVs, and INVs not captured in resequencing experiments (Supplementary Table S8). In addition, we identified SVs in the pepper pan-genome in the important genes FA, pvr6, CCS, and Pun1 in agreement with the corresponding phenotypes of the individual accessions. This indicates that PAVs identified in the present study may be a powerful new tool to identify their largely unknown phenotypic manifestations through reverse genetics.
Furthermore, using the graph-based pan-genome defined here, we mapped a PAV within the GWAS region associated with pepper fruit shape [1,43] and fruit orientation. The GWAS region for fruit shape harbors a gene coding for Ribosomal protein S4 (DEM11G09440), which is exclusively present in blocky peppers, implicating this protein in an unknown mechanism yielding bell-or blocky-type fruits. GWAS identified 28 SNPs associated with fruit orientation on chromosomes 7, 8, 10, 11 and 12 of the Dempsey v1.0 genome, accounting for 7-20% PVE. We noticed that the SNPs highlighted by GWAS map close to TPST (and the nearby Cytochrome P750A2 gene [CYP75A2]) on chromosome 11 and to a gene encoding a zf-RVT domain-containing protein and BIG GRAIN-like on chromosome 12, associated with fruit orientation (Fig. 6b). However, the physiological roles of these genes are not well established. Homologs to TPST are involved in fatty acid metabolism in maize and are thought to play a key role in plant growth and development [51]. The zf-RVT domain is normally found in reverse transcriptases [52]. BIG GRAIN-like was previously reported to be involved in auxin transport and to regulate plant growth, development, and grain size in rice [53], making its pepper homolog a likely candidate gene for controlling variation in fruit orientation.
In summary, we developed a high-quality genome assembly for the pepper 'Dempsey' by combining long-read and short-read sequencing technologies. Using the new 'Dempsey' genome as a pivot reference, we constructed a graph-based pan-genome, detected large structural variants across the genome in C. annuum accessions, and identified important genetic variants for the determination of hot or sweet pepper fruits. Overall, the highquality 'Dempsey' reference genome and C. annuum pan-genome provide an important resource for facilitating fundamental research and genetic improvement of the pepper crop.  [54]. Genomic DNA was then extracted from these nuclei using the Qiagen MagAttract HMW DNA Kit (Cat. no. 67563).

Construction of the pivot reference genome for the 'Dempsey' cultivar
High-quality genomic DNA for the 'Dempsey' cultivar was used for sequencing on the PacBio Sequel platform using CLR mode sequencing method. Sequencing was performed at DNAlink company (Seoul, Korea). De novo assembly was conducted using FALCON-Unzip assembler using default options with the exception of a length cut-off of 20.3 kb [55]. The FALCON-Unzip assembler and Arrow consensus algorithm were used to polish and generate primary PacBio de novo-phased diploid contigs. For error correction, primary PacBio contigs consisting of haplotigs were used as a reference to align Illumina NovaSeq data (80× coverage) using BWA version 0.7.10 and GATK version 3.5. For scaffolding the PacBio contigs using Hi-C technology, a Dovetail Hi-C library was prepared according to the manufacturer's instructions using a Dovetail™ Hi-C Kit (Dovetail Genomics, Chicago, IL, USA). The final libraries were sequenced on an Illumina NovaSeq 6000 system as 150-bp paired-end reads. Hybrid scaffolding was performed using HiRise software with Hi-C data. Optical mapping was performed using Irys optical mapping technology (BioNano Genomics, San Diego, CA, USA). DNA was labeled at DLE-1 (CTTAAG) sites using the IrysPrep kit. Labeled DNA samples were loaded onto IrysChips and run on the Irys imaging instrument Saphyr (BioNano Genomics). The IrisView software package was used to produce BioNano genome maps. BioNano genome maps were validated with two independently generated genome maps, BioNano + PacBio and BioNano + PacBio + Hi-C maps; any inconsistency between maps was corrected using genetic maps. To construct pseudomolecules, ALLMAPS software [56] was used to combine the four previously reported pepper genetic maps: a 'Perennial' × 'Dempsey' intraspecific C. annuum bin map [30], a 'NuMex RNaky' × 'BG2814-6' interspecific map (C. annuum × C. frutescens, FA) [34], an 'Early Jalapeno' × 'CM334' intraspecific C. annuum bin map (NM) [34], and a 'Tabasco' × 'P4' interspecific Illumina array map [35]. For the two interspecific maps, the known translocations between chromosomes 1 and 8 were curated manually [28]. Information from these combined genetic maps was used to build and anchor the pseudomolecules of the 'Dempsey' genome assembly. Scaffolds spanning two linkage groups were manually corrected according to the 'Perennial' × 'Dempsey' genetic map. To manually fill gaps in the 'Dempsey' genome, PCR primers were designed to anneal 500 bp upstream and downstream of a given gap sequence. Amplified single-copy PCR fragments were sequenced by Sanger dideoxy sequencing (Macrogen, Seoul, Korea).

Gene prediction and functional annotation
First, to increase the accuracy of pepper genome annotation, pepper transcript dataset from NRGene was aligned to each pepper genome and used this alignment file for a hintsfile in Augustus annotation. Augustus 3.2.3 were used for gene prediction in 12 pepper genomes with the species option as tomato, which was closest Solanaceae crops with Capsicum species. Raw gene annotation results were filtered out by aligning predicted genes against previous published pepper peptide sequence datasets (identity >70%; coverage >70%).

SV variant analysis
MUMMER v4.0.0 was used for genome-to-genome alignment (nucmer -g 1000 -c 90 -l 64), and PAVs were identified using Assemblytics with default options [62,63] using 'Dempsey' as a reference genome. PAVs were also identified using another pipeline, minimap2 (−ax asm5 -eqx) and SyRI v1.5 (−k -F -S) [64,65]. Because of the higher sensitivity of deletions than insertions in genome alignment step7, reciprocal genome alignment was also performed. Subsequently, PAVs were extracted from the Insertion/Deletion calls from Assemblytics and Insertion/Deletion/HDR (Highly divergent region) calls from the SyRI results. Common calls between Assemblytics and SyRI were considered for PAVs. For CNV identification, Assemblytics calls annotated as "Deletion" or "Tandem_contraction" were identified in pairwise and reciprocal genome alignments. SyRI calls annotated as "DUPAL", "HDR", or "CPL" were identified in both pairwise and reciprocal alignment. Then, common variants of Assemblytics and SyRI analyses were considered as CNVs. For small indel extraction, similar approach to the PAV extraction was performed with delmit option set to 5-49 bp. For INVs and SNPs, SyRI calls (INVAL and SNP) were used for downstream analysis.
We used 'Dempsey' as the reference genome and nonredundant PAV variant calling format (VCF) files to construct the pangenome graph using construct module as implemented in the vg toolkit program [38].

Construction of the phylogenetic tree of the C. annuum clade
To infer the phylogenetic relationship of 12 accessions in the pepper pan-genome, a phylogenetic tree was constructed using GBS SNP data from a pepper core collection [23]. First, C. annuum biallelic GBS SNPs were filtered using vcftools (-remove-indels -maf 0.005 -minDP 3 -minGQ 20 -min-alleles 2 -maxalleles 2) and merged with the pan-genome SNP dataset. Then, the resulting merged SNP data were filtered using vcftools (max-missing 0.5) and used to construct a phylogenetic tree. Iqtree v1.6.2 (−bb 1000 -m GTR + ASC) was used to generate a phylogenetic tree using a phylip-formatted SNP dataset [59]. An admixture plot was generated using ADMIXTURE with the k = 3 option [60].

Pan-genome analysis
To generate core and variable gene lists for the C. annuum pangenome, predicted genes were clustered using OrthoFinder v2.27 (−M msa) [61]. According to previous pan-genome analysis in rapeseed (Brassica napus) [13], gene clusters were classified as core (genes present in 12 genomes), non-core (genes present in two to 11 genomes), and specific (genes present in only one genome). Core and variable gene clusters were visualized using custom R scripts and TBTools [62]. Enrichment for GO terms among core and non-core genes was analyzed using AgriGO version 2 62 and visualized using custom R scripts.

Repeat annotation
To annotate repeats in the 'Dempsey' genome, a combination of de novo and homology-based searches were performed using RepeatMasker and RepeatModeler (http://www.repeatmasker. org/). First, custom repeat libraries were generated using RepeatModeler. Then, RepeatMasker was employed to identify repeat sequences in the pepper genome against all repeat libraries from C. annuum custom repeat libraries.

Identification of PAV hotspot regions
Total PAVs generated from pan-genome analyses were used for downstream analysis. PAV density was calculated by a sliding window approach (1-Mb window with 200-kb increments) using custom Python scripts; the top 1% were considered as PAV hotspot regions, with overlapping regions merged manually. These regions were compared and visualized using the MCscanX tool [64] and Artemis Comparison Tool (ACT) [65].
To calculate distinctive regions between hot and sweet peppers, the relative PAV frequency was manually calculated by the gap between the average PAV number in hot peppers ('CC-090', 'CC-260', 'Perennial', 'ThaiHot', 'MR', and 'CV3') and that of sweet peppers ('LaMuyo-01', 'I19-702-1', 'PG1', and 'Maor'). The frequency at each PAV position was visualized using a sliding window approach generated from custom Python scripts (1-Mb window with 200kb increments). The top 5% ranked PAV frequency regions were considered high PAV frequency regions between hot and sweet peppers. Then, a previously reported GWAS region associated with fruit shape [1,43] was compared to the high PAV regions above, and overlapping regions were used to find significantly different regions between hot and sweet peppers.

Chip genotyping
The PepperSNP16K array [35] was used to genotype 288 C. annuum complex accessions from the pepper core collection (Illumina, Inc., San Diego, California, USA). To locate chip marker positions in the 'Dempsey' genome, ∼100 bp of sequence f lanking either side of each chip marker (SNP) was searched in the 'Dempsey' genome by BLAST. Markers that aligned at four or more sites, as well as markers that were not aligned to 'Dempsey', were eliminated. By filtering markers with minor allele frequencies below 0.05 and call rates below 90%, we obtained 13 146 markers.

GWAS
GWAS was conducted using the genotypic data obtained by array and BLINK models, which were provided by the R package "GAPIT3" [66]. The following approach was used to designate the phenotype for fruit position: 1 for erect position, 3 for intermediate phenotype between erect and pendent position, and 6 for pendent position. The average measurements over a 3-year period from 2015 to 2018 were utilized as phenotypic data. The Bonferroni correction method was used to compute the adjusted P value by dividing statistical significance by the number of markers to avoid error accumulation. Markers below the Bonferroni-adjusted P value (<0.05) were considered significant.