Chromosome-scale assembly and population diversity analyses provide insights into the evolution of Sapindus mukorossi

We report the first chromosome-scale genome assembly of Sapindus mukorossi, covering ~391 Mb with a scaffold N50 of 24.66 Mb. Population genetic analyses showed that genetic diversity in the southwest of the distribution area is relatively higher than that in the northeast. Genome-wide selective sweep analysis showed that the selection of a large number of genes is involved in defense responses, growth, and development. Our identified candidate genes are associated with major agronomic traits, and this information can be used for further reference in the selection of superior germplasm resources.


Introduction
Sapindus mukorossi (Fig. 1), also known as soapberry tree, is a deciduous tree of the Sapindaceae family and is widely distributed in Southeast Asia, Japan, and southern and southwestern China [1]. Li Shizhen wrote 'By using S. mukorossi, shampoo can go to head wind, and washing can whiten' in The Compendium of Materia Medica, highlighting that S. mukorossi is a natural excellent detergent with good foaming and decontamination abilities [2]. As recorded in Pu Ji Fang, 'S. mukorossi can also treat toothache, sore throat, acute gastroenteritis and so on', showing that S. mukorossi is a very important Chinese herbal medicine with multiple biomedical functions [3]. The fruit peel of S. mukorossi is rich in saponins (up to 10.76%) with physiological activities, such as antifungal, antidandruff and anti-itching activities [4,5]. The seeds of S. mukorossi are rich in oil (up to 42.70%) with a high proportion of unsaturated fatty acids (up to 86.63%), rendering them ideal materials for developing biodiesel and coping with the energy crisis [6,7]. With its richness in saponins in the peel and oil content in kernels, S. mukorossi has become an important industrial tree species that can provide multifunctional raw material for producing biological chemicals, biodiesel, and biomedicine [8,9]. Additionally, with a beautiful crown shape, strong adaptability, and resistance to sulfur dioxide and industrial dust pollution, S. mukorossi is regarded as a new type of low-carbon, environmentally friendly renewable energy source for ecological restoration of barren mountains and rural greening [10,11].
Based on a large number of studies, research on S. mukorossi mainly focuses on genetic and ecological characteristics, seed germination, landscaping, active ingredient extraction, etc. [12][13][14][15][16]. For example, Liu et al. comprehensively evaluated superior germplasms from natural germplasm resources of three species and one variety of Sapindus in China and Vietnam and divided them into three categories, each with 10 selected superior germplasm accessions for oil and saponins [17]. Phenotypic and molecular markers (inter simple sequence repeat, simple sequence repeat and random amplified polymorphic DNA) were used to analyze the diversity of natural germplasm resources in India and China in different geographical regions. The results showed that the variation between individuals within populations is higher than that between populations [18][19][20][21][22]. Fan et al. investigated 122 accessions from nine provinces by measuring the main economic traits (saponins, oil, age, height, altitude, etc.), which showed that various economic traits of trees of different origins have obvious differences, and the main economic traits of different trees from the same area are also different [11]. However, previous studies have focused on small samples, resulting in only partial representation of S. mukorossi based on fruit and seed traits, and no studies have explored the related omics and molecular breeding of S. mukorossi. The abundance of germplasm resources is a prerequisite for the selection of improved varieties, and the genetic diversity of germplasm resources is the material basis for biological evolution and breeding [23]. Population resequencing has become an important strategy for further crop domestication, improvement, and breeding, as recently reported in jujube [24], soybean [25], chickpea [26], and peach [27]. Population resequencing studies have been performed to examine the genetic diversity and genetic basis of evolution and adaptation in species such as Salix brachista [28] and tea [29]. In theory, different species have experienced reproductive or geographical isolation. Due to the long evolutionary history of S. mukorossi, after long-term natural selection various morphological variations and biological genetic diversity in geographical areas are objectively present in nature. Moreover, different S. mukorossi varieties are easily cultivated by hybridization, complicating accurate tracing of the origin of the offspring. Although previous studies have attempted to explore this issue, the lack of genome data and annotated gene sets for S. mukorossi has limited functional studies and evolutionary analyses.
To gain a better understanding of the origin and evolution of S. mukorossi, we used PacBio and Hi-C technologies to assemble a genome, providing a good resource for investigating genetic diversity and evolution. We also collected 104 S. mukorossi resources for resequencing from nine provinces of China covering nearly all S. mukorossi planting areas at the wholegenome level, including 47 Table S1). Using these sequencing data, we demonstrated the evolutionary history of S. mukorossi and revealed the phylogenetic relationship between these populations by analyzing genomic characteristics, population structures, and genetic diversity. To identify potential candidate genes associated with fruit weight, peel-to-fruit ratio, saponin contents, seedto-fruit ratio, kernel-to-fruit ratio, and oil contents, we also performed genome-wide association studies (GWAS) using resequencing data from 57 accessions (Supplementary Table S2). Our results provide a new resource for further molecular breeding and studies of S. mukorossi biology. However, genome-wide association analyses of economic traits (saponins, oil, etc.) remain to be conducted in the future.

Sequencing and assembly of the S. mukorossi genome
Approximately 21 Gb (∼47×) of Illumina paired-end sequences were generated and used to estimate genome size and heterozygosity using Jellyfish software, which indicated that the size of the S. mukorossi genome was 432.  Fig. S1). The assembled genome size was smaller than those of other closely related species in Sapindales, such as Dimocarpus longan (∼471 Mb) and Acer yangbiense (∼666 Mb) [30,31]. Regarding the chromosome number, S. mukorossi is a diploid species, which has a karyotype of 2n = 2x = 28 homologous pairs of chromosomes based on karyotyping analysis ( Supplementary Fig. S3). To obtain a chromosome-scale assembly of S. mukorossi, a total of 40.96 Gb of clean sequences were generated by Hi-C sequencing. Using the contact matrix and the agglomerative hierarchical clustering method in LACHESIS, 90 Tables S6 and S7).
To estimate the nature of whole-genome duplication (WGD) events in S. mukorossi, the Ks distribution of homologous genes from S. mukorossi, D. longan, and A. yangbiense was characterized. Two Ks values peaked at 0.18 and 0.36 for orthologs between S. mukorossi and D. longan and between S. mukorossi and A. yangbiense, suggesting speciation events at ∼15 and ∼31 Mya, which is consistent with the findings of the phylogenetic tree. The major peak of duplication in S. mukorossi was 1.90, which indicates that an ancient WGD event had occurred at ∼161 Mya but did not reveal a recent WGD event, which is consistent with the Ks peaks reported for D. longan and A. yangbiense (Fig. 2c). These results indicated that the ancient WGD event of S. mukorossi occurred before the divergence of S. mukorossi and D. longan. Although synteny block analysis revealed strong collinearity in S. mukorossi, D. longan, and A. yangbiense, large interchromosomal rearrangements were detected for the homologous chromosome pairs AyChr03/SmChr04, SmChr04/DlChr02, AyChr08/SmChr08, AyChr01/SmChr01, SmChr02/DlChr04, SmChr03/DlChr05, SmChr07/DlChr08, and SmChr13/ DlChr14 (Fig. 2d). In addition, partial chromosomal fusions were also observed in the homologous chromosome pairs among these species (Fig. 2d). These results suggested that major chromosomal fusions and rearrangements may have led to the divergence of the three species in Sapindaceae, which might be one of the reasons for the difference in chromosome formation.

Population genetic analysis
A total of 104 S. mukorossi accessions were collected from nine provinces of China covering nearly all S. mukorossi planting areas (Supplementary Table S1). Resequencing of the 104 S. mukorossi accessions by Illumina HiSeq PE150 generated a total of 545.83 Gb of clean data with an average depth of 12.27× per or 4.91 Gb of clean data per sample (Supplementary Table S3). After mapping against the assembled reference genome, we identified a total of 19 443 452 high-quality single-nucleotide polymorphisms (SNPs) and 7 960 863 InDels after filtering. Phylogenetic tree and principal component analysis (PCA) revealed that 104 samples were clustered into three independent groups corresponding to the Group I, Group II, and Group III populations ( Fig. 3a and b). Group I mainly consisted of accessions from the northeast distribution area, with enrichment of accessions from FJ and ZJ. Group II mainly consisted of accessions from the southern distribution area, with enrichment of accessions from GX and JX. The accessions of the southwestern distribution area (YN) were placed in an independent branch as Group III, showing stronger geographical distribution patterns than the other two groups. To further analyze the genetic relationship between these accessions, we performed a structure analysis using ADMIXTURE (K, 2-10) (Fig. 3c). At a K value of 2 or 3, Group III was one separate cluster, demonstrating that Group III has had a linear evolutionary route due to geographical isolation. Additionally, the genes of Group III gradually penetrated into Group I and Group II accessions. At a K value of 3, the accessions were clearly divided into three separate clusters: Group I, Group II, and Group III, which is consistent with the above PCA and phylogenetic tree. Group I and Group II showed a relatively high level of admixture, which may be attributable to a shared ancestor or recent introgression events. To further understand the evolution of the S. mukorossi population from the perspective of genomics, we analyzed the decay rate of linkage disequilibrium (LD) of the three subpopulations (indicated by r 2 ). Based on the one-half r 2 maximum of pairwise SNPs among these three groups, the LD decay rates in Group I, Group II, and Group III were 459 bp (r 2 = .075), 643 bp (r 2 = .0703), and 403 bp (r 2 = .1446), respectively (Fig. 3d). The rapid decline in LD in each population may be due to the high density of SNPs and the high degree of recombination in the S. mukorossi genome. Similarly, a study of the quail population structure also revealed a genomic background of low-level LD bird populations, which promote adaptive variation. Genetic diversity (π ) increased from 5.42 × 10 −3 in Group I to 5.59 × 10 −3 in Group II and 6.71 × 10 −3 in Group III. We also identified the direction of gene f low from Group III (YN) to Group II (HB) to Group I (AH, FJ, JX, GD), indicating that the species in the southwest may be the donor population for S. mukorossi planting areas in China ( Supplementary Fig. S6). By calculating the fixation index value (F ST ) between these three groups, we found that the F ST values were ∼0.42 between Group I and Group III populations and ∼0.41 between Group II and Group III populations, and the F ST between the Group I and Group II populations was only 0.027. These results further indicated that Group I and Group II populations have relatively low genetic differentiation. Moreover, Group I (π = 5.42 × 10 −3 ) and Group II (π = 5.59 × 10 −3 ) harbored nearly the same level of genetic diversity with little genetic differentiation (F ST = 0.027). Negative mean Tajima's D values between Group I (−0.24) and Group II (−0.41) showed that the S. mukorossi population has recently accumulated more low-frequency gene mutations, and regional amplification may have occurred in the population history ( Supplementary Fig.  7). Overall, Group I and Group II populations, which existed in close geographical areas, formed one robust clade or a single cluster in structure analyses, with K values ranging from 2 to 3, confirming their close genetic relationship.

Demographic history
To explore the demographic history of S. mukorossi, we estimated N e using the pairwise sequentially Markovian coalescent (PSMC) method and assumed a mutation rate of 1.25 × 10 −8 per generation for a generation time of 5 years. The N e values of the three groups ref lected one dramatic decline period and a recovery period (Fig. 3e). Group I, Group II, and Group III reached the maximum N e at ∼10 Mya (million years ago) and then began to decline dramatically. One sharp reduction may have resulted from climatic f luctuations corresponding to the start of the Quaternary Ice Age (∼2 Mya) [24]. The recovery period occurred ∼180 Kya (thousand years ago) during the interglacial period of Marine Isotope Stage 5 (∼130 Kya), which may have resulted from a warming climate [39]. According to the results of demographic trajectories, we found that the N e values of Group I and Group II were basically identical and higher than that of Group III. The results may be due to relatively low genetic diversity and geographical distributions during the evolution of S. mukorossi.

Discussion
To obtain a high-confidence SNP set from resequencing data, a reference genome resource is very important. Several genome assemblies in Sapindaceae have been reported, including A. yangbiense (∼665 Mb) and D. longan Lour. (∼471 Mb) [30,31]. However, no report has involved the genome assembly and annotation of S. mukorossi. Diverse chromosome numbers (the chromosome numbers are 11, 13, 14, 15, etc.) and high heterozygosity (1.25%) in Sapindaceae have hindered genome assembly. S. mukorossi is an autodiploid with 2n = 2x = 28 chromosomes based on the results obtained for the molecular karyotype and K-mer analysis, differing significantly from the chromosome numbers of previously reported Sapindaceae species (A. yangbiense, 2n = 2x = 26; D. longan, 2n = 2x = 30) [30,31]. Using the latest sequencing and assembly technology, we obtained the genome of S. mukorossi at the chromosome level, revealing a genome size of ∼391 Mb with a contig N50 length of 24.66 Mb. By thoroughly mapping resequencing data, 104 accessions generated ∼545.83 Gb of clean data and identified 4 351 219 high-quality SNPs and 7 960 863 InDel datasets, yielding the richest genomic resource for further molecular breeding and biological studies of S. mukorossi to date.
The resequencing data of the 104 accessions collected from nine provinces of China covered nearly all S. mukorossi planting areas. The nucleotide diversity of S. mukorossi (7.53 × 10 −3 ) was significantly higher than that of other woody plants, such as peach (1.10 × 10 −3 ), tea (6.12 × 10 −3 ), apple (2.20 × 10 −3 ), jujubes (1.73 × 10 −3 ), pear (5.50 × 10 −3 ), and Prunus mume (2.42 × 10 −3 ), indicating that the evolutionary process of S. mukorossi is characterized by relatively high genetic diversity, which is conducive to adaption to various adverse environments. We further calculated the F ST between these three groups and found that the FST of the Group I and Group II accessions, at 0.002, was lower than that of the Group III accessions. These results also suggested that the close relation between accessions from the northeast and south may be attributable to a shared ancestor or recent introgression events due to close geographical areas.
KEGG and GO analyses showed that some of the specific or expanded genes in S. mukorossi were significantly enriched in disease resistance, hormone signaling pathways, and lipid metabolism, including SmRPM1, SmRPS2, acetyl-CoA acyltransferase 1, lipoxygenase and linoleate 9S-lipoxygenase, SmARR-B, and SmB3. Linoleate 9S-lipoxygenase and lipoxygenase, as key enzymes in linoleic acid and α-linolenic acid metabolism, could enhance key oxylipin synthesis pathways in tillers when upregulated [51]. Expansion of these genes is speculated to affect the accumulation of lipid metabolites. Kernel oil content plays an important role in S. mukorossi breeding. Moreover, Group I accessions were also characterized by the selection of a large number of RPSs compared with Group III accessions, including SmRPS2, SmRPS4, SmRPS7, SmRPS10, SmRPS12, and SmRPS23. The resistance genes RPS3/5 in Arabidopsis mediate resistance to P. syringae and DNA damage, and P. syringae has been reported to infect many economically important crops [52]. It is speculated that the strong selectivity of these genes enhances environmental adaptability, resulting in better cold tolerance in the northeast than in the southwest. Genome-wide selective sweep analysis revealed that most candidate genes were highly involved in lipid metabolism and terpenoid and polyketide metabolism, including SmDAD1, SmACOX4, SmOPR3, SmACOT1, SmFAB2, SmgalA, SmPPT, SmSGPL1, SmACAT, SmSQLE, SmKAO, SmGA3ox, SmGA2ox, SmFOLK, SmHMGCR, SmGGPS, SmF3H, SmCYP17A, SmCYP81F, and SmCYP1A1. SmGA3ox and SmGA2ox are diterpenoid plant hormones with a key role in regulating various processes throughout the life cycles of plants [53]. SmGGPS encodes an intermediate in terpenoid biosynthesis that is synthesized by geranylgeranyl diphosphate synthetases [54]. Many studies have focused on the identification of CYP450s for terpenoid biosynthesis, such as CYP716A47, which is related to ginsenoside biosynthesis [55]; CYP76AH1, which catalyzes the turnover of miltiradiene in tanshinone biosynthesis [56]; and CYP71AV14 and CYP71AX30, which are related to sesquiterpene lactone biosynthesis [57]. We speculate that differential selection in these regions may result in richer saponin and oil contents in northeastern accessions than in southwestern accessions, which is consistent with the phenotypic data.
Fruit weight, the peel-to-fruit ratio, saponin contents, the seed-to-fruit ratio, the kernel-to-fruit ratio, and oil contents are important agronomic traits for selecting excellent germplasm resources. To identify potential candidate genes associated with these traits, we performed GWAS using resequencing data from 57 accessions. We found that four peel-to-fruit ratio-and seedto-fruit ratio-related LRR-RK genes have been reported to regulate cell proliferation, stem cell maintenance, hormone perception and other developmental processes in Arabidopsis [58]. Similarly, we detected one SNP located upstream of cellulose synthases (SmCSLD1, whz_008313) that is responsible for the kernel-to-fruit ratio. SmCSLD1 is a component of the cell wall biosynthetic machinery and participates in the production of energy [59].
Saponins and oils are also important active compounds in S. mukorossi, and different saponin and oil contents may contribute to variation in economic value with respect to breeding. Interestingly, we found that two saponin-related genes were located in strong association peaks, including SmTMT (whz_003247), which is involved in ubiquinone and other terpenoid-quinone biosynthesis [60], and SmbHLH1 transcription factor (whz_003247), which regulates triterpene saponin biosynthesis in Medicago truncatula [61]. Our analysis identified candidate genes associated with major agronomic traits, and this information can be used for further reference in the selection of superior germplasm resources. Chromosomelevel reference genome assembly and population resequencing data have also substantially expanded basic and applied research for woody trees of Sapindaceae.

Molecular karyotype analysis and genome size estimation of S. mukorossi
S. mukorossi samples were obtained from Jianou, Nanping, Fujian, China (118.09 20 E, 27.03 01 N). Fluorescence in situ hybridization and karyotype analysis were carried out according to a previously reported method [62]. The size of the S. mukorossi genome was estimated by an optimized DNA f low cytometry method according to Huang et al. [63]. The genome size (1C) of S. mukorossi (∼380 Mb) was estimated to be ∼0.43-fold that of tomato (∼880 Mb) as an internal reference. Jellyfish software was used to estimate genome size and heterozygosity by 21-K-mers with ∼21.0 Gb (47×) of clean short reads on the Illumina HiSeq X Ten platform [64].

Sequencing and assembly of the S. mukorossi genome
Genomic DNA was isolated from young leaves using the CTAB method according to a previously reported protocol [65]. One SMRTbell library was constructed and sequenced on a PacBio Sequel II for HiFi CCS. A total of 409.15 Gb of CCS subreads were generated with a read N50 length of 13.45 kb and an average length of 13.16 kb. After trimming and quality control of all reads, ∼22.96 Gb of HiFi CCS data were used for genome assembly with hifiasm software (https://github.com/ chhylp123/hifiasm), followed by removal of heterozygous sequences with Khaper (https://github.com/lardo/kha per) based on the 21-K-mers result. Fresh leaves of S. mukorossi were harvested and placed in formaldehyde for 20 min for cross-linking, followed by endonuclease digestion (Dpn II), end repair, cyclization, and purification. The captured DNA fragments were used to construct a Hi-C sequencing library. The scaffolds/contigs of the assembled genome sequence were divided, sorted, and oriented into chromosomes using LACHESIS with parameters: CLUSTER_NONINFORMATIVE_RATIO = 1.4, CLUSTER_MAX_LINK_DENSITY = 2.5, CLUSTER_MIN_RE_ SITES = 100, ORDER_MIN_N_RES_IN_TRUNK = 60, and ORDER_MIN_N_RES_IN_SHREDS = 60 [66]. Placement and orientation errors exhibiting obvious discrete chromatin interaction patterns were manually adjusted.

Genome quality assessment
Genome completeness was assessed using the BUSCO databases with the Embryophyta model. High-quality ESTs were mapped to the S. mukorossi genome using BLAST (version 0.35) [67]. After trimming and quality control, second-generation data from the Illumina platform were mapped to the genome assembly using Bowtie2 to assess coverage [68].

Genome annotation
Repeat sequences were identified by combining de novo annotation and homology-based methods. For de novo annotation, RepeatModeler (version 1.0.4) was used to search for repetitive sequences in the S. mukorossi genome, and then the results were used to build a repetitive sequence library [69]. We further used Repeat-Masker (version 2.1) to identify repetitive sequences in a previously built repetitive sequence library [70]. For homology-based prediction, a homology-based detection procedure was performed using a conserved BLASTN search in Repbase of RepeatMasker [71]. The repetitive sequence results by de novo annotation and homologybased annotation were combined by removing redundant transposable elements. LTR_FINDER (version 1.05) was used to search for intact LTR retrotransposons in the S. mukorossi genome [72]. rRNAs and tRNAs were predicted using RNAmmer (version 1.2) and tRNAsan-SE (version 1.23), respectively [73]. In addition, miRNAs and snRNAs were identified using INFERNAL (version 1.1.1) to search the Pfam database [74].
Protein-coding genes were annotated by the MAKER pipeline (version 2.31.9) based on an integrated strategy, including homology-based prediction, RNA-seq data and ab initio annotation [75]. For de novo gene prediction, we first used BRAKER software to automatically generate full gene structure models with genomic and RNA-seq data [76]. Augustus (version 3.0.3) was applied with the default parameters, and the protein sequences of the training sets used were from the related species A. yangbiense and D. longan [77]. All of the predicted genes were functionally annotated according to homologous alignments with BLASTP (e-value ≤1e−5, max_target_seqs 1) against the NR, Swiss-Prot, GO, KOG, and KEGG databases [78][79][80][81][82]. TFs were identified by performing BLASTP (evalue ≤1e−5) searches against the plant transcription factor database (PlantTFDB version 5.0).

Construction of a phylogenetic tree and estimation of gain and loss of gene families
To identify the expanded and contracted S. mukorossi gene families, a phylogenetic tree was constructed using orthologous genes for Punica granatum, R. argentea, S. oleosum, Corymbia citriodora, E. grandis, R. rubrinervis, C. sativa, R. chinensis, F. vesca, M. baccata, C. clementina, C. unshiu, C. sinensis, D. longan, S. mukorossi, A. yangbiense, Manihot esculenta, Salix dunnii, S. brachista, P. alba, and P. euphratica. A total of 718 single-copy gene families were shared by the 21 analyzed genomes and used to construct a phylogenetic tree using FastTree (version 2.1.9) and RAxML (version 8.2.12) based on the maximum likelihood method [83]. MCMCtree in PAML (version 4.9) and the TimeTree website (http://www.timetree.org/) were used to estimate the time of species divergence with estimated confidence intervals at the 95% level [84]. CAFE software (version 1.6) was used to calculate the expansion and contraction of gene families [85]. Significantly expanded or contracted genes (P ≤ .01) were annotated and enriched by GO and KEGG analyses, respectively.

Synteny and whole-genome doubling analysis
We analyzed WGD events and determined the source of the homologous genes in S. mukorossi. The protein sequences from S. mukorossi and other species were searched to identify the best matching pairs using BLASTP (e-value ≤1e−5). Syntenic block analysis was performed using MCScanX with parameters -s 5 -m 5, and Ks values (synonymous substitution rate) for syntenic gene pairs were calculated using KaKs_Calculator software [86,87].

Positively selected genes in S. mukorossi
Orthologous gene clusters in S. mukorossi and closely related plants were identified using OrthoFinder (version 2.3.12) [88]. The single-copy genes of S. mukorossi and closely related species were aligned using MUSCLE [89]. Positive selection sites were detected using Codeml software of the PAML package. The Ka/Ks ratios of the singlecopy genes were evaluated with a threshold of Ka/Ks ≥ 1 and P value ≤.05.

DNA preparation for population sequencing, variant calling and annotation
A total of 104 S. mukorossi strains collected from various parts of the world were sequenced with the Illumina HiSeq X Ten platform with an insert size of ∼400 bp. DNA of each sample was isolated using the CTAB method according to a previously reported protocol [65]. The complete raw data of the 104 samples were processed for quality control using Trimmomatic with the default parameters [90], and the clean reads were then aligned to S. mukorossi using Bwa [91]. The alignments were sorted using SAMtools (version 1.9) [92], and PCRgenerated duplicates were removed using Picard (version 2.17). Variant detection was performed for each sample using the GATK (version 3.8.0) HaplotypeCaller tool and subsequently joint-called across all samples using the GATK GenotypeGVCFs tool [92]. To obtain highquality SNPs, a GATK hard filter was used to filter and merge VCF files with the parameters QD < 2.0 || FS ≤ 60 || MQRankSum< −12.5 || ReadPosRankSum< −8.0 || SOR > 3.0 || MQ < 40.0. The marker set was annotated using SnpEff (version 4.3) with our gene annotation for S. mukorossi [93].

Population genetics analysis
A subset of SNPs from 104 resequencing samples were filtered using PLINK (version 1.9) with the default parameters -geno 0.1, -maf 0.01 [94]. The final SNP set was used to construct a phylogenetic tree using PHYLIP (version 3.2) software based on the neighbor-joining method and implemented within FigTree (version 1.4.2) (http:// tree.bio.ed.ac.uk/software/figtree/). PCA of the final SNP set was performed with PLINK (version 1.9), and patterns of genetic variation were visualized using the R package (version 3.4) [95]. To further illustrate the evolutionary history of the S. mukorossi genome, a model-based clustering algorithm implemented in ADMIXTURE was used to estimate the relative genome composition for each sample. Population structure analysis was carried out using ADMIXTURE (version 1.3.0), which was run for K values from 1 to 10 with 20 bootstrapping replicates [96]. To perform TreeMix analysis, the per-site allele count of each group was calculated using PLINK with thewithin command, and the results were then converted into TreeMix input format using the plink2treemix.py script. We then ran TreeMix to construct a maximumlikelihood population tree [97]. LD between pairs of SNPs at up to 100-kb intervals was calculated using PLINK (version 1.9), and PopLDdecay (version 3.26) was used to plot the LD decay graphs by setting the parameter MaxDist to 1000 [98]. To identify candidate regions potentially affected by selection, nucleotide diversity (π ), and genetic differentiation (F ST ) were calculated using VCFtools (version 0.1.13) in a non-overlapping 100-kb sliding window with a step size of 10 kb [99]. Tajima's D was calculated using VariScan (version 2.0.3) in a non-overlapping 100kb window with a step size of 10 kb [100].

Genome-wide selective sweep analysis
Whole-genome detection of selective sweeps was conducted using F ST between Group I (47) and Group III (18) populations. To detect the regions under selective sweeps, we used the XP-CLR score to confirm the selective sweeps, with the highest being 5% via the XP-CLR method. The CLR test was implemented in SweepFinder2 for multiple markers across a contiguous grid space [101]. GO term enrichment analysis was performed for candidate genes located within these sweep regions.

Genome-wide association analysis
We used 57 S. mukorossi accessions to perform GWAS on fruit weight, the peel-to-fruit ratio, saponin contents, the seed-to-fruit ratio, the kernel-to-fruit ratio, and oil contents. A subset of SNPs from 57 resequencing samples were filtered using PLINK (version 1.9) with the default parameters -geno 0.1, -maf 0.05 [94]. The final SNP set was used for GWAS by EMMAX and a mixed linear model, and the calculation formula was as follows: where Y is the observed value vector of the population quantitative trait phenotype; μ is the overall mean; b is the fixed effect vector; γ is the SNP effect vector; u is the polygenic effect vector; λ is the primary weight vector; and X and Z are the correlation matrices corresponding to the fixed effect vectors b and γ , respectively. M is the correlation matrix corresponding to the random effect vector u; P is the correlation matrix corresponding to the primary weight vector λ; and e is the random residual error.
To minimize false-positives, a kinship matrix was estimated with the EMMAX-kin-intel64 program. The Bonferroni threshold for genome-wide significant truncation was set at 0.05/total SNPs. P values were used to screen out potential candidate SNPs. Manhattan diagrams and QQ plots were used to represent and evaluate the effect of association analysis, respectively.