Genomic basis of selective breeding from the closest wild relative of large-fruited tomato

Abstract The long and intricate domestication history of the tomato (Solanum lycopersicum) includes selection sweeps that have not been fully explored, and these sweeps show significant evolutionary trajectories of domestication traits. Using three distinct selection strategies, we represented comprehensive selected sweeps from 53 Solanum pimpinellifolium (PIM) and 166 S. lycopersicum (BIG) accessions, which are defined as pseudo-domestication in this study. We identified 390 potential selection sweeps, some of which had a significant impact on fruit-related traits and were crucial to the pseudo-domestication process. During tomato pseudo-domestication, we discovered a minor–effect allele of the SlLEA gene related to fruit weight (FW), as well as the major haplotypes of fw2.2/cell number regulator (CNR), fw3.2/SlKLUH, and fw11.3/cell size regulator (CSR) in cultivars. Furthermore, 18 loci were found to be significantly associated with FW and six fruit-related agronomic traits in genome-wide association studies. By examining population differentiation, we identified the causative variation underlying the divergence of fruit flavonoids across the large-fruited tomatoes and validated BRI1-EMS-SUPPRESSOR 1.2 (SlBES1.2), a gene that may affect flavonoid content by modulating the MYB12 expression profile. Our results provide new research routes for the genetic basis of fruit traits and excellent genomic resources for tomato genomics-assisted breeding.


Introduction
Artificial selection of vegetable crops has changed human dietary habits during the last 10,000 years. The impacts of the long and intricate history of pseudo-domestication on the tomato (Solanum lycopersicum) could be manifested in the tomato genome sequence [1]. The history of tomato pseudo-domestication has been described as a "two-step" process, beginning with domestication from the blueberry-sized S. lycopersicum var. pimpinellifolium (PIM) to the cherry-sized S. lycopersicum var. cerasiforme (CER), and improvement from CER to the large-fruited S. lycopersicum var. lycopersicum (BIG) [2,3]. However, CER, a weedy or wild species from central South America, may have existed before tomato domestication because of its complicated genetic mixing of both PIM and BIG [4,5].
Larger fruits, improved taste, and an overall more robust plant were selected throughout tomato evolution because they are valuable to humans and essential for plant survival [6]. One of the most crucial agronomic traits, fruit weight (FW), has been selected by human for hundreds of years. Several genes/quantitative trait loci (QTLs) controlling fruit size were selected and identified during the consecutive domestication and breeding of tomato [7][8][9][10][11], including fw2.2/cell number regulator (CNR), fw3.2/SlKLUH, fw11.3/cell size regulator (CSR), lc/WUSCHEL (SlWUS), and fas/CLAVATA3 (SlCLV3). By changing the cell division rate, cell number, and meristem size during fruit development, these genes increase the cell layer and carpel/locule number (LN). The first cloned fruit mass gene, fw2.2/CNR, negatively regulates cell division and accounts for 30% of fruit size variation contributing to increased FW [7]. The second fruit mass gene, fw3.2/SlKLUH, primarily affects the pericarp and septum in large-fruited tomatoes by increasing the cell number [8]. The third fruit mass QTL, fw11.3/CSR, explains about 8% of the phenotypic variation [9]. Brief ly, cell division and expansion are main drivers of organ growth in plants. In addition, SlWUS and SlCLV3 participate in the WUS-CLV pathway, which controls fruit size in tomato by regulating the LN [11,12].
In addition, selection of many other important morphological traits, including some specific metabolites [13], inf lorescence architecture, LN, and fruit shape, has been accompanied by dramatic, relatively rapid changes in fruit size. The artificially selected fw11.3 gene hitchhiked during tomato domestication, altering the content of eight metabolites and thus affecting fruit quality [13]. Several genes play a crucial role in increasing tomato yield, among which SINGLE FLOWER TRUSS (SFT) is the antagonist of SELF PRUNING (SP), which could increase tomato yield [14]. In order to increase fruit size, the AP2/ERF transcription factor EXCESSIVE NUMBER OF FLORAL ORGANS (ENO) shows synergistic effects with SlWUS and SlCLV3 [15]. Furthermore, it was discovered that the GLOBE gene, which encodes brassinosteroid hydroxylase, determines the shape and size of the tomato fruit [16].
The highly diversified crop metabolome is often regarded as a link between the genome and the phenome. Breeders use a new strategy called the integrative analysis of multi-omics data to investigate the genetic basis of crop metabolic and agronomic traits. Several loci for tomato fruit quality traits have been identified, including sucrose, ascorbic acid, malic acid, citric acid, and volatiles [17][18][19]. Metabolomics data were used to predict the performance of several agronomic traits in wheat (Triticum aestivum), including the grain number per spike and plant height [20]. In addition, 36 candidate genes were found to regulate the levels of metabolites that are of potential physiological and nutritional importance in rice (Oryza sativa) [21]. Meanwhile, 1035 metabolomes were utilized to analyze the natural variation and genetic control of maize drought adaptation in maize (Zea mays) [22].
In this study, we performed EigenGWAS, nucleotide diversity analysis, and the cross-population composite likelihood ratio test (XP-CLR) strategies on the PIM and BIG groups to identify selective sweeps during tomato pseudo-domestication. In total, we identified 390 putative selective sweeps, some of which were critical to the pseudo-domestication process and affected fruit-related traits. Also, we detected six plausible candidate loci that may have an impact on fruit mass during tomato pseudo-domestication. In addition, GWASs identified BRI1-EMS-SUPPRESSOR 1.2 (SlBES1.2), a transcription factor of BES1 family that may affect f lavonoid content by regulating SlMYB12 expression, as well as the causative variation that may cause the divergence of fruit f lavonoids among the large-fruited tomatoes. Our findings offer valuable data for illustrating the genetic architecture of tomato fruit mass and metabolite content, which will boost tomato breeding efforts in future.

Pseudo-domestication from the PIM to BIG group
In this study, we explored the phylogenetic relationships among 219 diverse accessions, including 53 wild tomatoes (Solanum pimpinellifolium; PIM) and 166 large-fruited cultivars (S. lycopersicum var. lycopersicum; BIG) using 46,850 single nucleotide polymorphisms (SNPs) (minor allele frequency [MAF] >5%, missing data <10%, and r 2 threshold <0.2). The neighborjoining tree largely approves the division of PIM and BIG groups (Supplemental Fig. 1). To identify comprehensive selective sweeps of tomato in the process of pseudo-domestication, we performed three genomic analyses: EigenGWAS, nucleotide diversity (π ), and the XP-CLR test on these accessions. We identified a total of 390 putative pseudo-domestication sweeps covering 329.32 Mb Using the three strategies, a total of 15,942 genes were identified in these selective sweeps (Supplemental Tables 5-7). Among these genes, 1,330 pseudo-domestication genes were identified by three strategies simultaneously. In addition, we found that 3,721 and 2,315 PDS E genes overlapped with PDS π and PDS X genes, respectively, and 1,992 PDS π genes overlapped with PDS X genes (Fig. 1E). Furthermore, we found that 36 genes or loci were located in these potential pseudo-domestication sweeps, including several related to fruit mass and LN (Supplemental Table 8). Among these 36 loci, two major genes related to fruit mass, fw2.2 and fw11.3, were consistently located within pseudo-domestication sweeps using these three strategies, whereas fw3.2 was identified by only two of the three (π and XP-CLR) ( Fig. 1A-C). Our results indicate that these three strategies can be combined for the study of tomato pseudo-domestication.
To assess the variation of the transcriptome level during tomato pseudo-domestication, we estimated gene expression distribution and transcript abundance for the PIM and BIG groups from a previous report [13]. A total of 4,724 differentially expressed genes (DEGs) were detected between the PIM and BIG groups (Supplemental Fig. 2), of which 1,445 were selected by humans during tomato pseudo-domestication, including 951 down-regulated and 494 up-regulated genes (Fig. 1F). GO analysis showed that these DEGs were involved (P < 0.01) in the following biological processes: oxidation-reduction, glutamate metabolic, glutamine family amino acid metabolic, molecular function regulator, and enzyme inhibitor/regulator activities (Supplemental Fig. 3A and Supplemental Table 9). Furthermore, Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis revealed that these genes were enriched in pathways including ribof lavin and tyrosine metabolism, phenylpropanoid biosynthesis, phagosome, and limonene and pinene degradation (Supplemental Fig. 3B and Supplemental Table 10). Furthermore, among these DEGs, we also discovered the Solyc03g097580 and Solyc03g097870 for glucose content [19], Solyc04g015530 (ps-2) for functional sterility [23], Solyc06g074240 (B/OG) for β-carotene [24], and Solyc12g008980 (Del) for carotenoid biosynthesis [25], indicating the importance of these DEGs in tomato pseudo-domestication.

Genomic selection for FW during tomato pseudo-domestication
After spreading from the Andes Mountains in South America to the rest of the world, the fruit mass and quality of tomato fruit have improved significantly from PIM to BIG lines. Some key genes for these traits have been identified, including fw2.2, fw3.2, fw11.3, fas, sun, and lc. However, the genomic selective characteristics related to fruit mass during tomato pseudo-domestication have not been thoroughly explored. To identify potential selection signals, we analyzed pseudo-domestication sweeps related to fruit mass together with GWAS results ( Fig. 2A-C). A total of 40 significant outlier regions were identified during tomato pseudo-domestication, accounting for 13.27 Mb of the tomato reference genome ( Fig. 2A). Further analysis found that 976 candidate genes were located within these outlier regions, and 171 genes were located in swept regions identified by the three strategies ( Fig. 2A and D). Through GO analysis of these candidate genes, their function showed in structural molecule, nutrient reservoir, and N-acetyltransferase activity (Fig. 2E). Intriguingly, three known genes, fw2.2, fw3.2, and fw11.3, were found around the peak SNPs on chromosomes 2, 3, and 11, respectively, and were located within pseudo-domestication sweeps (Fig. 2C).
To identify the regions related to fruit mass, we constructed an F 2 segregating population using a cross between the PIM (TS-19) and the BIG (TS-400) tomato accession. Using bulked segregant analysis of the F 2 population, we found one significant interval with FW at the distal end of chromosome 9 (Fig. 2F). In addition, we identified one significant outlier region (P SNP:chr9:63405303 = 3.14 × 10 −8 and P window:chr9:63370000_63,470 000 = 1.973; around 63.32-63.51 Mb) on the pseudo-domestication sweep (PDS E263 = 8.29, PDS X255 = 38.80) of chromosome 9 (Fig. 2G). Through functional analysis, a candidate gene (Solyc09g082110) encoding a seed maturation protein/late embryogenesis abundant (SlLEA) protein was located within this interval (Fig. 2G). The haplotype analysis showed that Solyc09g082110 had one major haplotype (GG) in the PIM group, whereas there were two haplotypes (AA and GG) in the BIG group (Fig. 2H). Furthermore, in the BIG group, we found that compared with haplotype GG, haplotype AA of Solyc09g082110 significantly increased the FW (Fig. 2I). The experimental analysis showed that the expression levels of Solyc09g082110 of the BIG accessions carrying haplotype AA were higher than those harboring haplotype GG during the fruit expansion stages ( Fig. 2J and Supplemental Fig. 4A). Meanwhile, we found that haplotype AA could produce more locules (Supplemental Fig. 4B and C). These results suppose that Solyc09g082110 could be a candidate gene for determining fruit mass by altering the LN in the BIG group, and the haplotype AA/GG might affect the transcriptional level of Solyc09g082110 at the fruit expansion stage. However, the mechanism and causal variation of Solyc09g082110 need to be further validated functionally.

Selection of agronomic traits related to FW
In the process of tomato pseudo-domestication, humans tend to select those larger and tastier fruits, which are accompanied by change of inf lorescence architecture. To reveal the selection of agronomic traits related to FW, we exploited 28 agronomic traits for 219 diverse tomato accessions, divided into three categories: plant architecture, f loral architecture, and fruit mass. We found that six traits, including LN, ovary transverse diameter (OTD), sepal number (SN), sepal length (SL), fruit stalk diameter (FSD), and fruit stalk length (FSL), were highly correlated (r > 0.5) with FW (Fig. 3A) and exhibited significant differences between the PIM and BIG groups (Supplemental Fig. 5).
In the PIM and BIG groups, we performed large-scale GWASs on these six agronomic traits highly related to FW (Fig. 3B and C). A total of 18 significant signals (P < 1.05 × 10 −7 ) were identified in the tomato genome ( Fig. 3B and Supplemental Table 11). Among these signals, five were shared among FW and other traits, including fw2.2 for FSD, OTD, and LN on chromosome 2; fw3.2 for FSD and OTD on chromosome 3; and fw11.3 for SN, OTD, LN, and FSL on chromosome 11 (Fig. 3B). Furthermore, haplotype analysis on the three FW-related loci (Chr02:41796211, Chr03:57938377, and Chr11:52136718) showed eight major haplotypes in the BIG group (Fig. 3D). Among these haplotypes, the RRR haplotype is predominant in the PIM group, and we found that the major haplotype (AAA) in the BIG group contributed to the larger fruit and greater organ number (Fig. 3E). Interestingly, five novel loci related to SN, FSD, OTD, FSL, and SL were identified (Supplemental Table 11). The above results indicate that tomato FW and the agronomic traits related to FW may have shared a part of common genetic basis during the process of tomato pseudo-domestication.

Differentiation of flavonoid content in large-fruited tomatoes
Flavonoids are an important metabolite that determinants tomato fruit quality, which can affect human consumption and acceptability. During tomato breeding, these compounds have gradually become the key indicators for farmers and breeders to cultivate tomato varieties. However, the genetic basis of divergence of f lavonoid biosynthesis has not been fully studied among large-fruited tomatoes. In this study, we found that a series of genes involved in f lavonoid biosynthesis, including cinnamate-4-hydroxylase (SlC4H), 4-coumarate-coenzyme A ligase (Sl4CL), MYB12 (SlMYB12), chalcone synthase 1 (SlCHS1), chalcone synthase 2 (SlCHS2), f lavanone 3-hydroxylase (SlF3H), f lavanone 3 -hydroxylase (SlF3 H), and f lavonol synthase 1 (SlFLS1), were more highly expressed in high f lavonoid-accumulating tomatoes than those with low f lavonoidaccumulating content (Fig. 4A and B).
In the coding region of Solyc02g063010/SlBES1.2, we found that a polymorphism at position 167 bp (SNP BES1.2 T/C) caused the substitution from serine to proline in SlBES1.2 ( Fig. 4D and E). According to the haplotype analysis, the CC allele (proline) was found to mainly exist in high f lavonoid-accumulating accessions (Fig. 4D). Interestingly, the Arabidopsis thaliana ortholog of SlBES1.2, AtBES1, could repress MYB12 expression and reduce f lavonoid biosynthesis in Arabidopsis [29]. The phylogenetic tree of homologous proteins showed that SlBES1.2 had a similar function to AtBES1 [29], indicating that SlBES1.2 played a crucial role for the differentiation of f lavonoids in tomato divergence (Fig. 4F).
To investigate whether SlMYB12 is regulated by SlBES1.2, we conducted a dual-luciferase reporter assay, consisting of a reporter construct containing the SlMYB12 promoter and effector constructs carrying either SlBES1.2 167S or a mutated SlBES1.2 167P (Fig. 4G). Co-transformation experiments in Nicotiana benthamiana leaves showed significantly lower LUC activity than the control (Fig. 4H). To verify the causative role of this gene in varying tomato f lavonoid content, we used high/low f lavonoid tomato lines overexpressing SlBES1.2 (SlBES1.2-OE). We found that the relative expression of f lavonoid-related genes, such as SlMYB12 and SlCHS1, decreased in the SlBES1.2-OE lines (Fig. 4I). Taken together, our results indicate that the transcription factor Solyc02g063010/SlBES1.2 is involved in the tomato f lavonoidbiosynthesis-related metabolic pathways.

Discussion
As the world's most important vegetable, the commercial tomato contains more nutrients than its wild germplasm, including abundant soluble solids, f lavonoids, vitamins, and antioxidants [30]. In order to improve the f lavor of tomato fruit and increase its resistance to pathogens, genomic fragments from the wild germplasm were introgressed into cultivated tomato during breeding [2,31]. The evolution of tomato genome is described as a two-step process with an increase in fruit mass: from PIM to CER, and then from the CER to BIG groups [2]. However, the CER in South America is native to the Ecuadorian and Peruvian Andes, and is considered as an evolutionary intermediate between the PIM and BIG groups [2], or alternatively, an admixture produced by extensive hybridization [4,5,32,33]. To better understand the genetic mechanism of tomato domestication, we studied the small-fruited wild tomatoes (S. pimpinellifolium) native to the Andean regions of South America and the large-fruited cultivated tomatoes grown worldwide.
FW is one of the most important quantitative inherited traits controlled by multiple genetic loci. To date, researchers have found about 30 QTLs related to fruit size and shape [2,34,35]. However, only three genes affecting FW were found in tomatoes, including fw2.2/CNR, fw3.2/SlKLUH, and fw11.3/CSR [7][8][9]. In our study, we found a leading SNP (chr09:63405303) in the fw9.3 locus on chromosome 9 by analyzing the PIM and BIG groups. A candidate gene (Solyc09g082110) encoding a seed maturation/late embryogenesis abundant (LEA) protein was found in this locus. Previous studies have shown that the LEA gene indeed regulated plant growth and organ development. Orthologs of Solyc09g082110 play a pivotal role in osmotic regulation and salt stress response in wheat [36]. In addition, the rice LEA protein HVA1 promotes root development through an auxin-dependent process [37], and the Brassica napus LEA3 could improve photosynthetic efficiency to increase the accumulation of oil content in seeds [38]. Furthermore, TaHVA1 improved the biomass productivity and water use efficiency in wheat [39]. In this study, we also found that the mutation of the leading SNP upstream of the SlLEA gene was significantly associated with FW. During the expansion  stage of the ovary, the relative expression level of this gene was upregulated significantly in the BIG AA group compared to the BIG GG group. We hypothesized that this leading SNP could regulate the expression level of the Solyc09g082110 and further inf luence fruit enlargement during the developmental stage of the ovary in the BIG group.
After the pseudo-domestication of tomato, different tomato varieties were developed in accordance with human preferences. Among these diversified varieties, pink tomatoes are preferred, particularly popular with consumers in Asia. Meanwhile, hundreds of metabolites were modified during the pink tomato breeding [13]. Flavonoids accumulate in tomato fruits and contribute to its fruit color and ultraviolet protection [28]. In addition to the previously reported gene SlMYB12 [40], we also identified a candidate gene (Solyc02g063010) encoding the transcription factor BRI1-EMS-SUPPRESSOR 1.2 (SlBES1.2), which is the primary regulator of brassinosteroid signaling transduction. In plants, BES1 could directly inhibit biosynthesis of the brassinosteroids and jasmonates by interacting with several MYB genes, thus inf luencing the growth-defense tradeoff [41]. In Arabidopsis, BES1 negatively regulated the expression level of the transcription factor, MYB11, MYB12, and MYB111, in f lavonoid biosynthesis [29], which is consistent with the MYB12 expression pattern of and f lavonoid content in tomato. Moreover, the previous studies reported that BRASSINOSTEROID-INSENSITIVE2 (BIN2) kinase could phosphorylate BES1 and inhibit its activity [42], which indicates that the content of f lavonoids may be regulated through the phosphorylation of BES1 in plants. It indicates that BES1 does not only play important roles in stress response but also inf luence fruit quality in tomato. However, the molecular mechanism of this candidate gene needs further functional verification.
Collectively, our study indicated artificial selection for fruit mass during tomato pseudo-domestication. The divergence of f lavonoid biosynthesis was clarified by the distinct differentiation among large-fruited tomatoes (Fig. 5). These findings not only provide insights into tomato pseudo-domestication and divergence, but they will also facilitate de novo domestication of wild relatives [43,44] and future variome-guided tomato breeding.

Selective sweep detection
We used three strategies to identify the selected genomic regions, including EigenGWAS, nucleotide diversity (π ) analysis, and XP-CLR methods. For EigenGWAS analysis [46,47], we used the first eigenvector of the Principal Component Analysis (PCA) as "phenotype" and performed GWAS on all tomato accessions using TASSEL software (version 5) [48] with the default parameters. The top 5% (6.398) windows were determined as candidate selected regions. For π analysis, we scanned the whole genomic regions using the PopGen module in Bioperl (version 1.7.8) [2]. The top 5% (16.466) of ratios (π PIM /π BIG ) were considered as candidate selected regions. For XP-CLR analysis [49], we exploited a composite likelihood method for detecting selected sweeps between the PIM and BIG groups, with the following parameters: -maxsnps 600 -size 100 000 -step 10 000. The top 5% (21.56) of the entire genome with the highest XP-CLR values were considered as candidate regions. Finally, we merged windows that were less than 100 kb into one selected region. Genes within these selected regions were defined as pseudo-domesticated genes.

RNA sequencing analysis
The transcriptome data of 26 PIM and 259 BIG tomato accessions from the previous study [13] were obtained to identify the DEGs using HISAT2 (version 2.1.0) [50] with the default parameters. The aligned reads are submitted to StringTie (version 2.0.3) [51] for transcript assembly with the default parameters. Finally, we filtered out genes with FPKM equal to zero in all tomato accessions and identified DEGs between the PIM and BIG groups (unpaired samples) using the DEGseq2 program. GO (gene ontology) enrichment analysis was performed using the TopGO program and KEGG enrichment analysis using the clusterProfiler program.

Bulked segregant analysis of the F 2 population
The data of TS-19 (PIM, 1.7 g), TS-400 (BIG, 260.1 g), and 500 F 2 individuals were obtained from the previous study [13]. We aligned the short-read data to the reference genome using Burrows-Wheeler Aligner (version 0.7.16a) [52]. The SNPs between TS-19 and TS-400 lines were identified using samtools (version 1.5) [53] and bcftools (version 1.9) [54]. We calculated the SNP index and the mean SNP index of bulk samples using the sliding window method (window: 100 kb, step size: 10 kb).

GWAS analysis
The SNPs were filtered with MAF >5%, missing rate <10%. After filtering, a total of 2,154,571 SNPs in the PIM and BIG groups were used for GWAS with the EMMAX software [55]. The kinship matrix was measured with the default parameter, and the first five principal components were considered as fixed effects. The suggested P value (2.11 × 10 −6 ) and the significant P value (1.05 × 10 −7 ) were calculated by the 474,845 effective SNPs using the GEC software (version 0.2) [56]. The significant differences of these phenotypes were measured using the ANOVA and Wilcoxon test. The correlation and Gaussian distribution of these phenotypes were analyzed and displayed using the PerformanceAnalytics program.

Quantitative real-time PCR (qRT-PCR) analysis
We extracted the total RNA of the fruit ovary in the preanthesis (I), full-bloom (II), 5 days post-anthesis (III), and 10 days post-anthesis (IV) stages using the Quick RNA Isolation Kit (Huayueyang Biotechnology Company). The relative expression of target genes was calculated through the 2 -Ct , and the SlEXP (Solyc07g025390) gene was used as an internal control. The significant differences were calculated according to Student t test.

Dual-luciferase assay
We constructed the reporter construct (proSlMYB12-LUC) through transferring the promoter sequence of SlMYB12 into the pGreenII 0800-LUC vector, and two different effector constructs through inserting the coding sequences of SlBES1.2 into pGambia1300-GFP. The GV3101 (Agrobacterium tumefaciens strain) carrying the above constructs was injected into the young leaves of N. benthamiana. After 2 days, the LUC and REN activities in the leaves were measured using the Dual-Luciferase Reporter Assay Kit (Vazyme, DL101-01) in three separate experiments. The results were inferred from at least three biological replicates in each experiment.

Overexpression vector construction and plant transformation
We transferred the full-length CDSs of SlBES1.2 167S and SlBES1.2 167P into pDONR221 using Gateway BP Clonase II (Invitrogen, 11 789 020). Then, their sequences were reconstituted into the destination vector pH7WG2D using Gateway LR Clonase II (Invitrogen, 11 791 020). The A. tumefaciens strain EHA105 carrying the final vector infected the tomato fruit (Micro Tom) at the breaker stage [64]. After 6 days, the total RNA of transgenic fruit pericarp tissues was extracted, and the expression levels of these genes were measured using qRT-PCR analysis. The primer sequences used in this study were shown in Supplemental Table 13.