Distribution of SUN, OVATE, LC and FAS in the Tomato Germplasm and the Relationship to Fruit Shape Diversity

Phenotypic diversity within cultivated tomato ( Solanum lycopersicum ) is particularly evident for fruit shape and size. Four genes that control tomato fruit shape have been cloned. SUN and OVATE control elongated shape whereas FASCIATED (FAS) and LOCULE NUMBER (LC) control fruit locule number and flat shape. We investigated the distribution of the fruit shape alleles in the tomato germplasm and evaluated their contribution to morphology in a diverse collection of 368 predominantly S. lycopersicum and S. lycopersicum var. cerasiforme accessions. Fruits were visually classified into eight shape categories that were supported by objective measurements obtained from image analysis using the Tomato Analyzer software. The allele distribution of SUN, OVATE , LC and FAS in all accessions was strongly associated with fruit shape classification. We also genotyped 116 representative accessions with additional 25 markers distributed evenly across the genome. Through a model-based clustering we demonstrated that shape categories, germplasm classes and the shape genes were non-randomly distributed among five genetic clusters (p < 0.001), implying that selection for fruit shape genes was critical to subpopulation differentiation within cultivated tomato. Our data suggested that the LC, FAS and SUN mutations arose in the same ancestral population while the OVATE mutation arose in a separate lineage. Furthermore, LC, OVATE and FAS mutations may have arisen prior to domestication or early during the selection of cultivated tomato whereas the SUN mutation appeared to be a post-domestication event arising in Europe. Chi-square analyses were conducted to determine whether the observed combination of alleles (number) were higher or lower than expected (number in parenthesis). The Chi-square values obtained in each paired analysis of the mutant alleles appears in the second row at ** p < 0.01, **** p < 0.0001. The Chi-square values obtained in the analysis of combinations of three or more fruit shape genes were all highly significant.


INTRODUCTION
Tomato (Solanum section Lycopersicon) is native to western South America, from Ecuador and Peru to Bolivia and northern Chile. Cultivated tomato (S. lycopersicum L.) is postulated to have been domesticated in Mexico (Jenkins, 1948) with Peru suggested as an alternative location (De Candolle, 1886). The fruit of S. lycopersicum var. cerasiforme, also known as "cherry tomato", are typically larger than fruit of the wild species but smaller than those of cultivated tomato. Wild cherry tomato is hypothesized to be a direct progenitor of cultivated tomato (Rick and Holle, 1990). However, others consider "var. cerasiforme" a revertant from cultivation (i.e., feral plants) or a possible hybrid between wild and weedy taxa (Peralta et al., 2008). Indeed, previous studies have shown that most accessions of S. lycopersicum var. cerasiforme are more closely related to cultivated tomato than to wild relatives and others that are an admixture between S. lycopersicum and the wild relative S. pimpinellifolium possibly resulting from the frequent hybridizations between them (Nesbitt and Tanksley, 2002;Ranc et al., 2008).
Some of the most important changes that occurred during the domestication and improvement of tomato were increased fruit weight and the emergence of variable fruit shapes and colors (Paran and van der Knaap, 2007). Tomato varieties have been classified based on fruit morphology into shape categories described by the International Union for the Protection of New Varieties of Plants (UPOV) and the International Plant Genetic Resources Institute (IPGRI) (IPGRI, 1996;UPOV, 2001). In addition to the fruit shape categories, tomatoes have also been categorized into germplasm classes based on geographic origin and/or age. However, definitions for germplasm classes are neither accepted by all nor clearly delineated, with categories partially overlapping. Tomatoes classified as regional or landraces are farmer or gardener-selected and are adapted to the local environment typically in areas of local subsistence (Male, 1999). The term "vintage" and "contemporary" (modern) tomato refers to tomato accessions released before (vintage) or after (contemporary) a certain year (Williams and St Clair, 1993;Park et al., 2004;Sim et al., 2009). Breeding and elite breeding lines are used in current breeding programs that seek to develop commercially competitive varieties. U.S. heirloom tomatoes comprise a diverse and loosely defined group. Heirlooms have been referred to as accessions handed down from generation to generation, old commercial varieties, contemporary varieties created to fill niche markets, those of mysterious origin, or "treasured" (Male, 1999). Many heirloom varieties were brought to North America by European settlers and therefore could be considered as regional accessions or landraces in the country from where they originated.
Fruit shape and locule number are quantitatively inherited characters with estimates of QTL number ranging from four to 17 with the major loci explaining 19 % to 79 % of the genetic variation ( Barrero and Tanksley, 2004;Brewer et al., 2007;Gonzalo and van der Knaap, 2008).
The sun and ovate loci control fruit elongation and the underlying genes are known. SUN encodes a protein that is a positive regulator of growth resulting in elongated fruit and is hypothesized to alter hormone or secondary metabolite levels (Xiao et al., 2008). The mutation is the result of a gene duplication event that was mediated by the retrotransposon Rider (Xiao et al., 2008;Jiang et al., 2009). OVATE encodes a negative regulator of growth, presumably by acting as a repressor of transcription and thereby reducing fruit length (Liu et al., 2002;Hackbusch et al., 2005;Wang et al., 2007). The OVATE allele that conditions an elongated fruit carries a premature stop codon and is presumed to be a null allele (Liu et al., 2002). Locule number, which has a pleiotropic effect on fruit shape and size, is controlled by the fas and lc loci. FAS encodes a YABBY transcription factor and down regulation of the gene is caused by a large insertion in the first intron (estimated to be 6 to 8 kb) resulting in fruits with high locule number (Cong et al., 2008). The molecular nature of LC was recently identified (Muños et al., unpublished data). Two single nucleotide polymorphisms (SNP) were found to be critical in controlling the locule number phenotype and were located ~1200 bp downstream of the stop codon of a gene encoding a WUSCHEL homeodomain protein, members of which regulate stem cell fate in plants (Mayer et al., 1998).
The goals of this study were to determine whether allelic distribution of SUN, OVATE, LC, and FAS was associated with fruit shape, genetic background as well as geographical and historical origin in a diverse collection of cultivated accessions. This information would offer important insights into the number of genes involved and their effect on fruit shape in the tomato germplasm. Moreover, the knowledge of the distribution of the fruit shape gene alleles would allow us to examine the molecular events that accompanied domestication and selection of this important crop. We showed that the diversity in tomato fruit morphology was explained to a large extent by mutations in the SUN, OVATE, LC, and FAS genes. We analyzed the genetic clustering of this dataset relative to fruit shape category and germplasm class, and demonstrated non-random distribution of the major fruit shape alleles. Moreover, our data suggested that FAS and SUN arose in the LC mutant background. OVATE on the other hand arose in a different ancestral population. Finally, our studies offered valuable insights into the evolution of tomato from a round berry to a fruit with diverse shapes.

Tomato germplasm and fruit shape categorization
We visually classified 368 tomato accessions according to eight fruit shape categories: "flat", "round", "rectangular", "ellipsoid", "heart", "long", "obovoid" and "oxheart" (Figure 1, Table   S1). These accessions represented eight germplasm classes based on geographic origin and/or history ( Table I). Some of the fruit shape categories were represented by accessions from all germplasm classes whereas other shape categories contained accessions from only a few classes (Table I). For example, fruit in the ellipsoid category was found in nearly all germplasm classes with the exception of the seven wild accessions, which produced only round fruits. In contrast, long tomatoes were commonly found among U.S. heirloom and regional Spanish accessions and rectangular shape was represented mostly by Italian accessions (Table I).

Phenotypic diversity analyzed by Tomato Analyzer
To analyze fruit shape objectively, and determine whether the visual categories were supported by quantitative shape measurements, we obtained the values for 36 fruit attributes using image analysis. A subset of 120 accessions (hereafter called the subcollection) that represented the diversity of the fruit shape and germplasm classes observed in the larger collection of 368 accessions was phenotypically evaluated using Tomato Analyzer (TA) (Brewer et al., 2006;Gonzalo et al., 2009). Iterations of Linear Discriminant Analysis (LDA) led to the identification of the seven most important attributes that define the fruit shape categories. The final set of attributes identified for their predictive value were: fruit shape index, distal end protrusion, widest width position, the proximal end blockiness value at 20% from the proximal end, rectangular, distal angle at 20% along the boundary from the tip of the fruit and proximal eccentricity. The range of these attributes for each shape category is listed in Table S2. Using the measurements of these attributes, 83% of fruit could be accurately classified by a simple linear discriminant function according to the cross validation test (Table II). The relatively few discrepancies between the visual classification and the objective classification based on TA attributes arose with fruits classified as ellipsoid, obovoid or rectangular. Each of these three categories had some accessions classified as one or the other. Upon closer inspection, approximately one third of these were misclassified during the initial visual scoring of the collection. The remaining misclassified accessions were the result of transitioning and overlapping values for the seven attributes in the ellipsoid, obovoid and rectangular shape categories (Table S2). Regardless, the objective assessment using the seven attributes in TA provided a robust classification of fruit shape categories consistent with subjective international descriptors.
Allelic distribution of SUN, OVATE, LC and FAS according to fruit shape category and germplasm class.
We determined the alleles for SUN, OVATE, LC, and FAS in the 368 accessions comprising the entire collection. The data showed that all obovoid, and many of the ellipsoid (83%), rectangular (59%) and heart (48%) tomatoes carried the mutant allele of OVATE whereas most of the long (88%) and oxheart (83%) tomatoes carried the mutant allele of SUN (Table I). The most frequent mutation in flat tomatoes was LC (82%) followed by FAS (28%). Many of the long tomatoes also carried the mutation in the LC gene (63%), a finding that was supported by genetic evidence for the lc QTL in a population that segregated for elongated fruit shape (Gonzalo et al., 2009). All oxheart tomatoes carried the LC mutation in addition to SUN and/or FAS. Most round tomatoes carried the wild type allele at the four shape loci, with the LC mutation most prevalent at 33%. The majority of the round tomatoes with the LC mutation were S. lycopersicum var cerasiforme lines (Table S1). To evaluate whether fruit shape category and shape gene mutations were correlated to one another, we conducted a Chi-square test. The test corroborated the lack of independence between fruit shape categories and alleles of SUN, OVATE, LC, and FAS (χ 2 = 790, df = 84, p < 0.0001) indicating that the tomato shape mutations have a major impact on fruit morphology. These results were further supported by a quantitative association of fruit shape alleles with specific shape attributes (Table S3).
The mutant alleles of LC and OVATE were well represented in all germplasm classes with the exception of the wild species (Table I). The FAS mutation was often present in U.S. heirloom and regional Italian collection whereas the SUN mutation was often present in U.S. heirloom and regional Spanish accessions. The most common mutant alleles in the Italian collection were for OVATE and LC. The FAS mutation was more common in the regional than in the contemporary Italian collection. There were very few Italian accessions that carried the mutation in the SUN gene (Table 1). We also investigated the distribution of alleles in the Latin American germplasm.
As expected, only wild type alleles for the fruit shape genes were found in the wild species (Table I). Moreover, none of the Latin American regional and cerasiforme accessions carried the mutation in SUN, suggesting that the gene duplication resulting in elongated fruit shape did not occur in South or Central America. In contrast, the mutations in OVATE, LC, and FAS were found in Latin American regional and cerasiforme accessions. The most common mutant allele was for LC, followed by OVATE and FAS (Table I). Two cerasiforme accessions (LA1655 and VIR739) carried both the LC and FAS mutations and exhibited flat fruits and higher locule number (6.8 and 7.5, respectively) than the LC mutation alone (3.4 locules) in this background (Table S1).

Genetic analysis of the subcollection
To evaluate genetic structure in relation to germplasm and morphological categories, we genotyped 120 accessions with an additional 25 loci that were randomly chosen based on their distribution across the genome. With the exception of two cerasiforme accessions, these 118 represented cultivated tomato. After the genotypic evaluations, we removed four accessions because they were identical to another accession in the subcollection. The number of alleles per locus ranged from two to nine for ten SSR markers with a mean of 5.5 (Table S4). The remaining markers had two alleles per locus with the exception of the markers SP and LeOH16.2 which had three. The resulting dataset of 114 cultivated and two cerasiforme accessions and 29 markers (including the four fruit shape genes) was analyzed with the STRUCTURE 2.2 software (Pritchard et al., 2000). We tested population structure for K = 1-15 and determined that the best number of clusters is 5 (Evanno et al., 2005) ( Figure S1 and S2). We tested the consistency among five different runs at K = 5 after which we determined the ranking of inferred ancestry among accessions within each cluster as well as the stability of accessions in the same cluster ( Figure 2). This analysis indicated that none of the accessions changed from one to another cluster and that the grouping of the accessions was robust. The analysis was repeated with identical settings for burn-in and iterations but only including the 25 randomly chosen markers and omitting the four fruit shape genes. This analysis did not define a best K for the number of clusters ( Figure S1). Therefore, the alleles of SUN, OVATE, LC, and FAS appeared to be essential in determining genetic structure for this dataset, suggesting that selection for fruit shape was responsible for differentiating statistically distinct subpopulations within the collection of 116 cultivated accessions.

Pairwise F ST and Nei's genetic distances of the STRUCTURE clusters
To test the significance of the genetic clusters at K = 5, we conducted pairwise F ST (θ) (Weir and Cockerham, 1984) analyses. We found that the STRUCTURE clusters were significantly different from one another, strongly supporting the genetic grouping (Table S5). Based on Nei's genetic distances (Nei, 1978), one of the most distinct STRUCTURE groups was represented by cluster 5, comprised of contemporary accessions, sharing the fewest common alleles with other clusters. This finding is not surprising because recent introgression of disease resistance genes from wild relatives have enhanced the diversity in this germplasm class (Ruiz et al., 2005;Sim et al., 2009). Clusters 1 and 2 as well as clusters 3 and 4 shared more alleles relative to other pairwise comparisons (Table S5).

OVATE, LC and FAS genes.
Germplasm classes including regional Italian, regional Spanish and U.S. heirloom lines and fruit shape categories including flat, ellipsoid and obovoid were associated with genetic clusters at p < 0.001 (Table S6A,B). Specifically, heirloom accessions were significantly overrepresented in cluster 3, regional Spanish accessions were significantly over-represented in cluster 4 whereas regional Italian accessions were significantly over-represented in cluster 1 (p < 0.05) (Table S6A). In addition, three of four contemporary U.S. lines and four of nine contemporary Italian lines grouped together in cluster 5 ( Figure 2). We did not statistically analyze these two germplasm classes because there were fewer than 10 in each group.
Nevertheless, these results suggested that the contemporary accessions, whether American or Italian, carried similar alleles implying that the accessions were related to one another or that selection through breeding favored the same allele at multiple loci. With respect to shape category, the flat category was significantly over-represented in cluster 3 and 4, and the ellipsoid category in cluster 1 and 2. The obovoid category was randomly distributed among the different genetic clusters (p > 0.05, Table S6B).
The accessions that carry the mutation in the LC gene were over-represented in cluster 3 and 4 (Table S6C), which was consistent with the over-representation of flat fruit in these clusters (Table S6B). OVATE was significantly over-represented in cluster 1 and 2 (Table S6C) which was consistent with the over-representation of ellipsoid fruit in these clusters (Table S6B). The mutations in FAS and SUN were significantly over-represented in cluster 3, representing all the oxheart, and most of the flat and long fruit ( Figure 2).
The occurrence of mutant alleles in certain genetic clusters might indicate separate origin of the mutations. The most wide-spread mutations in the tomato germplasm were for the OVATE and LC genes. These mutations were found in distinct genetic clusters, indicating separate origin.
The data showed that accessions with both OVATE and LC mutations were indeed rare (Table   III). It is possible that the lack of coinheritance was reinforced by repulsion phase linkage of OVATE and LC on chromosome 2. Accessions that carry both SUN and OVATE as well as FAS and OVATE were also found less often than would be expected by chance (Table III). This observation suggested an independent origin for the OVATE mutation and genetic isolation of lineages carrying the mutant allele. On the other hand, SUN, LC and FAS are found more often together in the same accession than expected by chance indicating that the mutations arose in the same ancestral population (Table III).

Presence of fruit shape genes in commercially grown fresh market tomato varieties
Nearly all tomato accessions evaluated in this study were heirloom, regional and contemporary accessions with a few exceptions (Table S1). As was evident from our analysis, selection for mutant alleles of the four fruit shape genes played a key role in defining population structure in the subpopulation dataset. To determine the relevance of these mutant alleles in commercial germplasm, we obtained varieties from local retail stores and evaluated them for shape and the alleles of the shape genes. We found that all the store-bought varieties carried one or more mutant alleles for the four shape genes. For example, ellipsoid shaped grape tomatoes, which occupy a distinct niche market, carried the null allele of OVATE, indicating that the introduction of this group was accompanied by selection for the mutation in this gene.
Unexpectedly, while the heirloom Roma carried the OVATE mutation (Table S1), all three storebought ellipsoid shaped Roma-type tomatoes carried the mutation in SUN instead. Two of the three store bought Roma tomatoes also carried the LC mutation. Flat shaped and high loculed varieties carried the LC mutation and one of the three carried the FAS mutation as well. Similar to our findings with the older accessions, several round and low loculed varieties carried LC, indicative of modifiers of this mutation. It appears that the SUN, LC, FAS, and OVATE mutant alleles are not unique to heirloom, regional and contemporary accessions but are found in commercially grown varieties sold at retail stores and, thus, are highly relevant today.

DISCUSSION
In contrast to wild relatives that carry round and two-loculed fruit, cultivated tomato fruit is highly diverse in shape. In this study, we demonstrate that the diversity in fruit morphology in the cultivated germplasm is explained to a large extent by mutations in the SUN, OVATE, LC, and/or FAS genes. Individually, the alleles of these genes explain as much as 71% of the observed variation for specific fruit shape attributes (Table S3). At the same time, it is evident that interactions between genes and uncharacterized modifiers also affect fruit shape. For example, store-bought Roma types and Sun1642 carry the duplication of the SUN gene and exhibit an ellipsoid instead of a long shaped fruit. Sun1642 is a contemporary U.S. accession that clusters genetically with similar accessions such as M82 and Rio Grande demonstrating that the mutant SUN allele was introgressed from another accession, most likely an heirloom. Moreover, differences in fruit shape of varieties carrying the OVATE, FAS and LC mutations also suggest that suppressors and enhancers of these genes are present within the cultivated germplasm. For example, accessions that carry the OVATE mutation display a range of fruit shapes from long (e.g. LYC1340) and obovoid (e.g. Yellow Pear) to round (e.g. Gold Ball Livingston) whereas accessions carrying LC mutation produce long (e.g. Howard German), oxheart (e.g. Cuore de Toro), round (e.g LA1215), or flat (e.g. Druzba) fruit. Although Howard German fruit has on average five locules controlled by LC (Gonzalo et al., 2009), the effect of SUN is dominant over LC in controlling overall fruit shape (i.e. long), even though locule number is impacted. When adding the FAS mutation to the SUN-LC mutant background, the fruit is less elongated and, instead, oxheart in shape. Fruit shape in rectangular, ellipsoid and heart shaped varieties that do not carry the OVATE or SUN mutation might be controlled by genes such as those underlying the shape QTL fs8.1 and/or tri2.1/dblk2.1 controlling fruit elongation (Grandillo et al., 1996;Ku et al., 2000;Brewer et al., 2007;Gonzalo and van der Knaap, 2008). However, without the knowledge of the underlying genes at fs8.1 and tri2.1/dblk2.1 and the mutations that gave rise to the altered fruit shape phenotype, it is not possible to survey the alleles at these QTL throughout the germplasm. Alternative explanations to describe variation in fruit shape are also plausible.
While it is likely there are more than two alleles for each fruit shape gene with the exception of SUN, it is not clear whether any of the other alleles would result in the fruit shape changes described for the known mutant alleles. For example, LC exhibits multiple alleles but only two SNPs are associated with changes in locule number (Muños et al., unpublished data). These two SNPs were genotyped in our collection. Limited sequencing of OVATE in several accessions also showed there are more than 2 alleles. However, only one SNP was associated with elongated fruit shape (Rodriguez, unpublished data). The SUN mutation was likely to have occurred recently based on findings presented herein. In fact, there are no nucleotide differences between the ancestral and derived locus with the exception of the template switch that accompanied the transposition event (Xiao et al., 2008). Thus, the existence of other alleles of SUN that feature elongated fruit shape is extremely unlikely. A third allele has been reported for FAS (Cong et al., 2008), although we did not find this allele in any of our accessions. Because we did not search for additional alleles of FAS, the existence of more than two alleles that would result in highly loculed fruit is possible albeit unlikely.
The population structure analysis resulted in the identification of five genetic clusters, some exhibiting significant associations with fruit shape category and germplasm class. The fruit shape alleles of SUN, OVATE, FAS and LC are responsible for the observed clustering indicating that selection for fruit shape is responsible for the underlying genetic structure in tomato. Genetic groupings according to tomato fruit morphology have been reported in other studies (Mazzucato et al., 2008), supporting the notion that selection of diverse fruit shapes played a critical role in tomato domestication. We demonstrate that the fruit shape controlled by SUN and FAS is   Table S6C). In addition, the analysis of Nei's genetic distances shows that clusters 1 and 2 share a large number of alleles as do clusters 3 and 4. This observation and the distribution of fruit shape alleles suggest a separate origin of the OVATE mutation (Table   S5). The OVATE lineage remained separate from the lineages carrying mutations in the other fruit shape genes during the domestication and selection of tomato, otherwise random interbreeding would have resulted in more accessions carrying mutations in OVATE and one of the other genes. However, it is also conceivable that the combination of these two mutations results in seed sterility and/or reduced plant viability which would preclude the formation of lineages that carry both OVATE and LC mutations.
Evidence about the consumption of tomato before and immediately after the discovery of America by Christopher Columbus is extremely limited. As described in the Florentine Codex writings by de Sahagún who lived between 1499 and 1590, tomatoes were eaten with salt and chili pepper (De Sahagún, 1959). Historical evidence demonstrated that tomato arrived from Mexico to Spain and Italy following Columbus' exploration of the Americas. The first written record of tomato in Europe was in 1544 where it was described as flat and segmented fruit (Matthiolus, 1544). Other descriptions of fasciated fruit followed soon thereafter (Oellinger, 1941 (Table I). The latter finding demonstrates that the OVATE mutation underlying the classical Italian paste tomato is in fact very wide-spread in its germplasm. All four tomato fruit shape gene mutations are wide-spread in the cultivated tomato germplasm, including commercial fresh market varieties sold at present time in grocery stores. This finding clearly shows that the fruit shape gene mutations, whether maintained for curiosity's sake, for cultural and culinary purposes, or to develop high yielding and uniquely shaped varieties, played key roles in the selection of tomato and that these mutations are still highly relevant today.
Although speculative, it has not escaped our notion that the distribution of the four fruit shape genes in the tomato germplasm enables us to develop a model for the domestication and selection of tomato. Based on the data presented in this work, we hypothesize how tomato evolved from a spherical to a variably shaped fruit, and where and when the fruit shape  (Table S1). This result supports the notion that SUN arose in Europe and came to North America as an heirloom. It is highly unlikely however, that the SUN mutation arose in Italy since only six out of 109 elongated fruit accessions from the Italian collection carry SUN.
It is likely that SUN arose in the LC background in cultivated tomato. In general, the older accessions representing the heirloom and regional accessions carry both SUN and LC mutations.  Legs is a created heirloom that results from a deliberate cross (Male, 1999), and represents an admixture genotype in our cluster analysis. We also assume the admixture genotype and deliberate crosses that generated Long John, Orange Banana, LYC1903, LYC1744, and T923 accessions (Figure 2). The latter three are Italian accessions and since the SUN mutation is quite rare in this germplasm, it is likely that the mutation was bred from an accession that originated elsewhere. In all, these findings suggest that like FAS, SUN most likely arose in the LC mutant background.

Plant material, DNA extraction and fruit scanning
A total of 368 tomato accessions were grown in the field in Wooster, Ohio USA in the summers of 2005 to 2007 (Table S1). The collection includes U.S. heirloom (47 accessions), contemporary U.S. (13), regional Spanish (22), regional Latin American (20), S. lycopersicum var. cerasiforme (46) and wild (7) accessions. The Italian germplasm was obtained from two sources (Table S1) and was divided into "regional" (162) and "contemporary" (38) based on the presence of the uniform ripening locus u located on chromosome 10 (Kinzer et al., 1990;Philouze). Older tomato accessions such as those found in the heirloom and regional categories often carry fruit with green shoulders when unripe, whereas accessions in the contemporary category lack green shoulders and ripen evenly. The seeds were obtained from a variety of sources (Table S1 and  longitudinally through the center, placed cut side-down on a scanner and digitalized at 100 dots per inch (dpi) as previously described (Brewer et al., 2006). Total genomic DNA was isolated from young leaves as described previously (Bernatzky and Tanksley, 1986;Fulton et al., 1995).

Fruit shape categories
The fruit shape terms and the number of categories in UPOV (UPOV, 2001) and IPGRI (IPGRI, 1996) classification systems are not completely consistent. Moreover, terms from an older version of UPOV (1992) are not the same as those in the most recent version (2001). The UPOV and IPGRI fruit shape terms are also inconsistent with the prevailing ontology terms (http://solgenomics.net/tools/onto/) (SP:0001000, Solanaceae phenotype ontology). Therefore, to maintain consistency with terms present in the trait ontology database, we renamed categories and merged ones for which varieties were often classified in both (Table S7). We merged the "flattened" and "slightly flattened" categories into just one category entitled "flat". The term "round" is used for spherical shaped fruit. The category "ellipsoid" represented oval shaped fruit.
The "heart-shaped" category in UPOV and IPGRI was renamed "oxheart". Fruit categorized as "oxheart" tended to be large and tapered with prominent shoulders. The term "heart" represented fruit that were larger toward the proximal end than the distal end, had less prominent shoulders than oxheart, and had a distinct tip at the distal end. Instead of "pear-shaped" or "pyriform", we adopt the term "obovoid". The category "long" included varieties that produced very elongated, cylindrical, tapered and often slightly curved fruit. The term "rectangular" remained the same.

Phenotypic analysis of the tomato subcollection of 120 accessions
We selected 120 representative cultivated tomato accessions from the larger set of 368 examined. This selection was balanced to equally represent the eight shape categories as well as regional representation. Two S. lycopersicum var. cerasiforme accessions were also included whereas more distant wild relatives were not included. Phenotypic data were collected using Tomato Analyzer software program (TA) (Brewer et al., 2006;Gonzalo et al., 2009). The TA attribute values were subjected to multiple iterations of Linear Discriminant Analysis (LDA) to determine which attributes were most important for defining shape. First, all 36 attributes were subjected to LDA after which seven were removed because they showed high correlations with one or more attributes that were kept in the analysis. Then, several combinations of three attributes were subjected to LDA. The highest value of the proportion of correct assignments in the cross validation test was obtained when fruit shape index, distal end protrusion and width at the widest position were included. Additional attributes were added one by one and kept only if they led to an increase in the correct proportion of assignments in the cross validation test. This process led to the selection of seven TA-defined attributes for an objective classification scheme (Table S2). The predictive accuracy of an objective image-based classification scheme using a fixed linear discriminant function based on these seven attributes was then assessed. The analyses were carried out with MINITAB 15.1.0.0 Software.

Marker development for the fruit shape genes
Two alleles are known for the SUN, OVATE and LC genes (Gonzalo and van der Knaap, 2008; Muños et al., unpublished data; LC sequence accession numbers: JF284938 and JF284939). FAS is represented by more than two alleles. For this study however, we focused on the allelic variant that carried the proposed 6-8 kb insertion in the first intron which is underlying the FAS gene mutation in the cultivated germplasm. Another FAS allelic variant (Cong et al., 2008) appears to be unique since the allele has not been found in other accessions.
For OVATE, a dCAPS marker was developed using a fluorescently labeled M13 F primer Amplification across the HindIII site was conducted using primers EP1031 (5'-AGCATCAATTAGCACTCTTCG-3') and EP1032 (5'-GCTGCAAAGGCAACAGTACA-3') resulting in a 4 kb band that was sequenced in its entirety. This analysis permitted the identification of the breakpoint of the 3' end of the insertion as well as part of the insertion sequence. Southern blot analysis using the insertion fragment as probe revealed that the inserted region was unique in the tomato genome and that it had rearranged (moved, not duplicated) from a region that mapped very close to fas (data not shown). Based on the sequence analysis of the insert with that of the genomic DNA, we developed three primers: EP1069 (5'-CCAATGATAATTAAGATATTGTGACG-3'), EP1070 (5'-ATGGTGGGGTTTTCTGTTCA -3') and EP1071 (5'-CAGAAATCAGAGTCCAATTCCA -3'). When the insertion is present, EP1069 and EP1071 will amplify a band of 466 bp; when the insertion is absent (wild type), EP1070 and EP1071 will amplify a band of 335 bp. The amplification products were separated on 2% agarose gels.
For SUN, we were unable to develop a reliable PCR-based marker. Instead, we used Southern blot analysis to detect the alleles at this locus. DNA digested with EcoRV was hybridized with a probe amplified with primers EP45 (5'-TTTACCCGATGTGAAAACGA-3') and EP46 (5'-CATCAATAGTCCAAGGGGAAA-3'). An extra 4.3 kb fragment signifies the presence of the gene duplication that leads to an elongated fruit shape whereas the absence signifies the wild type allele at sun. AAAGTAGTACGAATTGTCCAATCAGTCAG-3') that are included in the same PCR master mix. When the cultivated allele is present, lcn-SNP695-F and lcn-SNP695-R-lev will amplify a band of 533 bp; when the wild type allele is present lcn-SNP695-F-cer and lcn-SNP695-R will amplify a band of 395 bp. A common band of 872 bp always is present resulted of lcn-SNP695-F and lcn-SNP695-R amplification. The amplification products were separated on 2% agarose gels.

Marker selection
To genotype the 120 accessions, we assessed alleles for 10 SSR, 14 CAPS and one InDel marker. These markers were selected from the Tomato Mapping Resource Database (http://www.tomatomap.net/) and chosen based on their polymorphisms within cultivated tomato as well as their random distribution across the genome. Consequently, two or three markers per chromosome were employed with the exception of chromosome 10 which was only genotyped with one marker. CAPS markers were scored on 2 to 4% agarose gels whereas the InDel and SSR markers were scored on the LI-COR IR2 4200 (LI-COR Biosciences, Lincoln, Nebraska, USA). Details on the markers used are given in Table S4.

Genetic cluster analysis
Clusters of similar genotypes were delineated using STRUCTURE version 2.2 (Pritchard et al., 2000). To avoid redundancy and bias in our subcollection, we removed four accessions which had the same genotype based on the alleles for the 25 randomly distributed markers and four fruit shape genes. The accessions that were removed were Jersey Devil (identical to Howard German), PI513088 and PI513036 (identical to Opalka) and UPV11936 (identical to Yellow Plum). A model assuming admixture and independent allele frequencies was selected. We used a burn-in period of 500,000 Markov Chain Monte Carlo (MCMC) iterations and then 1,000,000 iterations after burn-in to estimate the parameters. The selected run length was much longer than suggested by Pritchard and colleagues (Pritchard et al., 2007) in order to minimize the effect of the starting configuration as well as to obtain the most accurate parameter estimates. Twenty independent runs were done for K (= number of clusters) varying from 1 to 15. The K optimum was defined according to the method proposed by Evanno et al. (2005). The strong modal signal at the true K = 5 ( Figure S1) was also supported by the plateau observed for parameter P(X|K) at K = 5 ( Figure S2), the rate of change of the likelihood distribution, and the absolute values of the second order.
The assignment of individuals to clusters was quite robust when compared to pre-defined classes. Some shape categories, genotype classes and fruit shape genes are over represented in some particular cluster (Table S6) Table I. Morphological diversity and allelic distribution of SUN, OVATE, FAS, and LC genes in tomato based on fruit shape category and germplasm class.  Table   S1). Attributes included in LDA: Fruit shape index, distal end protrusion, widest width position, the proximal end blockiness at 20% of the height from the top of the fruit, rectangular, distal angle at 20% along the boundaries from the tip of the fruit and proximal eccentricity. A description of the attributes is given in (Brewer et al., 2006;Gonzalo et al., 2009  Chi-square analyses were conducted to determine whether the observed combination of alleles (number) were higher or lower than expected (number in parenthesis). The Chi-square values obtained in each paired analysis of the mutant alleles appears in the second row at ** p < 0.01, **** p < 0.0001. The Chi-square values obtained in the analysis of combinations of three or more fruit shape genes were all highly significant.