Genome-wide association studies dissect the genetic architecture of seed shape and size in common bean

Abstract Seed weight and size are important yield components. Thus, selecting for large seeds has been a key objective in crop domestication and breeding. In common bean, seed shape is also important since it influences industrial processing and plays a vital role in determining the choices of consumers and farmers. In this study, we performed genome-wide association studies on a core collection of common bean accessions to dissect the genetic architecture and identify genomic regions associated with seed morphological traits related to weight, size, and shape. Phenotypic data were collected by high-throughput image-based approaches, and utilized to test associations with 10,362 single-nucleotide polymorphism markers using multilocus mixed models. We searched within genome-associated regions for candidate genes putatively involved in seed phenotypic variation. The collection exhibited high variability for the entire set of seed traits, and the Andean gene pool was found to produce larger, heavier seeds than the Mesoamerican gene pool. Strong pairwise correlations were verified for most seed traits. Genome-wide association studies identified marker–trait associations accounting for a considerable amount of phenotypic variation in length, width, projected area, perimeter, and circularity in 4 distinct genomic regions. Promising candidate genes were identified, e.g. those encoding an AT-hook motif nuclear-localized protein 8, type 2C protein phosphatases, and a protein Mei2-like 4 isoform, known to be associated with seed size and weight regulation. Moreover, the genes that were pinpointed are also good candidates for functional analysis to validate their influence on seed shape and size in common bean and other related crops.


Introduction
Common bean (Phaseolus vulgaris L., 2n ¼ 2x ¼ 22) is the most widely consumed legume grain in the human diet (Broughton et al. 2003) and a valuable source of protein, energy, and micronutrients for more than 300 million people in the developing world (Gepts et al. 2008;Blair et al. 2010;Petry et al. 2015). Its relevance to global food security is becoming increasingly important due to markedly high population growth in the leading common bean consumption countries (Akibode and Maredia 2011;Bellucci et al. 2014) and the rising importance of protein resources of vegetable origin (Petrusá n et al. 2016;Kumar et al. 2017). Brazil is one of the largest consumers and producers of common bean, harvesting around 3 million metric tons per year (FAOSTAT 2018).
Plant domestication can be understood as long-term experimentation involving the multifaceted relationships between environmental, anthropological, and evolutionary forces (Gepts 2004; Chacó n-Sá nchez 2018). In common bean, the occurrence of dual independent domestication events led to the establishment of 2 gene pools, the Mesoamerican and Andean. These gene pools display many morphological and genetic features (Singh et al. 1991b;Kwak and Gepts 2009), and especially traits related to a reduction in pod shattering and improvement in seed size, making P. vulgaris a good model for studies on crop evolution and domestication (Bitocchi et al. 2017). Accessions in the Mesoamerican gene pool exhibit greater genetic diversity (Mamidi et al. 2013) and are characterized by small-to-medium-sized seeds with "S" or "B" phaseolin types, while Andean seeds are larger, with "T," "C," "H," and "A" phaseolin patterns (Gepts et al. 1986;Singh et al. 1991a).
Seed weight and size are important yield components and thus selecting for large seeds has been a key objective in crop domestication (Gavazzi and Sangiorgio 2017). A superficial consideration of the relationship between yield and seed size might suggest that these traits are positively correlated, since seed size should have a positive influence on yield. However, for common bean, in which hundred seed mass can range from 15 to over 60 g, strong negative correlations with yield have been reported (Adams 1967;Nienhuis and Singh 1986;White and Gonzá lez 1990), since heavier seeds result in fewer seeds per plant. This negative correlation has been of great interest to bean breeders because selecting for higher yields can produce accessions with small, lightweight seeds (White and Gonzá lez 1990). Nevertheless, on the major common bean markets such as Latin America, Africa, and Asia, larger seeds are more attractive to consumers and drive up market prices (Nienhuis and Singh 1986).
Seed shape is also an important commercial criterion for the processing industry, consumers, and farmers. For instance, information on bean morphological seed features such as length, width, thickness, projected area, perimeter, weight, circularity, and length-to-width ratio are critical for designing seed drills and systems for sorting, sizing, and handling (Mazhar et al. 2013;FAO and AfricaSeeds 2018;Shaw et al. 2020). Hence, bean breeders have become increasingly interested in understanding the genetic architecture of seed shape and size traits, mainly for identifying accessions that combine high yield, large seeds, adaptation to industrial processing, and consumer appeal (White and Gonzá lez 1990;Park et al. 2000;Blair et al. 2009;Lei et al. 2020).
Since the early days of genetics as a science, common bean seed traits such as weight and size were reportedly subject to quantitative inheritance (Johannsen 1911;Sax 1923), leading the botanist Wilhelm Johannsen to coin terms such as "genotype," "phenotype," and "gene", and crucially providing evidence to resolve the classical biometric-Mendelian controversy (Roll-Hansen 1989). Further down the road, molecular markers were employed to dissect the genetic architecture of complex traits in plants based on 2 main approaches, known as quantitative trait locus (QTL) mapping and genome-wide association studies (GWAS) (Mitchell-Olds 2010;Xu et al. 2017). In family-based QTL mapping, designed crosses are used to unravel the genetic variations that differentiate progeny, in which meiotic cross-over events allow the identification of genetic loci segregating together with phenotypic variation (Lander and Botstein 1989). This methodology has been used to identify QTLs linked to many yield-related traits in common bean (Gonzá lez et al. 2017), including seed weight, size, and shape (Koinange et al. 1996;Park et al. 2000;Guzmá n-Maldonado et al. 2003;Blair et al. 2009;P erez-Vega et al. 2010;Checa and Blair 2012;Yuste-Lisbona et al. 2014;Geng et al. 2017).
The development of next-generation sequencing technologies significantly improved genotyping capacity and cut costs (Varshney et al. 2009), boosting the popularity of GWAS, which takes advantage of the historical accumulation of recombination events, obviating the need to synthesize segregating populations. This approach has enabled researchers to explore a wider genetic base and provide higher mapping resolution for studies focused on understanding the genetic underpinnings of quantitative traits (see Zhu et al. 2008;Korte and Farlow 2013;Huang and Han 2014;Giordani et al. 2021). Furthermore, high-throughput approaches involving image acquisition and analysis have also decisively favored GWAS, allowing large mapping populations, germplasm collections, and other breeding resources to be screened with progressively greater precision and accuracy for several phenotypic traits, including seed attributes (see Mir et al. 2019). Regarding bean seed size, Lei et al. (2020) applied GWAS to search for associations between 116 simple sequence repeat (SSR) markers and 4 seed traits. Although they were successful in finding marker-trait associations, the density of markers used was certainly not sufficient to represent all linkage disequilibrium (LD) blocks in the common bean genome. An essential step in this direction is to consider how LD decays across the common bean genome (Blair et al. 2018), especially when the biases ascribed to population structure and kinship are taken into account (Diniz et al. 2018). Therefore, there is still a scarcity of studies on combining GWAS with high-density single-nucleotide polymorphisms (SNPs) and high-throughput seed phenotyping to dissect common bean seed morphology, including not only seed size and weight but also other morphological aspects.
The aim of this study was to perform GWAS on a core collection of 180 common bean accessions with image-based phenotyping, and genotyping based on 10,362 SNPs to dissect the genetic architecture and identify genomic regions associated with 7 seed morphological traits related to weight (hundred seed mass), size (width, length, projected area, and perimeter), and shape (circularity and length-to-width ratio). Finally, we searched within genome-associated regions for candidate genes putatively involved in seed phenotypic variation.

Plant material and growing conditions
This study was based on a core collection of 180 common bean accessions representing the genetic diversity of over 1,800 accessions from one of the largest Brazilian P. vulgaris germplasm collections at the Agronomic Institute of Campinas (IAC), Brazil (Perseguini et al. 2015). It is a well-known, high-variability repository for many traits of interest to breeders and consumers, such as seed morphology, biotic and abiotic stress resistance, and nutritional value, previously classified according to commercial group, origin, and type of phaseolin (see Perseguini et al. 2015;Diniz et al. 2018).
The seeds used for phenotypic evaluation were obtained from plants grown in the greenhouse in a fully randomized experimental design with 2 replicates, conducted in the State of São Paulo, Brazil (22 42 0 30 00 S, 47 38 0 00 00 W), during the 2017-2018 growing season (December-May). Each plot consisted of 1 pot containing 2 plants. Considering that all accessions were inbred lines grown under optimal conditions, high seed homogeneity was expected. Briefly, seeds were germinated in a biochemical oxygen demand (BOD) incubator at 26 C for 5 days prior to transplanting. Three seedlings were transferred to each 3 l pot containing soil and sand (3:1). After establishment, 1 plant was discarded from each pot. The remaining 2 were fertilized with 5 N-10 P-30 K according to crop recommendations and irrigated periodically until harvesting.

Image-based acquisition and processing
In order to evaluate seed size, shape, and weight, 7 specific traits were measured on 50 seeds randomly selected from each plot. Seed size was determined by measuring seed length (LENGTH), width (WIDTH), perimeter (PERIM), and projected area (AREA). Length-to-width ratio (LWR) and circularity (CIRC) were calculated to reflect seed shape. Finally, seed weight was estimated on the basis of hundred seed mass (HSM). Seed size and shape properties were evaluated using highthroughput image processing. Seeds were photographed in a dark room with a supplementary light source and using a digital SLR camera (EOS 1100D, Canon, Tokyo, Japan) fixed perpendicularly on an image acquisition station, 35 cm above the seeds. The background color was either white or blue, to contrast with the tegument. A 10-mm crossmark was added to the background to convert from pixels to mm and indicate actual seed size on the image (Fig. 1). HSM in grams was determined on an analytical balance (Mark 1300, Technal, Brazil). A total of 18,000 seeds were phenotyped.
Images were analyzed using SmartGrain software (Tanabata et al. 2012), which can automatically detect every seed in the image and assess size and shape. It is recommended for high-throughput seed phenotyping in genetic mapping studies. The software processes the images rendering sequential coordinates along the seed perimeter. Briefly, the first step is to load the image and convert it from 24-bit (full color) to 1-bit (black and white) using a segmentation method. Next, the "cvErode" and "cvDilate" OpenCV functions are used to analyze seed morphology and exclude features such as awn, pedicel (for Poaceae), and funicle remnant (for Fabaceae). The outlines are then detected, all seeds labeled, and AREA, PERIM, and center of gravity computed using the "cvFindContour" function. The system automatically finds a set of perimeter coordinates for each seed in the image using: . . . ; n; where the origin ð0; 0Þ for all P i is at the top-left corner of the image. The "cvContourArea" function then computes the area within the set of perimeter coordinates for each seed, and "cvArcLength" computes the PL (OpenCV Developers Team 2012). To calculate the LENGTH and WIDTH of each seed, the longitudinal and transversal axes are estimated from the set of perimeter points. For LENGTH, the algorithm detects the longest distance between 2 points on the perimeter coordinates and WIDTH is estimated based on the longest segment perpendicular to LENGTH. Finally, LWR and CIRC are computed using the following expressions: In our study, all JPEG images were stored in a dedicated folder and batch analyzed. Finally, output images were manually checked to correct possible errors.

Phenotypic data analysis
Linear mixed models were created to estimate the variance components and obtain adjusted means using restricted maximum likelihood (REML), implemented in ASReml-R software (Butler et al. 2009): where Y ijk is the phenotypic value of accessions i within gene pool j; l is an intercept; g iðjÞ is the random effect of accessions i within gene pool j, where g iðjÞ $ N 0; r 2 g ; q j is the fixed effect of gene pool j; and e ij is the experimental error associated with accessions i within gene pool j, where e ij $ N 0; r 2 À Á . Likelihood ratio tests (LRT) were run to verify the statistical significance of the accession factor and the Wald test run to verify the significance of the gene pool. The best linear unbiased predictors were applied to compute the genotypic correlations using Pearson's coefficient, and standardized to perform principal component analysis (PCA). Broad sense heritabilities for all 7 seed traits were estimated as the ratio of genotypic to phenotypic variance, expressed as a percentage using the formula: Ã 100, wherer 2 g is the genotypic variance andr 2 e is the residual variance.

Genotypic data
Genotyping by sequencing (GBS) had already been used to genotype the IAC core collection panel, as reported by Diniz et al. (2018). The procedures were carried out at the Genomic Diversity Facility of Cornell University's Institute of Biotechnology. Briefly, DNA was extracted from young leaves using the DNeasy 96 Plant Kit (Quiagen, Hilden, Germany) and following the manufacturer's protocol. Libraries were built using restriction enzyme ApeKI (5 0 C/WGC 3 0 ), and barcode adapters ligated to each individual sample. The samples were then pooled and amplified by PCR as described by Elshire et al. (2011), establishing 2 libraries in 95-plex. Single-end sequencing of up to 100 bp was run on the HiSeq 2500 platform (Illumina, San Diego, CA, USA) in a single lane. Sequences from the 2 libraries can be downloaded from the GenBank database, BioProject PRJNA336556, BioSamples SAMN05513252, and SAMN05513251. Bioinformatics procedures for SNP calling were performed by processing GBS raw sequences through the TASSEL-GBS pipeline, with the P. vulgaris (G19833) genome as reference (Schmutz et al. 2014). SNP data were filtered using the following selection criterion: minimum coefficient of inbreeding 0.9; minor allele frequency (MAF) ! 0.05; call rate < 0.9; and heterozygous loci set to missing data. A total of 10,362 SNPs were detected and are available in Diniz et al. (2018).

GWAS and candidate gene search
GWAS on the 10,362 SNP markers and seed traits were performed using multilocus mixed modeling (MLMM; Segura et al. 2012), accounting for the kinship matrix, implemented in the GAPIT-R package (Lipka et al. 2012). In short, the model incorporates significant effects via a forward-backward stepwise approach. At each step, the variance components are re-estimated. If the newly included fixed effects are really influential, they might significantly reduce the residual variance and efficiently diminish the restrictions imposed by the mixed model on other markers correlated with population structure. This in turn improves statistical power and results in a lower false discovery rate compared to singlemarker scanning and stepwise linear regression, especially if a conservative threshold such as Bonferroni (a/number of markers) is used. To avoid bias ascribed to double-shrinkage, best linear unbiased estimators (BLUEs) of the genotypic values for each trait were used in the association analysis.
The proportion of variance explained (PVE) by each association (SNP and seed trait) was calculated using the expression: (1 -freq)effect 2 and V pheno is the variance of the adjusted means for each trait.
The squared coefficient of correlation (r 2 ) between the significant SNPs and neighboring SNPs (1 MB up and downstream) was computed in order to establish the LD block inherited together with the associated SNP, or inherited more frequently than could be expected by randomness. To do this, we used the LDheatmap R-package (Shin et al. 2006) and the expression below: where P A is the allelic frequency of A; P B is the allelic frequency of B, and P AB is the allelic frequency of the AB haplotype. The genomic intervals investigated for candidate genes were then defined based on the LD block, as proposed by Assefa et al.  (Conesa et al. 2005) with an e-value cutoff of 1 Â 10 À5 and a maximum of 20 hits. In addition, we used Gene Ontology (GO) mapping to assign GO terms by running Blas2GO with default parameters. Finally, aiming to assess the expression patterns of these genes during seed development, we consulted the common bean expression atlas (O'Rourke et al. 2014). The normalized expression data (reads/kb/million; RPKM) were obtained from 8 tissues: young flowers (FY), young pods (PY), pods approximately 9 cm long (PH), pods between 10 and 11 cm long (P1), pods between 12 and 13 cm long (P2), heart stage seeds (SH), stage 1 seeds (S1), and stage 2 seeds (S2). Moreover, in order to investigate relative gene expression among these tissues, we calculated the Z-score using: Z ¼ ðx À lÞ=r, where x is the RPKM value for each tissue; m is the mean of RPKM values across tissues and r the standard deviation. The application of this classical normalization allowed us to standardize RPKM values across a range of experiments, conditions and tissues, and contrast independently collected data (Cheadle et al. 2003).

Phenotypic analysis
Phenotypic evaluation of the core collection panel revealed wide variations across the whole set of seed traits ( Fig. 2 and Supplementary Fig. 1). Low standard deviations for all seed size and shape traits were observed within each plot, mirroring the high seed homogeneity, and the quality of image-based phenotyping (Supplementary File 1). The descriptive statistics for LENGTH, WIDTH, PERIM, AREA, LWR, CIRC, and HSM are summarized in Table 1. Examining the dispersion of phenotypic data, broad amplitudes were found, especially for AREA, PERIM, and HSM. The coefficient of variation (CV) values between different replicates for all the traits were low to medium, ranging from 3.1% (CIRC) to 22.6% (AREA), indicating good experimental precision. The only traits with CVs > 10% were AREA and HSM, with respective values of 22.6% and 17.2%. Similarly, high broad-sense heritabilities (H2) were verified for all traits, varying from 75.9% (CIRC) to 90.9% (HSM). These results show that much of the observed phenotypic variation can be attributed to the genetic component, supporting the use of the core collection for GWAS purposes.
The suitability of the core collection for the GWAS was also supported by LRT, which confirmed a significant accession effect for the 7 seed traits, further evidence that genetic factors underlie common bean seed traits (Table 2). Of the 180 accessions evaluated, "Apetito Blanco," "Branco Argentino," and "Jalo 110" produced larger, heavier seeds, while some of the lowest values were found for "Sanilac," "Rio Tibagi," and "IAC Carioca Arua." In terms of shape (CIRC and LWR), "Garbancillo," "Sanilac," and "Vermelhinho" exhibited the most circular seeds and "Red Kidney," "Jalo," and "Cal 143," the most elongated ( Supplementary Fig. 1).
Significant gene pool effects were also detected for all traits (P < 0.01) except WIDTH (P ¼ 0.12; Table 2), which could be due to slightly higher residual variance. However, for all sets of traits, low residual variances and high genetic variances were obtained, indicating wide genetic variability and high-accuracy phenotyping. The means for the Andean gene pool were higher than those for the Mesoamerican pool in respect of LENGTH, PERIM, AREA, LWR, and HSM, indicating larger, heavier seeds. On the other hand, the seeds produced by Mesoamerican accessions were more circular and less elongated (Table 1 and Fig. 3). It is important to note that, despite significant differences between gene pools, there was also considerable variation within the pools themselves ( Figure 3 and Supplementary Fig. 1).

Genotypic correlation and principal component analysis
Pairwise genetic correlation analysis enabled us to detect at least 3 groups of correlated traits. Pearson coefficients were strongly positive between all pairs of seed size traits, including weight ( Fig. 4, a and b). Among these traits, correlations varied from 0.69 (WIDTH-LENGTH) to 0.98 (AREA-PERIM). For instance, the strongest correlations were found between PERIM-AREA, PERIM-LENGTH, and AREA-LENGTH (0.98, 0.98, and 0.95, respectively). In terms of seed shape (LWR and CIRC), distinct behavior was detected; even though they showed strong negative correlation with each other (À0.77), correlations with the remaining traits differed, as shown by the network plot (Fig. 4b). LWR exhibited moderately positive correlations with LENGTH, PERIM, AREA, and  a Seed length (LENGTH, in mm); width (WIDTH, in mm); perimeter (PERIM, in mm); projected area (AREA, in mm2); length-to-width ratio (LWR); circularity (CIRC); hundred seed mass (HSM, in g). b Broad-sense heritability.
HSM, and a weak negative correlation with WIDTH. In contrast, CIRC exhibited moderately negative correlations with all seed traits except WIDTH, which was close to zero (Fig. 4). PCA was performed to ascertain the key sources of variation in the set of seed traits and to check whether the 7 evaluated traits were consistent for differentiating the accessions according to gene pool origin. The first 2 principal components (PC) accounted for 93.58% of the variation, in which PC1 mainly described the variation in LENGTH, PERIMETER, AREA, WIDTH and HSM, and PC2 captured most of the variance in CIRC and LWR (Fig. 5), corroborating the correlation clusters verified by the Pearson coefficient (Fig. 4b). The strong negative correlation between CIRC and LWR can also be detected by PCA, since the loadings representing these traits point in opposite directions. Finally, although it was not possible to clearly cluster the gene pools, most of the a Seed length (LENGTH); width (WIDTH); perimeter (PERIM); projected area (AREA); length-to-width ratio (LWR); circularity (CIRC) and hundred seed mass (HSM). Genome-wide association studies GWAS for the 7 seed traits were conducted on a total of 10,362 SNPs. As shown in the quantile-quantile (QQ) plots (Fig. 6b), according to the MLMM model the observed P-values are close to the expected values under the null hypothesis for all traits, indicating that there were no systematic false positive associations.
Seven marker-trait associations with P-values below the Bonferroni threshold (P < 4.82E-06) were considered significant for LENGTH, WIDTH, AREA, PERIM, and CIRC, and were located in 4 distinct genomic regions of 3 chromosomes (Pv02, Pv08, and Pv11; Fig. 6). Of the 7 associations, 2 SNPs were associated with more than 1 trait. For LENGTH, the 2 significantly associated SNPs (S1_325844152 and S1_507205534) were positioned on loci 1797591 and 42751382 bp of chromosomes Pv08 and Pv11, respectively. These SNPs had effects of similar magnitude (0.69 and À0.67 mm) and explained 5.94% and 6.39% of the phenotypic variation. SNP S1_325844152 was also found to be associated with PERIM and CIRCULARITY. For PERIM, it accounted for 7.16% of the phenotypic variation, while for CIRC the phenotypic variation explained (PVE) by the association was as high as 17.6% (Table 3, Fig. 6, and Supplementary Fig. 2).
AREA-associated SNPs (S1_53011276 and S1_379992973) were located on Pv02 and Pv08 at 805645 bp and 55946412 bp, respectively. For this trait, the effect of S1_379992973 was 6.01 mm 2 with a PVE of 5.18%, while S1_53011276 explained an even higher portion of the phenotypic variation (23.3%), with an additive effect of 10.86 mm 2 , resulting mainly from its influence on seed width. For WIDTH, this SNP also accounted for 23.7% of the total phenotypic variation and the effect of the variant allele on this trait was 0.55 mm. No significant associations were detected for LWR and WEIGHT (Table 3 and Fig. 6).
For the 4 significantly associated SNPs, both the variant and the reference allele were found in both gene pools. However, the minor allele variant always occurred at more balanced frequencies in both gene pools. Corroborating the phenotypic analysis which revealed larger seed size in the Andean accessions than in the Mesoamerican accessions, the alleles with positive effects were also present at higher frequencies in the Andean gene pool, except for S1_507205534 (Table 3 and Supplementary Fig. 3).

Candidate genes
With the aim of delimitating the genomic window for candidate gene investigation, LD analysis was performed on the borders of the significant SNP-seed trait associations. The genomic intervals varied between 358.4 kb for S1_53011276 and 558.6 kb for S1_507205534 ( Supplementary Fig. 4). Supplementary Table 1 gives a complete list of 192 genes predicted in the set of associated regions, 188 of which were annotated; 50 in the genomic region harboring SNP S1_53011276; 64 in S1_325844152; 45 in S1_325844152; and 29 in S1_507205534. To give a clearer picture of the potential functions of these genes, functional groups were classified based on the GO database. Among the inferred genes, 32 had no functional annotation or were classified as proteincoding genes of unknown function.
In order to narrow down the number of candidate genes and obtain further evidence of their importance in regulating common bean seed size and shape, we consulted the P. vulgaris expression atlas for 8 tissues and stages related to seed development (O'Rourke et al. 2014), extending from flowering to late phase seed formation. Among the 188 annotated genes, 156 were expressed in one or more tissues or stages and 13 were considered the most promising (Table 4 and Supplementary Table 2) with homologs already reported as putatively involved in seed development or related traits in crops or model species. In the LD block containing SNP S1_53011276, associated with WIDTH and AREA, several genes could be pinpointed as promising candidates (Table 4). S1_53011276 is positioned within an exon of Phvul.002G007100, which encodes the SUPERMAN transcriptional regulator, well known for playing a role in Arabidopsis reproductive organ morphology (Sakai et al. 1995(Sakai et al. , 2000. In terms of the Z-score, substantial negative values were observed for this gene in most tissues, indicating that compared to the whole set of genes and tissues analyzed herein, Phvul.002G007100 was the one with lowest general expression ( Supplementary Fig. 5). Note that the expression of this gene was verified at the 3 stages of seed development (SH, S1, and S2), with stronger expression in the 2 later stages (S1; RPKM ¼ 69 and S2; RPKM ¼ 53). Likewise, in this genomic region, Phvul.002G004400 encodes for a pentatricopeptide repeat-containing protein, already reported as a prominent player in seed development in various crops (Guti errez- Marcos et al. 2007;Liu et al. 2013a;Li et al. 2014;Verma et al. 2015) and highly expressed at the initial stage of seed development (SH; RPKM ¼ 122) (Supplementary Table 2). Phvul.002G006100 is annotated as a protein phosphatase 2C 29-like gene and exhibited remarkably strong expression in all the 8 evaluated tissues, with RPKM varying from 111 in P1 to 2029 in SH, and particularly high Z-scores for SH (1.22), S1 (1.57), and S2 (1.66) (Supplementary Table 2 and Supplementary Fig. 5). Finally, Phvul.002G006700 is a gene predicted as an AT-hook motif nuclear-localized protein 8 which has been shown to underlie seed size in other species (Sharma Koirala and Neff 2020), with high RPKM values for FY, PH, and the 3 seed stage tissues (Supplementary Table 2).
Associated with LENGTH, PERIM, and CIRC, SNP S1_325844152 is located within an exon of Phvul.008G020600, a gene that encodes a crowded nuclei 1 protein. Consistent with the expression atlas, this gene exhibited the greatest expression for FY, PV, PH, P1, P2, and SH, and also the highest Z-scores for most tissues, especially in the early developmental stages ( Supplementary Fig. 5). In the same haplotype block, a homolog of Phvul.008G019500, a protein Mei2-like 4 isoform, was recently reported as a modulator of grain size and weight in Oryza sativa (Lyu et al. 2020) and was highly expressed in common bean during the early seed developmental stage (SH; RPKM ¼ 488). In addition, this region comprises several genes encoding for pentatricopeptide repeat-containing proteins (Phvul.008G022600, Phvul.008G023900, Phvul.008G024300, and Phvul.008G024400), and Phvul.008G022600 and Phvul.008G024400 Figure 5. PCA of the IAC common bean core collection panel based on the phenotypic variability of 7 seed traits: seed length (LENGTH); width (WIDTH); perimeter (PERIM); projected area (AREA); length-to-width ratio (LWR); circularity (CIRC); and hundred seed mass (HSM). Orange and blue dots indicate accessions of Andean and Mesoamerican origin, respectively. Figure 6. GWAS on 7 common bean seed traits for the 180 accessions of the IAC core collection panel. Seed length (LENGTH) is shown in red; width (WIDTH) in green; perimeter (PERIM) in blue; projected area (AREA) in light blue; length-to-width ratio (LWR) in pink; circularity (CIRC) in yellow and hundred seed mass (HSM) in gray. The heatmap bars in the Joint Manhattan plot (a) depict SNP density along the 11 P. vulgaris chromosomes, and the dashed line indicates the Bonferroni threshold (P < 4.82E-06). In the QQ plot (b), the shaded area indicates a 95% confidence interval.  Table 2). Furthermore, in Pv08, significant for AREA, the S1_379992973 LD region contains a putative B3 domain-containing protein (Phvul.008G244300) that is worthy of attention since the B3 domain has been found in transcription factors involved in seed development control (Swaminathan et al. 2008). Finally, SNP S1_507205534, significantly associated with LENGTH, is positioned in an exonic region of Phvul.011G163100, a ribosomal protein L11 methyltransferase whose expression is highlighted in SH (RPKM ¼ 59) and S1 (RPKM ¼ 48) tissues (Supplementary Table 2). This SNP has an LD with a genomic locus that harbors other promising candidate genes, such as Phvul.011G160400, a cup-shaped cotyledon 2-like protein with high expression in FY (RPKM ¼ 205) and PH (RPKM¼ 275) tissues (Supplementary Table 2).

Variance components and heritabilities
In common with the majority of crops, breeding programs for common bean are primarily aimed at releasing high-yield cultivars with resistance to biotic and abiotic stresses (Miklas et al. 2006;Beaver and Osorno 2009;Singh and Schwartz 2010;Assefa et al. 2019). However, bean breeders face additional challenges since they need to pay close attention to the end consumers' needs (Kelly and Miklas 1998;Santalla et al. 2004;Beaver and Osorno 2009;Assefa et al. 2019). Therefore, understanding the genetic architecture of key seed morphological traits is a critical step toward the development of cultivars combining high yield, industrial compatibility, and consumer acceptance (White and Gonzá lez 1990;Park et al. 2000;Blair et al. 2009;Lei et al. 2020). In addition, since seed size has played a crucial role in common bean domestication, identifying the genes responsible for phenotypic variation can help us understand the genetic basis of adaptation (Chacó n-Sá nchez 2018).
Herein we studied the genetic architecture of 7 common bean seed traits including those related to size, shape, and weight. We used a core collection panel derived from one of the largest Brazilian germplasm collections. Despite its modest size, the panel embodies broad genetic diversity (see Perseguini et al. 2015 andDiniz et al. 2018) and is considered a high-variability repository for many important traits, including seed morphology and composition, and resistance to biotic and abiotic stresses. Furthermore, this panel was recently used to successfully map genomic regions associated with the common bean's response to root-knot nematodes (Giordani et al. 2021), further evidence of its suitability for GWAS.
We observed wide variability for all the traits analyzed and coefficients of variation were predominantly low (<10%), indicating that image analysis of seed properties leads to a slight experimental error, as reported by Fıratlıgil-Durmus¸et al. (2010). Significant accession effects were observed for all traits, and combined with high broad-sense heritabilities (>75.9%), show that a substantial portion of phenotypic variation might be ascribed to the genetic component, highlighting its importance in the expression of these quantitative traits, and thus lending weight to the use of this core collection for GWAS (Table 1). These results corroborate the findings of other studies on bean seed morphology, also reporting high heritabilities (Motto et al. 1978;Yuste-Lisbona et al. 2014;Rana et al. 2015;Singh et al. 2018;Wu et al. 2020).
Even though both domestication events increased seed size and weight compared to their wild ancestors (Gepts 2004; Chacó n-Sá nchez 2018), it is a fact that Andean accessions produce larger seeds than Mesoamericans (Gepts and Bliss 1988;Singh et al. 1991b;Logozzo et al. 2007;Lei et al. 2020). We once again confirmed the larger seed size of Andean accessions based on LEGNTH, PERIM, AREA, and HSM. No significant differences in WIDTH (P ¼ 0.12) were exhibited by the individual gene pools. These findings suggest that the main reasons for the seed size discrepancy between the 2 pools are related to the shape aspect of longitudinal development, an observation supported by the higher LWR of the Andean seeds and the higher CIRC values for Mesoamerican accessions.

Pairwise correlations and principal component analysis
Corroborating other evidence, the core collection herein evaluated showed strong positive pairwise correlations among seed size traits (P erez-Vega et al. 2010; Rana et al. 2015;Herron et al. 2020;Lei et al. 2020;Murube et al. 2020). We verified that the overall size, considering both weight and projected area, was influenced principally by an increase in the length dimension. Even though less pronounced, we observed a negative correlation between CIRC and all the other size traits except WIDTH, suggesting that small seeds are usually rounder while larger seeds tend Table 4. Candidate genes identified in GWAS-associated genomic regions able to modulate seed shape and size in common bean. Significantly associated SNPs according to the Bonferroni multiple test correction (P < 4.82E-06). a Seed length (LENGTH); width (WIDTH); perimeter (PERIM); projected area (AREA); and circularity (CIRC).
to be elongated. Hence, our results show that an extension in LENGTH does not necessarily imply an increase in the distance between one side and the other at the hilum region. Consequently, since WIDTH shows lower genetic variance (Table 2), LENGTH is suggested to be the main seed size parameter affecting shape proprieties. PCA provided further insights into the relationships among seed traits. PC1 accounted for over 70% of total variability, explaining mainly the variation in seed size traits, while for PC2, the most prevalent traits were related to seed shape. Although PCA did not clearly group the gene pools, the majority of the accessions positioned further away from the intersection of the biplot were found to be Andean.

SNP-trait associations
The GWAS approach is fundamentally based on LD and aimed at identifying genetic markers physically linked to the actual causal variant for a given trait. Nonetheless, several other factors over and above physical linkage may influence LD estimates. In breeding programs, intense selection, admixture of populations and crossing using a limited number of elite genotypes reduces genetic diversity and may strongly impact LD patterns and consequently affect GWAS resolution and statistical power (Delfini et al. 2021). In general, LD decay is slower in autogamous species, such as common bean, in which recombination is less effective than in allogamous species (Flint-Garcia et al. 2003).
In the populations typically used for GWAS, related individuals share both causal and noncausal alleles, and thus LD between these loci can result in spurious associations (Korte and Farlow 2013). For this reason, kinship and population structure are usually modeled in the association analysis (Zhao et al. 2007;Zhang et al. 2010). For the core collection used herein, our group has previously studied LD showing that kinship bias is heavier than the bias ascribed to population structure. We proved that when kinship is taken into account, population structure bias is no longer observed in the collection (Diniz et al. 2018). Additionally, we have shown that controlling these biases, LD in P. vulgaris decays to 0.1 at a distance of 1 Mb, indicating that LD for this collection is fairly widespread, in contrast to that observed in cultivars of soybean (133 kb) (Zhou et al. 2015) and rice (123 kb) (Huang et al. 2010).
Although it has been shown that LD is not uniformly distributed across the genome, the overall slower LD decay indicates that the SNP density used in this study could be adequate for tagging most of the LD blocks. However, we expected to find noncausal alleles among the significant associations. For this reason, our strategy of defining the entire loci in disequilibrium harboring the significantly associated SNPs was imperative for finding candidate genes.
In the present study, we have confirmed the previous results and demonstrated that, for all seed traits, when fitting kinship to the MLMM models, P-values exhibit low general inflation, since they closely adhered to the expected results under null hypothesis, suggesting goodness of fit and low false discovery rates (Fig. 6b).
In terms of the genetic architecture of seed size and shape, our results indicate that, although exhibiting quantitative behavior, some genomic regions do account for a considerable amount of variation. GWAS analysis allowed us to identify marker associations with moderate to high effects in 4 chromosome regions. On Pv02 at 805645 bp, S1_53011276 was significantly associated with WIDTH and AREA and explained over 23% of the phenotypic variance. This SNP is a promising candidate for marker-assisted selection, especially since the beginning of chromosome Pv02 has also been reported as linked to seed width (Geng et al. 2017Geravandi et al. 2020) and other seed traits P erez-Vega et al. 2010;Lei et al. 2020;Murube et al. 2020). Our findings also endorse Pv08 as a key chromosome underlying the size and shape aspects of common bean seeds. At positions 1797591 and 55946412 bp, 2 SNPs (S1_325844152 and S1_379992973) were associated with 4 traits (LENGTH, AREA, PERIM, and CIRC). Among these associations, the high PVE (17.6%) found between S1_325844152 and CIRC is particularly significant. Remarkably, this chromosome has already been reported to harbor seed size QTLs (Park et al. 2000;Blair et al. 2006;P erez-Vega et al. 2010;Wright and Kelly 2011;Geravandi et al. 2020). Since there are still no genetic mapping studies for CIRC and considering that it exhibits a marked influence on LENGTH, the significant effect of QTLs previously reported for seed length (Park et al. 2000;P erez-Vega et al. 2010) not only confirm the role of Pv08 in determining LENGTH but also suggest that it does contribute to CIRC. Finally, at position 42751382 bp of Pv11, the association between LENGTH and SNP S1_507205534 corroborates previous findings reporting QTLs underlying seed length at a similar location on the same chromosome (Park et al. 2000;Murube et al. 2020).
In common bean, dual domestication significantly impacted traits related to a reduction in pod shattering and improvement in seed size. These traits merit singular attention because of their role in early adaptation and importance for plant breeding (Chacó n-Sá nchez 2018).
Regarding the differences between the 2 gene pools, it is noteworthy that the 4 associated SNPs are segregating in both pools ( Supplementary Fig. 3). This suggests that such informative markers can be useful for improving genotypes of both origins using marker-assisted selection, and consequently afford good opportunities for breeding, since crossing Andean and Mesoamerican accessions is typically difficult.

Candidate genes
In order to define the genomic window on which to base the search for candidate genes, loci spanning significant associations were investigated based on LD ( Supplementary Fig. 4). Of the 192 genes predicted in those regions, 188 were annotated and 156 classified against the GO database (Supplementary Table 1) and expressed in one or more tissues related to seed development according to the P. vulgaris expression atlas (O'Rourke et al. 2014). Herein, we have highlighted some of the most promising candidates that may influence WIDTH, AREA, LENGTH, PERIM, and CIRC.
For WIDTH and AREA, homologs of Phvul.002G004400, Phvul.002G006100, Phvul.002G006700, and Phvul.002G007100 have been reported to be involved in the morphology of seed and related tissues. Interestingly, the significantly associated SNP S1_53011276 was positioned within an exon of Phvul.002G007100. Although this does not necessarily indicate a cause-effect relationship, this gene is a good candidate since it encodes a SUPERMAN transcriptional regulator, one of the best characterized TFIIIA zinc finger proteins associated with plant development, including leaf, shoot, and floral organ morphogenesis and gametogenesis. According to the atlas, strong expression of this gene was not observed, and it was noticeably expressed only in late seed development (Supplementary Tables 1 and 2) and thus had high negative Z-scores compared to the whole gene set (Supplementary Table 2). Similarly, Huang et al. (2006) indicated that a SUPERMAN-like gene, namely GmZFP1, was intronless and expressed principally in the late developing seeds and reproductive organs of soybean. The authors further suggested that this gene might act through a MADS-box transcription factor-mediated signal network, as occurs in Arabidopsis. In the same LD block, Phvul.002G004400 was consistently expressed at the initial stage of seed development (SH) and belongs to a protein family recognized as a pentatricopeptide repeat-containing protein, a modular RNA-binding protein which intermediates numerous aspects of gene expression, such as RNA processing, splicing, editing, stability, and translation. Already associated with several phenotypes, this family is recognized as an important player in seed development in various crops (Guti errez-Marcos et al. 2007;Li et al. 2014;Verma et al. 2015;Lyu et al. 2020). In maize, empty pericarp5 and small kernel 1 mutants, both encoding pentatricopeptide repeat proteins, exhibited embryo abortion, small endosperm, and stunted seed development in the early stages (Liu et al. 2013b;Li et al. 2014;Qi et al. 2017). Furthermore, many other related genes have also been reported as candidates for seed size and weight regulation in genetic mapping studies of many crops (Zhou et al. 2017;Du et al. 2019), including Fabaceae (Verma et al. 2015). Phvul.002G006100, predicted as a protein phosphatase 2C 29-like gene, exhibited remarkable expressed in all 8 evaluated tissues. Phosphatase 2C proteins are crucial negative regulators of ABA signaling (Bai et al. 2013), and have been reported in soybean as important regulators of seed weight and size. Lu et al. (2017) used whole-genome sequencing of a recombinant inbred line population to map QTLs for seed weight, and noticed that a phosphatase 2C-1 allele from wild soybean had a marked effect in increasing seed weight/size, an observation also validated in transgenic plants. Finally, Phvul.002G006700, an AT-hook motif nuclear-localized protein 8, exhibited strong expression for FY, PH, and tissues at the 3 seed stages. For instance, Sharma Koirala and Neff (2020) showed that an AT-hook motif (AtSOB3-D) modulates seed size as well as hypocotyl length in Camelina sativa and A. thaliana.
Associated with LENGTH, PERIM, and CIRC, the polymorphism of S1_325844152 is located inside an exon of Phvul.008G020600, a crowded nuclei 1 protein, which exhibited the strongest absolute expression and highest Z-scores in most tissues considered herein, especially in the flowering and early developmental stages. Although this gene has not yet been reported as a direct modulator of seed size, the control it exercises over nuclear cell size was detailed , as well as its role in ABAcontrolled seed germination (Zhao et al. 2016). In the same genomic haplotype block, Phvul.008G019500, a protein Mei2-like 4 isoform, is noteworthy for its strong expression in early-stage seed development (SH). Mei2-like proteins with conserved RNA recognition motifs have been identified in several plant species and seem to play a role in plant development (Anderson et al. 2004). In rice, an Mei2-like 4 protein (OML4) is phosphorylated by a glycogen synthase kinase 2 and negatively controls seed size and weight. Thus, loss of function of OML4 mutants produces larger and heavier seeds, while the overexpression of this gene leads to smaller, lighter grains (Lyu et al. 2020;Zhang 2020). As observed in the association of S1_53011276 with WIDTH and AREA, this region comprises genes encoding for pentatricopeptide repeat-containing proteins (Phvul.008G022600, Phvul.008 G023900, Phvul.008G024300, and Phvul.008G024400), in which Phvul.008G022600 and Phvul.008G024400 were abundantly expressed in all the tissues investigated.
The region containing the association between AREA and S1_379992973, a putative B3 domain-containing protein (Phvul.008G244300), exhibited low transcriptional levels in the 8 tissues considered herein. The plant-specific B3 superfamily encompasses well-studied protein families, including prominent transcriptional elements, such as auxin response factors (Swaminathan et al. 2008). Transcription factors associated with this domain have been proved to be involved in activating and repressing seed development and grain filling in several crops and model species (Guo et al. 2013;Peng and Weselake 2013;Ahmad et al. 2019;Yang et al. 2021). Finally, an association between SNP S1_507205534 and LENGTH was found in an exonic region of a ribosomal protein L11 methyltransferase gene (Phvul.011G163100). In this region, a cup-shaped cotyledon 2-like gene (Phvul.011G160400) has a homolog previously reported to be involved in ovule development (Gonc¸alves et al. 2015). Aida et al. (1997) studied mutations in 2 homologs of this gene and found defects in the separation of cotyledons that can lead to notable modifications in seed shape.
Although GWAS has been very widely used to investigate the genetic control of several important common bean traits (see  (Diaz et al. 2020), and flavor (Bassett et al. 2021), as far as we know ours is the first study to combine high-density genotyping and highthroughput image-based seed analysis to perform association mapping and dissect the genetic architecture of both seed size and shape in common bean. Our findings indicate that 4 genomic regions could explain a substantial amount of the phenotypic variation. From a practical standpoint, this study provides breeders with SNP markers that, given the concordance with the literature, are very good resources for use in marker-assisted selection. Moreover, several genes were pinpointed as interesting candidates underlying phenotypic variation in LENGTH, WIDTH, AREA, PERIM, and CIRC. These genes are promising choices for functional analysis, e.g. gene editing, which could validate their influence on seed shape and size in common bean and other related crops.
Supplemental material is available at G3 online.