Impact of early genomic prediction for recurrent selection in an upland rice synthetic population

Abstract Population breeding through recurrent selection is based on the repetition of evaluation and recombination among best-selected individuals. In this type of breeding strategy, early evaluation of selection candidates combined with genomic prediction could substantially shorten the breeding cycle length, thus increasing the rate of genetic gain. The objective of this study was to optimize early genomic prediction in an upland rice (Oryza sativa L.) synthetic population improved through recurrent selection via shuttle breeding in two sites. To this end, we used genomic prediction on 334 S0 genotypes evaluated with early generation progeny testing (S0:2 and S0:3) across two sites. Four traits were measured (plant height, days to flowering, grain yield, and grain zinc concentration) and the predictive ability was assessed for the target site. For days to flowering and plant height, which correlate well among sites (0.51–0.62), an increase of up to 0.4 in predictive ability was observed when the model was trained using the two sites. For grain zinc concentration, adding the phenotype of the predicted lines in the nontarget site to the model improved the predictive ability (0.51 with two-site and 0.31 with single-site model), whereas for grain yield the gain was less (0.42 with two-site and 0.35 with single-site calibration). Through these results, we found a good opportunity to optimize the genomic recurrent selection scheme and maximize the use of resources by performing early progeny testing in two sites for traits with best expression and/or relevance in each specific environment.


Introduction
Population improvement strategies are recognized as methods to exploit the genetic diversity of a crop and enrich the genetic basis of breeding programs. In rice, population breeding through recurrent selection (RS) was suggested as a valuable option in countering the decline in genetic diversity among the improved rice germplasm from Latin America and the Caribbean (LAC; Cuevas-Pé rez et al. 1992;Guimarães et al. 1996). RS in rice started in South America in 1985 (Taillebois and Guimarães 1989) and later spread to most of the continent through a Food and Agriculture Organization funded initiative (Châ tel et al. 2005;Martínez et al. 2014). In the region, RS was applied to rice synthetic populations, each composed of several elite materials, carefully chosen as founders, which had intercrossed for various generations (Guimarães 2005). Following recurrent cycles of selection and recombination, several thousand S 0 plants (S being used here to define the number of selfing cycles) are available for use in the breeding program either as new parents for population improvement or as S 0 progenies for variety development. The particularity of RS breeding in rice as performed in various countries in LAC is that it uses a recessive nuclear male sterility (ms) gene to facilitate outcrossing (reviewed in Frouin et al. 2014). This gene allows random recombination among a large number of parental plants at each cycle. Different ways are used to improve populations carrying the ms gene (Châ tel and Guimarães 1997). The most common practice is to evaluate a moderate number (200-300) of candidates randomly drawn from the synthetic population. The evaluation of the candidates is then performed through progeny testing, with more or less fixed families (S 0:2 , S 0:3 , or S 0:4 depending on the trait and required experimental design, obtained through several cycles of inbreeding and bulk harvest). Subsequently, parental lines are selected to be used for the next recombination cycle. Among others, two compromises have to be made that have a direct impact on the genetic gain achieved by the RS breeding scheme: (1) the number of candidate units evaluated through progeny testing with direct impact on the selection intensity and; (2) the required degree of fixation of those progenies prior to phenotyping, which would affect the breeding cycle length and influence the precision of genetic variance estimates.
Since its introduction by Meuwissen et al. (2001), genomic prediction (GP) has been widely adopted by animal and plant breeders alike. By allowing rapid selection of superior genotypes and accelerating the breeding cycle, GP has shown great potential since the advent of this new breeding paradigm in crop species in 2007 (Bernardo and Yu 2007). The value of GP in the context of RS is fairly evident as the selection based on genomic estimated breeding value (GEBV) can be applied to a very large population of genotyped entries through the calibration of a prediction model performed on a reduced set of training units. Furthermore, the average progeny phenotypic values associated with the genomic matrix of the respective S 0 individuals could allow a more precise estimate of the genetic variation in the case of early generation segregating candidate units. GP was simulated on multiparent populations (Heffner et al. 2011;Guo et al. 2012;Bian and Holland 2017;Allier et al. 2020a) directly related or not to a breeding program to assess the potential use of GP in genetic improvement through RS. However, few simulation studies have assessed the potential value of GP for crop synthetic populations (Mü ller et al. , 2018Schopp et al. 2017). Theoretically and through simulation approaches, recurrent genomic selection has the particular advantage of managing both the genetic gain and the maintenance of genetic diversity in the breeding program (Gorjanc et al. 2018;Allier et al. 2020b). In a simulated wheat breeding program, the inclusion of a step of population improvement with rapid recycling of early material proved to be greatly superior in terms of genetic gain compared with a program relying solely on biparental crosses between elite material to generate diversity (Gaynor et al. 2017). Similarly, recurrent genomic selection in soybean (Ramasubramanian and Beavis, 2020) and maize (Zhang et al. 2017) showed the long-term potential of RS combined with GP. GP has already been applied to material from RS for rice in single (Grenier et al. 2015; Morais Jú nior et al. 2017) and multi-environment contexts (Morais Jú nior et al. 2018). In these studies, the results showed relatively good predictive ability (PA) for various simple and complex traits such as plant height (PH), flowering date (FL), and grain yield (YLD). In both cases, however, the calibrations were based on material that underwent some degrees of fixation through plant selection and a few cycles of selfing. A significant jump in efficiency in these schemes is expected by calibrating on early generation candidates from S 0 progenies to save time in building the models and to accelerate the recycling of the selected germplasm.
GP integrating genotype by environment interaction (GxE) has proven successful, showing greater PA than the single environment prediction, provided environments are positively correlated. An approach to multi-environment GP was proposed and applied by Burgueño et al. (2012) where the authors modeled the environment and genotype covariance structure and used it within a mixed model framework. Later, GxE was incorporated in a GP model by separately capturing the main marker effect, common to all environments, and an environment-specific marker effect (Lopez-Cruz et al. 2015). This method is easy to implement and showed good results for wheat breeding under multiple environmental conditions. Additionally, it has the advantage that it enables working with different genotype covariance structures. Genotype covariances based on either a linear kernel (GxE GBLUP) or a Gaussian kernel (GxE RKHS) have been tested, and the Gaussian kernel allows a more flexible structure than the linear kernel and potentially better prediction (Cuevas et al. 2016). To optimize calibration with the multi-environment data, various strategies of genome-based models including GxE were proposed (Jarquín et al. 2020). The authors compared different partitioning of the calibration sets among the multiple sites where the population was tested, with different degree of overlapping of the genotypes between environments. Sparse testing designs in which subset of the genotypes are tested in each location was presented as a method to reduce the experimental effort and optimize the use of breeding program resources.
This study was conducted in the context of a collaborative rice breeding program between CIAT (International Center for Tropical Agriculture, member of the CGIAR centers) and Cirad (French Agricultural Research Centre for International Development). The CIAT-Cirad rice breeding program has historically conducted population development and improvement through RS. Its current RS program based on progeny testing is conducted in two locations; at CIAT-HQ in Palmira, where rice is cultivated all year round under irrigated conditions, and in Santa Rosa (SRO), an experimental site where rice is grown under rainfed conditions during the main cropping season. Although aiming to implement early GP in our RS scheme, we were also interested in making optimal use of all the data gathered in both locations (target and not target) for the breeding program. The main objective was to evaluate the PA of the GP model including the GxE interaction to obtain reliable estimates of the breeding value of selection candidates in the target site.

Development of PCT27 population
The genetic material used in this study belongs to the tropical japonica group of cultivated rice (Oryza sativa L.). Several synthetic populations developed in the CIAT-Cirad rice breeding program were improved for adaptation to upland ecosystems and acid soils. In 2015, Grenier et al. (2015) used a training set defined with 348 S 2:4 lines derived from four populations to study the potential of GP in an RS scheme. Of the 348 families at the S 2 generation, marker-assisted-selection for the ms gene (Frouin et al. 2014) helped to select [ms:MS] male fertile plants in 35 randomly sampled families. One single plant per family was selfed, and the seeds of each of 35 plants were mixed in equal proportion to generate a candidate population hereafter referred to as PCT27 ( Figure 1). Two recombination cycles were performed at CIAT-HQ in Palmira under irrigated conditions in a bundled field isolated from other rice experimental plots by at least 50 m to avoid pollen contamination and without any selection pressure. At each cycle, a population of about 3000 plants was established with male sterile and male fertile plants randomly distributed within the plot. The recombining units were then collected by harvesting male sterile plants pollinated by any male fertile plants in the vicinity. At the third cycle of recombination, 334 S 0 fertile plants were randomly extracted from the population to constitute our reference population. All entries were advanced to the S 0:2 and then S 0:3 generation by bulk harvesting seeds from 15 to 20 male fertile plants per line per generation. Additionally, 50 temporal checks from the same population were also advanced by bulk method to the generations S 0:2 and S 0:3 and were used to test the generation effect and the year effect within the site. The terms line and genotype were used indifferently in this work to refer to the S 0 plants and their bulked offspring at either generation S 0:2 or S 0:3 if specified.

Genotyping
Leaf tissues were sampled on the 334 S 0 plants and DNA extraction was performed as in Grenier et al. 2015. Genotyping was done by genotyping-by-sequencing (GBS) approach (Elshire et al. 2011). The detailed method is described in Appendix A and the genetic characterization of the population can be seen in Supplementary Tables and Figures. As a result of the genotyping and subsequent genetic analysis, the population was characterized by 9928 SNP markers fairly well distributed among the 12 rice chromosomes (Supplementary Table S1 and Figure S1). The MAF distribution among the 334 S 0 reflects a population where rare alleles were not depleted, which fits well with long-term objectives of a population breeding program (Supplementary Figure S2). The degree of allelic fixation varied greatly between the genotypes but remained relatively low for individuals at the S 0 generation (Supplementary Table S1). Considering the rather large average LD (Supplementary Table S2) and the slow LD decay observed (Supplementary Figure S3), the average marker density (1 SNP every 40 kb) was considered good enough to allow the capture of all linked QTLs with the SNP matrix in hand. The whole population was characterized, with a total absence of structure, which provides a good base for setting up a GP scheme through CV (Supplementary Figure S4).

Field trial and phenotyping
Field phenotyping was performed at two locations in Colombia. One site was an experimental field at CIAT-HQ in Palmira (PAL) located in the Valle del Cauca, Colombia (3.50 N-76.35 W, 1000 masl). At this location, rice evaluation trials are conducted under irrigated conditions and can be performed all year round due to favorable environmental conditions and good water availability that enable the irrigation scheme throughout the crop cycle. As it is not a rice prone area, no severe disease pressure is naturally present, and the rice crop usually expresses its full potential. On the other hand, SRO is an experimental site, owned by the Colombian National Federation of rice growers (Fedearroz) located in the Oriental plains of Colombia, in the department of Meta, Colombia (4.03 N-73.48 W, 300 masl). At this site, the rice crop is established through direct seeding and the trials are conducted under rainfed conditions during the main cropping season, between May and September. The predominance of rice Figure 1 Process followed for the development of the PCT27 population. Populations PCT4-C0, PCT4-C1, PCT4-C2 and PCT4-C3 were described in Grenier et al. (2015). Each population contains about 3,000 plants with half male fertile plants (! ? ) that can be selfed and half male sterile plants ($).
"SSD" is the single descend method of generation advance applied to 100 male fertile plants per population. indicates the selfing process. The "MAS" (marker-assisted selection) process was performed for the selection of S 2 plants based on genotypic profile at the ms gene. cultivation in this area, the climatic conditions of hot and humid summers during the main growing season and the natural occurrence of various strains of pathogens (bacterial, fungal, or viral) make this site a hot spot for disease screening. Four trials were conducted during two consecutive years, 2017 and 2018, using different semesters for each location. Field trials were established in PAL on December 4, 2017 and December 10, 2018 and in SRO on December 5, 2017 and May 30, 2018. At each site, the experimental design followed a lattice with 16 blocks and three repetitions. The 50 temporal check lines (only S 0:2 in the 2017 trials and S 0:2 and S 0:3 lines in the 2018 trials) were randomly distributed across the design within each repetition of the two sites and 2-year trials. In PAL, the trials were established after transplanting 3-week-old seedlings in a bundled field. The plot size was two rows of 17 plants with 25 cm between plants and between rows. Fertilizer application followed a split application with N, P and K nutrients added at 25 and 35 days after transplanting. Irrigation was maintained continuously in order to ensure a 25 cm layer of water in the field until a week prior to the crop maturation period. In SRO, the trials were established by direct sowing of two 4-m long rows, spaced by 26 cm at a density of one gram of seed per linear meter. Split fertilizer application was performed according to the recommended application for growing tropical japonica rice in upland soil conditions. Phytosanitary treatment was applied in SRO to prevent blast outbreaks. For all four trials, a similar design was applied, but with a different randomization.
Four traits were measured following the IRRI Standard Evaluation System (IRRI 2013) on the whole training population including the 50 temporal checks. FL was expressed as the number of days after crop establishment-being either the date after either transplantation (PAL) or sowing (SRO)-when 50% of the plants within a plot reached anthesis. PH was calculated as the average height measured in centimeters of five plants with their panicle extended. YLD was obtained by weighing the grains collected within each plot after discarding the plants at the start and end of each plot. For each harvested plot, percent humidity was measured and used to correct the weight of collected grains, expressed in grams per plot, for a relative humidity of 14%. The YLD value was neither adjusted for the plot size nor for the count of fertile plants. The grain zinc concentration (ZN), expressed in parts per million, was measured on a subsample of collected grains polished in Teflon equipment, using energy dispersive X-ray fluorescence spectrometry (X-supreme 8000, Oxford Instrument, Shanghai, CN) available at the CIAT-HQ Nutritional Laboratory. The exact same procedure was used for generation S 0:2 in 2017 and generation S 0:3 in 2018.
The 50 temporal checks were phenotyped as S 0:2 in 2017 and as S 0:2 and S 0:3 in 2018. This allowed measurement of the nonconfounded year within site effect on the S 0:2 and the generation effect in 2018 by analyzing the data from the S 0:2 and S 0:3 lines as presented in Appendix B.

Statistical models for GP
Raw data were visually explored for outliers as described in Appendix C. Based on clean data, Pearson's correlation between phenotypic BLUPs obtained in PAL and SRO was computed for generations S 0:2 and S 0:3 using the 334 S 0 families phenotyped in both generations.
All the models were estimated using ASReml-R v3.0 (Butler et al. 2007). GP was done independently in each generation. For single-site calibration, the following model was used: The fixed effects were the intercept l and the replicate effect r i . The random part was composed of the block effect b ij nested in replicate with distribution b $ N 0; Ir 2 b , the genotype effect g k that represents the progeny means with distribution g $ N 0; Mr 2 g and the residual e ijk with e $ Nð0; Ir 2 e ). The variance r 2 b is associated with the blocks, while r 2 g and r 2 e are the genotypic and error variances, respectively. The two variance-covariance matrices used are I for the identity matrix and M representing the genotype variance-covariance computed according to either of the two prediction methods described below.
For the two-site approach the following model was used: The fixed effects were the same as for Model 1, with an additional fixed site effect s i . The random part of Model 1 was completed with the genotype (progeny means) by site interaction In addition to the three variances described in Model 1, Model 2 includes two site-specific genotype by site interaction variances r 2 gsPAL and r 2 gsSRO as well as two site-specific error variances r 2 ePAL and r 2 eSRO . The error variance-covariance is modeled by the Kronecker product of the identity matrix and the variances matrix.
To compute the variance structure (M) for the genotype effect and genotype by site interaction (M PAL ; M SRO ), two different kernels were used. In the first approach, GBLUP, where M was based on the linear kernel M ¼ XX N (Lopez-Cruz et al. 2015), a proportional of the matrix proposed by VanRaden (2008) was used, with X being the genomic data with genotypes coded as À1, 0, 1 and N the number of markers. The second approach, RKHS, was based on the reproducing kernel Hilbert space approach by Gianola and van Kaam (2008). Three different variance-covariance structures were computed: one for the complete data ðM 0 Þ and one for each site independently M PAL ; M SRO ð Þ ; all based on the Gaussian kernel K e x me ; x ne ð Þ¼ exp Àh i k ð x me À x ne k 2 Þ, for x me ; x ne being two marker genotype vectors and m; n ð Þ2 1; . . . ; N f g 2 . The bandwidth h controls the decay rate of the correlation between the lines, smaller h giving a sharper correlogram. We computed h with the method proposed by Pé rez-Elizalde et al. (2015) and the provided R function marg.fun. A gamma prior distribution for h was used, the shape parameter was set at 3 and the scale parameter set at 1.5. Three different bandwidth parameters were computed as the method relies partially on phenotypes, and hence yields different kernels depending on the site. New bandwidth parameters were estimated at each cross-validation (CV) cycle based on the BLUP-adjusted phenotypes of the sampled training set, as in Pé rez-Elizalde et al. (2015). For both methods, the genotypic information was based on 9928.
Models 1 and 2 with identity matrix as variance-covariance matrices were used to compute broad sense heritability. H 2 at trial level (generation within site) was used as a measure for repeatability and computed using the formula: , where NR represents the harmonic mean number of plots per genotype. The global H 2 per generation (considering the two sites for each generation) was computed with Model 2 following the formula: , where NE is the harmonic mean of the number of sites per genotype, r 2 gsPAL and r 2 gsSRO are the genotype by site interaction variance for each site, NR is the number of plots per genotype across both sites and r 2 ePAL and r 2 eSRO are the residual variances for respectively PAL and SRO, respectively. Hence, H 2 was based on the genetic variance and the mean interaction variance and mean residual variance across the two sites (Holland et al. 2003, Isik et al. 2017. H 2 was computed with the function pin from R-package nadiv, adapted to use external harmonic means.

CV schemes for evaluating PA
Several CV schemes were used with different partitioning of the population among the two sites (Figure 2 and Supplementary  Table S3).
In the first instance, only phenotypic data from the target site of selection SRO was considered ( Figure 2). In that scenario, predictions were obtained based on Model 1 with a calibration based on a single site (SIN SRO ). Various calibration set sizes (s) were tested, s e f25, 50, 100, 200g.
For the two-site CV procedures, Model 2 was used. Calibrations were constructed with either a balanced (BAL) or imbalanced (IMB) representation of both sites. BAL1 represents a calibration method where both sites were represented by an equal number of phenotyped S 0 families. Sets of "s" S 0 were selected and their phenotypes in both sites were used for the training ( Figure 2). This corresponds to a CV1 in Burgueño et al. (2012). For BAL2, "s" refers to the number of S 0 families observed in SRO and in PAL, however, only a fraction of the families was observed in both sites (i.e., the overlap), the remaining families being observed in only one of them. An overlap of 50% of the total number families included in the calibration was targeted. For "s" S 0 families observed in both sites, the total number of S 0 families was then 3 2 s. For the IMB scenario, the whole population was phenotyped in PAL and only a fraction of size "s" was phenotyped in SRO ( Figure 2).
The same CV procedures were applied to each generation and with both GP models (GBLUP and RKHS). The GEBVs in SRO were obtained for the S 0 included in the validation set, defined as the set for which no phenotype at SRO was recorded. In each scenario, 100 alternative samplings were performed for which the PA was measured as corŶ; GEBV À Á . The referenceŶ was obtained with the complete SRO phenotypes using Model 1 and M ¼ I, I being an identity matrix and computed asŶ k ¼ l þ g k . GEBVs were obtained with the models including molecular information aŝ Y k ¼ l þ g k for SIN SRO orŶ SRO;l ¼ l þ s SRO þ g l for the other predictions (BAL1, BAL2, and IMB). For each scenario, the mean and the standard deviation of PA were computer on the 100 iterations.
To ensure that the variation in accuracy between the CV procedures was only due to the size differences in the training set, the correlations were always computed on the predictions for 100 genotypes randomly selected from the validation sets. However, for BAL2 with a training set size of 200, the validation set was reduced to 34 genotypes as those were the only genotypes with no phenotypic records that could be used for the validation with this strategy (Supplementary Table S3). The PA was still computed, but as the correlation was computed on only 34 points, the results must be considered with caution.

Effects of the calibration parameters on PA
To investigate the response of the PA to the calibration parameters, linear models were fitted to the PA obtained from the 100 iterations with each scenario. Depending on the scenario, the independent variables were year, GP method, CV scenario, training set size, and all their combinations. Proportion of variance associated with one or more main effect, errors or interactions were estimated through the Eta 2 , as Eta 2 ¼ SSq effect /SSq total , where SSq effect is the sum of squares for the effect under consideration and SSq total is the total sum of squares of all effects, errors, and interactions in the ANOVA study. Throughout the text, this ratio is expressed as a percentage.

Effect of sites and generations on the phenotypic performance
The phenotypic data were collected in two sites and on the same S 0 progeny at two generations. In each site, the phenotyping was done in 2017 for 334 families at the S 0:2 generation and in 2018 for the same 334 families at the S 0:3 generation.
For most traits recorded in the two locations, the mean phenotypic values differed between sites (Table 1 and Figure 3). Although the differences between sites were moderate for FL and PH, they were large for YLD and ZN, with more than 60% change in the 2017 trials. The S 0:2 families evaluated in 2017 had later flowering, shorter PH, lower yield and higher zinc concentration in SRO than in PAL. However, this tendency did not hold for the 2018 trials. The differences between sites in PH were greater at the 2018 trials, with taller S 0:3 plants in SRO. For each trait, the spread of the data was consistent across site and year with 0.4-4 points of difference in the coefficient of variation. The highest coefficient of variation was observed for YLD in 2018 (34%), and was higher than in the 2017 trial (27%). The trait broad sense heritability (H 2 ) at trial level showed large differences between traits and across sites and years. This measure of trial repeatability ranged from 0.52 for YLD in the PAL_2017 trial to 0.96 for FL in Figure 2 The four scenarios of CVs to evaluate the prediction accuracy in Santa Rosa (SRO). The first scenario (SIN SRO ) uses phenotypic information from a single site, whereas the three others include Palmira (PAL) phenotypes in two-site models. In the latter case, the level of information between locations is either balanced (BAL) or imbalanced (IMB). The gray area represents the genotypes included in the training set with a varying size "s" to calibrate the model and the green area represents the validation set fixed to 100 genotypes. SRO_2017. Heritability was systematically higher in SRO than in PAL, and similar or slightly increased in the 2018 trials for all traits in all locations, but for FL and YLD measured in SRO.
As the year and the generation effect were confounded, 50 temporal checks were used to untangle the potential effects of generation and year. The significance of the fixed effect and variance decomposition among the 50 temporal checks showed that differences were exclusively due to year effect and neither a significant generation effect nor a significant genotype by generation interaction could be observed (Supplementary Table S4).
For each trait scored in each year, an analysis of the variance components was performed on the combined data from both sites using Model 2 ( Table 2). The proportion of variance explained by the genotype effect was greater that of the combined genotype by site interaction effects from both sites (GxS PAL and GxS SRO ) only for FL in 2018, and PH recoded in both years. As a result, greater heritability was observed for these traits/year combination with H 2 ¼ 0.57, 0.50, and 0.62 for FL_2018, PH_2017, and PH_2018, respectively. The lowest genotype contribution to the explanation of variance was encountered for YLD, with large interaction effects and error effects associated with a particular site for each year, resulting in low H 2 in both years (H 2 ¼ 0.19 and 0.11 in 2017 and 2018, respectively). For ZN, the genotype effect represented a third of the combined GxS interaction variances in both year trials, leading to similar and moderate H 2 for both years (H 2 ¼ 0.38 and 0.40 for 2017 and 2018, respectively). The variance decomposition for each trait was coherent with the site correlation observed within years (Table 3). The highest correlations between SRO and PAL were observed for PH (r ¼ 0.62) and FL (r

Predictive abilities with calibration using single environment data
The effects of different parameters used for the calibration of the model were first investigated for the PA from the single environment CV in SRO; SIN SRO ( Figure 4). Similar global average PA were achieved for all traits combining set sizes, years and GP methods (PA ¼ 0.30, 0.33, 0.27, and 0.24 for FL, PH, YLD, and ZN, respectively; Supplementary Table S5). The linear model including all the factors taken individually, their first-order interaction and one second-order interaction explained 33-59% of the observed variation of PA (Table 4), indicating that a large proportion of the variability was due to the sampling of the CV method.
The training set size accounted for most of the PA variance explained by the model for all the traits. The largest training set size greatly improved the PA for all the traits (Eta 2 ¼ 22%, 53%, 39%, and 33% for FL, PH, YLD, and ZN, respectively). The year factor described a lower proportion of the total explained variance, with a maximum of 11% of the explained variance in PA for FL. For all traits GP model explained only a very limited proportion (<1%) of the variance. For most traits (PH, YLD, and ZN), the average PA was greater when predictions were performed with the GBLUP model. For this reason, the rest of the article will focus on the results achieved with the GBLUP model. However, the results for RKHS can be found in Supplementary Material (Supplementary Table S8).

Predictive abilities with calibration using single and two-environment data
Two CV scenarios including SRO and PAL (BAL1 and BAL2) were compared with SIN SRO including only SRO data to investigate the combined effect of the training set composition and its size ( Figure 5). The calibrations were tested in the two different years for their ability to predict line performance in SRO.
When the two sites were included in the training set, the main source of variation was the number of phenotypes from SRO and PAL included in the training set. Comparing the PA associated with the set size "s" in the case of BAL1 and BAL2 with the PA obtained with the same "s" in SIN SRO allowed us to assess the effect due to the addition of phenotypes from PAL to the training set. Globally, across all set sizes, PA in the BAL2 scenario was greater for all the traits considered (Supplementary data Table  S6), with average PA ranging from 0.23 for ZN to 0.38 for PH.
Although training set size was the factor explaining most of PA variation (>22%) for all traits, year effect had some importance (11%) but only for FL. CV methods on the other hand accounted only for a small fraction of the PA variation. The highest gains in PA provided by any two-site CV scenarios compared with the single-site model were obtained for the training set size of 50 to predict PH_2017 (PA increase of þ0.07) using the BAL2 model.

Two-site calibration as a sparse testing approach
So far, we have compared single-site prediction with two-site prediction methods to predict the phenotype of families that were never observed, based solely on between-family information exchange. Another possible approach is to take advantage of the population information by phenotyping all the families in one environment other than the one targeted for the prediction. As PAL is easier to manage, being free of main rice pathogens and closer to the research institute, we tested a scenario with unbalanced representation of the sites in the training sets (IMB), where all 334 families phenotyped in PAL and only a subset of a varying set size "s" phenotyped in SRO were considered. The PA were improved by including the phenotypes of the whole population in PAL in the training set, and this was consistently observed for all traits, although to a different extent ( Figure 6 and Supplementary Table S7). The largest differences in average PA were observed for FL (SIN SRO ¼ 0.29, IMB ¼ 0.56) and PH (SIN SRO ¼ 0.33, IMB ¼ 0.62). However, for both traits the increase of "s" did not yield a much higher PA with the IMB method. Average ZN predictions also benefited from PAL information, but less so (SIN SRO ¼ 0.24, IMB ¼ 0.45). For those three traits, the average PA with the IMB method was rather close to the phenotypic correlation between the two sites (dotted line in Figure 6). Conversely, for YLD the average PA was similar between SIN SRO (0.27) and IMB (0.34), with values above the indirect phenotypic prediction as represented by the site correlation. The partition of factor effects in the linear model revealed that the proportion of variance explained by the CV method depended on the traits (7% for YLD compared with !50% for all other traits; Table 5). Only for YLD did the set size account for a large fraction (32%) of the explained PA variance. The contribution of the year effect to the total PA variance was low ( 1%) for YLD and ZN while still contributing to a small portion of the variance for FL and PH (10% and 6.5%, respectively). For both traits, average PA was higher in the S 0:3 2018 trials.

Evaluation of early generation progenies
The training population with which we tested the various CV scenarios had the expected characteristics for applying GP, both in terms of marker density relative to the specific population LD and total absence of structure among the 334 S 0 genotypes (Appendix A and Supplementary Tables and Figures).
Our progeny phenotyping method could not capture the within-line variation, as we recorded traits as the mean of the evaluated plot (FL and PH) or from the bulked harvested plot (YLD and ZN). For most combinations of traits and sites, the difference in H 2 between the S 0:2 and S 0:3 progeny testing was limited and fell within the confidence interval of each other.
However, the H 2 of the S 0:2 progenies was significantly higher for FL and YLD in SRO and was significantly lower for PH in PAL. This lack of consistency suggested that changes were driven more by   environmental causes than by the degree of allelic fixation within the genetic material. This was supported by the temporal checks for which a significant year effect could be observed for all traits and sites, while no effect of the generation was observed. We concluded that the changes in mean between S 0:2 and S 0:3 within sites were essentially driven by the environment effect. As the phenotypic variance due to generation was minor compared with the variance associated with the year, generation advance did not seem to influence the PA. For time and economic reasons, calibration on S 0:2 phenotypes could thus be preferred as it allows a reduction of the breeding cycle length and cost.

Potential of early GP
We first tested GP models on the early generation phenotypes collected in a single environment. As expected, regardless of the generation, the four traits showed differences in mean PA. FL and PH were overall the best predicted traits, followed by ZN and YLD. This was fairly consistent with what is reported in the literature where FL and PH generally show high PA in absolute terms and relative to yield parameters (Combs and Bernardo 2013;Spindel et al. 2015;Ben Hassen et al. 2018a,b). However, when comparing with another GP study performed on families derived from rice synthetic populations much higher PA for FL was achieved than in Grenier et al. (2015), where average PA for FL reached only a maximum of 0.29 for the population of 343 S 2:4 lines. Conversely, maximum PA for PH (0.46) was comparable with the PA obtained for the 343 S 2:4 (0.50) (Grenier et al. 2015), The results are for the CV SIN SRO scenario. Eta 2 is the proportion of variance associated with each effect and R 2 is the coefficient of determination obtained from a linear model applied to the data from the 100 iterations (n ¼ 1600).    (Spindel et al. 2015), but lower than that reported for the 174 S 1:3 lines (0.44) (Morais Jú nior et al. 2018), despite an H 2 for YLD that was higher in our study (H 2 ¼ 0.58) than in the two others aforementioned (0.44 in S 1:3 lines and 0.32 in the diversity panel).
Overall, for these commonly reported traits, the PA obtained in our study did not greatly differ from what was reported for GP in rice diversity panels or synthetic populations (as reviewed in Ahmadi et al. 2020).
Although various studies on maize and spring wheat have proven the effectiveness of the GP-based approach for kernel zinc concentration, to our knowledge no study applying GP to rice for grain zinc concentration has yet been reported. Grain zinc concentration is a complex trait greatly influenced by soil and other associated factors (Jin et al. 2013;Hindu et al. 2018;Velu et al. 2018;Naik et al. 2020), so there are great hopes that GP will simplify the process of breeding rice for nutritional quality. On average, the PA for ZN in a single environment was low (0.26 and 0.24, for 2017 and 2018, respectively). However, the maximum PA in SIN SRO reached 0.36 with 200 S 0:2 progenies (2017 data and RKHS model), which is comparable to the average estimated PA obtained with the fivefold CV1 model applied to the HarvestPlus association mapping panel of 330 wheat lines (PA ¼ 0.36; Velu et al. 2018).

Effect of the GP methods on PA
In the context of single-site analysis, we found that the two prediction methods, GBLUP and RHKS, induced some differences in PA only for FL. Although GBLUP uses a linear kernel that models only the additive effects, RKHS uses a Gaussian kernel that carries the additive effects and the additive-additive epistatic effects at every possible order (Jiang and Reif 2015). RKHS has been reported to perform better than the linear model in the presence of epistasis (Gonzá lez-Camacho et al. 2012;Jiang and Reif 2015;Onogi et al. 2015). Epistasis has been reported in FL (Hori et al. 2016), PH (Yu et al. 2002;Shen et al. 2014), YLD (Luo et al. 2001;Xing et al. 2002), and ZN (Lu et al. 2008;Norton et al. 2010); however, both GP methods performed similarly for the traits we looked at in our population. The phenotypes we considered were all progeny means, which represent the breeding value or additive effect of our tested S 0 (Falconer 1960). Different and opposed epistatic effects can appear in the same family and have probably impeded RKHS from capturing them accurately. Limited differences between the two GP methods have also been reported in previous studies testing predictions for rice collections of fixed accessions (reviewed by Ahmadi et al. 2020) or S 1:3 lines extracted from synthetic populations (Morais Jú nior et al. 2018). Given our phenotypes and considering the PA, GBLUP appeared as the most appropriate method in our context of population breeding considering single or two-site phenotyping data in our calibration models.
Prediction of the target environment using the two-site calibration model Most of the contrasts in phenotypic records observed between the two sites were due in large part to the differences in crop establishment, soil conditions, climatic, and biotic constraints as well as field management. Between irrigated and rainfed conditions, not only yield performance was expected to be affected by the environmental conditions, but also the grain zinc concentration, these two traits showing lower correlation between sites. Under flooded conditions, the soil oxygen and redox potential will drop and trigger the formation of non-available zinc or its adsorption onto different compounds, depending on the soil type (reviewed in Rehman et al. 2012). As PAL is subject to continuous flooding, low zinc availability was expected and, consequently, observed ZN was much lower than in SRO.
Knowing the environment effect on the trait expressions and the phenotypic correlation between the sites, we tested the potential of GP including two sites with various CV schemes involving several factors. Of all the factors tested in the scenarios, the training set size had the most influence on PA. Training set size explained most of the differences observed for all the traits. The year of phenotyping was best in explaining the PA variations only for FL, which could be related to climatic differences and/or small changes in crop establishment date, which both are known to affect crop phenology. The CV methods SIN SRO , BAL1, and BAL 2, accounted for a small portion of the variance explained by the models. In general, the two BAL scenarios showed a limited advantage over SIN SRO for all traits. The prediction of unobserved genotypes for a specific environment using a two-site model was as precise as that obtained with a single-site model. Indeed, the prediction of gs SRO is based on the same amount of information as the g k from a SIN SRO calibration. For this reason, the two-site calibration could perform better only if g k is more precise and relatively larger (larger associated variance) than gs SRO , but this is expected only for well-correlated environments. BAL1 and BAL2 differed in the number of genotypes repeated over the two sites. In BAL1, 100% of the genotypes included in the calibration had phenotypes in both sites (the overlapping proportion), whereas only 50% of the included genotypes had phenotypes of both sites in BAL2. The effect of the overlap proportion was tested by Jarquín et al. (2020) in a study that assessed the effects of data allocation on the PA of genomic-enabled prediction models. With their GxE model (M3 as presented in their article), the use of overlapping sets of genotypes improved the precision. In our case, the tendency was the reverse. Maintaining similar efforts in phenotyping in both sites while reducing the overlap (BAL1 and BAL2 with the same "s" progenies) resulted in higher precision in the predictions, but only for a specific case of PH_2017 with small training set size of 50 genotypes. For PH, exploring more of the population genetic variability within relatively small training set sizes might have had a greater impact thanks to a higher phenotypic correlation between sites.
Although neither BAL1 nor BAL2 could greatly improve PA compared with SIN SRO , calibrating with the whole population phenotyped in PAL and only a subset "s" of the population in SRO for predicting in SRO (IMB) generated substantial improvement of PA for all traits. The interest of this sparse testing method lies in borrowing information within lines across environments (Lopez-Cruz et al. 2015). However, if the phenotypes are not correlated between sites, benefits from the inclusion of both environments are expected to be low, as we found with YLD, where less improvement of PA was achieved through IMB than for the other traits. Generally, sparse testing is in most cases more precise than the prediction of unobserved genotypes in known environments, regardless of the calibration method used (Burgueño et al. 2012;Jarquín et al. 2014;Lopez-Cruz et al. 2015;Ben Hassen et al. 2018a;Millet et al. 2019). However, as the predicted lines must be observed in at least one environment, the burden on the phenotyping still remains, but the effort can lead to an increase in PA for traits with strong to moderate environment correlation, as was the case for FL, PH, and ZN. For ZN, which has only a moderate site correlation, the IMB yielded a large gain in PA even with a drastic reduction of the phenotyping effort in SRO (from SIN SRO _s25 ¼ 0.14 to IMB_s25 ¼ 0.44 with the 2018 data). Overall, the sparse testing provided an improvement in the prediction of ZN in the rice synthetic population, with average PA (IMB_s200 ¼ 0.51 with the 2018 data) in the range of those reported for spring wheat (Velu et al. 2016 ) and maize (Mageto et al. 2020).

Optimization of calibration procedure for GP
In our study, we tested the calibration of a GP model using phenotypic records gathered from early progeny testing in two sites. The potential of using two-site data and sparse testing for the model calibration, was considered as a satisfactory measure to predict most traits, even for YLD, despite a slightly reduced advantage compared with what was reported for the other traits.
We have demonstrated that the calibration using phenotypic data collected on progeny testing at two successive early generations could deliver relatively good and comparable PA. This opens up possibilities for rapid cycling RS, with recycling of parental lines from the genotyping of S 0 plants, based on the breeding value of the S 0 . Yet, there is still a need to confirm that the models do predict well the performance of more advanced generations for inbred line development. Indeed, the units to derive in the pedigree breeding scheme should be selected on the basis of "varietal ability" (Gallais 1979), which is the expected value of all lines within a family at fixation. This will be explored in our next study, with an external validation of the GP models using a different set of S 0 progenies extracted from the PCT27 and brought to near fixation.
We are aware that the optimized scheme we suggest, based on random sampling of the training set, genome-wide markers considered as random effects, and random allocation of genotypes to sparse testing could be improved further still by considering other criteria known to increase the performance of GP. It remains to be seen whether PA can be improved by optimized assembly of the training set as performed in various studies (Rincent et al. 2012(Rincent et al. , 2017Bustos-Korts et al. 2016; Akdemir and Isidro-Sá nchez 2019; Mangin et al. 2019), by inclusion of particular weights for some specific loci (Spindel et al. 2016;Bhandari et al. 2019;Frouin et al. 2019) or by use of an efficient method to proceed to sparse testing in the context of GxE models (Ahmadi et al. 2020).
Notwithstanding optimization of the calibration to develop efficient prediction models to fit our scheme, we ought also to consider the gain of applying GP-aided RS in our rice breeding program. So far, only the PA within generations has been tested, starting with the extraction of S 0 fertile plants of the C n cycle. Prediction of S 0 in C nþ1 would be done with calibration based on data from the previous cycle C n . This has been tested through simulation Ramasubramanian and Beavis 2020) and showed that the persistency of PA across cycles could be achieved with the accumulation of data from several past cycles. Simulation studies will be performed on our population to optimize the long-term use of GP-aided RS and define how and when it is best to upgrade the calibration model. The simulation will also offer the opportunity to improve the prediction and apply genomic selection while maintaining enough genetic diversity for further use of the population.

Data availability
All supplementary tables, figures, and the data used in this study are available at figshare: https://doi.org/10.25387/g3.14139806.