- Split View
-
Views
-
Cite
Cite
Zerui Zhang, Lizhi Wang, A look-ahead approach to maximizing present value of genetic gains in genomic selection, G3 Genes|Genomes|Genetics, Volume 12, Issue 8, August 2022, jkac136, https://doi.org/10.1093/g3journal/jkac136
- Share Icon Share
Abstract
Look-ahead selection is a sophisticated yet effective algorithm for genomic selection, which optimizes not only the selection of breeding parents but also mating strategy and resource allocation by anticipating the implications of crosses in a prespecified future target generation. Simulation results using maize datasets have suggested that look-ahead selection is able to significantly accelerate genetic gain in the target generation while maintaining genetic diversity. In this paper, we propose a new algorithm to address the limitations of look-ahead selection, including the difficulty in specifying a meaningful deadline in a continuous breeding process and slow growth of genetic gain in early generations. This new algorithm uses the present value of genetic gains as the breeding objective, converting genetic gains realized in different generations to the current generation using a discount rate, similar to using the interest rate to measure the time value of cash flows incurred at different time points. By using the look-ahead techniques to anticipate the future gametes and thus present value of future genetic gains, this algorithm yields a better trade-off between short-term and long-term benefits. Results from simulation experiments showed that the new algorithm can achieve higher genetic gains in early generations and a continuously growing trajectory as opposed to the look-ahead selection algorithm, which features a slow progress in early generations and a growth spike right before the deadline.
Introduction
The core of plant breeding is in the selection of breeding parents to improve traits of interest, such as yield, tolerance to environmental stress, resistance to pests, most of which are quantitatively inherited (Wricke and Weber 2010). Traditional selection strategies used to focus on observable phenotype or a handful of assisting markers related to the desired traits. However, these methods are not applicable to polygenic traits, i.e. traits consisting of many small-effect alleles for which effects are scattered and difficult to determine reliably (Jannink et al. 2010). With the development of high-throughput genotyping and single nucleotide polymorphism (SNP) effect estimation (Heffner et al. 2009; Jannink et al. 2010) shedding lights on the field, genomic prediction emerged as a technique for linking genomic information of quantitative traits to phenotypic values. Genomic selection (GS) is an approach to exploiting genomic markers to cater to novel breeding programs and evaluation. In this technique, the genomic estimated breeding value (GEBV), i.e. the sum of the estimated marker effects for a specific individual becomes a popular criterion to evaluate the breeding potential for certain traits without relying on individual’s phenotype (Goddard 2009).
Improvements in genomic prediction models for complex patterns, such as genotype by environment interactions (G × E), have received great attention. Much of the work on GS has been on the design and execution of field trials (Heslot et al. 2015; Crossa et al. 2017; Hickey et al. 2017). What appears to be missing in the body of literature is computational algorithms that use genomic prediction to intelligently select individuals or groups worthy of breeding. These algorithms should be able to not only select optimal breeding parents but also strategically determine the number of crosses to make and progenies to produce under resource and time constraints.
Conventional genomic selection (CGS) selects the individuals with the highest GEBVs as breeding parents (Meuwissen et al. 2001), which are assumed to be most likely to produce superior offspring. CGS has been widely adopted in both plant and animal breeding practices due to its simplicity and effectiveness. However, this truncation approach often leads to loss of genetic diversity after only a few breeding cycles. Several more sophisticated selection algorithms have been proposed to address such limitations. Weighted genomic selection modifies the SNP effects with weights to render the inheritance of favorable alleles with low population frequencies (Heffner et al. 2010). Optimal haploid value (OHV) evaluates a breeding parent not by its own genetic value but by the genetic value of the best gamete that it can produce in the immediate next generation (Daetwyler et al. 2015). OHV also aggregates adjacent SNP markers as recombination blocks distributed across chromosomes to accelerate the computation. Optimal population value (OPV) introduces the concept of group selection and selects a group of breeding parents that possess favorable alleles in complementary loci, thus can produce the best progeny in the long term (Goiffon et al. 2017). Look-ahead selection (LAS) extends the concept of OHV and OPV by anticipating the implications of crosses in the current generation on the progeny produced in a prespecific future generation. By selecting optimal crosses to maximize the long-term performance, LAS has been found to be able to not only accelerate genetic gains but also preserve genetic diversity (Moeinizade et al. 2019).
Building on the prior work, we propose an algorithm to overcome 2 limitations of LAS, which are the challenges to specify an appropriate deadline in the context of continuous genetic improvement and the weak performance before the deadline. The new algorithm borrows the concept of present value (PV) from the field of finance and uses it to define a new breeding objective, which converts genetic gains in different future generations over a planning window back to the current time using a discount rate, and the trade-off between short-term and long-term performances can be adjusted using the parameters of window length and discount rate. Details of this method are developed in the next section.
Materials and methods
Nomenclature
Here, we define some of the parameters and variables used for modeling GS.
N | Number of the individuals in a population, a scalar |
L | Number of SNPs of an individual, a scalar |
T | Number of generations in a breeding program, a scalar |
S | Number of breeding parents to be selected, a scalar |
G | Genotype of a population, a binary matrix , with element indicating whether the allele in the first (m = 1) or second (m = 2) chromosome of diploid individual i at locus l is a major allele () or a minor allele () |
SNP effect, a vector β , with βl being the allele effect for locus l | |
r | Recombination frequencies, a vector , with rl being the recombination frequency between loci l and l + 1 |
v | GEBVs, a vector , with vi being the GEBV of individual i |
N | Number of the individuals in a population, a scalar |
L | Number of SNPs of an individual, a scalar |
T | Number of generations in a breeding program, a scalar |
S | Number of breeding parents to be selected, a scalar |
G | Genotype of a population, a binary matrix , with element indicating whether the allele in the first (m = 1) or second (m = 2) chromosome of diploid individual i at locus l is a major allele () or a minor allele () |
SNP effect, a vector β , with βl being the allele effect for locus l | |
r | Recombination frequencies, a vector , with rl being the recombination frequency between loci l and l + 1 |
v | GEBVs, a vector , with vi being the GEBV of individual i |
N | Number of the individuals in a population, a scalar |
L | Number of SNPs of an individual, a scalar |
T | Number of generations in a breeding program, a scalar |
S | Number of breeding parents to be selected, a scalar |
G | Genotype of a population, a binary matrix , with element indicating whether the allele in the first (m = 1) or second (m = 2) chromosome of diploid individual i at locus l is a major allele () or a minor allele () |
SNP effect, a vector β , with βl being the allele effect for locus l | |
r | Recombination frequencies, a vector , with rl being the recombination frequency between loci l and l + 1 |
v | GEBVs, a vector , with vi being the GEBV of individual i |
N | Number of the individuals in a population, a scalar |
L | Number of SNPs of an individual, a scalar |
T | Number of generations in a breeding program, a scalar |
S | Number of breeding parents to be selected, a scalar |
G | Genotype of a population, a binary matrix , with element indicating whether the allele in the first (m = 1) or second (m = 2) chromosome of diploid individual i at locus l is a major allele () or a minor allele () |
SNP effect, a vector β , with βl being the allele effect for locus l | |
r | Recombination frequencies, a vector , with rl being the recombination frequency between loci l and l + 1 |
v | GEBVs, a vector , with vi being the GEBV of individual i |
Formulation of GS
GS benefits from high-density markers used in whole-genome prediction models, which pave the way for effect estimation of quantitative trait locus for the traits of interest. Here, we use SNPs data, which are a common type of genetic variation among the population. With assumptions of linear additive SNP effects and appropriate high-dimensional point estimation methods, one can model the quantitative relationship between and the GEBV of individual i as , where μ is overall mean (Desta and Ortiz 2014).
Here,
Objective function represents the genetic breeding value or other appropriately defined breeding objectives. For example, CGS uses the total GEBVs of all selected breeding parents as its objective function: .
Decision variables for all are a binary variable that indicates whether individual i is selected or not , as shown in constraint (3).
Review of the look-ahead selection algorithm
Here,
decision variable xi indicates whether individual i is selected as a breeding parent (xi = 1) or not (xi = 0);
decision variable indicates whether individual i is mated with j () or not ();
parameter represents the current generation number;
is the GEBV of the best progeny in the final generation that has a probability of occurrence of at least ;
parameter γ is a risk tolerance parameter, a larger (smaller) value of which will incentivize the model to maximize the performance in more (less) optimistic scenarios; and
is the GEBV of a random progeny in the final generation T, created using the breeding decisions x and Y through a look-ahead simulation proposed in Moeinizade et al. (2020).
The introduction of the additional decision variable Y allows the model to further accelerate genetic responses by optimizing mating strategies of the selected individuals (Toro and Varona 2010; Wang et al. 2018). Constraint (5) defines , which is the γ quantile of g, the GEBV of a random progeny in the final generation T. This constraint also interprets the objective function (4), which is to maximize the genetic gain in the best possible scenario with a probability of at least . Constraint (6) ensures that only selected individuals be mated with each other. Constraints (7) and (8) make sure that a total of S crosses are made and that the mating is symmetric. Constraint (9) requires all decision variables to be binary.
Motivation for improvement
Despite the effectiveness of the LAS algorithm in accelerating genetic gain while preserving genetic diversity, it has 2 major limitations that motivated the design of the proposed algorithm.
First, it is challenging to determine an appropriate value for the breeding deadline, since the goal of breeding projects is usually to achieve continuous genetic improvements. To address this limitation, we replace the fixed deadline with a rolling horizon, which restarts its planning timeline over an adjustable time interval. The interval acts as a sliding window to smooth the variability in future offspring and consider the dependence along the breeding process.
Second, the LAS algorithm focuses exclusively on maximizing genetic gain in the terminal generation without considering earlier performance, potentially resulting in unacceptably low short-term genetic gains. In the proposed model, we redefine the breeding objective as the PV of genetic gains over the rolling horizon, which takes a user-defined discount rate, λ, accounting for the time value of genetic gains and the market value of successful release of new commercial lines from the breeding program. A genetic gain achieved in generation τ would be only times as valuable as the same genetic gain would have in generation 0. A larger λ assigns a higher time value of genetic gain, putting higher weights on shorter-term performances.
As a result, the new method is expected to achieve higher genetic gains in earlier generations albeit at the cost of a weaker performance in the final generation, as illustrated in Fig. 1. This is achieved by maximizing the PV of genetic gains in all generations, with earlier gains having higher values than later ones, similar with the financial concept of “time value of money.” Similar with the compound interest rate, the discount rate can be used to adjust the trade-off between short-term and longer-term genetic gains.
PV-based look-ahead selection algorithm
Here,
parameter W is the length of the sliding window, indicating the number of generations to look ahead;
parameter λ is the discount rate, indicating the time value of genetic gains. Specifically, the genetic gain in τ generations is as valuable as in the current generation; and
is the γ quantile of a random progeny’ GEBV in the τth future generation given the current decision variables x and Y.
The model introduces 2 additional parameters: the length of the sliding window W and the discount rate λ. The window length defines the longest term that the breeders look ahead in the planning horizon, whereas the discount rate determines the trade-off between short-term and long-term genetic gains. A larger (smaller) λ places a higher emphasis on shorter-term (longer-term) performances.
PV-LAS can be seen as an extension of LAS with a modified approach to looking ahead. LAS maximizes the performance at a predefined target deadline, which gradually reduces the planning horizon as the breeding process progresses toward the deadline. In contrast, a moving horizon is used in PV-LAS, where the planning horizon is always the length of the sliding window, W. As such, PV-LAS is more applicable to breeding programs with goals for continuous genetic improvement. The change in planning horizon requires a different formula to calculate the recombination frequency in a future generation. In particular, equation (13) for LAS in Moeinizade et al. (2019) should be modified as the following for PV-LAS.
Optimization framework
We propose a heuristic search approach to find optimal selection and mating decisions for PV-LAS by iteratively searching the solution space. The workflow of this heuristic algorithm is described as follows:
Input: G, , r, W, λ, S, K, and γ.
Step 1: Identify a feasible solution and use it as the incumbent solution. Denote the corresponding objective value of the incumbent as .
Step 2: Randomly choose such that . For all , evaluate the new solution , which is defined as and . If is feasible and has a better objective value than the incumbent, then update the incumbent solution and its corresponding objective value . Repeat this step until no further improvement of the incumbent can be made.
Output: the locally optimal x and Y.
From our computational experiences, the LAS algorithm almost always found selfing as the optimal breeding strategy for the last generation. Our explanation for this observation is that when the breeding goal is to maximize the genetic gain in the immediate next generation, the value of a breeding parent can be largely determined by its best gamete that can be produced (within certain risk tolerance). As such, selfing the top #1 breeding parent would be more likely to produce a better progeny than crossing the top #1 with top #2. While this strategy has produced satisfactory results for the case study, it is unnecessarily the optimal strategy for all breeding programs.
Illustrative example
We use a toy example to illustrate the difference between LAS and PV-LAS. Suppose we aim to make S = 3 crosses from a group of N = 8 diploid individuals, each genotyped with L = 10 SNPs. The breeding deadline in LAS and the length of sliding window in PV-LAS were both set as . The arbitrarily simulated SNP effect , recombination frequency and finalized selection and mating results for LAS and PV-LAS are showed in Fig. 2. For each of these 3 future generations, K = 500 future gametes have been simulated. We used .
Figure 2 illustrated the solutions given by the 2 methods. Both methods included a cross between the second and the eighth individuals from the left, whereas the remaining 2 crosses were different. Figure 3 shows vertical histograms of GEBVs of 500 random progeny in 3 generations using LAS and PV-LAS, which demonstrated the major differences of these algorithms: LAS resulted in higher genetic gains in the final generation, whereas PV-LAS improved the growth in the first 2 generations with a slightly compromised performance in the third.
Results and discussion
Data sources
We conducted computational experiments using the same data set as Moeinizade et al. (2019): the genotypic data G contains genotypes of 369 maize inbred lines consisting of L = 1,406,757 SNPs distributed across 10 maize chromosomes (Goiffon et al. 2017). These SNPs were aggregated into 10,000 blocks to increase the computational speed. SNP effects were estimated on the basis of 369 shoot apical meristem phenotypes using the BayesB model (Meuwissen et al. 2001; Leiboff et al. 2015). Recombination rates were based on the genetic map developed from maize nested association mapping (Yu et al. 2008). We assume that the marker effects and recombination rates have been estimated reasonably accurately and they stay fixed in our simulations. As a caveat for this simplifying assumption, however, we point out that LAS and PV-LAS are more sensitive to the accuracy of allele effects and recombination frequencies than CGS, because the former 2 extract more information from such data to make more sophisticated decisions.
Simulation setting
The aim of the simulation was to evaluate the performance of 3 algorithms: CGS, LAS, and PV-LAS with respect to genetic improvement throughout multiple breeding generations. For a group of offspring G with a size of N in at the end of the breeding program, the following criteria were used for evaluating the 3 algorithms:
Mean of GEBV: . This criterion measures the average performance of genetic gains.
Lower potential of GEBV: . This criterion gives the theoretical lower bound of GEBV based on the remaining genetic diversity.
Upper potential of GEBV: . This criterion gives the theoretical upper bound of GEBV based on the remaining genetic diversity.
PV of GEBV: . This criterion measures the PV of GEBVs over a period of T generations for a given discount rate λ.
Each of the simulations consisted of 3 steps: initialization, selection, and reproduction. In the initialization step, N = 200 individuals were randomly selected among the given 369 maize inbred lines. In the selection step, GS algorithms were used to select the optimal crosses. In the reproduction step, crosses from the previous step are made, each producing N = 200 progeny, and then it goes back to the selection step for the next generation until the final generation T = 10. Each algorithm was tested for 500 independent simulations. Parameters and were used for all experiments.
For CGS, S = 10 pairs of individuals with the highest GEBVs were selected and randomly mated to make 10 crosses, each producing 20 progenies.
For LAS, S = 10 pairs of individuals were selected and mated.
For PV-LAS, S = 10 pairs of individuals were selected and mated.
Performance comparison
Results for genetic gain comparison are shown in Fig. 4. The GEBV mean in Fig. 4b matched our expected trajectory in Fig. 1. Figure 4, a and c showed how PV-LAS compromised genetic diversity, with respect to LAS, to achieve higher growth in early generations, although PV-LAS still largely outperformed CGS in both genetic gain and genetic diversity. Figure 4, d–f showed mean, lower potential, and higher potential of GEBV together for the 3 selection methods. Although LAS and PV-LAS use a more forward-looking selection strategy than CGS, they have narrower ranges between upper and lower bounds in the first 2 generations. This counter-intuitive result is because they only select parents whose favorable alleles can be aggregated within time and resource constraints, which means that some otherwise high-performing parents may be discarded if their favorable alleles require more time or resources than available to be integrated with the selected ones. On the other hand, CGS will select all high-performing parents without anticipation of their future progeny, which may lead to higher genetic diversity in the first generations but inevitable loss of genetic gain and diversity in the longer term.
Figure 5 compares the empirical cumulative distribution functions of the PVs of GEBVs resulted from different selection methods. We observe that PV-LAS is on the right-hand side of LAS in almost all quantiles, which indicates the stochastic dominance of PV-LAS over LAS in terms of the PV of genetic gains. This was expected because PV-LAS was designed to optimize the PV of GEBVs.
Sensitivity analysis on PV-LAS parameters
Sensitivity analysis was performed to test the influence of window length W and discount rate λ on the performance of PV-LAS. Figure 6 shows the average (over 500 independent experiments) differences in genetic gains between the benchmark case of W = 1 and other window lengths. When W = 1, the model focused on genetic gain in the immediate next generation; as a result, GEBVs jumped to a plateau after the first selection and lost diversity and potential for future growth. The figure shows that the effect of W is not monotonic and that a balance between short-term and long-term growth requires a window length that is neither too short nor too long. For this particular case study, W = 3 achieved the best performance between generations 6 and 10, but it is unnecessarily optimal for other studies or datasets. In general, the sensitivity analysis should be done for each new dataset to identify the best set of parameters.
Figure 7 shows the differences in genetic gains between the benchmark case of λ = 0 and other discount rates, with window length fixed at W = 3. When λ = 0, the model focused on the nominal genetic gain and ignored its time value. Larger (smaller) λ values put higher (lower) emphasis on the time value of genetic gain and led to higher (lower) growth in the short-term and weaker (stronger) performance in later generations.
Conclusion
The introduction of the PV concept to GS is central to this work. PV-LAS uses the PV of GEBVs over a certain window period as the breeding objective by discounting genetic gain in future generations back to the current time. As such it balances the short-term and long-term benefits of genetic growth and provides a continuous growth trajectory. At the same time, PV-LAS makes a moderate compromise on genetic diversity in order to achieve a higher growth in early generations.
Computational results demonstrated the effectiveness of the new algorithm. Optimal window length W and discount rate λ for specific data sets can be determined using sensitive analyses.
Several research directions are worthy of future investigation. First, the simulations in this paper were based on the assumption that the estimates of additive effects of SNPs and recombination frequencies are reasonably accurate. Analysis should be conducted to assess the effects of inaccurate estimations and how to mitigate such effects. Second, we can extend PV-LAS to accommodate both additive and nonadditive effects, such as dominance and epistasis effects. Third, PV-LAS does not explicitly address G × E, which can often be substantial and create problems in finding consistently superior genotypes, leading to reduced heritability and overall genetic gain. Future research should focus on the convergence of crop modeling and machine learning approaches to explore more advanced strategies to address G × E in the breeding process.
Data availability
The datasets used the computational experiments were derived from sources in the public domain as described in Data Sources.
Acknowledgments
The authors are grateful to the editor and anonymous reviewers for their insightful feedback, which greatly improved the quality of this manuscript. LW is thankful to Dr. Jean-Luc Jannink for inspiring conversations that contributed to the motivation of this work.
Funding
This work was primarily supported by The National Science Foundation under LEAP HI and GOALI programs (grant number 1830478) and EAGER program (grant number 1842097) and by Plant Sciences Institute at Iowa State University. Computing resources were kindly provided by the High Performance Computing Facility at Iowa State University.
Conflicts of interest
None declared.
Author contributions
LW and ZZ conceived the project and wrote the manuscript. ZZ coded the algorithms and performed the computational experiments.