Genomic mating in outbred species: predicting cross usefulness with additive and total genetic covariance matrices

Abstract Diverse crops are both outbred and clonally propagated. Breeders typically use truncation selection of parents and invest significant time, land, and money evaluating the progeny of crosses to find exceptional genotypes. We developed and tested genomic mate selection criteria suitable for organisms of arbitrary homozygosity level where the full-sibling progeny are of direct interest as future parents and/or cultivars. We extended cross variance and covariance variance prediction to include dominance effects and predicted the multivariate selection index genetic variance of crosses based on haplotypes of proposed parents, marker effects, and recombination frequencies. We combined the predicted mean and variance into usefulness criteria for parent and variety development. We present an empirical study of cassava (Manihot esculenta), a staple tropical root crop. We assessed the potential to predict the multivariate genetic distribution (means, variances, and trait covariances) of 462 cassava families in terms of additive and total value using cross-validation. Most variance (89%) and covariance (70%) prediction accuracy estimates were greater than zero. The usefulness of crosses was accurately predicted with good correspondence between the predicted and the actual mean performance of family members breeders selected for advancement as new parents and candidate varieties. We also used a directional dominance model to quantify significant inbreeding depression for most traits. We predicted 47,083 possible crosses of 306 parents and contrasted them to those previously tested to show how mate selection can reveal the new potential within the germplasm. We enable breeders to consider the potential of crosses to produce future parents (progeny with top breeding values) and varieties (progeny with top own performance).


Introduction
Diverse crops ranging from staples (e.g., cassava and potato) to cash crops (e.g., cacao) to forestry products (e.g., eucalyptus) are both outbred and clonally propagated (Gemenet and Khan 2017). In these crops, exceptional genotypes can be immortalized and commercialized as clonal varieties. Few clonal crops are also inbred thus, like livestock, each cross segregates phenotypically to different degrees. Unlike seed crops (e.g., maize, wheat), inbreeding is unnecessary for product development. Consider a breeding program implementing some form of genomic selection (GS) (Heffner et al. 2009;Jannink et al. 2010) on a population. All extant members and future progeny are or will be genotyped using genome-wide markers. Field evaluations are conducted at species-and trait-appropriate stages for one or more traits, on at least a subset of the genotypes. Genomic prediction is used to increase selection intensity and decrease cycle times by providing selection criteria for more genotypes, faster than would have otherwise been possible (Hickey et al. 2017). The breeding scheme can be further divided into two parts (Gaynor et al. 2017;Santantonio and Robbins, 2020;Werner et al. 2020) consisting of (1) population improvement by recurrent selection (RS) and (2) a variety development pipeline (VDP). RS is done in order to manage and improve the frequency of beneficial alleles in the population over time. The VDP consists of a series of field trials in which candidates' performance is evaluated. For clonal crops, germplasm is advanced from one VDP stage to the next by vegetative propagation.
The importance of matings and the need for mate selection criteria Every cross is important. Crosses imply an opportunity and a risk. New matings generate genetic variation, the substrate on which selection can operate. However, for a breeder, new crosses require investment of time, land, and money, especially considering the added costs of genotyping. Moreover, crosses may exhibit inbreeding depression or heterosis. Thus, matings serve the multiple purposes of producing new candidate breeding parents for RS and/or cultivars for VDP, potentially evaluated for multiple product profiles characterized by unique selection indices (SI). Selection to drive improvement in the population's mean over time to meet the objective of RS centers on allele-substitution effects and the breeding value (BV). For the VDP, selecting clones to advance for testing should be based on the total genetic value (TGV) of an individual which includes nonadditive genetic effects such as dominance.

Genomic predictions incorporating nonadditive effects
Nonadditive effects can be included in genomic predictions in a number of ways (Vitezica et al. 2013;Varona et al. 2018). Most of the literature so far has dealt with including nonadditive effects in the prediction of the genetic values of an existing pool of selection candidates (Varona et al. 2018). Nonadditive predictions have often been shown to increase prediction accuracy (Heslot et al. 2012;Wolfe et al. 2016a;Werner et al. 2020). In addition, the mean performance (mean TGV) of the progeny can deviate from the prediction based on the mean BV of the parents in the presence of nonadditive effects. Genomic predictions of cross mean TGV have been applied to hybrid performance (Alves et al. 2019) and mate allocation (Toro and Varona 2010). Predictions can also include genome-wide inbreeding/overdominance effects, also referred to as directional dominance; Xiang et al. (2016) and this has recently been shown to be advantageous in a simulated two-part clonal crop breeding scheme (Werner et al. 2020).

Genomic mate selection for outbred, clonal crops
When one or both parents are heterozygous, offspring are expected to segregate for their BV and TGVs. The relative advantage of possible pairwise matings can best be distinguished when predictions of both the genetic mean and variance are available. The usefulness criterion (UC) or simply "usefulness" of a cross is a prediction of the mean performance of the selected superior fraction of the progeny: UC ¼ l þ i Â r, where r is the predicted genetic standard deviation of the progeny and i is the standardized selection intensity (Zhong and Jannink 2007;Segelke et al. 2014;Lehermeier et al. 2017b). The additive genetic variance of an infinite pool of progeny from a cross can be predicted deterministically using the combination of genome-wide marker effects, a genetic map, and phased parental haplotypes (Lehermeier et al. 2017b). This approach has almost exclusively been applied to the prediction of additive genetic variance and covariance . Bonk et al. (2016) showed that dominance in addition to additive within-family variances can be deterministically predicted in outbred species based on gametic variances of putative parents (Bijma et al. 2020). Most other applications are predictions of the variance of inbred lines derived from inbred founders (Zhong and Jannink 2007;Lehermeier et al. 2017b;Allier et al. 2019b;.

Criteria and methods developed in this study
In this study, we extend the deterministic prediction of progeny variances in several ways to maximize the utility and practicality of implementing genomic mate selection. First, we show how to include dominance in the prediction of cross genetic variance and we do so for founders of arbitrary inbreeding level. Next, we distinguish two types of cross usefulness: usefulness for RS (i.e., the predicted mean BV of offspring selected as parents; UC parent ) and usefulness for variety development (i.e., the predicted mean TGV of clones advanced as varieties in the VDP; UC variety ). Finally, since matings are usually chosenbased on multiple traits, we extend the prediction to cross variance on SI. We show that to predict index variance, we must predict the full matrix of trait genetic variances and covariances (Bonk et al. 2016;Allier et al. 2019b;. We implement the core functions for multi-trait prediction of outbred cross variances including additive and dominance effects in an R package predCrossVar.

Empirical study of cassava
We present an empirical study of the accuracy for predicting additive and nonadditive genomic mate selection criteria. We set up a cross-validation scheme that measures the accuracy of predicting means, variances and usefulnesses of previously untested crosses using data from a real cassava (Manihot esculenta) breeding program. Cassava is one of the most important tropical staple foods, especially in Africa (http://faostat.fao.org). Among outbred, clonal crops, GS is relatively mature in cassava breeding (de Oliveira et al. 2012;Ly et al. 2013;Wolfe et al. 2016aWolfe et al. ,b, 2017Elias et al. 2018;Yonis et al. 2020;Okeke et al. 2017;Ozimati et al. 2018) because of the Next Generation Cassava Breeding Project (http:// www.nextgencassava.org, est. 2012), and the species can serve as a model for many others. We leverage a validated GS pedigree with genome-wide phased haplotypes and a genetic map . We used a directional dominance model (Xiang et al. 2016) to make first-time estimates of genome-wide inbreeding (homozygosity) effects in cassava. We report our empirical study in a fully reproducible and documented framework (https://wolf emd.github.io/PredictOutbredCrossVar/).

Formulation of genomic predictions and selection criteria
Below, we describe predictions that are applicable as selection criteria, first for genomic truncation selection G TS , followed by extensions that enable mate selection G MS . Throughout, we distinguish selection criteria based on their suitability for evaluating the potential of individuals (for G TS ) or crosses (for G MS ) for RS vs VDP.
G TS : Selecting genotypes with predictions about generation t Genomic recurrent TS (G TS ) evaluates existing individuals, either for their potential as parents (without regards to specific mates) and/or their potential as clonal cultivars. Under a nonepistatic model, the TGVs of individuals in the current population (time t) can be partitioned into a BV (g BV ) and a dominance deviation g DD .
Consider a diploid population with n individuals genotyped at p biallelic genomic loci.
In this linear model, the n Â 1 vector of phenotypic observations, y is modeled according to a combination of genetic and nongenetic effects. Fixed experimental design-related effects estimates are given by b and its corresponding incidence matrix X(½n Â N fixed ) where N fixed is the number of fixed factors. The elements of the ½n Â p matrices Z and W contain column-centered marker genotypes: Here, p j and q j are the population allele frequencies, as opposed to the within-parent allele frequencies, which are referred to later on. This encoding of genotypes results in marker effects (a and d) that correspond to allele substitution and dominance deviation effects (Vitezica et al. 2013). The marker effects can then be used to predict genomic estimated TGVs (GETGV,ĝ TGV ) as the sum of the genomic estimated BV (GEBV,ĝ BV ¼ Z^) and a corresponding dominance deviation (GEDD,ĝ DD ¼ Wd). The GEBV predicts the mean offspring of a clone mated at random and as such is suitable for truncation RS of parents. The GETGV predicts the performance of each clone, rather than any property of its offspring and is useful for selection for variety advancement.
G MS : Selecting crosses with predictions about generation t þ 1 GEBV and GETGV enable us to do truncation selection. In order to implement mate selection, criteria that distinguish crosses are needed. Progeny of crosses may segregate for both their breeding and TGVs. Crosses may thus differ in their likelihood of producing progeny that are superior varieties (high g TGV;tþ1 ) and/or parents (high g BV;tþ1 ). We focus here on distinguishing the best crosses on the basis of both their predicted genetic means and variances.

Predicted cross means
The family mean, l BV can be predicted as the mean of parental BVs.
Dominance deviation can be included in order to predict the mean TGV, l TGV according to Equation 14.6 (Falconer and Mackay 1996;Toro and Varona 2010;Varona et al. 2018;Werner et al. 2020).
Here, p ik and q ik are the allele frequencies of the counted (alternative) and the noncounted (reference-genome) allele, respectively, for one of the two parents (indexed by i). The difference in frequency between the parent one (indexed by i) and the parent two (indexed by j) is,y k ¼ p ik À p jk and the summation is over the p markers indexed by k. Note that a k is the average effect and not the allele substitution effect, a estimated by the additivedominance parameterization presented above. As a result, predicting l TGV with the formula above may not be appropriate. We adopt a suitable additive-dominance partition, described below, in our primary analyses.

Predicted cross variances
The within-cross additive genetic variance can be predicted deterministically, relying on the formula for the genetic variance under linkage disequilibrium using Equation 5.16a (Lynch and Walsh 1998;Lehermeier et al. 2017b). Below, we use Equation 5.16b (Lynch and Walsh 1998) to predict dominance variance deterministically in an infinite population of diploid full-siblings (Bonk et al. 2016 indicating element-wise (Hadamard) multiplication of D, having the effect of squaring all elements.r 2 TGV ¼r 2 BV þr 2

DD
The p Â p variance-covariance matrix, D, is the expected linkage disequilibrium among full-siblings by considering the expected pairwise recombination frequency and each parent's haplotype phase.
are simply the p Â p covariance matrices associated with each parent's respective 2 Â p haplotype matrix (H P1orP2 ), where elements are 1 if the counted allele is present, 0 otherwise. We computed D haplos ¼ 1 2 H T H À pp T , where p is a vector of withinindividual, per-SNP allele frequencies (Alachiotis et al. 2016).
The p Â p pairwise recombination frequencies matrix is c and can be derived from a genetic map. D

gametes P1
and D

gametes P2
are the covariance matrices for each parents pool of possible gametes, whose covariances sum to give the expected covariances genotypes in the cross, D genotypes P1ÂP2 . The genetic variancesr 2 BV andr 2 DD are thus predicted as above by using D ¼ D genotypes P1ÂP2 .

Usefulness criteria (UC)-mean of superior family members
Given that predictions of genetic means and variances for a cross are available, they can be combined into a single cross selection criterion. We focus here on the UC, which predicts the mean (BV) of the superior progeny from a cross, i.e., the mean after selection (Schnell and Utz 1975;Zhong and Jannink 2007;Lehermeier et al. 2017b). We note that predictions of cross means and variances may be used in other ways (Bijma et al. 2020), but focus on UC. The UC ¼ l þ i Â r, where l is the predicted mean of the cross, i is the standardized within-family selection intensity and r is the predicted cross standard deviation. In the context of the two-part breeding scheme for GS in clonal crops, crosses may be useful for producing both new parents and new varieties. We, therefore, define e therefore distinguish two UCs: UC parent and UC variety ( Table 1). Notice that in addition to separate predictions of mean and variance for UC parent vs UC variety , two-part GS implies that the within-family intensity of selection for RS does not necessarily equal that of the VDP (Santantonio and Robbins 2020).

Extension to multi-trait selection indices
Parent selection is often done based on a multi-character selection index (SI). Crosses can be considered for their potential to produce progeny with good merit on one or more SI by first predicting the variances and covariances for each trait on the SI (Bonk et al. 2016; Allier et al. 2019b;. We can therefore predict the mean and variance of a cross on the SI as follows: 1) Predict (co)variances for all traits on SI. Consider an index with two traits, T1 and T2.
Apply to dominance by substituting a with d and squaring elements of D.
1) Compute the predicted mean and variance on the SI.
The n Â T matrixĝ BV contains the GEBV for each trait and the T Â 1 vector w are the index weights. The T Â T matrix G is the additive (or total) genetic variance-covariance matrix for traits on the index.
TraitT ;TraitT 2 6 6 4 3 7 7 5 Based on these predictions of family means, variances and trait-covariances, we can compute the mean of selected family members on the index (i.e., the UC SI ).

Including directional dominance as a genome-wide inbreeding effect
Many outbred, clonal crops are known to suffer from inbreeding depression. The typical genome-wide regression models the marker effects as drawn from a normal distribution, with mean zero and an estimated variance parameter. To include directional dominance, we model the genome-wide proportion of loci that are homozygous with the 1 Â p vector, f, as a fixed-covariate, leading to: The scalar b is the estimated linear effect of overall homozygosity, interpreted as inbreeding depression or heterosis depending on its direction relative to each trait (Xiang et al. 2016). The effect of over/under-dominance measured by b can be incorporated into the predicted means and variances by dividing b by the number of effects (p) and subtracting that value from the vector of dominance effects, to get d ¼ d Ã À b p (Xiang et al. 2016;Varona et al. 2018;Werner et al. 2020). It is important to note that the partition of genetic effects in this model corresponds to the "biological" (or genotypic) parameterization (Vitezica et al. 2013). The dominance coding in the matrix C is As a result the effects a and d do not correspond to allele substitution and dominance deviation effects directly, but the sum of variance components still equals the r 2 TGV and allele substitution effects can be recovered as a ¼ a þ dðq À pÞ in order to predict g BV (Vitezica et al. 2013;Varona et al. 2018;Werner et al. 2020).

Empirical assessment of the accuracy predicting means, variances, covariances, and usefulnesses in cassava crosses
Since 2012, the Next Generation Cassava Breeding project (http:// www.nextgencassava.org) has implemented GS in African and Latin American breeding programs (de Oliveira et al. 2012;Ly et al. 2013;Wolfe et al. 2017). Cassava breeding programs are wellpoised to adopt G MS if, in addition to prediction of means, variances and covariances can be accurately predicted.
Cassava data: pedigree, genetic map, and phased haplotypes We chose a publicly available, previously published pedigree, genetic map, and phased marker-dataset as the best starting point for our analysis ; https://www.biorxiv.org/con tent/10.1101/794339v1.full). The pedigree and germplasm chosen represent parents and offspring from the first three cycles of GS conducted at the International Institute of Tropical Agriculture (IITA). These germplasm and genomic selections have been described in greater detail previously (Rabbi et al. , 2020Wolfe et al. 2016aWolfe et al. ,b, 2017Wolfe et al. , 2019. We note that each cycle of selection was done by recurrent genomic truncation selection using a SI similar (but not identical) to the one described below. The base generation (C0) was the top-ranked clones among a larger collection of the diverse but interrelated elite as well as landrace germplasm. Chan et al. (2019) implemented a number of procedures to ensure the quality of the data. First, technical replications of the original genotyping-by-sequencing (GBS) were validated with BIGRED (Chan et al. 2018) and reads were combined to reduce missingness and increase read-depth-per-sample. Next, a multipass analysis using the pedigree-validation software, AlphaAssign (Whalen et al. 2018) was used to ensure only relationships supported by the data were assumed downstream. Genotypes were called using validated pedigree information, and sites with more than 30% missing data were removed, leaving 206,539 out of 336,692 sites (summed across all 18 chromosomes) for analysis. The filtered dataset was used as input for phasing and imputation. Pedigree-guided imputation and phasing were accomplished by SHAPEIT2/duoHMM (O'Connell et al. 2014). Finally, the authors constructed a genetic linkage map based on crossover events observed in the dataset. We restricted our analysis to only the 3199 individuals comprising 462 full-sibling families (and their parents), in which both parents were validated/known.

Cassava data: traits, trials, and selection indices
We chose four focal cassava traits: dry matter percentage (DM), fresh root yield in natural-log tons-per-hectare (logFYLD), season-wide mean cassava mosaic disease severity (1-5 scale; MCMDS), and total carotenoids by color chart (1-8 scale; TCHART). These traits include both polygenic (DM and logFYLD) and mono/oligenic architectures (MCMDS and TCHART). Two of the traits are known to have important dominance variance (logFYLD and MCMDS), while DM has been shown to be largely additive (Wolfe et al. 2016a,b). From these traits, we composed two hypothetical SI, which represent two real and disparate breeding goals (Supplementary Table S1). Both indices target increased DM and logFYLD and reduced MCMDS. We refer to the first index as the "Standard SI" (or StdSI) as it emphasizes yield and disease resistance in a white-fleshed background. The second index "Biofortification SI" (or BiofortSI) focuses on breaking a historically negative genetic correlation between DM and carotenoid content by weighting most heavily the combination of yellow-flesh (high TCHART) and high DM. We note that the pedigree and germplasm analyzed here arose from genomic truncation selection for the equivalent of the StdSI. For this reason, our population and analyses should not be considered as representative or definitive regarding biofortification breeding goals. We started with unscaled, noneconomic weights and scaled them by dividing by the standard deviation of phenotypic BLUPs (see below) for each trait (Supplementary Table S1).
We used pre-adjusted phenotypes, namely, de-regressed BLUPs as input for our downstream analyses. The field trial data used span from 2013 to 2019 and are available directly from http://www.cassavabase.org. The download, quality control, formatting and mixed-model analysis that produced the BLUPs are fully documented and reproducible here: https://wolfemd.github. io/IITA_2019GS/. The BLUPs produced and used in this study of cross variance prediction were originally used for GS conducted during summer 2019. The entire raw IITA trial download was too large for GitHub and is therefore stored here: https://cassava base.org/ftp/marnin_datasets/NGC_BigData/.

Parent-wise cross-validation scheme
We devised a cross-validation scheme that: (1) allowed measurement of the accuracy of predicting means, variances, and covariances in previously unobserved crosses, and (2) enabled us to distinguish accuracy predicting BV from TGV. First, define a vector, P of the parents listed in the pedigree. Define also a second vector C listing the genotypes (clones) in the pedigree, including the parents (P & C).
We conducted five replications of the following procedure: 1) Define parent-wise cross-validation folds: randomly assign the parents in P into k-folds. We chose k ¼ 5 folds or about 42 of 209 parents in P per fold (defined as P k TST , the list of "test" parents in the kth-fold. 2) For each of the k-folds (set of 42 "test" parents), divide the clones vector C into two mutually exclusive sets: "training" (C TRN ) and "validation" (C VLD ). From the set C TRN , we exclude all descendants (offspring, grandchildren, great grandchildren, etc.) of P k TST . We include the P k TST themselves (phenotyping the parents before predicting their offspring) and any nondescendents. Define C VLD simply as the set difference between C and C TRN . 3) Estimate marker effects independently by fitting mixedmodels (see section below for further details) to C VLD and C TRN corresponding to each P k TST . 4) For each P k TST , define the set of crosses to predict, X k toPred to include any of the 462 actual families (sire-dam pairs) in the pedigree, in which the P k TST were involved. By construction, the real family members that have been observed for each of the X k toPred were excluded from the model used to get marker effects for C TRN , and included in the model for C VLD . Predict the means, variances and covariances for each focal trait in each cross, X k toPred using the C TRN marker effects only. 5) For each family in X k toPred , using all existing family members, compute the sample means, variances, and covariances for GEBV and GETGV as predicted by the C VLD marker effects. 6) Calculate the accuracy of prediction for each mean (cor BV l T ; cor TGV , only families with greater than two members were able to be included, and we weighted the correlation between the predicted and sample (co)variance of GEBV/GETGV according to the family size (R package::function psych::cor.wt). For sake of comparison, we also include accuracies in the supplement where predicted values are correlated to phenotypic (rather than genomicpredicted) BLUPs, e.g., cor l T BV;BLUP ; cor l T TGV;BLUP , and so on.
The cross-validation scheme is numerically summarized in Supplementary Table S2 (see also Supplementary Tables S3-S5).

Multi-trait Bayesian ridge regressions
We used the multi-trait Bayesian ridge regression (MtBRR) implemented in the development version of the BGLR R package (https://github.com/gdlc/BGLR-R), which is itself a direct port of the model implemented in the package MTM (de los Campos and Grü neberg 2016). The MtBRR models marker effects as being drawn from a multivariate-normal distribution with mean effects of 0 for each trait and variance-covariance parameters jointly estimated from the posterior distribution of the Gibbs chain. We ran each chain for 30,000 iterations, discarded the first 5000 as burn-in and thinned to every 5th sample. The number of iterations was chosen based on prior univariate analyses using 10,000 iterations . Convergence was confirmed visually during initial test runs. We used de-regressed BLUPs as responses in each model to match the approach used for GS ), but BGLR does not currently support weighted observations in the multi-trait model. Our main focus was on the directional dominance model described above. However, we also fit a nondirectional additive plus dominance model to which we make some comparisons in the Supplementary material.
We fit an MtBRR to each C TRN and C VLD as described above. In addition, we analyzed the entire population ("All" samples) and the component genetic groups, which are: GG (or C0; the original progenitors chosen from a population known as the "Genetic Gain"), TMS13, TMS14, and TMS15, which represent the offspring from 2013 (C1), 2014 (C2), and 2015 (C3), respectively.

Predicting cross means, variances, and usefulnesses
We predicted cross means using the posterior mean marker effects. For variance predictions, Lehermeier et al. (2017a,b) used the posterior mean variance (PMV), which is effectively the mean of the variances predicted by each MCMC-sample of marker effects (see Equations 7-10 in that study). The alternative approach, referred to as the variance of posterior means (VPM), is to make variance predictions simply with the posterior mean marker effects. The PMV is expected to be less biased compared to the VPM but is considerably more computationally intensive. Moreover, PMV requires the on-disk storage of massive posterior marker-effects arrays. We computed the PMV for each prediction in the cross-validation study and in estimating population genetic variances. In the Supplementary Appendix, we made a brief comparison of PMV and VPM and based on these results, used VPM in the exploratory predictions, which are described below. We computed SI means and variances using the predicted (and sample) means, variances and covariances of the component traits, and the index weights, given in Supplementary Table S1.

Realized selection intensities (measuring postcross selection)
We used GEBV and GETGV based on test-set marker-effects to compute observed (or realized) usefulness criteria i.e., UC parent and UC variety and measure prediction accuracy as follows. For UC parent , we computed the mean GEBV of family members who were themselves later used as parents. We computed UC variety using the mean GETGV of family members advanced to the penultimate stage of the VDP, the advanced yield trial (AYT). In order to combine predicted means and variances into usefulness criteria, i.e.,ÛC parent andÛC variety , we first calculated the realized intensity of within-family selection (i RS and i VDP ). For i RS , we used the proportion of family members who themselves appear in the pedigree as parents. For the i VDP , we used the raw plot-basis data to compute the proportion of clones from each family with at least one plot in the aforementioned AYT stage of the VDP, as of July 2019. We computed standardized selection intensity in R using i ¼ dnormðqnormð1 À propSelÞÞ=propSel, where propSel is the proportion selected.

Exploratory analysis: predictions of previously untested crosses
We conducted a prediction exercise evaluating the interest of possible future crosses compared to those previously made in terms of additive and total merit, i.e., UC parent and UC variety . We predicted the means and variances of all possible pairwise matings between the union of 209 parents already used and the 100 clones with top rank on the StdSI, of which only 3 overlapped (N ¼ 306 parents). This resulted in 47,083 crosses to predict. We used marker-effects from the full-model (all clones included). We predicted means, variances, and covariances for all four traits and subsequently used these to compute StdSI and BiofortSI means and variances.
The dataset we analyzed does not include all traits or germplasm relevant to the IITA breeding program. For that reason, our results especially regarding the potential benefits of new matings are meant as an example. Assessment of the actual best new matings to make in the ongoing breeding program will rely on a broader analysis.

Results
Results along with code generating summaries, figures, and related tables are also available as part of the workflowr R markdown website (Results, Figures, Supplementary Figures,

and Supplementary Tables).
Pedigree and Germplasm: There were 3199 individuals in 462 families, derived from 209 parents in our pedigree. Parents were used an average of 31 (median 16, range 1-256) times as male and/or female parents in the pedigree. The mean family size was 7 (median 4, range 1-72). The average proportion of homozygosity was 0.84 (range 0.76-0.93) across the 3199 pedigree members (computed over 33,370 variable SNP; Supplementary Table S14). As expected for a population under RS, the homozygosity proportion increased with each generation with C0, C1, C2, and C3 having homozygosity proportion of 0.826, 0.835, 0.838, and 0.839, respectively (Supplementary Figure S1).  Figure S2). The correlation between DM and TCHART BLUPs was À0.29.

Accuracy of family mean prediction
Across traits, most accuracy estimates (more than 75%) were lower for prediction of family-mean TGV than for mean BV (median difference TGV-BV ¼ À0.017). The only exception was for yield (logFYLD), where TGV>BV, median increase of 0.13 ( Figure  1, Supplementary Table S10). We note that accuracy is higher for BiofortSI compared to StdSI, which makes sense given that BiofortSI emphasizes DM and TCHART, which have higher accuracy than logFYLD and MCMDS.

Accuracy of within-family variance and covariance prediction
Most (89%) of variance prediction accuracies were greater than zero, with median accuracy 0.14 across traits (Figure 2A, Supplementary Table S11). For covariances, prediction accuracy was lower (median 0.07) and 70% of accuracy estimates were greater than zero ( Figure 2B, Supplementary Table S11). In contrast to results for predicting family-means, the most accurately predicted trait-variances were MCMDS, TCHART, and logFYLD with median accuracies (proportion accuracies ! 0) of 0.25 (0.92), 0.17 (0.84), and 0.15 (1.0), respectively. Var(DM), for example, had among the lowest median accuracies at 0.07. Interestingly, the DM-TCHART covariance was also very well predicted with median accuracy 0.23 (97% of accuracies ! 0). Accuracy for the SI variances were intermediate with median StdSI accuracy ¼ 0.17 (0.92) and BiofortSI ¼ 0.09 (0.78) compared to the component traits. Like the SI accuracy for family-means, accuracy for variances was related to the accuracy of the component traits. In contrast to predicting SI cross-means, for variances, the StdSI > BiofortSI. This makes sense as the StdSI emphasized logFYLD and MCMDS, whose variance were better predicted than those of DM, TCHART, and related covariances. There were, overall, only small differences in accuracy between r 2 BV and r 2 TGV with the median difference being À0.003.

Accuracy predicting the usefulness of crosses
The observed UC are the mean GEBV or GETGV of family members who were themselves later used as parents or advanced on the VDP. In order to compute the UC, we first calculated the realized intensity of within-family selection (i RS and i VDP ) (Supplementary Figure S3; Supplementary Table S13). There were 48 families with a mean intensity of 1.59 (mean 2% selected) that themselves had members who were parents in the pedigree and could be used to validate UC parent predictions. There were 104 families for validation of UC variety predictions, with mean intensity 1.46; mean 5% members selected and advanced to the AYT stage of the VDP. On a per-repeat-fold basis, the number of families with observed usefulness for measuring prediction accuracy was limited. For UC parent there were an average of 17 families (min 9, max 24). For UC variety the sizes for the focal AYT stage of the VDP were an average depended on the VDP stage, for the focal stage UC ½AYT variety , mean number of 37 families was 37 (min 25, max 50) per-repeat-fold.
Most estimates (95%) of UC accuracy were greater than zero, with per-trait accuracies largely similar to the family-mean predictions. Indeed, the overall correlation between mean and UC accuracies was 0.75. As might be expected, given the incorporation of variance-predictions and the more limited validation sample size, the UC accuracy was on average lower by À0.09 compared to the family-mean accuracy (Figure 3, Supplementary Table S12). In contrast to predictions of cross variances, the median UC for the BiofortSI was higher (0.58) compared to the StdSI (0.49). Among component traits, median accuracy ranks TCHART (0.83) > DM (0.65) > logFYLD (0.24) > MCMDS (0.10). As with the mean, there was a tendency (62% of estimates) for UC parent to be slightly better predicted than UC variety (median magnitude of difference ¼ À0.06). Prediction accuracy for UC was similar when setting a constant intensity of 2.67 instead of using familyspecific realized intensity (Supplementary Table S12).

Population estimates of the importance of dominance variance
Our focus is mainly on distinguishing among crosses, and the accuracy of cross-based predictions. Detailed analysis of the additive-dominance genetic variance-covariance structure in Figure 1 Accuracy predicting family means. Fivefold parent-wise cross-validation estimates of the accuracy predicting the cross means on SI (A) and for component traits (B), are summarized in boxplots. Accuracy (y-axis) was measured as the correlation between the predicted and the sample mean GEBV or GETGV. For each trait, accuracies are given for two prediction types: family mean BV vs TGV.

Figure 2
Accuracy predicting genetic (co)variances. Fivefold parent-wise cross-validation estimates of the accuracy predicting the genetic variance of crosses on SI (A) and for component trait variances (B) and covariances (C). Accuracy (y-axis) was measured as the correlation between the predicted and the sample (co)variance of GEBV or GETGV. For each trait (panel), accuracies for two prediction types are given: VarBV and VarTGV. cassava (sub)-populations is an important topic, which we mostly leave for future study. We make a brief examination of the genetic variance-covariance estimates associated with the overall population and component genetic groups. We report all PMV-covariance estimates in Supplementary Table S15 and complete BGLR output in the repository associated with this study. We focus here on genetic variance-covariance accounting for LD, as in Lehermeier et al. (2017a), "Method 2." Over all genetic groups analyzed, across trait and SI variances, dominance accounted for an average of 24% (range 6-53%).
Dominance was most important (mean 46% of genetic variance) for yield (logFYLD) and least important for TCHART (mean 11%) (Figure 4). For several estimates, there was an opposing sign between additive and dominance components, e.g., positive dominance but negative additive genetic covariance for DM-logFYLD.

Population estimates of inbreeding effects
We found that genome-wide estimates of the effect of homozygosity were consistently negative for logFYLD with a mean Figure 3 Accuracy predicting cross usefulness (the expected mean of future selected offspring). Fivefold parent-wise cross-validation estimates of the accuracy predicting the usefulness of crosses on the SI (A) and for component traits (B), are summarized in boxplots. Accuracy (y-axis) was measured as the family-size weighted correlation between the predicted and observed usefulness of crosses for breeding parents (UC parent ) or varieties (UC variety ).

Figure 4
Population-level measures of the importance of dominance genetic effects. The genetic variance estimates from the models fitted to the overall population ("All") and also to its four genetic groups (x-axis) are presented in these barplots. Each panel contains results for a trait variance or covariance. For SI (A) and component traits (B) the proportion of genetic variance accounted for by dominance is shown on the y-axis. For covariances between component traits (C) the estimates themselves are plotted. In C, fill color indicates variance component (additive vs dominance). directional dominance regression coefficient of À2.75 log(tons/ ha) across genetic groups (mean effect À3.88 across crossvalidation folds). In addition, DM estimates indicated inbreeding depression effects in several genetic groups and the majority of cross-validation folds with mean directional dominance regression coefficient of À4.82 percent dry matter across genetic groups (mean effect À7.85 across cross-validation folds) and similarly for MCMDS, mean inbreeding effect of 0.32 worse disease severity across genetic groups (mean effect 1.27 across cross-validation folds). This corresponds to higher homozygosity being associated with lower DM, lower yield, and greater disease severity ( Figure 5, Supplementary Table S16).

Exploring predictions about untested crosses
We made 8 predictions (2 SIs Â 2 selection targets [BV, TGV] Â 2 criteria [Mean, UC ¼ Mean þ i*SD]) for each of 47,083 possible crosses of 306 parents. We examined the correlation structure among these predictions in order to understand the multivariate decision space they describe ( Figure 6, Supplementary Figures S4  and S5).
The two SI are (by design) disparate breeding goals. The mean correlation (across var. components) between SIs was low for predictions of the family mean (0.20) and lower for the UC (0.14), but high for the SD (0.91). The predictions of BV and TGV were strongly correlated with 0.95 (0.96) for predicted cross means on the StdSI (BiofortSI), 0.88 (0.91) for predicted genetic standard deviation, and 0.93 (0.95) for UC.
The predicted cross means and variances had a low, but negative correlation ( Figure 6A). Across traits and variance components, the average correlation between predicted mean and standard deviation (cor l;r ) was À0.37. At the standardized intensity of 2.67 (1% selected) the predicted UC was dominated by the mean (average cor l;UC ¼ 0.995) and there was a small negative correlation between variance and UC (average cor r;UC ¼ À0.26). We wanted to know how selections of crosses-to-make would be affected by our choice of criteria. Separately, for each of the 8 predictions of 47,083 crosses, we selected the top 50 ranked crosses (Supplementary Table S19). In total, only 202 unique crosses were selected based on their rank on at least one of the 8 predictions. Of those, 112 were selected for the StdSI (90 Biofort) and included only 7 (6) self-crosses. No crosses were selected for both SI. None of the selected crosses have previously been tested in the IITA breeding program. We plotted the predicted l vs the r ( Figure 6A) and the UC parent vs the UC variety ( Figure 6B). We highlight the unique new crosses proposed and contrast them to the 462 previously made, distinguishing genetic groups (selection-cycle-of-origin, C0, C1, and C2) by colors, in order to illustrate the extent to which our genomic mate selection criteria propose novel and putatively better crosses. For simplicity, we plotted predictions for StdSI only.
There were 44 parents represented among the 112 "best" crosses for StdSI with a median usage in 3 families each (range 1-70, most popular parent ¼ TMS13F1095P0013). Only 33 parents among 90 "best" crosses were indicated for the BiofortSI with a median contribution to 4 (range 1-81, most popular parent ¼ IITA-TMS-IBA011371) crosses. Figure 7 breaks down the selections on the StdSI according to prediction and variance components as a network where selected parents are nodes and matings are edges. For the StdSI, only 17 of 112 crosses (30 of 90 for BiofortSI) were selected jointly for both BV and TGV. Selfcrosses were only selected based on BV. In fact, 22 of 44 parents selected on the StdSI were chosen only for the TGV of their crosses and 4 only for their BVs (Figure 7). For the BiofortSI, one parent was chosen only for BV, but 14 of 33 were only interesting for their TGV. Only 27 crosses for the StdSI (14 for BiofortSI) were selected only based on the UC (i.e., selected for their variance but not their mean). As judged by the number of times a cross was chosen given the four selection criteria, there are 58 unique crosses in the top 50 for the StdSI and 66 for the BiofortSI. This demonstrates a relatively simple approach for selecting the overall best crosses based on the four predicted mate selection criteria.

Discussion
We developed and tested genomic mate selection criteria suitable for multi-trait index selection in organisms of arbitrary homozygosity level where the F 1 (full-sibling progeny) are of direct Figure 5 Estimates of the genome-wide effect of inbreeding. For each trait (panels), the fixed-effect for genome-wide proportion of homozygous sites is shown on the y-axis, as estimated by a directional dominance model. For the overall population ("All") and four genetic groups ("C0" C1"C2"C3"), the posterior mean estimate and its standard deviation (bars) are shown on the x-axis. For comparison a boxplot showing the distribution of estimates from models fit to parent-wise cross-validation training and validation sets ("ParentwiseCV") is also shown.
interest as future parents and/or cultivars (varieties). We focused on the prediction of the SI-associated genetic variance of crosses based on the haplotypes of proposed parents, estimates of marker effects, and estimates of recombination frequencies between marker loci. We combined the predicted mean and variance of a cross into usefulness criteria for parent (UC parent ) and variety (UC variety ) development, by predicting the genetic variance of both breeding (r 2 BV ) and TGVs (r 2 TGV ¼ r 2 BV þ r 2 DD ).

Sufficiency and implications of prediction accuracy estimates
We worked with 462 real cassava families of heterogeneous size. We made practical use of the available data in implementing the parent-wise cross-validation scheme. We found that prediction accuracy for family means were largely similar to our previously published estimates (Wolfe et al. 2016a. Variance and UC prediction accuracies were lower than mean prediction accuracies in general. Given that variances are the second-moment of the distribution, it makes sense that accuracies for variances are lower than for means (Zhong and Jannink 2007;Osthushenrich et al. 2018;. The accuracy predicting the family mean for a given trait was not well correlated with the accuracy estimate for predicting family variances (r ¼ À0.22). This suggests that, for a given phenotype, breeding programs cannot simply rely on available estimates of family-mean prediction accuracy to determine the adequacy of family-variance predictions. The UC and family-mean accuracy estimates were reasonably correlated (r ¼ 0.75). Many factors contribute to achieving optimal accuracy and those factors are well understood in the literature. We focused here on getting an assessment of the overall ability to distinguish crosses with high vs low genetic variances. Previous studies of variance-prediction accuracy evaluated relatively few families, but with larger size (Osthushenrich et al. 2018;Yao et al. 2018;. Interestingly, we found that traits with the most accurately predicted variances had less accurately predicted means, including the SIs (mean: StdSI < BiofortSI; variance: StdSI > BiofortSI; Figures 1 and 2). This Figure 6 Genomic mate selection criteria for the StdSI predicted for previously untested crosses. We predicted 47,083 crosses among 306 parents. We made four predictions: 2 variance components [BV, TGV] Â 2 criteria [Mean, UC ¼ Mean þ 2*SD]. Two-dimensional contour lines show the distribution of all predicted crosses. For each of the predictions, we took the top 50 ranked crosses and then selected the union of crosses selected by at least one metric. The 462 crosses previously made are also shown and genetic groups (C0, C1, and C2) are distinguished by color from the 112 new crosses to highlight the opportunity for improvement. Selfs are shown as triangles, outcrosses as circles. The predicted cross genetic mean is plotted against the predicted family genetic standard deviation (Sd, r) for BV and TGV (panel rows) (A). The UC parent is plotted against the UC variety with a red one-to-one line in B.
does not seem initially explainable by our priors regarding trait genetic architectures; DM and FYLD are both generally considered as polygenic/infinitesimal traits, while MCMDS and TCHART are expected to be closer to mono-or oligogenic (Wolfe et al. 2016b;Rabbi et al. 2020). Differences in accuracy between mean and variance predictions should instead have to do with the nature of linkage disequilibrium, especially as it affects marker-causal relationships (de Los Campos et al. 2015;Lehermeier et al. 2017a). Similar to the simulations and empirical results of  we found that the DM-TCHART covariance was particularly well predicted, corresponding to hypothesized tight linkage between QTL on chromosome 1 ); a region known to contain large low-recombination regions of historical introgression from the wild relative M. glaziovii (Wolfe et al. 2019).
The actual accuracy achieved should be higher than our estimates. Our empirical estimates of progeny variance, against which we validate our predictions, are subject to both Mendelian sampling and effect estimation error. That error decreases the correlation, biasing all our estimates downward. Put another way, we make predictions of the variance of an infinite number of progeny, but are only able to correlate those predictions to a real sample of families with finite and heterogeneous numbers of offspring.
Several conditions for the implementation of cross-variance predictions and mate selection need to be met. First, predictions of GEBV or GETGV are considered suitable for genomic truncation selection; for example, based on cross-validation and/or crossgeneration prediction accuracy estimates. Second, genetic maps are established and trusted. Finally, accurate marker data phasing for candidate parents must be available. If these criteria are met the logistics of mate selection are feasible.
The possibility remains that estimates of variance might be poor enough to contribute more noise than signal to the selection decisions. The answer is hard to intuit and decisions must be made on a program-specific basis. By obtaining a prediction of cross-variance we add a component of information to the crossmean predictions we had before. We also add a potential source of error. One suggestion might simply be to incorporate crossvariance predictions into selections via the UC cautiously by choosing a relatively low-standardized selection intensity value when incorporating the mean and variance predictions. Field validating variance predictions across multiple large families and simulating long-term outcomes may offer the best viable additional sources of decision support regarding the use of usefulness predictions.

The importance of nonadditive effects and the effect on inbreeding
Nonadditive effects are important in cassava, accounting for an average of 24% of genetic variance in this study. Our results are consistent with previous studies that highlight the importance of nonadditive effects for fresh root yields but not for dry matter or total carotenoid content (Esuma et al. 2016;Wolfe et al. 2016aWolfe et al. , 2017Nduwumuremyi et al. 2018;Andrade et al. 2019). To our knowledge, we are the first to report partitions of trait-trait genetic covariance into additive and dominance components, though we do not comment on it in detail in this study. We also report the first estimates of genome-wide marker-based directional dominance in cassava. Using the model of Xiang et al. (2016), we found notable evidence of inbreeding depression, for every trait except TCHART, but especially yield. Our results match several previous estimates of inbreeding depression based on field observation of selfed (S 1 ) progeny (Pujol and McKey 2006; Figure 7 Network plot of selected parents and matings for the StdSI. There were 44 parents and 112 crosses chosen because they were in the top 50 for at least one of four predictions: 2 variance components [BV, TGV] Â 2 criteria [Mean, UC ¼ Mean þ 2*SD]). Parents are shown as nodes, with size proportional to their usage (number of connections). Matings are shown as edges, with linetype distinguishing selection based on Mean (solid) and UC (dashed) and color depicts selection for BV vs TGV. Rojas et al. 2009;Kawuki et al. 2011;de Freitas et al. 2016). Theory and data (reviewed in Kristensen and Sørensen 2005) indicate that traits more closely associated with fitness (in cassava, this would be traits related to root and stem production, for example) should be more impacted by directional dominance (inbreeding depression). These results also make sense in light of the evidence of deleterious genetic load in cassava (Ramu et al. 2017) and balancing selection for heterozygosity in introgression regions (Wolfe et al. 2019). It was also interesting to observe that there are unique matings with elite rank only for TGV (nonadditive effects) and still others uniquely interesting for BV. The network plot in Figure 7 shows a pattern of crosses within vs among particular parents selected for exploiting either BV or TGV that warrants future investigation. Our cross-validation scheme allows us to distinguish TGV and BV accuracy by using genomic estimates of BV and TGV as validation data. Nevertheless, we found only small differences between TGV and BV accuracy for both mean and variance-related predictions. The composition of a mating plan based on UC parent and UC variety is still an important decision point for breeders. To a certain extent, choices depend on a breeder's emphasis on matings to produce varieties vs to improve the population overall. In the future, numerically optimized mating plans that balance investment in crosses to maximize the value of parents and varieties produced by each crossing block can be developed.

Caveats, limitations, and future directions for GMS in outbred, clonal crops
In some circumstances, for computational efficiency, it may be desirable to use the VPM rather than the PMV approach to predict cross variances. Our results show that the correlation between VPM and PMV predictions is very high but their magnitude is different, as is the accuracy estimate (see Supplementary Appendix). If any bias is consistent, then ranking differences between PMV and VPM (or REML) predictions of cross variance will not occur. Ultimately, if implementing mate selections via the usefulness criteria, choosing the VPM method would mostly have the consequence of shrinking the predicted values toward the mean.
Other critical considerations for practical implementation include the necessary phasing quality and method. We leveraged a dataset imputed and phased using a validated pedigree ; many plant breeding programs may not have suitable pedigree or depth of relationships to enable this. We do not rule out using "standard" population-based imputation and phasing (e.g., Browning and Browning 2016). Promising also will be the development of a practical haplotype graph suitable for outbred diploids like cassava (Jensen et al. 2020;Zou et al. 2020). In addition, the necessary marker density for accurate prediction should be considered as it has a very significant effect on computational speed.
Several extensions and future directions are of interest moving forward from the current study. We have only addressed dominance, but extensions of variance prediction to include epistasis or even nonlinear kernel types should be straightforward (Alves et al. 2019). The directional dominance model and its assumption of uncorrelated additive and dominance effects and linear genome-wide effects on phenotype of increasing homozygosity need evaluation (Xiang et al. 2018). We note that many outbred, clonal crops are actually polyploids. For organisms with such genomes, further developments in recombination mapping, phasing, and prediction models will be required, but are expected to be possible. In our study, we focused on trait-associated variance prediction. Considerable development of mate selection criteria has concerned the avoidance of genetic diversity loss generally, these are approaches that constrain inbreeding (Kinghorn 2011;Woolliams et al. 2015) and are distinct from trait-associated predictions presented here. We note that Allier et al. (2019b) recently described prediction of the variance in parental contribution in a family (i.e., the variance in inbreeding level) as a correlated trait, using an extension of the approach for prediction trait-associated variance.

Conclusions
By providing predictions of the selection-index-associated means and variances in arbitrary crosses for additive and dominance variances, we provided a suite of genomic mate selection criteria suitable for the complexities of a modern (cassava) breeding program. We presented a simple approach for genomic truncation mate selection that identifies a profile of crosses collectively interesting because of the predicted merit of their progeny in terms ofl BV ; l TGV ; r BV , and r TGV . Ultimately, crossing plans can be numerically optimized (Akdemir and Sá nchez 2016;Gorjanc and Hickey 2018;Akdemir et al. 2019;Allier et al. 2019a) to consider trait-associated means and variances as well as inbreeding levels, to provide a high degree of control for the management of breeding populations.
Lukas Mueller and Prasad Peteti for data hosting and curation respectively. They would also like to thank three anonymous reviewers and the editor whose constructive feedback substantially improved the paper.

Funding
We thank the UK's Foreign, Commonwealth & Development Office (FCDO) and the Bill & Melinda Gates Foundation (Grant INV-007637 http://www.gatesfoundation.org) as well as the CGIAR Research Program on Roots, Tubers, and Bananas for their support.