Pitfalls and Remedies for Cross Validation with Multi-trait Genomic Prediction Methods

Incorporating measurements on correlated traits into genomic prediction models can increase prediction accuracy and selection gain. However, multi-trait genomic prediction models are complex and prone to overfitting which may result in a loss of prediction accuracy relative to single-trait genomic prediction. Cross-validation is considered the gold standard method for selecting and tuning models for genomic prediction in both plant and animal breeding. When used appropriately, cross-validation gives an accurate estimate of the prediction accuracy of a genomic prediction model, and can effectively choose among disparate models based on their expected performance in real data. However, we show that a naive cross-validation strategy applied to the multi-trait prediction problem can be severely biased and lead to sub-optimal choices between single and multi-trait models when secondary traits are used to aid in the prediction of focal traits and these secondary traits are measured on the individuals to be tested. We use simulations to demonstrate the extent of the problem and propose three partial solutions: 1) a parametric solution from selection index theory, 2) a semi-parametric method for correcting the cross-validation estimates of prediction accuracy, and 3) a fully non-parametric method which we call CV2*: validating model predictions against focal trait measurements from genetically related individuals. The current excitement over high-throughput phenotyping suggests that more comprehensive phenotype measurements will be useful for accelerating breeding programs. Using an appropriate cross-validation strategy should more reliably determine if and when combining information across multiple traits is useful.

Genomic Selection (GS) aims to increase the speed and accuracy of selection in breeding programs by predicting the genetic worth of candidate individuals or lines earlier in the selection process, or for individuals that cannot be directly phenotyped (Meuwissen et al. 2001;Hayes et al. 2009;Crossa et al. 2017). Genomic selection works by training statistical or Machine Learning models on a set of completely phenotyped and genotyped individuals, and then using the trained model to predict the genetic worth of unmeasured individuals. If the predictions are reasonably accurate, selection intensity can be increased either because the population size of candidate individuals is larger or their true genetic worth is estimated more accurately.
Predictions of genetic values are usually based only on the genotypes or pedigrees of the new individuals. However predictions can in some cases be improved by including measurements of "secondary" traits that may not be of direct interest but are easier or faster to measure (Thompson and Meyer 1986;Pszczola et al. 2013;Lado et al. 2018). This is one goal of multi-trait genomic prediction. Multi-trait prediction is most useful for increasing the accuracy of selection on a single focal trait when that trait has low heritability, the "secondary" traits have high heritability, and the genetic and non-genetic correlations between the traits are large and opposing (Thompson and Meyer 1986;Jia and Jannink 2012;Cheng et al. 2018). With the advent of cheap high-throughput phenotyping, there is great interest in using measurements of early-life or easily accessible traits to improve prediction of later-life or more expensive traits, and multi-trait prediction models are attractive methods for leveraging this information (Pszczola et al. 2013;Rutkoski et al. 2016;Fernandes et al. 2017;Lado et al. 2018).
A large number of genomic prediction methods are available, and the best model varies across systems and traits (Heslot et al. 2012;de Los Campos et al. 2013). Due to their complexity and often high-dimensional nature, genomic prediction methods are prone to overfitting and require regularization to perform well on new data. Therefore, comparing models based on their ability to fit existing data (ex. with R 2 ) is unreliable; every candidate model could explain 100% of the variation in a typical-size dataset.
Instead, prediction models are generally compared by crossvalidation (Meuwissen et al. 2001;Utz et al. 2000;Gianola and Schon 2016). The basic idea of cross-validation is to separate the model fitting and tuning process from the model evaluation process by using separate datasets for each (Hastie et al. 2009). This penalizes models that fit too closely to one data set at the expense of generalization. In this way, cross-validation is meant to accurately simulate the real-world usage of the model: predicting the genetic values of un-phenotyped individuals; i.e., those not available during the model fitting process itself. Rather than requiring new data per se, cross-validation works by splitting an existing dataset into nonoverlapping "training" and "testing" partitions, fitting the candidate model to the former, and then evaluating it on its accuracy at predicting the latter. Common measures of accuracy include Pearson's r or the square root of the average squared error (RMSE) . This process of splitting, training, and predicting is typically repeated several times on the same dataset to get a combined or averaged measure of accuracy across different random partitions of the data.
Estimates of model accuracy by cross-validation are not perfect (Hastie et al. 2009). They are subject to sampling error as are any other statistic. They are also typically downwardly biased because smaller training datasets are used for the cross-validation than in the actually application of a model. However in typical cases, this downward bias is the same for competing models and thus does not impact model choice (Hothorn et al. 2005).
However, cross-validation can give upwardly biased estimates of model accuracy when misused due to various forms of "data-leakage" between the training and testing datasets, leading to overly optimistic estimates of model performance (Kaufman et al. 2012). Several potential mistakes in cross-validation experiments are well known: Biased testing data selection. The individuals in the model testing partitions should have the same distribution of genetic (and environmental) relatedness to the training population as individuals in the remaining target population (Amer and Banos 2010;Daetwyler et al. 2013). For example, if siblings or clones are present in the data, they should not be split between testing and training partitions unless siblings or clones of individuals in the training partition are also at the same frequency in the target population. Similarly, if the goal is to predict into a diverse breeding population, the cross-validation should not be performed only within one F2 mapping population. Overlap between the testing and training datasets. The observations used as testing data should be kept separate from the training data at all stages of the cross-validation procedure. For example, if data from individuals in the testing dataset are used to calculate estimated genetic values (EBVs) for model training, then the testing and training datasets are overlapping, even if the testing individuals themselves are excluded from model training (Amer and Banos 2010). Pre-selection of features (e.g., markers) based on the full dataset before cross-validation. All aspects of model specification and training that rely on the observed phenotypes should be performed only on the training partitions, without respect to the testing partition. For example, if a large number of candidate markers are available but only a portion will be included in the final model, the selection of markers (i.e., features) should be done using only the training partition of phenotypes and the selection itself should be repeated each replicate of the cross-validation on each new training dataset. If the feature selection is only done once on the whole dataset before cross-validation begins, this can lead to biased estimates of model accuracy (Hastie et al. 2009).
If these mistakes are avoided, cross-validation generally works well for comparing among single-trait methods, and in some cases for multi-trait methods. However, our goal in this paper is to highlight a challenge with using cross-validation to choose between single-trait methods and multi-trait methods; specifically multi-trait methods that use information from "secondary" traits measured on the target individuals to inform the prediction of their focal trait(s). In this case, standard cross-validation approaches lead to biased results. As we discuss below, the source of bias is not data leakage between the training and testing data per se, but correlated errors with respect to the true genetic merit between the secondary traits in the training data and the focal train in the testing data. Note that this issue only occurs when the multiple traits are measured on the same individuals, and the traits share non-genetic covariance. When traits are measured on different individuals, the standard cross-validation approach is appropriate.
In the following sections, we first describe the opportunity offered by multi-trait genomic prediction models in this setting, and the challenge in evaluating them. We then develop a simulation study that highlights the extent of the problem. Next, we propose three partial solutions that lead to fairly consistent model selections between single and multi-trait models under certain situations. Finally, we draw conclusions on when this issue is likely to arise and when it can be safely ignored.

General Setting
Multi-trait genomic prediction is useful in two general settings: 1) When the overall value of an individual depends on each trait simultaneously (ex. fruit number and fruit size) and these traits are correlated, and 2) When a focal trait is difficult or expensive to measure on every individual, but other correlated traits are more readily available (Thompson and Meyer 1986;Pszczola et al. 2013;Lado et al. 2018). While multi-trait models are clearly necessary in the first setting, in the second the value of the secondary traits depends on several factors including i) the repeatability of the focal and secondary traits, ii) the correlations among the traits and the cause of the correlations (i.e., genetic vs. non-genetic), and iii) the relative expenses of collecting data on each trait.
Here we focus on the goal of predicting a single focal trait using information from both genetic markers (or pedigrees) and phenotypic information on other traits. Even within this context, there are also two distinct prediction settings: 1) Predicting the focal trait value for new individuals that are yet to be phenotyped for any of the traits, and 2) Predicting the focal trait value for individuals that have been partially phenotyped; phenotypic values for the secondary traits are known and we wish to predict the individual's genetic value for the focal trait. These settings were described by (Burgueño et al. 2012) as CV1 and CV2, respectively, although those authors focused on multi-environment trials rather than single experiments with multiple traits per individual. The same naming scheme has since been extended to the more general multiple-trait prediction scenarios (Lado et al. 2018).
The key difference between CV1 and CV2-style multi-trait prediction is that in the former, the secondary traits help refine estimates of the genetic values of relatives of the individuals we wish to predict, while in the latter, the secondary traits provide information directly about the genetics of the target individuals themselves. This direct information on the target individuals is generally useful (as we demonstrate below). However, it comes with a cost for the evaluation of prediction accuracy by cross-validation. Since we do not know the true genetic values for the testing individuals, we must either use a model to estimate the genetic values or simply use their phenotypic value as a proxy. Unfortunately, if we use our genetic model to estimate these values, we are breaking the independence between the testing and training data, and therefore have biased estimates of cross-validation accuracy. On the other hand, if we simply use the phenotypic values of the focal trait as our predictand, these may be biased toward or away from the true genetic values depending on the non-genetic correlation between the focal and secondary traits. This leads to either overor under-estimation of the prediction accuracy of our multi-trait models. In realistic scenarios, this can lead users to select worse models.

MATERIALS AND METHODS
We used a simulation study to explore conditions when naive crossvalidation experiments as described above lead to sub-optimal choices between single and multi-trait genomic prediction methods. Our simulations were designed to mimic the process of using cross-validation to compare single and multi-trait models based on their prediction accuracies. We repeated this simulation across scenarios with different genetic architectures for two traits: a single "focal" trait and a single "secondary" trait. Specifically, we modified the heritability and correlation structure of the two traits. These are the most important parameters for determining the relative efficiencies of single-and multi-trait prediction models (Thompson and Meyer 1986). Sample size and level of genomic relatedness will also affect the comparisons, but are likely to only quantitatively (but not qualitatively) change the relative performances of the models and the accuracy of cross-validation.
To make our simulations realistic, we based them on genomic marker data from 803 lines from a real wheat breeding program (Lopez-Cruz et al. 2015). We downloaded the genomic relationship matrix K based on 14,217 GBS markers from this population. We used this relationship matrix to generate a set of simulated datasets covering all combinations of the following parameters: the relative proportions of genetic and non-genetic variation for each trait (h 2 ¼ f0:2; 0:6g), and the genetic and non-genetic correlations between the traits r g ¼ f0; 0:3; 0:6g, r R ¼ f20:6; 2 0:4; 2 0:2; 0; 0:2; 0:4; 0:6g, drawing trait values for each simulation from multivariate normal distributions. In particular, we set: where MNð:Þ is the Matrix normal distribution, Y ¼ ½y 1 ; y 2 are the phenotypic values for the two traits in the n individuals, U ¼ ½u 1 ; u 2 are the true genetic values for the two traits, and E ¼ ½e 1 ; e 2 are the true non-genetic deviations for the two traits. We repeated this process 500 times for each of the 42 combinations of the genetic architecture parameters and for each of the simulation settings we describe below. To improve the consistency of the simulations, we used the same draws from a standard-normal distribution for all 42 parameter combinations, but new draws for each of the 500 simulations. After creating the 803 simulated individuals, we randomly divided them into a training partition and a testing partition. We arranged the rows of Y so that the testing individuals were first, and correspondingly partitioned K into: Here and below, the subscript n refers to the testing partition (i.e., "new" individuals) and the subscript o refers to the training partition (i.e., "old" individuals). We use the hat symbol (^) to denote parameter estimates or predictions. We then fit single-and multi-trait linear mixed models to the training data and used these model fits to predict the genetic values for the focal trait (trait 1) in the testing partition.
Specifically, for the single-trait method we fit a univariate linear mixed model to the training data y o1 : by Restricted Maximum Likelihood using the relmatlmer function of R package (Ziyatdinov et al. 2018) and extracted the BLUPsû o1 . Note: an expanded version of these derivations are provided in the Appendix. We then calculated predicted genetic values for the testing partition u n1 as:û For the multi-trait model, we stacked the vectors of the two traits in the training dataset into the vector y o ¼ y o1 y o2 ! and fit: using the relmatLmer function, extracted estimatesm ¼ ½m ⊺ 1 ;m ⊺ 2 ⊺ ,Ĝ, R, and BLUPsû o .
To make predictions of the genetic values for the focal trait in the testing partition in the CV1 case without use of y n2 , we calculated: which has the same form as for the single trait model, but the input BLUPsû o1 are different.
To make predictions of the genetic values for the focal trait in the testing partition in the CV2 case, using the phenotypic observations of the secondary trait y n2 , we used a two step method. First, we estimated u o above based on both traits in the training data. Then we combined these estimates with the observed phenotypes of the testing data to calculate genetic predictions for the testing data: whereV c ¼ĝ 22 ðK 21 Þ nn þr 22 I n . This two-step method will be slightly less accurate than a one-step method that used y n2 during the estimation ofû o , but is much easier to implement in breeding programs because no genotype or phenotype data of the evaluation individuals is needed during the model training stage.
We measured the accuracy of these three predictions by calculating the correlation between the predictionû ðiÞ n1 and three predictands over the 500 simulations: u n1 : The true genetic value. y n1 : The phenotypic values of the testing individuals. u n1 : The estimated genetic values of the validation individuals using the full dataset (including y n1 ).
For the second accuracy measure that uses phenotypic values as predictands, we "corrected" the correlations by dividing by the true value of ffiffiffiffi ffi h 2 p to account for the larger variance of y n1 relative to u n1 . This impacts the denominator of the correlation , but since it is the same across methods, does not impact their comparison.
As described below, we also simulated phenotypes for an additional set of individuals y x not included in either the validation or testing partitions. These individuals were selected to be close relatives of each of the validation partition individuals but experienced different micro-environments.
For each combination of genetic parameters, we declared the "best" prediction method to be the one with the highest average correlation with the true genetic values across the 500 simulations. Then we counted the proportion of the simulations in which this "best" method actually had the highest estimated accuracy when scored against y n1 .

RESULTS
Although we ran simulations for two levels of heritability for the focal trait (h 2 1 ¼ f0:2; 0:6g) we present results only for h 2 1 ¼ 0:2. This is the "most-difficult" setting for prediction-when the heritability of the trait is low-but also the setting when we would expect the greatest benefit of using multi-trait models. Results for h 2 1 ¼ 0:6 were qualitatively similar, but with higher overall prediction accuracies of all methods.

Accuracy of single and multi-trait methods in simulated data
With h 2 1 ¼ 0:2 the true accuracy of prediction was moderate for all methods (corðû n1 ; u n1 Þ $ 0:4 2 0:6, Figure 1). Prediction accuracies for the single-trait method were constant across settings with different correlation structures because information from the secondary trait was not used.
The "standard" muti-trait model (i.e., CV1-style) that used phenotypic information only on the training partition slightly out-performed the single-trait model in some settings, more-so when the genetic and non-genetic correlations between traits were large and opposing and when the genetic determinacy of the secondary trait was high (Thompson and Meyer 1986). However it performed slightly worse whenever the genetic and residual correlations between traits were low. This was caused by inaccuracy in the estimation of the two covariance parameters (ĝ 12 ;r 12 ). Neither multi-trait model performed worse than the single-trait model when the true G and R matrices were used ( Figure S1), which we also verified by calculating the expected prediction accuracies analytically (See Appendix). In real data, multi-trait models require estimating more (co)variance parameters and therefore can show reduced performance when data are limited.
The CV2-style multi-trait method, which leverages additional phenotypic information on the secondary trait from the testing partition itself, showed dramatic improvements in prediction accuracy whenever genetic correlations among traits were large, irregardless of the non-genetic correlation between the traits. This is similar to the benefits seen by (Rutkoski et al. 2016) and (Lado et al. 2018). When the heritability of the secondary trait was high, the improvement in prediction accuracy was particularly dramatic (increasing to $ r ¼ 0:6). This is the potential advantage of incorporating secondary traits into prediction methods. However, the CV2 method also requires estimating G and R, and its performance was lower than the single-trait method whenever both genetic and residual correlations were low.
Therefore, multi-trait methods will not always be useful and it is important to test the relative performance of the different methods in real breeding scenarios. Unfortunately, we never know the true genetic values (u n1 ), and so must use proxy predictands to evaluate our methods in real data Legarra and Reverter 2018). In Figures 2A-B, we compare the prediction accuracies of the three methods using two candidate predictands: the observed phenotypic values (y n1 ) and estimated genetic values from a joint model fit to the complete dataset (ũ n1 ).
Using the observed phenotypic values (y n1 ) as the predictand, the estimated accuracy of both the single-trait and CV1-style multi-trait prediction methods consistently under-estimated their true prediction accuracies. This is expected because in this setting 80% of the phenotypic variation is non-genetic and cannot be predicted based on relatives alone. We therefore follow common practice to report a "corrected" estimate of the prediction accuracy by dividing by ffiffiffiffi ffi h 2 p in Figure 2A. This correction factor itself must be estimated in real data, but when comparing models the same value ofĥ 2 should be used for each model so that differences in these estimates do not bias model selection.
In contrast, the estimated accuracy of the CV2-style multi-trait method varied dramatically across simulated datasets. We tended to overestimate the true accuracy when both genetic and non-genetic correlations were large and in the same direction, and dramatically underestimate the true accuracy when the two correlations were opposing. Importantly, there are situations where the CV2-style method appears to perform worse than the single-trait method based on y n1 but actually performs better. Therefore, the observed phenotypic values are not reliable predictands to evaluate CV2-style methods when the intent is to estimate true genetic values and r R 6 ¼ 0.
On the other hand, using estimated genetic values from a joint model fit to the complete dataset (ũ n1 ) as the predictand led to dramatic overestimation of the true prediction accuracy for all methods. This is also expected because the training data are used both to train the prediction model and also to create the testing dataset, a clear violation of the crossvalidation rules that these datasets must be kept separate at all stages of the analysis. Again, the bias was most severe for the CV2-style method. Since this method is clearly invalid, we do not consider it further.

Effects of predictand on model selection
To demonstrate the impact of biased estimates of model accuracy using y n1 on the effectiveness of model selection, we assessed in each simulation whether the single-trait or multi-trait methods had a higher estimated accuracy, and compared this result to the true difference in prediction accuracies in that simulation setting. Figure 3 shows that selecting between the single-trait and CV1-style multi-trait models based on estimated accuracy using y n1 generally works well. Whenever one method is clearly better, we are able to choose that method . 50% of the time. But we never choose correctly , 50% of the time, even when the methods are approximately equivalent.
In contrast, when selecting between the single-trait and CV2-style multi-trait methods based on estimated accuracy using y n1 , the differential bias in estimated accuracy between the two methods frequently lead to sub-optimal model selection ( Figure 3B). With opposing genetic and non-genetic covariances between the two traits, the better model was chosen , 10% of the time. In these situations, using y n1 to select a prediction method will obscure real opportunities to enhance prediction accuracy using multi-trait prediction models.

Alternative estimates of multi-trait prediction accuracy
The CV2-style prediction method can be powerful because y n2 provides information on the genetic value of the testing individuals themselves (through u n2 ), while y o1 only provides indirect information on the genetic values of the testing individuals through the relatives. However, estimating prediction accuracy using y n1 fails for the CV2-style prediction method because both the focal and secondary traits are observed on the same individual and therefore share the same non-genetic sources of variation. Since the CV2 method uses y n2 , non-genetic deviations for the secondary trait e n2 pushû n1 either toward or away from y n1 depending on the estimated correlationr 12 . This either inflates or deflates the estimated accuracy, leading to incorrect model choices.
We now compare the effectiveness of three strategies for estimating cross-validation accuracy of CV2-style methods. To our knowledge, the second and third strategies are novel. Because the three methods have different data requirements, we implemented different experimental designs for each evaluation strategy.
Parametric estimate of accuracy: Our predictionû n1 is similar to a selection index because it combines multiple pieces of information into a linear prediction. The accuracy of an index I is: cor g ðI; yÞ ffiffiffiffi ffi h 2 I p , the genetic correlation between the index and phenotype multiplied by the heritability of the index (Falconer and Mackay 1996;Lopez-Cruz et al. 2019). Neither the genetic correlation nor the heritability can be directly observed, but we can estimate both as parameters of a multi-trait linear mixed model with the same form as (5). To be a valid cross-validation score, these parameters must be estimated with data only in the validation partition, rather than reusing estimates from model training.
Since both model training and model evaluation equally require estimates of G and R, we divided the data 50:50 into training and validation partitions in each simulation, thus using 404 lines to train the prediction models and 403 lines to evaluate the prediction accuracy.
The parametric estimates of prediction accuracy for the CV2 method were less biased than the corðû ð3Þ n1 ; y n1 Þ, the non-parametric estimates using y n1 as a predictand ( Figure 4A, compare to Figure 2). This led to more consistent model selections between the CV2 and single-trait methods ( Figure 4B). However, the parametric approach still underestimated the accuracy of the CV2 method when the genetic and residual correlations were in opposite directions, leading to model selection accuracies ,50%. This negative bias was due to poor estimation of G and R for the selection indices, given the limited sample sizes remaining after the data were partitioned.
Semi-parametric estimate of accuracy: In principle, we can correct for the bias in the non-parametric accuracy estimate (corðû ð3Þ n1 ; y n1 Þ) from the CV2-style method by calculating an adjustment factor based on the theoretical bias relative to the true accuracy (corðû ð3Þ n1 ; u n1 Þ). This is similar to the semi-parametric accuracy estimates presented by (Legarra and Reverter 2018), and the "correction" of accuracy estimates by 1= ffiffiffiffi ffi h 2 p used above to account for the difference in variance between y n1 and u n1 . As we derive in the Appendix, the difference between the true correlation from a CV2-style methods and its CV2 cross-validation estimate when a single secondary trait is used is: with V c defined above and S ¼ I 2 11 ⊺ n . The bias is a function of the the correlation among traits through the productĝ 12 r 21 (as the second term does not involve these parameters, and in most cases is % 1), and is large and positive (i.e., accuracy is overestimated) whenĝ 12 and r 12 are large and in the same direction, and large and negative (i.e., accuracy is underestimated) when these covariances are in opposite directions. Given this result, we can correct corðû ð3Þ n1 ; y n1 Þ by subtracting 8 from the estimated correlation, again corrected by 1= ffiffiffiffi ffi h 2 p ( Figure 5). Figure 1 True prediction accuracy of single-trait and multi-trait prediction methods in simulated data. 500 simulations were run for each heritability of the secondary trait (h 2 2 ¼ f0:2; 0:6g), and each combination of genetic and non-genetic correlation between the two traits (r g ¼ f0; 0:3; 0:6g; r R ¼ f20:6; 2 0:4; 2 0:2; 0; 0:2; 0; 4; 0:6g), all with h 2 1 ¼ 0:2. For each simulation, we used 90% of the individuals as training to fit linear mixed models (either single or multi-trait), predicted the genetic values of the remaining validation individuals, and then measured the Pearson's correlation between the predicted (û n1 ) and true (u n1 ) genetic values. In the CV1 method, we used only information on the training individuals to calculateû n1 . In the CV2 method, we used the training individuals to calculateû o and combined this with the observed phenotypes for the secondary trait on the validation individuals (y n2 ). Curves show the average correlation for each method across the 500 simulations. Ribbons show 61:96 · SE over the 500 simulations.
Clearly, the quality of this correction will depend on the accuracy of g 12 andr 12 as estimates of g 12 and r 12 . In Figure 5A, we show that the corrected correlation estimate has greatly reduced bias, particularly the dependence of the bias on the non-genetic covariance between the traits r 12 . However the correction is not perfect. Corrected accuracy estimates tend to overestimate the true accuracy. This over-estimation is caused by error inĜ andR as estimates of the true covariances: The correction factor is nearly perfect when the true covariance matrices are used in place of their estimates ( Figure S2).
Using the semi-parametric accuracy estimates, we are more successful at selecting the best model over the range of genetic architectures ( Figure 5B). The frequency of selecting the correct model rarely drops below 50% and is relatively constant with respect to the residual correlation between traits.
CV2Ã cross-validation strategy: Since the biased estimate of prediction accuracy for CV2-style methods is due to non-genetic correlations between y n2 used for prediction and the predictand y n1 , an alternative strategy, which we call CV2 Ã , is to use phenotypic information on close relatives of the testing individuals (y x1 ) to validate the model predictions in place of their own focal trait phenotypes (y n1 ). These "surrogate" validation individuals must also be excluded from the model training and raised so that they do not share the same nongenetic deviations as the testing individuals: corðe x1 ; e n1 Þ ¼ 0. Therefore,û x1 will not be artificially pushed toward or away from u x1 (measured on relatives) by y n2 (measured on testing individuals), preventing this source of bias in the estimated accuracy.
We implemented the CV2 Ã cross-validation strategy in two ways, simulating two different breeding schemes.
First, we considered the situation common in plant breeding where inbred lines (i.e., clones) are tested, and each line is grown in several plots in a field Bernardo (2002). Here, we can use one set of clones for prediction (y n2 ), and the other set of clones as trait-1 surrogates (y x1 ). Since they are clones, u x1 ¼ u n1 and y y2 is just as good for predicting u x1 as y x2 . Generally in this type of experiment, replicate plots of each line will be combined prior to analysis into a single line mean (or BLUP). But since we require y n2 and y x1 to be recorded from separate individuals, each value will have 2 · the residual variance because it is based on 1=2 as much data as the line means used for model training. Therefore, in our simuulations we drew two independent residual values for each line in the validation partition, each with a variance of 2R. For these simulations, we used a 90:10 training:validation split.
Second, we considered the situation more common in animal breeding where clones are not available. In this case, the best option for CV2 Ã would be to select pairs of closely related individuals to include in the training set; we use the first individual of the pair as y n2 and the second as y x1 . To implement this strategy, we again started with a validation partition of 10% of the lines. Then for each Figure 3 Impact of using phenotypic data to select between singletrait and multi-trait prediction methods. For each of the 500 simulations per genetic architecture described in Figure 1, we compared the estimated accuracy of a multi-trait prediction to the single-trait prediction. We then calculated the fraction of times that the selected model had higher average true accuracy in that setting (as shown in Figure 1). line, we selected the most closely related remaining line (arg max j K ij for validation line i) and held this additional set of 10% of the lines as y x1 . This left a training partition with only 80% of the lines. The average genetic relatedness of validation partition pairs in these simulations was 0.38. Figure 6A shows that for the first setting with split clones, estimates of prediction accuracy for CV2-style predictions by CV2 Ã are vastly more accurate than the naive estimates based on y n1 , but they are slightly downwardly biased because of the increased residual variance of y n1 and y x2 . Model selection works fairly well across all settings when clones are used ( Figure 6B, blue lines), although with slightly lower success rates than for the semi-parametric method. However, when we implementing the second approach with nearest relatives (not clones), model selection was rarely successful -we consistently chose the wrong model across most simulation settings unless the genetic and residual correlations were opposing. This is because the validation pairs were too distantly related to provide any additional information on genetic merit relative to individuals in the training partition. Interestingly, this method is relatively successful in the situations where the parametric method fails (see Figure 4B), and so may be complimentary.

DISCUSSION
Our study highlights a potential pitfall in using cross-validation to estimate the accuracy of multi-trait genomic prediction methods. When secondary traits are used to aid in the prediction of focal traits and these secondary traits are measured on the individuals to be tested, cross-validation evaluated against phenotypic observations can be severely biased and result in poor model choices. Unfortunately, we rarely know the true genetic value of any individual and therefore can only evaluate our models with phenotypic data (since multitrait-derived estimated genetic values are even more severely biased as we demonstrated above ( Figure 2B)). We cannot find earlier discussions of this problem in the literature. However a growing number of studies aim to use cheap or early-life traits to improve predictions of genetic worth for individuals in later-life traits (ex. Pszczola et al. 2013;Rutkoski et al. 2016;Fernandes et al. 2017;Lado et al. 2018). Therefore the issue is becoming more important.
The problematic bias in the cross-validation-based accuracy estimates is caused by non-genetic correlations between the predictors that we want to use (i.e., the secondary traits) and our best predictand (the phenotypic value of the trait in the testing individuals)non-genetic correlations between two traits measured on the same individual are expected. However, in some cases this correlation is zero by construction, and standard cross-validation approaches can be valid. For example, in the original description of the CV2 cross-validation method by (Burgueño et al. 2012), each trait was measured in a different environment. In this case, the traits were measured on different individuals and therefore did not share any non-genetic correlation. Also, CV1-style methods do not suffer from this problem because phenotypic information on the secondary traits in the testing individuals is not used for prediction. Similarly, this bias does not occur when the target of prediction is the phenotypic value itself (rather than the individual's genetic value). For example, in medical genetics the aim is to predict whether or not a person will get a disease or not, not her genetic propensity to get a disease had she been raised in a different environment (ex Spiliopoulou et al. 2015;Dahl et al. 2016).
We note that the common strategy of two-step genome selection: using single-trait methods to calculate estimated genetic values for each line:trait and then using these estimated genetic values as training (and validation) data, does not get around the problem identified here. Using estimated genetic values instead of phenotypic values will tend to increase the genetic repeatability of the training and validation values, and therefore increase the overall prediction accuracy of all methods. But these estimated genetic values will still be biased by the non-genetic variation, and the biases across traits will still be correlated by the non-genetic correlations. Therefore the same issue will arise.
Also, while we have used a GBLUP-like genomic prediction method for the analyses presented here, the same result will hold for any multi-trait prediction method that aims to use information from y n2 when there are non-genetic correlations with y n1 , i.e., any method that is evaluated with the CV2 cross-validation method on multiple traits measured on the same individual (Calus and Veerkamp 2011;Jia and Jannink 2012;Fernandes et al. 2017). This includes multi-trait versions of the Bayes Alphabet methods (Calus and Veerkamp 2011; We presented three partial solutions to this problem, spanning from fully parametric to fully non-parametric. The parametric solution relies on fitting a new multi-trait mixed model to the predicted values and the predictand, with the accuracy estimated as the genetic correlation scaled by the heritability of the prediction. This solution is always available as long as the individuals in the validation partition have non-zero genomic relatedness and the full dataset is large enough to estimate genetic correlations in both training and validation partitions. However it generally worked poorly in our simulations because G and R were not estimated accurately. It may work better with very large datasets. Also, because this parametric approach relies on the same assumptions about the data (i.e., multivariate normality) as the prediction model, it loses some of the guarantees of reliability that completely non-parametric crossvalidation methods can claim.
The semi-parametric solution aims to correct the non-parametric correlation estimate for the bias caused by the non-null residual correlation among traits. This correction factor is only needed for CV2-style multi-trait prediction approaches, and is similar to the approach of (Legarra and Reverter 2018) for single-trait models. We show that this correction factor can work well, particularly if the covariances among traits are well estimated. We only derived this correction method for prediction methods based on linear mixed effect models with a single known genetic covariance structure (i.e., GBLUP and RKHS-style methods with fixed kernels), although the approximationĝ 12r12 ffi ffi ð p varðûÞvarðuÞ will probably be approximately correct for other methods. However, when covariances are poorly estimated, the correction factor can still lead to biased estimates of model  accuracy. We are currently investigating whether Bayesian methods that sample over this uncertainty can be useful, and will implement this method in JWAS (Cheng et al. 2018). This method is semi-parametric, so also relies on distributional assumptions about the data and may fail when these assumptions are not met. As a third alternative, we proposed the CV2 Ã cross-validation method, a fully non-parametric approach for assessing CV2-style multi-trait prediction accuracy. CV2 Ã uses phenotypic values of the focal trait from relatives of the testing individuals in place of the phenotypic values of that trait from the testing individuals themselves. If the close relatives are raised independently, they will not share non-genetic variation, removing the source of bias in the cross-validation estimate ( Figure 6A). The CV2 Ã method works best when clones of the testing individuals are available. With clones, secondary trait phenotypes of the testing individuals can be used directly to predict focal trait genetic values of their clones because the genetic values are identical. Replicates of inbred lines are frequently used in plant breeding trials (Bernardo 2002). In this case, all replicates should be held-out as a group from the training data. Then the replicates can be partitioned again into two sets; secondary trait phenotypes from one set can be incorporated into the genetic value predictions for the lines, and these predictions evaluated against the phenotypic values of the other set. To compare this estimate of CV2-style prediction accuracy to the prediction accuracy for a single-trait method, the singletrait method's predictions should be compared against the same set of replicates of each line (i.e., not a joint average over all replicates of the line as would be typical for single-trait cross-validation). However, because of the separation of the replicates, each replicate will have higher residual variance, which reduces the accuracy of this method. Clones are less common outside of plant breeding, so more distant relatives need to be used instead. In this case, the estimated prediction accuracies of CV2-style methods will be downwardly biased. In our simulations, despite relatively close relatives for each validation line being available, this approach was not successful.
In our simulations, the semi-parametric approach was the most reliable, and the fully parametric approach the least reliable. However the fully parametric approach is always possible to implement while our semi-parametric and non-parametric approaches may not be possible depending on the prediction model used and the structure of the experimental design.

CONCLUSIONS
We expect that multi-trait methods for genomic prediction carry great promise to accelerate both plant and animal breeding. However there is a need to design better methods to evaluate and train the prediction methods to ensure that models can be accurately compared. We have presented and compared three contrasting methods to evaluating multitrait methods. Each of these methods is preferred to naive cross-validation when secondary traits of the target individuals are used to predict their focal traits. However, the methods can give contrasting answers for different datasets, so careful consideration of which evaluation method to use is critical when choosing among prediction methods.

APPENDIX
Here, we derive the genomic predictionsû n1 given y for the three prediction models that we use in the main text, and then evaluate the expected covariances between these predictions and the predictands u n1 and y n1 . We derive these relations for the more general situation with p $ 1 "secondary" traits and a single "focal" trait.
We start with a phenotypic data matrix Y with n individuals and p þ 1 traits, where the first trait (first column of Y) is the "focal" trait, and the other p traits are "secondary" traits. We first divide Y into a training partition ("old" individuals) and a testing partition ("new" individuals), and arrange them with the testing partition first, so we can partition Y . We then work with stacked versions of these phenotype matrices: y ¼ vecðYÞ; y n ¼ vecðY n Þ; y o ¼ vecðY o Þ. Our genetic model for y is: where G and R are genetic and phenotypic covariance matrices for the p þ 1 traits, and K is the n · n genomic relationship matrix among the lines. For convenience below, we partition the following matrices as follows: We partition the trait vectors for the training individuals and covariance matrices between the "focal" (index 1) and "secondary traits" (index 2): where scalars are normal text, vectors are bold-face lower case letters, and matrices are bold-face capital letters. Partitions for the testing individuals are similar. We also partition the genomic relationship matrix and its inverse between the training and testing individuals:

Single trait predictions
For the single-trait prediction, we begin by estimatingĝ 11 ,r 11 andb 1 by REML using only y o1 . The joint distribution of u n1 and y o1 is: ; g 11 K nn g 11 K no g 11 K on g 11 K oo þ r 11 I ! ! : To simplify, note that the joint distribution of u o1 and y o1 in the training data are: ; g 11 K oo g 11 K oo g 11 K oo g 11 K oo þ r 11 I ! ! Therefore,û o1 y o1 ¼ĝ 11 K oo b V 21 o1 ðy o1 2 X o1b1 Þ. Rearranging and plugging this in above simplifies to:û CV1-style multi-trait predictions For CV1-style multi-trait prediction, we begin by estimatingĜ,R andb by REML using y o . The joint distribution of u n1 and y o is: As above, to simplify this expression, we form the joint distribution of u o and y o in the training data as: Rearranging and plugging this in above simplifies to:û CV2-style multi-trait predictions For our CV2-style multi-trait prediction, we take a two-step approach. We first estimateû o from the training individuals and then supplement this with y n2 from the testing individuals. The joint distribution of u n1 , y n2 and u o is: ! would be: which simplifies to: nn þ R 22 5I nn . Now, conditioning on observed values of both u o from the training data and y n2 from the testing data, the expectation of u n1 would be: c À y n2 2 X 2 b 2 2 K no K 21 oo u o2 Á : Using this, we form our prediction as: whereû o1 andû o2 are extracted from the calculation ofû o for the CV1-style prediction. Plugging in the solutions for these values expands to: Expectations of prediction accuracy Now, we evaluate the expected correlation between a random sample of pairs of elements from our three candidate predictions and the predictand y n1 . We compare these expected correlations with the expected "true" correlations with u n1 . Below, let varðxÞ denote the variance of a random sample from a random vector x; covðx; yÞ and corðx; yÞ denote the covariance and correlation between a random sample of pairs of elements from x and y; and Covðx; yÞ denote the covariance matrix between vectors x and y. We use the following results: where trðÁÞ is the matrix trace, and m x ¼ 0 and/or m y ¼ 0. Therefore, the expected correlation between x and y is approximately: E½corðx; yÞ % 1 n 2 1 trðSCovðx; yÞÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi E½varðxÞE½varðyÞ p : Our goal with cross-validation is to estimate corðû n1 ; u n1 Þ. Since we do not know u n1 , we approximate the correlation with corðû n1 ; y n1 Þ= ffiffiffiffi ffi h 2 1 p . The factor of ffiffiffiffi ffi h 2 1 p corrects the correlation for the larger variance of y n1 relative to u n1 . Otherwise, any difference between these two correlations must be due to their numerators: trðSCovðû n1 ; u n1 ÞÞ and trðSCovðû n1 ; y n1 ÞÞ. Thus, for each of the three prediction methods we compare these two numerators to evaluate the accuracy and bias in the approximation.

Single trait predictions
The numerator of the expected correlation between u ð1Þ n1 and the true genetic values u n1 is: where we assume thatb 1 ¼ b 1 and Covðe o1 ; u n1 Þ ¼ 0. The same result for the numerator of the expected correlation between u ð1Þ n1 and the observed phenotypic values y n1 is: where we additionally assume Covðu o1 ; e n1 Þ ¼ 0 and Covðe o1 ; e n1 Þ ¼ 0. Therefore, the numerators are the same, and corðû ð1Þ n1 ; y n1 Þ= ffiffi f h p 2 1 is a consistent estimator for corðû ð1Þ n1 ; u n1 Þ.

CV1-style multi-trait predictions
The numerator of the expected correlation between u ð2Þ n1 and the true genetic values u n1 is: tr SCov û ð2Þ n ; u n1 ¼ tr again assumingb ¼ b and now also Covðe o ; u n1 Þ ¼ 0. The same result for the numerator of the expected correlation between u ð2Þ n1 and the observed phenotypic values y n1 is: where we additionally assume Covðu o ; e n1 Þ ¼ 0 and Covðe o ; e n1 Þ ¼ 0. Therefore, the numerators are the same, and corðû ð2Þ n1 ; y n1 Þ= ffiffi f h p 2 1 is a consistent estimator for corðû ð2Þ n1 ; u n1 Þ.