-
PDF
- Split View
-
Views
-
Cite
Cite
Jeffrey B Endelman, Jean-Luc Jannink, Shrinkage Estimation of the Realized Relationship Matrix, G3 Genes|Genomes|Genetics, Volume 2, Issue 11, 1 November 2012, Pages 1405–1413, https://doi.org/10.1534/g3.112.004259
Close - Share Icon Share
Abstract
The additive relationship matrix plays an important role in mixed model prediction of breeding values. For genotype matrix X (loci in columns), the product XX′ is widely used as a realized relationship matrix, but the scaling of this matrix is ambiguous. Our first objective was to derive a proper scaling such that the mean diagonal element equals 1+f, where f is the inbreeding coefficient of the current population. The result is a formula involving the covariance matrix for sampling genomic loci, which must be estimated with markers. Our second objective was to investigate whether shrinkage estimation of this covariance matrix can improve the accuracy of breeding value (GEBV) predictions with low-density markers. Using an analytical formula for shrinkage intensity that is optimal with respect to mean-squared error, simulations revealed that shrinkage can significantly increase GEBV accuracy in unstructured populations, but only for phenotyped lines; there was no benefit for unphenotyped lines. The accuracy gain from shrinkage increased with heritability, but at high heritability (> 0.6) this benefit was irrelevant because phenotypic accuracy was comparable. These trends were confirmed in a commercial pig population with progeny-test-estimated breeding values. For an anonymous trait where phenotypic accuracy was 0.58, shrinkage increased the average GEBV accuracy from 0.56 to 0.62 (SE < 0.00) when using random sets of 384 markers from a 60K array. We conclude that when moderate-accuracy phenotypes and low-density markers are available for the candidates of genomic selection, shrinkage estimation of the relationship matrix can improve genetic gain.
When molecular markers are available, it is often assumed that the goal is to estimate the probability of IBD, but in fact, the goal is to estimate the genetic covariance, which depends on the genotypes of the causal loci and is fundamentally a state property. It follows that for a complex trait for which the infinitesimal model is a suitable approximation, G depends on the probability that the alleles at a random locus are identical in state, or IBS (Yang et al. 2010; Powell et al. 2010). Our first objective was to develop a theoretical framework for estimating the (realized) relationship matrix that is suitable for inbred lines and consistent with the IBS approach (i.e. without invoking a base population).
(Beginning with Equation 3, the symbol A denotes the IBS relationship matrix, and the angular brackets denote the average with respect to an index, in this case, i.) Equation 3 requires a concept for inbreeding that is consistent with the IBS framework. Following Powell et al. (2010), we define the inbreeding coefficient for a single locus as the intra-individual gametic correlation, but our extension to the multi-locus case is different and emerges as an algebraic necessity during the derivation.
The strategy embodied in Equation 2, in which the IBS properties of the markers are used as a proxy for the IBS properties of any two genomic loci, requires the number of markers to be much larger than the number of lines (m >> n). However, to minimize genotyping costs in breeding programs, it is common to use low-density (e.g. 384) SNP arrays, in which case the number of lines may exceed the number of markers. To develop a suitable estimator for this situation, we express the realized relationship matrix in terms of the n×n variance-covariance matrix (Σ) for genomic loci (i.e. when sampling columns of the genotype matrix). Equation 2 is equivalent to estimating Σ with the sample covariance S, which in the large m limit is asymptotically optimal with respect to mean-squared error (MSE) (Casella and Berger 2002).
When the number of lines exceeds the number of markers, the MSE of the sample covariance matrix is no longer optimal because there are too many parameters to estimate (n2/2) relative to the number of marker data points (nm). This type of phenomenon is well known in the statistics literature under the name Stein’s paradox (Stein 1956; Efron 1975), and it was James and Stein (1961) who first proposed shrinkage to reduce the MSE. Yang et al. (2010) have proposed a shrinkage estimator for the realized relationship matrix, but it does not preserve Equation 3. We propose an alternative estimator that does not shrink the inbreeding coefficient, and we investigate its impact on the accuracy of breeding value prediction in rice, barley, maize, and pig populations.
THEORY
Derivation of A in terms of causal loci
The subscript u on the variance operator indicates that it is with respect to the random genetic effects and not the genotypes—the latter are simply given and not assumed to follow any distribution. For the last step in Equation 5, we have assumed the uk are i.i.d. with constant variance , which is appropriate for a complex trait with many causal loci of comparable effect (i.e. well described by the infinitesimal model). The variance per locus is scaled by so that is an intensive property that does not depend on the number of causal loci.
In Appendix 2, we show that fk is also the deviation from Hardy-Weinberg proportions and thus interpretable as the inbreeding coefficient for the population.
Estimating A from markers
METHODS
Data sets
Genotypes for several publicly available populations were used in this study:
Maize diversity panel (Cook et al. 2012) (available at http://www.panzea.org/dynamic/derivative_data/Cook_etal_2012_SNP50K_maize282_AGPv1-111202.zip)
Rice diversity panel (Zhao et al. 2011) (available at ftp://ftp.gramene.org/pub/gramene/CURRENT_RELEASE/data/diversity/data_download/hapmap_plink_files/div_rice34.RiceDiversity44K.hapmap.tar.gz)
Commercial pig population (Cleveland et al. 2012) (available at http://www.g3journal.org/content/suppl/2012/04/06/2.4.429.DC1/FileS1.zip)
Advanced breeding lines from the North Dakota State University 2006–2009 two-row and six-row barley breeding programs (available by querying the database at http://hordeumtoolbox.org)
For the pig population, we also used phenotypes and progeny-test-estimated breeding values (pEBV) for three anonymous traits, downloaded from the same source. Genotypes were curated by eliminating markers with more than 10% missing data and lines with more than 15% missing data. The number of lines and markers after curation are shown in Table 1. Missing marker scores were imputed with the population mean for each marker.
Populations
| Population . | Lines (n) . | SNPs (m) . | f a . | 1st PCb . | CVc . | n/CV2d . |
|---|---|---|---|---|---|---|
| Pig | 3534 | 52,843 | 0.03 | 0.06 | 4.7 | 1 |
| Maize | 274 | 44,431 | 0.97 | 0.05 | 1.3 | 1.01 |
| 2-row Barley | 383 | 2398 | 0.95 | 0.08 | 2.6 | 0.35 |
| 2+6-row Barley | 763 | 1884 | 0.97 | 0.32 | 9.0 | 0.06 |
| Rice | 407 | 31,443 | 0.96 | 0.34 | 7.2 | 0.05 |
| Population . | Lines (n) . | SNPs (m) . | f a . | 1st PCb . | CVc . | n/CV2d . |
|---|---|---|---|---|---|---|
| Pig | 3534 | 52,843 | 0.03 | 0.06 | 4.7 | 1 |
| Maize | 274 | 44,431 | 0.97 | 0.05 | 1.3 | 1.01 |
| 2-row Barley | 383 | 2398 | 0.95 | 0.08 | 2.6 | 0.35 |
| 2+6-row Barley | 763 | 1884 | 0.97 | 0.32 | 9.0 | 0.06 |
| Rice | 407 | 31,443 | 0.96 | 0.34 | 7.2 | 0.05 |
Inbreeding coefficient, estimated from the relationship matrix.
Fraction of total variance captured by the first principal component (PC).
Coefficient of variation (1 = 100%) for the eigenvalues of the covariance matrix.
Quantities are relative to the pig population (= 1).
| Population . | Lines (n) . | SNPs (m) . | f a . | 1st PCb . | CVc . | n/CV2d . |
|---|---|---|---|---|---|---|
| Pig | 3534 | 52,843 | 0.03 | 0.06 | 4.7 | 1 |
| Maize | 274 | 44,431 | 0.97 | 0.05 | 1.3 | 1.01 |
| 2-row Barley | 383 | 2398 | 0.95 | 0.08 | 2.6 | 0.35 |
| 2+6-row Barley | 763 | 1884 | 0.97 | 0.32 | 9.0 | 0.06 |
| Rice | 407 | 31,443 | 0.96 | 0.34 | 7.2 | 0.05 |
| Population . | Lines (n) . | SNPs (m) . | f a . | 1st PCb . | CVc . | n/CV2d . |
|---|---|---|---|---|---|---|
| Pig | 3534 | 52,843 | 0.03 | 0.06 | 4.7 | 1 |
| Maize | 274 | 44,431 | 0.97 | 0.05 | 1.3 | 1.01 |
| 2-row Barley | 383 | 2398 | 0.95 | 0.08 | 2.6 | 0.35 |
| 2+6-row Barley | 763 | 1884 | 0.97 | 0.32 | 9.0 | 0.06 |
| Rice | 407 | 31,443 | 0.96 | 0.34 | 7.2 | 0.05 |
Inbreeding coefficient, estimated from the relationship matrix.
Fraction of total variance captured by the first principal component (PC).
Coefficient of variation (1 = 100%) for the eigenvalues of the covariance matrix.
Quantities are relative to the pig population (= 1).
Shrinkage intensity
This shrinkage algorithm has been implemented as part of the rrBLUP package for R, version 4.0 (Endelman 2011; R Development Core Team 2011).
Simulation and analysis
Simulated traits were constructed by first generating additive genetic values from the multivariate normal distribution, with variance equal to the full-marker relationship matrix (hence, σ2 = 1). Independent normal deviates with variance were added to generate phenotypes, and the parameter was modulated to simulate traits with different phenotypic accuracies. Figure 3 was generated with = 3 for the plant species and = 2 for the pigs. Figure 4 is based on 10,000 simulations, with log2 chosen from a uniform(−1,7) distribution, and results were binned by realized phenotypic accuracy in 0.1 increments.
Accuracy was defined as the Pearson correlation coefficient between the genomic estimated breeding values (GEBV = ) and either the true breeding values (in the simulation) or the progeny-test-estimated breeding values (pEBV) for the pig traits.
RESULTS
Table 1 lists several attributes of the five populations used in this study. The population sizes ranged from n = 274 (maize) to n = 3534 (pig). The pig, maize, and rice populations had 30–50K SNPs, whereas only 2K SNPs were available for the barley populations. The inbreeding coefficient (f) for each of the four plant populations, calculated from the mean diagonal element of the relationship matrix, was near 1 as expected for inbred lines (imputing missing markers with the population mean introduced low levels of heterozygosity). The pig population was outbred with f = 0.03.
Both structured and unstructured populations were included. The rice population was a diverse panel of several distinct types (indica, japonica, Aus, etc.) identifiable with principal component (PC) analysis (Zhao et al. 2011). The observation that 34% of the total variation was captured by the first PC indicates its highly structured nature (1st PC in Table 1). We intentionally grouped the 2-row and 6-row barley lines, which as separate populations are unstructured (1st PC < 10%) and derived from different breeding programs, into one population to create a second structured population for analysis (32% explained by 1st PC). The pig and maize populations were relatively unstructured (1st PC < 10%).
Population structure can also be detected from a histogram of the realized relationship coefficients. Figure 1 contrasts the unstructured 2-row barley population with the structured 2+6-row population. Because the relationship coefficients are expressed relative to the current population, the mean of the off-diagonal elements (left panel) is −(1 + f)/n, which is essentially 0 for populations with hundreds of lines or more. Despite having the same mean, the histogram for the 2-row population is unimodal, whereas that for the 2+6-row population is bimodal. The positive peak in the bimodal distribution arises from relationships between lines with the same row number, while the negative peak corresponds to relationships between lines with different row numbers. The highly structured rice population also has a diffuse distribution of off-diagonal elements, whereas the pig and maize distributions are unimodal (supporting information, Figure S1).
Histograms of entries in the realized relationship matrix for the 2-row and 2+6-row barley populations. The diagonal elements have a mean of 1 + f ≈ 2 for inbred lines, while the off-diagonal elements have a mean of −(1 + f)/n ≈ 0. The bimodal distribution of the off-diagonal elements reveals the highly structured nature of the 2+6-row barley population. The positive peak contains relationships between lines with the same row number, while the negative peak is between lines with different row numbers.
The right panel in Figure 1 shows the distribution of diagonal elements in the realized relationship matrices for the 2-row and 2+6-row barley populations. Although the mean of the diagonal elements is 1+f and thus at most 2, the individual coefficients can be larger than 2, unlike the diagonal elements of the numerator relationship matrix. The interpretation of the diagonal coefficients in terms of inbreeding is discussed below.
Shrinkage to minimize MSE
For each of the five populations, relationship matrices were estimated from random subsets of markers, with the shrinkage intensity chosen to minimize the expected MSE. As shown in Figure 2, for every population, the shrinkage intensity approached zero as marker number increased, but there were clear differences in the amount of shrinkage at low marker density. With 384 markers, the two structured populations (rice and 2+6-barley) had less than 3% shrinkage compared with nearly 20% shrinkage for the 2-row barley and over 30% shrinkage for the maize and pig populations.
Shrinkage intensity to minimize the expected MSE. Each point is the mean from 20 random subsets of markers (SE < 0.01). As expected, the optimal shrinkage decreased as the number of markers increased. There was little shrinkage for the structured populations (rice, 2+6-row barley) because of their high eigenvalue dispersion (see CV in Table 1).
These trends can be understood in terms of the heuristic in Equation 19, in which (for a given marker density) the shrinkage intensity depends on the ratio n/CV2 between population size (n) and the coefficient of variation (CV) for the eigenvalues of the n×n covariance matrix. Because the leading principal components in a structured population account for a large amount of the total variation, such populations have high eigenvalue CV. As shown in Table 1, the rice and 2+6-row barley populations had the highest CV values (7.2 and 9.0, respectively), while the maize population had the lowest at 1.3. The final column in Table 1 shows the ratio n/CV2 relative to the pig population (= 1). Although the pig population was nearly 13 times the size of the maize population, its CV was 3.6 times larger, leading to nearly identical n/CV2 ratios and shrinkage intensities in Figure 2. The two structured populations had the smallest n/CV2 ratios and thus also the least shrinkage in Figure 2. The 2-row barley population was intermediate between these extremes.
The shrinkage intensities in Figure 2 were based on minimizing the expected MSE, as determined from a reduced marker set. Figure 3 (using m = 384 markers) shows that this approach did in fact minimize the actual MSE between the full-marker relationship matrix and that based on the reduced marker set (see Figure S2 for 2+6-row barley). The solid lines show the MSE as a function of the shrinkage intensity (in 0.05 increments), and in every case, the minimum was attained near the value indicated in Figure 2. For the rice and 2+6-row barley populations, the minimum MSE was attained at 5% shrinkage vs. 2–3% shrinkage based on the expected MSE. The correspondence was equally good for the unstructured populations: 2-row barley = 15% actual vs. 19% expected; pig = 35% actual vs. 34% expected; maize = 30% actual vs. 32% expected.
Maximizing accuracy vs. minimizing MSE. At shrinkage intensities ranging from 0 to 0.7, with 0.05 increments, the relationship matrix was calculated for random sets of 384 markers. In each replicate, the MSE was calculated relative to the full marker relationship matrix (MSE = n−2‖A384 − Afull‖2), and GEBV accuracy was estimated using simulated phenotypes. The two curves (dashed = accuracy, solid = MSE) show the mean from 40 simulations (SE less than 3% of the mean).
Maximizing accuracy
Minimizing the MSE, although theoretically tractable, is not in itself particularly useful. A more meaningful criterion is maximizing the accuracy of breeding value prediction. The dashed curves in Figure 3 show the effect of shrinkage on prediction accuracy, as measured by the correlation between GEBV (using the shrunken relationship matrix and all phenotypes for training) and true breeding values simulated with the full marker matrix. The results indicate that shrinkage based on minimizing MSE is somewhat conservative with respect to maximizing accuracy. This follows from the observation that the maximum in the accuracy curve occurred at higher shrinkage than where MSE was minimized. For the maize, rice, and pig populations, the shrinkage intensity needed to minimize MSE was 0.20–0.25 less than for maximizing accuracy. This difference was somewhat smaller for the barley populations, but they only had 2K markers for estimating the full marker relationship matrix.
Figure 4 compares GEBV accuracy against phenotypic accuracy in the maize population for a range of simulated heritabilities. The three curves correspond to (1) using all 44K markers, (2) using a random set of 384 SNPs with shrinkage, and (3) using 384 SNPs without shrinkage. For all three methods, the maximum GEBV accuracy relative to phenotypic accuracy was observed at a phenotypic accuracy of 0.3 (SE < 0.004). Comparing the two lower curves, one sees that shrinkage improved GEBV accuracy with 384 markers, and the accuracy gain increased with heritability. At a phenotypic accuracy of 0.9, shrinkage improved GEBV accuracy by 0.07 on average.
Prediction accuracy for simulated phenotypes in the maize population. The three curves show the difference between GEBV accuracy and phenotypic accuracy as a function of phenotypic accuracy (SE < 0.004 not shown). GEBV accuracy was highest using all markers, followed by 384 SNPs with shrinkage. All three prediction methods peaked when phenotypic accuracy was 0.3, while the accuracy gain due to shrinkage increased monotonically with phenotypic accuracy. Phenotypic accuracies between 0.4 and 0.6 represented a “sweet spot” for shrinkage: in this range, heritability was high enough for shrinkage to substantially improve GEBV accuracy but not so high that phenotypes were more accurate.
Figure 4 also illustrates that phenotypic accuracy can be superior to GEBV accuracy for highly heritable phenotypes. When phenotypic accuracy was above 0.6, it surpassed GEBV accuracy using random sets of 384 SNPs without shrinkage, and the crossover with shrinkage occurred at phenotypic accuracy equal to 0.8. This phenomenon arises because low-density markers sample the genome incompletely, leading to discrepancy between the true and estimated relationship matrices. If the sampling error is large enough, the accuracy of the phenotypes is corrupted rather than improved through the mixed model analysis. The “sweet spot” for shrinkage in this simulation was at phenotypic accuracies between 0.4 and 0.6. In this range, GEBV accuracy was substantially improved by shrinkage and was also higher than phenotypic accuracy.
These trends were confirmed by our analysis of three anonymous traits in the pig population, for which progeny-test-estimated breeding values (pEBV) are available to calculate accuracy (Cleveland et al. 2012). Table 2 compares the accuracy of phenotypes, high-density SNPs (53K), and low-density SNPs (random sets of 384), both with and without shrinkage. The top row for each trait shows the accuracy for individuals with measured phenotypes; the bottom row is for individuals without a measured phenotype. Looking at the last two columns, one sees a clear benefit to using shrinkage for predicting the breeding value of phenotyped individuals, and this benefit increased with heritability. For trait T3 (h2 = 0.38), shrinkage increased 384 SNP GEBV accuracy from 0.56 to 0.62, a gain of 0.06 (P < 10−10 by paired t-test). For traits T4 and T5 (h2 ≈ 0.6), the accuracy gain from shrinkage was 0.09 and 0.10, respectively, but phenotypic accuracy was still higher. With a phenotypic accuracy of 0.58, trait T3 appears to be in the sweet spot: GEBV accuracy was improved by shrinkage and was also higher than phenotypic accuracy.
Prediction accuracies for pig traits
| Trait . | h2a . | n . | Phenotypic Accuracyb . | GEBVc Accuracy 53K SNP . | GEBV Accuracy 384 SNP + Shrinkage . | GEBV Accuracy 384 SNP, No Shrinkage . |
|---|---|---|---|---|---|---|
| T3 | 0.38 | 3141d | 0.580 | 0.690 | 0.617 (0.002)e | 0.561 (0.002) |
| 393 | – | 0.465 | 0.370 (0.007) | 0.370 (0.007) | ||
| T4 | 0.58 | 3152 | 0.751 | 0.809 | 0.718 (0.002) | 0.630 (0.002) |
| 382 | – | 0.569 | 0.469 (0.004) | 0.469 (0.004) | ||
| T5 | 0.62 | 3184 | 0.734 | 0.765 | 0.678 (0.003) | 0.584 (0.003) |
| 350 | – | 0.520 | 0.429 (0.012) | 0.429 (0.012) |
| Trait . | h2a . | n . | Phenotypic Accuracyb . | GEBVc Accuracy 53K SNP . | GEBV Accuracy 384 SNP + Shrinkage . | GEBV Accuracy 384 SNP, No Shrinkage . |
|---|---|---|---|---|---|---|
| T3 | 0.38 | 3141d | 0.580 | 0.690 | 0.617 (0.002)e | 0.561 (0.002) |
| 393 | – | 0.465 | 0.370 (0.007) | 0.370 (0.007) | ||
| T4 | 0.58 | 3152 | 0.751 | 0.809 | 0.718 (0.002) | 0.630 (0.002) |
| 382 | – | 0.569 | 0.469 (0.004) | 0.469 (0.004) | ||
| T5 | 0.62 | 3184 | 0.734 | 0.765 | 0.678 (0.003) | 0.584 (0.003) |
| 350 | – | 0.520 | 0.429 (0.012) | 0.429 (0.012) |
Heritability reported by Cleveland et al. (2012).
Accuracy = correlation with progeny-test-estimated breeding values.
Genomic-estimated breeding values (GEBV) calculated using all phenotyped individuals.
Within each trait, the top row is for individuals with a measured phenotype; the bottom row is for individuals without a phenotype.
Mean and SE based on 20 random sets of 384 markers.
| Trait . | h2a . | n . | Phenotypic Accuracyb . | GEBVc Accuracy 53K SNP . | GEBV Accuracy 384 SNP + Shrinkage . | GEBV Accuracy 384 SNP, No Shrinkage . |
|---|---|---|---|---|---|---|
| T3 | 0.38 | 3141d | 0.580 | 0.690 | 0.617 (0.002)e | 0.561 (0.002) |
| 393 | – | 0.465 | 0.370 (0.007) | 0.370 (0.007) | ||
| T4 | 0.58 | 3152 | 0.751 | 0.809 | 0.718 (0.002) | 0.630 (0.002) |
| 382 | – | 0.569 | 0.469 (0.004) | 0.469 (0.004) | ||
| T5 | 0.62 | 3184 | 0.734 | 0.765 | 0.678 (0.003) | 0.584 (0.003) |
| 350 | – | 0.520 | 0.429 (0.012) | 0.429 (0.012) |
| Trait . | h2a . | n . | Phenotypic Accuracyb . | GEBVc Accuracy 53K SNP . | GEBV Accuracy 384 SNP + Shrinkage . | GEBV Accuracy 384 SNP, No Shrinkage . |
|---|---|---|---|---|---|---|
| T3 | 0.38 | 3141d | 0.580 | 0.690 | 0.617 (0.002)e | 0.561 (0.002) |
| 393 | – | 0.465 | 0.370 (0.007) | 0.370 (0.007) | ||
| T4 | 0.58 | 3152 | 0.751 | 0.809 | 0.718 (0.002) | 0.630 (0.002) |
| 382 | – | 0.569 | 0.469 (0.004) | 0.469 (0.004) | ||
| T5 | 0.62 | 3184 | 0.734 | 0.765 | 0.678 (0.003) | 0.584 (0.003) |
| 350 | – | 0.520 | 0.429 (0.012) | 0.429 (0.012) |
Heritability reported by Cleveland et al. (2012).
Accuracy = correlation with progeny-test-estimated breeding values.
Genomic-estimated breeding values (GEBV) calculated using all phenotyped individuals.
Within each trait, the top row is for individuals with a measured phenotype; the bottom row is for individuals without a phenotype.
Mean and SE based on 20 random sets of 384 markers.
Table 2 shows that shrinkage did not improve GEBV accuracy for the unphenotyped pigs, nor have we observed any benefit in simulations. For example, even with as few as 96 markers, where the gains in GEBV accuracy were 0.1–0.2 in the maize population when training on all phenotypes, there was no accuracy gain when predicting unphenotyped lines.
DISCUSSION
In the numerator relationship matrix, each diagonal element equals one plus the probability that the two alleles at a randomly chosen locus are IBD from the base population. As this probability lies between 0 and 1, the diagonal elements in the numerator relationship matrix range from 1 to 2. It was evident from Figure 1 that the diagonal elements in the realized relationship matrix can fall outside this range. In the UAR model, the diagonal elements have been modified to lie in the range 0–2 (Yang et al. 2010; Powell et al. 2010), but this has the effect of creating an improper covariance matrix for the breeding values (i.e. it may no longer be positive semidefinite).
Because the allele content at each locus is centered by the population mean, our realized relationship matrix is positive semidefinite but not strictly positive definite (there is at least one zero eigenvalue). This means the breeding values follow a singular normal distribution, but this poses no problem from the perspective of mixed model theory (Searle et al. 1992).
Heritability
When the genetic covariance is written as proportional to the numerator relationship matrix, the proportionality constant is the additive genetic variance in the outbred base population. Because the IBS-relationship matrix uses the current population as the “base,” one might expect its proportionality constant, (Equation 11), to equal the genetic variance of the current population, but this is not true for inbred lines. As originally shown by Fisher (1941) [see also Kempthorne (1957) and Lynch and Walsh (1998)], the additive genetic variance for a single locus with no dominance is . Compared with the coefficient of the relationship matrix, the additive genetic variance is larger by a factor of (1+f).
Equation 32 can also be used to estimate h2 by replacing the variance components with their ML or REML estimates.
Shrinkage
Yang et al. (2010) proposed using the identity matrix as a low-dimensional target when shrinking the estimate of the relationship matrix: . For inbred populations, this estimator is not ideal because it shrinks the off-diagonal and diagonal elements with the same intensity. By contrast, our estimator does not shrink the inbreeding coefficient.
Using both real and simulated phenotypes, we have demonstrated that shrinkage can substantially increase the accuracy of GEBVs for phenotyped individuals (or lines), but not for unphenotyped ones. Although the term “genomic selection” is typically used in the context of predicting unphenotyped individuals, it is also encompasses the selection of phenotyped individuals for mating based on GEBV, which is important in plant and animal breeding. In plant breeding, we also see potential to use the realized relationship matrix with single-replicate or unbalanced multi-environment yield trials to more accurately advance lines for variety or hybrid development, and shrinkage may be beneficial in these applications.
Conclusion
There were two objectives in this study. The first was to formulate the realized relationship matrix based on identity-by-state at causal loci and by requiring the mean diagonal element to equal 1+f for the current population. For high-density markers, the optimal estimator of this relationship matrix is equivalent to the first formula of VanRaden (2008). The second objective was to explore shrinkage estimation of the relationship matrix at low marker density. In unstructured populations with more lines than markers, shrinkage estimation can increase the accuracy of GEBVs for phenotyped lines; there is no benefit without phenotypes. Particularly when phenotypes have moderate accuracy, e.g. from preliminary yield trials in plant breeding, shrinkage estimation has the potential to improve the selection of lines as parents or for variety development.
Acknowledgments
Support for this research was provided by the USDA-ARS and the Bill and Melinda Gates Foundation.
Appendix 1
Appendix 2
Literature Cited
Footnotes
Communicating editor: J. B. Holland
Author notes
Supporting information is available online at http://www.g3journal.org/lookup/suppl/doi:10.1534/g3.112.004259/-/DC1.



