-
PDF
- Split View
-
Views
-
Cite
Cite
David Gerard, Luís Felipe Ventorim Ferrão, Priors for genotyping polyploids, Bioinformatics, Volume 36, Issue 6, March 2020, Pages 1795–1800, https://doi.org/10.1093/bioinformatics/btz852
- Share Icon Share
Abstract
Empirical Bayes techniques to genotype polyploid organisms usually either (i) assume technical artifacts are known a priori or (ii) estimate technical artifacts simultaneously with the prior genotype distribution. Case (i) is unappealing as it places the onus on the researcher to estimate these artifacts, or to ensure that there are no systematic biases in the data. However, as we demonstrate with a few empirical examples, case (ii) makes choosing the class of prior genotype distributions extremely important. Choosing a class is either too flexible or too restrictive results in poor genotyping performance.
We propose two classes of prior genotype distributions that are of intermediate levels of flexibility: the class of proportional normal distributions and the class of unimodal distributions. We provide a complete characterization of and optimization details for the class of unimodal distributions. We demonstrate, using both simulated and real data that using these classes results in superior genotyping performance.
Genotyping methods that use these priors are implemented in the updog R package available on the Comprehensive R Archive Network: https://cran.r-project.org/package=updog. All code needed to reproduce the results of this article is available on GitHub: https://github.com/dcgerard/reproduce_prior_sims.
Supplementary data are available at Bioinformatics online.
1 Introduction
Because of their importance in agriculture (Udall and Wendel, 2006), evolution (Soltis et al., 2014) and biodiversity (Soltis and Soltis, 2000), scientists are increasingly interested in studying the genomic landscape of polyploid organisms (organisms with more than two copies of their genome). In these studies, researchers usually first estimate the number of copies of each allele (dosage) at given polymorphic loci in a sample. These dosages may then be fed into downstream analyses such as genome-wide association studies (Benevenuto et al., 2019; Ferrão et al., 2018; Rosyara et al., 2016), genomic prediction (Amadeu et al., 2019; de Bem Oliveira et al., 2019; Endelman et al., 2018; de C Lara et al., 2019) and genetic mapping (Ferreira et al., 2019; Shirasawa et al., 2017). To determine allele dosage, researchers first measure the reference and alternative counts of small reads from some sequencing technology (Baird et al., 2008; Elshire et al., 2011). Based on these read counts, they then infer the dosage of reference and alternative alleles using statistical methods.
Due to constraints in sequencing depth, many genotyping methods incorporate empirical Bayes techniques to borrow strength between individuals (Blischak et al., 2018; Clark et al., 2019; Gerard et al., 2018; Maruki and Lynch, 2017; Serang et al., 2012; Voorrips et al., 2011). These methods posit some class of prior distributions for the possible genotypes, select a distribution among this class using maximum marginal likelihood, and incorporate this prior genotype distribution in a Bayesian inference scheme to derive a posterior genotype distribution from which genotype calls are made. Adaptively estimating the prior distribution allows methods to automatically gauge the level of prior information in the data and use this prior information to improve genotyping.
In (empirical) Bayesian inference, researchers typically first navigate toward prior classes that are either ‘uninformative’ or ‘flexible’. In the case of genotyping, a typical ‘uninformative’ prior would be the discrete uniform distribution, which assumes that each genotype is equally probable. This distribution was one of the first used in the literature (McKenna et al., 2010). If a researcher wanted no constraints on the genotype distribution, this can be accomplished with the class of general categorical distributions, which allows each genotype to occur with any frequency, subject to the constraint that the genotype frequencies sum to one. That is, it places no assumptions on the structure of the prior distribution and simply tries to fit the best genotype distribution to the data. This class was also one of the early priors used for genotyping polyploids (DePristo et al., 2011; Li, 2011).
Besides the genotype of the individual, there are usually other, more technical, aspects of the data that should be taken into consideration during allele dosage inference, including sequencing error rate, allele bias and overdispersion (Gerard et al., 2018). A principled empirical Bayes approach would integrate over the uncertainty in these other parameters prior to estimating the genotype prior. However, computation is a major concern as there are often tens or hundreds of thousands of SNPs to genotype in a single dataset, and even increases of seconds can make a genotyping method untenable for applied researchers. Thus, for computational reasons, genotyping methods either assume these parameters are known a priori (Blischak et al., 2018; Clark et al., 2019; Maruki and Lynch, 2017) or estimate them simultaneously with the prior genotype distribution (Gerard et al., 2018; Voorrips et al., 2011).
Assuming these parameters are known a priori is unappealing as it prohibits adapting to the features present in the data. However, using the second strategy of simultaneously estimating technical features with the genotype prior makes the choice of the class of priors of vital importance—priors that are either too flexible or too constrained perform poorly in practice. To demonstrate this, we present a genotype plot (Gerard et al., 2018) of a single SNP from the autotetraploid potato data from Uitdewilligen et al. (2013),Figure 1A. A genotype plot contains the reference counts of an individual on the y-axis and the alternative counts on the x-axis. Individuals with a dosage of four reference alleles would lie near the x = 0 vertical line, while individuals with a dosage of zero reference alleles would lie near the y = 0 horizontal line. The plot seems to display two clusters of individuals, which would indicate to a researcher that there is a group of potatoes with a genotype of four reference alleles and a group of potatoes with a genotype of three reference alleles. However, when we fit updog (the software implementation of the empirical Bayes genotyping method described in Gerard et al., 2018) to these data, we obtain unreasonable genotype estimates when we either use the class of general categorical distributions (labeled ‘General’ in Fig. 1B) or we use the discrete uniform distribution (labeled ‘Uniform’ in Fig. 1C). These results generalize to other datasets (Supplementary Fig. S1). As a preview, Figure 1D demonstrates that using one of our proposed classes of prior distributions results in intuitive genotype estimates.

The number of reference reads (y-axis) versus the number of alternative reads (x-axis) from 84 autotetraploid potatoes for a SNP from Uitdewilligen et al. (2013). The raw unannotated data are presented in (A). The other panels are color-coded by estimated genotype according to updog fits using either (B) the general categorical class of priors, (C) the discrete uniform prior distribution or (D) the class of proportional normal priors. The dashed lines in (B) through (D) are the estimated expected counts, and are functions of the estimated allele bias and estimated sequencing error rate (see Gerard et al., 2018). Only the proportional normal class of priors provides intuitive genotyping
This indicates that using a class of priors that is either too flexible (the general categorical) or too restrictive (the discrete uniform) can result in poor genotyping. In the case of the discrete uniform prior, updog effectively tries to estimate the allele bias and sequencing error to make the empirical genotype distribution as close to a discrete uniform as possible. As the real genotype distribution is not discrete uniform, this results in extreme estimates of the allele bias and unreasonable genotype estimates. In the case of the general categorical prior, since we do not have a data-adaptive prior distribution on the bias, there is little loss in marginal likelihood for having an extreme allele bias. And so the bias parameter is free to vary as long as it increases the marginal likelihood. Indeed, if we include a slight penalty on the bias parameter, the general categorical class of prior distributions results in reasonable genotype estimates (Supplementary Fig. S2). These results indicate that having too flexible a class of prior distributions can result in unstable genotype estimates; while having too constrained a class of prior distributions can result in the likelihood parameters adapting to the prior distribution instead of having the prior distribution improve our estimates of the genotypes.
In this article, we propose two classes of prior distributions for genotyping polyploids which are designed to have intermediate levels of flexibility (Section 2). One is the class of discrete unimodal distributions, for which we provide a complete characterization. The other is the class of distributions whose probability masses are proportional to a normal density. We provide optimization details for these classes, along with other classes of priors used in the literature. We demonstrate, through simulation and on real data, that these two classes, and in particular the proportional normal class, are more robust to varying genotype distributions (Section 3). The proportional normal prior is the default in the updog software available on the Comprehensive R Archive Network: https://cran.r-project.org/package=updog.
2 Materials and methods
These posterior probabilities are then used to make genotype calls. Deriving a general-purpose expectation-maximization algorithm (Dempster et al., 1977) to obtain and is not difficult (Supplementary Section S2.1), though special optimization considerations are needed for some of the priors we discuss in this section (Supplementary Sections S2.1 and S2.2).
Prior distributions that have been used in the past (though not necessarily in the above framework) include:
Equation (4) is merely a convolution of two hypergeometric distributions. The and parameters are the genotypes of the two parents. We will denote this distribution as the ‘F1 distribution’ as it results from an F1 cross of autopolyploid individuals.
The α parameter is the allele frequency of the reference allele. The binomial distribution results by assuming the individuals are in Hardy-Weinberg equilibrium.
There are many issues with these previously used priors. The issues with the discrete uniform and general categorical distributions were already noted in Section 1. The F1 and binomial distributions are only applicable in specific situations. The F1 distribution can only be applied if the individuals are known siblings; the binomial distribution relies on the strong assumptions of Hardy-Weinberg equilibrium (Crow, 1988). Though the beta-binomial distribution can account for overdispersion, it is unable to account for underdispersion, which may be caused by, e.g. unobserved relatedness (e.g. the F1 distribution is generally less dispersed than a binomial, and hence a beta-binomial).
Motivated by the fact that most of the prior distributions previously proposed in the literature are unimodal, our second proposal is to model genotype probabilities within the flexible class of unimodal probability mass functions. In Supplementary Section S1 we exactly characterize this class of distributions using mixtures of discrete uniform distributions. This idea was motivated by the work of Stephens (2017) who approximated continuous densities using mixtures of continuous uniform distributions. Our characterization lends itself to a convex optimization problem during the M-step of the EM algorithm (Supplementary Section S2.1). But rather than use standard convex optimization techniques, we found it faster to develop a weighted EM scheme to optimize over the class of unimodal distributions (Supplementary Section S2.2).
All priors explored in this article are summarized in Supplementary Table S2.
3 Results
3.1 Simulation results
In this section we compare, through simulation, the seven different classes of prior distributions from Section 2. Our strategy was to generate genotypes under seven different genotype distributions, where each genotype distribution was designed to be favorable to one of the classes of prior distributions. The discrete uniform, binomial (Hardy-Weinberg), beta-binomial and F1 favorable genotype distributions were all generated according to their assumed priors. For the genotype distribution favorable to the proportional normal prior, we generated genotypes according to the proportional normal distribution with a small variance since we motivated this distribution by the need to account for underdispersion. For the unimodal favorable genotype distribution, we allowed genotype probabilities to decline linearly from K to 0 since none of the other more structured prior classes have such linear declines. Finally, for the genotype distribution favorable to the general categorical prior, we specified non-unimodal genotype proportions, which none of the other priors should be able to accommodate. The specific genotype distributions are listed in Supplementary Section S3 and are graphically represented in Supplementary Figure S6.
Given individual genotypes, we generated sequencing data according to the likelihood in Gerard et al. (2018), setting the sequencing error rate to 0.001, varying the overdispersion parameter over (where larger values indicate greater dispersion), and varying the allele bias parameter over (where values further from 1 indicate greater bias). Each repetition, we generated thirty hexaploid individuals at a sequencing depth of 100 reads. For each unique combination of allele bias, overdispersion, and genotype distribution, we generated 500 datasets. For each dataset, we fit updog assuming one of the seven prior classes described in Section 2.
The results under conditions of no allele bias and no overdispersion are presented in Figures 2 and 3. The results under varying levels of bias and overdispersion are presented in Supplementary Figures S7–S27. Figure 2 plots the proportion of individuals genotyped correctly (with larger values being better) whereas Figure 3 plots the difference between the estimated proportion of individuals misclassified and the actual proportion of individuals misclassified (with values closer to 0 being better). In Figures 2 and 3, we see that each class of prior distributions performs best when the genotype distribution is designed to be favorable to it. Except for the classes of proportional normal distributions and unimodal distributions, every class performs very poorly under some genotype distribution. However, the class of unimodal distributions performs poorly under some genotype distributions under larger levels of bias and overdispersion (Supplementary Figs S8, S11, S15 and S18). The class of proportional normal distributions only performs poorly under the (unrealistic) general categorical-favorable genotype distribution with large levels of bias and overdispersion (Supplementary Figs S13 and S20). Only the class of proportional normal priors performs adequately when the gentoypes are very underdispersed (Supplementary Figs S11 and S18). Estimates of the allele bias when using the class of proportional normal priors appear to be unbiased (Supplementary Figs S23–S27). These results indicate that the class of proportional normal distributions performs the best under a wide variety of settings.

Proportion of individuals genotyped correctly (y-axis) stratified by the assumed class of prior distributions (x-axis) used in an updog fit. The facets index the distributions used to generate the genotypes, with each genotype distribution designed to be favorable to one of the classes of prior distributions (orange). The distributions are ordered roughly from least flexible (the discrete uniform) to most flexible (the general categorical class). The read counts were generated assuming no allele bias and no overdispersion

Estimated proportion of individuals incorrectly genotyped subtracted from the actual proportion of individuals incorrectly genotyped (y-axis) stratified by the assumed class of prior distributions (x-axis) used in an updog fit. The horizontal line (at y = 0) indicates unbiased estimation of the misclassification error rate. The facets index the distributions used to generate the genotypes, with each genotype distribution designed to be favorable to one of the classes of prior distributions. The distributions are ordered roughly from least flexible (the discrete uniform) to most flexible (general categorical). The read counts were generated assuming no allele bias and no overdispersion
3.2 Sweet potato results
In this section, we compare the seven classes of prior distributions discussed in Section 2 using the data from Shirasawa et al. (2017). The hexaploid sweet potatoes in these data are known siblings, resulting from a single generation of selfing (an S1 cross). Thus, the class of genotype distributions is known to be F1 (Equation 4) with the constraint that (Gerard et al., 2018), which we will simply call ‘S1’. We fit updog, using all seven prior classes in Section 2 as well as the S1 prior class, to the 1000 SNPs with mean read-depth closest to 100. We ran updog only on a sample of twenty individuals, as having fewer individuals makes the class of prior distributions most important. For each SNP, we calculated the Euclidean distances between all pairs of posterior mean genotypes resulting from the eight different fitted prior classes. We also calculated the Hamming distances (divided by the number of individuals) between the posterior mode genotypes resulting from the eight different fitted prior classes.
The results are presented in Figure 4. As the genotype distribution is known to belong to the S1 class, we will consider its genotypes a proxy for the true genotypes. Thus, prior classes that are closest to S1 (in terms of either Euclidean distance from posterior mean genotypes or Hamming distance from posterior mode genotypes) perform the best. We see in Figure 4 that F1 performs the best. This is expected since S1 is a small subclass of the class of F1 distributions. The second best performance comes from the class of proportional normal priors. Thus, on real data, we see the gains attained by using the class of proportional normal priors. Mean distances between the results of all methods are presented in Supplementary Figures S28 and S29.

Boxplots of distances (x-axis) stratified by prior class (y-axis). The left facet contains the Euclidean distances between posterior means when either using the S1 prior or the specified prior on the y-axis. The right facet contains the proportion of individuals genotyped differently when either using the S1 prior or the specified prior on the y-axis. Distances closer to 0 indicate superior performance
4 Discussion
In this article, we reviewed the classes of prior distributions used to genotype polyploids, found these classes either too flexible or too restrictive, and proposed two new classes of priors of intermediate levels of flexibility. This required us to characterize the class of discrete unimodal distributions using mixtures of discrete uniform distributions. We then demonstrated, through simulation, that our proposals work better under the genotyping inference scheme in updog. In particular, we recommend to use by default the class of proportional normal distributions as it resulted in robust performance under different genotype distributions. The proportional normal prior is the default in the updog software available on the Comprehensive R Archive Network: https://cran.r-project.org/package=updog. Our classes of priors are also more generally applicable to other inference schemes.
Our priors are appropriate defaults for a wide variety of genotyping scenarios. For example, breeding populations might exhibit violations in Mendelian segregation proportions (Equation 4) due to double reduction (Stift et al., 2010), preferential pairing (Voorrips and Maliepaard, 2012), or resulting from an allele being semi-lethal. Additionally, unstructured populations might exhibit violations from Hardy-Weinberg equilibrium (Equation 5) due to selection or non-random mating (Crow, 1988). Our proportional normal and unimodal classes of priors may be used in all of these scenarios without explicitly modeling these processes.
Accurate dosage estimation is vital for many statistics/bioinformatics procedures. We note, however, that not all genomic studies require dosage estimates. For example, transcriptional (Colle et al., 2019), genomic rearrangement (Meng et al., 2019) and even genomic prediction studies (de Bem Oliveira et al., 2019; Sverrisdóttir et al., 2017) have been performed without dosage estimates.
Here, we only analyzed one SNP at a time, though some recent approaches have also attempted to borrow strength between SNPs (Blischak et al., 2018; Clark et al., 2019). In particular, Clark et al. (2019) tries to account for linkage disequilibrium (LD) between SNPs. However, their method, though intuitive, is somewhat ad hoc. Generalizing our proportional normal prior would allow for a more principled approach to accounting for LD by simply letting the genotype distribution across multiple SNPs be proportional to a multivariate normal. To account for the high-dimensionality present due to the large number of SNPs, one could impose strong structural constraints on the covariance matrix, such as letting it decay exponentially according to recombination rate (Wen and Stephens, 2010).
Acknowledgements
All graphics were made using ggplot2 (Wickham, 2016) in the R statistical language (R Core Team, 2019). We would like to thank Matthew Stephens for some useful conversations. We thank the College of Arts and Sciences at American University for providing a travel grant from the Mellon Fund. This allowed the authors to meet and collaborate on this article.
Conflict of Interest: none declared.
References
R Core Team. (