A generic hidden Markov model for multiparent populations

Abstract A common step in the analysis of multiparent populations (MPPs) is genotype reconstruction: identifying the founder origin of haplotypes from dense marker data. This process often makes use of a probability model for the pattern of founder alleles along chromosomes, including the relative frequency of founder alleles and the probability of exchanges among them, which depend on a model for meiotic recombination and on the mating design for the population. While the precise experimental design used to generate the population may be used to derive a precise characterization of the model for exchanges among founder alleles, this can be tedious, particularly given the great variety of experimental designs that have been proposed. We describe an approximate model that can be applied for a variety of MPPs. We have implemented the approach in the R/qtl2 software, and we illustrate its use in applications to publicly available data on Diversity Outbred and Collaborative Cross mice.


Introduction
Multiparent populations (MPPs) are valuable resources for the analysis of complex traits (de Koning and McIntyre 2017), including the mapping of quantitative trait loci (QTL). A wide variety of MPPs have been developed, including heterogeneous stock (HS) in mice (Mott et al. 2000) and rats (Solberg Woods et al. 2010), eightway recombinant inbred lines (RIL) in mice (Complex Trait Consortium 2004) and Drosophila (King et al. 2012), and multiparent advanced generation intercross (MAGIC) populations in a variety of plant species including Arabidopsis (Kover et al. 2009), wheat (Cavanagh et al. 2008), maize (Dell'Acqua et al. 2015), and rice (Bandillo et al. 2013).
QTL mapping in MPPs can be performed through statistical tests at individual single nucleotide polymorphisms (SNPs), as used in genome-wide association studies. However, many investigators first seek to reconstruct the mosaic of founder haplotypes along the chromosomes of MPP individuals and use this reconstruction to test for association between founder alleles and the quantitative phenotype. This approach was first introduced by Mott et al. (2000) for the analysis of HS mice, implemented in the HAPPY software, and has been continued in packages such as R/ mpMap (Huang and George 2011), DOQTL (Gatti et al. 2014), and R/qtl2 (Broman et al. 2019a).
The process of genotype reconstruction in an MPP individual is illustrated in Figure 1. The genotypes in the founder strains ( Figure 1A) and the MPP offspring ( Figure 1B) are used to calculate the probability of each possible founder genotype at each position along the chromosome ( Figure 1C). Thresholding of these probabilities can be used to infer the founder genotypes and the locations of recombination breakpoints ( Figure 1D).
Such genotype reconstructions are valuable not just for QTL analysis but also for data diagnostics (Broman et al. 2019b). For example, the inferred number of recombination breakpoints is a useful diagnostic for sample quality. Further, the reconstructed genotypes can be used to derive predicted SNP genotypes; comparing these to the observed SNP genotypes can help to identify problems in both samples and SNPs.
The probability calculation in Figure 1C depends on a model for the process along MPP chromosomes in Figure 1D. In the HAPPY software for HS mice, Mott et al. (2000) used a model of random mating in a large population. Broman (2005) extended the work of Haldane and Waddington (1931) to derive two-locus genotype probabilities in multiparent RIL. This was later developed for the case of multiparent advanced intercross populations (Broman 2012a(Broman , 2012b, including Diversity Outbred (DO) mice (Churchill et al. 2012).
Genotype reconstruction for a variety of MPP designs has been implemented in the R/qtl2 software (Broman et al. 2019a, https:// kbroman.org/qtl2). But it can be tedious analytical work to derive the appropriate transition probabilities for each new MPP design that is proposed. An alternative is to develop a more general approach for genotype reconstruction, such as used in the software RABBIT (Zheng et al. 2015). However, this approach has a variety of parameters that can be difficult to specify.
Here, we propose a similarly general method for genotype reconstruction in MPPs. We imagine that an MPP was derived from a population of homozygous founder strains at known proportions, a i , followed by n generations of random mating among a large number of mating pairs. We can derive the exact transition probabilities for this situation. The a i should be simple to specify from the MPP design, and the effective number of generations of random mating, n, can be determined by computer simulation, to match the expected density of recombination breakpoints.
Our approach has been implemented in R/qtl2. While we currently focus on data with SNP genotype calls, such as from microarrays, our model could potentially be incorporated into methods for genotype imputation from low-coverage sequencing, such as that of Zheng et al. (2018). We illustrate our approach through application to publicly available datasets on DO (Al-Barghouthi et al.

Methods
For genotype reconstruction in an MPP, we use a hidden Markov model (HMM; see Rabiner 1989). Our basic approach is as described in Broman and Sen (2009, Appendix D) for a biparental cross; the extension to an MPP is straightforward and described below.
Consider an MPP derived from k inbred lines. We focus on a single individual, and on a single chromosome with M marker positions (including pseudomarkers: positions between markers at which we have no data but would like to infer the underlying genotype). Let G m be the underlying genotype at position m. In a homozygous population, such as RIL, the G m take one of k possible values, the k homozygous genotypes. In a heterozygous population, such as advanced intercross lines (AIL), the G m take one of k 2 þ k possible values, the k 2 heterozygotes and k homozygotes. Let O m be the observed SNP genotype at position m (possibly missing). We assume that the G m form a Markov chain (that G 1 ; . . . ; G mÀ1 are conditionally independent of G mþ1 ; . . . ; G M , given G m ), and that O m is conditionally independent of everything else, given G m . The forward-backward algorithm (see Rabiner 1989) takes advantage of the conditional independence structure of the HMM to calculate PrðG m jOÞ.
The key parameters in the model are the initial probabilities, p g ¼ PrðG 1 ¼ gÞ, the transition probabilities, t m ðg; g 0 Þ ¼ PrðG mþ1 ¼ g 0 j G m ¼ gÞ, and the emission probabilities, e m ðgÞ ¼ PrðO m jG m ¼ gÞ.
A particular advantage of the HMM for genotype reconstruction is the easy incorporation of a model for genotyping errors (Lincoln and Lander 1992), which is done through the emission probabilities, which condition on the founder SNP genotypes but allow some fixed probability that the observed SNP genotype in the MPP individual is in error and incompatible with the underlying genotype G m and the SNP genotypes in the founder lines.
The initial and transition probabilities govern the underlying Markov chain, including the relative frequency of founder alleles and the frequency of recombination breakpoints along MPP chromosomes. In principle, these probabilities may be derived on the basis of the crossing design for the MPP. In practice, the transition probabilities can be tedious to derive, and exact calculations may provide no real advantage for genotype reconstruction.
Here, we derive the transition probabilities for a generic MPP design, which may then be applied generally. We consider a founder population with k inbred lines in proportions a i , and imagine subsequent generations are produced by random mating with a very large set of mating pairs.
Consider a pair of loci separated by a recombination fraction of r (assumed the same in both sexes) and let p ðnÞ ij be the probability of that a random haplotype at generation n has alleles i and j. At n ¼ 0, we have just the founding inbred lines, and so p The probabilities from one generation to the next are related by a simple recursion, as in Broman (2012b). Consider a random haplotype at generation n. It was either a random haplotype from generation nÀ1 transmitted intact without recombination, or it is a recombinant haplotype bringing together two random alleles. Thus Using the same techniques described in Broman (2012b), we find the solutions: The transition probabilities along a haplotype are derived by dividing the above by the marginal probability, a i . Thus if G 1 and G 2 are the genotypes at the two loci, we have the following transition probabilities.
For a heterozygous population (such as HS or DO mice), an individual will have two random such haplotypes. For homozygous population (such as MAGIC), we treat them like doubled haploids, by taking a single random chromosome and doubling it.
For the X chromosome, we use the same equations but replace n with ð2=3Þn, since recombination occurs only in females, so in 2/3 of the X chromosomes. This provides a remarkably tight approximation.
You can potentially use the expected number of crossovers to calibrate the number generations of random mating, or the map expansion, which is the relative increase in the number of crossovers. Let R(r) be the chance that a random haplotype has an exchange of alleles across an interval with recombination fraction r, so RðrÞ ¼ 1 À ii . The map expansion is dR/dr evaluated at r ¼ 0 (see Teuscher and Broman 2007). Using Equation (2) above, we then get that the map expansion in this population is nð1 À P a 2 i Þ. In the special case that a i 1=k for all i, this reduces to nðk À 1Þ=k.
The map expansion at generation s in DO mice on an autosome is ð7=8Þðs À 1Þ þ M 1 where M 1 is the weighted average of map expansion in the pre-CC founders (Broman 2012b), or about ð7s þ 37Þ=8. Equating this with ð7=8Þn, we can thus take n % s þ 5 when using this model to approximate the DO. For the CC, Broman (2005) showed that R ¼ 7r/(1 þ 6r), and so the map expansion is 7. Thus we can take n ¼ 8 as the effective number of generations of random mating.

Applications
We illustrate our approach with application to datasets on DO mice (Al-Barghouthi et al. 2021) and CC mice (Srivastava et al. 2017). In both cases, the approach provided results that were generally equivalent to those from the more exact model, though with important differences in the results for the X chromosome in the CC application.

DO mice
The DO mouse data of Al-Barghouthi et al. (2021) concerns a set of 619 mice from DO generations 23-33, in 11 batches by generation and including 304 females and 315 males. The mice were genotyped on the GigaMUGA array (Morgan et al. 2016) and the cleaned data consist of genotypes at 109,427 markers. A wide variety of phenotypes are available; we focus on the 20 contributing to the results in Table 1 of Al-Barghouthi et al. (2021).
We performed genotype reconstruction using the transition matrices derived specifically for DO mice (Broman et al. 2019b) as well as by the approximate model proposed above. For the DO mice at generation n, we used the transition probabilities for general eight-way AIL at n þ 5.
Following Al-Barghouthi et al. (2021), we assumed a 0.2% genotyping error rate and used the Carter-Falconer map function (Carter and Falconer 1951). Calculations were performed in R (R Core Team 2021) with R/qtl2 (Broman et al. 2019a), on an 8-core Linux laptop with 64 GB RAM. The calculations with the DOspecific model took approximately 35 min, while those with the general AIL model took 27 min, an almost 25% reduction in computation time.
The transition probabilities used by the two models are only subtly different and become less different in later generations. The probability of an exchange across an interval on a random DO chromosome, as a function of the recombination fraction for the interval and the number of generations, is shown in Figure 2.
QTL analysis proceeded by the method described in Gatti et al. (2014) and also used by Al- Barghouthi et al. (2021). Namely, we fit a linear mixed model assuming an additive model for the founder haplotypes, with a residual polygenic effect to account for relationships among individuals with kinship matrices calculated using the "leave-one-chromosome-out" method (see Yang et al. 2014), and with a set of fixed-effect covariates defined in Al- Barghouthi et al. (2021).
The genotype probabilities were almost indistinguishable. The maximum difference was 0.011 on the X chromosome followed by a difference of 0.007 on chromosome 8. For that reason, the QTL mapping results were hardly different. Across all 20 traits considered, the maximum difference in LOD scores in the two sets of results was 0.02.
The LOD curves by the two methods for tissue mineral density (TMD) and the differences between them are shown in Figure 3. The QTL on chromosomes 1 and 10 have LOD scores of 23.9 and 14.6, respectively, but the maximum difference in LOD, genome wide, between the two methods is just 0.014.

CC mice
As a second application of our approach, we consider the data for a set of 69 CC lines (Srivastava et al. 2017). These are eight-way RIL derived from the same eight founders as the DO mice, as the DO was formed from 144 partially inbred lines from the process of developing the CC .
Each CC line was formed from a separate "funnel," bringing the eight founder genomes together as rapidly as possible, for ex- , where the female parent is listed first in each cross. Inbreeding was accomplished by repeated mating between siblings.
The recombination probabilities for the autosomes in the CC do not depend on the order of the founders in the funnel for a line (Broman 2005). This is in contrast with the case of eight-way RIL by selfing (see Broman 2005, Table 2). For the X chromosome, however, the cross order is important, as only five of the eight founders can contribute. For example, in a line derived from the H)], the single-locus genotype probabilities on the X chromosome are 1/6 each for alleles A, B, E, and F, and 1/3 for allele C, while alleles D, G, and H will be absent. And note that the mitochondrial DNA will come from founder A, while the Y chromosome will be from founder H.
The cross funnel information was missing for 14 of the 69 CC lines. While the sources of the mitochondria and Y chromosome were provided for all lines, there were several inconsistencies in these data: line CC013/GeniUnc has the same founder listed as the source for its mitochondria and Y chromosome, and for three lines (CC031/GeniUnc, CC037/TauUnc, and CC056/GeniUnc) the founder on the Y chromosome is also seen contributing to the X chromosome. We used the genotype probabilities reported in Srivastava et al. (2017) to construct compatible cross funnels, with small modifications to handle the inconsistent information. We performed genotype reconstruction using the transition matrices derived specifically for CC mice (Broman 2005) as well as by the approximate model proposed above, using n ¼ 8 generations of random mating, chosen to match the expected frequency of recombination breakpoints.
The resulting probabilities were nearly identical on all autosomes in all CC lines. The maximum difference in probabilities on the autosomes was just 0.0006.
There were some important differences on the X chromosome, however. There were no cases with high probability pointing to different founder alleles by the two models, but there were several cases where two or more founders cannot be distinguished, but some would be excluded by the assumed cross design.
For example, in Figure 4, we show the genotype probabilities along the X chromosome for strain CC038/GeniUnc, as calculated with the more-exact model ( Figure 4A) and with the approximate model ( Figure 4B). We also include the results for the case that the more-exact model but when an incorrect cross design was used ( Figure 4C). Note the segment near 135 Mbp, which is inferred to be from founder NOD with the more-exact model but is equally likely B6 or NOD with the approximate model; the B6 and NOD founder strains are identical in the region, but the assumed cross design for the CC038/GeniUnc strain excluded B6. For the results using the incorrect cross design (which excluded not just B6 but also 129 and NOD), the results across the entire chromosome become a chopped-up mess, with an apparent 39 recombination breakpoints, vs 5 when the correct cross information is used.
Overall, there were seven strains where the maximum difference in the probabilities from the more-exact model and the proposed approximate model were in the range 0.25-0.50, and another eight strains with maximum difference in the range 0.10-0.25. All of the differences concern cases where multiple founders are identical for a region and either some would be excluded by the cross design, or where the difference in prior frequencies affects the results. For example, in the cross H)], the frequency of the C allele on the X chromosome is twice that of A, B, E, and F.

Discussion
We have proposed an approximate model for use with genotype reconstruction in MPPs. We derived the two-point probabilities on autosomes in the case of random mating in large, discrete generations, derived from a founder population of a set of inbred lines in known proportions. We use the same frequencies for the X chromosome, but with 2/3 the number of generations. The approach is shown to give equivalent results for the mouse DO and CC populations, though with important differences for the X chromosome in CC lines, where some founder alleles can be excluded based on the cross design. The more-exact model for the X chromosome in the CC excludes three of the eight founders based on the cross design. This is particularly useful in cases that multiple founders are identical by descent across a region. However, the approximate model is not affected by errors in the specified cross design (see Figure 4).
The value of this generic model points toward the general usefulness of the original software for MPPs, HAPPY (Mott et al. 2000), developed for the analysis of mouse HS. The results may depend on marker density and informativeness, but with a dense set of informative markers, a generic approach can provide goodquality genome reconstructions. The HMM itself is an approximation. Meiosis generally exhibits positive crossover interference, but the Markov property is closer to being correct in MPPs with multiple generations of mating, because nearby recombination events come from independent generations. This was apparent in the three-point probabilities derived by Haldane and Waddington (1931) for twoway RIL and was further explored in Broman (2005) for multiway RIL.

more−exact model
The proposed method has been implemented in the R/qtl2 software (Broman et al. 2019a). It requires specification of the founder proportions and one other parameter (the number of generations of random mating) which governs the frequency of recombination breakpoints. The founder proportions should be straightforward from the cross design; the effective number of generations of random mating may require some calibration, such as through computer simulation to match the expected frequency of recombination breakpoints.