Linkage Analysis and Haplotype Phasing in Experimental Autopolyploid Populations with High Ploidy Level Using Hidden Markov Models

Modern SNP genotyping technologies allow measurement of the relative abundance of different alleles for a given locus and consequently estimation of their allele dosage, opening a new road for genetic studies in autopolyploids. Despite advances in genetic linkage analysis in autotetraploids, there is a lack of statistical models to perform linkage analysis in organisms with higher ploidy levels. In this paper, we present a statistical method to estimate recombination fractions and infer linkage phases in full-sib populations of autopolyploid species with even ploidy levels for a set of SNP markers using hidden Markov models. Our method uses efficient two-point procedures to reduce the search space for the best linkage phase configuration and reestimate the final parameters by maximizing the likelihood of the Markov chain. To evaluate the method, and demonstrate its properties, we rely on simulations of autotetraploid, autohexaploid and autooctaploid populations and on a real tetraploid potato data set. The results show the reliability of our approach, including situations with complex linkage phase scenarios in hexaploid and octaploid populations.

Polyploids are organisms with more than two sets of chromosomes. They are very important in agriculture and play a fundamental role in evolutionary processes, such as differentiation of species (Soltis et al. 2014a). The number of sets of chromosomes in an organism is called ploidy level. These multiple chromosome sets can originate from the combination of genomes from different, but related species, or from duplicated genomes from the same species (Birchler 2012;Comai 2005). In the first scenario, they are called allopolyploids; in the second, autopolyploids. Polyploid organisms are also characterized according to their pattern of inheritance. In general, allopolyploids exhibit diploid-like (or disomic) segregation, since homologous chromosomes, or homologs, tend to form bivalents within each sub-genome. Autopolyploids, however, have more than two homologs per homology group, forming either random bivalents or multivalents during meiosis, resulting in polysomic segregation (Sybenga 1975;Soltis et al. 1993;Osborn et al. 2003). Since the molecular mechanics of polyploid organisms are quite complex, this dichotomy is often broken, and polyploids can display intermediate modes of inheritance (Otto and Whitton 2000;Osborn et al. 2003). Throughout this paper, the term autopolyploid (or autotetraploid, autohexaploid, etc.) will refer to polyploid organisms that exhibit polysomic segregation.
Despite advances in genetic studies in autotetraploids, (Mather 1936;Fisher 1943Fisher , 1947Hackett et al. 2001;Luo et al. 2004Luo et al. , 2006Wu et al. 2004;Leach et al. 2010;Li et al. 2010;Hackett et al. 2013;Xu et al. 2013;Rehmsmeier 2013;Zheng et al. 2016), there is still a shortage of statistical methods to address organisms with higher ploidy levels, such as sweet potato (Kriegner et al. 2003;Arizio et al. 2014;Shirasawa et al. 2017), sugarcane Garcia et al. 2013), some ornamental flowers and forage crops (reviewed in (Soltis et al. 2014b)). In this work, we denote as high-level autopolyploids those autopolyploid organisms with ploidy level greater than four. A fundamental class of statistical methods that have lagged behind in high-level autopolyploid studies is the construction of genetic maps. A reliable genetic map is a crucial step in quantitative trait loci (QTL) analysis, as well as the assembly of reference genomes and the study of evolutionary processes (Lewin et al. 2009;Luo et al. 2013;Lemmon and Doebley 2014). Although understanding the concept of genetic mapping is rather easy, the construction of such maps in high-level autopolyploids is challenging. Even under bivalent pairing, there are many possible configurations during meiosis, and the number of possibilities gets exponentially larger as the ploidy level increases. Denoting m as the ploidy level, it is possible to find up to m different alleles for a locus in one individual. Furthermore, if some of those alleles are not distinguishable, it is necessary to consider the number of copies of each different allelic form, also known as dosage.
The construction of a genetic map in a full-sib population can be summarized in five basic steps: i) estimation of pairwise recombination fractions and associated statistical tests; ii) separation of markers into linkage groups; iii) ordering of markers within each linkage group using an optimization technique; iv) parental phasing, recombination fraction updating and likelihood computation (or other objective function) and v) if the order is optimal, the map is complete, otherwise, return to step iii. Historically, genetic maps in high-level autopolyploids have been constructed using only alleles present in one homolog, called single-dose or simplex markers (Wu et al. 1992;Sorrells 1992). In a full-sib population, these markers segregate in a 1:1 ratio (if they are present only in one parent), or in a 1:2:1 ratio (if present in both parents, also called double simplex). Given this level of simplification, it is possible to use the five-step procedure coupled with a standard software suitable for diploid populations. Nevertheless, it is well accepted that the use of single-dose markers imposes limitations on the construction of adequate genetic maps. These approaches sub-sample the genome (Hackett et al. 2013;Garcia et al. 2013), which precludes further consideration of multiallelic effects in models for QTL mapping and subsequent studies. Moreover, there is low statistical power to detect linkage when markers are in repulsion phase configurations (Wu et al. 1992;Ripol et al. 1999). Although some authors have addressed this problem by including multi-dose (or multiplex) markers when constructing genetic maps and performing QTL mapping (Ripol et al. 1999;Doerge and Craig 2000), the limitations on the genotyping technologies at the time required that the allelic dosage had to be inferred based on expected segregation rates. Because of the high amount of hidden information imposed by marker systems on those studies (Wu et al. 1992;Ripol et al. 1999), the estimation of recombination fraction between multi-dose markers was highly impaired.
Quantitative genotyping technologies for single nucleotide polymorphism (SNPs) evaluation have opened the door for further genetic mapping studies in high-level autopolyploids. It is now possible to measure the abundance of specific alleles within a locus in a polyploid genome (Voorrips et al. 2011;Serang et al. 2012;Hackett et al. 2013;Garcia et al. 2013;Bargary et al. 2014;Mollinari and Serang 2015). This technology, combined with the genotypic distribution in the population, makes it possible to infer the allelic dosage by using the ratio between the abundances of the two alternative alleles (Serang et al. 2012). Once the dosage of the markers is estimated, the construction of linkage maps can be significantly improved by taking this information into account, as previously done in autotetraploids by Hackett et al. (2013Hackett et al. ( , 2014. Genetic linkage maps can be constructed based on two-point or multipoint estimates of the recombination fraction. Two-point methods use information on pairs of markers, and even though they are less computationally demanding than multipoint methods, they require a higher amount of information in the markers to provide reliable results. Only recently, using a two-point-based method, van Geest et al. (2017) published an integrated hexaploid chrysanthemum genetic map using scripts implemented in the R package polymapR (Bourke et al. 2018). Multipoint approaches, instead, use information of multiple markers present in a linkage group, increasing the statistical efficiency of the analysis Jiang and Zeng 1997;Mollinari et al. 2009;Leach et al. 2010). This feature is particularly important in polyploid linkage analysis, where markers are mostly partially informative. One widely used procedure to obtain multipoint estimates is the hidden Markov model (HMM) . The construction of the genetic map using this method provides the estimates of the recombination fractions between all adjacent markers in a linkage group, as well as the multipoint likelihood, which has been shown to be an excellent criterion to evaluate and compare linkage phase configurations and orders of markers (Mollinari et al. 2009). Leach et al. (2010) presented a statistical framework in which HMMs were applied to reconstruct genetic linkage maps, but it was limited to autotetraploids. Recently, software packages such as polymapR (Bourke et al. 2018), pergola (Grandke et al. 2017) and netgwas (Behrouzi and Wit 2017), have been developed to build genetic maps in high-level autopolyploids. However, only polymapR is capable of estimating recombination fractions and inferring parental linkage phases in outcrossing populations, though it does not use multipoint procedures to perform those tasks.
The main challenges we address in this paper are the inference of the haplotypes of the multiple homologs and the multipoint estimation of recombination fractions in high-level autopolyploids. Although Zheng et al. (2016) proposed a probabilistic multilocus haplotype reconstruction model for autotetraploids considering double reduction, this remains as an open question for organisms with higher ploidy levels. Our method relies on an HMM and is developed for species with even ploidy levels under random chromosome segregation (complete polysomic inheritance). We also present a two-point method which is capable of dealing with hundreds of markers even in high ploidy level scenarios. Hence, we are proposing solutions for steps i and iv in high-level autopolyploids.
Step ii is straightforward from step i using clustering algorithms, as proposed by (Van Ooijen and Jansen 2013). Even though step iii is a challenging task in genetic mapping, it can be addressed using pairwise recombination fractions or the resulting likelihood of the Markov model as it has been proposed by several studies Buetow and Chakravarti 1987;Doerge 1996;Van Os et al. 2005;Wu et al. 2008;Preedy and Hackett 2016;Wang et al. 2016). To evaluate our method, and to show its properties, we rely on simulations of autotetraploid, autohexaploid, and autooctaploid data and on a real tetraploid potato data set. We also perform a set of hexaploid simulations to compare our method to the one implemented in polymapR (Bourke et al. 2018). The R computer codes to reproduce all simulations and analysis are publicly available.

MATERIALS AND METHODS
In this section, we define the notation used throughout this article and present the probabilistic model for the gamete formation in autopolyploids. The mathematical derivation of the HMM, including the estimation of the model parameters, is based on the work of Rabiner (1989), which presents the hidden Markovian process using tree major elements, namely, the transition probability function (Equation 6), the initial state function (Equation 7), the emission probability function (Equations 8 and 9). In the genetic mapping context, the first connects adjacent marker loci in function of their recombination fraction, the second is the prior probability of the genotypes in the mapping population, and the third connects the observed marker dosage to the complete multi-allelic hidden genotypic states. While these ideas are widespread in the genetic mapping literature, for instance in ; Jiang and Zeng (1997); Hackett and Broadfoot (2003); Hackett et al. (2013);Leach et al. (2010), here we present a generalization to any even ploidy level. We avoid using matrix notation throughout our model derivation since their high dimensionality would precludes the application of HMM techniques in polyploids with high ploidy level. We conclude this section by explaining the complexity of estimating linkage phases between markers, presenting an efficient two-point algorithm that simplifies the problem in a way that allows the phasing to be inferred using real data.

Notation
Consider a mapping population derived from a cross between two autopolyploid individuals P and Q with the same ploidy level (full-sib family). The ploidy level is denoted by m, and can be any even number greater than zero. Let the vectors P m k ¼ fP i k g and P m kþ1 ¼ fP i kþ1 g, and Q m k ¼ fQ i k g and Q m kþ1 ¼ fQ i kþ1 g, i ¼ 1; ⋯ ; m, denote the genotype of two adjacent multiallelic loci k and k þ 1 in P and Q, respectively. The superscript i indicates one of the possible alleles for the loci, and each locus has m different alleles in each parent. For example, for a cross between two autohexaploid individuals, P 6 k ¼ fP 1 k ; P 2 k ; ⋯ ; P 6 k g; similarly, this can be done for P 6 kþ1 , Q 6 k and Q 6 kþ1 . All alleles denoted by the same superscript number are in the same homologs (e.g., P 1 k and P 1 kþ1 are in homolog 1, etc). The following assumptions are made to ensure random chromosome segregation (Muller 1914;Haldane 1930) and no double reduction (Burnham 1962): i) there is only formation of bivalents during the meiosis; ii) there is no preferential pairing during the formation of bivalents; iii) all bivalents have the same recombination fraction between loci k and k þ 1; iv) bivalents are independent; v) there is separation of sister chromatids during meiosis II and vi) there is no chromatid interference. Consequences of violations of these assumptions will be addressed later using simulations. Although each bivalent is composed by a pair of chromosomes with two sister chromatids each, given assumption vi, we will consider only one sister chromatid per homolog during the derivation of our model.

Bivalent formation
Bivalent formation occurs during meiosis I (more specifically, at the pachytene stage of prophase). In diploid cells, there is only one possible pairing configuration: two duplicated homologs from a homology group pair to form one bivalent. However, in autopolyploid cells, given the previous assumptions, the expected number of possible pairing configurations, i.e., the number of possible bivalent chromosomal pairings for a given homology group during meiosis can be obtained by sequentially choosing pairs out of m homologs without replacement, divided by all possible permutations of the chosen pairs The orientation of the bivalents does not affect the expected frequencies of each gamete type, and therefore will not be considered. For example, as showed by Hackett (2001) in autotetraploids, there are two bivalents and three possible bivalent configurations: homolog pair as 1 with 2, and 3 with 4; or, 1 with 3 and 2 with 4; or 1 with 4 and 2 with 3. We denote C ¼ fc j g, j ¼ 1; ⋯ ; w m a set of all bivalent configurations for a given ploidy level.
Expected gametic frequency for a given bivalent configuration We will present the expected gametic frequencies considering parent P. Since parent Q undergoes a similar process, it is possible to combine the expected gametic frequencies to obtain the expected genotypic frequency in the full-sib population. Each of the bivalents obtained for a given configuration c j can result in two types of chromosomes for loci k and k þ 1: parental, which results from bivalents with zero or any other even number of recombinations between k and k þ 1; and recombinants, which results from bivalents with any odd number of recombinations. As presented by Doerge and Craig (2000), the probabilities of all chromosome types for any single bivalent can be represented always as where r k is the recombination fraction between k and k þ 1, i 6 ¼ i9.
For a given configuration c j , the expected frequencies for all possible gametes derived from that configuration is where 5 denotes the Kronecker product of matrices and subscripts in V indicate the corresponding bivalent. All elements of this product are of the form where l denotes the number of total recombinant bivalents between loci k and k þ 1, l 2 f0; ⋯; m=2g. From this, we can define the probability of observing any gamete (for two loci) given a bivalent configuration c j as Pr where vectors p k and p kþ1 denote a subset of m 2 alleles present in P m k and P m kþ1 , respectively; fp k ; p kþ1 g indicates a gamete for loci k and k þ 1 from parent P. Consistent means that the gamete can be produced from bivalent configuration c j . Notice that some gametes cannot be obtained from c j once the bivalents are formed.
Since we assume that alleles with the same superscript are in the same homolog, l can be obtained by a simple examination of superscripts of elements contained in p k and p kþ1 . Consider, for example, c 1 ¼ fð1; 2Þ; ð3; 4Þ; ð5; 6Þg (m ¼ 6, Figure 1). If one observes p k ¼ fP 1 k ; P 3 k ; P 5 k g and p kþ1 ¼ fP 1 kþ1 ; P 4 kþ1 ; P 6 kþ1 g, the number of recombinant chromosomes is l ¼ 2. Therefore, PrðfP 1 k ; P 3 k ; P 5 k g; fP 1 kþ1 ; P 4 kþ1 ; P 6 kþ1 gjc 1 Þ ¼ ð1 2 r k Þðr k Þ 2 2 3 . On the other hand, PrðfP 1 k ; P 2 k ; P 5 k g; fP 1 kþ1 ; P 2 kþ1 ; P 5 kþ1 gjc 1 Þ ¼ 0, since it is impossible Figure 1 One possible pairing configuration in an autohexaploid, namely c 1 . P i k denotes one allele present in homolog i for locus k in parent P. Notice that some allelic configurations, such as ðfP 1 k ; P 2 k ; P 5 k g; fP 1 kþ1 ; P 2 kþ1 ; P 5 kþ1 gÞ, are impossible to be obtained in this bivalent pairing. In this case, the homologs containing alleles P 1 k and P 2 k will migrate to opposite poles of the cell during meiosis I. Therefore, P 1 k and P 2 k will not be present in the same gamete.
to obtain this gamete from configuration c 1 , i.e., it is not consistent with c 1 .

Expected gametic frequency unconditional to bivalent configurations
In reality c j is unknown, thus the conditional probability given by Equation (2) must be considered for all possible c j . The probability of observing a gamete fp k ; p kþ1 g, unconditional to c j , can be expressed as It is important to notice that only a subset of C is consistent with the observed gamete, and consequently Prðp k ; p kþ1 c j Þ . 0 only for some c j 's. Figure 2 shows a graphical representation of Equations 2 and 3 for autohexaploid gametes.
The probability of observing a specific gamete is always the same for each c j in this consistent subset (Equation 2). Therefore, under random pairing (assumption ii), our task reduces to finding the number of elements in this subset that are consistent with the observed gamete and multiply Prðp k ; p kþ1 c j ÞPrðc j Þ by this number. The result is the probability of observing a gamete unconditional to the bivalent configuration.
For every gamete, l can change from zero to m=2 recombinant homologs. The observed gamete is the result of homologs that migrate to one pole of the cell at anaphase I with a subsequent migration to opposite poles at anaphase II. Since we are assuming that there is separation of sister chromatids during anaphase II, if l ¼ 0 (all chromosomes are of parental type), there is no information about the pairing configuration of the homologs that migrate to the opposite pole of the cell. In this situation, there are !. This is precisely the number of elements in the subset of C consistent with the observed gamete. Given the assumption of no preferential pairing during the formation of bivalents, Prðc j Þ ¼ 1 wm , the probability of a gamete fp k ; p kþ1 g, unconditional to c j , can be simplified to

Map reconstruction via hidden Markov model
The construction of a genetic map involves the estimation of the genetic distance and order between markers within linkage groups. If the origin of the haplotypes (i.e., linkage phase) for the parents of the mapping population is unknown, it also needs to be estimated. For several years, hidden Markov models have been proven to be an excellent avenue for obtaining these estimates Jiang and Zeng 1997;Mollinari et al. 2009;Leach et al. 2010). The multipoint likelihood obtained using HMMs is employable as a criterion to compare marker orders and judge which one is best, and also to provide a reliable estimation of recombination fraction and linkage phases. (Rabiner 1989) defines an HMM as a generative process composed of three well-defined probability distributions: transition, initial state and emission. In genetic mapping context, the transition probability distribution is defined as the probability of having a particular genotype  2). The rows and the columns indicate gametic configurations for loci k and k þ 1, respectively. For simplification, only the superscripts of the gametic configurations were presented. For example, row 123, column 123, represent the gamete ðfP 1 k ; P 2 k ; P 3 k g; fP 1 kþ1 ; P 2 kþ1 ; P 3 kþ1 gÞ. Colored cells indicate the probability of gametic configurations consistent with the bivalent configuration c : . The color scale indicates the number of recombinant bivalents associated to the gametic probability varying from 0 (dark blue) to 3 (light blue). Blank cells indicate non-consistent configurations. The far right full table represents the sum over all c configurations, weighted by their probability (Equation 3).
at position k þ 1, given the genotype at position k. Using Equation (4) the gametic transition probabilities Prðp kþ1 p k Þ, or the conditional probability of a gamete genotype at locus k þ 1 given the gamete genotype at locus k, is simply all possible genotypes that p k can assume for locus k. Also, assume that genotypes in Q m P are arranged according to the lexicographical order of their superscripts. For example, in an autotetraploid, k Þg for locus k. After some simplifications (see Supplementary Information, File S1) the transition probability, i.e., the conditional probability of a gametic genotype u m P;i in locus k þ 1 given the gametic genotype u m P;i9 in locus k, is . The initial state and the emission probability distributions will be addressed in the next section (Equations 7 to 9).

Including information from both parents
Any given individual in a full-sib population is formed by the union of gametes from both parents, P and Q. Each parent can form m m 2 ! different gametes for locus k. Since the formation of gametes in both parents is independent, the genotypic transition probability distribution can be written as where G m k;j denotes the genotype of an individual derived from the union of gametes u m P;i and u m Q;h at locus k. The same reasoning applies . l P and l Q denote the number of recombinant bivalents between loci k and k þ 1 in parents P and Q, respectively.
denote the number of possible genotypes derived from the cross between individuals P and Q. For simplification and without loss of generality, let t k ðj; j9Þ ¼ PrðG m kþ1;j9 G m k;j Þ. For a comprehensive autotetraploid example of the transition probabilities (similar to that presented by Hackett (2001)) and the indexation used in Equation 6, see Table S3.8, File S3 in Supplementary Information.
Given a ploidy level m and a recombination fraction r k , the only information required to obtain t k ðj; j9Þ in Equation (6) is l P and l Q . Since the genotypes in Q m P and Q m Q are arranged according to the lexicographical order of their superscripts, it is possible to obtain ðl P ; l Q Þ for any given pair ðj; j9Þ using the algorithm presented in Supplementary Information, File S2. Although the number of possible transitions between positions k and k þ 1 is ðg m Þ 2 , which can be a very large number even for modest ploidy levels, it is possible to obtain the transition between any specific genotypes in j and j9 without computing the entirety of the transition space.
The initial state distribution is the probability of observing a specific genotype. Given the assumption that there is no preferential pairing during the formation of bivalents, a uniform probability density function can be employed as the initial state probability function To this point, both transition and initial state distributions consider different allelic variants for all m homologs in both parents. This scenario can only be achieved when using fully informative markers.
In reality, autopolyploid species may have the same allelic variant in some homologs. Besides, even if a particular locus have different allelic form in all homologs, modern genotyping platforms are usually capable of detecting polymorphisms at the nucleotide level (SNPs), which are essentially biallelic. Due to this lack of identity between the observed data and the full transition space, we make use of the emission function, which is defined as the probability of observing a molecular phenotype given a genotype G m k;j . The detection of the allelic variants in modern genotyping platforms is based on the abundance of different alternative nucleotides. In the autopolyploid setting, this can be translated as the dosage of a SNP at a specific locus. The dosage of a SNP can be estimated using the ratio between the abundance of its two allelic forms. Several methods were proposed to perform this task including (Voorrips et al. 2011), (Serang et al. 2012) and (Bargary et al. 2014). Here we introduce a biallelic derivation of the emission probability distribution. Although the function presented here use biallelic information, other distributions can be derived for partial informative multiallelic marker systems following the same reasoning.
Let d k P ; d k Q 2 f0; ⋯ ; mg denote the observed dosage of one allelic form in locus k for parents P and Q, respectively. The choice of the allelic form denoted by d k P is arbitrary, as long as the same allelic form is used in d k Q . The dosage observed in parent P can be originated from alleles present in d k P of the m homologs. Let containing all possible subsets in P m k that originate the observed dosage d k P . The operator #f:g is the cardinality of a set. The same reasoning applies for f k Q . For instance, in an autotetraploid, if d k P ¼ 3, the three doses present in locus k can be derived from four distinct subsets f k P ¼ fðP 1 k ; P 2 k ; P 3 k Þ; ðP 1 k ; P 2 k ; P 4 k Þ; ðP 1 k ; P 3 k ; P 4 k Þ; ðP 2 k ; P 3 k ; P 4 k Þg. Given two particular subsets u k P and u k Q in f k P and f k Q , each one of the g m genotypic states in the full transition space can be associated to a dosage. The observed dosage O associated to the j-th state is obtained by counting the number of alleles present in the intersection between the parental allelic set ðu k P [ u k Q Þ and G m k;j . Thus, the emission function can be defined as and e denotes the global genotype error rate. In addition to the point estimate of the dosage, the genotyping calling methods cited above also provide the probability distribution of the dosages for a particular marker for all individuals of the biparental population. If this information is available, a more general emission function can be derived. Instead of modeling a global error rate e, we use the prior information provided by the genotyping calling procedure. Let p k ¼ fp k i g ð1·mþ1Þ denote the probability distribution vector associated to the dosages 0; ⋯; m at position k for a particular individual in the biparental population. For example, denotes a tetraploid individual with probabilities 1 6 , 2 3 and 1 6 of having one, two and three doses, respectively, and zero for the remaining ones. Then, the emission probability function can be written as In this case, the observation O can be any dosage from 0 to m and the information about the genotypes will be contained in the probability distribution of the dosages p k . Thus, the probability of observing any dosage given a genotype G m k;j associated to a particular dosage dðk; jÞ can be obtained by simply assessing the corresponding value in the probability distribution provided by the genotype calling procedure. Notice that Equation 8 can be reduced to Equation 9 using the appropriate p k . For example, in autotetraploids, when the observed dosage for locus k is

Multipoint likelihood and the estimation of recombination fraction
Suppose there are z markers in a homology group in a known order represented by M 1 ; ⋯ ; M k ; ⋯ ; M z . Let r ¼ ðr 1 ; ⋯; r k ; ⋯; r z21 Þ denote the recombination fraction vector between all marker intervals in this sequence. Also, assume linkage phase configurations in parents P and Q denoted respectively by F P ¼ ðu 1 P ; ⋯; u k P ; ⋯; u z P Þ and The sequence of observations for the z markers is denoted by ðO 1 ; ⋯; O k ; ⋯; O z Þ and its underlying probability distributions are denoted by P ¼ ðp 1 ; ⋯; p k ; ⋯; p z Þ. The likelihood of M 1 ; ⋯; M k ; ⋯; M z can be obtained using Equations (6), (7) and (9) following the classical forward procedure (Rabiner 1989). Let a k ðjÞ ¼ PrðO 1 ; ⋯; O k ; G m k;j jr; F P ; F Q ; PÞ denote the probability of the partial observation sequence ðO 1 ; ⋯; O k Þ and genotype G m k;j , j 2 f1; ⋯; g m g given the sequence of recombination fractions r, the linkage phase configurations F P and F Q and the probability distributions for the sequence of observations P. The forward procedure follows the steps below: 1. Initialization: 2. Induction: where k ¼ 1; ⋯; z 2 1 and j9 ¼ 1; ⋯; g m 3. Termination: Then, the likelihood of the model is defined as where n is the number of individuals in the full-sib population, O 1;i ; ⋯ ; O z;i is the sequence of marker observations for individual i and P i is a ðm þ 1Þ · z matrix where the k-th column denotes the probability distributions associated to the marker M k , individual i. The multipoint maximum likelihood estimate of r can be obtained using the forward-backward procedure coupled with the EM algorithm (Rabiner 1989). For the backward procedure, consider the variable b k ðjÞ ¼ PrðO kþ1 ; ⋯; O z G m k;j ; r; F P ; F Q ; PÞ as the probability of the partial observation sequence from k þ 1 to z, given the genotype G m k;j , the recombination fraction vector r, the linkage phase configurations F P and F Q and the probability distributions for the sequence of observations P. The solution to b k ðjÞ was also described by (Rabiner 1989) as follows: where k ¼ z 2 1; z 2 2; ⋯; 1 and j ¼ 1; ⋯; g m To estimate the recombination fraction for all intervals in the marker sequence we need to define j k ðj; j9Þ as the probability of state G m k;j at position k and state G m kþ1;j9 at position k þ 1 given the sequence of observations O 1 ; ⋯O z and their underlying probability distributions P, the recombination fraction vector r and the linkage phase config- The recombination frequency r k can be estimated through an iterative process using where j k ðj; j9jr s Þ is calculated for individual i, fðj; j9Þ ¼ ðlPþlQÞ m is the proportion of recombinations between markers k and k þ 1 for individuals with genotypes G m k;j and G m kþ1;j9 and r s is the vector of recombination fractions in the iteration ðsÞ and r sþ1 is the updated recombination fraction vector (Broman and Sen 2009).

Estimation of linkage phase
Let the Cartesian product zg denotes a set containing all possible linkage phase configurations in parent P. Also, let , denote a set containing all possible linkage phase configurations in both parents. The probability of the linkage phase configurations can be obtained using Bayes' rule where O is an array containing the observation for z markers in n individuals, and P is the underlying probability distribution for all marker observations. Since the prior probability PrðF u Þ can be assumed to be uniform, the posterior probability is proportional to the likelihood of the model, which can be used to select the best linkage phase configuration. Depending on the dosage and number of markers, some of these configurations are equivalent and will result in the same likelihood. The search space for the best linkage phase configuration can be unwieldy depending on the ploidy level, dosage and number of markers. Also, the transition space on the HMM gets larger as the ploidy level increases. To circumvent these problems, we propose a very efficient two-point procedure to reduce the search space for linkage phases. With the estimates in hands, it is possible to compare two-point likelihoods of alternative linkage phase configurations, eliminating those that do not meet a given threshold. The remaining configurations will be evaluated using the multipoint approach. The adequate threshold level will be discussed in the Simulations section.

Two-point algorithm for high-level autopolyploids
When the linkage analysis is conducted only between two markers (twopoint analysis), the information contained in these markers does not propagate into the rest of the chain. Thus, based on the dosage and linkage phase configuration of the markers involved in the analysis, the g m genotypic states present in the full transition space can be collapsed into a small number of states, and a straightforward likelihood function can be derived. It is worthwhile to mention that the estimates obtained using the two-point procedure are the same as those obtained using the multipoint algorithm for two markers. However, the twopoint computation is extremely faster. Consider a biallelic marker in an autopolyploid biparental cross with ploidy m. The number of possible genotypic states in the progeny for a given locus at position k is uðd k P Þ þ uðd k Q Þ þ 1, where the operator uðxÞ ¼ kx 2 m 2 2 m 2 and j:j denotes module. For example, in an autohexaploid biparental cross, if the dosage of the marker at position k in parent P is two (d k P ¼ 2) and in parent Q is three (d k Q ¼ 3), the number of possible genotypic classes expected in the progeny is six. Depending on the linkage phase configuration, each of the g m genotypic states in the full transition space corresponds to one of these expected genotypic classes, as presented in the emission function (Equations 8 and 9). Thus, in the previous example, all the g m states could be collapsed into six different classes. To perform this reduction of dimensionality, let D m k 2 f0; ⋯; mg denote one of the possible genotypes based on the observed dosage of one individual in the progeny of an autopolyploid biparental cross for position k with ploidy m. The joint probability of D m k and D m k9 , for a given genotypic configuration at positions k and k9 can be written as where T k ¼ fjjdðk; jÞ ¼ D m k ; j ¼ 1; ⋯; g m g and dðk; jÞ was defined in Equation 8; the same applies to T k9 . Since in a two-point analysis the probability distribution of the genotypic states in locus k can be assumed to be uniform, i.e., PrðG m k;j Þ ¼ 1 gm , Equation (19) can be rewritten as a sum of weighted terms from Equation (6) Pr where z Tk;Tk9 ðl P ; l Q Þ ¼ 1 g m X j2Tk X j92Tk9 hðj; j9; l P ; l Q Þ hðj; j9; l P ; l Q Þ is 1 if ðj; j9Þ corresponds to ðl P ; l Q Þ according to the procedure described in Supplementary Information, File S2, and zero otherwise. Equation 20 can be expressed in matrix form as where A u k P ;u k9 P ;u k Q ;u k9 Q ðr k Þ is a ðm þ 1Þ · ðm þ 1Þ matrix. Yet, in a twopoint analysis with biallelic markers, the linkage phase configuration can be summarized in an ordered pair ðw k;k9 P ; w k;k9 Q Þ indicating the number of homologs that share allelic variants for loci k and k9 in parents P and Q, respectively. For a given pair ðu k P ; u k9 P Þ, w k;k9 P ¼ #fx k P \ x k9 P g, where x k P and x k9 P denote the set of homologs inherited by parent P in positions k and k9, which can be assessed using the superscripts in u k P and u k9 P . #f:g indicates the cardinality of the set. Notice that u k P and u k9 P can assume several linkage phase configurations resulting in the same w k;k9 P . Let F k;k9 P ¼ f k P · f k9 P denote a set containing all possible pairs ðu k P ; u k9 P Þ for aup given pair ðd k P ; d k9 P Þ. In this set, there are minfuðd k p Þ; uðd k9 p Þg þ 1 partitions, each one corresponding to a different w k;k9 P . Figure 3 shows an example of F k;k9 P for ðd k P ¼ 2; d k9 P ¼ 2Þ in an autotetraploid homology group. The size of the set is 36, and it can be subdivided into three partitions where w k;k9 P ¼ 2, w k;k9 P ¼ 1 and w k;k9 P ¼ 0. In a two-point context, the likelihood function derived from any of the configurations belonging to the same partition (same w k;k9 P ) will be the same. Thus, any of them can be used to obtain the likelihood function for a given w k;k9 P . Let ðu k P ; u k9 P Þ Ã denote one of the possible pairs ðu k P ; u k9 P Þ that correspond to w k;k9 P . The same reasoning applies to parent Q. Without loss of generality, the two-point likelihood function of biallelic observed molecular phenotypes for markers k and k9 given w k;k9 P and w k;k9 Q is L r k w k;k9 where n is the number of individuals and T denotes transposition of a vector. In Equation (22), r k can be estimated using iterative procedures such as EM or Newton-Raphson. As in Equation (18), it is possible to list all linkage phase configurations and evaluate them based on their likelihood. Here we use the LOD Score (base-10 logarithm of likelihood ratios) in relation to the highest likelihood. Thus, models with high likelihoods will yield LOD Scores close to zero. We also use the LOD Score to assess the evidence for linkage between the two markers using the ratio between the model under H a : r ¼r and under the null hypothesis of no linkage H o : r ¼ 0:5, given a linkage phase configuration. As previously shown, it is possible to enumerate all linkage phase configurations for parent P using the Cartesian product f 1 P · f 2 P · ⋯ · f z P . To reduce this Cartesian space based on two-point analysis, we add a restriction where all pairs ðu k P ; u k9 P Þ in a sequence of configurations ðu 1 P ; ⋯; u z P Þ must be contained in F k;k9 P ðhÞ, where F k;k9 P ðhÞ is a subset of all partitions in F k;k9 P in which the associated LOD Score is smaller than h. Thus, a reduced subset of linkage phases in parent P based on two-point analysis can be obtained using It is important to note that it is not necessary to represent the whole Cartesian space fF P g to restrict the linkage phase configurations to the condition ðu k P ; u k9 P Þ 2 F k;k9 P ðhÞ. This procedure can be done through the sequential addition of markers from M 1 to M z . For each marker M k9 added to the end of the chain, the ordered pair ðk; k9Þ, k9 ¼ 2; ⋯; z and k ¼ k9 2 1; ⋯; 1, is evaluated and only linkage phase configurations that meet the condition ðu k P ; u k9 P Þ 2 F k;k9 P ðhÞ "k 2 fk9 2 1; ⋯; 1g are considered.
Some of the configurations selected using the previous procedure can be equivalent once they are products of a permutation of the same set of homologs. In order to remove this redundancy, let each one of the selected configurations be represented as a binary matrix of dimensions ðm · k9Þ such as where u 2 f1; ⋯; Ug, U is the number of selected linkage phase configurations, and k9 indicates that M k9 was the last marker inserted in the chain. The rows of matrix H u k9 represent the homologs for the u-th linkage phase configuration with the insertion of the k9-th marker at the end of chain; 1 denotes the presence of an allelic variation, and 0 denotes its absence. If a matrix H k9 could be obtained from a matrix H u9 k9 just by permuting the rows (permuting the order of the homologs), these two linkage configurations yield the same likelihood. Thus, one of the configurations should be excluded from consideration. The same reasoning applies to parent Q. This procedure can be done recursively until all redundancy is eliminated. The reduced linkage phase configurations search space considering both parents is obtained using FðhÞ ¼ F P ðhÞ · F Q ðhÞ, such as #fFðhÞg ( #fFg, combined with the redundancy elimination for homology groups. This sequential procedure results in a set of linkage phase configurations containing markers up to M k9 , which are evaluated using the HMM likelihood. A LOD Score threshold in relation to the most likely configuration is assumed to determine which configurations should be taken into consideration in the next round of marker inclusion (Figure 4). Additionally, it is possible to limit the two-point search space reduction to a window of SNPs in the terminal part of the chain to speed up the phasing process. Finally, with all markers inserted, the multipoint likelihood of the whole map is used Figure 3 Example of F k;k9 P ¼ f k P · f k9 P for an autotetraploid homology group with observed dosages d k P ¼ 2 and d k9 P ¼ 2 homologs sharing alleles. In this case, f k P denotes a set of size six, containing all possible subsets of size two in P 4 k ¼ fP 1 k ; P 2 k ; P 3 k ; P 4 k g. The same reasoning applies to f k9 P . The horizontal bars represent homologs forming a homology group and the dots represent allelic variations of a biallelic marker. The number below each homology group represents the number of homologs that share allelic variants (w k;k9 P ). This defines three partitions: w k;k9 P ¼ 2, w k;k9 P ¼ 1 and w k;k9 P ¼ 0. Notice that, from a homology group within a specific partition, it is possible to obtain the same linkage phase configuration observed in another homology group within that partition by permuting the its homologs.
to find the best configuration among the remaining ones, and the recombination fractions are reestimated. To demonstrate the mechanics of the two-point analysis coupled with the multipoint procedure, a simple example is presented in Supplementary Information, File S3.

Data availability
All the methods and procedures described here are available in the R package MAPpoly, which is freely available from https://github.com/ mmollina/mappoly. R scripts to perform the simulations and the potato map construction presented in this article can be accessed at https:// github.com/mmollina/Autopolyploid_Linkage. The tetraploid potato data set is available through the Solanaceae Coordinated Agricultural Project at http://solcap.msu.edu/potato_infinium.shtml. Supplemental material available at FigShare: https://doi.org/10.25387/g3.8218325.

Simulations
Simulation 1 -local performance under random bivalent pairing: the aim of this simulation study was to evaluate the local performance of the algorithm considering three ploidy levels (m ¼ 4, m ¼ 6 and m ¼ 8) under the mapping model assumptions (i.e., random pairing and bivalent formation). To be in accordance with molecular data that have been made available through sequence technologies, we simulated bi-allelic markers that can be observed in terms of dosage in parents and progeny. Three different linkage phase scenarios were simulated. In scenario A, for each marker, one of the allelic variants was assigned to the first homolog in the homology group and the remaining variants of the same type were assigned to the subsequent homologs. In scenario B, the allelic variant was randomly assigned to one of the first m 2 homolog and the remaining were assigned to the subsequent homologs. In scenario C, the allelic variants were randomly assigned to the m homologs. Thus, it is expected an increasing difficulty to detect recombination events from scenario A, where the allelic variants were concentrated in the same homologs, to scenario C, where they were randomly distributed. For each combination of ploidy level and linkage phase scenario, we simulated five different parental haplotypes. In total, 45 parental configurations were considered (3 · 3 · 5, Supplementary Information, Figure S4). For autotetraploid and autohexaploid configurations, we simulated 1000 full-sib populations. For autooctaploids, this number was reduced to 200 due to the high demand of computer processing required to reconstruct such maps. Each population was comprised of 200 individuals with one linkage group containing 10 markers positioned at a fixed distance of 1 centimorgan (cM) between them. For each combination, the percentage of correctly estimated linkage phase configuration in each parent was recorded. Also, for the cases where the linkage phases were correctly estimated, we calculated the average Euclidean distance between the distances of the estimated and simulated maps using ( ðd 2 dÞ T ðd2dÞ z21 ) 2 1 2 whered is the vector of distances for a estimated map, d is the vector of distances for the simulated map, z is the number of markers and T indicates vector transposition. For example, a value of 1 cM indicates that the maps differ 1 cM in average from each other (Mollinari et al. 2009). We used the sequential two-point procedure to reduce the search space assuming that linkage phase configurations with associated LOD , 3:0 should be investigated using HMM multipoint strategies (h ¼ 3). For the remaining configurations evaluated using HMM, we kept those with LOD , 10:0 to be evaluated in the next round of marker insertion.
Simulation 2 -chromosome-wide performance under preferential pairing and multivalent formation: In this simulation study, we evaluated the performance of the algorithm in dense maps, allowing for multivalent formation and preferential pairing. We used Scenario C from the previous study as a template to simulate five tetraploid and five hexaploid parental haplotypic configurations, each one comprising 200 equally spaced markers with a final length of 100.0 cM (Supplementary Information, Figure S5). For each parental configuration, we simulated 200 full-sib populations of 200 offspring considering a combination of three levels of preferential pairing (0.00, 0.25 and 0.50) and three levels of cross-like quadrivalent formation proportion (0.00, 0.25 and 0.50) with the position of the pairing partner switch varying across simulations. No hexavalents were simulated in this study. For autohexaploids, the multivalent configurations were always composed by a cross-like quadrivalent plus a bivalent. The centromere was positioned at 20.0 cM from the beginning of the chromosome (subtelocentric centromere with arms ratio 1:4) to study the effect of the double reduction which is more pronounced at the end of both chromosome arms. All simulations were conducted using the software PedigreeSim (Voorrips and Maliepaard 2012). In addition to the statistics recorded in Simulation 1, we computed the rate of double reduction observed in each marker for all constructed maps using the "founderalleles" file provided by Pedigree-Sim. We also evaluate two values for the LOD Score threshold associated to the two-point analysis (h ¼ 3 and h ¼ 5). We used a multipoint LOD Score threshold of 10.0 and also limited the twopoint search to a 50 SNP window in the terminal part of the map. Markers presenting higher doses than the sum of the doses in both parents, originated from double reduced gametes were filtered-out and assumed as missing data.

Simulation results
Simulation 1: Table 1 shows the percentage of data sets where the linkage phase configuration was correctly estimated in both parents P and Q. In scenario (A) the method was capable of recovering the correct linkage phase configuration in all situations for all ploidy levels. In scenarios (B) and (C) there was a slight decrease in the ability to correctly estimate the linkage phase configuration, especially for m ¼ 6 and m ¼ 8. Although in these cases the percentages of correctly estimated linkage phases was lower, they were nevertheless high, varying from 100 to 88.8%. This indicates a very good performance to estimate the linkage phase configurations, even using the two-point procedure to narrow the search space. Figure 5 shows the distributions of the average Euclidean distances between the estimated and simulated distance vectors for the correctly estimated linkage phase configuration. In all cases, the majority of the recombination fractions were consistently estimated once the medians of all distributions are very close 0.5 cM, with no practical problems in terms of mapping construction. These results show that, apart from a relatively small percentage of entangled linkage phase configurations, the method successfully performed the phasing and managed to estimate the recombination fraction of 10 markers in all situations evaluated.
Simulation 2: The proportion of correctly estimated linkage phase configurations for the dense chromosome-wise map is shown in Table 2. In general, results for tetraploid maps were superior when compared to results for hexaploid maps. It is also possible to observe a better performance for the threshold level h ¼ 5 in comparison to h ¼ 3. Similarly to Simulation 1, maps resulting from configurations with no preferential pairing or quadrivalent formation showed a high proportion of correctly estimated linkage phase configurations. Results ranged from 100 to 99% for tetraploid maps and from 100 to 84% for hexaploid maps. Different levels of quadrivalent formation rate had no substantial influence in estimating the correct linkage phase configurations in tetraploids. Within the preferential pairing level 0.0, the percentage of maps with correct linkage phases varied from 100 to 90%. For hexaploids, there was a decrease in this percentage as the quadrivalent formation increases from 0.0 to 0.50, with proportions varying from 100 to 70.5%. Especially for autohexaploids, there was considerable variation between the five simulated configurations. This occurred because the effect of the quadrivalent formation can be more pronounced depending on the level of information contained in a particular configuration. Also, the use of a more stringent two-point threshold h ¼ 5, improved the performance of the phasing algorithm.
Within the preferential pairing level 0.25, results showed decay of correctly estimated linkage phases, which was more pronounced for hexaploid cases with threshold level h ¼ 3, reaching a minimum value of 52.5% for parent Q in configuration 1. Again, the use of a higher two-point threshold level, h ¼ 5, helped to improve this number to 68.5%. For preferential pairing level 0.50, there was a clear distinction between the results in tetraploid and hexaploid cases. In the former, the effect was not as pronounced as it was in the latter, where in several cases, the proportion of correctly estimated linkage phases was close to zero. As expected, the usage of a higher threshold level of h ¼ 5 helped to improve the number of corrected estimated linkage phase configurations. Interestingly, for both cases with preferential pairing (0.25 and 0.50), the formation of quadrivalents had an overall tendency to improve the algorithm's performance. This improvement was expected because when a quadrivalent is formed, each chromosome involved can exchange segments with two others, providing more information regarding their phase configuration. Given a correctly estimated linkage phase, the recombination fractions were consistently estimated for all levels of preferential pairing with no quadrivalent formation ( Figure 6). However, they were overestimated in the presence of quadrivalent formation. This effect was mainly observed at the terminal regions of the chromosome, especially in the long arm, where double reduction is more pronounced. In this case, tetraploid maps were the most affected. This is in agreement with our expectations since in autohexaploid simulations, there was always the formation of a bivalent which was not involved in the double reduction process (although the rates of double reduction were very similar in both ploidy levels, Figure 6). In addition to the quadrivalent, the bivalent serves as an extra source of information to access the recombination events. The average Euclidean distances reflect the overestimation of recombination fractions in cases with quadrivalent formation, showing distributions with higher medians and interquartile ranges in tetraploid cases when compared to hexaploids (Supplementary Information, Figure S6). Nevertheless, all the Euclidean distances distributions were located relatively close to zero, with a maximum value of 1.4 cM, indicating that although we observed overestimated recombination fractions toward the terminal ends of the chromosome, they were equally distributed, causing no severe disturbances in the final map. Figure S7, in Supplementary Information, shows an example of the effect of increasing quadrivalent formation rate in autotetraploid and autohexaploid maps. As the markers get further away from the centromere, the recombination fractions become overestimated.

Analysis of real tetraploid potato SNP data set
We applied our method to construct a genetic map of the B2721 population which is a cross between tetraploid potato varieties Atlantic and B1829-5. The population comprises 160 offsprings genotyped with the SolCAP Infinium 8303 potato array. The genotype calling was performed using fitTetra R package (Voorrips et al. 2011). We obtained 4017 SNPs and computed all the pairwise recombination fraction between them for all possible linkage phase configurations. For each pair, we selected the configuration that yields the higher likelihood and applied the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) clustering algorithm to assign markers into 12 linkage groups. Within each linkage group, we ordered the markers using the MDSMap R package (Preedy and Hackett 2016). We applied the unconstrained multidimensional scaling algorithm (MDS) to the pairwise marker distance obtained using Haldane's mapping function and LOD 2 weighting. We performed two rounds of marker removal by inspecting the nearest neighbor measure scatter plot and also the monotony of the resulting recombination fraction matrix. Given the marker order obtained for each group, we applied our algorithm using h ¼ 10. For each round of marker inclusion, we limited our phase search to the last 100 markers inserted at the end of the map and eliminated markers that caused map inflation greater than 10 cM. The resulting map consisted of 3348 SNPs (58% simplex and double simplex markers and 42% multiplex) distributed in 12 linkage groups with lengths varying from 165.2 cM to 332.5 cM with no visible gaps between markers (Supplementary Information, Table S8.1). These values seem inflated compared to tetraploid potato maps available in the literature (Hackett et al. 2013;Sharma et al. 2013;Massa et al. 2015;Bourke et al. 2015;Rak et al. 2017). The map expansion observed here was mainly caused by two sources of error, namely, local marker misplacement and genotyping errors. Both factors cause the detection of spurious recombination events which are propagated through the HMM causing a global overestimation of recombination fractions.
While obtaining a de novo order based on multiploint likelihood is not feasible in our current implementation, we used the multipoint likelihood as an objective function to compare the map obtained using the genomic order from the Solanum tuberosum genome version 4.03 (Sharma et al. 2013) and the MDS based order. When using the genomic order, the length of all linkage groups are smaller, and the likelihoods were substantially superior when compared to the de novo MDS-based order (Supplementary Information, Table S8.1 and Figure S8.1). Furthermore, our algorithm estimated the same linkage phase in both cases, indicating the robustness of the phasing method to local marker misplacement. To address the genotyping errors, we used the approaches Figure 5 Distributions of the average Euclidean distances between the estimated and simulated distance vectors considering correctly estimated linkage phase configurations. The order of boxplots is the same as the order of haplotypes in Figure S4. Each column indicates the results for different linkage phase configuration scenarios, namely, A, B and C, and each row indicates a different haplotypic configuration within three ploidy levels.
presented in Equations 8 and 9, i.e., ðiÞ use the probability distribution provided by the mixture proportions of the doses from fitTetra software and ðiiÞ assuming a global genotyping error; in this case we assumed an ad hoc error rate of 5%. We applied this prior information in both de novo MDS-based and the genome-based orders. The result also can be observed in Table S8.1 and Figure S8.1. Both approaches produced smaller maps when compared to their relative original maps. However, since ðiÞ relies on the dosage proportions based in single SNPs, the adjustment of the map was not as flexible as observed in ðiiÞ, which assumes an equal global error for all markers. Thus the usage of global error allowed genotypes to conform to a global chromosomal structure, rather than restrain the markers in certain genotypic classes proposed by the classification method. Furthermore, the usage of a global error mitigates the effect of the local marker misplacement caused by the MDS algorithm by clustering markers into linkage disequilibrium blocks. This effect can be observed in linkage groups 1, 5, 8, 10, 11, and 12, where the difference between de novo MDS-based and the genome-based order when modeling a global error was less than 10 cM.
Comparison with polymapR software Among the available methods to construct maps in high-dose autopolyploids, namely, pergola (Grandke et al. 2017), netgwas (Behrouzi and Wit 2017), and polymapR (Bourke et al. 2018), only the latter is capable of inferring parental haplotypes and estimating recombination fraction in outcrossing populations. Thus, we limited our comparison to polymapR software. To assess the performance of the methods, we simulated 50 full-sib hexaploid populations with 200 each individuals in five marker density scenarios: 200, 400, 600, 800, and 1000 equally spaced markers in a 100 cM linkage group. Similarly to Simulation 2 described before, we randomly assigned allelic variants, from 0 to 3 doses, to the six homologs. Additionally, in this study, we considered two n  different dosage proportions scenarios: the first one considers a higher proportion of simplex and double simplex markers, with 40% of the simulated markers being nulliplex, 40% simplex, 10% duplex and 10% triplex in both parents; the second, considers equal proportions for all doses, with 25% for all dosage types, from nulliplex to triplex (Supplementary Information, Figure S9). In total, 500 populations were simulated (5 · 2 · 50) and for each, we obtained the phased map using polymapR and our HMM-based procedure. All simulations were performed using the software Pedi-greeSim (Voorrips and Maliepaard 2012) considering no preferential pairing and no quadrivalent formation. For both methods, we recorded the percentage of correctly phased markers, the final map length, and the number of markers inserted in each phased map.
To employ our HMM-based method, we ordered the markers using the unconstrained MDS algorithm (Preedy and Hackett 2016) weighted by LOD 2 with no rounds of marker removal. Given the MDS order, two levels of h were used: 3 and 5; the same levels were used for the multilocus LOD threshold. The phase search was limited to the last 50 markers inserted at the end of the map. To construct maps using polymapR, we first applied the function cluster_SN_markers to perform a grid search from LOD Scores 1 to 20 and chose the lowest one that yields six homologs based on simplex markers in coupling linkage phase. In the next step, we assigned double simplex and duplex markers to the linkage group using the function assign_linkage_group, and the remaining marker types were assigned using the function homolog_lg_assignment. This procedure was performed for both parents assuming LOD Score thresholds of 3 and 5. All remaining pairwise recombination fractions were computed, and the MDS algorithm (Preedy and Hackett 2016) was used to order the markers. Marker positions were estimated using the projection of the MDS result onto a single dimension principal curve. Finally, a phased map was created using the function create_phased_maplist. We used polymapR version 1.0.19. Differently from the HMM-based method, where the h indicates the LOD threshold from which the multipoint likelihood should be used to chose the best phase configuration, polymapR uses LOD thresholds to make decisions whether a marker should be used in a certain mapping context, such as, clustering homologs or to assign it into assembled linkage groups. Thus, they are not directly comparable. Table S10.1 in Supplementary Information, shows the results obtained using both methods. Overall, both methods recovered the vast majority of the linkage phases of the markers across all simulations. In the presence of a high number of single-dose markers both methods recovered the correct linkage phase from 99.7 to 100% of the included markers with polymapR positioning in average 94.3% of the markers when using LOD ¼ 3:0 and 85.0% when using LOD ¼ 5:0 and our HMM-based method positioned in average 100.0% of the markers when using LOD ¼ 3:0 and 99.8% when using LOD ¼ 5:0. In cases where the dosages were uniformly assigned, 96.2-100% of the markers were correctly phased when using polymapR and from 99.1 to 100% when using the HMM-based method. In this case, the number of positioned markers varied from 16.9 to 77.5% for LOD ¼ 3 and from 14.4 to 57.0% for LOD ¼ 5 for polymapR and from 99.3 to 99.9% for LOD ¼ 3 and from 97.8 to 99.9% for LOD ¼ 5 when using our HMM-based method. As pointed out by van Geest et al. (2017), when using polymapR's method, the presence of a sufficiently large number of simplex markers uniformly distributed throughout the genome is essential to define each homologs and be able to assign multiple dose markers to the framework map. It is important to mention that using modern genotyping technologies, most data sets have high proportions of single dose markers. While this is generally true, single dose markers can be absent in certain chromosome regions or even along entire homologs due to recent duplication events. Table S10.2, Supplementary Information, shows the average map length and the associated standard deviation obtained in all simulations. In cases with a high number of single-dose markers, the average map length produced by polymapR ranged from 87.0 to 89.9 cM, while in the case where the dosages where uniformly assigned, map lengths ranged from 76.7 and 80.1 cM. These results confirm the underestimation tendency in the MDS algorithm when using when using LOD 2 as weighting function, as observed by Preedy and Hackett (2016). In our HMM-based method, considering the MDS order, the maps lengths were highly overestimated, ranging from 143.0 cM to 548.0 cM. Since we did not introduce errors in our simulation procedures, the observed map inflation is exclusively due to local marker misplacement caused by the MDS algorithm. Nonetheless, even with local marker misplacements, the linkage phase configuration was correctly estimated in the vast majority of the cases. To accommodate the local marker misplacements, we used an ad hoc global error of 5% in the HMM emission function. As observed in the tetraploid potato analysis, this strategy allowed markers in the wrong order, but in the same linkage disequilibrium block, being positioned closely together through the HMM estimation process. The resulting maps lengths were close to the simulated 100 cM (Supplementary Information,  Table S10.2 and Figure S10.1). Although in general, our method yielded denser maps, it is worthwhile to mention that, polymapR's method is substantially faster than ours, notably when our method uses high values for h, in which case the HMM computations play a significant role in the phasing procedures (Supplementary Information, Table S10.3). Nonetheless, it was precisely the multipoint procedure that allowed our method to position more markers when compared to polymapR.

DISCUSSION
Although the concept of linkage mapping is relatively simple, the combinatorial properties and increasingly missing information that arise from the multiple sets of chromosomes make the construction of genetic maps in high-level autopolyploids challenging. In this work, we frame and solve two fundamental steps toward the construction of such maps, namely multipoint recombination fraction estimation and linkage phase estimation. We showed that, combined with standard grouping and ordering procedures (Preedy and Hackett 2016), these maps could be reliably constructed. Our method can be applied to biallelic codominant markers and, due to the flexibility of the HMM framework upon which it was derived, it is extendable to any codominant molecular marker. The HMM used in this work takes into account the linkage phase configuration of the whole linkage group to estimate the recombination fractions between adjacent markers. An efficient two-point approach was also presented to reduce the search space of linkage phase configurations. As a result, our method provides the likelihood of the model, which can be used as an objective function to compare different map configurations, including linkage phases and marker orders. When considering experimental populations, our method is a generalization, for any even ploidy level, of well established genetic linkage mapping methods. For diploid (m ¼ 2) populations derived from biparental crosses, our method is equivalent to the influential Lander and Green algorithm ; considering full-sib phase-unknown crosses, it is equivalent to Wu et al. (2002). For tetraploids (m ¼ 4) the method is equivalent to Leach et al. (2010), disregarding double reduction.
To assess the statistical power of our method, we conducted two simulation studies. In simulation 1, we demonstrated that our model was capable of correctly estimating the majority of parental linkage phase configurations and recombination fractions in a limited number of markers, even for complex linkage phase configurations and high ploidy levels. Since other methods are based on single-dose markers to assemble homology groups, to the best of our knowledge, this is the only method capable of phasing markers in high-dose autopolyploid genomes in small regions. These well-assembled regions could function as multiallelic codominant markers which propagate their information through the HMM to the rest of the chain, improving the quality of the final map. In simulation 2, we analyzed a sequence of 200 markers in combinations of different levels of preferential pairing and rates of quadrivalent formation. In this situation, quadrivalent formation rate had a marginal effect on the phasing procedure, whereas preferential pairing reduced its performance, especially for autohexaploids. The usage of a higher two-point threshold (h) improved the linkage phase estimation in all cases. This fact indicates that the haplotype phasing is more accurate when HMM-based likelihood is used as objective function to evaluate linkage phases. We also observed that quadrivalent formation yield overestimated recombination fractions between adjacent markers located further away from the centromere. Interestingly, Bourke et al. (2015Bourke et al. ( , 2016 found that higher quadrivalent rates had little effect on the recombination fractions. Most likely, the source of this discrepancy was the different usage of information in both approaches. While in both studies (Bourke et al. 2015(Bourke et al. , 2016 used exclusively two-point recombination fraction estimates, here the information of the markers propagates along the linkage group and the recombination fraction between two adjacent markers depends on the behavior of the whole chromosome. Moreover, the overestimation of the recombination fraction was expected since our model disregards double reduction and, consequently, was not able to correctly assess the number of crossing over events when this phenomenon was present. Although our model is robust enough to cope with certain levels of preferential pairing and tetravalent rate formation, it is possible to include both phenomena in specific points of its derivation. Preferential paring can be included in Equation 4 by not considering Prðc j Þ as uniformly distributed. Double reduction can be included in the definition of the genotypic states in the full transition space (Equation 5). These two phenomena add extra layers of complexity to the genetic mapping of polyploid organisms with high ploidy levels and should be addressed in future studies.
We also build a tetraploid map using the ideas presented in this study coupled with standard grouping and ordering procedures. While the choice of the genotype error rate depends on specific characteristics of the data set, we demonstrate that it is possible to use prior information on the HMM framework, including the probability distribution of the marker dosages for each SNP and a global error to avoid map inflation caused by local marker misplacement and genotyping errors. Finally, we compared our HMM-based method to the polymapR two-point based method and, as already pointed out by (van Geest et al. 2017), we concluded that with a number sufficiently large of single-dose markers uniformly distributed across the homologs, both methods performed well. However, when those markers are absent in a specific homologs or chromosome region, our method was able to build denser maps when compared to polymapR. Moreover, some autopolyploids with ploidy level higher than six, such as sugarcane (Aitken et al. 2007) and garden dahlias (Schie et al. 2014), could benefit only from our method, since polymapR is limited only to tetraploid and hexaploid species. The difficulty in correctly estimating linkage phase configurations where multi-dose allelic variants are spread randomly in all homologous chromosomes lies in two significant aspects of the experiments studied here: (i) the outbred nature of the experimental crosses and (ii) the incomplete information of the markers based on dosage (i.e., by not being multiallelic). In experimental population derived from inbred lines, the origin of the haplotypes can be easily inferred from the genetic design. However, obtaining pure inbred lines in high-level autopolyploids has been proven to be impractical due to the high number of crosses and generations necessary to achieve homozygous genotypes and to the inbreeding depression which some species undergo (Gallais 2003). In our method, the linkage phase configuration is obtained by comparing the likelihood of a set of models with different linkage phase configurations (Equation 18). The capability of estimating the correct configuration is directly related to the information contained in the marker data. Some of these limitations are overcome through the use of HMMs which take into account the information of the linkage group as a whole.
HMMs provide an excellent avenue to assemble genetic maps in complex scenarios, but they are remarkably computational demanding and, in some cases, unfeasible to use. Apart from parallel computing, which can greatly speed up the estimation process and is ubiquitous nowadays, the usage of two-point approaches is a viable option to reduce the dimension of the original problem efficiently. The dimension reduction is achieved by collapsing genotypic states in the full transition space according to the marker information. However, in several cases, the two-point based method can result in low statistical power which is related to the amount of information contained in markers in certain combinations of allelic dosage and linkage phase configurations. This lack of information is exacerbated as markers get distant from each other. Figure 7 shows nine possible configurations of pairs of markers in one autohexaploid parent. Considering one of the parents non-informative, we computed the Fisher's information equations based on the likelihood Equation (22) (Mather 1957;Ripol et al. 1999;Luo et al. 2004). The equations were plotted as a function of the recombination fraction. The information profiles are related to the number of different haplotypes present on the parental configuration for a given marker dosage. For instance, for two single-dose markers (Figure 7, panel A), when the alleles share the same homolog (w k ¼ 1), it is always possible to detect if the gamete contains at least one recombinant chromosome. However, when the alleles are in different homologs (w k ¼ 0), the detection of recombination events is limited to meiotic configurations containing a bivalent where these chromosomes paired with each other. Intermediate situations involving multi-dose markers can be observed in the other panels in Figure 7. Additionally, the model proposed here contemplates both parents on the analyses, leading to more complicated linkage phase configurations and information equations. The lack of information for some phase configurations in two-point procedures is essentially caused by the biallelic nature of the dosagebased markers. However, in several situations, genomic and transcriptomic references are available for related diploid species and often provide the physical order of the SNPs in small regions. Our phasing procedure could be applied in these regions to obtain local haplotypes, which could function as multiallelic markers improving the information in a two-point analysis. Moreover, in a multipoint context, when using multiallelic markers, the number of visited states in the Markov model can be significantly reduced, making the HMM procedure much more efficient. Ideally, in a full-sib population, the number of different alleles should be as high as two times the ploidy level (fully informative). In this case, the Markov model would be fully observed and, the task of estimating recombination fraction reduces to count the number of recombinant events given a linkage phase configuration. Since our algorithm does not need the entire transition space to work, only a subset of states should be visited, making the calculation much faster when compared to the biallelic case.
Once the map is assembled, given the HMM framework, it is a trivial exercise to obtain the probability of a specific genotype at any map position, conditioned on the whole linkage group and to compute the probability of any unobserved genotype given the genetic map using this information. These conditional probabilities are the basis for answering a series of fundamental questions about quantitative trait loci analysis in high-level autopolyploids, such as the effect of the dosage level on the variation of quantitative traits, the interaction of the alleles within (dominance effects) and between loci (epistatic effects). Therefore, the present study provides a sound basis for unveiling the complex structure of autopolyploid genomes through genetic mapping.