Constructing a linkage–linkage disequilibrium map using dominant-segregating markers

The relationship between linkage disequilibrium (LD) and recombination fraction can be used to infer the pattern of genetic variation and evolutionary process in humans and other systems. We described a computational framework to construct a linkage–LD map from commonly used biallelic, single-nucleotide polymorphism (SNP) markers for outcrossing plants by which the decline of LD is visualized with genetic distance. The framework was derived from an open-pollinated (OP) design composed of plants randomly sampled from a natural population and seeds from each sampled plant, enabling simultaneous estimation of the LD in the natural population and recombination fraction due to allelic co-segregation during meiosis. We modified the framework to infer evolutionary pasts of natural populations using those marker types that are segregating in a dominant manner, given their role in creating and maintaining population genetic diversity. A sophisticated two-level EM algorithm was implemented to estimate and retrieve the missing information of segregation characterized by dominant-segregating markers such as single methylation polymorphisms. The model was applied to study the relationship between linkage and LD for a non-model outcrossing species, a gymnosperm species, Torreya grandis, naturally distributed in mountains of the southeastern China. The linkage–LD map constructed from various types of molecular markers opens a powerful gateway for studying the history of plant evolution.


Introduction
Linkage disequilibrium (LD), a concept to describe the non-random association of alleles at different loci, has been a focus of population genetic studies during the last several decades. 1 However, since LD is affected by many evolutionary forces, the use of LD alone to infer the genetic structure of populations may generate spurious results. 2 For this reason, how LD can be served as a more efficient tool has been one of the most important issues in population and evolutionary genetics. One of the strategies to resolve this issue is constructing a LD map from which to infer population history by visualizing the decline pattern of LD with genetic distance. This strategy has been widely used in human genetics 3,4,5 and increasingly recognized in other species. 6,7 Many of these LD maps are constructed from the relationship of pairwise LD with the physical distance of the same marker pair, which do not estimate the frequency of recombination between marker loci. The basic principle by which LD is used for historical inference results from its relationship with the recombination rate. 1 Therefore, the estimation of the linkage, apart from estimating LD, is an essential step towards constructing a LD map. Wu and Zeng 8 pioneered the application of a sampling design to simultaneously estimate these two parameters. By sampling parents randomly from a natural population and the seeds of the sampled parents, this design constructs a two-level hierarchic structure of molecular data, which enables the characterization of how different markers are associated in the original population and how the markers co-transmit their alleles in a Mendelian fashion from the parent to offspring. Lou et al. 9 derived a close-form EM algorithm to estimate the LD and recombination rate within a unifying framework. Such a joint linkage-LD analysis has been applied to the genetic mapping of complex traits, leading to the identification of biologically validated quantitative trait loci (QTLs) for drought resistance in maize. 10 More recently, this strategy has been modified to accommodate to the estimation of genetic imprinting 11 and genetic variance. 12 Pikkuhookana and Sillanpaa 13 implemented a Bayesian algorithm for parameter estimation from this strategy.
In this article, we described a general computational framework built on Wu and Zeng's 8 open-pollinated design to construct a linkage-LD map using biallelic co-dominant markers. To make this framework more useful for a broader area of applications, we extended it to enable the utilization of dominant-segregating markers. Several recent studies have shown that epigenetic variation provides a source for the generation of phenotypic diversity in natural populations 14,15 and also epigenetic marks, such as differential cytosine methylation, may be inherited and have experienced the pressure of natural selection. 16 Thus, it has become increasingly important to construct a more comprehensive linkage-LD map by including methylation markers. In epigenetic population studies, there are many ways to score and analyse methylation-sensitive amplification polymorphisms, of which one common approach is to score those fragments that stay unmethylated as 1 and all others methylated as 0. This scoring approach leads to the segregation pattern of the so-called single methylation polymorphism (SMP) markers equivalent to that of dominant genetic markers. 17 Lu et al. 18 found a possibility of using three-point analysis to enhance the precision and power of linkage detection for dominant markers. Likewise, Li et al. 19 developed a three-point analysis to analyse LD among three dominant markers and establish a procedure for testing and estimating multiple disequilibria at different orders. However, the simultaneous estimates of LD and recombination fraction between dominant markers are methodologically challenging, because their genotypes can little explain the information of allelic segregation. We implemented a two-level EM algorithm for joint linkage and LD analysis by modelling and retrieving the unobservable feature of segregating genotypes for dominant-inherited markers. An example was demonstrated to show the utility and usefulness of the model by analysing a real data collected from an OP design of an outcrossing species, Torreya grandis, naturally distributed in mountains of the southeastern China.

Sampling strategy
From a natural plant population at Hardy-Weinberg equilibrium (HWE), we randomly sample n unrelated maternal individuals and open-pollinated seeds from each sampled plant. This constitutes a two-level hierarchic sampling setting in which both parental plants and their offspring are genotyped by the same set of molecular markers. Consider a pair of biallelic markers A and B, which generate nine joint genotypes, AABB (coded as 1), AABb (coded as 2), . . ., aabb (coded as 9). Let n i denotes the number of mother plants with marker genotype i, and n j i denotes the number of offspring with marker genotype j derived from mother genotype i. Depending on the genotype of a mother, all offspring from her have different numbers of marker genotypes. Table 1 tabulates genotypic observations for the two-level hierarchic setting.
Let p AB , p Ab , p aB and p ab denote haplotype frequencies for AB, Ab, aB and ab, respectively. The four haplotype frequencies are expressed as where allele frequencies are defined as p A and p a (p A + p a = 1) for marker A and p B and p b (p B + p b = 1) for marker B, respectively, and D is the LD between the two markers. Under the assumption of HWE, the expected frequency of two-marker genotype i in the parental population (P i ) is expressed as the product of the two corresponding haplotype frequencies. Based on the principle of co-transmission of two genes from a parent to its progeny, we derived the expected frequency of two-marker genotype j in the progeny population from mother genotype i (P j i ), expressed as a function of the recombination fraction θ for the double heterozygous mother genotype. All these maternal and offspring genotype frequencies are given in Table 2. Table 1. Data structure of two co-dominant markers typed for a panel of half-sib families, each composed of the mother and offspring, sampled at random from a natural population

. Likelihood
We use M m and M o to denote observed maternal genotypes and offspring genotypes for markers A and B. Let (Ω p , θ) denote the unknown parameters including all haplotype frequencies (arrayed in Ω p ) and the recombination fraction θ. A unifying log-likelihood that integrates two-level maternal and progeny genotype data can be expressed as where the first term of the right side is the upper level likelihood constructed by maternal genotype observations and haplotype frequencies Ω p that form expected maternal genotype frequencies (Table 2) and the second term is the lower level likelihood constructed by maternal and offspring genotypes, haplotype frequencies Ω p and the recombination fraction θ. The upper level likelihood is constructed by mother genotype observations and expected mother genotype frequencies, expressed as log LðΩ p jM m Þ ¼ constant þ n 1 logð p 2 AB Þ þ n 2 logð2 p AB p Ab Þ þ n 3 logð p 2 Ab Þ þ n 4 logð2 p AB p aB Þ þ n 5 logð2 p AB p ab þ 2 p Ab p aB Þ þ n 6 logð2 p Ab p ab Þ þ n 7 logð p 2 aB Þ þ n 8 logð2 p aB p ab Þ þ n 9 logð p 2 ab Þ ð2Þ For double heterozygote AaBb, its observed genotype may be derived from two possible diplotypes, AB|ab (with probability of p AB p ab ) or Ab|aB (with probability of p Ab p aB ), where the vertical lines are used to separate the two underlying haplotypes of a diplotype. For a given parental genotype combination, a certain group of offspring genotypes is produced. For a mother with genotype AaBb, there will be two possible diplotypes, AB|ab and Ab|aB, whose relative frequencies are ϕ ¼ p AB p ab p AB p ab þ p Ab p aB ; 1 À ϕ ¼ p Ab p aB p AB p ab þ p Ab p aB ; respectively. Both the diplotypes will produce haplotypes AB, Ab, aB and ab with frequencies defined as follows: Let Thus, overall haplotype frequencies produced by this parent are calculated as ω 1 for AB or ab and ω 2 for Ab or aB.

Estimation
We implement two EM algorithms to estimate the unknown parameters. The first is implemented to estimate the haplotype frequencies and therefore allelic frequencies and linkage disequilibria by jointly maximizing log-likelihoods (2) and (4). The second is implemented to estimate the recombination fraction that is contained with double heterozygote by maximizing the log-likelihood (4). In the E step of the second EM algorithm, we calculate the probability with which a considered haplotype produced by double heterozygote parent is the recombinant type using In the M step, the estimate of the recombination fraction is obtained by where m equals the sum of following terms, The E and M steps are iterated between Equations (5) and (6) until convergence.

Hypothesis testing
After genetic parameters are estimated, we test whether the two markers are associated and/or linked on the same genomic region. This can use the following hypotheses: The likelihoods under the H 0 and H 1 are calculated from which a log-likelihood ratio is calculated. By comparing this test statistic with a χ 2 threshold with two degrees of freedom, we can accept or reject H 0 .
It is also needed to test the significance of D and θ separately, showing how the two markers are related. Under the null hypothesis H 0 : D ¼ 0; parental diplotype and genotype frequencies can be simply expressed as a function of allele frequencies which can be estimated with no need of the EM algorithm. Similarly, under the null hypothesis H 0 : θ ¼ 0:5, offspring diplotype and genotype frequencies within each family are simply expressed as function of the Mendelian segregation ratio so that no parameter needs to be estimated.

Estimation
Methylation-sensitive amplification polymorphisms can be scored as a dominant marker. 17 For a dominant marker, the homozygote AA for the dominant allele cannot be distinguished from the heterozygote Aa. Thus, these two genotypes are observed as a single 'phenotype' (A_). For two dominant markers A and B, some cells for the observations and expected genotype frequencies in Tables 1 and 2 are collapsed in a way as shown in Tables 3 and 4, respectively. Let n jA= jB denote the observed number of observations of a two-dominant marker genotype j A /j B , j A = A_ (coded as 1) or aa (coded as 0) and j B = B_ (coded as 1) or bb (coded as 0), in the parental population. Similarly, let n kA=kB jA= jB denote the observation of progeny marker genotype k A /k B given its parent genotype j A /j B ( Table 3). Frequencies of offspring genotypes from different mother genotypes are shown in Table 4.
A two-level hierarchical likelihood (1) is formulated to jointly estimate haplotype frequencies and recombination fraction by implementing two EM algorithms. The first EM algorithm is used to estimate haplotype frequencies by jointly maximizing the upper and lower level likelihood, whereas the second EM algorithm used to estimate the recombination by maximizing the lower level likelihood. Here, we show how the second EM algorithm is implemented to estimate the recombination fraction. In the E step, we calculate the overall frequencies of the genotype with the progeny cells (Table 4) using where þ 2ω 2 p Ab p aB ð2 p AB þ p Ab þ p aB Þ δ 2 ¼ p AB p Ab ð p Ab þ p ab Þ þ 2ω 1 p AB p ab ð p Ab Þ þ 2ω 2 p Ab p aB ð2 p Ab þ p ab Þ δ 3 ¼ p AB p aB ð p AB þ p ab Þ þ 2ω 1 p AB p ab ð p aB Þ þ 2ω 2 p Ab p aB ð p aB þ p ab Þ We interpret Φ 1 as a probability that the offspring genotype is AABB and the mother genotype is AaBb while both offspring and mother genotypes are observed as A_B_. The other Φ's can be interpreted in a similar way.
In the M step, we estimate the recombination fraction by 1=1 and m is the sum of following terms, expressed as with ψ 1 and ψ 2 defined as the probabilities with which a considered haplotype produced by a double heterozygote parent AaBb is the recombinant type, i.e. ψ 1 ¼ ð1 À ϕÞθ for haplotype ABor ab ψ 2 ¼ ϕθ for haplotype Abor aB and with ψ 3 and ψ 4 defined as the probabilities with which the double heterozygote parent AaBb produce haplotype AB, ab or Ab, aB, i.e. ψ 3 ¼ ð1 À ϕÞθ þ ϕð1 À θÞ for haplotype ABor ab ψ 4 ¼ ϕθ þ ð1 À ϕÞð1 À θÞ for haplotype Abor aB where φ is defined in Equation (3).

Hypothesis testing
We formulate the hypothesis tests for the significance of the LD and linkage. The estimation of allele frequencies of two dominant markers under the null hypothesis of no LD should also be based on the EM algorithm; i.e. in the E step, we calculate In the M step, we estimate the allele frequencies of markers A and B by using The E and M steps between Equations (9) and (10) are iterated until the estimates are stable. The log-likelihood ratio under the null and alternative hypotheses is calculated and compared with a threshold determined from a χ 2 distribution.

Application
We used a real data analysis to demonstrate how the model is used to construct a linkage-LD map. According to Wu and Zeng's 8 design, we sampled a natural population of Torreya grandis distributed in the southeastern China. 20 In spite of the economic value of T. grandis, this species has not been extensively studied in population genetics. Zeng et al. 21 constructed a first low-density genetic map for genus Torreya using an open-pollinated progeny derived from half-sib seeds of a landrace T. grandis 'Merrillii', providing basic information for marker genotyping of this species. We sampled 50 unrelated trees randomly from a natural population of T. grandis and 20 progeny for each sampled tree. In total, we obtained (50 + 50 × 20) = 1,050 trees, which were genotyped by 233 sequence-related amplified polymorphism (SRAP) markers. SRAPs are dominant-segregating markers, 22 providing an excellent demonstration for the practical utility of our model. This data set constitutes a two-level hierarchic framework with a high level from the parents and a lower level from the progeny. We analysed each pair of these markers using the dominant model to simultaneously estimate the LD and recombination fraction and further test the significance of these two parameters.
Of 233 × 232/2 = 27,028 pairs, 5,733 (21.21%) display significant non-random associations, and 2,140 (7.92%) are significantly linked. It was seen that much fewer pairs (1,239 or 4.58%) are both associated in the original natural population and linked when they co-transmitted during meiosis from the parent to progeny. All this suggests that, for many marker pairs, significant associations are inconsistent with significant linkage. In other words, a pair of unlinked markers may be associated with each other, and also a pair of linked markers may not necessarily have a significant LD. Significant associations of unlinked markers may be due to the impact of some recent evolutionary forces on these markers, whereas the absence of associations between linked markers implies that this particular region of the genome has experienced the random mating of numerous generations. 23 Based on the estimated pair-wise recombination fractions, we constructed a genetic linkage map using MapMaker software. Under the thresholds of θ = 0.3 and LOD = 3.0, 233 markers were grouped into 8 linkage groups, but with 73 markers unlinked. Markers in each linkage group were ordered with an objective function of the sum of adjacent recombination fractions. When the optimal order of a linkage group was determined, the map distance between any two adjacent markers was calculated by Haldane's map function. To the end, we obtained a low-density genetic linkage map for T. grandis (Fig. 1). The total length of the map is 533.2 cM, with an average marker interval of 3.33 cM.
By plotting pair-wise LD over the genetic distance, we constructed a LD map from which to infer the population history of T. grandis (Fig. 2). In general, the LD declines markedly with marker distance within the first 10 cM of genome, and this decrease quickly becomes gradual after this length. This trend suggests that the population of T. grandis sampled may have experienced a long evolutionary history in the environment where this species grows. However, there are quite a few pairs of unlinked markers beyond 10-20 cM of genetic distance which are associated with a large R 2 , suggesting that these loci may be subjected to some recent evolutionary forces. Further studies from single-nucleotide polymorphisms (SNPs) are needed to characterize the biological function of these loci and relate this function to possible anthropological selection or climate change towards an in-depth understanding of the evolutionary mechanisms of T. grandis.
From the distribution of all pair-wise LD, it was found that most pairs of markers do not display a large LD value (Fig. 3), conforming to the result inferred from Fig. 2 that this population may have undertaken a long evolutionary history. Although the LD coefficients tend to be larger between markers located within the same linkage group than between markers from different linkage groups (Fig. 3), a portion of between-group markers has a large LD. This suggests that the genome harbouring these markers may be under recent evolutionary forces. Differences in LD occurrence within and among linkage groups are visualized in Fig. 4. It can be seen that markers in linkage Group 7 are not only rarely associated with those from other linkage groups, but also display a sparse distribution of LD with those within the same linkage group. More specifically, of all 13 × 12/2 = 78 possible combinations between 13 markers of this group, only 15 (19.2%) pairs display significant associations. Yet, such percentages for other linkage groups, such as Groups 2 and 8, were observed to be >60%. A reduced frequency of significant associations in linkage Group 7 suggests that this part of the T. grandis genome may have experienced a long evolutionary history. Relative to linkage Group 7, linkage Groups 2 and 8 has much more frequent LD between different loci from the same and different linkage groups, implying that some recent pressure of natural selection may have taken place in this part of the genome.

Computer simulation
To examine the statistical properties of the model for constructing the LD map with dominant markers, we performed computer simulation by mimicking a natural population at HWE. We randomly sample a panel of unrelated open-pollinated families (each including a female parent and multiple offspring). Given a total of 1,000 progeny, the simulation considers three sampling strategies, 1,000 × 1 (1,000 maternals with a single offspring), 200 × 5 (200 maternals with 5 offspring) and 50 × 20 (50 maternals with 20 offspring). For each strategy, we simulated two co-dominant markers with strong and weak LD, D = 0.15 and 0.02, respectively, in the population. The allele frequencies for the two markers are p A = 0.6 vs. p a = 0.4 and p B = 0.5 vs. p b = 0.5, respectively. The two markers are linked with two sizes of then recombination  fraction θ = 0.20 and 0.05. In each design, 1,000 simulation replicates were performed to estimate the means of the MLEs for each parameter and their standard deviations. By collapsing the simulated co-dominant marker genotypes into a dominant setting, we can test how well our model performs to construct a dominant LD map. Table 5 gives the results of parameter estimates from simulation studies under different designs. As expected, because of more information contained, co-dominant markers provide better estimation of each parameter than dominant markers, although the drawback of the latter can be overcome by choosing an optimal sampling strategy. General trends of estimation precision of parameters are summarized as follows: (i) LD can be estimated with high accuracy and precision for both co-dominant and dominant markers under all simulation schemes considered. However, as expected, more small families perform better than fewer larger families, because the estimate of LD is based on the sampled parents from the original population. (ii) The estimation of the recombination fraction θ is first dependent on the size of LD, followed by the degree of linkage and the sampling strategy. If LD is near zero, then φ is close to 1 2 so that ω 1 and ω 2 will not contain θ. Thus, θ is not estimable when there is no association between the two markers. To better estimate the linkage, precise estimation of LD is essential.
An additional scenario of simulation was conducted by collapsing only one of the two co-dominant markers into a dominant status. As expected, this scenario was intermediate in the precision of parameter estimation between those in which both markers are co-dominant and dominant, respectively. We have also performed simulation studies using the same schemes described above, but by quantitatively changing the values of LD within its interval. This simulation allows us to determine the minimum value of LD beyond which θ can be well estimated. In the package of software, we provide the function of determining such an LD value given a sampling strategy and allele frequencies.

Discussion
Similar to the HWE, significant departure from linkage equilibrium (LE) indicates that the population studied is undergoing some evolutionary pressure by extensive inbreeding, gene flow, genetic drift, mutation, natural selection, etc. However, unlike HWE, LE cannot be established in one generation of random mating, rather than it needs a number of generations to be reached, because LD declines at a rate that depends on the recombination fraction. 1 For this reason, a test of LD about its departure from LE may tell us more stories about the evolutionary history of the population. Just because the use of LD to infer a population's past events is founded on its relationship with the frequency of recombination, a joint estimation of the LD and recombination fraction can provide more precise information about evolutionary inference. 8 In this article, we extended Wu and Zeng's 8 open-pollinated progeny design to construct a linkage-LD map and particularly showed how this design can accommodate to missing information of dominant-segregating markers such as cytosine methylation markers. DNA methylation, as a covalent base modification of plant nuclear genomes, is thought to be accurately inherited through both mitotic and meiotic cell divisions. 14 Also, similarly to spontaneous mutations in DNA, errors in the maintenance of methylation states would violate the equilibrium of natural populations, leading to changes in associations between epialleles at different methylated loci. Thus, by constructing a linkage-LD map using those so-called SMPs, we can infer evolutionary pasts of the natural populations from a different perspective. 23,15 Indeed, as a simple and cheap technique, dominant markers, such as SRAP markers, 22 are still being used for many under-represented species including forest trees and wildlife species. 24,25,26 Thus, the dominant model described can widen the usefulness of the openpollinated design in practical population studies. Simulation studies have determined an appropriate sampling strategy to construct a linkage-LD map using dominant markers. Since the precise estimation of LD is of primary importance to linkage estimation, we recommend using many smaller families over small larger families. In addition, the efficiency of linkage-LD map construction can be enhanced by threepoint analysis, which has proved to not only provide more information about the genome structure and organization, but also reduce a possibility of biased estimation of the linkage when LD has a small value. 27,28 This is especially true for dominant markers.
Although the original model for joint linkage and LD analysis was proposed more than a decade ago, its practical use has not occurred until recent years when the collection of molecular markers for underrepresented species has been feasible. The current study presents one of the first applications of Wu and Zeng's 8 open-pollinated design to study the population structure and history of an outcrossing species. Torreya grandis is a gymnosperm tree species with a large size, endemic to the eastern and southeastern China. 20 Because of economic and ecological values, this species has been increasingly studied in terms of its evolutionary history and the genetic control of complex traits. 21 The results from a joint linkage and LD analysis with dominant markers suggest that this species has experienced a long history of evolution, but some regions of its genome are subject to a certain recent evolutionary forces. This information will provide guidance for better germplasm management of this important woody plant. Advances in understanding the evolutionary history of Torreya can be made by sampling multiple populations in a range of its distributions. This study, along with the previous one based on half-sib seeds from a single tree of T. grandis reporting a linkage map covering a total of 7,139.9 cM in 10 groups, 21 was among the first to construct genetic linkage maps for genus. It is important to align the two maps into a single one for a better coverage of the Torreya genome. Also, much more markers that can align those unlinked markers detected from the current and Zeng et al.'s studies are needed to completely cover 11 chromosomes of T. grandis. A complete coverage of markers allows more extensive studies of variation and examination of LD patterns, which will better reveal levels of complexity for this species.
It has been recognized that genetic mapping based on LD analysis helps to fine map complex traits or disease, but this approach may have a high likelihood to detect spurious signals of association, because allelic association can also be due to evolutionary forces rather than physical linkage. 2 A joint linkage and LD analysis can overcome this false-positive discovery. 27 Thus, the LD map constructed from genetic and epigenetic markers will provide an important fuel to map key QTLs that affect quantitative traits of economic and environmental importance. 7