Abstract

Motivation

Estimating haplotype frequencies from genotype data plays an important role in genetic analysis. In silico methods are usually computationally involved since phase information is not available. Due to tight linkage disequilibrium and low recombination rates, the number of haplotypes observed in human populations is far less than all the possibilities. This motivates us to solve the estimation problem by maximizing the sparsity of existing haplotypes. Here, we propose a new algorithm by applying the compressive sensing (CS) theory in the field of signal processing, compressive sensing haplotype inference (CSHAP), to solve the sparse representation of haplotype frequencies based on allele frequencies and between-allele co-variances.

Results

Our proposed approach can handle both individual genotype data and pooled DNA data with hundreds of loci. The CSHAP exhibits the same accuracy compared with the state-of-the-art methods, but runs several orders of magnitude faster. CSHAP can also handle with missing genotype data imputations efficiently.

Availability and implementation

The CSHAP is implemented in R, the source code and the testing datasets are available at http://home.ustc.edu.cn/∼zhouys/CSHAP/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Inferring haplotype pairs from unphased genotypes is of great importance in genetic analysis. The key point is to solve the phase ambiguity from genotype data, and there have been many methods for haplotype inference (Browning and Browning, 2011; Liu et al., 2008; Niu et al., 2002). Likelihood-based approaches using the expectation-maximization (EM) algorithm and related methods are proved to be statistically efficient under Hardy-Weinberg equilibrium (HWE) (Excoffier and Slatkin, 1995; Qin et al., 2002). On the other hand, a number of Bayesian methods that incorporate complex biological knowledge have been widely studied in literature (Lin et al., 2002; Niu et al., 2002; Stephens et al., 2001; Stephens and Donnelly, 2003; Stephens and Scheet, 2005; Xing et al., 2007; Zhang et al., 2006). Among those, the PHASE algorithm (Stephens and Scheet, 2005) had been considered as a gold standard, in which an approximate coalescent prior was used to model the clustering property of haplotypes. For modern phasing methods, hidden Markov models (HMMs) are often used. Under the assumption that haplotypes tend to cluster into groups over short regions, Scheet and Stephens (2006) proposed the fastPHASE algorithm based on HMM. This algorithm made it possible to a phase larger number of markers and samples. To reduce the computational time of PHASE, Delaneau et al. (2008) used the binary trees to represent the sets of possible haplotypes and proposed a linear complexity phasing method Shape-IT (Delaneau et al., 2011). By combining the local windows approach of phasing in Impute2 (Howie et al., 2011), Delaneau et al. (2013) proposed Shape-IT v2, which has been widely used in the 1000 Genomes Project (Delaneau et al., 2014).

Due to low rates of mutations and re-combinations in genetic evolution and the resultant high linkage disequilibrium (LD) in the genome, only a few haplotypes out of the large number of possible haplotypes are present in population (Daly et al., 2001; Patil et al., 2001). Many algorithms have been developed to maximize different measurements of parsimony of population haplotypes.

Clark (1990) suggested a heuristic approach to reconstruct haplotypes that use the least number of distinct haplotypes. Gusfield (2001) reformulated Clark’s algorithm into a conceptual integer linear programing (ILP) problem, then made a further modification (Gusfield, 2003) to make it practical. Computational results reported that the ILP method is very accurate (Gusfield and Orzack, 2005) for tightly linked polymorphisms. Moreover, Gusfield (2003) proposed an approach to the haplotype inference problem called Pure-Parsimony approach, which tries to find a solution that minimizes the total number of distinct haplotypes used, or equivalently, the 0 norm of haplotype frequency vector. Unfortunately, the Pure-Parsimony method is NP-hard, which cannot be computed efficiently. Jajamovich and Wang (2012) proposed to obtain a sparse solution by minimizing the Tsallis entropy of the frequency vector, which is still an NP-hard problem.

The most impressive conclusion in compressive sensing (CS) theory in the field of signal processing is that sparse or compressible signals could be accurately and efficiently recovered by minimizing the 1 norm of signals subject to the measurements from well-behaved under-determined linear sensing systems. Based on this, we propose a regularization algorithm, CSHAP, to reconstruct haplotypes by maximizing an 1 measure of parsimony defined as the weighted sum of haplotype frequencies, subject to constraints of a series of under-determined linear equations determined by estimated allele frequencies and pairwise covariance (or LD coefficients) of genotypes. Although the use of the first two moments as constraints may lose information, but as we will see later the loss is acceptable and exact or almost exact reconstruction is possible if haplotypes that present in population are sparse.

In this article, we propose a CS framework, CSHAP, for haplotype inference. We first introduce the approach based on the sparse representation of the exact allele frequencies and LD coefficients. Then we relax the constraints by allowing estimation errors in allele frequencies and LD coefficients. We will show that the method is applicable to both individual and pooling designs and works under both HWE and Hardy-Weinberg disequilibrium (HWD). Furthermore, by applying a modified EM algorithm, we can improve the accuracy of the CSHAP algorithm and make it possible to deal with missing data. Finally, based on the idea of partition-ligation (PL) first proposed by Niu et al. (2002), we further extend the algorithm to support long-range haplotypes. Performance of CSHAP is evaluated and compared with other methods by simulation studies and real data analysis.

2 Materials and methods

2.1 Notations

Consider q single nucleotide polymorphism (SNP) loci. Each locus is diallelic and thus theoretically there are r=2q possible haplotypes. Denote the minor allele at each locus by 1 and major allele by 0. We index the haplotypes by the sequence of coefficients of binary expansion of 0,1,,r, i.e. let hj=(h1j,,hqj) represent the jth haplotype vector.

For a specific locus of an individual, if both alleles have a 0 (or 1), this site is called homozygous and is encoded with a 0 (or 2). Conversely, if the two alleles are different, the site is heterozygous and the genotype for the locus is 1. The above encoding also works for pooling designs. Suppose we have T pooled genotype observations at the q loci, each is from a pool of N individuals (or 2N chromosomes equivalently). Let Gi=(Gi1,,Giq) be the pooled genotypes of the ith pool. Since Gi is a sum of 2N haplotypes, we have 0Gik2N,i=1,,T,k=1,2,,q. If N = 1, then the observations are individual genotypes and T stands for the number of individuals. Our objective is to estimate the haplotype frequencies p based on the genotype observations G=(G1,,GT).

2.2 Sparse representation of haplotype frequency

The minor allele frequency (MAF) for each locus and the joint probability of minor alleles at any two loci can be represented linearly using the haplotype frequencies p. In fact, the MAFs have a linear form representation ω0=Hp, and the joint probabilities of minor alleles can be expressed as η0=HΛH, where H=(h1,,hr) is a q × r matrix and Λ=diag{p} is a diagonal r × r matrix with diagonal elements p (Sham et al., 2002). Although η0 is a matrix consisting of all pairwise allele probabilities of any two loci, i.e. η0i,j=P(locus i=1, locus j=1). These two representations are true whether the HWE holds or not. In addition, we have the natural constraint 1p=1 since p is a probability distribution. Now we write the above three equations into a single linear equation:
Ψp=b,
(1)
with
Ψ=(1HHH)
where the q(q1)2×r matrix HH denotes the matrix whose rows are the logical product of all pairs of H, and b consists of the scalar 1, the vector ω0 and vector of the off-diagonal entries of η0, corresponding to the above three constraints, respectively.
Next, we normalize matrix Ψ such that its columns have unit 2 norms. Denote Λτ=diag(τ1,,τr), where τj is the 2 norm of the jth column of Ψ. We denote the normalized matrix as Φ=ΨΛτ1 and will call it the sensing matrix. Define p*=Λτp. Then (1) can be rewritten as
Φp*=b
(2)

Note that the sensing matrix Φ has dimension q2+q+22×r and the (2) is underdetermined when q2+q+22<r=2q, or equivalently q > 3. Therefore, b can be regarded as the noiseless partial observations of p* from underdetermined linear sensing system Φ.

Because only relatively few haplotypes can be present in a population when the number of possible haplotypes is large, the principle of parsimony has been emphasized in haplotype reconstruction (Clark, 1990; Gusfield, 2003). Noticing the equivalency between the sparsity of p and p*, the sparse signal p* can be recovered by solving the following optimization problem:
minp*R+r||p*||0subject to Φp*=b.
(3)

Solving (3) is equivalent to seeking the sparsest solution in the feasible space, which is a combinatorial optimization problem and is NP-hard.

The 1 norm is known as the convex envelope of the 0 norm over convex set Δ={xRr:||x||1} when ||x||=max xi. In other words, ||x||1 is the best pointwise approximation to ||x||0 among the set of convex function on Δ (Hiriart-Urruty and Lemaréchal, 1993; Recht et al., 2010). Thus, instead of minimizing the 0 norm of p* directly, we suggest the 1 minimization:
minp*R+r||p*||1subject to Φp*=b.
(4)
which is the typical form in CS theory and can be solved efficiently by linear programing.

The above Equation (4) is based on known exact value of b, however, b is unknown and needs to be estimated based on genotype observations. In the next subsection, we will make some modifications of (4) by allowing b with estimation errors.

2.3 Presence of noise

The MAFs can be readily estimated by sample means ω^0=iGi/(2NT). Under the assumption of HWE, the variance–covariance matrix of the q alleles can be estimated by sample variance–covariance matrix Σ^0=iGiGi/(2NT)2Nω^0ω^0 (Zhang et al., 2008). Then the two-locus joint probabilities η0=Σ0+ω0ω0 can be estimated by η^0=Σ^0+ω^0ω^0. Then a natural estimate of b is b^ consisting of 1, ω^ and the off-diagonal entries of η^.

Now allowing error of b^ in measuring Φp*, we transform the optimization Equation (4) into the following robust problem:
minp*R+r||p*||1subject to||Φp*b^||2ϵ
(5)
where ϵ is a tuning parameter controlling the error of estimates. This problem is a special case of second order cone programing (Boyd and Vandenberghe, 2004), which can be solved efficiently by many standard approaches such as iteratively reweighted least squares or interior-point algorithms.

The haplotype frequency pj is then estimated by p^j=p^j*/τj,j=1,,r,p* is the solution to (5). Necessary normalization is carried out to ensure p^ is a proper probability distribution.

2.4 Selection of tuning parameter ϵ

To solve the Equation (5), we need to select a proper tuning parameter ϵ. Given the fact that ϵ = 0 when b^=b, the ϵ should be as small as possible as the 2 estimation error of the first two moments should not be too large. However, we found that there is usually no solution for (5) if ϵ = 0 as typically b^b. On the other hand, the estimated frequencies are degenerate to (1,0,,0) as ϵ.

It is reasonable to believe that, for sufficiently large T and N, the estimates b^ is very close to b with certain probability, thereby the level of noise (2 error of estimates) is bounded. The ϵ should be related to the signal-to-noise ratio, which is directly determined by T, N and b.

From the Central Limit Theorem in statistics, we know that if the pool size N is large, then Gi will be asymptotically normally distributed with mean μG=2Nω0 and covariance matrix ΣG=2NΣ0 (Kuk et al., 2008). Using this fact we obtain (Bilodeau and Brenner, 2008)
2NT(ω^0ω)dN(0,ΣG)2N(T1)Σ^0dWq(T1,ΣG)
(6)

Here d represents convergence in distribution and Wq(T1,ΣG) represent the Wishart distribution with T − 1 degrees of freedom and scale matrix ΣG.

Denote e=b^Φp*, approximately we have eN(0,Σe), where Σe=Var(b^). Consider the squared norm of the error ||e||22=ee, then:
E(ee)=tr (Σe)Var(ee)=2 tr (ΣeΣe)
holds asymptotically. Based on the well-known concentration inequalities, the probability that ||e||22 exceeds its mean plus 2 or 3 standard deviations is small (<5 or 2%, respectively). We then solve (5) with
ϵλ2=tr (Σe)+λ2 tr (ΣeΣe)
and simply select λ = 2.

Since all the components of the vector b^ are made up of the elements in ω^0 and Σ^0, and we have obtained the asymptotic distribution of them in (6). The variance–covariance matrix Σe of b^, can be derived or estimated from ω^0 and Σ^0 naturally (details can be found in the Supplementary Section 1.1).

2.5 The deviation of HWE

Likelihood-based methods for estimating haplotype frequencies often require the assumption of HWE, but which may not be true if there exists population substructure. Here we introduce the inbreeding coefficient ρ to describe the possible dependence of the two haplotypes within a subject (Zeng and Lin, 2005). The distribution of diplotype H=[hk,hl] is assumed as:
P(H|p,ρ)=(1ρ)pkpl+ρpkIk=l,
(7)
where Ik=l is the indicator of whether k = l. Larger ρ means more similar between H1 and H2. Note that if one takes ρ = 0, then this is equivalent to the HWE assumption. Under model (7), the variance–covariance matrix of the genotype data is Σρ=(1+ρ)Σ0=(1+ρ)(η0ω0ω0) which can be estimated by Σ^ρ=12NTiGiGi2Nω^ρω^ρ and ω^ρ=ω^0=12NTiGi. The joint probabilities ηρ therefore are estimated by η^ρ=Σ^ρ/(1+ρ^)+ω^ρω^ρ, where ρ^ is an estimate of ρ. In this article, unless when HWE is explicitly assumed, we adopt the following estimate
ρ^=i=1qVii=1qG¯i(1G¯i/2N)1,
where G¯i, Vi is the sample mean and variance of genotypes of ith locus, respectively.
Then b can be readily estimated by the estimator b^ρ=b^(ω^ρ,η^ρ). Now problem (5) is then modified to
minp*R+r||p*||1subject to||Φp*b^ρ||2ϵ
(8)
where ϵ is the 2 error bound.
On the other hand, introducing a new estimator ρ^ will increase the variance, especially when ρ gets close to zero. In this case, the departure from HWE is negligible and it is not worth considering an extra inbreeding parameter, since our EM algorithm is relatively robust to deviations from HWE. An adaptive way to test the hypothesis
H0:ρ=0H1:ρ>0
(9)

In the simulation experiment, we compute ρ^L=ρ^1.645v^ρ, the 95% lower one-sided confidence bound, where the v^ρ is the variance of ρ^ obtained by bootstrap. If ρ^L>0, we will use the modified Equation (8). If ρ^L0, we will ignore the inbreeding parameter and still use (5).

2.6 CSHAP for individual DNA data

For individual genotyping design (N = 1), the Equation (5) can accurately identify the major haplotypes which really exist in the population, but the values may be biased.

Notice that the Equation (5) does not guarantee the compatibility of haplotypes with all the observed genotypes, especially when ϵ is relatively large, while too small ϵ may leads to no solution. Therefore, we first find the smallest ϵ0 such that solution exists. Then we calculate the largest ϵλ described in Section 2.4. Now we have a range of ϵ values [ϵ0,ϵλ]. For every ϵ that falls within this interval, our equation will return a series of solutions with different compatibility and sparsity. In all these compatible solutions, we choose the sparsest one as p^.

Once we get the above solution p^, we obtain a set of haplotypes H based on the non-zero entries of p^. Now we can use p^ and H as the input of the standard EM algorithm. In each iteration, we will only consider the haplotypes contained in H, and only update their corresponding frequencies in p^.

In fact, the p^ is quite close to the true haplotype frequency, especially for common-frequency (pi5%) variants. Correspondingly, H is very similar to the haplotypes that really present in sample genotypes, so that we can regard H as a haplotype reference panel. For most individuals, we can find only one compatible diplotype configurations in the reference panel; then, we regard these solutions as ‘solved’ and exclude them in the following iterations. Although most of the phasing algorithm needs to update the estimated haplotypes for every individuals in each iteration, we only consider the remaining ones. Furthermore, when the total number of individuals T is large, we only use the distinctive genotypes that appear in the remaining individuals, since the number of distinctive genotypes is very limited compared with T. These modifications will significantly reduce the number of individuals that need to be considered in each iteration.

Another important feature of our algorithm is that it can deal with missing data easily. We only need to estimate sample moments in Equation (5) based on samples which are complete on the locus or loci involved. The missing locus is then imputed with H in the subsequent EM procedures (details can be found in the Supplementary Sections 1.2 and 1.3).

The resulting hybrid algorithm, referred as the CSHAP algorithm, incorporates the advantages of both CS theory and EM algorithm. Our algorithm can be more accurate than purely using Equation (5) and more computationally efficient than the current state-of-the-art algorithms.

2.7 CSHAP for large DNA pools

The haplotype estimation methods of pooled DNA data can be divided into two categories: To focus on a small number of pool sizes N (each pool contains about two to three individuals) with large number of markers q, or to consider on a small number of markers q but lager pool sizes N. The former can be solved with EM algorithm (Yang et al., 2003), however, for the other case, the EM algorithm is computationally involved and is not feasible when pool size N10. Kuk et al. (2008) proposed an approximate EM (AEM) algorithm based on Central Limit Theorem, which makes the method based on EM algorithm computationally feasible for large N, with substantial improvement in accuracy simultaneously. However, the time consumption of AEM grows exponentially with q, which limits the performance of AEM when q is larger.

In this case, we substitute the EM procedure in CSHAP with the AEM algorithm. The resulting hybrid algorithm runs nearly two orders of magnitude faster than AEM while exhibiting almost same accuracy.

2.8 CSHAP for long-range haplotypes

Due to the computer memory constraint, our algorithm supports at most 24–27 loci for the length of each block. When the number of loci q exceeds this limit, we apply a special PL strategy (Niu et al., 2002).

Assuming the genotype data consists of L=K×M loci, where M is the number of blocks and K is the length of each block. Without loss of generality, the genotype data can be divided into M contiguous ‘atomistic’ blocks. In most of the previously published articles (Delaneau et al., 2011; Niu et al., 2002; Qin et al., 2002; Stephens and Scheet, 2005;), K was usually set between 5 and 8 due to the limitation of algorithms. In contrast, our method allows setting K = 16–20 while maintaining efficiency. Obviously, larger K and smaller M can help avoid the local-mode problem (Qin et al., 2002) of EM.

Once we have conducted the CSHAP procedure for each of M atomistic blocks, we iteratively combine the (2i1)th block with the subsequent (2i)th block until all blocks are ligated as one. Most of the PL methods usually take D haplotypes with top estimated frequencies in each block, where D can be specified between 40 and 50 (Niu et al., 2002) or be determined separately by some frequency thresholds (Stephens and Donnelly, 2003). This ‘thrown away’ procedure can limit the number of possible concatenated haplotypes to D2, but may lead the algorithm to trap into a local mode simultaneously. Here we use all estimated haplotypes in each block instead. However, the total number of all possible concatenated haplotypes may be significantly larger than D2. Notice that only relatively few concatenated haplotypes are true among all those possible haplotypes, we use CS again in this ligation step.

Next we demonstrate how to combine the block A with the adjacent block B. From the estimated haplotype frequencies p^A we obtain a haplotype set HA with nA haplotypes. Denote HA=(hn1,,hnA) is a qA×nA matrix obtained by combining the haplotype vectors in HA by column. Similarly, we have HB and HB for block B. Denote the set of all possible concatenated haplotypes as HA×B. Define HAB is a (qA+qB)×(nAnB) matrix which can be regard as the Cartesian product of HA and HB. Similar to the structure of Ψ in the (1), we have:
ΨAB=(1HABHABHAB)
and ΦAB,ΛAB,pAB*,bAB correspondingly.
Now we transform the ligation problem into this following problem, which has been already resolved in the above.
minpAB*R+||pAB*||1subject to||ΦABpAB*b^AB||2ϵ

The non-zero entries of p^AB correspond to the estimated haplotypes and frequencies in HA×B. We can also use the EM algorithm to improve the accuracy of p^AB. This process is repeatable until all blocks are connected as one.

When the number of loci increases to, say, more than 1000s, the frequency estimation is meaningless, since almost every haplotypes are rare. What we actually do at this time is ‘phasing’, which is a totally different task. However, with a few changes, our algorithm can be used as a phasing tool under this situation. We describe the details and results in the Supplementary Section 1.4.

3 Results

We demonstrate the performances of our algorithm by simulations. First we compare the estimation accuracies of CSHAP, PHASE v2.1.1, fastPHASE v1.4, Shape-IT v2.17 and PL-EM v1.0 (with default settings) using individual genotyping data under the assumption of HWE as well as HWD. In addition, we illustrate the computational efficiency of CSHAP in comparison with the others.

For pooling design, we compare the estimation accuracies of CSHAP, PoooL and AEM with various pool sizes and sample sizes. Meanwhile, we show the running time of CSHAP and AEM. All the simulations are repeated 10 000 times if not specified explicitly. Our platform is a desktop computer with Ubuntu 16.04 64-bit, Intel Core i5-7400 CPU@3.00 GHz and 8.0 GB RAM.

To summarize the accuracies of our methods, and to compare the accuracies between the different algorithms, we use two measures: the absolute discrepancy (Excoffier and Slatkin, 1995) and the χ2 distance. The asbolute discrepancy between the estimated haplotype frequencies and their true values is defined as i=1r|p^ipi|/2, and the χ2 distance is defined as i=1, pi0r(p^ipi)2/pi.

3.1 Individual design

We generate T unrelated individual genotypes according to the commonly used 10-locus haplotype frequencies of angiotensinogen (AGT) gene considered in Yang et al. (2003) by assuming HWE, T=10,20,50,100,200,500,1000,2000. Part of the results is demonstrated in Supplementary Table S2. We can see that for small sample size, the performances of CSHAP are better than the others in the sense of bias and the effective accumulated probability (EAP) of p. Also CSHAP has smaller estimation bias of almost every haplotype than PHASE’s and Shape-IT’s. For larger sample sizes (T100), the precision of CSHAP is very close to the PHASE and PL-EM, and slightly better than Shape-IT.

Simulation results are summarized in Figure 1. We can see that, for small sample size (T <  100), the absolute discrepancy of CSHAP is smaller than those of the others, and the solution of CSHAP is much sparser than those of PHASE and Shape-IT. When T exceeds about 100, CSHAP behaves as accurate as PHASE, while slightly better than fastPHASE and Shape-IT.

Fig. 1.

Average accuracy comparison in 10 000 trials with increasing numbers of sample size T. The top panel shows the χ2 distance between the true frequency and estimated frequencies. The second panel shows the absolute discrepancy. The bottom panel shows the sparsity (the number of existing haplotypes in solution), respectively. The true haplotype dataset has 11 unique existing haplotypes

Comparisons of computational efficiencies are given in Table 1. The CSHAP is at least two to three orders of magnitude faster than PHASE and fastPHASE. For T = 100 in our simulation, the CSHAP is 365 times faster more efficient than PHASE while maintaining the same accuracy. For T = 2000, the CSHAP is 1775 times faster than PHASE, 2010 times faster than fastPHASE, 468 times faster than Shape-IT and 19 times faster than PL-EM. Notice that the running time of most other methods shows approximately linear trends with sample size T, and in contrast, our method only in the order of a logarithmic scale. More details are in Supplementary Figure S1.

Table 1.

Running times of CSHAP, PHASE, fastPHASE, Shape-IT and PL-EM algorithm for 1000 replicates of simulation

TCSHAPPHASEfastPHASEShape-ITPL-EM
1010.2741338120140
2012.91459651139142
5017.537261589246146
10021.177273145410156
20026.098516309731188
50029.116 61915 7462007258
100030.828 47132 0196014384
200032.657 99065 66215 318645
TCSHAPPHASEfastPHASEShape-ITPL-EM
1010.2741338120140
2012.91459651139142
5017.537261589246146
10021.177273145410156
20026.098516309731188
50029.116 61915 7462007258
100030.828 47132 0196014384
200032.657 99065 66215 318645

Note: Using Yang et al. (2003)s 10-locus haplotype frequencies of AGT dataset for individual data. The unit of running time is second. T stands for the sample size.

Table 1.

Running times of CSHAP, PHASE, fastPHASE, Shape-IT and PL-EM algorithm for 1000 replicates of simulation

TCSHAPPHASEfastPHASEShape-ITPL-EM
1010.2741338120140
2012.91459651139142
5017.537261589246146
10021.177273145410156
20026.098516309731188
50029.116 61915 7462007258
100030.828 47132 0196014384
200032.657 99065 66215 318645
TCSHAPPHASEfastPHASEShape-ITPL-EM
1010.2741338120140
2012.91459651139142
5017.537261589246146
10021.177273145410156
20026.098516309731188
50029.116 61915 7462007258
100030.828 47132 0196014384
200032.657 99065 66215 318645

Note: Using Yang et al. (2003)s 10-locus haplotype frequencies of AGT dataset for individual data. The unit of running time is second. T stands for the sample size.

3.2 Haplotype diversity and missing data

To measure the performances under different haplotype diversities, we generate T = 100 individual genotypes according to the 11-locus G6PD haplotypes in Sabeti et al. (2002). The G6PD haplotypes are different among the following six ethnic populations: African American, Asian, Beni (Nigeria), European American, Shona (Zimbabwe) and Yoruba (Nigeria). The published haplotypes and frequencies are given in Supplementary Table S3. In addition, we randomly mask about 5% of the data as missing sites.

As Figure 2 shows, our CSHAP method performs well under varying degrees of diversities, as well as missing data imputation. In fact, the χ2 distances and the absolute discrepancies of PHASE, fastPHASE, CSHAP are similar. The Shape-IT identified too many non-existent haplotypes, which leads to higher discrepancies. PL-EM seems to have problems accurately imputing missing data in the Asian population. Our CSHAP method has almost the same precision as PHASE, while fastPHASE is only a little less accurate than PHASE, but our method gives a sparser solution with the same precision.

Fig. 2.

Measures of performance of PHASE, fastPHASE, CSHAP, Shape-IT and PL-EM based on the simulated G6PD gene datasets among six different populations. Sample size T = 100, all the simulations are repeated 10 000 times

3.3 The case of HWD

We generate T = 100 haplotype pairs from (7) using the 10-locus AGT gene with a series of inbreeding coefficients ρ=0.05,0.1,0.15,0.2,0.3. For each of 10 000 simulation trials, we first construct a bootstrap sample by resampling with replacement from these sample genotypes 1000 times to estimate the variance of ρ^ in (9). The proportion of times out of the simulations that ρ^L0 is 0.888,0.750,0.502,0.303,0.065, for the above mentioned different values of ρ, respectively.

Results are summarized in the Supplementary Table S4. We can see that introducing a ρ^ helps to reduce the impact of deviations from HWE, but the SDs are slightly larger. Our adaptive estimator ρ^L can make this tradeoff conveniently.

3.4 Pooling design

Next we consider large pooling design. The simulations are based on pooled 10-locus AGT gene data with N = 50, 100 and T = 50, 100 where HWE is assumed. We investigate the performance of CSHAP with PoooL and AEM. To compare with previous published results by Kuk et al. (2008) and Zhang et al. (2008), we use the averaged Euclidean distance d=i=1r(p^ipi)2/r instead of χ2 distance.

Figure 3 shows summarized results. The performance of PoooL is not optimal, since the distance and discrepancy are many times higher than those of AEM and CSHAP. Meanwhile, CSHAP shows comparable performance with AEM, the distance and discrepancy are slightly higher, but CSHAP can obtain sparser results than AEM. However, the computational efficiency of CSHAP is about two orders of magnitude faster than that of AEM, Supplementary Table S5 demonstrates that CSHAP runs about 120 times faster than AEM in our simulations.

Fig. 3.

Measures of performance of PoooL, AEM and CSHAP. T stands for the number of pools, and each pool contains N individuals. For each pooling design and method, simulation was repeated 10 000 times

3.5 Long-range haplotypes

In this section, we use the human angiotensin converting enzyme (ACE) dataset. This dataset contains q = 52 SNPs and 11 individuals, the genotypes have up to 37 heterozygous sites. For all individuals, 13 unique haplotypes from the 22 chromosomes are resolved in Rieder et al. (1999). We generate T=50,100,200,500,1000,2000 unrelated individual genotypes by assuming HWE and test all the methods in Section 3.1. The result is shown in Figure 4. Our method, PHASE and PL-EM produce the best solution, but CSHAP costs much less time. The biases of Shape-IT and fastPHASE are relatively slightly larger.

Fig. 4.

Measures of performance of PHASE, fastPHASE, CSHAP, Shape-IT and PL-EM. For each method and sample size T, simulation was repeated 1000 times. The Shape-IT failed to estimate haplotype frequencies when T = 2000, since the program aborted with an error

The running time of all methods are described in Table 2. CSHAP is about two to three orders of magnitude faster than PHASE and fastPHASE, while providing the most accurate results. For example, when T = 2000, the CSHAP is 3000 times faster than PHASE, 1634 times faster than fastPHASE, 303 times faster than Shape-IT and 10 times faster than PL-EM. More details are in Supplementary Figure S2.

Table 2.

Running times of CSHAP, PHASE, fastPHASE, Shape-IT and PL-EM algorithm for 100 replicates of simulation

TCSHAPPHASEfastPHASEShape-ITPL-EM
509.547068219863
1009.79571168422884
20010.912 176332445793
50012.019 68782831196107
100014.133 78316 4773250140
200020.260 69033 0836153211
TCSHAPPHASEfastPHASEShape-ITPL-EM
509.547068219863
1009.79571168422884
20010.912 176332445793
50012.019 68782831196107
100014.133 78316 4773250140
200020.260 69033 0836153211

Note: Using Rieder et al. (1999)s 52-locus haplotype frequencies on ACE dataset. The unit of running time is second. T stands for the sample size.

Table 2.

Running times of CSHAP, PHASE, fastPHASE, Shape-IT and PL-EM algorithm for 100 replicates of simulation

TCSHAPPHASEfastPHASEShape-ITPL-EM
509.547068219863
1009.79571168422884
20010.912 176332445793
50012.019 68782831196107
100014.133 78316 4773250140
200020.260 69033 0836153211
TCSHAPPHASEfastPHASEShape-ITPL-EM
509.547068219863
1009.79571168422884
20010.912 176332445793
50012.019 68782831196107
100014.133 78316 4773250140
200020.260 69033 0836153211

Note: Using Rieder et al. (1999)s 52-locus haplotype frequencies on ACE dataset. The unit of running time is second. T stands for the sample size.

4 Conclusion

In this study, we propose an efficient algorithm, CSHAP, for estimating haplotype frequencies from individual or pooled DNA data, under the HWE assumption or not. The CSHAP algorithm minimizes the weighted sum of haplotype frequencies under constraints on the allele frequencies and covariances (i.e. LD coefficients). This method is based on the maximum parsimony principle of Gusfield (2003), which was to minimize the total number of distinct haplotypes, subject to the condition that the solutions are consistent with genotype observations. In our approach, we substitute the 0 norm by the 1 norm and reduce the consistency condition to a system of linear constraints on the first two moments of genotype observations. Besides, we use a modified EM algorithm to boost accuracy efficiently.

Extensive simulation studies show that our method is comparable to or better than the existing methods but has significant computational advantage. An outstanding feature of our method is its computing efficiency for both individual and pooled DNA data. We showed that if the number of existing haplotypes in population is small, our method can recover the haplotypes with high accuracy. In addition, by introducing the inbreeding coefficients the HWE assumption is not required in our approach and this quantity can also capture the inflation of variance in genotype observations caused by genotype errors, population substructures etc.

Furthermore, by introducing the divide-and-conquer idea of PL, our method is able to handle long-range haplotypes. The specially designed PL algorithm can help us to overcome the local-mode problem while maintaining efficiency. In conclusion, our method can be a powerful and efficient approach in genome-wide association studies.

Acknowledgements

We appreciate the insightful comments and helps from Dr Xing Hua and Dr Yifang Yang during preparation of this article.

Funding

This work has been supported by the National Science Foundation of China [NSFC 11671375 and 11271346].

Conflict of Interest: none declared.

References

Bilodeau
 
M.
,
Brenner
D.
(
2008
)
Theory of Multivariate Statistics
.
Springer
,
New York
.

Boyd
 
S.
,
Vandenberghe
L.
(
2004
)
Convex Optimization
.
Cambridge University Press, Cambridge
.

Browning
 
S.R.
,
Browning
B.L.
(
2011
)
Haplotype phasing: existing methods and new developments
.
Nat. Rev. Genet
.,
12
,
703
714
.

Clark
 
A.G.
(
1990
)
Inference of haplotypes from pcr-amplified samples of diploid populations
.
Mol. Biol. Evol
.,
7
,
111
122
.

Daly
 
M.J.
 et al.  (
2001
)
High-resolution haplotype structure in the human genome
.
Nat. Genet
.,
29
,
229
232
.

Delaneau
 
O.
 et al.  (
2008
)
Shape-IT: new rapid and accurate algorithm for haplotype inference
.
BMC Bioinformatics
,
9
,
540.

Delaneau
 
O.
 et al.  (
2011
)
A linear complexity phasing method for thousands of genomes
.
Nat. Methods
,
9
,
179
181
.

Delaneau
 
O.
 et al.  (
2013
)
Improved whole-chromosome phasing for disease and population genetic studies
.
Nat. Methods
,
10
,
5
6
.

Delaneau
 
O.
 et al.  (
2014
)
Integrating sequence and array data to create an improved 1000 genomes project haplotype reference panel
.
Nat. Commun
.,
5
,
3934.

Excoffier
 
L.
,
Slatkin
M.
(
1995
)
Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population
.
Mol. Biol. Evol
.,
12
,
921
927
.

Gusfield
 
D.
(
2001
)
Inference of haplotypes from samples of diploid populations: complexity and algorithms
.
J. Comput. Biol
.,
8
,
305
323
.

Gusfield
 
D.
(
2003
) Haplotype inference by pure parsimony. In:
Proceedings of the 14th Annual Conference on Combinatorial Pattern Matching, CPM’03
,
Springer-Verlag
,
Berlin, Heidelberg
, pp.
144
155
.

Gusfield
 
D.
,
Orzack
S.
(
2005
)
Haplotype Inference. CRC Handbook on Bioinformatics, Chapter 1
.
CRC Press
,
Boca Raton, USA
.

Hiriart-Urruty
 
J.-B.
,
Lemaréchal
C.
(
1993
)
Convex Analysis and Minimization Algorithms II
.
Springer
,
Berlin, Heidelberg
.

Howie
 
B.
 et al.  (
2011
)
Genotype imputation with thousands of genomes
.
G3
,
1
,
457
470
.

Jajamovich
 
G.H.
,
Wang
X.
(
2012
)
Maximum-parsimony haplotype inference based on sparse representations of genotypes
.
IEEE Trans. Sig. Process
.,
60
,
2013
2023
.

Kuk
 
A.Y.
 et al.  (
2009
)
Computationally feasible estimation of haplotype frequencies from pooled DNA with and without Hardy-Weinberg equilibrium
.
Bioinformatics
,
25
,
379
386
.

Lin
 
S.
 et al.  (
2002
)
Haplotype inference in random population samples
.
Am. J. Hum. Genet
.,
71
,
1129
1137
.

Liu
 
N.
 et al.  (
2008
)
Haplotype-association analysis
.
Adv. Genet
.,
60
,
335
405
.

Niu
 
T.
 et al.  (
2002
)
Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms
.
Am. J. Hum. Genet
.,
70
,
157
169
.

Patil
 
N.
 et al.  (
2001
)
Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21
.
Science
,
294
,
1719
1723
.

Qin
 
Z.S.
 et al.  (
2002
)
Partition-ligation–expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms
.
Am. J. Hum. Genet
.,
71
,
1242
1247
.

Recht
 
B.
 et al.  (
2010
)
Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization
.
SIAM Rev
.,
52
,
471
501
.

Rieder
 
M.J.
 et al.  (
1999
)
Sequence variation in the human angiotensin converting enzyme
.
Nat. Genet
.,
22
,
59.

Sabeti
 
P.C.
 et al.  (
2002
)
Detecting recent positive selection in the human genome from haplotype structure
.
Nature
,
419
,
832
837
.

Scheet
 
P.
,
Stephens
M.
(
2006
)
A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase
.
Am. J. Hum. Genet
.,
78
,
629
644
.

Sham
 
P.
 et al.  (
2002
)
Dna pooling: a tool for large-scale association studies
.
Nat. Rev. Genet
.,
3
,
862
871
.

Stephens
 
M.
,
Donnelly
P.
(
2003
)
A comparison of bayesian methods for haplotype reconstruction from population genotype data
.
Am. J. Hum. Genet
.,
73
,
1162
1169
.

Stephens
 
M.
,
Scheet
P.
(
2005
)
Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation
.
Am. J. Hum. Genet
.,
76
,
449
462
.

Stephens
 
M.
 et al.  (
2001
)
A new statistical method for haplotype reconstruction from population data
.
Am. J. Hum. Genet
.,
68
,
978
989
.

Xing
 
E.P.
 et al.  (
2007
)
Bayesian haplotype inference via the Dirichlet process
.
J. Comput. Biol
.,
14
,
267
284
.

Yang
 
Y.
 et al.  (
2003
)
Efficiency of single-nucleotide polymorphism haplotype estimation from pooled dna
.
Proc. Natl. Acad. Sci. USA
,
100
,
7225
7230
.

Zeng
 
D.
,
Lin
D.Y.
(
2005
)
Estimating haplotype-disease associations with pooled genotype data
.
Genet. Epidemiol
.,
28
,
70
82
.

Zhang
 
H.
 et al.  (
2008
)
PoooL: an efficient method for estimating haplotype frequencies from large DNA pools
.
Bioinformatics
,
24
,
1942
1948
.

Zhang
 
Y.
 et al.  (
2006
)
A coalescence-guided hierarchical bayesian method for haplotype inference
.
Am. J. Hum. Genet
.,
79
,
313
322
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on:

Supplementary data