-
PDF
- Split View
-
Views
-
Cite
Cite
Heng Li, A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data, Bioinformatics, Volume 27, Issue 21, November 2011, Pages 2987–2993, https://doi.org/10.1093/bioinformatics/btr509
- Share Icon Share
Abstract
Motivation: Most existing methods for DNA sequence analysis rely on accurate sequences or genotypes. However, in applications of the next-generation sequencing (NGS), accurate genotypes may not be easily obtained (e.g. multi-sample low-coverage sequencing or somatic mutation discovery). These applications press for the development of new methods for analyzing sequence data with uncertainty.
Results: We present a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. On real data, we demonstrate that our method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping. We also highlight the necessity of using symmetric datasets for finding somatic mutations and confirm that for discovering rare events, mismapping is frequently the leading source of errors.
Availability: http://samtools.sourceforge.net
Contact: [email protected]
1 INTRODUCTION
The 1000 Genomes Project (1000 Genomes Project Consortium, 2010) sets an excellent example on how to design a sequencing project to get the maximum output pertinent to human populations. An important lesson from this project is to sequence many human samples at relatively low coverage instead of a few samples at high coverage. We adopt this strategy because with higher coverage, we will mostly reconfirm information from other reads, but with more samples, we will be able to reduce the sampling fluctuations, gain power on variants present in multiple samples and get access to many more rare variants. On the other hand, sequencing errors counteract the power in variant calling, which necessitates a minimum coverage. The optimal balancing point is broadly regarded to be in the 2–6 fold range per sample (Le and Durbin, 2010; Li et al., 2011), depending on the sequencing error rate, level of linkage disequilibrium (LD) and the purpose of the project.
A major concern with this design is that at 2–6 fold coverage per sample, non-reference alleles may not always be covered by sequence reads, especially at heterozygous loci. Calling variants from each individual and then combining the calls usually yield poor results. The preferred strategy is to enhance the power of variant discovery by jointly considering all samples (Depristo et al., 2011; Le and Durbin, 2010; Li et al., 2011; Nielsen et al., 2011). This strategy largely solves the variant discovery problem, but acquiring accurate genotypes for each individual remains unsolved. Without accurate genotypes, most of the previous methods [e.g. testing Hardy–Weinberg equilibrium (HWE) and association mapping] would not work.
To reuse the rich methods developed for genotyping data, the 1000 Genomes Project proposes to impute genotypes utilizing LD across loci (Browning and Yu, 2009; Howie et al., 2009; Li et al., 2009b, 2010a). Suppose at a site A, one sample has a low coverage. If some samples at A have high coverage and there exists a site B that is linked with A and has sufficient sequence support, we can transfer information across sites and between individuals, and thus make a reliable inference for the low-coverage sample at A. The overall genotype accuracy can be greatly improved.
However, imputation is not without potential concerns. First, imputation cannot be used to infer the regional allele frequency spectrum (AFS) because imputation as of now can only be applied to candidate variant sites, while we need to consider non-variants to infer AFS. Second, the effectiveness of imputation depends on the pattern of LD, which may lead to potential bias in population genetical inferences. Third, the current imputation algorithms are slow. For a thousand samples, the fastest algorithm may be slower than read mapping algorithms, which is frequently the bottleneck of analyzing NGS data (H.M.Kang, personal communication). Considering more samples and using more accurate algorithms will make imputation even slower.
These potential concerns make us reconsider if imputation is always preferred. We notice that we perform imputation mainly to reuse the methods developed for genotyping data, but would it be possible to derive new methods to solve classical medical and population genetical problems without precise genotypes?
Another application of NGS that requires genotype data is to discover somatic mutations or germline mutations between a few related samples (Conrad et al., 2011; Ley et al., 2008; Mardis et al., 2009; Pleasance et al., 2010a, b; Roach et al., 2010; Shah et al., 2009). For such an application, samples are often sequenced to high coverage. Although it is not hard to achieve an error rate one per 100 000 bases (Bentley et al., 2008), mutations occur at a much lower rate, typically of the order of 10−6 or even 10−7. Naively calling genotypes and then comparing samples frequently would not work well (Ajay et al., 2011), because subtle uncertainty in genotypes may lead to a bulk of errors. From another angle, however, when discovering rare mutations, we only care about the difference between samples. Genotypes are just a way of measuring the difference. Is it really necessary to go through the genotype calling step?
This article explores the answer to these questions. We will show in the following how to compute various statistics directly from sequencing data without knowing genotypes. We will also evaluate our methods on real data.
2 METHODS
This section presents the precise equations on how to infer various statistics such as the genotype frequency and AFS, and to perform various statistical test such as testing HWE and associations. Some of these equations have already been described in the existing literature, but for theoretical completeness, we give the equations using our notations. The last subsection reviews the existing methods and summarizes the differences between them, as well as between ours and the existing formulation.
In this section, we suppose there are n individuals with the i-th individual having mi ploidy. At a site, the sequence data for the i-th individual is represented as di and the genotype is gi which is an integer in [0, mi], equal to the number of reference alleles in the individual. Table 1 gives notations common across this section. The detailed derivation of the equations in this article is presented in an online document (http://bit.ly/stmath).
Symbol . | Description . |
---|---|
n | Number of samples |
m i | Ploidy of the i-th sample (1≤i≤n) |
M | Total number of chromosomes in samples: M=∑imi |
d i | Sequencing data (bases and qualities) for the i-th sample |
g i | Genotype (the number of reference alleles) of the i-th sample (0≤gi≤mi)a |
ϕk | Probability of observing k reference alleles (∑k=0Mϕk=1) |
Pr{A} | Probability of an event A |
![]() | Likelihood function for the i-th sample: ![]() |
Symbol . | Description . |
---|---|
n | Number of samples |
m i | Ploidy of the i-th sample (1≤i≤n) |
M | Total number of chromosomes in samples: M=∑imi |
d i | Sequencing data (bases and qualities) for the i-th sample |
g i | Genotype (the number of reference alleles) of the i-th sample (0≤gi≤mi)a |
ϕk | Probability of observing k reference alleles (∑k=0Mϕk=1) |
Pr{A} | Probability of an event A |
![]() | Likelihood function for the i-th sample: ![]() |
aIn this article, we only consider biallelic variants.
Symbol . | Description . |
---|---|
n | Number of samples |
m i | Ploidy of the i-th sample (1≤i≤n) |
M | Total number of chromosomes in samples: M=∑imi |
d i | Sequencing data (bases and qualities) for the i-th sample |
g i | Genotype (the number of reference alleles) of the i-th sample (0≤gi≤mi)a |
ϕk | Probability of observing k reference alleles (∑k=0Mϕk=1) |
Pr{A} | Probability of an event A |
![]() | Likelihood function for the i-th sample: ![]() |
Symbol . | Description . |
---|---|
n | Number of samples |
m i | Ploidy of the i-th sample (1≤i≤n) |
M | Total number of chromosomes in samples: M=∑imi |
d i | Sequencing data (bases and qualities) for the i-th sample |
g i | Genotype (the number of reference alleles) of the i-th sample (0≤gi≤mi)a |
ϕk | Probability of observing k reference alleles (∑k=0Mϕk=1) |
Pr{A} | Probability of an event A |
![]() | Likelihood function for the i-th sample: ![]() |
aIn this article, we only consider biallelic variants.
2.1 Assumptions
2.1.1 Site independency
We assume data at different sites are independent. This may not be true in real data, because sequencing and mapping are context dependent; when there is an insertion or deletion (INDEL) error or INDEL polymorphism, sites nearby are also correlated in alignment. Nonetheless, most of the existing methods make this assumption for simplicity. The effect of site dependency may also be reduced by post-filtering and properly modeling the mapping and alignment errors (Li et al., 2008; Li, 2011).
2.1.2 Error independency and sample independency

In real data, errors may be dependent of sequence context (Nakamura et al., 2011). The independency assumption may not hold. It is possible to model error dependency within an individual (Li et al., 2008), but the sample independency assumption is essential to all the derivations below.
2.1.3 Biallelic variants
We assume all variants are biallelic. In the human population, the fraction of triallelic SNPs is ~0.2% (Hodgkinson and Eyre-Walker, 2010). The biallele assumption does not have a big impact to the modeling of SNPs, though it may have a bigger impact to the modeling of INDELs at microsatellites.
2.2 Computing genotype likelihoods
For one sample at a site, the sequencing data d is composed of an array of bases on sequencing reads plus their base qualities. As we only consider biallelic variants, we may focus on the two most evident types of nucleotides and drop the less evident types if present. Thus, at any site we see at most two types of nucleotides. This treatment is not optimal, but sufficient in practice.

2.3 Inferences from multiple samples
2.3.1 Estimating the site allele frequency




When the signal from the data is strong, or equivalently for each i, one of i(g) is much larger than others, the EM algorithm converges faster than the direct numerical solution using Brent's method. However, when the signal from the data is weak, numerical method may converge faster than EM (Kim et al., 2011). In implementation, we apply 10 rounds of EM iterations. If the estimate does not converge after 10 rounds, we switch to Brent's method.
2.3.2 Estimating the genotype frequencies









2.3.3 Estimating haplotype frequencies between loci







When sample genotypes are all certain, this EM iteration is reduced to the standard EM for estimating haplotype frequencies using genotype data (Excoffier and Slatkin, 1995).
The time complexity of computing Equation (10) is O(n · 4k) and thus it is impractical to estimate the haplotype frequency for many loci jointly. A typical use of Equation (10) is to measure LD between two loci.
2.3.4 Testing associations






We have not found a powerful test statistic robust to HWE violation. For practical applications, we propose to take the P-value computed with Da1, while filtering candidates having a low Da2 to reduce false positives caused by HWE violation (see Section 3).
2.3.5 Estimating the number of non-reference alleles
In this section, we use the term site reference allele count to refer to the number of reference alleles at one single site. Allele count is a discrete number while allele frequency is contiguous.









Although the computation of the likelihood function (k) is more complex than of
(ψ),
(k) is discrete, which is more convenient to maximize or sum over. This likelihood function establishes the foundation of the Bayesian inference.
2.3.6 Numerical stability of the allele count estimation






As another implementation note, most yjl are close to zero and thus ynk can be computed in a band rather than in a triangle. This may dramatically speed up the computation of the likelihood.
2.3.7 Calling variants




2.3.8 Estimating the sample AFS
For variant calling [Equation (20)], we typically take the Wright–Fisher AFS as the prior. We can also estimate the sample AFS with the maximum-likelihood inference when the Wright-Fisher prior deviates from the data.

We call this method of estimating AFS as EM-AFS. Alternatively, we may also acquire the max-likelihood estimate of the allele count at each site using Equation (16). The normalized histogram of these counts gives the AFS. We call this method as site-AFS. We will compare the two methods in Section 3.
2.4 Discovering somatic and germline mutations














This equation has an intuitive interpretation: we are certain about a candidate somatic mutation only if both genotypes in both samples are clearly better than other possible genotypes.





Although most of the derivation in this article assumes that variants are biallelic, we drop this assumption in the implementation for methods described in this subsection. We have observed false somatic/germline mutations caused by the mismodeling of triallelic variants (M.Depristo, personal communication). The biallelic assumption may lead to false positives.
2.5 Related works
During SNP calling, Thunder (Li et al., 2011) and glfMultiples (http://bit.ly/glfmulti) compute the site allele frequency by numerically maximizing the likelihood [Equation (2)]. Genome Analysis Toolkit (GATK; Depristo et al., 2011) infers the frequency with EM [Equation (5)]. Kim et al. (2011) infers the frequency with both the numerical and the EM algorithms. Li et al. (2010b) derived an alternative method to estimate the site allele frequency, which is not covered in this article. SeqEM (Martin et al., 2010) estimates the genotype frequency using EM [Equation (7)] with a different parameterization. Le and Durbin (2010) derived Equation (16). The conclusion is correct, but the derivation is not rigorous: the binomial coefficient in Equation (13) was left out. Yi et al. (2010) came to a similar set of equations to Equations (15) and (20), but the prior is taken from the estimated site allele frequency. To the best of our knowledge, Kim et al. (2010) is the first to use genotype likelihood-based LRT to compute P-value of associations [Equation (11)] with more thorough evaluation in a recent paper (Kim et al., 2011). Nielsen et al. (2011) further proposed to test associations with a score test (Schaid et al., 2002). Except Kim et al. (2010), all the previous works focus on diploid samples, while many equations in this article can be in theory applied to multiploidy samples and pooled samples.
In this article, our contribution includes testing HWE, estimating haplotype frequency, the proposal of two-degree association test, a simple but effective model for discovering somatic mutations, the rigorous derivation and numerically stable implementation of a discrete allele count estimator and an EM algorithm for inferring AFS.
3 RESULTS
3.1 Implementation
Most of equations for diploid samples (m=2) have been implemented in the SAMtools software package (Li et al., 2009a), which is distributed under the MIT open source license, free to both academic and commercial uses. The exact Equations (17)–(19) have also been implemented in GATK as the default SNP calling model.
The SAMtools package consists of two key components samtools and bcftools. The former computes the genotype likelihood (g) using an improved version of Equation (2) that considers error dependencies; the latter component calls variants and infers various statistics described in this article. To clearly separate the two steps, we designed a new Binary variant call format (BCF), which is the binary representation of the variant call format (VCF; Danecek et al., 2011) and is more compact and much faster to process than VCF. On real data, computing genotype likelihoods especially for INDELs is typically 10 times slower than variant calling. The separation of genotype likelihood computation and subsequent inferences enhances the flexibility and improves the efficiency for inferring AFS. Bcftools also directly works with VCF files, but is less efficient than with BCF files.
Table 2 shows how VCF information tags generated by SAMtools are related to the equations in this article. We refer to the SAMtools manual page for detailed description.
INFOa . | Equationb . | Description . |
---|---|---|
AF1 | 3, 5 | Non-reference site allele frequency |
G3 | 7 | Diploid genotype frequency |
HWE | 8 | P-value of Hardy–Weinberg equilibrium |
NEIR | 10 | Neighboring r2 linkage disequilibrium statistic |
LRT | 11 | One-degree association test P-value |
LRT2 | 12 | Two-degree association test P-value |
AC1 | 17, 18, 19 | Non-reference site allele count |
FQ | 20 | Probability of the site being polymorphic among samples |
CLR | 22, 23 | Log likelihood ratio score for de novo mutations |
INFOa . | Equationb . | Description . |
---|---|---|
AF1 | 3, 5 | Non-reference site allele frequency |
G3 | 7 | Diploid genotype frequency |
HWE | 8 | P-value of Hardy–Weinberg equilibrium |
NEIR | 10 | Neighboring r2 linkage disequilibrium statistic |
LRT | 11 | One-degree association test P-value |
LRT2 | 12 | Two-degree association test P-value |
AC1 | 17, 18, 19 | Non-reference site allele count |
FQ | 20 | Probability of the site being polymorphic among samples |
CLR | 22, 23 | Log likelihood ratio score for de novo mutations |
aTag at the VCF additional information field (INFO).
bRelated, though not exact, equations for computing the values.
INFOa . | Equationb . | Description . |
---|---|---|
AF1 | 3, 5 | Non-reference site allele frequency |
G3 | 7 | Diploid genotype frequency |
HWE | 8 | P-value of Hardy–Weinberg equilibrium |
NEIR | 10 | Neighboring r2 linkage disequilibrium statistic |
LRT | 11 | One-degree association test P-value |
LRT2 | 12 | Two-degree association test P-value |
AC1 | 17, 18, 19 | Non-reference site allele count |
FQ | 20 | Probability of the site being polymorphic among samples |
CLR | 22, 23 | Log likelihood ratio score for de novo mutations |
INFOa . | Equationb . | Description . |
---|---|---|
AF1 | 3, 5 | Non-reference site allele frequency |
G3 | 7 | Diploid genotype frequency |
HWE | 8 | P-value of Hardy–Weinberg equilibrium |
NEIR | 10 | Neighboring r2 linkage disequilibrium statistic |
LRT | 11 | One-degree association test P-value |
LRT2 | 12 | Two-degree association test P-value |
AC1 | 17, 18, 19 | Non-reference site allele count |
FQ | 20 | Probability of the site being polymorphic among samples |
CLR | 22, 23 | Log likelihood ratio score for de novo mutations |
aTag at the VCF additional information field (INFO).
bRelated, though not exact, equations for computing the values.
3.2 Inferring the allele count
We downloaded the chromosome 20 alignments of 49 Pilot-1 CEU samples sequenced by the 1000 Genomes Project using the Illumina technology only. We called the SNPs with SAMtools and imputed the genotypes with Beagle under the default settings. At 32 522 sites genotyped using the Omni genotyping chip and polymorphic in the 49 samples, the root-mean-square deviation (RMSD) between the allele count acquired from Omni genotypes and the estimate using Equation (16) equals 3.7, the same as the RMSD between the Omni and the Beagle-imputed genotypes. Not surprisingly, imputed genotypes are more accurate when there is a tightly linked SNP nearby, while the imputation-free estimate is less affected (Fig. 1).

Correlation of the site allele count accuracy with LD. The site allele count is estimated with Beagle imputation (solid line) and with Equation (16) (dashed line) at sites typed by the Omni genotyping chip. For each Omni SNP, the maximum r2 LD statistic between the SNP and 20 nearby SNPs called by SAMtools (10 upstream and 10 downstream) is computed from imputed genotypes. Omni SNPs are then ordered by the maximum r2 and approximately evenly divided into 15 bins. For each bin, the RMSD between the Omni allele count and the estimated allele count is computed as a measurement of the allele count accuracy.
However, on the unreleased European data from the 1000 Genomes Project consisting of 670 samples, Beagle imputation is better than our imputation-free method [RMSD(imput) = 12.7; RMSD (imput-free) = 15.0]. We conjure that this is because with more samples, it is more frequent for two samples to share a long haplotype. The LD plays a more important role in counteracting the lack of coverage. Nonetheless, we should beware that sites selected on the Omni genotyping chip may not be a good representative of all SNPs. For example, for the sites on the Omni chip, only 8% of SNPs do not have a nearby SNP with r2>0.05 in a 20-SNP window (the ‘nearby SNPs’ include all SNPs discovered in the 670 samples), but this percentage is increased to 30% for all SNPs. The large fraction of unlinked SNPs might hurt the accuracy of imputation-based methods.
We have also evaluated our method on an unpublished target reqsequencing dataset consisting of ~2000 samples (Haiman,C. and Henderson,B., personal communication). The imputation-based method does not perform well [RMSD(imput) = 54.8; RMSD(imput-free) = 42.5], probably due to the lack of linked SNPs around fragmented target regions.
3.3 Inferring the AFS
To evaluate the accuracy of the estimated AFS, we compared the AFS obtained from the low-coverage data produced by the 1000 Genomes Project and from the high-coverage data released by Complete Genomics (http://bit.ly/m7LzvF). Figure 2 reveals that we can infer a fairly accurate AFS using the EM-AFS method with 3-fold coverage per sample. On the other hand, the site-AFS estimate is less stable, though the overall trend looks right. To estimate properties across multiple sites, summing over the posterior distribution using EM-AFS is more appropriate.

The derived AFS conditional on heterozygotes discovered in the NA18507 genome (Bentley et al., 2008; AC:SRA000271). Heterozygotes were called with SAMtools on BWA (Li and Durbin, 2009) alignment. The ancestral sequences were determined from the Ensembl EPO alignment (Paten et al., 2008), with the requirement of the chimpanzee and orangutan sequences being identical. The AFS at these heterozygotes were computed in three ways: (i) from the nine independent Yoruba individuals sequenced by Complete Genomics (Drmanac et al., 2010) and analyzed using CGA Tools version 1.10.0; (ii) from nine random Pilot-1 Yoruba individuals released by the 1000 Genomes Project using the EM-AFS method and (iii) from the same 9 Pilot-1 individuals using site-AFS.
3.4 Performing association test
To evaluate the performance of the association test statistics Da1 [Equation (11)], we constructed a perfect negative control using the 1000 Genomes data and derived the empirical distribution of Da1. We expect to see no associations. Figure 3 shows that Da1 largely follows the one-degree χ2 distribution. However, this method also produces one false positive SNP (P<10−6). Closer investigation reveals that the SNP significantly violates HWE [P<10−6, computed with Equation (8)], and thus violates the assumption behind the derivation of Da1. In fact, if we test the association with Da2 which does not assume HWE, the false positive will be suppressed (P>0.001). To test association using the one-degree likelihood-ratio test statistic, it is important to control HWE.
![QQ-plot comparing the association test statistics to the one-degree and the two-degree χ2 distribution. The 49 CEU samples sequenced by the 1000 Genomes Project using the Illumina technology were randomly assigned to two groups of size 24 and 25, respectively. (A) two association test statistics were computed on chromosome 20 between the two groups: one by the one-degree likelihood ratio test [Equation (11)] and the other by the canonical one-degree χ2 test based on Beagle imputed genotypes; (B) the two-degree likelihood rate test statistic [Equation (12)].](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/27/21/10.1093_bioinformatics_btr509/1/m_bioinformatics_27_21_2987_f3.jpeg?Expires=1744735093&Signature=eWYJ3PC4CHQZ-d4~O2~wEgPfA~JCPeNhAmGhCintgLvaFIghN9qyi6rTBiuABg4olDqPIjn4t~zUdnOCUDX0TLpSU0jgvMoNKfd6fB-D-XKX8lN7z7ouLiWdk9FYG0AQoobtVEXQ7t3TKWBFYrIsoRb8p60QuBvK7AZeQzEjHfn5qsgXp7TlM2lRBl0wjcmhB4HkgQ8uzMvdemEmgCEewbsBDmyxXue940gTWf1DW1EMcUbGRd~hdlShA5pot0GMNEkWms9VurbC0a~iF8FzGLlA8YaKjFpPdr6JyoLYtqoKfaI7gxK-u3R0rGpR6A6fQcKV33IDp1vWG8xCO5IFGA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
QQ-plot comparing the association test statistics to the one-degree and the two-degree χ2 distribution. The 49 CEU samples sequenced by the 1000 Genomes Project using the Illumina technology were randomly assigned to two groups of size 24 and 25, respectively. (A) two association test statistics were computed on chromosome 20 between the two groups: one by the one-degree likelihood ratio test [Equation (11)] and the other by the canonical one-degree χ2 test based on Beagle imputed genotypes; (B) the two-degree likelihood rate test statistic [Equation (12)].
3.5 Comparing sequencing data from the same individual
3.5.1 Comparing datasets of similar characteristics
We acquired the NA12878 data used by Depristo et al. (2011). This sample was sequenced with HiSeq2000 using two libraries with each put on eight lanes and each sequenced to about 30-fold coverage. We split the data in two by library and computed Dp [Equation (22)] at each base on chromosome 20 to identify sites that are called differently between the two libraries. With a stringent threshold Dp≥30 and without any filtering, 32 differences are called between the libraries and most from the centromere. Since the libraries were made from the same DNA at almost the same time, we expect to see no difference between the libraries. Seeing 32 differences is very unlikely. To explore if this is due to mismapping, we extracted reads around the 32 sites and remapped them with BWA-SW (Li and Durbin, 2010). Four differences remain around the centromere, implying that most of the differences between libraries are caused by the variation in read mapping. We further mapped the reads around the four sites to a version of the human reference genome used by the 1000 Genomes Project for phase-2 mapping (http://bit.ly/GRCh37d). No differences are left. This exercise reveals that when we come to very rare events, mapping errors, instead of sequencing errors, lead to most of the artifacts.
3.5.2 Comparing datasets of different characteristics
We also did a harder version of the exercise above: comparing this 60-fold HiSeq data to the old Illumina data for the same individual obtained >2 years ago by the 1000 Genomes Project. We note that although DNA used in the two datasets was originated from the same individual, somatic mutations in cell lines, which are of the order of 1000 per diploid genome (Conrad et al., 2011), may be present. If the cell lines used in two studies have greatly diverged, we might see up to a dozen somatic mutations on chromosome 20.
This time with a threshold Dp≥30 and a maximum depth filter 150, we identified 667 single-base differences between the two datasets, far more than our expectation. Again we sought to reduce mapping errors by remapping reads with BWA-SW to the 1000 Genomes Project phase-2 reference genome. The number of differences between the HiSeq and the old Illumina data quickly drops to 33. If we further filter out clustered SNPs using a 100 bp window, 13 potential differences are left, 2 of the initial candidates. This exercise again proves that mismapping is the leading source of errors.
To see if the simple likelihood ratio [Equation (22)] is comparable to more sophisticated methods, we briefly tried SomaticSniper (Larson et al., in press) on our data. With a somatic score cutoff 65, which is about 30 in the ‘2log’ scale as in Dp, SomaticSniper identified 1826 differences. SAMtools called fewer, because it limits the mapping quality of reads with excessive mismatches and applies base alignment quality (Li, 2011) to fix alignment errors around INDELs. With the two features switched off, SAMtools called 1696 differences, half of which overlap the differences found by SomaticSniper. Calls unique to one method tend to have a mutation score close to the threshold.
4 DISCUSSIONS
We have proposed a statistical framework for SNP calling as well as analyzing sequencing data but without explicitly calling SNPs or their genotypes. With this framework, we can discover somatic and germline mutations with appropriate input data, efficiently estimate site allele frequency, allele frequency spectrum and linkage disequilibrium, and test Hardy–Weinberg equilibrium and association. On real data, we have demonstrated that our method is able to achieve comparable accuracy to the best alternative methods. We have also extensively evaluated the performance of our method on several unpublished datasets and got sensible results. Thus, we conclude that useful information can be obtained directly from sequencing data without SNP calling or imputation.
Here we also want to emphasize a few findings in our evaluation of the methods. First, we confirmed that imputation is a viable method for transferring our knowledges on genotyping data to low-coverage sequencing data. It is likely to have higher accuracy than our method given homogeneous whole-genome data consisting of many samples. Nonetheless, we showed that the accuracy of imputation depends on the LD nearby, which has long been speculated but without direct evidence from real data until our work. Second, our proposed EM-AFS method is able to accurately estimate AFS from low-coverage sequencing data. It is more appropriate than estimating the site frequency separately and then doing a histogram. Third, we observed that violation of HWE may cause false positives in association mapping with the one-degree likelihood ratio test (Kim et al., 2011). A two-degree likelihood ratio test is a conservative way to avoid such an artifact. At last, we highlighted the importance of using data of similar characteristics in the discovery of somatic mutations. We also want to put a particular emphasis on the necessity of controlling mapping errors when looking for very rare events such as somatic mutations, germline mutations and RNA editing. It may be necessary to use two distinct mapping algorithms to call variants and then take the intersection.
Frequently, we require to know the exact DNA sequences or genotypes only to estimate parameters or compute statistics. In these cases, the sequences and genotypes are just intermediate results. When the sequence itself is uncertain, mostly due to the uncertainty in sequencing and mapping, it may sometimes be preferred to directly work with the uncertain sequence, which may carry more information than an arbitrarily ascertained sequence. We have showed that many population genetical parameters and statistical tests can be adapted to work on uncertain sequences, and believe more existing methods can be adapted in a similar manner. Knowing the exact sequence is convenient, but not always indispensable.
ACKNOWLEDGEMENTS
We are grateful to Christopher Haiman for providing the unpublished dataset for assessing the performance, to Petr Danecek for evaluating the methods in this article on large-scale datasets, to Rasmus Nielsen for the observation on the occasional slow convergence of the EM algorithm and to Si Quang Le and Richard Durbin for the help on understanding the QCall model. We also thank the 1000 Genomes Project analysis subgroup and the GSA team at Broad Institute for various helpful discussions, and thank all the SAMtools users for evaluating the software package.
Funding: National Institutes of Health (1U01HG005208-01).
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: Jeffrey Barrett