-
PDF
- Split View
-
Views
-
Cite
Cite
Heng Li, Improving SNP discovery by base alignment quality, Bioinformatics, Volume 27, Issue 8, April 2011, Pages 1157–1158, https://doi.org/10.1093/bioinformatics/btr076
Close - Share Icon Share
Abstract
Summary: I propose a new application of profile Hidden Markov Models in the area of SNP discovery from resequencing data, to greatly reduce false SNP calls caused by misalignments around insertions and deletions (indels). The central concept is per-Base Alignment Quality, which accurately measures the probability of a read base being wrongly aligned. The effectiveness of BAQ has been positively confirmed on large datasets by the 1000 Genomes Project analysis subgroup.
Availability: http://samtools.sourceforge.net
Contact: hengli@broadinstitute.org
1 INTRODUCTION
One of the leading sources of errors in SNP discovery is errors caused by indels (Li and Homer, 2010). Current solutions include realignment and filtering SNPs around predicted indels. However, realignment is computationally intensive; filtering SNPs around predicted indels is hampered by indel discovery which itself is a harder problem. This article aims to provide an effective and efficient solution to sorting out SNPs caused by misalignments.
To begin with, we need to make a distinction between read mapping and read alignment, which are often taken as synonymous. I define the alignment of a read as the set of coordinate pairs of read and reference bases that are placed together, while define the mapping of a read as the coordinate interval between the first and the last reference bases inclusive in the alignment. We say an alignment is correct if all bases are aligned correctly and say a mapping is correct if it overlaps the true mapping. Therefore, an alignment can be wrong even if the underlying mapping is correct.
Wrong alignments are mostly caused by the ambiguity in the presence of indels when we are unsure whether differences should be explained by mismatches or by indels. They tend to reoccur in the same context and deceive SNP callers into calling false SNPs.
To account for the intrinsic alignment ambiguity, I model the read alignment with a profile HMM and compute a per-Base Alignment Quality (BAQ) to directly evaluate the probability of misalignment of each base. I will show that by replacing the original base quality with the minimum between the base quality and BAQ we can dramatically improve the SNP accuracy.
2 METHODS
2.1 The profile HMM for computing BAQ
Let the nucleotide reference sequence be x = r1r2…rL (in practice x is the reference subsequence around a mapping) and the read sequence be y = c0c1…clcl+1 where c0 ≡ ‘ˆ’ marks the start of the read and cl+1 ≡‘$’ marks the end. Let ϵi, i = 1…l, be the substitution probability associated with ci, which in practice is set to be the maximum between the scale mutation rate and the sequencing error probability deduced from the base quality. We can construct a profile HMM to simulate how to generate the read sequence y from the reference x without considering introns (Fig. 1).

The topology of the profile HMM for BAQ computation. It consists of five types of states: alignment matches (M), insertions to the reference (I), deletions (D), alignment start (S) and alignment end (E). The S state points to every M and I state while every M and I points to E. States S and E are plotted together to avoid excessive dotted lines in the figure.


2.2 The forward and the backward algorithms






The probability of the read being generated from the reference equals fl+1,E = b0,S. In development, evaluating the equality helps to check the correctness of the implementation.
2.3 Computing BAQ


As Q(i|A) is mostly above 40, the logarithm scale is appropriate. In SNP calling, we update the i-th base quality to min{qi, Q(i|A)} where qi is the original base quality. The SNP calling algorithm may not need to be changed.
2.4 Numerical stability and banded acceleration

represents any state and the scaling factor si is chosen such that
. I refer to Durbin et al. (1998) and the source code for details on computing
,
and s.Another concern with the computation of BAQ is the quadratic time and space complexity O(L · l). This can be resolved by only computing the forward and the backward matrices within a band that is large enough to contain the likely alignments.
3 RESULTS AND DISCUSSIONS
BAQ is implemented in the SAMtools software package (Li et al., 2009), distributed under the MIT open source license.
I applied the method to the chromosome 20 alignment of 60 low-coverage pilot CEU samples from the 1000 Genomes Project (1000 Genomes Project Consortium, 2010), which is done in 12 CPU hours with 110 MB memory. Figure 2 compares the quality of the call sets with and without BAQ applied. Note that for human, the transition/transversion ratio (ts/tv) is above 2, while ts/tv of random errors is 0.5. Thus, a worse call set tends to have a lower ts/tv. The much higher ts/tv with BAQ applied indicates that BAQ has effectively suppressed many false SNPs.

Transition–transversion ratio (ts/tv) as a function of the number of SNP calls. SNPs are sorted by the posterior probability of the site being a SNP (SNP probability). Given a threshold on the SNP probability, the number of SNPs of higher probability and their ts/tv are plotted. For the solid line, filters in use are as follows: (i) total depth below 500; and (ii) root mean square mapping quality above 10; (iii) P-value of reference and non-reference bases being evenly distributed on both strands is above 10−4 (by exact test).
In conclusion, BAQ successfully resolves false SNPs caused by misalignment and improves the accuracy of SNP discovery. In addition to base quality and mapping quality (Li and Homer, 2010), BAQ is another useful statistic towards accurate SNP calling.
ACKNOWLEDGEMENTS
I am grateful to Gabor Marth and Hyun Min Kang whose works have inspired me to develop BAQ, to the 1000 Genomes Project analysis subgroup for the helpful discussions and to the three reviewers whose comments have helped me to improve the manuscript.
Funding: NIH 1000 Genomes Project (grant 1U01HG005208-01).
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: John Quackenbush