Abstract

Summary: Hidden Markov models are widely applied within computational biology. The large data sets and complex models involved demand optimized implementations, while efficient exploration of model space requires rapid prototyping. These requirements are not met by existing solutions, and hand-coding is time-consuming and error-prone. Here, I present a compiler that takes over the mechanical process of implementing HMM algorithms, by translating high-level XML descriptions into efficient C++ implementations. The compiler is highly customizable, produces efficient and bug-free code, and includes several optimizations.

Availability:http://genserv.anat.ox.ac.uk/software

Contact:gerton.lunter@dpag.ox.ac.uk

1 INTRODUCTION

Hidden Markov models (HMMs) are very suitable for sequence analysis, and have found wide application in computational biology. These applications include gene finding (Burge and Karlin, 1997), motif discovery and searching (Bailey and Elkan, 1995; Baldi et al., 1994; Bateman et al., 2002; Eddy, 1998), identification of CpG islands (Durbin et al., 1998), genotyping (Scheet and Stephens, 2006), detection of recombination events (Hobolth et al., 2007), sequence alignment (Holmes, 2003) and many more.

Applications in computational biology typically involve large data sets, making high-level toolkits such as R or MatLab less suitable. Another common approach, of which GHMM (Schliep et al., 2003) is but one example, is to use a library implementation which operate on HMMs defined by some data structure. In practice, either the flexibility or the efficiency of such libraries often falls short of practical requirements, especially for pair- and higher-dimensional HMMs. For these reasons, researchers often resort to hand-coding their algorithms. While straightforward, hand-coding remains tedious and error-prone, particularly when models become more complex and optimized code is required.

Code generation could, in principle, combine flexibility, usability and efficiency. An early effort in this direction is the PROLOG-based language GenLang, which is capable of handling context-free grammars as well as HMMs (Searls, 1993). Another approach was taken by the dynamic-programming code generator Dynamite (Birney and Durbin, 1997), which produces highly efficient C code for Viterbi-like algorithms, at the cost of some flexibility. Dynamite later developed into the gene annotation toolset Exonerate, which is based on the code generator C4 (Slater and Birney, 2005).

These code generators, being somewhat tied to their original application domain, lack direct support for probabilistic algorithms. Here I present an HMM compiler, HMMoC, that aims to fill this gap. HMMoC is both efficient and flexible, and provides support for all standard HMM algorithms. The input to the compiler consists of a succinct XML file that defines the topology of the HMM. The probabilities associated to transitions and emissions are defined in the same file, using arbitrarily parameterized C code fragments. From these, the compiler produces C++ header and source files that implement the required HMM algorithms.

2 OVERVIEW

2.1 Supported algorithms

The compiler supports the Forward and Backward algorithms, which compute posterior probabilities conditional on one or more input sequences. These algorithms can be configured to return the dynamic programming (DP) table if desired. Baum–Welch parameter estimation is also implemented, when required, as part of either the Forward or Backward algorithm. Parameter estimates are computed in two steps, by first computing a standard Forward DP table followed by a Baum–Welch backward iteration (or vice versa). For efficiency, either or both transition and emission parameters can be computed.

The Viterbi algorithm is implemented as two separate procedures, which compute the Viterbi table (and likelihood) and the most likely path, respectively. Finally, a sampling algorithm may be generated, which samples paths from the posterior distribution. Unconditional sampling from the HMM is not directly supported, but can be implemented by defining a ‘silent’ HMM producing no output, and computing a conditional sample from that.

2.2 Supported HMM architectures

HMMoC supports HMMs with any number of outputs, including pair HMMs, triple HMMs and phylogenetic HMMs. Emissions, from a user-defined alphabet, may be associated to states (‘Moore machines’) or to transitions (‘Mealy machines’), and mixtures are also possible. Higher-order Markov chains (where probabilities depend on a limited number of previously emitted symbols) are also supported, and the order or states may vary over the network. HMMoC includes support for inhomogeneous Markov chains, allowing probabilities to depend on the position within the sequence. This can be used to implement, e.g. inference algorithms for continuous-time Markov chains with variable intervals between measurements, or may be used to incorporate prior knowledge. Finally, HMMoC supports any number of silent states, and the required transition probabilities are computed by a matrix inversion step (‘wing retraction’, (Eddy, 1998)] whenever necessary.

2.3 Extended-exponent real-number type

Underflows are a frequent problem for HMM implementations and large data sets. Working in log space only partially solves this problem, since most algorithms use addition as well as multiplication, necessitating slow conversions to and from log space. HMMoC provides an 8-byte extended-exponent real type, termed BFloats (for ‘buoyant floats’), which combines the precision of a float with an essentially unbounded exponent. An efficient template library implementation provides fast in-line code, resulting in comparable running time for algorithms that use doubles or BFloats. In addition, a logspace type is provided, which is slightly more efficient than BFloats when used in the Viterbi algorithm.

2.4 Efficiency and optimizations

For efficiency, states can be grouped into cliques, over which the HMM induces an acyclic graph. This structure is used by HMMoC to efficiently allocate memory for the DP table. For instance, the start and end states may always form their own clique, and need not be represented in the main DP table. Some algorithm-specific optimizations are provided; for instance, when no DP table is required as output from a Forward or Backward iteration, a lower-dimensional DP table is allocated.

Several time-optimizations are implemented as well. HMMoC interprets the emission patterns to determine the range of coordinates at which states may be visited, and uses this information to limit access to accessible states only. In the main loops, HMMoC chooses an order of computation that minimizes accesses to DP table entries, resulting in considerable improvements, particularly for sparse DP tables where such accesses are relatively expensive (see below). In addition, constant transition probabilities are pre-computed outside the main loop whenever possible. Finally, all loops over transitions and states are unrolled, reducing the number of run-time decisions and lookups. As a result of these optimizations, HMMoC produces code approaching the efficiency of the hand-optimized HMMER package (Eddy, 1998), see Figure 1.

Fig. 1.

Execution times for performing a database search using HMM profiles of the TK and ZnF_C2HC domains (38 and 56 states, respectively), in a database of 4.84 million amino acids on a desktop computer (2.33 GHz E5345 Xeon, 4 MB cache). For this comparison, HMMER and HMMoC performed a Viterbi and a Forward recursion. The HMMoC-generated algorithm uses logspace real numbers for the Viterbi recursion, and BFloats for the Forward recursion.

Fig. 1.

Execution times for performing a database search using HMM profiles of the TK and ZnF_C2HC domains (38 and 56 states, respectively), in a database of 4.84 million amino acids on a desktop computer (2.33 GHz E5345 Xeon, 4 MB cache). For this comparison, HMMER and HMMoC performed a Viterbi and a Forward recursion. The HMMoC-generated algorithm uses logspace real numbers for the Viterbi recursion, and BFloats for the Forward recursion.

2.5 Banding

Banding refers to the technique of traversing only a user-defined portion of a DP table. When used correctly, this can result in dramatic improvements in time and memory usage, with small to negligible loss of accuracy. HMMoC allows the user to specify an arbitrary iterator to traverse the DP table. When banding is used, HMMoC uses a sparse DP table, backed by a hash map storing only entries that are visited. All DP tables provide accessor functions to hide implementation details from the user.

2.6 Robustness

HMMoC provides several consistency checks, both at compile and run time. For instance, HMMoC checks and determines the consistency of state orders, by insisting that orders can increase by one only following an emission, and the induced graph on cliques is checked for cycles. At run time, warnings are issued when iterators follows an inconsistent or out-of-bounds path, and when negative probabilities are encountered. Care has been taken to produce clean and relatively readable C++ code which should compile without warnings, allowing potential problems to be easily identified.

2.7 Examples and documentation

The package includes four examples demonstrating the use of HMMoC: the classic ‘dishonest casino’ HMM (Durbin et al., 1998), a CpG-island finder, a probabilistic global pairwise aligner with Baum-Welch parameter optimization and a conversion script for HMMER profile HMMs that was used to produce the test results mentioned before. Efforts are underway to develop a graphical programming interface abstracting the XML layer to simplify model design (Dombai and Miklós, Personal communication); a prototype web implementation is available at http://dbalazs.web.elte.hu/Phyl4/.

ACKNOWLEDGEMENTS

HMMoC's design was inspired by Ian Holmes’ Telegraph, an XML-based language to define HMMs. Helpful discussions with Jotun Hein and István Miklós are gratefully acknowledged.

Conflict of Interest: none declared.

REFERENCES

Bailey
TL
Elkan
C
The value of prior knowledge in discovering motifs with MEME
Proc. Int. Conf. Intell. Syst. Mol. Biol
 , 
1995
, vol. 
3
 (pg. 
21
-
29
)
Baldi
P
, et al.  . 
Hidden Markov models of biological primary sequence information
Proc. Natl Acad. Sci. USA
 , 
1994
, vol. 
91
 (pg. 
1059
-
1063
)
Bateman
A
, et al.  . 
The Pfam protein families database
Nucleic Acids Res
 , 
2002
, vol. 
30
 (pg. 
276
-
280
)
Birney
E
Durbin
R
Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison
Fifth International Conference on Intelligent Systemts in Molecular Biology.
 , 
1997
Menlo Park
IAAA Press
(pg. 
56
-
64
)
Burge
C
Karlin
S
Prediction of complete gene structures in human genomic DNA
J. Mol. Biol
 , 
1997
, vol. 
268
 (pg. 
78
-
94
)
Durbin
R
, et al.  . 
Biological Sequence Analysis.
 , 
1998
Cambridge
Cambridge University Press
Eddy
SR
Profile hidden Markov models
Bioinformatics
 , 
1998
, vol. 
14
 (pg. 
755
-
763
)
Hobolth
A
, et al.  . 
Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model
PLoS Genet
 , 
2007
, vol. 
3
 pg. 
e7
 
Holmes
I
Using guide trees to construct multiple-sequence evolutionary HMMs
Bioinformatics
 , 
2003
, vol. 
19
 
Suppl. 1
(pg. 
i147
-
i157
)
Scheet
P
Stephens
M
A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase
Am. J. Hum. Genet
 , 
2006
, vol. 
78
 (pg. 
629
-
644
)
Schliep
A
, et al.  . 
Using hidden Markov models to analyze gene expression time course data
Bioinformatics
 , 
2003
, vol. 
19
 
Suppl. 1
(pg. 
i255
-
i263
)
Searls
DB
String variable grammar: a logic grammar formalism for the biological language of DNA
J. Log. Program
 , 
1993
, vol. 
24
 (pg. 
73
-
102
)
Slater
GS
Birney
E
Automated generation of heuristics for biological sequence comparison
BMC Bioinformatics
 , 
2005
, vol. 
6
 pg. 
31
 

Author notes

Associate Editor: Alex Bateman

Comments

0 Comments