MRHMMs: Multivariate Regression Hidden Markov Models and the variantS

Summary: Hidden Markov models (HMMs) are flexible and widely used in scientific studies. Particularly in genomics and genetics, there are multiple distinct regimes in the genome within each of which the relationships among multivariate features are distinct. Examples include differential gene regulation depending on gene functions and experimental conditions, and varying combinatorial patterns of multiple transcription factors. We developed a software package called MRHMMs (Multivariate Regression Hidden Markov Models and the variantS) that accommodates a variety of HMMs that can be flex-ibly applied to many biological studies and beyond. MRHMMs sup-plements existing HMM software packages in two aspects. First, MRHMMs provides a diverse set of emission probability structures, including mixture of multivariate normal distributions and (logistic) regression models. Second, MRHMMs is computationally efficient for analyzing large data-sets generated in current genome-wide studies. Especially, the software is written in C for the speed advantage and further amenable to implement alternative models to meet users’ own purposes. and Supplementary information: Supplementary data are available at Bioinformatics online.


INTRODUCTION
Multivariate features are abundant in genomic and genetic studies. A central theme is to understand the differential relationships among features and their contribution to gene functions and phenotypes. For example, the gene expression variation can be predicted using distinct sets of histone modification levels in gene function groups (Jung and Kim, 2012). The joint occupancy of multiple transcription factors may change depending on the genome context, such that differential regulatory modules are observed and linked to distinct regulation mechanisms. MRHMMs (Multivariate regression hidden Markov models and the variantS) is based on the premise that biological factors do not interact with each other in a uniform manner across the genome. A regression hidden Markov model (rHMM), for example, can be used to segment the genome or genes into groups in each of which there is a unique relationship among biological factors. In addition to rHMM, MRHMMs includes five other hidden Markov model (HMM) variant structures that can be alternatively applied to suit specific studies and data characteristics.

MATERIALS AND METHODS
An HMM can be viewed as a mixture model in which the latent (hidden) state has the Markov property. Given the number of states M, an HMM is constructed by specifying the initial state probability, the transition probability and the emission probability distribution. Let O ¼ ðO 1 , . . . , O T Þ denote the observation of data at time points t 1 , . . . , t T . The time points will correspond to genomic locations in our setting. Let d ðd 2 , . . . , d T Þ denote the distance of the two adjacent observations, where d k ¼ t k À t kÀ1 for k ¼ 2, . . . , T, and let q ðq 1 , . . . , q T Þ denote the hidden states of the HMM. The initial state probability is denoted by ¼ ð 1 , . . . , M Þ, where i ¼ Pðq 1 ¼ s i Þ and s i represents the ith state for i ¼ 1, . . . M. The transition probability from state s i at time t to state s j is denoted by a ij ðtÞ ¼ Pðq tþ1 ¼ s j jq t ¼ s i , d tþ1 Þ, where t 2 ft 1 , . . . , t T g. See the Supplementary material for the details. Let A t , t 2 ft 1 , . . . , t T g denote the transition probability matrix at time t, and let A ¼ fA 1 , . . . , A T g.
Let b i ðO t Þ ¼ pðO t jq t ¼ s i , Þ denote the emission probability distribution given state s i , where denotes a collection of parameters for the emission probability distribution. The joint distribution pðO, qj, A, Þ of the HMM is then written as When the emission probability distribution is a mixture of K distributions, the last term in Equation (1) can be replaced by the following: where h t is the mixture component at time t When the observation O t consists of explanatory and response variables, denoted by x t and y t , respectively, the last term in Equation (2) can be further expressed as This model is more general than both the regular HMM and the rHMM. The regular HMM has pðy t jx t , q t , Þ ¼ 1 and the rHMM has pðx t jq t , Þ ¼ 1. *To whom correspondence should be addressed. (3):

MRHMMs incorporates six different emission probability structures by modifying Equation
Within a state, the emission probability densities in Models 1-3 consist of a single parametric distribution, while Models 4-6 consist of a mixture of multiple distributions. Model 1 represents the regular HMM, Model 3 corresponds to rHMM and Model 2 incorporates information in both the explanatory variables and their relationships with the response variables. Models 4 and 5 can be considered as an extension of Models 1 and 2 to non-parametric emission densities, i.e. the actual emission densities in unknown distribution families are approximated by a mixture of parametric densities. Model 6 includes a single relationship between explanatory and response variables, whereas the explanatory variables are further modeled non-parametrically using a mixture model.
The unknown parameters in MRHMMs, (, A, ), are estimated using the Baum-Welch algorithm (Baum et al., 1970). The hidden states are estimated using the Viterbi algorithm (Viterbi, 1967) that finds the most likely state sequence. The number of states is determined by Bayesian Information Criterion.
The information of the model structure and options that a user would like to build are carried through a text file, which we refer to as an input file. The explanatory, response and location files are space-delimited and need to be prepared separately. MRHMMs automatically generates five output files whose names are indicated by the number of states and the number of mixture components (if Models 4-6 are used). The output files are the initial state and transition probabilities (hmm), the parameter estimates of emission probability distributions (parm), the log likelihood and the (posterior) state probabilities (loglike), the loglikelihood of each independent run (RepLoglike) and the HMM states inferred by the Viterbi algorithm (viterbi). The details of input and output files and available options are provided in the Supplementary material.

EXAMPLE
We present an application of MRHMMs to a genome-wide study of the dynamics of GATA1 binding in erythroid cells in mouse. To illustrate, we applied MRHMMs using Model 3 (rHMM) to segment the mouse (mm8) chromosome 7 using the relationships between four explanatory variables: H3K4me3, H3K27me3 (histone modification levels), DNase I hyper sensitivity and Tal1 occupancy, and one response variable: GATA1 occupancy. The data are generated by Wu et al. (2011) using the ChIP-seq technology. We used the log-transformed maximums ðlogðx þ 1ÞÞ over 1 000-bp non-overlapping windows as the variables. We quantile normalized the explanatory variables and standardized the response variables. MRHMMs found 16 states as determined by Bayesian Information Criterion using 10 independent runs for each number of states M ¼ 2 $ 20. We relabeled the states by an increasing order of the GATA1 averages in each state. The regression coefficients and the boxplots of GATA1 occupancy in each state are shown in Figure 1. The statistically insignificant regression coefficients are marked by 'X'. It is observed that the last two states (states 15 and 16) have notably high GATA1 signals, and the genome segments in these two states are the potential candidates of GATA1 binding sites. The regression coefficients of the two states suggest distinct relationships between GATA1 binding and the four explanatory factors: state 15 demonstrates mild correlation but state 16 indicates strong relationships. Based on the results from McLean et al. (2010), we found that state 15 is enriched with cis-regulatory regions of genes related with hemoglobin complex, whereas state 16 is enriched with cytosolic part (Binom Raw P-values are 8.6 Â 10 À8 and 4.8 Â 10 À8 , respectively).
Funding: The project described was supported by the National Center for Research Resources and the National Center for Advancing Translational Sciences, National Institutes of Health, through grants UL1TR000127 and NSF ABI-1262538. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Conflict of Interest: none declared.