DNA motif elucidation using belief propagation

Protein-binding microarray (PBM) is a high-throughout platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k = 8 ∼10); such comprehensive binding affinity data usually need to be reduced and represented as motif models before they can be further analyzed and applied. Since proteins can often bind to DNA in multiple modes, one of the major challenges is to decompose the comprehensive affinity data into multimodal motif representations. Here, we describe a new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations. We describe an HMM-based approach using belief propagations (kmerHMM), which accepts and preprocesses PBM probe raw data into median-binding intensities of individual k-mers. The k-mers are ranked and aligned for training an HMM as the underlying motif representation. Multiple motifs are then extracted from the HMM using belief propagations. Comparisons of kmerHMM with other leading methods on several data sets demonstrated its effectiveness and uniqueness. Especially, it achieved the best performance on more than half of the data sets. In addition, the multiple binding modes derived by kmerHMM are biologically meaningful and will be useful in interpreting other genome-wide data such as those generated from ChIP-seq. The executables and source codes are available at the authors’ websites: e.g. http://www.cs.toronto.edu/∼wkc/kmerHMM.

. Motif Comparison. The motif logos mapped and plotted from the most probable state transition paths of the HMMs trained by kmerHMM are found using the max-product algorithm. It can be seen that the most probable state transition paths of kmerHMM encode the patterns similar to the motif matrices found by other methods. Figure S7. Empirical p-values estimated from about 2 million random pair-wise distances. Figure S8. The motif logos mapped and plotted from the most probable state transition paths of the HMMs trained by kmerHMM on Array #1 are found using the max-product algorithm. The trailing gaps are trimmed. RC stands for Reverse Complement. Figure S9. The motif logos mapped and plotted from the most probable state transition paths of the HMMs trained by kmerHMM on Array #2 are found using the max-product algorithm. The trailing gaps are trimmed. RC stands for Reverse Complement. Figure S10. Receiver Operating Characteristic (ROC) curves for kmerHMM trained on Array #1 and #2 and tested on Array #2 and #1 respectively. The positive (bound) DNA probe sequences in each dataset are defined using the robust estimate in RankMotif++ (57). In other words, we define the positive probes to be the probes seq i which normalized signal intensity i i > mi+4σ where mi and σ are the median and the median absolute deviation (MAD) of all the probe normalized intensities {i 1 ,i 2 ,...,in} divided by 0.6745 (the MAD of the unit normal distribution) respectively. The remaining ones are defined as the negative ones which accounts for 94.6% to 99.1% of the dataset. Given such a two-class classification setting, the predicted binding preference B(seq j ) of each probe sequence seq j is thresholded to estimate the true positive rates at different level of false positive rates where AUC stands for the Area Under Curve.

BAUM WELCH ALGORITHM
After a set of positive k-mers were selected, they are aligned using a multiple sequence alignment method. The aligned k-mers are then input for training a HMM to represent the binding preferences of the DNA-binding protein of interest, using Baum Welch training algorithm (65). Mathematically, the Baum Welch training algorithm can be described herein: Input: A set of aligned k-mers S = {s 1 ,s 2 ,s 3 ,...,s M } of length L. Each k-mer s m can be represented as s m = s m1 s m2 ...s mL where s mp is the p-th nucleotide of the aligned k-mer s m : Output: A HMM model θ trained to represent the aligned k-mers: where a ij is the transition probability from state i to state j; b i (x) is the emission probability to emit x at state i; π i is the initial state probability for state i.
As Baum Welch training is an Expectation Maximization (EM) algorithm, we randomly initialize those HMM model parameters at the beginning θ 0 and iteratively refine them in each iteration.
In the expectation step (E-step) of the l-th iteration, we calculate the expected values of being in state i based on the current parameter estimates θ l . Specifically, we calculate: where γ m p (i) is the expected probability of being in state i at the p-th nucleotide position for the aligned k-mer s m ; α m p (i) and β m p (i) are the forward and backward probability of the aligned k-mer s m to be in state i at the p-th nucleotide position as calculated by the dynamic programming approach (65). P (s m ;θ l ) is the probability of observing s m given the existing HMM model parameter θ l which can be calculated as P (s m ;θ l ) = N i=1 α m p (i)β m p (i). In addition, we also calculate the expected values of state transitions from state i to state j: where ζ m p (i,j) is the expected probability of transiting from state i at the nucleotide position p to state j at the nucleotide position p+1 for the aligned k-mer s m given the current parameter estimates θ l .
In the maximization step (M-step) of the l-th iteration, those model parameters are refined to be the maximal likelihood estimates for those expected values: The new HMM model parameters θ l+1 is used in the next iteration. We repeat the E-step and M-step alternatively until the HMM model parameters are not changed anymore. In other words, the difference between θ l and θ l+1 converges to a numerically negligible value at which a local optimum is found.

MAX-PRODUCT ALGORITHM
In this study, the most probable state transition path Y = (y 1 ,y 2 ,y 3 ,...,y L ) is calculated for each HMM θ trained using the max-product algorithm. Mathematically: Max-Product algorithm can be described herein: Input: A HMM model θ trained to represent the input aligned k-mers: where a ij is the transition probability from state i to state j; b i (x) is the emission probability to emit x at state i; π i is the initial state probability for state i.
Output: Most probable state transition path Y * = (y * 1 ,y * 2 ,y * 3 ,...,y * L ) in the input HMM model θ: which can be calculated using a dynamic programming approach. If we expand P (Y |θ) to the classic likelihood function P (Y,O|θ) of a HMM, we can further simplify it: P (y 1 )P (y 2 |y 1 )P (y 3 |y 2 )...P (y L |y L−1 ) P (o 1 |y 1 )P (o 2 |y 2 )...P (o L |y L ) ⇒ arg max Y P (y 1 )P (y 2 |y 1 )P (y 3 |y 2 )...P (y L |y L−1 ) ⇒ arg max Y P (y 1 )P (y 2 |y 1 )P (y 3 |y 2 )...P (y L |y L−1 ) It is reduced to finding the most probable state transition path for the discrete Markov chain of the HMM θ. It can be computed using a dynamic programming approach (69) as follows: arg max y 1 P (y 1 )arg max y 2 P (y 2 |y 1 )arg max y 3 P (y 3 |y 2 )... arg max In essence, we maximize each state y i one by one with the help of an auxiliary function f i which factors out the previous state y i−1 for all the possible value of i until only a single state y 0 needs to be maximized. It exploits the Markov property in which only two states need to be maximized for each value of i from the end so that linear time complexity can be achieved with respect to the path length L. Backward tracing is then applied to find the most probable state transition path Y * = (y * 1 ,y * 2 ,y * 3 ,...,y * L ). Similarly, N-Max-Product algorithm works in the same principle but reserves N paths during the maximization procedure. In other words, we can replace each argument maximization operator 'arg max y i ' by a similar operator which can maximize and store the top N arguments during the dynamic programming shown above.

MOTIF MATRIX DISTANCE
Given two motif matrix A and B of sizes 5×n A and 5×n B respectively, we can align them using a profile alignment method (e.g. Needleman-Wunsch global alignment (64). Once we have aligned A and B, we calculate the aligned distance d a (A,B): where δ(i,j) = 1 when the ith column of A is aligned to the jth column of B. Otherwise, δ(i,j) = 0. a i and b j denote the ith column of A and the jth column of B respectively. d( a i , b j ) is the Euclidean distance between a i and b j .
In addition, we also take into account the gap distance d g (A,B): where d max is the maximal possible value for the function d a (A,B) here in the context of motif matrices. Thus we aim at penalizing each unaligned column by a constant d max where the number of unaligned columns is calculated by (min(n A ,n B )−( n A i=1 n B j=1 δ(i,j))). Finally, we combine and normalize the two distances into a single distance measure d(A,B) as follows: where A rc and B rc are the reverse complement motif matrices of the motif matrices A and B respectively.