An improved compound Poisson model for the number of motif hits in DNA sequences

Abstract Motivation Transcription factors play a crucial role in gene regulation by binding to specific regulatory sequences. The sequence motifs recognized by a transcription factor can be described in terms of position frequency matrices. When scanning a sequence for matches to a position frequency matrix, one needs to determine a cut-off, which then in turn results in a certain number of hits. In this paper we describe how to compute the distribution of match scores and of the number of motif hits, which are the prerequisites to perform motif hit enrichment analysis. Results We put forward an improved compound Poisson model that supports general order-d Markov background models and which computes the number of motif-hits more accurately than earlier models. We compared the accuracy of the improved compound Poisson model with previously proposed models across a range of parameters and motifs, demonstrating the improvement. The importance of the order-d model is supported in a case study using CpG-island sequences. Availability and implementation The method is available as a Bioconductor package named ’motifcounter’ https://bioconductor.org/packages/motifcounter. Supplementary information Supplementary data are available at Bioinformatics online.

N (a 0 · · · a d−1 , a d ) is then used to estimate the transition probabilities (see Equation (1) in the main text).
From the acquired estimates of the transition probabilitiesπ, we obtain the stationary probabilities µ(a 1 · · · a d ) by utilizing the power method for eigenvalue decomposition (Karlin and Taylor, 1981). 1 In this section, we discuss a recursive algorithm for computing the distribution of the scores P (S = s). Originally an algorithm that assumes an underlying order-0 background model was described by several groups (Rahmann et al., 2003;Grant et al., 2011;Touzet et al., 2007). It was also pointed out that this algorithm can be extended to general order-d background models (Beckstette et al., 2006;Touzet et al., 2007). An implementation of such an algorithm was reported in Thomas-Chollier et al., 2008, without describing the full details of the implementation. Therefore, we state the algorithm for computing the score distribution based on an order-d background model in this section.

Discretization of the score range
In order to derive an efficient algorithm for the score distribution P (S = s) it is necessary to discretize the real-valued score range. As a consequence, the score distribution will not be exact, with its accuracy depending on discretization granularity which has to be chosen as a compromise between accuracy and runtime efficiency.
To discretize the score range, we determine the maximally and minimally achievable score, denoted by s max and s min , for the given PFM of length M and a background model with order d. s max and s min can be obtained efficiently in O(M |A| d+1 ).
Subsequently, the score range is split into G equally spaced intervals using a predefined score granularity ∆s according to G = s max − s min ∆s .
Our decision to use the natural logarithm to obtain the motif score in Equation (2) (see main text) influences the choice of the score granularity ∆s. Note, however, that since the logarithm is a monotonic function, one can find a corresponding score granularity also for a different logarithm base (e.g. base = 2) which would yield a similar discretization error. Hence, the choice of the logarithm base does not affect the results.
For all experiments discussed in the main article, we used a score granularity of ∆s = 0.1 which seems to be a reasonable compromise between accuracy and runtime efficiency.

Computation of the score distribution
Similarly as described in (Rahmann et al., 2003), we seek to establish the distribution of the scores by a recursive procedure. To this end, we successively determine the score distribution Q j of a sub-motif ranging from position 1 through j by reusing the score distribution up to the previous position Q j−1 and incorporating the scores at position j. In addition to that the distribution must be computed with respect to all possible prefix words of length d in order to correctly account for local dependencies between neighboring nucleotides in an order-d background model.
We define the initial score distribution at position c = max(1, d) for all prefix words a 1 · · · a c ∈ A c and all discrete scores s ranging from s min to s max using the stationary distribution µ of the background model according to where we made use of the discretized score l c w.r.t. the first c positions of the motif defined by .
The floor operator denotes the discretization of the scores.
Next, we define the score distribution Q j recursively by extending Q j−1 the score contributions at position j. Therefore, we evaluate the following equation for all words a 1 · · · a d a d+1 ∈ A d+1 and for all discrete scores s ranging from s min to s max using the transition probabilities of the background model π according to Q j (s, a 2 · · · a d+1 ) := a 1 ∈A Q j−1 (s − l j (a 1 · · · a d , a d+1 ), a 1 · · · a d ) × π(a 1 · · · a d ; a d+1 ). (4) Here we made use of the discretized score contribution at position j defined by .
The score distribution is determined in an ordered fashion starting with j = max(1, d) and terminates when j = M . Finally, we obtain the score distribution by averaging over all prefix words a 1 · · · a d ∈ A d P (s) = a 1 ···a d Q M (s, a 1 · · · a d ).

Runtime
The runtime of the algorithm for computing the score distribution is given by where M denotes the length of the motif, |A| the alphabet size, d the background order and G the number of bins for discretizing the score range.

The joint distribution of a pair of scores
In this section, we describe the computation of the joint distribution of the scores P (S 0 = s, S k = s ) for two motif start positions 0 and k. The algorithm assumes that the underlying DNA sequence was generated by a order-d Markov model. This algorithm extends the originally proposed algorithm (Pape et al., 2008) which built on the assumption of an order-0 background. On the other hand, the algorithm for computing P (S 0 = s, S k = s ) accounts for the prefix words of length d in a similar manner as the algorithm for computing P (S = s) (see above).
We shall ultimately use the algorithm to acquire the marginal overlapping hit probabilities that were described in the main text.

Computation of the joint score distribution
Along the line of the previous section, it is necessary to discretize the score range with a predefined score granularity ∆s. For this purpose, we use the same granularity to discretize the scores s and s which always was chosen to be ∆s = 0.1 for the experiments in this article.
Next, we define the local score contributions in as similar way as in the previous section as In order to keep the notation uncluttered, we shall use l j in the following without explicitly indicating its dependence on the sequence a 1 · · · a d+1 .
We proceed by deriving an recursive algorithm, similarly as described in the previous section.
We initialize the score distribution at position c = max(1, d) for all prefix words a 1 · · · a c ∈ A c and for all discrete scores s and s that range from the respective minimally and maximally achievable scores using Subsequently, Q j is recursively expressed using Q j−1 and the score contribution at position j. Therefore, for all words a 1 · · · a d+1 ∈ A d+1 and for all discrete scores s and s ranging from the respective minimally and maximally achievable scores, we (7) where π denotes the transition probability of the order-d background model.
We evaluate the score distribution in an ordered fashion by starting Q max (1,d) and iterating the process until we obtain Q M +k . Finally, we obtain the joint score distribution by averaging over all prefix words a 1 · · · a d

Scanning both strands
As we are interested in finding motif hits on both strands of the DNA, we scan the sequence once with the original motif and once with the reverse complemented motif.
Therefore, analogously to the local score contributions for the forward strand l j , we introduce the local score contributions on the reverse strand as l j . l j is obtained by considering the reverse complemented instead of the original motif. Otherwise, the algorithm remains the same.

Runtime
The runtime of the algorithm presented above is O(M |A| d+1 G 2 ), where M denotes the motif lengths, |A| the alphabet size, and G the granularity of the score range.
However, there are a number of ways how the algorithm can be optimized: Firstly, and most importantly, after each iteration i with 1 ≤ i ≤ M + k any probability mass associated with scores that are certainly below t α after the final iteration can be dropped. Similarly, any probability mass that is certainly above t α after the final iteration can be aggregated (Pape et al., 2008). Both of these considerations reduce the region over which to perform the convolution and thereby the number of operations substantially. Secondly, we keep the motif sub-score distributions for each position when computing the score distribution at one position and reuse them for computing the joint distribution of scores. A similar approach was also proposed by Pape et al., 2008. This avoids redundant computational effort for initializing the 2D arrays. Third, we propose a novel optimization strategy: We initialize the joint distribution of scores by means of an enumerative approach (that iterates over all words) for the initial m ≤ M + k positions in the motif. While, the asymptotic runtime of the enumerative approach depends exponentially on m, it might still be efficient to use for small m, because for the first few positions, most of the entries in Q i (s, s , w i · · · w i+d−1 ) are zero. In that case, computational resources are wasted by multiplications with zeros, while the enumerative method more efficiently populates Q i (s, s , w i · · · w i+d−1 ) with non-zero entries. We compute optimal subsequence length m, up to which point the enumerative approach can be done faster than the recursive approach by summing up the number of multiplications that are necessary for each approach.

Principal overlapping hit probabilities
In this section, we derive an approximation of the principal overlapping hit probabilities that were introduced in the main paper.
The main difference of the principal overlapping hit probabilities compared to the marginal overlapping hit probabilities is that the former ones explicitly exclude intermediate motif hits. Therefore, the principal overlapping hit probabilities quantify the probability of obtaining a motif hit non-redundantly, whereas the marginal overlapping hit probabilities might contain redundant information about overlapping motif hits.

Overlapping hits with respect to a single DNA strands
For simplicity, we first derive a recursive approach for computing the overlapping hit probability with respect to scanning only a single strand. First, we introduce the base case for which there exists no intermediate event.
Subsequently, we derive a recursive approximation for β k for k > 1. The approximation rests on the assumption that having observed Y m = 1, the events Y l and Y n are statistically independent of one another for l < m < n. That is, events prior to and after a motif hit are independent. Note that this assumption is exactly met if the TFBSs can be described by a single word. If multiple words are compatible with the TF motif under the given score threshold, this assumption holds true only approximately.
Theorem 1. Assuming that the events Y l and Y n are independent if Y m = 1, the principal overlapping hit probability can be expressed as where l < m < n and k > 1.

Overlapping hits with respect to both DNA strands
In this section, we state the approximation of the principal overlapping hit probabilities for the case that both DNA strands are scanned for motif hits. We consider the ordered sequence of events Y 1 Y 1 Y 2 Y 2 Y 3 Y 3 · · · from left to right and assume that relative to that ordering, events that are separated by a hit (e.g. Y l = 1 or Y l = 1) are rendered independent of one another.
Accordingly, we identify the following base cases for which there exist no intermediate events, respectively.
Theorem 2. Assuming that events that are separated by a hit (e.g.
we can express the principal overlapping hit probabilities as Proof. The proof is a analogous of the proof for Theorem (1).

Adaptation of the original compound Poisson model
We reimplemented the compound Poisson approximation according to Pape et al., 2008 with a few changes, in order to ensure a fair comparison with the improved version. In this section, we briefly summarize the changes to the original version.
First, while the original version supports only order-0 background models, we utilized the afore-mentioned algorithm for computing the two-dimensional distribution of the scores which assumes a higher-order background model. This enables us to compare the original and the improved model across different background model orders.
Second, while in the original version two kinds of overlapping hit probabilities were introduced, namely γ k and γ k = γ 3 ,k , we suggested a third case which is represented by γ 5 ,k , because γ 5 ,k and γ 3 ,k are generally not symmetric (see main text). We adapted the original model to include γ 5 ,k in the derivation of the clump size probabilities.
Third, in contrast to the original model where a 1-clump is always contained on the forward strand hit, we allowed the 1-clump to either contain a forward strand hit or a reverse strand hit.     Motif α d(P E , P N Clump ) d(P E , P P Clump ) E47 10