How sequence alignment scores correspond to probability models

Abstract Motivation Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments. Results This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a ‘temperature’ parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment. Supplementary information Supplementary data are available at Bioinformatics online.

1 Efficient algorithms for local alignment As described previously [ZPM97,CWC04], Algorithm II can be made more efficient by using these variables: The algorithm can then be performed as follows: Algorithm S.I W 00 = 0 Optimal alignment score = m max i=0 n max j=0 W ij The main advantages are that this algorithm uses only three old values of W ,Ŷ , andẐ, whereas Algorithm II uses five old values, and if a D = a I (as is often the case) one addition is saved. A further nuance is described in [CWC04].

Alternatives
An equally-efficient alternative is to use these variables: Another possibility is to useX ij = X i+1 j+1 . These changes of variables can also be used to accelerate the Forward algorithms described in this study. In practice, a fast implementation should consider processor hardware details in order to optimize the speed and parallelism of calculation [Far07,Rog11,SK18]. Some aspects of this are quite general, e.g. maximizing parallelism by minimizing the number of bits needed to hold each value, and minimizing critical path length [SK18].

More on the linear-gap model
This section provides more information on the simple gapped PHMM in Figure 4 of the main text, in particular, an alternative explanation of balanced length probability.

Forward algorithm
The total probability of all possible alignments can be calculated like this: Algorithm S.II W 00 = 1

Conserved probability ratio
Let us consider a different explanation of balanced length probability. The idea is to have transition probabilities (arrow probabilities) that favor neither longer nor shorter alignments. For the gapless model (Figure 2), this means: γ = ω D ω I . Intuitively, for the linear-gap model, it seems that γ should be less than ω D ω I , because an alignment can be extended in alternative ways (with gaps). Precisely, this idea means conservation of transition probability ratio relative to a null alignment. Algorithm S.II outputs a probability ratio, where the denominator (µ G ) is the probability of a null alignment. The algorithm calculates this from probability ratios of sequence prefixes. In the algorithm, W ij supplies probability ratio to W i+1 j , W i j+1 , and W i+1 j+1 . Specifically, it supplies: • probability ratio W ij · a I to W i j+1 , • transition probability ratio W ij · c to W i+1 j+1 .
Conservation means that the total probability ratio emanating from W ij neither increases nor decreases: This is equivalent to Equation 27 in [YH01].  Figure S1: Change in prior length probability for an unaligned flank of a local alignment, when comparing two equal-length sequences using gapped versus gapless alignment models with balanced length probability. Black curves: lineargap parameters a D = a I = 0.1. Red curves: linear-gap parameters a D = a I = 0.3.

Balanced flank length probabilities
Let us see what balanced length probability means for the unaligned flanks of a local alignment. For gapless alignment ( Figure 2 in the main text), it means that all possible alignments have uniform prior probability. So the prior probability of a flank's length being x is proportional to the number of possible alignments with that flank. If we compare two sequences of length n, the number of possible alignments with a particular flank of length x is: (n − x + 1)(n + x + 2)/2.
For the linear-gap model, we can calculate the prior probability of a flank's length being x, by a variant of Algorithm S.II. The result is close to the gapless result, and gets closer as n increases ( Figure S1).
3 More on affine-gap model A 3.1 Ambiguity and alignment scores As explained in the main text, model A ( Figure 5A) is ambiguous, meaning that different paths through the PHMM yield indistinguishable alignments. In the main text we assumed that "alignment" means "path", and described probabilities and scores of alignments defined in this way. However, it is also possible to define an alignment to be a set of indistinguishable paths.
Let us explore the case where consecutive deletions of lengths d 1 and d 2 , and a single deletion of length d 1 + d 2 , are regarded as the same alignment (and likewise for insertions). The probability of a length-d deletion starting just after R c is now: Likewise, the probability of a length-d insertion starting just after Q c is: Therefore, the gap score parameters in this case are: The practical consequence is that scores for gaps of length > 1 are less negative, compared to the case where alignment = path. Furthermore, the optimal path might not be a member of the optimal alignment. If we wish to reverse-engineer a PHMM from score parameters, then we need to decide whether the score parameters apply to paths or to alignments (sets of paths). In the latter case, we could at first calculate the path parameters: Once we know the path parameters, the situation is the same as in the main text. Some examples suggest that the difference between gap scores for paths and for alignments is typically small (Table S1).

Linear gap costs and affine parameterization
These gap score parameters have some properties worth noting. On one hand, there is a convenient guarantee:ā D ≤b D andā I ≤b I . On the other hand, the linear gap cost caseā D/I =b D/I corresponds to b D/I = −∞ and b D/I = 0. Therefore, it is desirable for the Forward algorithm to allow b D/I = 0. The Forward algorithm presented here (Algorithm III in the main text) does allow this, but only because it uses the right parameterization of affine gaps. The two common parameterizations of affine score for a gap of length k are: a + b × (k − 1) and a + b × k. Algorithm III is based on the former, whereas the latter does not support b = −∞.

Further ambiguity
There is another potential ambiguity in model A: an insertion followed by a deletion might be regarded as indistinguishable from a deletion followed by an insertion. However, if we define these as being the same alignment, then alignment scores do not have the standard affine-gap form (consecutive inserts and deletes score less negatively than non-consecutive ones).

Conserved probability ratio
Similarly to the linear-gap model, balanced length probability can be understood as conservation of transition probability ratio relative to a null alignment. The Forward algorithm (Algorithm III in the main text) outputs a probability ratio, where the denominator (µ G ) is the probability of a null alignment. The algorithm calculates this from probability ratios of sequence prefixes. Conservation is easier to see if we rescale Y and Z : The recurrences in Algorithm III can be written in terms of Y and Z : The point of this rescaling is that the total probability ratio emanating from Y ij and Z ij neither increases nor decreases. The probability ratio emanating from , which sums to Y ij . Likewise for Z ij . The probability ratio (of transition probabilities) emanating from W ij is: Conservation of probability ratio means that this should equal W ij . Equivalently: This section deals with the PHMM in Figure 5B of the main text. The math turns out to become simpler if we use the M→M transition probability, Γ = γ(1 − α D − α I ), instead of γ.

Constraint on δ
If we desire standard affine-gap alignment, δ should be set such that adjacent deletions and insertions have the same probability as non-adjacent deletions and insertions. That is, these two configurations should have the same probability: which implies that:

Scores in terms of model probabilities
Consider an alignment where the first c letters of R and Q are aligned, the next d letters of R are inserted, the next e letters of R and Q are aligned, the next f letters of Q are inserted, and the final g letters of R and Q are aligned. The probability of this alignment under the model is: We can rearrange this to: Dividing by µ (defined in the main text) gives: Now we can express the score parameters in terms of the model probabilities: Finally, substituting in the constrained value of δ gives:

Symmetric insertions and deletions
Suppose we wish to make the model as symmetric as possible, and so we specify that ω D = ω I , β D = β I , and a D = a I . In that case, α D and α I are related like this:

Limits to degrees of freedom
Model B has several degrees of freedom, meaning that we can vary the model probabilities with no effect on alignment scores. In particular, we can freely vary ω D and ω I , provided we co-vary β D , β I , α D , α I , and γ so as to keep these values fixed: We cannot, however, have γ ≥ 1. Thus: Equivalently: We can express Γ + α D + α I in terms of ω D , ω I , and the fixed values, as follows: Thus, ω D and ω I are restricted to values that make the right-hand side of the above equation < 1.

Balanced length probability
Balanced length probability occurs when: In this case, we can freely vary ω D , ω I , and γ arbitrarily close to 1. This is equivalent to Equation 28 in [YH01].

Forward algorithm
The Forward algorithm for model B can be expressed like this: Algorithm S.III

Conserved probability ratio
Let us consider transition probabilities that conserve probability ratio relative to a null alignment. Similarly to model A, conservation is easier to see if we rescale Y and Z : If the Forward algorithm is written in terms of Y and Z , it can be verified that the total probability ratio (of transition probabilities) emanating from Y ij and Z ij neither increases nor decreases. The probability ratio emanating from X ij is: Conservation of probability ratio means that this should equal X ij . Equivalently: Models that conserve probability ratio are precisely those with balanced length probability.

Maximum likelihood parameters
A standard way of selecting PHMM parameters is by seeking a maximumlikelihood fit to some training data [DEKM98]. This procedure (i) calculates the expected count of each transition and emission, and (ii) selects model parameters that maximize the likelihood of those counts.
Step (ii) is affected if we impose constraints on the parameters (such as the constraint on δ), so let us examine step (ii) here. Suppose we are given these expected counts: The probability of observing these counts under the model is proportional to: We wish to select parameters (Greek letters) that maximize this probability, or equivalently, maximize its logarithm: Substituting in the constrained value of δ gives: Without the constraint, the maximum-likelihood parameters are straightforward: With the constraint, the maximum-likelihood β D and β I are unchanged, but unfortunately α D , α I , and γ do not appear to be simple.

Yet another model of gapped local alignment
Model C ( Figure S2) is yet another model of gapped local alignment: this one is considered here because it produces recurrence relations of the form in [YH01].

Constraint on δ
If we desire standard affine-gap alignment, δ should be set such that adjacent deletions and insertions have the same probability as non-adjacent deletions and insertions. By similar reasoning to that used above for model B:

Scores in terms of model probabilities
Similarly to the other PHMMs, the alignment probabilities can be connected to scores. First, define µ C = prob(null path, R, Q), where "null path" means a path that never uses the α D , α I , or γ arrows: Then, for any path, t ln(prob(path, R, Q)/µ C ) is a sum of these affine-gap scores:

Symmetric insertions and deletions
Suppose we wish to make the model as symmetric as possible, and so we specify that ω D = ω I , β D = β I , and a D = a I . In that case, α D and α I are related like this:

Limits to degrees of freedom
Model C has several degrees of freedom, meaning that we can vary the model probabilities with no effect on alignment scores. In particular, we can freely vary ω D and ω I , provided we co-vary β D , β I , α D , α I , and γ so as to keep these values fixed: We cannot, however, have γ ≥ 1. We can express γ in terms of ω D , ω I , and the fixed values, as follows: Thus, ω D and ω I are restricted to values that make the right-hand side of the above equation < 1. This turns out to be the same restriction as for model B!

Forward algorithm
The Forward algorithm for model C can be expressed like this: Algorithm S.IV

Conserved probability ratio
The transition probabilities of model C conserve probability ratio in the same situation as model B (because the Forward algorithms are similar): For model C, this is equivalent to: Again, parameter sets that conserve probability ratio are those with balanced length probability (where we can freely vary ω D , ω I , and γ arbitrarily close to 1).

Maximum likelihood parameters
It is useful to select PHMM parameters that have maximum-likelihood fit to some given sequence data. As mentioned above for model B, this involves finding model parameters that maximize the likelihood of some expected counts (c m , c D , c I , c d , c i , c p , defined above). The probability of observing these counts under the model is proportional to: We wish to select parameters (Greek letters) that maximize this probability, or equivalently, maximize its logarithm: