A Model of Indel Evolution by Finite-State, Continuous-Time Machines

How do instantaneous rate models of insertion-deletion processes relate to distributions over pairwise sequence alignments? The only exactly-solved model is the 1991 Thorne.....

the set of all possible DNA or protein sequences. For practical purposes, we often need to summarize paths through this process, and it is worth distinguishing between different ways of doing so. We will use three progressively detailed descriptions of the evolutionary path which we refer to as alignments, trajectories, and histories (Figure 1), as described below.

Alignments
A pairwise alignment consists of the observed initial and final state of the process (the ancestral and descendant sequences), with gap characters to show which residues are descended from which. An example alignment is shown in Figure 1A. Most of our discussion will be at this level of summarization.

Trajectories
A trajectory includes all the intermediate sequences from ancestor to descendant. Transitions between the intermediate sequences correspond to instantaneous changes. A trajectory uniquely implies an alignment, but there are many trajectories consistent with each alignment: the most plausible trajectory for alignment 1A is shown in 1B, but the longer trajectories in 1C and 1D are also consistent. We will refer to this level of summarization when discussing some previous methods for indel analysis.

Histories
A history consists of a trajectory fully annotated with the time of each indel event. This is the most detailed description, being a complete specification of the path of the stochastic process. For each nontrivial trajectory, there is a continuum of possible histories. For example, history 1E is consistent with trajectory 1B, with an event time u in the range 0 # u # t. We will not refer much to this level as it contains more information than we usually care about.
Reflecting this hierarchy of summarization, we can write where (u 1 , u 2, u 3. . . ) represents all the event times in a history. Note that the top-level summation is over alignments. Many scenarios demand that we marginalize ambiguous or uncertain alignments. For example, the alignments in 1F are plausible alternatives to 1A; in 1G, the ordering of insertions and deletions may be considered irrelevant for many purposes; and the placement of the second gap in 1H admits some uncertainty.
If the alignment likelihood can be represented as a path probability through a pair HMM, F, then we can perform this sum over alignments using the forward algorithm (Durbin et al. 1998), writing the result as This paper focuses on the probability distribution of alignment gaps. In general, when we refer to a gap, we will mean a run of adjacent indels in any order, as in 1G. Because of the possibility of overlapping indel events, as in 1C, these gaps can arise in a number of different ways.
The general geometric indel model Our starting point for defining an evolutionary process is the point substitution model, applied to a sequence. In such a model, each residue evolves according to a substitution rate matrix R, such as Kimura's two-parameter model for DNA (Kimura 1980) or Dayhoff's PAM model for proteins (Dayhoff et al. 1978).
We generalize this by allowing instantaneous insertion and deletion events as well as point substitution events. We do not want to be forced always to count the insertion of multiple adjacent residues as separate events (as in 1D), since this leads to inferential artifacts such as trajectories with too many events, alignments with scattered gaps, rates that are too fast, or times that are too long. Consequently, our model should allow events that insert or delete multiple residues instantaneously (as in 1B), with the indel length being a random variable.
For simplicity, we want to keep the number of parameters minimal, so we specify only the mean lengths of insertion and deletion events. The maximum entropy distribution for this parameterization is the geometric distribution. Thus, the probability that a given event involves n residues is x n21 1 2 x ð Þfor an insertion, and y n21 1 2 y ð Þ for a deletion, with mean lengths 1= 1 2 x ð Þ and 1= 1 2 y ð Þ. If the rate of insertions is l and the rate of deletions is m, then, at any given site, an event that inserts n residues has rate lx n21 1 2 x ð Þ3 P I 1 . . . I n ð Þ , where I 1 . . . I n represents the actual n residues that were inserted, while an event that deletes n residues has rate my n21 1 2 y ð Þ. So, for example, the instantaneous event EALGVK/EALGKLGVK in the history shown in Figure 1E, which occurs during the time interval [u, u + du) and inserts the three residues GKL, has instantaneous rate lx 2 1 2 x ð Þ3 P GKL ð Þ and infinitesimal probability lx 2 1 2 x ð ÞP GKL ð Þdu. The inserted residues are independently drawn from the stationary distribution of the substitution model, so P I 1 . . . I n ð Þ¼ Q n k¼1 r I k where rR = 0. Thus, P GKL ð Þ¼ r G r K r L . By contrast, deletion rates are completely independent of the residues being deleted.
To summarize, the parameters of our indel model are Q ¼ ðl; m; x; y; RÞ consisting of indel rates (l, m) and indel length parameters (x, y), together with a substitution rate matrix R. We call this model the general geometric indel (GGI) model, following De Maio (2020). The GGI model is the simplest continuous-time Markov chain over sequences that is local, allows multiresidue indels, and does not enforce reversibility. We may contrast the locality with the Poisson indel process, where the indel rate per site varies inversely with the total sequence length (Bouchard-Côté and Jordan 2013). As for multiresidue indels, other models such as TKF92 do allow this, but they do so by introducing unobservable auxiliary information into the state space; specifically, TKF92 introduces fragment boundaries. We can further constrain the parameters in various ways if desired; for example, by insisting that the model be reversible ly 1 2 x ð Þ¼mx 1 2 y ð Þ ½ , as with the long indel model of Miklós et al. (2004); or by requiring perfect symmetry between insertions and deletions (l = m and x = y), as in the simulations of De Maio (2020); or by restricting indels to single residues (x = y = 0), so all trajectories look like Figure 1D, as with the TKF91 model of Thorne et al. (1991).

Derivation of alignment likelihoods from indel processes
In the previous section, we described the GGI model with instantaneous rates (l, m) and extension probabilities (x,y). We now review previous approaches to calculating alignment gap likelihoods under this model and related models. These methods include the pair HMMs we evaluate in this paper: TKF91 (Thorne et al. 1991), TKF92 (Thorne et al. 1992), MLH04 (Miklós et al. 2004), LG05 (Löytynoja and Goldman 2005), RS07 (Redelings and Suchard 2007), LAHP19 (Levy Karin et al. 2019), and DM20 (De Maio 2020).
All of these methods exploit the property of the GGI model that the indel and substitution processes are independent of one another. A pairwise alignment (1H) has a gap profile (1I) that is like a residue-masked silhouette of the alignment, comprising three types of column: matches (M), in which ancestral and descendant residues are aligned, and insertions (I) and deletions (D), in which either ancestor or descendant contains a gap. We can factorize the alignment likelihood into a term for the gap profile (written as a sequence of M's, I's, and D's) and a conditionally independent set of terms for the actual residue content: Figure 1 Three views of evolutionary processes-alignments, trajectories, and histories-represent different levels of summarization. Alignments include no information about intermediate events except the positions of homologous residues; trajectories include intermediate sequences and the transition events between them, but not the times at which those events occurred; histories include transition events and times. Panels are illustrated using examples from the HOMSTRAD database (Mizuguchi et al. 1998) (J) A multiple alignment whose misaligned gap boundaries do not seem to support the TKF92 model's assumption that multiresidue gaps arise from indivisible sequence fragments.
Here, M XY is the probability that a descendant residue is Y, conditional on the ancestor being X; while r Y is the probability that an inserted residue is Y. Since deletion events are residueblind in the GGI model, and we have already conditioned on the ancestral sequence, we do not to include terms for the probability that the two deleted ancestral residues are N and K; the gap profile tells us what positions the deleted residues were at, and that is enough.
This decomposition of indel and substitution probabilities is naturally expressed in terms of a pair HMM with M, I, and D states. We can think of the gap profile term as the probability of the state path through the HMM, while the substitution terms correspond to the emission probabilities from those states. The emission part is well understood (Thorne et al. 1991): M and r can be linked to an underlying point substitution rate matrix R [by the matrix exponential M = exp(-Rt) and the stationary distribution rR = 0]. Our focus is on the likelihood of the gap profile: we seek a similar relationship F t ð Þ ' exp ℝt ð Þ between the transition probabilities of the pair HMM, F t ð Þ, and the GGI model's rate matrix over sequences, ℝ.
We now review previous work in this area.
TKF91: The first approach, TKF91, addresses a restricted version of the GGI model allowing only single-residue indels. This reduces to a linear birth-death process, which can be solved exactly (Thorne et al. 1991). The probability distribution over alignments can be represented as a pair HMM (Holmes and Bruno 2001). Being exactly solvable, TKF91 has become the canonical example of an indel model. However, as noted previously, it leads to systematic biases during inference, imputing trajectories with too many events, as in Figure 1D.
TKF92: Attempting to address the deficiencies of TKF91, the TKF92 model (Thorne et al. 1992) posits a similar birth-death process, but on indivisible multiresidue fragments instead of single residues. Each fragment contains a random, geometrically distributed number of residues. TKF92 has a closed-form pair   Table 4, models alignments at divergence time t + Dt. It is the machine product of F t ð Þ and G Dt ð Þ: each state has the form XY where X is an F state and Y is a G state. Uppercase is used to indicate that a component machine makes a transition when the compound state is entered. So, for example, when FG makes the transition mI / MM, the transition weight is the product of a (for F's self-looping M / M transition) and 1 2 x (for G's I / M transition). However, if then makes the transition MM / Dm, the transition weight is just c (for F's M / D transition), since G stays in the M state without making a transition. This structure arises from simple rules for transition synchronization in multiplied machines (Westesson et al. 2011(Westesson et al. , 2012. HMM solution, rather like TKF91, but with the introduction of a new parameter corresponding to the mean fragment length. However, TKF92 is also somewhat unrealistic in practice. The idea that multiresidue gaps arise from unbreakable fragments is artificial, as can be illustrated with reference to the multiple sequence alignment of Figure 1J. The gap boundaries in (1J) do not align, and this is not uncommon: empirically, there is no evidence that TKF92's indivisible fragments are real. In practice, when using TKF92, it is common to simply marginalize over the fragment boundaries, effectively treating TKF92 as an ad hoc approximation to the GGI model. Therefore, by defining a suitable mapping between TKF92's fragment parameters and the GGI model's indels, we can evaluate it on this basis, as an approximation to GGI.
LG05 and RS07: Similarly, the LG05 pair HMM used in the PRANK program (Löytynoja and Goldman 2005) and the RS07 pair HMM used in BAliPhy (Redelings and Suchard 2007) introduce fragment length parameters that can be related (with some hand-waving) to the indel length parameters of GGI. In this paper, we evaluate TKF91, TKF92, LG05, and RS07 as approximations to GGI, but we do not evaluate some other indel models that are a little harder to reconcile with GGI because of extra parameters (Rivas and Eddy 2015) or incompatible assumptions (Bouchard-Côté and Jordan 2013).

MLH04:
The MLH04 approach, developed by Miklós et al. (2004), computes lower-bound likelihoods for alignment gaps by considering short trajectories like those in Figure 1, B-D, integrating out the event times from the corresponding histories to find a likelihood for each such trajectory. To calculate the likelihood of an alignment gap, MLH does a brute-force exhaustive enumeration of all consistent trajectories, up to a given number of indel events and a maximum gap length. Under the assumption of an infinite sequence, the resulting distribution is technically still a pair HMM, albeit one with an infinite number of states (corresponding to every possible size of gap). As our simulations in the Results section demonstrate, this approach is extremely slow, and effectively impossible for trajectories with more than three overlapping indel events; however, for very short evolutionary times, MLH04 remains the most accurate approximation to GGI, short of direct simulation.
In the special cases of TKF91 and TKF92, the alignment gap lengths are geometrically distributed. This is not necessarily true in general for the GGI model (Rivas and Eddy 2015): alignment gap lengths are not geometrically distributed even though the underlying indel event lengths are. Thus, a simple three-state pair HMM-whose gap lengths are geometrically distributedcannot be an exact solution to GGI. Nevertheless, MLH04 shows that the exact solution is, in fact, an infinite-state pair HMM, so a smaller pair HMM may be a reasonable approximation.
LAHP19: A purely simulation-based approach to estimating the gap probabilities of the GGI model has recently been described (Levy Karin et al. 2019). In the limit of an infinite number of random trials, this approach is exact. We use such simulations as a gold standard to evaluate other approximations. However, the number of trials required to sample rare outcomes (i.e., long gaps, particularly those involving multiple-event trajectories) is large, and the simulations become computationally expensive with longer sequences. The performance and sampling limitations of this approach are further discussed in the Results.

DM20:
The DM20 method is a recent breakthrough in approximating the GGI model (De Maio 2020). Starting from the assumption that the alignment likelihood can be approximated by a product of geometric distributions over insertion and deletion lengths, De Maio derived ordinary differential equations (ODEs) for the evolution of the mean lengths of these distributions, yielding transition probabilities for the pair HMM. DM20 is a more accurate approximation to the multiresidue indel process than all previous attempts, although it has limitations: it does not allow deletions to directly follow insertions in the alignment (thus limiting its ability to model covariation between insertion and deletion lengths), it is inexact for the special case of the TKF91 model, and it requires laborious manual derivation of the underlying ODEs.
H20: The H20 method, developed in this paper, builds on DM20 to develop a systematic differential calculus for finding HMM-based approximate solutions of continuoustime Markov processes on strings that are "local" in the sense that the infinitesimal generator is a pair HMM. Our approach addresses the limitations of DM20, identified in the previous paragraph. It does allow deletions to follow insertions, so as to better account for covariation between insertion and deletion gap sizes. The TKF91 model emerges as a special case: the closed-form solutions to TKF91 are also exact solutions to our model. Finally, although our equations can be derived without computational assistance, the analysis is greatly simplified by the use of symbolic algebra packages, both for the manipulation of equations, for which we used Mathematica (Wolfram Research, Inc.) (version 2020), and for the manipulation of state machines, for which we used our recently published software Machine Boss (Silvestre-Ryan et al. 2020). Table 1 Interpretation of states in machine G Dt ð Þ ( Figure 2B, defined in Infinitesimal-time machine); here, v in ; v out 2 V represent input and output tokens from the residue alphabet The central idea of our approach is that the application of the infinitesimal generator to the approximating HMM generates a more complicated HMM that, by a suitable coarsegraining operation, can be mapped back to the simpler structure of the approximating HMM. By matching the expected transition usages of these HMMs, we derive ODEs for the transition probabilities of the approximator. Our approach is justified by improved results in simulations, yielding greater accuracy and generality than all previous approaches to this problem, including DM20 (which can be seen as a restricted version of our method). Our approach is further justified by the emergence of the TKF91 model as an exact special case, without the need to introduce any additional latent variables such as fragment boundaries.
While we focus here on the multiresidue indel process, the generality of the infinitesimal automata suggests that other local evolutionary models, such as those allowing neighbordependent substitution and indel rates, might also be productively analyzed using this approach.
The sequence rate matrix and the infinitesimal-time machine: We now give a concise preview of the approach described in detail in the Materials and Methods.
The rate matrix ℝ of the GGI model is, for two sequences FðM/IDN/XÞ: where V is the residue alphabet (e.g., nucleotides or amino acids), V* is the set of all sequences over that alphabet (including the empty sequence e), V N is the set of all sequences of finite length N, B is the deleted sequence, C is the inserted sequence, and A; D 2 V * are flanking sequences (we will mostly be considering the infinite-sequence approximation, where X; Y; A; D 6 ¼ e).
Suppose that c t ð Þ 2 V * is a sequence evolving under the GGI model. Consider G Dt ð Þ, the pair HMM defined in Figure  2B and Table 1. Assuming c (t) is infinitely long, the forward algorithm for G Dt ð Þ computes the conditional distribution over an instant of evolutionary time: where I is the identity matrix over sequences ( We match expected transition counts between classes of states in F t ð ÞG Dt ð Þ to their representative transitions in Figure 3 Versions of the first two pair HMMs of Figure 2 that include start and end states, and so can be used for finite sequences. As in Figure 2, match states (s M ) are orange, insert states (s I ) are green, delete states (s D ) are red, and null states (s N ) are uncolored. (A) A version of GðDt) with start and end states, where deletion rates at the end of the sequence are elevated by a factor 1=ð1 À yÞ so that the total rightward deletion rate at any residue is m, independent of its distance from the end. (B) A version of FðtÞ with start and end states.
F t þ Dt ð Þ, and take the limit Dt / 0 to derive differential equations for the transition weights of A 0 ¼ Q ðMIDN/YÞ . A note on finite sequences: to model these we can set ℝ AB;A ¼ my N21 for B 2 V N , dropping the 1 2 y term for deletions that remove the end of the sequence. This ensures the total rightward deletion rate starting at any residue is m, regardless of its distance from the end. We can then define G Dt ð Þ as in Figure 3A. Imposing reversibility on finitesequence models takes slightly more care (Miklós et al. 2004). The general geometric indel model Introduction TKF91 The links model, a special case of GGI for single-residue indels Thorne et al. (1991) TKF92 Sequel to the TKF91 model allowing multiresidue indels Thorne et al. (1992) MLH04 Approximation to GGI that enumerates short trajectories Miklós et al. (2004) LG05 Pair HMM used by PRANK alignment software Löytynoja and Goldman (2005) Figure 2B and Table 1 F t ð ÞG Dt ð Þ Automata product of F t ð Þ and G Dt ð Þ. Abbreviated to FG Figure 2C and Table 4 Finite-State, Continuous-Time Machines

Materials and Methods
As noted in the Introduction, our approach makes use of pair HMMs. We assume some familiarity with these models, which are standard in bioinformatics. A tutorial introduction can be found in Durbin et al. (1998).
The pair HMMs we will use are normalized for the conditional probability P descendant ancestor j ð Þ [rather than for the joint distribution P ancestor; descendant ð Þ as is often seen in the bioinformatics literature]. Such conditionally normalized pair HMMs are sometimes called input-output automata, or transducers. Rather than simultaneously generating two output sequences, these state machines read a specified input sequence and generate a probabilistic output. Following the convention of Durbin et al. (1998), these pair HMMs have Match (input-output), Insert (output-only), and Delete (input-only) states corresponding to the M, I, and D columns in a pairwise alignment (see e.g., Figure 1I), as well as Null (N) states that do not input or output anything.
A key result enabling our approach is that two automata A; B ð Þcan be multiplied together: the output of A is fed into the input of B. The serial operation of both machines can be represented by a single compound machine AB, constructed algorithmically from the two component machines. The algorithm takes a Cartesian product of the two machines' state spaces, then synchronizes transitions in this joint space so that each output-writing transition of A coincides with an input-reading transition of B, with some ordering of updates so that indels are not double-counted. The algorithms for doing these multiplications are published (Westesson et al. 2011(Westesson et al. , 2012, and software implementations are available (Silvestre-Ryan et al. 2020).
For our purposes, the only machine-multiplication that we need is the one shown in Figure 2. A three-state machine representing a distribution over finite-time pairwise alignments [F t ð Þ, Figure 2A] is multiplied by another threestate machine representing the action of the GGI model over an infinitesimal time interval [G Dt ð Þ, Figure 2B], yielding a nine-state machine [F t ð ÞG Dt ð Þ, Figure 2C]. In the following sections, we show that F t ð ÞG Dt ð Þ can be systematically mapped back to F t þ Dt ð Þby a coarse-graining operation that involves finding the expected number of transitions of each type (I/I, I/D, etc.) in walks through the state space that begin and end in the M state. In the following sections, we describe how to calculate these expectations, with expository examples relating to F t ð Þ, which we then define in detail. We then apply this to map F t ð ÞG Dt ð Þ back to F t þ Dt ð Þ, and, by taking the limit Dt / 0, derive differential equations for the transition probabilities of F t ð Þ. Finally, we show that this reduces correctly to the TKF91 model when x = y = 0. Table 2 gives a glossary of mathematical terms used throughout this section. In the Supplemental Material, we give some additional lemmas and a conjecture relating this work to the expectation-maximization algorithm for parameter estimation.

Expected transition usage
Suppose that we have a pair HMM, M, with K states that can be partitioned into match s M , insert s I , delete s D , and null s N states. As a shorthand we will write s ID for s I [ s D , and so on. Thus s MIDN is the complete set of K states.
We will be considering models with only one match state, which, by convention, will always be the first state, so Þdenote the set of state paths with the following properties: The path begins in an s X state; The path ends in an s Z state; For paths with more than just a begin and end state, the intermediate states are all s Y states.
Let J X/Y ð Þ be a matrix that selects transitions from s X to s Y,   3 Interpretation of states in machine F t ð Þ (Figure 2A, defined in Three-state HMM); here, v in ; v out 2 V represent input and output tokens from the residue alphabet

State
Name Class On entry Input Output Figure 2A

I. Holmes
Consider a random walk f 2 F M/IDN/M ð Þ that begins and ends in the match state, passing only through nonmatch states in between. Let Þgj count the number of transitions from X states to Y states if null states are removed from the walk. In other words this is the number of subpaths of f that go from a state in s X , via zero or more null states, to a state in s Y .
To find the expected value of T XY in walks that begin and end in the match state, we break down paths into three segments: M to X (via insert, delete, and null states), X to Y (via null states), and Y to M (via insert, delete, and null states). The first and last segments are only required if M 6 ¼ X and Y 6 ¼ M. The corresponding sets of state paths are To sum over all paths in a given set F X/Y/Z ð Þ , we can use the geometric series formula B the effective X / Z transition probabilities are the nonzero entries of C ¼ Q We can further simplify the formulae, for example by noting that Using the methods of the previous paragraphs, the expectation of T XY is Let S X f ð Þ be the number of X states in f, excluding the final state. Thus, Three-state HMM Consider the machine F t ð Þ shown in Figure 2A with with a + b + c = 1, f + g + h, and p + q + r = 1. Table 3 gives the emission probabilities.
Here, t will play the role of a time parameter. Let where the expectations are as defined in (1) By (1), The essence of our approach is to use transducer composition to study infinitesimal increments in T XY t ð Þ, and thereby obtain differential equations that can be solved to find these parameters.
Continuing with the example of machine F t ð Þ in Figure 2A, the state path (1, 2, 2, 2, 1) has state types (M, I, I, I, M ) and thus transitions (MI, II, II, IM), so the transition counts are T MI = T IM = 1 and T II = 2. For an example of a machine with more complex structure including null states, consider F t ð ÞG Dt ð Þ of Figure 2C. A state path 1; 2; 3; 9; 9; 5; 1 ð Þ through this machine has state types (M, I, I, N, N, D, M). When we remove the null states, this becomes (M, I, I, D, M) and so the transitions are (MI, II, ID, DM). Thus the transition counts are T MI ¼ T II ¼ T ID ¼ T DM ¼ 1.

Infinitesimal-time machine
The infinitesimal transducer G Dt ð Þ of Figure 2B has states See Table 1 for emission probabilities. This describes the GGI model as defined in the introduction. The model parameters are the insertion and deletion rates (l, m) and extension probabilities (x, y), and the time interval Dt ( 1= l þ m ð Þ.

Rate of change of expected transition counts
Composing F t ð Þ (Figure 2A) with G Dt ð Þ ( Figure 2B) yields F t ð ÞG Dt ð Þ, the machine of Figure 2C. This machine has states s M ¼ 1 f g, s I ¼ 2; 3; 4 f g, s D ¼ 5; 6; 7; 8 f g , s N ¼ 9 f g, and transition matrix Table 4 describes the interpretation of each state, and its emission probability distribution. The transducer composition F t ð Þ 3 G Dt ð Þwas performed using the automata algebra program Machine Boss (Silvestre-Ryan et al. 2020). A general procedure for doing this for any two machines A; B involves taking the Cartesian product of the two machines' Here, v in ; v out ; v thru 2 V represent input, output, and pass-through tokens from the residue alphabet. Each state has the form XY where X is an F state and Y is a Gstate. Each transition of FG can involve an F-transition, a G-transition, or both. Uppercase (XY) is used to indicate that a component machine makes a transition when the compound state is entered; lowercase (xy) indicates the component machine makes no transition. Thus, transitions into MM; IM; MD; Dm; Dd; Di; ID f g involve an F-transition; transitions into MM; mI; IM; iI; MD; ID f g involve a G-transition. This structure arises from simple rules for transition synchronization in multiplied machines (Westesson et al. 2011(Westesson et al. , 2012. By these rules, G can only make an input-reading transition when F makes an output-writing transition, and vice versa. So, for example, when FG makes the transition mI / MM, what happens internally is that F makes a self-looping M / M transition while G makes an I / M transition, and an (unobserved) token v thru is passed through from F to G. However, if FG then makes the transition MM / Dm, internally F makes a M / D transition without outputting anything, so G just stays in the M state without making a transition. state spaces and then synchronizing their transitions so that the output of A drives the input of B. This ensures that, if M XY represents the result of the forward algorithm for machine M (with null states eliminated) and sequences X,Y, then AB ð Þ XZ ¼ P Y Figure 5 Moments of the gap length distribution P (S I ,S D ) as revealed by simulation, compared to the predictions of various approximate methods. The approximation methods are TKF91 (Thorne et al. 1991), TKF92 (Thorne et al. 1992), MLH04 (Miklós et al. 2004), LG05 (Löytynoja and Goldman 2005), RS07 (Redelings and Suchard 2007), and DM20 (De Maio 2020), reviewed in the Introduction; and H20 (the present method), defined in the Materials and Methods. The simulation procedure is defined in the Results. The parameter range explored is l = m = 1, x = y = 0.5, and 2 27 # t # 2 1 , corresponding to panel A of Figures 4 and 6. Panel A shows the expected insertion length as a function of time, and panel B shows the probability that there is no gap between adjacent ancestral residues.
The parameters (a, b, c, f, g, h, p, q, r) are defined by (3) for The remaining counts are obtained from (2): The expected state occupancies are governed by the following equations: TKF91 model When x = y = 0, our model reduces to the TKF91 model (Thorne et al. 1991). Holmes and Bruno (2001) showed that the solution to the TKF91 model can be expressed  Figure 2A, with parameters It can readily be verified that this is an exact solution to Equation (3) through Equation (7), when x = y = 0. Thus, our model reduces exactly to the TKF91 model when indels involve only single residues. In this case, the equivalence F t þ Dt ð Þ¼F t ð ÞG Dt ð Þ is exact in the limit Dt / 0.

Using the model for alignment
To apply this model to sequence alignment, we need to specify a start and end state for G, rather than implicitly assuming infinite-length sequences as we have done up to this point.
A version of G with start and end states is shown in Figure  3A. This can be carried throughout the analysis by also specifying start and end states for F and deriving differential equations for the transitions involving these states. Since this complicates the formulae considerably, we have omitted it. Instead, we propose a heuristic modification of F that includes ad hoc transitions from/to start and end states, shown in Figure 3B.

Parameterizing the model
In principle, the likelihood function is sufficient to parameterize the model: we can compute the gradient numerically to locate the maximum-likelihood parameters, or use Markov Chain Monte Carlo sampling to find the posterior.
In practice, it might be more efficient to use an expectationmaximization algorithm tailored to this model. In the Supplemental Material, we conjecture that a simple expectationmaximization algorithm does exist for this model, and we outline one way it might be arrived at.

Data availability statement
Our code implementing this model is available under an opensource license at https://github.com/ihh/trajectory-likelihood/ tree/benchmark. Figure 7 The running time of the MLH04 method is prohibitive and increases steeply for long gaps, since it requires the explicit enumeration of all intermediate indel states in the trajectory (Miklós et al. 2004). In this plot, each data point is an average of 100-1000 repetitions on a late-2014 iMac (4GHz quad-core Intel i7 CPU, 32GB 1600MHz DDR3 RAM). The running times of DM20 and H20 are negligible in comparison (, 1 ms).
File S1 (gzipped tarball) contains JSON files describing the state machines in Figure 2, a Makefile and short script for manipulating these state machines, a Mathematica notebook deriving the equations in this paper, and a text file listing the contents of the tarball. Supplemental material available at figshare: https://doi.org/10.25386/genetics.13040585.

Results
We implemented the H20 likelihood calculations described in the Materials and Methods and those of the models (TKF91, TKF92, MLH04, LG05, RS07, and DM20) reviewed in the Introduction. We also implemented a simulator for the underlying indel process, similarly to LAHP19 but using a succinct (run-length encoded) representation of the alignment. We performed simulations at various parameter settings and calculated summary statistics for all methods, including the relative entropy from the simulated to approximate joint distribution P (S I , S D ) and various marginals and moments of this distribution. Our implementations and simulation results are available at https://github.com/ihh/trajectory-likelihood/ tree/benchmark.
In Figures 4-6, we show summary statistics for several sweeps of the GGI model parameters (l, m, x, y, t) in the ranges 2 27 # t # 2 1 , 0 # x # 0.7, 0 # y # 0.65, 0 # x # 0.8 with y = x, 0 # x # 0.7, 0 , l # 1, and 0 , m # 1. All sweeps are based around the point l = m = 1, x = y = t = 0.5, which was chosen to be indel-symmetric (l = m and x = y) and representative of amino acid indel lengths: the setting y ' 0:5 is consistent with a previous maximum-likelihood estimate based on indel lengths in the HOMSTRAD database (Miklós et al. 2004). These parameter sweeps have the effect of exploring each parameter individually, as well as (in the case of the x-sweep where y = x) varying the length of insertions and deletions simultaneously.
To map the GGI model parameters onto those of the models being evaluated, we used the mappings defined by De Maio (2020), with some adjustments to allow for cases where insertions and deletions were asymmetric, which were not reported by De Maio. These mappings are defined in the Supplemental Material. Our criteria for evaluating a model was based on the obviousness of these mappings; so, for example, we did not include models significantly more parameter-rich than GGI (Rivas and Eddy 2015), where to define a mapping would have required so many choices as to have created a new model.
In each experiment we performed N simulations on a sequence of length L and estimated gap sizes up to G residues, discounting gaps at the end of the sequence (which have different statistics). In most cases these settings were L = 10 3 and G = 10 2 , with N = 10 7 for t , 2 25 , N = 10 6 for 2 25 # t , 2 24 , and N = 10 5 for 2 24 # t , 1. For t $ 1, and also for y . 0.6, we set L = 10 5 , G = 10 3 , and N = 10 2 . The higher N at low t was to ensure sufficient sampling of infrequent events over short evolutionary timespans. The higher values of (G, L) at high (t, y) were to mitigate end effects as deletions become longer (which happens as t or y get large). Because of the O L ð Þ time complexity of finding a position in a sequence and then inserting or deleting elements, the simulation time is O NL 2 t À Á yielding O NLt ð Þ indel events. Thus, when increasing L, we also reduced N. The time required for simulation handily dominated the total CPU time taken by the experiment, in almost all cases; the longest-running data points of Figure 4A each required over 10 min on our late-2014 iMac (4GHz quad-core Intel i7 CPU, 32GB 1600MHz DDR3 RAM). Even with these run times, the observed counts were zero for a majority of the (S I ,S D ) tuples in many cases. This illustrates a fundamental problem with the purely simulation-based LAHP19 approach; running it enough times to sample all cases is impractical. This may be one reason why the authors of LAHP19 limited the maximum gap length G to 50 residues (Levy Karin et al. 2019). A potential solution to this problem would be to use a limited sample to fit a parametric model, although we have not explored that approach. Of course, other indel simulation programs may run faster than ours.
For the MLH04 method, we limited G to 30 in all cases, since it takes impractically long to calculate likelihoods of longer gaps. This is illustrated by Figure 7, which plots the runtime of MLH04 as a function of gap length. To calculate likelihoods of gap lengths up to 30 residues at a particular parameter setting, MLH04 takes roughly 10 sec on current desktop hardware, which is vastly slower than the microsecond-scale runtime of all other methods (with the exception of direct simulation, which requires many repetitions to achieve statistical accuracy). Because we only considered shorter gaps for MLH04, when calculating the relative entropy from the simulated gap length distribution, we used the truncated distribution P S I ; S D S I # G; S D # G j ð Þ to avoid infinities that would otherwise occur due to MLH04 assigning zero probability to longer gaps. Figure 4 shows the relative entropy (Kullback-Leibler divergence) between the simulated and various approximated distributions for the six parameter sweeps: t, x, y (separately and together), l, and m. Starting with the time sweep in Figure 4A, we see a pattern that was broadly repeated in time sweeps across other parameterizations (data not shown): at very short times, the MLH04 trajectory-enumerating approximation is the best fit to the simulated distribution (with the proviso that it can only handle short gaps, and takes significant time to compute). At these low times, the DM20 and H20 approximations are almost indistinguishable, and are the second-best fit; TKF92 is the next best after that, followed by the PRANK and BAliPhy HMMs. However, when t gets large enough (in Figure 4A it occurs around t * 0:4), the divergence of MLH04 shoots up, to the point where it quickly becomes the worst or second-worst fit. This is presumably because, at higher t, there is a significant probability of having more than three events in the trajectory (MLH04 is limited to three events due to the combinatorial complexity of enumerating longer trajectories). This leaves DM20 and H20 as the best methods. Shortly after this point, DM20 and H20 start to separate, so that H20 has a slight edge over DM20. Meanwhile, LG05 (the PRANK HMM) is not defined for all parameterizations, since it contains probabilities proportional to 1 2 2d where d ¼ 1 2 exp 2 lt 1 2 x À Á . Thus, at high enough t or x, we have d . 0.5 and so the probabilities become negative. The RS07 HMM (used by BAliPhy) remains defined but becomes inaccurate at high t. It should be noted that the time values mt . 1 2 y correspond to a scenario where the sequence is saturated with deletions, which may be of less relevance to many applications in alignment and phylogenetics. However, it is at the borderline of this region where the differences between the methods are most pronounced.
Seeking to explore these differences further, we varied x and y in Figure 4, B and C. The general trend is consistent: H20 and DM20 are the most accurate methods, LG05 is not defined for most of this parameter regime, and the other methods cluster in a pack, affected to varying degrees by the parameter sweep. TKF92 (which models multiresidue indels) is consistently a better fit than TKF91 (which does not), as might be expected. Both TKF91 and TKF92 explicitly assume reversibility between insertions and deletions, and appear to be quite strongly affected by deviations from symmetry.
When x and y are varied together, as in Figure 4D, we see quite different behavior across the different approximation methods. In the special case x = y = 0, when indel events can include only a single residue, the GGI model is essentially identical to TKF91. Unsurprisingly TKF91, TKF92, and H20-which all admit exact solutions in this special casehave effectively zero relative entropy. By contrast, the MLH04 model performs very poorly when x = y = 0, since this parameterization requires at least K separate events to explain a gap of length K, and so MLH04 assigns zero probability to any gap of length $3, yielding an infinite Kullback-Leibler divergence from the simulated distribution. As soon as x and y become nonzero, the MLH04 divergence becomes finite again, as (from the other direction) do those of TKF91, TKF92, and H20. All relative entropies continue to rise as the mean indel length increases, MLH04 rising most slowly. Eventually, H20's error approaches DM20's error from below. At x * 0:78, the RS07 probabilities-which drop off rapidly with increasing gap length-are rounded to zero by floatingpoint precision errors, including for some gap lengths that are reached by the simulation, leading to infinities in the relative entropy for RS07.
Varying l and m (Figures 4, E and F) reveals that H20, DM20, TKF92, and MLH04 rise monotonically in inaccuracy as these rate parameters increase from zero. TKF91 is a worse approximation than all these methods when l = 0 or m = 0, but stays mostly flat as they are increased, to the point where it eventually beats MLH04 as an approximation. RS07's inaccuracy decreases monotonically with l, but has a minimum as a function of m.
LG05, as with the other sweeps, performs weakly and is only defined for part of the parameter regime. One notable point is that H20 and DM20 have almost identical accuracy when l = 0 or m = 0, but diverge as those rates increase, with H20 performing better than DM20.
To summarize Figure 4: at all parameterizations, H20 is more accurate than DM20, and is the most accurate of all the three-state pair HMMs. H20 is outperformed only by MLH04, and then only at very low values of t and for very short gaps (taking a very long time to compute). In the parameter range where most alignment occurs (mt ( 1), DM20 and H20 are roughly equivalent; they diverge as gaps become very frequent. Of the closed-form approximations, TKF92 is most accurate, but is significantly less reliable than H20.
To understand these differences better, we examined moments of the simulated and approximate distributions. These are plotted in Figures 5 and 6. Figure 5A plots the expected insertion length, focusing on the t-parameter sweep (similar trends were apparent in the other sweeps). All the methods that find explicit closed-form formulae for the expected insertion length as a continuoustime process (which is to say TKF91, TKF92, DM20, and H20) show an exact fit to the simulated distribution, while RS07 deviates more significantly, and LG05 as previously noted is only defined for part of the time range. MLH04 underestimates the expected insertion length significantly after t * 2 23 , again presumably because MLH04 is a poor approximation when there are multiple expected indel events per observed alignment gap. Figure 5B plots the probability that adjacent ancestral residues have no gap between them, P(S I = S D = 0). As with Figure 5A, this is plotted as a function of time; and once again, DM20 and H20 closely match the simulation, while RS07 is an overestimate. However, for this statistic (unlike the expected insertion length), MLH04 and (where defined) LG05 are as accurate as DM20 and H20, while TKF91 underestimates the probability and TKF92 overestimates it.
To summarize the results plotted in Figure 5, the expected insertion length and empty-gap probability are uninformative as to the differences between DM20 and H20: both methods seem to get these statistics right. However, noting that the main difference between the DM20 and H20 is that DM20 does not allow transitions back and forth between the I and D states, instead requiring deletions to precede insertions, we might expect the covariance between insertion and deletion lengths to be a better diagnostic. Figure 6 plots the insertion-deletion covariance for all parameter sweeps, using the same ordering of subfigures (one per parameter sweep) as the relative entropy plots in Figure  4. What we see in these plots is that the relative accuracy of the covariances for DM20 and H20 closely tracks the relative entropies of their gap length distributions. In the time sweep, DM20 and H20 perform identically at low times, but gradually diverge; in the x and y sweeps, they are separated throughout the parameter range; and in the l and m sweeps, they are close when the rate parameter is zero, but then diverge rapidly. The most significant single factor determining the covariance between insertion length S I and deletion length S D is the joint probability P(S I = S D = 0) that both are zero; however, Figure 5B shows that both DM20 and H20 basically get this probability correct. The most obvious remaining factor that might explain the different covariances is the absence of an I / D transition in the DM20 pair HMM, which De Maio explicitly suggested might be a potential limitation on the accuracy of DM20 (De Maio 2020). Thus, we propose this as an explanation for the improved performance of H20, while noting that this improvement is relatively small.
As for the covariance of the other methods, the results in Figure 6 are broadly consistent with Figure 4, although they do not provide as a coherent single explanation of the differences between methods, as is the case when comparing DM20 and H20. Two points are worth remarking on. First, in Figure 6A, the covariance of MLH04 dips sharply around t * 2 23 , and in fact eventually becomes negative (although it is not shown on this plot, due to the logarithmic y-axis). The negative covariance can be explained by MLH04's restriction to a small finite number of indel events in any given trajectory, which implies that every insertion event is one event that cannot be a deletion, and vice versa. The other point worth noting is that, as is the case with the relative entropies, Figure 6D clearly shows that TKF91, TKF92, and H20 are an exact fit to the simulated data when x = y = 0, which reduces the GGI model to the TKF91 model.
We can summarize all the simulation results as follows. In virtually all cases, H20 is the most accurate of pair HMM methods, outperformed only by MLH04 when the rate of indels is slow enough that there is negligible probability that an observed alignment gap can be explained by multiple overlapping indel events. DM20 is very close behind H20; the differences between DM20 and H20 appear to be explainable in terms of H20's better modeling of covariation between insertion and deletion lengths, which is probably attributable to an additional I / D transition in the H20 pair HMM.

Discussion
We have shown that an evolutionary model which can be represented infinitesimally as an HMM can be formally connected to a pair HMM that approximates its finite-time solution. This may be viewed as an automata-theoretic framing of the Chapman-Kolmogorov equation.
Ours is a coarse-graining technique. It is generally the case that composing two state machines will yield a more complex machine, since the composite state space is the Cartesian product of the components. We approximate this more complex machine with a renormalizing operation that eliminates null states and maps the remaining states of each type back to a single representative state in the approximator.
We used this approach to derive ODEs for the transition probabilities of a minimal (three-state) pair HMM that approximates a continuous-time indel process with geometrically distributed indel lengths. We have implemented numerical solutions to these equations, and demonstrated that they outperform the previous best methods. The improvement in accuracy over DM20, the closest method, is small but significant; further, our approach puts the process of deriving the approximation on a more systematic footing. In the Supplemental Material, we also conjecture similar ODEs for the posterior expectations of the sufficient statistics that would be required to fit this model by expectation maximization, although we have not yet tested this approach.
Point substitution models are the foundation of likelihood phylogenetics. There is, additionally, a substantial literature combining such models with HMMs and stochastic contextfree grammars, for the purposes of genome annotation and other sequence analysis. Indels are a potential annotation signal: phase-preserving indels, for example, are a common signature of protein-coding selection. As well as being useful tools to represent and (approximately) solve such models, automata theory can be used to build sampling kernels (Redelings and Suchard 2007) and reconstruct ancestral sequences (Westesson et al. 2012).
Our emphasis on the GGI model, a continuous-time Markov process defined on sequences of residues, somewhat disadvantages models like TKF92, which technically defines a process on sequences of multiresidue fragments. We have argued that there is no evidence such indivisible fragments really exist, so instead we evaluated TKF92 as an approximation to the GGI model. However, the routine usage of amino acid fragment models to predict protein tertiary structure suggests a valid counterargument: such models may usefully capture some aspects of selection. Further, TKF92 can be generalized in other ways, allowing for richer models of fragment mutation; for example, to model the evolution of RNA structure. In this context, it is promising that our method recovers TKF91 (and therefore TKF92) as special cases.
Input-output automata are well suited to modeling indels in statistical phylogenetics. It seems possible that the method of this paper might be applied to other instantaneous rate models of local evolution where the infinitesimal generator can be represented as an HMM. It is tempting to speculate that a similar approach may also be productively applied to stochastic contextfree grammars, so as to analyze RNA; or to model literary texts, phonemes, vocabularies, music, source code, bird song, or other alignable sequences that evolve by local edits over time.