Abstract
Motivation: We introduce a novel approach to multiple alignment that is based on an algorithm for rapidly checking whether single matches are consistent with a partial multiple alignment. This leads to a sequence annealing algorithm, which is an incremental method for building multiple sequence alignments one match at a time. Our approach improves significantly on the standard progressive alignment approach to multiple alignment.
Results: The sequence annealing algorithm performs well on benchmark test sets of protein sequences. It is not only sensitive, but also specific, drastically reducing the number of incorrectly aligned residues in comparison to other programs. The method allows for adjustment of the sensitivity/specificity tradeoff and can be used to reliably identify homologous regions among protein sequences.
Availability: An implementation of the sequence annealing algorithm is available at
Contact: sariel@cs.berkeley.edu
1 INTRODUCTION
The multiple alignment problem is to find all homologous amino acids, or nucleotides, among multiple sequences. This problem is similar in some respects to the protein folding problem: each multiple alignment can be evaluated according to the homology relationships it specifies in the same way that the Gibbs free energy can be computed for a protein conformation. However the differences between the problems reveal distinct fundamental hurdles that lead to very different computational problems. In protein folding the free energy of a conformation is derived from a clear understanding of the underlying physics, whereas in multiple alignment the mechanisms of evolution are not well understood leading to uncertainty over how to evaluate multiple alignments. There is another fundamental difference: while the space of protein conformations is infinite and difficult to explore completely, there are natural ‘moves’ in the space based on changing backbone torsion angles or side chain conformations. The space of multiple alignments, while also difficult to explore completely, is discrete and finite. However it has been unclear how to perform ‘small’ steps in the space, making it difficult to accurately align single amino acids or nucleotides. In this paper we show that it is possible to efficiently explore the space of multiple alignments using the smallest possible steps.
Existing programs for multiple sequence alignment are mostly based on the idea of progressive alignment (Feng and Doolittle, 1987). We refer the reader to Batzoglou (2005) for a review of this approach to alignment. Here we note that it is widely used, by first generation programs such as CLUSTALW (Thompson et al., 1994) and TCoffee (Notredame et al., 2000), as well as by more recent programs such as Alignm (Van Walle et al., 2005), MUSCLE (Edgar, 2004) and ProbCons (Do et al., 2005). In terms of the space of multiple alignments, we can think of progressive alignment as beginning with the ‘null alignment’ where no two sequences are aligned and proceeding via large steps to arrive at a complete alignment. Each step consists of aligning either a pair of sequences to each other or the alignment of a two partial alignments of subsequences to each other. This is a fundamental weakness of progressive alignment. For example, at the very first step two entire sequences are aligned, possibly incorrectly, because other informative sequences are not used. This has been addressed by iterative refinement (Gotoh, 1996), which improves on progressive alignment but still often fails to correct complex alignment errors involving multiple sequences.
Another approach to multiple sequence alignment was introduced by Morgenstern et al. (1996) and pursued in a series of papers developing the DIALIGN program (Morgenstern et al., 1998; Subramanian et al., 2005). The main idea was to address the problems of progressive alignment by incrementally aligning matching segments to each other, while preserving the consistency of the alignment. This segmenttosegment alignment approach effectively reduced the size of the steps taken by progressive alignment methods. A key ingredient was the careful formulation of the multiple alignment problem via precise definitions of partial and global multiple alignments (Morgenstern et al., 1999). We discuss these ideas in detail in Section 2.
In Section 3 we introduce the method of sequence annealing. We build alignments one match at a time, ‘annealing’ positions that are more likely to be homologous, first. Using fast algorithms for online topological ordering (Katriel and Bodlaender, 2006), we are able to rapidly construct a consistent global multiple alignment. We remove the requirement of DIALIGN so that alignment proceed by segmenttosegment comparison and allow for segments of size 1. This eliminates the need for many of the heuristics incorporated in DIALIGN. In order to correctly align single residues, we compute posterior probabilities for matches and gaps, using methods described in Holmes and Durbin (1998), Durbin et al. (1998), Do et al. (2005) and Schwartz et al. (2006).
In Section 4 we show that sequence annealing improves on all existing methods for protein multiple sequence alignment. Not only is sequence annealing more accurate, it is also very fast, even though it is based on performing very small steps in multiple alignment space.
2 ALIGNMENT POSETS
We use the notation to denote the ith element of a sequence of length n. By a set of sequence characters S = {σ^{1}, … , σ^{k}} we mean the set of n_{1} + n_{2} ⋯ + n_{k} sequence characters that form the sequences σ^{1}, σ^{2}, … , σ^{k} of lengths n_{1}, n_{2}, … , n_{k} respectively. A partial global multiple alignment of sequence characters S = {σ^{1}, … , σ^{k}} is a partially ordered set P = {c_{1}, … , c_{m}} together with a surjective function ϕ : S → P such that if i < j. The elements of P correspond to columns of the multiple alignment, and the partial order specifies the order in which columns must appear. We call P an alignment poset, and note that unless P is a total order, there are columns of the partial multiple alignment whose order is unspecified. A linear extension of a partially ordered set P = {c_{1}, … , c_{m}} is a permutation of the elements c_{1}, … , c_{m} such that whenever c_{i} < c_{j}, i < j. A global multiple alignment is a partial global multiple alignment together with a linear extension of the alignment poset P (Fig. 1). A similar representation using directed acyclic graphs instead of postes is used by the Partial Order Alignment program (Lee et al., 2002), and has also been used in the ABruijn alignment program (Raphael et al., 2004).
There is an important (trivial) case of a partial multiple alignment:
Example (null alignment): A null global multiple alignment of k sequence S = {σ^{1}, … , σ^{k}} of lengths n_{1}, … , n_{k} is a partial global multiple alignment M_{Null} where the alignment poset P has size . Note that in this case P must be a disjoint union of k chains.
Let be the set of all partial multiple alignments of a set of sequences S. A scoring function for multiple alignments is a function f : which assigns a ‘score’ to each partial multiple alignment. The problem of specifying biologically realistic scoring functions that can be evaluated efficiently is an important open problem that we do not address in this paper. In what follows, we make use of a scoring function defined for alignments of pairs of sequences, namely the alignment metric accuracy (Schwartz et al., 2006). The score of a multiple alignment is defined to be the sumofpairs score for all of the pairwise alignments. This scoring function has useful properties explained in Schwartz et al. (2006) most notably that the sumofpairs score is meaningful for producing an alignment ‘close’ to the true alignment.
3 SEQUENCE ANNEALING
The goal of global multiple alignment is to find argmax_{M∈} (f(M)) where M ranges over all partial multiple alignments and f is a scoring function. In principle, f can be evaluated at every partial multiple alignment but this is not feasible in practice, as the set is large (Dress et al., 1998).
We formalize a greedy approach to multiple alignment as follows. Let S be a collection of k sequences of length n_{1}, … , n_{k} and let . A sequence annealing for S with scoring function f is a nesting of partial global multiple alignments M_{L} ⊃ M_{L} − 1 ⋯ ⊃ M_{r} where the alignment poset P_{i} associated to M_{i} has ∣P_{i}∣ = i for all i, P_{L} is a null multiple alignment of S and f (M_{i+1}) ≤ f(M_{i}). Note that each multiple alignment M_{i} consists of a function
By definition, the sequence annealing process can only proceed by minimal steps that reduce the number of columns in the (partial) alignment by one. This can only be done by merging two existing columns into , such that if , or otherwise. The difficulty in finding a sequence annealing is that not all pairs of columns can necessarily be merged in a partial multiple alignment. However it is natural to consider the algorithm in Figure 2.
The time complexity of the algorithm depends on (1) the number of iterations of the main loop, (2) the time it takes to check the condition in line 3 and (3) the time it takes to merge the two columns. The time for (1) is simply L − r + 1. (3) can be done in O(k) time, since there are at most k − 1 sequence elements that are affected by each merge operation. The challenging step of the algorithm is (2).
In order to perform step (2) efficiently a weight is assigned to each pair of columns in M_{L}. Candidate pairs with positive weights are placed in a heap in O(Lk log(Lk)) time.1 At every iteration of the algorithm the candidate pair p with the highest weight is drawn from the heap in O(1) time. The weight of p is recalculated to account for merge operations that involve the positions in p. If the new weight w′ is lower than the weight of the current top candidate in the heap the edge is reinserted into the heap with the new weight w′ in O(log(Lk)) time, otherwise merge(p) is performed.
The merge operation can be done efficiently using an online topological ordering algorithm. Given a directed acyclic graph G, the topological ordering problem is to find a function ord: V → N such that if i → j then ord(i) < ord(j). Directed acyclic graphs are equivalent to partially ordered sets, and the topological ordering problem is just the problem of finding a linear extension of the poset (the former terminology is used in computer science, and the latter terminology in mathematics). It is easy to see that the topological ordering problem is trivial (Tarjan, 1972). The online topological ordering problem is the topological ordering problem where edges appear one at a time. This problem was first considered in Alpern et al. (1990) and MarchettiSpaccarnela et al. (1996). Significantly for our application, the problem admits efficient algorithms. We omit a detailed complexity analysis in this paper and refer the reader to Ajwani et al. (2006). In our implementation we adopted the algorithm of Pearch and Kelly (2004), which has good time complexity and is relatively easy to implement.
The correctness of the algorithm requires that the weight function w(p) has the property that merging a pair with positive weight will result in M_{i−1} for which f(M_{i−1}) ≥ f(M_{i}). In addition, a good weight function should be correlated with δ = f(M_{i−1}) − f(M_{i}), such that pairs that have a higher potential contribution to the objective function will be merged earlier in the sequence annealing process. Since w can change after each merge operation, a naive algorithm will have to update the weights for all the affected pairs after every such operation and reinsert them to the heap with the updated weights. To reduce the complexity of the algorithm we currently restrict w to be independent of merge operations or to have the property that the weight of a pair can only be reduced after a merge operation. The second property allows for a rapid evaluation of w. Only when a pair p_{i} is fetched from the top of the heap, its weight w′(p_{i}) is recalculated. Since w′(p_{l}) ≤ w(p_{l}), if and only if w′(p_{l}) ≥ w(p_{h}), where p_{h} is the next top candidate pair in the heap H.
In order to specify a weighting function w, we first need to define an objective function for the global multiple alignment problem. Such an objective function should be derived from the quality measures used to evaluate alignments. The most widely used accuracy measure for multiple sequence alignment is the developer score (f_{D}) (Sauder et al., 2000), which is equivalent to the sumofpairs score (SP) (Thompson et al., 1994). f_{D} is a sensitivity measure, defined as the number of correctly aligned residue pairs in the evaluated alignment divided by the number of aligned residue pairs in the reference alignment. In Schwartz et al. (2006) we show that use of f_{D} to derive the objective function leads to the over alignment problem, in which many unrelated positions are aligned without much support, resulting in low specificity, especially when aligning unrelated sequences or sequences with unalignable regions. Moreover, sine most alignment programs are compared based on f_{D} alone, programs that are considered to be very accurate can actually produce poor alignments that do not distinguish between related (homologous) and unrelated sequenceregions.
To solve this problem we introduced in Schwartz et al. (2006) a new accuracy measure for global multiple sequence alignment named alignment metric accuracy (AMA), which is based on a metric for multiple alignments. AMA is defined as the fraction of residues that are aligned correctly to another residue or to a gap, summed over all sequence pairs. We refer the reader to the original paper for a formal definition of the alignment metric, the AMA score, and further discussion about their properties.
Given a probabilistic model for pairwise alignments, the match posterior probabilities assign a probability to the event that is aligned to given the probabilistic model specified by the parameters θ and the data (Durbin et al., 1998). Although gap posterior probabilities and are not discussed in Durbin et al. (1998), they are important and easy to derive.
A key point is that there is a family of objective functions, based on posterior decoding, that can be used to optimize the expected f_{D} score, as well as the expected AMA score:
In fact, these objective functions can be used to obtain a wide range of alignments, from the most specific to the most sensitive. When the gapfactor parameter G_{f} is set to 0, f^{0} evaluates the expected f_{D} score. f^{0.5} evaluates the expected AMA score. In general, setting G_{f} to higher values results in alignments with higher specificity, while reducing the value of G_{f} results in higher sensitivity.
We propose two weight functions that are derived from f as defined in (1). Both have a static and a dynamic version. The static weights are computed once using the null alignment and stay constant during the sequence annealing process, while the dynamic weights are recomputed (rapidly) for each top candidate column pair (c_{k}, c_{i}):
then andThe first weight function assigns the highest weight to the pair of columns p_{max}, for which the increase in the objective function divided by the number of new matched pairs is maximal. It is easy to see that this weight function satisfies the requirement that weights can only decrease following a merge operation, since it is calculated by averaging over all newly matched residue pairs.
While is a good hillclimbing heuristic and is useful for finding a local maxima of at the final alignment M_{r}, it has no guarantees for the alignments produced during the sequence annealing process. The second weight function addresses this issue. It is motivated by the empirical observation that when the optimal pairwise alignment is found using dynamic programming, 99.8% of the pairs that are matched in alignments that use higher gapfactor values for increased specificity also appear in alignments that use lower gapfactors for better sensitivity. Sequence annealing with w_{tgf} emulates the process of slowly reducing the temperature (gapfactor in our analogy), allowing pairs of columns whose weights become positive to align. Therefore at any step of the sequence annealing process the scoring function that is being optimized is f^{temp}(M), where .
The following lemma shows that can only decrease after a merge operation.
Proof: We may assume without loss of generality that . If x_{k} < y_{k} for all k, then , which is a contradiction.□
4 RESULTS
We implemented sequence annealing by extending our existing protein alignment software that already performs posterior decoding based alignments. The new program, AMAP 2.0, is abbreviated to AMAP in this paper2.
The following variations of AMAP were tested:

AMAP—the program with its default settings. with dynamic weights.

AMAP_{sn}—adjusted to maximize sensitivity. with static weights and two rounds of consistency transformations. This version of the program is best suited for comparison with existing multiple alignment programs that focus on maximizing sensitivity.

AMAP_{sp}—adjusted for better specificity. with dynamic weights.
We compared the performance of AMAP with other alignment programs on the SABmark 1.65 datasets (Van Walle et al., 2005). SABmark includes two sets of reference alignments with a known structure from the ASTRAL (Brenner et al., 2000) database. The Twilight Zone set contains 1740 sequences with <25% identity divided into 209 groups based on SCOP folds (Murzin et al., 1995). The Superfamilies set contains 3280 sequences with <50% identity divided into 425 groups. In addition, each dataset has a ‘false positives’ version, which contains unrelated sequences with the same degree of sequence similarity in addition to the related sequences. Programs tested include Alignm 2.3 (Van Walle et al., 2005), CLUSTALW 1.83 (Thompson et al., 1994), DIALIGNT 0.2.1 (Subramanian et al., 2005), MUSCLE 3.52 (Edgar, 2004), ProbCons 1.1 (Do et al., 2005) and TCoffee 3.84 (Notredame et al., 2000). Our main results are as follows:
The sensitivity of AMAP_{sn} averaged over all positions in both SABmark datasets (with and without falsepositives) is higher than all tested programs and is achieved with the highest specificity of all tested programs. Remarkably, the specificity of AMAP_{sn} is almost three times higher than the closest competitors in sensitivity, when false positives are introduced.
Both AMAP and AMAP_{sp} get better modeler and AMA scores than all other programs, including Alignm, which is designed to optimize specificity rather than sensitivity.
The running time of AMAP is comparable to (and in some cases faster than) that of DIALIGNT, MUSCLE, ProbCons and TCoffee. This was achieved without any optimization for speed in the current prototype.
These results are detailed in Tables 1 and 2 that show the performance of all the alignment programs we tested measured with the developer, modeler and AMA accuracy measures on the SABmark 1.65 datasets without and with false positives, respectively. The developer score (f_{D}) measures sensitivity and the modeler score (f_{M}) measures specificity. The AMA measure, discussed earlier in the paper, provides a balanced assessment of the overall accuracy of an alignment.
Twilight (209)  Superfamilies (425)  Overall by alignments  Overall by positions  Average time  

Program  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  Seconds 
Alignm  21.6  23.6  51.7  49.2  45.6  56.9  40.1  38.3  55.2  35.2  45.4  56.6  12.7 
CLUSTALW  25.6  14.7  24.9  54.0  38.1  43.8  44.7  30.4  37.6  33.6  19.5  28.2  0.4 
DIALIGNT  21.3  19.8  45.5  49.9  44.9  54.8  40.4  36.6  51.7  33.9  38.6  52.5  1.4 
MUSCLE  27.3  16.4  27.6  56.3  40.3  46.4  46.8  32.4  40.2  37.5  22.5  31.7  2.1 
ProbCons  32.1  21.1  37.4  59.8  44.4  51.8  50.7  36.7  47.0  43.0  34.3  47.0  4.5 
TCoffee  26.7  18.1  35.2  56.5  42.8  50.3  46.7  34.7  45.3  39.4  31.5  44.5  11.3 
AMAP_{sn}  30.9  22.4  40.9  58.8  45.3  53.3  49.6  37.8  49.2  43.3  39.1  51.9  2.4 
AMAP  24.0  28.3  51.2  52.8  54.6  59.5  43.3  45.9  56.8  32.5  59.7  59.6  1.7 
AMAP_{sp}  14.5  41.5  56.5  38.7  69.4  60.2  30.7  60.2  59.0  20.7  78.1  58.9  1.4 
Twilight (209)  Superfamilies (425)  Overall by alignments  Overall by positions  Average time  

Program  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  Seconds 
Alignm  21.6  23.6  51.7  49.2  45.6  56.9  40.1  38.3  55.2  35.2  45.4  56.6  12.7 
CLUSTALW  25.6  14.7  24.9  54.0  38.1  43.8  44.7  30.4  37.6  33.6  19.5  28.2  0.4 
DIALIGNT  21.3  19.8  45.5  49.9  44.9  54.8  40.4  36.6  51.7  33.9  38.6  52.5  1.4 
MUSCLE  27.3  16.4  27.6  56.3  40.3  46.4  46.8  32.4  40.2  37.5  22.5  31.7  2.1 
ProbCons  32.1  21.1  37.4  59.8  44.4  51.8  50.7  36.7  47.0  43.0  34.3  47.0  4.5 
TCoffee  26.7  18.1  35.2  56.5  42.8  50.3  46.7  34.7  45.3  39.4  31.5  44.5  11.3 
AMAP_{sn}  30.9  22.4  40.9  58.8  45.3  53.3  49.6  37.8  49.2  43.3  39.1  51.9  2.4 
AMAP  24.0  28.3  51.2  52.8  54.6  59.5  43.3  45.9  56.8  32.5  59.7  59.6  1.7 
AMAP_{sp}  14.5  41.5  56.5  38.7  69.4  60.2  30.7  60.2  59.0  20.7  78.1  58.9  1.4 
Entries show the average developer (f_{D}), modeler (f_{M}) and AMA scores. Best results are shown in boldface. All numbers have been multiplied by 100.
TwilightFP (209)  SuperfamiliesFP (425)  Overall by alignments  Overall by positions  Average time  

Program  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  Seconds 
Alignm  17.8  6.4  81.5  44.8  16.8  77.5  35.9  13.4  78.9  28.6  17.6  83.3  158.9 
CLUSTALW  20.4  2.4  35.5  50.9  7.4  37.0  40.8  5.7  36.5  31.2  4.0  34.2  1.7 
DIALIGNT  17.0  4.5  74.1  46.7  14.0  71.5  36.9  10.9  72.4  30.2  13.5  78.5  5.7 
MUSCLE  19.4  2.3  37.1  49.7  7.5  38.9  39.7  5.8  38.3  30.7  4.1  35.7  19.2 
ProbCons  26.8  4.4  55.6  56.0  10.9  55.0  46.4  8.8  55.2  34.8  6.5  54.2  28.5 
TCoffee  13.0  2.3  56.5  42.5  9.3  56.6  32.8  7.0  56.6  26.7  6.6  62.6  61.2 
AMAP_{sn}  27.3  6.6  68.3  56.1  14.1  63.8  46.6  11.6  65.3  35.7  17.7  81.1  13.5 
AMAP  19.2  9.8  84.4  46.4  27.0  84.2  37.4  21.4  84.2  27.4  30.1  88.5  11.2 
AMAP_{sp}  12.7  17.3  91.0  35.7  45.9  91.1  28.1  36.5  91.1  19.1  50.3  91.4  10.0 
TwilightFP (209)  SuperfamiliesFP (425)  Overall by alignments  Overall by positions  Average time  

Program  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  f_{D}  f_{M}  AMA  Seconds 
Alignm  17.8  6.4  81.5  44.8  16.8  77.5  35.9  13.4  78.9  28.6  17.6  83.3  158.9 
CLUSTALW  20.4  2.4  35.5  50.9  7.4  37.0  40.8  5.7  36.5  31.2  4.0  34.2  1.7 
DIALIGNT  17.0  4.5  74.1  46.7  14.0  71.5  36.9  10.9  72.4  30.2  13.5  78.5  5.7 
MUSCLE  19.4  2.3  37.1  49.7  7.5  38.9  39.7  5.8  38.3  30.7  4.1  35.7  19.2 
ProbCons  26.8  4.4  55.6  56.0  10.9  55.0  46.4  8.8  55.2  34.8  6.5  54.2  28.5 
TCoffee  13.0  2.3  56.5  42.5  9.3  56.6  32.8  7.0  56.6  26.7  6.6  62.6  61.2 
AMAP_{sn}  27.3  6.6  68.3  56.1  14.1  63.8  46.6  11.6  65.3  35.7  17.7  81.1  13.5 
AMAP  19.2  9.8  84.4  46.4  27.0  84.2  37.4  21.4  84.2  27.4  30.1  88.5  11.2 
AMAP_{sp}  12.7  17.3  91.0  35.7  45.9  91.1  28.1  36.5  91.1  19.1  50.3  91.4  10.0 
Entries show the average developer (f_{D}), modeler (f_{M}) and AMA scores. Best results are shown in boldface. All numbers have been multiplied by 100.
It is important to note that all programs except for Alignm aim at maximizing sensitivity at the expense of specificity. It is therefore not surprising that these programs clearly have better f_{D} scores than f_{M} scores. On the other hand, AMAP allows to control the sensitivity/specificity tradeoff and is able to achieve best results on both measures. This is clear by examining Figure 3, which shows sensitivity and specificity of AMAP at various stages of the sequence annealing on the SABmark dataset with no falsepositives (averaged over all positions). All the points on the AMAP curve were produced with AMAP, except for the most sensitive point, which was produced by AMAP_{sn}. The figure depicts one crucial difference between AMAP and all the other current alignment programs. While every other program produces a single point, the sequence annealing method used in AMAP produces a series of alignments, starting with very specific alignments with low sensitivity and ending at very sensitive alignments with lower specificity. It is clear that the AMAP line dominates all other programs, since for every alignment produced by some other program, there is a point on the AMAP line that has better sensitivity and specificity.
Although we disagree that alignment programs should be evaluated based on sensitivity alone, we will note that the f_{D} scores of AMAP_{sn} are comparable to ProbCons, which is arguably the most accurate program in terms of sensitivity. This is not surprising since AMAP and ProbCons use similar posterior probabilities. While ProbCons has slightly better sensitivity than AMAP_{sn} on the sets without falsepositives when scores are averaged over alignments, AMAP_{sn} has a slightly better score when the scores are averaged over all positions in all alignments together. This indicates that the sequence annealing method finds better alignments when the total number of residues in an alignment increases. The same trend is evident in the falsepositives sets, in which AMAP_{sn} achieves better scores than ProbCons on all measures and datasets. This is again due to the fact that the falsepositives sets are larger and therefore the alignments are less robust to errors made in early stages of the progressive alignment procedure used in ProbCons. Even more striking is the fact that while both programs achieve very similar f_{D} scores, AMAP_{sn} has much better f_{M} and AMA scores than ProbCons (81.1 AMA compared with 54.2 on the falsepositive sets averaged over all alignments), even though it is tuned to optimized sensitivity alone. Again, this can be explained by the fact that sequence annealing takes the smallest step in the space of multiple alignments and fixes matches based on their overall contribution to the objective function score. In progressive alignment, once two sequences are chosen to be aligned, their entire alignment is determined without regard to the fact that the alignment of some residues may be unclear until later in the procedure.
Both AMAP and AMAP_{sp} get better f_{M} and AMA scores than all other programs, including Alignm, which is designed to optimize specificity rather than sensitivity. AMAP achieves high specificity while still producing respectable sensitivity scores. On most datasets, AMAP_{sp} has better AMA scores than AMAP, despite the fact that AMAP is tuned to optimize the AMA scores. This can be explained by the fact that the parameters used to produce the posterior probabilities do not fit the SABmark data very well and probably underestimate the gap probabilities.
AMAP compares favorably with most other programs in terms of running time, as well. It takes only 13.5 seconds on average to align a group of sequences from the falsepositives sets with current version of AMAP, which is still a prototype that has not yet been optimized for speed, compared with 19.2 seconds per query with MUSCLE, which is known to be optimized for fast running time. In fact, most of AMAP's running time is spent on computing the posterior probabilities and not on the actual sequence annealing. Using sequence annealing also allows AMAP to produce in one run of the algorithm a range of multiple sequence alignments, ranging from very specific to very sensitive alignments, since at every point of the algorithm a valid multiple sequence alignment can be produced from the current (partial) alignments. This allows users to see how the alignment is being built and decide which alignment looks best for their purpose.
5 SUMMARY AND FUTURE DIRECTIONS
We have shown that sequence annealing is a practical alternative to progressive alignment that can be used for multiple alignment of protein sequences. In addition to being more accurate than existing methods, sequence annealing has the advantage that the intermediate alignments produced during the annealing process are of use in identifying reliable portions of the multiple alignment. They provide the user with the ability to explore the tradeoff between sensitivity and specificity.
Although we have only considered multiple alignment of protein sequences in this paper there is no reason to expect that the methods we have developed cannot be applied to DNA multiple alignment. In the case of DNA, it may be necessary to further develop the sequence annealing approach to allow for melting (in analogy with simulated annealing algorithms). Indeed, by revisiting certain annealing steps, it may be possible to avoid local maxima of the scoring function. Indeed, we do not, at this time, have a clear understanding of the landscape of the scoring function we have proposed. This is a promising direction for future research.
Sequence annealing can also be used for local multiple alignment. A partial local multiple alignment of sequence characters S = {σ^{1}, … , σ^{k}} is a partially ordered set P = {c_{1}, … , c_{m}} together with a surjective function ϕ : S → P such that if x, y ∈ P with x < y then if and , i < j. Note that every partial global multiple alignment is a partial local multiple alignment (but not vice versa). A null local multiple alignment is the poset consisting of a disjoint union of singletons. Sequences can be annealed with the added ‘move’ of connecting components of the alignment posets without collapsing elements.
In summary, sequence annealing opens up the possibility of exploring the space of multiple alignments and allows for global optimization methods to be applied directly to the multiple alignment problem.
Ariel Schwartz was supported by NSF grant EF 0331494. Lior Pachter was supported by NIH grant R01HG2362 and NSF grant CCF0347992.
Comments