Abstract
We present a computational scheme to locally align a collection of RNA sequences using sequence and structure constraints. In addition, the method searches for the resulting alignments with the most significant common motifs, among all possible collections. The first part utilizes a simplified version of the Sankoff algorithm for simultaneous folding and alignment of RNA sequences, but maintains tractability by constructing multi-sequence alignments from pairwise comparisons. The algorithm finds the multiple alignments using a greedy approach and has similarities to both CLUSTAL and CONSENSUS, but the core algorithm assures that the pairwise alignments are optimized for both sequence and structure conservation. The choice of scoring system and the method of progressively constructing the final solution are important considerations that are discussed. Example solutions, and comparisons with other approaches, are provided. The solutions include finding consensus structures identical to published ones.
Introduction
Locating sequence as well as structure motifs in a set of RNA sequences is of general interest. For example, all of the methods that do structure prediction based on phylogenetic data require that the alignment of the sequences be known in advance. That alignment process is usually done by hand and is one of the biggest problems in using that approach. The method presented here promises to automate the alignment and structure determination process, and can be used on normal phylogenetic data, on SELEX (1) type data where the RNAs have been selected in vitro, and when one has a collection of genes that are coordinately regulated at the translational level. In contrast to many other RNA folding and aligning methods, we present a method which performs local structural alignment of RNA sequences. The work here is an extended version of (2).
Much work has been put into sequence alignment, e.g. (3,4), including methods to align multiple sequences, e.g. (5–9), but RNA sequences are often conserved more in their structure than in their sequence, so alignments of them based solely on sequence conservation are usually incorrect. There has also been much work on RNA secondary structure prediction, for example through free-energy minimization of structures (10,11). But these methods work on single sequences and are not generally reliable enough to accurately predict the structures of entire sets of RNA sequences. There is a method to build a multiple alignment based on structure predictions of individual sequences (12), but this ignores the sequence component of the alignment. Simulated annealing has also been applied to the problem of aligning multiple RNA structures (13) and to fold individual sequences (14). Also a genetic algorithm has been applied to fold individual sequences (15). It is generally agreed that comparative methods are the most reliable for determining the structure of a set of related RNA sequences (16,17). However, those methods require that the alignment be known in advance. If reasonably good alignments can be obtained, a very effective method is an iterative procedure that uses the alignment to predict the structure, then refines the alignment based on the structure, and repeats the process until no further improvement is seen. However, the actual use of such methods are not very automated and usually require the careful attention of an expert to attain the final resulting alignment. Perhaps the most promising automated approach is the use of stochastic context-free grammars (SCFG) (18,19) which perform global alignment considering both sequence and structure conservation. Eddy et al. (18) even show how this method can be used to discover the structure model that is common to the set of sequences. However, these methods do not work well on local alignments. This is the problem we are most interested in, where the conserved motifs represent only a portion of the available sequences. Such is usually the case with in vitro selected RNAs, and also with regulatory regions in RNA. In fact, interest in identifying regulatory regions in DNA sequences (20–22) has motivated this work. The problem is more complicated because the motifs being searched for contain structural, as well as sequence, conservation, and also because it is imperative to allow gaps in the alignment. If gaps are not required, then a simple extension to the sequence alignment method can identify the common motifs quite well (23).
With real data there is always the possibility that some of the N sequences are not really related to the rest, but have been included in the set erroneously, or for other reasons, for example that the N sequences are functionally related but fall into two (or more) structural classes so that there is not a single motif that is common to all of them. In general we want to identify M ≤ N sequences that contain the most significant common motif. In our approach that common motif will appear in the alignment of the M sequences, and failure to get good alignments with the rest of the sequences will indicate that they do not belong to the same structural class. However, for N sequences, there are essentially 2N subsets of sequences, and it would not be practical to examine all these subsets to identify the one with the most significant common motif. Thus we include an approach in which poorly aligned subsets are discarded.
In 1985 David Sankoff (24) published an algorithm to optimally fold and align multiple RNA sequences. The main limitation of his algorithm is that its time complexity is O(L3N) for N sequences of length L. That makes it impractical for any but the smallest problems. However, it is easily computed for two sequences of moderate length. Therefore, we have adapted an approach used in pairwise progressive alignment methods, such as CLUSTAL (5,6), in which each step of the procedure requires alignment of only two entities. Each entity can be either an individual sequence or an alignment of sequences. We also use the strategy of the greedy algorithm in CONSENSUS (21), of maintaining many intermediate solutions so as to minimize the likelihood of missing the optimal solution. While this approach does not guarantee that we will find the optimal solution, as does the Sankoff algorithm, it does have one advantage (besides tractability) over that method, which is that it can locate the subset with the most significant core structure.
We have simplified the basic Sankoff algorithm in two important ways. His algorithm minimizes the total cost of the alignment and, simultaneously, the energy of the structures. Because we are specifically looking for local alignments, we adopt the approach of the Smith-Waterman algorithm for local sequence alignments (3). Therefore we maximize a score based on a combination of sequence similarity and structure. For convenience we choose to score the structure based only on the number of base pairs, and not their stacking energies (10). This change allows us to use the basic Sankoff algorithm to find the highest scoring local alignment of the RNA sequences. (See below for further details of the scoring system.)
The second major simplification reduces the time complexity to O(L4), rather than the O(L6) for two sequences of the standard algorithm. We do this by not allowing for branching structures. We justify this approach by claiming that we do not expect our program to identify the complete structural motif in a single pass, but rather to give us a good ‘core structural alignment’ which can be used to obtain the complete motif by existing methods. We certainly expect that the motifs may contain branching structures. They may even contain pseudoknots, and if that is true even the complete O(L3N) Sankoff algorithm will not find them, nor will SCFG methods. However, given a good alignment of a ‘core’ region of the common motif, an SCFG method can extend and refine that to capture more of the motif. And maximum-weightedmatching (MWM) methods can be used to identify pseudoknots in the aligned sequences (25,26). The most difficult part of the problem, given existing programs, seems to be identifying a good candidate local alignment from which the total motif can be identified. The O(L4) algorithm seems quite capable of that, and its increased speed (compared to the O(L6) version) allows us to do many more comparisons between different subsets of sequences.
Data Sets
We select four published data sets for investigation, all from SELEX (1) experiments and for which a consensus structure has been proposed. The first set (H1) of RNA was found to bind to the human immunodeficiency virus type 1 rev protein, and consists of three families of hairpin loops (27). The second set (H2) contains a pseudoknot with specific affinity for HIV-1-RT. This data set also contains erroneous sequences which have been found to be retained by nitrocellulose filters in the absence of protein (28). The third set (THEO) of RNA binds to the bronchodilator theophylline and has a conserved sequence pattern and a consensus structure consisting of a circularly permuted and broken hairpin loop (29). In particular it has a conserved CCU bulge. The fourth set (R17) is RNA ligands for the bacteriophage R17 coat protein (30), which has characteristic tetra-loops and requires an A-bulge in the stem. The length of all the sequences is in the range of 30–50 nt, and the data set sizes range from 13 to 36 sequences. For each data set we have used only the variable part of the sequences. If we had included the constant regions our program would simply have aligned by them. Since, for some of the selected sequences, the common motif overlaps the constant region, these will not be included in the subset of sequences identified by the program. However, they can easily be identified by later searching for the motif in the entire sequence, including the constant region.
Method
The basic principles derived by Sankoff (24) are simplified and extended to perform alignment between two collections (entities) of already aligned sequences, merging them into a new collection of sequences. We extend the 2-D dynamic programming to 4-D dynamic programming which includes folding and show that this is similar to the 2-D case and can be done with a similar score matrix. Finally we add the feature of pruning away low score collections (filtering) from the set of all collections of aligned sequences. The collections with highest scores are the ones used to suggest consensus structures for the sequences within the given collection.
From 2-D to 4-D dynamic programming for comparing two sequences
For comparison with our new algorithm, we briefly describe the algorithm of Smith and Waterman (3) for finding the maximum scoring local alignment between two sequences, and
. This algorithm essentially comprises the alignment part of our algorithm. We are given a ‘scoring matrix’, Aab, which defines the similarity between any two bases, a and b. The scoring matrix also includes scores (or penalties) for aligning a base with a gap, Aa− and A−b. (To simplify the notation, in the following we write Aik to mean
, and similarly for gaps, Ai− to mean
.) The highest scoring alignment of subsequences is then found using a matrix
, initialized as
Another helpful comparison to our method is the sequence folding algorithm of Nussinov and Jacobson (10). The basic idea of folding is to find the maximum number of basepairs possible for any folding of an arbitrary subsequence (ai, ...., aj). This is done by testing aj's ability to pair with any base ak with i ≤ k ≤ j − 1. This folding allows for branching structures, and the optimal fold is found from the following recursion:
in which each element in the matrixWe may now extend the ideas to finding the best subsequence alignment between two sequences, allowing for conserved basepairing within the two subsequences. Formally, we find the best alignment of the subsequences (ai, ...., aj) and (bk, ...., bl), in which the score of the alignment includes terms for sequence similarity between and
and basepairing within
and
that is conserved at aligned positions. Whereas the Sankoff (24) algorithm minimizes a combination of alignment cost and basepairing energy, we maximize the alignment similarity and the number of basepairs that can be simultaneously formed in both sequences. We define a scoring matrix, Sij,kl, over all quadruples of bases, including gaps. Details of the scoring matrix are provided below, but letting Dij,kl denote the best score of the subsequences (ai, ...., aj) and (bk, ...., bl), then given any scoring matrix and the constraint of non-branching structures, the maximum scoring subsequence alignments can be obtained from the following recursion:
From the definition of Sij,kl (see next section) it follows that we only need to take the maximum of cases (a), (d) and (f) as the remaining cases can be constructed from these [assuming that the alignment of two -'s (gaps) gets a score of 0]. For example, an alignment of class (b) can be obtained by an alignment of class (f) followed by an alignment of class (d). This reduces the scheme to only seven cases.
The recursion in Figure 1 (3) is systematically applied to all possible combinations of i, j, k and l in the two sequences by considering a window of varying length along each sequence, as illustrated in Figure 2. Since the subsequence comparisons in larger windows use the results of comparisons in smaller windows, each comparison score is stored for later use in the matrix Dij,kl. As in other dynamic programming approaches, upon completion of the recursive computation of scores, the maximum element in is the starting point for retracing steps. However, now the backward trace continues until the interior of the subsequence is found; that is, until the previous subsequence consists of a single base.
Inside the vertical dashed bars is the section that is already structurally aligned, and corresponds to the D portion of the right side of equation 3. The bases listed outside of the dashed bars are the new ones being aligned, with each other or with gaps, and the combined score from all of them is included in the S portion of the recursion equations. The cases illustrated as (a)–(f) correspond to the first example in each class listed in equation 3.
Inside the vertical dashed bars is the section that is already structurally aligned, and corresponds to the D portion of the right side of equation 3. The bases listed outside of the dashed bars are the new ones being aligned, with each other or with gaps, and the combined score from all of them is included in the S portion of the recursion equations. The cases illustrated as (a)–(f) correspond to the first example in each class listed in equation 3.
The scoring matrix for aligning and folding
Consider the score matrix for structurally aligning a sequence against a sequence
. As before, we reserve the indices i and j for respective positions in sequence
, and the indices k and l for respective positions in sequence
. For structural alignment between a and b we construct the total score for each quadruplet (i,j,k,l) from two independent contributions, one for sequence alignment and one for alignment of basepairs. We define the score matrix
to be the sum of matrices
(alignment) and
(basepairing)
The scenario of computing the score between all size windows against all size windows. Here the sliding window of sequence is of size Wa = 4, and for sequence
, Wb = 6.
The scenario of computing the score between all size windows against all size windows. Here the sliding window of sequence is of size Wa = 4, and for sequence
, Wb = 6.
This matrix gives a score of +4 for basepairing and alignment matches, and a score +3 for any other basepairing without alignment matches (i.e. for compensating changes). Those are the two highest scores obtainable. All other scores are derived from considering only the matching and mismatching of the aligned positions. Note that the computation of score between two given subsequences for which the respective positions are considered fixed (gaps might have been included), may be done by first computing the score for sequence alignment, then by computing the score for structure alignment.
To construct the alignment contribution of this 25 × 25 matrix, we first generalize the classic alignment approach to include alignment of any two bases against any two bases (including gaps). In the standard method of aligning two sequences one defines a similarity of substitutions for some position i in the one
sequence and some position k in another, listed in a score matrix (calledWhen constructing the score matrix for basepairing one might simply put values directly into the 25 × 25 matrix. However, it can also be constructed from two simple components, a matrix of complementary bases and another type of score matrix. First we define the complementary weight matrix by
where d is the minimal allowed loop size (typically 3), and where accounts for the basepairings. In this example we weight the basepairing G·U the same as any regular Watson-Crick basepair. To construct the score matrixThe scenario of computing the score between all size windows against all size windows for two collections of aligned sequences. As before the sliding window of sequence is of size Wa = 4, and for sequence
, Wb = 6.
The scenario of computing the score between all size windows against all size windows for two collections of aligned sequences. As before the sliding window of sequence is of size Wa = 4, and for sequence
, Wb = 6.
Using the values listed in equations 7 and 10 we obtain the total score matrix for listed above, equation 5.
Multiple sequence comparison
We now expand the 4-D dynamic programming algorithm to include alignment between two entities or collections of sequences, which do not overlap in the sequences they contain. Consider the case of aligning a collection of p aligned sequences to a collection of q aligned sequences. The aligned columns within each separate collection remain unchanged, but insertions and deletions can occur between the two sets. The score for the alignment is the sum of all the pairwise scores among all r = p + q sequences. However, we only add in the structural component of the score if every sequence in both sets is complementary at the aligned position pairs. As for the two sequence comparisons, a similar score matrix formalism for multiple sequence comparison has been constructed, but not discussed here. Figure 3 illustrates the computation of the score between two windows from the two collections. The computation is similar to what is done in (6,7).
As extra features we include different penalties for initiation and elongation of gaps. Also we include an optional basepair elongation factor, ε, which scores basepairing such that stacking is favored. When ε = 0 the same number of basepairs throughout an alignment will give the same score no matter where the basepairings are located. When ε > 0 basepairs are scored Bij,kl + εc, where c is the number of nested stacked pairs up to, but not including (i,j) and (k,l).
The greedy algorithm
Here we briefly outline the basic ideas of applying the greedy algorithm to build up multiple alignments. This has similarities to CLUSTALW (6) and CONSENSUS (21). For a set of N different sequences there are 2N − (N + 1) ≈ 2N different subsets of two or more sequences. In order to find the best M ≤ N aligned sequences, the best scores from many different subsets will have to be compared.
The raw pairwise greedy algorithm is as follows. First all individual sequences are compared to each other, then all the pairwise alignments are compared to all the individual sequences, such that no sequence appears more than once in each comparison. The next greedy step would be to align all the triplet alignments to the individual sequences, and compare all the pairwise alignments to each other, still such that no sequence appears more than once in each final alignment. The algorithm may then be continued until all sequences have been compared in a final alignment. This approach is exponential in time, requiring O(L4NN) time. However, many of these comparisons/alignments are redundant and many may also be of such low score that they are unlikely to contribute to the final aligned subset. We can save many comparisons by eliminating alignments that appear unproductive.
Further work is ongoing to optimize the greedy strategy. For the results presented in this paper we have used two means of limiting the number of comparisons. First, one entity is always a single sequence. In the notation introduced above, alignments of r sequences are made by aligning one sequence with r − 1 sequences. Second, after the initial alignment of all pairs of sequences, we only store a fixed number of the best scoring alignments, typically 30. For example, we compare each single sequence to each aligned pair of sequences to generate alignments of three sequences. But we only keep the 30 highest scoring alignments for comparison with single sequences to generate the alignments of four sequences. This approach has a time complexity of O(L4N4).
Finally, in selecting the ‘best’ alignment of M ≤ N sequences, there is a problem that scores increase with the number of sequences, and so different sized sets are difficult to compare. We need stopping criteria that indicate the largest set of well-aligned sequences occurs at some particular round. As a first approach we look at a few best alignments of each round r, and compare score versus alignment length. We find a trade off from which the final alignment is chosen by plotting the scores as a function of r, and stopping at the round for which the rate of change decreases, i.e. at the empirical inflection point. Then we look at the five best scored alignments of r and r − 1 sequences. Work is ongoing to make the selection of the ‘best’ alignment more rigorous and more automated.
Results
For comparison, we attempt to find the alignment for each data set using three publicly available programs. Depending on the data set, and type and amount of conservation, these programs perform with varying degrees of success. None of them is consistently able to identify the structural motif common to the set of sequences as well as our program FOLDALIGN, an implementation of the 4-D alignment and greedy schemes presented here.
Clustalw
We performed multiple alignment of the SELEX data by using the program CLUSTALW (5,6) with default parameters. We used the full dynamic programming alignment between two sequences for the progressive alignment (6,32) performed by CLUSTALW. Since this program performs multiple alignment based on sequence conservation alone, we would not expect it to identify structurally conserved regions. As expected, it does fairly well on data sets with significant amounts of sequence conservation. For example, it properly aligns the conserved hairpin loop, and the conserved bulged A, in the R17 data (33), but fails to align consistently the conserved stem region. It also adds spuriously aligned bases near the 5′ end of the sequences that are not part of the functional motif. The H2 data (28) also contain significant sequence conservation in the first stem of a pseudoknot, and CLUSTALW aligns that region well, although with no suggestion of the conserved structure. The THEO data (29) contains sequence conservation, but it is misaligned in the two subclasses. The H1 data (27) has less sequence conservation than the others and is not well aligned by CLUSTALW.
COVE: covariance models of RNA
The COVE program by Eddy and Durbin (18) finds tree representations of secondary structures by applying a dynamic programming algorithm to find pairwise mutual information scores of all nucleotide positions. A covariance model is then derived from the resulting optimal tree representation of the structure. The algorithm is also similar to proposed stochastic context-free grammar models developed by the Haussler group (19,34). COVE performs global alignment on a collection of sequences and has been shown to perform well on tRNA for which a global (consensus) structure is defined. As options in the program one can either construct a model from a set of sequences with known structure and apply it to find structures for new sequences, or utilize it to suggest a common structure for a set of sequences. Additional features, like using alignment of the sequences, are also available. It is well known that it is hard to find local features using a global alignment procedure, so we did not expect this package to perform well on the SELEX data sets, and it did not. Using the same data sets as for CLUSTALW we did not find any strong signals for consensus structures, not even if we used the CLUSTALW alignments as input to the program. The multiple structural alignment method presented here can be used to construct a core model which can then be used by COVE to extend and refine that model.
Tools from the Vienna RNA package
The Vienna RNA package by Hofacker et al. (12) includes a program RNAfold to find the minimum free energy structures, and a program RNAdistance to find the distance between two structures in terms of the smallest cost along the editing path when representing the structures as trees. The RNAfold program is based on the work in (11,35,36), and the RNAdistance program on the work in (37–39). We folded each sequence in the data sets and found that many of the structures resembled the published structures, when neglecting the H2 pseudoknot. Using RNAdistance to pairwise compare the structures and appropriate cut-off scores to select reasonable comparisons, we found a number of sequences with similar structures (in the respective data sets) were clustered together. However, we did not find a strongest common structure, or the consensus structure for the respective data sets.
Foldalign
We determined structural alignments for each of the data sets and compared them to the published consensus structures. For each of the investigated data sets we did find the subset with the strongest common structure, matching what has been published. The first data set, H1 (27), consists of 20 sequences and has been assigned three structural classes, the classes all containing the same structural elements. The largest class consists of 10 sequences, however three of them use the constant SELEX regions (see ref. 1) in the basepairing (which were not included in our data sets), so we would not expect to find more than seven of them with conserved structure. For those seven sequences FOLDALIGN finds the proper alignment and consensus structure matching the published one. This is shown in Table 1. The additional structure identified by FOLDALIGN (Table 1), but not included in the published consensus structure, is included in the consensus structure for the largest class (figure 3 of ref. 27). FOLDALIGN has captured the interacting bases in an internal loop. Furthermore FOLDALIGN succeeds in merging the two strongest classes, missing one sequence in the largest class and getting only one sequence wrong in the alignment (not shown). All of the sequences in the third class utilize part of the constant region in their structures, and so cannot be aligned properly with the other sequences when using only the variable region.
Even better results are obtained for the data set H2 (28). As mentioned this data set contains a pseudoknot, two overlapping stem-loop regions, and therefore violates the knot constraint in the dynamic programming. One stem region is highly conserved in sequence, and the other has almost no sequence conservation. FOLDALIGN aligns the sequence conserved regions based on their sequence alignment, but at the same time aligns the other stem region which is only conserved in structure (see Table 2). This is a good example of the application of the method. Programs like MWM (25,26) could refine the alignment to identify the entire motif, including the pseudoknot.
FOLDALIGN for the R17 set of the sequences which have at least six basepairs in the stem
FOLDALIGN for the R17 set of the sequences which have at least six basepairs in the stem
FOLDALIGN for the R17 set of the sequences which have at least five basepairs in the stem
FOLDALIGN for the R17 set of the sequences which have at least five basepairs in the stem
The logos for the data. The symbols for sequence alignment are A, C, G, U and —. The letter displaying the mutual information between basepaired positions is M. The sequence-structure logos are for (a) the data set H1, (b) the data set H2, (c) the data set THEO and (d) the data set R17. Only positions with positive information content are shown. The letters ‘a’ and ‘b’ indicate basepairing for respective regions. For the data set H2 we inserted the known pseudoknot for illustration.
The logos for the data. The symbols for sequence alignment are A, C, G, U and —. The letter displaying the mutual information between basepaired positions is M. The sequence-structure logos are for (a) the data set H1, (b) the data set H2, (c) the data set THEO and (d) the data set R17. Only positions with positive information content are shown. The letters ‘a’ and ‘b’ indicate basepairing for respective regions. For the data set H2 we inserted the known pseudoknot for illustration.
The third data set, THEO (29), consists of two structural classes which are circular permutations of each other. FOLDALIGN identifies the proper motif from the largest class, getting the alignment exactly right for six of the eight sequences (see Table 3). The two remaining sequences contain the shortest stems, only two basepairs in one case, and require the most gaps for proper alignment. The second class could not be aligned with the first due to the circular permutation, but their common structure should be identifiable if they are treated as a separate class (not tested).
The final set, R17 (30), consists of 36 sequences. For many of these, part of the structural motif is contained in the constant region of the sequence, and so is not available to the program for alignment. However, we obtained a perfect alignment for the subset of nine sequences that have at least six basepairs in the stem (Table 4). We also found a perfect alignment for 12 of 16 sequences with at least five basepair stems (Table 5). Alignment of larger subsets are nearly correct, although they do contain a few misaligned sequences. In general, for this data set and the others, the identification of subsets of sequences with conserved core structures should be enough information to extend and refine the complete motif utilizing existing programs mentioned above.
In the results presented above, a few different scoring matrices were used, with essentially the same results for each. Work in progress will further explore various scoring schemes and their impact on resulting alignments. In real time the runs (alignments) were completed in 4–12 h depending on the set sizes. The runs were performed on a Silicon Graphics power challenge IRIX64 machine.
To display the sequence structure content of the alignments we show four examples of structure logos (40) in Figure 4. The structure consists of a sequence part, which is a generalized form of the Schneider and Stephens sequence logo (33). On top of this sequence logo the mutual information between corresponding basepairs has been displayed using the letter ‘M’.
Conclusion
We have presented a method to structurally align a set of RNA sequences, as well as select the subsets containing the most significant alignments. The method, which consists of 4-D dynamic programming and a greedy algorithm for pairwise comparison of sequences, was able to fully find the published alignments of conserved motifs. The complete structure was not always obtained, as in the case of the pseudoknot due to the dynamic programming limitation. But the core alignment that is obtained can be used by existing methods to complete the motif identification. For this type of problem the method clearly outperforms methods based on sequence alignment alone, or structure based methods like stochastic context-free grammars and free energy minimization, the latter including pairwise alignment of structures only. We conclude that our method, to a very large extent, can replace the alignments currently made by hand, or provide significant hints to assist with ‘hands-on’ methods.
Further work is ongoing to improve the identification of the most significant subset alignments. We are also working to fully automate the detection of subclasses of motifs; the current version is really only capable of finding the single most significant common structure, but alignments based on disjoint subsets of the sequences can be utilized for classification into distinct motif classes. The method is also very amenable to parallelization, which should greatly facilitate more extensive comparisons and analyses.
Acknowledgements
Thanks to B.Javornik and NexStar for providing the data electronically. We thank S.Eddy for advice on the use of COVE and an anonymous reviewer for many helpful comments. This work was sponsored in part by NIH grant HG00249 to GDS. JG was supported by the Danish National Research Foundation.

Comments