Motivation: Alignment of RNA has a wide range of applications, for example in phylogeny inference, consensus structure prediction and homology searches. Yet aligning structural or non-coding RNAs (ncRNAs) correctly is notoriously difficult as these RNA sequences may evolve by compensatory mutations, which maintain base pairing but destroy sequence homology. Ideally, alignment programs would take RNA structure into account. The Sankoff algorithm for the simultaneous solution of RNA structure prediction and RNA sequence alignment was proposed 20 years ago but suffers from its exponential complexity. A number of programs implement lightweight versions of the Sankoff algorithm by restricting its application to a limited type of structure and/or only pairwise alignment. Thus, despite recent advances, the proper alignment of multiple structural RNA sequences remains a problem.
Results: Here we present StrAl, a heuristic method for alignment of ncRNA that reduces sequence–structure alignment to a two-dimensional problem similar to standard multiple sequence alignment. The scoring function takes into account sequence similarity as well as up- and downstream pairing probability. To test the robustness of the algorithm and the performance of the program, we scored alignments produced by StrAl against a large set of published reference alignments. The quality of alignments predicted by StrAl is far better than that obtained by standard sequence alignment programs, especially when sequence homologies drop below ∼65%; nevertheless StrAl’s runtime is comparable to that of ClustalW.
Supplementary information: Supplementary data available at Bioinformatics online.
Non-coding RNAs (ncRNAs) are RNA molecules or elements that do not code for proteins but nevertheless are functional in biological processes, including localization, replication, translation, degradation and stabilization of biological macromolecules (for review and further references see Eddy, 2001; Storz, 2002; Winkler and Breaker, 2005; Fedor and Williamson, 2005; Gottesmann, 2005). Prominent examples are small nuclear RNAs, which are involved in mRNA splicing, and riboswitches, which are located in non-translated regions of mRNAs, where they bind metabolites and control gene expression.
For analysis of ncRNA function, knowledge about their secondary and tertiary structure is crucial. Structure prediction for single sequences is performed by dynamic programming, which allows the thermodynamically optimal structure or structure ensemble to be found (Zuker, 2000, 2003; Hofacker, 2003; Rivas and Eddy, 1999). These algorithms rely on correctness of thermodynamic parameters, neglect the influence of kinetics on structure formation and are not able to take into account interactions with other macromolecules. The alternative method—called comparative sequence analysis—needs a set of homologous RNAs and predicts base–base interactions on the basis of compensatory mutations (Chiu and Kolodziejczak, 1991; Gautheret et al., 1995; Lescoute et al., 2005). Its reliability increases with the number and divergence of sequences, but it needs an alignment (nearly) perfect with respect to sequence and structure. Other approaches for consensus structure prediction based on RNA alignments include PFold (Knudsen and Hein, 2003) and RNAalifold (Hofacker et al., 2002). Furthermore, RNA alignments are an essential basis for phylogeny inference (e.g. Hudelot et al., 2003), homology searches (e.g. Gräf et al., 2006; Eddy, 2002) and approaches to searching for new ncRNAs (e.g. Washietl et al., 2005).
The structurally correct alignment of RNA sequences is, however, a difficult problem. An algorithm for simultaneously optimizing the sequence and structure of an RNA set was published by Sankoff in 1985; however, the algorithm is not employable due to its computational complexity and memory usage for m sequences of length n. Thus, several variants of this algorithm have been developed which are restricted to pairwise alignment only and implement other simplifications to make the calculation tractable (Mathews and Turner, 2002; Hofacker et al., 2004; Havgaard et al., 2005a; Holmes, 2005).
This situation resembles that of (pure) sequence alignment: the algorithm for aligning two sequences is relatively cheap (Smith and Waterman, 1981), whereas the same approach cannot be applied to multiple sequence alignment due to its complexity of (Fuellen, G., 1997). This led to the development of very successful heuristic alignment methods, for example Clustal (Thompson et al., 1994a). Here we follow the same heuristic multiple sequence alignment approach but enhance it using a scoring function that emphasizes structural features. For this purpose, we project the structure features calculated by a thermodynamic approach (RNAfold; Hofacker, 2003) on top of the sequence. Similar approaches have already been proposed in the literature (Bonhoeffer et al., 1993; Yang and Blanchette, 2004) but have not been implemented. We call our program StrAl.
To test the performance of StrAl, we compare it with two sequence alignment programs—Clustal (Thompson et al., 1994a) and ProAlign (Löytynoja and Milinkovitch, 2003)—and three different structure alignment programs: In the following, we present our string-like alignment algorithm and test the performance of the respective program StrAl on several hundred sets of homologous RNAs. The quality of alignments produced by the Sankoff-like algorithm implemented in Stemloc is clearly superior to that produced by StrAl; this significant difference is, however, accompanied by a huge factor in computing time and memory usage in favor of StrAl. The quality of alignments predicted by StrAl is similar to that of standard sequence alignment programs if sequence similarity is >65%, but is far better with lower sequence similarities; nevertheless StrAl’s runtime is close to that of ClustalW.
PMstring (PMmulti in string-like alignment mode; see Hofacker et al., 2004) follows a concept related to that of our approach: it uses strings of pairing probabilities obtained from Sankoff-like pairwise alignments to produce a guide tree for a multiple alignment. This procedure, however, ‘often produces misaligned [sequence] pairs’ (Hofacker et al., 2004).
MARNA (Siebert and Backofen, 2005) uses a distantly related approach: it performs pairwise alignments of RNAs using sequence and structure information (based on RNAfold predictions) and combines the pairwise, weighted information into a multiple alignment with T-Coffee (Notredame et al., 2000).
Stemloc (Holmes, 2005) implements the pairwise Sankoff algorithm using a pair stochastic context-free grammar and a heuristic for multiple alignment. To reduce computing cost and memory usage, constraints are imposed by ‘fold and alignment envelopes’, which take into account, for example, the 1000 best single sequence structure predictions and the 100 best alignments.
2 SYSTEMs AND METHODS
StrAl is implemented in C and should compile under any Unix system. We used the GNU C compiler (GCC) versions 3 and 4. The program was thoroughly tested on several Linux distributions, including Debian Version 3.1 and Red Hat Fedora Core 3. To facilitate installation, support for GNU autotools is built in. We also provide a precompiled Debian package. StrAl requires RNAfold's RNAlib (Hofacker et al., 1994; Hofacker, 2003) version 1.5 and the squid library (Eddy, 2005) version 1.9g. Static versions of these libraries compiled for i386 are included in the package, which can be downloaded at . All alignments have been computed using StrAl version 0.5.2. Runtime comparison was performed on a 1.8 GHz Dual-Opteron machine with 4 GB memory; Stemloc computations had to be performed on 2.4 GHz Opterons with 16 GB memory. Programs have been compiled with optimization level 3 (GCC 4).
2.1 Reference alignments
As reference alignments we used dataset-1 from the RNA alignment benchmark database BRAlibase (Gardner et al., 2005). This dataset consists of 388 alignments of Group I introns, 5S rRNAs, tRNAs, and U5 spliceosomal RNAs with five sequences per alignment.
2.2 Scoring of alignments
As proposed by Gardner et al. (2005), we used two independent yet complementary scores to evaluate alignment quality: the widely used sum-of-pairs score (SPS) implemented in BaliScore (Thompson et al., 1999) and the Structure Conservation Index (SCI; see Washietl et al., 2005). The SPS measures the level of sequence consistency between a test and a reference alignment by comparing all possible character pairs per column between both alignments; it ranges from 0 to 1 (complete agreement). The SCI is a measure for structural conservation and works independently of a reference alignment. It ranges from 0 (no detectable conservation) to values slightly above 1.0 (sequence agreement and structure conservation). For statistical analysis of results we used R ().
2.3 Parameters of alignment programs
The parameter choices for StrAl are described in Section 4.1. ProAlign v0.5a3 was allowed to use up to 256 MB of memory and a bandwidth of 400 nt (
The steps of the algorithm include This strategy follows that of Clustal (Thompson et al., 1994a).
a pairwise alignment of all sequences of a set of homologous ncRNAs,
production of a guide tree using the alignment scores from step 1, and
a progressive alignment of the sequences, guided by the tree.
3.1 Pairwise alignment
We use the RNAfold library (Hofacker et al., 1994; Hofacker, 2003) to compute the partition function and matrix of base-pairing probabilities Pij of base i with base j for a single sequence. This probability matrix is then condensed into three linear vectors (Bonhoeffer et al., 1993), holding for each base i the probabilities of being paired downstream , paired upstream or unpaired , respectively. Thus we lose the specific pairing information but we can apply an alignment method that is cheap in terms of computing costs while still using thermodynamic information. Next, we choose a particular combination of these vectors—a structural part Sstruct = f(p1, p2) and a sequence part Sseq = f(p0)—as a similarity score
Let Vi,k be the value of the optimal alignment of prefixes A[1…i] and B[1…k] with base conditions V0,0 = 0, Vi,0 = Ei,0 = −go −i · ge, V0,k = F0,k = −go −k · ge and an affine gap weight model. That is, a single gap of length q is given by weight go + q · ge; go and ge denote gap-open and gap-extension values, respectively. If we do not charge end gaps, which is the default setting, set Vi,0 = V0, k = 0. Then follow the dynamic programming recurrences (Gusfield, 1999; Gotoh, 1999) for aligning two sequences:
3.2 Guide tree
3.3 Progressive alignment
During the progressive alignment, the sequences of the set S = S(1), … , S(m) are aligned according to their position in the tree, starting with the two sequences with lowest distance, resulting in a group of aligned sequences. To this group, a further sequence or a group of sequences is added, which makes necessary modifications of equations (1) and (2), and the base conditions. We measure the SPS, so Vi,0 = Ei,0 = n · (−go −i · ge) and V0,k = F0,k = n · (−go − k · ge), with n being the number of all pairwise alignments of groups A and B. In the case of free end gaps, Vi,0 = V0,k = 0.
4.1 Choice of parameters
For parameter optimization we used the RNA alignment benchmark datasets published by Gardner et al. (2005) (see also Section 2). The determination of ‘correct’ parameters was quite difficult; for example, the ratio α of structure over sequence similarity and gap values go and ge depend on each other. So we examined a rather huge parameter space. Contrary to our expectations, however, modification of the final parameters by a factor of 2 leads to only marginally different alignments.
4.1.1 Scoring function
The scoring function (1) implies a large influence of sequence on the alignment only in predicted loop regions and not in predicted helical regions. A major improvement in alignment quality arose by skipping this restriction; that is, formally we set the probability of a base to be unpaired . Consequently, sequence information is used for both structured and unstructured regions.
4.1.2 Substitution matrix d
The 4 × 4 single nucleotide substitution matrix given by Gotoh (1999) emphasizes identity substitutions [d(X, X) = 4] and allows for pyrimidine to pyrimidine and purine to purine substitutions only [d(Y, Y) = d(R, R) = 1]. The RIBOSUM85-60 matrix from Klein and Eddy (2003) overall contains higher values [3.51 ≤ d(X, X) ≤ 4.70, d(Y, Y) = 1.43, d(R, R) = 1.02]; only d(G, C) = 0. The former performed better for RNA sets of high similarity; the latter yielded a slightly better performance over the full similarity range.
4.1.3 Structure over sequence ratio α
Overall, scoring values were relatively constant in an α range from 3 to 11. A factor α = 7 produced the best results (see Table 1) when end gaps were free of costs. Nevertheless, scoring values improved when a higher α was applied to sequence sets containing fewer than five sequences and/or a sequence similarity lower than ∼50%. In contrast, α values lower than 2 did not lead to an improvement on high-similarity sets.
Values for gap-opening go and gap extension ge were fixed at 8.0 and 0.5, respectively, with free end gaps. The significance of SPS·SCI obtained using the different α values was determined by Friedman tests against α = 7; only α = 0 was significantly different (worse) than higher values.
4.1.4 Gap values
To determine appropriate values for gap costs, we aligned the complete dataset-1 several times with different gap opening, gap extension and α values. The alignments were then scored by the product of SPS and SCI. These values were averaged over all 388 alignments for each parameter combination. An example with α = 7 is shown in Figure 1. Several parameter combinations are near optimal. We implemented a gap opening value of go = 8 and a gap extension value of ge = 0.5 as default, as this combination produced high-scoring alignments for other values of α, too (data not shown). Overall alignment quality, however, does not vary significantly upon 2-fold changes of either gap opening or gap extension values.
4.1.5 Guide tree
Alignments based on guide trees derived from UPGMA showed the highest accuracy. Trees produced by the other methods showed similar alignment accuracy, except for the trees derived from the midpoint method (data not shown).
To demonstrate the power of StrAl we compared its performance with that of Clustal (Thompson et al., (1994a), ProAlign (Löytynoja and Milinkovitch, 2003) (the best-performing programs according to the evaluation by Gardner et al., 2005), PMmulti in string-like alignment mode (PMstring; see Hofacker et al., 2004, MARNA (Siebert and Backofen, 2005) and Stemloc (Holmes, 2005). For the comparison we used the multiple RNA sequence set (dataset-1) from the above-mentioned benchmark. Aligning all 388 RNA sets (with five sequences each) from this data set took ∼106 s using StrAl. This time reduced to only ∼9 s when probability matrices were precomputed. ClustalW, ProAlign, PMstring, MARNA, and Stemloc needed ∼8 s, ∼15 min, ∼1 h, ∼5 h and nearly 2 days, respectively. Note that PMstring is only prototyped in Perl; for Stemloc calculations a (faster) machine with more memory had to be used (see Section 2). MARNA was not able to align those 46 sequence sets containing ambiguity code, whereas Stemloc failed to align 5 sequence sets for unknown reasons.
Program performance was measured using SPS and SCI and plotted as a function of the sequence similarity (see Fig. 2). The structure and sequence alignment program Stemloc performed best, but at the cost of long time and high memory usage. With the exception of Stemloc, StrAl clearly outperformed all other programs: alignments produced by StrAl not only showed higher structural conservation (Fig. 2B) than the structure alignment programs marna and PMstring but also achieved more conservation on the sequence level (see Fig. 2A) than the pure sequence alignment programs CLustal and ProAlign. The performance difference became drastic when sequence similarity dropped below 60%. Here, the performance of StrAl is clearly better as the alignment process is guided by structural features, too, whereas pure sequence alignment programs cannot handle these sets, which are only poorly conserved on the sequence level.
A rather ad hoc approach to measuring RNA alignment quality is to compare consensus structures predicted from calculated alignments. Figure 3 shows such consensus structures predicted by RNAalifold (Hofacker et al., 2002) on the basis of different alignments of aphto- and cardiovirus IRES regions (see also Hofacker et al., 2004). Clustal clearly failed to create a correct alignment which comprised enough conserved structure. The consensus structures predicted from alignments computed by StrAl and PMmulti (slow variant) are almost identical. A more detailed comparison of the predicted structures is given in Supplementary Table S1.
To give a visual impression of the differences in alignment quality obtained using ClustalW and StrAl, alignments of 14 selenocysteine insertion sequences (SECIS; taken from Kryukov and Gladyshev, 2004) from methanogenic organisms are shown in Figure 4. In the Clustal alignment (Fig. 4C) the thermodynamically predicted stem–loop structures are not superimposed (top right triangle) and no statistically significant base pairs are predicted (lower left triangle); the ‘sequence alignment’ has a length of 44 nt. In contrast, the alignment computed by means of StrAl (Fig. 4B) is close to the ‘correct’ one (Fig. 4A), which was manually refined by means of ConStruct (Lück et al., 1999): the thermodynamically predicted stem–loop structures are perfectly superimposed (except for a single sequence), the highly conserved internal loop nucleotides are aligned and alignment length is only 41 nucleotides.
We have implemented a multiple RNA alignment program named StrAl that combines structural and sequence information in a ‘cheap’ dynamic programming approach. That is, when pairing vectors are precomputed, StrAl is nearly as fast as ClustalW. StrAl requires computational resources similar to other sequence alignment programs with time and memory cost, whereas true structure alignment programs such as Dynalign (Mathews, 2005), Foldalign v. 2 (Havgaard et al., 2005a,b); PMcomp (Hofacker et al., 2004) and Stemloc (Holmes, 2005), have costs of at least for pairwise alignment. Nevertheless, it has been shown that these structural alignment programs do not necessarily produce high-quality alignments (Gardner et al., 2005).
The parameters used in the algorithm of StrAl have been optimized using the benchmark data set BRAliBase (Gardner et al., 2005). Clearly, the inclusion of sequence and structure into the scoring function improves predictions in comparison with both pure sequence alignment and pure structure alignment (as, for example, in PMstring). With respect to all parameters, StrAl's performance is quite robust to modifications by a factor of ∼2 from the default parameters. It is likely, however, that the performance of other programs will also be improved by such a parameter optimization (e.g. cf. Katoh et al., 2005).
The use of a condensed vector representation of the pairing probabilities instead of the full pairing matrix—as done in PMmulti in pairwise mode—allows for a very fast pairwise alignment computation during every alignment step. Yet a future improvement could be the use of such ‘perfect’ pairwise alignments in combination with StrAl for the multiple alignment step(s). A further improvement of StrAl will be the inclusion of a recursive step in addition to the purely progressive approach (for review, see Gotoh, 1999).
Our approach offers a fast and reliable compromise between the computationally very demanding true structural alignment and pure sequence alignment.
Acknowledgement is made to Dr M. Schmitz for critical reading of the manuscript. We thank Drs I. Hofacker and P. Stadler for providing the IRES reference alignment. A.W. was supported by a grant from the German National Academic Foundation.
Conflict of Interest: none declared.