Abstract

Motivation: Alignment of RNA has a wide range of applications, for example in phylogeny inference, consensus structure prediction and homology searches. Yet aligning structural or non-coding RNAs (ncRNAs) correctly is notoriously difficult as these RNA sequences may evolve by compensatory mutations, which maintain base pairing but destroy sequence homology. Ideally, alignment programs would take RNA structure into account. The Sankoff algorithm for the simultaneous solution of RNA structure prediction and RNA sequence alignment was proposed 20 years ago but suffers from its exponential complexity. A number of programs implement lightweight versions of the Sankoff algorithm by restricting its application to a limited type of structure and/or only pairwise alignment. Thus, despite recent advances, the proper alignment of multiple structural RNA sequences remains a problem.

Results: Here we present StrAl, a heuristic method for alignment of ncRNA that reduces sequence–structure alignment to a two-dimensional problem similar to standard multiple sequence alignment. The scoring function takes into account sequence similarity as well as up- and downstream pairing probability. To test the robustness of the algorithm and the performance of the program, we scored alignments produced by StrAl against a large set of published reference alignments. The quality of alignments predicted by StrAl is far better than that obtained by standard sequence alignment programs, especially when sequence homologies drop below ∼65%; nevertheless StrAl’s runtime is comparable to that of ClustalW.

Availability: StrAl is implemented in C. Source code (under GNU public license) as well as a precompiled Debian package can be downloaded at

Contact:stral@biophys.uni-duesseldorf.de

Supplementary information: Supplementary data available at Bioinformatics online.

1 INTRODUCTION

Non-coding RNAs (ncRNAs) are RNA molecules or elements that do not code for proteins but nevertheless are functional in biological processes, including localization, replication, translation, degradation and stabilization of biological macromolecules (for review and further references see Eddy, 2001; Storz, 2002; Winkler and Breaker, 2005; Fedor and Williamson, 2005; Gottesmann, 2005). Prominent examples are small nuclear RNAs, which are involved in mRNA splicing, and riboswitches, which are located in non-translated regions of mRNAs, where they bind metabolites and control gene expression.

For analysis of ncRNA function, knowledge about their secondary and tertiary structure is crucial. Structure prediction for single sequences is performed by dynamic programming, which allows the thermodynamically optimal structure or structure ensemble to be found (Zuker, 2000, 2003; Hofacker, 2003; Rivas and Eddy, 1999). These algorithms rely on correctness of thermodynamic parameters, neglect the influence of kinetics on structure formation and are not able to take into account interactions with other macromolecules. The alternative method—called comparative sequence analysis—needs a set of homologous RNAs and predicts base–base interactions on the basis of compensatory mutations (Chiu and Kolodziejczak, 1991; Gautheret et al., 1995; Lescoute et al., 2005). Its reliability increases with the number and divergence of sequences, but it needs an alignment (nearly) perfect with respect to sequence and structure. Other approaches for consensus structure prediction based on RNA alignments include PFold (Knudsen and Hein, 2003) and RNAalifold (Hofacker et al., 2002). Furthermore, RNA alignments are an essential basis for phylogeny inference (e.g. Hudelot et al., 2003), homology searches (e.g. Gräf et al., 2006; Eddy, 2002) and approaches to searching for new ncRNAs (e.g. Washietl et al., 2005).

The structurally correct alignment of RNA sequences is, however, a difficult problem. An algorithm for simultaneously optimizing the sequence and structure of an RNA set was published by Sankoff in 1985; however, the algorithm is not employable due to its computational complexity O(n3m) and memory usage O(n2m) for m sequences of length n. Thus, several variants of this algorithm have been developed which are restricted to pairwise alignment only and implement other simplifications to make the calculation tractable (Mathews and Turner, 2002; Hofacker et al., 2004; Havgaard et al., 2005a; Holmes, 2005).

This situation resembles that of (pure) sequence alignment: the algorithm for aligning two sequences is relatively cheap (Smith and Waterman, 1981), whereas the same approach cannot be applied to multiple sequence alignment due to its complexity of O(nm) (Fuellen, G., 1997). This led to the development of very successful heuristic alignment methods, for example Clustal (Thompson et al., 1994a). Here we follow the same heuristic multiple sequence alignment approach but enhance it using a scoring function that emphasizes structural features. For this purpose, we project the structure features calculated by a thermodynamic approach (RNAfold; Hofacker, 2003) on top of the sequence. Similar approaches have already been proposed in the literature (Bonhoeffer et al., 1993; Yang and Blanchette, 2004) but have not been implemented. We call our program StrAl.

To test the performance of StrAl, we compare it with two sequence alignment programs—Clustal (Thompson et al., 1994a) and ProAlign (Löytynoja and Milinkovitch, 2003)—and three different structure alignment programs: In the following, we present our string-like alignment algorithm and test the performance of the respective program StrAl on several hundred sets of homologous RNAs. The quality of alignments produced by the Sankoff-like algorithm implemented in Stemloc is clearly superior to that produced by StrAl; this significant difference is, however, accompanied by a huge factor in computing time and memory usage in favor of StrAl. The quality of alignments predicted by StrAl is similar to that of standard sequence alignment programs if sequence similarity is >65%, but is far better with lower sequence similarities; nevertheless StrAl’s runtime is close to that of ClustalW.

  • PMstring (PMmulti in string-like alignment mode; see Hofacker et al., 2004) follows a concept related to that of our approach: it uses strings of pairing probabilities obtained from Sankoff-like pairwise alignments to produce a guide tree for a multiple alignment. This procedure, however, ‘often produces misaligned [sequence] pairs’ (Hofacker et al., 2004).

  • MARNA (Siebert and Backofen, 2005) uses a distantly related approach: it performs pairwise alignments of RNAs using sequence and structure information (based on RNAfold predictions) and combines the pairwise, weighted information into a multiple alignment with T-Coffee (Notredame et al., 2000).

  • Stemloc (Holmes, 2005) implements the pairwise Sankoff algorithm using a pair stochastic context-free grammar and a heuristic for multiple alignment. To reduce computing cost and memory usage, constraints are imposed by ‘fold and alignment envelopes’, which take into account, for example, the 1000 best single sequence structure predictions and the 100 best alignments.

2 SYSTEMs AND METHODS

StrAl is implemented in C and should compile under any Unix system. We used the GNU C compiler (GCC) versions 3 and 4. The program was thoroughly tested on several Linux distributions, including Debian Version 3.1 and Red Hat Fedora Core 3. To facilitate installation, support for GNU autotools is built in. We also provide a precompiled Debian package. StrAl requires RNAfold's RNAlib (Hofacker et al., 1994; Hofacker, 2003) version 1.5 and the squid library (Eddy, 2005) version 1.9g. Static versions of these libraries compiled for i386 are included in the package, which can be downloaded at . All alignments have been computed using StrAl version 0.5.2. Runtime comparison was performed on a 1.8 GHz Dual-Opteron machine with 4 GB memory; Stemloc computations had to be performed on 2.4 GHz Opterons with 16 GB memory. Programs have been compiled with optimization level 3 (GCC 4).

2.1 Reference alignments

As reference alignments we used dataset-1 from the RNA alignment benchmark database BRAlibase (Gardner et al., 2005). This dataset consists of 388 alignments of Group I introns, 5S rRNAs, tRNAs, and U5 spliceosomal RNAs with five sequences per alignment.

2.2 Scoring of alignments

As proposed by Gardner et al. (2005), we used two independent yet complementary scores to evaluate alignment quality: the widely used sum-of-pairs score (SPS) implemented in BaliScore (Thompson et al., 1999) and the Structure Conservation Index (SCI; see Washietl et al., 2005). The SPS measures the level of sequence consistency between a test and a reference alignment by comparing all possible character pairs per column between both alignments; it ranges from 0 to 1 (complete agreement). The SCI is a measure for structural conservation and works independently of a reference alignment. It ranges from 0 (no detectable conservation) to values slightly above 1.0 (sequence agreement and structure conservation). For statistical analysis of results we used R ().

2.3 Parameters of alignment programs

The parameter choices for StrAl are described in Section 4.1. ProAlign v0.5a3 was allowed to use up to 256 MB of memory and a bandwidth of 400 nt (

-Xmx256m -bwidth=400
). PMmulti v1.1 was used either in its slow and thorough variant or in string-like alignment mode (
- -fast_progressive
). Other programs (ClustalW v1.83, MARNA, Stemloc v0.19b) were used with default options.

3 ALGORITHM

The steps of the algorithm include This strategy follows that of Clustal (Thompson et al., 1994a).

  1. a pairwise alignment of all sequences of a set of homologous ncRNAs,

  2. production of a guide tree using the alignment scores from step 1, and

  3. a progressive alignment of the sequences, guided by the tree.

3.1 Pairwise alignment

We use the RNAfold library (Hofacker et al., 1994; Hofacker, 2003) to compute the partition function and matrix of base-pairing probabilities Pij of base i with base j for a single sequence. This probability matrix is then condensed into three linear vectors (Bonhoeffer et al., 1993), holding for each base i the probabilities of being paired downstream pi1=j>iPij, paired upstream pi2=j<iPij or unpaired pi0=1(pi1+pi2), respectively. Thus we lose the specific pairing information but we can apply an alignment method that is cheap in terms of computing costs while still using thermodynamic information. Next, we choose a particular combination of these vectors—a structural part Sstruct = f(p1, p2) and a sequence part Sseq = f(p0)—as a similarity score  

si,k=α(Sstruct)+Sseq=α(pAi1pBk1+pAi2pBk2)+pAi0pBk0·d(Ai,Bk)
for matching bases i and k from different sequences A and B, respectively. The idea for this score is to favor structural alignment of paired nucleotides from both sequences as well as sequence alignment of unpaired nucleotides. The factor α gives the ratio of structure over sequence similarity. The positive 4 × 4 single nucleotide substitution matrix d for aligning single-stranded regions is adapted either from Klein and Eddy (2003) (RIBOSUM85-60) or from Gotoh (1999) (see Section 4.1.2).

Let Vi,k be the value of the optimal alignment of prefixes A[1…i] and B[1…k] with base conditions V0,0 = 0, Vi,0 = Ei,0 = −goi · ge, V0,k = F0,k = −gok · ge and an affine gap weight model. That is, a single gap of length q is given by weight go + q · ge; go and ge denote gap-open and gap-extension values, respectively. If we do not charge end gaps, which is the default setting, set Vi,0 = V0, k = 0. Then follow the dynamic programming recurrences (Gusfield, 1999; Gotoh, 1999) for aligning two sequences:  

(2a)
Ei,k=max{Ei,k1,Vi,k1go}ge
 
(2b)
Fi,k=max{Fi1,k,Vi1,kgo}ge
 
(2c)
Gi,k=Vi1,k1+si,k
 
(2d)
Vi,k=max{Ei,k,Fi,k,Gi,k}
G(i, k) is the maximum value of an alignment of A[1…i] and B[1…k], where A(i) and B(k) are aligned opposite each other. E(i, k) aligns A(i) to the left of B(k); thus, the alignment ends in a gap in A. F(i, k) aligns A(i) to the right of B(k); thus, the alignment ends in a gap in B. V(i, k) is defined as the maximum value of the three terms E(i, k), F(i, k) and G(i, k). This pairwise alignment is carried out for all m(m − 1)/2 pairs of the m sequences S = S(1), … , S(m) of the ncRNA set.

3.2 Guide tree

The m(m − 1)/2 scores from the previous step are converted into a distance matrix  

DSp,Sq=log(VSp,SqVmin+1VmaxVmin)with Vmin=min1p<qm(VSp,Sq) and Vmax=max1p<qm(VSp,Sq).
This is used as the starting point for the construction of a guide tree. We implemented four different methods: UPGMA (Sokal and Michener, 1958), neighbor joining (NJ; Saitou and Nei, 1987; Felsenstein, 1989, 1997), weighted neighbor joining (Weighbor; Bruno et al., 2000) and BIONJ (Gascuel, 1997); the default is UPGMA. In the case of an unrooted tree, it is possible to root the tree using the midpoint method (Thompson et al., 1994b).

3.3 Progressive alignment

During the progressive alignment, the sequences of the set S = S(1), … , S(m) are aligned according to their position in the tree, starting with the two sequences with lowest distance, resulting in a group of aligned sequences. To this group, a further sequence or a group of sequences is added, which makes necessary modifications of equations (1) and (2), and the base conditions. We measure the SPS, so Vi,0 = Ei,0 = n · (−goi · ge) and V0,k = F0,k = n · (−gok · ge), with n being the number of all pairwise alignments of groups A and B. In the case of free end gaps, Vi,0 = V0,k = 0.

For (2c) the scoring function (1) has to be modified. Let C be the desired alignment of groups A and B (C = A ∪ B); then  

si,k=S(t)CS(u)C1t<uC[α(pS(t)i1pS(u)k1+pS(t)i2pS(u)k2)+pS(t)i0pS(u)k0·d(S(t)i,S(u)k)].
To speed up calculation, the scores of individual groups are stored in a double linked list. For (2a) and (2b) the gap values have to be modified according to the size of groups A and B:  
gopen=(Ai=1Aδi)·go withδi={1if Ai is gap,0elsegext=A·ge.
These equations accordingly hold for sequences in B. We have not tried to reduce computing costs by introducting true profiles (Gribskov et al., 1990); with our scoring function, this would force a grouping of individual scores (si,k0,+) and gap sizes, respectively, into small numbers of classes (Gotoh, 1993).

4 RESULTS

4.1 Choice of parameters

For parameter optimization we used the RNA alignment benchmark datasets published by Gardner et al. (2005) (see also Section 2). The determination of ‘correct’ parameters was quite difficult; for example, the ratio α of structure over sequence similarity and gap values go and ge depend on each other. So we examined a rather huge parameter space. Contrary to our expectations, however, modification of the final parameters by a factor of 2 leads to only marginally different alignments.

4.1.1 Scoring function

The scoring function (1) implies a large influence of sequence on the alignment only in predicted loop regions and not in predicted helical regions. A major improvement in alignment quality arose by skipping this restriction; that is, formally we set the probability of a base to be unpaired pi0=1. Consequently, sequence information is used for both structured and unstructured regions.

4.1.2 Substitution matrix d

The 4 × 4 single nucleotide substitution matrix given by Gotoh (1999) emphasizes identity substitutions [d(X, X) = 4] and allows for pyrimidine to pyrimidine and purine to purine substitutions only [d(Y, Y) = d(R, R) = 1]. The RIBOSUM85-60 matrix from Klein and Eddy (2003) overall contains higher values [3.51 ≤ d(X, X) ≤ 4.70, d(Y, Y) = 1.43, d(R, R) = 1.02]; only d(G, C) = 0. The former performed better for RNA sets of high similarity; the latter yielded a slightly better performance over the full similarity range.

4.1.3 Structure over sequence ratio α

Overall, scoring values were relatively constant in an α range from 3 to 11. A factor α = 7 produced the best results (see Table 1) when end gaps were free of costs. Nevertheless, scoring values improved when a higher α was applied to sequence sets containing fewer than five sequences and/or a sequence similarity lower than ∼50%. In contrast, α values lower than 2 did not lead to an improvement on high-similarity sets.

Table 1

SPS and SCI dependence upon structure-over-sequence ratio α

α SPS SCI SPS·SCI Sign. 
0.838 0.741 0.621 
0.849 0.773 0.656 — 
0.852 0.794 0.677 — 
0.855 0.800 0.684 — 
0.854 0.809 0.691 — 
0.855 0.818 0.700 — 
0.855 0.820 0.701 — 
0.854 0.821 0.701 — 
0.851 0.817 0.696 — 
0.847 0.814 0.690 — 
10 0.844 0.813 0.686 — 
11 0.839 0.809 0.678 — 
12 0.836 0.803 0.671 — 
13 0.834 0.800 0.667 — 
14 0.834 0.792 0.657 — 
15 0.824 0.786 0.648 — 
α SPS SCI SPS·SCI Sign. 
0.838 0.741 0.621 
0.849 0.773 0.656 — 
0.852 0.794 0.677 — 
0.855 0.800 0.684 — 
0.854 0.809 0.691 — 
0.855 0.818 0.700 — 
0.855 0.820 0.701 — 
0.854 0.821 0.701 — 
0.851 0.817 0.696 — 
0.847 0.814 0.690 — 
10 0.844 0.813 0.686 — 
11 0.839 0.809 0.678 — 
12 0.836 0.803 0.671 — 
13 0.834 0.800 0.667 — 
14 0.834 0.792 0.657 — 
15 0.824 0.786 0.648 — 

Values for gap-opening go and gap extension ge were fixed at 8.0 and 0.5, respectively, with free end gaps. The significance of SPS·SCI obtained using the different α values was determined by Friedman tests against α = 7; only α = 0 was significantly different (worse) than higher values.

4.1.4 Gap values

To determine appropriate values for gap costs, we aligned the complete dataset-1 several times with different gap opening, gap extension and α values. The alignments were then scored by the product of SPS and SCI. These values were averaged over all 388 alignments for each parameter combination. An example with α = 7 is shown in Figure 1. Several parameter combinations are near optimal. We implemented a gap opening value of go = 8 and a gap extension value of ge = 0.5 as default, as this combination produced high-scoring alignments for other values of α, too (data not shown). Overall alignment quality, however, does not vary significantly upon 2-fold changes of either gap opening or gap extension values.

Fig. 1

Alignment quality measured by SPS·SCI averaged over all 388 alignments is shown for several gap parameter combinations. The bar at the right gives averaged SPS·SCI values from 0.654 to 0.702 in grey scale. The structure-over-sequence ratio α is fixed to a value of 7 in this example. Note that absolute values change only slightly; that is alignment quality is rather robust to the choice of gap parameters.

Fig. 1

Alignment quality measured by SPS·SCI averaged over all 388 alignments is shown for several gap parameter combinations. The bar at the right gives averaged SPS·SCI values from 0.654 to 0.702 in grey scale. The structure-over-sequence ratio α is fixed to a value of 7 in this example. Note that absolute values change only slightly; that is alignment quality is rather robust to the choice of gap parameters.

4.1.5 Guide tree

Alignments based on guide trees derived from UPGMA showed the highest accuracy. Trees produced by the other methods showed similar alignment accuracy, except for the trees derived from the midpoint method (data not shown).

4.2 Benchmark

To demonstrate the power of StrAl we compared its performance with that of Clustal (Thompson et al., (1994a), ProAlign (Löytynoja and Milinkovitch, 2003) (the best-performing programs according to the evaluation by Gardner et al., 2005), PMmulti in string-like alignment mode (PMstring; see Hofacker et al., 2004, MARNA (Siebert and Backofen, 2005) and Stemloc (Holmes, 2005). For the comparison we used the multiple RNA sequence set (dataset-1) from the above-mentioned benchmark. Aligning all 388 RNA sets (with five sequences each) from this data set took ∼106 s using StrAl. This time reduced to only ∼9 s when probability matrices were precomputed. ClustalW, ProAlign, PMstring, MARNA, and Stemloc needed ∼8 s, ∼15 min, ∼1 h, ∼5 h and nearly 2 days, respectively. Note that PMstring is only prototyped in Perl; for Stemloc calculations a (faster) machine with more memory had to be used (see Section 2). MARNA was not able to align those 46 sequence sets containing ambiguity code, whereas Stemloc failed to align 5 sequence sets for unknown reasons.

Program performance was measured using SPS and SCI and plotted as a function of the sequence similarity (see Fig. 2). The structure and sequence alignment program Stemloc performed best, but at the cost of long time and high memory usage. With the exception of Stemloc, StrAl clearly outperformed all other programs: alignments produced by StrAl not only showed higher structural conservation (Fig. 2B) than the structure alignment programs marna and PMstring but also achieved more conservation on the sequence level (see Fig. 2A) than the pure sequence alignment programs CLustal and ProAlign. The performance difference became drastic when sequence similarity dropped below 60%. Here, the performance of StrAl is clearly better as the alignment process is guided by structural features, too, whereas pure sequence alignment programs cannot handle these sets, which are only poorly conserved on the sequence level.

Fig. 2

Performance of alignment programs on the BRAliBase benchmark: all 388 RNA sets from dataset-1 (Gardner et al., 2005) were aligned and each alignment scored. The performance is plotted against sequence similarity measured as mean pairwise sequence identity. Curves are fitted using Lowess smoothing. A: Performance measured as SPS. B: Performance measured as SCI. See Section 2.2 for an explanation of these scores; for a color version of this figure, see Supplementary Figure S1.

Fig. 2

Performance of alignment programs on the BRAliBase benchmark: all 388 RNA sets from dataset-1 (Gardner et al., 2005) were aligned and each alignment scored. The performance is plotted against sequence similarity measured as mean pairwise sequence identity. Curves are fitted using Lowess smoothing. A: Performance measured as SPS. B: Performance measured as SCI. See Section 2.2 for an explanation of these scores; for a color version of this figure, see Supplementary Figure S1.

A rather ad hoc approach to measuring RNA alignment quality is to compare consensus structures predicted from calculated alignments. Figure 3 shows such consensus structures predicted by RNAalifold (Hofacker et al., 2002) on the basis of different alignments of aphto- and cardiovirus IRES regions (see also Hofacker et al., 2004). Clustal clearly failed to create a correct alignment which comprised enough conserved structure. The consensus structures predicted from alignments computed by StrAl and PMmulti (slow variant) are almost identical. A more detailed comparison of the predicted structures is given in Supplementary Table S1.

Fig. 3

IRES consensus structures predicted by RNAalifold on the basis of a manually constructed reference alignment (A; see Hofacker et al., 2004, and references therein) and of alignments predicted by StrAl (with NJ tree) (B), PMmulti (slow, thorough variant) (C) and ClustalW (D), respectively. Consistent and compensatory mutations are indicated by circles. Grey letters indicate inconsistent mutations. The ClustalW prediction shows several inconsistent mutations and the overall structure is different from the reference. Predictions made by PMmulti and StrAl are almost identical and share only one inconsistent mutation. For further details, see Supplementary Table S1.

Fig. 3

IRES consensus structures predicted by RNAalifold on the basis of a manually constructed reference alignment (A; see Hofacker et al., 2004, and references therein) and of alignments predicted by StrAl (with NJ tree) (B), PMmulti (slow, thorough variant) (C) and ClustalW (D), respectively. Consistent and compensatory mutations are indicated by circles. Grey letters indicate inconsistent mutations. The ClustalW prediction shows several inconsistent mutations and the overall structure is different from the reference. Predictions made by PMmulti and StrAl are almost identical and share only one inconsistent mutation. For further details, see Supplementary Table S1.

To give a visual impression of the differences in alignment quality obtained using ClustalW and StrAl, alignments of 14 selenocysteine insertion sequences (SECIS; taken from Kryukov and Gladyshev, 2004) from methanogenic organisms are shown in Figure 4. In the Clustal alignment (Fig. 4C) the thermodynamically predicted stem–loop structures are not superimposed (top right triangle) and no statistically significant base pairs are predicted (lower left triangle); the ‘sequence alignment’ has a length of 44 nt. In contrast, the alignment computed by means of StrAl (Fig. 4B) is close to the ‘correct’ one (Fig. 4A), which was manually refined by means of ConStruct (Lück et al., 1999): the thermodynamically predicted stem–loop structures are perfectly superimposed (except for a single sequence), the highly conserved internal loop nucleotides are aligned and alignment length is only 41 nucleotides.

Fig. 4

Visualization as base-pairing dotplots, drawn by ConStruct (Lück et al., 1999), of alignments produced manually (A) or by programs StrAl (B) and Clustal (C), respectively. Each top-right triangle shows the thermodynamic base-pairing probability of individual sequences; the horizontal and vertical bars denote gaps; the lower-left triangle shows mutual information content normalized by pair entropy (Martin et al., 2005). For further details see Supplementary Figure S2.

Fig. 4

Visualization as base-pairing dotplots, drawn by ConStruct (Lück et al., 1999), of alignments produced manually (A) or by programs StrAl (B) and Clustal (C), respectively. Each top-right triangle shows the thermodynamic base-pairing probability of individual sequences; the horizontal and vertical bars denote gaps; the lower-left triangle shows mutual information content normalized by pair entropy (Martin et al., 2005). For further details see Supplementary Figure S2.

5 DISCUSSION

We have implemented a multiple RNA alignment program named StrAl that combines structural and sequence information in a ‘cheap’ dynamic programming approach. That is, when pairing vectors are precomputed, StrAl is nearly as fast as ClustalW. StrAl requires computational resources similar to other sequence alignment programs with O(k2n2) time and O(n2) memory cost, whereas true structure alignment programs such as Dynalign (Mathews, 2005), Foldalign v. 2 (Havgaard et al., 2005a,b); PMcomp (Hofacker et al., 2004) and Stemloc (Holmes, 2005), have costs of at least O(n4) for pairwise alignment. Nevertheless, it has been shown that these structural alignment programs do not necessarily produce high-quality alignments (Gardner et al., 2005).

The parameters used in the algorithm of StrAl have been optimized using the benchmark data set BRAliBase (Gardner et al., 2005). Clearly, the inclusion of sequence and structure into the scoring function improves predictions in comparison with both pure sequence alignment and pure structure alignment (as, for example, in PMstring). With respect to all parameters, StrAl's performance is quite robust to modifications by a factor of ∼2 from the default parameters. It is likely, however, that the performance of other programs will also be improved by such a parameter optimization (e.g. cf. Katoh et al., 2005).

The use of a condensed vector representation of the pairing probabilities instead of the full pairing matrix—as done in PMmulti in pairwise mode—allows for a very fast pairwise alignment computation during every alignment step. Yet a future improvement could be the use of such ‘perfect’ pairwise alignments in combination with StrAl for the multiple alignment step(s). A further improvement of StrAl will be the inclusion of a recursive step in addition to the purely progressive approach (for review, see Gotoh, 1999).

Our approach offers a fast and reliable compromise between the computationally very demanding true structural alignment and pure sequence alignment.

Acknowledgement is made to Dr M. Schmitz for critical reading of the manuscript. We thank Drs I. Hofacker and P. Stadler for providing the IRES reference alignment. A.W. was supported by a grant from the German National Academic Foundation.

Conflict of Interest: none declared.

REFERENCES

Bonhoeffer
S.
, et al.  . 
RNA multi-structure landscapes. A study based on temperature dependent partition functions
Eur. Biophys. J.
 , 
1993
, vol. 
22
 (pg. 
13
-
24
)
Bruno
W.
, et al.  . 
Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction
Mol. Biol. Evol.
 , 
2000
, vol. 
17
 (pg. 
189
-
197
)
Chiu
D.
Kolodziejczak
T.
Inferring consensus structure from nucleic acid sequences
Comput. Appl. Biosci.
 , 
1991
, vol. 
7
 (pg. 
347
-
352
)
Eddy
S.
Noncoding RNA genes and the modern RNA world
Nat. Rev. Genet.
 , 
2001
, vol. 
2
 (pg. 
919
-
929
)
Eddy
S.
SQUID—C function library for sequence analysis.
2005
 
Eddy
S.R.
A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure
BMC Bioinformatics
 , 
2002
, vol. 
3
 pg. 
18
 
Fedor
M.
Williamson
J.
The catalytic diversity of RNAs
Nat. Rev. Mol. Cell. Biol.
 , 
2005
, vol. 
6
 (pg. 
399
-
412
)
Felsenstein
J.
PHYLIP—Phylogeny Inference Package (Version 3.2)
Cladistics
 , 
1989
, vol. 
5
 (pg. 
164
-
166
)
Felsenstein
J.
An alternating least squares approach to inferring phylogenies from pairwise distances
Syst. Biol.
 , 
1997
, vol. 
46
 (pg. 
101
-
111
)
Fuellen
G.
A gentle guide to multiple alignment
Complexity International
 , 
1997
, vol. 
4
  
Gardner
P.P.
, et al.  . 
A benchmark of multiple sequence alignment programs upon structural RNAs
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
2433
-
2439
)
Gascuel
O.
BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data
Mol. Biol. Evol.
 , 
1997
, vol. 
14
 (pg. 
685
-
695
)
Gautheret
D.
, et al.  . 
Identification of base-triples in RNA using comparative sequence analysis
J. Mol. Biol.
 , 
1995
, vol. 
248
 (pg. 
27
-
43
)
Gotoh
O.
Optimal alignment between groups of sequences and its application to multiple sequence alignment
Comput. Appl. Biosci.
 , 
1993
, vol. 
9
 (pg. 
361
-
370
)
Gotoh
O.
Multiple sequence alignment: algorithms and applications
Adv. Biophys.
 , 
1999
, vol. 
36
 (pg. 
159
-
206
)
Gottesmann
S.
Micros for microbes: non-coding regulatory RNAs in bacteria
Trends Genet.
 , 
2005
, vol. 
21
 (pg. 
399
-
404
)
Gräf
S.
, et al.  . 
Nellen
W.
Hammann
C.
A computational approach to search for non-coding RNAs in large genomic data
Small RNAs: Analysis and Regulatory Functions, volume 17 of Nucleic Acids and Molecular Biology
 , 
2006
Springer Verlag
(pg. 
57
-
74
)
Gribskov
M.
, et al.  . 
Profile analysis
Methods Enzymol.
 , 
1990
, vol. 
183
 (pg. 
146
-
159
)
Gusfield
D.
Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology
 , 
1999
Cambridge
Cambridge University Press
Havgaard
J.H.
, et al.  . 
Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
1815
-
1824
)
Havgaard
J.H.
, et al.  . 
The FOLDALIGN web server for pairwise structural RNA alignment and mutual motif search
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
W650
-
W653
)
Hofacker
I.
, et al.  . 
Fast folding and comparsion of RNA structures
Monatsh. Chem.
 , 
1994
, vol. 
125
 (pg. 
167
-
188
)
Hofacker
I.
, et al.  . 
Secondary structure prediction for aligned RNA sequences
J. Mol. Biol.
 , 
2002
, vol. 
319
 (pg. 
1059
-
1066
)
Hofacker
I.
, et al.  . 
Alignment of RNA base pairing probability matrices
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
2222
-
2227
)
Hofacker
I.L.
Vienna RNA secondary structure server
Nucleic Acids Res.
 , 
2003
, vol. 
31
 (pg. 
3429
-
3431
)
Holmes
I.
Accelerated probabilistic inference of RNA structure evolution
BMC Bioinformatics
 , 
2005
, vol. 
6
 pg. 
73
 
Hudelot
C.
, et al.  . 
RNA-based phylogenetic methods: application to mammalian mitochondrial RNA sequences
Mol. Phylogenet. Evol.
 , 
2003
, vol. 
28
 (pg. 
241
-
252
)
Katoh
K.
, et al.  . 
MAFFT version 5: improvement in accuracy of multiple sequence alignment
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
511
-
518
)
Klein
R.
Eddy
S.
RSEARCH: finding homologs of single structured RNA sequences
BMC Bioinformatics
 , 
2003
, vol. 
4
 pg. 
44
 
Knudsen
B.
Hein
J.
Pfold: RNA secondary structure prediction using stochastic context-free grammars
Nucleic Acids Res.
 , 
2003
, vol. 
31
 (pg. 
3423
-
3428
)
Kryukov
G.V.
Gladyshev
V.N.
The prokaryotic selenoproteome
EMBO Rep.
 , 
2004
, vol. 
5
 (pg. 
538
-
543
)
Lescoute
A.
, et al.  . 
Recurrent structural RNA motifs, isostericity matrices and sequence alignments
Nucleic Acids Res.
 , 
2005
, vol. 
33
 (pg. 
2395
-
2409
)
Löytynoja
A.
Milinkovitch
M.C.
A hidden Markov model for progressive multiple alignment
Bioinformatics
 , 
2003
, vol. 
19
 (pg. 
1505
-
1513
)
Lück
R.
, et al.  . 
ConStruct: a tool for thermodynamic controlled prediction of conserved secondary structure
Nucleic Acids Res.
 , 
1999
, vol. 
27
 (pg. 
4208
-
4217
)
Martin
L.C.
, et al.  . 
Using information theory to search for co-evolving residues in proteins
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
4116
-
4124
)
Mathews
D.
Turner
D.
Dynalign: an algorithm for finding the secondary structure common to two RNA sequences
J. Mol. Biol.
 , 
2002
, vol. 
317
 (pg. 
191
-
203
)
Mathews
D.H.
Predicting a set of minimal free energy RNA secondary structures common to two sequences
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
2246
-
2253
)
Notredame
C.
, et al.  . 
T-Coffee: a novel method for fast and accurate multiple sequence alignment
J. Mol. Biol.
 , 
2000
, vol. 
302
 (pg. 
205
-
217
)
Rivas
E.
Eddy
S.
A dynamic programming algorithm for RNA structure prediction including pseudoknots
J. Mol. Biol.
 , 
1999
, vol. 
285
 (pg. 
2053
-
2068
)
Saitou
N.
Nei
M.
The neighbor-joining method: a new method for reconstructing phylogenetic trees
Mol. Biol. Evol.
 , 
1987
, vol. 
4
 (pg. 
406
-
425
)
Sankoff
D.
Simultaneous solution of the RNA folding, alignment and protosequence problems
SIAM J. Appl. Math.
 , 
1985
, vol. 
45
 (pg. 
810
-
825
)
Siebert
S.
Backofen
R.
MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
3352
-
3359
)
Smith
T.
Waterman
M.
Identification of common molecular subsequences
J. Mol. Biol.
 , 
1981
, vol. 
147
 (pg. 
195
-
197
)
Sokal
R.
Michener
C.
A statistical method for evaluating systematic relationships
The University of Kansas Scientific Bulletin
 , 
1958
, vol. 
38
 (pg. 
1409
-
1438
)
Storz
G.
An expanding universe of noncoding RNAs
Science
 , 
2002
, vol. 
296
 (pg. 
1260
-
1263
)
Thompson
J.
, et al.  . 
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Res.
 , 
1994
, vol. 
22
 (pg. 
4673
-
4680
)
Thompson
J.
, et al.  . 
Improved sensitivity of profile searches through the use of sequence weights and gap excision
Comput. Appl. Biosci.
 , 
1994
, vol. 
10
 (pg. 
19
-
29
)
Thompson
J.
, et al.  . 
A comprehensive comparison of multiple sequence alignment programs
Nucleic Acids Res.
 , 
1999
, vol. 
27
 (pg. 
2682
-
2690
)
Washietl
S.
, et al.  . 
Fast and reliable prediction of noncoding RNAs
Proc. Natl Acad. Sci. U.S.A.
 , 
2005
, vol. 
102
 (pg. 
2454
-
2459
)
Winkler
W.
Breaker
R.
Regulation of bacterial gene expression by riboswitches
Ann. Rev. Microbiol.
 , 
2005
, vol. 
59
 (pg. 
487
-
517
)
Yang
Q.
Blanchette
M.
StructMiner: a tool for alignment and detection of conserved secondary structure
Genome Inform Ser Workshop Genome Inform.
 , 
2004
, vol. 
15
 (pg. 
102
-
111
)
Zuker
M.
Calculating nucleic acid secondary structure
Curr. Opin. Struct. Biol.
 , 
2000
, vol. 
10
 (pg. 
303
-
310
)
Zuker
M.
Mfold web server for nucleic acid folding and hybridization prediction
Nucleic Acids Res.
 , 
2003
, vol. 
31
 (pg. 
3406
-
3415
)

Author notes

Associate Editor: Alex Bateman

Comments

0 Comments