Abstract
Motivation: Searching for non-coding RNA (ncRNA) genes and structural RNA elements (eleRNA) are major challenges in gene finding today as these often are conserved in structure rather than in sequence. Even though the number of available methods is growing, it is still of interest to pairwise detect two genes with low sequence similarity, where the genes are part of a larger genomic region.
Results: Here we present such an approach for pairwise local alignment which is based on foldalign and the Sankoff algorithm for simultaneous structural alignment of multiple sequences. We include the ability to conduct mutual scans of two sequences of arbitrary length while searching for common local structural motifs of some maximum length. This drastically reduces the complexity of the algorithm. The scoring scheme includes structural parameters corresponding to those available for free energy as well as for substitution matrices similar to RIBOSUM. The new foldalign implementation is tested on a dataset where the ncRNAs and eleRNAs have sequence similarity <40% and where the ncRNAs and eleRNAs are energetically indistinguishable from the surrounding genomic sequence context. The method is tested in two ways: (1) its ability to find the common structure between the genes only and (2) its ability to locate ncRNAs and eleRNAs in a genomic context. In case (1), it makes sense to compare with methods like Dynalign, and the performances are very similar, but foldalign is substantially faster. The structure prediction performance for a family is typically around 0.7 using Matthews correlation coefficient. In case (2), the algorithm is successful at locating RNA families with an average sensitivity of 0.8 and a positive predictive value of 0.9 using a BLAST-like hit selection scheme.
Availability: The program is available online at http://foldalign.kvl.dk/
Contact:gorodkin@bioinf.kvl.dk
INTRODUCTION
Non-coding RNAs (ncRNAs) have just within the last few years been recognized as a highly abundant class of genes, involved in various regulatory mechanisms in the cell (Eddy, 2002a). A variety of ncRNAs have received much attention recently, in particular the micro RNAs (Lagos Quintana et al., 2001; Lau et al., 2001; Lee and Ambros, 2001; Huttenhofer et al., 2002). A related class of regulatory elements are the RNA structural cis regulatory elements (eleRNAs), and counting these together with the ncRNA genes, there exist about 380 such entries in the current Rfam database (Griffiths-Jones et al., 2003). This class of genes is rapidly growing, and it is of increasing interest to detect them computationally.
Note that until just a few years ago ncRNAs were widely ignored in the gene prediction literature and to some extent in genome projects. In fact, even though ncRNAs now have attention in genome projects, little has been possible due to the high computational complexity of the methods for searching/screening for ncRNAs.
However, the recent introduction of the cm2hmm RNA family scanning tool is a breakthrough that seems to make it possible to systematically include ncRNA annotation in genome projects (Weinberg and Ruzzo, 2004a,2004b) at a higher level than just a few specific families (Lowe and Eddy, 1997, http://genome.wustl.edu/eddy/tRNAscan-SE/, 1999). These methods are based on stochastic context-free grammars (SCFGs) and hidden Markov models (HMMs). Previous approaches for scanning used in Rfam include the Infernal implementation by Eddy (2002b) that also allow for partial structural matches. Also, Brown (2000) constrained SCFGs with HMMs to match long RNA sequences (small subunit rRNA). A faster method, RSEARCH, is also useful (Klein and Eddy, 2003). Whereas these methods are well suited to locate known ncRNA families, they are not likely to find novel classes of ncRNAs. An approach suited for finding novel ncRNAs is QRNA which exploits a semi-conserved pairwise sequence alignment to detect significant covariance between the sequences (Rivas and Eddy, 2001). A recent approach using folding energy in related sequences has also been introduced to scan for ncRNAs (Washietl and Hofacker, 2004).
Other methods can also find a common motif in a set of multiple related Sequences. (Macke et al., 2001; Eddy, 2002b; Perriquet et al., 2003; Pavesi et al., 2004; Ji et al., 2004). However, these are not expected to work well on just two sequences. Other methods work well in finding the secondary structure once a good multiple alignment has been obtained (Knudsen and Hein, 1999; Hofacker et al., 2002).
If the structure searched for is energetically different from its sequence context, energy based approaches such as that of Hofacker et al. (2004b) might be of interest. Nonetheless, all the methods mentioned above require some sort of biological knowledge that goes into the problem prior to prediction. Here we address the basic problem of finding common local structure between just two sequences. To test the method we try to remove prior knowledge by considering sequences with low similarity, <40%. Furthermore, as some ncRNA sequences have a folding energy very different from their surrounding sequence context, we also discard such sequences. Sequences with low sequence similarity was also the major challenge for Weinberg and Ruzzo (2004b).
In principle the algorithm by Sankoff (1985) copes with many of the limitations mentioned above, but it is computationally intractable for more than two sequences. An early implementation of the Sankoff algorithm, foldalign, was in principle designed to handle the problem, but, was due to the computational capacities at that time, limited to stem-loop structures only (Gorodkin et al., 1997a, 1997b). Later heuristics were employed to speed up the computation (Gorodkin et al., 2001a,2001c). Other methods have implemented pairwise versions of the Sankoff algorithm (Mathews and Turner, 2002; Holmes, 2004; Hofacker et al., 2004a). Here, we not only include (non-pseudoknotted) bifurcated structures, but we also present an effective approach for conducting mutual simultaneous scan of two sequences to search for common local structural motifs.
Furthermore, in contrast to the former version of foldalign, which is limited to maximizing a score, the version presented here employs substitution matrix schemes similar to Klein and Eddy (2003) including parameters corresponding to energy terms with similarities to those in Dynalign (Mathews and Turner, 2002). However, foldalign still maximizes a score which enables it to perform local alignments.
ALGORITHM
The foldalign is a simplified version of the Sankoff algorithm (Sankoff, 1985) which uses dynamic programming to locally align two sequences of lengths L1 and L2. Previous versions (Gorodkin et al., 1997a,1997b, 2001b,c) used base-pair substitutions and a simple stacking scoring scheme. It was limited to stem-loop structures due to the computational complexity. The version presented here can find any non-pseudoknotted structure. Furthermore, the scoring scheme has been extended with some of the rules and parameters used in energy minimization (Mathews and Turner, 2002; Mathews et al., 1999; Xia et al., 1998; Zuker and Stiegler, 1981) as well as the computation of substitution matrices which now has similarities to RIBOSUM (Klein and Eddy, 2003) and BLOSSUM (Henikoff and Henikoff, 1992).
Computing scores and costs
The full recursion for computing the score Dij,kl of aligning subsequence (i,j) to subsequence (k,l) can be written as
The dynamical computation of Sninj, nknl depends on the five structural contexts: hairpin-loop, stem, internal-loop, bulge, and multibranch-loops (Fig. 1). In some contexts a previous computed cost S needs to be recalculated. How the cost Sninj, nknl is computed from the score matrices S(ni, nj, nk, nl) is outlined below. The context of each Dij,kl is stored in a separate matrix. The energy rules used in the cost calculation are a subset of the Turner energy rules described in the mfold documentation (Zuker et al., 1999; Mathews et al., 1999).
Hairpin-loop
An alignment is always initiated in the hairpin-loop context. The hairpin-loop context is the single-stranded part of a structure which is closed by two nucleotides that base-pair. The cost for aligning two hairpin-loops is:
The cost Slength = Shp_length(j − i + 1) + Shp_length(l − k + 1) is a loop size dependent cost, calculated from energy parameters stored in the hairpin column of the loop length costs table in the score matrix.
The default minimum hairpin-loop length is three nucleotides. For hairpin-loops longer than three nucleotides there is an energy cost for stacking the two unpaired nucleotides (ni, nj) at the end of the loop on top of the base-pair (ni−1, nj+1) which closes the hairpin-loop. The cost Sstack = Shp_stack(ni, nj, ni−1, nj+1) + Shp_stack(nk, nl, nk−1, nl+1) is calculated from the energy parameters stored in the hairpin close matrix in the score matrix. The hairpin-loops of both sequences have to be longer than three nucleotides before the stacking cost is added. The stacking values have a non-GC cost added. This cost can be explained by the distribution of the hydrogen bonds in the nearest neighbor model (Xia et al., 1998).
Stem
A stem is a series of stacked base-pairs at least two base-pairs long. A single base-pair is not allowed and is recalculated as part of the surrounding loop. The score for aligning two stems has two parts:
The stacking cost of the base-pair (i,j) depends on the base-pair (i − 1, j + 1). For the last base-pair special rules apply (see internal-loops and multibranch-loops). The cost Sstack = ∑Sstacking(ni, nj, ni−1, nj+1) + ∑ Sstacking(nk, nl, nk−1, nl+1) is the sum of the two independent sums of stackings in the two stems. The Sstacking cost is found in the stacking matrix in the score matrix.
When aligning two stems to each other, two gaps can be inserted into one of the subsequences making that stem one base-pairs longer than the other. This is only allowed when the stem is at least two base-pairs long and the inserted base-pair is surrounded by standard base-pairs (AU, GC, GU). An example of an insert base-pair calculation is
Internal-loop
Internal-loops and bulges are single-stranded parts of an RNA structure surrounded by stems. A bulge is a single-strand region in only one side of the structure (Fig. 1). An internal-loop contains single-stranded nucleotides on both sides of the structure. In the alignment an internal-loop aligned to a bulge is treated as two internal-loops. The score of an internal-loop has four parts:
The length cost is a function of the sum of lengths of each side of the internal-loops. For an internal-loop spanning positions i′ − i and positions j′ − j, the internal-loop lengths are Δli = i − i′ +1 and Δlj = j − j′ + 1. The cost Slength = Sinternal(Δli + Δlj) + Sinternal(Δlk + Δll) is calculated using the energy parameters stored in the internal column of the loop length costs table.
The two sides of an internal-loop do not have to have same lengths, but the length difference has a cost which is proportional to the length difference. Sasymmetry = Sasym(|Δli − Δlj|) + Sasym(|Δlk−Δll|) is the sum of the two asymmetry costs.
The first pair of nucleotides in an internal-loop stack with the last base-pair in the previous stem. Likewise, the last pair of nucleotides in the loop stack on the first base-pair in the next stem. The cost Sstack = Sil_stack(ni′,nj′,ni′+1,nj′−1) + Sil_stack(nk′,nl′,nk′+1,nl′−1) + Sil_stack(ni,nj,ni−1,nj+1) +Sil_stack(nk,nl,nk−1,nl+1) is the sum of these four costs for the two structures in the alignment. The internal-loop stacking values are those in the internal-loop stacking matrix in the score matrix. The internal-loop stacking parameters have also had a non-GC cost added.
Bulge-loop
Bulges are similar to internal-loops in many ways. The cost of a bulge is therefore calculated as an internal-loop until it is certain that it really is a bulge. Then the bulge cost is recalculated. The cost of a bulge is:
Multibranched-loop
Multibranched-loops are regions where ≥2 stems meet. The cost of a multibranch-loop is:
Backtracking
When the score D has been calculated for all values of (i, j, k, l), the structure of the best scoring alignment can be found by backtracking the D matrix.
Score matrices and parameters
Two types of parameters are used: energy and sequence similarity. The energy parameters are taken from mfold (Zuker et al., 1999; Mathews et al., 1999). All energy parameters have been multiplied by −10 and rounded to the nearest integer. The similarity parameters are log-odds substitution rates similar to RIBOSUM-matrices (Klein and Eddy, 2003) with two exceptions. No intra-cluster counts are allowed, and no low similarity cut-off is used (Henikoff and Henikoff, 1992). The two substitution matrices are scaled independently. The alignment used for making the substitution matrices is the 1995 version of the small subunit ribosomal RNA alignment from the European Ribosomal RNA Database (Van de Peer et al., 1994) supplied and described by Klein and Eddy (2003). The problem of estimating parameters from one family and applying these to others has previously been addressed, where it was found that parameters estimated from rRNA and tRNA could be applied to HIV-1 RNA analysis (Knudsen et al., 2004).
The stacking bonus can be estimated from a matrix just like the substitution costs. In this way the bonus becomes the cost of substituting one stacked base-pair with another stacked base-pair. The number of different stacked base-pairs is 64 = 1296 which is a very high number of parameters to estimate. Assuming that the stackings are independent of each other, the substitution cost is approximated as the substitution cost plus the two log-odds stackings bonus. This gives only 62 = 36 parameters to estimate. When testing the program, we had to choose between the experimentally measured stacking parameters and those estimated from the SSU alignment. We chose to use the energy stacking parameters.
Each score matrix has five parameters which are not determined experimentally or estimated from multiple alignments: gap opening, gap elongation, similarity clustering percentage, single-strand substitution matrix scale and base-pair substitution scale. These five values need to be chosen before running foldalign. If one wishes to conduct the structural alignment unbiased by sequence similarity, the single-strand substitution matrix and the base-pair substitution matrix are simply scaled to zero (Hofacker et al., 2004a).
Mutual scan of two sequences and the algorithmic complexity
The algorithm as presented so far has a time complexity of
The memory complexity of L1L2 λ δ can be reduced further by splitting the first sequence into smaller subsections. Each section is then aligned to a sliding window on the second sequence (Fig. 2). Since the program aligns all subsequences of lengths up to λ, each section has to overlap the next section by λ − 1 nucleotides. The alignments at each position k in sequence 2 are only dependent on the positions (k) to (k + λ −1). It is therefore only necessary to keep a window of size λ of the D matrix in the memory at any time. To ease the implementation the second sequence is scanned from end to start. In this way the values calculated for the (k +1) to (k + λ) window can be reused directly. The full two sequences are still kept in memory giving the program a memory requirement of O(lsλ2δ + L1 + L2), where ls is length of the subsections.
Splitting one sequence into overlapping sections increases the run time of the program. If the sequence is split into ⌈(L1 − (λ −1))/(ls − (λ −1))⌉ sections of length ls, the time complexity becomes O((L1ls/ls − λ) L2 λ2δ2). The foldalign uses ls = 2λ as the default ls value. This subsection length gives a doubling of the run time compared to not splitting the sequence. When the alignment step is done, the regions of interest must be realigned one by one to get the structure. This adds another O(λ4δ2) per hit region. Currently, only the best scoring hit is backtracked.
The total run time scales as O(L1L2λ2δ2 + Nλ4δ2), where N is the number of structures to be backtracked.
To speed up the program, parts (b) and (e) of Equation (1) have not been implemented. This does not affect the time and memory complexities. Since the alignments of cases (b) and (e) can be done by combinations of some of the other cases this does not change the types of structures that can be aligned.
A constraint has also been placed on when part (g) of Equation (1) is calculated. The (g) part combines two substructures. To speed up the the program the calculation is only performed when the left part of the structure (Din,km) has base-pairs between (i & n) and (k & m). The right part of the structure (D(n+1)j, (m+1)l) is limited to the cases where j is base-paired and l is base-paired. These constraints do not limit the kinds of structure that can be found since any unconstrained structures that can be found, can also be found as a constrained structure extended using the (a), (c), (d) and (f) parts of Equation (1).
Selection of a hit region
It is possible for two sequences to share more than one common locally conserved structure. Two such structures cannot overlap each other in both sequences. A list of the best scoring, non-overlapping regions in the two sequences can be made by the following simple algorithm. In a matrix of all the best local alignment scores:When the list of alignments has been made, a cutoff for considering an alignment real or random has to be chosen. For one scan this can sometimes be seen simply by looking at the score, but when there are more than a few scans an automated method is needed. Three approaches are tested:
Find the best scoring alignment Dij,kl.
Remove all alignments Di′j′,k′l′ for which i ≤ i′ ≤ j or i ≤ j′ ≤ j, and k ≤ k′ ≤ l or k ≤ l′ ≤ l.
Repeat until all alignments have been removed or a predetermined threshold is reached.
The best n alignments: For each scan the n best scoring alignments are chosen. Simply selecting the best alignment from each scan is the n = 1 case. This is the default behavior of the foldalign program.
Z-value: The Z-score for all alignments in the list is calculated based on all the scores in the (i,k) matrix.
Extreme value distribution: The scores from the alignment are assumed to follow an extreme value distribution in the same way as the scores from the BLAST program (Karlin and Altschul, 1990). The extreme value distribution is characterized by two parameters Λ and K (Λ is usually called λ in other texts). The island method allows Λ and K to be estimated from a few random alignments scores (Olsen et al., 1999; Altschul et al., 2001). Due to the high time complexity of the algorithm Λ and K are sometimes estimated directly from the scores of the alignments in the alignment list. The calculations follow those in Altschul et al. (2001). The mean
\(\widehat{D}\), of the alignment scores for alignments with a score above a threshold C is calculated. Λ and K are then estimated as:where nD>C is the number of alignments in the list scoring better than the threshold C. The estimated Λ and K are now used to calculate a rough estimate of the probability for seeing a random alignment with the score D or better.\[{\Lambda }_{C}=ln\left(1+\frac{1}{\widehat{D}-C}\right),\hbox{ \hspace{1em} }k=\frac{{n}_{D > C}{exp}^{{\Lambda }_{c}C}}{{L}_{1}{L}_{2}},\]The probability score should not be taken as a good estimate of the real probability.\[\hbox{ Prob }(\hbox{ Score }\ge D)\approx 1-exp(-K{L}_{1}{L}_{2}exp(-\Lambda D\left)\right).\](8)
The work in Heyer (2000) also describes related statistical properties of foldalign scores.
Performance evaluation
Two cases of performance are evaluated: the structure of the alignment and the localization of alignments. The structure performance measures how well the predicted structure fits to a known curated structure. The localization performance tells us how well the correct alignments are separated from the random alignments. This is our ability to select the right region among many hits. Three measures are used to evaluate the predictions: the Matthews correlation coefficient (CC) (Matthews, 1975), the sensitivity (SEN) and the positive predictive value (PPV) (sometimes called the specificity). These are defined as:
Structure
The evaluation of structure predictions follow Mathews and Turner (2002) and Mathews et al. (1999). A predicted base-pair between positions i and j is considered as correctly predicted, true positive, if a base-pair is annotated between i and j, i ± 1 and j, or i and j ± 1. Predicted base-pairs which pair other positions are counted as false positives. Annotated base-pairs which are not predicted are counted as false negatives.
Localization
A predicted hit (alignment) is counted as a true positive, if for both sequences at least half the positions predicted as part of the alignment overlap an annotated motif. An alignment where less than of half the predicted positions are annotated on one or both sequences is counted as a false positive alignment. From the annotation of the sequences it is known which regions should be aligned. Any alignments between two RNAs from the same family missing are counted as false negative alignments. For correct alignments and completely missed alignments the family of the alignment is taken from the annotations. A false positive alignment which overlaps an annotated gene with at least half of its positions is counted as belonging to the family of the annotated RNA. All other false positives are counted as a part of the unknown family. Predicted alignments between two different types of RNAs are counted as unknown false positives.
IMPLEMENTATION
The foldalign algorithm is implemented as a C++ program. The program has been tested on machines running different types of Linux. The implementation uses an extra matrix besides the score and the context matrices already described. This matrix, the length matrix, keeps track of the number of times the same context has been used in a row. This is used to speed up the backtracking during alignment. The memory requirement of the length matrix scales in the same way as the other two matrices.
The program has a number of command-line options. The most important ones control the λ and the δ parameters. The number of overlapping nucleotides during scans can also be set. These three parameters greatly influence the memory consumption and the run time of the program. By default λ and δ are set to the sequence length. The number of overlapping nucleotides ls is set to 2λ by default. If this is longer than the sequence length, then the sequence length is used. The program performs a local alignment by default, but a global alignment can be chosen using an option. A score matrix different from the default can be read from a file. The default score matrix file is part of the software package. Using option ‘help’ will print a small help text.
The default sequence input format is fasta. The program can also read a space/tab separated file with the format name<space>sequence<space>not_used<space> comments. A third file format is the pair format. In this format each line in the input file contains name1<space>sequence1<space>name2<space>sequence2<space>options. The program will align sequence 1 with sequence 2 using the options specified. Finally two sequences can be specified on the command-line.
RESULTS
The foldalign predicts a local alignment and structure for two sequences containing a common conserved RNA-motif. The version presented here has been improved over earlier versions (Gorodkin et al., 1997b) in two ways.
The memory complexity of the algorithm has been lowered to a point where sequences of any length can be scanned given enough time. However, the length of the RNA-motifs is still limited by the memory of the computer. As in the previous versions of the algorithm two heuristics are used to limit the calculation time and the memory consumption. The RNA-motif length is limited to λ nucleotides. This speeds up the calculation, but more importantly it allows the sequences to be aligned in smaller sections (Fig. 2). The second heuristic limits the length difference between two subsections being aligned. Limiting this length difference to δ nucleotides greatly lowers the computation time and also lowers the memory requirement. To illustrate the necessity and efficiency of these heuristics, two sequences, 300 nucleotide long, were aligned using the different time and memory complexities shown in Table 1. The machine used was an SGI Altrix 3000 with 64 Intel Itanium 2900 MHz processors and 190 GB of memory. Even with the relatively short sequences used in this example there is a large drop in the needed time and memory. The memory requirement is reduced to <1% of the unlimited case. For the longer sequences used later the savings are even greater. This represents an effective constraint on the Sankoff algorithm when the motif is limited in size.
The second improvement is the use of a subset of the rules used in energy minimization folding. Instead of just using single-strand substitutions, base-pair substitutions and a limited form of stacking, the algorithm now also utilizes multiple forms of stacking and length dependencies of different structural elements. The energy parameters used are a subset of those used by mfold (Zuker et al., 1999; Mathews et al., 1999). The substitution matrices are based on the RIBOSUM-matrices (Klein and Eddy, 2003), but follow Henikoff and Henikoff (1992) more strictly.
The program was tested using sequences selected using two main criteria. Two RNA-motifs being compared must be <40% similar. Furthermore, the folding energy of the RNA-motifs must not be significantly different from the folding energy of the surrounding sequence. The Z-value for the folding energy of the motif compared to the folding energy of the surrounding sequence has to be −2 or more. To keep the computation time down a motif-length limit of ∼150 nucleotides was used. Sequences were extracted from the 5S rRNA database (Szymanski et al., 2002), the tRNA database (Sprinzl et al., 1998) and the U1 database (Zwieb, 1996). To get the surrounding sequence an exact match for each sequence in the databases was found in GenBank (Benson et al., 2004). For sequences with multiple hits an entry was chosen at random. Furthermore, sequences from the purine and THI families of RFAM version 5.0 (Mandal et al., 2003; Mandal and Breaker, 2004; Rodionov et al., 2002; Miranda Rios et al., 2001; Sudarsan et al., 2003; Winkler et al., 2002; Kubodera et al., 2003Griffiths-Jones et al., 2003) were used. The RFAM structures were compared and corrected to fit those in the publications. Unpaired nucleotides at the ends of the RFAM structures were removed. The surrounding sequence of the purine and THI sequences were located using their RFAM names. RNA-motifs which have a sequence similarity of ≥90% are considered to be the same motif. Each RNA-motif was allowed to be in a maximum of three sequence pairs. Due to the high number of known tRNA sequences these were only allowed in one pair. Two versions of the dataset were made: One consisting of only the RNA-motifs, and one containing 500 nucleotide long sequences each containing one of the RNA-motifs. Some of the long sequences contain RNAs other than those in the dataset. These were annotated, using GenBank and RFAM, in the long sequences but not included in the motif dataset. In some cases a sequence contains one or two other RNAs which were only partially included. These RNAs were not annotated. Hits to such partial genes are counted as false positives. All GenBank sequences were manually checked. The motif dataset contains 99 RNA pairs and the long dataset also has 99 sequence pairs but a total of 295 gene pairs. As an extra test set eight sequence pairs were taken from the SRP database (Rosenblad et al., 2003). Since the SRP motifs are longer than the other motifs used, 1000 nt long sequences were extracted from GenBank.
The algorithm's performance was measured for two different cases: structure and selection. To align two subsequences correctly the algorithm has to find a common structure. How well the predicted structure fits with the structure annotated in the database is the first performance test. The second performance test tells how well the algorithm locates RNA-elements in long sequences.
Structure predictions
To evaluate the algorithm's ability to find the correct structure shared between two sequences the RNA-motif sequences were aligned without surrounding sequence. λ was set to the sequence length, and δ to 25. Different values of the five parameters were tested. The performance evaluation for the score matrix which best predicts the structures can be seen in Table 2 in the column foldalign. The best results for predicting each RNA-family separately can be seen in the foldalign family column. The correlation coefficient, CC for predicting the structures of 5S rRNA and U1 families are close to 0.7 for the same set of parameters. The tRNA performs a little better. The performance on the Purine family is better than on the other four families. The sequences in this set all have a very well conserved structure. There are no long inserts or deletions, and no extra or missing stems which explains the better performance. The THI family performs worse than average. The structure of the THI element is less well conserved. It can have very large inserts in stem 2, and stem 3 is often missing (see Figure 2 in Rodionov et al., 2002). The most important parameter for the structure prediction is the gap opening cost, and the weight of the single-strand substitution matrix. The gap elongation cost is of less importance and was usually fixed to half the gap opening cost. The single-strand substitution matrix affects the performance more than the base-pair substitution matrix. The best performing score matrix without any similarity has a CC of 0.6. This shows that even though sequence similarity between the sequences is low there is still some useful information left, which may explain why foldalign does slightly better than Dynalign which uses no sequence information.
To compare how well foldalign performs compared to Dynalign the motif dataset was aligned using Dynalign. Dynalign has two variable structural parameters: the gap penalty and whether to allow single base-pair insertions or not. A number of test alignments showed that allowing base-pair insert gave the best performance. Single base-pair insertions were allowed for alignments used. The sequence pairs were aligned using gap penalties between 0 and 2.0 kcal/mol with 0.1 kcal/mol increments. The performance of Dynalign using the best set of parameters to predict the structures of all families can be seen in the Dynalign column of Table 2. The best performances for each family can be seen in the Dynalign family column. While there are differences for the individual families, the overall performances of the two programs are very similar.
Dynalign has a constraint parameter, M. The positions of aligned nucleotides must be within M nucleotides of each other. It is similar to the foldalign δ constraint in the sense that two subsequences aligned can only have a length difference of M nucleotides. Therefore it could be argued that comparing a foldalign alignment made with δ = 25 to a Dynalign alignment made with M = 15 is not fair. The foldalign was rerun with the same parameters as in the foldalign column except that δ was changed to 15. There was no difference in performance.
Table 3 shows the time and memory needed by foldalign and Dynalign to align three sequence pairs of different lengths. It can be seen that while foldalign is much faster, it also uses more memory. Dynalign owes its memory efficiency to the M parameter. By requiring that the sequences are aligned to within M nucleotides instead of just limiting the length differences, one of the dimensions of the dynamic programming matrices is reduced from alignment length to M. Since foldalign does not have the prealignment requirement, the whole sequence section has to be kept in memory resulting in the higher memory requirement.
The parameters found to be optimal for predicting structures were used to predict the structures of the sequences in the SRP dataset. The average CC for the structure prediction was 0.65. The sensitivity and positive predictive values were 0.58 and 0.74, respectively.
Localization
To test foldalign's performance as a local alignment tool the dataset containing the RNA-motifs with their surrounding sequence was aligned. A number of different parameters were tested. λ was chosen to be approximately the size of the RNA-motif, and δ was fixed at 15. The most important parameter was the gap opening. It was found that higher gap opening penalties were needed for localization compared to structure prediction. Similarity scores was found to improve the localization performance.
To get a better estimate of the probabilities that the alignments found are random alignments, the sequences were shuffled ten times in a way which conserved the dinucleotide distribution (Workman and Krogh, 1999).
Table 4 shows the performance for hit selection using Z-value scores and probabilities based on extreme value distributions. For each sequence pair the parameters of the two distributions were estimated from the alignment of the shuffled sequences. Only hits with a Z-value >8 were reported. The performance is perfect except for tRNAs and THI elements.
The results reported in the Z-value column assume that the Z-values from one scan can be compared with those from another. This is most likely not the case. The foldalign scores are expected to follow an extreme value distribution like the scores from BLAST (Karlin and Altschul, 1990; Heyer, 2000). The probabilities of getting random alignments with the score of the predicted alignments are estimated. The estimates are very crude, and the estimated probabilities are far from correct, but they can be used to select the correct alignments. Alignments with a probability score <0.015 were selected as the predicted alignments. The performance figures for each family can be seen in the extreme value distribution part of Table 4. The performance for the THI family is again much worse than for the other families, but the tRNA sensitivity has improved significantly.
Table 5 shows how many good alignments would be missed or found if a fixed number of alignments were considered correct in each scan. With the exception of the THI and tRNA family all the correct alignments are found for all families even if only the best hit is chosen. Since tRNA genes are often clustered in the genome, the number of hits chosen for this family should be higher than for the other families. An example of multiple correct alignments can be seen in Figure 3.
The detection of tRNAs are made harder by the fact that they are clustered. The clustering raises the number of tRNA gene pairs in two sequences being compared. The methods using the extreme value distribution and the Z-value are better able to handle this than simply taking the n-best alignments. The THI family is hard to locate for any of the three selection methods. This is most likely due to the less well conserved structure of these elements.
The program was used to predict the localization of the SRP genes. In all the eight scans the correct SRP-alignment was the best scoring alignment. Since the sequences are long (1000 nt) and λ large (∼300 nt) it was not feasible to do random alignments. When the extreme value distribution was estimated directly from the alignment scores of the unshuffled sequences, seven of the correct alignments had a probability of <0.001 for being random; for the last alignment this probability was 0.028. The probability for the least likely false positive was 0.142.
DISCUSSION
We addressed the problem of finding common local structural motifs in two RNA sequences with low sequence similarity (<40%) as well as having the sequences energetically indistinguishable from their surrounding sequence context. We introduced a Sankoff (1985) based approach to conduct simultaneous mutual scan of two sequences for non-coding RNAs (ncRNAs). This implementation is an extension of the previous foldalign implementation as it now allows for bifurcated structure, includes RIBOSUM (Klein and Eddy, 2003) like substitution matrices and includes structural energy parameters in score computation. The approach presented here conducts pairwise structural alignments with the same quality as an existing method (Mathews and Turner, 2002) and obtains an overall performance of 0.72 using Matthews correlation coefficient.
Furthermore, as the method still maximizes a score it can perform local alignments. The local alignment approach was implemented such that it was memory efficient while conducting mutual scan of two sequences. The ability to make local alignments is desirable as a local alignment approach is expected to be more sensitive in detecting motifs smaller than the window size used in the scanning procedure, than in a global alignment approach (Mathews and Turner, 2002; Hofacker et al., 2004a) if used for scanning.
Different approaches for selecting significant hits were tested, and we found that the probability scores based on the estimated extreme value distribution worked best. This approach is likely to be refined, so even better hit selection can be made. The obtained performance was 0.82 in sensitivity and 0.88 in positive predictive value.
One limitation is that foldalign does not deal with structural inserts, which is the likely explanation why it did not perform well on the THI element.
The algorithm can clearly be improved with respect to structural inserts using the principles presented by Eddy (2002b). Also, the ability to select hit region candidates for ncRNAs can be improved.
The approach presented here should also be well suited to be included in a scheme that searches for a common motif in a set of related sequences and should thereby improve the earlier foldalign implementation for multiple sequences (Gorodkin et al., 1997b, 2001c). In fact, the greedy algorithm might be completely replaced by other methods, such as the stochastic context-free grammar implementation Infernal (Eddy, 2002b).
In conclusion, the new version of foldalign applies well to many classes of low-sequence similarity ncRNA sequences and complements existing methods for computational searches for ncRNAs.
The cost depends on five structural contexts: hairpin-loop, stem, internal-loop, bulge and multibranch-loop.
The cost depends on five structural contexts: hairpin-loop, stem, internal-loop, bulge and multibranch-loop.
The first sequence is split into smaller sections — a, b, c and d. Each section overlaps the next section with λ −1 nucleotides. Each section is aligned to the second sequence. During the alignment a window of λ nucleotides from sequence 2 is aligned to the current section. Then the window is moved by one-nucleotide, and the new window is aligned to the section. By starting from the end and moving toward the beginning of sequence 2 the information stored about the previous position can be easily reused. When all of sequence 2 has been aligned to the section, the alignment process starts over with the next section. Due to the sequence overlap foldalign with the default section length, ls = 2λ − 1, runs half as fast as when the section length equals the length of sequence 1.
The first sequence is split into smaller sections — a, b, c and d. Each section overlaps the next section with λ −1 nucleotides. Each section is aligned to the second sequence. During the alignment a window of λ nucleotides from sequence 2 is aligned to the current section. Then the window is moved by one-nucleotide, and the new window is aligned to the section. By starting from the end and moving toward the beginning of sequence 2 the information stored about the previous position can be easily reused. When all of sequence 2 has been aligned to the section, the alignment process starts over with the next section. Due to the sequence overlap foldalign with the default section length, ls = 2λ − 1, runs half as fast as when the section length equals the length of sequence 1.
The time and memory needed by the foldalign program
| Case | Time | Memory | Time | Memory |
|---|---|---|---|---|
| No heuristics | \({L}_{1}^{3}{L}_{2}^{3}\) | \({L}_{1}^{2}{L}_{2}^{2}\) | ∼3 days | 27.8 GB |
| Scanning | L1L2 λ2 δ2 | λ3 δ | 49 min 31 s | 273 MB |
| Case | Time | Memory | Time | Memory |
|---|---|---|---|---|
| No heuristics | \({L}_{1}^{3}{L}_{2}^{3}\) | \({L}_{1}^{2}{L}_{2}^{2}\) | ∼3 days | 27.8 GB |
| Scanning | L1L2 λ2 δ2 | λ3 δ | 49 min 31 s | 273 MB |
Two sequences, 300 nucleotide long, were aligned. In the scanning run λ = 100 and δ = 15 were used.
Structure prediction performance for foldalign and Dynalign
| RNA type | Number of pairs | foldalign | Dynalign | foldalign family | Dynalign family | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CC | SEN | PPV | CC | SEN | PPV | CC | SEN | PPV | CC | SEN | PPV | ||
| All | 99 | 0.72 | 0.68 | 0.78 | 0.69 | 0.69 | 0.71 | ||||||
| 5S rRNA | 2 | 0.70 | 0.66 | 0.75 | 0.61 | 0.59 | 0.65 | 0.74 | 0.61 | 0.95 | 0.64 | 0.46 | 0.93 |
| Purine | 5 | 0.93 | 0.91 | 0.95 | 0.90 | 0.87 | 0.95 | 0.93 | 0.92 | 0.94 | 0.97 | 0.99 | 0.96 |
| THI | 21 | 0.45 | 0.31 | 0.69 | 0.45 | 0.40 | 0.51 | 0.61 | 0.58 | 0.65 | 0.56 | 0.52 | 0.60 |
| U1 | 6 | 0.65 | 0.67 | 0.63 | 0.69 | 0.77 | 0.63 | 0.71 | 0.77 | 0.66 | 0.71 | 0.77 | 0.65 |
| tRNA | 65 | 0.86 | 0.84 | 0.89 | 0.82 | 0.83 | 0.82 | 0.86 | 0.84 | 0.89 | 0.84 | 0.84 | 0.83 |
| RNA type | Number of pairs | foldalign | Dynalign | foldalign family | Dynalign family | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CC | SEN | PPV | CC | SEN | PPV | CC | SEN | PPV | CC | SEN | PPV | ||
| All | 99 | 0.72 | 0.68 | 0.78 | 0.69 | 0.69 | 0.71 | ||||||
| 5S rRNA | 2 | 0.70 | 0.66 | 0.75 | 0.61 | 0.59 | 0.65 | 0.74 | 0.61 | 0.95 | 0.64 | 0.46 | 0.93 |
| Purine | 5 | 0.93 | 0.91 | 0.95 | 0.90 | 0.87 | 0.95 | 0.93 | 0.92 | 0.94 | 0.97 | 0.99 | 0.96 |
| THI | 21 | 0.45 | 0.31 | 0.69 | 0.45 | 0.40 | 0.51 | 0.61 | 0.58 | 0.65 | 0.56 | 0.52 | 0.60 |
| U1 | 6 | 0.65 | 0.67 | 0.63 | 0.69 | 0.77 | 0.63 | 0.71 | 0.77 | 0.66 | 0.71 | 0.77 | 0.65 |
| tRNA | 65 | 0.86 | 0.84 | 0.89 | 0.82 | 0.83 | 0.82 | 0.86 | 0.84 | 0.89 | 0.84 | 0.84 | 0.83 |
The main columns are the RNA type and the overall average, All. The foldalign and Dynalign columns are the performances when one set of parameters is used to predict the structures of all families. The foldalign and Dynalign families show the performance when an optimal set of parameters is used for each family. For each main column there are three subcolumns with the correlation coefficient CC, the sensitivity SEN and the positive predictive value PPV.
The time and memory needed for foldalign and Dynalign to align two sequences
| RNA type | Gb accession 1 | Gb accession 2 | Length 1 | Length 2 | foldalign | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| δ = 15 | δ = 25 | Dynalign | ||||||||
| CPU time | Memory (MB) | CPU time | Memory (MB) | CPU time | Memory (MB) | |||||
| tRNA | AF047724 | AD000019 | 72 | 72 | 16 s | 45 | 28 s | 64 | 6 min 41 s | 18 |
| THI | AE007806 | AP005276 | 91 | 93 | 46 s | 98 | 1 min 30 s | 138 | 19 min 42 s | 25 |
| U1 | AC004546 | X63783 | 164 | 163 | 7 min 37 s | 521 | 16 min 18 s | 779 | 2 h 10 min 1 s | 70 |
| RNA type | Gb accession 1 | Gb accession 2 | Length 1 | Length 2 | foldalign | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| δ = 15 | δ = 25 | Dynalign | ||||||||
| CPU time | Memory (MB) | CPU time | Memory (MB) | CPU time | Memory (MB) | |||||
| tRNA | AF047724 | AD000019 | 72 | 72 | 16 s | 45 | 28 s | 64 | 6 min 41 s | 18 |
| THI | AE007806 | AP005276 | 91 | 93 | 46 s | 98 | 1 min 30 s | 138 | 19 min 42 s | 25 |
| U1 | AC004546 | X63783 | 164 | 163 | 7 min 37 s | 521 | 16 min 18 s | 779 | 2 h 10 min 1 s | 70 |
The machine used has two 2.4 MHz Intel XEON processors and 2 GB of memory.
The location performance for each family
| RNA | Z-value | Extreme value distribution | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Pt | Pf | Nf | SEN | PPV | Pt | Pf | Nf | SEN | PPV | |
| 5S rRNA | 2 | 0 | 0 | 1.00 | 1.00 | 2 | 0 | 0 | 1.00 | 1.00 |
| Purine | 5 | 0 | 0 | 1.00 | 1.00 | 5 | 1 | 0 | 1.00 | 0.83 |
| THI | 13 | 6 | 8 | 0.62 | 0.68 | 8 | 1 | 13 | 0.38 | 0.89 |
| U1 | 6 | 0 | 0 | 1.00 | 1.00 | 6 | 1 | 0 | 1.00 | 0.86 |
| tRNA | 134 | 23 | 109 | 0.55 | 0.85 | 174 | 44 | 69 | 0.72 | 0.80 |
| Unknown | 30 | 25 | ||||||||
| RNA | Z-value | Extreme value distribution | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Pt | Pf | Nf | SEN | PPV | Pt | Pf | Nf | SEN | PPV | |
| 5S rRNA | 2 | 0 | 0 | 1.00 | 1.00 | 2 | 0 | 0 | 1.00 | 1.00 |
| Purine | 5 | 0 | 0 | 1.00 | 1.00 | 5 | 1 | 0 | 1.00 | 0.83 |
| THI | 13 | 6 | 8 | 0.62 | 0.68 | 8 | 1 | 13 | 0.38 | 0.89 |
| U1 | 6 | 0 | 0 | 1.00 | 1.00 | 6 | 1 | 0 | 1.00 | 0.86 |
| tRNA | 134 | 23 | 109 | 0.55 | 0.85 | 174 | 44 | 69 | 0.72 | 0.80 |
| Unknown | 30 | 25 | ||||||||
This shows how well the correct alignmentsy are selected from the random alignments. Two methods were used to predict the alignments that were hit. The first used Z-scores and the second a probability estimated from the extreme value distribution. The true positives, Pt, are alignments where at least half of the two RNAs are predicted to align. The false positives, Fp, are the hits where one RNA of this type either overlaps less than half of another RNA or something that has not been annotated. When the two regions, which are not annotated, are aligned then they are counted as being part of the ‘Unknown’ family. False negatives, Fn, is the number of alignments which should have been found but were not. SEN is the sensitivity and PPV the positive prediction value.
The columns indicate how many good alignments would be missed or found if the number of alignments indicated in the column header were chosen
| RNA | Number of hits chosen | |||||||
|---|---|---|---|---|---|---|---|---|
| 100 | 20 | 10 | 5 | 4 | 3 | 2 | 1 | |
| Missed | ||||||||
| 5S rRNA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Purine | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| THI | 0 | 0 | 3 | 7 | 7 | 8 | 8 | 11 |
| U1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| tRNA | 0 | 6 | 29 | 64 | 82 | 108 | 149 | 190 |
| Found | ||||||||
| 5S rRNA | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Purine | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| THI | 21 | 21 | 18 | 14 | 14 | 13 | 13 | 10 |
| U1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| tRNA | 243 | 237 | 214 | 179 | 161 | 135 | 94 | 53 |
| RNA | Number of hits chosen | |||||||
|---|---|---|---|---|---|---|---|---|
| 100 | 20 | 10 | 5 | 4 | 3 | 2 | 1 | |
| Missed | ||||||||
| 5S rRNA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Purine | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| THI | 0 | 0 | 3 | 7 | 7 | 8 | 8 | 11 |
| U1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| tRNA | 0 | 6 | 29 | 64 | 82 | 108 | 149 | 190 |
| Found | ||||||||
| 5S rRNA | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Purine | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| THI | 21 | 21 | 18 | 14 | 14 | 13 | 13 | 10 |
| U1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| tRNA | 243 | 237 | 214 | 179 | 161 | 135 | 94 | 53 |
(A) Shows the Z-value for the best local alignment starting at each pair of positions along the two sequences. The parameters of the Z-value distribution is calculated using the scores of the highest scoring alignment starting at all pairs of positions between the two sequences. The sequence on the x-axis contains one tRNA starting at position 297. The sequence on the y-axis has three tRNAs annotated. These are located at positions 12, 89 and 173. The three regions where the genes align are easily spotted in the plot. The tRNA at position 89 is the one with similarity <40% to the tRNA in the x-axis sequence. The database structures and the foldalign structure prediction for this region can be seen in (B). The two gaps are placed differently and the final nucleotide is missing but the rest of the predicted structure is correct.
(A) Shows the Z-value for the best local alignment starting at each pair of positions along the two sequences. The parameters of the Z-value distribution is calculated using the scores of the highest scoring alignment starting at all pairs of positions between the two sequences. The sequence on the x-axis contains one tRNA starting at position 297. The sequence on the y-axis has three tRNAs annotated. These are located at positions 12, 89 and 173. The three regions where the genes align are easily spotted in the plot. The tRNA at position 89 is the one with similarity <40% to the tRNA in the x-axis sequence. The database structures and the foldalign structure prediction for this region can be seen in (B). The two gaps are placed differently and the final nucleotide is missing but the rest of the predicted structure is correct.
This work was supported by the Danish Technical Research Council, the Ministry of Food, Agriculture and Fisheries, and the Danish Center for Scientific Computing. We thank Dave Mathews for the help with Dynalign.




Comments