Abstract

Summary: Assessing the statistical significance of structured RNA predicted from multiple sequence alignments relies on the existence of a good null model. We present here a random shuffling algorithm, Multiperm, that preserves not only the gap and local conservation structure in alignments of arbitrarily many sequences, but also the approximate dinucleotide frequencies. No shuffling algorithm that simultaneously preserves these three characteristics of a multiple (beyond pairwise) alignment has been available to date. As one benchmark, we show that it produces shuffled exonic sequences having folding free energy closer to native sequences than shuffled alignments that do not preserve dinucleotide frequencies.

Availability: The Multiperm GNU Cb++ source code is available at http://www.anandam.name/multiperm

Contact:anandam@u.washington.edu

Supplementary information:Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

An important question in any computational genomic search is whether its predictions are sufficiently surprising to suggest biological significance. A standard approach is to re-run the prediction algorithms on random data that are similar to real genomic data, and compare the score distribution of the resulting (presumably false) predictions to those from native sequences.

For prediction of structured RNAs, the folding free energy is a critical attribute. Since G–C base pairs are more energetically favorable, random sequences generated as controls should ideally exhibit the same mononucleotide frequencies as native sequences. More subtly, dinucleotide frequencies also matter, since stacked pairs in helices also affect folding energy. Several studies, (e.g. Clote et al., 2005; Workman and Krogh, 1999) convincingly demonstrate that this is more than just a theoretical concern.

A very elegant algorithm for permuting a single sequence while preserving dinucleotide frequencies performs an Eulerian walk on a directed multigraph constructed from the original sequence (Altschul and Erickson, 1985). They also show that shuffling is preferable to sampling from a Markov chain, which only preserves expected frequencies.

With multiple sequence alignments, competing preservation requirements make shuffling more challenging. The mononucleotide frequencies, gap structure and local conservation patterns are all important (Washietl and Hofacker, 2004); additionally preserving dinucleotide frequencies reduces the number of possible shuffles unacceptably—often no non-native alignment is possible. Nevertheless, several authors have lamented this situation, as failure to preserve dinucleotide frequencies in pairwise alignments can result in a drastic underestimate of false-positive rates in RNA prediction (Babak et al., 2007).

We present here a shuffling algorithm for arbitrary multiple sequence alignments that preserve mononucleotide frequencies, gap structure and local conservation patterns exactly, while preserving dinucleotide frequencies approximately. The number of distinct shuffles that can be produced by the algorithm is very large for the vast majority of multiple sequence alignments. Together these characteristics provide a much more realistic null model for RNA prediction studies than previously available.

2 METHODS

The essential idea behind the algorithm is to permute columns of the multiple alignment so as to preserve dinucleotide frequencies in the alignment's consensus sequence, hence approximately in the individual sequences, while respecting constraints on gap and local conservation patterns.

We partition columns in the multiple sequence alignment into conservation classes based on the positions of gaps and a coarse representation of the nucleotide conservation pattern in each column as in Washietl and Hofacker (2004). Specifically, a class in an n-sequence alignment is the concatenation of a binary n-vector indicating presence of gaps, and an integer c summarizing conservation in the column. The extent of coarse-graining of the latter attribute is controlled by the level parameter. At level 0, c equals the mean pairwise identity (MPI) of the column rounded to zero digits after the decimal point (thus, only 0 or 1). At level 1, c is the MPI of the column rounded to one digit after the decimal point (11 possible bins). Since there are fewer conservation classes at level 0, the number of possible unique shuffles is greater at level 0 than at level 1.

Our shuffling heuristic is inspired by Altschul and Erickson's Euler tour technique, but differs in important ways. In outline, we augment the Altschul–Erickson multigraph with conservation class information, then randomly ‘walk’ the graph guided by that information. Our multigraph has up to four vertices, one for each type of nucleotide that appears in the consensus sequence for the multiple alignment. For each dinucleotide in the consensus sequence, there is an edge from the start nucleotide to the end nucleotide, labeled by the end column, and hence its consensus character and conservation class. The crux of Altschul and Erickson (1985) is that the consensus character labels on a random Euler tour of this graph (i.e., a tour crossing each edge once) exactly preserves the dinucleotide frequencies of the underlying sequence (and uniformly samples all such sequences).

Our algorithm uses conservation classes of successive columns in the input alignment as a script to restrict selection of the next edge (alignment column) from the edge list at the current vertex, selecting randomly among those unused outgoing edges whose labels match the desired conservation class. If no such edge is available, pick uniformly at random among unused edges with the correct end class leaving other vertices; this produces a violation in the preservation of consensus dinucleotide frequencies (but preserves conservation class sequence). If such violations are infrequent, and if the consensus nucleotides are indeed representative of the nucleotides in each column, we will have good approximate preservation of dinucleotide frequencies. Our implementation includes two small refinements of this basic scheme. First, to avoid biasing dinucleotide violations towards the end of the alignment, we start the walk at a randomly chosen column (and wrap from end to beginning) instead of always beginning at the first column. Second, regions of the alignment where the majority of characters in a column are gaps are specially handled. Specifically, for such regions we attempt to preserve the dinucleotide formed by the pair of consensus nucleotides bracketing the region, in preference to the consensus pairs entering and leaving the region, as the former is likely to make a bigger contribution to the overall dinucleotide statistics. Supplementary Figures S1 and S2 illustrate the workings of the algorithm.

3 RESULTS

In addition to dinucleotide statistics, a key concern is whether a large number of unique shuffles remain possible for a given multiple sequence alignment. As in Babak et al. (2007), we use the fraction of nucleotides in the alignment that change identity to quantify the degree of shuffling. We define the shufflability of a multiple sequence alignment to be the mean of this quantity for 1000 random alignments. At level 0, we found that about 90% of a sample of the UCSC 17-way vertebrate alignments had a shufflability above 0.5, comparable to the 0.53 reported by Babak et al. (2007) in pairwise alignments.

Early work highlighting the importance of dinucleotide frequencies focused on coding sequence (Clote et al., 2005; Workman and Krogh, 1999). As a benchmark, we compared shuffling algorithms on the 570 multiple alignments within exons in human chromosome Y. We shuffled each alignment 100 times as follows: (i) each sequence independently preserving mononucleotide frequencies, (ii) each sequence independently preserving dinucleotide frequencies [Altschul-Erickson algorithm from Clote et al. (2005)], (iii) the whole multiple alignment using

rnazRandomize.pl
from the RNAz package at levels 0 and 1 and (iv) the whole multiple alignment using Multiperm at levels 0 and 1. For each native and shuffled sequence, we used RNAfold (Hofacker et al., 1994) to obtain its minimum free energy, and calculated a Z-scores for the native with respect to the shuffled background.

Results are shown in Table 1. The independent mono- and dinucleotide shuffles extend the results of Workman and Krogh (1999), showing in a larger dataset that significant dinucleotide effects are visible in individual exons (not just full length mRNAs) and pervasive in vertebrates. (Intriguingly, the shift of approximately 0.44 SD is quite stable among the homeotherms tested, while the frog and fish species all show shifts near 0.30. However, the smaller datasets and weaker alignments to these four species may contribute to the apparent discrepancy.) In nearly all cases, preserving dinucleotide frequencies produces sequences whose minimum free energies on average are nearly the same as native sequences, unlike the mononucleotide shuffle. Of course, independently shuffling aligned sequences destroys the alignment, making it a poor null model. The RNAz shuffle does preserve this similarity but the resulting Z-scores are not substantially different from mononucleotide shuffles of the individual sequences. The Multiperm Z-scores are significantly closer to native, while still preserving other key characteristics of the alignments. The data thus underscore the importance of preserving dinucleotide frequencies when generating random alignments, and show that our method better approximates this characteristic.

Table 1.

Minimum free energy Z-scores using different shuffles

 Num Len G+C Independent
 
RNAz
 
Multiperm
 
   (%) Mono di 
hg18 570 144 44 −0.46 −0.03 −0.47 −0.45 −0.24 −0.27 
panTro1 532 145 44 −0.43 0.00 −0.44 −0.42 −0.21 −0.24 
canFam2 528 145 48 −0.47 −0.05 −0.47 −0.47 −0.23 −0.29 
bosTau2 519 148 49 −0.49 −0.06 −0.49 −0.48 −0.27 −0.31 
rheMac2 500 147 45 −0.50 −0.04 −0.51 −0.48 −0.25 −0.29 
mm8 441 142 43 −0.54 −0.08 −0.53 −0.52 −0.35 −0.37 
loxAfr1 428 147 46 −0.43 0.02 −0.43 −0.42 −0.19 −0.23 
rn4 422 143 45 −0.50 −0.07 −0.51 −0.49 −0.32 −0.36 
dasNov1 371 148 46 −0.44 0.06 −0.46 −0.44 −0.21 −0.26 
monDom4 367 151 42 −0.54 −0.10 −0.55 −0.51 −0.37 −0.37 
oryCun1 366 140 48 −0.52 −0.08 −0.52 −0.52 −0.31 −0.36 
echTel1 360 143 47 −0.46 0.03 −0.42 −0.40 −0.16 −0.21 
galGal2 250 143 42 −0.50 −0.05 −0.48 −0.46 −0.31 −0.34 
xenTro1 213 139 41 −0.57 −0.25 −0.58 −0.56 −0.45 −0.46 
tetNig1 195 130 56 −0.23 0.04 −0.24 −0.27 −0.18 −0.21 
fr1 164 139 53 −0.18 0.14 −0.17 −0.16 −0.09 −0.10 
danRer3 157 141 48 −0.27 0.04 −0.26 −0.26 −0.19 −0.20 
 Num Len G+C Independent
 
RNAz
 
Multiperm
 
   (%) Mono di 
hg18 570 144 44 −0.46 −0.03 −0.47 −0.45 −0.24 −0.27 
panTro1 532 145 44 −0.43 0.00 −0.44 −0.42 −0.21 −0.24 
canFam2 528 145 48 −0.47 −0.05 −0.47 −0.47 −0.23 −0.29 
bosTau2 519 148 49 −0.49 −0.06 −0.49 −0.48 −0.27 −0.31 
rheMac2 500 147 45 −0.50 −0.04 −0.51 −0.48 −0.25 −0.29 
mm8 441 142 43 −0.54 −0.08 −0.53 −0.52 −0.35 −0.37 
loxAfr1 428 147 46 −0.43 0.02 −0.43 −0.42 −0.19 −0.23 
rn4 422 143 45 −0.50 −0.07 −0.51 −0.49 −0.32 −0.36 
dasNov1 371 148 46 −0.44 0.06 −0.46 −0.44 −0.21 −0.26 
monDom4 367 151 42 −0.54 −0.10 −0.55 −0.51 −0.37 −0.37 
oryCun1 366 140 48 −0.52 −0.08 −0.52 −0.52 −0.31 −0.36 
echTel1 360 143 47 −0.46 0.03 −0.42 −0.40 −0.16 −0.21 
galGal2 250 143 42 −0.50 −0.05 −0.48 −0.46 −0.31 −0.34 
xenTro1 213 139 41 −0.57 −0.25 −0.58 −0.56 −0.45 −0.46 
tetNig1 195 130 56 −0.23 0.04 −0.24 −0.27 −0.18 −0.21 
fr1 164 139 53 −0.18 0.14 −0.17 −0.16 −0.09 −0.10 
danRer3 157 141 48 −0.27 0.04 −0.26 −0.26 −0.19 −0.20 

The number of multiple alignments (Num) containing the species, the length of the consensus sequence (Len), the G+C content and the computed minimum free energy Z-scores are shown for multiple alignments within exons in human chromosome Y.

ACKNOWLEDGEMENTS

We thank Robert Gentleman for useful discussions.

Funding: Danish Research Council.

Conflict of Interest: none declared.

REFERENCES

Altschul
SF
Erickson
BW
Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage
Mol. Biol. Evol.
 , 
1985
, vol. 
2
 (pg. 
526
-
538
)
Babak
T
, et al.  . 
Considerations in the identification of functional RNA structural elements in genomic alignments
BMC Bioinformatics
 , 
2007
, vol. 
8
 pg. 
33
 
Clote
P
, et al.  . 
Structural RNA has a lower folding energy than random RNA of the same dinucleotide frequency
RNA
 , 
2005
, vol. 
11
 (pg. 
578
-
591
)
Hofacker
IL
, et al.  . 
Fast folding and comparison of RNA secondary structures
Monatshefte f. Chemie
 , 
1994
, vol. 
125
 (pg. 
167
-
188
)
Washietl
S
Hofacker
IL
Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics
J. Mol. Biol.
 , 
2004
, vol. 
342
 (pg. 
19
-
30
)
Workman
C
Krogh
A
No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution
Nucleic Acids Res.
 , 
1999
, vol. 
27
 (pg. 
4816
-
4822
)

Author notes

Associate Editor: Dmitrij Frishman

Comments

0 Comments