SHAPE directed RNA folding

Summary: Chemical mapping experiments allow for nucleotide resolution assessment of RNA structure. We demonstrate that different strategies of integrating probing data with thermodynamics-based RNA secondary structure prediction algorithms can be implemented by means of soft constraints. This amounts to incorporating suitable pseudo-energies into the standard energy model for RNA secondary structures. As a showcase application for this new feature of the ViennaRNA Package we compare three distinct, previously published strategies to utilize SHAPE reactivities for structure prediction. The new tool is benchmarked on a set of RNAs with known reference structure. Availability and implementation: The capability for SHAPE directed RNA folding is part of the upcoming release of the ViennaRNA Package 2.2, for which a preliminary release is already freely available at http://www.tbi.univie.ac.at/RNA. Contact: michael.wolfinger@univie.ac.at Supplementary information: Supplementary data are available at Bioinformatics online.


Availability
The complete set of benchmark data is available for downloaded at http:// github.com/dluntzer/shapebenchmark. These data may be of use for future comparisons against novel or updates approaches.

Related Work
Soft constraints are naturally implemented in folding algorithms using bonus energies. Since the size limit on Application Notes makes it impossible to include a proper review of the pertinent literature in the main text, we briefly summarize the relevant literature in this section of the Supplement.
The idea of guiding RNA folding by adding pseudo-energies to specific terms in the recursion goes back all the way to early implementations of hard constraints for excluding or forcing base pairs in early versions of mfold Zuker et al. (1991) and the ViennaRNA Package Hofacker et al. (1994). Although this particular form of pseudo-energies has been replaced by hard constraints on the recursions in more recent implementations of RNA folding algorithms, see e.g. Mathews et al. (2004), the concept of pseudo-energies terms has been persistent. * These authors contributed equally to this work † To whom correspondence should be addressed For a while, exceptions to general energy rules were handled as positionspecific bonus energies on top of a standard energy model (e.g. in the Turner 1999 model, Mathews et al. (1999). Probably the best known example are the bonus energies for extra-stable tetraloops.
Bonus energies, in this case derived from sequence covariation, were also used to guide consensus folding in RNAalifold Hofacker et al. (2002); Bernhart et al. (2008). Similarly, TurboFold Harmanci et al. (2011) makes use pseudo-energies rewarding conservation of local structure.
Soft constraints have gained a more focussed interest recently in the context of analyzing chemical and enzymatic probing experiments. In RNAstructure Deigan et al. (2009);Zarringhalam et al. (2012); Hajdin et al. (2013), the first application of this type, SHAPE reactivities are converted to position-specific stabilizing energies for unpaired bases. The same idea, albeit with different models of bonus energies, has been used successfully also for other types of chemical probing e.g. using DMS Cordero et al. (2012) and to enymatic probing (PARS Kertesz et al. (2010); Wan et al. (2014)). A variation on this theme has been proposed in Washietl et al. (2012). Instead of computing the bonus energies directly from the reactivities, they are determined here as the solution of an optimization problem that tries to balance the experimental signal and the thermodynamic folding model. Full probabilistic models for the task are advocated in Eddy (2014).

Minimum free energy structure
In order to rate the quality of RNA secondary structure prediction results in terms of prediction accuracy, the predicted minimum free energy structures are usually compared to known reference secondary structures. Suboptimal folds, yielding a free energy within a certain range from the minimum free energy structure, may also contain structures with similar or even better quality than the MFE structure. However, there is no way to rate the quality of suboptimal folds when the reference structure is unknown, which is usually the case. As a result, suboptimal folds can not be used in a meanigful to evaluate the quality of a secondary structure prediction algorithm.
The correctness of the minimum free energy structure prediction is evaluated by comparing the predicted base pairs against the base pairs determined from the reference structure. In order to measure the quality, two parameters are most commonly used to describe the amount of correct and wrong base pair predictions: The Sensitivity represents the percentage of base pairs in the reference structure, which are also found in the prediction. However, many RNA secondary structure prediction algorithms tend to predict additional pairs, which can not be verified with experimental methods. As a result the Positive predictive value (PPV ), that is, the fraction of predicted base pairs that are also present in the reference structure, is used to measure the amount of false positives.

Partition function
In contrast to the MFE structure, the partition function approach is used to model the whole ensemble of structures instead of predicting just one or a few promising secondary structure candidates. Several different parameters can be used to describe the agreement of the predicted ensemble with the known reference structure.
The Pairing Probability Score is defined as the arithmetic mean of the predicted pairing probabilities p ij of all pairs contributing to the reference structure S and shows the agreement of the pairing probability matrix derived from the ensemble of all possible structures with one single reference structure: The Ensemble Diversity d shows the mean distance between predicted pairs, which can be obtained from the predicted pairing probabilities. Since the algorithms for incorporating experimental data tend to favor motifs that are in agreement with the observed experimental data, while penalizing disagreeing motifs, the Ensemble Diversity is used to illustrate to which extent the shift towards the experimental data influences the variability of the secondary structures represented by the ensemble. The Ensemble Diversity of a thermodynamic based prediction depends on the energy model and its parameters. Since the incorporation of additional experimental information is usually done by adding additional constraints, a decrease in the Ensemble Diversity in contrast to the thermodynamic based prediction is expected. However, a large decrease in ensemble diversity indicates a major shift towards probing data, thus rendering equilibrium properties such as base pair probablilities uninformative. Since the amount of possible pairs raises with growing sequence lengths, the Ensemble Diversity is normalized through division by the length n of the RNA to ensure comparability between RNAs of different size.
The Ensemble Distance is the L 1 -distance between the predicted ensemble and the reference structure: It measures the agreement of the predicted ensemble with the accepted target structure quantitatively. In contrast to the Pairing Probability Score, which focuses on the predicted pairing probabilities for base pairs present in the target structure, here the probabilities for all possible basepairs are taken into account.
Since the incorporation of experimental data into prediction algorithms tends to prefer structures being in accordance with the determined structural features, the ensemble distance can be used to quantify the expected shift of the whole ensemble towards the reference structure. It should be noted that all of the ensemble properties mentioned above cannot be used by themselves as measures of prediction quality. They are useful, however, to help interpret the predicted ensembles in comparison to each other. In particular, they provide insights how strongly the ensemble of structures is distorted by the constraints. Furthermore, the extent of the resulting measures generally grows linearly with sequence length, and should therefore be normalized by the sequence length n. This ensures compareability of predictions for RNAs of different length.

Conversion of SHAPE reactivities into pseudo free energy terms
In this section we briefly discuss the implementation of three different version for how to include chemical probing data, in particular SHAPE reactivities, into the ViennaRNA Package.
The Deigan et al. (2009) method was the first approach to incorporate SHAPE data to direct RNA folding. It uses the following simple heuristic model for the pseudo-energies ∆ G SHAPE (i) = m ln(SHAPE reactivity(i) + 1) + b.
as a function of the measured SHAPE reactivity values for a given nucleotide i. This energy contributes to a stacked pair (Deigan et al., 2009). A positive slope m penalizes high reactivities in paired regions, while a negative intercept b results in a confirmatory "bonus" free energy for correctly predicted base pairs. Since the energy evaluation of a base pair stack involves two pairs, the pseudo energies are added for all four contributing nucleotides. Consequently, the energy term is applied twice for pairs inside a helix and only once for pairs adjacent to other structures. For all other loop types the energy model remains unchanged even when the experimental data highly disagrees with a certain motif.
A somewhat more principled model considers nucleotide-wise experimental data in all loop energy evaluations (Zarringhalam et al., 2012). First, the observed SHAPE reactivity of nucleotide i is converted into the probability q i that position i is unpaired by means of a non-linear map. Then pseudo-energies of the form are computed, where x i = 0 if position i is unpaired and x i = 1 if i is paired in a given secondary structure. The parameter β serves as scaling factor. The magnitude of discrepancy between prediction and experimental observation is represented by |x i − q i |.
These two methods incorporate pseudo-energies even when the observed data are consisted with an unaided secondary structure prediction. In an attempt to avoid pseudoenergy contribution to positions that are already predicted correctly based by the thermodynamic model, Washietl et al. (2012) suggested to phrase the choice of the bonus energies as an optimization problem aiming to find a perturbation pseudo-energy vector . The perturbation is chosen in such a away that the discrepancy between the observed and predicted probabilities to see position i unpaired, respectively, is minimized. At the same time, the perturbation should be as small as possible. The tradeoff between the two goals is naturally defined by the relative uncertainties inherent in the SHAPE measurements and the energy model, respectively. An appropriate error perturbation vector thus satisfies Here, µ is the perturbation energy for a certain structural element µ and the variances τ 2 µ and σ 2 i serve as weighting factors for the relative influence of the structure predicted from the standard energy model compared to the experimental data.
In this setting, the energy model is only adjusted when necessary. If the thermodynamic model already yields a perfect prediction, the resulting perturbation vector vanishes and the folding recursions remain unbiased. Otherwise the perturbation vector is used to guide the folding process by adding a pseudoenergy i whenever nucleotide i appears unpaired in the folding recursions.
Since the pairing probabilities are derived from the whole ensemble of secondary structures, the algorithm of Washietl et al. (2012) tends to decrease structural diversity only slightly, which makes it applicable to RNAs with several distinct low free energy structures. Furthermore, the inferred perturbation energies identify sequence positions that require major adjustments of the energy model to conform with the experimental data. High perturbation energies for just a few nucleotides are therefore indicative of posttranscriptional modifications or intermolecular interactions that are not explicitly handled by the energy model. A major drawback of this approach is its asymptotic time complexity of O(n 4 ), which renders it very expensive for long sequences. This can be alleviated by a sampling strategy for estimating the gradient of the error functional F and provides a viable alternative to the exact numeric solution that reduces the time complexity to O(n 3 ) again.

Benchmark Data
The test set created by Hajdin et al. (2013) was used for benchmarking the accuracy of secondary structure predictions including SHAPE data (http:// www.chem.unc.edu/rna/data-files/ShapeKnots_DATA.zip). It consists of 24 sequences with their corresponding SHAPE data sets and reference structures, which are required to score the prediction results. The reference structures were derived from X-ray crystallography experiments, or predicted by comparative sequence analysis. As shown in Figure 1, the benchmark shows a high diversity regarding the length and prediction performance of the involved RNAs.
The test set described above has been designed for benchmarking an O(n 6 ) algorithm that predicts secondary structures containing pseudoknots. As a result the longest RNA sequences in the test set have a length of just about 500 nucleotides. The benchmark set also contains target structures without pseudoknots. The ViennaRNA Package does not support pseudoknot prediction. Of course, this prohibits perfect predictions for benchmark structures containing pseudoknots. On the up side, the computational cost is only O(n 3 ), thus allowing much larger RNAs, including in particular SSU and LSU ribosomal RNAs, to be processed. All three methods compared here depend on a set of carefully adjusted parameters. For the methods of Deigan et al. (2009), andZarringhalam et al. (2012) we use the latest published default parameters of m = 1.8, b = −0.6, and β = 0.8, respectively. Since there are no default parameters available for the stochastic version of the Washietl et al. (2012) method, we performed an exhaustive evaluation of its parameter space, see Figure 2. From this analysis, we selected new default parameters which will be described below. The result of the corresponding leave-one-out cross validation is shown in Figure 3. It should be noted, that the above mentioned parameters for the Deigan et al. (2009) method were already trained on a majority of the 24 reference RNAs from our benchmark set in another study Hajdin et al. (2013).
Our implementation of the method by Washietl et al. (2012) in the program RNApvmin defaults to an estimation of the gradient by drawing 1000 sample structures from the Boltzmann ensemble. This not only considerably speeds up the optimization routines, but also enables their application to rugged landscapes where an exact gradient approach could easily trap the optimization process in a local minimum. Based on the data from an exhaustive parameter space evaluation, we selected the following default combination for this approach: τ /σ = 1.0, minimizer tolerance m = 0.001, initial step size of the minimizer method s m = 0.01.  The benchmark results for all three methods that correspond to their default parameters are listed in Table 1. Estimation of the distribution of PPVs are shown in terms of 95% confidence intervals that we derive from a bootstrapping analysis with 1000 iterations, see   Although there is a large overlap between the different methods, the approach of Deigan et al. (2009) shows the best average performance on our benchmark set.

Theophylline sensing riboswitch
As an additional, particular example for comparing the three methods we investigated the artificially designed RNA switch theo-P-is10 described by Qi et al. (2012). This RNA consists of a theophylline sensing aptamer part followed by a ncRNA expression platform. The switching principle follows a regular ONswitch behavior, where under sufficiently high concentrations of theophylline, the aptamer part of the structure is thermodynamically favored, and the downstream located ncRNA part of the sequence folds into its active state. On the other hand, in the absence of theophylline, the expression platform misfolds into an inactive state.
In the original design theo-P-is10 forms a pseudoknot interaction between the aptamer stem and the expression platform in the inactive state. However, by using RNAfold we explicitly exclude pseudoknots, which is also true for almost all other secondary structure prediction programs available. Nevertheless, the data that comes with the work of Qi et al. (2012) provides a rich source of interesting SHAPE probing data, since it consists of normalized reactivities from two experiments: (i) the RNA folds in theophylline free solution, and (ii) the RNA folds in the presence of 0.5 mM theophylline, respectively. Since the designed pseudoknot is only present in the inactive state of the RNA switch, i.e. in theophylline-free solution, we do not emphasize too much on the correctness of the structure prediction in this case.
To compare the different variants of guided RNA secondary structure prediction through SHAPE reactivity data incorporation for theo-P-is10, we computed the ground state structures, and base pair probabilities for the two corresponding experimental data sets. Instead of using two dotplots for comparison, we create differential RNAbow plots (Aalberts and Jannen, 2013) to visualize the difference in base pair probability predictions. Here, a differential RNAbow plot consists of two sets of arcs located on the upper and lower half of the horizontally aligned nucleotide sequence, showing the predictions for both experiments, respectively. The strength/width of the arcs represents the pairing probability (thicker lines mean higher probability), whereas arcs are colored with an intensity corresponding to the absolute value of difference in predictions (red in the upper half, blue in the lower half) only, if pairing probability is higher when compared to the other experiment. Otherwise, arcs are drawn in gray, indicating lower probability compared to the other experiment. For better visualization we restrict the RNAbow plots to pairing probabilities of 0.1 and above.  Figures 6, 7, 8, and 9, respectively. It can be easily seen that the method of Deigan et al. (2009) using default parameters clearly misses the proposed ground state structures and essentially yields results as obtained by RNAfold without incorporation of SHAPE reactivities. On the other hand, using the two parameters m = 3.4, and b = −0.5, both SHAPE reactivity data sets yield high probabilities for the aptamer pocket and the functional ncRNA part, although only in presence of theophylline the aptamer pocket is fully formed. Using the method of Zarringhalam et al. (2012) both predicted ground state structures again correspond to the active conformation of the designed RNA switch. However, the proposed pseudoknot interaction between the two hairpin loops of the inactive state becomes visible in the base pair plot. This effect is even more pronounced when using the method of Washietl et al. (2012). Here, the pairing probabilities for the pseudoknot interaction are much higher in the absence of theophylline, whereas the probabilities of the base pairs involved in the formation of the aptamer pocket and the ncRNA part are increased in the presence of theophylline. Nevertheless, both ground state structures are virtually identical and represent the active conformation.
In contrast to the above methods, the implementation of so-called softconstraints in the ViennaRNA Package 2.2 (published elsewhere) also allows for a direct inclusion of binding free energies of the ligand to the aptamer pocket. For this purpose, the ensemble of structures is modified such that all structures that exhibit the aptamer pocket receive an additional stabilizing free energy of E s = −9.22 kcal/mol, according to the dissociation constant of K d = 0.32µM taken from Jenison et al. (1994). The resulting constrained secondary structure prediction is shown in Figure 10. Here, the shift towards the functional ligand binding state of the RNA switch under presence of theophylline is clearly visible in the base pair probabilities.  A Base pair probabilities of the two proposed secondary structure states as computed by

G G U G A U A C C A G A U U U C G C G A A A A A U C C C U U G G C A G C A C C U C G C A C A U C U U G U U G U C U G A U U A U U G A U U U U U C G C G A A A C C A U U U G A U C A U A U G A C A A G A U U G A G
RNAfold depicted by an RNAbow diagram. Similar to the differential RNAbow plot, base pair probabilities are depicted by the strength/width of the arcs. However, the upper half (red arcs) shows the base pairs of the MFE structure (supposedly OFF state, since the actual conformation has not been validated experimentally), and the lower half (blue arcs) the ON-state, respectively, instead of the difference between two distinct predictions. Secondary structure drawings of the MFE structure and the ON-state with a color encoding of their respective SHAPE reactivity as measured by Qi et al. (2012) are shown in B, and C, respectively.  A Differential RNAbow diagram of the base pair probabilities (see text). SHAPE reactivity data for the theophylline-free probing experiment (upper half), and a concentration of 0.5 mM theophylline (lower half) are taken from Qi et al. (2012). Although the (short-range) aptamer part and ncRNA part of theo-P-is10 show relatively high pairing probabilities, especially in the presence of theophylline (lower half, blue arcs), long-range interactions dominate the structure ensemble in both cases. The two secondary structure plots B, and C show the predicted MFE structures for the theophylline-free system, and under presence of theophylline, respectively. Colored nucleotides show the corresponding SHAPE reactivity data, whereas grey circles indicate the absence of probing data.  The two secondary structure plots B, and C show the predicted MFE structures for the theophylline-free system, and under presence of theophylline, respectively. Colored nucleotides show the corresponding SHAPE reactivity data, whereas grey circles indicate the absence of probing data.  A Differential RNAbow diagram of the computed base pair probabilities. SHAPE reactivity data for the theophylline-free probing experiment (upper half), and a concentration of 0.5 mM theophylline (lower half) are taken from Qi et al. (2012). Both data sets result in high probabilities for the aptamer, and ncRNA part. However, in the presence of theophylline, the possible long-range interaction between the two complementary hairpin loop sequences of these parts (upper half, large red arcs) falls below the probability threshold of 0.1 used for the construction of the RNAbow plot. This goes along with a high shift of the ensemble towards the ON-state of the riboswitch (lower half, blue arcs).
The two secondary structure plots B, and C show the predicted MFE structures for the theophylline-free system, and under presence of theophylline, respectively. Colored nucleotides show the corresponding SHAPE reactivity data, whereas grey circles indicate the absence of probing data. Both SHAPE reactivity data sets result in a MFE prediction of the ON-state, with the only difference of an additional A-U base pair in the ligand binding pocket of the aptamer part under the presence of theophylline.  Here, the two provided SHAPE reactivity data sets were used to compute a position-wise perturbation vector for the usage in guided secondary structure prediction. A Differential RNAbow diagram of the computed base pair probabilities. SHAPE reactivity data for the theophylline-free probing experiment (upper half), and a concentration of 0.5 mM theophylline (lower half) are taken from Qi et al. (2012). The two different modes (OFF/ON state) are clearly visible in the differential probabilities. While under theophylline-free condition (upper half) a long range interaction between the two complementary hairpin loop parts of the aptamer and the ncRNA show rather high probability (red arcs), this is not the case under the presence of theophylline (lower half). Here, the base pairs of the aptamer part as well as the ncRNA part of the RNA switch exhibit distinctly high pairing probability (dark blue arcs). The two secondary structure plots B, and C show the predicted MFE structures for the theophylline-free system, and under presence of theophylline, respectively. Colored nucleotides show the corresponding SHAPE reactivity data, whereas grey circles indicate the absence of probing data. Although the pairing probabilities as shown in A would suggest otherwise, both predicted ground state structures represent the ON-state of the RNA switch.    In this benchmark a common method was used to compute the required pairing probabilities based on the experimentally determined SHAPE reactivities. The application RNAplfold was used to predict the pairing probabilities for all sequences of the benchmark. The predicted pairing probabilities of all nucleotides were then compared with the determined SHAPE reactivities. The dataset containing about 4500 observations showed a significant correlation between the logarithm of the SHAPE reactivity and the probability for a certain nucleotide to be unpaired. However, as shown in figure 11, the experimental signal shows a high variation and high reactivities can also be observed for paired nucleotides, and vice versa. Nevertheless, a linear model is suitable for converting the logarithm of the SHAPE reactivity to the probability for being unpaired. The best fit is q = 5 8 (2.29 + log(SHAPE reactivity)) Since the equation above may also lead values of q below 0 or larger than 1, all results exceeding those limits are replaced by 0 or 1, respectively.

Running time
The impact of the incorporation of additional soft constraints onto the required amount of computational time was benchmarked for the whole dataset using a workstation (Intel Core 2 Quad 2.83 GHz, GCC 4.8.2). The running time for the folding recursion are reported as averages over 10 runs. As shown in figure  12, the incorporation of additional constraints results in a slight increase of the required computational time. However, the effect is less pronounced for the  Figure 11: Relation of measured SHAPE reactivities to predicted probabilities for being unpaired approach of Deigan et al., which may be explained by the fact that in contrast to the other approaches, which apply pseudo energies for every paired/unpaired nucleotide, the free energy is only adapted when evaluating stacked pairs. The overall running time for the prediction according to Washietl et al. can be separated into two phases. First, a perturbation vector is calculated by numerically minimizing an objective function. This step requires most of the computational resources since the exact evaluation of the gradient at various points of the minimization algorithm scales at O(N 4 ). However, the evaluation can be done much faster when the gradient is estimated from a number of sampled sequences. Second, the calculated perturbation vector is used to constrain the secondary structure prediction.