The Role of Nucleobase Interactions in RNA Structure and Dynamics

The intricate network of interactions observed in RNA three-dimensional structures is often described in terms of a multitude of geometrical properties, including helical parameters, base pairing/stacking, hydrogen bonding and backbone conformation. We show that a simple molecular representation consisting in one oriented bead per nucleotide can account for the fundamental structural properties of RNA. In this framework, canonical Watson-Crick, non-Watson-Crick base-pairing and base-stacking interactions can be unambiguously identified within a well-defined interaction shell. We validate this representation by performing two independent, complementary tests. First, we use it to construct a sequence-independent, knowledge-based scoring function for RNA structural prediction, which compares favorably to fully atomistic, state-of-the-art techniques. Second, we define a metric to measure deviation between RNA structures that directly reports on the differences in the base-base interaction network. The effectiveness of this metric is tested with respect to the ability to discriminate between structurally and kinetically distant RNA conformations, performing better compared to standard techniques. Taken together, our results suggest that this minimalist, nucleobase-centric representation captures the main interactions that are relevant for describing RNA structure and dynamics.


INTRODUCTION
Ribonucleic acid (RNA) plays a fundamental role in a large variety of biological processes, such as enzymatic catalysis (1), protein synthesis (2) and gene regulation (3). RNA molecules fold in a well-defined three-dimensional structure dictated by their nucleotide sequence (4), that can be determined by means of X-ray crystallography or nuclear magnetic resonance experiments (5). Despite recent important advances in the field, RNA structure determination is still a complex and expensive procedure. Computational/theoretical models, and in particular structure prediction methods, ideally complement experiments, as they can in principle provide the tertiary, native structure for a given RNA sequence.
Although traditionally considered simpler than the protein folding problem (4), de novo prediction of RNA threedimensional structure still represents a challenging task (6). Atomistic molecular dynamics methods have been so far limited to the predictive folding of very small systems (7,8), whereas coarse-grained, physics-based models have proven useful in folding larger RNA structures (9,10). However, the most popular and successful RNA structure prediction methods are the so-called knowledge-based techniques, that typically employ fragment libraries for constructing RNA structures. These fragments are either combined using secondary structure information (11,12) or used to generate a large number of plausible, putative structures that are subsequently ranked according to a scoring function (13)(14)(15)(16)(17). Knowledge-based methods often rely on two assumptions: first, that the main structural features of RNA molecules can be described in terms of few relevant observables. Second, that the distribution of these observables in a data set of experimental structures displays specific features, that can be used for generating and/or scoring threedimensional structures (18). A crucial issue is the choice of the aforementioned observables. While backbone atoms and torsional angles are the most natural choice for representing protein structures (19), RNA molecules are more difficult to treat, due to the presence of a larger number of flexible backbone degrees of freedom and because of the important structural role of base-base interactions. For this reason RNA molecules are often described in terms of several structural observables, including backbone dihedrals (20,21), pseudo dihedrals (22), helical parameters (23), hydrogen-bond networks and stacking interactions (24,25).
The proper choice of simplified, insightful descriptors for RNA molecules is not limited to the development of structure prediction methods, but is a general and relevant is-sue in RNA structural analysis, as in the definition of metrics for comparing three-dimensional structures. The wellknown root mean square deviation (RMSD) after optimal superposition (26) is known to be suboptimal in the analysis of RNA structure, as it does not provide reliable information about the differences in the base interaction network (6,27).
This motivated the development of alternative RNAspecific deviation measures based on selected geometrical properties. These measures can be used for comparing observed structures with a priori known patterns, as is done for instance in annotation methods (28) and in motif recognition algorithms (29)(30)(31). Other RNA specific similarity measures were introduced with the purpose of assessing the quality of RNA structural predictions (6) or for clustering/identifying recurrent structural motifs in large databases of known structures (21,29). The definition of a proper structural deviation measure is also a central issue in the construction of Markov state models (32), that typically assume the geometric similarity to provide an adequate measure of kinetic proximity.
The analysis of RNA three-dimensional structure, the construction of a knowledge-based structure prediction method and the definition of a structural deviation share a common trait: all of them are based on the apparently arbitrary choice of the structural observables used to represent RNA. Is there a simple representation of an RNA molecule that recapitulates its key structural and dynamical properties?
In this Paper we introduce a minimalist representation for RNA molecules consisting in one oriented bead per nucleotide. We first show that this description captures the fundamental base-pair and base-stacking interactions observed in experimental structures. Using this representation we define a simple sequence-independent scoring function for RNA structure prediction, which performs on par or better compared to existing atomistic techniques. Based on the same molecular description, we then introduce a metric for calculating structural deviation. When compared with the standard RMSD measure, this metric correlates better with the physical time required to explore the conformational space in molecular dynamics simulations. As a further proof of the capability of this metric to distinguish relevant structural differences, we show that it can be successfully used for recognizing local structural motifs, reproducing results obtained with state-of-the-art techniques. Finally, in the Discussion section, we summarize the analogies between these applications and we proceed with an extensive comparison with existing techniques. The software for performing the calculations is freely available at https://github.com/srnas/barnaba.

METHODS
This section is structured as follows: first, we introduce a simple representation for RNA three-dimensional structures that has an intuitive interpretation in terms of basestacking and base-pairing interactions. We then use this molecular description to define i) a scoring function for RNA structure prediction (ESCORE) and ii) a metric for calculating deviations between RNA three-dimensional structures (ERMSD).

Molecular representation
We construct a local coordinate system in the center of the six-membered rings, as shown in Figure 1a. Following this definition, the relative position and orientation between two nucleobases is described by a vector r, that is conveniently expressed in cylindrical coordinates , and z ( Figure 1b). Note that r is invariant for rotations around the axis connecting the six-membered rings. We highlight that this definition is similar to the local referentials introduced by Gendron and Major (25). The use of a nucleotide-independent centroid makes it straightforward to compare and combine collection of position vectors deriving from different combinations of nucleobases. This is of particular importance for constructing the knowledge-based scoring function (see below). The position vector r has an intuitive interpretation in terms of base-stacking and base-pairing interactions. This aspect is illustrated in Figure 1c, that shows the distribution of r vectors for all neighboring bases in the crystal structure of the Haloarcula marismortui large ribosomal subunit (PDB code 1S72) (2) projected on the and z coordinates. In the figure, different colors correspond to different types of interactions detected by MC-annotate. Due to steric hindrance, no points are observed in a forbidden ellipsoidal region. Furthermore, almost all the base-stacking and base-pairing interactions (≈99.6%) belong to a welldefined ellipsoidal shell. It is therefore useful to introduce the anisotropic position vector with a = 5Å and b = 3Å, so that pairs of bases in the interaction shell are such that 1 <r < √ 2.5. The majority of base-base contacts lying in this interaction shell are annotated either as Watson-Crick/non-Watson-Crick or as base stacking, as detailed in Table 1. Within this region we distinguish a pairing zone and a stacking zone, according to the type of featured interactions. The tri-modal histogram in Figure 1c shows that these two zones can be defined without ambiguity considering pairs such that the projection of r along the z axis is larger (stacking) or smaller (pairing) than 2Å.
It is well known that the strength and nature of pairing and stacking interactions depend on the base-base distance, on the angle as well as on other angular parameters (e.g. twist, roll, tilt) in a non-trivial manner (24,33). Such dependence can be observed in Figure 2, where the points belonging to the pairing and stacking zone of Figure 1c are projected on two separate -planes. These distributions give an average picture containing contributions from different base pair types (purine-purine, purine-pyrimidine and pyrimidine-pyrimidine) and with weights dictated by the employed data set. Nevertheless, the observations below hold also when considering the 16 possible combinations of base pairs individually and different data sets (see Supporting Data (SD) Figrues S1-S3). In the pairing zone ( Figure 2, left panel) we first observe a dominant peak centered around ( =5.6Å, =60 • ), corresponding to the position of canonical Watson-Crick base pairs as well as wobble (GU) base pairs. The two other peaks correspond instead to base pairs interacting through the Hoogsteen or sugar edge (13). One can also appreciate the absence of bases in the region occupied by the sugar (190 • < < 290 • ). The probability distribution in the stacking zone ( Figure 2, right panel) shows a broad peak in the proximity of the origin and extending up to ≈ 4Å, which can be compared to the typical radius of the six-membered ring (≈1.4Å). This means that partial or negligible ring overlap is very frequent in RNA structures, as also observed in a seminal paper by Bugg et al. (34). This feature is more evident in pyrimidine-pyrimidine and purine-purine pairs, for which high overlap is the exception rather than the rule (see Supplementry Figure S3), whereas overlap is systematically observed in pyrimidinepurine pairs. The fact that bases in the stacking zone are very often 'imbricated,' similarly to roof tiles, rather than literally stacked one on top of the other, does not imply that they are not interacting. Indeed, base-base interaction is not limited tostacking but also includes electrostatic effects, London dispersion attraction, short range repulsion as well as backbone-induced effects (35).

Scoring function
The empirical distribution of all r vectors observed in experimental RNA structures displays very specific features, as described above. We use these geometrical propensities to construct a knowledge-based scoring function for RNA structure prediction. More precisely, we define a function of the atomic coordinates (in this case of the r vectors in a molecule) that quantifies the compatibility of a given RNA 3D conformation with respect to the expected distribution observed in native RNA structures. This scoring function, called ESCORE, is defined as a weighted sum over all pairs of bases in a molecule: Notice that both permutations of the j, k indexes should be included in the sum since the vectors r jk and r kj provide two partially independent pieces of information. The weights p(r) are given by the empirical probability distribution of nucleobases in the crystal structure of the H. marismortui large ribosomal subunit. With this definition, a lower weight is assigned to structures with pairs of bases in relative positions not compatible with the training set, including those with steric clashes. The probability distribution p(r) is calculated considering only pairs of bases within the interaction shell (r < √ 2.5) and using a Gaussian kernel density estimation with bandwidth = 0.25Å. Note that p(r) does not depend on the identity of the nucleotides, and the RNA molecule is therefore treated as a homopolymer. The sum of the probabilities is used instead of their product, in order to decrease the impact of lowcount regions on the scoring function. This procedure has been first introduced in the field of hidden Markov modeling for speech recognition (36) and has a formal justification known as 'summation hack' (T. Minka, The 'summation hack' as an outlier model, http://research.microsoft.com/ en-us/um/people/minka/papers/minka-summation.pdf). A non-redundant set of high-resolution structures (16) was also employed as a training data set for the ESCORE, producing similar results.

Structural deviation
The collection of the scaled vectorsr (see Eq. 1) in an RNA molecule provides information about the relative base arrangement. One could thus define a metric for measuring the distance between two RNA structures, ␣ and ␤, as Similarly to Eq. 2, also here both permutations of the j, k indexes should be included in the sum. The d quantity is highly non-local, because large differences in distant pairs give large contributions to the sum. To remove this effect it is convenient to introduce a cutoff distancer cutoff in the same spirit as we did for the scoring function discussed above. However, we notice that a cutoff would introduce discontinuities in the metric. A possible solution is to introduce a function that maps the vectorsr to new vectors G(r) with the following properties: (1) |G(r α ) − G(r β )| ≈ |r α −r β | ifr α ,r β r cutoff .
where γ = π/r cutoff and is the Heaviside step function. We note that the dependence of G on the direction ofr vanishes asr approaches the cutoff valuer cutoff . We thus setr cutoff = 2.4, so that for the typical distances between stacked and paired bases (r < √ 2.5, see Figure 1c) the contribution of ther directionality is significant. In-depth discussion on Eq. 4 can be found in SD Text 4. Having defined the mapping function G(r), the ERMSD distance reads Note that ERMSD is proportional to the Euclidean distance between G vectors, which are non-linear functions of the atomic coordinates. As such, ERMSD is positive semidefinite, symmetric, and satisfies triangular inequality.
During the developmental stage we tested a non-vectorial version of the ERMSD: instead of using the vectorial function G(r), we considered the scalar, continuous function which is similar to a contact-map distance (37,38). This scalar version differs from ERMSD when considering structures with four nucleotides or less, while the two quantities are highly correlated when analysing larger structures (Supplementry Figure S5).

Scoring function for RNA structure prediction
We assess the quality of the ESCORE on its capability of recognizing the folded state among a set of decoys (39). We consider here a total of 39 different decoy sets previously used to benchmark other knowledge-based scoring functions: 20 decoy sets generated de novo using the FARNA algorithm (13) and 19 additional decoy sets from the work of Bernauer et al. (16) obtained by gradual deformation of the native structures. As a measure of performance we use the normalized rank (40), defined as the percentage of decoys scoring better than the native structure. In order to simulate a realistic structure prediction experiment, we additionally report the RMSD from native of the best scoring structure in the decoy set. Because the RMSD has been reported to be a suboptimal indicator of structural proximity for RNA, we also compute the interaction fidelity network (INF), which measures the overlap between annotations in two structures (27), as well as the ERMSD introduced here. Equivalent analyses are performed using two state-of-theart scoring functions, namely the FARFAR (15) and the allatom knowledge-based scoring function RASP (17). Both FARFAR and RASP are defined at atomistic resolution. FARFAR combines the low-resolution FARNA potential with explicit terms for solvation and hydrogen bonding, while RASP is a statistical potential based on pairwise interatomic distances. These two scoring functions are among the best available (6) and the only ones for which it was possible to perform automatic scoring of a large set of decoys. Figure 3 summarizes the results obtained on all the 39 decoy sets. For each method we report the number of decoy sets below the value indicated on the horizontal axis. Thus, the better is a scoring function, the faster the curve reaches the maximum number 39. In general, whereas FAR-FAR performs well in ranking and RASP in finding the best decoys, ESCORE performs well in both tasks (See also Supplementry Table S6). This is a remarkable result considering that ESCORE employs a coarse-grained representation (one bead per nucleotide) and is not sequence dependent. The scoring results on all RNA decoys, together with a short discussion of the cases for which ESCORE fails to identify correctly the native structure can be found in Supplementry Figures S7 and S8.

RNA structural deviation
The definition of an accurate measure of structural deviation for comparing RNA three-dimensional structures is a relevant problem, as the standard RMSD approach does not report on differences of base-stacking and base-pairing patterns, that are the fundamental, key features of RNA molecules (25). As a striking example, the RMSD between an ideal A-form double helix GGGG CCCC and the mismatched structure GGGG CCCC is 1.9Å, meaning that the threshold of 4Å that is sometimes employed to consider two structures as similar (8) may not be sufficiently strict. A similar issue arises in kinetic modeling of proteins, where the assumption that geometrically close configurations (in terms of the RMSD distance) are also kinetically close has been shown to be  (27) and ERMSD of the best scoring decoy to the native structure, as labeled. See Supplementry Table S6 for a complete list of results. problematic in discretizing the register shift dynamics of ␤ strands (32).
In order to elucidate the difference between the standard RMSD and the ERMSD, we compute these two quantities on a series of snapshots taken from an atomistic molecular dynamics simulation of a short stem-loop. These structures have been obtained by repeatedly melting and folding (see SD Text 9), and thus include near-native structures, partially folded, misfolded as well as extended conformations. Figure 4 shows that structures with similar RMSD do not necessarily present the same base-base interaction network of the native state: in fact, the secondary structure can be substantially different. Conversely, the ERMSD discriminates structures with near-native base-base contacts (ERMSD <0.8) from structures with non-native basebase interactions (ERMSD>1). Note that in general low ERMSD correspond to high INF values, except for the region around ERMSD=1.5, RMSD=2.7. In this region we observe structures with formed stem and distorted loop (high INF) but also structures with formed loop and incorrect stem (low INF). While Figure 4 serves as an illustrative example, the validity of the ERMSD is more rigorously assessed by performing two tests that probe the accuracy of the method in the analysis of structural and dynamical properties of RNA molecules. More precisely, in the following two subsections we demonstrate i) that the ERMSD is a good measure of kinetic proximity between configurations, performing considerably better compared to standard structural distances and ii) that the ERMSD is highly spe- Structural and kinetic proximity. The correspondence between a structural deviation (RMSD, ERMSD, etc.) and the kinetic distance (defined as the time needed to evolve from one configuration to the other), can be quantitatively assessed by computing the coefficient of variation (CV).
The CV of a structural deviation d is defined as the ratio of the standard deviation to the mean calculated on structures separated by a time lag To a large CV corresponds a large standard deviation, and therefore a high uncertainty between geometric and kinetic distance (41). Note that the standard deviation is normalized with the mean, and the CV is thus scale-invariant. In the approximation that the deviations at fixed time lag are normally distributed, this analysis is equivalent to the histogram overlap discussed in Ref. (38). The connection between kinetic and structural distance is inevitably limited by the fact that the typical time necessary for a transition is in general different from the time necessary for the reversed transition, in contrast with the symmetric definition of structural deviations. Nevertheless, this analysis provides insightful and intuitive information for the purpose of qualitatively comparing different metrics. In Figure 5 we show the CV as a function of time separation in molecular dynamics simulation of the add riboswitch (42,43). The molecular dynamics simulation was initialized in the native state and remained stable throughout the 100 ns run, while experiencing non trivial dynamics such as base-pairing breakage/formation and double-helix breathing. The CV for ERMSD is compared with RMSD and with dRMSD. Additionally, the scalar variant of ERMSD, based on Eq. 6, is also presented.
We first observe that for short time separations the correspondence between geometric and kinetic distance is high (low CV) in all cases, and it diminishes at large time lag. The CV for the ERMSD is consistently lower compared to both RMSD and dRMSD, suggesting this measure to better reflect the temporal separation between configurations. We also note that, in this test, the scalar and vectorial version of the ERMSD behave very similarly.
Search of structural motifs. RNA molecules display a great variety of base-base interaction patterns (RNA motifs) that play a key role in RNA folding and that mediate important biological processes (44). A structural deviation should therefore accurately capture such RNA-specific features. This property can be assessed by searching for known RNA structural motifs in the database of experimentally solved structures. Here, we use the ERMSD to search for five known internal loop motifs (K-Turn, C-loop, Sarcin Ricin, triple sheared, double sheared) and two hairpin loops (Tloop and GNRA) within the RNA structure atlas database (45) release 1.43, composed by 772 crystal structures. As a reference, we compared the results with 'find RNA 3D' (FR3D) (29), the most extensively used and carefully tested method for RNA motif recognition. FR3D employs a reduced representation with one centroid per nucleobase, and computes the geometrical deviation by combining fitting and orientation errors after optimal superposition. Reference and motif structures were obtained from the RNA 3D motif atlas release 1.13. Except for the K-turn motif, we were able to correctly identify almost all FR3D matches, including motifs with bulged bases ( Table 2). By visual inspection, it can be seen that most of the false positive (i.e. motifs that were found using the ERMSD and not by FR3D) are either bona fide matches or known motif variants, while the residual discrepancies are due to differences in the employed cutoff value (45). We refer the reader to SD Text 10 for a detailed description of the search strategy and to SD Text 11 for the full list of results.

DISCUSSION
A fully atomistic description of an RNA molecule requires on average ≈20 non-hydrogen atoms per nucleotide, and thus ≈60 degrees of freedom per nucleotide. Biologically relevant RNA structures are composed of hundreds or even thousands of nucleotides that display an intricate network of interactions. The complexity of the problem is further increased when considering that many RNA molecules are highly dynamic entities, and multiple chain configurations are needed for describing their mechanism and function (3). In order to provide a simpler representation, RNA structures are usually analysed in terms of sugar pucker conformations, backbone dihedrals (20,21) or pseudo dihedrals (22). Complementary approaches consider instead basepair parameters (propeller, slide, roll, twist, etc.) and hydrogen bonding to classify base-pair interactions in terms of recurrent geometrical configurations observed in experimentally solved RNA structures (23)(24)(25).
A key result of the present work is that the main structural and dynamical properties of RNA can be described in terms of the spatial arrangement of the nucleobases only. Since nucleobases are substantially rigid, this implies 6 degrees of freedom per nucleotide. This observation is in full accordance with previous analysis of RNA crystal structures (25,46), where the position and orientation of nucleobases only were considered. This work complements the analysis by showing that nucleobase position and orientation are sufficient to correctly describe the RNA dynamics as obtained from explicit solvent molecular dynamics simulations.
As shown in Figure 1, RNA base-base interactions mainly occur in a restricted neighborhood of the nucleobase that is well approximated by an ellipsoidal shell. Since specific regions of the ellipsoidal shell surrounding nucleobases correspond to different types of pairing and stacking interactions, it is in principle possible to infer secondary structure information directly from the pairing and stacking plots of Figure 2. This procedure is formally similar to the one employed in the analysis of protein structures, where the angular preferences of the backbone approximately correspond to different secondary structures (19). Our approach depends on nucleobase positions only and is thus complementary to that proposed in Ref. (22), where backbone pseudo dihedrals were used to classify RNA secondary structures.
By considering only the relative spatial arrangement of nucleobases, we introduce a knowledge-based scoring function for RNA structure prediction (ESCORE). In contrast with several studies underlining the importance of an atomistic description for an accurate scoring (15)(16)(17), our work shows that a minimalist description of nucleobase arrangement is sufficient to produce equivalent or better results. Two aspects are worth highlighting. First, while the backbone can be ignored for scoring, one cannot avoid including the chain connectivity in the sampling algorithm used to generate the decoys. Second, the outcome of the scoring benchmarks is strongly dependent on the employed decoy sets. Some of the decoy sets generated using the FARNA algorithm are very challenging, as all scoring functions (ESCORE, FARFAR and RASP) are highly degenerate, and assign good scores to structures that are very far from native (see e.g. Supplementry Figure S6, decoy set 1A4D and Ref. (16)). These decoys can be therefore used to improve the current methodologies for de novo predictions of RNA structure.
A second aspect intimately connected with the choice of molecular representation is the definition of a proper structural proximity measure. Typically, structural deviations are used either to analyse and compare three-dimensional structures (e.g. RMSD, dRMSD, INF (27)) or to recognize and cluster RNA structural motifs (29)(30)(31). The ERMSD introduced here is suitable for both tasks. A distinctive feature of the ERMSD is the use of the ellipsoidal metricr , that naturally stems from the empirical data and that takes into account both relative distance and orientation of nucleobases. Interestingly, the intrinsic base anisotropy has been also discussed in the context of the electronic polarizability of nucleobases (35) and anisotropic interactions have been used in coarse-grain models for proteins (47). We note that the ERMSD is conceptually reminiscent of the INF measure (27), that also considers the differences in the network of base-pairing and base-stacking interactions. With respect to INF, however, the ERMSD presents important advantages: i) it does not depend on the annotation scheme employed and therefore on the quality of the structure, ii) it is a continuous function of the atomic coordinates and iii) it is well defined even when the annotation is not informative (e.g. for unfolded states, unstructured, or singlestranded RNAs). The last two items make the ERMSD particularly well-suited for studying dynamical processes, such as RNA folding or binding events. The ERMSD also shares a number of similarities with the structural distance used in FR3D for recognizing three-dimensional motifs (29). However, in the ERMSD calculation we introduce important simplifications: i) the definition of the local coordinate system is not nucleobase-specific and requires the knowledge of three atoms per nucleobase only (Figure 1a), ii) the ERMSD calculation does not require least square fitting and iii) the problem of combining the positional and orientational base-base distance is circumvented by introducing an ellipsoidal metric. The fact that the two methods produce similar results ( Table 2) strongly suggests that for most practical uses the ERMSD is accurate enough to capture the specific pattern of base-base interactions.
In this study we introduce a minimalist representation of RNA three-dimensional structures that solely considers the relative orientation and position of nucleobases. Based on this representation, we construct a scoring function for RNA structure prediction and we define a metric for calculating deviation between RNA three-dimensional structures. In both cases, the results are on par or better compared with state-of-the-art techniques. Taken together, the presented results suggest that this minimalist representation captures the main interactions that are relevant for describing RNA structure and dynamics. As a final remark, we note that the methodology presented here can be readily applied to DNA molecules as well.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.

ACKNOWLEDGMENTS
We are grateful to Francesco Colizzi, Richard Andre Cunha, Eli Hershkovits, Alessandro Laio, Gian Gaetano Tartaglia, Eric Westhof and Craig Zirbel for carefully reading the manuscript and providing several enlightening suggestions. Massimiliano Bonomi, Thomas Hamelryck and Andrea Pagnani are also acknowledged for useful discussions.