Evolutionary analysis of base-pairing interactions in DNA and RNA secondary structures

Pairs of nucleotides within biologically functional nucleic acid secondary structures often exhibit evidence of coevolution that is consistent with the maintenance of canonical base-pairing. MESSI is a sequence evolution model that infers substitution rates associated with base-paired sites in alignments of DNA or RNA sequences. MESSI can estimate these whilst simultaneously accounting for the uncertainty associated with an unknown RNA or DNA secondary structure shared across an alignment of sequences. Moreover, the unknown structure can be predicted, or a base-pairing probability matrix calculated. MESSI optionally leverages CUDA GPU parallelism to accelerate inference. MESSI was used to infer coevolution rates associated with GC, AU (AT in DNA), GU (GT in DNA) pairs in non-coding RNA alignments, and single-stranded RNA and DNA virus alignments. Inferred rates of GU pair coevolution were found to be higher at base-paired sites in single-stranded RNA viruses and non-coding RNAs than those of GT pairs in single-stranded DNA viruses, suggesting that GT pairs do not stabilise DNA secondary structures to the same extent as GU pairs in RNA. The relative coevolution rates associated with GC, AU, and GU pairs were largely consistent with their relative chemical base-pairing stabilities (GC base-pairs being more stable than AU base-pairs, and AU base-pairs being more stable than GU base-pairs). Additionally, MESSI estimates the degrees of coevolution at individual base-paired sites in an alignment. These estimates were computed for a SHAPE-MaP-determined HIV-1 NL4-3 RNA secondary structure and two corresponding alignments. MESSI’s estimates of coevolution were significantly more strongly correlated with experimentally-determined SHAPE-MaP pairing scores as compared to three non-evolutionary measures of base-pairing covariation. Finally, to assist researchers in prioritising substructures with potential biological functionality, MESSI automatically identifies substructures and ranks them by degrees of coevolution at base-paired sites within them. Such a ranking was created for an HIV-1 subtype B alignment, revealing an excess of top-ranking substructures that have been previously identified in the literature as having structure-related functional importance, and a number of top-ranking structures that have not yet been characterised.


Introduction
The primary role of nucleic acid molecules, such as DNA (deoxyribonucleic acid) and RNA (ribonucleic acid), is encoding genetic information for storage and transfer. However, both types of molecules can form structures with additional functions (Mattick, 2003). DNA is ordinarily thought of as a doublestranded molecule forming the now iconic double helical configuration (Watson and Crick, 1953) (Heaphy et al., 1990;Daugherty et al., 2010).
The structures that nucleic acid molecules form are commonly referred to as their secondary or tertiary structures. Secondary structure is defined as the set of hydrogen bonding interactions between the constituent bases of a nucleic acid molecule; tertiary structure is defined as the arrangement of the constituent atoms of a nucleic acid molecule in threedimensional space. This study focuses exclusively on RNA and DNA secondary structures.
Both computational (Markham and Zuker, 2008;Sükösd et al., 2012;Bernhart et al., 2006) and hybrid experimental-computational techniques (Wilkinson et al., 2006) for secondary structure prediction exist. However, even if the secondary structure of an RNA sequence can be accurately determined, this does not immediately say anything about the potential functional or biological importance of the identified structure. Many RNA secondary structures are known to have specific biological functions, and it is expected that these structures might detectably impact patterns of evolution in the sequences in which they occur.
One evolutionary signal that can be leveraged to identify selectively maintained secondary structures is nucleotide coevolution. Nucleotide coevolution is expected at base-paired nucleotide positions within these structures (Eddy and Durbin, 1994;Tuplin et al., 2002;Cheng et al., 2012). Many pairs of nucleotides within RNA molecules exhibit evidence of coevolution, such that whenever a substitution occurs in one partner of the pair, complementary substitutions are selected for in the other partner in a manner that is consistent with the selective maintenance of canonical base-pairing (Cheng et al., 2012).
In this study we consider the canonical RNA basepairs to be the two Watson-Crick base-pairs, GC and AU, and the weaker GU wobble base-pair (GC, AT, and GT base-pairs in DNA, respectively).
Methods for detecting coevolution, such as mutual information (Eddy and Durbin, 1994;Lindgreen et al., 2006), can be used to aid the computational inference of secondary structures. Accordingly, some RNA comparative secondary structure prediction approaches, such as PPfold (Sükösd et al., 2012), use information about coevolving nucleotides inferred from sequence alignments to more accurately predict secondary structures. Conversely, within a given secondary structural element, evidence that paired bases are coevolving is evidence of the functional importance of that element (Tuplin et al., 2002;Cheng et al., 2012;Muhire et al., 2014).
Standard approaches for measuring coevolution (or more accurately: covariation), such as mutual information, are non-evolutionary in that they do not take into account the phylogenetic relationships of the sequences being analysed. Founder substitutions can, by chance, induce correlations between bases in a large number of observed species (Bhattacharya et al., 2007), which may be mistaken for strong evidence of coevolution if the phylogeny is not accounted for. Substitution models provide a statistical framework for modelling of both phylogenetic relationships and underlying substitution processes. However, the computational cost of applying such models can severely limit their utility.
In this study, we extend the M95 (Muse, 1995) paired site model of RNA base-pairing evolution and implement a new software tool called MESSI (Modelling Evolution of Secondary Structure Interactions). In what follows, the underlying models and methodologies are detailed. Relative rates of coevolution amongst different base-pairs are measured, degrees of coevolution at base-paired sites are measured, secondary structure uncertainty is accounted for and predicted, and substructures are identified and ranked by their expected structure-related biological functionality. Muse (1995) developed a paired site model, henceforth referred to as the M95 model. M95 accounts for RNA base-pairing constraints by modelling pairs of nucleotide positions using a 16 × 16 matrix. The model generalises upon standard 4×4 nucleotide substitution models, such as the GTR model, by introducing a coevolution parameter, λ, that is intended to capture substitutions at paired positions that are consistent with the maintenance of canonical RNA base-pairing. We define the set of canonical basepairs as follows:

The Muse 1995 model
Here we present a version of the original M95 paired model based on a GTR model Q and a set of canonical base-pairs C: a / ∈ C and b ∈ C, e.g. a=AC → b=AU q ab * pairing unchanged a, b / ∈ C or a, b ∈ C, e.g. a=AU → b=GU q ab * /λ pairing lost a ∈ C and b / ∈ C, e.g. a=AU → b=AC 0 2 differences e.g. a=AU → b=GC Where a and b are nucleotide pairs, q ab * is the entry of the GTR matrix Q corresponding to the nucleotide position within the nucleotide pair that underwent a substitution, and λ is a parameter capturing the degree of RNA coevolution; i.e. the degree to which canonical RNA base-pairing is evolutionary maintained (λ > 1) or disrupted (λ < 1). Note that λ = 1 represents the neutral case where each of the two nucleotide positions in a pair are treated as evolving independently under the GTR model specified by

Q.
Furthermore, let π dinuc denote a length 16 vector of paired frequencies. π dinuc is the union of two mutually exclusive cases: π dinuc = π unpaired π paired , π unpaired represents the cases where the target pair d ij is not in the set of canonical base-pairs (d ij / ∈ C), and π paired represents the cases where the target pair d ij is in the set of canonical base-pairs (d ij ∈ C), respectively: Note that i and j correspond to the first and second positions of the target pair, respectively. Where π i is the equilibrium frequency under the GTR model, Q, of the nucleotide in the first position of the target pair d ij , and similarly π j is the equilibrium frequency of the nucleotide in the second position. κ 1 = 1 + 2(π G π C + π A π U + π G π U )(λ 2 − 1) is a normalising constant that ensures the entries of π d sum to one.
Note that within the set of canonical base-pairs,  (Rousset et al., 1991). In light of this, in the next section we extend the M95 model such that substitutions affecting the three canonical base-pairs are not constrained to have the same coevolution rate.

Differentiating between types of base-pairing substitutions
We extend the M95 model to differentiate between the three different canonical base-pairs, by introducing potentially distinct coevolution rates (λ GC , λ AU , and λ GU ) for each of three different base-pairs (GC, AU, and GU, respectively). Using similar notation as before the extended rate matrix is given as follows:  3) and the corresponding paired frequencies are: Where

Stationarity and time-reversibility
We show that the paired frequencies, π, given in (4) correspond to the stationary distribution of M by verifying that: and that time-reversibility of M holds: where a and b represent nucleotide pairs. The conditions in (5) and (6) were verified using the symbolic math package, SymPy (Joyner et al., 2012), as implemented in the musesymbolic.py script in the Supplementary Material.

Modelling variable degrees of coevolution
In the M95 model (2) the rate of coevolution, as specified by λ GC , λ AU , and λ GU , was assumed to be the same at each base-paired site within a secondary structure S. However, it is expected that the strength of the selective forces maintaining canonical base-pairing will vary amongst base-paired sites in S.
In this section, we extend the M95 model such that the degree of coevolution, denoted by η q,r , is able to vary from base-paired site to base-paired site. η q,r is drawn independently for each base-paired site (described in the next section), and acts to scale the 4 three coevolution rates as follows: λ q,r GC = (λ GC − 1)η q,r + 1 λ q,r AU = (λ AU − 1)η q,r + 1 λ q,r GU = (λ GU − 1)η q,r + 1 where λ GC ≥ 1, λ AU ≥ 1, and λ GU ≥ 1 are the base-pairing substitution rates shared across all paired sites. This parametrisation was chosen so that λ q,r GC = λ q,r AU = λ q,r GU = 1 when η q,r = 0. In addition to allowing the rate of coevolution, η, to vary across base-paired sites, we also allow substitution rates to vary from site to site following the gamma distributed sites rate approach of (Yang, 1993(Yang, , 1994. For unpaired sites this is modelled using a standard GTR+Γ model. For base-paired sites slightly more care needs to be taken (see Supplementary Section

Testing neutrality of coevolution
To test the hypothesis that two nucleotide positions within a particular base-paired site are evolving neutrally, i.e. the substitutions at each of the two sites are occuring independently rather than actively favouring the maintenance canonical base-pairing, we assume that the degree of coevolution, η q,r , at each base-paired site is distributed as follows: η q,r = 0 with probability w η (the neutral, independent case), or η q,r is drawn from a M-way discretised gamma distribution with probability 1 − w η (the dependent case). Note that η q,r ≥ 0 and therefore the case where

Priors
This section lists the priors (and hyperpriors) over parameters used in the most general version of the implemented model (the unconstrained model). Note that for some analyses we perform Bayesian inference, whereas for others we perform maximum likelihood (ML) inference. The priors over the parameters specified here are those used for Bayesian inference, however, we also indicate how the parameters are treated during ML inference (either estimated whilst ignoring the prior, or using the prior and fully marginalised).
The prior probability of neutral coevolution, w η (estimated during ML inference whilst ignoring the prior), is drawn from a beta distribution: For each paired position q, r a Bernoulli random variable, X q,r (marginalised during ML inference), indicating neutral coevolution is drawn from a Bernoulli distribution with probability w η : A tied shape and rate parameter, c (estimated during ML inference whilst ignoring the prior), specifying a prior over the variation in coevolution rates is drawn from an exponential distribution: For each paired position q, r the coevolution rate, η q,r , is set equal to zero if X q,r = 1, otherwise if X q,r = 0 the coevolution rate, η q,r (marginalised during ML inference), is drawn from a discretised gamma distribution with shape c and rate c: A tied shape and rate parameter, d (estimated during ML inference whilst ignoring the prior), specifying a prior over the variation in site-to-site substitution rates is drawn from an exponential distribution: For each nucleotide position q a substitution rate, ρ q (marginalised during ML inference), is drawn from a discretised gamma distribution with shape d and rate d: The prior probabilities of the four nucleotides, π (estimated during ML inference), is given by a flat Dirichlet distribution: (π A , π C , π G , π T ) ∼ Dirichlet(1, 1, 1, 1) The six symmetric nucleotide substitution rates (estimated during ML inference whilst ignoring the prior) are each drawn from exponential distributions: The three coevolution rates (estimated during ML inference whilst ignoring the prior) corresponding to the three canonical base-pairs are each drawn from exponential distributions right shifted by 1: The secondary structure, S (marginalised during ML inference), is drawn from the KH99 SCFG (discussed in Section 2.9): Finally, the phylogenetic tree,T , relating the align- D. Extended dot bracket notation Figure 1: Examples of secondary structure representations. Above (A) is a dot bracket representation of a secondary structure, and the corresponding VARNA and circular visualisations (B and C, respectively) produced by VARNA Darty et al. (2009). Below (D) is an extended dot bracket notation format with an additional bracket type, <>, that allows a pseudoknotted structure to be represented unambiguously. E and F are the corresponding VARNA visualisations for D. Note how the overlapping bonds in the circular visualisation (F) demonstrate that the secondary structure is pseudoknotted. ii. k ≤ j implies that i < k < l < j Vertices that are not contained within the edge set B S are termed unpaired. Condition (i) implies that each vertex (nucleotide) belongs to at most one basepair. Condition (ii) prevents pseudoknotting, i.e.
non-nested base-pairs. Note that pseudoknotting is physically possible in both real RNA and DNA structures, but is excluded in many definitions of secondary structures as efficient algorithms exist for marginalising or maximising over secondary structures when assuming (ii). MESSI permits a canonical secondary structure with pseudoknots to be specified a priori, however, if the user instead treats the structure as unknown, MESSI will strictly sum over non-pseudoknotted structures only.  Figure 1D).

Likelihood
Conditioned on a secondary structure, S, unpaired nucleotide positions within S, denoted by · q, and basepaired nucleotide positions within S, denoted by q, r, are assumed to be independent. The likelihood of an alignment, D, is given by a simple product of unpaired and paired site likelihoods: WhereT is a phylogenetic tree and Felsenstein's pruning algorithm (Felsenstein, 1981) was used to calculate both the unpaired site likelihoods, p(D q | · q,T , θ), and the paired site likelihoods, p(D q,r | q, r,T , θ). Paired sites were modelled using the unconstrained M95 model, whereas unpaired sites were modelled using the GTR+Γ model that is nested within the unconstrained M95 model.

Prior over RNA secondary structures
Equation 18 assumes that the secondary structure S is known a priori either through experimental or computational methods of structure prediction. However, it also possible to treat the secondary structure as unknown, by placing a prior probability distribution, p(S), over secondary structures and marginalising S.
One way of introducing a prior over secondary structures is by using a Stochastic Context Free Grammar (SCFG). A SCFG is probabilistic extension of a context-free grammar (CFG). A CFG is a type of grammar that defines a set of rules for generating all possible strings in a given formal language. A SCFG extends this notion by assigning probabilities to each possible string in the given language. RNA SCFGs are SCFGs that give probability distributions over strings of base-paired and unpaired nucleotides representing RNA secondary structures (Anderson, 2014).

The KH99 grammar
The KH99 SCFG (Knudsen and Hein, 1999) was chosen as a prior over secondary structures. The rules 7 and associated probabilities are given as follows: Note that S is the start symbol.
The KH99 assigns probabilities to all strings of a specified length that can be written in dot-bracket notation, with at least two unpaired nucleotides separating every base-pair.

Structure-integrated likelihood
Using Bayes' rule, the probability of a secondary structure, S, conditional on the data, D, and phylogenetic parameters, θ, is given by: The structure-integrated likelihood term in the de- Note that structure-integrated likelihood in the denominator of (20) is given by: A recursive formula for calculating the outside probabilities is as follows: where G is a RNA secondary structure grammar. Note that the inside and outside algorithms also require a vector, S, of length of L single site likelihoods, where each element corresponds to p(D q | · q,T , θ).
However, this is fast to compute compared to B.

Sampling secondary structure configurations
The inside probability matrix can be used to sample secondary structure configurations: Sampling terminal strings (secondary structures in our case) using an SCFG is analogous to sampling hidden state sequences using the forward-filtering backward-sampling algorithm for HMMs (Frühwirth- Calculation of (S,F) Figure 2: Illustrations of the inside algorithm showing CPU and GPU parallelism schemes. The light to dark blue gradient moving across each matrix from the central diagonal to the top right-hand corner indicates the order in which each diagonal is computed. The light red elements indicate the data dependencies required to compute the single bright red entry of the inside matrix. The lower half of each matrix with each cell crossed out is not computed and can be ignored. Note that the top-right element corresponds to the structure-integrated likelihood term and is therefore always the last element to be calculated, as it depends on all other elements having been computed first. Schnatter, 1994). An algorithmic description for sampling secondary structures from an RNA SCFG is given in the Supplementary Methods Section

Bayesian posterior inference
The posterior distribution of the continuousparameters, θ, conditional on the data D and a secondary structure S can be sampled using the Metropolis-Hastings algorithm and the relationship given by Bayes' formula: Where the likelihood term, p(D|S, θ), is given by (18) and p(θ) is the prior.
We can also treat the secondary structure as unknown and assume a RNA SCFG prior, p G (S), over secondary structures. Using the structure-integrated likelihood p S (D|θ): However, note that the structure-integrated likelihood term is computed every time a new set of parameters is proposed. As mentioned previously, this requires computing a matrix B of paired site likelihoods (requiring O(L 2 ) computational steps) and calculating the final structure-integrated likelihood term using the inside algorithm (requiring O(L 3 ) computational steps). Therefore gathering enough samples to ensure an adequate sample size will be relatively slow. However, given that we can sample the conditional distribution, p(S|D, θ), using the sampling procedure outlined in Section 2.11 this leads to a potentially more efficient Metropolis-within-Gibbs approach. This approach works by alternatively sampling from the full conditional distribution: using the sampling procedure outlined in Section 2.11 and θ (k+1) ∼ p(θ|D, S (k) )  (27) only requires O(L) operations and can be repeated for multiple iterations following the Gibbs sampling step.
In our implementation we repeat the Metropolis-Hastings step 50 times following the Gibbs sampling step. Together these give a Markov Chain Monte Carlo algorithm whose stationary distribution, p(S, θ|D), and associated marginals, p(S|D) and p(θ|D), are the distributions of interest.

Maximum likelihood inference
The COBYLA optimization algorithm (Powell, 1994) in the NLOpt library (Johnson, 2014) was used to find the maximum likelihood (ML) parameters via the structure-integrated likelihood (20). Note that when doing so the priors over the continuous parameters were either ignored and estimated using ML, or the priors were used and the parameters were fully marginalised (as specified in the Priors section).

Site permutations
To test whether secondary structure dependencies present in real datasets influence model fit, each alignment was taken and its sites randomly permuted. Two such permuted datasets (p 1 and p 2 ) were generated for each real dataset. ML estimation using the structure-integrated likelihood was used to fit the parameters of each permuted dataset under the unconstrained model and the secondary structure information entropy was calculated.

Site permutation benchmarks
To assess the degree to which secondary structure dependencies present in real datasets influence model fit, ML inference was performed on real and permuted datasets, and their structure-integrated likelihoods and structure information entropies were compared (Section 2.14 in Methods). The structure-integrated likelihoods for the permuted datasets were expected to be lower than those of the real datasets. Conversely, the structure information entropies were expected to be higher for the permuted datasets than for the real datasets. Unlike the real datasets, the patterns of coevolution in the permuted datasets were not expected to coincide with stable secondary structure configurations, thereby spreading the probability mass over a larger number of secondary structure configurations.
The maximum likelihood estimates of the structure-integrated likelihoods were indeed lower for the permuted datasets in every instance (Supplementary Table   3.2 Benchmarks of RNA structure prediction Whilst, our model was not designed to predict RNA secondary structure, the expected base-pairing and unpairing probabilities can be calculated (see Sup-plementary Section MESSI has lower precision but higher recall than the other two methods, implying that it predicts more base-pairs (higher recall), but with a higher number of false-positives (lower precision; Figure 3). For the F1-score and MCC measures, both of which combine precision and recall, MESSI performs slightly better than RNAalifold and PPfold. MESSI also performs marginally better with respect to the mountain similarity measure -a measure that takes into account the overall 'shape' of the secondary structures being compared, rather than thee exact matching of basepairs.
Overall, our method performs on a par with the other two well-established methods of comparative RNA structure prediction. This was surprising given that the model was not tuned for secondary structure prediction. Maximum likelihood inference was used to determine the parameters, where the coevolution parameters (λ GC , λ AU , and λ GU ) were free to vary with the only restriction being: λ GC ≥ 1, Figure 3: Summary of secondary structure prediction benchmarks. Structure predictions were performed on 22 RFAM datasets using three different comparative structure prediction methods (MESSI, RNAalifold, and PPFold).
λ AU ≥ 1, and λ GU ≥ 1. Although not tested here, it might be possible to improve the predictive accuracy of MESSI's structure predictions by performing Bayesian or MAP inference of the parameters using a prior learnt from a training dataset of alignments and corresponding structures.

CPU and GPU timing benchmarks
The  The speed-ups seen here are significant, enabling us to analyse datasets which would typically be considered intractable. Note that CPU and GPU implementations were also developed for the outside algorithm with similar speed-ups obtained (Figure

The role of GU and GT base-pairs in single-stranded RNA and DNA
It is well-established that GU pairs can hydrogen bond in RNA to form base-pairs, although they are chemically weaker than GC and AU basepairings (Rousset et al., 1991). The relative chemical strengths of GC, AU, and GU base-pairs are partially due to the number of hydrogen bonds that form between their constituent bases: three for GC base-pairs, two for AU base-pairs, and two for GU base-pairs. Although GU pairs form the same number of hydrogen bonds as in AU pairs, the geometry of the bases leads to the GU pairing being substantially weaker than the AU pairing (Varani and McClain, 2000).
Despite the weaker chemical interaction, GU basepairings are known to be involved in functional RNA structures (Gautheret et al., 1995). Less well understood is the role of GT base-pairings in DNA.
There are some reports of GT base-pairings in doublestranded DNA helices (Early et al., 1978;Ho et al., 1985), but none that were found in a literature search of GT base-pairings in single-stranded DNA. Whilst, we were unable to measure the chemical strength of these base-pairing interactions in the present study, we used MESSI to analyse alignments for evidence of evolutionary forces favouring the maintenance of

Relative coevolution rates
The relative selective strengths of the coevolution rates associated with GC, AU and GU pairs were with both being substantially stronger than GU basepairings.
To assess whether λ GC := λ AU is a reasonable assumption, we performed LRTs comparing the unconstrained model to a λ GC := λ AU constrained model.   pected. There is broad evidence to suggest that base-paired sites in a functionally important RNA structure tend to be be more conserved (less variable) due to being under selective constraint (Muhire et al., 2014;Tuplin et al., 2004) and that doublestranded RNA (i.e. base-paired positions) is less susceptible to basal mutational processes (Lindahl and Nyberg, 1974). Conversely, unpaired sites are expected to undergo relatively higher rates of mutation.
These higher rates of mutation may cause the three non-evolutionary measures of covariation to be erroneously inflated, given that they do not fully account  (Heaphy et al., 1990;Daugherty et al., 2010).
The longest continuous helix identified in both the SHAPE-MaP and MESSI structures, was ranked 2nd in the SHAPE ranking and 8th in the consensus ranking, respectively. The SHAPE-MaP analysis revealed that this helix is highly stable, although its function is unknown. The significant degrees of coevolution detected at base-paired sites within this substructure and the fact that MESSI detects it as conserved across all HIV-1 subtype sequences provides further evidence of its likely functional importance.
Portions of the 3' and 5' untranslated regions (UTRs) were ranked 3rd and 4th in the SHAPE ranking, respectively. This was not surprising given that    Watts et al. (2009). Depicted within each nucleotide is a sequence logo summarising the nucleotide composition at the corresponding alignment position. Mean degrees of coevolution inferred using MESSI are depicted for each base-pair using coloured links (blue-green-yellow gradient).
such as those expected within coding sequences. The third was to permit the strength of coevolution to vary across base-paired sites, enabling the identification of sites that are evolving independently versus those that are coevolving in a manner that favours canonical base-pairing. The fourth and final extension was to allow for an unknown secondary structure by using an RNA SCFG prior over secondary structures, and marginalising over structures using the inside algorithm during ML inference or a Metropoliswithin-Gibbs procedure during Bayesian inference.
Extending the model to permit an unknown secondary structure posed significant computational challenges. The first challenge was the need to compute likelihoods using Felsentein's algorithm for all L 2 paired sites. Fortunately, a large number of redundant calculations could be avoided due to a large proportion of paired sites sharing the same partial sites patterns (Pond and Muse, 2004), resulting in at least a 5× speed-up. Additionally, a further 50× speed-up was achieved using a GPU implementation of Felsenstein's algorithm. The second bottleneck was the need to marginalise an unknown secondary structure using the inside algorithm. Computational speed-ups by factors of 50× -200× were achieved using a GPU implementation of the inside algorithm. For Bayesian inference a Metropoliswithin-Gibbs procedure was implemented to further avoid calculating the paired matrix likelihoods and inside probabilities at every iteration.
Both ML and Bayesian inference were used for different analyses. ML inference allowed us to perform likelihood ratio tests of various hypotheses, where Bayesian model comparison was computationally intractable. Bayesian inference was used to obtain posterior distributions over various parameters, including the rates of coevolution associated with the canonical base-pairs, and posterior probabilities and degrees of coevolution at base-pair sites.
To perform an initial validation of our model, site permutations of alignment datasets were performed to disrupt the secondary structure dependencies expected in real datasets. Consistent with the model behaving desirably, the structure-integrated maximum likelihood values were lower, and the structure information entropy values higher for the permuted datasets overall.
The ability to marginalise an unknown secondary structure shared amongst an alignment of sequences, implies that MESSI is also capable of secondary structure prediction. Although MESSI was not designed with structure prediction in mind, we found that it performed similarly to two popular comparative secondary structure prediction methods: RNAalifold (Hofacker, 2009) and PPfold (Sükösd et al., 2012). This result served to further validate our model.
Strong evidence was found for GU pairs being selectively favoured at base-paired sites in five noncoding RNA datasets and four of five RNA virus genome datasets. Whereas strong evidence for GT pairs being selectively favoured at base-paired sites only found for one out of five of the DNA virus dataset tested. The notion that GU pairs play a role in stabilizing RNA secondary structures is consistent with numerous phylogenetic, and experimental analyses of RNA molecules (Woese et al., 1980;Eddy and Durbin, 1994;Deigan et al., 2009). The role of GT base-pairings in stabilizing DNA genomic secondary structures remains unclear.
The model was applied to the HIV-1 NL4-3 secondary structure and two corresponding alignment datasets containing large numbers of HIV-1 sequences, and an SIVmac239 secondary structure and a corresponding alignment of SIV sequences. It was found that correlations between the SHAPE-MaPdetermined quantities and degrees of coevolution as detected using MESSI were stronger than correlations detected when comparing the determined quantities and degrees of covariation, as measured using three non-evolutionary methods. The ability to infer degrees of coevolution in a statistically rigorous manner, together with MESSI's visualisation and ranking features is expected to assist researchers in focusing their experimental analyses on those portions of large RNA structures that are most likely to be biologically relevant.