Concurrent prediction of RNA secondary structures with pseudoknots and local 3D motifs in an integer programming framework

Abstract Motivation The prediction of RNA structure canonical base pairs from a single sequence, especially pseudoknotted ones, remains challenging in a thermodynamic models that approximates the energy of the local 3D motifs joining canonical stems. It has become more and more apparent in recent years that the structural motifs in the loops, composed of noncanonical interactions, are essential for the final shape of the molecule enabling its multiple functions. Our capacity to predict accurate 3D structures is also limited when it comes to the organization of the large intricate network of interactions that form inside those loops. Results We previously developed the integer programming framework RNA Motifs over Integer Programming (RNAMoIP) to reconcile RNA secondary structure and local 3D motif information available in databases. We further develop our model to now simultaneously predict the canonical base pairs (with pseudoknots) from base pair probability matrices with or without alignment. We benchmarked our new method over the all nonredundant RNAs below 150 nucleotides. We show that the joined prediction of canonical base pairs structure and local conserved motifs (i) improves the ratio of well-predicted interactions in the secondary structure, (ii) predicts well canonical and Wobble pairs at the location where motifs are inserted, (iii) is greatly improved with evolutionary information, and (iv) noncanonical motifs at kink-turn locations. Availability and implementation The source code of the framework is available at https://gitlab.info.uqam.ca/cbe/RNAMoIP and an interactive web server at https://rnamoip.cbe.uqam.ca/.


Introduction
The rise of RNA therapeutics (Wang et al. 2020, Yu et al. 2020) is due to technical and computational advances in our understanding of RNA sequence-structure-function paradigm.While the prediction of all-atoms RNA structure from sequence still remains a challenge (Miao et al. 2020), many different approaches have allowed to reach interesting results in various facets of the problem.
Taking advantage of the hierarchical folding of RNA, the secondary structure composed of strong stems of canonical and Wobble base pairs form first (Tinoco Jr and Bustamante 1999), many efficient theoretical approaches have been developed to predict this secondary structure.Nonetheless, the most accurate and feasible models as in the ViennaRNA package (Lorenz et al. 2011) or RNAstructure (Reuter and Mathews 2010) assume that there are no crossing interactions, no pseudoknots, since that assumption adds complexity and decreases accuracy in the thermodynamic parameters making it often impractical to use.Yet pseudoknots are abundant and important.To predict accurate secondary structure with them would be invaluable for the main 3D reconstruction tools that rely on that secondary structure (Watkins et al. 2020).
The prediction of RNA with pseudoknots is in all generality in the nearest neighbor model NP-Hard (Lyngsø and Pedersen 2000).The dynamic programming algorithm solving exactly the minimal free energy (MFE) structure problem and with the most general classes of pseudoknots is PKnots (Rivas and Eddy 1999) with prohibitive complexity in time of Oðn 6 Þ and space of Oðn 4 Þ.Instead of exact methods, heuristics have also been developed as HotKnots (Ren et al. 2005).Reducing the set of achievable pseudoknot configurations (while keeping known important ones) combined with sparsification techniques Knotty (Jabbari et al. 2018) is able to achieve a time complexity of Hðn 3 þ ZÞ (with in practice Z < Oðn 4 Þ).While deep neural networks as SPOT-RNA (Singh et al. 2019) on the subjects have been published, rigorous benchmarks show that they still lack generability (Szikszai et al. 2022, Justyna et al. 2023).By formulating the problem as an Integer Programming problem IPknot (Sato et al. 2011) has proven to be effective to predict general pseudoknotted secondary structures when maximizing base pair probabilities generated by RNAfold (Lorenz et al. 2011), leveraging fast general solvers and without sacrificing possible outputs.Methods as BiokoP (Legendre et al. 2018) developed a different approach by targeting Pareto fronts with Integer Programming, showing that the pseudoknotted structure can be predicted with greater accuracy when combining the MFE and MEA.
More recent work improved IPknot using clever heuristic to solve long sequences in reasonable time, combining a dynamic threshold with a linear-time approximation of the RNA folding partition (Sato and Kato 2022).
Beyond the secondary structure RNA are composed of many different important interactions.The Leontis-Westhof (LF) classification (Leontis and Westhof 2001) defines 12 types of possible base pairs, between any nucleotides.When describing the loops between the rigid stems using those noncanonical interactions, different methods have shown that conserved sub structures are present and important (Leontis et al. 2006, Petrov et al. 2013, Reinharz et al. 2018).These databases of motifs can be leverage for not only more accurate structure prediction, but also to include geometric information beyond canonical and Wobble base pairs.
In previous work, we used conserved structural motifs to select an optimal secondary structure and ease all-atoms 3D reconstruction (Reinharz et al. 2012, Yao et al. 2017).Subsequently a different group developed BiORSEO (Becquey et al. 2020) that computes the Pareto front of an objective function balancing structures with pseudoknots and motifs insertions.They enforce stricter constraints on the motif insertions.As discussed later, interior loop and multiloop motifs are composed of nonsequential strands, which when inserted are connected by base pairs with BiORSEO.It is not the case in RNA Motifs over Integer Programming (RNAMoIP).The size of the Pareto front becomes also prohibitive to compute for longer sequences.
In this paper, we expand on our IP framework RNAMoIP (Reinharz et al. 2012) to achieve the simultaneous prediction of secondary structure with pseudoknots and structural motifs insertion with or without alignments incorporating ideas from IPknot (Sato et al. 2011), using a newly designed local structural modules dataset computed from Soule ´et al. (2021).
In Section 2.3 the unified IP equations are presented.We discuss in Section 3.9 how a sequence alignment can be used, a feature implemented in our software.We benchmark the secondary structure prediction on all known nonredundant RNAs with a determined pseudoknotted structure below 150 nucleotides in Section 3.4, and how the predictions fare for the canonical and noncanonical interactions inside the motifs in Sections 3.5 and 3.6.We then evaluate how a good sequence alignment can improve the prediction in Section 3.9 with a set of hand aligned structures by Rfam.We conclude on analysis in Section 3.10 by looking specifically at the kinkturn motif that was present in four previous structures and how using an alignment or not influences its prediction.

Materials and methods
Let x an RNA sequence, and X its secondary structure.A base pair ði; jÞ 2 X must be canonical (G-C or A-U) or Wobble (G-U), and have j À i > 3. The Leontis-Westhof (LW) classification of RNA base pairs (Leontis and Westhof 2001) The secondary structure X can be decomposed in an ensemble of pseudoknot-free structures X 1 ; . . .; X m .X q is pseudoknot-free if there is no crossing between any base pairs, formally for all ði; jÞ; ðk; lÞ 2 X q ) i < k < l < j or k < i < j < l.
The workflow of our framework is presented in Fig. 1: 1) Given a sequence x and a secondary structure X simultaneously: a) Decomposed X in pseudoknot-free structures, for each compute a constrained version of a classic folding algorithm to obtain a base pairing probability matrix (if no structure is provided, the algorithm runs without constraints once), sum all the matrices.b) Find all possible motifs location using pattern match in the input sequence x. 2) Solve the IP model to find the optimal combination of base pairs with pseudoknots and motifs given our objective function detail in Section 2.3.3.3) Iterate until: a) Two iterations give the identical solution.b) A threshold in the number of iterations or time is reached.

Structural motifs
The validation of structure prediction using motifs database is challenging to ensure that the benchmark is not biased.We decided to use the same database as in the first version of RNAMoIP built in 2012 (Reinharz et al. 2012).Additional measures to avoid overfitting are discussing in Section.3.3.This database is based on the detection of similar networks of interactions among all RNA 3D structures (Reinharz et al. 2018).The entire dataset is built upon 398 unique common subgraphs (or RINs-Recurrent Interaction Networks), that can be divided into 5278 different nucleotide sequences.Those sequences are composed of multiple strands, which can represent motifs like hairpins, interior loops, and multi-loops.

Base pair probabilities
Different tools as the ones provided by ViennaRNA (Lorenz et al. 2011) and RNAstructure (Reuter and Mathews 2010) can accurately, in a thermodynamic setting, utilize the most recent set of nearest neighbor parameters to compute base pairing probabilities for pseudoknot-free structures.Following the model of IPknot (Sato et al. 2011), the secondary structure X is decomposed in an ensemble of pseudoknotfree structures X 1 ; . . .; X m .For each, a base pair probability (BPP) matrix can be computed such that the base pairs in that sub-structure are enforced as hard constraints, and any position in another sub-structure are forbidden to pair.In the IP formulation, the weight of the base pair ði; jÞ will be the pseudo-probability pði; jÞ ¼ p 1 i;j þ Á Á Á þ p m i;j .Note that during the evaluation of the BPPs the base pairings are considered as hard constrained, they must be preserved, there is no such condition in the IP model.

Integer programming model
The integer programming model is quite complex and we reproduce here all equations for sake of completeness.Section 2.3.1 describes how the motifs are encoded.The Section 2.3.2 lists all the model variables.Then the objective function is detailed in Section 2.3.3.The constraints in regards to the base pairs placement are in Section 2.3.4 and the ones regarding the motifs insertion are in Section 2.3.5.All modifications to the original equations of RNAMoIP are in green, and the complete model is in Supplementary Material S6.1.

Input
We denote an RNA sequence as x and X as a secondary structure compatible with it.x i is the nucleotide at position i and must be in fA; C; G; Ug.The structure can be empty and may contain crossing interactions.We use n ¼ jxj as the length of the sequence.
Each motif in the database can be composed of a set of different sequence strands (e.g. an interior loop has two strands).The equation will differentiate between hairpins (1 strand), interior loops and bulges (two strands), and k-way junctions (3 or more strands).Mot j is the set of motifs with j-strands and each is composed of its list of strands.For any motif x 2 Mot, the length jxj represents how many nucleotides it contains.Formally, a position x b a in a strand can be A, C, G, U, or the wild card *, and we have: The strands are ordered in the 5 0 to 3 0 order of the sequence they are extracted from.The model needs to know where the i À th strand of any motif composed of j-parts can be inserted.These can be of different lengths, so the ensemble Seq j i will be a set of triplets with the name of the motif and the first and last positions where the ith strand of that motif can be inserted.The same strand can be placed in many different places.Formally: x 2 Mot j and x i 1 ; . . .; x i ki ¼ x a ; . . .; x aþkiÀi g:

Variables
To keep in line with the previous implementation of RNAMoIP, two type of binary variables are being tracked.
First, four things need to be kept track for the motifs: (i) x the name of the motif (ii) if it is the ith strand of that motif, and (iii,iv) the interval k, l where it is inserted.This will be kept done by the variable C x;i k;l representing the insertion of the ith component of the motif x at position ðk; lÞ of the sequence.One such variable exists for every element of every set Seq j i .Second, for every pair of position where the BPP is above a certain threshold, the model needs to know if it is instantiated and at which level.Each level will contain a pseudoknot-free structure, and will have to be crossing a base pair in each level below it.The binary variable D q u;v will be 1 if there is a base pair between x u and x v at level q.The set B will contain all pairs of positions ðu; vÞ with a potential base pair.

Objective
Intuitively, we want to maximize the pseudo-base pair probabilities as the amount of information in the sequence.The IPknot objective function is based on the MEA and tries to maximize an approximate gain that does not take into account positions unpaired in the pseudoknotted secondary structure.With RNAMoIP we enhance the objective function by giving a bonus to unpaired positions that are known to fit into an insertable motif, compensating some of the necessary simplifications of the model.In Reinharz et al. (2012), this is achieved by maximizing the square of the length of the motifs inserted, which will push to insert the smallest amount of largest motifs possible.As in Section 2.3.1 the number of nucleotides in motif x is denoted jxj.For each potential base pair between positions ðu; vÞ the pseudo-BPPs probability pðu; vÞ (see Section 2.2) is used.As in the first version of RNAMoIP, the factor of 10 is added to normalize the scale between the pairing probabilities and the motifs insertion, which allows the a parameter to be calibrated more easily.A parameter a will be used to combine the pseudo-BPPs maximization and motifs insertion.A weight b q is used to balance the different level of pseudoknots.Following IPknot we select b 1 ¼ 0:5; b 2 ¼ 0:25; b 3 ¼ 0:125 and b 4 ¼ 0:0625.Formally, the objective function is: (1)

Base pairs constraints
The first three equations will ensure that each position is only in one base pair [Equation ( 2)], that each level q contains a pseudoknot-free structure [Equation (3)], and that base pairs are added to a level if and only if they cross another base pair in every lower level [Equation ( 4)].Two additional properties we enforce are that base pairs must be stacked on the same level [Equations ( 5) and ( 6)) and that at least 25% of positions must be in a base pair [Equation ( 7)].Note that naturally, due to the weights in the objective function, most of the base pairs will concentrate on lower levels.

Motifs constraints
Following the original formulation of RNAMoIP, we reproduce here all the equations necessary for the insertion of motifs.While most of them are exactly similar, there is a notable difference.One of the main conditions for the insertion of any strand is that it must be stacked or overlapping the last base pair of a strand.Since the motifs database is defined over the loops of a pseudoknot-free secondary structure, only the base pairs in the first level are considered.We give a brief overview of each equation but more details can be found in Reinharz et al. (2012).Finally, to unify both models, it is important to avoid clashes between the motifs and base pairs at the different levels.This is the role of Equation ( 8).Since the structural motifs are defined on the secondary structure, we insert them in relation to the base pairs in D 1 and forbid other base pairs to form inside them.We assume that motifs are a cohesive geometric unit.

Incorporating evolutionary information from sequence alignments
Sequence alignment contain large amount of evolutionary information that can be leveraged for better prediction.When an alignment is provided to RNAMoIP the execution logic is slightly adapted in two ways to take this new data into account.
First, instead of relying on RNAfold, the base pairings probabilities matrix is calculated taking the alignment into account using RNAalifold (Bernhart et al. 2008) which is part of the ViennaRNA Package.
Second, each motif insertion score is weighted in function of its compatibility with the alignment.A motif can now be inserted if it (i) perfectly matches the input sequence at that position, or (ii) is at most at Hamming distance 1 of at least 50% of the alignment at these positions.In the objective function [Equation (1)] the term C x;1 k;l representing motif x inserted in position k, l is now weighted by the sum for each strand of the fraction of its match with the alignment.Details are shown in Supplementary Material S6.4.

Implementation
The Integer Programming framework is implemented in Python 3, with an interface to facilitate usage with different solvers.In this study we used the open-source solver Or-Tools (OR-Tools 2021), giving better performance.Instructions are also provided to use the open source IP solver CBC (Coin or 2022) through the MIP library (Department of Computing 2021), or the proprietary Gurobi (Gurobi Optimization, LLC 2022) solver.We ran our benchmarks on Ubuntu 21.04 on an Intel Xeon Processor W-2295 with 512 GB 8Â64 GB DDR4 2933.The source code, data, and results, are available at https://gitlab.info.uqam.ca/cbe/RNAMoIP.

Dataset
For benchmarking, all RNA structures between 20 and 150 nucleotides in the PDB (Berman et al. 2000) were selected, filtering for identical sequences.To avoid molecular redundancies, we kept one structure per nonredundant class as defined by the BGSU RNA Structure Atlas (B.R. Group 2022) v3.208.
The canonical base pairs in the secondary structure can be deconvoluted in different ways into a main knot-free structure and an ensemble of pseudoknots of increasing complexity (Smit et al. 2008).A reference secondary structure from which pseudoknots are defined was determined using RNApdbee (Zok et al. 2018).The benchmark dataset is composed of the remaining 101 structures with at least a single pseudoknot.

Solver
Version 2.6.4 of ViennaRNA is used to compute the base pair probability matrices.The terminating conditions are set to a maximum of three iterations, or two iterations with the same results.A time of 10 4 s was allotted for each prediction and sequences.While many optimal solutions can exist for one IP formulation only the first one achieved was used.
To avoid overfitting, before each sequence prediction all motifs belonging to any structure in the same RNA Structure Atlas v3.208 nonredundant class were removed (B.R. Group 2022).

Motifs insertion improve secondary structure prediction
To evaluate the capacity of RNAMoIP to predict the secondary structure, the set of True Positives (TP) consists of canonical and Wobble base pairs.The Positive Predicted Value (PPV: TP TPþFP ), sensitivity (STY: TP TPþFN ) and F1 ( 2ÁPPVÁSTY PPVþSTY ) are used as metrics and shown in Fig. 2. We compare the results of RNAfold and RNAMoIP with different values of a.When a ¼ 0, no motifs are considered, and the model is equivalent to IPknot.Note that RNAfold STY cannot reach 1, since only pseudoknotted structures are in the benchmark set, but RNAfold cannot predict crossing interactions.And this is what we observe, where it has lower STY than almost all values of a.On the other hand, RNAfold predictions are in general more sensitive than the IP model, especially when no motifs are used.The average values over all finished models are shown in Table 1.The optimal F1, balancing the amount of base pairs predicted and their sensitivity, is achieved with a ¼ 0:15, complementing the base pairs with motifs information.
The statistics shown previously are computed over all the secondary structures.Our models allow up to two crossing levels between the base pairs yet 95 of the benchmarked structures have only one level of crossing interactions.No overprediction of the pseudoknots level was observed, as shown in Table 2. Pseudoknots are in fact usually under-predicted.When no motifs are inserted, around 15% of structures have a pseudoknot level too low, up to 50% when motifs are added to the model.Nonetheless, the improvement in PPV and of over 10% in F1 measure indicates that although no pseudoknot is predicted the structure is much more accurate.a As a is increased, we underestimate the lvl of pseudoknots that are present in the structure.The complexity of the IP model increases and makes it more challenging to find a feasible optimal solution in time.
Concurrent prediction of RNA secondary structures

Recovering canonical and Wobble interactions in the motifs
For each motif inserted, we retrieved from the RNA structure atlas all canonical and Wobble interactions at the inserted positions to build our positive examples.Note that these interactions can be crossing inside the loops and do not need to be stacked, therefore they are not necessarily part of the secondary structure.Since a motif sequence in our database can match different sub-structures, the one with the best structural match was used in this and the subsequent section.The same metrics as previously, PPV, STY (available in Supplementary Fig. S1), and F1 are computed and averaged over all motifs in all structures.Inside the motifs, an F1 value around 40% is achieved, a slightly lower precision for predicting the canonical and Wobble pairs in the motifs than the pseudoknotted secondary structure, as shown in Fig. 3.

Noncanonical interactions remain challenging to predict
In the minority and hard to predict, noncanonical interactions have been shown to be necessary for many RNA functions.RNAMoIP maximizes the insertion of motifs based solely on a sequence match and the structural context.We expect similar loops in different RNA to adopt the same topology, but it is not given since each motif inserted comes from an RNA from a different redundancy class.As in the previous section, for each motif the positive set consists in the ensemble of all noncanonical interactions between inserted positions.The distribution of F1 is shown in Fig. 4 (PPV and STY in Supplementary Fig. S2) highlighting how the motifs as of now are still limited to predict this finer grain information.
While looking dramatic, it can be explained in two different ways.First, their number is really low, as shown in Supplementary Fig. S3.The Y-axis shows that at these positions rarely >4 noncanonical interactions are present per RNA, and rarely >2 found in the inserted motifs.Second, this is underestimating the noncanonical interactions since many cannot be predicted by our model.On top of those that are at locations where no motifs are predicted, many can be linking motifs together, but those cannot be found in the dataset used.

Performance
Integer programming is known to be NP-complete, but decades of optimization have allowed to leverage efficient implementations to express complex models.We remind that a maximum of 10 4 s was enforced and the 14 sequences without a solution at the time cutoff with a ¼ 0:1 were discarded.We show in Supplementary Fig. S8 the execution time in seconds.More heuristics as the ones developed by IPknot for long sequences could be used, at the cost of a decrease in the accuracy.We expect that the optimal gains would be achieved by optimizing the location where we allow motifs to be inserted.

Tools comparison
When benchmarked only on the canonical and Wobble base pairs, while a value of a ¼ 0.15 seems to give marginally better results in the secondary structures, it is a bit worse inside the motifs.We therefore selected a ¼ 0:1 as default parameter value.
We evaluated in Fig. 5 RNAMoIP with a ¼ 0:1 against 11 other tools: RNAfold (Lorenz et al. 2011), our implementation of IPknot (Sato et al. 2011) (i.e. a ¼ 0), PKnots (Rivas and Eddy 1999), HotKnots (Ren et al. 2005) Knotty (Jabbari et al. 2018), SPOT-RNA (Singh et al. 2019), MXfold2 (Sato et al. 2021), LinearFold (Huang et al. 2019), pAliKiss (Janssen and Giegerich 2015), BiokoP (Legendre et al. 2018) and BiORSEO (Becquey et al. 2020).While in the task of predicting base pairs of the secondary structure RNAMoIP clearly outperforms most of the competitors.When compared with Knotty it produces as good results while providing more information in the form of motifs inserted.Since we cannot re-train SPOT-RNA, it was run on the subset of the test sequences that did not share a BGSU RNA Structure Atlas redundancy class with its training set.We show in the supplementary material the results when run over the entire set, and we can see how the overfitting greatly improves its result.As mentioned in recent papers (Szikszai et al. 2022, Justyna et al. 2023), if more data was available, a benchmark based on Rfam families would probably decrease further the results.

Including alignments increases predictability
Our model was extended in Section 2.4 to incorporate evolutionary information as sequence alignments.The Rfam  Out of those, they belong to 23 different families (as defined by RNA Structure Atlas).The representative of each family was selected, or a random for the two families with a structure in an alignment but not its representative.The name of the 23 structures is provided in the directory with the code in the results folder.We describe in Supplementary Material S6.4 how we modified the procedure.
To the best of our knowledge, only pAliKiss (Janssen and Giegerich 2015) can use an alignment to predict secondary structures with pseudoknots.Due to the limited amount of tools and data, only 23 structures, we used as baseline RNAalifold (Lorenz et al. 2011)-which does not take into account pseudoknots-and evaluated the number of base pairs statistically supported by the alignment using R-scape (Rivas et al. 2017).We show the comparisons with RNAMoIP F1 score in Fig. 6 (PPV and STY can be found in Supplementary Fig. S6).We compare with RNAfold and RNAMoIP on these sequences without the alignments in Supplementary Fig. S4 and Supplementary Fig. S5.
A drastic increase in PPV is observed as much for RNAalifold versus RNAfold as for RNAMoIP.R-scape indicates that only a fraction of the base pairs are directly supported by the sequence alignment.We see that with a sequence alignment, there isn't a significant improvement brought by the motifs under our scheme.The F1 distributions for RNAMoIP at a ¼ 0 where no motif is taken into account or a ¼ 0:1 are almost similar, and increasing a slowly decreases the F1 value.These results are of equivalent quality to the ones returned by pAliKiss.We note that pAliKiss produced in 60% of the cases multiple solutions, up to 47, but as shown there is no observable difference between its best and a random structure.Incorporating motif prediction using our simple scheme does not decrease accuracy while providing additional information, as we will explore in the next section.

Kink-turn
A particularly interesting geometrically complex local motif is the kink-turn (Klein et al. 2001, Matsumura et al. 2003) which is linked to many different biological processes (Huang and Lilley 2018).Since they greatly constrain the structure, they are key pieces completely ignored in the secondary structure representation.
The RNA Structure Atlas (Petrov et al. 2013) maintains different classes of kink-turn as all their occurrences in known RNA structures.From the previous set of 23 structures with alignments, four have a kink-turn annotated in them all in group IL_29549.4,in PDB and chains 3V7E-C, 4AOB-A, 4KQY-A, and 5FJC-A.Since all four belong to the same group, they are represented with the same noncanonical diagram.
We investigate where motifs are predicted at the position of the kink turn with and without the alignment in Fig. 7.In both cases, different interior loop is colored as green or blue, and hairpins as orange.When exactly the right base pair was predicted, it is also colored, the other ones inside the predicted motifs are omitted.
We observe immediately that there is always an interior loop matching on each strand of the kink-turn, in green.The prediction without alignment seems to be more cohesive, having always a match on all the position forming a cycle with the triangular pattern.But we can see using the alignment that the match on 5FJC chain A recover three out of the four non-Watson-Crick/Watson-Crick base pairs.Since the noncanonical interaction do not necessitate an exact match to reflect similar geometries (Stombaugh et al. 2009) a more in depth study out of the scope of the present work is needed.Nonetheless, RNAMoIP predicts relevant geometric information that cannot be represented by the secondary structure at the kink-turn location.Loyer and Reinharz correspond to each motif found are shown in their respective tab, with all their canonical and noncanonical interactions as a dynamic 3D visualization allowing superposition of the different occurrences of the motifs.

Conclusion
In this work, an integer programming framework allowing simultaneous prediction of the secondary structure with pseudoknots, and insertion of structural motifs, is presented.The implementation in RNAMoIP is benchmarked over the 101 nonredundant pseudoknotted RNAs with known structure.We show that combining the approach of IPknot to construct pseudoknotted structures based on the base pair probability matrix obtained by the standard thermodynamic model, implemented in the ViennaRNA, with the insertion of known conserved structural motifs, allows to: (i) increase the accuracy of the prediction of secondary structure with pseudoknots, (ii) generates accurate knowledge about canonical and Wobble interactions present inside structural motifs, which might not belong to the secondary structure, (iii) improves drastically under a simple scheme to incorporate evolutionary information from multiple sequence alignment as validate with 23 independent structures aligned by Rfam, and (iv) predict noncanonical interaction motifs at kink-turn locations.
Two main limitations are highlighted by our work.First, motifs can be inserted now based on a perfect sequence match.More advanced probabilistic techniques, as RMDetect (Cruz and Westhof 2011), JAR3D (Roll et al. 2016) or BayesPairing (Sarrazin-Gendron et al. 2020), would allow to integrate a more rigorous term in the objective function, as match motifs with altered sequence, increasing diversity and therefore the range of predictable structure.Second, the database of motifs only incorporates loops (i.e.hairpin, interior loops, multi-loops).These approaches can directly use multiple sequence alignment which would not only increase the general base pairs as shown in this work, but would probably show a more significant boost from incorporating the motifs.
Advances in structural molecular biology are pushing against the limitation of the nearest neighbor model.While the biological importance of networks of noncanonical interactions are becoming more and more evident, the capacity to predict them lags far behind.The IP programs remains a promising direction for RNA structure determination due to the flexibility of their formulation allowing to go above the nearest neighbor model.Expending to more complex conserved structures, as groups of interacting and conserved loops containing pseudoknots described in Carnaval (Reinharz et al. 2018, Soule ´et al. 2021) would allow to take fully advantage of the IP formulation and extend the notion of pseudoknots prediction to all noncanonical interactions.This flexible formulation will also allow to give specific rules to help incorporate chemical modifications and other features that are absent from the nearest neighbor model.

Figure 1 .
Figure1.The RNAMoIP workflow.Left top: the sequence with a structure (which can be empty).The structure is decomposed in pseudoknot-free substructures, for each a constrained BPP is computed, and they are all summed together.Left bottom: a database of structural motifs containing hairpins, interior loops and bulges, and k-way junctions.Right: outputs an optimal combination between a secondary structure with pseudoknots and motifs inserted in sequence compatible locations.Each motif strand must stack or overlap by one position a base pair in the secondary structure

Figure 2 .
Figure 2. Pseudoknots prediction accuracy.Comparing results for RNAfold (cannot predict crossing interactions), without motifs insertion (a ¼ 0), and for different values of a.When a > 0, all base pairs inserted in the same motifs are counted as true positives.In (a) we compare the positive predice value and sensitivity, and in (b) the F1 scores.

Figure 3 .
Figure 3. Prediction accuracy of canonical and Wobble interactions in motifs.For a values of 0.05, 0.1, 0.15 that more than half of the canonical and Wobble base pairs in the motifs are correctly predicted, and 40% of them are generally captured

Figure 4 .
Figure 4. Predicting noncanonical base pairs in motifs.True positives are the noncanonical base pairs at positions where one motif is inserted in the sequence.They are composing at most 15% of the interactions in the inserted motifs, and are hard to predict (see Supplementary Fig. S3 in Supplementary Material)

Figure 5 .
Figure 5. Tools comparison of F1 scores.RNAMoIP with parameter a ¼ 0:1 is compared with 13 other tools.Knotty does not provide additional geometric information.SPOT-RNA is evaluated on the subset of sequences, not in its training set

Figure 7 .
Figure 7.For the four PDB with a kink-turn the representation of its group and colored the position where RNAMoIP predicted motifs on the kink-turn as the exact base pair

Figure 6 .
Figure 6.Alignment-based secondary structure prediction accuracy.Result of the alignment's dataset with the help of the alignment information.pAliKiss and R-scape were also added for comparison 8 defines 12 different geometries possible combining two edges between Watson-Crick (W), Hoogsteen (H), Sugar (S) and an orientations cis (c) or trans (t).The canonical and Wobble base pairs are all of type cis Watson-Crick/Watson-Crick (cWW).Generally, any combination of nucleotides can form any type of base pair.Stability in the nearest neighbor model is obtained from stacked base pairs (Turner and Mathews 2010), forbidding lonely base pairs implies formally that if ði; jÞ 2