When will RNA get its AlphaFold moment?

Abstract The protein structure prediction problem has been solved for many types of proteins by AlphaFold. Recently, there has been considerable excitement to build off the success of AlphaFold and predict the 3D structures of RNAs. RNA prediction methods use a variety of techniques, from physics-based to machine learning approaches. We believe that there are challenges preventing the successful development of deep learning-based methods like AlphaFold for RNA in the short term. Broadly speaking, the challenges are the limited number of structures and alignments making data-hungry deep learning methods unlikely to succeed. Additionally, there are several issues with the existing structure and sequence data, as they are often of insufficient quality, highly biased and missing key information. Here, we discuss these challenges in detail and suggest some steps to remedy the situation. We believe that it is possible to create an accurate RNA structure prediction method, but it will require solving several data quality and volume issues, usage of data beyond simple sequence alignments, or the development of new less data-hungry machine learning methods.


INTRODUCTION
RNA molecules pla y man y key functions within cells.Perhaps the most striking example is in translation, where it has been shown that the ability to build proteins is orchestrated by ribosomal particles, with the crucial catalytic step being performed by the ribosomal RNA itself, with amino acid residues deli v ered specifically by tr ansfer RNAs.Untr anslated regions of mRNAs and viruses harbor numerous regulatory elements.There are also a large number of noncoding RN As (ncRN A) for w hich, despite decades of r esear ch, we have only a scant understanding of their functions.An example is the large class of long noncoding RNAs in animal genomes.These RNA genes are numerous, perhaps exceeding the number of protein-coding genes and seem to play a range of subtle regulatory roles ( 1 ).Many ncRNA functions depend on the stable (ribosome, tRNA) or Examples of interactions in an RNA molecule.Some of the most important interactions are highlighted in dashed lines: base pairing hydrogen bonds in dark red, sugar-base stacking in dark violet, phosphate-base hydrogen bond in yellow, water-formed hydrogen bonds in cyan (waters are depicted as cyan balls).The bottom pair is canonical Watson-Crick, the pair above is a G-U pair 'locked' by interaction with bridging water molecule.G2147 is in syn orientation and dinucleotide C2146-G2147 is in the left-handed Z-form conformation (note the inverted direction of the ribose of C2146 further stabilized by stacking its O4' to the guanine aromatic ring).Displayed is a six nucleotide loop from 80 nucleotide long fragment of 23S RNA from Thermus thermophilus complexed with ribosomal protein L1 (PDB ID: 4qvi) ( 5 ).transient (spliceosome) structure of RNA.Knowledge of RNA structures can answer basic scientific questions and can be of great help in design of new types of drugs and therapies.Structures can help answering the fundamental question of evolution whether life started with RNA as 'RNA World' ( 2 ) or other, perhaps peptide-type molecules.Rational drug design would without a doubt benefit from reliable predictions of RN A structures.Increasingl y, the growing issue of bacterial drug resistance is appr oached fr om different perspecti v es but specific inhibition of ribosome particles offers a promising route to effecti v e treatment ( 3 ).RNA therapies are attracting more attention from large pharmaceutical companies ( 4 ).
RNA building blocks , nucleotides , are chemically complex with aromatic nitrogenous bases, chiral ribose sugar rings and phosphate groups.The bases are able to stack on each other by van der Waals interactions, but they also carry large electrical moments and can form strong hydrogen bonds.Ribose rings strongly constrain backbone geometries by their puckers; the C3'-endo pucker prevails in RNA, but a ribose can also locally adopt the C2'-endo pucker, thus radically changing the backbone geometry.The phosphate groups are perhaps structurally the most complex parts of the RNA molecules due to d-orbitals in phosphorous atoms.Both torsion angles describing the conformations around the phosphodiester bonds O3'-P and P-O5' called and ␣ prefer -gauche orientations, but the torsions can adopt any other combinations of gauche, trans andgauche (+60   ) conforma tions.Phospha tes in nucleic acids under normal conditions are charged and render whole RNA or DNA molecules strongly negati v e, which needs to be neutralized by interacting positi v e ions.The single negati v e charge of each phosphate is distributed between its unbound oxygen atoms that are highly polarizable and capable of forming hydrogen bonds to other RNA atoms, proteins and water, but also of forming char ge-char ge interactions to amino acids, other cellular components such as amines and prominently also to metals.All intra-and inter-molecular interactions in which RNA molecules are involved determine their structur es.Figur e 1 illustrates at least some of these physically complex interactions as they were observed in a small six-nucleotide loop from an 80-nt fragment of rRNA from a crystal structure 4qvi ( 5 ).

RNA 3D STRUCTURE PREDICTION: ST A TE OF THE ART
In the 1960s, first attempts began to reconstruct in silico the 3D structures of RNA molecules based on sequence homology ( 6 ).These efforts became more frequent with a growing number of experimentally determined 3D RNA structures.Building in silico models relied largely on manual manipulation of structure templates in a computational environment.The first interacti v e tool targeting RNA tertiary structure modeling was published in 1998 ( 7 ).Se v er al y ears later, systems that could fully or semi-automatically process from RNA sequence to a 3D model began to appear, using ab initio folding such as F ARF AR ( 8 ), iFoldRNA ( 9 ), NAST ( 10 ), SimRNA ( 11 ) and Vfold ( 12 ); or homology modeling such as RNABuilder ( 13 ) and ModeRNA ( 14 ), or a fragment-based assembly approach used in MC-Fold / MC-Sym ( 15 ), Assemble ( 16 ), RNAComposer ( 17 ) and 3dRNA ( 18 ).In the past two years, deep learning (DL)-based predicti v e models hav e begun to emer ge.The paper by To wnshend et al. ( 19 ) presented a DL model that predicted the quality (RMSD) of a new computer-generated 3D RNA structure.Meanwhile, other works (20)(21)(22) described methods that used deep learning for the end-to-end 3D prediction of the RNA structure.
With the increasing availability of computer-based methods for predicting 3D RNA structures, the question of the reliability and quality of the generated models became more important.In response , RNA-Puzzles , a collecti v e b lind e xperiment to critically e valuate the prediction of 3D RNA structures, was started in 2010 ( 23 ).During the past 12 years, RNA-Puzzles organized 38 competiti v e challenges ( 24 ) and two dedicated projects --modeling structures from unknown Rfam families and untranslated region of SARS-CoV-2 ( 25 ).Within each, participants predicted the tertiary structure of a single RNA target.The predictions were evaluated mainly by comparing them with a r efer ence structure, once the latter was published in the Protein Data Bank and the assessments for 34 challenges are currently known (data as of February 2023).Se v eral similarity and distance measur es wer e used for evaluation, some of which were specifically de v eloped for RNA (26)(27)(28)(29)(30)  determining the overall fold of the RNA, influencing stem packing and junction topologies.RMSD indicates how the predicted 3D coordinates diverge from those of the r efer ence structur e and shows only a few models with RMSD < 5 Å .For most RNA-Puzzles, the distribution of RMSD values is multimodal and spreads over a wide range.Ther efor e, despite significant advances in modeling approaches, predicting RNA coordinates with nati v e-like featur es r emains challenging and r equir es improvements in both accuracy and quality ( 31 ).
The RNA-Puzzles initiati v e has adopted many mechanisms that were de v eloped in CASP, the biennial experiment for the critical assessment of protein structure prediction.The first CASP competition was launched in 1994 ( 32 ), a quarter of a century after pioneering r esear ch into 3D computer modeling of protein structure began ( 33 ).Twentyse v en participating groups were challenged to predict the a tomic coordina tes of 33 amino acid sequences.In subsequent editions of CASP, the number of targets and participants incr eased (Figur e 3 ), and new competition categories emerged.This included a fully automatic prediction by w e b servers, a category that started in 2000 (CASP4).Eighteen years later, AlphaFold ( 34 ) entered the game in CASP13 ( 35 ) to make a breakthrough in protein structur e pr ediction in 2020 (CASP14) ( 36 ).RNA-Puzzles opened its own w e b server category in 2015.In 2022, this competition saw the first teams using deep learning models to predict 3D RNA structures.In the same year, CASP-RNA was launched, a contest co-organized by CASP and RNA-Puzzles ( 37 ).It coincided with an explosion of interest in the prediction of the 3D RNA structure ( 38 ) resulting, among other things, from the success of AlphaFold and the Covid-19 pandemic caused by an RNA virus.42 groups participating in CASP-RNA tried their hand at modeling three-dimensional structures for 12 RNA sequences.Eighteen contributing teams used deep learning models (including Deep-F oldRNA, RhoF old, trRosettaRNA and OpenComplex-RNA) at various stages of prediction (20)(21)(22).The final CASP-RNA ranking gave the top 4 places to teams that combined expert modeling with non-machine learning algorithms.

THE CHALLENGES
AlphaFold and other highly accurate methods ( 34 , 39-48 ) applied deep learning to predict the protein structure based on the sequence.Training these tools required huge amounts of data.For example, AlphaFold implemented a bootstrap technique in which its final version used both experimentally determined and predicted structures of high accuracy.A fundamental question is whether we have enough RNA structure data for training and whether they are of sufficiently high quality and di v ersity.

RNA content in the Protein Data Bank
Since the first tRNA structures were solved in the mid-1970s ( 49) and published about ten years later ( 50 , 51 ) it was known that RNA molecules could adopt complex 3D architectures.Howe v er, it was not until the late 1990s that structures of functionally new types of RNA emerged: first several types of ribozymes (52)(53)(54), and then impressi v e ribosome particles (55)(56)(57).These re v ealed the structural richness of the RNA ar chitectur es, which was later confirmed by more structures solved mostly by X-ray crystallography and recently by cryo-electron microscopy (cryo-EM).Despite all the discoveries about RNA structures, the sheer volume of experimental structural data available for RNA and proteins remains strongly in favor of the latter (Table 1 ).There are about 25 times more protein depositions than RNA.The ratio is slightly more favorable for DNA, but e v en so, both nucleic acids account for < 10% of the PDB archi v e, and this ratio has remained fairly stable over time.The situation is e v en more dramatic when restricted to high-resolution data: among X-ray and cryo-EM structures with a resolution better than 2.0 Å , proteins are about 100 times more abundant than RNA (Table 1 ).Considering all structures with resolution < 3.0 Å , RNA nucleotides constitute only 2% of all residues (nucleotides and amino acids) ( 58 , 59 ).Unfortunately, these proportions cannot be expected to change quickl y.Newl y solved crystal and cryoem structures tend to have a limited resolution.The reason is the inherent flexibility of RNA molecules that can be estimated, for instance, by factors B and R in the crystal phase; they are higher for RNAs than for proteins with comparable resolution.A limited number of high-resolution RNA structures is a se v ere constraint, as these structures are the source of the most reliable experimental information about the 3D structures, and some belie v e the only.

RNA ar chitectur es crucial f or the global f old
The main architectural element of RNA is an antiparallel double helix of form A that constitutes a pproximatel y 60% of RNA in ribosome particles.The structure of this element is the easiest to identify and predict.The overall three-dimensional arrangement of a molecule results from the assembly of these helical regions.It is orchestrated by various types of 3D motifs such as sharp turns , loops , nway junctions, coaxial stacking of duplexes and triple and quadruple helical regions ( 56 , 60 ).A junction consists of at least three helical regions arranged in a way that significantly influences the overall fold.There are three families of three-way junctions, which differ by the coaxial stacking pattern ( 60 ).For junctions with higher multiplicity, it becomes more complicated ( 61 ).The correct prediction of the junction topology and the resulting stem orientation is of utmost importance, but poses a significant challenge, as ther e ar e usuall y onl y single or no homolo gous junctions in experimental structures of RNA ( 62 ).All of the aforementioned regions often form between sequentially distant parts of the RNA molecule and are stabilized by non-Watson-Crick base pairs (NWC).Reliable information on structurally critical NWCs is necessary for the correct 2D / 3D structural predictions.Howe v er, the collection of NWCs Table 1.Numbers of all PDB-released structures (*) and residues in X-ray and cryo-EM structures (**) with high resolution ( ≤2.0 Å ) over decades.In the first column, amino acids are abbreviated as AAs, and nucleotides as nts in high-resolution PDB structures is not sufficient to infer their sequence and structural features ( 63 ).There are ∼34 thousand RNA nucleotides in high resolution ( ≤2.0 Å ) crystal and cryo-EM structur es, compar ed to ∼42 million amino acids; it is < 0.1% of all PDB-deposited residues (Table 1 ).3D modules are another group of crucial yet hard to predict motifs ( 64 ) (Figure 4 ).They are primarily defined by NWCs that form an intricate network of interactions.These networks r emain coher ent e v en in RNAs from different phylogenetic groups.3D modules serve as loops, turns and f oundations f or protein-RN A or RN A-RN A interactions.Their accurate modeling is essential to catch the global RNA fold, but it is har dly possib le due to the low amount of data available.
RNA ar chitectur es ar e also stabilized by interactions such as base-ribose hydrogen bonding, intramolecular interactions with charged phosphates, and coordination with metal ions.The roles of these interactions are e v en less understood than those of non-Watson-Crick base pairs.

Quality of experimental RNA data
Not only does the shortage of high-resolution structures complica tes the accura te annota tion of RNAs.Ther e ar e problems with the quality of deposited RNA (and DNA) da ta tha t arise from the lack of community-accepted quality standards.They ar e r elated to base pairing, valence geometry and backbone geometry; their combination can lead to a flood of imprecisely and unreliably refined structures.
A formal description of base pairing is essential to build reliable 3D models.However, base pairing in public archi v es is not described reliab ly; it is often incomplete or incorrect.The programs used to assign base pair topology to 3D structures, such as MC-Annotate ( 66 ), RN Aview ( 67 ), FR3D ( 68 ), ClaRN A ( 69 ), CompAnnotate ( 69 ), RN A pdbee ( 70 ), bpRN A ( 71 ), baRN Aba ( 72 ), BP-NET ( 73 ) and DSSR ( 74 ), often provide incomplete or conflicting information (manuscript in pr eparation).Ther efor e, comprehensi v e benchmar king must be performed along with a consistent update of public archives with topology data from the consensus algorithm(s).
Perhaps of lesser but existing importance for the prediction of large RNA structures is the inconsistency of targets used in the refinement of bond distances and angles.These valence geometry targets differ in various refinement programs, validation packages and the PDB, leading to confusion in the community.Ther efor e, an ELIXIR-led effort was undertaken by the Nucleic Acid Valence Geometry Working Group ( 75 ) to formulate community-agreed validation targets (76)(77)(78).
A significant source of errors in the structural description of RNA (and DNA) is the misconception about the geometry of the nucleic acid backbone.The structural complexity of the backbone was understood early on ( 79 ), but the topic attracted much less attention until the end of the 1990s.At that time, large RNA ribozyme and ribosome structures started to emerge and it became possible to analyze their structural variability based on experimental data.The smallest unit that makes sense to categorize structurally is a dinucleotide, which includes two riboses and captures the complexity of the phosphodiester linkage C3'-O3'-P-O5'-C5'.Howe v er, e v en this relati v ely small fragment has nine torsional degrees of freedom.The first conformer definitions of dinucleotide fragments were published at the beginning of 2000, first for RNA (80)(81)(82), later for DNA ( 83 ) and recently for both RNA and DNA as a structural alphabet CANA built from dinucleotide conformational classes NtC ( 84 ).Perhaps the relati v e nov elty of the concept of conformational classes and technical difficulties with their implementation into routine refinement and validation protocols is the reason why the classes are not widely used.We see this fact as one of the reasons why the quality of newly determined structures does not improve.

Sequences and sequence alignments
The efficiency of 3D RNA structur e pr ediction is likely to be improved using information from multiple sequence alignments (MSA).MSA has already been incorporated into se v eral e xpert-based modeling methods in the human categories of RNA-Puzzles and CASP-RNA ( 24 ).Such a strategy is also applied in AlphaFold and other recent protein prediction methods.In these methods, correlated mutations are used to detect residues that are in close contact in 3D space, despite the distance in sequence.This principle has been understood for a long time in RNA ( 63 ).Unfortuna tely, crea ting high-quality RNA alignments is difficult and often r equir es the manual work of an expert.This difficulty has led to there being far fewer RNA vs. protein alignments.
To illustrate the difference in quantity, we can compare two r esour ces, Pfam and Rfam.Pfam and Rfam ar e collections of protein / RNA alignments and models annotate them in genomes.Rfam is the oldest and largest source of alignments for ncRNAs.Although ther e ar e other r esour ces that collect similar data, for example, miRBase ( 85 ) or  for RNA, they are smaller and focus on one particular type of molecule.Pfam was founded in 1997 ( 87 ), while Rfam in 2003 ( 88 ).Each member of Rfam / Pfam is made up of a curated seed alignment which is used to build the model that allows finding more examples of the family and produces what is known as a full alignment.The models in Pfam are based on hidden Markov models, while in Rfam they are covariance models and also include a consensus secondary structur e.Her e, we will discuss some of the issues facing machine learning practitioners that want to use RNA alignments by comparing these r esour ces.
First, while Rfam is similar to Pfam in spirit and goals, it contains far less data than Pfam.At the time of writing this paper, the current version of Rfam, 14.9, contains 4108 alignments, while the current release of Pfam, 35.0, contains 19 632.The difference in resource size is due to historical bias towards RNA gene discovery, the difficulty in identifying homology between related RNAs, and the difficulty in building new alignments for Rfam.Constructing Rfam alignments r equir es using covariance models, which ar e much mor e computationally e xpensi v e compared to the hidden Markov models applied to build Pfam alignments.
Second, RNA alignments are on average smaller than protein alignments.This relationship relates to the number of sequences, with seed alignments containing an average of 5 sequences in Rfam versus 23 in Pfam (Figure 5 A), as well as the number of columns, 95 columns in Rfam versus 163 in Pfam (Figure 5 B).There is also a significant difference in the degree of conservation, with the Rfam alignments 83% conserved versus 26% in Pfam (Figure 5 C).Together, it means that ther e ar e few RNA alignments compared to proteins, and the existing alignments are smaller and lack variation.Ther efor e, it is likely that there is not enough RNA data yet to effecti v ely train machine learning methods.This is also supported by the fact that the currentl y best-performing RN A-dedicated methods in CASP are not machine learning based.
Third, Rfam alignments have several global biases that mak e w orking with them difficult.One is that the most common alignments are for simple molecules.Taking into account the type of RNA, most alignments concern miRNA precursors (35%) followed by snoRNA (19%) (Figure 6 ).miRNA pr ecursors ar e simple molecules, essentially a helix with a few small loops and mismatches; in proteins, this is most similar to a single alpha helix.Such simple structures do not represent the complexity of RNA f olds; f or example, they do not contain any junctions, while --as discussed above --the junction topology is essential to determine the overall structure of more complex RNAs.
Another global bias is observed in the number of seed or full sequences, Rfam has the most data for bacterial small RN A (sRN A) sequences.Howe v er, ther e ar e few structur es of these molecules with < 50 in PDB at the time of writing.In terms of full alignments, tRNAs constitute the largest group (45%), and rRNA subunits are the third largest, accounting for another 8% (Figure 6 ).These families are the most commonly solved structur es, r epr esenting 26% and 61% of all known 3D structures of RNA, respecti v ely (Figure 6 ).Although a large collection of these sequences and structures is valuable, we recommend caution.Creating ML models that generalize to other structures is unlikely if their training is based only on ribosomes.Se v eral predic-tion methods that train off currently existing datasets have not yet produced high-quality models.
In addition to the global bias in the RNA data, there are specific issues with Rfam alignments that must be considered in machine learning.For example, not all non-Watson-Crick base pairs are aligned in Rfam, and the aligned ones have not been handled in a consistent manner.Moreover, Rfam consensus secondary structures can represent parts of the structure as unfolded.Howe v er, looking at the 3D structure , when available , in that region often shows a clear secondary structur e.These r egions include places known to have species-specific structure or their unstructured form results from Rfam limitations.Rfam families are intended to cover a wide phylogenetic range.For example, the eukaryotic large subunit ribosomal RN A famil y (RF02543) r epr esents all large rRNA subunits in all eukaryotes.
Howe v er, rRNA is well known to vary considerably within the kingdom, or e v en within a species, with important functional consequences ( 89 ).Since the 2D structures in Rfam must r epr esent what is common to all members of the family, they are often underfolded in many regions.This should be dealt with when building a useful ML training set.Finally, pseudoknots --a key factor in 3D RNA structures --have been shown to help organize the global structur e, but ar e not consistently annotated in Rfam alignments.Unfortunately, current 2D and 3D prediction methods struggle to predict them.Rfam is working to annotate more observed pseudoknots but many families lack them.
In summary, there are se v eral issues with the RNA alignment da taset tha t will pose problems for deep learning.The data set is small compared to proteins, is highly biased in se v eral ways, and the existing alignments have some shortcomings.While work is ongoing to fix all these issues, it will be challenging to use these data to successfully predict 3D structures.One key issue will be creating a test / train da taset tha t r epr esents the observ ed comple xity, while not being overly biased.

CONCLUSIONS
Gi v en the history of protein fold prediction, can we anticipate when the RNA realm will see similar results?Al-phaFold's success came 50 years after the first work on computer-based protein structure prediction.This period of time was necessary to accumulate a sufficient volume of high-quality, reliable data on protein sequences and structures.At the same time, information and computer technology were de v eloped, enab ling efficient applications of artificial intelligence models to solve problems that traditional computational methods could not deal with.Artificial neural networks as an idea are already 80 years old ( 90 ), but it was only in the second decade of the 21st century that they came into widespread use.In 2012, the power of deep learning was demonstrated ( 91 , 92 ).It has triggered a flood of projects that have applied DL models to various areas of life.Among other things, this wave has brought about ne w predicti v e methods dedicated to molecular structures.All of them are data-hungry; AlphaFold has been trained on structures of more than 170,000 proteins combined with very large sequence alignments.We expect to have similar r equir ements to successfully use neural networks for RNA 3D structure prediction.
A simple way to estimate when AlphaFold for RNA will be created is to consider when the number of RNA structures or sequence alignments are comparable to the currently available protein data.As mentioned above Pfam contains 19 632 protein sequence alignments.Historically, the growth of Rfam has been linear due to the r equir ement for manual work to build each alignment and we observe that on average Rfam adds a pproximatel y 205 alignments per year.Thus, we estimate Rfam will contain 19 000 alignments in a pproximatel y 70 years.This is undoubtedly a vast overestimate as we expect the RNA 3D structure prediction problem to be solved by then.One technique which may help is automatic family building.While this is still unsolved for RNA, there has been recent work on this issue which may be promising ( 93 ).Automatically built families were used in training AlphaFold and may prove useful for RNA as well ( 34 ).
We belie v e that ther e ar e se v eral viab le approaches to enable the prediction of the 3D RNA structure in the near future.First, the RN A comm unity can improve knowledge of RNA structur e through mor e data, second, we can di v ersify the data used in prediction, and finally, we can improve the machine learning methods used.
W ha t da ta is missing that would improve predictions?We do not seem to know enough about RNA motifs to predict their global structures.We may provide an educated guess, at least for the small structural motifs, of which the most important are base-pair topologies.Concerning the latter, it is very likely that they exist in known structures of reasonably high resolution and can provide reliable geometries.There are also strong reasons to belie v e that the CANA alphabet describes more than 90% of the existing dinucleotide conformers; only a few of them may be missing ( 84 ).In our opinion, mor e r esear ch is needed on intramolecular interactions other than base pairs, namely hydrogen bonding bridges of the O2' group to bases , ribose , phosphates and interactions between phosphate oxygens (mostly charged) and other RNA constituents.Benchmarking the quality of 3D structures, as well as streamlined and consistent principles of their validation, is r equir ed to ensur e r eliability in data repositories.
Another approach is to improve the size and scope of multiple sequence alignments of RNA.Alignments of fourletter RNA sequences ar e mor e challenging than those of 20-letter protein sequences.Some classes of RNA, such as ribosomes, have a large number of sequences and we know how to align them.Howe v er, more well-aligned sequences of underr epr esented RNA classes are needed.Perhaps the Tree-of-Life projects ( 94 , 95 ) will provide a sufficiently large number of sequences.Currentl y, RN A gene prediction is inconsistent across known genomes, so we encourage the community to annotate ncRNA genes in newly sequenced genomes.Annotated ncRNAs from Tree of Life projects can sho w lo w sequence di v ersity, and we recommend that ncRNA gene annotation in metagenomes be used as a solution.We note that AlphaFold r equir ed metagenomic sequences in order to reach its maximum performance, and we suspect that RNA will show a similar trend.Solving these challenges involves finding all the ncRNA genes and making the data reusable.
Consistentl y annotating RN A families across all genomes will be useful and may increase the di v ersity of RNA sequences availab le; howe v er, it seems that a prediction method would benefit from a wider range of RNA families.As discussed above, many Rfam families are structurally similar.We belie v e that providing a more di v erse training set would be useful.While Rfam is the global repository of RNA families, not all known families can be found ther e. Corr ecting this and working to create new families that are different from existing ones should be a focus of the RN A comm unity.Additionall y, creating high-quality alignments remains a challenge ( 96 ).
If the current amount and growth rate of currently available sequence and structure data are not sufficient, can they be supplemented with other sources of data?We think so.
In particular, RNA biochemistry has a rich history and has de v eloped many methods to ra pidl y probe 3D structures ( 97 , 98 ).A subset of these data, SHAPE probing, has proven useful to classical prediction methods, and we expect it to be helpful to DL-based approaches.Although many labs probe the structure of RNA, these data are not readily availab le to ML practitioners.Wor king as a community to standardize, collect and distribute such data seems valuable for pr edictions.Additionally, ther e ar e other low r esolution methods, such as SAXS and AFM, which may prove useful in modelling structures ( 97 ).
Finall y, the ra pid and hard-to-predict development of ML methods may potentially change our pessimistic predictions about the ability to accurately predict 3D RNA structures.De v elopment of methods that are less data hungry, e.g.transfer learning, may allow successful prediction sooner.We belie v e that RNA structure prediction is an excellent test case for r esear chers inter ested in machine learning in the face of limited data.At the moment, we do not belie v e that reliab le 3D RNA prediction will be availab le in the 2020s, but we challenge the community to prove us wrong.

DA T A A V AILABILITY
The data underlying this article are available in Zenodo, at https://doi.org/10.5281/zenodo.8167407 .

Figure 1 .
Figure 1.Examples of interactions in an RNA molecule.Some of the most important interactions are highlighted in dashed lines: base pairing hydrogen bonds in dark red, sugar-base stacking in dark violet, phosphate-base hydrogen bond in yellow, water-formed hydrogen bonds in cyan (waters are depicted as cyan balls).The bottom pair is canonical Watson-Crick, the pair above is a G-U pair 'locked' by interaction with bridging water molecule.G2147 is in syn orientation and dinucleotide C2146-G2147 is in the left-handed Z-form conformation (note the inverted direction of the ribose of C2146 further stabilized by stacking its O4' to the guanine aromatic ring).Displayed is a six nucleotide loop from 80 nucleotide long fragment of 23S RNA from Thermus thermophilus complexed with ribosomal protein L1 (PDB ID: 4qvi) ( 5 ).
For example, Interaction Network Fidelity (INF), a similarity measure, scor es the pr ediction of base pairs, Watson-Crick (INF-WC), non-Watson-Crick (INF-NWC) and stacking (INFstacking).As shown in Figure 2 , during the 12 years of challenges in RN A-Puzzles, INF-WC generall y ranged between 0.75 and 1.0, demonstrating that most models had accurately predicted double helical stem motifs (INF = 1 means ideal prediction and 0 is failure).Howe v er, INF-NWC scored close to 0 for most predictions, which is of concern since non-Watson-Crick base pairs play a crucial role in

Figure 2 .
Figure 2. Distribution of values of selected evaluation measures for the predictions submitted to RNA-Puzzles from inception to 2022.Numbers in parentheses next to each puzzle indicate the total number of nucleotides for all structures in each puzzle.

Figure 3 .
Figure 3. Numbers of RNA and protein structure predictions made in RNA-Puzzles and CASP competitions.The solid lines r epr esent the numbers of groups competing in CASP and RNA-Puzzles; the dashed lines are for the number of pr otein / RNA targets.Fr om 2010 to 2021, RNAs were predicted only in RNA-Puzzles and in 2022, CASP included also RNA targets, which is responsible for the recent spike in targets and groups involved in 3D RNA structure prediction.

Figure 4 .
Figure 4. Comparison of predicted and experimentally determined structures.Displayed is hammerhead ribozyme RNA: the structure determined experimentally by X-ray dif fraction a t the 2.9 Å resolution (PDB ID 5di4) ( 65 ) is shown in light blue, the model PZ15 Adamiak 15 is in red.Cartoon r epr esentation of the r esidues A9-U33 in panel ( A ) suggests that the prediction follows the overall topology of the ribozyme correctly but with local deviations.Panel ( B ) shows segments between residues G11 and G18.The overall backbone direction is predicted correctly but local deviations are large.They include differences in base orientations and subsequently in base pairing and also the distances between the corresponding phosphorous atoms are quite large; one such distance between Ps of adenosines 15 of the target and model is highlighted by the green rod.Segments in panel B on the left and right show the same atoms, the view is rotated by ∼90 • .

Figure 5 .
Figure 5. Rfam versus Pfam alignments compared based on ( A ) a number of sequences, ( B ) a number of columns and ( C ) the average pairwise percent identity for each family.The points on the plots indicate the mean, and the vertical bars indicate the standard deviation.

Figure 6 .
Figure 6.Counts of Rfam families, seed sequences, full sequences and structures for all Rfam families organized by Rfam RNA type.