DoubleHelix: nucleic acid sequence identification, assignment and validation tool for cryo-EM and crystal structure models

Abstract Sequence assignment is a key step of the model building process in both cryogenic electron microscopy (cryo-EM) and macromolecular crystallography (MX). If the assignment fails, it can result in difficult to identify errors affecting the interpretation of a model. There are many model validation strategies that help experimentalists in this step of protein model building, but they are virtually non-existent for nucleic acids. Here, I present doubleHelix—a comprehensive method for assignment, identification, and validation of nucleic acid sequences in structures determined using cryo-EM and MX. The method combines a neural network classifier of nucleobase identities and a sequence-independent secondary structure assignment approach. I show that the presented method can successfully assist sequence-assignment step in nucleic-acid model building at lower resolutions, where visual map interpretation is very difficult. Moreover, I present examples of sequence assignment errors detected using doubleHelix in cryo-EM and MX structures of ribosomes deposited in the Protein Data Bank, which escaped the scrutiny of available model-validation approaches. The doubleHelix program source code is available under BSD-3 license at https://gitlab.com/gchojnowski/doublehelix.


INTRODUCTION
Nucleic acids are key players in many cellular processes ranging from gene expression to the catalysis of chemical reactions. For many nucleic acid molecules tertiary structure determines function, much like for proteins. Ne v ertheless, our understanding of the structure-function relationship in nucleic acids clearly lags behind proteins, which is reflected by the disproportion of structures deposited in the Protein Data Bank (PDB) ( 1 ). As of January 2023, out of 200 708 available structures only 15 374 (7%) contained a nucleic acid component. The resolution revolution in cryogenic electr on micr oscopy (cryo-EM) seems to be slowly changing this picture as more and more challenging nucleicacid complex es ar e being determined using this technique. In 2022, out of 1454 structures with nucleic-acid components deposited in the PDB as many as 804 (55%) were determined using cryo-EM. Many of these cryo-EM structures would be very difficult to solve using other techniques owing to their size and structural heterogeneity ( 2 ).
The release of the Artificial Intelligence (AI) based structur e pr ediction programs AlphaFold2 ( 3 ) and RoseTTAFold ( 4 ) provided means for accurate and widely accessible structur e pr ediction of protein structur es. Although they did not solve the problem of protein structure determina tion the accura te predictions proved useful for solving the phase problem in macromolecular X-ray crystallo gra phy and the interpretation of cryo-EM maps ( 5 ). There is no AlphaFold2 or RoseTTAFold equivalent for nucleic acids and most lik ely w on't be soon as building AI 3D structur e pr ediction tools r equir es huge and di v erse training sets of reliable structural models that are currently not available for nucleic acids. The PDB-deposited models are strongly biased towards ribosomal RNA that is usually highly conserved across all kingdoms of li v e. Moreov er, as will be shown later in this wor k, availab le e xperimental RNA structures contain difficult to identify errors that may reduce generalization properties of structure prediction methods. Although attempts to build such tools are already taken (e.g. RoseTTAFoldNA ( 6 ) and ARES ( 7 )), experimental techniques will continue to be the method of choice for detailed structural studies of nucleic acid complexes, with all their limitations and bottlenecks.
MX and cryo-EM remain the most frequently used experimental approaches for the structure determination of large biomolecules. The main result of both these methods is an atomic model traced into a map -an interpretation of experimental observations given a priori knowledge of biomolecular structure. Although, the main effort of the method de v eloper comm unity is clearl y focused on proteins, se v eral techniques facilitating experimental determination of nucleic acid structures in cryo-EM and MX have been de v eloped, e.g. NAUTILUS ( 8 ), ARP / wARP ( 9 ), PHENIX ( 10 ), RCrane ( 11 ), COOT ( 12 ), ISOLDE ( 13 ), DeepTracer ( 14 ) and ModelAngelo ( 15 ). As with proteins, nucleic acid model building in these methods usually starts with tracing into a density map a ribose-phosphate backbone that makes up two-thirds of a polynucleotide mass. The backbone model is subsequently assigned to a target sequence and complemented with base moieties.
Sequence assignment is a crucial step in macromolecular model building. It is r equir ed for the identification and completion of missing fragments in initial models. It is also a fundamental pr er equisite of a model interpretation. Failure may lead to register-shift errors, where residues are systematically assigned an identity of a residue a few positions before or ahead in sequence. Although register-shifts may bias model interpretation, they remain one of the most difficult problems to identify and correct in macromolecular models ( 16 ). In protein models, register-shifts often result in backbone-geometry outliers when se v eral sidechains are forced into too small density volumes, which can be detected using geometry validation approaches like CaBLAM ( 17 ). Moreover, backbone tracing issues (e.g. deletion or insertion) that caused register shift can be occasionally detected as a geometry outlier ( 18 ). Ne v ertheless, it has been shown that regardless of the effort made to validate protein models, r egister-shift errors ar e r elati v ely common in PDB ( 19 , 20 ). Particularly affected are very large structures, for example ribosomes, where detailed inspection of a model using interacti v e tools (e.g. COOT or ISOLDE) is rarely feasible.
Sequence assignment errors in nucleic acids are e v en more difficult to identify than in proteins. They rarely result in se v ere geometry issues during model refinement as the ribose-phosphate backbone dominates scattering and its geometry is weakly affected by the presence of misassigned nucleobases. Mor eover, small differ ences between different types of purines and pyrimidines makes visual validation of a sequence assignment very challenging unless high-resolution maps are available.
The most prominent issue related to a sequence assignment error in nucleic acid structures are steric clashes arising from the presence of base-paired nucleobases that don't fit their secondary structure context -non-isostericity ( 21 ). For example, a Watson-Crick pair in cis orientation that erroneously involves two guanines is too large to fit into a double-helical region ( 22 ). At lower resolutions, howe v er, this will be promptly masked by a refinement software and the bases that are weakly restrained by a map shifted to a non-clashing conformation. Ne v ertheless, these relati v ely rare issues can be in principle be detected using standard model-validation software, e.g. Molprobity ( 17 ).
Another, rar ely r ecognised, issue r elated to the sequence assignment in model building are unknown target sequences. Until recently, structural studies of macromolecules of unknown identity, e.g. extracted from natural sour ces, wer e attempted pr edominantly using MX ( 23 ). Recent de v elopments in cryo-EM, howe v er, forged a completely new path to the studies of uncharacterised macromolecules. It has been shown that cryo-EM reconstructions of protein nucleic-acid complex es, at r esolutions high enough for de-novo model building, can be determined directly in a cell using subtomogram averaging ( 24 ). Highresolution structural information can be also retrie v ed from a systematic cryo-EM analysis of cell lysate fractions ( 25 , 26 ). In a recent study Skalidis and co-w ork ers ( 27 ) presented a complete workflow for identifying biomolecules directly from nati v e cell e xtracts combining cryo-EM with AI structur e pr ediction methods. Although ther e ar e in principle no technical limitations to the identification of nucleicacid sequences directly from cryo-EM reconstruction, to the best of my knowledge there is no computational tool available that could be used for this purpose.
In this work, I present doubleHelix ; a computer program for comprehensi v e nucleic-acid sequence identification, assignment, and validation in MX and cryo-EM models. Similarly to a pre viously de v eloped program findMySequence ( 28 ) for protein-sequence identifica tion, doub leHelix uses neural network classifiers for estimating residue-type probabilities gi v en a backbone model and a density map. W ha t makes the doubleHelix program unique is the way it addresses the inherent nucleobase-type ambiguity that makes it impossible to distinguish adenine from guanine and cytosine from uracil or thymine unless a very high-resolution experimental data is available. The program estimates only the probabilities of purines and pyrimidines in a model. This information is complemented with base-pairing restraints obtained using a new approach that relies on a backbone conformation ignoring nucleobase identities that are not known before the sequence assignment. The basepair identification approach is based on alignment of recurrent nucleic-acid structural motifs of known secondary structure (e.g. A-or B-form double helices) to the target model. I show that despite its simplicity this approach is both highly specific and accurate. Moreover, the secondary structure information it provides readily improves sequence assignment and identification performance at lower resolutions where base-type classification reliability is reduced. I also show an example of a previously unidentified RNAsequence assignment errors in mammalian and bacterial ribosome structures deposited in the PDB that could have been avoided if doubleHelix had been used for model building and validation.

AN O VER VIEW OF THE DOUBLEHELIX METHOD
The doubleHelix program r equir es on input a model in PDB or mmCIF format. For the sequence identification and assignment, it also r equir es a corr esponding ma p, w hich can be provided in CCP4 / MRC format for cryo-EM models or as a MTZ file with structure factor amplitudes and phases for crystal structures. The doubleHelix program provides four basic functionalities: • Secondary structure restr aints gener ator for nucleic-acid models Gi v en RNA or DNA model on input the program generates base-pair and -stacking restraints in formats accepted by COOT and popular refinement programs REFMAC5 ( 29 ), PHENIX ( 30 ) and ISOLDE ( 13 ). Additionally, it generates a PYMOL ( 31 ) script that can be used for visualising the restraints (Figure 1 C). The restraints are also generated for interacting model fragments modelled as separate chains (e.g. DNA duplexes).
• Identification of unknown sequences of nucleic-acid models For a nucleic acid model and a corresponding map (CCP4 / MRC or MTZ f ormats f or cry o-EM and MX, respecti v el y), the pro gram identifies the most plausible sequence from a sequence-database in FASTA format gi v en estimated nucleobase-type probabilities and input model secondary structure (Figure 1 D). By default, the program identifies a sequence that best matches all nucleic-acid chains fragments in the input model.
• Assignment of nucleic acid models to known target sequences For a nucleic acid model and a corresponding map (CCP4 / MRC or MTZ f ormats f or cry o-EM and MX, respecti v el y), the pro gram assigns continuous pol ynucleotide chain fragments to the target sequence and rebuilds the bases accordingl y. A part from the estimated nucleobasetype probabilities, base-pairs identified within the fragments are used as additional restraints (Figure 1 E).
• Sequence assignment validation in nucleic-acid models Gi v en a nucleic-acid model, a corresponding map (CCP4 / MRC or MTZ f ormats f or EM and MX, respecti v ely) and the set of all model sequences, the program evaluates the plausibility of the model's sequence assignment. This feature is implemented as an extension of the c hec k-MySequence program and uses an algorithm described pre-viously ( 19 ). Users interested in the validation of nucleic acid model sequence assignment should refer to instructions available on the c hec kMySequence project page.

Recurr ent structur al motifs in nucleic acids
The doubleHelix program identifies base pairs in RNA and DNA models from a local similarity of backbone coordinates with small 'search-fragments' of known secondary structure. Model, double helical A-RNA and B-DNA sear ch-fragments wer e generated using the X3DNA suite ( 32 ). Non-helical search fragments were selected using RNA Bricks ( 33 ) database. Selected sets of r ecurr ent RNA fragments classified as 'loops' with at least 500, 100, 25 and 10 occurrences in the database correspond to 83, 532, 1430 and 2664 sear ch-fragments r especti v ely (as of 28 March 2020).

Ribosome crystal structures for secondary structure assignment benchmarks
As a r efer ence for the secondary structur e assignment benchmarks, I used crystal structures of ribosomes available in PDB as of 28 March 2020. From all structures determined at a resolution better than 3.0 Å , I selected ones with crystallo gra phic R-free factor below 0.3. To reduce the set redundancy, from each group of similar structures (e.g. originating from the same publication) I selected models with the lowest R-work / R-free difference. The resulting set contained eight structures originating from Haloarcula marismortui , Thermus thermophilus , and Deinococcus radiodurans (PDB entries 1s72, 4yb b , 7rqa, 1hnx, 1fjg, 1k73, 4y4o, 6oxi). For each of the models, secondary structure was determined using the ClaRN A pro gram ( 34 ) and used as a groundtruth.

Reference set of ribosome cryo-EM structur es f or sequence identification benchmarks
From the PDB, I selected cryo-EM structures of ribosomes determined at a resolution better than 3.5 Å . Among 102 such structures available as of 4 February 2020, I selected 17 with half-maps available for download in the Electron Microscopy Data Bank (EMDB). For each of the half-map pairs local resolution maps were calculated using Resmap version 1.1.4 ( 35 ) with default parameters.
The selected models (PDB entries 3j79, 3j7a, 3j7q, 5iqr, 5mdv, 5mdw , 5mdy , 5ngm, 5umd, 5wdt, 5we4, 5wfs, 6okk, 6p5i, 6p5j, 6p5k, 6p5n) originated from fiv e different organisms: Plasmodium falciparum, Esc heric hia coli, Staphylococcus aur eus, Sus scr of a and Ory ctolagus cuniculus . For each of them, nucleotide sequences corresponding to RNA features annotated based on the genome sequence were downloaded from NCBI ( 36 ) and used as r efer ences for the sequence identification benchmarks. The r efer ence sets contain tRN A, rRN A and ncRN A sequences, except for those corresponding to eukaryotic organisms that additionally contain mRNA sequences. To ensure that exact matches are available in the reference sets I added target rRNA sequences to each of them.

Structures used for training neural network classifiers
As map features observed in cryo-EM and MX maps differ in fine detail, two separate neural networks were trained for each of these experimental methods.
For training the cryo-EM nucleobase-type classifier from the cryo-EM structures of ribosomes initially selected for sequence identification benchmarks, for w hich half-ma ps were not available in EMDB, I randomly selected 10 (PDB entries 5afi, 5mmi, 5u9f, 5wdt, 6eri, 6h4n, 6ogi, 6om0, 6q8y , 6sgc). Additionally , 142 PDB-deposited cryo-EM structures containing a DNA, but not an RNA component determined at resolution 3.5 Å or better with mapto-model correla tion coef ficient above 0.8 as estima ted for complete models (including non-NA components) using phenix.map model cc ( 37 ) were added to the training set.
For training a crystal structure nucleobase-type classifier, I selected eight structures and corresponding maps that were also used for benchmarking secondary structure assignment procedur e. Mor eover, 100 crystal structur es randomly selected from PDB that contain a DNA, but not an RNA component, determined at resolution 3.5 Å or better with R-free-R-work below 0.3 were added to the training set.

Ribosome structures for the base-type classifier benchmarks
For benchmarking the residue type neural network classifiers, I selected fiv e the highest resolution cryo-EM and MX structures available in PDB as of 24 April 2023 and released after training the base-type classifiers. The resulting set contained crystal structurers refined between 2.3 and 2.5 Å resolution (PDB entries 8cvl, 6xhv, 8cvj, 7rqe and 7r qa). The r esolution of selected cryo-EM structur es varied between 1.5 and 1.9 Å (PDB entries 8b0x, 8glp, 8a3d, 8aye and 7k00). All the models contain modified nucleic acids residues as modelled by their authors. Each modified residue in the set was classified as a purine or pyrimidine based on the presence of imidazole rings. Specifically, the classification was based on the presence of both N1 and N9 atoms within a base, which are located a pproximatel y 4.1 Å apart from each other.

Ribosome crystal structures for sequence identification and assignment benchmarks
For RNA sequence identification benchmarks, I arbitrarily selected two crystal structure models of Thermus thermophilus 30S ribosomal subunit determined at a resolution 2.8 Å (PDB entry 2uub) and 3.3 Å (PDB entry 6mpi). For both targets, r e-r efined structur es wer e downloaded from the PDB REDO server ( 38 ). Initiall y, a randoml y selected 90% of ribosomal RNA nucleobases were mutated in both models (purines to pyrimidines and vice versa) keeping canonical base-pairing interactions identified using doubleHelix (the procedure is implemented in doubleHelix program and can be enabled with an option "-randomize = 0.9 "). The model coordinates were subsequently randomised with 0.2 Å RMSD ignoring any geometry restraints and automatically refined using the PDB REDO w e b server. For both models the randomisation procedure clearly affected the R-work / R-free factor values that increased from 0.19 / 0.23 to 0.25 / 0.29 and from 0.20 / 0.25 to 0.23 / 0.27 for better and worse resolution structures respecti v el y. The automaticall y refined randomised models with corresponding maximum likelihood, combined 2mFo-DFc maps were used for sequence identification and assignment benchmarks.
For the sequence identification benchmarks, nucleotide sequences corresponding to RNA fea tures annota ted based on the Thermus thermophilus genome were downloaded from NCBI and used for making queries.

Neur al netw ork base-type classifier
To estimate the probability that a gi v en nucleotide fitted into a map corresponds to a purine or pyrimidine two independent neural-network classifiers wer e pr epar ed. The classifiers have identical ar chitectur e but ar e trained on distinct training sets deri v ed from crystal structures or cryo-EM models and their respecti v e maps.
Nucleotides are described with a vector of map values sampled around a putati v e base moiety (a residue descriptor). The map is sampled on a regular grid with 1.0 Å spacing. The grid is centred at the N1 or N9 atom for purine and pyrimidine respecti v ely and spanned by orthonormal vectors defined by glycosidic bond (e x ), the normal vector of the ribose best-fitting plane (e y ), and their cross product (e z = e x × e y ). For a gi v en nucleotide the input to the classifier contains a cloud of 403 grid points that are within 1.0 Å distance from any atom of a nucleobase mutated to Guanine in any r otation ar ound the glycosidic bond. In practice a precomputed cloud is aligned to each nucleotide using C2', C1' and O4' ribose or deoxyribose atoms.
The neural-network model input is a vector of length 403 (the residue descriptor described above). The model contains two, fully connected hidden layers. The first layer has a R eLU (R ectified Linear Unit) activation function, which sets all negati v e neuron inputs to zero, and 403 output features. The second layer has 2 output features and uses the log-softmax normalisation function enabling estimation of output classification probabilities. To avoid overfitting, an additional dropout layer was inserted between the two hidden layers. The dropout layer at each training step disables neuron connections selected at random with probability P . The models were trained for 1000 epochs with P = 0.5, a batch size of 20 residue descriptors in each parameters update cycle, and a 10% validation set. The models were trained using the ADAM optimization algorithm ( 39 ) with a learning rate of 1e-5 that resulted in the best test-set accuracies.
For training the crystal structure classifier I used 84 887 and 9431 nucleobase descriptors for training and test-set respecti v ely. The accuracies of a resulting model were 0.98 and 0.96 for the training and test sets, respecti v el y. Similarl y, for training the EM classifier I used 85 092 and 9454 residue descriptors in training and test-sets, respecti v ely. The resulting model estimated accuracies were 0.95 and 0.92 for training and test set, respecti v ely.

Secondary structure assignment procedure
Unlike DN A, w hich occurs in nature predominantly in a double-helical form, RNA molecules are often singlestranded and fold into complex structures stabilised by stacking and base-pairing interactions ( 40 ). Folded RNA molecules have a modular ar chitectur e in which the doublehelical r egions ar e intertwined with differ ent types of loops that define the topology of the structure and stabilise it through long-range interactions. Many of these loops are r ecurr ent and can be found in a similar structural context in many, possibly evolutionary unrelated, RNA molecules ( 33 ). Most importantly, it is the overall module geometry and base-pairing pattern rather than the nucleotide sequence that is conserved across different occurrences of the same module ( 41 ). This feature of RNA molecules is used in the doubleHelix program for the inference of base-pairing interactions from the local geometry of sugar-phosphate backbone. This approach ignores both identities and mutual orientation of bases, which is particularly important in the analysis of preliminary, not fully refined models, where base identities are not yet known and their coordinates, unlike relati v ely heavy backbone, may be inaccurate.
The program superposes small RNA or DNA searchfragments of known secondary structure onto the input model using an algorithm described previously as a part of a model-building program Brickworx ( 2 ). First, all possible triplets of phosphate group P-atoms in a search-fragment are structurally aligned with similar P-atom triplets from the input structure. Resulting rigid body transformation is applied to the complete fragment to identify matching nucleotides in the search fragment and input structure. Finally, the match is refined using all sugar-phosphate backbone atoms. If the resulting root-mean-square deviation (RMSD) is below 1.0 Å (the threshold defined in the Results section), search-fragment base pairs are assigned to the corr esponding r esidues in the input model. If multiple, overlapping matches of search-fragments are identified, the one with the lowest RMSD is selected.
For the sake of computa tional ef ficiency, the input model processing is divided into two steps. Firstl y, onl y a doublehelical fragment is matched to identify Watson-Crick basepairs. In this step, A-RNA or B-DNA search fragments are used depending on the target nucleic acid type. Next, all nucleotides within stacked Watson-Crick base-paired regions (except flanking residues) are removed from the input model. In the second step, used only in case of RNA targets, a predefined set of r ecurr ent RNA motifs is matched to the remaining nucleotides in the input model. In case of basepair assignment conflicts, the ones detected in the second step are gi v en preference.

Sequence identification procedure
For the identification of the most plausible sequence in a database, gi v en input model residue-type probabilities and secondary structure, doubleHelix uses sequencecomparison tools from the INFERNAL suite ( 42 ) (Figure  1 D). Initially, pr edicted r esidue-type probabilities ar e converted into a multiple sequence alignment (MSA), where fractions of residue types in each column correspond to predicted probabilities. The MSA, combined with basepairing pattern is encoded in a Stockholm format (STO), which is an input to the cmbuild program. The resulting Covariance Model (CM) is further calibrated with cmalibrate and used to query sequence databases using the cmsearch program with default parameters. The Hidden Markov Model (profile-HMM) queries are enabled by adding the -hmmonly k eyw ord to cmsearch and skipping the calibration step. Sequence hits with the lowest E-values are returned to the user (3 by default).

Sequence assignment procedure
Analo gousl y to the sequence identification procedure, dou-bleHelix uses the estimated base-type probabilities to assign RNA or DNA models to a target sequence. For a contin uous polyn ucleotide fragment in the input model a neural network classifier is used to estimate base-type (purine or pyrimidine) probabilities for each residue. The resulting scoring matrix is aligned to the target sequence and the probability of each tentati v e alignment is approximated by a product of the probability estimates for each residue in the fragment, assuming their independence. If any residue pair in the fragment forms a Watson-Crick base-pair (detected using a procedure described above) an additional, low-probability correction factor (0.1) is used if for a tentati v e alignment the two bases are either both purines or pyrimidines, which is very unlikely for a Watson-Crick interaction. Otherwise, the correction term is 1.
Although the neural-network classifier has been calibrated and the pr edicted r esidue-type probabilities generally reflect expected frequencies, the accuracy of predictions may vary depending on local map resolution and quality of the models ( 9 ). Ther efor e, for each tentati v e assignment of a fragment to a target sequence the method estimates a p-value, or probability that a gi v en alignment has been observed by chance. A tentative alignment probability is compared to a background distribution of the fragment alignment probabilities for a long, random sequence. To additionally account for the varying target-sequence lengths an additional extreme-value distribution theorem correction is applied as described before ( 19 ).

Implementation
The doubleHelix program was implemented using Python3 with an e xtensi v e use of numpy ( 43 ), scipy ( 44 ), CCTBX ( 45 ) and CCP4 ( 46 ) libraries and utility programs. The neural network classifier used in this work was trained using Pytorch ( 47 ) and deployed to a C code using keras2c ( https: //github.com/f0uriest/keras2c ). For making the rRNA sequence database queries, the program uses INFERNAL suite version 1.1.4.

Secondary structure assignment
The base-pair assignment in the doubleHelix program relies on structural alignment of r ecurr ent motifs with known secondary structur e (sear ch-fragments) to the input model. The method uses fiv e different sets of search-fragments.
First, A-or B-form double helices, for RNA and DNA target models, respecti v el y, are tried. Additionall y, for RN A models, the method uses four sets of the most frequent recurrent RNA motifs from the RNA Bricks database (see Materials and Methods for details). Although the doublehelical search fragments can be used for the identification of the canonical Watson-Crick base-pairs only, the other search fragments allow for the identification of any basepairing interaction type.
The base-pair assignment procedure parameters providing maximum performance are 2-base-pair double-helical search fragments with 1.0 Å RMSD thr eshold (Figur e 2 A). This can be explained with a relati v ely high structural heterogeneity observed in double-helical structures ( 48 ), which cannot be r epr esented using longer, idealised search models. Interestingly, an additional search step with recurrent RNA motifs with at least 500 occurrences in the RNA Bricks database further improves precision of the Watson-Crick base-pairs assignment. The resulting Watson-Crick basepair assignment f 1-score 0.94 is comparable to a value of 0.95 reported for a recently described program CSSR ( 49 ), which also relies on backbone conformations only. With doubleHelix this translates to the recall of 0.91 and precision 0.98 (no corresponding results were reported for CSSR). Unlike doub leHelix, howe v er, CSSR focuses e xclusi v ely on pairs of nucleotides compatible with canonical base pairing (A / U, G / C or G / U) that makes the two methods not directly comparable. Another advantage of the doubleHelix approach over CSSR is its ability to identify stackings and non-canonical base-pairs with r ecall / pr ecision of 0.63 / 0.89 and 0.47 / 0.94, respecti v ely (Figure 2 B). This, howe v er, requires the use of a larger set of r ecurr ent RNA motifs with at least 100 instances in the RNA Bricks database that results in an increased computational cost. For example, for a tRNA model (76 nucleotides, PDB entry 1ehz) processing time on a standard laptop increases from 4 s in default configuration to 13 s. For a complete porcine 28S rRNA (3938 nucleotides, PDB entry 3j7q) these times increase from 1 to almost 8 h. This, howe v er, is needed only for the accurate assignment of non-canonical base-pairs in RNA models (e.g. as refinement restraints). By default, for the identification of Watson-Crick base-pairs the doubleHelix uses the smallest set of RNA r ecurr ent motifs providing maximum performance at a reasonable computational cost.

Base-type classifier benchmarks
The neural network residue-type classifiers used in this work were trained ignoring any base modifications in the structures. Howe v er, these modifications can effecti v ely alter the scattering properties of a base and introduce bias in the classification results. To examine the impact of the base modifications on the classification performance, I selected ten high-resolution models of ribosomes containing modified ribonucleotides, as described in Materials and Methods section.
The crystal structure models of ribosomes contained in total 45 784 standard ribonucleotides out of which 99% have been correctly classified. Among 358 modified bases the most frequent were pseudouridines (81) and all of them wer e corr ectly classified as p yrimidines. For other ribonu- cleotides with modified bases, 222 out of 234 (95%) were correctly classified as purines or pyrimidines. For the remaining modifica tions tha t do not af fect the base (e.g. O2'methyluridine), 39 out of 42 (93%) were classified correctly.
In the set of cryo-EM models, 98% out of 28 791 standard ribonucleotides, 297 out of 301 (99%) modifications not affecting bases, and all 294 pseudouridines were classified correctly. Among ribonucleotides with base modifications 89 out of 91 (98%) were classified correctly.
Overall, the performance of base-type classifiers for both cryo-EM and MX agrees with the estimates from the training procedure. This includes modified bases, e v en though gi v en high-resolution of the maps, the modifications are usually resolved in the density. It can be expected that at lower resolutions the effect of base-modifications will be also negligible as modifications are not resolved in the maps an ywa y.

Sequence identification in cryo-EM
For the identification of a nucleic acid model the doubleHelix program finds the most plausible sequence in a database gi v en nucleobase-type probabilities (purine or pyrimidine) estimated based on a backbone model and corresponding cryo-EM map. Secondary structur e r estraints, deri v ed directly from a backbone model, are used as an additional source of information. Both base-type probabilities and base-pairing information are used to query sequence databases using the INFERNAL suite as described in the Materials and Methods section. I observed that this approach allows for a sequence identification up to 4.5 Å local resolution for fragments of 50 amino acid residues ( Figure  3 A) when Covariance Models (CMs) and secondary structure information is used. The use of Hidden Markov Models (HMMs), which neglects base-pairing information, clearly reduces the method performance (Figure 3 A). The use of longer fragments of 100 r esidues, further incr eases the resolution limit of the method applicability up to 5.5 Å when the base-pairing information and CMs are used (Figure 3 A). Overall, the E -value provided by the INFERNAL suite is a r eliable measur e of the sequence identification r eliability. Ther e ar e, howe v er, a fe w model fragments for which the identified sequence has a relati v ely low sequence identity to the target, despite a reliable E -value score (Figure 3 B). This issue can be attributed to the reduction of the sequence identification problem to purines and pyrimidines only that may occasionally make different sequence fragments practically indistinguishable.

Sequence assignment in cryo-EM
It has been shown that a neural network-based assignment of protein model sequence can provide reliable results up to local map resolutions where de novo model building would be very challenging ( 50 ). Moreover, a carefully derived sequence assignment reliability score, the p-value, accurately separates correct fr om wr ong alignments independently of local map resolution. The assignment of polynucleotide sequences, howe v er, presents a particular challenge compared to proteins. First of all, nucleic acid models are often built de novo into lower resolution map regions, as double-helical fragments are excellent and universal models for map interpr etation. Mor eov er, e v en at moderate resolutions the target sequence is effecti v ely reduced to only two types of nucleobases (purines and pyrimidines) tha t grea tly increases sequence assignment ambiguity.
I observed that similarly to proteins the neural-network classifier implemented in doubleHelix provides a reliable means of assigning polynucleotide sequences to model fragments at local resolutions as low as 5 Å , e v en though the required fragment lengths are clearly longer (Figure 4 A). For a total number of 18 655 RNA test-fragments of 20 residues (see Materials and Methods for details), the assigned sequence matched the corresponding model in 83% (15 461) of cases. For longer fragments of 40 residues the number of correctly assigned sequences increases to 95% (17 383).  . Medians and 90% confidence intervals for sequence assignment p-value as a function of local resolution for RNA chain test-fragments of 20 and 40 nucleic acid r esidues. P anel ( A ) shows fragments with sequences matching r efer ence. Fragments for which assigned and r efer ence model sequences differ are presented in panel ( B ). The dashed line corresponds to a 99.5% one-sided confidence interval estimated for fragments of 20 nucleic acids with an assigned sequence different from the input-model sequence. Blue circles depict an outlier presented in the text (porcine 28S rRNA, PDB entry 3j7q). The plots' ordinate axes show -log(p-value) for the test-fragments; higher values correspond to lower p-values and more reliable sequence assignments.
The p-value, or the probability of observing a gi v en sequence assignment by chance, provides a reliable estimate of the alignment accuracy that doesn't depend on local map r esolution (Figur e 4 B). Indeed, a one-sided 99.5% confidence interval for fragments with sequence assignment that doesn't match the r efer ence (dashed line on Figur e 4 ) corr esponds to 62% and 17% of cases with matching sequences for fragments of 20 and 40 residues respecti v ely. Although sequence mismatches with a p-value outside this region are expected to be very rare, I observ ed se v er al such test-fr agments in the benchmark set (blue circles on Figure 4 B), all of them correspond to a single model of porcine 28S rRNA (PDB entry 3j7q). I will discuss this outlier in more detail in the next section.

Sequence assignment outlier: mammalian 28S rRNA
In the cryo-EM benchmark set, I identified se v eral clear outliers, w here RN A test-fragments were assigned reliable (low p-value) sequences different from the reference model (Figure 4 B). All the fragments originate from an expansion segment (ES7a) of a cryo-EM structure of porcine 28S ribosomal subunit determined at 3.4 Å resolution (PDB entry 3j7q). Closer inspection of the model re v ealed se v eral nucleobases poorly fitting the EM reconstruction, which is understandab le gi v en limited resolution, but no clearly visible sequence assignment issues. As there is no higher resolution structure for the porcine ribosome available, which could be used to validate the sequence register, I decided to use as a r efer ence the closest homologue from rabbit (98% of sequence identity), for which a structure determined at 3 Å Nucleic Acids Research, 2023, Vol. 51, No. 15 8263 resolution has been recently deposited (PDB entry 6r5q). For a detailed comparison I selected a fragment of the ES7a that has a strictly conserved sequence in both organisms according to an alignment generated using R-coffee ( 51 ). Structural alignment of the corresponding model fragments, howe v er, re v ealed multiple differences between corresponding nucleobase identities, se v eral of them resulting in base-pairing violations (Figure 5 C). I also observed that se v eral differences in sequence preserving secondary structur e ar e visible as clear density-fit outliers (Figure 5 A). The problem is easily solved by shifting the sequences of the two chain fragments by one residue, as suggested by the dou-bleHelix pro gram, w hich results in a perfect fit between the porcine and rabbit models (Figure 5 B).

RNA sequence identification and assignment in MX
A crystallo gra phic diffraction experiment provides only amplitudes of complex structure factors required for calculating electron-density maps. Missing phases need to be deri v ed from other sources, for example a tentati v e model of the unknown crystal structure from Molecular Replacement procedure. The use of model-deri v ed phases for calculating electron-density maps ine vitab ly results in so-called 'model bias' -the presence in a map of model features that are absent in a crystal structure. The same issue may be expected when a tentative crystal structure model polynucleotide sequence doesn't match an unknown crystal structure. Although the model bias is reduced in maximum likelihood ma ps ( 54 ), routinel y used for model building and refinement in MX, the problem is not completely eliminated.
To investigate how the model-bias issue affects the sequence identification and assignment procedures I benchmarked them using ribosome crystal structures with randomised rRNA sequences. I used two crystal structure models of Thermus thermophilus 30S ribosomal subunit determined at resolutions 2.8 Å (PDB entry 2uub) and 3.3 Å (PDB entry 6mpi). Additionally, to remove any effect related to the presence of protein chain models that were refined in the presence of the correct rRNA sequences all atomic coordinates were randomised with 0.2 Å RMSD. The resulting, randomized 30S models were refined using REFMAC5 and the PDB REDO w e bserv er. Interestingly, observ ed model bias in both randomised structures is moderate and clear sequence mismatches can be noticed for few (but not all) nucleobases (Figure 6 B and D ).
The sequence identification procedure was very effecti v e for the randomised 2.8 Å resolution model. Among 1000 continuous, randoml y selected rRN A fragments of 100 residues, doubleHelix identified a correct target sequence in 99% of cases. For shorter fragments of 50 residues this fraction reduces to 76%. At lower resolution (randomised 6mpi model at 3.3 Å resolution) the performance clearly reduces. The program provided a correct hit in 86% and 62% and of cases for test-fragments of 100 and 50 residues respecti v el y. Interestingl y, I also observed that the use of secondary structure information for the sequence identification clearly improves the method performance. Without base-pairing restraints the fraction of correctly identified sequences for fragments of 100 residues reduces to 90% and 72% for better and worse resolution structures. In all cases the incorrect hits can be easily filtered based on E-value returned by INFERNAL suite where values smaller than 0.1 guarantee a correct solution (Figure 7 A).
For sequence assignment the method correctly identified sequences of 46% and 65% of 1000 continuous rRNA fragments of 20 nucleic acid residues selected at random from randomised models at worse and better resolution respecti v ely. These numbers increase to 82% and 94% for longer fragments of 40 residues. The correct sequence assignments are also clear ly separ ated from incorrect ones by p-value with the 99.5% one-sided confidence interval for wrongly assigned fragments matching a value estimated for EM models (Figures 4 B and 7B ).

Case study: sequence register errors in crystal structure of the large ribosomal subunit of S. Aureus
The doubleHelix program has been integrated into a previously de v eloped tool called c hec kMySequence, w hich is used for validating sequence assignment in protein models . This integration enables a fully automated validation of sequence assignment for complete protein-nucleic acid comple xes. Benchmar ks of the updated c hec kMySequence method re v ealed a particularly inter esting structur e of the large ribosomal subunit from S. aureus . The program identified in this model plausible register shift errors in two ribosomal proteins (L18 and L4) and 5S rRNA. The structure was determined at 3.5 Å resolution (PDB entry 4wce, ( 55 )) and refined to R-work / R-free factor values of 0.202 / 0.246 and c lashscor e 11.
A detailed discussion of the register shift error correction in ribosomal proteins is out of scope of this work and will be presented only briefly. The protein chains were replaced with corresponding predictions from AlphaFold2 database (release 4 for UniProt entries Q2FW07 and Q2FW22 for L4 and L18 respecti v el y) subsequentl y refined using COO T in real space with self-restrains generated at 5 Å cut-off. Both pr edicted models wer e assigned a very high confidence score (pLDDT > 90), which was observed to usually correspond to minor differences in loop and side-chain conformations compared to reliab le e xperimental models ( 56 ). Comparison of the deposited and corrected protein chains of both proteins re v ealed multiple plausib le tracing issues and confirmed register-shifts suggested by c hec kMySequence .
The target sequence of the 5S rRNA chain consists of 114 residues that were all traced in the map (chain Y in the deposited model). The c hec kMySequence scan rev ealed an alternati v e sequence assignment with p-value of 0.01 (very reliable according to Figure 7 B) for a chain fragment following residue 81. The alternati v e sequence corresponds to a register shift by one base which suggests that a residue could have been omitted in the deposited model (deletion). Although closer inspection of the crystal structure re v ealed se v eral clear density outliers for bases (Figure 8 A), there were no signs of tracing issues. To confirm the chain sequence correctness, I performed sequence identification using doubleHelix with the deposited 5S chain coordinates, corresponding maximum likelihood 2mFo-DFc map, and a set of RNA sequences for S. aureus strain NCTC8325 downloaded from NCBI. At the time of writing (25.11.2022) there were two assemblies available (GCF 900475245 and GCF 900475245) each containing roughly 100 tRNA, rRNA, and ncRNA sequences.
In the RNA sequence sets doubleHelix identified two, different 5S genes (96% sequence identity), both very reliable hits with E -value below 1e-20 (see Figure 7 A). One of the sequences, howe v er, scored visib ly better that the other (7e-25 versus 4e-21 for NC 007795.1 rrna 7 and NC 007795.1 rrna 6 respecti v el y), w hich has been previously shown to be a good evidence for a better fit to the data for protein models ( 28 ). The deposited 5S model was assigned the sequence variant with worse E-value, which resulted in the map-model fit outliers mentioned above (Figure 8 A). The model, after re-assigning the 5S sequence variant identified with better E-value and subsequent restrained refinement with REFMAC5 and PDB REDO shows much better fit to the map (Figure 8 B). The new base identities are also more favourable for the formation of canonical base-pairs in the model (Figure 8 C). The fi- Figure 6. Corresponding fragments of Thermus thermophilus 28S ribosomal subunit crystal structures used for the sequence identification and assignment benchmarks. Two models with randomised rRNA sequence were generated based on crystal structures determined at a resolution of 2.8 Å (PDB entry 2uub, panels A and B ) and 3.3 Å (PDB entry 6mpi, panels C and D ). The panels depict residue range 1262-1273 with corresponding combined 2mFo-DFc (grey) and difference mFo-DFc (red / green) maximum likelihood maps contoured at 2 and 3 le v els respecti v ely; as deposited (A, C) and after randomising a tomic coordina tes and muta ting 90% of nucleobases (B, D, see Materials and Methods for details). Both deposited and randomised structures were automatically refined using the PDB REDO w e b server. nal model of complete ribosome, after correcting tracing issues in the two protein chains (L4 and L18) and the 5S chain sequence refines with clearly better scores; Rwork / R-fr ee r educes to 0.195 / 0.237 (from 0.202 / 0.247) and clashscore to 5 (from 11). The additional base-pairs, visually better map-to-model fit and reduced global modelquality scores together provide good evidence of improved agreement of the corrected model with the experimental data.

CONCLUSIONS
Sequence assignment is a key step of macromolecular model building that may lead to difficult to identify errors affecting structur e interpr etation. Ne v ertheless, it has been shown that protein models deposited in the PDB, despite e xpensi v e model validation efforts, contain many registershift errors ( 16 , 20 , 57 ). Validation and assignment of nucleic acid sequences presents a particular challenge compared to proteins; the models are usually poorly resolved in cryo-EM and MX maps and available sequence-information is in practice limited to two nucleobase-types. Moreover, validation of nucleic acid models addresses predominantly backbone conforma tions tha t ar e rar ely affected by nucleobasesequence assignment issues. In consequence, the reliability of availab le, e xperimentally determined nucleic acid models is very difficult to assess. This may result in unintended error propagation, where a newly deposited model contains an error inherited from an earlier one used for model building. Errors can also detrimentally affect efforts of bioinformaticians working on large-scale structural analyses or structure prediction methods. Performance of ( A ) sequence-identification and ( B ) sequence-assignment for fragments selected at random from two MX ribosomal RNA models determined at a resolution of 2.8 Å (PDB entry 2uub) and 3.3 Å (PDB entry 6mpi). The random fragments used for sequence identification and assignment had 50 and 20 nucleic acids, respecti v ely. The continuous curves are logistic regression estimates of a probability that an identified sequence will have at least 90% sequence identity to the target sequence.
Her e, I pr esented doubleHelix ; a new program for a comprehensi v e assignment, identification, and validation of nucleic acid sequences in cryo-EM and MX. I show that the a pproach, w hich relies on a neural-network base-type classifier, can successfully identify and assign sequences of cryo-EM model fragments at local resolution as low as 5 Å . I also show that base-pairing information, deri v ed from backbone-geometry, clearly improves the program's performance at lower resolutions but is not essential. This is particularly important for large nucleic acid structures with very low content of paired bases; for example the trypanosomal mitochondrial ribosome ( 58 ).
Nucleic acid model building in cryo-EM and MX usually begins with tracing a ribose-phosphate backbone, which is then assigned to a target sequence. Although the former step can be addressed by se v eral fully automated and interacti v e tools, the sequence assignment remains an open issue. Popular crystal structure building programs and modern, AI-based EM model building tools (e.g. DeepTracer or ModelAngelo) usually produce intermediate models that need to be completed and partially assigned to the target sequence using interacti v e software ( 14 , 15 ). The doubleHelix pr ogram should pr ove useful in this model building step. It can be used for assigning nucleic acid model fragments to the target sequence after each round of model rebuilding (and refinement in case of MX). It can help a user identify model connectivity and make decisions about consecutive modelling steps. Particularly important for this purpose is the sequence-assignment score provided by doubleHelix (pvalue) that can help assessing local correctness of intermediate models.
The doubleHelix software can be also used for generating base-pairing restraints ready-to-use with the most popular cryo-EM and MX model-building and refinement tools (REFMAC5, PHENIX, COOT, ISOLDE). I have shown that the approach can successfully identify 91% of canonical base-pairs with 98% precision without relying on base conformation or identity. This may be particularly useful in early stages of the model building process as available base-pairing restr aints gener ation a pproaches are strictl y dependent on base-identities and the detection of hydrogenbonding patterns that r equir es accurate relati v e positioning of paired-bases (LibG ( 59 ), Phenix suite ( 60 )). An exception here is a pipeline implemented in PDB REDO ( 61 ) relying on base-pair assignment by DSSR suite, which is by design less sensiti v e to the structure distortions ( 62 ).
The presented doubleHelix benchmarks re v ealed a plausible sequence-register error in an expansion segment ES7a of a mammalian ribosome model deposited in the PDB (PDB entry 3j7q / EMDB entry EMD-2650). Expansion segments (ES) ar e pr esent only in eukaryotic ribosomes and exhibit a surprising le v el of variability between different organisms. Ne v ertheless, the function of ES remains poorly understood, which makes them an important r esear ch target ( 63 ). Ribosomes are usuall y highl y conserved across all kingdoms of li v e and ribosome models already available in the PDB can be often used to greatly simplify model building and refinement process of newly determined structures. The high variability of the ES at a structural and sequence le v el makes them one of the few rRNA regions that require in-depth modelling. Not surprisingly, this results in sporadic errors, which may hinder efforts aimed at understanding ES function. The problem is even more important for newly determined nucleic-acid complex structures for which at least partial models that could be used for model building are not available in the PDB. This makes the doubleHelix progr am particular ly useful in structure determination using cryo-EM and MX as a reliable tool for nucleic-acid model sequence assignment and validation.
To facilitate the use of doubleHelix for model validation, it has been incorporated into a previously released, opensource sequence-assignment validation tool c hec kMySequence that can now process complete protein-nucleic acid complexes. The method enables a conceptually simple and fully automated detection of the most common sequence assignment issues in models of proteins, RNA, DNA, and their complexes that include register-shifts, sequence mismatches (single-residue differences between model and After re-assigning the model to a sequence of different 5S gene variant and subsequent r estrained r efinement in REFMAC5 and PDB REDO map-model fit clearly improves ( B ). The new sequence also clearly improves base-pairing pattern of the model ( C ). For clarity, only a sequence fragment including the stem loop depicted in panels (A, B) is shown (highlighted with a red box). Most sequence mismatches between the two 5S sequence variants can be corrected by introducing a gap to the alignment (shown as a grey box) that corresponds to a single base register-shift identified by c hec kMySequence . Maximum likelihood combined 2mFo-DFc and difference mFo-DFc maps on panels (A, B) are contoured at 2 and 3 le v el, respecti v ely. Secondary structur e pr esented on panel (C) was identified in the corrected model using ClaRNA. The figure was pr epar ed using Pymol ( 31 ) and R-chie w e bserver ( 53 ). target sequence), problems with residue numbering (e.g. continuous residue numbering ignoring parts of a model that were not traced). With an example of bacterial ribosome crystal structure, I show that the c hec kMySequence can successfully identify errors in both protein and nucleic acid components of complex structur es, r esulting from model tracing issues and errors in r efer ence sequences that would be otherwise very difficult to identify. Owing to its performance and full automation, c hec kMySequence is a pplicable to the analysis of very large models. For example, validation of a complete cryo-EM structures of 80S ribosome at 3.4 Å resolution discussed in this work, with 48 chains and over 11 000 protein and nucleic acid residues takes less than six minutes on a laptop.

DA T A A V AILABILITY
The doubleHelix program source code and installation instructions are available at https://gitlab.com/gchojnowski/ doub lehelix . The doub leHelix sequence assignment algorithm has been added to the c hec kMySequence pro gram to enab le comprehensi v e validation of sequence assignment in the models of protein-nucleic acid complexes. The updated method is available at https://gitlab.com/ gchojno wski/checkm ysequence . Corrected models of mammalian and bacterial ribosome structures presented in this work are available at https://doi.org/10.5281/zenodo. 7650444 .