The Nucleic Acid Database: new features and capabilities

The Nucleic Acid Database (NDB) (http://ndbserver.rutgers.edu) is a web portal providing access to information about 3D nucleic acid structures and their complexes. In addition to primary data, the NDB contains derived geometric data, classifications of structures and motifs, standards for describing nucleic acid features, as well as tools and software for the analysis of nucleic acids. A variety of search capabilities are available, as are many different types of reports. This article describes the recent redesign of the NDB Web site with special emphasis on new RNA-derived data and annotations and their implementation and integration into the search capabilities.


INTRODUCTION
The Nucleic Acid Database (NDB) was founded in 1991 to assemble and distribute structural information about nucleic acids (1). In addition to the primary structural data that are contained in the archival Protein Data Bank (PDB) (2), the NDB contains annotations specific to nucleic acid structure and function, as well as tools that enable users to search, download, analyze and learn more about nucleic acids. NDB is thus a value-added database providing services specifically for the nucleic acid community.
When the NDB was first established, the focus was on DNA structural biology. As more RNA structures have been determined (Figure 1), tools and annotations were developed to address the features of these molecules.
The NDB seeks to be a central source for nucleic acid structural information and annotations that evolves with the science. In this article we describe the recent redesign of the NDB Web site with special emphasis on new RNAderived data and annotations and their implementation and integration into the search capabilities.

NDB ACCESS
All available NDB resources can be accessed through two persistent headers available on top of all the pages in the Web site. The first persistent header seen in gray in Figure 2 consists of six tabs: About NDB, Standards, Education, Tools, Software and Download.

About NDB
Information about the project including a site map.

Standards
Information about the standard reference frame for the description of nucleic acid base pair geometry (3); ideal geometries for bases and sugars (4,5); DNA/RNA topology and parameter files for refinement of structures (6); mmCIF resources (7); PDBML resources (8); and a link to the RNA ontology consortium (9,10).

Education
Introduction to nucleic acids; definitions of terms used in the Web site; nucleic acid-related features from PDB-101, an educational component of RCSB PDB (11); and links to other educational activities and sites.

Tools
Recently added features include the RNA 3D Motif Atlas (12); nonredundant (NR) lists of RNA-containing 3D structures (13); the RNA Base Triple Atlas (14), a tool to perform nucleotide to nucleotide alignment of two RNA 3D structures (R3D align) (15,16); and a server for finding, aligning and analyzing recurrent RNA 3D motifs (WebFR3D) (17). These tools are also highlighted in the 'Featured Tools' section of the homepage. Other NDB tools include a secondary structure similarity search (QPROF) (18); an RNA 2D structure viewer (RNA View) (19); and an option for the analysis and visualization of nucleic acid structures (w3DNA) (20). Links to a number of other external resources for both RNA and DNA are provided.

Download
The coordinate and experimental files for all the structures in NDB are available under the 'download' tab. A mapping of PDB ID to NDB ID for released entries is also available.
The second persistent header shown in red in Figure 2 includes the search options (simple search, advanced search and ID search) that are described in a subsequent section.

NDB DATA CONTENT
The NDB contains primary structural information about nucleic acid containing structures obtained from the PDB as well as classifications and derived data. Manually annotated nucleic acid classifications as well as derived and calculated data regarding structural features of RNA are managed separately from the primary structure entries; these data are recorded and stored as external reference files (ERFs).

Primary structural information
The primary data obtained from the corresponding PDB structure entries include experimental files, identifiers, structural descriptions, citations, crystal data, coordinate information and details regarding crystallization, data collection and structure refinement (Table 1).

Nucleic acid classifications
Annotations specific to nucleic acids and the molecules to which they are bound are provided (Table 2). Nucleic acid annotations include nucleic acid type and conformation, structure description and secondary structure. Some functional information about the bound proteins as well as drug binding modes is also offered.

Derived data
Derived structural features such as bond distances, angles, torsions and base morphology (20,27) are calculated from the coordinate data and stored in the searchable database (Table 3).
We have recently added derived information on RNA structural features including pairwise nucleotide interactions for each RNA structure, equivalence classes and NR sets of RNA structure files and RNA 3D motifs extracted from structures.

RNA pairwise interactions
Pairwise interactions between RNA nucleotides are annotated using FR3D (21) for RNA base pairing and base stacking interactions and as described in (28) for  base phosphate interactions. These annotations form the basis for the RNA 3D Motif Atlas and the RNA Base Triple Atlas. In addition, we provide statistics on pairwise interaction frequencies. These data may prove useful to modelers and other computational scientists interested in determining characteristics of structured RNAs.
Equivalence classes and NR sets RNA-containing entries are grouped into 'equivalence classes' of structures that share the same, or nearly the same, sequence and geometry, as described in (13). These equivalence classes are computed every week so that new additions to the 3D structure database are quickly reflected. Generally, different structures of the same RNA from the same organism appear in the same equivalence class, while structures of homologous RNAs from different organisms appear in distinct classes. For example, NR_4.0_00834.10 is the accession number for the equivalence class of Escherichia coli large subunit (LSU) ribosome structures, as of 20 July 2013. This equivalence class has 74 members. When gathering statistics across many RNA structures, it is not appropriate to include all 74 E. coli LSU structures as if they provide independent data points. Therefore, one structure with the largest number of FR3D-annotated base pairs per nucleotide is chosen to represent this equivalence class. A NR set of RNA 3D structures results from using the representative structure from each equivalence class. The NDB home page provides links to lists of equivalence classes and the structures contained in the current NR set. NDB search functions allow the user to limit results to include only one structure from each equivalence class.

RNA 3D Motif Atlas
The RNA 3D Motif Atlas, linked to from the NDB home page, is an organized collection of internal and hairpin loops extracted from the NR set of 3D structures (12). Individual motif instances are organized into motif groups, containing all instances that share the same pattern of base pairing interactions and overall geometry. The Atlas is updated automatically every 4 weeks.
The manually annotated nucleic acid classifications as well as derived and calculated data regarding structural features of RNA are managed separately from the primary structure entries; these data are recorded and stored as ERFs.

SEARCH CAPABILITIES
The NDB data flow is depicted in Figure 3. To facilitate search and reporting functions, the NDB stores primary structural data, classification data and derived data in a   Figure 2): ID search, search and advanced search. These search options can also be accessed in the 'Search Structures' section of the homepage.

ID Search
The ID search accepts either an NDB ID or PDB ID as input and the result is the individual Structure Summary for the entered structure.

Search
Search is available separately for DNA and RNA structures. Users can create a set of entries (results) by constraining particular predefined data attributes in several categories. Following a preliminary search, users can add selection constraints from additional categories to further narrow their searches.

RNA search
The polymer, protein function and experimental method selection categories present options similar to those provided for DNA searches. Additional options are provided for RNA to restrict search results to the representative structures belonging to the NR data set. The RNA-specific selection categories include the following: . RNA type: Search various RNA functional types, such as tRNAs, rRNAs, riboswitches or ribozymes. . NR list: Restrict the RNA search to the representative members of equivalence classes that constitute the NR RNA structure set. This filtering dramatically reduces the number of structures returned by the search without diminishing the range of molecules represented and provides lists of structure that are more suitable for statistical analyses. Putting only the nonredundancy constraint on the query will result in a 'nonredundant list' of the best modeled structures (in terms of the number of base pairs per nucleotide) at the specified resolution threshold.

Advanced search
The Advanced Search allows users to compose queries combining multiple selection constraints using logical operators. Selection constraints are organized into the following categories: structure content, experimental information, experimental details, citation, RNA 3D interactions, RNA 3D motifs, sequence, nucleic acid modifications, protein binding type and nucleic acid conformation type.
. Structure content: Restrict searches based on the presence or absence of a type of molecule (DNA, RNA, protein, hybrid molecules and drugs). . Experimental information: Restrict searches by experimental method (X-ray/NMR) and by the availability/ nonavailability of experimental files. . Experimental details: Restrict searches by user-provided values for crystal cell dimensions and space group. . Citation: Restrict searches to specific authors, publication years or PDB/NDB ID. . RNA 3D interactions: Define searches according to the presence and relative frequencies of any of the base pair, base phosphate and base stacking interactions. The relative frequencies of interactions are calculated with respect to the total number of interactions of that type occurring in the structure. . RNA 3D motifs: Restrict searches to structures that contain certain named RNA motifs. . Sequence: Restrict searches to a specific nucleotide sequence pattern present in the structure and a range of overall sequence length. . Nucleic acid modifications: Constrain searches based on the presence or absence of chemical modifications in bases, sugars or phosphate. . Binding type: Restrict searches of protein complexes according to type of protein, protein function and type of nucleic acid to which it binds. . Nucleic acid structural conformation type: Narrow searches according to presence of structural features such as bulges, three-way junction, non-Watson-Crick base pairing along with strand description and conformation type.
For each of the search criteria, the options to explicitly select, deselect and ignore are available as Y, N and ignore, respectively, with the default being ignore. When combining two or more search queries, logical operators AND (to restrict results) or OR (to combine results) are available with the default being AND. For example, to get all NMR structures that have base modifications, select NMR AND base modification by clicking 'yes' next to each of them and choosing the logical operator AND. This search returns only those structures that satisfy both criteria. The results of each search appear in a new window and include NDB ID, PDB ID, title, authors, initial deposition and release dates, and links for further information.

REPORTING CAPABILITIES
The results of a structure selection search are presented as a structure selection report, an image gallery, or as one of a set of predefined feature reports. A detailed structure summary report is available for every structure.

Structure selection report
The structures selected by a search are displayed in a tabular report containing the essential features for the structure selection, including the title, authors, citation and release date of the structure, the type of experiment, type of structure, a link to the equivalence class to which it belongs and the representative structure of that class, as well as a structure image. The structure selection report is also available as a gallery of structure images with their accession codes. In both the gallery and summary reports, the structure accession code provides a link to a more detailed structure summary report for each selected structure.

Structure summary report
Each structure in NDB has its own individual summary page containing information relevant to that structure. Data are presented in the structure summary page in four sections: (a) primary structure information, (b) downloads, (c) derived structural data and (d) images (Figure 2).
The main (primary structure information) section (Figure 2 Section a) holds the entry title, sequence, citation, experimental details, refinement information and various structural descriptions. The atomic coordinates (asymmetric and biological unit files), structure factors and NMR restraint information are available in the 'Download Data' section. The 'Structural Features' panel in the upper right of the window (Figure 2 Section c) presents links to derived information including hydrogen bonding, torsions and base morphology, and step parameters. Below the Structural Features panel, the contents of the asymmetric unit or the biological assembly model is shown as a 3D image (Figure 2 Section d). For RNA entries, an RNA View image showing 2D base pairing is also provided (Figure 2 Section e). Additional images of biological assemblies, crystal packing and ensemble images are available under the 'more images' link.
For RNA-containing structures, many additional 'Structural Features' are now available. For those NDB structures that are the representative of their equivalence, the RNA 3D Motifs page lists all internal and hairpin loops in the structure, and links to the corresponding entries in the RNA 3D Motif Atlas (12). The base pair signature of the motif is provided, and, when available, the common name of the motif (Figure 4).
The 'Structural Features' section also links to annotations of RNA base pairs, base stacking, and base phosphate interactions in a tabular form, as annotated by the FR3D program suite (21) (28). Because some structures are large, the list of interactions can be filtered to view only interactions of a given type, for example, cis-Watson-Crick/Hoogsteen (cWH) or trans Hoogsteen/ Sugar Edge (tHS) base pairs. At the bottom of the pairwise interaction page is a summary of the counts and relative frequencies of the different types of interactions. In the pairwise interaction list, each RNA nucleotide is identified by a unique unit ID; this is a text string constructed from the PDB ID, the model number, the chain, the RNA base or amino acid and the residue number. For example, the annotation 1S72j1j9jUj99 1S72j1j9jGj83 cWW unambiguously refers to the GU cWW base pair made between two nucleotides in chain 9 of model 1 of PDB file 1S72. Unit IDs provide a way to uniquely and unambiguously refer to any unit in any structure, a need identified by the RNA Ontology Consortium (9,10). By clicking the 'Similar Structures' link in the 'Structural Features' panel, one can reach new pages listing structures belonging to the same equivalence class.
The 'Structural Features' section also provides a link out to interactive visualizations of the base pairs in RNA structures, in the form of RNA circle diagrams (29). The interactions are displayed as clickable arcs colored by base pair type, with all nucleotides in the structure arranged around a circle. Moreover, for certain structures, a more conventional secondary structure diagram is available. In either case, the user can select base pairs to display by type, mouse over the interaction arcs to see the participating nucleotides, and select pairs or regions to visualize in 3D in an adjoining Jmol window. Finally, links are provided in the 'Structural Features' section to facilitate WebFR3D searches within the current structure.

Featured reports
Featured reports are available for the result set of any advanced search query. A predefined set of reports is provided: NDB status, cell dimensions, citation, refinement data, backbone torsion, base pair and base step parameters, descriptor, sequence and RNA motifs. The NDB status report containing NDB and PDB ID's, structure title, authors, deposition and release dates is the default report for any advanced search query. The content of each of these reports is described in Table 4.  All these reports can be exported as spreadsheets for further analysis.

DATA RETRIEVAL
The atomic coordinates for the asymmetric unit and for all biological assemblies are available in the 'downloads' section of the corresponding NDB structure summary report and are provided in PDB as well as mmCIF formats. The asymmetric unit coordinates are also available in PDBML (XML) format for the complete file or for the header and the coordinate sections separately. Structure factor data are provided in mmCIF format, and NMR restraint information are available in deposited program format. All these files are updated weekly during database update and are downloadable from the NDB ftp server at ftp://ndbserver.rutgers.edu/NDB/.

Web service framework
In the new NDB site, we have transitioned from a reliance on pregenerated HTML pages to dynamically generated page content. Each page has been partitioned to load content sections on demand using AJAX protocol web services. A framework supporting REST style web service queries has been created to support dynamic content AJAX functionality. For instance, within the structure summary pages, RNA Motif and interaction classification statistics are obtained from asynchronous database queries, as these are requested by users viewing the summary page. Search summary and structure browsing pages are similarly generated using this dynamic protocol. The new framework has been implemented using Python language middleware and Apache web server FastCGI protocol request handling. Web pages rendered in HTML take full advantage of CSS and JavaScript.

Database server infrastructure
In the new release of the NDB Web site, the proprietary relational database engine IBM DB2 has been replaced with the open source MySQL database engine. A new Python language middleware has been developed to support NDB loading operations, query construction and report generation using the MySQL storage engine.
Moving to the MySQL server has dramatically improved the portability of the NDB site and simplified database maintenance and administration. The new database infrastructure facilitates site replication and synchronization, allowing us to support production, beta and development instances. This capability has enabled rapid implementation and testing of new project functionality.

Pipeline for RNA structure annotations
The pipeline to exchange primary structure data between PDB and NDB has been in place for many years. A new pipeline has been developed for the regular exchange of additional RNA 3D structural annotations created by the RNA group at Bowling Green State University (BGSU); moreover, this pipeline can be used for data exchange from other research groups as needed in the future. The annotations from BGSU include the assignment of NR representative RNA structure and associated equivalence classes, RNA 3D Motif assignments and RNA base pairing, base phosphate and base stacking interactions. These new RNA annotations are added to the NDB database as part of each weekly update.

Portability, maintenance and administration
The source code supporting the redesigned NDB web resource has been organized into a file system that allows all of the source components of the site to be managed by the revision control system, Subversion (http://subversion.tigris.org/). The use of Subversion has simplified the management of the code base, enabling rapid deployment, synchronization and simultaneous development by multiple programmers. This in turn has dramatically improved the portability and simplified the administration of NDB applications on multiple servers.

CONCLUSIONS
The recent redesign of the NDB highlights the improvements made in data content including annotations and derived data and their presentation. The entire Web site has been revamped to improve the query and reporting capabilities. Annotations of RNA-containing structures have been expanded significantly. The site is structured to facilitate the addition of more search options, annotations, visualizations and reports about nucleic acid containing structures in the future.