The Protein–RNA Interface Database (PRIDB) is a comprehensive database of protein–RNA interfaces extracted from complexes in the Protein Data Bank (PDB). It is designed to facilitate detailed analyses of individual protein–RNA complexes and their interfaces, in addition to automated generation of user-defined data sets of protein–RNA interfaces for statistical analyses and machine learning applications. For any chosen PDB complex or list of complexes, PRIDB rapidly displays interfacial amino acids and ribonucleotides within the primary sequences of the interacting protein and RNA chains. PRIDB also identifies ProSite motifs in protein chains and FR3D motifs in RNA chains and provides links to these external databases, as well as to structure files in the PDB. An integrated JMol applet is provided for visualization of interacting atoms and residues in the context of the 3D complex structures. The current version of PRIDB contains structural information regarding 926 protein–RNA complexes available in the PDB (as of 10 October 2010). Atomic- and residue-level contact information for the entire data set can be downloaded in a simple machine-readable format. Also, several non-redundant benchmark data sets of protein–RNA complexes are provided. The PRIDB database is freely available online at http://bindr.gdcb.iastate.edu/PRIDB .
Protein–RNA interactions play critical roles in myriad and diverse biological processes, including many recently discovered regulatory functions, in addition to well-studied roles in protein synthesis, DNA replication, regulation of gene expression and defense against pathogens ( 1–9 ). Despite their importance, structures of protein–RNA complexes have proven difficult to obtain using experimental structure determination methods; such structures constitute only ∼1% of structures in the Protein Data Bank (PDB) ( 10 ). For this reason, several computational methods for predicting the interfaces in protein–RNA complexes have been developed ( 11–21 ). Virtually all such methods require data in the form of information about structurally characterized protein–RNA complexes and their interfaces.
PRIDB is a repository of protein–RNA interface information derived from structures in the PDB. PRIDB is designed to facilitate detailed analyses of individual protein–RNA complexes of interest and rapid identification of interfacial atoms and residues in both the protein and RNA chains of a chosen complex or user-defined set of complexes. In addition, PRIDB can be used to generate data sets of protein–RNA interfaces for machine learning applications, such as the generation of classifiers for predicting interfaces in protein–RNA complexes for which high-resolution structures are not available.
To our knowledge, only one other up-to-date and comprehensive online repository of protein–RNA interfaces is currently available: Biological Interaction Database for Protein-Nucleic Acid (BIPA) ( 22 ). BIPA provides a list of protein–RNA (and protein–DNA) complexes from the PDB and displays RNA-binding residues within the linear primary sequence of a chosen protein, or within a multiple sequence alignment of related RNA-binding proteins. PRIDB complements BIPA by providing atomic- and residue-level interfacial information for both the RNA and protein chains of complexes, providing previously published reduced-redundancy data sets and allowing users to make advanced queries and compile custom data sets. Other collections of protein–RNA complexes and related resources include NDB ( http://ndbserver.rutgers.edu/ ) ( 23 ), PRID ( http://www-bioc.rice.edu/∼shamoo/prid.html ) ( 24 ), RsiteDB ( http://bioinfo3d.cs.tau.ac.il/RsiteDB/ ) ( 25 ), w3DNA ( http://w3dna.rutgers.edu/ ) ( 26 ), NPIDB ( http://monkey.belozersky.msu.ru/NPIDB ) ( 27 ), ProNIT ( http://gibk26.bse.kyutech.ac.jp/jouhou/pronit/pronit.html ) ( 28 ) and the RNP Databases http://rnp.uthct.edu/index.html/ ). Several excellent databases of protein–DNA interfaces are also available, including PDIdb ( http://melolab.org/pdidb/ ) ( 29 ) and hPDI ( http://bioinfo.wilmer.jhu.edu/PDI/ ).
Data extraction, interface definition and motif identification
Atomic coordinate information for all 926 protein–RNA complexes in the Protein Data Bank (PDB) on 10 October 2010 was extracted using the REST API advanced search interface. To generate this comprehensive data set (rRB926), no filters based on sequence redundancy, structure resolution or other criteria were applied (see ‘Non-redundant Benchmark data sets’ below). The complex structures in rRB926 were then scanned to identify interacting amino acids and ribonucleotides using two different definitions: (i) a simple distance-based definition in which a given amino acid residue (AA) in a protein chain is defined as interacting with a ribonucleotide (rNT) in an RNA chain if any atom in AA is within a 5-Å radius of any atom in rNT; and (ii) a rule-based definition based on that of Allers and Shamoo ( 30 ), in which interactions are classified as van der Waals, hydrogen-bonding, hydrophobic or electrostatic interactions, involving specific AAs and rNTs. All such interacting AAs and rNTs are defined as ‘interface’ residues.
ProSite patterns and profiles ( 31 ) appearing in any of the protein sequences in the database were retrieved using the ScanProsite REST service ( 32 ). RNA structural motifs were identified in RNA sequences using FR3D’s ( 33 ) pure symbolic search function; specific motif definitions used for these scans are available in the Tutorial and FAQs section of the PRIDB online server.
Non-redundant benchmark data sets
Because PRIDB is intended to be a comprehensive collection of protein–RNA complexes from the PDB, the rRB926 data set was not filtered on the basis of redundancy, structure determination method, resolution or protein/RNA chain length. While it is possible to filter with such criteria using PRIDB’s advanced search function, several pre-calculated benchmark data sets, which have been filtered to limit redundancy and to exclude low-resolution structures, are also provided for the user’s convenience. These include two previously published data sets, RB109 ( 17 , 34 ) and RB147 ( 35 ), as well as a larger, more recently extracted data set (RB199) (B. Lewis, submitted for publication). Complete lists of the PDB IDs for protein–RNA complexes in these data sets, in addition to the pre-calculated interface residue statistics, can be readily accessed from the ‘Datasets’ section of the PRIDB homepage.
Implementation and availability
PRIDB runs on the Apache 2.2 web server, using MySQL 14.14 as a database backend with AJAX and PHP 5 for user interface functions. Functions not requiring use of the database (e.g. calculating interface residues for a user-submitted complex) are implemented using standalone Perl 5 scripts and the BioPerl module ( 36 ). All PRIDB code is available on request under the Creative Commons Attribution Non-Commercial License. All data currently in PRIDB was obtained from databases or programs which impose no restrictions on academic use.
PRIDB summary statistics
As summarized in Table 1 , the current version of PRIDB contains structural information for a total of 926 protein–RNA complexes available in the PDB as of 10 October 2010. These structures contain 9689 total protein chains, among which there are only 1174 unique sequences. While this would seem to indicate that most sequences in the database are repeated several times, this is not the case; 395 of the 1174 (34%) sequences appear only once, and 899 (77%) appear less than eight times (the ‘expected’ average redundancy). This disparity is due to the large proportion of ribosomal structures in the PDB (and, by extension, in PRIDB); 9 of the top 10 most abundant sequences, each present in more than 70 structures, are ribosomal proteins. The most abundant sequence, repeated more than 100 times, is that of the TRP-responsive attenuation protein, a protein for which numerous multimeric structures have been solved.
|Total Number in PRIDB a||Unique|
|Total Number in PRIDB a||Unique|
a Total number in PRIDB includes redundant complexes, RNA and protein chains (i.e. chains with identical sequences).
As shown in Table 2 , PRIDB currently contains 1 475 774 amino acid residues. Based on a 5Å distance cutoff definition for interfacial residues, 397 216 of these residues interact with RNA; of 851 853 ribonucleotide residues in PRIDB, 322 858 interact with protein. On average, 38% of the amino acids in the RNA-binding proteins directly interact with RNA, and 28% of the ribonucleotides in the bound RNAs directly interact with protein. As before, these averages are skewed by the prevalence of ribosome structures; ribosomal proteins account for ∼90% of interacting amino acid residues and ∼60% of interacting nucleotides.
|Type||Total (Interface + Non-Interface)||Number in Interfaces (%)|
|Amino Acids||1 475 774||414 026 ( 38 )|
|Ribonucleotides||851 853||326 441 ( 28 )|
PRIDB provides a ‘Tutorial and FAQs’ section with detailed instructions on using PRIDB’s web interface; a list and brief descriptions of key capabilities of PRIDB are provided here. Using the ‘Basic Search’ function, users can retrieve information about protein–RNA complexes using their PDB ID or a keyword. Using the ‘Advanced Search’ function, users can filter results by specifying:
the experimental method used to determine the complex structure (e.g. X-ray diffraction, nuclear magnetic resonance);
a resolution range or threshold (for structures determined using X-ray diffraction, electron microscopy or fiber diffraction);
the minimum or maximum length of protein or RNA chains within the complex;
an amino acid or nucleotide subsequence found within the sequence of at least one of the protein or RNA chains in the complex; and
a motif (as defined by ProSite for protein chains or FR3D for RNA chains) found within at least one chain in the complex.
The ‘Advanced Search’ function also allows users to either specify a different distance cutoff for the distance-based interaction definition or choose the alternative rule-based definition.
As shown in Figure 1 , when viewing search results, PRIDB provides:
a summary of and basic information (name, resolution and structure determination method) about each complex, as well as a link to that complex’s PDB entry;
a linear display of the amino acid and nucleotide residues in each chain of each complex, with residues in the protein–RNA interface highlighted;
a display of residues (in red font) that are part of a protein or RNA motif, with information about that motif (and a link back to its source) provided on mouse-over;
a JMol applet for 3D visualization of each complex, with interacting amino acid and nucleotide residues colored ( Figure 2 A); and
a link to a dynamically-generated file containing atomic-level interface information for each result in a machine readable format ( Figure 2 B).
In addition to providing machine-readable results files for all searches, pre-computed results files for the non-redundant RB109, RB147 and RB199 data sets described above have been made available. These files, along with the complete PRIDB database (rRB926), can be downloaded from the ‘Datasets’ section of the website. Users can also generate a machine-readable list of interface residues for any arbitrary collection of complexes by inputting a list of PDB IDs. Results files contain a single line for each pair of interacting atoms listing the specific interacting atoms (by chain name, residue number and atom name) and the distance between them.
Users may also calculate interface residues for protein–RNA complexes that are not in PDB using PRIDB by submitting a structure file in PDB format. A results file containing interface residues (as calculated using PRIDB’s 5 Å cutoff) is returned via e-mail.
CONCLUSIONS AND FUTURE DIRECTIONS
PRIDB provides researchers with atomic and residue-level information about structures of protein–RNA complexes and their interfaces, facilitating analyses of protein–RNA interactions by pre-computing commonly used information and by providing structural information both interactively onscreen and in a machine-readable format. It allows users to rapidly identify and visualize interfaces in protein–RNA complexes on a residue-by-residue basis and displays identified ProSite or FR3D motifs along with the amino acid or ribonucleotide sequences. PRIDB can be used to generate custom data sets of protein–RNA interfaces for statistical analyses and machine learning applications. The PRIDB server also provides pre-calculated benchmark data sets of protein–RNA complexes for evaluating the performance of interface prediction methods. PRIDB will be updated regularly as new structures are released through PDB, and is intended to be a stable resource for researchers in the field of protein–RNA interactions.
Future versions of PRIDB will include additional protein and RNA motifs from other sources, such as PRINTS ( 37 ), PIRSF ( 38 ) and other InterPro ( 39 ) member databases. In addition, the current JMol 3D visualization capabilities will be extended to user-submitted structures, allowing for more facile manipulation and examination of interfaces in complexes not currently in the PDB.
National Institutes of Health (GM066387 to V.H. and D.D.); the National Science Foundation [IGERT0504304 (to D.D.); GK120947929 (to B.A.L.); NIBIB-NSF0608769 (to V.H., J.F. and C.Z.)]; Iowa State University’s Center for Integrated Animal Genomics (to B.A.L. and D.D.); Center for Computational Intelligence, Learning and Discovery (to V.H.). Funding for open access charge: Center for Computational Intelligence, Learning and Discovery.
Conflict of interest statement . None declared.
The authors thank members of our research groups for helpful discussions and especially Usha Muppirala for critical comments on the PRIDB server and manuscript.