ProNet DB: a proteome-wise database for protein surface property representations and RNA-binding profiles

Abstract The rapid growth in the number of experimental and predicted protein structures and more complicated protein structures poses a significant challenge for computational biology in leveraging structural information and accurate representation of protein surface properties. Recently, AlphaFold2 released the comprehensive proteomes of various species, and protein surface property representation plays a crucial role in protein-molecule interaction predictions, including those involving proteins, nucleic acids and compounds. Here, we proposed the first extensive database, namely ProNet DB, that integrates multiple protein surface representations and RNA-binding landscape for 326 175 protein structures. This collection encompasses the 16 model organism proteomes from the AlphaFold Protein Structure Database and experimentally validated structures from the Protein Data Bank. For each protein, ProNet DB provides access to the original protein structures along with the detailed surface property representations encompassing hydrophobicity, charge distribution and hydrogen bonding potential as well as interactive features such as the interacting face and RNA-binding sites and preferences. To facilitate an intuitive interpretation of these properties and the RNA-binding landscape, ProNet DB incorporates visualization tools like Mol* and an Online 3D Viewer, allowing for the direct observation and analysis of these representations on protein surfaces. The availability of pre-computed features enables instantaneous access for users, significantly advancing computational biology research in areas such as molecular mechanism elucidation, geometry-based drug discovery and the development of novel therapeutic approaches. Database URL: https://proj.cse.cuhk.edu.hk/aihlab/pronet/.


Background & Summary
Proteins perform vital functions in a variety of cellular activities, and protein-molecule interactions decipher the complexity of organisms such as gene expression regulation ?, signal transduction ?and drug therapy ? .However, the dedicated mechanism of most protein-molecule interactions has not been well illustrated and hinders the development of mechanism exploration and drug discovery.During the process of protein interacting with molecules, molecules are intended to recognize the surface of the protein, such as the hydrophobicity, charge distribution, hydrogen/electron donor, and binding steric hindrance.Thus, a comprehensive and efficient representation of the protein surface is essential to elucidate the mechanism of proteinmolecule interaction.For example, Rudden et al. ?utilize a single volumetric descriptor representing protein surface including electrostatics and local dynamics for protein docking and achieved an average success rate of 54%.Experimental assessments such as NMR-based measurement ?, hydrophobic interaction chromatography (HIC) ? of protein surface property are timeconsuming and costly.Besides, with the presence of AlphaFold2 Protein Structure Database ?, a number of protein structures are determined by computational prediction, indicating that the traditional approaches are unable to handle this series of protein surface property evaluation.
To overcome the limitations of experimental approaches, several in silico methods of protein surface property have been proposed, such as MaSIF ?, FEATURE ?, and AutoDock ? .For example, AutoDock ?calculates the atom-wise biochemical property, and FEATURE ?employs a series of centric shells to represent atoms of the protein with 7.5Å of a grid point with 80 physicochemical properties.MaSIF ?presents a method to encode geometric features (shape index and distance-dependent curvature) and chemical features (hydropathy, continuum electrostatics, and free electrons/protons) on the surface with the geodesic radius of 9Å or 12Å.Despite the availability of those tools for downstream applications, they are not ready-to-use, with the complex running environment and long running time.Also, it is inefficient for each user to run them locally for the same protein, which leads to repetitive work.Theoretically, for the fixed protein structure, the surface representation of the same tool should be the same.Considering that, we build up the database, running MaSIF to encode protein surface physicochemical properties including hydrophobicity, charge distribution, hydrogen bond, interacting face for the protein structure from the experimentally validated database (PDB) and in silico database (AlphaFold DB), so that the user can directly use such features for their downstream applications.The successful de novo design of protein with learned surface fingerprints revealed that surface property plays a crucial role in function-oriented protein design, and lay the foundation for the development of synthetic biology ? .Similar to the physicochemical property, RNA binding landscape is also an important part of surface property.Direct recognition of RNA motifs on RNA-binding proteins (RBPs) can provide information of protein-nucleic acid interaction ? .For example, the Pumilio/FBF (PUF) family can govern translations by direct base-protein recognition, such as UGUR motifs on RNA transcripts ? .Thus, the RNA binding profiles of RBPs are the important part to illustrate protein-molecule interaction.In this study, we employed the state-of-the-art deep-learning framework NucleicNet ? to predict the binding preference of RNA constituents and the binding sites on protein surface to provide RNA-binding landscape of the protein structure from the experimentally validated database (PDB) and in silico database (AlphaFold DB).Although the dataset is based on prediction, we are the first to provide such a ready-to-use database for downstream applications, such as CRISPR-Cas system optimization ?, RBP-targeting therapeutics discovery ?, and aptamer-guided drug delivery system development ? .
In summary, we proposed a comprehensive database for protein surface feature, ProNet DB, which contains protein surface physicochemical representations and RNA-binding landscape for more than 326,175 protein structures covering the 16 model organism proteomes from AlphaFold DB and PDB.For each protein, we provided the original protein structure, surface property representation including hydrophobicity, charge distribution, hydrogen bond, interacting face, and RNA-binding landscape such as RNA binding site and RNA binding preference.To interpret protein surface property representation and RNA binding landscape intuitively, we also integrate Mol* and Online 3D Viewer to visualize representation on the protein surface.The server now can be assessed at https://proj.cse.cuhk.edu.hk/aihlab/pronet/ and future releases will expand the species and property coverage.

Data Source
We first collected 23,391 protein structures on Homo sapiens proteome and 6,042 protein structures on Saccharomyces cerevisiae proteome from AlphaFold DB ? .If the corresponding experimentally validated protein structures exist in PDB, we collected the protein structure with the highest resolution from PDB (Homo sapiens: 6,030, Saccharomyces cerevisiae: 1,160) ? .For further comprehensive database construction, we collected the other 14 model organism proteomes from AlphaFold DB, including Arabidopsis thaliana, Caenorhabditis elegans, Candida albicans, Danio rerio, Dictyostelium discoideum, Drosophila melanogaster, Escherichia coli, Glycine max, Methanocaldococcus jannaschii, Mus musculus, Oryza sativa, Rattus norvegicus, Schizosaccharomyces pombe and Zea mays.Finally, the proteomes of these model organisms sufficiently expanded ProNet DB protein structure coverage from 33,000 to 333,365 (Table 1) and led to a more comprehensive and user-friendly database.

Protein Surface Physicochemical Property
MaSIF is a general framework to encode protein surface fingerprints ? .For each protein, it will generate a discretized molecular surface by assigning calculated physicochemical features on every vertex of the surface.In this way, the properties of the protein surface can be clearly represented.As shown in Figure 1, the user can determine which part of the surface area is hydrophilic or hydrophobic, and which part is more likely to interact with other molecules (interacting face).We computed the surface properties by the MaSIF tool for the proteins in our database so that users can obtain the physicochemical property profile for every protein efficiently.These computed features can benefit many downstream tasks a lot including binding site prediction ?, protein-protein interaction prediction ?and protein design ? .The recent study of protein design based on surface fingerprints confirmed that surface property plays a crucial role in function-oriented protein design, and indicated that such geometric features on protein surface boost protein-centric issue development ? .

Protein RNA-binding Profiles
As RNA-protein interaction is involved in multiple cellular activities, the interaction between RNAs and RBPs plays an important role in understanding cellular activities.The systematic mappings of the RNA-protein interaction for multiple RNA constituents were constructed in ProNet DB.Following the deep-learning framework proposed by NucleicNet ?, we acquired the binding preference as well as the binding sites for multiple bases for protein structures in Alphafold DB and PDB, such as Ribose (R), Phosphate (P), Adenine (A), Guanine(G), Cytosine (C), and Uracil (U).The protein RNA binding profiles are further classified into multiple sub-classes for each species based on their diverse protein functions.In ProNet DB, users are able to directly address the protein properties such as RNA backbone composition and binding preference of different bases, which intuitively illustrated the protein RNA-binding landscape and partially revealed the protein surface property.

Figure 1 .
Figure 1.An overview of the ProNet DB and the illustration for two main outputs.The right panel shows the example of the protein surface physicochemical property and RNA binding profiles.

Figure 2 .
Figure 2. User interface of ProNet DB.Top-left: Home page contains three subsections: servers, databases, and visualization tools.Top-right: NucleicNet DB page.Users can search, filter and view the searched results.On the top-right corner of NucleicNet DB page, a toggle button provides different protein sources.Bottom: Protein information page and visualization details for each item.

Figure 3 .
Figure 3. ProNet DB statistics for both Human and Yeast results in AlphaFold DB and PDB.(A) The functional classification for protein structures in both AlphaFold DB and PDB.(B) The upper panel illustrates the protein surface physicochemical property distribution including hydrophilic, hydrophobic, and interacting face region proportion of Human protein surface in AlphaFold DB.The beneath panel reveals the distribution of the positive/negative charge region and the Hbond Doner/Receptor region proportion of Yeast protein surface in AlphaFold DB. (C) Venn diagram shows the number of experimentally validated protein structures from PDB, compared with computationally predicted structures from AlphaFold DB. (D) Detailed comparison of the proportion of binding profiles of each RNA constituent in PDB, e.g., 4 bases: Adenine(A)/ Guanine(G)/ Cytosine(C)/ Uracil(U), and 2 backbone constituents: phosphate (P) and ribose (R).(E) The proportion of the number of chains in the PDB database in Human and Yeast.

Table 1 .
The model organism proteomes in ProNet DB.

Table 2 .
An example entry in ProNet DB shows the data content organization of one protein 17-beta-hydroxysteroid dehydrogenase type 1.An entry has three profiles: Basic Profile contains basic information like the protein names, protein types, gene names, as well as the mapping id to other databases; MaSIF Profile includes the physicochemical properties computed by MaSIF, describing the protein surface features; NucleicNet Profile contains the RNA-binding preference information.
640 entries).Figure3(C) shows that a certain number of protein structures (66.9% in Homo sapiens, 75.2% in Saccharomyces cerevisiae) have not been validated by experimental approaches.Besides, the Supplementary FigureS2demonstrates the considerable protein structure prediction performance, since 80.6% of validated proteins in Homo sapiens and 74.8% of validated proteins in Saccharomyces cerevisiae are accurately predicted (RMSD ≤ 2.0).In Figure3 (B), we integrate the proportion for hydrophobic and hydrophilic vertex over the total number of vertex for AlphaFold2 Human proteins and compare them with the interacting face proportion.A clear pattern is shown in Figure3(B) that the Hbond receptor region Supplementary

Table S1 .
Detailed data statistics for ProNetDB