PredictProtein - Predicting Protein Structure and Function for 29 Years

Abstract Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.


Introduction
The sequence is known for far more proteins (1) than experimental annotations of function or structure (2,3). This sequence-annotation gap existed when PredictProtein (4,5) started in 1992, and has kept expanding ever since (6). Unannotated sequences contribute crucial evolutionary information to neural networks predicting secondary structure (7,8) that seeded PredictProtein (PP) at the European Molecular Biology Laboratory (EMBL) in 1992 (9), the first fully automated, query-driven Internet server providing evolutionary information and structure prediction for any protein. Many other methods predicting aspects of protein function and structure have since joined under the PP roof (4,5,10) now hosted by the Luxembourg Centre of Systems Biomedicine (LCSB).
PP offers an array of structure and function predictions most of which combine machine learning with evolutionary information; now enhanced by a faster alignment algorithm (11,12). A few prediction methods now also use embeddings (13,14) from protein Language Models (LMs) (13)(14)(15)(16)(17)(18). Embeddings are much faster to obtain than evolutionary information, yet for many tasks, perform almost as well, or even better (19,20). All PP methods join at PredictProtein.org with interactive visualizations; for some methods, more advanced visualizations are linked (21)(22)(23). The reliability of PredictProtein, its speed, the continuous integration of wellvalidated, top methods, and its intuitive interface have attracted thousands of researchers over 29 years of steady operation.
New: goPredSim embedding-based transfer of Gene Ontology (GO). goPredSim (19) predicts GO terms by transferring annotations from the most embedding-similar protein. Embeddings are obtained from SeqVec (13); similarity is established through the Euclidean distance between the embedding of a query and a protein with experimental GO annotations. Replicating the conditions of CAFA3 (41) in 2017, goPredSim achieved Fmax values of 37±2%, 52±2%, and 58±2% for BPO (biological process), MFO (molecular function), and CCO (cellular component), respectively (41,42). Using annotations from the Gene Ontology Annotation (GOA) database (43,44) in 2020 and testing on a set of 296 proteins with annotations added after February 2020 appeared to reach even slightly higher values that were confirmed through preliminary results for CAFA4 (45).
New: ProNA2020 predicts protein, RNA & DNA binding. ProNA2020 (3) predicts whether or not a protein interacts with other proteins, RNA or DNA, and if the binding residues. Per-protein predictions rely on homology and machine learning models employing profile-kernel SVMs (49) and embeddings from an in-house implementation of ProtVec (50). Perresidue predictions are based on simple neural networks due to the lack of experimental high-resolution annotations (51)(52)(53). ProNA2020 correctly predicted 77%±1% of proteins that bind DNA, RNA or protein. In proteins known to bind other proteins, RNA, or DNA, ProNA2020 correctly predicted 69±1%, 81±1%, and 80±1% of binding residues, respectively.
New: MMseqs2 speedy evolutionary information. Most time-consuming for PP was the search for related proteins in ever growing databases. MMseqs2 (11) finds related sequences blazingly fast and seeds a PSI-BLAST search (25). The query sequence is sent to a dedicated MMseqs2 server that searches for hits against cluster representatives within the Uni-Clust30 (54) and PDB (26) reduced to 70% pairwise percentage sequence identity (PIDE). All hits and their respective cluster members are turned into a MSA and filtered to the 3,000 most diverse sequences.

Web Server
Frontend and User Interface (UI). Users query PredictProtein.org by submitting a protein sequence. Results are available in seconds for sequences that had been submitted recently (cf. PPcache next section), or within up to 30 minutes if predictions are recomputed. Per-residue predictions are displayed online via ProtVista (55), which allows to zoom into any sequential protein region (Fig. S1), and are grouped by category (e.g., secondary structure), which can be expanded to display more detail (e.g., helix, strand, other). On the results page, links to visualize MSAs through AlignmentViewer (56) are available. More predictions can be accessed through a menu on the left, e.g., Gene Ontology Terms, Effect of Point Mutations and Subcellular Localization. Prediction views include references and details of outputs, as well as rich visualizations, e.g., GO trees for GO predictions and cell images with highlighted predicted locations for subcellular localization predictions (57). PPcache, backend, and programmatic access. The PP backend lives at LCSB, allowing for up to 48 parallel queries. Results are stored on disc in the PPcache (5). Users submitting any of the 660,000 recently submitted sequences obtain results immediately. Using the bio-embeddings software (58), the PPcache is enriched by embeddings and embedding-based predictions on the fly. For all methods displayed on the frontend, JSON files compliant with ProtVista (55) are available via REST APIs (SOM_1), and are in use by external services such as the protein 3D structure visualization suite Aquaria (21, 23) and by MolArt (22).
PredictProtein is available for local use. All results displayed on and downloadable from PP are available through Docker (59) and as source code for local and cloud execution (available at github.com/rostlab).

Use Case
We demonstrate PredictProtein.org tools through predictions of the novel coronavirus SARS-CoV-2 (NCBI:txid2697049) nucleoprotein (UniProt identifier P0DTC9/NCAP_SARS2; Fig. 1). NCAP_SARS2 has 419 residues and interacts with itself (experimentally verified). Sequence similarity and automatic assignment via UniRule suggest NCAP is RNA binding (binding with the viral genome), binding with the membrane protein M (UniProt identifier P0DTC5/VME1_SARS2), and is fundamental for virion assembly. goPredSim (19) transferred GO terms from other proteins for MFO (RNA-binding; GO:0003723; ECO:0000213) and CCO (compartments in the host cell and viral nucleocapsid; GO:0019013; GO:0044172; GO:0044177; GO:0044220; GO:0030430; ECO:0000255) matching annotations found in UniProt (1). While it missed the experimentally verified MFO term identical protein binding (GO:0042802), go-PredSim predicted protein folding (GO:0006457) and protein ubiquitination (GO:0016567) suggesting the nucleoprotein to be involved in biological processes requiring protein binding. ProNA2020 (3) predicts RNAbinding regions, the one with highest confidence between I84 (Isoleucine at position 84) and D98 (Aspartic Acid at 98) (protein sequence available in SOM_1). While high resolution experimental data on binding is not available, an NMR structure of the SARS-CoV-2 nucleocapsid phosphoprotein N-terminal domain in complex with 10mer ssRNA (PDB identifier 7ACT (60)) supports the predicted RNA-binding site (Fig. 2). Additionally, SNAP2 (38) predicts single amino acid variants (SAVs) in that region to likely affect function, reinforcing the hypothesis that this is a functionally relevant site. Although using different input information (evolutionary vs. embeddings), RePROF (5) and ProtBertSec (14) both predict an unusual content exceeding 70% non-regular (neither helix nor strand) secondary structure, suggesting that most of the nucleoprotein might not form regular structure. This is supported by Meta-Disorder (30) predicting 53% overall disorder. Secondary structure predictions match well high-resolution experimental structures of the nucleoprotein not in complex (e.g., PDB 6VYO (61); 6WJI (62)). Both secondary structure prediction methods managed to zoom into the ordered regions of the protein and predicted e.g., the five beta-strands that are formed within the sequence range I84 (Isoleucine) to A134 (Alanine), and the two helices formed within the sequence range spanned from F346 (Phenylalanine) to T362 (Tyrosine).

Conclusion
For almost three decades (preceding Google) PredictProtein (PP) offers predictions for proteins. PP is the oldest Internet server in protein prediction, online for 6-times as long as most other servers (63)(64)(65). It pioneered combining machine learning with evolutionary information and making predictions freely accessible online. While the sequence-annotation gap continues to grow, the sequence-structure gap might be bridged in the near future (66). For the time being, servers such as PP, providing a one-stop solution to predictions from many sustained, novel tools are needed. PP is the first server to offer fast embedding-based predictions of structure (ProtBertSec) and function (goPredSim). By slashing runtime for PSSMs from 72 to 4 minutes through MMseqs2 and better LCSB hardware, PP also delivers evolutionary information-based predictions fast! Instantaneously if the query is in the precomputed PPcache. For heavy use, the offline Docker containers provide predictors out-of-the-box. At no cost to users, PredictProtein offers to quickly shine light on proteins for which little is known using well validated prediction methods.