- Split View
-
Views
-
Cite
Cite
Mirko Torrisi, Gianluca Pollastri, Brewery: deep learning and deeper profiles for the prediction of 1D protein structure annotations, Bioinformatics, Volume 36, Issue 12, June 2020, Pages 3897–3898, https://doi.org/10.1093/bioinformatics/btaa204
- Share Icon Share
Abstract
Protein structural annotations (PSAs) are essential abstractions to deal with the prediction of protein structures. Many increasingly sophisticated PSAs have been devised in the last few decades. However, the need for annotations that are easy to compute, process and predict has not diminished. This is especially true for protein structures that are hardest to predict, such as novel folds.
We propose Brewery, a suite of ab initio predictors of 1D PSAs. Brewery uses multiple sources of evolutionary information to achieve state-of-the-art predictions of secondary structure, structural motifs, relative solvent accessibility and contact density.
The web server, standalone program, Docker image and training sets of Brewery are available at http://distilldeep.ucd.ie/brewery/.
1 Introduction
The complexity, wide variability and ultimately the sheer number of diverse protein structures present in nature make the characterization of proteins extremely expensive and complex. For this reason, considerable effort has gone into predicting protein structures by computational means (Dill and MacCallum, 2012), often in the form of abstractions that simplify the prediction while still retaining structural information (Torrisi and Pollastri, 2019). These abstractions or protein structural annotations (PSAs), may be 1D when they can be represented by a string or a sequence of numbers of the same length as the protein’s primary structure, i.e. the chain of amino acids (AA) composing the protein. Secondary structure (SS; Torrisi et al., 2019) and relative solvent accessibility (RSA; Kaleel et al., 2019) are among the most widely adopted structural annotations, with various definitions of torsion angles/structural motifs and contact density (CD) having also being predicted (Kurgan and Disfani, 2011; Torrisi and Pollastri, 2019).
Although there is increasing interest in predicting 2D structural annotations such as contact and distance maps (Senior et al., 2019), 1D PSAs remain a central component of many such predictors (Hou et al., 2019; Senior et al., 2019), especially in the hardest prediction cases (Wu et al., 2019). Because of this, many efforts are ongoing to harness all the protein data available, e.g. the Protein Data Bank (PDB; Berman et al., 2000) and the UniProt (The UniProt Consortium, 2016) and increasingly sophisticated algorithms for sequence similarity detection (Steinegger et al., 2019a), to more reliably predict PSA (Fang et al., 2019; Hanson et al., 2018; Kaleel et al., 2019; Klausen et al., 2019; Torrisi et al., 2019). Here, we introduce Brewery, a suite of PSA predictors based on deep learning algorithms that are fast, highly accurate, compact and portable and are free to access via a Web interface and to download.
2 Materials and methods
Brewery predicts four PSAs: SS; RSA; CD (Baú et al., 2006); and structural motifs obtained from clustering of consecutive torsion angles (TA; Sims et al., 2005). Brewery’s training set is built starting from the PDB (Berman et al., 2000) released on December 2014, imposing an internal redundancy threshold of 25% sequence identity. This resulted in 15 753 proteins, comprising 3 797 426 AA, one of the largest sets built to date, which we used for both training and validation purposes (in cross-validation). We also built an independent test set (Torrisi et al., 2019) to compare Brewery against some of the most recent predictors, i.e. MUFOLD-SSW (Fang et al., 2019), NetSurfP-2.0 (Klausen et al., 2019) and SPOT-1D (Hanson et al., 2018), starting from the PDB released on June 2019. This set was redundancy reduced at 25% sequence identity internally and against the training sets of Brewery, NetSurfP-2.0 and SPOT-1D and contains 618 proteins (91 375 AA). We call this resulting set 2019_test and report results observed on this set for SS (in both three and eight classes), TA, RSA and CD in Table 1.
Predictor . | SS—Q3 (%) . | SS—Q8 (%) . | TA—Q14 (%) . | RSA—Q4 (%) . | CD—Q4 (%) . |
---|---|---|---|---|---|
Brewery+ | 81.8 | 68.6 | 65.8 | 52.6 | 49.4 |
Brewery | 81.7 | 68.5 | 65.8 | 52.4 | 49.1 |
BreweryH | 81 | 67.5 | 64.5 | 52 | 48.6 |
MUFold-SSW | 81.1 | 68.2 | 64.5 | N/A | N/A |
NetSurfP-2.0 | 81.3 | 67.9 | 63 | 49 | N/A |
SPOT-1D | 82.1 | 69.7 | 65.4 | 49.9 | N/A |
Predictor . | SS—Q3 (%) . | SS—Q8 (%) . | TA—Q14 (%) . | RSA—Q4 (%) . | CD—Q4 (%) . |
---|---|---|---|---|---|
Brewery+ | 81.8 | 68.6 | 65.8 | 52.6 | 49.4 |
Brewery | 81.7 | 68.5 | 65.8 | 52.4 | 49.1 |
BreweryH | 81 | 67.5 | 64.5 | 52 | 48.6 |
MUFold-SSW | 81.1 | 68.2 | 64.5 | N/A | N/A |
NetSurfP-2.0 | 81.3 | 67.9 | 63 | 49 | N/A |
SPOT-1D | 82.1 | 69.7 | 65.4 | 49.9 | N/A |
Note: Brewery employs evolutionary information from UniProt20 (via HHblits) and UniRef90 (via PSI-BLAST), Brewery+ also includes profiles from a HHblits search on BFD. BreweryH only uses HHblits profiles from UniProt20.
Predictor . | SS—Q3 (%) . | SS—Q8 (%) . | TA—Q14 (%) . | RSA—Q4 (%) . | CD—Q4 (%) . |
---|---|---|---|---|---|
Brewery+ | 81.8 | 68.6 | 65.8 | 52.6 | 49.4 |
Brewery | 81.7 | 68.5 | 65.8 | 52.4 | 49.1 |
BreweryH | 81 | 67.5 | 64.5 | 52 | 48.6 |
MUFold-SSW | 81.1 | 68.2 | 64.5 | N/A | N/A |
NetSurfP-2.0 | 81.3 | 67.9 | 63 | 49 | N/A |
SPOT-1D | 82.1 | 69.7 | 65.4 | 49.9 | N/A |
Predictor . | SS—Q3 (%) . | SS—Q8 (%) . | TA—Q14 (%) . | RSA—Q4 (%) . | CD—Q4 (%) . |
---|---|---|---|---|---|
Brewery+ | 81.8 | 68.6 | 65.8 | 52.6 | 49.4 |
Brewery | 81.7 | 68.5 | 65.8 | 52.4 | 49.1 |
BreweryH | 81 | 67.5 | 64.5 | 52 | 48.6 |
MUFold-SSW | 81.1 | 68.2 | 64.5 | N/A | N/A |
NetSurfP-2.0 | 81.3 | 67.9 | 63 | 49 | N/A |
SPOT-1D | 82.1 | 69.7 | 65.4 | 49.9 | N/A |
Note: Brewery employs evolutionary information from UniProt20 (via HHblits) and UniRef90 (via PSI-BLAST), Brewery+ also includes profiles from a HHblits search on BFD. BreweryH only uses HHblits profiles from UniProt20.
We use both PSI-BLAST (Altschul et al., 1997) and HHblits (Steinegger et al., 2019a) to gather evolutionary information. PSI-BLAST is run on the May 2016 release of UniRef90 (The UniProt Consortium, 2016) and HHblits on both the February 2016 release of UniProt20 (Mirdita et al., 2017) and the March 2019 release of BFD (Steinegger et al., 2019b). We encode evolutionary information applying a weighting scheme and a novel clipping technique, which are fed to ensembles of Cascaded Bidirectional Recurrent and Convolutional Neural Networks (Kaleel et al., 2019; Torrisi et al., 2019).
We define both RSA and SS using DSSP (Kabsch and Sander, 1983), classifying the former into four classes, roughly equally distributed, and the latter into both eight states (G, H, I, E, B, S, T and ‘.’), and three states, i.e. Helices (G, H and I), Sheets (E and B) and Coils (‘.’, S and T). We classify TA into 14 structural motifs derived from clusters described in Sims et al. (2005) and CD into four states, i.e. very low density, low density, high density and very high density (Baú et al., 2006).
3 Web server
The web server of Brewery allows the prediction of (some or) all the 1D PSA in one go. Up to roughly 200 protein sequences (or 64 KB) can be submitted at the same time. There is no limit of total submissions while the confirmation page will show the server load and the URL to the result page. If an email address is provided, all the information on the result page will be sent by email when available. The result page/email will contain a recap of the query sequence along with the predictions, and respective confidence levels, of the 1D PSA of choice, with a quick explanation of the output format. The server implements ‘Brewery’ in Table 1. At the time of writing the Brewery web server had processed over 70 000 queries, with an average response time of 24 min.
4 Standalone
The standalone of Brewery is accessible on both GitHub and Docker Hub. The standalone allows more flexibility than the web server, including using a slower but slightly more accurate ensemble of UniProt20, UniRef90 and BFD models (‘Brewery+’ in Table 1) and a significantly faster HHblits-only pipeline (‘BreweryH’ in Table 1). The standalone is freely released for non-commercial purposes and, given appropriate credit, may also be modified. The very light standalone requires only 30 MB of space.
Acknowledgements
The authors acknowledge the Research IT Service at University College Dublin for providing HPC resources that have contributed to the research results reported within this article. The authors are grateful to Michael Schantz Klausen and Prof. Bent Petersen for NetsurfP-2.0s training set, and Prof. Yaoqi Zhou for SPOT-1D’s training set.
Funding
The work of M.T. was supported by the Irish Research Council [GOIPG/2015/3717] and the UCD School of Computer Science Bursary.
Conflict of Interest: none declared.
References
The UniProt Consortium. (