Abstract

Motivation

Protein structural annotations (PSAs) are essential abstractions to deal with the prediction of protein structures. Many increasingly sophisticated PSAs have been devised in the last few decades. However, the need for annotations that are easy to compute, process and predict has not diminished. This is especially true for protein structures that are hardest to predict, such as novel folds.

Results

We propose Brewery, a suite of ab initio predictors of 1D PSAs. Brewery uses multiple sources of evolutionary information to achieve state-of-the-art predictions of secondary structure, structural motifs, relative solvent accessibility and contact density.

Availability and implementation

The web server, standalone program, Docker image and training sets of Brewery are available at http://distilldeep.ucd.ie/brewery/.

1 Introduction

The complexity, wide variability and ultimately the sheer number of diverse protein structures present in nature make the characterization of proteins extremely expensive and complex. For this reason, considerable effort has gone into predicting protein structures by computational means (Dill and MacCallum, 2012), often in the form of abstractions that simplify the prediction while still retaining structural information (Torrisi and Pollastri, 2019). These abstractions or protein structural annotations (PSAs), may be 1D when they can be represented by a string or a sequence of numbers of the same length as the protein’s primary structure, i.e. the chain of amino acids (AA) composing the protein. Secondary structure (SS; Torrisi et al., 2019) and relative solvent accessibility (RSA; Kaleel et al., 2019) are among the most widely adopted structural annotations, with various definitions of torsion angles/structural motifs and contact density (CD) having also being predicted (Kurgan and Disfani, 2011; Torrisi and Pollastri, 2019).

Although there is increasing interest in predicting 2D structural annotations such as contact and distance maps (Senior et al., 2019), 1D PSAs remain a central component of many such predictors (Hou et al., 2019; Senior et al., 2019), especially in the hardest prediction cases (Wu et al., 2019). Because of this, many efforts are ongoing to harness all the protein data available, e.g. the Protein Data Bank (PDB; Berman et al., 2000) and the UniProt (The UniProt Consortium, 2016) and increasingly sophisticated algorithms for sequence similarity detection (Steinegger et al., 2019a), to more reliably predict PSA (Fang et al., 2019; Hanson et al., 2018; Kaleel et al., 2019; Klausen et al., 2019; Torrisi et al., 2019). Here, we introduce Brewery, a suite of PSA predictors based on deep learning algorithms that are fast, highly accurate, compact and portable and are free to access via a Web interface and to download.

2 Materials and methods

Brewery predicts four PSAs: SS; RSA; CD (Baú et al., 2006); and structural motifs obtained from clustering of consecutive torsion angles (TA; Sims et al., 2005). Brewery’s training set is built starting from the PDB (Berman et al., 2000) released on December 2014, imposing an internal redundancy threshold of 25% sequence identity. This resulted in 15 753 proteins, comprising 3 797 426 AA, one of the largest sets built to date, which we used for both training and validation purposes (in cross-validation). We also built an independent test set (Torrisi et al., 2019) to compare Brewery against some of the most recent predictors, i.e. MUFOLD-SSW (Fang et al., 2019), NetSurfP-2.0 (Klausen et al., 2019) and SPOT-1D (Hanson et al., 2018), starting from the PDB released on June 2019. This set was redundancy reduced at 25% sequence identity internally and against the training sets of Brewery, NetSurfP-2.0 and SPOT-1D and contains 618 proteins (91 375 AA). We call this resulting set 2019_test and report results observed on this set for SS (in both three and eight classes), TA, RSA and CD in Table 1.

Table 1.

Most recent predictors assessed on 2019_test set of 618 proteins

PredictorSS—Q3 (%)SS—Q8 (%)TA—Q14 (%)RSA—Q4 (%)CD—Q4 (%)
Brewery+81.868.665.852.649.4
Brewery81.768.565.852.449.1
BreweryH8167.564.55248.6
MUFold-SSW81.168.264.5N/AN/A
NetSurfP-2.081.367.96349N/A
SPOT-1D82.169.765.449.9N/A
PredictorSS—Q3 (%)SS—Q8 (%)TA—Q14 (%)RSA—Q4 (%)CD—Q4 (%)
Brewery+81.868.665.852.649.4
Brewery81.768.565.852.449.1
BreweryH8167.564.55248.6
MUFold-SSW81.168.264.5N/AN/A
NetSurfP-2.081.367.96349N/A
SPOT-1D82.169.765.449.9N/A

Note: Brewery employs evolutionary information from UniProt20 (via HHblits) and UniRef90 (via PSI-BLAST), Brewery+ also includes profiles from a HHblits search on BFD. BreweryH only uses HHblits profiles from UniProt20.

Table 1.

Most recent predictors assessed on 2019_test set of 618 proteins

PredictorSS—Q3 (%)SS—Q8 (%)TA—Q14 (%)RSA—Q4 (%)CD—Q4 (%)
Brewery+81.868.665.852.649.4
Brewery81.768.565.852.449.1
BreweryH8167.564.55248.6
MUFold-SSW81.168.264.5N/AN/A
NetSurfP-2.081.367.96349N/A
SPOT-1D82.169.765.449.9N/A
PredictorSS—Q3 (%)SS—Q8 (%)TA—Q14 (%)RSA—Q4 (%)CD—Q4 (%)
Brewery+81.868.665.852.649.4
Brewery81.768.565.852.449.1
BreweryH8167.564.55248.6
MUFold-SSW81.168.264.5N/AN/A
NetSurfP-2.081.367.96349N/A
SPOT-1D82.169.765.449.9N/A

Note: Brewery employs evolutionary information from UniProt20 (via HHblits) and UniRef90 (via PSI-BLAST), Brewery+ also includes profiles from a HHblits search on BFD. BreweryH only uses HHblits profiles from UniProt20.

We use both PSI-BLAST (Altschul et al., 1997) and HHblits (Steinegger et al., 2019a) to gather evolutionary information. PSI-BLAST is run on the May 2016 release of UniRef90 (The UniProt Consortium, 2016) and HHblits on both the February 2016 release of UniProt20 (Mirdita et al., 2017) and the March 2019 release of BFD (Steinegger et al., 2019b). We encode evolutionary information applying a weighting scheme and a novel clipping technique, which are fed to ensembles of Cascaded Bidirectional Recurrent and Convolutional Neural Networks (Kaleel et al., 2019; Torrisi et al., 2019).

We define both RSA and SS using DSSP (Kabsch and Sander, 1983), classifying the former into four classes, roughly equally distributed, and the latter into both eight states (G, H, I, E, B, S, T and ‘.’), and three states, i.e. Helices (G, H and I), Sheets (E and B) and Coils (‘.’, S and T). We classify TA into 14 structural motifs derived from clusters described in Sims et al. (2005) and CD into four states, i.e. very low density, low density, high density and very high density (Baú et al., 2006).

3 Web server

The web server of Brewery allows the prediction of (some or) all the 1D PSA in one go. Up to roughly 200 protein sequences (or 64 KB) can be submitted at the same time. There is no limit of total submissions while the confirmation page will show the server load and the URL to the result page. If an email address is provided, all the information on the result page will be sent by email when available. The result page/email will contain a recap of the query sequence along with the predictions, and respective confidence levels, of the 1D PSA of choice, with a quick explanation of the output format. The server implements ‘Brewery’ in Table 1. At the time of writing the Brewery web server had processed over 70 000 queries, with an average response time of 24 min.

4 Standalone

The standalone of Brewery is accessible on both GitHub and Docker Hub. The standalone allows more flexibility than the web server, including using a slower but slightly more accurate ensemble of UniProt20, UniRef90 and BFD models (‘Brewery+’ in Table 1) and a significantly faster HHblits-only pipeline (‘BreweryH’ in Table 1). The standalone is freely released for non-commercial purposes and, given appropriate credit, may also be modified. The very light standalone requires only 30 MB of space.

Acknowledgements

The authors acknowledge the Research IT Service at University College Dublin for providing HPC resources that have contributed to the research results reported within this article. The authors are grateful to Michael Schantz Klausen and Prof. Bent Petersen for NetsurfP-2.0s training set, and Prof. Yaoqi Zhou for SPOT-1D’s training set.

Funding

The work of M.T. was supported by the Irish Research Council [GOIPG/2015/3717] and the UCD School of Computer Science Bursary.

Conflict of Interest: none declared.

References

Altschul
 
S.F.
(
1997
)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
.,
25
,
3389
3402
.

Baú
 
D.
 et al.  (
2006
)
Distill: a suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins
.
BMC Bioinformatics
,
7
,
402
.

Berman
 
H.M.
(
2000
)
The Protein Data Bank
.
Nucleic Acids Res
.,
28
,
235
242
.

Dill
 
K.A.
,
MacCallum
J.L.
(
2012
)
The protein-folding problem, 50 years on
.
Science
,
338
,
1042
1046
.

Fang
 
C.
 et al.  (
2019
)
MUFold-SSW: a new web server for predicting protein secondary structures, torsion angles and turns
.
Bioinformatics
.

Hanson
 
J.
 et al.  (
2018
)
Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks
.
Bioinformatics
.

Hou
 
J.
 et al.  (
2019
)
Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13
.
Proteins
,
87
,
1165
1178
.

Kabsch
 
W.
,
Sander
C.
(
1983
)
Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features
.
Biopolymers
,
22
,
2577
2637
.

Kaleel
 
M.
 et al.  (
2019
)
PaleAle 5.0: prediction of protein relative solvent accessibility by deep learning
.
Amino Acids
,
51
,
1289
1296
.

Klausen
 
M.S.
 et al.  (
2019
)
NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning
.
Proteins
,
87
,
520
527
.

Kurgan
 
L.
,
Disfani
F.M.
(
2011
)
Structural protein descriptors in 1-dimension and their sequence-based predictions
.
Curr. Protein Pept. Sci
.,
12
,
470
489
.

Mirdita
 
M.
 et al.  (
2017
)
Uniclust databases of clustered and deeply annotated protein sequences and alignments
.
Nucleic Acids Res
.,
45(Database issue
),
D170
D176
.

Senior
 
A.W.
 et al.  (
2019
)
Protein structure prediction using multiple deep neural networks in CASP13
.
Proteins
,
87
,
1141
1148
.

Sims
 
G.E.
 et al.  (
2005
)
Protein conformational space in higher order maps
.
Proc. Natl. Acad. Sci. USA
,
102
,
618
621
.

Steinegger
 
M.
 et al.  (
2019
a)
HH-suite3 for fast remote homology detection and deep protein annotation
.
BMC Bioinformatics
,
20
,
473
.

Steinegger
 
M.
 et al.  (
2019
b)
Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold
.
Nat. Methods
,
16
,
603
606
.

The UniProt Consortium. (

2016
)
UniProt: the universal protein knowledgebase
.
Nucleic Acids Res
.,
45
,
D158
D169
.

Torrisi
 
M.
,
Pollastri
G.
(
2019
). Protein structure annotations. In: Shaik,N.A. et al. (eds)
Essentials of Bioinformatics, Volume I: Understanding Bioinformatics: Genes to Proteins
.
Springer International Publishing
,
Cham
, pp.
201
234
.

Torrisi
 
M.
 et al.  (
2019
)
Deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction
.
Sci. Rep
.,
9
,
1
12
.

Wu
 
T.
 et al.  (
2019
)
Analysis of several key factors influencing deep learning-based inter-residue contact prediction
.
Bioinformatics
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Arne Elofsson
Arne Elofsson
Associate Editor
Search for other works by this author on: