Skip to Main Content

Article Navigation

Journal Article

Brewery: deep learning and deeper profiles for the prediction of 1D protein structure annotations

Abstract

Motivation

Protein structural annotations (PSAs) are essential abstractions to deal with the prediction of protein structures. Many increasingly sophisticated PSAs have been devised in the last few decades. However, the need for annotations that are easy to compute, process and predict has not diminished. This is especially true for protein structures that are hardest to predict, such as novel folds.

Results

We propose Brewery, a suite of ab initio predictors of 1D PSAs. Brewery uses multiple sources of evolutionary information to achieve state-of-the-art predictions of secondary structure, structural motifs, relative solvent accessibility and contact density.

Availability and implementation

The web server, standalone program, Docker image and training sets of Brewery are available at http://distilldeep.ucd.ie/brewery/.

Contact

gianluca.pollastri@ucd.ie

1 Introduction

The complexity, wide variability and ultimately the sheer number of diverse protein structures present in nature make the characterization of proteins extremely expensive and complex. For this reason, considerable effort has gone into predicting protein structures by computational means (Dill and MacCallum, 2012), often in the form of abstractions that simplify the prediction while still retaining structural information (Torrisi and Pollastri, 2019). These abstractions or protein structural annotations (PSAs), may be 1D when they can be represented by a string or a sequence of numbers of the same length as the protein’s primary structure, i.e. the chain of amino acids (AA) composing the protein. Secondary structure (SS; Torrisi et al., 2019) and relative solvent accessibility (RSA; Kaleel et al., 2019) are among the most widely adopted structural annotations, with various definitions of torsion angles/structural motifs and contact density (CD) having also being predicted (Kurgan and Disfani, 2011; Torrisi and Pollastri, 2019).

Although there is increasing interest in predicting 2D structural annotations such as contact and distance maps (Senior et al., 2019), 1D PSAs remain a central component of many such predictors (Hou et al., 2019; Senior et al., 2019), especially in the hardest prediction cases (Wu et al., 2019). Because of this, many efforts are ongoing to harness all the protein data available, e.g. the Protein Data Bank (PDB; Berman et al., 2000) and the UniProt (The UniProt Consortium, 2016) and increasingly sophisticated algorithms for sequence similarity detection (Steinegger et al., 2019a), to more reliably predict PSA (Fang et al., 2019; Hanson et al., 2018; Kaleel et al., 2019; Klausen et al., 2019; Torrisi et al., 2019). Here, we introduce Brewery, a suite of PSA predictors based on deep learning algorithms that are fast, highly accurate, compact and portable and are free to access via a Web interface and to download.

2 Materials and methods

Brewery predicts four PSAs: SS; RSA; CD (Baú et al., 2006); and structural motifs obtained from clustering of consecutive torsion angles (TA; Sims et al., 2005). Brewery’s training set is built starting from the PDB (Berman et al., 2000) released on December 2014, imposing an internal redundancy threshold of 25% sequence identity. This resulted in 15 753 proteins, comprising 3 797 426 AA, one of the largest sets built to date, which we used for both training and validation purposes (in cross-validation). We also built an independent test set (Torrisi et al., 2019) to compare Brewery against some of the most recent predictors, i.e. MUFOLD-SSW (Fang et al., 2019), NetSurfP-2.0 (Klausen et al., 2019) and SPOT-1D (Hanson et al., 2018), starting from the PDB released on June 2019. This set was redundancy reduced at 25% sequence identity internally and against the training sets of Brewery, NetSurfP-2.0 and SPOT-1D and contains 618 proteins (91 375 AA). We call this resulting set 2019_test and report results observed on this set for SS (in both three and eight classes), TA, RSA and CD in Table 1.

Table 1.

Most recent predictors assessed on 2019_test set of 618 proteins

Predictor	SS—Q3 (%)	SS—Q8 (%)	TA—Q14 (%)	RSA—Q4 (%)	CD—Q4 (%)
Brewery+	81.8	68.6	65.8	52.6	49.4
Brewery	81.7	68.5	65.8	52.4	49.1
BreweryH	81	67.5	64.5	52	48.6
MUFold-SSW	81.1	68.2	64.5	N/A	N/A
NetSurfP-2.0	81.3	67.9	63	49	N/A
SPOT-1D	82.1	69.7	65.4	49.9	N/A

Predictor	SS—Q3 (%)	SS—Q8 (%)	TA—Q14 (%)	RSA—Q4 (%)	CD—Q4 (%)
Brewery+	81.8	68.6	65.8	52.6	49.4
Brewery	81.7	68.5	65.8	52.4	49.1
BreweryH	81	67.5	64.5	52	48.6
MUFold-SSW	81.1	68.2	64.5	N/A	N/A
NetSurfP-2.0	81.3	67.9	63	49	N/A
SPOT-1D	82.1	69.7	65.4	49.9	N/A

Note: Brewery employs evolutionary information from UniProt20 (via HHblits) and UniRef90 (via PSI-BLAST), Brewery+ also includes profiles from a HHblits search on BFD. BreweryH only uses HHblits profiles from UniProt20.

Open in new tab

Table 1.

Most recent predictors assessed on 2019_test set of 618 proteins

Predictor	SS—Q3 (%)	SS—Q8 (%)	TA—Q14 (%)	RSA—Q4 (%)	CD—Q4 (%)
Brewery+	81.8	68.6	65.8	52.6	49.4
Brewery	81.7	68.5	65.8	52.4	49.1
BreweryH	81	67.5	64.5	52	48.6
MUFold-SSW	81.1	68.2	64.5	N/A	N/A
NetSurfP-2.0	81.3	67.9	63	49	N/A
SPOT-1D	82.1	69.7	65.4	49.9	N/A

Predictor	SS—Q3 (%)	SS—Q8 (%)	TA—Q14 (%)	RSA—Q4 (%)	CD—Q4 (%)
Brewery+	81.8	68.6	65.8	52.6	49.4
Brewery	81.7	68.5	65.8	52.4	49.1
BreweryH	81	67.5	64.5	52	48.6
MUFold-SSW	81.1	68.2	64.5	N/A	N/A
NetSurfP-2.0	81.3	67.9	63	49	N/A
SPOT-1D	82.1	69.7	65.4	49.9	N/A

Note: Brewery employs evolutionary information from UniProt20 (via HHblits) and UniRef90 (via PSI-BLAST), Brewery+ also includes profiles from a HHblits search on BFD. BreweryH only uses HHblits profiles from UniProt20.

Open in new tab

We use both PSI-BLAST (Altschul et al., 1997) and HHblits (Steinegger et al., 2019a) to gather evolutionary information. PSI-BLAST is run on the May 2016 release of UniRef90 (The UniProt Consortium, 2016) and HHblits on both the February 2016 release of UniProt20 (Mirdita et al., 2017) and the March 2019 release of BFD (Steinegger et al., 2019b). We encode evolutionary information applying a weighting scheme and a novel clipping technique, which are fed to ensembles of Cascaded Bidirectional Recurrent and Convolutional Neural Networks (Kaleel et al., 2019; Torrisi et al., 2019).

We define both RSA and SS using DSSP (Kabsch and Sander, 1983), classifying the former into four classes, roughly equally distributed, and the latter into both eight states (G, H, I, E, B, S, T and ‘.’), and three states, i.e. Helices (G, H and I), Sheets (E and B) and Coils (‘.’, S and T). We classify TA into 14 structural motifs derived from clusters described in Sims et al. (2005) and CD into four states, i.e. very low density, low density, high density and very high density (Baú et al., 2006).

3 Web server

The web server of Brewery allows the prediction of (some or) all the 1D PSA in one go. Up to roughly 200 protein sequences (or 64 KB) can be submitted at the same time. There is no limit of total submissions while the confirmation page will show the server load and the URL to the result page. If an email address is provided, all the information on the result page will be sent by email when available. The result page/email will contain a recap of the query sequence along with the predictions, and respective confidence levels, of the 1D PSA of choice, with a quick explanation of the output format. The server implements ‘Brewery’ in Table 1. At the time of writing the Brewery web server had processed over 70 000 queries, with an average response time of 24 min.

4 Standalone

The standalone of Brewery is accessible on both GitHub and Docker Hub. The standalone allows more flexibility than the web server, including using a slower but slightly more accurate ensemble of UniProt20, UniRef90 and BFD models (‘Brewery+’ in Table 1) and a significantly faster HHblits-only pipeline (‘BreweryH’ in Table 1). The standalone is freely released for non-commercial purposes and, given appropriate credit, may also be modified. The very light standalone requires only 30 MB of space.

Acknowledgements

The authors acknowledge the Research IT Service at University College Dublin for providing HPC resources that have contributed to the research results reported within this article. The authors are grateful to Michael Schantz Klausen and Prof. Bent Petersen for NetsurfP-2.0s training set, and Prof. Yaoqi Zhou for SPOT-1D’s training set.

Funding

The work of M.T. was supported by the Irish Research Council [GOIPG/2015/3717] and the UCD School of Computer Science Bursary.

Conflict of Interest: none declared.

References

Altschul

S.F.

(

1997

)

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

.

Nucleic Acids Res

.,

25

,

3389

–

3402

.

Baú

D.

et al. (

2006

)

Distill: a suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins

.

BMC Bioinformatics

,

7

,

402

.

Berman

H.M.

(

2000

)

The Protein Data Bank

.

Nucleic Acids Res

.,

28

,

235

–

242

.

Dill

K.A.

,

MacCallum

J.L.

(

2012

)

The protein-folding problem, 50 years on

.

Science

,

338

,

1042

–

1046

.

Fang

C.

et al. (

2019

)

MUFold-SSW: a new web server for predicting protein secondary structures, torsion angles and turns

.

Bioinformatics

.

OpenURL Placeholder Text

Hanson

J.

et al. (

2018

)

Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks

.

Bioinformatics

.

OpenURL Placeholder Text

Hou

J.

et al. (

2019

)

Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13

.

Proteins

,

87

,

1165

–

1178

.

Kabsch

W.

,

Sander

C.

(

1983

)

Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features

.

Biopolymers

,

22

,

2577

–

2637

.

Kaleel

M.

et al. (

2019

)

PaleAle 5.0: prediction of protein relative solvent accessibility by deep learning

.

Amino Acids

,

51

,

1289

–

1296

.

Klausen

M.S.

et al. (

2019

)

NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning

.

Proteins

,

87

,

520

–

527

.

Kurgan

L.

,

Disfani

F.M.

(

2011

)

Structural protein descriptors in 1-dimension and their sequence-based predictions

.

Curr. Protein Pept. Sci

.,

12

,

470

–

489

.

Mirdita

M.

et al. (

2017

)

Uniclust databases of clustered and deeply annotated protein sequences and alignments

.

Nucleic Acids Res

.,

45(Database issue

),

D170

–

D176

.

OpenURL Placeholder Text

Senior

A.W.

et al. (

2019

)

Protein structure prediction using multiple deep neural networks in CASP13

.

Proteins

,

87

,

1141

–

1148

.

Sims

G.E.

et al. (

2005

)

Protein conformational space in higher order maps

.

Proc. Natl. Acad. Sci. USA

,

102

,

618

–

621

.

Steinegger

M.

et al. (

2019

a)

HH-suite3 for fast remote homology detection and deep protein annotation

.

BMC Bioinformatics

,

20

,

473

.

Steinegger

M.

et al. (

2019

b)

Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold

.

Nat. Methods

,

16

,

603

–

606

.

The UniProt Consortium. (

2016

)

UniProt: the universal protein knowledgebase

.

Nucleic Acids Res

.,

45

,

D158

–

D169

.

OpenURL Placeholder Text

Torrisi

M.

,

Pollastri

G.

(

2019

). Protein structure annotations. In: Shaik,N.A. et al. (eds)

Essentials of Bioinformatics, Volume I: Understanding Bioinformatics: Genes to Proteins

.

Springer International Publishing

,

Cham

, pp.

201

–

234

.

Torrisi

M.

et al. (

2019

)

Deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction

.

Sci. Rep

.,

9

,

1

–

12

.

Wu

T.

et al. (

2019

)

Analysis of several key factors influencing deep learning-based inter-residue contact prediction

.

Bioinformatics

.

OpenURL Placeholder Text

© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Views

1,169

Altmetric

Total Views 1,169

805 Pageviews

364 PDF Downloads

Since 3/1/2020

Month:	Total Views:
March 2020	33
April 2020	24
May 2020	14
June 2020	29
July 2020	44
August 2020	24
September 2020	23
October 2020	44
November 2020	11
December 2020	21
January 2021	24
February 2021	12
March 2021	21
April 2021	15
May 2021	24
June 2021	27
July 2021	32
August 2021	32
September 2021	23
October 2021	35
November 2021	22
December 2021	30
January 2022	27
February 2022	29
March 2022	19
April 2022	23
May 2022	33
June 2022	20
July 2022	38
August 2022	32
September 2022	36
October 2022	23
November 2022	17
December 2022	18
January 2023	18
February 2023	30
March 2023	17
April 2023	21
May 2023	18
June 2023	22
July 2023	5
August 2023	10
September 2023	7
October 2023	18
November 2023	20
December 2023	21
January 2024	22
February 2024	18
March 2024	22
April 2024	21