Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework

Aydin

Z.

et al. (

2006

)

Protein secondary structure prediction for a single-sequence using hidden semi-Markov models

.

BMC Bioinformatics

,

7

,

178

.

Berthelot

D.

et al. (

2019

) Mixmatch: a holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, pp.

5049

–

5059

.

Bidargaddi

N.P.

et al. (

2009

)

Combining segmental semi-Markov models with neural networks for protein secondary structure prediction

.

Neurocomputing

,

72

,

3943

–

3950

.

Blum

M.

et al. (

2021

)

The InterPro protein families and domains database: 20 years on

.

Nucleic Acids Res

.,

49

,

D344

–

D354

. [CrossRef] [10.1093/nar/gkaa977]

Buchan

D.W.

,

Jones

D.T.

(

2019

)

The psipred protein analysis workbench: 20 years on

.

Nucleic Acids Res

.,

47

,

W402

–

W407

.

Burley

S.K.

et al. (

2019

)

RCSB protein data bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy

.

Nucleic Acids Res

.,

47

,

D464

–

D474

.

Carradec

Q.

et al. ; Tara Oceans Coordinators. (

2018

)

A global ocean atlas of eukaryotic genes

.

Nat. Commun

.,

9

,

1

–

13

.

Cole

C.

et al. (

2008

)

The JPRED 3 secondary structure prediction server

.

Nucleic Acids Res

.,

36

,

W197

–

W201

.

Cuff

J.A.

,

Barton

G.J.

(

1999

)

Evaluation and improvement of multiple sequence methods for protein secondary structure prediction

.

Proteins Struct. Funct. Bioinf

.,

34

,

508

–

519

.

Cuff

J.A.

et al. (

1998

)

JPRED: a consensus secondary structure prediction server

.

Bioinformatics (Oxford, England)

,

14

,

892

–

893

.

PubMed

OpenURL Placeholder Text

Dai

Z.

et al. (

2019

)

Transformer-xl: attentive language models beyond a fixed-length context

. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988.

Dana

J.M.

et al. (

2019

)

Sifts: updated structure integration with function, taxonomy and sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins

.

Nucleic Acids Res

.,

47

,

D482

–

D489

.

Devlin

J.

et al. (

2019

) Bert: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp.

4171

–

4186

.

Eddy

S.R.

(

2011

)

Accelerated profile HMM searches

.

PLoS Comput. Biol

.,

7

,

e1002195

.

Edwards

R.A.

,

Rohwer

F.

(

2005

)

Viral metagenomics

.

Nat. Rev. Microbiol

.,

3

,

504

–

510

.

El-Gebali

S.

et al. (

2019

)

The pfam protein families database in 2019

.

Nucleic Acids Res

.,

47

,

D427

–

D432

.

Frishman

D.

,

Argos

P.

(

1996

)

Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence

.

Protein Eng. Des. Select

.,

9

,

133

–

142

.

Greener

J.G.

et al. (

2019

)

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

.

Nat. Commun

.,

10

,

1

–

13

.

Hanson

J.

et al. (

2019

)

Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks

.

Bioinformatics

,

35

,

2403

–

2410

.

Hanumanthappa

A.K.

et al. (

2021

)

Single-sequence and profile-based prediction of RNA solvent accessibility using dilated convolutional neural network

.

Bioinformatics

,

36

,

5169

–

5176

. [CrossRef][10.1093/bioinformatics/btaa652].

Heffernan

R.

et al. (

2018

)

Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning

.

J. Comput. Chem

.,

39

,

2210

–

2216

.

Heinzinger

M.

et al. (

2019

)

Modeling aspects of the language of life through transfer-learning protein sequences

.

BMC Bioinformatics

,

20

,

723

.

Hinton

G.

et al. (

2015

)

Distilling the knowledge in a neural network

. NIPS 2014 Deep Learning Workshop.

Hochreiter

S.

,

Schmidhuber

J.

(

1997

)

Long short-term memory

.

Neural Comput

.,

9

,

1735

–

1780

.

Jones

D.T.

(

1999

)

Protein secondary structure prediction based on position-specific scoring matrices

.

J. Mol. Biol

.,

292

,

195

–

202

.

Jones

D.T.

(

2019

)

Setting the standards for machine learning in biology

.

Nat. Rev. Mol. Cell Biol

.,

20

,

659

–

660

.

Jones

D.T.

,

Swindells

M.B.

(

2002

)

Getting the most from psi–blast

.

Trends Biochem. Sci

.,

27

,

161

–

164

.

Kabsch

W.

,

Sander

C.

(

1983

)

DSSP: definition of secondary structure of proteins given a set of 3D coordinates

.

Biopolymers

,

22

,

2577

–

2637

.

Kandathil

S.M.

et al. (

2019a

)

Prediction of interresidue contacts with DeepMetaPSICOV in CASP13

.

Proteins Struct. Funct. Bioinf

.,

87

,

1092

–

1099

.

Kandathil

S.M.

et al. (

2019b

)

Recent developments in deep learning applied to protein structure prediction

.

Proteins Struct. Funct. Bioinf

.,

87

,

1179

–

1189

.

Koga

N.

et al. (

2012

)

Principles for designing ideal protein structures

.

Nature

,

491

,

222

–

227

.

Lee

D.-H.

(

2013

) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML 3, 896.

Levitt

M.

(

2009

)

Nature of the protein universe

.

Proc. Natl. Acad. Sci. USA

,

106

,

11079

–

11084

.

Li

Z.

,

Yu

Y.

(

2016

) Protein secondary structure prediction using cascaded convolutional and recurrent neural networks. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2560–2567.

Marcos

E.

,

Silva

D.-A.

(

2018

)

Essentials of de novo protein design: methods and applications

.

Wiley Interdiscip. Rev. Comput. Mol. Sci

.,

8

,

e1374

.

Meiler

J.

,

Baker

D.

(

2003

)

Coupled prediction of protein secondary and tertiary structure

.

Proc. Natl. Acad. Sci

.,

100

,

12105

–

12110

.

Merity

S.

et al. (

2018

) Regularizing and optimizing LSTM language models. International Conference on Learning Representations, 2018.

Mirabello

C.

,

Pollastri

G.

(

2013

)

Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility

.

Bioinformatics

,

29

,

2056

–

2058

.

Mirdita

M.

et al. (

2017

)

Uniclust databases of clustered and deeply annotated protein sequences and alignments

.

Nucleic Acids Res

.,

45

,

D170

–

D176

.

Mitchell

A.L.

et al. (

2020

)

MGnify: the microbiome analysis resource in 2020

.

Nucleic Acids Res

.,

48

,

D570

–

D578

.

PubMed

OpenURL Placeholder Text

Mokili

J.L.

et al. (

2012

)

Metagenomics and future perspectives in virus discovery

.

Curr. Opin. Virol

.,

2

,

63

–

77

.

Orengo

C.A.

et al. (

1997

)

Cath—a hierarchic classification of protein domain structures

.

Structure

,

5

,

1093

–

1109

.

Ovchinnikov

S.

et al. (

2017

)

Protein structure determination using metagenome sequence data

.

Science

,

355

,

294

–

298

.

Perdigão

N.

et al. (

2015

)

Unexpected features of the dark proteome

.

Proc. Natl. Acad. Sci. USA

,

112

,

15898

–

15903

.

Peters

M.E.

et al. (

2018

) Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, pp.

2227

–

2237

.

Remmert

M.

et al. (

2011

)

HHblits: lightning-fast iterative protein sequence searching by HMM–HMM alignment

.

Nat. Methods

,

9

,

173

–

175

.

Riesselman

A.J.

et al. (

2018

)

Deep generative models of genetic variation capture the effects of mutations

.

Nat. Methods

,

15

,

816

–

822

.

Rost

B.

(

2001

)

Protein secondary structure prediction continues to rise

.

J. Struct. Biol

.,

134

,

204

–

218

.

Rost

B.

,

Sander

C.

(

1993

)

Prediction of protein secondary structure at better than 70% accuracy

.

J. Mol. Biol

.,

232

,

584

–

599

.

Schmidler

S.C.

et al. (

2000

)

Bayesian segmentation of protein secondary structure

.

J. Comput. Biol

.,

7

,

233

–

248

.

Senior

A.W.

et al. (

2020

)

Improved protein structure prediction using potentials from deep learning

.

Nature

,

577

,

706

–

710

.

Sillitoe

I.

et al. (

2019

)

Cath: expanding the horizons of structure-based functional annotations for genome sequences

.

Nucleic Acids Res

.,

47

,

D280

–

D284

.

Sohn

K.

et al. (

2020

) Fixmatch: simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33.

Steinegger

M.

,

Söding

J.

(

2017

)

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

.

Nat. Biotechnol

.,

35

,

1026

–

1028

.

Steinegger

M.

,

Söding

J.

(

2018

)

Clustering huge protein sequence sets in linear time

.

Nat. Commun

.,

9

,

1

–

8

.

Steinegger

M.

et al. (

2019

)

Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold

.

Nat. Methods

,

16

,

603

–

606

.

Suzek

B.E.

et al. ; The UniProt Consortium. (

2015

)

Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches

.

Bioinformatics

,

31

,

926

–

932

.

Torrisi

M.

et al. (

2018

) Porter 5: fast, state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes. bioRxiv, 289033.

Torrisi

M.

et al. (

2019

)

Deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction

.

Sci. Rep

.,

9

,

1

–

12

.

UniProt-Consortium. (

2019

)

Uniprot: a worldwide hub of protein knowledge

.

Nucleic Acids Res

.,

47

,

D506

–

D515

.

PubMed

Wang

G.

,

Dunbrack

R.L.

(

2003

)

Pisces: a protein sequence culling server

.

Bioinformatics

,

19

,

1589

–

1591

.

Yang

K.K.

et al. (

2019

)

Machine-learning-guided directed evolution for protein engineering

.

Nat. Methods

,

16

,

687

–

694

.

Yang

Y.

et al. (

2018

)

Sixty-five years of the long march in protein secondary structure prediction: the final stretch?

Brief. Bioinf

.,

19

,

482

–

494

.

OpenURL Placeholder Text