Abstract

Motivation

Understanding the protein sequence-function relationship is essential for advancing protein biology and engineering. However, less than 1% of known protein sequences have human-verified functions. While deep learning methods have demonstrated promise for protein function prediction, current models are limited to predicting only those functions on which they were trained.

Results

Here, we introduce ProtNote, a multimodal deep learning model that leverages free-form text to enable both supervised and zero-shot protein function prediction. ProtNote not only maintains near state-of-the-art performance for annotations in its training set, but also generalizes to unseen and novel functions in zero-shot test settings. ProtNote demonstrates superior performance in prediction of novel GO annotations and EC numbers compared to baseline models by capturing nuanced sequence-function relationships that unlock a range of biological use cases inaccessible to prior models. We envision that ProtNote will enhance protein function discovery by enabling scientists to use free text inputs without restriction to predefined labels – a necessary capability for navigating the dynamic landscape of protein biology.

Availability and Implementation

The code is available on GitHub: https://github.com/microsoft/protnote; model weights, datasets, and evaluation metrics are provided via Zenodo: https://zenodo.org/records/13897920.

Supplementary Information

Supplementary Information is available at Bioinformatics online.

Information Accepted manuscripts
Accepted manuscripts are PDF versions of the author’s final manuscript, as accepted for publication by the journal but prior to copyediting or typesetting. They can be cited using the author(s), article title, journal title, year of online publication, and DOI. They will be replaced by the final typeset articles, which may therefore contain changes. The DOI will remain the same throughout.
This content is only available as a PDF.

Author notes

Current affiliation for Nathaniel Corley Institute for Protein Design, University of Washington, Seattle, WA, USA 98195

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Pier Luigi Martelli
Pier Luigi Martelli
Associate Editor
Search for other works by this author on:

Supplementary data