GPSFun: geometry-aware protein sequence function predictions with language models

Abstract Knowledge of protein function is essential for elucidating disease mechanisms and discovering new drug targets. However, there is a widening gap between the exponential growth of protein sequences and their limited function annotations. In our prior studies, we have developed a series of methods including GraphPPIS, GraphSite, LMetalSite and SPROF-GO for protein function annotations at residue or protein level. To further enhance their applicability and performance, we now present GPSFun, a versatile web server for Geometry-aware Protein Sequence Function annotations, which equips our previous tools with language models and geometric deep learning. Specifically, GPSFun employs large language models to efficiently predict 3D conformations of the input protein sequences and extract informative sequence embeddings. Subsequently, geometric graph neural networks are utilized to capture the sequence and structure patterns in the protein graphs, facilitating various downstream predictions including protein–ligand binding sites, gene ontologies, subcellular locations and protein solubility. Notably, GPSFun achieves superior performance to state-of-the-art methods across diverse tasks without requiring multiple sequence alignments or experimental protein structures. GPSFun is freely available to all users at https://bio-web1.nscc-gz.cn/app/GPSFun with user-friendly interfaces and rich visualizations.


Note S1. Details of the benchmark datasets
The benchmark datasets for assessing binding site predictions of DNA, RNA, peptide, ATP, and HEM are compiled from BioLiP as discussed in Material and Methods.BioLiP is a database of biologically relevant protein-ligand complexes primarily from PDB, in which a binding residue is defined if the smallest atomic distance between the target residue and the ligand is < 0.5 Å plus the sum of the Van der Waal's radius of the two nearest atoms.Besides, the benchmark dataset of protein-protein binding sites is directly from (1), which contains non-redundant (pairwise sequence identity < 25%) transient heterodimeric protein complexes dated up to May 2021 collected from PDB.A protein-protein binding residue is defined as a surface residue (relative solvent accessibility > 5%) that lost more than 1 Å 2 absolute solvent accessibility after protein-protein complex formation.The benchmark datasets of metal ion (Zn 2+ , Ca 2+ , Mg 2+ and Mn 2+ ) binding sites are directly from (2), which contain non-redundant (pairwise sequence identity < 25%) proteins dated up to 29 December 2021 from BioLiP.Proteins released before 1 January 2020 are used for training, while those released thereafter are used for testing.Combining all these ten datasets results in a total of 8441 training sequences and 1838 test sequences.Details of the statistics of these binding site benchmark datasets are given in Table S1.
To evaluate GO function predictions, we adopted the benchmark datasets proposed in (3), in which the training and test sets were collected following the standard protocol of the critical assessment of functional annotation (CAFA) competitions (4)(5)(6).Specifically, the GO term annotations were extracted and combined from Swiss-Prot (7), GOA (8) and GO (9) in January 2020.Only experimental annotations with the following evidence codes were kept: IDA, IPI, EXP, IGI, IMP, IEP, IC or TA.The annotations were further up-propagated based on the "is-a" relationship in the hierarchical structure of GO, and the root GO terms were omitted.Then, the training, validation and test sets were split according to the annotation time stamps.The training sets contain proteins annotated before January 2018, while the validation and test sets contain no-knowledge proteins annotated from January to December 2018 and from January 2019 to January 2020, respectively.In the training steps, only the GO terms with enough training samples (≥ 50 sequences) were considered, resulting in 790, 4766 and 667 classes for the molecular function (MF), biological process (BP), and cellular component (CC) sub-ontology.In the evaluation phase, we considered all terms to ensure fair comparisons with other methods.Table S2 shows the detailed statistics of the datasets for GO.
To evaluate subcellular localization prediction, we adopted the benchmark datasets from (10).The training set was originally extracted from UniProt (11) (release 2021_03), where the localization annotations were filtered using the following criteria: eukaryotic, not fragments, encoded in the nucleus, > 40 amino acids, and experimentally annotated (ECO:0000269) subcellular localizations.The proteins can be categorized into one or multiple of the ten locations: Cytoplasm, Nucleus, Cell membrane, Extracellular, Mitochondrion, Endoplasmic reticulum, Lysosome/Vacuole, Golgi apparatus, Plastid, and Peroxisome.The test set was derived from the Human Protein Atlas (HPA) project (12), in which only annotations with reliability labels of "Enhanced" and "Supported" were kept.
Redundant sequences sharing identity > 30% with the training set were removed.Tables S3 and S4 present the statistics of these two subcellular localization datasets.
To evaluate protein solubility prediction, we adopted the benchmark datasets in (13).The training set was derived from a subset of the TargetTrack database by the Protein Structure Initiative (PSI) (14), in which the percentage of soluble proteins is ~66%.The North East Structural Consortium expressed proteins in E. coli using a unified production pipeline and provided integer scores (0-5) for both expression (E) and solubility (S) (15).The highly expressed proteins (E = 4 or 5) with high solubility scores (S = 4 or 5) were defined as the soluble proteins for the test set while proteins with low solubility scores (S = 0) were defined as insoluble.Using this definition, soluble proteins are ~64% of the test set.The test set was ensured not to have any sequences sharing sequence identity > 25% with the training set.Table S3 presents the statistics of these two solubility datasets.

Note S2. The geometric featurizer
GPSFun represents the protein as a radius graph derived from the Cα coordinates of the residues, where the radius is equal to 15 Å.An end-to-end featurizer is utilized for geometric feature extraction similar to (16), except that we additionally encode the sidechain conformations of the residues.Specifically, a local coordinate system is first defined at each residue based on the relative position of the Cα atom to other backbone atoms.Then, several geometric node and edge features are derived to capture the arrangements of backbone and sidechain atoms in or between residues.
(i) Local coordinate system.We define a local coordinate system   = [  ,   ,   ×   ] for residue , where   is the negative bisector of the angle formed by the N, Cα, and C atoms, and   is a unit vector normal to this plane.
Formally, we have: Based on the local coordinate systems, we could construct geometric features that are invariant to rotation and translation for single or pair of residues.
(ii) Geometric node features.GPSFun constructs distance, direction and angle features for each residue.consider relative directions of all atoms in residue  to    , namely  ∈ �  ,    ,   ,   ,   �.To reflect the relative spatial rotation between the two reference frames of residues  and , the angle features q�     � are employed, where q(•) is the quaternion encoding function representing 3D rotation matrices as four-element vectors (17).

Note S3. Implementations of the baseline methods
Here we only discuss and compare GPSFun with the latest state-of-the-art methods in our benchmark experiments.

MaSIF-site:
The predictions of MaSIF-site for the test proteins were directly obtained from our previous work (1), which were originally generated by the standalone program with pre-trained model weights through a docker container from https://github.com/lpdi-epfl/masif.

GraphPPIS:
The results of GraphPPIS were directly obtained from our previous study (1).

IonCom:
The results of IonCom were directly obtained from our previous study (2), which were originally obtained using its standalone program with pre-trained weights from https://zhanggroup.org/IonCom/.
In addition, we also implemented a simple baseline using a transformer (34) model solely fed with the PSSM and HMM evolutionary profiles from MSA (1,24).This model was trained separately for different ligands (i.e., in a single-task fashion), and the results are shown in Table S6.
For GO predictions, we compared GPSFun with state-of-the-art sequence-based methods DeepGOPlus (35), GOLabeler (36) and SPROF-GO (37), as well as protein-protein interaction network-based methods DeepGraphGO (3) and NetGO (38).In addition, two simple baseline methods named BLAST-KNN and Foldseek-KNN were also implemented as follows: The idea of BLAST-KNN is that similar proteins may have similar functions.For a test protein   , BLAST (39) is run with an e-value cut-off of 0.001 to search   against all proteins in the training set to obtain its homologous protein set   .Then the probability score between   and a GO term   is computed as follows: where (  , ) is the bit-score between   and , and �,   � is a binary indicator which equals to 1 if protein  has the function   , or equals to 0 otherwise.The results of all these methods were directly obtained from previous works (3,37).Foldseek-KNN is similar to BLAST-KNN, except that Foldseek (40) is run to search the ESMFoldpredicted structure of   against all protein structures in the training set, and TM-score is adopted as similarity measurement instead of bit-score.
Note that the results of SoluProt, SWI and NetSolP that we obtained are consistent with those reported in (13).We also implemented two simple baselines, BLAST-KNN and Foldseek-KNN, as in the GO prediction tasks.

Note S4. Evaluation metrics
Following the previous studies, we employ recall (Rec), precision (Pre), accuracy (Acc), F1-score (F1), Matthews correlation coefficient (MCC), area under the receiver operating characteristic curve (AUC), and area under the precision-recall curve (AUPR) to evaluate the performance of protein-ligand binding site predictions: where true positives (TP) and true negatives (TN) denote the numbers of correctly predicted binding and non-binding residues, and false positives (FP) and false negatives (FN) denote the numbers of incorrectly predicted binding and non-binding residues, respectively.AUC and AUPR are independent of thresholds, thus reflecting the overall performance of a model.The other metrics are calculated using a threshold to convert the predicted binding probabilities to binary predictions.We go through 101 thresholds from 0 to 1 with an interval of 0.01, and select the best threshold that maximizes the MCC on the validation sets.
We assess the predictive performance for the three domains in GO (MF, BP, and CC) independently using maximum protein-centric F-measure (Fmax) and AUPR.Fmax is computed as follows: where  ���() and  � () denote the average precision and recall at threshold  respectively, which are defined as follows: where () denotes the number of proteins predicted with at least one GO term at threshold ;   denotes the total number of proteins;   denotes the true annotation set for protein ;   () denotes the predicted annotation set for protein  at threshold ; and |•| is the set cardinality operation.We use 101 thresholds from 0 to 1 with an interval of 0.01 to compute the above measurements.The AUPR score is calculated from the computed precision and recall scores using the trapezoidal rule.
To assess the performance of subcellular localization prediction, we use accuracy, Jaccard, micro/macro F1, micro/macro AUC, and micro/macro AUPR.Here, the calculation of accuracy requires the exact location(s) to be predicted.Jaccard measures the intersection of the actual and predicted labels over their union.Micro metrics are calculated by aggregating the labels from all classes and then computing the metrics globally, whereas macro metrics are calculated by computing the metrics for each class and then averaging them.
To evaluate the performance of solubility prediction, we use accuracy, MCC and AUC following the previous works.The threshold used to convert the predicted solubility probabilities to binary predictions was selected by maximizing the MCC on the validation sets.

Figure S1 .
Figure S1.Precision-recall curves of GPSFun and other methods on the binding site test sets of DNA, RNA, peptide, protein, ATP and HEM.

Figure S2 .
Figure S2.Precision-recall curves of GPSFun and other methods on the binding site test sets of Zn 2+ , Ca 2+ , Mg 2+ , and Mn 2+ ions.

Figure S3 .
Figure S3.Precision-recall curves of GPSFun and other methods on the subcellular localization test set.

Figure S4 .
Figure S4.Precision-recall curves of GPSFun and other methods on the protein solubility test set.
Given the coordinates of two atoms  and , the distance feature is computed via RBF(‖ − ‖), where RBF(•) is a radial basis function.For the intra-residue distance features of node , ,  ∈ �  ,    ,   ,   ,   � and  ≠ .Here,  denotes the centroid of the heavy sidechain atoms.The direction features encoding relative directions of other inner atoms to Cα in residue  are computed via , where  ∈ {  ,   ,   ,   }.We also incorporate the sine and cosine values of the bond angles (  ,   ,   ) and torsion angles (  ,   ,   ) to consider the backbone geometry.(iii)Geometric edge features.Similarly, we construct geometric features between neighboring residues including distance, direction and angle.The inter-residue distance features RBF(‖ − ‖) between nodes  and  are computed with atoms  ∈ �  ,    ,   ,   ,   � and  ∈ �  ,    ,   ,   ,   �.The edge direction features    −   �−   �

Table S2 . Numbers of proteins in the training, validation and test sets for the three domains in
GO, i.e., molecular function (MF), biological process (BP), and cellular component (CC)

Table S6 . The ablation studies on protein features and model designs in the ten binding site test sets
The numbers in this table are AUPR values.Bold fonts indicate the best results."Pep" and "Pro" denote peptide and protein, respectively."Avg" means the average AUPR values among the ten test sets."Baseline" means using a (1,24)ormer(34)model fed with the PSSM and HMM evolutionary profiles from MSA(1,24).This model was trained separately for different ligands (i.e., in a single-task fashion)."MSA profiles" means replacing the ProtTrans embeddings with PSSM and HMM."w/o structure" means using a transformer model fed with the ProtTrans sequence features."w/o geometry" means removing the geometric featurizer in GPSFun.

Table S7 . Performance comparison of GPSFun with state-of-the-art methods on the test sets of the three domains in GO
See Note S3 for the implementation details of BLAST-KNN and Foldseek-KNN.Bold and underlined fonts indicate the best and second-best results, respectively.

Table S8 . Performance comparison of GPSFun with state-of-the-art methods on difficult proteins within the test sets of the three domains in GO
The difficult proteins are defined by CAFA2 (5) as the test proteins with sequence identity <60% to the training set.The numbers of difficult proteins in the MF, BP and CC test sets are 303, 649 and 437, respectively.See Note S3 for the implementation details of BLAST-KNN and Foldseek-KNN.Bold and underlined fonts indicate the best and second-best results, respectively.

Table S9 . Performance comparison of GPSFun with state-of-the-art methods in each location on the subcellular localization test set
The locations with less than 10 test proteins are omitted.Bold and underlined fonts indicate the best and second-best results, respectively.

Table S10 . Performance comparison of GPSFun with three baseline methods on the subcellular localization test set
See Note S3 for the implementation details of BLAST-KNN and Foldseek-KNN."GPSFun w/o structure" means using a transformer model fed with the ProtTrans sequence features.Bold and underlined fonts indicate the best and second-best results, respectively.

Table S11 . Performance comparison of GPSFun with three baseline methods on the solubility test set
Note: See Note S3 for the implementation details of BLAST-KNN and Foldseek-KNN."GPSFun w/o structure" means using a transformer model fed with the ProtTrans sequence features.Bold and underlined fonts indicate the best and second-best results, respectively.

Table S12 . The performance of GPSFun with or without model ensemble on the ligand-binding site test sets
Note: "w/o ensemble" reports the mean and standard deviation of the performance metrics of the five trained models from cross-validation.

Table S13 . The performance of GPSFun with or without model ensemble on the GO test sets
Note: "w/o ensemble" reports the mean and standard deviation of the performance metrics of the five trained models from five different random seeds.

Table S14 . The performance of GPSFun with or without model ensemble on the subcellular localization test set
Note: "w/o ensemble" reports the mean and standard deviation of the performance metrics of the five trained models from cross-validation.

Table S15 . The performance of GPSFun with or without model ensemble on the solubility test set
Note: "w/o ensemble" reports the mean and standard deviation of the performance metrics of the five trained models from cross-validation.