M-Ionic: prediction of metal-ion-binding sites from sequence using residue embeddings

Abstract Motivation Understanding metal–protein interaction can provide structural and functional insights into cellular processes. As the number of protein sequences increases, developing fast yet precise computational approaches to predict and annotate metal-binding sites becomes imperative. Quick and resource-efficient pre-trained protein language model (pLM) embeddings have successfully predicted binding sites from protein sequences despite not using structural or evolutionary features (multiple sequence alignments). Using residue-level embeddings from the pLMs, we have developed a sequence-based method (M-Ionic) to identify metal-binding proteins and predict residues involved in metal binding. Results On independent validation of recent proteins, M-Ionic reports an area under the curve (AUROC) of 0.83 (recall = 84.6%) in distinguishing metal binding from non-binding proteins compared to AUROC of 0.74 (recall = 61.8%) of the next best method. In addition to comparable performance to the state-of-the-art method for identifying metal-binding residues (Ca2+, Mg2+, Mn2+, Zn2+), M-Ionic provides binding probabilities for six additional ions (i.e. Cu2+, Po43−, So42−, Fe2+, Fe3+, Co2+). We show that the pLM embedding of a single residue contains sufficient information about its neighbours to predict its binding properties. Availability and implementation M-Ionic can be used on your protein of interest using a Google Colab Notebook (https://bit.ly/40FrRbK). The GitHub repository (https://github.com/TeamSundar/m-ionic) contains all code and data.

. Performance on distinguishing metal binding and non-binding proteins using the independent test set generated in this study (TestFold6) and negative set.............................3 Table S3.Impact of evolutionary (MSA, PSSM) on metal-binding site prediction using 'Recent BioLip' dataset (i.e. independent    showing the effect of mutating metal-binding residues.Firstly, all the metal-binding residues from the original sequences are replaced with another residue one at a time and saved to a new fasta file.So in plot (a), all the binding residues are replaced with alanine (A); in (b), all the binding residues are replaced with Cysteine (C) and so on.These mutated sequences are then used as input to M-Ionic.The output from M-Ionic is the residue-level probabilities of the mutated sequences (represented in blue in the plots) and of the original sequences (represented in orange in the plots).If the original non-mutated sequence is annotated to bind to a particular ion, the output probabilities of the mutated sequences for only that truly binding ion are considered.This means that if the original sequence binds to Zn2+ according to the BioLip annotation, the M-Ionic binding  Table S1   Positive log odds signify that certain amino acids are more likely to bind to that metal group, whereas a negative log odds ratio shows a non-preferential binding.To examine the ability of the method to identify the effect of mutations, we systematically replaced all metal-binding residues with one of the 20 amino acids one at a time.In plot (a), all the binding residues are replaced with alanine (A); in (b), all the binding residues are replaced with Cysteine (C) and so on.These mutated sequences are then used as input to M-Ionic, and the binding probabilities for each of the residues are obtained.If the original residue is annotated to bind to a particular ion, the output probabilities of the mutated sequences for only that truly binding ion are considered.This means that if the original sequence binds to Zn 2+ , the M-Ionic binding probability to Zn 2+ is considered, and the probabilities for other ions are ignored.In the plots, the output probabilities from M-Ionic for the mutated residue (in orange) and original residue (in blue) are shown.We have pooled the data for all ions into single plots per residue type.The amino acids on the x-axis represent the residue all the metal binding residues were mutated to.In this figure, the values are not pooled as in Figure 4 and Figure S3.

Figure
Figure.S1.Protein-and residue-level comparison on homology reduced independent test set (TestFold6) (a) Protein-level: Comparison of ROC curves for the performance of each ion type for M-Ionic (this study) trained on ESM-2 embeddings, LMetalSite (Yuan et al., 2022) and mebi-pred (Aptekmann et al., 2022) on homology reduced independent test set (TestFold6) and the negative binding test set (b) Protein-level: Comparison of Precision-Recall curves for the performance of each ion type for each of the methods (c) Residue-level: F1, MCC, Precision and Recall scores of performance of M-Ionic (trained on ESM-2 and ESM-MSA-1b embeddings) and LMetalSite (Yuan et al., 2022) on homology reduced independent test set (TestFold6).

Figure. S2 .
Figure.S2.The log odds ratio shows the binding propensity of amino acids for each ion group.

Figure .
Figure.S3.M-Ionic output probabilities distributions (with one plot for each residue type) showing the effect of mutating metal-binding residues.

Figure. S4 .
Figure.S4.Count plot for each ion type indicating the number of true and false positives of the M-Ionic predictions of mutated metal-binding residues.To examine the effect of mutating residues on prediction performance with respect to ion type individually, each of the above plots shows the count of the number of true positives (TPs) and false positives (FPs) when the metal-binding residues are mutated to one of the 20 amino acid types.Plot (a) shows the TPs and FPs for Calcium; (b) shows the TPs and FPs for Cobalt and so on.The amino acids on the x-axis represent the residue all the metal binding residues were mutated to.In this figure, the values are not pooled as in Figure4and FigureS3.

.
Summary of metal-binding proteins in BioLip dataset

Table S3 .
Impact of evolutionary (MSA, PSSM) on metal-binding site prediction using 'Recent BioLip' dataset (i.e.independent test set of recent PDB proteins)

Table S4 .
Benchmark on MIonSite benchmark set

Table S5 .
Impact of structural features (DSSP) on metal-binding site prediction using the independent test set generated in this study (TestFold6)

Table S6 .
Validating that M-Ionic is trained on the residue level (on the embedding dimension, e.g.1280 for ESM-2) and not on the protein level (on the length L of the protein) using the independent test set generated in this study (TestFold6)

Table S7 :
Analysis of M-Ionic performance for each ion for each taxon