Insights into the inner workings of transformer models for protein function prediction

Abstract Motivation We explored how explainable artificial intelligence (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. Results The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g. transmembrane regions, active sites) across many proteins. Availability and Implementation Source code can be accessed at https://github.com/markuswenzel/xai-proteins.


Introduction 1.Protein function prediction
Proteins -constituents of life.Proteins are versatile molecular machines, performing various tasks in basically all cells of every organism, and are modularly constructed from chains of amino acids.Inferring the function of a given protein merely from its amino acid sequence is a particularly interesting problem in bioinformatics research.
Function prediction can help to rapidly provide valuable pointers in the face of so far unfamiliar proteins of understudied species, such as of emerging pathogens.Moreover, it makes the analysis of large, unlabeled protein data sets possible, which becomes more and more relevant against the backdrop of the massive and evermore growing databases of unlabeled nucleic acid sequences, which again can be translated into amino acid sequences.Next-generation DNA sequencers can read the nucleic acid sequences present in a sample or specimen at decreasing costs (Mardis, 2017;Shendure et al., 2017), much faster than experimenters can determine the function of the genes and corresponding proteins.Therefore, databases with genes and corresponding amino acid sequences grow much more rapidly than those of respective experimental gene and protein labels or annotations.Besides, gaining knowledge about the mapping between amino acid sequence and protein function can help to engineer proteins for dedicated purposes too (Alley et al., 2019;Yang et al., 2019;Ferruz et al., 2022;Hie and Yang, 2022;Madani et al., 2023).

Protein language modeling and transfer learning
Relations to NLP.Amino acid sequences share some similarities with the sequences of letters and words occurring in written language, in particular with respect to the complex interrelationships between distant elements, which are arranged in one-dimensional chains.Thus, recent progress in research on natural language processing (NLP) employing language modeling in a transfer learning scheme (Howard and Ruder, 2018) has driven forward protein function prediction too (e.g.Strodthoff et al., 2020).
Self-supervised pretraining.Typically, a language model is first pretrained on large numbers of unlabeled sequences in an unsupervised fashion, e.g. by learning to predict masked tokens (cloze task) or the respective next token in the sequences (which is why this unsupervised approach is also dubbed self-supervised learning).In this way, the model learns useful representations of the sequence statistics (i.e.language).These statistics possibly arise because the amino acid chains need to be stable under physiological conditions and are subject to evolutionary pressure.The learned representations can be transferred to separate downstream tasks, where the pretrained model can be further finetuned in a supervised fashion on labeled data, which are usually available in smaller amounts, considering that sequence labeling by experimenters is costly and lengthy.
Model architectures.Transformer models (Vaswani et al., 2017) making use of the attention mechanism (Niu et al., 2021), such as bidirectional encoder representations from transformers (BERT, Devlin et al., 2018) are currently prevailing architectures in NLP.Transformers have been recently applied to the study of amino acid sequences too, pushing the state of the art in the field of proteomics as well (Rao et al., 2019(Rao et al., , 2021;;Nambiar et al., 2020;Bepler and Berger, 2021;Littmann et al., 2021;Rives et al., 2021;Brandes et al., 2022;Elnaggar et al., 2022;Fenoy et al., 2022;Unsal et al., 2022;Lin et al., 2023;Olenyi et al., 2023).Recurrent neural networks (RNNs) using long short term memory (LSTM) cells are another model architecture that is particularly suited to process sequential data.RNNs have been successfully employed to protein (Strodthoff et al., 2020) and peptide (Vielhaben et al., 2020) property prediction as well, within the scheme of language modeling combined with transfer learning, as sketched out above.

Explainable machine learning
Need for explainability.Transformers and other modern deep learning models are notorious for having often millions and sometimes billions of trainable parameters, and it can be very difficult to interpret the decision making logic or strategy of such complex models.The research field of explainable machine learning (Lundberg and Lee, 2017;Montavon et al., 2018;Arrieta et al., 2020;Tjoa and Guan, 2020;Covert et al., 2021;Samek et al., 2021) aims at developing methods that enable humans to better interpret -or to a limited degree: understand -such "opaque", complex models.In certain cases, it was demonstrated that the methods can even help to uncover flaws and unintended biases of the models, such as being mislead by spurious correlations in the data (Lapuschkin et al., 2019).

Contributions of the article
Goal of the study.Building upon this previous research on the interpretation of protein classification models, we aimed at exploring how explainability methods can further help to gain insights into the inner workings of the now often huge neural networks, and proceeded as follows.
Specific contributions.First, we finetuned pretrained transformers on selected prediction tasks and could push or reach the state-of-the-art (see Supplementary Appendix E).Then, we quantified the relevance of each amino acid of a protein for the function prediction model.Subsequently, we investigated if these relevant sequence regions match expectations informed by knowledge from biology or chemistry, by correlating the relevance attributions with annotations from sequence databases (see Figure 1).For instance, we addressed the question if a Figure 1: Illustration of the experimental design.Top: From the amino acid sequence, the finetuned transformer model infers the applicable Gene Ontology (GO) terms (represented as multi-label class membership vector).(The depicted exemplary "catalase-3" should be labeled with the GO terms "catalase activity" as "molecular function", "response to hydrogen peroxide" as "biological process", "cytoplasm" as "cellular component" etc.; about 5K of about 45K GO terms were considered.)Center: Relevance indicative for a selected GO term was attributed to the amino acids per protein and correlated with corresponding annotations per amino acid.This correlation between relevance attributions and annotations was then statistically assessed across the test data set proteins.The analysis was conducted for the embedding layer and "inside" of the model, for each head in each layer, and was repeated for different GO terms (see Section 2.1).Bottom: Specific amino acids of a protein are annotated in sequence databases like UniProt, because they serve as binding or active sites or are located in the cell membrane etc.Active sites can, e.g.be found at the histidine ("H" at position 65) and asparagine ("N" at position 138) of "catalase-3" (protein structure prediction created by AlphaFold -"AlphaFold Data Copyright (2022) DeepMind Technologies Limited" -under the CC-BY 4.0 licence (Jumper et al., 2021;Varadi et al., 2021)).
classification model that is able to infer if a protein is situated in the cell membrane does indeed focus systematically on transmembrane regions or not.We conducted this analysis on the embedding level and "inside" of the model with a novel adaptation of IG.In this way, we identified transformer heads with a statistically significant correspondence of the attribution maps with ground truth annotations, across many proteins and thus going beyond anecdotes of few selected cases.
2 System and methods

Revealing insights into function prediction models
The prediction tasks of inferring GO terms and Enzyme Commission (EC) numbers, that the proteins are labeled with, from their amino acid sequence are detailed in Supplementary Appendix B. This Supplementary material also explains the finetuning of the transformers "ProtBert-BFD" and "ProtT5-XL-UniRef50" (Elnaggar et al., 2022) and "ESM-2" (Lin et al., 2023) on the GO and EC tasks, and contains statements about data availability and composition.
Overall approach.We investigated whether specific positions or areas on the amino acid sequence that had been annotated in sequence data bases are particularly relevant for the classification decision of the model (see Figure 1).Annotations included UniProtKB/Swiss-Prot "active" and "binding sites", "transmembrane regions", "short sequence motifs", and PROSITE patterns related to a GO term and its children terms in the ontology.Definitions of the aforementioned UniProt annotations (per amino acid) and matching GO terms (class labels of proteins) are compiled in Supplementary Table A.1 (tables/figures with prefix letters are shown in the Supplementary material).First, we attributed relevance indicative for a given class (either a selected GO term or EC number) to each amino acid of a protein.Then, we correlated the relevance heat map obtained for the amino acid chain of a protein with corresponding binary sequence annotations.To study the information representation within the model, the explainability analysis was conducted at the embedding layer and repeated "inside" of the model, separately for its different heads and layers, using a novel method building upon IG, described below in Section 3.
Experimental setup.For the experimental evaluation, we focus on the pretrained ProtBert model that was finetuned either to the multi-label GO-classification on the GO "2016" data set, or to the multi-class EC number classification on the "EC50 level L1" data set.We consider the comparatively narrow EC task in addition to the much more comprehensive GO prediction, because the test split of the EC data set contains a larger number of samples that are both labeled per protein and annotated per amino acid, which is beneficial for the conducted explainability analysis.We observed that larger models tend to perform numerically better than smaller models (see Supplementary Appendix E).Given our focus on methodological matters of model interpretation, we deliberately studied ProtBert (420M parameters), because it is better manageable, due to its considerably smaller memory footprint, in comparison to the larger ProtT5 (1.2B).

Algorithm
Integrated gradients (Sundararajan et al., 2017) represents a model-agnostic attribution method, which can be characterized as unique attribution method satisfying a set of four axioms (Invariance, Sensitivity, Linearity, and Completeness).In this formalism, the attribution for feature i is defined via the line integral (along a path, parameterized as γ(t) with t ∈ [0, 1], between some chosen baseline γ(0) = x ′ and the sample to be explained γ(1) = x), where F is the function we aim to explain.Choosing γ as straight line connecting x ′ and x makes IG the unique method satisfying the four axioms from above and an additional symmetry axiom.This path is the typical choice in applications applied directly to the input layer for computer vision or to the embedding layer for NLP.The approach can be generalized to arbitrary layers if one replaces x and x ′ by the hidden feature representation of the network up to this layer (referred to as "layer IG" in the popular "Captum" library (Kokhlikyan et al., 2020)).
Head-specific attribution maps.To obtain attributions for individual heads, we have to target the output of the multi-head self-attention (MHSA) block of a particular layer; see Figure 2 for a visualization of the transformer architecture.Properly separating the attributions of the individual heads from the attribution contribution obtained from the skip connection necessitates to target directly the output of the MHSA.Now, one cannot just simply choose an integration path that connects baseline and sample as encoded by the MHSA block because the input for the skip connection has to be varied consistently.
To keep an identical path in all cases, we fix the integration path as a straight line in the embedding layer, which then gets encoded into a, in general, curvilinear path seen as input for some intermediate layer.Choosing not a straight path only leads to the violation of the symmetry axiom, which is not of paramount practical importance in this application; see (Ward et al., 2020;Kapishnikov et al., 2021) for other applications with IG applied to general paths.For every sample, this application of IG yields a relevance map of shape seq × n model , where the first n model /n heads entries in the last dimension correspond to the first head, followed by the second head etc.By summing over n model /n heads entries in the last dimension, we can reduce the relevance map to a seq × n heads attribution map, i.e. one relevance sequence per head.
Correlation coefficients and statistical significance.Each sequence of relevance attributions can then be correlated with sequence annotations to find out if the model focuses on the annotated amino acids.Coefficients of point biserial correlation (Kornbrot, 2005), which is equivalent to Pearson correlation, were calculated between the continuous relevance values and the corresponding binary annotations per amino acid.This correlation analysis was conducted separately for each head in each transformer layer.The resulting correlation coefficients were then assembled into a n layer × n head matrix per protein, which entered the subsequent statistical analysis across proteins.Summary statistics over all proteins (which belong to the respective GO or EC class, and, which are part of the respective test data set split) were obtained by computing Wilcoxon signed-rank tests across the correlation coefficients.The resulting p-values were corrected for the multiple tests per condition (16 heads times 30 layers equals 480 hypothesis tests) by controlling the false discovery rate (Benjamini and Hochberg, 1995).
Summed attribution maps.In parallel to the correlation analysis, we furthermore sum the aforementioned attribution map along the sequence dimension, and obtain n heads entries that specify the relevance distribution onto the different heads.We can carry out the same procedure for every transformer layer and combine all results into a n layer × n head relevance map of summed attributions.This map makes it possible to identify heads with a positive relevance with respect to the selected class.One map was obtained per protein.Heads with a significantly positive relevance were singled out by calculating a summary statistic across proteins with the Wilcoxon signedrank test.Finally, the two parallel analysis tracks were combined by identifying transformer heads that feature both a significantly positive (A) relevance-annotationcorrelation and (B) relevance (this overlay is displayed in the figures by masking A with B).

Implementation
Supplementary Appendix D shows implementation details.

Predictive performance
The performance results for the ProtT5, ProtBert, and ESM-2 transformers finetuned to the GO and EC protein function tasks are presented in Supplementary Tables E.1 to E.3 in SupplementaryAppendix E. In summary, we show that finetuning pretrained large transformer models leads to competitive results, in particular in the most relevant comparison in the single-model category, often on par with MSA-approaches.Larger models lead the rank-ings, with ProtT5 competing with ESM-2.Finetuning the entire model including the encoder shows its particular strength in the "CAFA3" benchmark.

Explainability analysis: embedding layer
Research question.Starting with embedding layer attribution maps, as the most widely considered type of attribution, we investigate whether there are significant correlations between attribution maps and sequence annotations from external sources (see Section 2.1).We aim to answer this question in a statistical fashion going beyond anecdotal evidence based on single examples, which can sometimes be encountered in the literature.
GO prediction: GO "membrane" attributions correlate in particular with UniProt "transmembrane regions".Figure 3 shows the results of the explainability analysis for the embedding layer of ProtBert finetuned to GO classification.The relevance of each amino acid indicative for selected GO terms was computed with IG, and then correlated with UniProt and PROSITE sequence annotations.Subsequently, it was tested whether the correlation coefficients across all annotated proteins from the test set were significantly positive (see Section 2.1).A significant correlation was observed when relevance attributions indicative for the GO label "membrane" were correlated with UniProt "transmembrane regions" (p≪0.05).Correlation was not observed in the GO "catalytic activity" and "binding" cases.
The pretrained model is expected to contain substantial information already prior to finetuning; e.g.Bernhofer and Rost (2022) had identified transmembrane regions using the pretrained ProtT5.Therefore, we inspected the GO membrane case in more detail.The pretrained but not finetuned ProtBert (combined with a classification head trained for the same number of epochs) resulted also in a significantly positive correlation of embedding level attributions to the GO term "membrane" with transmembrane regions only.Thus common patterns emerge between the pretrained and the finetuned ProtBert.
EC prediction: attributions correlate significantly with several types of sequence annotations.sification ('EC50 level L1" data set; i.e., the differentiation between the six main enzyme classes).Relevance per amino acid for each of the six EC classes was correlated with the UniProt annotations as "active sites", "binding sites", "transmembrane regions", and "short sequence motifs'.It can be observed that the relevance attributions correlated significantly (p<0.05) with "active site" and "binding site" annotations for five out of six EC classes, and with "transmembrane regions" and "short sequence motifs" for two, respectively, three EC classes.(Supplementary Figure E.2 shows that positive relevance-annotation-correlation was observed for all annotation types for "EC40" and "EC50" on both levels "L1" and "L2" for several enzyme (sub-) classes.)Discussion.Attribution maps obtained for the embedding layer correlated with UniProt annotations on the amino acid level, in particular, in the EC case, but also for the GO term "membrane'.To summarize, across two tasks, we provide first quantitative evidence for the meaningfulness and specificity of attribution maps beyond anecdotal evidence.Note that the EC case has the benefit of often several hundred annotated samples contained in the test split (except for "transmembrane regions" and "motifs'; see right panel of Figure 4).In comparison, the GO case provides fewer samples in the test split of the data set that were also annotated on the amino acid level (see numbers in brackets below the x-axis in Figure 4: Attribution maps calculated for the embedding layer of ProtBert finetuned to "EC50 level L1" classification were correlated with UniProt sequence annotations.Left: Relevance attributions correlated significantly (p<0.05,i.e. above blue line) with "active sites" and "binding sites" for five out of six EC classes, and with "transmembrane regions" and "short sequence motifs" for two, respectively, three EC classes each.Right: Numbers of annotated samples in the test split per annotation type and EC class.

Explainability analysis: peeking inside the transformer
Research question.Given the encouraging results presented in Section 5.2, we aim to go one step further and try to answer the more specific question if there are specialized heads inside of the model architecture for specific prediction tasks, using our IG variant that calculates relevance on the amino acid level per transformer head and layer (see Section 3).GO-prediction: membrane.Figure 5 shows the results of the explainability analysis inspecting the latent representations inside of the ProtBert model focusing on the selected class of the GO term "membrane" (GO:0016020).Relevance attributions indicative for GO "membrane" per amino acid were correlated with the UniProt annotations as "transmembrane regions" separately for each transformer head and layer (matrix plot pixels in Figure 5).In parallel, ProtBert heads were singled out with a significantly positive relevance (sum along the sequence) indicative for "membrane" (see also Section 2.1 and Section 3).Both parallel analysis streams were combined by identifying ProtBert heads with both a significantly positive attribution-annotation-correlation and relevance.Several ProtBert heads in different lay- Only results for UniProt "transmembrane regions" are shown, omitting the results for "active/binding sites", "motifs", and PROSITE patterns, which did not feature heads with both a sig.positive relevance and attribution-annotation-correlation.
ers feature a significantly positive correlation of relevance attributions per amino acid with the UniProt annotations as "transmembrane regions", going along with a significantly positive relevance for the GO class "membrane'.In contrast, correlation of relevance attributions with UniProt "active" or "binding sites" or "motifs" or PROSITE patterns accompanied by a positive relevance was not observed (hence these cases were not included in Figure 5).
GO prediction: catalytic activity.Supplementary Figure E.3 (in Supplementary Appendix E) shows the results of the explainability analysis for the case where the GO term "catalytic activity" was selected (GO:0003824).Different ProtBert heads stand out characterized by a positive relevance accompanied by a positive correlation of attributions with PROSITE patterns and with UniProt "active sites" and "transmembrane regions" (but neither with "binding sites" nor "motifs").
GO-prediction: binding.Supplementary Figure E.4 (in Supplementary Appendix E) repeats the explainability analysis inside ProtBert for the GO term "binding" (GO:0005488).For several transformer heads and layers, a positive relevance went along with a correlation of relevance attributions with corresponding PROSITE patterns, and with UniProt "transmembrane regions" (but neither with UniProt "active" nor "binding sites" nor "motifs").
EC-prediction.Subsequently, we conducted the explainability analysis for the case where ProtBert had been finetuned to EC number classification on EC50 level L1.Here, the model had learned to differentiate between the six main enzyme classes.Supplementary Figure E.5 (in Supplementary Appendix E) identifies ProtBert heads characterized both by a positive relevance (sum along the sequence) with respect to the EC class, and by a positive attribution-annotation-correlation (on the amino acid level).The analysis was conducted separately for UniProt annotations as "active"/"binding sites", "transmembrane regions", and "motifs'.(The absence of identified heads for EC4, EC5, and EC6 in the "transmembrane regions" rows and for EC1 and EC5 in the "motif" rows of Supplementary Figure E.5 goes along with the availability of relatively few "transmembrane" and "motif" annotations for these EC classes; see histogram in Figure 4.) Discussion.In summary, we propose a constructive method suited to identify heads inside of the transformer architecture that are specialized for specific protein func-tion or property prediction tasks.The proposed method comprises a novel adaptation of the explainable artificial intelligence (XAI) method of IG combined with a subsequent statistical analysis.We first attributed relevance to the single amino acids per protein (per GO term or EC class), separately for each transformer head and layer.Then, we inspected the correlation between relevance attributions and annotations, in a statistical analysis across the annotated proteins from the test split of the respective data set.Apparently, different transformer heads are sensitive to different annotated and thus biologically and, respectively, chemically "meaningful" sites, regions or patterns on the amino acid sequence.
We discuss the benefits of finetuning a pretrained model from end-to-end, and evaluate the XAI method with a residue substitution experiment in Supplementary Appendix E. There, we also discuss the relation of XAI to homology, to the hydrophobicity and charge of residues in transmembrane regions, and to probing and in-silico mutagenesis.

Uncovering collective dynamics
Finally, we studied collective dynamics potentially emerging among the transformer heads (ProtBert, EC50, level L1) by a visualization of the originally highdimensional, summed attribution maps in two dimensions, taking their similarities into account.For this purpose, the attribution maps that were summed along the amino acid sequence and represented as n layer × n head matrices (see Section 3) were flattened, resulting in one vector per protein.The dimensionality of these vectors was then reduced with principal component analysis to 50 dimensions, and subsequently to two dimensions with t-distributed stochastic neighbor embedding (t-SNE; van der Maaten and Hinton, 2008), using the default t-SNE parameters.The resulting 2D points were visualized as scatter plot and colored according to the corresponding six main enzyme classes (Figure 6).
The points form distinctive clusters matching the EC labels.Apparently, a structure emerges in the attribution maps, that seems to indicate class-specific collective dynamics among several ProtBert heads.It is important to stress that the attribution map underlying the clustering no longer contains any reference to specific positions in the sequence but relies on the relevance distribution on the different heads through all layers of the model.The emergence of class-specific structures therefore indicates that there are specific combinations of heads that are relevant for a specific classification decision.

Conclusion
This work provides additional evidence for the effectiveness of the currently predominant paradigm in deep-learning-based protein analysis through the finetuning of large protein language models from end-to-end (which brings additional benefits; see Supplementary Appendix E).For different protein function prediction tasks, this approach leads to best-performing models according to single-model performance.The performance level is in many cases on par with MSA-approaches.The proposed models can even be effectively combined with the latter through the formation of ensembles.
Considering the ever increasing model complexity, XAI has started to gain traction in the field of protein analysis too (Upmeier zu Belzen et al., 2019;Taujale et al., 2021;Vig et al., 2021;Hou et al., 2023;Vu et al., 2023;Zhou et al., 2023), but quantitative evidence for its applicability beyond single examples was lacking up to now.We provide statistical evidence for the alignment of attribution maps with corresponding sequence annotations, both on the embedding level as well as for specific heads inside of the model architecture, which led to the identification of specialized heads for specific protein function prediction tasks.Emerging class-specific structures suggest that these specialized transformer heads act jointly to decide together in specific combinations.A further detailed analysis of the identified heads could be an interesting next step in future research, potentially based on the query/key/value (QKV) matrices.Internally to the multi-layered model, a direct correspondence between rows/columns of the QKV matrices and individual residues in the sequence is, however, not possible anymore.This limitation makes it, e.g.difficult to infer relations between residues from the QKV matrices.
In summary, XAI promises to tap into the presumably substantial knowledge contained in large models pretrained on massive data sets of amino and/or nucleic acid sequences (Steinegger et al., 2019;Ji et al., 2021).Therefore, we expect that XAI will play an increasingly important role in the future of bioinformatics research.We see potential applications of XAI for model validation and for scientific discovery (e.g. of novel discriminative sequence patterns or motifs that have not been identified by experiments or MSA so far).Identifying specialized heads might also help to prune overly large models, making them smaller and more efficient.
A Supplementary material: Introduction GO prediction: introduction.We first focus on the protein function prediction task of inferring Gene Ontology (GO) terms, that the proteins are labeled with, from their amino acid sequence.This particularly comprehensive task allows characterizing each protein with respect to several properties, taking its molecular function, location in the cell, and involvement in different biological processes into account.GO term prediction has been approached using a range of deep learning methods in the past years (e.g.Kulmanov et al., 2017;Kulmanov andHoehndorf, 2019, 2022;You et al., 2018bYou et al., ,a, 2021;;Strodthoff et al., 2020;Littmann et al., 2021).The GO consortium (Ashburner et al., 2000;Consortium, 2020a) aims at representing biological knowledge in the form of an ontology, i.e. as a graph comprising terms/classes as nodes and the relationships between them as edges.Actually, GO consists of three separate ontologies for molecular function (MFO), cellular component (CCO) and biological process (BPO).
The protein database UniProtKB/Swiss-Prot (Consortium, 2020b) lists the amino acid sequences of proteins, and labels these proteins with selected, corresponding GO terms.In addition, the database contains selected annotations on the amino acid level, such as amino acids that serve as active or binding sites (see Table A.1), which will later play a role in this work.GO prediction: setup.Considering that each protein can typically be labeled with several of the numerous GO terms, the task was phrased here as a multi-label ("protein-centric") prediction problem, where the model was trained to produce a label vector of about 5000 dimensions (5220 or 5101; see below) for each protein.The mapping from GO terms to the intricate topology of the ontology had been simplified by Kulmanov and Hoehndorf (2019), who made the data available in their accompanying data repository, as follows.A "flat" label vector represents the original graph structure of the ontology.Labels were propagated from children terms to parents through the hierarchy of the ontology (in contrast to the i original GO, where labels are assigned to ontology leaves as far as possible).Only a subset of about 5k of the about 45k GO terms in total was considered, because the GO terms had been filtered to those that occur at least 50 times in the respective data set.Like in Kulmanov and Hoehndorf (2019), predictions are propagated towards the root node of the respective ontology tree.Evaluation is always performed with respect to the set of all GO terms, where the model can only predict the most frequently occurring GO terms.
GO prediction: data sets.We used both the "CAFA3" and the "2016" data sets by Kulmanov and Hoehndorf (2019) with 66841 training and 3328 testing samples with 5220 classes ("CAFA3") and, respectively, with 65028 training and 1788 testing samples with 5101 classes ("2016") (see Section 2.1 with Table 1 in Kulmanov and Hoehndorf, 2019).While the "CAFA3" data set is a widely used benchmark (Zhou et al., 2019), the timebased "2016" splits of the underlying UniProtKB/Swiss-Prot database serve for the comparison with the results by You et al. (2018b,a); Kulmanov and Hoehndorf (2019); Strodthoff et al. (2020) ("2016" because entries from Jan.-Oct.2016 serve for testing).The predictive performance of the protein function prediction models was assessed separately for the molecular function, cellular component and biological process ontologies of GO, with the F max and S min metrics (Clark and Radivojac, 2013) used in "CAFA3" (Zhou et al., 2019), and with the area under the precision-recall (AUPR) curve (comparable to You et al., 2018a;Kulmanov and Hoehndorf, 2019;You et al., 2018b;Strodthoff et al., 2020).
EC prediction.During the course of the present study, we narrow down the prediction task to the inference of the enzymatic function of a protein from its amino acid sequence.Enzymes are proteins that catalyze chemical reactions, which can be grouped into seven main classes and further sub-classes, as categorized by the Enzyme Commission (EC) numbers, a nomenclature developed by the International Union of Biochemistry and Molecular Biology (Webb et al., 1992;McDonald et al., 2008).EC prediction has been approached with machine learning, e.g. by Dalkiran et al. (2018) The studied prediction task comprised the binary classification of whether a protein serves as enzyme or not (EC level "L0"), the classification into the main classes of enzymatic reactions (EC level "L1"), and the differentiation among different sub-classes of enzymatic reactions (EC level "L2").We assessed the predictive performance on the "EC40" (58018 samples) and "EC50" (104940 samples) data sets by Strodthoff et al. (2020) (see their Supplementary Section S1), using accuracy as metric.The numbers in "EC40" and "EC50" refer to the similarity threshold between train and test splits of 40% similarity (more difficult task) and, respectively, 50% (easier task).The data set comprises the traditional six main EC classes (oxidoreductases, transferases, hydrolases, lyases, isomerases, ligases) leaving apart the new seventh EC class of translocases.
Data availability and composition of the training set.GO data by Kulmanov and Hoehndorf (2019) from their data repository and EC data by Strodthoff et al. (2020) (see their Supplementary Section S1) were preprocessed as detailed in the code repository by Strodthoff et al. (2020), resulting in separate data sets for GO "2016" (a.k.a."temporalsplit"), for GO "CAFA3", and for EC40 and EC50 on levels L0, L1, and L2.Dedicated validation (a.k.a.development) splits were available for the EC and GO "CAFA3" tasks.For the time-based GO "2016" split, a 10 % subset was randomly sampled from the training data and used as validation split (comparable to Kulmanov and Hoehndorf, 2019).As described below, we use the validation set for model selection and report the test set score.

II Finetuning pretrained transformers
Considered transformer models.We finetuned the transformer model "ProtBert-BFD" (Elnaggar et al., 2022) to the GO term and EC number prediction tasks detailed earlier in Section I of Appendix B. "ProtBert-BFD" is a variant of BERT (Devlin et al., 2018), with 420 million parameters distributed among 16 heads in 30 layers, that had been pretrained in a self-supervised way with a masked token prediction task on the massive metagenomic protein sequence collection titled "Big Fantastic Database" (BFD), which comprises more than two billion sequences (Jumper et al., 2021;Steinegger et al., 2019;Steinegger and Söding, 2018).
In addition, we finetuned also the more complex and memory intensive "ProtT5-XL-UniRef50" model (Elnaggar et al., 2022) on the GO and EC tasks, using the last hidden state of the encoder.'ProtT5-XL-UniRef50" is based on the "Text-To-Text Transfer Transformer" (T5; Raffel et al., 2020) with about three billion parameters (1.2 billion for the encoder used here) and had been pretrained on 45 million sequences from UniRef50 (Suzek et al., 2014).We abbreviate "ProtBert-BFD" as "Prot-Bert" and "ProtT5-XL-UniRef50" as "ProtT5" in the following text.
Classification head.The final classification head comprised a hidden layer and then a linear layer as connection to the dimensionality of the respective label vector.A hidden layer with n = 50 neurons was chosen for the simpler EC task.For the more demanding GO task, the size of the hidden layer was increased to n = 2 × 5220 (GO "CAFA3") and, respectively, to n = 2 × 5101 (GO "2016"), such that the hidden layer contained twice as many neurons in comparison to the dimensionality of the respective label vector.Rectified linear units served as activation function, and layer normalization and a dropout layer with a dropout rate of ten percent served for regularization.
Finetuning procedure.For finetuning, the input sequence was cropped to a maximum sequence length of 1000 amino acids per protein, if necessary.The encoder features created by the transformer were pooled by concatenating the classification token, the maximum and the mean of the features along the sequence, and the sum of the features divided by the square root of the sequence length (Reimers and Gurevych, 2019).Binary cross entropy combined with a sigmoid layer served as loss function for the multi-label GO task, and cross entropy loss combined with softmax for the multi-class EC task.Gradients were accumulated over 64 batches of size one each.Adam (Kingma and Ba, 2015) was selected as optimizer.The learning rate was set to 5 × 10 −6 for the encoder, which was frozen for the first epoch, and to 3 × 10 −5 for the classification head.For the multi-label GO task, the rate was additionally scheduled with a linear increase over 500 warm-up steps in the beginning and then a decrease to zero following a half-cosine function calculated for a total number of 20000 training steps.
Model selection.After each training epoch, the loss (GO) or accuracy (EC) on the validation split was monitored and the model with the minimum loss (GO) or maximum accuracy (EC) was saved.Training was repeated until no further improvement was observed on the validation split.The best model was typically found after around 2-13 epochs in the case of ProtBert, after around 2-7 epochs in the case of ProtT5, after 6-20 epochs for the smaller, and 2-7 epochs for the larger ESM-2 variant.

C Supplementary material: Algorithm
Positive attributions.For the summed attribution maps (but not for the correlation analysis), the positive attributions are of higher importance, since this analysis focuses on identifying the areas of a sequence that speak for a specific class, rather than finding the areas that speak against this sequence being classified as such.Most XAI cases, such as the multiple classification attribution examples in Sundararajan et al. (2017), choose this approach.Some granularity is being lost because the attributions of individual amino acids are aggregated.The case that a head identifies the relevant areas with a positive attribution, and those values are canceled out by also finding negativerelevant areas is theoretically possible, even though a positive relevance sum is an indicator for the respective class, because the relevance sign is not arbitrary (Binder et al., 2023).Nevertheless, we account for a potentially too rigid filtering by showing the correlation analysis results and the summed attribution maps individually.

D Supplementary material: Implementation
The pretrained models of ProtBert, ProtT5 (Elnaggar et al., 2022), and ESM-2 (Lin et al., 2023) were connected to an additional pooling layer and classification iii head (see Appendix B, II), and finetuned on GO data provided by (Kulmanov and Hoehndorf, 2019) and, respectively, EC data by Strodthoff et al. (2020) (see their Supplementary Section S1), using PyTorch (Paszke et al., 2019) with PyTorch Lightning as wrapper.The adaptation of the integrated gradients algorithm acting separately on each transformer head in every layer (see Section 3) was implemented in Python building on the layer integrated gradients method of Captum (Kokhlikyan et al., 2020).Annotations were obtained from the databases UniProt (Consortium, 2020b) and PROSITE (Sigrist et al., 2012) in combination with GO (Ashburner et al., 2000;Consortium, 2020a).Statistical calculations were performed using SciPy (Virtanen et al., 2020) and statsmodels (Seabold and Perktold, 2010).Performance on the GO task was assessed based on an adaptation by (Strodthoff et al., 2020) from (Kulmanov and Hoehndorf, 2019).Source code for model training, model interpretation, statistics and visualization can be accessed at https://github.com/markuswenzel/xai-proteins.

E Supplementary material: Results and Discussion
Research question.Here, we aim to answer the question whether finetuning of large pretrained transformer models leads to a competitive performance in protein function prediction.
GO prediction on the CAFA3 data set.Table E.1 shows the performance results for GO term prediction by finetuning and testing ProtBert, ProtT5 and ESM-2 on the "CAFA3" splits.Performance was assessed with the protein-centric F max metric, separately for the molecular function (MFO), biological process (BPO), and cellular component ontologies (CCO) that comprise the GO.For comparison with the state-of-the-art, we added results reported in the literature (Olenyi et al., 2023;Bernhofer et al., 2021;Littmann et al., 2021;Kulmanov and Hoehndorf, 2019;You et al., 2018b) to Table E.1.For model evaluation, we distinguish between single-models and ensemble models.In the former category, the finetuned ProtT5 turns out to be the model that outperforms both prior deep-learning-based approaches based on transformers (goPredSim) and convolutional neural networks (DeepGOCNN) as well as the very strong MSAbased baseline DiamondScore, on all ontologies MFO, BPO and CCO.The finetuned larger ESM-2 variant is a comparably strong competitor to ProtT5 (and better than ProtT5 on BPO).
Turning to ensemble model, we find that the ensemble of a finetuned ProtBert (followed by ProtT5) and Di-amondScore (MSA-based; Kulmanov and Hoehndorf, 2019) produces the best predictions on CCO (scores were combined with a weighted sum; comparable to Kulmanov and Hoehndorf, 2019, see their Section 2.3).The ensemble of the finetuned larger ESM-2 variant with Di-amondScore leads the ranking on BPO.The ensemble method "GOLabeler" (You et al., 2018b) leads on MFO.GOLabeler had won the original CAFA3 challenge on all three ontologies BPO, CCO, and MFO (see Fig. 3 of Zhou et al., 2019, where the F max results are reported).
GO prediction on the 2016 data set.Table E.2 shows the GO term prediction results of the finetuned ProtT5, ProtBert, and ESM-2 models on the more recent "2016" data set Kulmanov and Hoehndorf (2019); You et al. (2018b), as well as the literature results (compiled by Kulmanov and Hoehndorf, 2019;Strodthoff et al., 2020, which report AUPR and S min results in addition to F max ).Again, we distinguish between single and ensemble models in the following discussion.
Concerning single-models, different methods obtain the best results, depending on the considered metric and ontology.ProtT5 is the best single-model as measured by AUPR on all three ontologies MFO, BPO, CCO.ProtT5 also obtained the largest F max on BPO while the larger ESM-2 variant is best on CCO.DiamondScore, an MSAbased method, is the best single-model with respect to the S min metrics on the ontologies MFO and BPO, and achieves the best F max on MFO.DeepGOPlus has the lowest and thus best S min for CCO.
Discussion.Finetuning large, pretrained protein language models on the tasks of GO term as well as EC number prediction results in a strong predictive performance that is competitive with, and in several cases, better than state-of-the-art methods from the literature (depending on the metric and ontology under consideration, in the GO case).This outcome highlights again the benefits of transferring pretrained universal language models to downstream tasks in the proteomics field (cf., e.g.Strodthoff et al., 2020;Rao et al., 2019;Elnaggar et al., 2022).
The largest inspected models lead the rankings on different metrics (and ontologies), with the bigger ESM-2 being a strong competitor to ProtT5.Again, ProtT5 consistently ranks better than (the smaller) ProtBert both on "CAFA3" and on the more recent, harder "2016" benchmark (only the ProtBert-DiamondScore ensemble forms an exception on "CAFA3" with respect to F max and on "2016" with respect to S min in the BPO case).
DeepGOCNN and DiamondScore are competing single-models with results available from the literature on both benchmarks.ProtT5, ProtBert, and ESM-2 outperform DeepGOCNN in both benchmarks on all metrics.ProtT5 and (in particular, the larger variant of) ESM-2 outperform the strong DiamondScore, which is based on MSAs, in both benchmarks in many cases.
The approach of finetuning the entire model including the encoder shows its strength in the "CAFA3" benchmark for BPO and CCO, where the finetuned ProtT5 outperforms the "goPredSim" model which is based on nearest-neighbor-lookup using features extracted from ProtT5 embeddings (Olenyi et al., 2023).The observation that this nearest-neighbor-lookup is competitive in the MFO ontology might be related to the relatively dense annotation of the MFO.
In summary, we showed that finetuning pretrained large transformer models leads to competitive results for protein function prediction tasks, in particular in the most relevant comparison in the single-model category.
On pretraining and finetuning from end-to-end.First of all, it is important to stress that our approach can only be applied to a model that includes a classification head as it relies on the class-specific output prediction.We studied how finetuning the entire model (comprising the pretrained ProtBert and the classification head) from end-to-end compares to training only the classification head while keeping the pretrained ProtBert frozen i.e. unchanged.We refer to these two scenarios as "finetuned" vs. 'pretrained'.We also considered an additional basevi line where the ProtBert parameters were "shuffled" per layer, hence keeping realistic weight statistics across layers.Only the classification head was trained in the "shuffled" scenario, while the encoder parameters were kept frozen.For comparability, we trained the models in all scenarios for the same number of epochs, i.e. until early stopping was initiated in the "finetuned" scenario.
First, we inspected the performance results for these scenarios.For the GO term classification task "2016", the finetuned ProtBert (see Table E.2) achieved an F max of 0.503 on MFO, of 0.424 on BPO, and of 0.677 on CCO.Performance dropped when only the classification head was trained, while the pretrained encoder was kept frozen, to an F max of 0.463 on MFO, of 0.407 on BPO, and of 0.674 on CCO.Parameter shuffling resulted in an F max of 0.305 on MFO, of 0.316 on BPO, and of 0.601 on CCO (close to the "naive" baseline from Table 3 of Strodthoff et al. (2020)).
For the enzyme classification task "EC50 level L1", the accuracy of 0.990 of the finetuned ProtBert, see Table E.3, dropped to an accuracy of 0.829 achieved by training only the classification head on top of the pretrained but frozen ProtBert.Performance dropped further to 0.369 in the "shuffled" scenario.
Thus, finetuning from end-to-end apparently provides advantages in comparison to only training the classification head on top of the pretrained encoder.Using pretrained parameters in both the "finetuned" and the "pretrained" scenarios shows the well-known benefits of transfer learning in comparison to the "shuffled" scenario.
The corresponding attribution analysis results (see Figure E.1) for the "finetuned" and the "pretrained" scenario have much in common -in contrast to the "shuffled" scenario, which shows a disparate picture.Overlap of "significant heads" between each scenario pair was quantified with the Jaccard similarity coefficient as 0.43 for the finetuned-pretrained pair, in contrast to the smaller 0.036 for the finetuned-shuffled pair, and 0.035 for the pretrained-shuffled pair.Both "finetuned" and "pretrained" scenarios build upon the same pretrained foundation model.Therefore, a similar attribution analysis result can be expected.When the parameters were "shuffled", fewer heads appear in the plot.This outcome shows that the XAI method is sensitive to the model parameters, which is expected and required (Adebayo et al., 2018).It varies across different training runs, whereas the Figure E.1: Attribution analysis results were similar in the comparable "finetuned" (left) and "pretrained" (centre) scenarios, but different in the disparate "shuffled" (right) scenario, as expected (inside ProtBert, GO membrane/transmembrane). Jaccard similarity of the overlap of "significant heads" between scenarios was 0.43 for the finetuned-pretrained, 0.036 for the finetuned-shuffled, and 0.035 for the pretrained-shuffled pair.
finetuned results remain largely similar (as already suggested by the large similarity of full finetuning vs. pretrained).It is noteworthy to stress that on general grounds the plot may still show significant heads due to a coincidental partial alignment of certain randomly initialized heads with directions in feature space that happen to correlate spatially with sequence annotations.Besides, skip connections are not affected by the randomization (Binder et al., 2023).Model parameter randomization tests for XAI evaluation had been introduced by Adebayo et al. (2018), and their shortcomings were discussed by Binder et al. (2023).
XAI evaluation with residue substitution.XAI methods can be evaluated, e.g. by replacing those input features that have been identified as most relevant for the classification decision of the model with default values or random noise, and by observing the resulting performance drop.In image recognition, this approach has been demonstrated in the form of "pixel flipping" for black and white images, or in the form of pixel substitution with a grey or other color value (Bach et al., 2015).In protein function prediction, the residues in each sequence could be sorted according to the attributed relevance.The top n relevant residues could then be substituted with alanine (cf.alanine scanning mutagenesis; Cunningham and Wells, 1989), while observing the model performance drop in comparison to the substitution of n random residues.We conducted a residue substitution experiment for EC50 L1, where we substituted an increasing number of relevant or random residues with alanine.Random test protein generation was repeated ten times for estab- For the GO classification case, it must be considered that relevance is computed only for one selected class out of the numerous GO terms, while the performance is calculated over all available classes of this multi-label problem.)Table E.4 shows the result of this residue substitution experiment.The model is highly accurate on the original test split.The performance of ProtBert decreases when more and more relevant as well as random residues are substituted with alanine.Performance decays faster when the most relevant residues are substituted, in comparison to a random replacement.Thus, the residues identified with IG are indeed relevant for the accurate classification decision of the model.Relation between homology and XAI.Homology describes the relation between proteins that are deemed as originating from a common ancestor, typically because their sequences or structures are sufficiently similar (or, because an intermediary sequence exists that both sequences resemble) (Pearson, 2013).To study how homology relates to residue-level relevance attributions of an ML model, homology information must be made available per residue.For example, a consensus sequence can   Relation between sequence-structure-function and XAI.According to the "sequence-structure-function" paradigm (Koehler Leman et al., 2023), the amino acid sequence determines the secondary protein structure (alpha helices, beta sheets etc.) that determines the tertiary protein structure, i.e. 3D shape, that determines the protein function.An analogy can be drawn to the deep ML models that start with the sequence, build up useful representations over multiple layers, to finally classify the protein function.Here, we aimed at inverting this information flow from sequence to function, and attributed relevance for the function prediction model back to the individual residues in the sequences.Interestingly, we observed a high correlation of attributions w.r.t. the GO term "membrane" with annotations as alpha-helical and beta-barrel transmembrane regions (see Figure 3, Figure 5, and Table A.1). Thus, structurally similar sequences apparently resulted in comparable attributions in this case.
Relation to probing and in-silico mutagenesis.The proposed method can provide a complementary view to probing approaches (Vig et al., 2021) as well as to insilico mutagenesis ('ISM'; Bromberg and Rost, 2008;Raimondi et al., 2018).ISM and XAI methods have in common that both aim at identifying individual residues that are important for the protein function.ISM uses ML to mimic alanine scanning mutagenesis (Cunningham and Wells, 1989), where single residues are replaced with alanine or other amino acids.Then, the (potentially deleterious) effect of the substitution on the protein function is measured (e.g.binding energy change; see Bromberg and Rost, 2008), respectively predicted with ML.Thereby, residues can be identified that are essential for the protein function (or stability).On the other hand, local XAI methods aim at identifying features (here: residues) that are relevant for the decision of a (protein function prediction) model.XAI post-hoc analyses fall into few general categories.Some XAI methods, including IG, are based on gradients, while other methods are "simulating feature removal" (Covert et al., 2021), e.g.LIME (Ribeiro et al., 2016).In some ways, ISM might be comparable (using the alanine scanning analogy) to feature removal methods where input features are replaced with a given value (alanine) -in principle, PredDiff (Blücher et al., 2022)  xi ter suited in the context of protein function prediction is hard to say.Both take a different perspective, with ISM potentially placing particular emphasis on the protein stability, and with XAI viewing the problem more indirectly through the "spectacles" of the function prediction model.Finally, a potential shared limitation of ISM and XAI is that both aim merely at identifying individual residues, potentially neglecting their interaction.
Transmembrane regions, hydrophobicity and charge.For a more fine-grained inspection of the case of transmembrane regions presented in Figure 3, we additionally tested if the relevance attributions (to residues of proteins that had been labeled with the GO term "membrane" and annotated with transmembrane regions) correlated with hydrophobic and, respectively, positively charged residues.Hydrophobic amino acids of transmembrane proteins tend to be located in the hydrophobic core of the membrane, while positively charged residues tend to be attracted by the interface of the membrane with the cytoplasma (positive-inside rule; see Vonheijne, 1989;Elazar et al., 2016;Baker et al., 2017).Interestingly, a significant correlation of attributions with residues that were hydrophobic was observed (in particular, if they are situated within a transmembrane region), but not with positively charged residues (see Figure E.6).Apparently, attention of the model, when aiming at inferring if the GO term "membrane" applies to the protein, is particularly drawn to hydrophobic residues buried in the membrane.

Figure 2 :
Figure 2: Visualization of the explainability method based on IG that can attribute relevance to sequence tokens (here: amino acids) separately for each head and layer of the transformer (adapted from Vaswani et al., 2017).

Figure 3 :
Figure3: Attribution maps for the embedding layer of Prot-Bert finetuned to GO term classification were correlated with sequence annotations.Relevance attributions indicative for the GO label "membrane" correlate significantly with UniProt annotations as "transmembrane regions" (p<0.05,i.e. above blue line).Attribution-annotation-correlation was not observed for GO "catalytic activity" and "binding'.Numbers of test split samples both labeled with the GO term and annotated per amino acid are listed below the x-axis.

Figure 5 :
Figure 5: Inside ProtBert; GO "membrane" (GO:0016020).Left: The relevance attribution (along the sequences) indicative for the GO term "membrane" was correlated with UniProt annotations as "transmembrane regions", for each transformer head and layer.Biserial correlation coefficients (r), obtained for each attribution-annotation-pair, were aggregated in population statistics with Wilcoxon signed-rank tests.The resulting p-values of the tests were adjusted with the Benjamini/Hochberg method for the multiple hypothesis tests conducted in order to limit the false discovery rate.A significance threshold was applied (family-wise error rate of 0.05).The negative logarithm of the corrected and thresholded p-values is displayed.All colored pixels indicate statistically significant results.Center: Prot-Bert heads with a sig.positive relevance (sum along the sequence; indicative for the GO term "membrane") were singled out with the Wilcoxon signed-rank test.The matrix plots show the negative logarithm of the resulting p-values (adjusted with Benjamini/Hochberg and a threshold).Right: ProtBert heads with a sig.positive attribution-annotation-correlation (p-values from Wilcoxon signed-rank tests plotted) that are also characterized by a sig.positive relevance (the latter overlaid as mask).Only results for UniProt "transmembrane regions" are shown, omitting the results for "active/binding sites", "motifs", and PROSITE patterns, which did not feature heads with both a sig.positive relevance and attribution-annotation-correlation.
Additional XAI results for EC and GO, discussed earlier in Section 5.3, are shown in Supplementary Figures E.2 to E.5.

Figure E. 3 :
Figure E.3: Inside ProtBert; focus on GO "catalytic activity" (GO:0003824).Left panels: Results of the correlation analysis between relevance attributions indicative for the GO term "catalytic activity" with PROSITE patterns and UniProt annotations as "active sites" and "transmembrane regions" per head and layer (negative logarithm of corrected p-values of Wilcoxon signed-rank tests over correlation coefficients; significance thresholds were overlaid as masks).Center panels: (Negative logarithm of corrected p-values of) Wilcoxon tests inspecting whether the relevance (sum along the sequence) was significantly positive.Right panels: Heads with both a significantly positive attribution-annotation-correlation and a significantly positive relevance (overlay of left and center panels).Only results for corresponding PROSITE patterns (top), UniProt "active sites" (center) and "transmembrane regions" (bottom) are shown."Binding sites" and "motifs" are ommited, since a significantly positive attribution-annotationcorrelation and combined with a positive relevance was not observed in these cases.

Figure E. 4 :
Figure E.4: Inside ProtBert; focus on GO "binding" (GO:0005488).Left: Correlation of relevance for "binding" with corresponding PROSITE patterns (top) or UniProt "transmembrane regions" (bottom) per head and layer.Center: Positive relevance.Right: Heads with both a significantly positive attribution-annotation-correlation and a significantly positive relevance.Results for "active/binding sites" and "motifs" are not shown, given the absence of significantly positive attribution-annotation-correlations in these cases.
Figure E.5: Inside ProtBert finetuned to EC number classification.Rows: Results of the correlation analysis are presented separately for the different types of annotations (UniProt "active sites", "binding sites", "transmembrane regions", "short sequence motifs").Columns: Results for the six main EC classes (at level L1).Pixels in the matrix plots represent the 16 transformer heads and 30 layers.Each pixel shows the negative logarithm of the p-value resulting from a Wilcoxon signed-rank test across the coefficients of correlation between relevance attributions and annotation (p<0.05; after correction for multiple comparisons and thresholding) overlaid with the mask of those heads/layers that featured a significantly positive relevance too.

Table E .
1: GO term prediction by finetuning and testing ProtT5, ProtBert, and ESM-2 on the "CAFA3" splits, next to state-ofthe-art results reported in the literature.Predictive performance was evaluated with the protein-centric Fmax metric.The molecular function, biological process and cellular component ontologies of GO are abbreviated as MFO, BPO and CCO.Best single-model results are marked in bold face.Best overall results are underlined.Arrows (→) mark results achieved in this work.ProtT5 provides the best overall single-model performance in the MFO and CCO categories, while the larger ESM-2 variant performs best on BPO.Ensembling with the MSA-based DiamondScore predictions is highly effective.

Table E .
4: Accuracy of ProtBert on EC50 L1 for an increasing number of substituted residues.Either the n most relevant or n random residues of the proteins from the test split were replaced by alanine.Then, the accuracy of the model trained on the original training data was evaluated on these modified test proteins.Performance drops faster when relevant residues are substituted.
lishing the random baseline.(The multi-class problem of EC classification on level L1 might lend itself better to this kind of analysis, because there are only six mutually exclusive classes.