LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model

Abstract Motivation Recent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from protein language models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. Results Herein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer’s encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate fusion stacked generalization approach, using an n-mer window sequence (or, peptide/fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. Availability and implementation LMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.


Total number of features 1343
More details on each of the feature sets are as follows: (i) The features (120 features) derived from the physical properties of amino acids include the amino acid mass (or, molecular weight), number of atoms in the residue, and volume of the residue (shown in Table S2).(ii) The physicochemical propertiesʼ features (341 features) are derived from ten physicochemical property indexes obtained from the AAindex database (presented in Table S3).
Table S3.Physicochemical property indices of amino acids and their description.

TSAJ990101
Volumes including the crystallographic waters using the ProtOr (iv) Composition, transition, and distribution (CTD) feature (147 features) characterize the amino acid distribution patterns of a typical structural or physicochemical property in a peptide sequence.
(v) Zscale features (155 features), in which each amino acid is characterized by five physicochemical properties as listed by Sandberg et.al. [3].
(vi) Sliding window amino acid Composition (SWAAC), which calculates the AAC based on the sequence window of five residues sliding from the N-to C-terminus of each protein sequence (540 features).
(vii) Features based on whether a residue is hydrophobic or polar (two features), where the hydrophobic amino acids are G, A, V, L, I, P, F, M, and W while the polar amino acids are T, C, N, Q, and Y.
(viii) Features based on the overlapping properties (OP) of amino acids (four features).The OP is a multi-label classification of amino acids; each residue can be assigned one or more labels of the ten classes: polar, positive, negative, charged, hydrophobic, aliphatic, aromatic, small, tiny, and proline (shown in Table S4).
However, only four properties are found important: positive, negative, tiny, and proline (four features).[4] as shown in Table S5.The ACH of hydrophobicity is calculated by averaging the hydrophobicity of the residues of the possible sub-windows in which the targeted lysine (K) is always in the middle.For a window of 31 residues, there will be 15 features representing the ACH of the 15 sub-windows.

Section S2. Description of transformer-based pLMs used in this work
Let L be the length of the embedding vector per residue (also called "last hidden state").Note: * Sequences longer than ESM-2's 1024 length limit were truncated.For sites within the first or last 1024 residues, only these segments were used.If a site was central, a 1024-residue window around it was selected.
Section S3.Details of architectures of base models and meta-classifier I) T5ResConvBiLSTM: • Input Layer: It accepts the input with a shape (window_size x 1024).The ʻwindow_sizeʼ is the length of the window sequence (=31), and 1024 is the feature dimension of residue in the sequence.
• Reshape Layer: This reshapes the input to (window_size, 32, 32, 1), preparing it for the convolutional operations.It suggests that each element in the sequence is treated as a 32x32 grid with 1 channel.
• First TimeDistributed Conv2D Layer: Applies a convolution operation with 4 filters of size 5x5 to each sequence element individually.TimeDistributed is used to apply the same convolution operation across the time dimension (sequence).
• Second TimeDistributed Conv2D Layer: It has one filter of size 1x1.It serves to align the number of filters with the output from the previous convolutional layer with the residual connection.Technically, this layer performs no significant operation (except for aligning the residual connection), which is why we chose to omit it in the manuscript, referring to the model as having two convolutional layers rather than three.
• First Add Layer: This creates a residual connection by adding the previous layer's output to the reshaped input, which can help train deeper networks by allowing gradients to flow through the network more effectively.
• First TimeDistributed Dropout Layer: It applies dropout with a rate of 0.5 to each sequence element • Third TimeDistributed Conv2D Layer: A second convolution operation is applied which uses 2 filters of size 3x3.
• Second Add Layer: Adds the output of the second convolutional operation to the output of the previous dropout layer, forming another residual connection.
• Second TimeDistributed Dropout Layer: Another dropout layer that applies the dropout regularization with a rate of 0.5 to the output of the previous residual block.
• TimeDistributed Flatten Layer: Flattens the output of the previous layers to a single vector per sequence element, making it suitable for input to the recurrent layers.

•
Flatten Layer: This layer flattens the outputs of the bidirectional LSTM layer into a 1D vector.
• Dense Layer: A fully connected layer that projects the flattened output down to a 16-dimensional space and introduces non-linearity with the ReLU activation function.
• Dropout Layer: Another dropout layer with a rate of 0.5 is added here.
• Output Dense Layer: A dense layer with a single unit and a sigmoid activation function.

II) EmbedCNN:
• Embedding Layer: embedding layer with dimension=15, vocabulary size= 23 and input_length=31 • Lambda Layer: The lambda function used here is ʻK.expand_dims(x, 3)ʼ, which expands the dimensionality of the input by adding a single-dimensional axis at position 3.
LMCrot | 6 • Conv2D Layer: This layer applies 64 filters to the input data using a kernel size of (16x4).The activation function is ReLU.
• Dropout Layer: dropout layer with a rate of 0.3

• MaxPooling2D Layer:
The pooling operation is performed over a 4x4 window, further reducing the size of the feature maps.
• Flatten Layer: This layer flattens the input from a multi-dimensional tensor to a one-dimensional tensor to be used in the following dense layers.
• Dense Layer: The first dense layer has 32 units with ReLU activation.
• Dense Layer: The final dense layer has 1 unit with a sigmoid activation function.

III) PhysicoDNN:
• Input Layer: Accepts input with shape (1343,), indicating the model takes a flat vector of 1343 features.
• Dense Layer: A fully connected layer with 64 units using ReLU activation.
• Dropout Layer: A dropout layer with a rate of 0.5.
• Dense Layer: Another fully connected layer following the dropout, this time with 8 units also using ReLU activation.
• Dropout Layer: A second dropout layer, now with a rate of 0.3.

• Dense Layer:
The final layer is a fully connected layer with 1 unit and a sigmoid activation function.

IV) Meta-classifier:
• Input Layer: Accepts input with a shape 56 x1 (obtained from the intermediate fusion of three base models).
• PReLU Layer: layer with a parametric rectified linear unit activation function.
• Dense Layer: A fully connected neural network layer with 8 units.It uses the ReLU activation function.This layer also includes L2 and L1 regularization each with a regularization coefficient of 0.05.
• Dropout Layer: A dropout layer with a rate of 0.5.
• Dense Layer: Another dense layer with 4 units with linear activation.
• Dense Layer: The final dense layer with 1 unit and a sigmoid activation function.
Table S7.Description of the no. of layers and no. of parameters of the proposed models.Note: 1.
The calculation of the total number of layers in these models excludes the input, lambda, add, dropout, reshape, and flatten layers.

2.
*The meta-classifier layer count includes PReLU; however, if PReLU is not considered, the layer count remains 3 (as also stated in the manuscript).

Section S4. Performance Measures
Let TP (True Positive) be the count of the predicted Kcr sites, TN (True Negative) be the count of correctly predicted non-Kcr sites, FP (False Positive) be the count of incorrectly predicted Kcr sites and FN (False Negative) be the count of incorrectly predicted non-Kcr sites.Based on these fundamental metrics we define the following measures:

Section S5. Statistical Significance Testing
We test the statistical significance of LMCrot by performing McNemar's test [5,6], a "within-subjects chi-squared test", for its suitability in paired sample analysis.This test is particularly appropriate for our LMCrot | 8 study as it is designed to compare the performance of two models on the same dataset, which is the case for LMCrot against the base and existing models in tables 4, 5, 6, and 7 of the manuscript.This test accounts for the correlation between the paired samples, which is a crucial factor in our analysis since the same dataset tests both models.Furthermore, McNemar's test is ideal for binary outcomes, aligning well with our classification tasks.It provides a clear framework to test the null hypothesis (H O ), which states that there is no performance difference between LMCrot and the other models (either base models or existing ones).The From the above matrices, we calculated chi-squared ( 2 ) and thereby p-values which are listed in the below From the data presented in Table S8, it is evident that the p-values for all the base models are significantly lower than the chosen significance threshold (α=0.05).This allows us to confidently reject the null hypothesis,  In Table S10, the p-value is less than the significance threshold and thus we reject the null hypothesis.
Therefore, there is a significant difference in prediction performance between LMCrot and CapsNh-Kcr.
amino acid classification based on overlapped properties 4 ix ACH (Average Cumulative Hydrophobicity): Jain (JA) and Miyazawa (MI) Indexes 30 propensity of position 44 in T4 lysozyme 5 BIOV880101 Information value for accessibility; average fraction of 35% 6 CEDJ970104 Composition of amino acids in intracellular proteins (iii) Four features based on amino acid classes categorize amino acids into four classes (Nonpolar aliphatic, aromatic, polar uncharged, and positively charged).

Fig. S3 .
Fig. S3.Contingency matrix obtained from comparing the LMCrot against the CapsNh-Kcr model on the non-histone test set.

Table S4 .
The overlapping properties of amino acids.

Table S5 .
The amino indexes used for computing ACH.

Table S6 .
Description of transformer-based pLMs used in this study.

table : Table S8 .
Chi -squared (L=2) and p-values obtained by comparing the LMCrot (Model 1) against the three base models (Model 2) on the HeLa test set.

Table S10 .
Chi-squared (L=2) and p-values obtained by comparing the LMCrot (Model 2) against the CapsNh-Kcr model (Model 2) on the non-histone test set.