TemStaPro: protein thermostability prediction using sequence representations from protein language models

Abstract Motivation Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. These methods can facilitate training on more data and, possibly, enable the development of more versatile thermostability predictors for multiple ranges of temperatures. Results We applied the principle of transfer learning to predict protein thermostability using embeddings generated by protein language models (pLMs) from an input protein sequence. We used large pLMs that were pre-trained on hundreds of millions of known sequences. The embeddings from such models allowed us to efficiently train and validate a high-performing prediction method using over one million sequences that we collected from organisms with annotated growth temperatures. Our method, TemStaPro (Temperatures of Stability for Proteins), was used to predict thermostability of CRISPR-Cas Class II effector proteins (C2EPs). Predictions indicated sharp differences among groups of C2EPs in terms of thermostability and were largely in tune with previously published and our newly obtained experimental data. Availability and implementation TemStaPro software and the related data are freely available from https://github.com/ievapudz/TemStaPro and https://doi.org/10.5281/zenodo.7743637.

'protein_id' -a header taken from the FASTA le of the input protein; 'sequence' -an amino acid sequence of the protein; 'length' -a length of the protein's amino acid sequence; 't??_binary'a binary prediction label for a given temperature threshold (one of the six thresholds is written in the place of question marks) -the label is assigned by rounding the raw prediction (see the next point) at this temperature threshold; 't??_raw' -a raw prediction value for a given temperature threshold (real numbers from the interval [0, 1]); 'left_hand_label' -a label of the highest temperature range, at which the protein was predicted to still be thermostable (possible 'right_hand_label' -a label that is interpreted as 'left_hand_label', yet the label is assigned by reading the outputs starting from the right (possible values of the label coincide with the 'left_hand_label'); 'clash' -a Boolean identier, whether a contradiction between the models' predictions was observed -the expected output is a decreasing sequence of binary predictions if the outputs are read from left to right in the increasing order of the temperature thresholds (expected output is labelled as '-' and other cases are assigned '*').

Figure S1 .
Figure S1.Protein sequences distribution regarding to organism's growth temperature in TemStaPro-Minor-30 cross-validation and testing data subsets.

Figure S2 .
Figure S2.Protein sequences distribution regarding to organism's growth temperature in TemStaPro-Major-30 data subsets.

Figure S3 .
Figure S3.Distributions of protein sequences regarding to organism's growth temperature in TemStaPro-Major-30 data subsets, plotted as overlapping histograms for protein sequences predicted to belong to either cytoplasm or extracellular class, number of sequences of each group is indicated in the brackets.

Figure S4 .
Figure S4.Schemes of architectures: single-layer perceptron (upper), a feed-forward neural network model with 1 hidden layer (middle), and a feed-forward neural network model with 2 hidden layers (lower).

Figure S7 .
Figure S7.Comparison of weight decay eect on the best architectures' (after cross-validation) MCC scores.

Figure S11 .
Figure S11.An example tab-separated table that is the output of per-residue prediction mode of TemStaPro program.

Figure S12 .
Figure S12.An example plot for the output of per-residue mode (left) and per-segment mode with default window size of 41 (right).

Figure
Figure S13.(a) Thermal stability of Cas proteins without guide RNA (apo) or loaded with sgRNA (Ghy2Cas9 and SpyCas9) (RNP).Protein unfolding was measured using nano dierential scanning uorimetry (nanoDSF) over a temperature range from 20 • C to 80 • C. Fluorescence was monitored as temperature increased at a rate of 1 • C per second.The inection point of the uorescent curve is interpreted as the unfolding point of the protein (T m ).Data points collected from replicate experiments are plotted as circles, the means are plotted as dashes.(b) The double-stranded DNA (dsDNA) cleavage activities of Ghy2Cas9 and SpyCas9 RNPs were measured using in vitro assays containing uorophore-labeled dsDNA target substrates.Cleaved fragments were quantitated and are represented in a heatmap showing overall activity at temperatures ranging from 37 • C to 70 • C. The intensity of the blue colour indicates the fraction of substrate cleaved.

Table S1 .
Models that were tested with ESM-2 and ProtT5-XL embeddings as input.
An example tab-separated table that is the output of the (default) global prediction mode of TemStaPro program.The main output of the method is a TSV table with 8 columns: