Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone

Abstract Motivation Language models are routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful new approaches in the bioinformatics field. Protein language models (pLMs) generate high-dimensional embeddings on a per-residue level and encode a “semantic meaning” of each individual amino acid in the context of the full protein sequence. These representations have been used as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. Results In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA) and show how these capture structural similarities even in the twilight zone, outperforming both classical methods as well as other approaches based on pLMs. The method shows excellent accuracy despite the absence of training and parameter optimization. We demonstrate that the combination of pLMs with alignment methods is a valuable approach for the detection of relationships between proteins in the twilight-zone. Availability and implementation The code to run EBA and reproduce the analysis described in this article is available at: https://git.scicore.unibas.ch/schwede/EBA and https://git.scicore.unibas.ch/schwede/eba_benchmark.


METHODS DETAILS Needleman-Wunch with BLOSUM matrix
We run the Needleman-Wunch global alignment using EMBOSS Needle [7] with default parameters: "score matrix": EBLOSUM62, "Gap_penalty": 10.0, "Extend_penalty": 0.5.Here we report the following Needle output: alignment score, sequence identity and sequence similarity.Alignment score has been normalized the same way as the EBA alignment score.Best performances are obtained using sequence similarity and are reported in Table 1 of the main manuscript.

HHalign
We generated alignments for the Pisces's pairs using HHalign [10] with default parameters.The required profiles were generated with two iterations of HHblits [10] with default parameters on UniClust30 30_2018_08.We normalized the alignment score produced by HHalign as in EBA, by dividing by the length of the longer/shorter sequence for the comparison to TM max /TM min , respectively.

pLM-BLAST
We installed pLM-BLAST from https://github.com/labstructbioinf/pLM-BLAST on date 27/12/2022.Since pLM-BLAST [5] generates a local alignment, multiple alignments are associated to each pair of sequences.For each pair of sequences in the PISCES data, we selected the best alignment score as a proxy for sequence similarity.

ProtTucker
We downloaded the pre-computed embeddings for the PDB sequences from: https://github.com/Rostlab/EAT.We then selected the embeddings associated to the sequences used in the analysis 3.1 of the manuscript.Finally, we computed the Euclidean distance between these embeddings.

EBA COMPUTATION TIME
We report here the EBA computation times for the sequence pairs of the analysis described in section 3.1 of the manuscript.Each point represents a pair of sequences.We can observe how the time needed to compute the EBA score grows linearly with the product of the length of the sequence pair.These analysis were run on a single thread on a "AMD EPYC 7742 64-core" Processor.

Fig. S1.
Computation times for PISCES's pairs in function of the product of the sequences length.We compare: EBA, EBA plain , AD and TM-align.We included only the sequence pairs for which TM-align was able to generate an alignment.In the left plot: on the x-axis we show the product of the length of the two sequences, while on the y-axis the computation time.In the one on the right we show the same thing, but in logarithmic scale, in order to appreciate the difference with the AD.The computation time includes only the generation of the similarity matrix and the computation of the alignment; not the per-residue embedding generation.
Table S3.Average computation times for PISCES's pairs.We compare: EBA, EBA plain , AD and TM-align, including only the sequence pairs for which TM-align was able to generate an alignment.

FLEXIBLE DOMAIN IDENTIFICATION
Compared to the TM score, the method under consideration offers several advantages.For example, the TM score metric relies on rigid superpositions, which can limit similarity detection due to structural flexibility.For instance, as illustrated in Figure S2, two structures that undergo a hinge movement may receive a low TM score, whereas the EBA method can still accurately identify the similarity between the corresponding sequences.

Fig. S2.
Outlier of the EBA classification in the PISCES analysis.These proteins share the same domains but with a different relative orientation.Notably, the EBA method was able to capture the structural similarity between these proteins, whereas the TM score failed to do so.

Fig. S3 .
Fig. S3.Cumulative sensitivity distribution for the annotation transfer analysis on the SCOPe40 database for: family, super family and fold.The sensitivity is computed as the area under the ROC curve up to the first FP.With TPs being matches within the same group and FPs being matches between different folds.We report the performances of EBA, EBA plain and AD for the following protein language models: ProstT5 [3], ProtT5[1], ESM-1b[9].

Fig. S4 .
Fig. S4.Posterior probability of belonging to the same group in the transfer annotation analysis on the SCOPe40 dataset.The posterior probabilities are computed using EBA min with ProtT5 and ESM-1b as underling language models.

Fig. S5 .
Fig. S5.Scatter plots for alignment quality comparison between EBA-ProstT5, DALI[4] and Foldseek[6].For each element of the HOMSTRAD benchmark set we compare the F1 score generated with the different aligners.Notice that DALI failed in 46 of the alignments, we did not include those in the scatter plot.

Fig. S6 .
Fig. S6.Pairwise distance matrix and enhanced similarity matrix (SM enh ) of the two pairs of sequences shown in Figure 1 of the manuscript.This figure shows how the signal enhancement strengthens the signal for pair 2.

Table S1 .
Spearman correlations between the similarity predictions obtained running a Needleman-Wunch global alignment.The best scores on this table are reported in Table 1 of the manuscript.

Table 1
of the manuscript.Since TM-vec predicts TM scores no normalization is needed.

Table S2 .
Spearman correlations between the similarity predictions obtained running TM-vec.The best scores (cath model large) are reported in Table1of the manuscript.

Table S4 .
[8]gnment quality sensitivity and precision for the HOMSTRAD[8]benchmark.With sensitivity being: TP residues in alignment/query length and precision being TP residues/alignment length.This values refer to the plot in Figure3panel B of the manuscript.