Scoring alignments by embedding vector similarity

Abstract Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $E$\end{document}-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $E$\end{document}-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


INTRODUCTION
Sequence similarity is fundamental in sequence analysis.Significant similarities do not appear by chance but are evolutionarily motivated, with corresponding implications on their associated functions.Given a new sequence, its functionality is much faster investigated by performing a database search to identify known similar sequences and inferring the function of the new sequence from the existing information, resulting often in accurate identification.This procedure is sometimes the only way to infer function.Sequence similarity search is the most widely used procedure in bioinformatics, the BLAST program [1,2] being one of the most cited contributions in the history of science.(The two BLAST papers [1,2] combined exceed 190 000 citations on Google Scholar as of August 2023.) The most important component of similarity search algorithms, such as BLAST, is the function that describes the similarity scores of various components, such as nucleotides for DNA and amino acids for proteins.We consider here the much more complicated case of proteins, where scoring of alignments is done using PAM [3] or BLOSUM [4] matrices, which give the substitution scores for pairs of amino acids.These matrices are fixed, which means that the substitution score for a pair of amino acids does not depend on their context, that is, the surrounding sequence of the protein containing them.We propose a new scoring function for residue pairs that is dependent on the context in which the residues appear.The new score makes use of protein embeddings.
Many of the best current techniques in Natural Language Processing (NLP) involve transfer learning based on pre-trained models using self-supervised learning on unlabelled data.Such data are freely available in enormous amounts, such as the Common Crawl corpus (commoncrawl.org).Words that share similar meanings or are related often appear in similar contexts.This enables scalable models to map words to numerical vectors (embeddings) in a high-dimensional space.Words with similar co-occurrence patterns are represented by vectors that are closer in this space.Such text embeddings are able to capture the semantic meaning of words.One of the best known earlier models is word2vec [5].Denoting by w v the vector associated with a word w, an example ( [5]) of meaning capture is shown by the fact that king v − man v + woman v produces a vector that is closest to queen v .
Our new scoring function uses any embedding method that associates a vector to each residue.For any such method, E, we define the E-score between two residues as the cosine similarity between their associated vectors.We investigate this new scoring function for contextual embeddings from the point of view of correct alignment generation.We have selected a wide variety of reference multiple sequence alignments (MSAs) from NCBI's CDD (Conserved Domain Database) [30], that are manually curated according to the function.For each, we randomly selected many pairs of protein sequences in order to check how close the pairwise alignment is to the original reference MSA.Several alignment distances [31], including two that we introduce here, were used to compare the best new contextual ProtT5-score, with the best old one, the BLOSUM45 matrix.ProtT5-score performs overwhelmingly better than BLOSUM45, producing an average alignment distance that is better in all but one of the 37 MSAs tested.
The new scoring method proposes completely changing the way alignments are computed, with far-reaching implications in all areas that use sequence similarity.

Cosine similarity
For two vectors, A = (A i ) i=1..n and B = (B i ) i=1..n , the cosine similarity between A and B is defined as the cosine of the angle, θ , between the two vectors: The cosine similarity takes values between −1 and 1, with −1 for vectors of opposite direction, 1 for the same direction and 0 for orthogonal vectors; Figure 1 illustrates these three cases.

E-score
For a protein sequence P, of length m, denote its ith residue by P [i].
Consider an embedding method, E, that produces embedding vectors of size n.Assume that E, when applied to protein P, generates an m × n matrix, E(P), that has as its rows all m embedding vectors of the residues of P; denote the embedding of the ith residue by E(P) i = E(P)(i, :).
For two proteins P and Q, and two positions, i within P and j within Q, the E-score between the residues P[i] and Q[j] is defined as the cosine similarity between the corresponding embedding vectors: Note that, while not ref lected by our notation, E−score(i, j) depends also on the protein sequences P and Q.Thus, when the embedding method is ProtT5, ProtBert, ESM2, etc., we obtain ProtT5-score, ProtBert-score, ESM2-score, and so on.Clearly, for any contextual embedding method E, the E-score changes with context as well.

Visualizing E-scores
To have an idea what these new E-scores look like, we constructed some scoring matrices using those, and compared them with the BLOSUM matrices.While a BLOSUM matrix is fixed, the Escores change with context, so we computed average scores for each pair of amino acids.We picked the longest MSA, NBD_sugar-kinase_HSP70_actin, from the selected six representative MSAs (from the first column of Table 3).Then, we distinguished between 'aligned' residues, which appear in the same column of the MSA, and 'unaligned' residues, which occur in different columns.Since the E-scores use the context, it is expected that aligned residues have significantly higher scores compared to unaligned ones.For each pair of amino acids, we randomly selected occurrences of their residues in the MSA and computed the average score; 100 000 pairs were selected in each of the categories, aligned and unaligned.We do this separately for the aligned and the unaligned case.This way, for a given embedding type and an MSA, we obtain two 20 × 20 matrices.We show the aligned matrices for three embeddings: ProtT5-score, ESM2-score and ProtAlbertscore, alongside the BLOSUM45 matrix, in Figure 2. It appears from Figure 2 that the matrix for the ProtT5-score is the most similar to the BLOSUM45 matrix.The heatmaps of all five BLOSUM matrices and all twelve E-score matrices (two matrices for each embedding method, for the aligned and the unaligned case) are shown in the Supplementary Table 8.It is interesting to notice the wide variety of ranges among the E-score matrices.The unaligned matrices have lower, and narrower, ranges.It appears that a wider range of values is an indication of better performance for alignments.

Correlations
It is of interest to see how well the new E-scores are correlated with the BLOSUM matrices.We consider here five BLO-SUM matrices: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80 and BLOSUM90, and six embedding methods: ProtT5, ProtBert, ProtAlbert, ProtXLNet, ESM-1b and ESM2.For each embedding method, we use for computing the correlations the same matrices computed above for the NBD_sugar-kinase_HSP70_actin MSA. Figure 3 gives the Pearson correlations between the five BLOSUM matrices and the twelve E-score matrices.It can be seen that the ProtT5 matrices have the highest correlation with all BLOSUM matrices, followed, in order, by ESM2 and ProtBert, and then Pro-tAlbert, ESM-1b and, by far the last, ProtXLNet.For each method, the aligned matrix has higher correlation with BLOSUM matrices than the unaligned one.Interestingly, the unaligned ProtT5 matrix comes in second place, after the aligned ProtT5 and before all the other aligned matrices.With the exception of ProtXLNet, the two matrices, aligned and unaligned, for the same embedding method, are highly correlated.

Alignment distances
Our goal is to show that the use of the new E-scores provides better alignments compared to the BLOSUM scores.We do this by comparing their ability to produce true alignments, as given by the reference alignments from CDD; see Reference alignments subsection below.We selected a number of reference MSAs, discussed in the Results section, and investigated how close pairwise alignments produced using E-score and BLOSUM matrices, respectively, are from the alignments between the same sequences, as induced by the reference MSAs.To compare how close two alignments are, we need to measure distances between alignments.We use five distance metrics, three existing ones, d ssp , d seq and d pos from [31], and two that we introduce here, d cc and d d .
We give first a formal definition of an alignment.Consider an alphabet Σ.For a string P over Σ of length |P| = n, let Note that there is no column with two gaps.
Four distances have been introduced by [31], of which we use three that are suitable for our purpose: d ssp , symmetrized sum-ofpairs score, which ignores gaps; d seq , which treats all gaps in the same sequence equally; and d pos , which incorporates positional information about the gaps.The original alignment is modified to ref lect each gap treatment; in Example 1, alignment A 1 is converted into the following three alignments, to be used for each of the three distances d ssp , d seq , d pos , respectively: Similarly, A 2 is converted into the following three versions: The three distances, d ssp , d seq , d pos , are then defined using the three versions above, in order, using the homology set for each character, H X (A) i j , for alignment A, X ∈ {ssp, seq, pos}, sequence i, position j, that contain the characters from the other sequences in the same column as j.For example, H pos (A 1 ) 2 }. d ssp is defined using the idea of the Jaccard distance [32] on the homology sets H ssp : .
The other two distances use symmetric difference on the sets H X , for X ∈ {seq, pos}: .
We refer the reader to [31] for details and examples.

Relative displacement distance
The fourth distance, d d , is new and measures the relative displacement between the positions of the letters in the two strings.Put |P| = n and |Q| = m and let 1 ≤ for any 1 ≤ j ≤ m.For any i, 1 ≤ i ≤ n, and j, 1 ≤ j ≤ m, denote the distance between the positions of Given two alignments, A 1 and A 2 , between the same strings, we define the relative displacement distance between them as follows: The division by mn(m + n) is done to scale the value between 0 and 1.For any i, j with 1 Example 1: for alignment A 1 , we have n = m = 3, and

Closest context distance
The first three distances above, d ssp , d seq and d pos , consider only the column context of each residue, disregarding the position at which this occurs.On the other hand, the d d distance considers only the relative positions of the residues, disregarding their contexts.We propose next a distance that captures both position and context, by considering the distance to the closest position with the same context.We introduce more notation.For a string P over Σ, of length |P| = n, denote the range For a string w over Σ∪{−}, denote by L (w) the number of letters from Σ in w; e.g., L (DN − N) = 3.
We modify slightly some notation from above.If |P| = n and |Q| = m, and let p ℓ,i be the position in P Aℓ at which P[i] occurs; that is, P Aℓ [p ℓ,i ] = P[i], for any 1 ≤ i ≤ n and ℓ = 1, 2. Similarly, let q ℓ,j be the position in For a given i, 1 ≤ i ≤ n, we define d P,i,1 , the contribution of P[i] from alignment A 1 to the distance, as follows, depending on whether the letter opposite to is a letter, then d P,i,1 is the number of letters in Q A2 from position p 2,i to the closest occurrence of a residue that is the same as Q A1 [p 1,i ] ('closest' means fewest letters; gaps are not counted): In Example 1, for i = 1, we have p 1,1 = 2, p 2,1 = 1, P A1 [2] = P A2 [1] = P [1] = R, and Q A1 [2] = N.The closest N to the position 1 in Q A2 is at position 3, and there is one letter, D, in the range from 1 to 3. Therefore, 1 is the difference between the number of letters in the two strings, Q A1 and Q A2 , before the position of P[i] (note that this does not depend on whether Q A2 [p 2,i ] is a letter or a gap): In Example 1, for i = 2, we have p 1,2 = 3, p 2,2 = 2, P A1 [3] = P A2 [2] = P [2] = K, and Then, the closest context distance between A 1 and A 2 is The division by 4mn is done to scale the value between 0 and 1, because we always have d P,i,ℓ ≤ m and d Q,j,ℓ ≤ n, thus making the sum after the fraction at most 4mn.In our Example 1, we have

Reference alignments
There are multiple benchmark databases available to validate the accuracy of alignments.Most of these databases include comparatively similar sequences [33], albeit with the addition of features that are often difficult to align such as repeats, and many of the benchmarks were created using simulation techniques [34].Most alignment algorithms can properly align similar sequences and the purpose of the method reported here is to align very distantly related protein sequences.For distantly related proteins, we do not agree that simulation methods have completely captured the vagaries of protein sequence evolution.We therefore wanted to use very distantly related protein sequence alignments that had been done by hand and based on the function of the proteins.To this end, we choose to make use of the CDD, [30], from NCBI.This database has hand curated alignments and furthermore, it lists these alignments in a hierarchical fashion.Thus, alignments can be chosen at different degrees of divergence from the same types of proteins.
Several alignments were chosen from CDD to explore different degrees of divergence and to explore different lengths of alignments.Pairs of protein groups were chosen to be very divergent (fold or superfamily level) and then a superfamily or family from within that protein group.Here are several examples: Globinlike, cd01067 (consensus alignment length 161 sites with 119 aa) and Hb-alpha-like, cd08927 (consensus alignment length 142 sites with 140 aa); 7tm_GPCRs, cd14964, (consensus alignment length 420 sites with 267 aa) and 7tmA_photoreceptors_insect, cd15079 (consensus alignment length 301 sites with 292 aa); NBD_sugar-kinase_HSP70_actin, cd00012 (consensus alignment length 1154 sites with 185 aa) and FGGY_YpCarbK_like, cd07782 (consensus alignment length 660 sites with 509 aa).The difference between the alignment length and the consensus number of aa gives an indication of the divergence within the grouping.
We have chosen 49 multiple sequence alignments from CDD.They are presented in Table 1, each with the conserved domain name, source, number of protein sequences and alignment length.The list is sorted by the alignment length.The MSAs are sorted increasingly by length.

Case studies
Before presenting a comparison between E-scores and the BLO-SUM matrices, we show in this section two examples of the advantage presented by the alignments produced by the ProtT5-score compared with the BLOSUM45 matrix.The first example is for two sequences from the KAZAL_FS (cd00104) MSA.The sequences and the two alignments are shown below.The BLOSUM45 alignment is correct in the first half, whereas the second half is significantly different, with several gaps added, whereas the true alignment has none.
The second example involves two sequences from the Globin_sensor (cd01068) MSA.The sequences and the two alignments are shown below.In this example, the BLOSUM45 alignment diverges drastically from the true alignment; the algorithm is deceived by a number of identities between the two sequences.
It appears that the BLOSUM45 matrix often introduces false gaps to achieve a higher number of matches.That is, it blindly works to achieve a maximum score while the ProtT5-score avoids this trap, due to its contextual nature.For example, note in the first example that BLOSUM45 removes one match to instead introduce six additional ones.Remarkably, the ProtT5-score finds the true alignment, in spite of the fact that it contains in the middle 21-18 region only one match instead of six.The ProtT5score appears to be taking more information from the properties of the amino acids than are encoded in the matrix or capable of being encoded in the matrix without context.

Comparison procedure
We compare the E-scores against the BLOSUM matrices on the 49 MSAs introduced above as follows.For each MSA, we pick 1000 randomly selected protein pairs (if the total number of pairs is less than 1000, then we pick all pairs), compute pairwise alignment of each pair using either the E-score or BLOSUM matrix, according to the Needleman-Wunsch dynamic programming global alignment algorithm [35], but without any penalties for gaps at the ends, called semi-global alignment.This is the most suitable alignment type, as we expect these sequences to align globally; however, using end-gap penalties could artificially alter the alignments.
For BLOSUM matrices we used the NCBI BLAST default values: −11 for gap opening and −1 for gap extension.For E-scores, we used experimentally set gap penalties: −.25 for gap opening and −0.01 for gap extension.Experiments indicate that the algorithm is robust to penalty changes.
Each pairwise alignment is then evaluated against the reference alignment from CDD by computing the distance to the pairwise alignment induced by the reference MSA for the two sequences.The distance is computed using all five distances introduce above: d cc , d d , d pos , d seq and d ssp .
The comparison for all E-scores and all BLOSUM matrices on all data is very time consuming; therefore, we need to choose the best E-score and the best BLOSUM matrix.For that purpose, we have selected six representative MSAs of the above 49; see the first column of Supplementary Tables 3-6.We have first compared all six E-scores on these six MSAs; the results are shown in Supplementary Table 3.The two top performers are the ProtT5score and the ESM2-score.They are compared head-to-head in Supplementary Table 4, using the same six MSAs, but considering also the Wilcoxon signed-rank test P-values, with a threshold of 0.01, to check the validity of the findings.The result is that ProtT5 is the best E-score.
Similarly, to find the best BLOSUM matrix for our purposes, we performed the same procedure.All five candidates, BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80 and BLOSUM90, are compared in Supplementary Table 5, and the top two, BLOSUM45 and BLO-SUM62, in Supplementary Table 6.BLOSUM45 is the best scoring matrix for our comparison.This is not unexpected, given the fact that we consider distantly related sequences.The most commonly used matrix, BLOSUM62, comes in second place very closely behind BLOSUM45.

ProtT5-score versus BLOSUM45 matrix
In this section, we compare head-to-head the ProtT5-score against the BLOSUM45 matrix, the top performers, as decided by the previous section.We have compared the two, as described above, on all data; see Supplementary Table 1, where the Wilcoxon test P-values larger than.01 are indicated in red.We have removed all tests with P-values higher than .01 and present the comparison for the d cc , d d and d pos distances in Table 2.The d cc distance presents the most important evaluation, which, together with d d and d pos gives a comprehensive evaluation of the similarity between the pairwise alignments and true ones, as induced by MSAs.The results for the d ssp and d seq distances are similar to those for the d pos distance; see Supplementary Table 1.
ProtT5-score is better in all but one test on the shortest MSA.Regarding the overall number of better cases, ProtT5 has 74% of the cases, BLOSUM45 13%, while for the remaining 13%, they were equal; see Supplementary Table 2.
The tests with few samples are not very reliable, as indicated by the fact that almost all P-values above the threshold are for tests with 36 or less.Many of these tests are for short MSAs.The only case where BLOSUM performs better than ProtT5-score, Bbox2_TRIM37_C-VIII, is among these.In fact, one of its P-values is 4.87E-03, close to our threshold of 1.00E-02.
For the much more difficult case of long alignments, ProtT5score outperforms BLOSUM matrices in 94% of cases of alignments with length over 300, whereas BLOSUM45 is better in only 4% of those cases; see Supplementary Table 2.
For a better visualization of the comparison, we plotted in Figure 4(A), for each of the five distances, the ratio between the distance to the reference alignment from the BLOSUM45 alignment and from the ProtT5 alignment, respectively.That means higher than 1 is better in favour of ProtT5.It is very clear from the figure that ProtT5 has a much better similarity with the reference alignment, the advantage being most clear for the d d distance, as expected.
We also plotted the percentage of cases where ProtT5score and BLOSUM45, respectively, produce better results (see Supplementary Table 2), as a function of MSA length, in Figure 4(B).The difference in performance in favour of the ProtT5score is increasing with alignment length, and therefore with alignment complexity, to the point where ProtT5-score produces better results in practically all cases.
The comparison for the purely global alignment case (where end-gaps are considered) is presented in the Supplementary Table 7.The conclusions are similar to those presented here for the semi-global case.

ProtAlbert-score versus BLOSUM45 matrix
We have seen above that the best E-score, ProtT5, is much better than the best BLOSUM matrix, BLOSUM45.However, it turns out that the least powerful E-score, ProtAlbert (strictly speaking, ProtXLNet is the least powerful, but it gives errors sometimes, and cannot be tested properly) is still clearly outperforming the top BLOSUM45 matrix.We show only the comparison on the six selected alignments in Table 3.This indicates most clearly the superiority of the E-scores compared with BLOSUM matrices for scoring alignments.

DISCUSSION AND CONCLUSION
We have proposed a new method for scoring sequence alignments, E-score, using the cosine similarity between the vectors associated using any word embedding algorithm, E. When using contextual embeddings, the E-score has the advantage of incorporating into the score associated with each position contextual information from the neighbouring positions in the sequence.
We have thoroughly tested the new E-score in the context of protein global alignments, by comparing the top scoring function, ProtT5-score, against the BLOSUM45 matrix on a large set including a wide variety of reference multiple sequence alignments from NCBI's Conserved Domain Database.The pairwise alignments produced by using the ProtT5-score are much closer to the true alignments, as induced by the reference multiple sequence alignments, compared to BLOSUM45.Furthermore, even the least powerful ProtAlbert-score outperforms the top BLOSUM45 matrix by a wide margin.This means that all E-scores clearly outperform all BLOSUM matrices, indicating the superiority of the new method.
Many issues remain to be investigated regarding the performance of the newly introduced E-scores for biological sequences, such as local alignments, multiple alignments, database search, The best values are shown in boldface.We show only the distances dcc, d d and dpos; the comparisons using dssp and dseq are similar to the one for dpos genome alignment, phylogenetic tree reconstruction, etc.Also, the case of aligning nucleic acid sequences-DNA, RNA-though simpler than the case of proteins, is a whole area where E-scores can be applied as well.
While we have tested the new E-score in the context of biological sequences, particularly proteins, the new method can be applied to any sequences where computation of embedding vectors is possible, which include practically all existing textual data, the largest type of data available.All Large Language Models have the possibility to produce word embedding vectors for any input text.Considering the importance and implications of text similarity in general, our method presents the potential for applications into many areas.We hope that future investigations will explore how our method can be applied in various fields, revealing its effectiveness and discovering new uses.The language models are expected to improve at a rapid pace in the future, with expected corresponding improvements for those involving biological sequences.The ideas presented apply directly to any new model, meaning that the E-score will improve as well.
Specialized models, obtained usually by fine tuning the general ones, or adapting them to specialized tasks, are expected to provide improved performance.Regarding the case of biological sequences, where alignments play a particularly vital role, the possibility exists for training, or fine tuning, models for highly specialized tasks, such as specific types of proteins.

Key Points
• Sequence similarity is central in biology, as similar sequences share function and common ancestry.BLO-SUM matrices play a crucial role in identifying similarities, but have the drawback that they are independent of context.• We propose a new method, E-score, for amino acid similarity that uses the context of the protein sequences.It uses the cosine similarity of the embedding vectors associated with the residues.• E-scores greatly surpass the ability of BLOSUM matrices to produce true alignments.• The new method can be applied to the similarity of any textual data, with a limitless number of applications.

Figure 3 .
Figure 3. Pearson correlations between 5 BLOSUM matrices and 12 Escore matrices.For each embedding method we considered, the aligned and unaligned matrices for the NBD_sugar-kinase_HSP70_actin MSA have been included.

Example 1 .
denote the ith character of P; thus P = P[1]P[2] • • • P[n].For two strings P, Q over Σ, an alignment between P and Q is a pair of strings A = (P A , Q A ), of the same length |P A | = |Q A |, over Σ ∪ {−} (the '-' character stands for a gap), with P A (Q A ) being obtained by inserting gaps at any positions in P (Q, resp.), with the restriction that, for any position i, at least one of the characters P A [i] and Q A [i] is not a gap.Here are two examples of alignments between peptides P = RKD and Q = DNN:

Figure 4 .
Figure 4. (A) The ratio between the performance of BLOSUM45 and ProtT5-score in terms of distances to the reference from their alignments.The results are sorted by increasing MSA length.Higher than 1 (above the y = 1 line) indicates ProtT5-score is better.(B) The percentage of cases when ProtT5-score and BLOSUM45, respectively, are producing better results, sorted increasingly by MSA length.The results for the same distance are plotted in the same colour, with solid line for ProtT5-score and dashed line for BLOSUM45.

Table 1 :
The 49 MSAs from the CDD: the columns give, in order, the conserved domain, source ID, number of protein sequences and alignment length

Table 2 :
ProtT5-score against BLOSUM45 matrix: comparison using 37 MSAs, for which the Wilcoxon test P-values are smaller than .01

Table 3 :
ProtAlbert-score against BLOSUM45 matrix: comparison using 6 MSAs.The best values are shown in boldface.We show only the distances d cc , d d , and d pos ; the comparisons using d ssp and d seq are similar to the one for d pos