WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences

Abstract Background Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match’s score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence’s functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis. Results We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences. Impact Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.


Introduction
The utility of tools that label biological sequences is dependent on the ability to reasonably predict the risk that a proposed label is due to some random or biased process, rather than true signal.An example is the case of sequence homology detection, in which an unlabeled sequence q is compared to a collection of known sequences, and similarity to one or more of those sequences serves as the basis of annotation of q.
Homology detection tools typically represent the risk of false annotation by providing the supporting alignment's E-value along with its score S. For search of a particular query sequence against a given target database T, the E-value gives the number of alignments expected to meet or exceed the score S if T consisted exclusively of unrelated (random) sequences.Theoretically-grounded calculation of the E-value has been established for gap-free alignments (Karlin andAltschul 1990, Altschul andGish 1996), and empirical studies demonstrate that these E-value predictions are accurate for search against randomly generated sequences of nucleotides or amino acids (Mott 1992, Eddy 2008).
Even so, genomics practitioners are keenly aware that these E-values are often overestimates of the confidence with which we should label one sequence as being related to another, so that it is common for analysis pipelines to require seeminglyextreme E-value cutoffs for search matches (Pearson 2013).This is because the models underlying the scoring schemes used in alignment (Henikoff andHenikoff 1992, Sj€ olander et al. 1996) assume that the database sequences are generated under a simple random distribution (sample with replacement), whereas true biological sequences are not always random in this way.
One notable deviation comes in the form of locally repetitive sequence.So-called tandem repeats are typically the result of some form of replication error (Pumpernik et al. 2008), and can present with a repeat period ranging from 1 to several hundred (Thakur et al. 2021).They produce substantial risk of spurious matches when both query and target sequences contain (possibly degenerate) repeats with similar repeated units.One strategy for compensating for tandem repeats is to employ a tandem repeat masking tool (Benson 1999, Frith 2011, Olson and Wheeler 2018) to hide regions of repetitive sequence away from homology detection.Such strategies are necessarily limited by the selection of some score threshold, and may thus either overmask (hiding true homologs) or under-mask (allowing decayed repetitive regions to be identified as true homologs (Wheeler et al. 2012)).
Continuing concerns about real-world accuracy of E-values, and of the compensatory score adjustment/masking procedures, have lead researchers to evaluate tools with benchmarks that empirically measure false positive annotations in addition to true positive recall, rather than rely solely on theoretical E-values.A benchmark will generally contain a set of query sequences and a distinct set of target sequences.In the target set, some sequences are related to a query (these are the positives that the tool will ideally annotate as such), while other sequences are not related to a query (these are negatives or decoys that the tool should not annotate).The benchmarked algorithm is used to compute alignments without knowledge of true labels, and a predicted classification is produced.Wherever the computed classification agrees with ground truth, the algorithm is said to have produced a true positive or true negative, and elsewhere it produces false negatives and false positives, respectively.
A variety of techniques have been developed to automate the production of positive and negative sequence pairs.In this paper, we are particularly interested in methods of creating negative sequence pairs, because these decoys inform efforts to establish score cutoffs that limit false discovery rate.One option for generating decoy sequences is to randomly sample real biological sequences from a large dataset.This strategy is likely to produce an overestimate of the false positive rate because each randomly chosen 'decoy' has some possibility of being homologous to the query.Alternative methods for producing decoys include (i) shuffling the letters of a real biological sequence, or (ii) generating a sequence by sampling letters under a random model.In both of these cases, failure to include repetition or other realistic features of biological sequences may lead to an underestimate of the rate of false positives that will be observed when the tool is applied to real biological sequence.
Another mechanism for producing non-homologous biorealistic sequence is to use reversed biological sequences, so that decoys are generated by selecting true biological sequences and reversing the order of their characters.Reversal has several immediate benefits over stochastic methods.Reversal preserves the distribution of letters in a sequence, and also preserves the existence of short (possibly degenerate) tandemly repetitive regions and regions of low compositional complexity that are often found in biological sequences.Meanwhile, the reversed sequence almost certainly does not fold or behave the same as the original.
Thus, a decoy set consisting of reversed sequences is expected to possess the low complexity and repetitiveness of real biological sequence, but with no true homology to an actual sequence.Such sequences demonstrate much higher rates of apparent "false" matching than do shuffled sequences, and it is tempting to view these as a better (more realistic) benchmark for false labeling, in the context of both nucleotide (Frith 2011) and protein (Steinegger and S€ oding 2017) annotation.We have employed reversed sequences in this way for the same reasons (Wheeler et al. 2012); however, during post-publication analysis of our benchmark results, we realized that "false" matches of transposable element queries to a reversed (not complemented) genome were surprisingly likely to be to "self" -a match of a query sequence to the reversed genome was highly likely to be to a reversal of an instance its own family.In an earlier paper evaluating software for protein search, Eddy (2011) suggested the source of this observation, while explaining why that manuscript's analysis did not include reversed decoy: ("reversed sequences are surprisingly significantly more likely to show a significant match to the original sequence [ … ] because of a counterintuitive statistical effect of the frequency of approximate palindromes").Here, and throughout this manuscript, the term "palindrome" follows the canonical definition and refers to a sequence that reads the same backwards and forwards (e.g.ATGGGGTA or LIWMMWIL).This is in contrast to an alternate usage of the term "palindrome" that sometimes appears in genomics literature: a sequence that consists of two spaced or adjacent inverted repeats, in which matching sequences occur as reverse complement (e.g.GTGAxxTCAC).
Here, we explore the phenomenon of surprisingly high palindrome lengths, and the extent to which they may cause the use of reversed decoys to overstate the risk of false discovery in sequence database search tools.Through empirical analysis, we demonstrate that this palindromic phenomenon explains much of the increase in false discovery relative to shuffled sequence.
1.1 Why are unexpectedly high scores for reversed sequences a problem?
It has recently become common to present the accuracy of homology search tools by reporting the percent of all true instances that are assigned an E-value better than the highestscoring decoy (which we refer to as recall-0 and is equivalent to the measure primarily reported in (Steinegger and S€ oding 2017) and (Buchfink et al. 2021)).
When the top-scoring decoy produces E-value D, recall-0 will ignore differences in sensitivity that occur for matches with E-value >D.If the method of generating decoys is prone to producing false matches with inappropriately high score (low E-value), then the recall-0-imposed cutoff for differentiating between tools will be extreme, such that sensitivity differences in interesting score ranges are ignored.For example, a decoy with D ¼ 1e-10 will ensure that recall-0 does not distinguish between one tool that reports 20 matches with E-value between 1e-5 and 1e-10, and a separate tool that only reports a single match in that range.Failure to distinguish tools producing different rates of recall in this 'marginal' score range may be concerning, particularly when there is a risk that the marginal cutoff might be unintentionally set higher than is appropriate.
As we demonstrate in the Results section, reversed decoy sequences are prone to producing surprisingly high scores (low E-values).This is true both between a query sequence q and its reversal and also between q and reversals containing many mutations.As a result, we suggest that care be taken in using reversed sequences as decoy sequences when evaluating the occurrence of false discovery (Steinegger and S€ oding 2017) or selecting score thresholds for reporting (Storer et al. 2021).If reversed sequences are to be used as decoys, the input sequences should first be filtered to remove matches to the query set.Alternatively it may be preferable to use simulated sequences in which care is taken to produce realistic sequence (for DNA, see Caballero et al. 2014).We hope that this article will induce exploration into alternative mechanisms for improving the generation of bio-realistic decoy sequences, particularly for proteins, for approximating false similarity rates.

Intuition and background for palindrome frequencies
We find it instructive to consider a simple calculation related to the expected frequency of a length-n palindrome in a string A, and of a length-n common substring in two strings A and B, both over a uniformly distributed alphabet of size k.The next two paragraphs are a sketch, intended to provide intuition about expected palindrome length; they do not represent a complete analysis.
Consider some position i in A. Ignoring edge cases, the chance that this is the center position of a length-n palindrome (where n is odd) is ð1=kÞ ðn − 1Þ=2 -a length-n palindrome is formed if the ðn − 1Þ=2 letters to the right if i are matched (in reverse) to the ðn − 1Þ=2 letters to the left of i, each with 1=k probability.This corresponds directly to the probability that position i in A and position ðm − i þ 1Þ in the reversal of A (which are the same letter) will be the center of an identical common substring of length n.
Contrast this with the chance of observing a length-n identical common substring in two length-m strings, A and B. For some position i in A with value a i , with reasonably large string length m and modest value k, there will almost certainly be some position j in B with value b j ¼a i ; this is a length-1 common substring between A and B. Ignoring edge effects, the chance that this is the center position of a length-n common substring is ð1=kÞ ðn − 1Þ , because each of the neighboring n − 1 characters in A has a 1=k chance of being matched by the corresponding position in B. For each position i, there are expected to be m=k candidate centers for common substrings, but this number is dwarfed by the much lower probability of producing a common substring of even modest length.
The above sketch ignores edge effects and cases of evenlength or approximate palindromes, yet suggests that with a randomly sampled sequence A, a length-n palindrome is less surprising than is a length-n substring shared by A and some other randomly sampled sequence B (or a shuffled variant of A).Generally, the expected length of the longest common sequence (LCS) of two strings of length-k is 2 log 1=λ 2 ðkÞ (Arratia and Waterman 1985), and the expected longest palindrome length in a length-k string is 2 log 1=λ 2 ðkÞ þ 1. (The probability amplitude λ 2 is determined from frequencies of each letter; for an alphabet of size s, on uniformly-distributed sequences, λ 2 ¼ 1=s).
Finally, we consider approximate palindromes.An α-gapped palindrome is a kind of approximate palindrome defined as w ¼ uvu where u and v can be any strings and u is the reversal of u.We say w is α-gapped whenever α≥ juj þ jvj juj .Duchon et al. (2017) give an expectation for the length of α-gapped palindromes in a length-k string (L k ) as By considering odd-length palindromes as gapped palindromes with jvj ¼ 1, it follows that β log 1=λ 2 ðkÞ � 1, and it can be demonstrated that the expected length of odd maximal To our knowledge, there is no further published analysis of approximate palindromes involving substitutions or a more general pattern of gaps distributed throughout the palindrome.
Though the expected (mean) palindrome length of a sequence is only slightly higher than the expected length of the longest common substring (LCS) of that sequence and its shuffled partner, we hasten to note that palindromes as described above (exact, or simple approximate) are not exclusively responsible for high scoring alignments.High-scoring self-reversal alignments usually involve a gapped alignment with some mismatches and a reasonably-good approximate palindrome at the center.The scoring dynamics of these mismatched and gapped approximate palindromes appear to be similar to those of pure palindromes, but analysis is sufficiently complicated to exceed the scope of this manuscript.We also highlight that the average match length does not reflect the expected frequency of outliers, which will have almost no effect on the mean due to their rarity; it is these rare high-scoring matches that drive most of the alignment score results seen below.

Paper outline
In the following sections, we explore the connection between approximate palindrome length and the increased scores produced when aligning sequences against their reversal, relative to aligning against shuffled sequences.We begin by describing the sequence data used in our analyses, then examine the extent to which sequences achieve higher than random score when aligned to their reversal.We follow this with a study of the frequency of long palindromes and common substrings, and an investigation of the roles of tandem repeats and mutations in the context of using reversed sequences as decoy sequences.Finally, we explore the impact of palindromes on assessing the false discovery rate in sequence annotation and in mass spectrometry analysis.

Real protein sequences and their paired relatives
To gain insight into the influence of decoy type on the distribution of high-scoring matches between sequences, we developed related sets of paired protein sequences.For each pair, we first identified one sequence q from Swiss-Prot (Bateman et al. 2023), then either (a) selected another Swiss-Prot sequence t, (b) shuffled q, or (c) reversed q.
It is possible that two sequences that are randomly selected from Swiss-Prot will in fact be homologous.To reduce this risk, we ensured that paired sequences did not share folds as defined by the SCOP database (Andreeva et al. 2014(Andreeva et al. , 2020)).Specifically, sequences from release 2022_04 of Swiss-Prot were length-filtered to identify 565,854 sequences with length between 50 and 2000; this set is called sprot all .These were aligned to all SCOP2 superfamily representatives using blastp from the BLAST suite (version 2.12.0 þ ) (Kent 2002).Swiss-Prot sequences were retained for this dataset only if they yielded an alignment that covered at least 95% of the length of both the Swiss-Prot and SCOP sequence-this ensured that we could reasonably annotate all folds on retained proteins.The result is a set of 106,740 sequences, called (sprot).Length distributions are shown in Fig. 1.
Sequences in (sprot) were assigned to clusters according to their SCOP superfamily.For each sequence q in sprot, its cluster v was identified, and a distinct cluster w 6 ¼ v was chosen via sampling weighted by cluster size.A sequence t was then sampled from w and assigned to random.This second set (random) contains presumably non-homologous partners for the set sprot, and can be used to produce an alignment score distribution by aligning the ith sequence of sprot to the ith sequence of random.
To explore the distribution of scores when aligning a biological sequence q to a shuffled copy of q, each sequence in sprot was shuffled to produce the set shuf; the ith sequence in shuf will share length and amino acid composition with the ith sequence in sprot, but no homology.Similarly, each sequence in sprot was reversed to produce the set rev (note: this is true reversal, not reverse complementation).All shuffling and reversals were computed using the esl_shuffle tool from the Easel package as released in HMMER 3.3.2(Finn et al. 2011).
It is plausible that the high rate of palindromes in biological sequences is due to repetition in those sequences.To demonstrate that the palindrome phenomenon is general to all sequences, and not just repetitive ones, we reversed the sequences of shuf to produce shufrev; because the shuffled sequences do not retain signal of repetition found in true proteins, an alignment of a sequence in shuf to its mate in shufrev will highlight pure features of reversal alignment.Analogous sets of unrestricted relatives of sprot all are called rev all , shuf all , and shufrev all respectively.Figure 2 provides a visual guide to the various alignment pairings used throughout the article.

Reversals in mutated sequences
In the reversal analyses described above, each sprot sequence is compared to a reversal of itself.In a carefully constructed benchmark, query sequences themselves may not be present among the sequences that are reversed to produce the decoy set, calling the applicability of this analysis into question.But the benchmark may contain their homologs, which will show some level of sequence similarity to the queries.To understand how the level of divergence impacts distribution of scores induced by approximate palindromes, we generated six mutated variants of shuf all , and reversed each to form a new diverged reversed benchmark.For each variant, a conditional distribution was used as the basis of mutation, so that s ¼ s 1 � � � s n in shuf all was transformed to a related sequence in which t i is an amino acid sampled from the distribution pðxjs i Þ.For the first transformation, substitution probabilities were derived (Yu and Altschul 2005) from the BLOSUM90 substitution matrix.This matrix has a mean self-emission frequency (pðajaÞ) of 45.6%.Additional transformations were produced based on modified distributions in which the likelihood of mutation was reduced for each amino acid a, adjusting the likelihoods of mutation to the other 19 amino acids pðbjaÞ proportionately.These distributions have average self-emission frequencies (pðajaÞ) of 50%, 60%, 70%, 80%, and 90% respectively.These enable consideration of the likelihood of finding matches between a query and reversals of homologs that show varying levels of percent identity to the query.These augmented distributions preserve the relative frequency of mutations derived from BLOSUM90 (pðbjaÞ for b 6 ¼ a) while increasing the expected percent identity between the original sequence s and its relative t.We refer to these distributions as shufrev all -46pct, shufrev all -50pct, shufrev all -60pct, shufrev all -70pct, shufrev all -80pct, and shufrev all -90pct, respectively.

Repeat masking
To further investigate the role of repetition in false matching for biological sequences, we hard-masked repetitive sequences in sprot all using tantan version 40 (Frith 2011) with default parameters, producing masked sprot all .

Simulated fragmentation with trypsin
In mass spectrometry-based proteomics, proteins are broken into fragments, often employing trypsin digestion.The identity of the resulting fragments is determined based in part on their mass.Typically, the fragments evaluated by mass spectrometry have length 5-100 amino acids because the signal induced by short sequences (<5 residues) is indistinguishable from noise, while long sequences (>100 residues) may exceed the physical limits of the machine.To simulate this process, we applied a simplified approximation of trypsin digestion as follows.Consider a protein P ¼ ðp 1 ; p 2 ; . . .; p n Þ and its set S ¼ ðs 1 ; s 2 ; . . .; s k Þ of potential trypsin digestion positions at which there is an arginine or lysine, ignoring any position  that is not followed by a proline.Taking a pair of positions ðs i ; s j Þ from S, P may be split into three fragments, ðp 1 :: For all protein sequences in sprot all , we enumerated all pairs ði; jÞ of potential digestion sites and produced all resulting three-way splits, then filtered to remove fragments with length below 5 or greater than 100.The remaining fragments were accumulated into the set sprot tr .Each member of sprot tr was shuffled to produce a shuffled control denoted shuf tr .This approach does not capture the true complexity of trypsin digestion patterns (Rodriguez et al. 2008), but is sufficient for our purposes of analyzing palindromes in relatively short fragments.

Annotation of chromosome 22
The preceding datasets only comprise protein and peptide sequences.To demonstrate that the elevated presence of approximate palindromes in sequence benchmarks is not merely a feature of amino acid sequences, we consider DNA sequences from the human genome.
The sequence of human chromosome 22 in hg38.p14along with the NCBI RefSeq annotation was downloaded from UCSC's Genome Browser (Kent et al. 2002).The commandline tool Gffread (Pertea and Pertea 2020) was used to extract and concatenate exon regions into sequences corresponding to the mRNA transcripts of chromosome 22, denoted mRNA.The intron and intergenic regions of chromosome 22 were also sampled for sets of sequences with similar lengths as mRNA, denoted intron and intergenic.
As above, shuffling and masking were performed, and the resulting variant datasets are denoted shuf mRNA , shuf intron , shuf intergenic , and masked mRNA, masked intron, and masked intergenic.

Software and algorithms
Exploration of score and count distributions was achieved using established tools and algorithms.For the large Swiss-Prot-based dataset, sequence database search scores were computed using blastp (Altschul et al. 1997).For our purposes, pairwise alignment was performed by creating a blast database for each singular target, and this target was "searched" with its paired query sequence.The parameter for (minimum) word length was set to 4. The default substitution matrix (BLOSUM62, half-bit) and gap penalties (gap open ¼ − 11, gap extend ¼ − 1) were left unchanged.One blastp alignment may contain several results, each with its own score-for our experiments, only the highest-scoring match was retained and reported.We report score, rather than the associated E-value, because E-value is dependent on the size of the query and target datasets, while score is stable for an aligned pair.
One problem with the BLAST-based analysis is that for many comparisons, the similarity between sequences is sufficiently poor that not even a low-scoring blastp result is produced.To gain insight to the distribution of lower-scoring results, we used the BioJulia implementation of Smith-Waterman (SW) algorithm BioAlignments.jl (Christensen et al. 2023) for finding the optimal local alignment between two sequences.The algorithm is parameterized by a substitution matrix (whose values at row i and column j determine the cost of substituting letter i with letter j), as well as gapopen and gap-extend penalties, which determine the cost of skipping one or more characters in either sequence.When gap penalties and off-diagonal values in a substitution matrix are very large negative numbers (e.g.− 1e12), Smith-Waterman will never produce gaps or mismatches and effectively performs exact alignment; we use this mode for computing LCS between sequences.In all other cases, the gap penalties and substitution matrix are set to BLAST's defaults.
Using either BLAST or Smith-Waterman (with typical parameters) for reversed alignment readily produces approximate palindromes, but rarely identifies true palindromes.To measure the frequency of exact palindromes in Swiss-Prot, we implemented Manacher's algorithm (Gusfield 1997) in Julia.
Code used to produce all results and figures may be found at https://github.com/TravisWheelerLab/WASITAMATchISAW.

Alignment scores for Swiss-Prot variations
To understand the influence of reversals on sequence alignment score distributions, we aligned sequences taken from Swiss-Prot (length between 50 and 2000) against a variety of decoy datasets.Each pairing involves either 106 740 (sprot and relatives) or 565 855 (sprot all and relatives) pairwise alignments, with the ith sequence of one set aligned to the ith sequence of the other set using blastp (see Section 2 for definitions of sequence sets).The first pairing (sprot$random) replicates the case in which a real protein sequence is aligned to some randomly chosen (presumed non-homologous) protein.
The second pairing (sprot$shuf) corresponds to a simple decoy strategy: the true protein is aligned to a sequence that is a shuffled copy of the original-it shares length and amino acid composition, but is not a true protein.As seen in Fig. 3A, the distribution of scores in these two pairings are similar.
The next two pairings involve reversed sequences.The sprot$rev pairing aligns each of the 106 740 true Swiss-Prot proteins to the sequence's reversal.Figure 3A shows that the distribution of these scores is strongly right-shifted relative to alignments involving random or shuf.According to common wisdom this is expected, as the argument for using reversed sequences as decoys is that real biological sequences contain patterns of repetition and regions of low complexity, and these patterns are responsible for a higher observed rate of matches against reversed decoys.This result is contrasted with the pairing shuf$shufrev.In this pairing, each query sequence is a shuffled sequence from Swiss-Prot, and the target for each shuffled query is its reversal.The shuffled sequences retain only random levels of repeats and local low complexity, so by conventional wisdom are expected to score similarly to sprot$shuf.However, the score distribution of shuf$shufrev is nearly the same as that of the biological sequence and reversal, sprot$rev.This indicates that the observed right-shift of sprot$rev scores is primarily due to the reversal itself, and not the biological or repetitive nature of the sequences.Figure 3B presents score distributions for the reversal alignments of the unfiltered set of proteins (sprot all ), and shows a very small right shift for sprot all $rev all relative to shuf$shufrev.This shift represents the impact of repetition and low complexity, and is much smaller than the shift between shuf$shufrev and sprot$shuf.
Figure 4 presents scores for these various pairings as a function of length.Specifically, every pair was binned according to sequence length, then the mean blastp score was computed for each bin.Sequences aligned to other sprot sequences (sprot$random) may differ in length, so were binned by the geometric mean length of the query and target sequences; they Reversed benchmarks overstate false match rate show a score distribution that is essentially identical to that of sequences aligned to their shuffled match (sprot$shuf).When sprot sequences are aligned to their reversals (sprot$rev), they produce consistently higher scores than they do when aligned to their shuffled partners (sprot$shuf).The majority of the difference is explained by the simple fact of reversal, as evidenced by the gap between sprot$shuf and shuf$shufrev and the overlap between sprot$rev and shuf$shufrev.

Role of tandem repeats in BLAST score gap
In Fig. 4A, Swiss-Prot sequences aligned to their reversals show a modest increase in BLAST score relative to shuffled sequences to their reversals.To explore the role of tandem repeats in this shift, we created the dataset masked sprot all by hard-masking repetitive regions using tantan (Frith 2011) (see Section 2). Figure 4B presents these analyses, focused on shuf all $shufrev all and sprot all $rev all distributions, while Fig. 4C replaces sprot all $rev all with another distribution showing the result of first masking with tantan (Frith 2011) (default settings), then searching against its reversal (masked sprot all $rev all ).These results demonstrate that the small gap between shuf all $shufrev all and sprot all $rev all is almost entirely due to repetitive sequence that can be effectively masked.This phenomenon, and its response to masking, reoccurs in the analyses replicated on DNA sequences in the corresponding section below.

Sequences match their own reversals much more than other reversals
The previous results show that alignment of a shuffled sequence to its reversal produces a score that is typically 10-20 bits higher than alignment of that shuffled sequence to the original protein.When evaluating false discovery risk, this average score is less important than the frequency of very high scoring outliers.To explore the right tail of this score distribution, we captured a slice of all sequences in sprot with length between 280 and 320, and computed the frequency with which scores are observed in (i) alignments of each sequence in this set with a shuffled copy of the sequence, (ii) alignments of shuffled copies of each sequence in this set with reversed copies of those shuffled copies, and (iii) alignments of shuffled copies of each sequence in this set with reversed copies that we mutated according to the 80pct substitution matrix described in Methods (and in the next section).supplemented with the results of masking with tantan prior to search (masked sprot all $rev all ); the resulting distribtuon shows that most of the modest difference between sprot all $rev all and shuf all $shufrev all is resolved by application of tandem repeat masking.
Figure 5 shows that alignment to reversals produces a highly right-skewed distribution, and that this right skew is evident even when reversals have been subjected to mutation to 80% identity.All three score distributions appear to be well modeled by a Gumbel distribution, so we fit Gumbel parameters (Chafaï 2018) to estimate rare event frequency.As a concrete example of how these distributions relate to rare high scores: a score of 165 is expected to occur by chance with a probability of �3.0e-12 in masked sequence-toshuffled analysis.Meanwhile the same score is expected to occur �150 000 more frequently (�4.5e-7) for shuffled sequence aligned to reversed masked sequences.
To further explore the overabundance of high scores in self-reversal alignments, we aligned a set of 10 000 sequences against reversals of that set, and found that sequences were much more likely to find a high-scoring match to their own reversals than to the reversals of any other sequences.Specifically, we selected 10 000 random sequences from UniProtKB/TrEMBL (Boeckmann et al. 2003) and masked tandemly repetitive regions in each sequence using tantan (Frith 2011) with default settings, resulting in a set that we call trembl.Each sequence in this set was reversed, resulting a set that we call revtrembl.Blast was used to search trembl against the revtrembl database, so that each query sequence in trembl was aligned to the 10 000 reversed sequences in revtrembl.Under conventional reasoning, the reversed sequences in revtrembl are unrelated to the original sequences in trembl, so each query should have a similar chance of matching any of the sequences in revtrembl; thus, a query sequence should produce a high-scoring alignment to the reversal of some other sequence roughly 10 000× as often as it will produce a high-scoring alignment to its own reversal.In conflict with this expectation, among all blastp matches with E-value <0.001, there were 12 matches between any query and its self-reversal, and only three matches between a query and any of the 9999 other reversals.Other E-value thresholds produce similar relative abundances.
These self-reversal matches are not due to repetitive sequence, since repeat sequences were masked before alignment.Because low complexity (biased use of the available protein alphabet) could cause elevated rate of self-reversal matches, we also assessed the potential role of sequence complexity using phmmer (Finn et al. 2011) (which reports a measure of sequence complexity): matches reported negligible composition bias.

The role of mutations (approximate reversals)
In practice, a database of reversed sequences may contain the reversal of one or more homologs to the query sequence.It seems reasonable to expect the effect of reversal on alignment score to be diminished, but perhaps not entirely removed.To simulate this situation, we produced a collection of reversal variants by sampling amino acid substitution mutations from conditional distributions with self-emission probabilities ranging from 46% to 90% (see Section 2).These transformations are of shufrev all , named shufrev all -46pct, shufrev all -50pct, shufrev all -60pct, shufrev all -70pct, shufrev all -80pct, and shufrev all -90pct.Pairwise alignments are performed for each transform to shuf all using the Smith-Waterman (SW) algorithm implemented in BioAlignments.jl (Christensen et al. 2023).SW was used instead of blastp in order to maximize the number of alignment score datapoints captured, because blast often produces no alignment (or score) for a comparison between a query and decoy.
The SW alignments between shuf all and shufrev all -Xpct for X 2 f46, 50, 60, 70, 80, 90g were computed using the BLOSUM62 scoring matrix and gap penalties set to − 11 for opening and − 1 for extension, to mimic the default configuration of blastp.Figure 6 displays the spread of these scores, as well as those of sprot all $shuf all and shuf all $shufrev all .The self-matching effect of reversal is lost completely in shufrev all -46pct and almost undetectable in shufrev all -50pct, which present score distributions similar to sprot all $shuf all .But when simulated homologs are created with higher average percent identity between query and partner, and the partner is then reversed (yielding shufrev all -Xpct for X 2 f60,70,80,90g), the associated pairs exhibit higher-thanrandom affinity between sequences and their (approximate) reversal.Thus, there is a risk of overstating false match rates  even when the reversed decoy set contains homologs sharing 60% or greater similarity to query sequences.
These results demonstrate that elevated false match risk is not limited to self-reversal.In a benchmark of practical size, most of the elevated risk is likely to result from the presence of homologs in the decoy set.Very large decoy sets (Steinegger andS€ oding 2017, Buchfink et al. 2021) include sufficiently numerous proteins that for some queries there may be hundreds or thousands of high-identity homologs, each of which has a greatly elevated chance of producing a high-scoring match upon reversal (Figs 6 and 5).Collectively, these analyses highlight that reversals can potentially overstate false discovery risk, thus skewing analyses in which a single high-scoring decoy serves as the basis for selecting a score threshold (as when reporting recall-0 metrics).

Exact palindromes in Swiss-Prot
To gain insight into the high frequency of high-scoring alignments between sequences and their reversals, we computed all maximal-length palindromes in each of the 565 855 sequences in sprot all as well as shuf all for a control, using our implementation of Manacher's algorithm (Gusfield 1997).An additional control was observed by computing the LCS between sequence pairs in sprot all and shuf all .
The palindromic substrings in sprot all , shuf all , and the LCS between them, were binned by the length of the sequence containing the substring; averages computed per each length are presented in Fig. 7.The length distribution of maximal palindromes is closely related to that of LCS; a result of Duchon et al. (2017) implies that the expected length of (ungapped) maximal palindromes is asymptotically equivalent to the expected length of LCS as derived by Arratia and Waterman (1985).In order to contextualize our analyses within this theory, we plotted the expected lengths of (i) LCS between two random sequences (2 log 1=λ 2 ðnÞ), and (ii) palindrome within a sequence (2 log 1=λ 2 ðnÞ þ 1).
Figure 7 shows that sprot all contains, on average, longer palindromic substrings than shuf all .While palindromes found in the shuffled dataset are densely concentrated around their expected value, the average palindrome length in sprot all diverges from the expectation, particularly as sequence length increases.This asymptotic behavior only yields a large divergence from expected score in the small fraction of sequences in sprot all that are longer than �500 amino acids.
To explore the role of repetitive sequence in this shift, we computed maximal palindromes for each masked sequence.The resulting distribution of palindrome lengths (Fig. 7B) mirrors that of the shuffled sequences in Fig. 4C, clustering densely around the expected value.Based on manual review, the divergence of biological palindromes from their stochastic counterparts appears to be almost entirely explained entirely by long tandem repeats.In particular, these palindromes are composed of smaller, palindrome-supporting subunits like "PNAN" or "SD," leading respectively to palindromes like PNANPNANPNANP and SDSDSDSDS.

Palindromes in the context of mass spectrometry
When using mass spectrometry to identify proteins in a sample, a library of reversed sequences is usually used to set thresholds based on risk of false labeling (Wright and Choudhary 2016).Here, we consider the potential that including reversals in mass spectrometry benchmarks may lead to overestimated false match rates and a resulting decrease in search sensitivity.
With all sequences in sprot all , we produced protein fragments similar to those used in mass spectrometry, splitting each sequence at each instance of a lysine or arginine, except where followed by a proline (see Section 2).After splitting, substrings with length between 5 and 100 were accumulated into sets named sprot tr .Fragments in sprot tr were shuffled to produce a related set shuf tr .
Figure 8 displays the average lengths of longest palindromic subsequence (LPS) in each set of strings, binned by the length of the string.To place these in context, the LCS was computed between paired sequences in sprot tr and shuf tr .
Trends established by the preceding results are still apparent in Fig. 8. LPS are longer than LCS, and because tryptic peptides are so short, the constant component separating LPS and LCS lengths is large relative to their shared logarithmic component.Duchon et al. (2017).Biological palindromes appear to be asymptotically longer than random palindromes.On the left, the lengths of longest common substrings (LCS) of pairs between sprot and shuf are plotted with the expected length from Arratia and Waterman (1985).On the right, maximal palindromes remaining after masking sprot with tantan are plotted; with repetitive regions removed, the lengths of biological palindromes become closely clustered around the random expectation.

8
Glidden-Handgis and Wheeler However, the relationship between LPS and LCS appears to invert for very short sequences.The data indicate that LCSðsprot tr ; shuf tr Þ>LPSðsprot tr Þ for sequences shorter than 10 amino acids, and LPSðsprot tr Þ � LPSðshuf tr Þ for sequences shorter than approximately 20 peptides.Composition bias in short tryptic peptides may contribute to quasi-invariance under shuffling.

Palindromes in DNA
The analyses presented above have been focused on protein sequences, but the surprising prevalence of palindromes is a general feature of strings, regardless of alphabet.As with the protein results shown in this manuscript, a random DNA sequence (i.e. one with no repetitive or low-complexity regions) will on average have a longer exact match to its reversal than it will to a shuffled version of itself; it will also typically produce a higher score when aligned (under a reasonable scoring scheme) to its reversal than when aligned to a shuffled version of itself.
To explore this, we analyzed DNA occurring in human chromosome 22.Chromosome 22 was downloaded and annotated using the UCSC Genome Browser.Gene sequences were extracted from the chromosome using GffRead; introns and intergenic regions were also sampled from chromosome 22 to match the length distribution of genes.These datasets are respectively denoted mRNA, intron, and intergenic.Each set was filtered to contain sequences less than 50 000 bp.As above, variant sets were created using shuffling and masking, respectively denoted shuf mRNA , shuf intron , shuf intergenic , and masked mRNA, masked intron, and masked intergenic.For more detail, see the preceding section in Methods.
The LPS of each DNA sequence was computed using Manacher's algorithm.LCS were constructed between each sequence and a shuffled counterpart using a substring automata.Figure 9 plots the lengths of these common substrings, binned by sequence length, and Figure 10 plots these lengths after masking short tandem repeats.
The analysis of DNA reproduces the results observed in peptides and proteins.On average, the longest palindromic substring of a sequence is longer than the LCS between that sequence and a random other.When biologially-induced repetitions are present, the difference between LPS and LCS grows superlogarithmically.In the absence of repetition (either by shuffling or tantan masks), this difference appears constant.

Summary of results
In summary, alignment to reversal produces a substantial increase in score compared to alignment to unrelated sequences, even when there is no inherent repetitiveness or low complexity in the initial sequence.This effect is diminished when the reversed sequence contains mutations relative to the query, but remains significant for related sequences exceeding 60% identity to the non-reversed sequence.As a result, when reversed biological sequences are used as benchmarks they may overstate the risk of false matches.

Discussion
Up to this point, we have discussed the prevalence of palindromes in terms of their impact on bioinformatic analysis (specifically, overstatement of false positive rates in sequence Figure 8.The average length of LPS in tryptic peptides between lengths 5 and 100, generated from an in-silico trypsin digest of sprot all , are plotted against those of tryptic substrings occurring shuf all .When tantan-masked regions are removed as in masked sprot all , the difference in distributions of LPS length collapses. Reversed benchmarks overstate false match rate search).It is also interesting to consider how this existence of palindromes in sequences may play a role in biological processes.The role of palindromes in proteins is currently unknown and research on their function is ongoing (Giel-Pietraszuk et al. 2003, Sheari et al. 2008, Sridhar et al. 2015, Kefala et al. 2021).Sheari et al. (2008) argue that the prevalence of palindromic regions in proteins is a side-effect of repetitive and low-complexity regions.Sridhar et al. (2015) found that 60% of palindromes in proteins are "not associated with any regular secondary structure", consistent with the notion that palindromes in proteins are not dependent on repetitiveness such as that observed in alpha helical structures.Furthermore, 80% of palindromes in proteins do not interact with active sites, catalytic residues, ligands, or metal ions.In a more direct study, Kefala et al. (2021) synthesized a 4-alpha-helical bundle and its reversal, and demonstrated that while the reversed sequence does fold into a helix-likely due to the aforementioned effects of repetition in helical sequences-the resulting structure was not a true alpha helix.In general, these results are consistent with our results above, in that the frequency of palindromes in true proteins appears to be primarily a function of the frequency of palindromes in random sequences, and repetitive aspects of some sequences may lead to a moderate increase in the observed rate of palindromes.

A better benchmark?
Can we compensate for the overabundance of palindromic regions in biosequence benchmarks?A repository containing all experiments found in this manuscript (https://github.com/TravisWheelerLab/WASITAMATchISAW) includes a script for generating improved reversed decoy benchmarks, given a query set Q and a target set T. The target set T is restricted to T 0 , containing sequences with less than 50% percent identity to any sequence in Q.The reversal of T 0 , denoted R, retains the original biological signal present in T and omits reversalinduced false matches.As above, it may be necessary to mask repetitive regions before performing the initial alignment.Reversed benchmarks overstate false match rate

Figure 1 .
Figure 1.Distribution of sequence lengths for sequences sampled from Swiss-Prot (sprot all ) and for those that match a SCOP family (sprot).Figure 2. The core experiment design compares five related sequence sets, random, sprot, rev, shuf, and shufrev, represented here as nodes, by means of pairwise alignment.Each colored edge represents a comparison of two sequence sets.For each connected pairing, the edge color corresponds to the color used for that pairing throughout the manuscript in both figures and text.

Figure 2 .
Figure 1.Distribution of sequence lengths for sequences sampled from Swiss-Prot (sprot all ) and for those that match a SCOP family (sprot).Figure 2. The core experiment design compares five related sequence sets, random, sprot, rev, shuf, and shufrev, represented here as nodes, by means of pairwise alignment.Each colored edge represents a comparison of two sequence sets.For each connected pairing, the edge color corresponds to the color used for that pairing throughout the manuscript in both figures and text.

Figure 3 .
Figure 3. Score distributions from BLAST alignments.(A) All four pairings sprot$random, sprot$shuf, sprot$rev, and shuf$shufrev are restricted to sequences with 95% identity to SCOP2 superfamily representatives.(B) BLAST alignment scores are presented for the unfiltered sets sprot all $rev all and shuf all $shufrev all .

Figure 4 .
Figure 4. BLAST alignment scores as a function of sequence length, for the set of 106 740 sequences with full-length matches to SCOP sequences.Sequences were binned according to sequence length, and mean blastp score was computed for each bin.(A) The set sprot$random includes pairs that may not have equal length, so scores are plotted by the geometric mean length of query and target sequence.For all remaining sets both sequences in each pair share the same length: sprot$shuf, sprot$rev, and shuf$shufrev are plotted.(B) This plot shows scores as a function of sequence length for all 565,854 Swiss-Prot sequences with length under 2000: sprot all $rev all , and shuf all $shufrev all .(C)The values from B are reproduced, and supplemented with the results of masking with tantan prior to search (masked sprot all $rev all ); the resulting distribtuon shows that most of the modest difference between sprot all $rev all and shuf all $shufrev all is resolved by application of tandem repeat masking.

Figure 5 .
Figure 5. Observed score frequencies for three types of aligned pairs, using only sequences of length 280-320.The three pairings are: (i) sequences aligned to shuffled copies (blue), (ii) those shuffled copies aligned to their reversals (orange), and (iii) the shuffled copies aligned to their reversals that have been subjected to mutation resulting in 20% amino acid change (green).

Figure 6 .
Figure 6.Box-and-whisker plot of SW alignment score distributions for alignment pairs sprot all $shuf all , shuf all $shufrev all , and shufrev all -Xpct for X 2 f46, 50, 60, 70, 80, 90g.Above each box plot, the symbol ��� indicates distributions that are significantly higher than sprot all $shuf all based on the Wilcoxon rank test (P ¼ 0, due to computational underflow).The values [Y] indicates the fraction of shuffled instances for which the associated box plot produces a higher score than sprot all $shuf all .

Figure 7 .
Figure7.Lengths of maximal palindromes occurring in biological sequences sprot and random sequences shuf with the expected lengths derivedDuchon et al. (2017).Biological palindromes appear to be asymptotically longer than random palindromes.On the left, the lengths of longest common substrings (LCS) of pairs between sprot and shuf are plotted with the expected length fromArratia and Waterman (1985).On the right, maximal palindromes remaining after masking sprot with tantan are plotted; with repetitive regions removed, the lengths of biological palindromes become closely clustered around the random expectation.

Figure 10 .
Figure10.Repetitive regions in human chromosome 22 are masked via tantan; the remaining palindromes in non-repetitive DNA, as well as those in corresponding shuffled sequences, are plotted.Both data are distributed closely around their expectation.As above, when repetitive subsequences are suppressed, the biological contribution to reverse affinity appears negligible.