Sequence similarity tools like Basic Local Alignment Search Tool (BLAST) are essential components of many functional genetic, genomic, phylogenetic and bioinformatic studies. Many modern analysis pipelines use significant sequence similarity scores ( p - or E -values) and the ranked order of BLAST matches to test a wide range of hypotheses concerning homology, orthology, the timing of de novo gene birth/death and gene family expansion/contraction. Despite significant contrary findings, many of these tests still implicitly assume that stronger or higher-ranked E -value scores imply closer phylogenetic relationships between sequences. Here, we demonstrate that even though a general relationship does exist between the phylogenetic distance of two sequences and their E -value, significant and misleading errors occur in both the completeness and the order of results under realistic evolutionary scenarios. These results provide additional details to past evidence showing that studies should avoid drawing direct inferences of evolutionary relatedness from measures of sequence similarity alone, and should instead, where possible, use more rigorous phylogeny-based methods.
Local sequence alignment tools are central to many molecular comparative analyses and informatics pipelines. The Basic Local Alignment Search Tool (BLAST) [ 1 ] revolutionized the speed with which sequences could be compared with large databases. As such, BLAST has become essential in many analyses ranging from assessment of gene homology, orthology and annotation to large-scale phylogenetics, phylogenomics and phylostratigraphy [ 2–13 ]. While BLAST is ubiquitously used to address questions in these areas, its specific uses and interpretations vary widely.
Unlike exact search approaches like Smith-Waterman that guarantee optimal local alignments [ 14 ], BLAST uses a heuristic method to quickly produce significant local alignments and provide several similarity scores. Alignments that have scores above a specified threshold are presented in ranked order by significance. The significance score often used by BLAST users is the E -value, which is interpreted as the expected number of random alignments with at least the same quality as the alignment calculated by BLAST between the query and subject sequence. The smaller the E -value, the fewer random alignments are expected for the given parameters.
While the BLAST algorithm was intended for simple sequence or motif similarity searches, modern usages make significantly broader assumptions. One common application of sequence similarity programs is the interpretation of reciprocal best hits (RBH) between species as homologous or orthologous [ 15 , 16 ]. However, homology and orthology are hypotheses concerning common descent (rather than mere similarity) and are therefore phylogenetic in nature [ 9 ]. Using sequence similarity alone has been considered insufficient evidence for identifying common ancestry (i.e. orthology or homology) [ 17 , 18 ] and imposes a potentially unsupported evolutionary interpretation on sequence similarity scores. Other applications that rely on the rank-order of results from BLAST make the same assumption. Many phylogenomic and phylogenetic analysis pipelines use either BLAST or other related similarity scores for homolog and ortholog identification at an early stage, including eggNOG, OrthoMCL, OMA, HaMStR and OrthoFinder [ 2 , 3 , 19–22 ]. While alternatives that incorporate phylogeny exist (e.g. [ 9 , 23 , 24 ]), rank-based BLAST analyses are still common.
Phylostratigraphy, another research approach that relies on sequence similarity searches, attempts to determine the age of a gene based on the phylogenetic completeness of sequence similarity hits [ 4 , 5 , 25 , 26 ]. Phylostratigraphy assumes sequence similarity searches will return unbiased and complete (or nearly complete) significant hits from a sequence database. Recent examinations have demonstrated that differences in molecular rates of evolution and gene length can bias phylostratigraphy results [ 12 ], but these analyses primarily focus on biases related to molecular properties of the sequences themselves rather than phylogenetic bias.
Several factors may cause BLAST results not to reflect phylogenetic relationships ( Figure 1 ). In general, sequence similarity measures presumably suffer from the same problems that complicate all distance-based phylogenetic measures and methods [ 27 ]. In addition, the specific challenges that affect phylogenetic reconstruction should also affect BLAST results ( Figure 1 ), including lineage-specific rate heterogeneity, saturation and non-stationarity of composition [ 8 , 27–30 ]. Even sophisticated molecular evolutionary models that accommodate for these processes can still have difficulty reconstructing phylogenies [ 8 , 31 ], and therefore these factors are expected to complicate BLAST results as well.
Understanding the biases and expectations for sequence similarity analyses and how these relate to phylogenetics is important for a number of reasons. As described above, many phylogenetic and phylogenomics analyses interpret the significance scores or the order of search hit completeness as a proxy for evolutionary relatedness (i.e. phylogeny). If results from sequence similarity analyses are to be used to make decisions confidently about further evolutionary analyses, then they should reflect phylogeny as accurately as possible. But if BLAST results do not accurately reflect relatedness, then BLAST should not be used instead of phylogenies where approximation of phylogenetic relationships is needed. The suggestion to use phylogenies instead of BLAST to increase the accuracy of inference in functional genomics is far from a recent, with the first findings on this subject appearing over 15 years ago [ 32 ] and related debates over distance-based phylogenetic methods stretching back even further (e.g. [ 33 ]). Despite these past findings and clear arguments against the use of BLAST to approximate phylogenetic relationships, this practice still persists, perhaps in part because this relationship has been under-examined statistically.
Here, we detail some of the potential problems for using BLAST (or other similarity-based measures) to address questions of sequence relatedness ( Figure 1 ) and examine these issues through simulations ( Figure 2 ; see also Supplementary Figure S1A ). Specifically, we explore lineage-specific rate heterogeneity, compositional bias and saturation in relation to both the ranked order and completeness of BLAST results. These realistic scenarios using BLAST represent only a few of the many possible parameters, problems and approaches that complicate similarity score results, but already demonstrate the danger of presuming phylogenetic relationships based on E -values in situations where even simple phylogenetic methods would be more appropriate.
Phylogenies and molecular sequence data were simulated under several scenarios. Pure birth trees were simulated with 100 and 1000 taxa with a standard molecular model (JC for 2000 nucleotide sites and WAG for 400 amino acid sites), including indels, rate heterogeneity (low = RH+, high = RH ++) and biased base composition (CB+). Tree heights (in substitutions per site) varied for each of these scenarios from 0.5, 1, 2, 5 and 10. For each scenario, 100 replicates were performed and summarized. More detailed description of the simulation scenarios can be found in the Supplementary Methods and Supplementary Results.
Pairwise alignment analyses were conducted using blastn for nucleotides and blastp for amino acids. BLAST and associated programs are heuristic and so the alignments are not guaranteed to be the best possible hits. Therefore, pairwise alignments were also conducted using the exact Smith-Waterman algorithm as implemented in SWIPE [ 34 ]. The parameters used for blastn and blastp included an E -value cutoff of and word sizes of four for blastn and three for blastp. The state space of amino acids and nucleotides differs such that the E -value may perform differently. However, here we are interested in the bias and so E -values were not corrected for this inherent nucleotide-amino acid difference. While BLAST does not guarantee that all best hits will be returned, SWIPE guarantees that all best hits will be reported given a particular E -value cutoff. SWIPE reports exact best hits and therefore served as more complete all-by-all results. The SWIPE E -value cutoff used was for both amino acids and nucleotides.
The resulting BLAST results were compared with the phylogenies used to generate the sequences in a number of different ways ( Figure 2 ). The – transformation of the E -values were used in all comparisons primarily because they are the most frequently used statistic in phylogenomics. Briefly, the E -value is the number of hits with the same or better score expected to randomly be hit given the particular parameters of the search. They are closely related to p -value as . The – transformation of the E -value is the commonly used statistic in homology/orthology assessment and phylogenetic studies because this transformation allows for higher E -values to be associated with lower p -values. To determine the rough correlation of E -values with phylogeny, we compared pairwise E -values with phylogenetic distance as measured by the sum of the branch lengths (i.e. substitutions per site) separating the subject and query sequence in the pairwise alignment. To examine the completeness of the returned hits, the results returned from BLAST were compared with those returned by SWIPE. Those results from SWIPE that were not recorded in BLAST were recorded, including the E -values (as recorded from SWIPE) and the phylogenetic distance of the sequences involved. To determine how complete results are from BLAST in relation to phylogeny, the proportion of BLAST hits that were recorded within a clade along with the number of hits recorded outside of each clade was recorded ( Figure 3C , see also Supplementary Figures S3 and S4 ). To facilitate the comparison of results across different simulations, a statistic that we name was calculated for each simulation.
When a BLAST search is conducted using a given query sequence ( x q ) on a given tree ( T ), each potential subject sequence ( x s ) in T that is hit by the BLAST search must also share a most recent common ancestor (MRCA) with x q . All other sequences in T that also descend from the MRCA of x s and x q can be defined as the set of sequences X qs . If x s is a BLAST hit of x q and similarity scores are correlated with phylogenetic relatedness, then all sequences in X qs are expected to also be BLAST hits of x q . We define the measure (for a given x q and x s ) as the proportion of sequences in X qs that do not have a BLAST hit for x q (i.e. are ‘missed hits’). If one or more hits are missed among these equally or more related sequences, then . Alternatively, if BLAST hits all sequences in X qs , then this would result in an optimum score of . The mean of was calculated across the set of sequences for a tree, and the mean of these tree-wide values was calculated across each simulated sequence set.
In addition to the phylogenetic pattern of missed hit, the order of hits can be important for certain phylogenetic and phylogenomic analyses. We also calculated the phylogenetic error in first BLAST hits. This is useful not only because BLAST first hits are used in some analyses, but also as a general proxy for errors in order.
Results and discussion
Sequence similarity correlates broadly but variably with phylogenetic distance
To demonstrate potential problems with using similarity scores to infer phylogenetic relatedness, we first need to characterize the relationship between sequence similarity and phylogenetic relatedness. We found BLAST E -values (expressed as – ) were correlated (Spearman’s ρ ) with phylogenetic distance for both nucleotide and amino acid sequences ( Figure 3A–B , Supplementary Figure S2 and Table S1 ). The relationship between the E -value and phylogenetic distance remained with simulations including indels ( Supplementary Figure S2 and Table S1 ) or rate heterogeneity ( Supplementary Figure S2 and Table S1 ). No significant difference was shown with rate heterogeneity runs including 1000 taxa. Simulations that included biases in base composition are better compared with the same data sets without biased composition (‘CB + ’ versus ‘CB0’) as these use a single tree on which to simulate data and both use the p4 simulation engine (instead of indelible, Supplementary Figure S2 and Table S1 ). Collectively, these results demonstrate that, in general, sequence similarity and phylogenetic distance are grossly correlated though composition bias and rate heterogeneity somewhat weaken this correlation.
Missing and misordered hits
One common way BLAST scores are used in a broad phylogenomic comparative and phylostratigraphy analyses is to examine phylogenetic patterns of the presence or absence of BLAST hits. When a given query sequence is used to conduct a BLAST search, the expectation might be that it will hit all most closely related sequences until reaching a most distantly related sequence that is still a BLAST hit. We can then define a new measure, called , as the proportion of sequences missed by BLAST that share the same MRCA as the query and most-distant-hit (i.e. are members of the clade defined by these sequences). For amino acids and relatively small tree height (i.e. low substitutions), BLAST performs well and generally hits all sequences within a clade before hitting outside of a clade ( , see Methods for an extended definition). However, for tree height = 2 and no other molecular processes present, on average 20% of the sequences within the MRCA of the hits for a sequence were missing ( Figure 3C ). For nucleotides and large tree height (i.e. complete saturation), BLAST also performed generally well. However, for tree height = 0.5, on average 18% of the hits within the MRCA were missing. In the presence of lineage-specific rate heterogeneity or biased base composition, the missing hit percentages rose sharply to >30% and >50%, respectively. While the composition bias examined here is extreme, it demonstrates the potential for errors (though probably at lower rates).
While completeness can be important for some analyses, the exact order can be important for many others (e.g. RBH analyses). To examine error in the order of hits, the first significant BLAST hit was examined. This serves not only to address procedures that specifically use the first BLAST hit, but also gives a simple measure to describe errors in the order of hits. The lowest error rates, 0.3% for nucleotides and 2% for amino acids, was with tree height equal to 10 as generally only closely related sequences would have successful hits ( Figure 3D ). As with the other measures, introducing indels, lineage-specific rate heterogeneity and biased composition increased the error. The highest error rates were found with lower root heights, but this could simply be owing to high sequence similarity at lower root heights. Data sets with extreme rate heterogeneity in amino acids produced error rates of 20–24%. The source of the error in the order of BLAST hits can be demonstrated on a small simulated data set ( Figures 1 and 2 ). These findings highlight the first hit often is not the most phylogenetically related sequence and should not be used to identify orthologs. Orthology is fundamentally a phylogenetic question, and therefore a phylogeny should be constructed to infer orthology (e.g. [ 9 ]).
Interpretation of E -values
The issues examined here may be somewhat alleviated by more focused and precise interpretations of BLAST results. For example, instead of using the resulting p - or E -values as a quantitative measure of homology, these scores could be interpreted as a Boolean (i.e. true or false). In this way, p - and E -values would be used—as intended—as frequentist significance test statistics. The null hypothesis in a BLAST analysis is that the proposed alignment between the query and subject sequences is random and follows a null distribution. A significantly small p -value, as used for BLAST, is evidence against the null hypothesis that the alignment could have been generated by random sequences, given the sequence database and search parameters. The E -value is interpreted as the number of random hits expected with an alignment score equal to or better than the score obtained between the query and subject. As with any frequentist statistic, an insufficiently small p -value does not necessarily mean that the alignment is random, but rather lacks sufficient information to distinguish it from a random alignment. Furthermore, sufficiently small p - and E -values do not necessarily mean that the alignment is the result of common descent. First, a small p -value offers evidence that the null hypothesis does not adequately explain the observation. However, the alternative to the null hypothesis for a BLAST analysis is a non-random alignment and not a homologous one. A smaller p -value more strongly refutes the null hypothesis of random sequences, but does not more strongly support homology. This more conservative usage of BLAST not only is a more accurate representation of the measures themselves, but also avoids explicitly addressing phylogenetic questions.
Sequence similarity, phylogenetic relatedness and suggestions
Many of the potential pitfalls discussed and examined here will not be surprising to phylogeneticists, and significant research has demonstrated the failings of particular methods for phylogenetic reconstruction methods, including those relying solely on sequence distance [ 27 , 35 ]. Fundamentally, sequence distance alone has limited ability to reflect phylogeny. As expected and shown here, a rough correspondence exists between E -value and phylogenetic distance. However, this does not imply that distance matrices alone can replace construction of a phylogeny. Also, lineage-specific rate heterogeneity, saturation and compositional bias exacerbate errors in these types of interpretations of BLAST results. Unfortunately for BLAST analyses, these problematic molecular patterns can occur in common scenarios. For example, aside from the lineage-specific rate heterogeneity that is common throughout the tree of life [ 36 , 37 ], few data sets (even small ones) conform to a strict molecular clock, which has spurred extensive research in relaxed molecular clock models [ 38–41 ]. Even though phylogenies may better model these processes, they can still pose problems for phylogeny reconstruction [ 8 , 28 , 31 ].
While in this study we have focused examining how aspects of similarity-based searches impact phylogenetic conclusions, our results are also relevant to ongoing community efforts to codify methods and assessment of ortholog prediction ([ 42 , 43 ], and reviewed in [ 44 ]). Some studies in this area have found mixed performance for some tree-based prediction of orthologs and assignment of gene function (e.g. [ 45 ]). In particular, these studies cite weaker performance of phylogenetic ortholog prediction methods compared with similarity-based ones, particularly more rigid tree topology-based reconciliation methods that do not incorporate gene tree discordance owing to biological processes (i.e. incomplete lineage sorting, introgression). While our study does not advocate a particular practical solution for ortholog prediction or gene functional annotation, we would agree with these studies that improvements to phylogenetic ortholog prediction methods are needed, specifically in resolving both phylogenetic discordance (produced by both error and biological forces) and heterogeneity of both molecular rates and compositions that we describe. However, we would also caution that similarity-based searches, as we have demonstrated here, suffer from major biases in relation to these common molecular evolutionary processes.
BLAST and other sequence similarity tools will continue to be essential for bioinformatics, phylogenetic, genomic and phylogenomic analyses. However, the lessons from decades of phylogenetic method development need to be integrated into the culture of homolog identification, phylostratigraphy and other analyses. As expected from the large phylogenetic literature on distance-based methods, significant biases exist in how BLAST similarity corresponds to phylogenetic relatedness, and indicate that applications of BLAST that presume to estimate phylogenetic relationships are misguided. Instead of a particular approach, we advocate caution when using and interpreting sequence similarity results, especially as they are more frequently applied to phylogenetic questions or act as inputs for more complex analysis methods. Finally, where a phylogenetic relationship is needed, a phylogeny will likely produce more accurate results than the order of BLAST results.
BLAST is often incorrectly used to infer evolutionary relatedness of sequences.
Reciprocal best hits from BLAST are often not the closest related phylogenetically under common scenarios.
Phylogenetic methods should be used to infer orthology instead of similarity-based methods.
We are grateful to Joseph Brown, Bryan Moyers, Caroline Parins-Fukuchi, Jeet Sukumaran, Joseph Walker and Ya Yang for helpful discussions.
This work was supported by the National Science Foundation [1354048 to S.A.S., 1338694 to J.B.P.].