The BLAST (Altschul et al., 1990) alignment tool has been the workhorse of genomics research for almost 30 years. While many other tools were developed during this period for performing database searches and sequence alignment, BLAST remains the tool of choice for many use cases, and continues to be actively used in many bioinformatics workflows. Despite the tremendous collective experience of the bioinformatics community with BLAST, the full functionality of this tool remains poorly understood. Here we report on a feature of BLAST that operates in a non-intuitive way and that is frequently misused in bioinformatics workflows, potentially leading to erroneous results impacting numerous scientific articles.

Compared to modern search tools, BLAST is highly sensitive across longer evolutionary distances between a query sequence and the database, but also relatively slow. By default, BLAST reports all the database sequences that match a query sequence sufficiently well to within a specified level of quality (usually defined through an E-value cutoff). Bioinformatics workflows usually need to reduce this set to just one or a handful of answers. For example one may estimate the taxonomic origin of a DNA sequence from the taxonomic label associated with the best-scoring database hit(s). In such cases, the slowness of the BLAST tool is further compounded by the computational cost needed to sift through the potentially many hits produced by BLAST for each sequence. To enable the efficient processing of large data sets, researchers frequently rely on shortcuts aimed at reducing the number of BLAST results that need to be processed. A common strategy involves using the ‘-max_target_seqs’ parameter of the NCBI BLAST+ suite. According to the BLAST documentation itself (2008), this parameter represents the ‘number of aligned sequences to keep’. This statement is commonly interpreted as meaning that BLAST will return the top N database hits for a sequence query if the value of max_target_seqs is set to N. For example, in a recent article (Wang et al., 2016) the authors explicitly state ‘Setting “max target seqs” as “1,” only the best match result was considered.’

To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.

This functionality was first reported as a bug to NCBI by Kumar (2015), and later documented in a blog post (Cock, 2015) by Peter Cock. The functionality remains unchanged to this day, and the BLAST documentation(NCBI, 2008) (last modified in 2016) fails to clarify the misconception a reasonable user would have upon reading the manual. The confusion is further compounded by the fact that in the online BLAST portal, the max_target_seqs parameter behaves in the expected way—the best (rather than first) N hits are returned.

The impact of this misunderstanding about the meaning of the BLAST max_target_seqs parameter is likely significant. Hundreds of scientific papers (as determined through a Google Scholar search) explicitly use this parameter to restrict the number of results reported by BLAST, and many more likely rely on this parameter without mentioning it in the main manuscript. Many database search tools justify their performance by comparing directly to BLAST. The use of the max_target_sequences parameter would negatively impact the accuracy of BLAST and artificially inflate the performance of these tools—a serious concern especially as in many cases just a few percentage points distinguish the performance of ‘superior’ tools from that of state of the art approaches such as BLAST.

More importantly, the incorrect use of the max_target_seqs parameter can result in invalid analytic results. A biodefense screening tool might miss the presence of Bacillus anthracis simply because a Bacillus cereus sequence occurs before B. anthracis in the database. Similarly, the abundance of Salmonella in food samples may be severely underestimated because many sequences get assigned to a non-pathogenic genome reasonably similar in sequence to Salmonella. Such errors are difficult if not impossible to debug when analyzing complex samples with unknown composition, especially in a production setting.

In closing, we encourage the users of BLAST to carefully examine their use of this tool and to avoid the use of the parameter max_target_seqs unless the selected threshold is guaranteed to capture all database hits of interest. We also encourage the team developing BLAST at NCBI to revise the documentation and provide ample warning about unexpected behavior due to this and other parameters of the tool. While one may debate whether the current functionality of the max_target_seqs parameter constitutes a feature or a bug, it is incumbent on the BLAST development team to ensure this functionality is clearly documented, especially after concerns have been raised by users.

Funding

The authors were supported in part by the US National Science Foundation, Award IIS-1513615.

Conflict of Interest: none declared.

References

NCBI. (

2008
) BLAST® Command Line Applications User Manual [Internet]. BLAST® Command Line Applications User Manual [Internet]. National Center for Biotechnology Information (US), Bethesda (MD).

Kumar
 
S.
(
2015
) NCBI blastp bug - changing max_target_seqs returns incorrect top hits. https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5

Altschul
 
S.F.
 et al. (
1990
)
Basic local alignment search tool
.
J. Mol. Biol
.,
215
,
403
410
.

Cock
 
P.
(
2015
) What BLAST's max-target-sequences doesn't do. https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html.

Wang
 
X.
 et al. (
2016
)
Identification of mild freezing shock response pathways in barley based on transcriptome profiling
.
Front. Plant Sci
.,
7
,
106
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on: