Summary: Standard DNA alignment programs are inadequate to manage the data produced by new generation DNA sequencers. To answer this problem, we developed PASS with the objective of improving execution time and sensitivity when compared with other available programs. PASS performs fast gapped and ungapped alignments of short DNA sequences onto a reference DNA, typically a genomic sequence. It is designed to handle a huge amount of reads such as those generated by Solexa, SOLiD or 454 technologies. The algorithm is based on a data structure that holds in RAM the index of the genomic positions of ‘seed’ words (typically 11 and 12 bases) as well as an index of the precomputed scores of short words (typically seven and eight bases) aligned against each other. After building the genomic index, the program scans every query sequence performing three steps: (1) it finds matching seed words in the genome; (2) for every match checks the precomputed alignment of the short flanking regions; (3) if passes step 2, then it performs an exact dynamic alignment of a narrow region around the match. The performance of the program is very striking both for sensitivity and speed. For instance, gap alignment is achieved hundreds of times faster than BLAST and several times faster than SOAP, especially when gaps are allowed. Furthermore, PASS has a higher sensitivity when compared with the other available programs.
Availability and implementation: Source code and binaries are freely available for download at http://pass.cribi.unipd.it, implemented in C++and supported on Linux and Windows.
Over the past few years, there has been a considerable advance in DNA sequencing technologies and there is a general expectation that the trend of producing more data at a progressively decreasing cost will continue for some time (Church, 2005). Today, the ABI-SOLiD and the Illumina-Solexa are able to generate several GigaBases of mappable data per run, as millions of short reads of about 35 bases. A considerable bottleneck is their alignment (mapping) on a reference genome, allowing some gaps and mismatches.
Exact algorithms based on dynamic programming are far too slow to manage such a huge amount of data. We have recently implemented a version of the Smith–Waterman algorithm running on graphic processors, achieving much faster execution times (Manavski and Valle, 2008), but even so the time required to align hundreds of millions of reads would be too long.
Heuristic solutions, such as those implemented in FASTA (Pearson and Lipman, 1988) and BLAST (Altschul et al., 1990) were designed to search single sequences in large databases and are not suitable for millions of reads. More recent approaches, such as SSAHA (Ning et al., 2001) and BLAT (Kent, 2002), make an index of the words occurring in the genomic sequence, but are optimized for longer alignments, not for short reads.
The need for algorithms specifically optimized for massive alignment of short sequences has recently led to the development of new programs such as SX Oligo Search, SlimSearch and ELAND, which are all proprietary software, the latter coming as a part of the Solexa suite. Amongst open source softwares, there are also some noteworthy developments such as PatMaN (Prüfer et al., 2008), RMAP (Smith et al., 2008), MAQ (Li et al., 2008a), SeqMap (Jiang and Wong, 2008), ZOOM (Lin et al., 2008) and SOAP (Li et al., 2008b).
Here we propose PASS, a new algorithm to align short DNA sequences allowing gaps and mismatches, with high sensitivity and speed.
In general, we will refer to the reference sequence as genomic sequence and to the query sequences as reads. As a first step, PASS makes an index of seed words (default 12 bases) occurring in the genomic sequence. This step is fast, requiring O(L) time and O(L) space to be completed, where L is the length of the genomic sequence. After producing the genomic index, PASS scans every read executing three steps: first, it identifies seed words on the genomic index; second, it checks if it is possible to extend the alignment; third, it refines the alignment and/or score and prints it.
To verify the possibility of extending the seeds, PASS uses a very simple and yet effective approach that is accomplished extremely rapidly, with perfect specificity and sensitivity. It uses a precomputed table of all the possible short words aligned against each other. This table is loaded in RAM thus allowing an immediate analysis of the two flanking regions adjacent to seed words.
Several precomputed score tables (PST) have been calculated with the Needleman and Wunsch (1970) algorithm, using different values for matches, mismatches and gaps. They are supplied together with PASS, ready to be used. A PST suitable for most purposes is W7M1m0G0X0 (word length 7, match 1, mismatch 0, gap start 0, gap extension 0) that essentially gives a score of one for every aligned base; while W7M3m0G-1X-1 produces different scores depending whether gaps or mismatches are included.
The size of PST requires O(42w), where w is the length of the short words. For w=6 there are 4096 possible words that are aligned against each other producing more than 16 million possible alignments. For w=7 there are about 268 million, while for w=8 there are more than 4 billion. Since each score is easily contained in a byte, it should not be a problem to load onto the computer memory of any personal computer a seven bases PST, while eight bases PSTs require 4 Gb.
The PST step, more than a proper extension step, should be considered as a filter to discard seed words that will not produce a valid alignment. To complete the discrimination of useless positions, two more filters can be optionally applied: one for discarding low complexity regions and another to discard AT rich regions. The regions that pass the above selections are finally aligned by a dynamic algorithm.
A further optimization came after comparing the sensitivity obtained with gapped patterns rather than uninterrupted seed words. The best results were achieved including two gaps, producing three substrings of similar length. Therefore, the genomic indexes were built using gapped words rather than contiguous bases.
The test set used in this work consisted of 1 million human Solexa reads from a cDNA library (obtained from the NCBI Short Reads Trace Archive, SRA000299), trimmed to 32 bases by the program ELAND and aligned on the whole human genome (Build 36.3). The computer used for the analysis was a 2 CPUs Dual core AMD with 32 Gb RAM, running at 2.4 GHz.
The importance of the length of the seed words is easily understandable. Seed words of 12–13 bases guarantee the identification of any alignment longer than 26 bases with less than two mismatches (Baeza-Yates and Navarro, 1996), but more degenerated alignments may be missed. The risk of missing alignments increases if longer words are used for seeding. On the other hand, shorter words result in more seeds to extend, hence longer execution times.
The extension process is generally the most critical step of alignment algorithms based on hashing tables, both for execution time and sensitivity. PASS uses PST of every short word aligned against each other (see Section 2). To evaluate the selectiveness of the PST step, we considered all of the possible alignments of seven bases and we found that only in 0.7% of the cases there were less than two discrepancies (gaps or mismatches), while in 11.5% of the cases there were less than three. As a result, we calculated that the combined probability that the two sides of a seed word have a total of three or less discrepancies is ∼0.08%, indicating that >99.9% of the random alignments would not pass this selection. The same analysis performed on eight bases words showed that 99.997% of the random alignments would be blocked. We think that this level of selectiveness is very effective in reducing the execution time, without affecting at all the sensitivity. As a result, the program performs very well both for speed and sensitivity, as shown in Figure 1.
It can be easily appreciated from the figure that the sensitivity of both programs increases with the shortening of the seed words. However, even with words of 14, PASS has a better sensitivity than SOAP with words of 11, and it runs at least 10 times faster. Our recommendation is to use PASS with words of 12 or 13 bases, thus making the alignment process much faster without significantly compromising the sensitivity.
Funding: Italian Telethon Foundation (grant GSP04289/1C); MiPAF VIGNA Project and FIRB (grant RBLA0345SF).
Conflict of Interest: none declared.