## Abstract

**Motivation:** The only algorithm guaranteed to find the optimal local alignment is the Smith–Waterman. It is also one of the slowest due to the number of computations required for the search. To speed up the algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm at the instruction level.

**Results:** A faster implementation of the Smith–Waterman algorithm is presented. This algorithm achieved 2–8 times performance improvement over other SIMD based Smith–Waterman implementations. On a 2.0 GHz Xeon Core 2 Duo processor, speeds of >3.0 billion cell updates/s were achieved.

**Contact:**farrar.michael@gmail.com

## 1 INTRODUCTION

The Smith–Waterman (Smith and Waterman, 1981) algorithm is one of the slowest and most sensitive sequence search algorithms. As the size of the GenBank/EMBL/DDBJ double every 15 months (Benson *et al*., 2000), faster implementations of the Smith–Waterman algorithm are needed to keep pace. One recent optimization has been adopting the algorithm to Single-Instruction Multiple-Data (SIMD) microprocessors. A SIMD instruction is able to perform the same operation on multiple pieces of data in parallel.

One of the first Smith–Waterman SIMD implementations was Alpern *et al*. (1995). This approach divided the 64-bit Z-buffer registers of the Intel Paragon i860 processor into four parts. Each part of the register contained a value from four different database sequences. A 6-fold speedup was achieved over the original implementation.

Wozniak (1997) presented an implementation of the Smith–Waterman algorithm running on the Sun Ultra SPARC using its SIMD instructions, the Visual Instruction Set. The SIMD registers contained values parallel to the minor diagonal. An advantage to this implementation is that there are no conditional branches in the inner loop. Therefore, the execution time is dependent on the length of the query string and the database, not the scoring matrix or gap penalties. A major drawback of this implementation is the query profile must be computed for each database sequence. A speedup of over two times was reported over the traditional implementation.

Rognes and Seeberg (2000) presented an implementation of the Smith–Waterman algorithm running on the Intel Pentium processor using the MMX SIMD instructions. The SIMD registers contained values parallel to the query sequence. A major optimization was computing the query profile once for the entire database search. A disadvantage introduced by processing the values vertically is that conditional branches are placed in the inner loop to compute *F*. With conditional code the execution time is dependent on the length of the query string and the database, the scoring matrix and gap penalties. A speedup of over six times was reported over an optimized non-SIMD implementation.

This paper presents a new Smith–Waterman implementation where the SIMD registers are parallel to the query sequence, but are accessed in a striped pattern. Like the Rognes implementation, the query profile is calculated once for the database search, but the conditional *F* calculations are moved outside the inner loop. Calculations speeds of >3.0 GCUPS are achieved. This is a speedup of 2–8 times over the Wozniak and Rognes SIMD implementations.

## 2 METHODS

### 2.1 Smith–Waterman algorithm

The algorithm used to compute the optimal local alignment is the Smith–Waterman (Smith and Waterman, 1981) with the Gotoh (1982) improvements for handling multiple sized gap penalties. The two sequences to be compared, the query sequence and the database sequence, are defined as *Q* = *q*_{1}, *q*_{2} … *q _{m}* and

*D*=

*d*

_{1},

*d*

_{2}…

*d*. The length of the query sequence and database sequence are

_{n}*m*= ∣

*Q*∣ and

*n*= ∣

*D*∣, respectively. A scoring matrix

*W*(

*q*,

_{i}*d*) is defined for all residue pairs. Usually the weight

_{j}*W*(

*q*,

_{i}*d*) ≤0 when

_{j}*q*<>

_{i}*d*and

_{j}*W*(

*q*,

_{i}*d*) > 0 when

_{j}*q*=

_{i}*d*. The penalty for starting a gap and continuing a gap are defined as

_{j}*G*

_{init}and

*G*

_{ext}, respectively. The alignment scores ending with a gap along

*D*and

*Q*are

*E*Equation (1) and

*F*Equation (2), respectively.

*H*where 1 ≤

_{ij}*i*≤

*m*and 1 ≤

*j*≤

*n*is defined by Equation (3).

*H*,

_{ij}*E*and

_{ij}*F*are equal to 0 when

_{ij}*i*<

*1*or

*j*<

*1*.

### 2.2 Implementation

#### 2.2.1 Striped query profile

When calculating *H _{i, j}* the value from the scoring matrix

*W*(

*q*,

_{i}*d*) is added to

_{j}*H*

_{i−1, j−1}. To avoid the lookup of

*W*(

*q*) for each cell, Rognes and Seeberg (2000) calculated a query profile parallel to the query for each possible residue. The query profile is calculated just once for each database search. Then the calculation of

_{i}, d_{j}*H*requires just an addition of the pre-calculated score to the previous

_{i, j}*H*. The striped Smith–Waterman implementation takes a similar approach by pre-calculating the query profile, but with a different layout than Rognes (Fig. 1).

_{i, j}**Fig. 1**

**Fig. 1**

The layout used by the query profile is a striped access parallel to the query sequence. The query is divided into equal length segments, *S*. The number of segments, *p*, is equal to the number of elements being processed in the SIMD register. When processing byte integers (8 bit values) the *p* = 16 and when processing word integers (16 bit values) *p* = 8. The length of each segment, *t*, is (∣*Q*∣ + *p* − 1)/*p*. If the query is not long enough to fill all the segments, *t* > ∣*Q*∣, the segments are padded with null entries that have a weight of zero. The query segments are defined as S* _{n}* =

*q*

_{t(n−1)+1},

*q*

_{t(n−1)+2}, … ,

*q*

_{t(n−1)+t}where 1 ≤

*n*≤

*p*.

Each element of the SIMD registers maps to one segment. The first element in the vector maps to *S*_{1}, the second element in the vector maps to *S*_{2}, till the last element in the vector maps to *S _{p}*. The vectors move uniformly across the segments, so 〈

*H*〉 processes the

_{i, j}*i*-th element of all the segments. Equation (4) shows the segment layout when

*p*= 8 and the elements processed by the SIMD register when

*i*= 2.

The vectors making up the scoring profile *W*, like the *H* vectors, also moves uniformly across the segments. The layout of the scoring profile, when *p* = 8, is:

The query profile is stored in memory on 16-byte boundaries. By aligning the profile on a 16-byte boundary, the values are read with a single aligned load instruction, which is faster than reading unaligned data. Figure 2 has the pseudo code for generating the query profile.

**Fig. 2**

**Fig. 2**

Both the Wozniak (1997) and Rognes and Seeberg (2000) implementations have data dependencies between the previous *H* vector and the current *H* vector, Figure 3. This is also true when calculating *F*. Before *H* or *F* are calculated, the last element in the previous vector is moved to the first element in the current vector. By using the striped query access, these data dependencies are moved out of the inner loop and done just once in the outer loop when processing the next database residue.

**Fig. 3**

**Fig. 3**

#### 2.2.2 Smith–Waterman SIMD implementation

The striped Smith–Waterman implementation was written for Intel processors supporting SSE2 instructions. The pseudo code for the implementation is shown in Figure 5. The code was written in C using Intel's SSE2 intrinsics for portability.

To maximize the number of cells calculated per instruction, the SIMD SSE2 registers are divided into their smallest unit possible. The 128-bit wide registers are divided into 16 8-bit elements for processing. One instruction can therefore operate on 16 cells in parallel. Dividing the register into 8-bit elements limits the cell's range from 0 to 255. In most cases, the scores fit in the 8-bit range unless the sequences are long and similar. If a query's score exceeds the cells maximum, that query is recalculated using a higher precision.

For those queries that do require a higher precision, the register is divided into 8, 16-bit elements. This gives each cell a possible range from 0 to 65 535. The obvious drawback to using 16-bit elements is now each instruction is processing half as many cells per instruction compared with using 8-bit elements.

Due to limitations in the SSE2 instruction set, unsigned byte integers are used in the 8-bit calculations. The SSE2 instruction set supports only maximum on unsigned byte integers. Since the maximum instruction is needed to compute *E*, *F* and *H*, unsigned bytes are used in the low precision calculations. To use unsigned bytes, the query profile is biased to the smallest value in the scoring matrix. After *W* is added to *H*, the bias is subtracted from *H*. When the score is >255 − bias, the search is rerun with higher precision calculations. This approach was used in the Rognes and Seeberg (2000) implementation.

For the higher precision calculations signed short integers are used to speed up the inner loop. When using signed integers, the initial *E*, *F* and *H* values are biased by the minimum signed short integer value, −32 768 instead of 0. By biasing the initial value with its minimum possible value, the complete range of the element can be used. When the search is done the bias is added back to *H* returning a score between 0 and 65 535. Using signed arithmetic, the weight is not biased, therefore the instruction subtracting bias from *H* is not needed in the inner loop keeping calculation times down.

The *H _{i, j}* calculation is dependent on the previous value on the major diagonal,

*H*

_{i−1, j−1}. To simplify the code for handling this dependency, two buffers are allocated for storing the

*H*values. On the first pass, one buffer is used to read the previous

*H*values and the other buffer is used to store the new

*H*values. On the next pass, the buffers are swapped, so the input

*H*buffer is now the output

*H*buffer and vice-versa.

The computation of 〈*H _{i, j}*〉 where 1 ≤

*i*≤

*T*, is the addition of the weight 〈

*W*〉 to 〈

_{i, j}*H*

_{i−1, j−1}〉 to access the

*H*values on the major diagonal. If

*i*= 1 then 〈

*H*〉 is shifted to the left by one element and added to 〈

_{T, j}*W*〉. Figure 4 shows the data dependencies between the last

_{1, j}*H*vector and the first

*H*vector of the next column. The inner loop therefore no longer requires the extra operations to insert the

*H*value into the next SIMD register. The only shifting of

*H*is done once in the outer loop to get 〈

*H*〉 in the correct order.

_{T, j}**Fig. 4**

**Fig. 4**

The computation of 〈*E _{i, j}*〉 where 1 ≤

*i*≤

*T*, is the subtraction of the gap extension penalty,

*G*

_{ext}, from 〈

*E*

_{i, j−1}〉 to access the

*E*values to the left of the current cells. If

*i*= 1 then zeros are used for the value of

*E*.

The computation of 〈*F _{i, j}*〉 where 1 ≤

*i*≤

*T*, is the subtraction of the gap extension penalty,

*G*

_{ext}, from 〈

*F*

_{i−1, j}〉 to access the

*F*-values above the current cells. If

*i*= 1 then the initial calculation of 〈

*F*〉 is dependent on 〈

_{1, j}*F*〉 shifted to the left by one. Since the values of 〈

_{T, j}*F*〉 are unknown until the inner loop has completed, zeros are substituted and any errors introduced are corrected in a second pass.

_{1, j}#### 2.2.3 Lazy F evaluation

For most cells in the matrix, *F* remains at zero and does not contribute to the value of *H*. Only when *H* is greater than *G*_{init} + *G*_{ext} will *F* start to influence the value of *H*. In many instances the second pass at correcting errors introduced by *F* is not necessary. Figure 5 has the pseudo code for the lazy *F* loop. After the inner loop has completed 〈*F _{T, j}*〉 is checked against the values of 〈

*H*〉 to see if the second pass is necessary. The values in 〈

_{1, j}*F*〉 are shifted to the left by one and if any elements are greater than 〈

_{T, j}*H*〉 −

_{1, j}*G*

_{init}, then

*H*is recalculated because

*F*can change the value of

*H*.

**Fig. 5**

**Fig. 5**

The second pass loop is executed until all elements in *F* are less than *H* − *G*_{init}. If this loop processes all the segments without an early exit, an additional pass might be needed to recalculate *F*. Since each element in the vector represents a different segment of the query sequence, after processing the last vector 〈*F _{T, j}*〉, the values in 〈

*F*〉 are shifted to the left by one to move their values to the next segment. Figure 6 shows the data dependencies between the last

_{T, j}*F*vector and the first. If any elements in 〈

*F*〉 are still greater than 〈

_{T, j}*H*〉 −

_{1, j}*G*

_{init}, the loop is executed again. This loop is repeated until all elements in

*F*are below the threshold.

**Fig. 6**

**Fig. 6**

One advantage of this approach is all branches are moved out of the inner loop to the outer loop. Modern processors use branch prediction to limit the impact of branching on the run time. As execution pipelines get deeper to support higher clock rates, the penalty for a misprediction increases and therefore conditional branches should be eliminated if possible (Intel, Optimization Ref. Man.). The execution pipeline on the Pentium 4 is documented as being twice as long as for the Pentium III, which should indicate that the branch misprediction penalty is at least 20 cycles (Hinton *et al*., 2001).

## 3 RESULTS

To get meaningful comparisons of the different execution times, a test framework was developed to use both the Wozniak (1997) and Rognes and Seeberg (2000) Smith–Waterman implementations along with the striped algorithm presented in this paper. All three Smith–Waterman implementations were written in C using Intel SSE2 intrinsic functions. The programs were compiled using Microsoft Visual C++ 2005 with optimization set for maximum speed. By using SSE2 intrinsic functions instead of assembler, the compiler was responsible for optimizations, such as register usage, instruction selection and instruction scheduling.

The programs were tested on a 2.0 GHz Xeon Core 2 Duo processor with 2 GB of RAM running Windows XP SP2. The program was run as a single-threaded application, so the number of CPU cores had no affect on the execution time. All queries were run against Swiss-Prot release 49.1 comprising 75 841 138 amino acids in 208 005 sequence entries. To test the different Smith–Waterman implementations, 11 query sequences were used ranging is size from 143 to 567 amino acids. These sequences were used to test other algorithms including BLAST 2 (Altschul *et al*., 1997) and SWMMX (Rognes and Seeberg, 2000) Smith–Waterman implementation.

To test the different Smith–Waterman implementations, three different tests were run using different scoring matrices and different gap penalties. The two scoring matrices used in the testing were BLOSUM62 and BLOSUM50 (Henikoff *et al*., 1992). The two scoring matrices were used to test the runtime performance of the Rognes and striped implementations. One factor affecting the runtime performance of the Rognes and striped implementations is the scoring matrix. If the *H* values get above *G*_{init}, then the *F* values need to be calculated. The other factor affecting the number of *F* calculations is the gap penalties. One test uses different gap penalties to show the runtime characteristics of the different implementations. The higher the gap open and gap extension penalties, the fewer iterations are needed to calculate *F*. The Wozniak implementation is not affected by different scoring matrices or gap penalties, since there are no conditional calculations of *F*. An additional comparison was run comparing the execution times of the striped Smith–Waterman and different search programs using heuristic algorithms. All tests were run three times with the minimum scan time used as the final result. The MCUP rating (million cell updates per second) was calculated by ∣*Q*∣ * ∣*D*∣/ScanTime/10 ^{6}.

For the first test, the scoring matrix BLOSUM62 with a gap-open penalty of 10 and a gap-extension penalty of 1 were used. The same scoring matrix and gap penalties were used to evaluate BLAST2 and SWMMX. The search times for each of the 11 query sequences are shown in Figure 7. The Wozniak implementation completed the search in 821 s with an average of 352 MCUPS and a peak of 367 MCUPS. The Rognes implementation turns in a better search time of 354 s with an average of 816 MCUPS and a peak of 865 MCUPS. Finally, the striped implementation completed the search in 113 s with an average of 2553 MCUPS and a peak of 2998 MCUPS.

**Fig. 7**

**Fig. 7**

The next test used the same gap penalty, 10 − k, but utilized the BLOSUM50 scoring matrix. The results are shown in Figure 8. With the higher *H* scores, more time was spent calculating the value of *F*. The Wozniak implementation stayed the same taking a total of 821 seconds still averaging 351 MCUPS with a peak of 367 MCUPS. The Rognes implementation turned in a slightly better time of 771 s with the average MCUPS dropping to 374 with a peak of 419 MCUPS. The striped implementation was also affected by the higher *H* values taking 159 s to run the search averaging 1817 MCUPS with a peak of 2256 MCUPS.

**Fig. 8**

**Fig. 8**

The third test used the same 11 query sequences with the BLOSUM50 and BLOSUM62 scoring matrices, but with four different gap penalties, 10 − k, 10 − 2k, 14 − 2k and 40 − 2k. The total search times for all of the 11 query sequences are shown in Figure 9. With a large gap penalty of 40 − 2k, most of the *F* calculations were skipped for the Rognes and striped implementations, basically testing just the efficiency of the inner loop. The Wozniak implementation total search time was 13.68 min. The search times for the Rognes implementation using the gap penalty of 40 − 2k, took 2.31 min for both scoring matrices, a 60–80% improvement over the 10 − k times. The striped implementation using the gap penalty of 40 − 2k took 1.51 min, only a 20–40% improvement over the 10 − k times.

**Fig. 9**

**Fig. 9**

The final comparison is against heuristic programs FASTA 34t26b4 (Pearson and Lipman, 1988) and NCBI BLAST 2.2.14 (Altschul *et al*., 1997). The executables used in testing were downloaded from their respective web sites, the University of Virginia and the NCBI. For a more meaningful comparison, SSEARCH was modified to use the striped Smith–Waterman implementation. All the searches were run using the BLOSUM50 scoring matrix and gap penalties of 10 − 2k. The options for all programs were to display the top 500 scores with no alignments. The search times for the 11 query sequences are shown in Figure 10. FASTA was run with both ktup = 1 and 2. On the whole, the striped Smith–Waterman was faster than FASTA, more than 50% faster when ktup = 2 and four time faster when ktup = 1. SSEARCH averaged about 30% slower runtimes when compared to BLAST.

**Fig. 10**

**Fig. 10**

## 4 DISCUSSION

Due to the number of iterations in the Smith–Waterman (Smith and Waterman, 1981) calculations, reducing the number of instructions in the inner loop had a significant effect on the execution time. By using pre-calculated weights, removing the SIMD register data dependencies and moving all branches out of the inner loop, the striped Smith–Waterman has a very efficient inner loop. This paper presents an efficient SIMD implementation of the dynamic programming algorithm that might be adapted to other biological problems.

The current implementation uses block substitution matrices as the scoring matrix. The implementation could easily be adapted to use other types of scoring functions, such as position-specific scoring matrices (PSSM) (Gribskov *et al*., 1987) and possibly profile hidden Markov model (profile HMM) (Eddy, 1999). The profiles need to be re-arranged in the same stripped-pattern as the query profiles in order to work with this implementation.

Another possible use for this algorithm, in addition to database searches, would be for aligning two sequences. Other software packages use a SIMD implementation to find high scoring matches and then use a scalar Smith–Waterman to align the two sequences. This implementation could easily be modified to find the high score and the location of the scoring sequence. The location could then be used in the Hirschberg algorithm (Chao *et al*., 1994) to align the two sequences. This would allow faster alignments of larger sequences in linear space.

Dynamic programming is used for global alignment (Needleman and Wunsch, 1970) as well as local alignment. Other uses include assembling DNA sequence data from the fragments from automated sequencing machines (Anson and Myers, 1997), and to determine the intron/exon structure of eukaryotic genes (Gelfand and Royberg, 1993). It is also used to predict the secondary structure of functional RNA genes or regulatory elements (Zuker, 1989). All of these problems might benefit from an efficient SIMD implementation of the dynamic programming algorithm.

I thank William Pearson for providing the source code for SSEARCH, which was used in the initial prototyping of the striped Smith–Waterman implementation.

*Conflict of Interest*: none declared.

## Comments