## Abstract

Motivation: New high-throughput sequencing technologies have promoted the production of short reads with dramatically low unit cost. The explosive growth of short read datasets poses a challenge to the mapping of short reads to reference genomes, such as the human genome, in terms of alignment quality and execution speed.

Results: We present CUSHAW, a parallelized short read aligner based on the compute unified device architecture (CUDA) parallel programming model. We exploit CUDA-compatible graphics hardware as accelerators to achieve fast speed. Our algorithm uses a quality-aware bounded search approach based on the Burrows–Wheeler transform (BWT) and the Ferragina–Manzini index to reduce the search space and achieve high alignment quality. Performance evaluation, using simulated as well as real short read datasets, reveals that our algorithm running on one or two graphics processing units achieves significant speedups in terms of execution time, while yielding comparable or even better alignment quality for paired-end alignments compared with three popular BWT-based aligners: Bowtie, BWA and SOAP2. CUSHAW also delivers competitive performance in terms of single-nucleotide polymorphism calling for an Escherichia coli test dataset.

Availability:http://cushaw.sourceforge.net.

Supplementary information: Supplementary data are available at Bioinformatics online.

## 1 INTRODUCTION

The emergence of high-throughput sequencing technologies have led to an explosive growth of the size and availability of short read datasets. It is therefore important and challenging to efficiently and effectively map the sheer volume of short reads to genomes as large as the human genome for existing short read aligners.

Existing aligners are mainly based on two approaches: hash tables and suffix/prefix tries. The aligners based on hash tables either hash the short reads or the reference genome. By hashing short reads, aligners including RMAP (Smith et al., 2008), MAQ (Li H. et al., 2008) and SHRiMP (Rumble et al., 2009) usually have a flexible memory footprint but may have the overhead introduced by scanning the whole genome when few reads are aligned. By hashing the reference genome, aligners such as SOAP (Li R. et al., 2008), PASS (Campagna et al., 2009) and BFAST (Homer et al., 2009) can be efficiently parallelized using multiple threads but require large amounts of memory to construct an index for the reference genome.

The aligners based on suffix/prefix tries essentially perform inexact matching based on exact matching. Compared with hash tables, the advantage of the use of tries is that only one search pass is needed to find the alignments to multiple identical copies of a substring in the reference genome, whereas the hash tables need to perform one search pass for each copy. The search using suffix/prefix tries must rely on some algorithms and data structures, one such combination is the Burrows–Wheeler transform (BWT) (Burrows and Wheeler, 1994) and the Ferragina–Manzini (FM) index (Ferragina and Manzini, 2000, 2005). Several short read aligners have been developed based on the BWT and the FM index, including the popular Bowtie (Langmead et al., 2009), BWA (Li and Durbin, 2009) and SOAP2 (Li R. et al., 2009). These three aligners provide support for two types of alignments: seeded alignment, where the high-quality end of a read is used as the seed and end-to-end alignment. Moreover, paired-end mapping is usually also supported. Bowtie does not support gapped alignment, and SOAP2 and BWA support both ungapped and gapped alignments. The three aligners have been parallelized using multi-threading and are optimized for multi-core CPUs. Even though these aligners are fast, they are still time-consuming for large-scale re-sequencing applications.

The emerging many-core graphics processing units (GPUs) have evolved to be a powerful choice, in addition to multi-core CPUs and other accelerators such as Cell/BE, for the high-performance computing world. Their compute power has been demonstrated to reduce the execution time in a range of demanding bioinformatics applications including sequence alignment (Liu et al., 2009; Vouzis and Sahinidis, 2010), motif discovery (Kuttippurathu et al., 2011; Liu et al., 2010) and short read error correction (Liu et al., 2011). These successes have motivated us to use GPU computing to accelerate short read alignment.

As a first step, Blom et al. (2011) implemented a short read aligner based on the hash table approach using GPUs. However, this aligner can only work for small-scale microbial genomes and does not support paired-end mapping. In this article, we present CUSHAW, a parallelized algorithm to align short reads to large genomes, such as the human genome. It uses a quality-aware bounded search approach based on the BWT and the FM index and exploits compute unified device architecture (CUDA)-compatible GPUs to accelerate the alignment process. The performance of our algorithm is compared with three aligners: Bowtie, BWA and SOAP2, using both simulated and real short read datasets. Our experimental results show that our algorithm achieves significant speedups in terms of execution time compared with the three aligners, while giving comparable or even better alignment quality for paired-end alignments. Furthermore, we have evaluated the performance of single-nucleotide polymorphism (SNP) calling from short read alignments using the different aligners.

## 2 METHODS

### 2.1 Burrows–Wheeler transform

Given a text defined over an alphabet, a BWT is a reversible permutation of the text. For a genome sequence G defined over Σ={A, C, G, T}, the forward BWT of G can be constructed in three steps. First, a special character $, which is lexicographically smaller than any character in Σ, is appended to the end of G to form a new sequence G$. Second, a conceptual matrix M is constructed, whose rows are all cyclic rotations of G$(equivalent to all suffixes of G) sorted in lexicographical order and each column is a permutation of G$. Finally, the transformed text L (i.e. the forward BWT of G) is formed by taking the last column of M. A suffix array SA, where SA[i] stores the starting position of the ith smallest suffix of G, can be constructed from M using the one-to-one correspondence relationship between SA[i] and the ith row of M (Ferragina and Manzini, 2005; Grossi and Vitter, 2005).

The matrix M has a property called ‘last-to-first column mapping’, which means that the ith occurrence of a character in the last column corresponds to the ith occurrence of the same character in the first column. This property constitutes the basis of string searching using BWT. A reverse BWT is constructed from the reverse orientation (not the reverse complement) of G and has the same properties as the forward BWT. A reverse BWT has its own SA.

BWT construction has been extensively studied, including techniques to reduce the execution time and the peak memory size of the working space. We use the algorithm proposed by Hon et al. (2007), which only takes |G| bits of working space to construct the BWT (with a peak memory size of <1 GB for the human genome). As BWT construction is a pre-processing step which can be performed offline, we do not discuss it further. Readers can refer to Lam et al. (2008) for more information. In CUSHAW, we have used parts of the source code from the open-source BWT–SW (Lam et al., 2008) and BWA algorithms.

### 2.2 Searching for exact matches

Given a sequence S that is a substring of G, each occurrence of S can be found using a backward search procedure based on the FM index (Ferragina and Manzini, 2005). As all suffixes of G that have S as a prefix are sorted together, the search for all occurrences of S is actually equivalent to the search for an interval in SA. We define C(•) to denote an array of length |Σ|, where C(x) denotes the number of characters in G that are lexicographically smaller than x∈Σ and defines Occ(x, i) as the number of occurrences of x in L[0, i]. Thus, the SA interval for all occurrences of S in G can be recursively calculated, using the forward BWT from the rightmost to the leftmost of S, as

(1)
where Rs(i) and Re(i) are the starting and end indices of the SA interval for the suffix of S starting at position i, and Rs (|S|) and Re(|S|) are initialized as 0 and |G|, respectively. The calculation stops if it encounters Rs(i+1)>Re(i+1), and the condition Rs(i)≤Re(i) stands if and only if the suffix of S starting at position i is a substring of G. The total number of the occurrences of S in G is calculated as Re(0)−Rs(0)+1 if Rs(0) ≤Re(0) and 0, otherwise. We can also perform the backward search using the reverse BWT with the difference that it calculates the SA interval from the leftmost to the rightmost of S.

The calculation in equation (1) only relies on the occurrence array Occ, not requiring L. However, it requires up to 4|G|⌈log2|G|⌉ bits for storing Occ in memory, as the total number of elements is 4|G| and each element takes ⌈log2|G|⌉ bits. To reduce the memory footprint of Occ, we trade off the execution speed and memory space using a reduced occurrence array (ROcc), which only stores parts of elements in Occ and calculates the others with the help of L. In CUSHAW, ROcc stores the elements whose indices in Occ are multiples of K (K=128 by default) in memory. After using 2 bits to represent each character in L, the total memory size can be reduced to 4|G|⌈log2|G|⌉/K+2|G| bits, which is significantly smaller than the whole occurrence array. The memory reduction is very important for our GPU implementation, due to the limited device memory and small cache sizes. For clarity, we define bwt to denote the combination of ROcc and L from the forward BWT and define rbwt to denote the combination from the reverse BWT.

### 2.3 Searching for inexact matches

Inexact matches to a genome are indispensable in the consideration of sequencing errors and genuine differences between sequenced organisms and reference genomes. The inexact matches can be obtained by introducing substitutions, insertions and deletions in the query and the subject sequences.

In this article, our algorithm only supports ungapped alignment, i.e. CUSHAW does not allow insertions and deletions. Thus, the search for inexact matches can be transformed to the search for exact matches of all permutations of all possible bases at all positions of a short read. All the permutations can be represented by a complete four-ary tree (see Supplementary Fig. S1 in Supplementary Data), where each permutation corresponds to a path routing from the root to a leaf (the root node is Ø, meaning an empty string). Each node in the path corresponds to a base in a read that has the same position. Hence, all inexact matches can be found by traversing all paths using either depth-first search (DFS) or breadth-first search (BFS) approaches. BFS requires a large amount of memory to store the results of all valid nodes at the same depth. However, the approach is infeasible for GPU computing, as we need to launch hundreds of thousands of threads to leverage the compute power of GPUs and thus can only assign a small amount of device memory to each thread. Instead, our aligner uses the DFS approach, whose memory consumption is small and directly proportional to the depth of the tree (i.e. the full length of a read). Even though recursive functions are supported for the Fermi architecture (NVIDIA, 2009), we use a stack framework to implement the DFS approach.

### 2.4 Locating the occurrences

After getting the SA interval, the position of each occurrence in G can be determined by directly looking up the SA. However, it will take |G|⌈log2|G|⌉ bits if the entire SA is loaded into memory. This large memory consumption is prohibitive for large genomes (e.g. for the human genome, the SA takes about 12 GB RAM). Fortunately, we can reconstruct the entire SA from parts of it. Ferragina and Manzini (2005) have shown that an unknown value SA[i] can be re-established from a known SA[j] using equation (2).

(2)
where β(t)(i) means repeatedly applying the function β(i) t times. The function uses last-to-first column mapping for the ith row of M and is calculated as
(3)
In equation (2), t is actually the distance between the two starting positions (i.e. SA[i] and SA[j]) in G. Thus, for any unknown ith element of SA, we can calculate its value SA[i] in at most k iterations if we store the SA elements whose values are multiples of a constant number k (i.e. the starting positions in G are multiples of k). However, this will complicate the storage of SA by introducing additional data structures, making it difficult to map to GPU memory. Instead, we construct a reduced suffix array (RSA) by simply storing SA[i] whose index i is a multiple of k (k=32 by default). This approach reduces the total memory size of a suffix array to |G|⌈log2|G|⌉/k bits, but cannot guarantee to complete each computation within k iterations. This is because a maximal distance k between i and j does not mean a maximal distance k for starting positions SA[i] and SA[j] in G. The selection of k is a trade-off between lookup time and memory space. For a suffix array index i that is not a multiple of k, we repeat t iterations using equation (3) until j is a multiple of k, where SA[j] is equal to RSA[j/k], and then calculate SA[i] as SA[j]+t following equation (2).

### 2.5 Mapping using CUDA

To design an efficient short read alignment algorithm using CUDA, it is critical to fully understand the underlying hardware architecture and programming constraints. More than a computing architecture, CUDA is also a parallel programming language that extends the general programming languages, such as C/C++ and Fortran, with a minimalist set of abstractions to express parallelisms. A CUDA program is comprised of code for the host and kernels for the devices. A kernel is a program launched over a set of lightweight parallel threads on GPUs, where the threads are organized into a grid of thread blocks. All threads in a thread block are split into small groups of 32 parallel threads, called warps, for execution. These warps are scheduled in a single instruction, multiple thread fashion. Full efficiency and performance can be obtained when all threads in a warp execute the same code path. It is the programmers' responsibility to limit the amount of thread divergence.

A CUDA-enabled GPU is built around a fully configurable scalable processor array, organized into a set of streaming multiprocessors (SMs) using two different architectures: Tesla architecture (Lindholm et al., 2008) and Fermi architecture (NVIDIA, 2009). For the Tesla architecture, the number of SMs per GPU device varies from generation to generation, and each SM comprises eight scalar processors (SPs), sharing 8 or 16 kB of 32-bit registers depending on compute capabilities and a fixed 16 kB of on-chip shared memory. The local and global memory are not cached and the amount of local memory size per thread is 16 kB. For the newer Fermi architecture, each GPU device contains 16 SMs with each SM comprising 32 SPs, where each SM has a total number 32 kB of 32-bit registers and has a configurable shared memory size from the 64 kB on-chip memory. This on-chip memory can be configured, for each kernel at runtime, as 48 kB of shared memory with 16 kB L1 cache or as 16 kB of shared memory with 48 kB L1 cache. The Fermi architecture has a larger local memory size of 512 kB per thread than the Tesla architecture. Furthermore, a L1 cache of configurable size per SM and a unified L2 cache of size 768 kB per device are introduced for local and global memory caching. The L1/L2 cache hierarchy offers the potential to significantly improve the performance compared with the direct access to the external device memory. The global memory caching in the L1 cache can be disabled at compile time, whereas the local memory caching in L1 cannot be disabled. Thus, it is useful to find out the best combination for a given kernel: 16 or 48 kB per SM L1 cache (vice versa for shared memory) with or without global memory caching in L1 and with more or less usage of local memory. Kernels that use a large amount of local and global memory might be able to benefit from the 48 kB L1 cache.

As mentioned earlier, CUSHAW searches for inexact matches by traversing the tree in a DFS way. This traversing is implemented using a stack data structure. It is impractical to store the stack per thread in shared memory even though we configure a shared memory size of 48 kB. In this case, we implement the stack in the cached local memory for each thread. For the alignment of a read, both bwt and rbwt are used for the calculation. They are frequently accessed and do not show good data locality when performing alignments. Considering the large amount of device memory consumed by bwt and rbwt (e.g. the human genome requires about 2.2 GB memory), we store them in cached global memory, instead of cached texture memory. This organization is based on two considerations: (1) we do not need some other features, such as address calculations or texture filtering, that can benefit from texture fetches and (2) L1 cache has a higher bandwidth than the texture cache. Hence, for access to large memory, we can expect a higher performance gain through regular global memory loads cached in L1 compared with texture fetches. After tuning, CUSHAW achieves the best performance when the GPU is configured with a 48 kB per SM L1 cache with global memory cached in L1.

CUSHAW uses a quality-aware bounded search approach to reduce the search space and guarantee alignment. It supports two types of alignments: seeded alignment and end-to-end alignment. The end-to-end alignment is considered as a special case of the seeded alignment which considers the full length of a read as the seed size. The reverse complement of a read is also incorporated and for clarity, the following discussions only refer to the forward strand. The quality-aware property exploits the numerical quality score of each base in a read. A quality score Q is assumed to be calculated from the probability p that the corresponding base is incorrect, following the PHRED (Ewing and Green, 1998) definition Q=−log10p. Lower quality scores indicate higher probabilities of incorrect bases. The bounded search is able to reduce the search space by using several constraints on sums of quality scores and maximal allowable number of mismatches. The use of these constraints can prune branches of the complete tree ahead of time and thus significantly reduce the number of backtracks. The constraints have been used in other aligners, such as Bowtie, BWA and SOAP2, in part or whole and are detailed as Increasing MMS and MMR might enable the alignment of more reads but will result in a longer execution time. Decreasing QSS and QSR focuses the aligner more on mismatches with low-quality scores and can thus facilitate earlier pruning of some branches. However, smaller values also carry the risk of pruning real alignments. Figure 1 shows the workflow of CUSHAW.

• MMS: the maximal number of mismatches allowed in the seed (default = 2);

• MMR: the maximal number of mismatches allowed in the full length of a read, which is calculated as MMS+[err×|S|], where err is the uniform base error rate (default = 4%);

• QSS: the maximal sum of quality scores at all mismatched positions in the seed (default = 70);

• QSR: the maximal number of quality scores at all mismatched positions in the full length of a read (default =3× QSS);

• QSRB: the maximal QSR among the currently selected best alignments, which is updated as the aligning process goes on.

Fig. 1.

Program workflow of CUSHAW

Fig. 1.

Program workflow of CUSHAW

CUSHAW outputs the alignments with the smallest quality score sum over all mismatched positions in the full-length read alignment. All possible alignments are compared and enumerated by using the above constraints. Our alignment approach is different from the ones used in Bowtie and BWA. Bowtie allows any number of mismatches in the non-seed region and outputs the ‘best’ alignment after a specified maximum number of search attempts. BWA does not use base-quality scores when performing alignments but assigns different penalties on mismatches and gaps. It then reports the alignment with the best score that is calculated from the number of mismatches and gaps in the alignment.

When searching along a path from top to bottom using DFS, we check the quality score constraints: QSS, QSR and QSRB against the current sum of quality scores at all mismatched positions in the path from the root to the current node (including the current node). If the current quality score sum does not satisfy all the three constraints, we will stop searching the sub-trees of the current node. In addition, we must confine the number of mismatches within the allowable ranges. To facilitate the bounded search using the constraints with respect to the number of mismatches, we estimate the lower bound of the number of substitutions that are required to transform the substring, represented by the sequence of nodes in the sub-path down from the current node to the leaf, to exactly match to the genome. This estimation has been used in BWA and is calculated based on the observation that if a sequence S exactly matches to a genome, any substring of S must exactly match to the genome. Hence, the lower bound of the number of substitutions for a string can be estimated by counting the number of all constituent non-overlapping substrings that do not exactly match to the genome. We define a vector DIFFS(•) for the full length of S, where DIFFS(i) represents the lower bound of the number of substitutions in the substring S[i+1, |S| − 1] (0≤I <|S|−1). When using the seeded alignment with a seed size W, we calculate another vector SDIFFS(•) for the seed, where SDIFFS(i) represents the lower bound of the number of substitutions in the substring S[i+1, W−1] (0≤i<W − 1). The current node corresponds to a base (mutated or not) at position i of S and holds the number of mismatches in the sub-path routing to the root (UPMM), including the current node. We will not traverse the sub-trees of the current node if either UPMM + SDIFFS(i)> MMS when the current node is in the seed region or UPMM + DIFFS(i)> MMR for the full length. Figure 2 shows the pseudocode of the CUDA kernel for the search from the forward strand of S.

Fig. 2.

Pseudocode of the CUDA kernel for the search from the forward strand

Fig. 2.

Pseudocode of the CUDA kernel for the search from the forward strand

To calculate the mapping positions in G using RSA, an approach is to calculate the SA intervals of the best alignments on GPUs and the mapping positions using equation (2) on CPUs. This approach requires loading an RSA and its corresponding bwt (or rbwt) into memory to improve execution speed, which introduces more memory overhead for the host. Moreover, the calculation of j and t in equation (2) imposes more compute overhead, as well, and thus increases the execution time. Fortunately, the calculation of j and t is independent of the RSAs and can be directly computed using the corresponding bwt (or rbwt) after gaining the SA interval. Hence, we select the best alignments, calculate the j-and t-values on GPUs and then output the j and t instead of the SA interval. Thus, the mapping positions can be calculated very quickly on the host by only two operations: one table lookup and one addition, after loading the RSAs into memory.

### 2.6 Paired-end mapping

Fig. 3.

Workflow of the multi-threaded paired-end mapping

Fig. 3.

Workflow of the multi-threaded paired-end mapping

## 3 RESULTS

We have evaluated the performance of CUSHAW by comparing to three popular short read aligners: Bowtie (version 0.12.7), BWA (version 0.5.9rc1) and SOAP2 (version 2.21) using simulated and real short read datasets (Table 1). The first three datasets in Table 1 are paired-end datasets simulated from the human genome using the wgsim utility program in the SAMtools package (version 0.1.17) (Li H. et al., 2009), with 0.5% SNP mutation rate, 0.05% indel mutation rate and 1% uniform sequence base error rate. The insert size is drawn from a normal distribution N(200, 10) for the SIM36 and SIM72 datasets and from a normal distribution of N(500, 30) for the SIM120 dataset. The simulated datasets are available for download at http://sourceforge.net/projects/cushaw/files/data. The last five datasets are real datasets, named after their accession numbers in the NCBI Sequence Read Archive. All the tests are conducted on a workstation with an AMD Opteron 2378 2.4 GHz quad-core processor and 8 GB RAM running the Linux operating system. Two Fermi-based Tesla C2050 GPUs are installed in the workstation. A single GPU consists of 14 SMs (a total of 448 SPs) with a core frequency of 1.15 GHz and with 3 GB of user available device memory (after turning off error correcting code).

Table 1.

Datasets used for performance evaluation

SIM36 PE 36 2 000 012 200 bp
SIM72 PE 72 2 000 012 200 bp
SIM120 PE 120 2 000 012 500 bp
SRR002273 PE 36 8 552 428 200 bp
ERR000589 PE 51 24 277 796 200 bp
SRR033552 PE 75 20 483 110 200 bp
SRR034966 PE 100 41 010 976 500 bp
ERR024139 PE 100 53 542 962 311 bp
SIM36 PE 36 2 000 012 200 bp
SIM72 PE 72 2 000 012 200 bp
SIM120 PE 120 2 000 012 500 bp
SRR002273 PE 36 8 552 428 200 bp
ERR000589 PE 51 24 277 796 200 bp
SRR033552 PE 75 20 483 110 200 bp
SRR034966 PE 100 41 010 976 500 bp
ERR024139 PE 100 53 542 962 311 bp

### 3.1 Alignment quality

The alignment quality is conventionally evaluated by computing how many single-end reads are found to have matches to the reference genome and how many paired-end reads are paired together using simulated or real short read datasets. As we know the exact positions of each read in the simulated datasets, we introduce two more stringent measures to compare the aligners: The distance d for the MCP measure is considered because of two reasons: (1) a few contiguous low-quality bases starting from the 3′-end and a few contiguous leading or trailing unknown bases might be trimmed and (2) gaps might be introduced. In the following evaluations, the single-end alignments consider all aligned reads and the paired-end alignments take into account only the reads that have been paired to calculate the MCP and EMCP. Thus, the MCP and EMCP for paired-end alignments might be underestimated compared to when all aligned reads are considered. For all aligners, we have tuned the parameters (see Supplementary Data for details) with the intention to gain the highest performance.

• MCP: the number of reads that have a match to their correct positions with a maximal distance of d (d=5 in our evaluation) in the genome;

• EMCP: the number of reads that have an exact match to their correct positions in the genome.

We first evaluate all aligners using the three simulated datasets (Table 2). For both the single-end and paired-end alignments, BWA outperforms all other aligners, while Bowtie performs worst, in terms of all measures. SOAP2 performs slightly better than CUSHAW for the single-end alignment, but the latter outperforms the former for the paired-end alignment (with an exception of the SIM36 dataset). After using paired-end mapping, both CUSHAW and BWA were able to improve MCP and EMCP. However, both Bowtie and SOAP2 do not always follow this trend. We have also simulated three datasets with a higher uniform base error rate of 4%. The dataset information as well as the alignment results can be obtained from Supplementary Data. Using these three datasets, the alignment quality varies much for the aligners, where Bowtie outperforms the other aligners for the single-end alignment and CUSHAW performs best for the paired-end alignment.

Table 2.

Alignment results for simulated datasets

Type Aligner SIM36

SIM72

SIM120

Aligned (paired) MCP EMCP Aligned (paired) MCP EMCP Aligned (paired) MCP EMCP
SE CUSHAW 1 822 197 1 596 293 1 595 327 1 791 559 1 705 766 1 704 743 1 705 318 1 649 885 1 648 954
Bowtie 1 822 191 1 593 327 1 592 366 1 775 714 1 686 975 1 685 982 1 648 097 1 597 055 1 596 288
BWA 1 823 640 1 598 321 1 597 298 1 827 683 1 740 344 1 739 150 1 819 577 1 764 549 1 763 384
SOAP2 1 825 550 1 598 291 1 597 316 1 799 550 1 706 894 1 705 859 1 765 747 1 705 692 1 704 724
PE CUSHAW 1 841 802 1 691 464 1 689 415 1 844 816 1 769 581 1 767 417 1 838 152 1 787 936 1 785 860
Bowtie 1 793 318 1 676 232 1 674 846 1 705 892 1 648 008 1 647 021 1 469 346 1 436 112 1 435 416
BWA 1 832 606 1 757 384 1 755 500 1 843 374 1 796 106 1 794 208 1 844 454 1 809 769 1 807 890
SOAP2 1 793 426 1 700 966 1 699 779 1 740 120 1 685 439 1 684 416 1 578 630 1 541 413 1 540 692
Type Aligner SIM36

SIM72

SIM120

Aligned (paired) MCP EMCP Aligned (paired) MCP EMCP Aligned (paired) MCP EMCP
SE CUSHAW 1 822 197 1 596 293 1 595 327 1 791 559 1 705 766 1 704 743 1 705 318 1 649 885 1 648 954
Bowtie 1 822 191 1 593 327 1 592 366 1 775 714 1 686 975 1 685 982 1 648 097 1 597 055 1 596 288
BWA 1 823 640 1 598 321 1 597 298 1 827 683 1 740 344 1 739 150 1 819 577 1 764 549 1 763 384
SOAP2 1 825 550 1 598 291 1 597 316 1 799 550 1 706 894 1 705 859 1 765 747 1 705 692 1 704 724
PE CUSHAW 1 841 802 1 691 464 1 689 415 1 844 816 1 769 581 1 767 417 1 838 152 1 787 936 1 785 860
Bowtie 1 793 318 1 676 232 1 674 846 1 705 892 1 648 008 1 647 021 1 469 346 1 436 112 1 435 416
BWA 1 832 606 1 757 384 1 755 500 1 843 374 1 796 106 1 794 208 1 844 454 1 809 769 1 807 890
SOAP2 1 793 426 1 700 966 1 699 779 1 740 120 1 685 439 1 684 416 1 578 630 1 541 413 1 540 692

Finally, we have assessed the alignment quality (Table 3) using the five real datasets in Table 1. For the single-end alignment, CUSHAW is inferior to both Bowtie and SOAP2 for all datasets. CUSHAW aligned more reads than BWA for the SRR002273 and ERR000589 datasets, but aligned fewer for the other three datasets. This is due to the progressive constraint approach, which is likely to discard a relevant part of the single-end alignment search space. For the paired-end alignment, CUSHAW performs best and BWA outperforms SOAP2, for all datasets. BWA is superior to Bowtie for the SRR002273, ERR000589, SRR033552 and ERR024139 datasets and is slightly inferior to it for the SRR034966 dataset. Similar to the results using simulated datasets, the performance of both CUSHAW and BWA improves after using paired-end mapping, but the performance of both Bowtie and SOAP2 deteriorates. For a real dataset, it is difficult for us to prove the correctness of the alignments rescued in the paired-end mapping stage, but we can get some supportive evidences from the paired-end alignments of all simulated datasets to imply the effectiveness of our paired-end mapping policy.

Table 3.

Alignment results for real datasets (in %)

Datasets Type CUSHAW Bowtie BW SOAP2
SRR002273 SE 92.58 94.69 90.26 94.85
PE 95.77 87.56 90.85 87.89
ERR000589 SE 94.76 96.51 94.06 96.87
PE 97.72 92.42 94.60 91.09
SRR033552 SE 88.71 91.70 89.12 92.03
PE 94.46 86.86 92.35 82.17
SRR034966 SE 78.89 91.25 85.10 85.10
PE 90.56 90.55 90.51 73.11
ERR024139 SE 89.69 92.09 93.28 92.68
PE 95.11 87.58 94.46 86.25
Datasets Type CUSHAW Bowtie BW SOAP2
SRR002273 SE 92.58 94.69 90.26 94.85
PE 95.77 87.56 90.85 87.89
ERR000589 SE 94.76 96.51 94.06 96.87
PE 97.72 92.42 94.60 91.09
SRR033552 SE 88.71 91.70 89.12 92.03
PE 94.46 86.86 92.35 82.17
SRR034966 SE 78.89 91.25 85.10 85.10
PE 90.56 90.55 90.51 73.11
ERR024139 SE 89.69 92.09 93.28 92.68
PE 95.11 87.58 94.46 86.25

### 3.2 SNP calling

To evaluate the SNP calling performance of the respective aligners, we first aligned the 20.8 million 200-bp-insert-size paired-end reads (accession number SRR001665 in NCBI SRA, whose reference genome is Escherichia coli K12 MG1655 with accession number NC_000913 in GenBank) to a related reference genome E.coli 536 (accession number NC_008253) using the paired-end alignment for each aligner. Second, we called SNPs from the aligned reads using the mpileup utility program in SAMtools. Finally, we compared the identified SNPs with the ‘correct’ SNPs of the two genomes. As the correct SNPs are unknown, we have used a cross-match approach to predict the correct SNPs, where an SNP is considered to be correct if and only if it has been identified by both the dnadiff utility program in MUMmer (version 3.23) (Kurtz et al., 2004) and the combination of BWA–SW (Li and Durbin, 2010) and SAMtools, through whole genome alignments. The detailed procedure for identifying the ‘correct’ SNPs, as well as the SNP files, is given in Supplementary Data.

We evaluated the SNP calling performance of all aligners using the precision, recall and F-score measures (Table 4). Precision is defined as TP/(TP+FP), recall as TP/(TP+FN) and F-score as 2 × precision × recall/(precision + recall), where TP is a true positive, representing a match with a ‘correct’ SNP, FP is a false positive representing a mismatch and FN represents a ‘correct’ SNP that was not identified. From the single example presented in Table 4, CUSHAW has the best recall but gives the worst precision, whereas Bowtie yields the best precision but gives the worst recall. In terms of the F-score, BWA is the best, CUSHAW is second and Bowtie is the worst.

Table 4.

SNP calling accuracy comparison

CUSHAW Bowtie BW SOAP2
TP 96 980 84 286 96 690 92 343
FP 10 780 5889 8611 9314
FN 2472 15 166 2762 7109
Precision 90.00% 93.47% 91.82% 90.84%
Recall 97.51% 84.75% 97.22% 92.85%
F-score 0.936 0.889 0.944 0.918
CUSHAW Bowtie BW SOAP2
TP 96 980 84 286 96 690 92 343
FP 10 780 5889 8611 9314
FN 2472 15 166 2762 7109
Precision 90.00% 93.47% 91.82% 90.84%
Recall 97.51% 84.75% 97.22% 92.85%
F-score 0.936 0.889 0.944 0.918

### 3.3 Execution speed

Besides alignment quality, another major concern about short read alignment is the execution speed considering the sheer volume of short reads produced from the high-throughput sequencing technologies. We have compared the speed of CUSHAW on one and two GPUs with the three aligners using multiple threads. In this article, all the execution times (in seconds) are wall clock times that are taken to complete the whole computation for each aligner.

Table 5 shows the execution speed comparison between aligners for the single-end and paired-end alignments, respectively, using the five real datasets. The table shows the execution times of CUSHAW using a single GPU and two GPUs, and the execution times of the other three aligners using a single thread and four threads on a quad-core CPU. Each aligner uses the same parameters as in Table 3 with an additional parameter to specify the number of GPUs or threads.

Table 5.

Execution speed comparison between aligners using real datasets

Type Aligner SRR002273 (36 bp)

ERR000589 (51 bp)

SRR033552 (75 bp)

SRR034966 (100 bp)

ERR024139 (100 bp)

1 core (1 GPU) 4 cores (2 GPUs) 1 core (1 GPU) 4 cores (2 GPUs) 1 core (1 GPU) 4 cores (2 GPUs) 1 core (1 GPU) 4 cores (2 GPUs) 1 core (1 GPU) 4 cores (2 GPUs)
SE CUSHAW 226 129 1365 852 3096 1704 9532 5051 7077 3930
Bowtie 709 249 3072 944 3430 940 7735 2115 14 379 3949
BWA 2715 896 13 662 4057 19 055 5515 60 957 17 308 48 128 14 176
SOAP2 1788 625 7307 2252 10 470 3084 62 362 16 815 28 381 8417
PE CUSHAWa 292 203 1700 1067 3441 2004 10 716 6528 8179 5031
Bowtie 1271 398 20 017 5053 15 430 4023 32 126 8366 41 177 10 634
BWA 4246 2469 15 713 6118 20 123 6569 63 121 19 412 49 989 16 157
SOAP2 7098 2103 15 986 4862 11 830 3405 48 525 13 337 29 028 8528
Type Aligner SRR002273 (36 bp)

ERR000589 (51 bp)

SRR033552 (75 bp)

SRR034966 (100 bp)

ERR024139 (100 bp)

1 core (1 GPU) 4 cores (2 GPUs) 1 core (1 GPU) 4 cores (2 GPUs) 1 core (1 GPU) 4 cores (2 GPUs) 1 core (1 GPU) 4 cores (2 GPUs) 1 core (1 GPU) 4 cores (2 GPUs)
SE CUSHAW 226 129 1365 852 3096 1704 9532 5051 7077 3930
Bowtie 709 249 3072 944 3430 940 7735 2115 14 379 3949
BWA 2715 896 13 662 4057 19 055 5515 60 957 17 308 48 128 14 176
SOAP2 1788 625 7307 2252 10 470 3084 62 362 16 815 28 381 8417
PE CUSHAWa 292 203 1700 1067 3441 2004 10 716 6528 8179 5031
Bowtie 1271 398 20 017 5053 15 430 4023 32 126 8366 41 177 10 634
BWA 4246 2469 15 713 6118 20 123 6569 63 121 19 412 49 989 16 157
SOAP2 7098 2103 15 986 4862 11 830 3405 48 525 13 337 29 028 8528

aCUSHAW uses four threads to perform the paired-end mapping on the CPU.

For the paired-end alignment, CUSHAW achieves significant speedups over all other three aligners (with an exception that for SRR034966, CUSHAW on two GPUs executes only 1.3× faster than Bowtie on four CPU cores). On a single GPU (two GPUs), CUSHAW achieves an average speedup of 5.7 (2.4) with a highest of 11.8 (4.7) over Bowtie, an average speedup of 8.3 (5.5) with a highest of 14.5 (12.2) over BWA and an average speedup of 8.5 (4.1) with a highest of 24.3 (10.4) over SOAP2, where the three aligners run on a single CPU core (four CPU cores) for all five datasets. Similar to the single-end alignment, the speedups of CUSHAW over BWA and SOAP2 generally decrease with the increase of read lengths for the paired-end alignment, but they are also still considerable for the two 100-bp datasets. Moreover, CUSHAW on a single GPU (two GPUs) also achieves nearly stable speedups over BWA on a single CPU core (four CPU cores) for the three datasets of read lengths ≥75, with an average speedup of about 5.9±0.1 (3.2±0.1).

As mentioned above, with the increase of read lengths, the number of possible alignments that needs to be evaluated by CUSHAW grows significantly. With the progress in high-throughput sequencing technologies, longer reads (e.g. ≥150 bp) will become more frequent in the future. This will pose challenges to CUSHAW in terms of execution speed since our aligner attempts to evaluate all possible alignments to find the best alignments by using some constraints.

## 4 CONCLUSIONS

We have presented CUSHAW, a parallelized BWT-based short read aligner to the human genome. The algorithm is based on the CUDA C++ parallel programming model and uses CUDA-compatible graphics hardware as accelerators to achieve fast execution speed. It uses a quality-aware bounded search approach to reduce the search space and to guarantee alignment quality. Evaluation using simulated and real short reads reveals that our algorithm is able to achieve significant speedups in execution time over three popular BWT-based aligners: Bowtie, BWA and SOAP2, while producing comparable or even better alignment quality for paired-end alignments. The performance of SNP calling from short read alignments was also examined. Although the single example presented is insufficient to fully evaluate the performance of all the aligners, it still sheds some light on the impact of the different aligners in terms of their SNP calling performance.

## ACKNOWLEDGEMENTS

The authors thank Dr Liu Weiguo for providing the experimental environments and thank the editor and the anonymous reviewers for their helpful and constructive comments which helped to improve the manuscript. We acknowledge the BWT–SW authors (T.W. Lam, W.K. Sung, S.L. Tam, C.K. Wong and S.M. Yiu) and the BWA authors (Heng Li and Richard Durbin) for the use of parts of source code from their open-source BWT–SW and BWA algorithm, respectively.

Conflict of interest: none declared.

## REFERENCES

Blom
J.
, et al.  .
Exact and complete short read alignment to microbial genomes using GPU programming
Bioinformatics
,
2011
, vol.
27
(pg.
1351
-
1358
)
Burrows
M.
Wheeler
D.J.
A block sorting lossless data compression algorithm.
,
1994
Palo Alto, CA
Digital Equipment Corporation

Technical Report 124
Campagna
D.
, et al.  .
PASS: a program to align short sequences
Bioinformatics
,
2009
, vol.
25
(pg.
967
-
968
)
Ewing
B.
Green
P.
Base-calling of automated sequencer traces using phred. II. Error probabilities
Genome Res.
,
1998
, vol.
8
(pg.
186
-
194
)
Ferragina
P.
Manzini
G.
Opportunistic data structures with applications
Proceedings of the 41st Annual Symposium on Foundations of Computer Science
,
2000
IEEE Computer Society
pg.
390

Ferragina
P.
Manzini
G.
Indexing compressed text
J. ACM
,
2005
, vol.
52
pg.
4

Grossi
R.
Vitter
J.S.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
SIAM J. Comput.
,
2005
, vol.
35
pg.
2

Homer
N.
, et al.  .
BFAST: an alignment tool for large scale genome resequencing
PLoS One
,
2009
, vol.
4
pg.
e7767

Hon
W.K.
, et al.  .
A space and time efficient algorithm for constructing compressed suffix arrays
Algorithmica
,
2007
, vol.
48
(pg.
23
-
36
)
Kurtz
S.
, et al.  .
Versatile and open software for comparing large genomes
Genome Biol.
,
2004
, vol.
5
pg.
R12

Kuttippurathu
L.
, et al.  .
CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments
Bioinformatics
,
2011
, vol.
27
(pg.
715
-
717
)
Lam
T.W.
, et al.  .
Compressed indexing and local alignment of DNA
Bioinformatics
,
2008
, vol.
24
(pg.
791
-
797
)
B.
, et al.  .
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biol.
,
2009
, vol.
10
pg.
R25

Li
H.
, et al.  .
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Genome Res.
,
2008
, vol.
18
(pg.
1851
-
1858
)
Li
H.
, et al.  .
The sequence alignment/map format and SAMtools
Bioinformatics
,
2009
, vol.
25
(pg.
2078
-
2079
)
Li
H.
Durbin
R.
Fast and accurate short read alignment with Burrows–Wheeler transform
Bioinformatics
,
2009
, vol.
25
(pg.
1755
-
1760
)
Li
H.
Durbin
R.
Fast and accurate long-read alignment with Burrows–Wheeler transform
Bioinformatics
,
2010
, vol.
26
(pg.
589
-
595
)
Li
R.
, et al.  .
SOAP: short oligonucleotide alignment program
Bioinformatics
,
2008
, vol.
24
(pg.
713
-
714
)
Li
R.
, et al.  .
SOAP2: an improved ultrafast tool for short read alignment
Bioinformatics
,
2009
, vol.
25
(pg.
1966
-
1967
)
Lindholm
E.
, et al.  .
NVIDIA Tesla: a unified graphics and computing architecture
IEEE Micro
,
2008
, vol.
28
(pg.
39
-
55
)
Liu
Y.
, et al.  .
CUDASW++: optimizing Smith–Waterman sequence database searches for CUDA-enabled graphics processing units
BMC Res. Notes
,
2009
, vol.
2
pg.
73

Liu
Y.
, et al.  .
CUDA–MEME: accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units
Pattern Recognit. Lett.
,
2010
, vol.
31
(pg.
2170
-
2177
)
Liu
Y.
, et al.  .
DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI
BMC Bioinformatics
,
2011
, vol.
12
pg.
85

NVIDIA
NVIDIA's next generation CUDA compute architecture: Fermi. NVIDIA Corporation Whitepaper.
,
2009

Rumble
S.M.
, et al.  .
SHRiMP: accurate mapping of short color-space reads
PLoS Comput. Biol.
,
2009
, vol.
5
pg.
e1000386

Smith
A.D.
, et al.  .
Using quality scores and longer reads improves accuracy of Solexa read mapping
BMC Bioinformatics
,
2008
, vol.
9
pg.
128

Smith
T.F.
Waterman
M.S.
Identification of common molecular subsequences
J. Mol. Biol.
,
1998
, vol.
147
(pg.
195
-
197
)
Vouzis
P.D.
Sahinidis
N.V.
GPU-BLAST: using graphics processors to accelerate protein sequence alignment
Bioinformatics
,
2010
, vol.
27
(pg.
182
-
188
)

## Author notes

Associate Editor: Alfonso Valencia