Long read alignment based on maximal exact match seeds

Motivation: The explosive growth of next-generation sequencing datasets poses a challenge to the mapping of reads to reference genomes in terms of alignment quality and execution speed. With the continuing progress of high-throughput sequencing technologies, read length is constantly increasing and many existing aligners are becoming inefficient as generated reads grow larger. Results: We present CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. Our aligner is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments. We have evaluated and compared CUSHAW2 to the three other long read aligners BWA-SW, Bowtie2 and GASSST, by aligning simulated and real datasets to the human genome. The performance evaluation shows that CUSHAW2 is consistently among the highest-ranked aligners in terms of alignment quality for both single-end and paired-end alignment, while demonstrating highly competitive speed. Furthermore, our aligner shows good parallel scalability with respect to the number of CPU threads. Availability: CUSHAW2, written in C++, and all simulated datasets are available at http://cushaw2.sourceforge.net Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de Supplementary information: Supplementary data are available at Bioinformatics online.


INTRODUCTION
Many biological applications of next-generation sequencing (NGS) require the alignment of large quantities of produced reads to a given reference genome. Consequently, a wide variety of short read aligners have been developed in recent years. They can be classified into two categories according to their approaches to identify seeds: hash tables and prefix/suffix tries. MAQ , SOAP , SHRiMP (Rumble et al., 2009) and BFAST (Homer et al., 2009) are examples of the hash table approach. Bowtie (Langmead et al., 2009), BWA (Li and Durbin, 2009), SOAP2  and CUSHAW (Liu et al., 2012) implement the concept of prefix/suffix tries using the Burrows-Wheeler transform (BWT) (Burrows and Wheeler, 1994) and the FM-index (Ferragina and Manzini, 2005).
With the progress of NGS technologies, the length of produced reads continues to increase. Unfortunately, many existing short read aligners are becoming inefficient as generated reads grow to a few hundred bp in length because of two reasons. First, they typically perform only ungapped alignments or gapped alignments allowing a very limited number of gaps (typically one gap). Second, their speed degrades rapidly as the number of gaps increases. However, for long read alignment, more gaps must be allowed as indels will * To whom correspondence should be addressed. occur more frequently. These new features of long read alignment thus motivate the design of new aligners with fast speed and high quality.
In this article, we devise a new long read aligner based on the well-known seed-and-extend heuristic. This heuristic is based on the observation that significant alignments are likely to include homologous regions, containing exact or inexact short matches between the two sequences. It generally works in three steps. First, seeds, represented as short matches indicating highly similar regions, are generated between the query and the target sequences. Secondly, these seeds are extended and refined under certain constraints, such as minimal percentage identity and extension length, to filter out noisy seeds. Finally, more sophisticated algorithms, such as the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) or the Smith-Waterman (SW) algorithm (Smith and Waterman, 1991), are employed to obtain the final alignments. Several types of seeds have been proposed, including fixed-length seeds, maximal exact matches (MEMs), maximal unique matches (MUMs), and adaptive seeds (Kielbasa et al., 2011). Fixed-length seeds (k-mers) are the most widely used seed type. The simplest fixed-length seed is the exact k-mer match. Some improvements have been suggested by allowing mismatches and gaps in the k-mers, including spaced seeds (Ma et al., 2002), and q-gram (a substring of q bases) filters (Rasmussen et al., 2006). MEMs are exact matches that cannot be extended in either direction without allowing a mismatch. MUMs are inherently MEMs but require uniqueness in addition. An adaptive seed has a variable seed length, and also has a limitation on the number of seed occurrences in the target.
Recently, several long read aligners have been developed based on the seed-and-extend approach, including BWA-SW (Li and Durbin, 2010), Bowtie2 (Langmead and Salzberg, 2012), and GASSST (Rizk and Lavenier, 2010). BWA-SW, inspired by BWT-SW (Lam et al., 2008), identifies long gapped seeds by employing a prefix directed acyclic word graph (implicitly represented by an FM-index) to perform dynamic programming (DP). Subsequently, it heuristically extends and refines the long gapped seeds to produce the final alignments. Bowtie2 extracts all mismatch-allowable fixed-length seeds from a read using the BWT and then employs DP to identity alignments. GASSST employs hash tables to find fixed-length seeds and employs multiple filters to remove noisy seeds, prior to the final DP-based alignment. This approach is effective to significantly reduce the number of noisy seeds, but also has the risk of discarding correct ones.
In this article, we present a new long read aligner using MEMs as seeds. MEMs have been used for whole genome alignment ( Bray et al., 2003;Choi et al., 2005;Delcher et al., 1999;Höhl et al., 2002). However, to the best of our knowledge, MEMs have not been used for NGS read alignment. Our aligner employs memoryefficient versions of the BWT and FM-index data structures to generate MEM seeds for each read. Each seed defines a potential mapping read region on the genome. We then compute the optimal local alignment score between the read and each potential mapping region and select the highest-scoring mapping region to produce the final alignment. In addition, our aligner provides support for paired-end (PE) alignment. For the PE alignment, a new seed-pairing approach is introduced with the intention to quickly determine the potential mapping regions of a PE read pair without performing alignments. Furthermore, we employ vectorization and multithreading to achieve fast execution speed on standard multi-core CPUs. The performance of our aligner is assessed and compared with BWA-SW, Bowtie2 and GASSST, by aligning simulated and real datasets to the human genome. The experimental results show that our aligner achieves favorable alignment quality, highly competitive speed and good parallel scalability with respect to the number of threads. This new aligner has been integrated into our software package CUSHAW. The first version of CUSHAW was designed for short read alignment ( 128-bp reads) using GPU computing. It uses mismatch-allowable fixed-length seeds and does not provide support for gapped alignments. We name the aligner presented in this article, CUSHAW2, to indicate the extended functionality.

METHODS
CUSHAW2 employs MEMs as seeds to compute a gapped alignment for a long read to a given reference genome. For the single-end (SE) alignment, CUSHAW2 works in three stages: (i) generate MEM seeds, (ii) select the best mapping regions on the genome and (iii) produce and report the final alignments. For the PE alignment, we introduce two additional stages before producing the final alignments: one is the seed pairing stage and the other is the read rescuing stage. Figure 1 illustrates the pipelines of our aligner for both the SE and the PE alignment.

Essentials for maximal exact match identification
2.1.1 Definitions and notations Given a sequence S, define |S| to denote the length of S, S[i] to denote the character at position i and S [i,j] to denote the substring of S starting at position i and ending at position j, for 0 i < |S| and 0 j < |S|. We represent an exact match between two sequences S 1 and S 2 as a triplet (p, q, k), where k is the length of the exact match and the substring S 1 [p, p+k− 1] is identical to the substring S 2 [q, q+k− 1]. An exact match is called right maximal if p+ 1 = |S 1 | or q+ 1 = |S 2 | or S 1 [p+k] = S 2 [q+k], and left maximal if p= 0 or q= 0 or S 1 [p− 1] = S 2 [q− 1]. An exact match is called a MEM if it is both left maximal and right maximal.
Given a sequence T , defined over the alphabet = {A, C, G, T}, the suffix array SA of T stores the starting positions of all suffixes of T in lexicographical order. In other words, SA[i] = j means that the ith lexicographically smallest suffix (among all suffixes of T ) starts at position j in T . The SA of T has an overall memory footprint of |T | log 2 |T | bits (∼12 GB for the human genome). Given a substring S of T , we can find all its occurrences within an SA interval. An SA interval is an index range [I a , I b ], where I a and I b represent the indices in SA of the lexicographically smallest and largest suffixes of T with S as a prefix.

BWT and FM-index
The BWT of T starts from the construction of a conceptual matrix M T , whose rows are all cyclic rotations of a new sequence T $ sorted in lexicographical order. T $ is formed by appending the special character $ to the end of T that is lexicographically smaller than any character in . After getting M T , the last column of the matrix is taken to form the transformed text B T , i.e. the BWT of T . B T is a permutation of T and thus occupies the same memory size of |T | log 2 | | bits as T . M T has a property called 'last-to-first column mapping', which means that the ith occurrence of a character in the last column corresponds to the ith occurrence of the same character in the first column.
The FM-index consists of a vector C(•) and an occurrence array Occ(•), constructed from B T , to accomplish substring search. C(•) contains | | elements with each element C(a) representing the number of characters in T that are lexicographically smaller than a ∈ . Occ(•) is an array of size |T |×| | with each element Occ(a, i) representing the number of occurrences of a ∈ in B T [0,i]. In terms of memory overhead, C(•) requires only| | log 2 | | bits but Occ(•) requires| ||T | log 2 |T | bits.
Given a substring S of T , the SA interval of all its occurrences can be computed in O(|S|) time using a backward search procedure. Based on C(•) and Occ(•), the SA interval [I a , I b ] can be recursively calculated, from the rightmost to the leftmost suffixes of S, as: where I a (i) and I b (i) represent the starting and end indices of the SA interval for the suffix of S starting at position i, and I a (|S|) and I b (|S|) are initialized as 0 and |T |, respectively. The calculation stops if it encounters I a (i+ 1)> I b (i+ 1). The condition I a (i) I b (i) holds if and only if the suffix of S starting at position i is a substring of T . The total number of the occurrences is calculated as I a (0) -I b (0) + 1 if I a (0) I b (0), and 0, otherwise. After getting the SA interval, the location of each occurrence can be determined by directly looking up SA with a constant time complexity. Hence, the time complexity for finding n occurrences of S isO(|S|+n). (1), the substring search using the FM-index does not require B T , meaning that the overall memory footprint is the memory sum of the FM-index and SA. This memory overhead (∼60 GB for the human genome) can be reduced by an order of magnitude by taking advantage of some features of the BWT at the cost of a slightly higher substring search time complexity as follows.

Reducing memory overhead From Equation
For the FM-index, Occ(•) dominates the overall memory overhead. An approach to trade-off speed and memory space is to use a reduced FMindex (detailed in the Supplementary Material) with a time complexity of O(u·|S|) for substring search, which is able to reduce the memory size to | | log 2 | | +|T |(| | log 2 |T | /u+ log 2 | | ) bits (u = 128 by default and thus ∼1.1 GB for the human genome). For SA, by employing the 'lastto-first column mapping' property of BWT, we can reduce the memory size to |T | log 2 |T | /v bits (v = 8 by default and thus ∼1.5 GB for the human genome) through the use of a reduced suffix array (detailed in the i319 Supplementary Material) with an approximate time complexity of O(n·v)for locating n occurrences of S. Now, we have arrived at a significantly smaller memory footprint of(| |+|T |) log 2 | | +|T |(| |/u+1/v) log 2 |T | bits (e.g. ∼2.6 GB for the human genome). Furthermore, the increased time complexityO(u·|S|+ v ·n) for finding n occurrences of S is still acceptable.

Estimation of the minimal seed size
We are only interested in the MEM seeds whose lengths are not less than a minimal seed size Q. Decreasing Q generally increases the sensitivity by finding more hits in homologous regions, at the cost of producing more noisy hits. Increasing Q generally decreases the number of hits at the cost of decreased sensitivity. Many seed-based aligners, therefore, require users to carefully tune Q. However, this tuning work is tedious. To address this issue, we propose an automatic estimation of Q according to a given read length.
Our estimation of Q is based on the q-gram lemma (Rasmussen et al., 2006) and a simplified error model. The q-gram lemma states that two aligned sequences S 1 and S 2 with an edit distance of e (the number of errors) share at least t q-grams where t = max(|S 1 |,|S 2 |)−q+1−q·e. This means that for overlapping q-grams, one error may cause up to q·eq-grams not to be shared by the two reads, and for non-overlapping q-grams, one error can destroy only one q-gram (Blom et al., 2011). Hence, given the edit distance e of S aligned to the genome, Q is estimated as: where Q L and Q H are the global lower-bound and upper-bound, respectively. The estimation is based on the pigeonhole principle for non-overlapping qgrams, meaning that at least one q-gram of length Q is shared by S and its aligned substring mate on the genome. By default, our aligner sets Q L = 13 and Q H = 49. Since the error model for gapped alignments is quite complicated, we employ a simplified error model for ungapped alignments to estimate e. Supposing that the number of substitutions w in the full-length alignment of S is a random variable and each base in S has the same error probability p (default = 2%), the probability of having z substitutions is calculated as: where w follows a binomial distribution. By specifying a missing probability m (default = 4%), e can be estimated as min{z|P(w > z) < m}. Our simplified error model results in the following values: Q = 16 for 100-bp reads, Q = 22 for 200-bp reads and Q = 35 for 500-bp reads. In addition, we also provide parameters to allow users to customize Q.

Generation of maximal exact matches
To identify MEMs between S and T , we advance the starting position p in S, from left to right, to find the longest exact matches (LEMs) using the BWT and the FM-index. According to the above definitions, we know that the identified LEMs are right maximal. We know that the LEMs starting at the beginning of S are both left maximal and right maximal. This means that when advancing the starting positions from the beginning to the end of S, the identified LEMs are also left maximal if it is not part of any previously identified MEM. In this way, only unidirectional substring search is required.
Since we are only concerned about MEMs of sufficient lengths, we discard the MEMs whose lengths are less than Q. For large genomes, it is possible to find a lot of occurrences of a MEM starting at a certain position of S. In this case, we only keep its first h (h = 1024 by default) occurrences and discard the others. However, it is also observed that we sometimes fail to find any MEM seeds for some reads using Q. To improve sensitivity, we therefore attempt to rescue them by re-conducting the MEM identification procedure using a new and smaller minimal seed size Q N = (Q +Q L )/2.

Determination and selection of mapping regions
For local alignment with affine gap penalty, the positive score for a match is usually smaller than the penalty charged for a substitution or for a gap. Using such type of scoring schemes, the length of the optimal local alignment of S to the genome cannot be >2|S| as a local alignment requires a positive alignment score. This conclusion forms the foundation of our genome mapping region determination approach for each identified MEM seed. In our aligner, we employ a commonly used scoring scheme [e.g. also used in BLAST (Altschul et al., 1990) and BWA-SW] with the score 1 for a match, a penalty of 3 for a substitution, a penalty of 5 for a gap opening and a penalty 2 for a gap extension.
For a read, a MEM indicates a mapping region on the genome, which includes the seed and potentially contains the correct alignment of the full read. We can determine the range of the mapping region by extending the MEM in both directions by a certain number of bases. Since the optimal local alignment length of S cannot be >2|S| in our aligner, it is safe to determine the mapping region range by extending the MEM by 2|S| bases in each direction. This extension does work, but will result in lower speed due to the introduced redundancy. Hence, we attempt to compute a smaller mapping region with as little loss of sensitivity as possible.
We define P mem to denote the starting position of a MEM in S, T mem to denote the mapping position of the MEM on the genome and L mem to denote the MEM length. Assuming that the MEM is included in the final alignment, our aligner estimates the mapping region range [T a , T b ] as: Our aligner computes the optimal local alignment scores in all determined mapping regions of S using the SW algorithm, and then builds a sorted list of all mapping regions in the descending order of score. Mapping regions whose scores are less than a minimal score threshold (default = 30) are removed from the sorted list. Subsequently, the sorted list of qualified mapping regions is used in the SE and the PE alignment (e.g. determining the final alignments and approximating the mapping quality scores).

Paired-end mapping
The alignment of a paired read pair generally has two constraints: alignment strand and mapping distance. For the alignment strand constraint, our aligner requires the two reads to be aligned to the genome from different strands. For the mapping distance constraint, our aligner requires that the mapping distance of the two reads cannot exceed a maximal mapping distance threshold defined by the insert-size information of a library. Assuming that the mean insert-size is X and the standard deviation of the insert-size is σ , we calculate the maximal mapping distance threshold as X +4σ . For the PE mapping, our aligner employs two stages: (i) pairing qualified mapping regions in order to find the correct alignments for both ends and (ii) rescuing un-aligned reads through their aligned read mates. For any aligned read pair, we can first compare their mapping distance on the genome (calculated from the positions of the best alignments of the two reads) to the insert-size constraint. If this comparison is within the mapping distance threshold, the corresponding alignment is output. Otherwise, we could calculate the mapping distance for each mapping position pair from all qualified mapping regions in the sorted list. However, the associated computational overhead cannot be tolerated since we need to obtain the alignment paths for all qualified mapping regions of a read pair. Hence, we introduce a seed-pairing approach to heuristically accelerate the read pairing.
The seed-pairing heuristic works by enumerating each seed pair of S 1 and S 2 in order to find all potential seed pairs. If the seed pair has different alignment strands and locates on the same genome fragment, it will be used to estimate the mapping distance of S 1 and S 2 , and otherwise will be discarded. In our aligner, the mapping position T s of S is estimated from one of its MEMs as:

Long read alignment based on maximal exact match seeds
where we assume that S is aligned to the genome without gaps. To compensate for the difference between the estimated mapping distance and the correct one, we employ a larger maximal insert-size threshold X +4σ +2e for the seed-pairing heuristic. If the estimated mapping distance does not exceed the maximal insert-size threshold, this seed pair is considered qualified and will be saved for future use. After finding all qualified seed pairs, we enumerate each qualified seed pair to compute the real mapping distance of S 1 and S 2 , which is compared with the maximal insert-size threshold X +4σ . If the insert-size constraint is met, S 1 and S 2 are reported as paired.
Otherwise, we will compute the best alignment for S 1 (or S 2 ) to rescue its mate by employing the insert-size information to determine the potential mapping region of its mate. This rescuing procedure is also applied when only one read of S 1 and S 2 is aligned.

Approximation of mapping quality scores
Since the introduction of mapping quality scores in MAQ  to indicate the probability of the correctness of alignments, the concept of mapping quality scores has been frequently used in many NGS read aligners. Generally, a higher mapping quality score indicates a higher confidence in the correctness of an alignment. As stated in BWA-SW, if an aligner guarantees to find all local alignments of a read, the mapping quality score M q is determined by these local alignments only. Although our aligner does not find all local alignments of the read, the sorted list of qualified mapping regions still provides sufficient information to approximate M q . In our aligner, we employ two equations to approximate M q for the SE and the PE alignment. For the SE alignment, M q is approximated as 250(b 1 −b 2 )/b 1 × r, similar to the mapping quality approximation in BWA-SW, where b 1 is the best local alignment score, b 2 is the second best local alignment score, and r is calculated by dividing the number of bases of the read in the final alignment by the read length. For the PE alignment, the calculation of M q depends on two conditions. If the two reads are correctly paired through the seed-pairing heuristic, the mapping quality score for each read is equal to its SE M q . Otherwise, if one read is rescued by its mate, the mapping quality score of the read is approximated as r ×M mq , where M mq is the SE M q of its mate.

Parallel design
In CUSHAW2, the most time-consuming part of SE alignment is the selection of the best mapping region using the SW algorithm. To accelerate its execution, we have adopted the (SSE2) Streaming SIMD Extensions 2-based parallel implementation of the SW algorithm in SWIPE (Rognes, 2011).
In addition, as multi-core CPUs have become commonplace, our aligner employs a multi-threaded design using Pthreads to parallelize the alignment process. We use a dynamic scheduling policy to assign reads to threads, which allows one thread to immediately start a new alignment without waiting for the completion of the other threads. For SE alignment, a thread aligns a single read at a time and then reads a new read from the input file immediately after finishing the current alignment. For PE alignment, we follow the same scheduling policy with the difference that one read pair is assigned at a time. Locks are appropriately used to ensure mutually exclusive accesses to both the input and output files.

RESULTS
The performance of CUSHAW2 is compared with three other long read aligners: BWA-SW (v0.6.1), Bowtie2 (v2.0.0-beta5) and GASSST (v1.28). BWA-SW employs the default settings and Bowtie2 also employs the default settings, except for the insert size related parameters for the PE alignment. GASSST uses a minimal percentage identity of 90% and default settings for other parameters. CUSHAW2 requires the final alignment to have a percentage identity of 90% (default setting) and to include 80% (default setting) bases of the read. All the tests are conducted on a workstation with two six-core Intel Xeon X5650 2.67GHz CPUs and 96 GB RAM, running the Linux operating system. The runtime of each aligner is measured in wall clock time for all tests, where the one-time construction time of the BWT and the FM-index is not counted in for CUSHAW2, BWA-SW and Bowtie2. We use the recall and precision measures to assess all aligners using simulated datasets, where recall (precision) is calculated by dividing the number of reads that are correctly aligned by the total number of reads (the number of aligned read). If not explicitly specified, a read is deemed to be correctly aligned if the mapping position has a distance of 5 to the true position. For real datasets, we use the sensitivity measure, which is calculated by dividing the number of aligned reads by the total number of reads.
GASSST does not evaluate every seed to determine the best alignment for a single read. Instead, it continues reporting identified alignments until reaching the maximal limit of the number of alignments. Thus, we consider the best of all reported alignments as the final alignment of the read and discard the others. In addition, GASSST does not provide the support for PE alignment and thus is only evaluated for SE alignment. BWA-SW might report more than one alignment for a single read (in rare cases for not very long reads), where one alignment corresponds to one fragment of the read. Since these fragment alignments are difficult to be distinguished and ranked, we take all of them into consideration.

Evaluation on simulated datasets
We have first evaluated all aligners using nine simulated 100-bp, 200-bp and 500-bp datasets with different uniform base error rates (i.e. 1%, 2% and 4%). These datasets are simulated from the human genome using the wgsim utility in SAMtools v0.1.17  with 10% errors being indel errors. Each dataset is comprised of 2 million PE reads, and the insert-sizes are drawn from normal distributions N (500, 50), N (1000, 50) and N (2000, 50) for the 100-bp, 200-bp and 500-bp datasets, respectively. Table 1 shows the alignment results of all aligners for the 200bp datasets, whereas the alignment results for the other datasets can be obtained from the Supplementary Material. CUSHAW2 yields the highest recall and precision for both the SE and the PE alignment for all datasets (with an exception that for the 100-bp dataset with 1% error rate, Bowtie2 has a slightly better precision than CUSHAW2 by ∼0.04% for the PE alignment). Furthermore, i321 To evaluate alignments with high mapping quality scores, we have taken into account the alignments whose mapping quality scores are 30 (Q30). Moreover, an aligned read is deemed to be correctly aligned only if the mapping position is identical to the true position of the read. As for the precision, both CUSHAW2 and BWA-SW can hold their precision for each dataset, whereas Bowtie2 has a minimal decrease for some datasets.
Finally, we have evaluated the impact of the amount of indel errors on each aligner. In this evaluation, we have re-simulated four 200-bp datasets from the human genome containing 2 million PE reads each. All the four datasets have the same uniform base error rate of 2%, but have different percentages of indel errors (i.e. 20%, 40%, 60% and 80%). Table 3 shows the alignment results. For the

Evaluation on real datasets
We have assessed all aligners using four datasets produced by 454 and Illumina sequencers, respectively. All the datasets are publicly available from NCBI SRA and named after their accession numbers (see Table 4). We have used two runs (SRR000026 and SRR000027) of the SRX000001 experiment, two runs (SRR006428 and SRR006433) of SRX001829, two runs (ERR024139 and ERR024140) of ERX009608 and the single run (SRR189815) of SRX028059. For all 454 datasets, we have removed all reads shorter than 100-bp and only conducted the SE alignment. For all Illumina datasets, we have performed both the SE and the PE alignment. Figures 2 and 3 show the alignment results for the 454 and Illumina datasets, respectively. In this evaluation, the alignment of a read is taken into account only if it has a percentage identity of 90% and contains 80% bases of the read. This constraint is also in accordance with our intention for long read alignments, i.e. attempting to align a read in the full length to the genome. We have excluded GASSST from the larger Illumina datasets due to its very slow speed.
i322 Long read alignment based on maximal exact match seeds  For each 454 dataset, CUSHAW2 has the highest sensitivity. BWA-SW is second and GASSST is worst. For the Illumina datasets, Bowtie2 aligned the most reads and CUSHAW2 the second most. For the SE (PE) alignment, the average sensitivity is ∼94.93% (95.28%) for CUSHAW2, 93.99% (94.95%) for BWA-SW and 95.45% (95.81%) for Bowtie2.

Speed and scalability evaluation
We have assessed the speed of each aligner using 12 threads on our workstation (see Table 5), where we organized all simulated datasets into three groups, namely D100, D200 and D500, as per the read lengths and averaged the runtimes of all datasets in each group.
For the SE alignment, GASSST is the slowest for each data group and is almost two orders of magnitude slower than the other three aligners for D100, D200 and SRX000001. For the datasets with smaller mean read lengths of 200-bp, Bowtie2 is the fastest and CUSHAW2 outruns BWA-SW. However, for the datasets with greater mean read lengths of around 500-bp, Bowtie2 becomes slower than both CUSHAW2 and BWA-SW and BWA-SW outruns CUSHAW2. This suggests that Bowtie2 might have been optimized for reads of lengths around 200-bp, but does not scale well towards longer reads. The PE alignment comparison between CUSHAW2, BWA-SW and Bowtie2 shows a similar trend.
From the runtimes of all simulated data groups, it is observed that the runtime for CUSHAW2, BWA-SW and Bowtie2 significantly increases as the read lengths grow from 100 to 500. This can Table 5. Runtime comparison (in seconds) of the tested aligners Fig. 4. Scalability comparison between all aligners for the SE and PE alignment with the use of SSE2 assembler instructions. In addition, multithreading is supported in order to benefit from the coarse-grained parallelism on multiple CPU cores.
We have assessed the performance of CUSHAW2 and the three other long read aligners: BWA-SW, Bowtie2 and GASSST using simulated as well as real datasets. For the simulated reads, we have computed the recall and precision measures since we know the true position of each read on the genome. For the real datasets, we have employed the sensitivity measure. Using the above measures, CUSHAW2 is shown to be among the highest-ranked aligners in terms of alignment quality for both the SE and the PE alignment for a variety of error rates and varying amount of indel errors. Our aligner achieves good parallel scalability with respect to the number of threads, while demonstrating highly competitive overall execution speed. Furthermore, through the use of memory efficient data structures, CUSHAW2 only requires a memory footprint of ∼4 GB (using 12 threads) for performing alignments to the human genome. This approach makes it possible to accurately align hundreds of millions of long reads to a mammalian-sized genome in only a few hours on a standard multi-core workstation with only a modest amount of RAM installed. Since throughput and read-length of NGS machines continues to grow, these results are of high importance to the bioinformatics community.
Funding: We acknowledge funding by the Center for Computational Science, Mainz.