## Abstract

Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.

Results: Standard overlap assembly methods have time complexity O(N2), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.

Contact:js18@sanger.ac.uk

## 2 BACKGROUND

Restated, BX[i] is the symbol preceding the first symbol of the suffix starting at position SAX[i]. Ferragina and Manzini Ferragina and Manzini (2000) extended the BWT representation of a string by adding two additional data structures to create a structure known as the FM-index. Let CX(a) be the number of symbols in X that are lexographically lower than the symbol a and OccX(a, i) be the number of occurrences of the symbol a in BX[1, i]. We note that CX and OccX include counts for the sentinel symbol, $. Using these two arrays, Ferragina and Manzini provided an algorithm to search for a string Q in X (Ferragina and Manzini, 2000). Let S be a string whose suffix array interval is known to be [l, u]. The interval for the string aS can be calculated from [l, u] using CX and OccX by the following: (1) (2) We encapsulate Equations (1) and (2) in the following algorithm, updateBackward. To search for a string Q, we need to first calculate the interval for the last symbol in Q then use Equations (1) and (2) to iteratively calculate the interval for the remainder of Q. The interval for a single symbol is simply calculated from CX. The backwardsSearch algorithm presents the searching procedure in detail. If backwardsSearch returns an interval where l>u, Q is not contained in X otherwise SAX[i] is the position in X of each occurrence of Q for liu. The backwardsSearch algorithm requires updating the suffix array interval |Q| times. As each update is a constant-time operation, the complexity of backwardsSearch is O(|Q|). To save memory OccX(a, i) is stored only for i divisible by d (typically d is around 128). The remaining values of OccX can be calculated as needed using the sampled values and BX. ### 2.5 The generalized suffix array We can easily expand the definition of a suffix array to include multiple strings. Let 𝒯 be an indexed set of strings and 𝒯i be element 𝒯[i]. We define SA𝒯[i]=(j, k) iff 𝒯j[k, |𝒯j|] is the i-th lowest suffix in 𝒯. In the generalized suffix array, unlike the suffix array of a single string, two suffixes can be lexographically equal. We break ties in this case by comparing the indices of the strings. In other words, we treat each string in 𝒯 as if it was terminated by a unique sentinel character$i where $i<$j when i<j. We extend the definition of the BWT to collections of strings as follows. Let SA𝒯[i]=(j, k) then

Like the BWT of a single string, B𝒯 is a permutation of the symbols in 𝒯; therefore, the definitions of the auxiliary data structures for the FM-index, C𝒯(a) and Occ𝒯(a, i), do not change.

## 3 METHODS

The construction of the string graph occurs in two stages. First, the complete set of overlaps of length at least τ is computed for all elements of ℛ. The initial overlap graph is then built as described in Section 2.3 and transformed into the string graph using the linear expected time transitive reduction algorithm of Myers Myers (2005). The first step in this process is the computational bottleneck. The all-pairs maximal overlap problem can be optimally solved in O(N+k2) time using a generalized suffix tree where N=∑i=1|ℛ||ℛi| and k=|ℛ| (Gusfield, 1997). It is straightforward to restrict this algorithm to only find overlaps of length at least τ at a lower computational cost; however, the amount of memory required for a suffix tree makes this algorithm impractical for large datasets. Myers' proposed the use of a q-gram filter to find the complete set of overlaps. This requires O(N2/D) time where D is a time-space tradeoff factor dependent on the amount of memory available. We will show that by using the FM-index of ℛ the set of overlaps can be computed in O(N+C) time for error-free reads where C is the total number of overlaps found. We then provide an algorithm that detects only the overlaps for irreducible edges—removing the need for the transitive reduction algorithm and allowing the direct construction of the string graph.

### 3.1 Building an FM-index from a set of sequence reads

To build the FM-index of ℛ, we must first compute the generalized suffix array of ℛ. We could do this by creating a string that is the concatenation of all members of ℛ, S=ℛ12…ℛm and then use one of the well-known efficient suffix array construction algorithms to compute SAS (Puglisi et al., 2007). We have adopted a different strategy and have modified the induced-copying suffix array construction algorithm (Nong et al., 2009) to handle an indexed set of strings ℛ where each suffix array entry is a pair (j, k) as described in Section 2.5. This suffix array construction algorithm is similar to the Ko–Aluru algorithm (Ko and Aluru, 2005). A set of substrings of the text (termed LMS substrings for leftmost S-type, see Nong et al., 2009) is sorted from which the ordering of all the suffixes in the text is induced. Our algorithm differs from the Nong–Zhang–Chan algorithm as we directly sort the LMS substrings using multikey quicksort (Bentley and Sedgewick, 1997) instead of sorting them recursively. This method of construction is very fast in practice as typically only 30−40% of the substrings must be directly sorted. Once SA has been constructed, the BWT of ℛ, and hence the FM-index is easily computed as described above. We also compute the FM-index for the set of reversed reads, denoted ℛ′, which is necessary to compute overlaps between reverse complemented reads. We also output the lexographic index of ℛ, which is a permutation of the indices {1, 2,…, |ℛ|} of ℛ sorted by the lexographic order of the strings. This can be found directly from SA and is used to determine the identities of the reads in ℛ from the suffix array interval positions once an overlap has been found.

### 3.2 Overlap detection using the FM-index

We now consider the problem of constructing the set of overlaps between reads in ℛ. Consider two reads X and Y. If a suffix of X matches a prefix of Y an edge of type (E, B) will be created in the initial overlap graph. We will describe a procedure to detect overlaps of this type from the FM-index of ℛ. Let X be an arbitrary read in ℛ. If we perform the backwardsSearch procedure on the string X, after k steps we have calculated the interval [l, u] for the suffix of length k of X. The reads indicated by the suffix array entries in [l, u], therefore, have a substring that matches a suffix of X. Our task is to determine which of these substrings are prefixes of the reads. Recall that if a given element in the suffix array, SA[i], is a prefix then B[i]=$by definition. Therefore, if we know the suffix array interval for a string P, the interval for the strings beginning with P can be determined by calculating the interval for the string$P using Equations (1) and (2). This interval, denoted [l$, u$], indicates that the reads with prefix P are the l$-th to u$-th lexographically lowest strings in ℛ. We can, therefore, recover the indices in ℛ of the reads overlapping X using lexographic index of ℛ. The algorithm is presented below in findOverlaps.

The findOverlaps algorithm is similar to the backwards search procedure presented in Section 2.4. It begins by initializing [l, u] to the interval containing all suffixes that begin with the last symbol of X. The interval [l, u] is then iteratively updated for longer suffixes of X. When the length of the suffix is at least the minimum overlap size, τ, we determine the interval for the reads that have a prefix matching the suffix of X and output an overlap record for each entry (using the subroutine outputOverlaps). When the update loop terminates, [l, u] holds the interval corresponding to the full length of X. The outputContained procedure writes a containment record for X if X is contained by any read in [l, u] based on the rules described in Section 2.3. The overlaps detected by findOverlaps correspond to edges of type (E, B). We must also calculate the overlaps for edges of type (E, E) and (B, B), which arise from overlapping reads originating from opposite strands. To calculate edges of type (E, E), we use findOverlaps on the complement of X (not reversed) and the FM-index of ℛ′. Similarly, to calculate edges of type (B, B), we use findOverlaps on (the reverse complement of X) and the FM-index of ℛ.

The overlap records created by outputOverlaps are constructed in constant time as they only require a lookup in the lexographic index of ℛ. Let ci be the number of overlaps for read ℛi. The findOverlaps algorithm makes at most |ℛi| calls to updateBackwards and a total of ci iterations in outputOverlaps for a total complexity of O(|ℛi|+ci). For the entire set ℛ, the complexity is O(N+C) where C=∑i=1|ℛ|ci. Note that the majority of these edges are transitive and subsequently removed. We can, therefore, improve this algorithm by only outputting the set of irreducible edges, allowing the direct construction of the string graph. We address this in Section 3.3.

In rare cases, multiple valid overlaps may occur between a pair of reads. In this case, the intervals detected during findOverlaps will contain intersecting or duplicated intervals. To handle this, we can modify findOverlaps to first collect the entire set of found intervals. This interval set could then be sorted and duplicated or intersecting intervals that represent sub-maximal overlaps can be removed. The outputOverlaps procedure can be called on the entire reduced interval set to output the set of maximal overlaps.

### 3.3 Detecting irreducible overlaps

To directly construct the string graph, we must only output irreducible edges. Recall from Section 2.3 that the labels of the irreducible edges for a given read are prefixes of the labels of transitive edges. We use this fact to differentiate between irreducible and transitive edges during the overlap computation. Consider a read X and the set of reads that overlap a suffix of X, 𝒪. We could devise an algorithm to find the subset consisting only of irreducible edges by calculating the edge-labels of all members of 𝒪 and filtering out the members whose label is the extension of the label of some other read. This would require iterating over all members of 𝒪, which can be quite large for repetitive reads. We will now show that the labels of the irreducible edges can be constructed directly from the suffix array intervals using the FM-index.

Consider a substring S that occurs in ℛ and its suffix array interval [l, u]. Let a left extension of S be a string of length |S| + 1 of the form aS. We can use B[l, u] to determine the set of left extensions of S. Let ℬ be the set of symbols that appear in the substring B[l, u]. The left extensions of S are the strings aS such that a∈ℬ. Note that we do not have to iterate over the range B[l, u] to determine ℬ. Since Occ(a, i) is defined to be the number of times symbol a occurs in B[1, i] we can count the number of occurances of a in B[l, u] (and hence aS in ℛ) in constant time by taking the difference Occ(a, u) − Occ(a, l−1). If the $symbol occurs in B[l, u] we say that S is left terminal, in other words one of the elements of ℛ has S as a prefix. We similarly define a right extension of S as a string of length |S| + 1 of the form Sa. While we cannot build the right extensions of S directly from the FM-index, the right extensions of S are equivalent to left extensions of S′ (the reverse of S) in ℛ′. Let S be right terminal if$ exists in Bℛ′[l′, u′], in other words S is a suffix of some string in ℛ.

The procedure to find all the irreducible edges of a read X and construct their labels is to find all the intervals containing the prefixes of reads that overlap a suffix of X, then iteratively extend them rightwards until a right-terminal extension is found. The terminated read forms an irreducible edge with X and the label of the edge is the sequence of bases that were used during the right-extension. All non-terminated strings with the same sequence of extensions are transitive and, therefore, not considered further.

The algorithm requires searching the FM-index in two directions, first backwards to determine the intervals of overlapping prefixes and then forwards to extend those prefixes and build the irreducible labels. Naively this would require first determining the intervals [l, u] for each matching prefix, P, and then reversing the prefix and performing a backwards search on the FM-index of ℛ′ to find the interval [l′, u′] for P′. The intervals [l′, u′] would then be used in the extension stage to determine the labels of the irreducible edges. We can do better, however, by noting that the interval [l′, u′] can be calculated directly during the backwards search without using the FM-index of ℛ′. We define OccLT(a, i) to be the number of symbols that are lexographically lower than a in B[1, i]. Let S=X[i, |X|] be a suffix of X and [li, ui] its suffix array interval. Suppose we know the interval [li′, ui′] for S′ in ℛ′. Let a=X[i−1]. The interval for Sa=[li−1′, ui−1′] is therefore

(3)

(4)

The interval for X′[1] is identical to that of X[|X|], since B and Bℛ′ are both permutations of symbols in ℛ, therefore, C=Cℛ′. We can, therefore, initialize the interval [l′, u′] to the same initial value of [l, u] and perform a forward search of X′ simulatenously while performing a backward search of X using only the FM-index of ℛ. This does not require any additional storage as the OccLT array can easily be computed from Occ by summing the values for symbols less than a. This procedure is similar to the 2way-BWT search recently proposed by Lam et al. (2009). The updateFwdBwd algorithm implements Equations (3) and (4) along with updateBackward to calculate the pair of intervals. The ℱ parameter to updateFwdBwd indicates the FM-index used — that of ℛ or ℛ′.

We now give the full algorithm for detecting the irreducible overlaps for a read X. The algorithm is performed in two stages, first a backwards search on X is performed to collect the set of interval pairs, denoted ℐ, for prefixes that match a suffix of X. This algorithm is presented in findIntervals below and is conceptually similar to findOverlaps.

The interval set found by findIntervals is processed by extractIrreducible to find the intervals corresponding to the irreducible edges of X. This algorithm has two parts. First, the set of intervals is tested to see if some read in the interval set is right terminal. If so, the intervals corresponding to the right terminal reads form irreducible edges with X and are returned. If no interval has terminated, we create a subset of intervals for each right extension of ℐ and recursively call extractIrreducible on each subset.

The algorithm above assumes that there are no reads that are strict substrings of other reads (in other words, all the containments are between identical reads). If this is not the case, a slight modification must be made. If the set of reads overlapping X includes a read that is a proper substring of some other read it is possible that the first right terminal extension found is not that of an irreducible edge but of the contained read. It is straightforward to handle this case by observing that such a read will have an overlap which is strictly shorter than that of the irreducible edge. In other words, the only acceptable right terminal extension is to the reads in ℐ that have the longest overlap with X.

We can similarly modify extractIrreducible to handle overlaps for reads from opposite strands. To do this, we use findIntervals to determine the intervals for overlaps for the same strand as X and overlaps from the opposite strand of X (using the complement of X as in the previous section). When extending an interval that was found by the complement of X, we extend it by the complement of a. In other words, if we are extending same-strand intervals by A, we extend opposite strand intervals by T and so on.

We now offer a sketch of the complexity of the irreducible overlap algorithm. Let Li be the label of irreducible edge i. During the construction of Li at most ki intervals must be updated, corresponding to the number of reads that have an edge-label containing Li. The sum over all irreducible edges, E=∑i(|Li|ki), is the total number of interval updates performed by extractIrreducible. Note that each read in ℛ is represented by a path through the string graph. The total number of times edge i is used in the set of paths spelling all the reads in ℛ is ki and the amount of sequence in ℛ contributed by edge i is |Li|ki. This implies E can be no larger than N, the total amount of sequence in ℛ, and extractIrreducible is O(N). As findIntervals is also O(N), the entire irreducible overlap detection algorithm is O(N).

## 4 RESULTS

As a proof of concept, we implemented the above algorithms. The program is broken into three stages: index, overlap and assemble. The index stage constructs the suffix array and FM-index for a set of sequence reads, the overlap stage computes the set of overlaps between the reads and the assemble stage builds the string graph, performs transitive reduction if necessary, then compacts unambiguous paths in the graph and writes out a set of contigs. We tested the performance of the algorithms with two sets of simulations. In both sets of simulations, we compared the exhaustive overlap algorithm (which constructs the set of all overlaps) and the direct construction algorithm (which only outputs overlaps for irreducible edges). First, we simulated Escherichia coli read data with mean sequence depth from 5 × to 100 × to investigate the computational complexity of the overlap algorithms as a function of depth. After constructing the index for each dataset, we ran the overlap step in exhaustive and direct mode with τ = 27. The running times of these simulations are shown in Figure 2. As expected, the direct overlap algorithm scales linearly with sequence depth. The exhaustive overlap algorithm exhibits the expected above-linear scaling as the number of overlaps for a given read grows quadratically with sequence depth.

Fig. 2.

The running time of the direct and exhaustive overlap algorithms for simulated E. coli data with sequence depth from 5× to 100×. The direct overlap algorithm scales linearly with sequence depth. As the number of overlaps grows quadratically with sequence depth, the exhaustive overlap algorithm exhibits above-linear scaling.

Fig. 2.

The running time of the direct and exhaustive overlap algorithms for simulated E. coli data with sequence depth from 5× to 100×. The direct overlap algorithm scales linearly with sequence depth. As the number of overlaps grows quadratically with sequence depth, the exhaustive overlap algorithm exhibits above-linear scaling.

To assess the quality of the resulting assembly, we assembled the data using the direct overlap algorithm and compared the contigs to the reference. For each level of coverage, we selected τ to maximize the assembly N50 value. The N50 values ranged from 1.7 kbp (5 × data, τ = 17) to 80 kbp (100 × data, τ = 85). We aligned the contigs to the reference genome with bwa-sw (Li and Durbin, 2010) and found that no contigs were misassembled.

We also simulated data from human chromosomes 22, 15, 7 and 2 to assess how the algorithms scale with the size of the genome. We pre-processed the chromosome sequences to remove sequence gaps then generated 100 bp error-free reads randomly at an average coverage of 20× for each chromosome. The results of the simulations are summarized in Table 1. The running time of the exhaustive and direct overlap algorithms are comparable. As the sequence depth is fixed at 20×, both overlap algorithms scale linearly with the size of the input data. The final stage of the algorithm, building the string graph and constructing contigs, is much shorter for the direct algorithm as the transitive reduction step does not need to be performed. In addition, this step requires considerably less memory as the initial graph constructed by the direct algorithm only contains irreducible edges.

Table 1.

Simulation results for human chromosomes 22, 15, 7 and 2

chr 22 chr 15 chr 7 chr 2 ratio
Chr. size (Mb) 34.9 81.7 155.4 238.2 6.8
Number of reads (M) 7.0 16.3 31.1 47.6 6.8
Contained reads (k) 684 1668 3103 4709 6.9
Contained (%) 9.8 10.2 10.0 9.9 –
Transitive edges (M) 38.0 93.0 177.7 274.6 7.2
Irreducible edges (M) 6.3 14.9 28.7 44.4 7.0
Assembly N50 (kbp) 4.0 4.6 4.2 4.7 –
Longest contig (kbp) 31.9 47.7 53.1 48.6 –
Index time (s) 2606 9743 19 779 30 866 11.8
Overlap -e time (s) 2657 6572 12 970 18 060 6.8
Overlap -d time (s) 2885 6750 13 271 19 437 6.7
Assemble -e time (s) 1836 4043 8112 13 095 7.1
Assemble -d time (s) 423 1161 2044 3226 7.6
Index memory (GB) 8.0 18.6 35.4 54.5 6.8
Overlap -e mem. (GB) 2.4 5.5 10.5 16.1 6.7
Overlap -d mem. (GB) 2.4 5.5 10.4 16.1 6.7
Assemble -e mem. (GB) 5.9 14.2 27.2 41.9 7.1
Assemble -d mem. (GB) 2.7 6.3 12.1 18.6 6.9
chr 22 chr 15 chr 7 chr 2 ratio
Chr. size (Mb) 34.9 81.7 155.4 238.2 6.8
Number of reads (M) 7.0 16.3 31.1 47.6 6.8
Contained reads (k) 684 1668 3103 4709 6.9
Contained (%) 9.8 10.2 10.0 9.9 –
Transitive edges (M) 38.0 93.0 177.7 274.6 7.2
Irreducible edges (M) 6.3 14.9 28.7 44.4 7.0
Assembly N50 (kbp) 4.0 4.6 4.2 4.7 –
Longest contig (kbp) 31.9 47.7 53.1 48.6 –
Index time (s) 2606 9743 19 779 30 866 11.8
Overlap -e time (s) 2657 6572 12 970 18 060 6.8
Overlap -d time (s) 2885 6750 13 271 19 437 6.7
Assemble -e time (s) 1836 4043 8112 13 095 7.1
Assemble -d time (s) 423 1161 2044 3226 7.6
Index memory (GB) 8.0 18.6 35.4 54.5 6.8
Overlap -e mem. (GB) 2.4 5.5 10.5 16.1 6.7
Overlap -d mem. (GB) 2.4 5.5 10.4 16.1 6.7
Assemble -e mem. (GB) 5.9 14.2 27.2 41.9 7.1
Assemble -d mem. (GB) 2.7 6.3 12.1 18.6 6.9

For the overlap and assemble rows, -e and -d indicate the exhaustive and direct algorithms, respectively. The last column is the ratio between chromosome 2 and 22.

The bottleneck in terms of both computation time and memory usage is the indexing step, which builds the suffix array and FM-index for the entire read set. This required 8.5 h and ∼55 GB of memory for chromosome 2. Extrapolating to the size of the human genome indicates it would require ∼4.5 days and 700 GB of memory to index 20× sequence data. While the computational time is tractable, the amount of memory required is not practical for the routine assembly of human genomes. We address ways to reduce the computational requirements in Section 5.

## 5 DISCUSSION

We have described an efficient method of constructing a string graph from a set of sequence reads using the FM-index. This work is the first step in the construction of a new, general purpose sequence assembler that will be effective for both short reads of the current generation of sequence technology and the longer reads of the sequencing instruments on the horizon. Unlike the de Bruijn graph formulation, the string graph is particularly well-suited for the assembly of mixed length read data. While the primary algorithms are in place, a considerable amount of work remains. Most pressing is the issue of adapting the assembler to handle real sequence data that contains base-calling errors. This amounts to adapting the algorithms to handle inexact overlaps. The BWA, Bowtie and SOAP2 aligners implement a number of strategies and heuristics for dealing with base mismatches and small insertion/deletions (Langmead et al., 2009; Li and Durbin, 2009; Li et al., 2009). These strategies directly translate to finding overlaps. Let ϵ be the maximum allowed difference between two overlapping reads. When performing the backwards search to find overlaps, we can allow the search to proceed to mismatched bases or gaps while ensuring that the ϵ bound is respected. We can similarly modify the irreducible overlap detection algorithm by allowing the right-extension phase to extend to mismatch bases or gaps. Here, we would only consider an interval to be transitive with respect to one of the irreducible intervals if the inferred difference between the intervals is less than ϵ.

Our intention is to build an assembler that can handle genomes up to several gigabases in size, such as for human or other vertebrate genomes and our initial results indicate that our algorithms scale well. The introduction of sequencing errors will increase the complexity of the irreducible overlap identification step but this step is straightforward to parallelize if necessary because it is carried out for each non-contained read. The construction of the suffix array is currently the computational bottleneck; however, there are a number of established ways to improve this. We can lower the amount of memory required by exploiting the redundancy present in a set of sequencing reads by using a compressed index. The compressed suffix array is one such index and a method was recently developed to merge two compressed suffix arrays that possibly allows a distributed construction algorithm (Sirén, 2009). Additionally, efficient external memory (disk-based) BWT construction algorithms have been developed that allow the construction of the FM-index for very large datasets while using a limited amount of main memory (Dementiev et al., 2008; Ferragina et al., 2010).

It is worth investigating the equivalency of the de Bruijn graph and string graph formulations (Pop, 2009). This has been studied in terms of the computational complexity of reconstructing the complete sequence of the genome and both formulations have been shown to be NP-hard (Medvedev et al., 2007). We would like to know the equivalence in terms of the information contained in the graph. Consider the case where all sequence reads are of length l and every l-mer in the genome has been sampled once. In this case, the de Bruijn graph and string graph constructions (using parameters k = l − 1 and τ = l − 1 respectively) are equivalent. In the realistic case where the genome is unevenly sampled, the relationship is not clear. In the original paper on the EULER assembler Pevzner presents an algorithm to recover the information lost during the deconstruction of reads into k-mers by finding consistent read-paths through the k-mer graph (Pevzner et al., 2001). It is conceivable that if this procedure is able to perfectly reconstruct the information lost the resulting graph would be equivalent to Myers' string graph. This is not clear, however, and the equivalency of these formulations is a question we would like to address.

## ACKNOWLEDGEMENTS

We thank Veli Mäkinen and members of the Durbin group for discussions related to string matching and sequence assembly.

Funding: This work was supported by the Wellcome Trust (grant number WT077192). J.T.S. is supported by a Wellcome Trust Sanger Institute Research Studentship.

Conflict of interest: none declared.

## REFERENCES

Bentley
JL
Sedgewick
R
Fast algorithms for sorting and searching strings
SODA '97: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms.
,
1997
Society for Industrial and Applied Mathematics
(pg.
360
-
369
)
Burrows
M
Wheeler
DJ
A block-sorting lossless data compression algorithm
Technical report 124
,
1994
Palo Alto, CA
Digital Equipment Corporation
Chaisson
MJ
Pevzner
PA
Short read fragment assembly of bacterial genomes
Genome Res.
,
2008
, vol.
18
(pg.
324
-
330
)
Dementiev
R
, et al.  .
Better external memory suffix array construction
J. Exp. Algorithmics
,
2008
, vol.
12
(pg.
1
-
24
)
Ferragina
P
Manzini
G
Opportunistic data structures with applications
Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS 2000)
,
2000
Los Alamitos, CA, USA
IEEE Computer Society
(pg.
390
-
398
)
Ferragina
P
, et al.  .
Lightweight data indexing and compression in external memory
Proceedings of the Latin American Theoretical Informatics Symposium.
,
2010
Heidelberg, Germany
Springer

Lecture Notes of Computer Science
Gusfield
D
Algorithms on Strings, Trees, and Sequences : Computer Science and Computational Biology.
,
1997
Cambridge, UK
Cambridge University Press
Hernandez
D
, et al.  .
De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer
Genome Res.
,
2008
, vol.
18
(pg.
802
-
809
)
Ko
P
Aluru
S
Space efficient linear time construction of suffix arrays
J. Discrete Algorithm.
,
2005
, vol.
3
(pg.
143
-
156
)
Lam
TW
, et al.  .
High throughput short read alignment via bi-directional bwt
2009 IEEE International Conference on Bioinformatics and Biomedicine
,
2009
, vol.
0

Los Alamitos, CA
IEEE
(pg.
31
-
36
)
B
, et al.  .
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biol.
,
2009
, vol.
10
pg.
R25+

Li
H
Durbin
R
Fast and accurate short read alignment with Burrows-Wheeler transform
Bioinformatics
,
2009
, vol.
25
(pg.
1754
-
1760
)
Li
H
Durbin
R
Fast and accurate long-read alignment with Burrows-Wheeler transform
Bioinformatics
,
2010
, vol.
26
(pg.
589
-
595
)
Li
R
, et al.  .
Soap2: an improved ultrafast tool for short read alignment
Bioinformatics
,
2009
, vol.
25
(pg.
1966
-
1967
)
Manber
U
Myers
G
Suffix arrays: a new method for on-line string searches
SODA '90: Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms.
,
1990
Society for Industrial and Applied Mathematics
(pg.
319
-
327
)
Medvedev
P
, et al.  .
Computability of models for sequence assembly
Algorithms in Bioinformatics
,
2007
Heidelberg, Germany
(pg.
289
-
301
Lecture Notes in Computer Science, chapter 27
Myers
EW
The fragment assembly string graph
Bioinformatics
,
2005
, vol.
21

suppl_2
(pg.
ii79
-
85
)
Nong
G
, et al.  .
Linear suffix array construction by almost pure induced-sorting
DCC '09 Proceedings of the IEEE Conference on Data Compression
,
2009
Los Alamitos, CA, USA
(pg.
193
-
202
)
Pevzner
PA
, et al.  .
An eulerian path approach to DNA fragment assembly
Proc. Natl Acad. Sci. USA
,
2001
, vol.
98
(pg.
9748
-
9753
)
Pop
M
Genome assembly reborn: recent computational challenges
Brief Bioinform
,
2009
, vol.
10
(pg.
354
-
366
)
Puglisi
SJ
, et al.  .
A taxonomy of suffix array construction algorithms
ACM Comput. Surv.
,
2007
, vol.
39
pg.
4+

Simpson
JT
, et al.  .
Abyss: a parallel assembler for short read sequence data
Genome Res.
,
2009
, vol.
19
(pg.
1117
-
1123
)
Sirén
J
Compressed suffix arrays for massive data
String Processing and Information Retrieval
,
2009
(pg.
63
-
74
)
Zerbino
DR
Birney
E
Velvet: algorithms for de novo short read assembly using de bruijn graphs
Genome Res.
,
2008
, vol.
18
(pg.
821
-
829
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.