## Abstract

**Motivation:** Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.

**Results:** Standard overlap assembly methods have time complexity *O*(*N*^{2}), where *N* is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to *O*(*N*), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome *de novo* assembly.

**Contact:**js18@sanger.ac.uk

## 1 INTRODUCTION

The sequence assembly problem is one of the most important and difficult problems in bioinformatics. Most genomes, particularly eukaryotic genomes, are highly repetitive that complicates their assembly by obscuring true relationships between reads with many false options. To help disambiguate the true relationships between the reads from those induced by different copies of repeats, it is useful to construct a graph where all the copies of a repeat are collapsed into a single segment. Such a graph is commonly referred to as a *repeat graph*. This structure is a natural consequence of the de Bruijn graph method of sequence assembly as the deconstruction of the sequence reads into *k*-mers (short subsequences of the reads of length *k*) collapses repeats that share the same *k*-mer into a single vertex (Pevzner *et al.*, 2001). An alternative formulation was proposed by Gene Myers and is called the *string graph* (Myers, 2005). The string graph is built by first constructing a graph of the pairwise overlaps between sequence reads and transforming it into a string graph by removing transitive edges. The string graph shares with the de Bruijn graph the property that repeats are collapsed to a single unit without the need to first deconstruct the reads into *k*-mers. Because it is based on maximal overlaps, which are typically longer than de Bruijn *k*-mers, it also disambiguates shorter repeats that de Bruijn methods would only resolve in later processing steps (if at all). The string graph is much more expensive to construct, however, as the set of all pairwise, inexact overlaps between sequence reads must be found. For this reason, the majority of assemblers of short read sequence data have been based on the de Bruijn approach (Chaisson and Pevzner, 2008; Simpson *et al.*, 2009; Zerbino and Birney, 2008). A notable exception is the Edena assembler (Hernandez *et al.*, 2008) that uses a suffix array to compute exact overlaps between reads that are then used to construct the string graph. We address the construction of a string graph with a related approach by indexing the set of sequence reads using the Burrows–Wheeler transform(BWT)/Ferragina—Manzini(FM)-index, which has recently been used for the short read alignment problem (Langmead *et al.*, 2009; Li and Durbin, 2009; Li *et al.*, 2009). We show how to efficiently compute the set of overlaps needed to construct the string graph from the FM-index. Furthermore, we show that the string graph can be constructed directly using the FM-index without the need for explicitly finding all overlaps and a subsequent transitive removal step, yielding a space and time efficient construction algorithm.

## 2 BACKGROUND

### 2.1 Definitions and notation

Let *X* be a string of symbols *a*_{1},…, *a*_{l} from an alphabet Σ. The length of *X* is denoted |*X*|. We consider all strings to be terminated by a sentinel symbol $ that is not in Σ and is lexographically lower than all the symbols in Σ. *X*[*i*]=*a*_{i} is the *i*-th symbol of *X* and *X*[*i*, *j*] is the substring *a*_{i},…, *a*_{j}. A substring *X*[*k*, |*X*|] is a *suffix* of *X* and a substring *X*[1, *k*] is a *prefix* of *X*. Let *X*′=*a*_{l}, *a*_{l−i},…, *a*_{1} denote the reverse of *X*.

### 2.2 Genomes and sequence reads

We define a *genome* to be a long string from the alphabet {*A*, *C*, *G*, *T*} representing the complete DNA sequence of an individual, for simplicity ignoring potential subdivisions into chromosomes. A sequence *read* is a short substring from a genome. DNA is a double stranded molecule and sequence reads can originate from either strand. We use the notation for the *reverse-complement* of a read *X*. In a shotgun sequencing experiment, a set of sequence reads, which we denote by the indexed set ℛ, is randomly sampled from a genome with an unknown sequence. The sequence assembly problem is to reconstruct the sequence of the genome given ℛ. We say that two reads *X* and *Y overlap* if a prefix of *X* is equal to a suffix of *Y* or vice versa. If *X* and *Y* originate from opposite strands, they overlap if the reverse complement of one of them overlaps the other. To help distinguish true overlaps from spurious overlaps, we set a threshold of τ on the minimum acceptable overlap length. We assume for the moment that sequence reads are perfect representations of the genome—there are no sequencing errors. We discuss how to relax this constraint in the discussion at the end of this article.

### 2.3 Overlap and string graphs

To help reconstruct the source genome from ℛ, we can build a graph of the relationships between sequence reads. One such graph is the *overlap graph*. In the overlap graph, each sequence read in ℛ is a vertex and two vertices are joined by an edge if their corresponding reads overlap. Myers' string graph is a refinement of such a graph. In the string graph, reads that are *contained* within some other read, that is they are a substring of (or perhaps identical to) another read, are considered to be redundant and are not vertices in the graph. Each edge in a string graph is bidirectional to model the double-stranded nature of DNA and labelled with the unmatched substrings of the sequence reads. More formally, let *X* and *Y* be two reads where *X*[*s*_{xy}, *e*_{xy}]=*Y*[*s*_{yx}, *e*_{yx}]. We call *X*[*s*_{xy}, *e*_{xy}] the *matched* portion of *X* and the remainder *unmatched*. If *s*_{xy}=1 and *e*_{xy}=|*X*| the entirety of *X* is matched by *Y* and *X* is said to be contained by *Y*. If *Y* is also contained by *X* (*s*_{yx}=1 and *e*_{yx}=|*Y*|), *X* and *Y* are identical. In this case, we break the tie by saying the read with the higher index in ℛ is contained within the read with the lower index. If neither *X* nor *Y* are contained and *X*[*s*_{xy}, *e*_{xy}] is a prefix of *X* (*s*_{xy}=1) and *Y*[*s*_{yx}, *e*_{yx}] is a suffix of *Y* (*e*_{yx}=|*Y*|), or vice versa, we say the overlap between *X* and *Y* is *proper*. If *X* and *Y* are reads from opposite strands of the genome they can still form an overlap. In this case, and both *X*[*s*_{xy}, *e*_{xy}] and *Y*[*s*_{yx}, *e*_{yx}] must be prefixes or both must be suffixes.

All non-contained reads are vertices in the string graph. For each proper overlap between two reads, we add a bidirected edge to the graph *X* ↔ *Y*. The bidirected edge describes the nature of the overlap between the reads and has two labels, one for each of the unmatched substring of the reads. We denote the tuple of data for each edge as (*type*_{xy}, *type*_{yx}, *label*_{xy}, *label*_{yx}). We define the *type*_{xy} property (respectively, *type*_{yx}) as:

In other words, *type*_{xy} is *B* if the matched portion of *X* is a prefix of *X*, otherwise the matched portion of *X* must be a suffix and *type*_{xy} is *E*. Note that since the graph does not have contained reads these cases are mutually exclusive. The *label*_{xy} property is

Restated, *label*_{xy} is the unmatched suffix of *Y* if the matched portion of *Y* is a prefix and vice versa. The concatenation of *X* and *label*_{xy} is an assembly of reads *X* and *Y* — the resulting string contains both the sequence of *X* and *Y*. If the overlap between *X* and *Y* is reverse complemented, i.e. then *label*_{xy} and *label*_{yx} are also reverse complemented. Note that in the case of an edge built from a reverse-complement overlap, *type*_{xy} is necessarily the same as *type*_{yx}. To perform a walk in the string graph, if one enters a vertex on an edge of type *B* then an edge of type *E* must be used to exit and vice versa. Figure 1 depicts a simple string graph built from three overlapping reads.

The initial graph built from the overlaps between reads is not a string graph yet. Consider a read *X* that overlaps reads *Y* and *Z*, which mutually overlap. The initial string graph will contain the edges *X* ↔ *Y*, *X* ↔ *Z* and *Y* ↔ *Z*. If *Y* and *Z* overlap the same end of *X*, i.e *type*_{xy} = *type*_{xz}, then *Y* and *Z* must share a common substring of *X* which is a prefix or suffix of one of *Y* or *Z*. This implies that there is a valid path that visits each of the three reads in succession. Let *X* → *Y* → *Z* be such a path. The string corresponding to this path is a valid assembly of the three reads which is identical to the string corresponding to the path *X* → *Z*. In this case, we say that the edge *X* ↔ *Z* is *transitive*. We will refer to non-transitive edges as *irreducible*. The transitive edges can be removed from the graph without losing any information—the transitive edges (and their corresponding overlaps) could be inferred from the irreducible edges. We can determine useful properties of transitive and irreducible edges. As the graph does not have contained reads, the length of the overlap between *X* ↔ *Y* is necessarily larger than the overlap between *X* ↔ *Z*. Equivalently, the length of *label*_{xy} is shorter than *label*_{xz}, and *label*_{xz} can be seen as the concatenation of *label*_{xy} and *label*_{yz}. In other words, *label*_{xy} of the irreducible edge is a prefix of *label*_{xz} of the transitive edge.

### 2.4 The suffix array, BWT and FM-index

The *suffix array* data structure was introduced by Manber and Myers Manber and Myers (1990) as a succinct representation of the lexographic ordering of the suffixes of a string. The suffix array of a string *X*, denoted SA_{X}, is a permutation of the integers {1, 2,…, |*X*|} such that SA_{X}[*i*]=*j* iff *X*[*j*, |*X*|] is the *i*-th lexographically lowest suffix of *X*. For example, if *X*=AAGTA$ then SA_{X}=[6, 5, 1, 2, 3, 4]. Since the suffix array is a sorted data structure, the start positions of all the instances of a pattern *Q* in *X* will occur in an interval in SA_{X}. We refer to such an interval as a *suffix array interval* and associate with it a pair of integers [*l*, *u*] denoting the first and last index in SA_{X} that correspond to a position in *X* of an instance of *Q*. Using SA_{X} and the original string *X*, *l* and *u* can be efficiently found with a binary search for *Q*. Ferragina and Manzini developed a related method of indexing text, called the FM-index, which requires considerably less memory than a suffix array and can compute *l* and *u* in *O*(|*Q*|) time, independent of the size of the text being searched. Central to the FM-index is the BWT. Originally developed for text compression (Burrows and Wheeler, 1994) the BWT of *X*, denoted **B**_{X}, is a permutation of the symbols of *X* such that

Restated, **B**_{X}[*i*] is the symbol preceding the first symbol of the suffix starting at position SA_{X}[*i*]. Ferragina and Manzini Ferragina and Manzini (2000) extended the BWT representation of a string by adding two additional data structures to create a structure known as the FM-index. Let **C**_{X}(*a*) be the number of symbols in *X* that are lexographically lower than the symbol *a* and **Occ**_{X}(*a*, *i*) be the number of occurrences of the symbol *a* in **B**_{X}[1, *i*]. We note that **C**_{X} and **Occ**_{X} include counts for the sentinel symbol, $. Using these two arrays, Ferragina and Manzini provided an algorithm to search for a string *Q* in *X* (Ferragina and Manzini, 2000). Let *S* be a string whose suffix array interval is known to be [*l*, *u*]. The interval for the string *aS* can be calculated from [*l*, *u*] using **C**_{X} and **Occ**_{X} by the following:

We encapsulate Equations (1) and (2) in the following algorithm, updateBackward.

To search for a string *Q*, we need to first calculate the interval for the last symbol in *Q* then use Equations (1) and (2) to iteratively calculate the interval for the remainder of *Q*. The interval for a single symbol is simply calculated from **C**_{X}. The backwardsSearch algorithm presents the searching procedure in detail. If backwardsSearch returns an interval where *l*>*u*, *Q* is not contained in *X* otherwise **SA**_{X}[*i*] is the position in *X* of each occurrence of *Q* for *l*≤*i*≤*u*.

The backwardsSearch algorithm requires updating the suffix array interval |*Q*| times. As each update is a constant-time operation, the complexity of backwardsSearch is *O*(|*Q*|). To save memory **Occ**_{X}(*a*, *i*) is stored only for *i* divisible by *d* (typically *d* is around 128). The remaining values of **Occ**_{X} can be calculated as needed using the sampled values and **B**_{X}.

### 2.5 The generalized suffix array

We can easily expand the definition of a suffix array to include multiple strings. Let 𝒯 be an indexed set of strings and 𝒯_{i} be element 𝒯[*i*]. We define **SA**_{𝒯}[*i*]=(*j*, *k*) iff 𝒯_{j}[*k*, |𝒯_{j}|] is the *i*-th lowest suffix in 𝒯. In the generalized suffix array, unlike the suffix array of a single string, two suffixes can be lexographically equal. We break ties in this case by comparing the indices of the strings. In other words, we treat each string in 𝒯 as if it was terminated by a unique sentinel character $_{i} where $_{i}<$_{j} when *i*<*j*. We extend the definition of the BWT to collections of strings as follows. Let **SA**_{𝒯}[*i*]=(*j*, *k*) then

Like the BWT of a single string, **B**_{𝒯} is a permutation of the symbols in 𝒯; therefore, the definitions of the auxiliary data structures for the FM-index, **C**_{𝒯}(*a*) and **Occ**_{𝒯}(*a*, *i*), do not change.

## 3 METHODS

The construction of the string graph occurs in two stages. First, the complete set of overlaps of length at least τ is computed for all elements of ℛ. The initial overlap graph is then built as described in Section 2.3 and transformed into the string graph using the linear expected time transitive reduction algorithm of Myers Myers (2005). The first step in this process is the computational bottleneck. The all-pairs maximal overlap problem can be optimally solved in *O*(*N*+*k*^{2}) time using a generalized suffix tree where *N*=∑_{i=1}^{|ℛ|}|ℛ_{i}| and *k*=|ℛ| (Gusfield, 1997). It is straightforward to restrict this algorithm to only find overlaps of length at least τ at a lower computational cost; however, the amount of memory required for a suffix tree makes this algorithm impractical for large datasets. Myers' proposed the use of a *q*-gram filter to find the complete set of overlaps. This requires *O*(*N*^{2}/*D*) time where *D* is a time-space tradeoff factor dependent on the amount of memory available. We will show that by using the FM-index of ℛ the set of overlaps can be computed in *O*(*N*+*C*) time for error-free reads where *C* is the total number of overlaps found. We then provide an algorithm that detects only the overlaps for irreducible edges—removing the need for the transitive reduction algorithm and allowing the direct construction of the string graph.

### 3.1 Building an FM-index from a set of sequence reads

To build the FM-index of ℛ, we must first compute the generalized suffix array of ℛ. We could do this by creating a string that is the concatenation of all members of ℛ, *S*=ℛ_{1}ℛ_{2}…ℛ_{m} and then use one of the well-known efficient suffix array construction algorithms to compute **SA**_{S} (Puglisi *et al.*, 2007). We have adopted a different strategy and have modified the induced-copying suffix array construction algorithm (Nong *et al.*, 2009) to handle an indexed set of strings ℛ where each suffix array entry is a pair (*j*, *k*) as described in Section 2.5. This suffix array construction algorithm is similar to the Ko–Aluru algorithm (Ko and Aluru, 2005). A set of substrings of the text (termed LMS substrings for leftmost S-type, see Nong *et al.*, 2009) is sorted from which the ordering of all the suffixes in the text is induced. Our algorithm differs from the Nong–Zhang–Chan algorithm as we directly sort the LMS substrings using multikey quicksort (Bentley and Sedgewick, 1997) instead of sorting them recursively. This method of construction is very fast in practice as typically only 30−40% of the substrings must be directly sorted. Once **SA**_{ℛ} has been constructed, the BWT of ℛ, and hence the FM-index is easily computed as described above. We also compute the FM-index for the set of *reversed* reads, denoted ℛ′, which is necessary to compute overlaps between reverse complemented reads. We also output the *lexographic index* of ℛ, which is a permutation of the indices {1, 2,…, |ℛ|} of ℛ sorted by the lexographic order of the strings. This can be found directly from **SA**_{ℛ} and is used to determine the identities of the reads in ℛ from the suffix array interval positions once an overlap has been found.

### 3.2 Overlap detection using the FM-index

We now consider the problem of constructing the set of overlaps between reads in ℛ. Consider two reads *X* and *Y*. If a suffix of *X* matches a prefix of *Y* an edge of type (*E*, *B*) will be created in the initial overlap graph. We will describe a procedure to detect overlaps of this type from the FM-index of ℛ. Let *X* be an arbitrary read in ℛ. If we perform the backwardsSearch procedure on the string *X*, after *k* steps we have calculated the interval [*l*, *u*] for the suffix of length *k* of *X*. The reads indicated by the suffix array entries in [*l*, *u*], therefore, have a substring that matches a suffix of *X*. Our task is to determine which of these substrings are prefixes of the reads. Recall that if a given element in the suffix array, SA_{ℛ}[*i*], is a prefix then **B**_{ℛ}[*i*]=$ by definition. Therefore, if we know the suffix array interval for a string *P*, the interval for the strings beginning with *P* can be determined by calculating the interval for the string $*P* using Equations (1) and (2). This interval, denoted [*l*_{$}, *u*_{$}], indicates that the reads with prefix *P* are the *l*_{$}-th to *u*_{$}-th lexographically lowest strings in ℛ. We can, therefore, recover the indices in ℛ of the reads overlapping *X* using lexographic index of ℛ. The algorithm is presented below in findOverlaps.

The findOverlaps algorithm is similar to the backwards search procedure presented in Section 2.4. It begins by initializing [*l*, *u*] to the interval containing all suffixes that begin with the last symbol of *X*. The interval [*l*, *u*] is then iteratively updated for longer suffixes of *X*. When the length of the suffix is at least the minimum overlap size, τ, we determine the interval for the reads that have a prefix matching the suffix of *X* and output an overlap record for each entry (using the subroutine outputOverlaps). When the update loop terminates, [*l*, *u*] holds the interval corresponding to the full length of *X*. The outputContained procedure writes a containment record for *X* if *X* is contained by any read in [*l*, *u*] based on the rules described in Section 2.3. The overlaps detected by findOverlaps correspond to edges of type (*E*, *B*). We must also calculate the overlaps for edges of type (*E*, *E*) and (*B*, *B*), which arise from overlapping reads originating from opposite strands. To calculate edges of type (*E*, *E*), we use findOverlaps on the complement of *X* (not reversed) and the FM-index of ℛ′. Similarly, to calculate edges of type (*B*, *B*), we use findOverlaps on (the reverse complement of *X*) and the FM-index of ℛ.

The overlap records created by outputOverlaps are constructed in constant time as they only require a lookup in the lexographic index of ℛ. Let *c*_{i} be the number of overlaps for read ℛ_{i}. The findOverlaps algorithm makes at most |ℛ_{i}| calls to updateBackwards and a total of *c*_{i} iterations in outputOverlaps for a total complexity of *O*(|ℛ_{i}|+*c*_{i}). For the entire set ℛ, the complexity is *O*(*N*+*C*) where *C*=∑_{i=1}^{|ℛ|}*c*_{i}. Note that the majority of these edges are transitive and subsequently removed. We can, therefore, improve this algorithm by only outputting the set of irreducible edges, allowing the direct construction of the string graph. We address this in Section 3.3.

In rare cases, multiple valid overlaps may occur between a pair of reads. In this case, the intervals detected during findOverlaps will contain intersecting or duplicated intervals. To handle this, we can modify findOverlaps to first collect the entire set of found intervals. This interval set could then be sorted and duplicated or intersecting intervals that represent sub-maximal overlaps can be removed. The outputOverlaps procedure can be called on the entire reduced interval set to output the set of maximal overlaps.

### 3.3 Detecting irreducible overlaps

To directly construct the string graph, we must only output irreducible edges. Recall from Section 2.3 that the labels of the irreducible edges for a given read are prefixes of the labels of transitive edges. We use this fact to differentiate between irreducible and transitive edges during the overlap computation. Consider a read *X* and the set of reads that overlap a suffix of *X*, 𝒪. We could devise an algorithm to find the subset consisting only of irreducible edges by calculating the edge-labels of all members of 𝒪 and filtering out the members whose label is the extension of the label of some other read. This would require iterating over all members of 𝒪, which can be quite large for repetitive reads. We will now show that the labels of the irreducible edges can be constructed directly from the suffix array intervals using the FM-index.

Consider a substring *S* that occurs in ℛ and its suffix array interval [*l*, *u*]. Let a *left extension* of *S* be a string of length |*S*| + 1 of the form *aS*. We can use **B**_{ℛ}[*l*, *u*] to determine the set of left extensions of *S*. Let ℬ be the set of symbols that appear in the substring **B**_{ℛ}[*l*, *u*]. The left extensions of *S* are the strings *aS* such that *a*∈ℬ. Note that we do not have to iterate over the range **B**_{ℛ}[*l*, *u*] to determine ℬ. Since **Occ**_{ℛ}(*a*, *i*) is defined to be the number of times symbol *a* occurs in **B**_{ℛ}[1, *i*] we can count the number of occurances of *a* in **B**_{ℛ}[*l*, *u*] (and hence *aS* in ℛ) in constant time by taking the difference **Occ**_{ℛ}(*a*, *u*) − **Occ**_{ℛ}(*a*, *l*−1). If the $ symbol occurs in **B**_{ℛ}[*l*, *u*] we say that *S* is *left terminal*, in other words one of the elements of ℛ has *S* as a prefix. We similarly define a *right extension* of *S* as a string of length |*S*| + 1 of the form *Sa*. While we cannot build the right extensions of *S* directly from the FM-index, the right extensions of *S* are equivalent to left extensions of *S*′ (the reverse of *S*) in ℛ′. Let *S* be *right terminal* if $ exists in **B**_{ℛ′}[*l*′, *u*′], in other words *S* is a suffix of some string in ℛ.

The procedure to find all the irreducible edges of a read *X* and construct their labels is to find all the intervals containing the prefixes of reads that overlap a suffix of *X*, then iteratively extend them rightwards until a right-terminal extension is found. The terminated read forms an irreducible edge with *X* and the label of the edge is the sequence of bases that were used during the right-extension. All non-terminated strings with the same sequence of extensions are transitive and, therefore, not considered further.

The algorithm requires searching the FM-index in two directions, first backwards to determine the intervals of overlapping prefixes and then forwards to extend those prefixes and build the irreducible labels. Naively this would require first determining the intervals [*l*, *u*] for each matching prefix, *P*, and then reversing the prefix and performing a backwards search on the FM-index of ℛ′ to find the interval [*l*′, *u*′] for *P*′. The intervals [*l*′, *u*′] would then be used in the extension stage to determine the labels of the irreducible edges. We can do better, however, by noting that the interval [*l*′, *u*′] can be calculated directly during the backwards search without using the FM-index of ℛ′. We define **OccLT**_{ℛ}(*a*, *i*) to be the number of symbols that are lexographically lower than *a* in **B**_{ℛ}[1, *i*]. Let *S*=*X*[*i*, |*X*|] be a suffix of *X* and [*l*_{i}, *u*_{i}] its suffix array interval. Suppose we know the interval [*l*_{i}′, *u*_{i}′] for *S*′ in ℛ′. Let *a*=*X*[*i*−1]. The interval for *S*′*a*=[*l*_{i−1}′, *u*_{i−1}′] is therefore

The interval for *X*′[1] is identical to that of *X*[|*X*|], since **B**_{ℛ} and **B**_{ℛ′} are both permutations of symbols in ℛ, therefore, **C**_{ℛ}=**C**_{ℛ′}. We can, therefore, initialize the interval [*l*′, *u*′] to the same initial value of [*l*, *u*] and perform a forward search of *X*′ simulatenously while performing a backward search of *X* using only the FM-index of ℛ. This does not require any additional storage as the **OccLT**_{ℛ} array can easily be computed from **Occ**_{ℛ} by summing the values for symbols less than *a*. This procedure is similar to the 2way-BWT search recently proposed by Lam *et al.* (2009). The updateFwdBwd algorithm implements Equations (3) and (4) along with updateBackward to calculate the pair of intervals. The ℱ parameter to updateFwdBwd indicates the FM-index used — that of ℛ or ℛ′.

We now give the full algorithm for detecting the irreducible overlaps for a read *X*. The algorithm is performed in two stages, first a backwards search on *X* is performed to collect the set of interval pairs, denoted ℐ, for prefixes that match a suffix of *X*. This algorithm is presented in findIntervals below and is conceptually similar to findOverlaps.

The interval set found by findIntervals is processed by extractIrreducible to find the intervals corresponding to the irreducible edges of *X*. This algorithm has two parts. First, the set of intervals is tested to see if some read in the interval set is right terminal. If so, the intervals corresponding to the right terminal reads form irreducible edges with *X* and are returned. If no interval has terminated, we create a subset of intervals for each right extension of ℐ and recursively call extractIrreducible on each subset.

The algorithm above assumes that there are no reads that are strict substrings of other reads (in other words, all the containments are between identical reads). If this is not the case, a slight modification must be made. If the set of reads overlapping *X* includes a read that is a proper substring of some other read it is possible that the first right terminal extension found is not that of an irreducible edge but of the contained read. It is straightforward to handle this case by observing that such a read will have an overlap which is strictly shorter than that of the irreducible edge. In other words, the only acceptable right terminal extension is to the reads in ℐ that have the longest overlap with *X*.

We can similarly modify extractIrreducible to handle overlaps for reads from opposite strands. To do this, we use findIntervals to determine the intervals for overlaps for the same strand as *X* and overlaps from the opposite strand of *X* (using the complement of *X* as in the previous section). When extending an interval that was found by the complement of *X*, we extend it by the complement of *a*. In other words, if we are extending same-strand intervals by A, we extend opposite strand intervals by T and so on.

We now offer a sketch of the complexity of the irreducible overlap algorithm. Let *L*_{i} be the label of irreducible edge *i*. During the construction of *L*_{i} at most *k*_{i} intervals must be updated, corresponding to the number of reads that have an edge-label containing *L*_{i}. The sum over all irreducible edges, *E*=∑_{i}(|*L*_{i}|*k*_{i}), is the total number of interval updates performed by extractIrreducible. Note that each read in ℛ is represented by a path through the string graph. The total number of times edge *i* is used in the set of paths spelling all the reads in ℛ is *k*_{i} and the amount of sequence in ℛ contributed by edge *i* is |*L*_{i}|*k*_{i}. This implies *E* can be no larger than *N*, the total amount of sequence in ℛ, and extractIrreducible is *O*(*N*). As findIntervals is also *O*(*N*), the entire irreducible overlap detection algorithm is *O*(*N*).

## 4 RESULTS

As a proof of concept, we implemented the above algorithms. The program is broken into three stages: index, overlap and assemble. The index stage constructs the suffix array and FM-index for a set of sequence reads, the overlap stage computes the set of overlaps between the reads and the assemble stage builds the string graph, performs transitive reduction if necessary, then compacts unambiguous paths in the graph and writes out a set of contigs. We tested the performance of the algorithms with two sets of simulations. In both sets of simulations, we compared the exhaustive overlap algorithm (which constructs the set of all overlaps) and the direct construction algorithm (which only outputs overlaps for irreducible edges). First, we simulated *Escherichia coli* read data with mean sequence depth from 5 × to 100 × to investigate the computational complexity of the overlap algorithms as a function of depth. After constructing the index for each dataset, we ran the overlap step in exhaustive and direct mode with τ = 27. The running times of these simulations are shown in Figure 2. As expected, the direct overlap algorithm scales linearly with sequence depth. The exhaustive overlap algorithm exhibits the expected above-linear scaling as the number of overlaps for a given read grows quadratically with sequence depth.

To assess the quality of the resulting assembly, we assembled the data using the direct overlap algorithm and compared the contigs to the reference. For each level of coverage, we selected τ to maximize the assembly N50 value. The N50 values ranged from 1.7 kbp (5 × data, τ = 17) to 80 kbp (100 × data, τ = 85). We aligned the contigs to the reference genome with bwa-sw (Li and Durbin, 2010) and found that no contigs were misassembled.

We also simulated data from human chromosomes 22, 15, 7 and 2 to assess how the algorithms scale with the size of the genome. We pre-processed the chromosome sequences to remove sequence gaps then generated 100 bp error-free reads randomly at an average coverage of 20× for each chromosome. The results of the simulations are summarized in Table 1. The running time of the exhaustive and direct overlap algorithms are comparable. As the sequence depth is fixed at 20×, both overlap algorithms scale linearly with the size of the input data. The final stage of the algorithm, building the string graph and constructing contigs, is much shorter for the direct algorithm as the transitive reduction step does not need to be performed. In addition, this step requires considerably less memory as the initial graph constructed by the direct algorithm only contains irreducible edges.

chr 22 | chr 15 | chr 7 | chr 2 | ratio | |
---|---|---|---|---|---|

Chr. size (Mb) | 34.9 | 81.7 | 155.4 | 238.2 | 6.8 |

Number of reads (M) | 7.0 | 16.3 | 31.1 | 47.6 | 6.8 |

Contained reads (k) | 684 | 1668 | 3103 | 4709 | 6.9 |

Contained (%) | 9.8 | 10.2 | 10.0 | 9.9 | – |

Transitive edges (M) | 38.0 | 93.0 | 177.7 | 274.6 | 7.2 |

Irreducible edges (M) | 6.3 | 14.9 | 28.7 | 44.4 | 7.0 |

Assembly N50 (kbp) | 4.0 | 4.6 | 4.2 | 4.7 | – |

Longest contig (kbp) | 31.9 | 47.7 | 53.1 | 48.6 | – |

Index time (s) | 2606 | 9743 | 19 779 | 30 866 | 11.8 |

Overlap -e time (s) | 2657 | 6572 | 12 970 | 18 060 | 6.8 |

Overlap -d time (s) | 2885 | 6750 | 13 271 | 19 437 | 6.7 |

Assemble -e time (s) | 1836 | 4043 | 8112 | 13 095 | 7.1 |

Assemble -d time (s) | 423 | 1161 | 2044 | 3226 | 7.6 |

Index memory (GB) | 8.0 | 18.6 | 35.4 | 54.5 | 6.8 |

Overlap -e mem. (GB) | 2.4 | 5.5 | 10.5 | 16.1 | 6.7 |

Overlap -d mem. (GB) | 2.4 | 5.5 | 10.4 | 16.1 | 6.7 |

Assemble -e mem. (GB) | 5.9 | 14.2 | 27.2 | 41.9 | 7.1 |

Assemble -d mem. (GB) | 2.7 | 6.3 | 12.1 | 18.6 | 6.9 |

chr 22 | chr 15 | chr 7 | chr 2 | ratio | |
---|---|---|---|---|---|

Chr. size (Mb) | 34.9 | 81.7 | 155.4 | 238.2 | 6.8 |

Number of reads (M) | 7.0 | 16.3 | 31.1 | 47.6 | 6.8 |

Contained reads (k) | 684 | 1668 | 3103 | 4709 | 6.9 |

Contained (%) | 9.8 | 10.2 | 10.0 | 9.9 | – |

Transitive edges (M) | 38.0 | 93.0 | 177.7 | 274.6 | 7.2 |

Irreducible edges (M) | 6.3 | 14.9 | 28.7 | 44.4 | 7.0 |

Assembly N50 (kbp) | 4.0 | 4.6 | 4.2 | 4.7 | – |

Longest contig (kbp) | 31.9 | 47.7 | 53.1 | 48.6 | – |

Index time (s) | 2606 | 9743 | 19 779 | 30 866 | 11.8 |

Overlap -e time (s) | 2657 | 6572 | 12 970 | 18 060 | 6.8 |

Overlap -d time (s) | 2885 | 6750 | 13 271 | 19 437 | 6.7 |

Assemble -e time (s) | 1836 | 4043 | 8112 | 13 095 | 7.1 |

Assemble -d time (s) | 423 | 1161 | 2044 | 3226 | 7.6 |

Index memory (GB) | 8.0 | 18.6 | 35.4 | 54.5 | 6.8 |

Overlap -e mem. (GB) | 2.4 | 5.5 | 10.5 | 16.1 | 6.7 |

Overlap -d mem. (GB) | 2.4 | 5.5 | 10.4 | 16.1 | 6.7 |

Assemble -e mem. (GB) | 5.9 | 14.2 | 27.2 | 41.9 | 7.1 |

Assemble -d mem. (GB) | 2.7 | 6.3 | 12.1 | 18.6 | 6.9 |

For the overlap and assemble rows, -e and -d indicate the exhaustive and direct algorithms, respectively. The last column is the ratio between chromosome 2 and 22.

The bottleneck in terms of both computation time and memory usage is the indexing step, which builds the suffix array and FM-index for the entire read set. This required 8.5 h and ∼55 GB of memory for chromosome 2. Extrapolating to the size of the human genome indicates it would require ∼4.5 days and 700 GB of memory to index 20× sequence data. While the computational time is tractable, the amount of memory required is not practical for the routine assembly of human genomes. We address ways to reduce the computational requirements in Section 5.

## 5 DISCUSSION

We have described an efficient method of constructing a string graph from a set of sequence reads using the FM-index. This work is the first step in the construction of a new, general purpose sequence assembler that will be effective for both short reads of the current generation of sequence technology and the longer reads of the sequencing instruments on the horizon. Unlike the de Bruijn graph formulation, the string graph is particularly well-suited for the assembly of mixed length read data. While the primary algorithms are in place, a considerable amount of work remains. Most pressing is the issue of adapting the assembler to handle real sequence data that contains base-calling errors. This amounts to adapting the algorithms to handle inexact overlaps. The BWA, Bowtie and SOAP2 aligners implement a number of strategies and heuristics for dealing with base mismatches and small insertion/deletions (Langmead *et al.*, 2009; Li and Durbin, 2009; Li *et al.*, 2009). These strategies directly translate to finding overlaps. Let ϵ be the maximum allowed difference between two overlapping reads. When performing the backwards search to find overlaps, we can allow the search to proceed to mismatched bases or gaps while ensuring that the ϵ bound is respected. We can similarly modify the irreducible overlap detection algorithm by allowing the right-extension phase to extend to mismatch bases or gaps. Here, we would only consider an interval to be transitive with respect to one of the irreducible intervals if the inferred difference between the intervals is less than ϵ.

Our intention is to build an assembler that can handle genomes up to several gigabases in size, such as for human or other vertebrate genomes and our initial results indicate that our algorithms scale well. The introduction of sequencing errors will increase the complexity of the irreducible overlap identification step but this step is straightforward to parallelize if necessary because it is carried out for each non-contained read. The construction of the suffix array is currently the computational bottleneck; however, there are a number of established ways to improve this. We can lower the amount of memory required by exploiting the redundancy present in a set of sequencing reads by using a compressed index. The compressed suffix array is one such index and a method was recently developed to merge two compressed suffix arrays that possibly allows a distributed construction algorithm (Sirén, 2009). Additionally, efficient external memory (disk-based) BWT construction algorithms have been developed that allow the construction of the FM-index for very large datasets while using a limited amount of main memory (Dementiev *et al.*, 2008; Ferragina *et al.*, 2010).

It is worth investigating the equivalency of the de Bruijn graph and string graph formulations (Pop, 2009). This has been studied in terms of the computational complexity of reconstructing the complete sequence of the genome and both formulations have been shown to be NP-hard (Medvedev *et al.*, 2007). We would like to know the equivalence in terms of the information contained in the graph. Consider the case where all sequence reads are of length *l* and every *l*-mer in the genome has been sampled once. In this case, the de Bruijn graph and string graph constructions (using parameters *k* = *l* − 1 and τ = *l* − 1 respectively) are equivalent. In the realistic case where the genome is unevenly sampled, the relationship is not clear. In the original paper on the EULER assembler Pevzner presents an algorithm to recover the information lost during the deconstruction of reads into *k*-mers by finding consistent read-paths through the *k*-mer graph (Pevzner *et al.*, 2001). It is conceivable that if this procedure is able to perfectly reconstruct the information lost the resulting graph would be equivalent to Myers' string graph. This is not clear, however, and the equivalency of these formulations is a question we would like to address.

## ACKNOWLEDGEMENTS

We thank Veli Mäkinen and members of the Durbin group for discussions related to string matching and sequence assembly.

*Funding*: This work was supported by the Wellcome Trust (grant number WT077192). J.T.S. is supported by a Wellcome Trust Sanger Institute Research Studentship.

*Conflict of interest*: none declared.

## REFERENCES

*Lecture Notes of Computer Science*

## Comments