A bioinformatician’s guide to the forefront of suffix array construction algorithms

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.


S1.1 The Proper prefix problem
To describe the proper prefix problem formally it helps us to first define a tiling of a text T , as a set of substrings (tiles) of T such that 1) all positions in T are covered by at least one tile, 2) each position is the start of at most one tile. For example, the set of all length-two substrings is a tiling, as is the set of all length-two substrings starting at an even position (for an even length text). Note that tiles may overlap, for example in SA-IS adjacent tiles ( √ √ substrings) overlap by one position.
Now we can define lexical naming, a technique at the heart of all of the linear time suffix array construction algorithms. Lexical naming transforms a text T to another text T ′ , by defining a tiling of T and treating those tiles as characters of the new text T ′ . The particular character used to represent a tile is chosen to reflect the rank of that tile in a lexically sorted list of the tiles. In this way every suffix T ′ i ′ ··· of T ′ corresponds to a suffix Ti··· of T . One key property that must be maintained by lexical naming is that for any two suffixes T ′ i ′ ··· and T ′ j ′ ··· of T ′ , If lexical naming is done on fixed-length substrings, it is rather straightforward that the above relationship is maintained by simply using the rank of each substring.
However with variable-length substrings, as is with the case of √ √ substrings, this is not immediate. For example, consider T = ctsmctsy$, and assume that ctsm and cts are two among other substrings for which we need to compute lexical names. Note that one is a proper prefix of the other. Since ctsm > cts, we might want the lexical naming to assign a larger character to the former. However in T , the corresponding suffixes T0... and T4... have the opposite relationship, and clearly such a naming strategy will not work.
In the SA-IS algorithm, this problem goes away because of the way the √ √ substrings and their ordering are defined. With some thought one can see that for any √ √ substring pair (p, r) in which p is a proper prefix of r, the final character of p is always ր and the corresponding character of r is always ց. With the extra rule of checking the types, p will always be given a lower lexical name than r, as it should be (since the suffix starting at the first position of p is lexically larger than the suffix starting at the first position of r).

S1.2 Correctness of instance reduction
Here we will show that sorting the suffixes of T ′ corresponds to sorting the √ suffixes of T . Let us first prove the following useful lemma.
Lemma S1. Between a pair of ց-type and ր-type suffixes, both starting with the same character, the ց suffix is lexically smaller than the ր one.
Proof. Let I and J be two suffixes starting with the same character c, but of types ց and ր, respectively. Consider the first position where they have different characters.
In both I and J, all the characters leading up to this position must also be c (otherwise it cannot be that I and J are of different types). At the position where they differ, given their types, I must have a character smaller than c, whereas J must have a character larger than c. Thus I < J.
Theorem S1. Sorting the suffixes of T ′ corresponds to sorting the √ suffixes of T .
Proof. We defined T ′ such that each of its character corresponds to a √ √ substring; and as a consequence, each √ suffix Ti··· of T corresponds to a suffix of T ′ , which we will refer to as T ′ i ′ ··· . We need to show that the lexical relation between any two suffixes T ′ i ′ ··· and T ′ j ′ ··· of T ′ is the same as that between their corresponding suffixes Ti··· and Tj··· of T , i.e.
Since a common prefix between T ′ i ′ ··· and T ′ j ′ ··· implies a common prefix between Ti··· and Tj···, let us consider the position at which T ′ i ′ ··· and T ′ j ′ ··· disagree first. Let I be the √ √ substring of T corresponding to this position in T ′ i ′ ··· , and let J be the √ √ substring corresponding to that in T ′ j ′ ··· . Since T ′ i ′ ··· < T ′ j ′ ··· , it must be that according to our definition of the ordering between two √ √ substrings, I < J. Recall that we had defined this ordering as follows: if the characters under comparison are different, then we apply the usual lexical rule; and if they are the same, then we look at their types, with ր-type character defined to be larger than ց-type. Therefore we distinguish two cases.
Case 1: The first position in which I and J differ have different characters. Then at that position, the character in I is smaller than the one in J. This implies that Ti··· < Tj···.
Case 2: The first position in which I and J differ have the same character in both I and J, but the character in I is ց type and the one in J is ր type. Thus we have from Lemma S1 that Ti··· < Tj···.
We can show that using the same flow of logic as above.

S1.3 Sorting the substrings
In this subsection, we will prove the following theorem.
Theorem S2. The √ √ substrings can be sorted in time linear to the length of T .

S1.3.1 The Method
Following Nong et al., let us first define a few new terms. For every suffix Ti··· of T , its √ -prefix is the string Ti..j, where j > i and Tj··· is the first √ suffix to start at a position to the right of Ti. Additionally, the sentinel is defined to be a √ -prefix.
A √ -prefix is also classified as ր or ց depending on the type Ti··· is. Note that by this definition, a √ √ substring is also a ր-type √ -prefix. Similar to the ordering of √ √ substrings, we will define the ordering of √ -prefixes based on both character and type (i.e. the order is first determined by the character, and if they are equal by the types, ր being smaller than ց. We will (ab)use the ≺ , ≻, ≈ symbols to denote the relationship between two √ -prefixes based on this ordering, so that they are distinguished from the regular <, >, = that we have reserved for the normal lexical ordering of full-length suffixes.
What makes SA-IS so interesting is that sorting √ √ substrings is almost exactly the same as the 3-step combine phase (main article Section 3.2.3). It might be beneficial to read that section before proceeding. To recall, the combine phase uses the ordering of √ suffixes to order the ց suffixes; and then uses the order of the ց suffixes to sort the ր suffixes. In contrast, to sort the √ √ substrings, first the ց-type √ -prefixes are sorted based on their end √ characters; and then the ր-type √ -prefixes are sorted based on the ordering of the ց-type √ -prefixes. Since √ √ substrings are also ր-type √ -prefixes, they end up being sorted too.
We will now describe the method in more detail, and then follow it with a proof of correctness.
Let A be an array of the size of T , which we will use to sort the √ -prefixes. As in the combine phase, A can be thought of as being partitioned into buckets depending on the first character of the √ -prefixes. Each bucket can then be divided into two sub-buckets: the left one for ց type and the right one for ր type. This is because Lemma S1-like relationship holds for √ -prefixes also. There is one distinction however: two √ -prefixes can be equal, and we allow them to occupy neighboring positions in A in any order.
Step 0: This step initializes A and places the √ -type suffixes in their appropriate buckets. Set all elements of A to the special value of -1. Compute the ր-type bucket boundaries of A and save those in an array of tail pointers. Place each √ -type suffix (precisely speaking, its index into T ) at the end of the ր-type bucket then decrement that tail pointer to point one position to the left.
Step 1: This step orders the ց-type √ -prefixes based on the order of the √ suffixes (or rather the √ characters) from the earlier step. Scan A from left to right, skipping place Ai − 1 at the current head of its respective bucket, and then increment that head pointer.
Step 2: This step uses the order of the ց-type √ -prefixes obtained from Step 1 to sort the ր-type √ -prefixes. Scan A from right to left. For each suffix index Ai encountered, if TA i −1 is ր-type, place Ai−1 into the current tail of its respective bucket and decrement that tail pointer.

S1.3.2 Proof of Correctness
Before we begin a formal proof of correctness, we make two observations regarding step 0. The first is that the order in which the √ -prefixes is places does not matter, as long as they are placed in their appropriate buckets. The second is that by doing this, the length-one √ -prefixes are correctly prefix sorted. Thus the situation is highly analogous to the combine phase of the main algorithm in which the √ -suffixes? are assumed to be correctly sorted via recursion.
The following lemma is not difficult to see.
Lemma S3. At the end of Step 1, the ց-type √ -prefixes are in their correct positions.
Proof. We will show this by proving the following proposition: when the sweep reaches the i-th position of the array A, every position between 0 to i that is inside a ց-type bucket already contains the correct ց-type √ -prefix in it. The proof is by induction on i.
For i = 0, this statement is vacuously true because the 0-th position is not in a ց bucket, rather it contains Tn−1..n−1 from Step 0.
Assuming that the proposition is true for some i > 0, we will show that it is also true for i + 1. Suppose the contrary. Since the ց-type √ -prefixes up to i have been correctly placed and we don't move them once placed, it must be that the (i + 1)-th position has an incorrect ց-type √ -prefix. Let T j..k be this incorrectly placed √prefix, and let T l..m be the correct one that should have been placed there. It must be .m , then it has been correctly placed; and if T j..k ≺ T l..m , it would have been already placed in a position somewhere left of i + 1 by the induction hypothesis. Also both T j..k and T l..m start with the same character because they were placed in the same bucket.
Let us have a look at T l+1..m and T j+1..k . Since T l is ց type, we have from Then it must be that T l+1..m is already placed in A in some position left of i + 1 because of one of the following: Case 2: if T l+1..m consists of a single √ -type character, it must be lexically smaller than T l ; and therefore in Step 0, it was placed in its the bucket, which must be left of i + 1.
As for T j+1..k , it can either be left of i + 1 or not. If the latter is true, then it cannot be the case that T j..k got inserted before T l..m because the left-to-right sweep would have reached T l+1..m first. Suppose the former is true. Since we established that T j..k and T l..m start with the same character and that T j..k ≻ T l..m , we have T j+1..k ≻ T l+1..m . Using this, we can show that T l+1..m is left of T j+1..k because of one of the following: Case 1: if both T j+1..k and T l+1..m are ց-type √ -prefixes, then by induction hypothesis.
Case 2: if both of them consist of single √ -type character, then they are in different buckets in Step 0. (Note that they cannot be in the same bucket, else T j..k ≈ Case 3: if one of them is ց type, and the other ր type, then the buckets they belong to already ensures this relative positioning.
Since T l+1..m is left of T j+1..k , the sweep would have reached T l+1..m , first and therefore this contradicts our assumption that T j..k was inserted before T l..m . This completes the proof that Step 1 correctly orders the ց-type √ -prefixes.
Lemma S4. At the end of Step 2, the ր-type √ -prefixes are in their correct positions.
Proof. Similar to the proof above, we will show this by proving the following proposition: when the sweep reaches the i-th position of the array A, every position between The proof is by induction on i.
The base case is that of i = n − 1. The √ -prefix in that position in A has to be an ց-type one because we cannot have an ր-type √ -prefix starting with the largest alphabet. From the previous lemma, it must be in its correct position.
Next we need to show that if the proposition is true for some i < n − 1, it is also true for i − 1. Since the proof is similar to that of the lemma above, we will skip it.
The time complexity of this method is O(n), as each step is a sweep across the length-n array A, with each operation along the sweep requiring O(1) time.
Thus we have proved Theorem S2.

S1.4 Computing the reduced instance T ′ from the ordering of the substrings
One minor thing that remains to be discussed is how to obtain the reduced instance T ′ from this ordering of the √ √ substrings. We will show how to construct T ′ in-place in A, when we discuss the memory usage of SA-IS in Section S3. Proof. We will show this by proving the following proposition: when the scan reaches the i-th position of the array A, every position between 0 to i that is inside a ց-type bucket already contains the correct suffix in it.
The proof is by induction on i.
For i = 0, this statement is vacuously true because the 0-th position is not in a ց bucket, rather it contains Tn−1··· from Step 0.
Assuming that the proposition is true for some i ≥ 0, we will show that it is also true for i + 1. Suppose the contrary. Since the ց-type suffixes up to i have been correctly placed and we don't move them once placed, it must be that the (i + 1)-th position has an incorrect suffix. Let this suffix be Tj··· for some j. Note that Tj··· must be ց as well, because we do not insert any ր-type suffixes in Step 1. Let T k··· (for some k = j) be the correct ց suffix that should have been placed at i + 1. It must be that Tj··· > T k··· because otherwise Tj··· would already be in its correct position by the induction hypothesis. Also since Tj··· was placed in the ց-type bucket in which T k··· should have been, they must have the same initial character. Let us have a look at the suffixes T k+1··· and Tj+1···. We have T k+1··· < T k··· since the latter is ց type. Then it must be that T k+1··· is already placed in A in some position left of i + 1 because of one of the following: Case 1: if T k+1··· is ց type, by the induction hypothesis.
Case 2: if T k+1··· is √ type, the starting character of T k+1··· is lexically smaller than the starting character of T k··· ; and therefore the bucket for this √ type is to the left of i + 1.
As for Tj+1···, it can either be to the left of i + 1 or not. If the latter is true, then it cannot be the case that Tj was inserted before T k because the scan proceeds from left to right and would have encountered T k+1··· first. Suppose the former is true. Since we already established that Tj··· and T k··· have the same initial character and that Tj··· > T k··· , we have Tj+1··· > T k+1··· . Using this, we can show that T k+1 is left of Tj+1 because: Case 1: if both Tj+1··· and T k+1··· are ց type, by the induction hypothesis.
Case 2: if both Tj+1··· and T k+1··· are √ type, they have been inserted in A in their sorted order in Step 0.
Case 3: if one of them is ց type and the other √ type, the buckets they belong to (ց and ր respectively) ensure this relative positioning.
Thus the scan would have encountered T k+1··· before Tj+1···, and therefore it cannot be the case that Tj was inserted before T k . Therefore our initial assumption that i + 1 does not have the correct suffix is wrong. This completes the proof by induction.

S2.2 Correctness of Step 2 of combine phase
Theorem S4. At the end of Step 2 of Combine Phase, all the suffixes are in their correct position.
Proof. The ց suffixes were placed in their correct position by Step 1, and we do not move them in Step 2. Therefore we only need to show that this step places all the ր suffixes in their correct positions. We will show this by proving the following proposition: when the right-to-left scan reaches the i-th position of the array A, every position between i and n− that is inside a ր-type bucket already contains the correct suffix in it. The proof is by induction on i. The base case is that of i = n − 1. The suffixes starting from the largest alphabet are always ց type. Because Step 1 places all ց-type suffixes in the correct position, the suffix at position n − 1 is ց. Therefore the base case is vacuously true.
The induction phase is very similar (in fact simpler) than the proof of Theorem S3, so we will skip it.

S3 Memory usage of SA-IS
We mentioned in Section 4.2 that the SA-IS algorithm requires about 2n bytes of working memory for a text T of length n. This is under the assumption that each character can be represented by 1 byte and that each position index of the suffix array can be represented by a 4-byte integer (this limits the text size to < 2 32 ). In this section, we will provide a thorough analysis of the space requirement of SA-IS. We begin by considering a rather straightforward implementation in order to understand where memory costs are incurred, and then show how to improve this implementation using simple but (perhaps) non-obvious techniques.

S3.1 Simple Implementation
Suppose the computation has reached some level i of recursion (level 0 being the topmost). The memory requirement at level i depends on whether we have arrived at this level from the shallower level i − 1 on the way down the recursion path (in which case the divide phase is executed), or from the deeper level i + 1 on the way up (in which case the combine phase is executed). Let us treat these two cases separately.

Memory usage of the divide phase at level i
In this section, we denote the text at level i of recursion as Ti and its length by ni, with n0 = n. As detailed in Section 3.2.1 of the main article and Section S1.3 above, SA-IS does the following computations in the divide phase: (1) classify the suffixes as ր or ց type, (2) induced-sort the √ √ substrings, and (3) generate the reduced instance T ′ i which serves as the input text Ti+1 for the next level of recursion. The major memory allocations required for these computations are as follows: • an array TYPEi of size ni bits indicating the type of each suffix, • an array Ai of 4ni bytes to induced-sort the √ √ substrings, • an array Bi of size equal to the number of distinct characters in Ti, to hold the bucket pointers required for induced-sorting of the √ √ substrings (recall that the bucket heads and tails are never needed in the same step so only one pointer per character is required).  Figure S1: Each horizontal bar represents the worst-case (T i+1 is half the size of T i and when B i is in the order of the size of T i ) peak memory usage for the divide phase at a particular recursion level. B 0 is not shown as it is of some constant size, assuming T 0 is a text over a constant alphabet.
• an array to hold the reduced text T ′ i that is constructed by lexical-naming the √ √ substrings.
Before the computation proceeds to the next level of recursion, we can discard TYPEi, Bi, and Ai from memory. The first two will be required again when we revisit level i on the way back from level i + 1, but we can generate them from Ti when that need arises.
Based on the observations above, we show in Figure S1 the worst-case memory requirement at different levels of recursion. The maximum is at level 1, with a footprint of 8n bytes plus n/2 bits.

Memory usage of combine phase at level i
We now turn our attention to the case where the computation returns to level i from level i + 1. Level i + 1 returns SA T i+1 , the suffix array of Ti+1. The suffixes in SA T i+1 correspond to the √ suffixes of Ti, and as detailed in Section 3.2.3 of the main article, the suffix array of Ti can be computed using this information by induced-sorting. The following memory allocation is required: • an array TYPEi of size ni bits indicating what type each suffix is, • an array Ai of size 4 × ni bytes to induced-sort suffixes of Ti and to ultimately hold its suffix array, • an array Bi of size equal to the number of distinct characters in Ti to hold the bucket pointers required during induced-sorting of the suffixes of Ti.
Note that Ti+1 loses its utility once its suffix array is computed, and therefore it need not be passed on to level i. Also the bucket and the type arrays can be discarded before proceeding to a shallower level.  Figure S2: Worst-case memory requirement for combine phase at different levels of recursion. B 0 is not shown as it is of some constant size, assuming T 0 is a text over a constant alphabet. Figure S2 shows the worst-case memory usage for the combine phase at different levels of recursion. Again, the maximum of 8n bytes plus n/2 bits is reached at level 1.
Thus the working memory of this simple implementation is roughly 3n bytes plus n/2 bits. We next show how to bring this figure down to 2n bytes by being a bit more frugal during both the combine and divide phases. While a reduction of n bytes might not sound big, mammalian genome sizes just happen to be on the same order of magnitude as modern memory sizes, so saving n bytes can in some cases have significant practical ramifications.

S3.2 Improving memory usage
The key idea in reducing the working memory to 2n is to reuse slices of the 4n-byte array A0 (the space allocated at level 0 to induced-sort the √ √ substrings) at all subsequent levels. Since A0 will also ultimately hold the suffix array of the input text, this 4n bytes does not count towards working memory. The only major working memory allocation will be for the bucket pointers. The following subsections carve out the details of this implementation idea. Again, we will split the discussion into the divide and combine phases.

Improving divide phase
In the simple implementation described in the previous section, a fresh memory allocation for arrays Ai and T ′ i was made at each level i. Instead, Ai can use half the space previously used by Ai−1 (for i = 0, 4n bytes are freshly allocated), and T ′ i can be constructed in-place in Ai. This memory allocation scheme is better explained by Figure S3, from which it is clear that the divide-phase worst-case working memory is now solely dominated by the (at most) n/2 × 4 bytes of the bucket pointers.
It remains to see how T ′ i can be constructed in-place in Ai. We describe the process in-place computation of T ′ 1 = T 2 Figure S3: Memory requirement with improved implementation of divide phase. We begin by allocating 4n bytes for A 0 . After A 0 is used for induced sorting, T ′ 0 is constructed in-place in A 0 such that at the end of its construction, it occupies the right half of A 0 . The left half of it is used by A 1 at Level 1. This pattern is repeated in subsequent recursion levels.
with a running example in Figure S4. We can think of Ai in terms of its two halves (let's call the left half A l i and the right A r i ). After the induced-sorting computation is over, Ai contains the indices of √ -prefixes in their sorted order. Since we are only interested in the indices of the √ -prefixes that constitute √ √ substrings, we can pack them into A l i . This is possible since there at most ⌊|Ai|/2⌋ suffixes of type √ in Ti.
We now have A r i free for constructing T ′ i . A l i helps us compute T ′ i in two ways; both to compute the lexical names of the √ √ substrings and where they should appear in T ′ i . To compute the lexical names first prepare a lexical name counter with an inital value of zero. We then scan A l i from left to right, using Ti to check if each √ √ substring is distinct (type considered) from the √ √ substring encountered immediately beforehand in the scan, and if so we increment the lexical name counter. We then place the current lexical name at position ⌊ j 2 ⌋ of A r i , where j is the value of A l i of the current position of the scan. Because neighboring suffixes in Ti cannot both be of type √ , dividing by two never leads to a collision.
Finally, we can remove the empty items between the lexical names to obtain T ′ i . It goes without saying that if all the lexical names are distinct, we can directly compute the suffix array of T ′ i without further recursion and place it in A l 0 .

Improving combine phase
Now let us look at the combine phase at level i. Consider the slice of A0 spanning its first 1/2 i positions, and let's call it Ai. From the scheme of partitioning A0 in the divide phases of the deeper levels, the right half of Ai (A r i ) contains Ti+1. Assuming that the first |Ti+1| positions of the left half of Ai (A l i ) hold the suffix array of Ti+1, we can execute the combine phase such that the suffix array of Ti is computed in-place in Ai. The profile of memory use at different levels of recursion is shown in Figure S5,  from which we can see that the peak working memory is 2n bytes plus n/2 bits.
We now describe how to implement the combine phase in-place, along with a running example in Figure S6. SA T i+1 which is inside A l i contains the suffixes of Ti+1, and therefore we first replace them by the corresponding suffixes of Ti. Recall that the j-th suffix of Ti+1 corresponds to the j-th √ suffix of Ti. We can easily work out the correspondence between the indices, by making use of A r i . We first clear Ti+1 off A r i as it is no longer needed, and use it instead to list the √ suffixes of Ti in their order of appearance in Ti. Then for every suffix j in A l i , we replace it with the j-th suffix in A r i . Now that A l i contains the √ suffixes of Ti in their sorted order, we prepare Ai (specifically the first |Ti| positions) for the induced-sorting phase. This means that the bucket boundaries need to be computed, and that the √ suffixes have to be placed inside their respective buckets in Ai. The latter can be done by simply scanning the suffixes in A l i from right to left, and for every suffix encountered, placing it inside its respective bucket. But A l i is part of Ai, and we have to make sure that we don't end up overwriting a position of A l i before having scanned it. In fact, this is guaranteed. Assume otherwise, and suppose that when the scan has only reached A l i [p], we attempt to insert a suffix S at Ai[q], where q < p. This would mean that there are q suffixes lexically smaller than S; but also at the same time, there are p suffixes of type √ lexically smaller than S, a clear contradiction. Once the √ suffixes are in place, we can proceed with induced-sorting (Steps 1 and 2 of the combine phase).
in-place computation of SA T 0 in-place computation of SA T 1 in-place computation of SA T 2 Figure S5: Worst-case peak memory requirement for improved combine phases at different levels. Compared with Figure S2, we can see that n bytes are saved.
We conclude this section by noting that since both the improved combine and divide phases require in the worst case 2n bytes plus n/2 bits, this figure summarizes the overall memory usage of SA-IS. Note that n/2 is an upper bound on the maximum number of bucket pointers needed. The precise maximum number of bucket pointers needed is a function of the input string and in general may be considerably less than n/2.