A survey of BWT variants for string collections

Abstract Motivation In recent years, the focus of bioinformatics research has moved from individual sequences to collections of sequences. Given the fundamental role of the Burrows–Wheeler transform (BWT) in string processing, a number of dedicated tools have been developed for computing the BWT of string collections. While the focus has been on improving efficiency, both in space and time, the exact definition of the BWT used has not been at the center of attention. As we show in this paper, the different tools in use often compute non-equivalent BWT variants: the resulting transforms can differ from each other significantly, including the number r of runs, a central parameter of the BWT. Moreover, with many tools, the transform depends on the input order of the collection. In other words, on the same dataset, the same tool may output different transforms if the dataset is given in a different order. Results We studied 18 dedicated tools for computing the BWT of string collections and were able to identify 6 different BWT variants computed by these tools. We review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on eight real-life biological datasets with different characteristics. We find that the differences can be extensive, depending on the datasets, and are largest on collections of many similar short sequences. The parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2. Availability and implementation Source code and scripts to replicate the results and download the data used in the article are available at https://github.com/davidecenzato/BWT-variants-for-string-collections.

the collection, typically in the thousands or even tens or hundreds of thousands.One way to avoid this is to use only conceptually different end-of-string-symbols, i.e. to have only one dollar-sign and apply string input order to break ties.This is the method employed by ropebwt2 [37], BCR [3], and many others. 1Another method to avoid increasing the alphabet size is to separate the input strings using the same end-of-string-symbol; in this case, a different end-of-string-symbol has to be added to the end of the concatenated string to ensure correctness, as e.g.done by BigBWT [8].An equivalent solution is to concatenate the input strings without removing the end-of-line or end-of-file characters, since these act as separators; or to concatenate them without separators and use a bitvector to mark the end of each string.Many studies nowadays use string collections in experiments without turning to dedicated tools for multi-string BWT (e.g.[56,2,34]); often the input strings are turned into one single sequence using one of the methods described above, and then the single-string BWT is computed; it is, however, not always stated explicitly which was the method used to obtain one sequence.Underlying this is the implicit assumption that all methods are equivalent.
In 2007, Mantaci et al. [45] introduced the extended Burrows-Wheeler Transform (eBWT), which generalizes the BWT to a multiset of strings.The eBWT, like the BWT, is reversible, and maintains other properties of the BWT such as fast pattern matching functionality.Since then, however, the term 'extended BWT' has come to be used as a generic term to denote the BWT of a collection of sequences.This is unfortunate, as the eBWT has several properties such as independence of the input order which the other methods do not share; and it is defined using a different order relation from the classical BWT (omega-order rather than lexicographic order, see Section 2 for exact definitions).
Two tools compute the eBWT according to the original definition, pfpebwt and cais (both [6]); all others append an end-of-string character to the input strings, explicitly or implicitly, and as a consequence, the resulting transforms differ from the one defined in [45].Moreover, the output in most cases depends on the input order of the sequences (except for those tools that compute what we term dolEBWT, colexBWT, or optBWT).The exact nature of this dependence differs from one transform to another (see Section 4).
The result is that the BWT variants computed by different tools on the same dataset, or by the same tool on the same dataset but given in a different order, may vary considerably.

Multidollar BWT
As we will see, the BWT-variant which we term mdolBWT is the most general one, in the sense that all others, except for the eBWT, can be simulated by mdolBWT (Proposition 9).This is the variant output by most tools, and it is dependent on the input order: both the transform itself and the number of runs varies depending on the order in which the strings are concatenated.Bentley, Gibney, and Thankachan recently gave a linear-time algorithm for computing mdolBWT with the minimum number of runs amongst all input string orders [4].In order to study the variation of the parameter r, we first implemented a variant of this algorithm counting the minimal number of runs, which later led to a new tool, optimalBWT, computing the optBWT [11].We use this tool in our experiments as a baseline for the number of runs of the other BWT-variants.On our real-life biological datasets, the parameter r varies by up to a multiplicative factor of 4.2 between the different variants.It was shown in [11] that an improvement by a multiplicative factor of up to 31 can be obtained between the input order and optBWT.

Our contributions
1. We identify six distinct BWT variants which are computed by 18 publicly available tools, specifically designed for string collections.We formally describe the differences between these, identifying specific intervals to which differences are restricted.2. We show the influence of the input order on the output, in dependence of the BWT variant.3. We describe the impact on the number r of runs of the BWT and give an upper bound on the amount by which the colexicographic order (sometimes referred to as 'reverse lexicographic order') can differ from the optimal order of Bentley et al. [4].4. We complement our theoretical analysis with extensive experiments, comparing the BWT variants on eight real-life datasets with different characteristics.5. We suggest a way of standardizing the parameter r, and thus to eliminate the ambiguity caused by the presence of different BWT-variants.
To the best of our knowledge, this is the first systematic treatment of the different BWT variants in use for collections of strings.

What is not covered
This paper deals with tools for string collections, so we did not include any tool that computes the BWT of a single string, such as libdivsufsort [48], sais-lite-lcp [22], libsais [28], or bwtdisk [19].Although in many cases, these are the tools used for collections of strings, the transform they compute depends on the method with which the string collection was turned into a single string, as explained above.Nor did we include other BWT variants for single strings such as the bijective BWT [25,32], since, again, these were not designed for string collections.
The Big-xBWT [23] is a tool for compressing and indexing read collections, using the xBWT of Ferragina et al. [20,21].In addition to the string collection, it requires a reference sequence as input, in contrast to the other tools.Moreover, the output is not comparable either, since its length can vary-as opposed to all other BWT variants we review, the xBWT is not a permutation of the input characters but can be shorter, due to the fact that it first maps the input to a tree and then applies the xBWT to it, a BWT-like index for labeled trees, rather than for strings.Likewise, the tool [52] for reference-free xBWT is not included in this review: even though it does not require a reference sequence, it, too, computes the xBWT and not the BWT.Finally, since the method for concatenating the input strings used in [10] (using the same separator symbol but without an additional end-of-string character) differs from all BWT variants that have been implemented by some tool, we excluded it from the current study.

Overview
We give the necessary definitions in Section 2. In Section 3, we present the BWT variants and analyse their differences, followed by an in-depth analysis of the relationship between the input and the output order of the strings in the collection, for the different BWT-variants, in Section 4. In Section 5, we discuss the effects on the repetitiveness measure r.A summary of our experimental results is presented in Section 6.We draw some conclusions from our study in Section 7. The full tables with detailed results on all eight datasets are included in the Appendix.
A preliminary version of this article appeared in [12].Source code and scripts to replicate the results and download the data used in the article are available at https: //github.com/davidecenzato/BWT-variants-for-string-collections.

Preliminaries
Let Σ be a finite ordered alphabet of size σ.We use the notation T = T [1.
.n] for a string T of length n over Σ, T [i] for the ith character, and T [i..j] for the substring where i ≤ j; the length of T is denoted |T |, and the empty string is denoted ε.For a string T over Σ and an integer m > 0, we write T m for the m-fold concatenation of T .A string T is called primitive if T = U m implies T = U and m = 1.Every string T can be written uniquely as T = U m , where U is primitive; in this case, we refer to U as root(T ) and to m as exp(T ).In other words, for every string T it holds that T = root(T ) exp(T ) .Often, an end-of-string character (usually denoted $) is appended to the end of T ; this character is not an element of Σ and is assumed to be smaller than all characters from Σ.Note that appending a $ makes any string primitive.
String S is a conjugate of string for some i ∈ {1, . . ., n} (also called the ith rotation of T ).It is easy to see that a string of length n has n distinct conjugates if and only if it is primitive.A run in string T is a maximal substring consisting of the same character; we denote by runs(T ) the number of runs of T .For example, runs(CAAGGGA) = 4.
For two strings S, T , the (unit-cost) edit distance, or Levenshtein distance, dist edit (S, T ) is defined as the minimum number of operations necessary to transform S into T , where an operation can be deletion or insertion of a character, or substitution of a character by another.The Hamming distance dist H (S, T ), defined only if The lexicographic order on Σ * is defined as follows: S < lex T if S is a proper prefix of T , or if there exists an index j s.t.S[j] < T [j] and for all i < j, S[i] = T [i].The colexicographic order, or colex-order (referred to as reverse lexicographic order or rlo in [37,14]), is defined as follows: S < colex T if S rev < lex T rev , where For a string T = T [1..n] over Σ, the Burrows-Wheeler Transform [9], BWT(T ), is a permutation of the characters of T , given by concatenating the last characters of the lexicographically sorted conjugates of T .In Fig. 2 we give three examples: BWT(CAGAGA) = GGCAAA, BWT(CACACA) = CCCAAA, and BWT(CAGAGA$) = AGGC$AA.It follows from the definition of the BWT that two strings S, T are conjugates if and only if BWT(S) = BWT(T ).Indeed, the BWT is reversible up to conjugates: if a string L is the BWT of some string T , then a string S can be computed in linear time such that L = BWT(S), and thus S is a conjugate of T .To make the BWT uniquely reversible, one can add an index to it, marking the lexicographic rank of the conjugate in input.For example, BWT(CAGAGA) = GGCAAA, and the index 4 specifies that the input was the 4th conjugate in lexicographic order.Alternatively, one adds a $ to the end of T , which makes the input unique: BWT(CAGAGA$) = AGGC$AA, and CAGAGA$ is the only string ending in $ with this BWT.Note that BWT with and without end-of-string symbol can be quite different, as the example shows.
An important parameter of the BWT of string T is the number of runs r(T ) = runs(BWT(T )).It is well-known that on repetitive inputs, the BWT tends to produce long runs of the same character, making it amenable to compression via runlength-encoding (RLE).In our example, r(CAGAGA) = 3, while the original string has 6 runs.This property, referred to as the clustering effect of the BWT, is taken advantage of by compressed data structures such as the RLFM-index [43] or the r-index [24].
Next we define the omega-order [45] on Σ * : S ≺ ω T if root(S) = root(T ) and exp(S) < exp(T ), or if S ω < lex T ω (implying root(S) ̸ = root(T )), where T ω denotes the infinite string obtained by concatenating T infinitely many times.The omega-order relation coincides with the lexicographic order if neither of the two strings is a proper prefix of the other.The two orders can differ otherwise, e.g.GT < lex GTC but GTC ≺ ω GT.
For a multiset of strings M = {T 1 , . . ., T k }, the extended Burrows-Wheeler Transform, eBWT(M) [45], is a permutation of the characters of the strings in M, given by concatenating the last characters of the conjugates of each T i , for i = 1, . . ., k, listed in omega-order.For example, the omega-sorted conjugates of M = {GTC, GT} are: CGT, GTC, GT, TCG, TG, hence, eBWT(M) = TCTGG, see Figure 3. Again, adding the indices of the input conjugates, in this case 2, 3, makes the eBWT uniquely reversible.For more on the eBWT, see [45].

BWT variants for string collections
We identified six distinct transforms, listed below, which were computed by the tools given in Table 1.Let M = {T 1 , . . ., T k } be a multiset of strings, with total length Since several of the data structures depend on the order in which the strings are listed, we implicitly regard M as a list [T 1 , . . ., T k ], and write ρ(M) for a specific input order ρ in which the strings are presented.
Because all BWT variants except the eBWT use additional end-of-string symbols as string separators, we refer to these by the collective term separator-based BWT variants.In Table 4 we show the six transforms on our running example of 5 DNA-strings, and give first properties of these transforms.For ease of exposition and comparison, we replaced all separator-symbols by the same dollar-sign $, even where, conceptually or concretely, different dollar-signs are assumed to terminate the individual strings.This is the case for mdolBWT and its special cases, colexBWT and optBWT.Moreover, the concBWT contains one additional character, the final end-of-string symbol, here denoted by #, which is smaller than all other characters; thus, the additional rotation starting with # is the smallest and results in an additional dollar in the first position of the transform.To facilitate the comparison with the other transforms, we remove this first symbol from concBWT and replace the # by $.
It is important to point out that the programs listed in Table 1 do not necessarily use the definitions given here; however, in each case, the resulting transform is the one claimed, up to renaming or removing separator characters, see Sections 3.1 and 3.2.

The effect of adding separator symbols
The first obvious difference between the eBWT and the separator-based variants is their length: eBWT(M) has length N M , while all other variants have length N M + k, since they GGAAACGG$$$TTACTGT$AAA$ lexicographic order of strings yes mdolBWT(M) GAGAAGCG$$$TTATCTG$AAA$ input order of strings no concBWT(M) AAGAGGGC$$$TTACTGT$AAA$ lexicographic order of subsequent strings in input no colexBWT(M) AAAGGCGG$$$TTACTGT$AAA$ colexicographic order of strings yes optBWT(M) AAAGGGGC$$$TTACTTG$AAA$ order given by Bentley et al.'s algorithm [4] yes Table 4 Overview of some properties of the six BWT variants considered in this paper.The colors in the example BWTs correspond to interesting intervals in separator-based variants, while the same characters are highlighted in the eBWT for showing their positions, see Section 3.2.
contain an additional character (the separator) for each input string.
In all separator-based transforms, the k-length prefix consists of a permutation of the last characters of the input strings.This is because the rotations starting with the dollars are the first k lexicographically.On the other hand, in the eBWT, these k characters occur interspersed with the rest of the transform; namely, in the positions corresponding to the omega-ranks of the input strings T i (see Table 4).
In gerenal, adding a $ to the end of the strings introduces a distinction, not present in the eBWT, between suffixes and other substrings: since the separators are smaller than all other characters, occurrences of a substring as suffix will be listed en bloc before all other occurrences of the same substring, while in the eBWT, these occurrences are listed interspersed with the other occurrences of the same substring.
▶ Example 1.Let M = {AACGAC, TCAC} and U = AC.U occurs both as a suffix and as an internal factor; the characters preceding it are A (internal substring) and C,G (suffix), and we have eBWT(M) = CGACATAACC, dolEBWT(M) = CC$GCAAATAC$.
Finally, it should be noted that adding end-of-string symbols to the input strings changes the definition of the order applied.As observed above, the omega-order coincides with the lexicographic order on all pairs of strings S, T where neither is a proper prefix of the other; but with end-of-strings characters, no input string can be a proper prefix of another.Thus, on rotations of the T i $'s, the omega-order equals the lexicographic order.As an example, consider the multiset M = {GTC$, GT$} from Section 2: we have the following omega-order among the rotations: $GT, $GTC, C$GT, GT$, GTC$, T$G, TC$G (see Table 3), which coincides with the lexicographic order.Similarly, adding different dollars $ 1 , $ 2 , . . ., $ k and applying the omega-order results again in the lexicographic order between the rotations, with different dollar symbols considered as distinct characters.This implies the following: , where lex(M) denotes the lexicographic order of the strings in M; 2. mdolBWT(M) = eBWT({T i $ i | i = 1, . . ., k}), up to renaming of dollars.
Proof.For 1., note that the order of the strings T i $ is the lexicographic order, since T i $ < ω T j $ if and only if T i $ < lex T j $ if and only if T i < lex T j .For 2., consider that no two rotations can be one prefix of the other, resulting in the lexicographic order between rotations, with the dollar-signs breaking ties.The only difference between the two transforms is now that for i > 1, $ i in the right transform is replaced by $ i−1 in the left transform, and $ 1 is replaced by $ k .◀ Regarding the differences among the separator-based BWT variants, we will show that all differences occur in certain well-defined intervals of the BWT, and that the differences themselves depend only on a specific permutation of {1, . . ., k}, given by the combination of the input order, the lexicographic order of the input strings, and the BWT variant applied.In Tables 5 and 6, we give the full BWT matrices for all BWT variants.

Interesting intervals
Let us call a string U a shared suffix w.r.t.multiset M if it is the suffix of at least two strings in M. Let b be the lexicographic rank of the smallest rotation beginning with U $ and e the lexicographic rank of the largest rotation beginning with U $, among all rotations of strings T $, where T ∈ M. (One can think of [b, e] as the suffix-array interval of U $.) We call [b, e] an interesting interval if there exist i ̸ = j s.t.U is a suffix of both T i and T j , and Table 6 From left to right we show the mdolBWT, the colexBWT, and the optimal BWT of the string collection M = {ATATG, TGA, ACG, ATCA, GGA}.Indices are given with reference to the numbering T1 = ATATG, T2 = TGA, T3 = ACG, T4 = ATCA, T5 = GGA.Note that we give the rotations according to Lemma 2. the preceding characters in T i and T j are different, i.e., the two occurrences of U as suffix of T i and T j constitute a left-maximal repeat.(Put in different terms, interesting intervals correspond to internal nodes in the suffix tree of the reverse string, within the subtree of $.) Clearly, [1, k] is an interesting interval unless all strings end with the same character.Note that interesting intervals differ both from the SAP-intervals of [14] and from the tuples of [4] (called maximal row ranges in [46]): the former are the intervals corresponding to all shared suffixes U , even if not left-maximal, while the latter include also suffixes U that are not shared.

▶ Lemma 3. Any two distinct interesting intervals are disjoint.
Proof.Follows from the fact that no two distinct substrings ending in $ can be one prefix of the other.◀ We can now narrow down the differences between any two separator-based BWTs of the same multiset.The next proposition states that these can only occur in interesting intervals (part 1).This implies that the dollar-symbols appear in the same positions in all separator-based variants except for one very specific case (part 2).Moreover, we get an upper bound on the Hamming distance between two separator-based BWTs (part 3).▶ Proposition 4. Let L 1 and L 2 be two separator-based BWTs of the same multiset M.
Since all separator-based BWT variants use the lexicographical order of the rotations, this means that there exists a substring U which is preceded by x in one string T j and by y in another T j ′ , the first occurrence has rank i in one BWT and the other has rank i in the other BWT variant.This implies that the two occurrences are followed by two dollars, and either the two dollars are different, or they are the same dollar, and the subsequent substrings are different.Therefore, U defines an interesting interval.Parts 2. and 3. follow from 1. ◀ Proposition 4 implies that the variation of the different transforms can be explained based solely on what rule is used to break ties for shared suffixes.We will see next how the different BWT variants determine this tie-breaking rule.

Permutations induced by separator-based BWT variants
Let us now restrict ourselves to M being a set, i.e., no string occurs more than once.(Later we will show how to deal with multisets.)As we showed in the previous section, the only differences between the separator-based BWT variants are given by the order in which shared suffixes are listed.It is also clear that the same order applies in each interesting interval, as well as to the k-length prefix of the transform.Therefore, it suffices to study the permutation π of the k dollars in this prefix.
Since the strings are all distinct, they each have a unique lexicographic rank within the set M. Thus the input order can be seen as a permutation ρ of the lexicographic ranks2 ; if the strings are input in lexicographic order, then ρ = id.For our toy example M = [ATATG, TGA, ACG, ATCA, GGA], we have ρ = 25134.
Let us now define as output permutation π the permutation of the last characters of the input strings, as found in the k-length prefix of the BWT variant in question.We will denote the output permutations of the dolEBWT, mdolBWT, concBWT, and colexBWT by π dolE , π mdol , π conc , and π colex , respectively.(As the permutation of optBWT is algorithmically defined, we do not treat it here.)Again, we give these permutations w.r.t. the lexicographic ranks of the strings.In our running example, we have π dolE = 12345, π mdol = 25134, π colex = 34512, and π conc = 45132.
It is easy to see that the output permutation π mdol is equal to ρ, since the dollar-symbols are ordered according to ρ.For the dolEBWT, the rank of $T i equals the lexicographic rank of T i among all input strings (Lemma 2), i.e., π dolE = id.Further, π colex = γ by definition, where γ denotes the colexicographic order of the input strings.The situation is more complex in the case of concBWT.Since the # is the smallest character, the last string of the input will be the first, while for the others, the lexicographic rank of the following string decides the order.In our running example, π conc = 45132.We next formalize this.
Let Φ ρ be the linking permutation [33] of ρ, defined by Φ ρ (i) = ρ(ρ −1 (i) + 1), for i ̸ = ρ(k), and Φ ρ (ρ(k)) = ρ(1), the permutation that maps each element to the element in the next position and the last element to the first.Let us also define, for j ∈ {1, . . ., k} and i ̸ = j, f j (i) by f j (i) = i if i < j and i − 1 otherwise, i.e. f j (i) gives the rank of element i in the set {1, . . ., k} \ {j}.The next lemma gives the precise relationship between ρ and π conc .
▶ Lemma 5. Let ρ be the permutation of the input order w.r.t. the lexicographic order, i.e. the ith input string has lexicographic rank ρ(i).Then π conc = π conc (ρ) is given by: Proof.Follows straightforwardly from the tie-breaking rule of concBWT.◀ Essentially, Lemma 5 says that π conc is the BWT of ρ. (We thank Massimiliano Rossi for this observation.)This can be seen as follows.Take the string collection M in order ρ and construct a new string T ρ concatenating the lexicographic ranks of the strings in M with a final dollar, in our example T ρ = 25134$; thus, T ρ is a string over the alphabet {1, 2, . . ., k} with an additional dollar at the end.It follows from Lemma 5 that the output permutation π conc is the BWT of T ρ , from which the $-sign was removed: BWT(25134$) = 45$132, therefore, π conc = 45132.As can be seen already for k = 3, not all permutations π are reached by this mapping.We will call a permutation π conc-feasible if there exists an input order ρ such that π conc (ρ) = π.For k = 4, there are 18 conc-feasible permutations (out of 24), for k = 5, 82 (out of 120).In Table 7, we give the percentage of conc-feasible permutations π, for k up to 11.The lexicographic order is always conc-feasible, namely with ρ = k, k − 1, . . ., 2, 1; the colex order is not always conc-feasible, as the following example shows.▶ Example 7. Let M = {ACA, TGA, GAA}, thus ρ = 132, γ = 213, but as we have seen, no permutation of the strings in M will yield this order for concBWT.In particular, the colexBWT(M) = AAAACGG$AT$$ has 7 runs, while all conc-feasible concatBWTs have at least 8: An important consequence is that, given an input permutation ρ, the output permutations induced by mdolBWT and concBWT are always different: π mdol ̸ = π conc holds always, since π conc (1) = ρ(k).This means that, in whatever order the strings are given, on most string sets the resulting transforms mdolBWT and concBWT will differ.

Permutations on multisets
Now let M be a multiset, so the same string can be contained more than once in M. Let us again map M to a string T ρ over the alphabet of the lexicographic ranks {1, 2, . . ., k ′ }, where k ′ ≤ k, and let us define the output order π as before, as the order in which the lexicographic ranks appear in the k-length prefix of the BWT variant. 3Then, π mdol = T ρ , π dolE is the sequence of non-decreasingly sorted ranks; and π colex is non-decreasing w.r.t. the colexicographical order.Finally, again π conc is the BWT of T ρ $ from which the dollar-sign has been removed.Let us denote by BWT * (T ) the string BWT(T $) with the dollar removed.It has been shown experimentally that more than half of binary and ternary strings of length between 10 and 20 do not lie in the image of the function BWT * , with the percentage of those not in the image increasing with increasing length [26].As already seen for permutations (Table 7), the function BWT * is not surjective; the results of [26] seem to indicate that, in fact, the majority of multi-permutations cannot be produced by concBWT.
On the other hand, clearly all multi-permutations can be produced with mdolBWT, as in that case, the output permutation is the same as the input permutation.Moreover, we have seen that all separator-based BWT-variants can be simulated by the mdolBWT transform, since it suffices to apply mdolBWT to the output permutation of the desired variant.We summarize: ▶ Proposition 9. Let M be given, and ρ the order of the lexicographic ranks in which the strings appear in M. Then 1. dolEBWT(M) = mdolBWT(λ(M)), with λ the lexicographic order; 2. colexBWT(M) = mdolBWT(γ(M)), with γ the colexicographic order; 3. concBWT(M) = mdolBWT(β(M)), where β = BWT * (T ρ ) and T ρ is the meta-string consisting of the lexicographic ranks of the input strings.

Effects on the parameter r
What is the effect of the different permutations π of the strings in M, induced by these BWT variants, on the number of runs of the BWT?As the following example shows, the number of runs can differ significantly between different variants.
The results of Section 3 give us a method to measure the degree to which the BWT variants can differ.Proof.Place the n a a-characters in a row, creating n a + 1 gaps, namely one between each adjacent a, and one each at the beginning and at the end.Now place all b-characters, each in a different gap; since n a is maximum, there are enough gaps.Then place all c's, first filling gaps that are still empty, if any, then into gaps without c, etc.We never have to place two identical characters in the same gap.If the total number of non-a-characters is at least than n a − 1, then we can fill every gap, thus separating all a's, and creating a run for every character of I.If we have fewer than n a − 1 characters, then we are still creating two runs with each non-a-character, but we cannot separate all a's.◀ We will use this lemma to measure the variability of a dataset: .
The colexBWT has been shown experimentally to yield a low number of runs of the BWT [37,14].Even though it does not always minimize r (one can easily create small examples where other permutations yield a lower number of runs), we can bound its distance from the optimum.The algorithm of Bentley et al. [4] for the optimal order for mdolBWT is based on the idea of starting from the colex-order and then adjusting, where possible, the order of the runs within interesting intervals in order to minimize character changes at the borders, i.e. such that the first and the last run of each interesting interval is identical to the run preceding and following that interesting interval.This is equivalent to sorting groups of sequences sharing the same left-maximal suffix.This sorting can be done on each interesting interval independently without affecting the other interesting intervals.In Table 6, we show the result on our toy example, where it reduces the number of runs by 2 w.r.t.colex order.In the next section, we compare the number of runs of the non-separator based BWT variants to the optimum.

Experimental results
We computed the five BWT variants eBWT, dolEBWT, mdolBWT, concBWT, and colexBWT, on eight different genomic datasets.We used the tool optimalBWT to compute the minimum number of runs (i.e., that of optBWT) and used this as a baseline for comparison with the r parameter of the other BWT-variants.For mdolBWT and concBWT, we used the default input order in which the dataset was downloaded.The eight datasets have different characteristics: Four of the datasets contain short reads: SARS-CoV-2 short [58], Simons Diversity reads [44], 16S rRNA short [64], Influenza A reads [63], and four contain long sequences: SARS-CoV-2 long [27], 16S rRNA long [17], Candida auris reads [65], one of which, SARS-CoV-2 genomes, whole viral genomes [6].The main features of the datasets, including the number of sequences, sequence length, and the mean runlength of the optimal BWT are reported in Table 8.We include the details of the experimental setup in the Appendix.
On each of the datasets, we computed the pairwise Hamming distance between separatorbased BWTs.To compare them to the eBWT, we computed the pairwise edit distance on a small subset of the sequences (for obvious computational reasons), computing also the Hamming distance on the small set, for comparison.We generated the following statistics on each of the data sets: the number of interesting intervals, the fraction of positions within interesting intervals (total length of interesting intervals divided by total length of the dataset), and the dataset's variability (Def.12).In Table 10 and 11, we include a compact version of these results for the two datasets with the highest and the lowest variation between the BWT variants, the SARS-CoV-2 short sequences and the SARS-CoV-2 genomes, respectively.The full experimental results for all eight datasets are contained in the Appendix.
In Table 9 we give a brief summary of the results, reporting, for each dataset, the fraction of positions in interesting intervals, the dataset's variability, the average pairwise Hamming distance between separator-based BWT variants, and the maximum and minimum value, among the different BWT variants, of the average runlength (n/r) of the BWT.
The experiments showed a high variation in the number of runs in particular on datasets of short sequences.The highest difference was between colexBWT and concBWT, by a multiplicative factor of over 4.2, on the SARS-CoV-2 short dataset.In Figure 1 we plot the average runlength n/r for the four short sequence datasets, and the percentage increase of the number of runs w.r.t.r OP T .The variation is less pronounced on the one dataset which is less repetitive, namely Simons Diversity reads.Recall that the mdolBWT and concBWT vary depending on the input permutation.On most long sequence datasets, on the other hand, the differences were quite small (see Appendix).Recall also that the mdolBWT and concBWT vary depending on the input permutation.To better understand how far the colexBWT is from the optimum w.r.t. the number of runs, we plot in Figure 2 the number of runs of colexBWT w.r.t. to r opt , on all eight datasets.The strongest increase is on short sequences, where the variation among all BWT variants is high, as well; on the long sequence datasets, with the exception of SARS-CoV-2 long sequences, the colexBWT is very close to the optimum; however, note that on those datasets, all BWTs are close to the optimum.
The average number of runs and the average pairwise Hamming distance strongly depend on the length of the sequences in the input collection.If the collection has a lot of short sequences which are very similar, then the differences between the BWTs both w.r.t. the number of runs, and as measured by the Hamming distance, can be large.This is because there are a lot of maximal shared suffixes and so, many positions are in interesting intervals.To better understand this relationship, we plotted, in Figure 3, the average Hamming distance against the two parameters variability and fraction of positions in interesting intervals.We see that the two datasets with highest average Hamming distance, SARS-CoV-2 short dataset and the Simons Diversity reads, have at least one of the two values very close to 1, while for those datasets where both values are very low, the BWT variants do not differ very much.
Note that the input order used by the mdolBWT and the concBWT is the order in which the input sequences appear when the dataset is downloaded.Our study shows that only a few input permutations can minimize the number of runs of the resulting BWT, namely those orders that group the characters inside the interesting intervals in at most σ runs, such as the order of Bentley et al. and the colexicographic order.However, since there are k! possible input permutations, selecting an arbitrary input order will likely result in a BWT whose number of runs is much larger than the optimal one, especially on datasets with high variability.9 Summary of the results on the eight datasets.From left to right we report dataset names followed by the ratio of positions in interesting intervals, the variability of the dataset (see Def. 12), the average normalized Hamming distance between any two separator-based BWT variants.In the last two columns we report the maximum and minimum average runlength (n/r) taken over all five BWT variants.Table 11 Results for the SARS-CoV-2 genomes dataset.Top left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.Top right: summary of the dataset properties.Bottom left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Bottom right: number of runs and average runlength (n/r) taken over all BWT variants.

Conclusion
We presented the first study of the different variants of the Burrows-Wheeler Transform for string collections.We found that the transforms computed by different tools differ not insignificantly, as measured by the pairwise Hamming distance: up to 12% between different BWT variants on the same dataset in our experiments.We showed that most current tools implement BWT variants that are input order dependent, so that the same tool can produce different outputs if the input set is permuted.These differences extend also to the number of runs r, a parameter that is central in the analysis of BWT-based data structures, and which is increasingly being used as a measure of the repetitiveness of the dataset itself.
With string collections replacing individual sequences as the prime object of research and analysis, and thus becoming the standard input for text indexing algorithms, we believe that it is all the more important for users and researchers to be aware that not all methods are equivalent, and to understand the precise nature of the BWT variant produced by a particular tool.
We suggest further to standardize the definition of the parameter r for string collections, using either the colexicographic order (implemented by the tool ropebwt2 [37]) or the optimal order of Bentley et al. [4] (implemented by the tool optimalBWT [11]).In this paper, we found that the number of runs can vary by up to a factor of 4.2 on real-life biological datasets, while in [11], a factor of 31 was shown on other biological data.Not only does this heavily impact the space requirements of BWT-based data structures, but it also means that using the average runlength n/r as a repetitiveness measure of a dataset is ambiguous, unless the research community agrees on the BWT variant being used for the definition of this parameter.

A Experimental setup
All datasets are stored in FASTA format.We used three tools for computing the five BWT variants; pfpebwt, ropebwt2 and Big-BWT.In order to make the BWTs comparable we did some adaptations to both tools and inputs.We modified ropebwt2 to make it work with the same character order as the other tools, i.e. $ < A < C < G < N < T. Then we used ropebwt2 for computing both the mdolBWT and the colexBWT using the -R and -R -s flags respectively.We used for constructing both the eBWT and the dolEBWT variants.In order to compute the dolEBWT, we modified the input files, appending an end-of-string character at the end of each sequence.Finally, for computing the concBWT, we removed the headers from the FASTA files, arranging the sequences in newline separated files, and ran Big-BWT without additional flags on these newline separated files.

B Further information on the tools
We tested all 18 tools extensively, and determined which data structure they compute, using both our tests and the algorithm descriptions in the respective papers.In this section, we include further information about some of these tools.
pfpebwt is a tool computing the eBWT of string collections (https://github.com/davidecenzato/PFP-eBWT.git).It takes in input a fasta file and gives in output the eBWT in either plain ASCII text or RLE (run-length-encoded) format.We used (a) no flags for long sequences, and (b) the flags -w 10 -p 10 -n 3 --reads for short sequences.We included it in two different rows of Table 1 because by default pfpebwt computes the eBWT, but it can compute the dolEBWT if the sequences have explicit end-of-string characters (not in multi-thread mode).cais is a tool implementing the SAIS_for_eBWT algorithm [6], which computes both the eBWT and the dolEBWT (https://github.com/davidecenzato/cais.git) depending on the input flag.It takes in input a fasta file, a fastq file, or a plain text file and gives in output the final transform in plain ASCII text.The -c and -a flags enable to output the conjugate array along with the resulting BWT.
G2BWT is a tool computing the dolEBWT of short sequence collections (https://bitbucket.org/DiegoDiazDominguez/lms_grammar/src/bwt_imp2).It takes in input newline separated files.Even though it is not stated explicitly, this tool computes the dolEBWT because, when it constructs the grammar, it uses dollars for separating adjacent strings.Thus, also the string rotations will contain dollars.We tested it using the default settings.msbwt is a tool implementing the Holt and McMillan [30] merge-based BWT construction algorithm (https://github.com/holtjma/msbwt.git).It takes in input a list of one or several fastq files.Even if this tool uses the BCR approach [3] for computing the BWTs to merge, it actually computes the dolEBWT.This is because it features a preprocessing where it sorts the input strings lexicographically.Thus, the resulting mdolBWT corresponds to the dolEBWT.BEETL is a suite containing several tools, including a tool computing the mdolBWT of string collections using an implementation of the BCR and BCR-ext algorithms [3] (https://github.com/BEETL/BEETL.git).This tool requires that all input sequences have to have the same length.We tested this tool using -output-format ASCII and -concatenate-output flags.This tool also computes the a BWT variant similar to the colexBWT by using the -sap-ordering flag (BCR-ext mode only).
BCR_LCP_GSA is a tool computing the mdolBWT of sting collections in semi-external memory (https://github.com/giovannarosone/BCR_LCP_GSA).It implements an algorithm similar to BCR contained in the BEETL tool, but it can process a string collection containing sequences of different lengths.It takes in input a fasta file, a fastq file, or a gz-compressed fastq file.It computes the mdolBWT following the method of Bauer et al., described in [3].We set the 'dataTypeLengthSequences' variable in Parameters.h to 1.
ropebwt2 is a tool computing the FM-index and the mdolBWT of string collections (https://github.com/lh3/ropebwt2.git), using an approach similar to BCR.It takes in input a fasta file, a fastq file, or a gz compressed fastq file.We listed it in two different rows of Table 1 because it computes the mdolBWT or the colexBWT, depending on the flags.We used the -R and the -R -s flags, respectively, to obtain the two transforms.In addition, we modified main.c in order to change the order of the characters to $ < A < C < G < N < T. merge-BWT computes the mdolBWT of a string collection by merging the BWTs of subcollections of the input (https://github.com/jltsiren/bwt-merge.git).It takes in input a list of one or several mdolBWTs.The order of the dollars will depend on the order in which the input BWTs are listed.We tested it using -i plain_sorted and -o plain_sorted flags.We computed the BWTs of the subcollections using ropebwt2.nvSetBWT is a tool included in nvbio suite (https://github.com/NVlabs/nvbio.git).It takes in input either a fastq or a newline separated file.We tested it using the -R flag for skipping the reverse strand.However, even if the algorithmic descriptions in [55,39] seem to describe the mdolBWT, the output of the current version (version 1.1) does not correspond to a possible BWT because the Parikh vector is different from that of the input.
eGSA computes the generalized enhanced suffix array and the mdolBWT of a string collection (https://github.com/felipelouza/egsa.git).It takes in input a text file, a fasta file, or a fastq file.It uses the gSACA-K algorithm for computing the suffix array of subcollections of the input and then merges all suffix arrays.Thus it computes the mdolBWT.We tested it with the -b flag.
eGAP computes the mdolBWT, and optionally the LCP-array (longest common prefix array) and DA (document array) of a string collection (https://github.com/felipelouza/egap.git).It takes in input a newline separated file, a fasta file, or a fastq file.We tested it with default settings.
bwt-lcp-parallel computes the mdolBWT and the LCP-array of a collection of short sequences (https://github.com/AlgoLab/bwt-lcp-parallel.git).It takes in input fasta files and does not support the N character.We tested it using standard settings.
gsufsort computes the SA, LCP and mdolBWT of a string collection (https://github.com/felipelouza/gsufsort.git), using the gSACA-K algorithm of [40].It takes in input a newline separated file, a fasta file, or a fastq file.We tested it using --fasta and --bwt flags.
grlBWT is a tool computing the mdolBWT of string collections using an induced suffix sorting based algorithm that keeps the intermediate data structures in compressed form (https://github.com/ddiazdom/grlBWT).It takes in input a concatenated string collection and gives in output the mdolBWT in run-length compressed form.We tested it with the default parameters and used new-line separated files as input.
BigBWT computes the concBWT, and optionally the suffix array, of a highly repetitive text or string collection (https://github.com/alshai/Big-BWT.git) using the Prefix-free parsing (PFP) algorithm.It takes in input a newline separated file or a fasta file.This tool with the -f flag is used internally in the r-index implementation (https://github.com/alshai/r-index), producing the BWT of the strings concatenated without dollars, thus, the end-of-string symbols have to be added explicitly.On the other hand, the tool without the -f flag will compute the BWT of the fasta files without skipping the fasta headers.We used standard parameters and as input newline separated files, the output then is the concBWT.r-pfbwt is a tool which computes the run-length encoded concBWT by using a similar algorithm than BigBWT (https://github.com/marco-oliva/r-pfbwt).However, unlike BigBWT, r-pfbwt employs an improved version of the PFP algorithm, which allows the process of even larger datasets through a recursive pre-processing of the input.We tested it using the -bwt-only flag and computed the PFP data structures using the pfp++ software (https://github.com/marco-oliva/pfp.git).CMS-BWT is a tool computing the concBWT by using the matching statistics to speed up the BWT computation and reduce the memory footprint for large and repetitive datasets (https://github.com/fmasillo/CMS-BWT.git).Unlike the other software it requires two input files, one containing a string collection and another containing a reference sequence.It takes in input fasta files and outputs the resulting BWT in plain format or run-length encoding.We tested it using the default parameters.
optimalBWT is a tool computing the optimal BWT of Bentley et al., it features two different construction algorithms, a variant of SAIS of Nong et al. which works in internal memory and a variant of BCR working in semi-external memory (https://github.com/davidecenzato/optimalBWT.git).It takes in input either a fasta or fastq file and outputs the resulting BWT in plain ascii text.We tested it using both -a sais and -a bcr flags.Table 12 Results for the SARS-CoV-2 short dataset.First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.First row right: summary of the dataset properties.Second row: number of runs and average runlength (n/r) of all BWT variants.Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection.Third row right: summary of the dataset properties of a subset of the input collection.Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.Table 13 Results for the Simons Diversity reads dataset.First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.First row right: summary of the dataset properties.Second row: number of runs and average runlength (n/r) of all BWT variants.Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection.Third row right: summary of the dataset properties of a subset of the input collection.Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.

C Results on individual datasets
16S rRNA short (500,000 short sequences) Table 14 Results for the 16S rRNA short dataset.First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.First row right: summary of the dataset properties.Second row: number of runs and average runlength (n/r) of all BWT variants.Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection.Third row right: summary of the dataset properties of a subset of the input collection.Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.Table 15 Results for the Influenza A reads dataset.First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.First row right: summary of the dataset properties.Second row: number of runs and average runlength (n/r) of all BWT variants.Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection.Third row right: summary of the dataset properties of a subset of the input collection.Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.Table 17 Results for the 16S rRNA long dataset.First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.First row right: summary of the dataset properties.Second row: number of runs and average runlength (n/r) of all BWT variants.Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection.Third row right: summary of the dataset properties of a subset of the input collection.Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.Table 18 Results for the Candida auris reads dataset.First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.First row right: summary of the dataset properties.Second row: number of runs and average runlength (n/r) of all BWT variants.Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection.Third row right: summary of the dataset properties of a subset of the input collection.Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.Table 19 Results for the SARS-CoV-2 genomes dataset.First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.First row right: summary of the dataset properties.Second row: number of runs and average runlength (n/r) of all BWT variants.Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection.Third row right: summary of the dataset properties of a subset of the input collection.Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.

▶ Lemma 11 .
Let [b, e] be an interesting interval, and (n 1 , . . ., n σ ) the Parikh vector of L[b..e], i.e. n i is the number of occurrences of the ith character.Let a be such that n a = max i n i , and N a = (e − b + 1) − n a , the sum of the other character multiplicities.Then the maximum number of runs in interval [b, e] is e − b + 1 if n a − 1 ≤ N a , and 2N a + 1 otherwise.

Definition 12 .
Let M be a multiset.For an interesting interval [b, e], let var([b, e]) be the upper bound on the number of runs in [b, e] from Lemma 11.Then the variability of M is var(M) = [b,e] interesting interval var([b, e]) [b,e] interesting interval (e − b + 1)

Proposition 13 .
Let L be the colexBWT of multiset M, and let r OPT denote the minimum number of runs of any separator-based BWT of M. Then runs(L) ≤ r OPT + 2 • c M , where c M is the number of interesting intervals.Proof.Let I = [b I , e I ] be an interesting interval containing d distinct characters, and let U be the shared suffix defining I. Since the strings are listed according to the colex order, all strings in which U is preceded by the same character will appear in one block, and therefore, L has exactly d runs in the interval I. Let L b I −1 = x and L e I +1 = y.If x occurs in I and it is not the first run of I (i.e., L b I ̸ = x), then listing first the strings where U is preceded by x would reduce the number of runs by 1; similarly, listing those where y precedes U as last of the group would reduce the number of runs by 1.By Prop.4, this is the only possibility for varying the number of runs.◀ dataset

Figure 1
Figure1Results regarding r on short sequence datasets, of all BWT variants.Left: average runlength (n/r).Right: number of runs (percentage increase with respect to optimal BWT).

Figure 2
Figure 2 Number of runs of the colexBWT with respect to optimal BWT (percentage increase) on all eight datasets.

Figure 3
Figure 3 Average normalized Hamming distance variations with respect to variability and fraction of positions in interesting intervals on all datasets.

Table 8
[4]mary of the most important parameters of the eight datasets.From left to right we report the dataset name, the number of sequences, the total length, the average, minimum and maximum sequence length, and the average runlength n/r of the optimum BWT according to Bentley et al.[4].

Table 10
Results for the SARS-CoV-2 short dataset.Top left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.Top right: summary of the dataset properties.Bottom left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Bottom right: number of runs and average runlength (n/r) taken over all BWT variants.

Table 16
Results for the SARS-CoV-2 long dataset.First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants.First row right: summary of the dataset properties.Second row: number of runs and average runlength (n/r) of all BWT variants.Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection.Third row right: summary of the dataset properties of a subset of the input collection.Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection.Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.