Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Motivation: Over the last few years, methods based on suffix arrays using the Burrows–Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here a theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms. Results: Given M sequences with N bi-allelic variable sites, an O(NM) algorithm to derive a representation of the data based on positional prefix arrays is given, which is termed the positional Burrows–Wheeler transform (PBWT). On large datasets this compresses with run-length encoding by more than a factor of a hundred smaller than using gzip on the raw data. Using this representation a method is given to find all maximal haplotype matches within the set in O(NM) time rather than O(NM2) as expected from naive pairwise comparison, and also a fast algorithm, empirically independent of M given sufficient memory for indexes, to find maximal matches between a new sequence and the set. The discussion includes some proposals about how these approaches could be used for imputation and phasing. Availability: http://github.com/richarddurbin/pbwt Contact: richard.durbin@sanger.ac.uk


INTRODUCTION
Given a large collection of aligned genetic sequences, or haplotypes, it is often of interest to find long matches between sequences within the collection, or between a new test sequence and sequences from the collection. For example, sufficiently long identical substrings are candidates to be regions that are identical by descent (IBD) from a common ancestor. (I will use the word 'substring' to denote contiguous subsequences, as is standard in the computer science text matching literature.) When using imputation approaches to infer missing values one wants to identify sequences that are as close as possible to the test sequence around the location being imputed, such as those that are IBD, or at least share long matches with the test sequence. Maximizing the number of such long matches could also form the basis of genotype phasing.
Naive substring match testing would take OðN 2 MÞ time for each test sequence, where there are N variable sites and M sequences, and hence OðN 2 M 2 Þ time for complete all-pairs comparison within a set of sequences. By keeping a running match score to find maximal matches as in BLAST, it is straightforwardly possible to reduce this to O(NM) per single test, and so OðNM 2 Þ across the whole collection, but this is still large for large M. Recently suffix-array-based methods have proved powerful in standard sequence matching, as exemplified by Bowtie (Langmead et al., 2009), BWA (Li and Durbin, 2009) and SOAP2 . Here an approach based on suffix arrays is described that can find best matches within a set of sequences in O(NM) time, following preprocessing of the dataset also in O(NM) time, and empirically best single haplotype matches in O(N) time.
The differences between the algorithms described here and standard suffix array based sequence matching are derived from the fact that there are many sequences that are all of the same length and already aligned. So on the one hand there is no need to consider offsets of the test sequence with respect to the sequences in the collection, but on the other hand the test sequence is long and we are looking for maximal matches of an arbitrary substring of the test sequence, not of the whole test sequence.

APPROACH
When looking at genetic data from humans or other diploid organisms, there are two underlying genome sequences per person, one from their father and one from their mother. These are known as 'haplotype' sequences. Here I consider the case where we are given these two sequences separately, rather than unphased diploid 'genotype' sequences, where the two haplotype sequences have been observed together.
Consider a set X of M haplotype sequences x i , i ¼ 0, . . ., ðM À 1Þ over N variable sites indexed by k, numbered from 0 to (N À 1). We can take all the sites to be bi-allelic with values 0 or 1, so a typical site x i ½k 2 f0, 1g: For any sequence s, let us write s½k 1 , k 2 Þ to represent the semi-open substring of s starting at k 1 and finishing at ðk 2 À 1Þ: We will say that there is a 'match' between s and t from k 1 to k 2 if s½k 1 , k 2 Þ ¼ t½k 1 , k 2 Þ, and this match is 'locally maximal' if there is no extension that is also a match, i.e. if k 1 ¼ 0 or s½k 1 À 1 6 ¼ t½k 1 À 1, and k 2 ¼ N or s½k 2 6 ¼ t½k 2 : When comparing s to the set of sequences X we say that s has a set-maximal match to x i from k 1 to k 2 if the match is locally maximal and there is no longer match from s to any other x j that includes the interval ½k 1 , k 2 Þ: For some applications we will be interested in the set-maximal matches within X, i.e. the set-maximal matches of each x i to Xnfx i g: Fundamental to our approach will be to consider a particular form of ordering on substrings of the set X of sequences. Here is an explanation for this ordering, with some motivation about why it is important. We will consider a separate ordering for each position k between 0 and N. For given k, let us order the sequences x in X so that their reversed prefixes x½0, kÞ are ordered, by which I mean that the reversed sequences of the prefixes running back from (k À 1) to 0 are ordered in the natural fashion, with them being ordered according to their index i in X if the prefixes are the same.
Let us consider a set-maximal match between two sequences in X from k 0 to k. If we sort in this reversed order at k, then the maximally matching sequences must be adjacent, because, if there were another prefix sorting in between them, then it would have to match both from k 0 to k because of sort order, and it would have to match one of the two at position ðk 0 À 1Þ, because the maximally matching sequences must take different values there and there are only two possible values. The new sequence would thus form a longer match, which contradicts the presumption that the match between the original pair was set-maximal. Strictly, this argument requires that the original match was set-maximal in both directions. We will see this is important below. However, this is just a motivating paragraph so there is no need to consider yet the case where it is setmaximal in only one direction.
Those with prior exposure to suffix array algorithms will notice that I talk here about prefixes and reversed ordering rather than suffixes and lexicographic ordering. In this text the direction of the standard theory is reversed so that algorithms process forwards naturally through the sequences from 0 to (N À 1), rather than backwards; this has no substantive effect on any of the algorithms or results, but will enable us to process very large datasets in the order in which they naturally come, and also makes some of the notation more natural.

Derivation of prefix array representation
The argument in the preceding paragraph suggests that having the sequences sorted in order of reversed prefixes at position k would help find maximal matches. It might seem that finding this sort order for all the prefixes at every position k would be computationally costly, but that is not the case. If we know the sort order at position k Algorithm 1 shows how to derive the sort order at position (k þ 1) by a simple process looking only at the k-th value of each sequence. We can therefore calculate the entire set of orderings for all k in a single pass through all the sequences, in time proportional to NM. Let a k ½i be the index m5M of the sequence x m from which the i-th prefix in the reversed ordering at k is derived. The array a k is a permutation of the numbers 0, . . ., ðM À 1Þ: Because we are often going to want to discuss the sequences sorted in the order of their prefixes, I define y k i to be the i-th sequence in this sorted ordering The key observation is that, conditional on the value of y k i ½k, the order of the elements of a kþ1 is the same as their order in a k . An illustration is given in Figure 1.
Algorithm 1 BuildPrefixArray-build the positional prefix array a kþ1 from a k u 0, v 0, create empty arrays a½, b½ the concatenation of a followed by b To identify where maximal matches start, we need to keep track of the start position of matches between neighboring prefixes. Formally, for i40 define d k ½i to be the smallest value j such that y i ½j, kÞ matches y iÀ1 ½j, kÞ (note that I have dropped the k suffix of the y's here and in the following for ease of notation, since its value is implicitly k for the time being). If y i ½k À 1 6 ¼ y iÀ1 ½k À 1 then set d k ½i ¼ k: It can then be shown that the start of any maximal match ending at k between any y i , y j ði5jÞ is given by max i5m j d k ½m: Using this we can efficiently extend Algorithm 1 to update d k in parallel with a k as we sweep through the data, as shown in Algorithm 2.
Algorithm 2 BuildPrefixAndDivergenceArrays-build the divergence array d kþ1 along with a kþ1 from d k and a k u 0, v 0, p k þ 1, q k þ 1 create empty arrays a½, b½, d½, e½ the concatenation of a followed by b d kþ1 the concatenation of d followed by e Because we are dealing with bi-allelic data, so long as d k ½i40, the values of y iÀ1 ½d k ½i À 1 and y i ½d k ½i À 1 must be 0 and 1, respectively, since they differ by definition and are in sorted order. This means as a corollary that it is not possible for d k ½i to be equal to d k ½i þ 1, as long as they are greater than 0, because otherwise y i ½d k ½i À 1 would need to be both 1 and 0, which is impossible.
I will call the collection of arrays a k for all k the 'positional prefix arrays' of X. These are related to standard suffix arrays, but apart from being prefix rather than suffix arrays because the sorting is in the reverse direction, they differ because they form a set of N arrays each of size M rather than a single array of size NM. The d k contain the equivalent of 'longest common prefix' values in standard suffix array algorithms.
2.2 Finding all matches within X longer than a minimum length L Now we can use the a k and d k arrays to efficiently find matches. In order to count matches only once, we will sweep through the sequences with increasing k and only report matches at each k that end at k, i.e. for which y i ½k 6 ¼ y j ½k: In order for the match to be longer than L, we must by definition have d k ½m k À L for all i5m j, so pairs of indices (i, j) with this property will occur in blocks in the sorted list, separated by positions at which d k ½i4k À L: We therefore proceed through the sorted list at position k, keeping track of the last time that d k [i] was greater than (k À L). Given this, our algorithm looks remarkably like that for generating the a k arrays.
Algorithm 3 ReportLongMatches-report matches within X ending at k longer than L u 0, v 0, create empty arrays a½, b½ This algorithm is OðmaxðNM, number of matchesÞÞ; reporting the pairs of subsets a[] and b[] of sizes u and v would be O(NM). Note that ReportLongMatches can be run in the same sweep through the data as used to calculate the a and d arrays, so if we are happy to discard previous values of a j and d j for j5k as we go, it can be carried out in O(M) space.
A variation of this method can deliver all matches that extend in both directions from a location by at least a minimum length L/2. In this case one considers blocks within which d k ði, jÞ ðk À ð2L þ 1ÞÞ to find sets of such matches centered on position k À ðL À 1Þ, and does not separate into subsets for which y k [i] ¼ 0 or 1. This approach may be relevant when looking for similar sequences at a position, perhaps for the purpose of imputation. Long matches will recur many times in this formulation, so it is best if possible to use the similar subsets as they are identified during the sweep, rather than to store them for future use.

Finding all set-maximal matches within X in linear time
Consider a sequence y i in the sorted list at k. Under what conditions will it have a set-maximal match ending at k? Clearly the match must be to one or more sequences directly preceding or following it in the sort order. First we find the candidate interval [m,n] such that for all j 6 ¼ i with m5j n, d k ½j minðd k ½i, d k ½i þ 1Þ: If y j ½k 6 ¼ y i ½k for all these j then y i has a set-maximal match to them all, but (unless K ¼ N) if any have y j ½k ¼ y i ½k then the match between y i and y j can be extended forwards, and there is no set-maximal match ending at k. Iterating over all k and i we get algorithm 4.
Algorithm 4 ReportSetMaximalMatches-report set maximal matches in X report match of a k ½i to a k ½j from d k ½i þ 1 to k Despite the inner loops this algorithm only has time complexity O(NM), because the requirement that y j ½k 6 ¼ y i ½k limits the search so that each position is compared at most once from each To be completely precise, because matches have to terminate at the start and end of the sequence, this last statement relies on there not being arbitrarily large groups of sequences identical from 0 to N. Under these conditions, this also proves that the total number of set-maximal matches within X is bounded by a fixed multiple of NM.
As with ReportLongMatches, ReportSetMaximalMatches can be run in the same sweep through the data as used to calculate the a and d arrays, so if we are happy to discard previous values of a j and d j as we go, it also can be carried out in O(M) space.
2.4 Finding all set-maximal matches from a new sequence z to X Next let us consider the case where we have a new sequence z, and want to find the set-maximal matches between it and set X.
We will again sweep forward through the sequence, and in this case keep track of the start e k of the longest match of z to some y i ending at position k, and the interval ½f k , g k Þ ½0, . . ., MÞ of indices in a k with that match. So for all i such that f k i5g k we have z½e k , kÞ ¼ y i ½e k , kÞ, but z½e k À 1 6 ¼ y i ½e k À 1: We allow g k to be M if y MÀ1 is included in the set of longest matches. We want an efficient procedure for updating e, f and g as we move from k to (k þ 1). First let us imagine that we have a procedure for updating f and g to f 0 and g 0 given a fixed starting position e k . If f 0 5g 0 then at least some of the original matches starting at e k and ending at k can be extended to (k þ 1), so e 0 kþ1 ¼ e k by definition and we are done. If on the other hand f 0 ¼ g 0 then none of the matches can be extended, and so the matches ending at k to sequences between f k and g k are set-maximal and can be reported. Then we need to find a new e kþ1 4e k and corresponding new f kþ1 5g kþ1 : To efficiently update f and g we need the values u and v from Algorithm 1. We did not store them at the time, but let us now assume that we did so, in arrays u k and v k , and also kept track of the total number of zero values at position k in the sequences as value c k , which is equivalently the length of array a[] in Algorithm 1. Now if we define w k ði, 0Þ ¼ u k ½i and w k ði, 1Þ ¼ c k þ v k ½i then Algorithm 1 tells us that a kþ1 ½w k ði, y i ½kÞ ¼ a k ½i: Furthermore, if k is the inverse of the permutation a k , then kþ1 ½i ¼ w k ð k ½i, x i ½kÞ: This last statement gives us a clue about how to update f and g. If we define f 0 ¼ wðf, z½kÞ then this will be the index in a kþ1 of the first sequence y j with j ! f for which y j ½k ¼ z½k, which is what want. Similarly g 0 ¼ wðg, z½kÞ is what we want for updating g. So we can now update f and g by simple lookup from stored values. Now, as we saw above, if f 0 5g 0 then we are done. On the other hand, if f 0 ¼ g 0 then there are no extensions of the match starting at e k . At position k, we know that z sorted either just before the block [f,g) in the natural prefix ordering, or just after it. So it either sorted just before f or just before g. From this we can infer that y kþ1 f 0 À1 5z5y kþ1 f 0 in the natural ordering of reversed prefixes. So e kþ1 d kþ1 ½f 0 : Let us consider z½d kþ1 ½f 0 À 1: If this is 0 then z matches y kþ1 f 0 À1 better than y kþ1 f 0 , and we will set g kþ1 ¼ f 0 ¼ g 0 and look for f kþ1 5f 0 : If it is 1 then z matches y kþ1 f 0 better than y kþ1 f 0 À1 , and we will set f kþ1 ¼ f 0 ¼ g 0 and look for g kþ1 4g 0 : In either case we need to search back from d kþ1 ½f 0 to find e kþ1 : We need to take a little care at the boundaries 0 and M.
Algorithm 5 UpdateZmatches-report any set-maximal matches of z to X ending at k and update to (k þ 1) f 0 wðf k , z½kÞ, g 0 wðg k , z½kÞ if f 0 5g 0 then e 0 e k else for i ¼ f; i5g; i i þ 1 do report match to a k ½i from e to k e 0 d kþ1 ½f 0 À 1 if z½e 0 ¼ 0 and f 0 40 then f 0 g 0 À 1 while z½e 0 À 1 ¼ y kþ1 f 0 ½e 0 À 1 do e 0 e 0 À 1 while d kþ1 ½f 0 e 0 do f 0 f 0 À 1 else g 0 f 0 þ 1 while z½e 0 À 1 ¼ y kþ1 f 0 ½e 0 À 1 do e 0 e 0 À 1 while g 0 5M and d kþ1 ½g 0 e 0 do g 0 It is not immediately obvious that the algorithm is O(N). The while loop in f 0 or g 0 is inevitable because it only takes as many iterations as there are matches to report the next time that f 0 ¼ g 0 : The total number of set-maximal matches is bounded by NA, as required. More complicated are the while loops that decrement e 0 . The sum of times these are used is bounded by a constant multiple of N.

Compact representation of X
The algorithms described above all use the a k and d k arrays to find matches. However, these are arrays of integers with a total number of elements equal to the number of binary values in the original dataset, so storing them would take more space than a bit representation of the starting data. Some algorithms can be applied on the fly, in the same sweep through the data as used to generate the values of a and d, but for other purposes, in particular for analyzing new sequences, we would like to store the relevant information more efficiently. Here is a description of how to do that.
First we notice that in the matching processes we do not actually use directly the a k ½i indexes, but rather the y i ½k ¼ x a k ½i ½k values. These are a permutation of the x i ½k values determined by the a k permutation indicating the sort order at k of prefixes up to position (k À 1), and are therefore a positional analogue of the Burrows-Wheeler Transform (BWT) of X (Burrows and Wheeler, 1994; see Li and Durbin (2009) for an explanation closer to that given here if this is not familiar). Let us call the set of ordered y sequences the Positional Burrows-Wheeler Transform (PBWT). As with the BWT, the PBWT is composed of bit values not integer values, so we can store it in the same space as the original data. Furthermore we can also expect the y arrays to be strongly run-length compressible. This is because population genetic structure means that there is local correlation in values due to linkage disequilibrium, which means that haplotypes with similar prefixes in the sort order will tend to have the same allele values at the next position, giving rise to long runs of identical values in the y array. So the PBWT can easily be stored in smaller space than the original data. This will be true even if the original data is run-length encoded, since the left-to-right orientation of the data in X will not reflect shared haplotype structure due to linkage disequilibrium.
To find matches to a new sequence, as in Algorithm 5, we need also the stored arrays u k and v k , in order to evaluate the extension function w k ðÞ: These arrays correspond to the information stored in the index described by Ferragina and Manzini (2000), commonly known as the FM-index. Given the PBWT we can store the information needed to generate them efficiently in an exactly analogous fashion to that for normal strings.
We do need values of a k ½ for reporting, but given the y values in the PBWT and the position FM-index, we can do this efficiently by only storing a k for a subset of values of k, for example every 32 or 64 positions. Reported matches will be longer than this, so extending to the next stored value of a by use of the extension function w k ðÞ is relatively inexpensive.
Finally, we need a compact representation of the d k arrays. For now, it is proposed to Huffman encode the differences between adjacent d k ½i, and perhaps only store them at a subset of k as for a k . There is probably scope for further improvement here.

RESULTS
Here I present initial results on simulated data. A dataset of 100 000 haplotype sequences covering a 20Mb section of genome sequence was simulated using the sequentially Markovian coalescent simulator MaCS Chen (2009) using essentially the command macs 100000 2e7 -t 0.001 -r 0.001 (in fact a larger simulation was undertaken, which crashed a little beyond 20Mb, and the remaining material was trimmed down to this set). There are 370 264 segregating sites in this dataset. The raw MaCS output contains essentially the haplotype sequences written in 0's and 1's, and so is approximately 37GB in size. This compresses with gzip to 1.02GB.
An initial implementation pbwt of the key algorithms was produced. This uses single byte run length encoding for the PBWT, with the top bit encoding the value, the next two bits selecting whether the length is in units of 1, 64 or 2048, and the remaining 5 bits giving the number of units. For runs 464 but 52048 this typically requires 2 bytes, and for runs 42048 but 564k this typically requires 3 bytes. All experiments were carried out on an Apple Mac Air laptop with a 2.13GHz Intel Core 2 Duo processor using a single core. Encoding the dataset of 100 000 sequences described above took 1070 s (user plus system), generating a PBWT representation that is 7.7MB in size, over 130 times smaller than the gzip compression of the raw data. Further results including application to subsets of the data are given in Table 1, showing that the relative gain increases with the number of sequences, indicating clearly the non-linear benefits of the algorithm. This can be clearly seen by the observation that for each increase of a factor of ten in the number of sequences, the average number of bytes used by the PBWT to store the haplotype values at a site only approximately doubles. As a test on real data, similar measures were applied to the chromosome 1 data from the 1000 Genomes Project phase 1 data release 1000 Genomes Project (2012), consisting of 2184 haplotypes at 3 007 196 sites. The gzip file of this data took 303MB, whereas the PBWT used 51.1MB, nearly a factor of six smaller, not far from the factor expected based on the simulated data.
Next Algorithm 4 was implemented to find all set-maximal matches within the simulated datasets. As expected the time taken was linear in the number of sequences (Table 2), taking only 20 min to find all maximal shared substrings within 100 000 sequences.
Finally the comparative performance of three different approaches to matching new sequences to a pre-indexed reference panel was evaluated, finding all set-maximal matches of each new sequence to the reference set. For this evaluation, I subsampled the simulated sequence dataset to approximate typical data from a genotype array experiment, by only retaining a fraction (10%) of sites with allele frequency 45%. This reduced the number of sites to 5940, approximately one pre 3.4 kb, corresponding to 850 k in the human genome, comparable to the content of a standard a genotyping array. Table 3 shows results for matching 1000 of the sequences from the simulation, comparing them to non-overlapping subsets between 1000 and 50 000 in size.
First a 'naive' algorithm was implemented, in which each sequence was compared to all sequences in the panel in a single pass, keeping the best matching segment covering each base in the test sequence. As expected, this approach takes time linear in the size of the panel, taking $0.05 s per sequence in the panel. Second Algorithm 5 was implemented, which is termed 'indexed' in Table 3. This takes constant user time of 0.9 user seconds to match 1000 sequences to reference panels up to size 10 000, but for 50 000 reference sequences the size of the stored u, v and d arrays (which were not compressed in this implementation) became larger than the available memory, resulting in an increase in system time from 0.2 to 15 s, and a smaller increase in user time to 1.7 s. I therefore conclude that the PBWT-based approach can be hundreds of times faster than a direct search approach and find matches in time independent of the reference panel size, as conjectured above, so long as the associated index arrays fit in memory.
For situations where the indexes do not fit in memory, we can still use the PBWT data structures to provide a third 'batch' matching process that is still much faster than the naive approach. This uses a modified version of the within-set Algorithm 4 that passes jointly through the panel and combined set of new sequences together, just considering matches between new and old sequences. As shown in Table 3, the time taken by this batch approach, whose memory requirements are low and independent of the number of sites, increases with reference panel size, but is still many times more efficient than a direct search as in the naive approach. Asymptotically the time taken by current implementation will depend linearly M, but it may be possible to reduce this by careful avoidance of PBWT compression where it is not needed.

DISCUSSION
I have presented here a series of algorithms to generate positional prefix array data structures from haplotype sequences and to use them for very strong compression of haplotype data, and time and space efficient haplotype matching. In particular, the matching algorithms remove a factor of M, the size of the set of haplotype sequences being matched to, from the search time taken by direct pair-wise comparison methods. This makes it possible to find all best matches within tens of thousands of sequences in minutes, and generates the potential for practical software that scales to millions of sequences. Although the algorithms are presented for binary data, they can be extended to multi-allelic data with a little care. These algorithms share aspects of their design with analogous algorithms based on suffix arrays for general string matching, but are structured by position along the string resulting in substantial differences. One consequence is that, unlike with suffix array methods where linear time sorting algorithms are non-trivial, building the sorted positional prefix arrays in linear time using Algorithm 1 is straightforward. The approach used here is reminiscent of that used by Bauer et al. (2011) to generate a string BWT from very large sets of short strings.
With respect to efficient representation, it is interesting to note that the original BWT was introduced by Burrows and Wheeler (1994) for string data compression, not search, and it in fact forms the basis of the bzip compression algorithm. Valimaki et al. (2007) have previously explored the use of BWT compressed self-index methods for efficient compression and search of genetic sequence data from many individuals, but this does not require a fixed alignment of variable sites as in the work presented here, and is substantially different.
All the algorithms described here require exact matching without errors or missing data. As for sequence matching, if a more sensitive search is required that permits errors, it is still possible to use the exact match algorithms to find seed matches, and then join or extend these by direct testing. This would typically be the approach taken by production software, but having powerful methods to identify seeds is key to performance.
An alternative to using suffix/prefix array methods in sequence matching is to build a hash table to identify exact seed matches. Analogous to the creation of position prefix arrays described here, it would be possible to build a set of positional hash tables for each position in the haplotype sequences. Hash based methods when well tuned can be faster than suffix array based methods, because the basic operations are simpler, but they typically require greater memory, particularly in cases where the suffix representation can be compressed as it can be here. A problem with genotype data not present in standard sequence matching is that the information content of positions varies widely, with a preponderance of rare sites with very little information, which would mean that the length of hash word would need to change depending on position in the sequence. An alternative would be to build hashes based on a subset of sites with allele frequency greater than some value such as 10%, or in some frequency range, but this would lose information leading to false seed matches.
Most research into algorithms for analyzing large sets of haplotype or genotype data has focused on statistical methods that are powerful for inference, but only scale up to a few thousand sites and sequences; e.g. see Marchini and Howie (2010). Recently accelerated methods have been developed that can handle data up to tens of thousands of sequences, e.g. Williams et al. (2012) or Delaneau et al. (2012). However, these methods provide approximations to the statistical matching approaches, and are still much heavier than the algorithms presented here. Over a million people have been genotyped, and although there are logistical issues in bringing together datasets on that scale, genotype data on sets of 4100 000 people are becoming available (Hoffman et al., 2011). One approach to more efficient phasing and imputation may be to use computationally efficient approaches such as the positional prefix array methods to seed matches for statistical genotype algorithms, or at other computational bottlenecks. For example, in their BEAGLE software Browning and Browning (2007) build a probabilistic hidden Markov Model from a variable length Markov model of the local haplotype sequences which is essentially derived from a dynamic truncation of the positional prefix array. Although their algorithms using this probabilistic model take them in a different direction, I suggest that the methods described here could be used to significantly speed up the model building phase of BEAGLE.
Alternatively, a more direct approach may also be possible. Most phasing and imputation algorithms build a model from the entire dataset, then thread each sequence in turn against it to provide a new phasing based effectively on a series of matches. Instead, the positional prefix algorithms progress jointly along all sequences. If we start at both ends of the data, then at some position k we have information about matches in both directions based on the current phasing, and can propose an assignment of alleles for all sequences at k in a single step, before incrementing k. Approaches based on this idea may be fast and complementary to current methods.