hapbin: An Efficient Program for Performing Haplotype-Based Scans for Positive Selection in Large Genomic Datasets

Understanding how the genome is shaped by selective processes forms an integral part of modern biology. However, as genomic datasets continue to grow larger it is becoming increasingly difficult to apply traditional statistics for detecting signatures of selection to these cohorts. There is therefore a pressing need for the development of the next generation of computational and analytical tools for detecting signatures of selection in large genomic datasets. Here, we present hapbin, an efficient multithreaded implementation of extended haplotype homzygosity-based statistics for detecting selection, which is up to 3,400 times faster than the current fastest implementations of these algorithms.


Supplementary Methods
At their core, all EHH based tests involve examining how haplotype identity decays across a group of individuals with increasing distance from an allele of interest. Following the phasing of haplotypes, one of the major rate limiting factors of implementations of these statistics is the inefficiency with which they store and access the genetic data to characterise this haplotype decay. Encoding the variants of a whole genome genetic dataset of ~1000 individuals as a space or tab separated text file requires greater than 160Gb, whereas encoding it as a binary file reduces this by up to 16 fold (the allele and space representation of 2x8 bits being reduced to 1 bit). Once stored as a binary representation it is then possible to improve the efficiency of the EHH algorithm through the use of bitwise operations to process many alleles for each CPU instruction. Determining EHH is based upon identifying the frequency of each haplotype combination of a given length, n. The smaller the number of observed haplotypes at length n, in general, the higher the EHH. Rather than explicitly traversing along each haplotype and storing each subhaplotype of length n along with a corresponding count, hapbin instead represents the divergence of haplotypes in the form of a binary haplotype tree. The frequencies of each haplotype of a given length can then be determined, as described in greater detail below, without having to explicitly store the haplotypes themselves. By storing the genetic data as a binary file and determining haplotype frequencies through the use of a bitwise algorithm, a reduced memory footprint and substantially faster runtime is achieved in comparison to other implementations.
In greater detail, in the IMPUTE hap file format (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#-h) alleles at biallelic variants are encoded as either '0' or '1', with each variant separated by a newline character and each allele separated by a space. Each line in a hap formatted file consequently representing a variant and each column representing a phased chromosome haplotype. An immediate observation of the .hap data format is the binary nature of the data. The spaces between alleles are also unnecessary, causing the .hap files to require roughly twice as much memory as actually required. Each '0' and '1' character also takes up a full byte but can instead be stored as bits. By representing SNPs as a set of bits rather than characters, the storage requirement can be reduced by roughly 16x: 2 characters at 8 bits per character reduced to 1 bit.
There is, however, a significant challenge to storing SNPs as a set of bits: the algorithm to calculate EHH reads along both the rows and columns of the data. Furthermore, memory is addressed in byte sized chunks, not bits. In order to identify if a particular allele is 0 or 1, the relevant bit must be extracted. This can be done using a bitmask. A bitmask is set of bits used to select or toggle a certain bit or set of bits from another set of bits. A bitmask can be used to select desired bits using the bitwise AND operator.
The key to an efficient bitwise haplotype identification algorithm is using a bitset to store a mask of the locations of the haplotypes, not the haplotypes themselves.  (A)). Therefore B 0001 =1000 (green in (B)), indicating only the first chromosome carries the 0001 haplotype. The frequency of this haplotype is retrieved by determining the Hamming weight of this returned byte, which in this case is 1.

. Example SNP of interest around which EHH is to be calculated is located at position 0 with alleles in bold. (B) Bitwise haplotype tree where the "1" allele is present at the SNP of interest (corresponding to boxed alleles in (A)). To determine the frequency of the haplotypes of length n in the population downstream from the SNP of interest, bitmasks are used to isolate binary representation of the alleles present at SNP n. These are then combined with the previously calculated frequencies observed of haplotypes of length n-1 through the use of bitwise operations. For example B 000 =1100 (coloured in blue in (B)), indicating that of the four chromosomes, the first two carry the 000 haplotype between SNPs 1 and 3. The individuals that carry the 0001 haplotype will therefore be 1100 & 1010 = 1000 where & corresponds to the bitwise AND (1010 being the allelic representation of SNP 4, coloured in red in
A haplotype, h, of length i can become h + 0 or h + 1 at length i + 1. Using bitwise operations, these possible divergences can be processed for an entire bitset at once:

Equations 1,2
where B N,xi+1 is a bitset containing the data at marker x i+1 . For example, using the data from Supplementary Methods Figure 1:

Equations 3,4,5
An algorithm, popcount is then used to count the occurrences of a given haplotype. This operation counts the number of 1 bits set in an integer, also known as the Hamming weight. For example popcount([11010011] 8 ) = 5, i.e. 5 chromosomes carry this haplotype. There are algorithms which can perform the popcount operation in fewer steps than individually testing if each bit is set to 1. Furthermore, modern x86 CPUs have a POPCNT instruction which can perform this operation in a single clock cycle. If popcount (B N,h ) = 0, then the haplotype, h, is not observed in the data. All of the observed haplotypes at marker x i can be stored as a set of bitsets R i and all of the haplotypes at marker x i+1 , R i+1 , can be found by applying equations 1 and 2 to each element of R i . The set of bitsets at index i = 0 shall be defined as containing a bitset with all bits set: R 0 = {[11...1] N }.
Thus, a recursive bitwise haplotype identification algorithm can be defined as: This algorithm can be optimized by excluding bitsets with a population count of 0 since subsequent iterations will also always have a population count of 0. Since unique haplotypes can diverge no further, bitsets with a population count of 1 can be removed from the tree, counted, and factored in separately. It is also unnecessary to be able to reconstruct what the specific sequence of the haplotypes are. All that is necessary is knowing the locations of the different haplotypes. Therefore, previous levels of the tree can be discarded once the level's EHH and next level are calculated and there is no need to store relational information between nodes, saving memory.
From the tree of haplotype locations, it is straightforward to calculate the EHH for a particular length. Because the locations at which a haplotype occurs are represented by '1' bits at positions corresponding to the chromosomes, the number of times a haplotype is observed can be computed by finding the population count of the bitset. The EHH for core haplotype c can then be found as previously described (Sabeti et al. 2002): where h is each haplotype in the set H c (x i ),n h is the number of times h is observed, and n c is the number of times the core haplotype, c, is observed at the location of interest. The n h for haplotype h in combination with core haplotype c can therefore be found by n h = popcount (B N,h & B N,c ). The core haplotypes are added at the end in order to reduce the complexity of the tree. iHS and XP-EHH can then be calculated from these EHH values as previously described (Voight et al. 2006;Sabeti et al. 2007).