Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding

Abstract DNA sequencing technologies provide unprecedented opportunities to analyze within-host evolution of microorganism populations. Often, within-host populations are analyzed via pooled sequencing of the population, which contains multiple individuals or “haplotypes.” However, current next-generation sequencing instruments, in conjunction with single-molecule barcoded linked-reads, cannot distinguish long haplotypes directly. Computational reconstruction of haplotypes from pooled sequencing has been attempted in virology, bacterial genomics, metagenomics, and human genetics, using algorithms based on either cross-host genetic sharing or within-host genomic reads. Here, we describe PoolHapX, a flexible computational approach that integrates information from both genetic sharing and genomic sequencing. We demonstrated that PoolHapX outperforms state-of-the-art tools tailored to specific organismal systems, and is robust to within-host evolution. Importantly, together with barcoded linked-reads, PoolHapX can infer whole-chromosome-scale haplotypes from 50 pools each containing 12 different haplotypes. By analyzing real data, we uncovered dynamic variations in the evolutionary processes of within-patient HIV populations previously unobserved in single position-based analysis.


TABLE OF CONTENTS SECTION I: OVERVIEW OF OTHER HAPLOTYPE RECONSTRUCTION METHODS
The reconstruction of maternal and paternal haplotypes from diploid genotype data has excellent precision in human genetics, due to modelling recombination and coalescence (Stephens and Donnelly 2003;Scheet and Stephens 2006;Howie, et al. 2012). Large reference panels in humans (Kowalski, et al. 2019) allow routine phasing and imputation of long-range haplotypes.
However, this procedure does not extend to the inference of many haplotypes in the same host(s), or to microbial species that are not extensively studied. To facilitate the analysis of data generated by genotyping artificially pooled samples (a cost-saving strategy used in Genome-Wide Association Studies in the 2000s), several methods were developed to infer cross-pool population haplotype frequencies using pooled genotyping data. The standard PHASE algorithm (modeling coalescence) (Pirinen, et al. 2008), expectation minimization (EM) algorithm (modeling general sharing), Markov-chain Monte Carlo (MCMC) sampling (modeling both coalescence and recombination) (Kuk, et al. 2009;Pirinen 2009), and phylogenetic trees (modeling coalescence) (Efros and Halperin 2012) have been developed to reconstruct haplotypes in silico. These methods in general do not scale to large haplotypes with more than 25 genetic variants and cannot be applied to microorganisms.
To deconvolve viral and bacterial pooled-sequencing data, researchers have developed many tools using different computational techniques including de novo assembly (utilizing sequencing reads) (Baaijens, et al. 2017), tensor decomposition (utilizing sequencing reads and allele frequency) (Hashemi, et al. 2018), graph-based algorithms (utilizing sequencing reads) (Mazrouee and Wang 2014), and regularization (utilizing allele frequencies and template genomes) (Albanese and Donati 2017). As a representative technology for barcoded reads linked-reads, the 10X Genomics has provided a built-in tool for inferring paternal and maternal haplotypes and large structural variants, Long Ranger, that achieves the goals by utilizing sequencing reads that spans a large distance (Zheng, et al. 2016).
In summary, the first category of models assumes cross-host sharing via statistical models in the domain of "genetics" using statistical linkage disequilibrium; while the second focuses on within-host specifics observed by sequencing reads using "genomics" in the form of physical linkage disequilibrium.

Problem formulation
The haplotype reconstruction problem to be addressed in this paper is intuitively outlined below: The input information of the algorithm are (1) in-pool allele frequencies in each pool, recorded in the VCF files (one VCF file for each sample); and (2) reads that are aligned to the reference genome (of the focal species) carrying multiple mutations, recoded in the BAM files. The output will be the identity of global haplotypes and their in-pool as well as across-pool frequencies.
Step 1: Graph coloring The haplotyping problem is initially interpreted as a graph coloring model as follows. We form a graph <V, E> in which V denotes all nodes in the graph and E denotes the edges between nodes. Each sequencing read represents a node in the graph. An edge will be placed between two nodes if their corresponding reads have an overlapping region which contains at least one varying site between the two reads. Since the nodes linked by an edge represent reads which differ in at least one site, they must belong to two different haplotypes. Therefore, if we assign colors to each node so that every edge joins a pair of differently colored nodes, then the nodes with the same colors may belong to the same potential haplotypes. Solving this graph-coloring problem gives a preliminary haplotyping solution, although the solution is very approximate when there is insufficient read depth to distinguish between multiple valid graph-colorings.
We implemented two algorithms to solve the graph-coloring problem. We first attempted to convert this problem to a standard maximum clique problem (Bomze, et al. 1999) in graph theory in order to use miniSAT (Sorensson and Een 2005) to solve the problem. This is a parsimony algorithm that aims to search for the minimal number of haplotypes, but becomes very slow when the number of reads is large. However, the parsimony property is undesirable, since the graph-coloring step only provides the initial input for our downstream hierarchical AEM algorithm (which integrates statistical linkage disequilibrium, or LD). As a result, we have implemented a greedy algorithm as the default graph-coloring method in the current version of PoolHapX, which is described below.
During computation we maintain three data structures: the color of each node, the colors that have been assigned, and the set of colors of the neighbors of each node. Let C = {1, 2, ..., M} represent the current M colors that have been assigned to at least one node, and let color(x) denote the color of a node x. If a node has not been assigned a color, color(x) = 0. We use nbCol(x) to denote the set of different colors among all the neighboring nodes of x and size(nbCol(x)) to denote the size of this set. Our algorithm proceeds as follows: we begin by assigning color 1 to a random node. In each subsequent step, we select an uncolored node x* with the largest number of differently colored neighbors size(nbCol(x*)), and assign it to the color color(x*) = min{C \ nbCol(x)}. If there are no free colors which can be assigned to x*, meaning size(nbCol(x*)) = M, we add a new color M+1 to C and assign it to the node. We then update every uncolored node y with an edge to x so that nbCol(y) contains color(x*). This process is repeated until all nodes are colored (Supplementary Fig. 1).

Supplementary Figure 1:
An example of the Graph Coloring algorithm. Each node in the graph is a sequencing read (panel a). Two nodes are joined by an edge if their corresponding reads overlap in at least one position with differing alleles (panel b). A color is assigned to each node so that any two nodes linked by an edge are colored differently. As a result, nodes with the same color potentially belong to the same haplotype (panel c).
Step 2: Hierarchical AEM The basic approximate expectation maximization (AEM) algorithm as described in Kuk et al (Kuk, et al. 2009) utilizes the multivariate normal (MVN) distribution. The LD between all pairs of n genetic variants is modeled as the variance-covariance matrix of a MVN distribution. Suppose there are q genetic variants, and therefore 2 q possible haplotypes. Initially, all 2 q possible haplotypes are assigned the same frequency 1/2 q . Then, using an iterative procedure, the ratio of the likelihoods of observing MVN densities with or without each haplotype is estimated. These ratios indicate the relative importance of each haplotype, and are used to adjust the frequencies of all the 1/2 q possible haplotypes. This adjustment is conducted iteratively until the haplotype frequencies converge. A detailed mathematical formulation is outlined below (please refer to Kuk et al (Kuk, et al. 2009) for full description and justification).
Suppose we have q genetic variants and T hosts. We use A to denote the matrix of aggregated variant allele frequencies in all the pools. That is, Ai,j denotes the aggregated allele frequency of the j-th genetic variant in the i-th pool. Note that this is the only input for the original AEM algorithm, and additional information from the sequencing reads is not integrated. Denoting r = 2 q as the total number of haplotypes, let H = (ℎ 1 , ℎ 2 , …, ℎ ) be the sequence of all potential haplotypes. Using the initial haplotype frequencies P = ( 1 , 2 , …, ), we calculate the mean and variance-covariance matrix Ω of a MVN distribution modeling the q variants using the formulas: = ∑ ℎ =1 and the variance-covariance matrix Ω = ∑ ℎ ℎ′ − ′ =1 . Then the importance factors (IFs) (Kuk, et al. 2009) indicating the importance of each haplotype is defined as: where denotes the multivariate normal density.
This IF is then used to update the haplotype frequencies: In our variant of the AEM algorithm, we have made the following three modifications.
First, the initial haplotype configuration is no longer a uniform distribution over all 2 q possible haplotypes. Instead, using the potential haplotypes determined by the Graph Coloring step, we assign higher initial frequencies to haplotypes with higher sequence coverage. Many potential haplotypes that are not observed in the graph coloring step, therefore are removed. While the theoretical spatial and time complexity of our AEM remains O(q 2 ) and O(q 3 x 2 q ) respectively, our method substantially reduces the number of iterations required in practice since our initial frequencies start much closer to the true values. As our Graph Coloring algorithm is a greedy algorithm that may produce too many haplotypes (especially when the sequencing coverage is not high), we further filter out haplotypes that are not supported by information from the sequencing reads. By default, PoolHapX requires each haplotype to have at least 2% coverage by sequencing reads at every pair of adjacent sites (Supplementary Fig. 2).

Supplementary Figure 2:
In this example there are five reads covering two adjacent sites with variants (T/G) and (G/C) respectively. Since there is only one read with variants "T" and "C" (light blue), the percentage of T-C haplotype supported by physical LD is 20%.
Second, we apply AEM hierarchically, by using a divide-and-conquer algorithm to scale up the original AEM algorithm to larger regions. The frequency and identity of shorter haplotypes inferred by an iteration of AEM are combined into longer haplotypes for subsequent iterations (Supplementary Fig. 3). In each iteration, the local regions on which AEM operates are designed to have half of their genetic variants overlap with the next region ( Supplementary  Fig. 3). This allows the local regions to form tiling windows that are stitched together at the next hierarchical level. By default, PoolHapX undergoes four iterations of the divide & conquer procedure. The division plan for Level-1 of the AEM procedure is based on the initial blocks determined by the outcome of the Graph-Coloring step (Supplementary Fig. 4). The gaps between graphcoloring haplotypes indicate sites with very little physical LD (mostly due to sparse sequencing coverage), and are used to divide the sequence into regions. The midpoints of successive regions are used to determine the start and endpoints of additional, overlapping regions. The combination of these overlapping regions forms the tiling regions of Level-1. This design specifically leverages statistical LD to stitch the gaps left by physical LD.
PoolHapX provides options for the users to control the size of Level-1 regions so that it is not solely decided by the positions of gaps in the physical LD (Users Manual). PoolHapX will then run AEM against the regions independently. After resolving the Level-1 haplotypes, adjacent regions are combined into a larger region for the Level-2 partition. Only combinations of Level-1 haplotypes with non-zero frequencies are considered as candidate haplotypes for Level-2. Specifically, adjacent regions are stitched together by the following procedure: Suppose we have 3 tiling regions, A, B, and C, as depicted in (Supplementary Fig. 5), and a, b, c are three haplotypes in region A; d, e, are two haplotypes in region B, and f, g, h are three haplotypes in region C, respectively. The smaller haplotypes form a longer Level-2 haplotype (such as ah), only if the number of mismatched alleles between the haplotype in A and haplotype in B is smaller than a cutoff value, and likewise between B and C. The user can specify this cutoff (Users Manual) based on their understanding of the haplotype structure and level of polymorphism for their focal species. Using the same procedure, PoolHapX can run additional rounds of the AEM algorithm, depending on the availability of the users' computing resources. In general, the size of region that is resolvable by AEM is limited by the computation cost of storing and factoring many q x q matrices. By default, PoolHapX can process regions with up to 112 genetic variants after four levels of iteration. Supplementary Fig. 4: Regions are split at sites that have a higher number of gaps (shown as rectangles under the scissors) between graph-coloring haplotypes, since the gaps contain little information about the physical LD. Supplementary Fig. 5: Three Level-1 tiling regions generate one larger Level-2 region according to the number of mismatched alleles. Suppose that Level-1 regions A, B, and C contain haplotypes (a, b, c), (d, e), and (f, g, h) respectively. Regions A and C can be combined in nine possible ways (af, ag, ah, bf, bg, bh, cf, cg, ch). However, based on their overlap with region B's haplotypes, as indicated by the colors of haplotypes d and e, only two combinations (ah and cg) are retained in Level-2, as the number of mismatched alleles in the other combinations exceeds the cutoff point. Third, the original AEM is not robust to numerical instability if the denominator in the likelihood ratio is close to zero. When this happens, the program will converge rapidly to a sink point at which only one haplotype is dominant. This is not an important issue in the original AEM design where the number of variants is small. However, this problem occurs more frequently when applying to AEM to a larger matrix containing sparse non-zero LDs. We have resolved this problem by adjusting the calculation of the likelihood so that if both the denominator and numerator are very close to zero, the Importance Factor (IF) value will be set to a moderate constant. If only the denominator is close to zero, the IF value will be set to a large constant. These parameters are tunable by the end user.
Step 3: Breadth-first search The top-level local haplotypes produced by the hierarchical AEM algorithm will be linked into global haplotypes using a breadth-first search (BFS) algorithm. As haplotypes were inferred for each of the tiling windows, the haplotypes from adjacent windows will overlap over about half of their sequence length. If the sequences of two overlapping haplotypes match, they are stitched together to form a longer haplotype. Since the overlapping regions may not be identical due to errors in the previous steps, two haplotypes may be stitched together as long as the number of mismatches in their overlapping region is less than a user-specified cutoff parameter (Supplementary Fig. 6). By default, PoolHapX repeats this process to form local haplotypes with a maximum of 112 genetic variants.
Next, global haplotypes are determined by traversing a tree structure which contains the locally stitched haplotypes. The i-th level of the tree represents the i-th genomic region. At each level, local haplotypes are represented by a node in the tree. We draw an edge between two nodes in neighboring levels if and only if their corresponding haplotypes can be stitched together (Supplementary Fig. 6).
By collecting all the paths that traverse the tree, we find the set of global haplotypes that are consistent with the outcome of the hierarchical AEM algorithm. We have implemented a standard breadth-first search (BFS) algorithm (Cormen, et al. 2009) to traverse the tree and produce a set of candidate global haplotypes for the final step. Supplementary Fig. 6: An example of the BFS algorithm used to determine global haplotypes. Haplotypes from adjacent windows are stitched together to form a longer haplotype if the overlapping sequences of two adjacent haplotypes match.
Step 4: L0 and L1 regularized regression Using the candidate global haplotypes from the BFS algorithm, we use an innovative regression model to estimate the frequencies of within-host global haplotypes from two sources of information. First, the within-host aggregated minor allele frequency (MAF) at a given site should match the sum of the frequencies of haplotypes bearing the alternate allele at that site. This is the basis for many algorithms such as StrainEst and others (Pulido-Tamayo, et al. 2015;Albanese and Donati 2017). Second, the physical LD revealed by the sequencing reads is also integrated into the model. Specifically, consider the alternate allele (coded as 1 in our model) at a pair of variant sites t and k which are not necessarily adjacent. The sum of the frequencies of haplotypes carrying alternate alleles at both t and k should match the frequency of reads covering t and k which also carry both alternate alleles. This insight is generalizable to any sequencing technology, including standard Illumina paired-end reads, barcoded linkedreads, and third-generation long-reads.
Based on this rationale, we define the following model: This is a standard regression model where and are independent variable and predictors. The coefficient is the frequency of the i-th global haplotype in the host (pool). Note that and represent the two different sources of data discussed above, although denoted by the same term in the model. That is, we use two types of samples for and to train this model. For individual MAFs, at each site j the sum of haplotype frequencies containing the minor allele at j should match the MAF of the observed reads at site j. This is the same information that has been utilized by several other tools (Pulido-Tamayo, et al. 2015;Albanese and Donati 2017). Additionally, our design utilizes physical LD information as follows: for each pair of sites j and k, the sum of haplotype frequencies containing both alternate alleles should match the frequency of observed reads containing alternate alleles at j and k (Supplementary Fig. 7). The dimension of is n + n(n-1)/2, where there are n individual sites and MAFs, and n(n-1)/2 pairs of sites with physical LD information. An innovation of our design is the use of the second type of sample: for each pair of two sites, the sum of frequencies of haplotype that contains both alternate alleles should be equal to the frequency of observing the number of reads that cover both alternate alleles in the pool (Supplementary Fig. 7). In order to adjust the relative importance of MAF versus physical LD information, the weighting of each type of information is tunable in the regression model (Users Manual). The set of global haplotype frequencies (i.e., the set of ) that best fits these two sets of constraints is our solution.
The set of that fits the above two sets of constraints best is our solution. Supplementary Fig. 7: PoolHapX uses two types of frequency information in the L0 and L1 regularized regression step: the sum of minor allele frequencies at individual sites (panel A), and the sum of two-variant alternate allele frequencies at paired sites (panel B). Here three genetic variants provide six samples to train the model: three within-host aggregated MAFs represented by blue squares, and three pair-wise physical LDs represented by blue diamonds. A regularized (L0 and L1) optimization is used to estimate haplotype frequencies. Panel C illustrates the output of the regression model, which contains three non-zero frequency haplotypes represented by the yellow, orange and green bars. The remaining potential global haplotypes have zero frequencies, indicating that they do not exist in this pool.
Solving L0 regularized regression is extremely challenging since it is not a convex problem. PoolHapX takes advantage of the L0Learn package, a recent development in the field of optimization (Hazimeh and Mazumder 2018), to solve this regression problem.
By applying the above regression to all hosts, the cross-host (population) frequencies of each haplotype can be determined by combining the within-host frequencies.

Simulating Haplotypes for Each Pool Using SLiM
The overall procedure for the simulations is shown in Supplementary Figure 8 Supplementary Fig. 8: Flowchart of the testing pipeline. First, the forward genetic simulation framework SLiM (Haller and Messer 2019) is used to generate haplotypes for each pool. These haplotypes are collected by a Java program called PoolSimulator (https://github.com/theLongLab/PoolHapX/tree/master/Simulation_And_Analysis/PoolSimulator_SLiM.jar), which embeds the variant sites into the real reference genome and calls genome sequencing simulators (such as DWGSIM and for short reads and LRSIM for 10X Genomics barcoded linked-reads) to simulate sequencing reads. BWA (Li and Durbin 2009), Samtools (Li, et al. 2009), and tools from the GATK suite ) are used to process read alignment, reference mapping, and variant calling on the FASTQ files produced by PoolSimulator. PoolHapX takes the processed SAM and CVF files as input. The reconstructed haplotypes from PoolHapX are evaluated by comparing them against the "gold standard" haplotypes generated by SLiM. Although the simulation procedures will vary for different species, the evaluation criteria are the same for all species. We compute the Jensen-Shannon divergence (JSD) and the Matthews correlation coefficient (MCC) between reconstructed haplotypes and the gold standard haplotypes in order to measure the accuracy PoolHapX against existing haplotypes reconstruction tools.
We use SLiM (version 3)(Haller and Messer 2019), a forward genetic simulation framework, to generate haplotypes for each pool. SLiM-3 provides an R-like scripting language called "Eidos" for users to customize their simulations. By default, SLiM simulates diploid individuals whereas most viruses and bacteria, including the ones we used in simulations, are haploid. Therefore, we configure SLiM to use a non-Wright-Fisher model (see the SLiM manual) (Haller and Messer 2016) to simulate haploid individuals, which is recommended by the authors of SLiM. To mimic cross-pool gene flow, we chose an island model with a constant migration rate of 5%. All simulations generated haplotypes for either 25 or 50 pools, and we set the carrying capacity for each pool as 70, which is lower than the actual number of individuals in a normal host. However, in practice we found that for simulations with carrying capacities of 100 for 50 pools with positive mutations, SLiM took over 5 days and ran out of memory on our computing cluster. To compensate for a smaller number of individuals per pool, we set a higher mutation rate so that the population mutation rate, which is the product of effective population size and the per-base mutation rate, remains reasonable.
The mutation rates are chosen differently depending on the specific application, chromosome length, and the density of single nucleotide polymorphisms (SNPs). For neutral simulations without evolutionary selection, we use only one type of mutation without any changes to fitness. We also simulate haplotypes under different evolutionary scenarios with selection, where we add another type of mutation with negative or positive changes to fitness. Following the default settings of SLiM, all mutations that became fixed (i.e., frequency of the mutations reaching 100% in the population) are removed from the population of the genomes. After simulating a set number of generations, we save mutations with a minor allele frequency (MAF) above 0.05 into SLiM's standard output format, which contains information on each mutation present and the haplotypes in each pool. Although SLiM supports the generation of real genomic bases (i.e., A, C, G, and T), simulations using reference genomes are much slower. Therefore, we specify the length of the genome and embed the resulting polymorphic sites into the reference backbone to produce an equivalent output.

Metagenomic haplotype simulation
We compare the performance of PoolHapX against two metagenomic haplotype reconstruction tools: StrainEst(Albanese and Donati 2017) and Gretel (Nicholls, et al. 2019). StrainEst is a reference-based method that uses the single nucleotide variant (SNV) profiles of available genomes from selected species to determine the number and identity of coexisting strains. In reality, the true number of haplotypes is generally unknown. To ensure that StrainEst is able to run, we use the SNV profiles of E. coli genomes which are available in the StrainEst distribution (ftp://ftp.fmach.it/metagenomics/strainest/ref) as our reference genome.
Gretel does not require a priori knowledge of the input data, such as references of many known strains or the true number of haplotypes), and is therefore directly comparable to PoolHapX. However, Gretel will fail to run if any pair of genetic variants is not covered by the same sequencing read. Therefore, using the reference genome provided by StrainEst, we simulate a genomic region with a very high variant density. In particular, we set the mutation rate as 8×10 −7 , 10 −6 , and 2×10 −6 for simulating 50, 100, and 200 loci, respectively. The length of the simulated region is 4972 bp from position 68525 (68525 is the start point in the StrainEst SNV profile) to 73496. We run the simulation for 6,000 generations, to generate the SLiM standard output.

Haplotype simulation under evolutionary scenarios
SLiM provides a mechanism of specifying multiple types of mutations in the same simulation. In the following evolutionary scenarios, we keep most mutations neutral but add some nonneutral mutations to allow selection to occur. We use HIV strain HXB2 (https://www.ncbi.nlm.nih.gov/nuccore/K03455.1?report=fasta) as the reference genome, and simulate haplotypes under three commonly studied evolutionary scenarios: positive selection, negative selection, and selective sweeps. The proportion of neutral mutations is set to 0.7 for positive and negative selections. For selective sweep, we simulate only one strongly favorable mutation and keep the remaining mutations neural.
Please note that, in SLiM simulations, we cannot absolutely control the number of loci in the final results especially when multiple evolutionary constrains are present, so we managed to adjust evolutionary parameters a bit to make the number of loci roughly 50, 100, and 200 in respective simulations.
For the positive selection model, we combined neutral mutations and mutations with a fixed positive fitness change of 0.05, as specified by a SLiM parameter for the strength of positive selection. The mutation rate is 3×10 −6 , 8×10 −6 , and 5×10 −5 for simulating roughly 50, 100, and 200 loci, respectively. For the negative selection model, we combine neutral mutations and mutations with a fixed negative fitness change of -0.05. The mutation rate is 8×10 −7 , 1×10 −6 , and 3×10 −6 for simulating 50, 100, and 200 loci, respectively. For the selective sweep model, we combine neutral mutations and mutations with a large fixed positive fitness change of 1.0 at the 100 th generation. The mutation rate is 6×10 −7 , 1×10 −6 , and 2×10 −6 for simulating roughly 50, 100, and 200 loci, respectively. We run the simulation for 10,000 generations to generate the SLiM standard output.

Haplotype simulation when provided with single-cell linked-reads
We use Chlamydia trachomatis 434/Bu (https://www.ncbi.nlm.nih.gov/genome/471) as the reference genome. The length of this genome is 1,038,841 bp. We set the mutation rate as 9×10 −9 and run for 50,000 generations to generate 570 mutations and 31 global haplotypes.

Processing SLiM Output
The SLiM standard output contains information for all simulated mutations and individual haplotypes within each pool. We developed a Java program called PoolSimulator to process the SLiM output. This program collects haplotypes from each pool, embeds their variant sites into the real reference genome, and calls another tool (DWGSIM(Homer 2010) for short reads or LRSIM (Luo, et al. 2017) for linked-reads to simulate sequencing reads. The source code and executable is available is downloadable from our GitHub (https://github.com/theLongLab/PoolHapX).
PoolSimulator first parses the SLiM standard output to generate global haplotypes and their corresponding frequencies within each pool. Since SLiM encodes mutated alleles as a string of 0s and 1s, a reference genome is then used to generate full sequences from the polymorphic sites generated by SLiM. The true cross-pool and within-pool haplotype distributions are written in separate files to be used as the gold standard for evaluating PoolHapX's performance. Once we have a mixture of haplotypes for each pool, a genome-sequencing simulator is used to generate sequencing reads. We use DWGSIM(Homer 2010) (https://github.com/nh13/DWGSIM) to generate paired-end reads, and LRSIM (Luo, et al. 2017), a 10x Genomics Reads Simulator (https://github.com/aquaskyline/LRSIM), to generate single-cell linked-reads. The average read length is 150 with an outer distance of 400, and an error rate of 0.001 errors per base. Reads are simulated using different combinations of coverage (50, 100, 250 or 1000) depending on the simulation type. The simulated reads are saved as FASTQ files in the input directory for further processing.
We use property files to more easily run multiple simulations and store simulation parameters for future reference. In the file we record parameters including the locations of input files, intermediate files, gold standard files, and SLiM output. We run PoolSimulator with the following command: > java -Xmx4G -jar PoolSImulator.jar PoolSimulator.properties The following example is a PoolSimulator.properties file for processing viral haplotypes simulated by SLiM from the HIV HXB2 reference genome. All property files are available in our GitHub repository (

Sequence Alignment and Variant Calling
We use BWA, Samtools, and tools from the GATK suite to process read alignment, reference mapping, and variant calling on the FASTQ files produced by PoolSimulator. The alleleannotated reads generated by this pipeline may contain false positive segregating sites and false negative (missed) sites due to errors in pre-processing, such as read or variant dropout due to poor mapping or base quality. This mirrors the imperfect data processing in practise and challenges PoolHapX's ability to robustly handle data with errors.
Our sequence alignment and variant calling workflow is as follows: For each pool, use 'gunzip' to extract the "fastq.gz" files into two paired-end FASTQ files. The placeholder "{prefix}" has been substituted for the name of the actual file to be processed.

Filtering of genetic variants
PoolHapX operates on BAM files, which record the read alignment status, and VCF files, which record the discovered genetic variants. Note that non-reference bases may be discovered in the read sequences of the BAM files, but such bases are not included in the haplotype reconstruction process if they are not present in the VCF files. Uncalled non-reference bases are likely to be sequencing or alignment errors. When comparing reconstructed haplotypes to gold standard, only correct variants are used. As PoolHapX does not intend to analyze very rare global haplotypes, we remove all sites with MAF lower than 0.05 from the VCF files.
The parameters used to run PoolHapX are recorded in property files, which are available in our GitHub repository (https://github.com/theLongLab/PoolHapX/tree/master/Simulation_And_Analysis). We use default parameters in most cases, with the only exception being the analysis of P. vivax data for which we have used stronger regularization.

Comparison criteria
MCC+JSD: Although the simulation procedure varies across different species, the assessment criteria are the same for all species. For each haplotype reconstruction tool, we compute the Jensen-Shannon divergence (JSD) (Fuglede and Topsoe 2004) and the Matthews correlation coefficient (MCC) (Powers 2011) between the simulated (gold standard) and reconstructed haplotypes to measure the reconstruction accuracy. Each reconstructed haplotype is first mapped to its nearest gold standard haplotype using Hamming distance (Tangherloni, et al. 2019). Both haplotypes are represented by a string composed of 1s and 0s (representing reference and alternate alleles). MCC is calculated between such pairs and then averaged across all haplotypes in the pools.
The Matthews correlation coefficient (MCC) is used to measure the similarity of the simulated and reconstructed haplotype. The MCC is defined as where TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives. The MCC returns a value in the interval [−1,1] where 1 represents an exact prediction, 0 represents a random prediction, and -1 indicates total disagreement between the predicted and actual haplotypes.
The Jensen-Shannon divergence (JSD) evaluates the divergence between the frequencies of simulated and reconstructed haplotype distributions. The probability distributions A and P represent the frequency distributions of haplotype in the pool, where A is the distribution of gold standard haplotypes, and P is the reconstructed haplotype distribution. The JSD is defined as D(A||P) is the Kullback-Leibler divergence from A to P, which is defined as The JSD is symmetric for the comparing distributions (i.e., A and P), and for two probability distributions has a value in the interval [0,1] (Fuglede and Topsoe 2004), where 0 represents an exact prediction.
F1-Score + CRR + FDR: Here, the same as the MCC/JSD case, only the reconstructed haplotypes with frequency higher than 0.01 are counted in the comparison. As the targeted application is to reconstruct many haplotypes among many pools, we do not require the true positives to be defined as exactly perfect match. Instead, among these reconstructed haplotypes with frequency higher than 0.01, the ones that are close enough to a gold-standard haplotype (MCC >0.8) is considered as true positives, and the rest are considered as false positives. If there is no reconstructed haplotype with frequency higher than 0.01 can match a gold-standard haplotype, then this gold-standard haplotype will be deemed as a false negative.
With the above definition, we can define the following metrics: precision = true_positives / (true_positives + false_positives) recall = true_positives / (true_positives + false_negatives) F1-Score = 2 * (precision * recall) / (precision + recall) Additionally, the CRR (correctly reconstructed rate) is defined as the number of correctly reconstructed haplotypes divided by the total number of the gold-standard haplotypes. FDR (false discovery rate) is defined by the number of incorrectly reconstructed haplotypes divided by the total number of the haplotypes detected by the corresponding tool.

SECTION V: RUNNING OTHER HAPLOTYPE RECONSTRUCTION TOOLS
To evaluate the performance of PoolHapX, we compared its performance against state-of-theart haplotype reconstruction tools in four fields: viral, bacterial, metagenomic, and human genetics. Every haplotype reconstruction tool is benchmarked with the same input data provided to PoolHapX. We describe our procedure for running these tools below.

CliqueSNV
CliqueSNV is a reference-based method for reconstructing viral variants from next generation sequencing (NGS) short reads, which constructs an allele graph based on linkage between single nucleotide variations and identifies true viral variants by merging cliques of that graph via combinatorial optimization techniques. The open source implementation of CliqueSNV is available at (https://github.com/vyacheslav-tsivina/CliqueSNV). We run CliqueSNV using the default setting listed under the "Usage example" section, which is the following command: # Run CliqueSNV software to reconstruct haplotypes > java -Xmx20g -jar path/to/clique-snv.jar -m snv-illumina -in path/to/{prefix}.rg.bam The parameters are interpreted below: -Xmx. Controls the maximum memory used by the Java runtime environment (JVM). path/to/clique-snv.jar. Indicates full path to the CliqueSNV jar file.
-m snv-illumina. Sets CliqueSNV to run on Illumina data-in. Indicates alignment file to use.

PredictHaplo
PredictHaplo implements a fully probabilistic approach to quasispecies reconstruction. Given a set of aligned reads, PredictHaplo uses a Bayesian mixture model with a Dirichlet process prior to adaptively estimate the unknown number of underlying haplotypes (Prabhakaran, et al. 2013 To run PredictHaplo, we have downloaded their example configuration file, which can be downloaded using the link https://bmda.dmi.unibas.ch/software.html . Inside the folder, there is a file named as "config_5V" for us to start. However, the default parameters are not runnable on our side. So we have adjusted two parameters: First, we have lower down the '% min_readlength' to from its default of 220 to 100, because our simulated read length is 150. Second, following a published paper comparing tools (Schirmer, et al. 2014), we have We set the parameter '% entropy_threshold' to 5e-3. Then the software is runnable on our side. TenSQR TenSQR utilizes a tensor factorization framework to analyze high-throughput sequencing data and reconstruct viral quasispecies characterized by highly uneven frequencies of their components (Ahn, et al. 2018). TenSQR performs clustering with successive data removal to successively infer strains in a quasispecies from most to least abundant. As each strain is inferred, the sequencing reads originating from that strain are removed from the dataset. The open source implementation of TenSQR is available at (https://github.com/SoYeonA/TenSQR).
To run this tool with its default settings, we have downloaded a default config provided by the tool website (https://github.com/SoYeonA/TenSQR/blob/master/config ). We only edited the reconstruction start and stop positions, and kept all the rest parameters default.
We run TenSQR with the following commands: Using BWA-MEM, map the simulated reads to the reference sequence "{ref}", and generate an output "{input_sam}" in the SAM format:  (Li, et al. 2019).

EVORhA
EVORhA is a haplotype reconstruction method that combines phasing information in regions of overlapping reads with the estimated frequencies of inferred local haplotypes. EVORhA first reconstructs haplotypes at the local level, and then links local reconstructions into extended haplotypes using information from overlapping reads. The implementation of EVORhA can be downloaded from http://bioinformatics.intec.ugent.be/kmarchal/EVORhA/.
We run EVORhA with the following commands, which are their default settings:

Bhap
BHap is a tool for reconstructing bacterial haplotypes from next generation sequencing (NGS) data, and is based on fuzzy flow networks.

StrainEst
StrainEst is a reference-based method that uses genomic data from selected species to reconstruct complex strain profiles from metagenomic sequencing. This tool can quantify the number and identity of coexisting strains and their relative abundances in mixed metagenomic samples. The source code is available at https://github.com/compmetagen/strainest.
We run StrainEst with the following commands, which are their default settings:

Gretel
Gretel is able to recover haplotypes from metagenomic data. Gretel parses an alignment of reads into a Hansel matrix and uses SNP pairs observed in contiguous reads to probabilistically reconstruct the most likely haplotypes. Gretel uses an L-th order Markov chain model to reconstruct likely sequences of variants that constitute real haplotypes in the metagenome. The source code is available at https://github.com/SamStudio8/gretel.
We run Gretel with the following commands, which are their default settings: Using BWA-MEM, map the simulated reads to the reference sequence "{ref}", and generate an output "{input_sam}" in the SAM format:

Human Haplotypes Reconstruction Tools:
Hippo Hippo (Haplotype estimation under incomplete prior information using pooled observations) uses a multinormal approximation of the likelihood observing genotyping data and a reversiblejump Markov chain Monte Carlo (rjMCMC) algorithm to estimate the existing haplotypes in the population and their frequencies.
We run Hippo with the following commands, which are their default settings:

>/path/to/hippo parameters_file
Parameters: /path/to/hippo. Full path to hippo parameters_file. Each row of the parameters file contains a label and the corresponding parameter value, separated by whitespace. An example of the parameter file is below: {prefix} The name of the data file. The data file contains one row for each pool, for a total of n_pools rows. Each row contains 1+n_loci integer entries separated by whitespace as shown:

Approximate expectation maximization
The approximate expectation maximization (AEM) algorithm infers haplotypes from pools using an approximate EM algorithm (Kuk, et al. 2009).
We run AEM using the following commands, which are their default settings:

Processing Haplotype Reconstruction Output
We converted all output data into a custom tab-delimited output format. The first row of the output file lists the haplotype IDs, and the second row lists their corresponding frequencies.
The remaining rows record the position of each varying site and list the alleles of each haplotype at that site. Below is an example of the output format: We evaluated the performance of each haplotype reconstruction tool by comparing their reconstructed haplotypes with the gold standard simulated haplotypes using the following command: >java -jar PoolHapX.jar evaluate We measured the accuracy of each tool by calculating the MCC and JSD of their reconstructed haplotypes.

>freebayes --bam [chr, list of bam] --min-alternate-count 5 --fasta-reference [fasta file of chromosome] --dont-left-align-indels --vcf [chr].freebayes.raw.vcf >vcftools --vcf [chr].freebayes.raw.vcf --minQ 30 --stdout --recode --recode-INFO-all > [chr].freebayes.filter.vcf
The VCF files produced by GATK and freebayes are merged together, keeping the genotyping information produced by freebayes: The P. vivax data was then obtained in the form of VCF files. The data was separated by chromosome and their respective BAM files were checked for coverage using SAMtools. The pool or strain ID for P. vivax was noted. The VCF files were then filtered for SNP (i.e., no indels), sequencing coverage of the pools, and SNP allele frequencies greater than 0.05. Once the VCFs were filtered they were formatted and analyzed by PoolHapX using its default parameters. PoolHapX was able to identify multiple haplotypes on different strains of P. vivax. The numbers of haplotypes are presented in Supplementary Table 3 and an example of visualized frequencies are illustrated in Supplementary Figure 9.
Supplementary Fig. 9: The graph shows the distribution of predicted haplotypes in Chromosome 13 of Plasmodium vivax strain SRS693934. The chromosome was divided into separate regions to enable PoolHapX to process the data. Each stacked bar in the graph represents one of these regions and each stack represents a haplotype. The height of a stack depicts the haplotype's frequency.

Infant Gut Colonization Time Series Analysis
StrainEst focuses on species of interest by defining their population structure through a clustering of their genetic variants profiles. In each cluster of SNP profiles, StrainEst can only detect the representative strain which is central to that cluster. However, StrainEst cannot investigate the subtle difference between haplotypes and reconstruct them de novo. In addition, while each cluster contains strains that are similar across the whole genome, the strains within the cluster may differ greatly in specific regions. StrainEst uses a popular time series of infant gut colonization data to demonstrate its ability to analyze real metagenomic data. In this dataset, the dominant strain was 504_SEPI at the start of the sampling period (15 to 16 days), after which the dominant strain switches to 236_SEPI (17 days onward). This finding is consistent with the conclusion of the original paper (Sharon, et al. 2013). In our analysis, we apply PoolHapX to the same data and map our predicted haplotypes back to the three main strains, to examine whether the aggregated haplotype frequencies follow a similar trend to the original estimation(Albanese and Donati 2017) based on reference templates of known strains. We then use our pipeline to call SNPs for the 11 available samples from 11 different time points. From the output, the variants with minor allele frequency less than 0.05 were removed. In total there are 10,975 SNPs. As PoolHapX cannot process such a large number of SNPs, and we also expect the haplotype structure in different regions to vary, we divided the 10,975 SNPs into 110 regions so that each region contains around 100 SNPs and can be quickly processed by PoolHapX.
The running parameters of PoolHapX can be found in the property file in our GitHub (https://github.com/theLongLab/PoolHapX/tree/master/Simulation_And_Analysis). Average numbers of haplotypes at each time point are presented in Supplementary Table 4. We then conduct a comparison between the template-based inference by StrainEst and PoolHapX's de novo inference. The abundance of the three S. epidermidis strains 236_SEPI, 504_SEPI, and 604_SEPI in the infant gut colonization time series was predicted by StrainEst using a template-based method. However, StrainEst only predicted the abundance of the three representative strains, which is not directly comparable with the fine-scale haplotypes predicted by PoolHapX. We therefore align all the haplotypes predicted by PoolHapX to their nearest representative S. epidermidis strains with the lowest MCC, in order to check whether the abundance of the three strains predicted by both tools have a similar trend for the majority of regions. Of the 110 regions, in around 37, 40, and 17 regions of the three respective strains we observed significantly difference between strain abundance predicted by StrainEst and the aggregated frequencies of PoolHapX-predicted haplotype frequencies (two-tailed paired T-test, p-value < 0.05). The rest regions, which are the majority enjoy similar frequencies. Overall, the average of the 110 regions shows a fairly similar pattern over the longitudinal samples. This result is depicted in Figure 5 of the main text.

HIV genomic data
We selected patient 1 from the dataset generated by Zanini et al (Zanini, et al. 2015), due to the number of time-points available and the relative completeness of genomic coverage at each time-point. For more information about the inclusion criteria of study patients, the clinical origins of the sample data, and sequencing procedures, we refer the reader to Zanini et al (Zanini, et al. 2015). Patient #1 is a female with HIV-1 subtype 01_AE. The first blood plasma sample was taken from an estimated 122 days post-infection, and the last sample at 8.2 years post-infection. By applying PoolHapX to the data, we reconstructed haplotypes in each time point and the number and frequencies of haplotypes are listed in Supplementary Table 5. Within each time-point-specific set of haplotypes, we calculate the site-specific extended haplotype heterozygosity (EHH) at every SNP position. To examine the rate and dynamics of linkage decay within each gene, we calculated at each time-point the size of regions around each SNP position in which site-specific extended haplotype heterozygosity (EHHS) is equal to or larger than 0.5.

EHH-based analyses of the within-patient population
Scripts for generating well-formatted files from PoolHapX output files and apply extended haplotype homozygosity-based statistics to the reconstructed haplotypes, are available in our GitHub repository (https://github.com/theLongLab/PoolHapX/tree/master/Simulation_And_Analysis/HIV_analysis_code). The scripts include detailed descriptions of our parameters and settings.
Using the sequence alignment and variant calling procedure described in the previous sections, 208 SNP positions were identified from the sequencing data from all 12 time-points. The sequencing reads from each time-point were re-mapped to the consensus whole-HIV genome sequence from patient 1's first sample to generate consistent SNP positions. The reads from the SAM files of each time-point were annotated with either a 0 or a 1 at each SNP position, depending on whether they carried the reference or an alternate allele. PoolHapX was subsequently applied to reconstruct haplotypes jointly across the longitudinal samples of patient 1, with a set of haplotypes reconstructed for each time-point (default parameters are used).
For each time-point, a single PoolHapX output file was generated and converted to MS(Ewing and Hermisson 2010) output files. The within-population frequencies of each haplotype scaled to the number of viral particles per milliliter of patient plasma (Zanini, et al. 2015). MS is generally used to generates samples with the assumption of neutral evolution, and the default output format encodes haplotypes as a string of 0s (reference allele) and 1s (alternate allele) at each SNP position (Hudson 2002).
The R package rehh (version 3.0) was applied to survey linkage patterns within a single timepoint, as well as changes to linkage patterns across the duration of infection monitoring (Gautier, et al. 2017). Several long-range haplotype-based evolutionary statistics related to extended haplotype homozygosity (Sabeti, et al. 2002) were used to quantify the type and magnitude of selection.
Within each set of haplotypes specific to a time-point, the site-specific extended haplotype heterozygosity (EHHS) was calculated at every SNP position (Sabeti, et al. 2007). EHHS quantifies the probability that any pair of randomly chosen haplotypes are homozygous over the region between the focal SNP position and another SNP position. As with (Gautier and Vitalis 2012;Gautier, et al. 2017), the EHHS of the region between focal SNP position s and another SNP position t is ns is the number of haplotypes at SNP position s, which is equivalent to the number of haplotypes at that time-point since all PoolHapX-reconstructed haplotypes are complete across the span of SNP positions. Ks,t is the number of different kinds of haplotypes that are homozygous over the region between SNP positions s and t (inclusive). nk is the number of haplotypes that are in the category Ks,t.
To examine the rate and dynamics of linkage decay within each gene, we calculated at each time-point the size of regions around each SNP position where EHHS is equal to or larger than 0.5. The size of the region between the rightmost and leftmost positions such that EHHSs,t >= 0.5 (same notation as above) was plotted for each time-point and each gene.
To search for regions of positive selection within the reconstructed genome, we calculated the integrated haplotype score (iHS) for each time-point, as well as cross-population EHH scores (XP-EHH) between each time-point. The rehh function scan_hh() calculates the integrated haplotype homozygosity (iHH) for both the ancestral and derived alleles respectively, and the rehh function ihh2ihs() calculates the iHS from the ancestral and derived iHH. The iHH of an allele is the area under the EHH curve of that allele, which quantifies the decay of linkage around that allele. The minimal threshold of EHH calculation (limehh) is set to 0, and EHH integration continues to the end of the reconstructed region (for patient 1 haplotypes, this is from positions 1 to 6078).
The iHS of a SNP position calculates the relative homozygosity around the derived allele relative to the ancestral allele. When the distribution of iHS from other SNP positions with similar allele frequencies is standardized, p-values can be calculated for each SNP position to indicate outliers where either the derived or ancestral allele at that position is significantly more homozygous than the other. Alleles are binned by frequency in increments of 0.1, and the pvalue threshold for outlier status was set to 0.05. The equations are the same as the ones in Gautier et al (Gautier, et al. 2017).
uniHS(s) is the un-standardized log-ratio of ancestral iHH to derived iHH at SNP position s. mean(uniHS|ps) and sd(uniHS|ps) refer to the mean and standard deviation of SNP positions belonging to the same allele frequency bin (ps), so that their frequencies are within 0.1 of each other. We standardize iHS(s) by subtracting the mean and dividing by the standard deviation of uniHS.
Selection tends to produce regions with multiple outliers (Tang, et al. 2007). To identify regions where homozygosity is maintained to a significant degree relative to the rest of the genome at a given time-point, the rehh function calc_candidate_regions() was applied to intervals of 250 base pairs (bp) containing two or more SNP positions with outlier iHS scores, in a sliding window fashion with 125 bp of allowable overlap. If overlapping 250 bp intervals each contain two or more outliers, the intervals are merged and reported as a single region. We chose the interval size of 250 bp since it was slightly less than the average window size of EHHS >= 0.5 in the envelope gene. This is the smallest average window size out of all genes analyzed (gag, pol, vif, vpu, tat, env).
To identify sites where haplotype homozygosity surrounding one allele or the other has undergone significant longitudinal change, the XP-EHH of each SNP position was calculated by comparing its integrated EHHS (analogous to iHH, but with EHHS instead of EHH) at timepoint i + 1 relative to the previous time-point i. As with iHS, integrated EHHS must also be standardized relative to the rest of the inter-time-point comparisons (Gautier, et al. 2017).
unXP-EHH(s) is the unstandardized log-ratio of the population 1 iES (at time-point i) to the population 2 iES (at time-point i + 1), at SNP position s. mean(unXP-EHH(s)) and sd(unXP-EHH(s)) refer to the mean and standard deviation of SNP positions within the same allele frequency bin (ps) of width 0.1. We standardize XP-EHH(s) by subtracting the mean and dividing by the standard deviation of unXP-EHH(s).
Similar to the iHS regions obtained above, regions where homozygosity is significantly increased relative to the previous time-point are calculated by applying the rehh function calc_candidate_regions() to the XP-EHH scores of each time-point's SNP positions. The userdefined parameters are the same as described for the outlier iHS regions above.

Longitudinal Analyses of Patient HIV Populations Reveal General and Gene-specific Linkage Patterns
Typically, prior investigations estimated the recombination rate in genes targeted by antiretroviral drugs or immune selection either by clonal sequencing, or indirectly through longitudinal SNP comparisons (Messer 2013;Rouzine, et al. 2014). To demonstrate the utility of haplotype reconstruction for evolutionary analyses of complex populations, we explored longitudinal patterns of linkage across the HIV genome. For more sample details and a thorough analysis of SNP-based trends in genetic diversity and divergence, we refer the reader to Zanini et al (Zanini, et al. 2015).

The applicability of EHH-based selection statistics to within-patient HIV populations
Given its high mutation rate (Rambaut, et al. 2004), HIV does not necessarily follow the assumption of identity by descent, and the maintenance of regional homozygosity may not indicate population expansion due to the advantage of a single allele. The same mutation may have arisen multiple times within the span of the longitudinal samples. As a result, there may be multiple haplotypes that are identical due to random drift and not by descent, which dilutes the signal of positive selection for that haplotype. Additionally, most SNP positions are assumed to be governed by neutral selection. Though the genes surveyed in this analysis are not as immunogenic as the envelope gene (of which only a few base pairs are reconstructed), several proteins in Gag, such as p24, do accumulate cytotoxic T-lymphocytes (CTL) escape mutations at some SNP positions due to their physical location on the exterior of the HIV capsid (Prince, et al. 2012). Thus, it is likely, although not certain, that the rehh-defined p-value (against a null hypothesis of no selection when all positions follow a Gaussian distribution) holds.
However, the maintenance of any particular linkage block is difficult under a high rate of recombination, and therefore the strength of linkage in a block of alleles relative to the whole genome is still indicative of the relative fitness of that regional haplotype. Thus, EHH-based summary statistics can still be considered as general tests for assessing positive selection via unusually long-range LD.
The integrated haplotype score (iHS) and cross-population EHH score (XP-EHH) are both used because they are complementary. Whereas XP-EHH, which is similar to Tang et al.'s Rsb statistic (Tang, et al. 2007), has the power to detect selective sweeps that have reached fixation, iHS is better for comparing sweeps which are in progress, in which case there exist two alleles with intermediate frequencies.

Linkage decays extremely rapidly post-infection, but does not decrease monotonically over time
Within each time-point-specific set of haplotypes, we calculated the site-specific extended haplotype heterozygosity (EHHS) at every SNP position (Sabeti, et al. 2002). EHHS measures the decay of haplotype homozygosity as a weighted combination of the ancestral and the derived allele frequencies. An EHHS of 0.5 indicates that 50% of the haplotypes carrying an allele at the focal position also carry the same allele at the distant position.
Although the Gag-Pol Polyprotein gene is in complete LD at the first time-point, over a year later at time-point 2 (562 days post infection (DPI)), the window of SNP positions where the EHHS is larger than 0.5 has shrunk to at most 1841 bp at positions 1412 to 1670, which overlap the p7, p1, and p6 genes in Gag. However, even at the first time-point, Gag-Pol SNP positions are not in complete linkage with the genes downstream of Pol which we reconstructed, including the genes Tat, Vif, Vpu, and the first 386 bp of Env ( Supplementary  Fig. 10).

Linkage is measured on the order of several hundred base-pairs consistently across two-thirds of the HIV genome
To examine the rate and dynamics of linkage decay within each gene, we calculated the average size of regions around each SNP position where EHHS is larger than 0.5. We plotted the longitudinal decay of homozygosity for each focal position (Figure 6C, D and Supplementary Fig. 10). Since EHHS decays monotonically (Figure 6A, B and Supplementary Fig. 10) with increasing genomic distance from each SNP position, the sizes of these regions represent the average size of a linkage block at that time-point.
Strikingly, for all genes the decrease in average window size appears to level out around day 1282, approximately 3.5 years post infection. After this point, the average window size across all positions in the remaining time-points is 217 bp, and the standard deviation is 170 bp. In intermediate progressors, this is the time-point in disease progression where CD4+ T-cell counts start to decline (Langford, et al. 2007). After this point, instead of decreasing monotonically towards an asymptotic value, the average window size fluctuates around a range constrained to several hundred bp, depending on the gene. These fluctuations can occur quickly, with increases and decreases in the hundreds of bp on scales as low as a month (Supplementary Fig. 10). Within Gag and Pol, there is substantial heterogeneity in average window size, with the downstream regions of Gag and Pol largely fluctuating between 0 and 250 bp (Figure 6C,D).

Supplementary
Although the first half of the Pol gene, which spans from 1551 to 3470 bp fluctuates in window size between 0 and 500 bp, the latter half of the Pol gene behaves more similarly to the genes downstream of Gag-Pol, with the average size fluctuating between 0 and 250 bp. Even within Gag and Pol, where the rate of nonsynonymous evolution is slower than in accessory and envelope proteins (Zanini, et al. 2015), there is no SNP position with a consistent window of high EHHS.
Zanini et al. (Zanini, et al. 2015) previously characterized linkage using D', an approach that relies on the average allelic content residing on a genomic fragment. After reconstructing HIV haplotypes from the first 6000 bp, we find that the average linkage block size is at least twice as large as previously estimated, and the standard deviation is nearly as large as the average. Nevertheless, our estimate of the recombination rate, which is based on an average linkage block size of 217 bp and an average separation of 261 days between sampling times, is 1.76015 x 10 -5 crossovers/day, which agrees with both the study of Zanini et al. as well as several others (Shriner, et al. 2004;Neher and Leitner 2010;Batorsky, et al. 2011;Song, et al. 2018). Furthermore, the limited range of sizes observed in the average linked regions indicate that the first 6000 bp of the HIV genome appear to be governed by similar recombination rates after approximately three years post-infection.
In the absence of drug selection in immunogenic genes, selective sweeps still occur occasionally and recurrently in the same regions We have conducted a search for regions of positive selection within the reconstructed genome where selective sweeps could have taken place. To identify regions that are positively selected relative to other regions within the same time-point, we identified 500 bp windows where there are 2 or more SNP positions that were outliers in terms of their integrated haplotype score (iHS). The iHS measures whether a derived allele displays significantly higher extended haplotype homozygosity (EHH) than the ancestral allele, or vice versa (Sabeti, et al. 2007) (Supplementary Table 6). To identify positively selected regions relative to the previous timepoint, we identified 500-bp windows where there are 2 or more SNP positions that were outliers in terms of their cross-population EHH (XP-EHH) (Sabeti, et al. 2007) (Supplementary Table 7). The region approximately between positions 5250 to 5625, which overlaps the Vpr and Vpu genes, is repeatedly identified as an outlier (Supplementary Table 1). The surrounding sequence space, which overlaps neighboring genes such as Vif and Env, is also included. Though the region between Pol and Env is not uniformly identified as an outlier at all timepoints, its consistent identification as a site of selective sweeping (relative to the rest of the genome) flags it as a location to further investigate estimates of baseline recombination rates, especially given that none of those proteins are immunogenic. The exception in the set is the region between positions 2125 and 2375 identified at time-point 10 (day 2639), which is near the 5' end of the reverse transcriptase gene in Pol. It is unknown why this occurs.

Supplementary
In contrast, when comparing regions between time-points, there is both significant heterogeneity in the locations of the outlier regions, as well as limited consistency in terms of the affected gene between adjacent time-points (Supplementary Table 2). Regions that are recurrently swept include positions 5500 -6000 that overlap Vpu and parts of Vpr and Env, and positions 3250 -3625 in Pol that overlap the p15 gene, which undergo sweeps at multiple time-points throughout the 10 year span of the study data. Interestingly, different regions of Gag appear as the outlier at multiple time-points, accounting for 5 out of 16 entries and appearing at 4 out of 12 time-points.