Correlated Allele Frequency Changes Reveal Clonal Structure and Selection in Temporal Genetic Data

Abstract In evolving populations where the rate of beneficial mutations is large, subpopulations of individuals with competing beneficial mutations can be maintained over long times. Evolution with this kind of clonal structure is commonly observed in a wide range of microbial and viral populations. However, it can be difficult to completely resolve clonal dynamics in data. This is due to limited read lengths in high-throughput sequencing methods, which are often insufficient to directly measure linkage disequilibrium or determine clonal structure. Here, we develop a method to infer clonal structure using correlated allele frequency changes in time-series sequence data. Simulations show that our method recovers true, underlying clonal structures when they are known and accurately estimate linkage disequilibrium. This information can then be combined with other inference methods to improve estimates of the fitness effects of individual mutations. Applications to data suggest novel clonal structures in an E. coli long-term evolution experiment, and yield improved predictions of the effects of mutations on bacterial fitness and antibiotic resistance. Moreover, our method is computationally efficient, requiring orders of magnitude less run time for large data sets than existing methods. Overall, our method provides a powerful tool to infer clonal structures from data sets where only allele frequencies are available, which can also improve downstream analyses.


Evolutionary model
We assume a WF model consisting of N individuals evolving under mutation and selection.Each individual is represented by a sequence of length L. The loci are assumed to be bi-allelic where the value of each locus is either 0 (wildtype (WT)) or 1 (mutant), thus resulting in M = 2 L possible haplotypes.For clarity, we use i, j, ... to refer to locus indices and a, b, ... to refer to haplotype indices.The index is shown as a subscript when representing only one of the locus or haplotype indices.However, when both indices need to be shown simultaneously, the locus index is shown as a subscript while the haplotype index is shown as a superscript.Let n a (t) denote the number of individuals in the population that belong to haplotype a at generation t such that q M a=1 n a (t) = N .At generation t, denote z(t) = (z 1 (t),...,z M (t)) as the observed haplotype frequencies with z a (t) = n a (t) N .The observed allele frequencies are correspondingly x(t) = (x 1 (t),...,x L (t)) and are related to haplotype frequencies by x i = q a g a i x a where g a i represents the allele (either 0 or 1) at the ith locus of the ath gentype.
Here we assume that the fitness contribution from individual alleles is additive, such that the fitness f a of the ath haplotype can be written (3) Here the s i denote the time-invariant selection coefficients for mutant alleles, which quantify the selective advantage of mutant allele i relative to wild-type (WT).This model is consistent with a multiplicative fitness model where the effects of individual mutations are small, or an exponential fitness function for an additive trait, where each mutation is assumed to make a small contribution to the trait value.Under Wright-Fisher dynamics, the probability of observing haplotype frequencies z(t + 1) at generation t + 1, given haplotype frequencies of z(t) at generation t, is where p a (z(t)) is the probability of observing haplotype a at generation t.To derive this expression, we sum over contributions to generating haplotype a in the next generation, including the effects of mutation, recombination, and selection, (5) Here µ ba is the probability for haplotype b to mutate to haplotype a in a generation.
We assume that mutations at different loci are independent with the same mutation rate µ, so that L≠d(b,a) where d(b, a) is the number of different alleles (Hamming distance) between haplotype b and haplotype a.
Below, we will follow the assumption that the population size N is large, and that the selection coefficients s i , mutation rate µ, and recombination rate r are small (O(1/N )).For now, expanding the expression for p a (z(t)) to eliminate terms of order µ 2 and higher, we have The probability that the haplotype frequency vector follows a particular evolutionary path (z(t 1 ), z(t 2 ),..., z(t K )), conditioned on the initial state z(t 0 ), population size N , mutations rate µ and selection coefficients s is

Weighting observed products of allele frequency changes
For finitely sampled data, variations in allele frequencies near the extremes (0 and 1) are more susceptible to sampling noise.To reduce the influence of noise, we weight the entries of x x according to the smaller of the two allele frequency variances at each time.Specifically, the weight matrix is where v i (t k ) is the variance of interpolated frequency of allele i in the middle of time point t k and t k+1 , We denote the sum of the weighted x x matrix over an entire evolutionary history as D,which has entries where t k is the number of generations between sampling time points t k and t k+1 , i.e., t k = t k+1 ≠ t k .The factor of 1/ t k arises from linearly interpolating allele frequency trajectories, and summing x x entries from generation to generation.To prevent confusion, here we note that the D matrix defined above is different from the LD matrix 38 , which is also commonly written as D.

Forming initial clades
To determine which clade an allele i should belong to, we first determine the group that has the largest cooperation score with it, which we denote g max i , We quantify the degree to which allele i should be assigned to group g max i by a confidence score, fl conf (i), computed as its cooperation score with group g max i minus the sum of its cooperation scores with all other groups, With the above concepts, we form the initial clades by first identifying the pair of alleles with most competing behavior, i.e., the ones with the most negative entry in D, and assign them as two initial groups.We then select the next allele i out of the remaining ones that has the highest confidence score, fl conf (i).The selected allele i is processed in one of the following ways.
1. We assign it to group g max i if it cooperates with group g max i and competes with all other groups.That is, fl coop (i, g max i ) > 0, and fl coop (i, g) AE 0 for all other groups.
2. We mark it as a shared allele if it cooperates with more than one group, such that there are multiple groups with fl coop (i, g) > 0.
3. We assign it to a new group if it competes with all existing groups, with fl coop (i, g) AE 0.
We repeat this procedure of selecting and processing alleles iteratively until all alleles have been processed.
Denoting the number of groups as n, we put the group of shared alleles as the first group (group 1), and sort the other groups (groups 2 to n) by their sizes from largest to smallest.Considering possible redundant groups in the end with negligible signals, we select groups 2 to m so that their entries in the D matrix are just enough to account for more than a percentage (set to 99%) of all non-shared alleles in terms of the sum of absolute values, i.e., we select the smallest m such that q i" =joe{group 2,...,m} Each of groups 2 to m then constitutes an initial clade.

Iterative refinement of clade membership and frequencies
Below, we first introduce the concept of a "competition period," a time during which a specific set of clones compete, which is needed to account for cases with time-varying clonal structures.For each competition period, we obtain an initial estimate of clade frequencies from their constitutive allele frequencies.We then iteratively refine clonal identities and clade frequencies until convergence: We detect improbable classifications of alleles and re-classify them to appropriate clades, with clade frequencies re-estimated correspondingly, until no more improbable cases are detected.Finally, we merge our estimates for clade identities and frequencies from all competition periods.

Dividing evolution into competition periods
During the course of evolution, a population can exhibit different patterns of clonal interference at different times.For example, consider a population that consists of multiple, competing clades until one of the clades outcompetes the others and fixes.At a later time, new beneficial mutations can arise and compete with one another, initiating another period of clonal interference.To account for these possibilities, we define a "competition period" as a period of time bounded by the beginning and end of the coexistence of several clades.When an evolutionary history consists of multiple competition periods, each period can have a distinct clonal structure, i.e., number of clades, clade frequency trajectories, and assignment of alleles to different clades.It is thus important to detect and separate these competition periods to infer this information correctly.
We detect the existence of multiple competition periods and set approximate boundaries for them by identifying fixation events of mutations involved in clonal interference.For each clade (excluding the shared alleles) generated in the clustering step, we identify its member alleles that fix during the evolution.An allele i is identified as fixed at time point t k if its mean frequency after t k exceeds a threshold value, set here to 0.98.We take the fixation time of the allele with the largest confidence score, fl conf (i), as the fixation time of that clade.The fixation of a clade then marks the boundary between two successive competition periods.When there are n groups that fix during the evolution, we divide the whole evolutionary history into n + 1 competition periods at their fixation times, and infer the clonal structure for each competition period separately.
The initial clades determined in the previous section are formed by analyzing the matrix D integrated over the entire evolution.During a competition period, some of these clades may have become extinct or may be yet to emerge.The population can therefore exhibit different patterns of clonal interference in different competition periods.To infer the clonal structure for each competition period, we form initial clades again for each competition period separately.The steps to form initial clades for a competition period are the same as introduced in the previous section, except that the matrix D is integrated over that period alone.By limiting the integration of matrix D to each period, mutations that fixed before a period or have yet to appear will not interfere with the classification of mutations that are at intermediate frequencies within that period.This reduces noise in the integrated D matrix and could make clustering more accurate.

Estimating clade frequencies
To estimate the frequencies of these clades, we select alleles that can provide reliable information on clade frequencies.We filter out alleles that are shared by multiple groups in the clustering process, alleles that fix during the competition period, and alleles that get extinct during the evolution.An allele i is identified as extinct at time point t k if its mean frequency after t k is lower than a threshold value (set to 1%).These alleles are excluded from the next step because they provide less reliable information to distinguish between clades.The remaining alleles are then used to obtain an initial estimate of clade frequencies.
At time point t k , an allele in clade c can be in one of three states: fixed within its clade, at an intermediate frequency, or yet to emerge.With sampling noise, frequencies of alleles that are fixed within clade c are distributed around the true underlying clade frequency trajectory, f c (t k ).Frequencies of alleles that have not emerged are near zero.We assume that the typical time required for alleles to fix within a clade is shorter than the time for entire clades to fix in the population.Therefore, at most time points, most alleles have either already fixed within clades or have not yet emerged.Equivalently, if we view frequencies of all alleles in clade c as sampled from a distribution, then its probability density should have two peaks at zero and f c (t k ) at most time points.Based on this assumption, we estimate the frequency of clade c at time point t k by frequencies of all its member alleles that exceed a threshold (0.01), S = {x i (t k ) | allele i oe clade c, x i (t k ) > 0.01}, in three steps.
1. We consider only frequencies larger than a minimum threshold (0.01).When there are no such frequencies, the clade frequency is set as the mean of all its member frequencies.
2. We estimate the probability density of frequencies in set S using a kernel density estimation (KDE) method 39 .When set S contains too few frequencies to yield a reliable estimate (fewer than four frequencies), we resort to setting the clade frequency as the mean of all frequencies in set S.
3. We identify the frequency with the highest estimated probability density, select frequencies that fall within a window of width 0.2 around that frequency, and take their mean frequencies as an initial estimate of the clade frequency f c (t k ).

Refining clonal identities and clade frequencies
The initial estimates of clonal identities and clade frequencies are based on the clustering results, which are based, in turn, on the evaluation of co-varying behaviors quantified by fl coop .Here we refine these estimates by accounting for sampling probabilities.Without sampling noise, the frequency of an allele should not exceed its clade frequency f c .However, this becomes possible with noise from finite sampling.In cases where an allele i has a higher frequency than its clade at a time point t k , x i (t k ) > f c (t k ), we assume that its true frequency is the clade frequency f c (t k ), and quantify the sampling probability by a binomial distribution, i.e., where R i (t k ) is the read depth of the data at allele i at time point t k , and f c (t k ) is the estimated frequency of clade c at time point t k .Note that these probabilities are only considered for alleles that have a higher frequency than their clades at some time point.We refine estimates of clonal identities and clade frequencies in a recursive way.In each round of refinement, for every allele with a frequency higher than its current clade, we compute the logarithm of its mean sampling probabilities with respect to all clades over the competition period, i.e., log P (i, c) = 1 where T is the number of time points in the current competition period.The clade it most probably belong to based on sampling probabilities is then denoted as We then reclassify it to the clade c max i when the probability gain by doing this is larger than a threshold ÷, i.e, log P (i, After each round of refinement, we re-estimate time-series frequencies for all clades according to refined clonal identities following the steps outlined in the previous section.We then start another round of refinement, unless there is no reclassification performed in the last round.

Incorporating initially excluded alleles
During the previous steps of estimating clade frequencies and refining clades and frequency estimates, we exclude alleles that provide less reliable information to distinguish between clades.These are alleles that are shared by multiple groups in the clustering process, alleles that fix during the competition period, and alleles that never reach a minimum frequency threshold throughout the competition period.After we have refined clades and their frequency estimates, we reconsider these excluded alleles by evaluating their cooperation scores and sampling probabilities with all clades.The cooperation score between an allele i and clade c is defined as  Otherwise, we keep allele i as a non-clonal allele which is shared across clades.The first condition ensures that allele i exhibits co-varying behaviors to a greater extent than at least half of member alleles of clade c maxCoop i .The second condition prevents an assignment that is too improbable based on sampling probabilities.

Merging results from all competition periods
After recovering the clonal structure for each competition period, we merge results from all periods by aggregating clonal identities and connecting clade frequencies.For two successive competition periods p and p + 1, one of the clades in period p dominates the population at some time, becomes the "ancestor" clade in period p + 1, and develops into subclades.We denote this clade as c a , and its fixation time as t f .When the first sub-clade of c a emerges after t f , the clonal dynamics of periods p and p + 1 are well separated in time, and their clonal structures can be directly merged, e.g., as shown in Supplementary Fig. S1A.However, when the first sub-clade of c a emerges before t f , say at t e < t f , there is an overlap period [t e ,t f ] when c a has not fixed and a sub-clade has started to emerge.In such a scenario, the frequency trajectory of this sub-clade during this overlap is not estimated in period p + 1, since we divide these two competition periods at the fixation time t f .It may not be identified in period p, as the signal of it in the D matrix integrated in that period can be weak.In order to estimate this segment of clade frequency trajectory, we make use of an alternative way to define competition periods.Instead of dividing at fixation times of clades, we set boundaries for competition periods as time points when the clade that is going to fix (i.e., the ancestor clade of the next competition period) first breeds a sub-clade.Such boundaries can be determined because we have already inferred clonal structure for these competition periods.With re-defined boundaries, we recover again the clonal structure for each competition period.With results from both rounds, we can now connect clade frequencies seamlessly across competition periods.A schematic example of this workflow is shown in Supplementary Fig. S1B.

Marginal path likelihood (MPL) inference
Sohail et al. presented a diffusion approximation and path integral expression for the stochastic haplotype frequency dynamics under the WF model setting 40 .The path integral expression for the probability of observing a trajectory of hap-lotype frequencies (z(t 1 ), z(t 2 ),..., z(t K )) is given by where C(z(t k )) is the haplotype covariance matrix, i.e., and where d a (z(t k )) is the expected change in haplotype frequency z a at time t k .
The MAP estimate of the selection coefficients can be obtained by solving where P prior (s) is an assumed prior with mean zero and variance ‡ 2 > 0, and the likelihood function is given as (15) The right-hand side is approximated by Equation Eq. (11).Differentiating the expression with respect to s and equating to zero leads to the MAP estimator of selection coefficients s

Simulation
To benchmark the performance of our method, we generate artificial time series sequence data by simulating evolution as a Wright-Fisher process.As in the previous section, the loci are assumed to be bi-allelic.In the default setup, in each of the 40 simulations that we ran, a population of N = 1000 sequences started with all wild-type sequences and evolved for T = 1000 generations.At a generation t, the population first goes through a multinomial sampling process to determine the number of sequences z a (t) for each existing haplotype a in the current generation, where the probability p a for a haplotype a to be drawn is proportional to the product of its frequency in the last generation z a (t ≠ 1) and fitness f a , i.e., Each sequence, at each generation, has a fixed probability (2 ◊ 10 ≠4 ) to acquire a new mutation.We assume that at most one mutation occurs in a given sequence.We enforce an additional condition that promotes the occurrence and persistence of clonal interference: Each mutation occurs on a unique site, which also means that each mutation can only be acquired once.The selection coefficient of each mutation is drawn from a Gaussian distribution centered at 0.03 with a standard deviation of 0.01.At the end of the simulation, only alleles whose frequencies once exceed a threshold of 0.05 are preserved and considered in later analysis.We varied the default simulation setup to test our method more thoroughly.We considered simulations with smaller (N = 100) or larger (N = 10000) population sizes, simulations with rare recombination events, simulations starting from a mixture of random haplotypes, and simulations of diploid populations with random mating.In the diploid population, characterized by having two sets of chromosomes, we correspondingly adjusted the mutation rate to be twice as large (4 ◊ 10 ≠4 ).

Comparing runtime of methods on LTEE data
We applied the Lolipop and Evoracle methods on the LTEE data and compared their run times with our method in Supplementary Fig. S10.The LTEE data consists of 12 populations.Six populations are nonmutator populations with fewer than 1,000 alleles, while the other six are mutator populations with more than 1,000 alleles.When applied on the mutator populations, Lolipop and Evoracle did not finish running after 2 weeks (after which time running jobs were killed), while our method requires less than 1 day to complete.Supplementary Fig. S10 shows that the run time of our method is typically around an order of magnitude faster than Lolipop and around two orders of magnitude faster than Evoracle for these large data sets.

Clustering results for population p3 of LTEE data
Population p3 involves more than 6,000 mutant alleles and is one of the six mutator populations in the LTEE data.We found that the dynamics of this population is better explained when we split the whole evolution into two competition periods at generation 35,000, infer for each period, and combine the inferred dynamics of two periods.We plot the conjugate clustering results in Supplementary Fig. S8A.Inference on each period identifies strong competition between two clades, where the first period features competition between clade 1 and clade 2, and the second period features competition between clade 3 and clade 4. When looking at their relationships, we found that there are 21 mutant alleles that are shared between either clade in period 1 and either clade in period 2. Their frequency trajectories (plotted in yellow in Supplementary Fig. S8A) follow either clade 1 or clade 2 in the first period, and then follow either clade 3 or clade 4 in the second period.This suggests complex evolutionary dynamics in which the same mutations appear or arise on different backgrounds.
Similar to entries in the matrix D, which quantify correlated frequency changes for different alleles, the cooperation score fl(i, c) quantifies correlated frequency changes of allele i and clade c.We denote c maxCoop i as the clade with highest cooperation score with allele i, and c maxProb i as the clade with highest mean sampling probability with allele i, i.e., c maxProb i = argmax c log P (i, c).We assign allele i to clade c maxCoop i when the two following conditions are met.1. fl(i, c maxCoop i ) > f(c maxCoop i ), where f(c maxCoop i ) represents the median cooperation score with clade c maxCoop i for all its member alleles.
2. log P (i,c maxProb i ) ≠ log P (i, c maxCoop i ) <= ÷, where÷ is a preset threshold, and log P terms are averaging logarithm of sampling probabilities over time points at which frequency of allele i exceeds frequency of clade c.
Effects of different sampling temporal intervals.Performance of five methods averaged over 40 simulations with different sampling temporal intervals for inference of (A) integrated allele frequency covariance, (B) selection coefficients, and (C) haplotype fitness.The left column shows rank correlations with true values, and the right column shows the mean absolute error (MAE) of inferred values versus true values.The True method uses the true allele frequency covariance matrix, which is not available in short read data, and represents the ideal performance.
Clonal structure for population p3 of LTEE data inferred when split into two periods at generation 35,000 exhibits four-clade competition and complex dynamics.The clustering results from (A) our method (inferred as two periods) and (B) previous results41 on the population p3 identify different patterns of clonal interference.Clades 1 and 2 are identified by inference on the first period from generation 0 to 35,000.Clades 3 and 4 are identified by inference on the second period from generation 35,000 to the last generation.For better visibility, we place trajectories of mutant alleles that are shared across clades (marked as yellow; totaling 21 alleles) at the front layer.C, D matrix segmented into groups according to clustering processes of the two periods.S9.Clonal structure and fitness inference for Planktonic-2 replicate of tobramycin data.(A) Mutant allele frequency trajectories, and (B) clade frequency trajectories inferred by our method are plotted.C, Our method outperforms the Lolipop and LB methods, and has the same performance as Evoracle, in terms of rank correlation between inferred fitness and measured MIC values.D, Recovered integrated covariance matrix segmented into blocks according to clustering results of our method.The number of alleles in each clade is shown in brackets.Alleles in the same clades tend to show cooperating behaviors, as indicated by positive entries.Clades tend to show competing behaviors with each other, as indicated by negative entries.