Estimating the mutational fitness effects distribution during early HIV infection

Abstract The evolution of HIV during acute infection is often considered a neutral process. Recent analysis of sequencing data from this stage of infection, however, showed high levels of shared mutations between independent viral populations. This suggests that selection might play a role in the early stages of HIV infection. We adapted an existing model for random evolution during acute HIV-infection to include selection. Simulations of this model were used to fit a global mutational fitness effects distribution to previously published sequencing data of the env gene of individuals with acute HIV infection. Measures of sharing between viral populations were used as summary statistics to compare the data to the simulations. We confirm that evolution during acute infection is significantly different from neutral. The distribution of mutational fitness effects is best fit by a distribution with a low, but significant fraction of beneficial mutations and a high fraction of deleterious mutations. While most mutations are neutral or deleterious in this model, about 5% of mutations are beneficial. These beneficial mutations will, on average, result in a small but significant increase in fitness. When assuming no epistasis, this indicates that, at the moment of transmission, HIV is near, but not on the fitness peak for early infection.

to the rapid expansion and absence of immune response, which is the main evolutionary pressure

Estimating the shape of the mutational effects distribution
In order to calculate the fitness of a mutated sequence, every possible mutation is assigned a 91 fitness effect according to the mutational fitness effects distribution (MFED). Each of these effects 92 is assumed to apply universally in all hosts, there are no host-specific effects. The fitness of a 93 sequence is then the product of the fitness effects of all mutations in the sequence.

94
The effects in the MFED range from zero to infinity, with a fitness effect of one indicating a 95 neutral mutation (see figure 2). Deleterious mutations have a fitness effect smaller than one, with 96 an effect of zero indicating a lethal mutation. Sequences carrying a lethal mutation differ from 97 sequences carrying a non-lethal deleterious mutation since they will never produce any offspring, 98 while sequences carrying mutations with a very small, but not 0, fitness effect can still produce 99 offspring, albeit with a very low probability. A very beneficial mutation might also compensate for 100 such a deleterious mutation, while this is impossible in the case of lethal mutations.

101
The fitness effects distribution will affect the amount of shared mutations across viral pop-102 ulations. While the probability of a mutation occurring does not change, the probability of a 103 mutation being maintained in the population and later sampled is affected by the fitness effect of 104 the mutation. Once a beneficial mutation occurs, the sequence carrying this mutation will create 105 more offspring than unmutated sequences, and will be overrepresented in the viral population after 106 a few generations. This increases the chance of observing the mutation in two or more samples.

107
Lethal or deleterious mutations will cause the sequence carrying the mutation to have fewer 108 offspring and are therefore unlikely to be observed. Since these mutations are so unlikely to be 109 observed, the sites can be considered immutable, which results in an effective shortening of the 110 genome available for mutation. This might indirectly increase the chance of sharing mutations by 111 increasing the chance that other, less detrimental, mutations are observed. 112 We defined 6 different models for the mutational effects distribution (see figure 2), additional 113 to a neutral model where all fitness effects are one.

114
The first two models describe simplified distributions with a restricted effects range. The

115
'beneficials only' model consists only of neutral and beneficial fitness effects, and is defined by 116 a fraction of beneficial mutational effects (f b ), which are exponentially distributed with mean λ.
to zero in any of the models.  The log-normal model has slightly fewer beneficials (4.5% vs 5.2% in the 'lethals and beneficials' Interactions between mutations are therefore unlikely.

222
The fit resulted in two high-probability models (the 'lethals and beneficials' and the log-normal 223 model), two low-probability models (the 'beneficials only' and '5 spikes' model), and three ex-   shows us that 4% of mutations will introduce a premature stop codon, which is typically lethal.

249
This sets a lower bound on the number of lethal mutations, which is just met by the best fit of The ABC-SMC framework (standing for Approximate Bayesian Computation using Sequential estimation of the MFED. For this, we implemented an SMC procedure in python.

323
The fitting procedure starts with equal probabilities for all models. Simulations are then 324 performed for all models with random parameters according to their prior distribution. The priors 325 were uniform distributions from zero to one for all parameters except λ (0, 2) in the exponential 326 beneficials models, µ (−1, 1) in the log-normal models and b b (1, 2) in the 5 spikes model. In 327 the first iteration, a set of parameters (a 'particle') is sampled and a simulation is run with these 328 parameters. If the distance between the summary statistics of this simulation and the data is 329 smaller than 1 , the particle is retained. Once 1000 particles have been accepted, a weight is

331
In the next generations, a particle is sampled from the previous iteration using the assigned 332 weights. The particles are then perturbed according to Gaussian kernel and a simulation is run 333 with these parameters. Again, the distance between the summary statistics of this simulation and 334 the data is calculated. If this is smaller than i (with i the iteration), the particle is retained and 335 once 1000 particles have been accepted, the weights are calculated. 336 We In total 14 summary statistics were defined to calculate the distance between simulations in data.

351
The majority of them are based on shared mutations, which are defined as a mutation that occurs