Clonal competition in B-cell repertoires during chronic HIV-1 infection

During chronic infection, HIV-1 engages in a rapid coevolutionary arms race with the host's adaptive immune system. While it is clear that HIV exerts strong selection on the adaptive immune system, the modes of immune response are still unknown. Traditional population genetics methods fail to distinguish a chronic immune response from natural repertoire evolution in healthy individuals. Here, we infer the evolutionary modes of B-cell repertoire response and identify a complex dynamics where, instead of one winning clone, there is a constant production of new better mutants that compete with each other. A substantial fraction of mutations in pathogen-engaging CDRs of B-cell receptors are beneficial, in contrast to the many deleterious changes in structurally relevant framework regions. The picture is of a dynamic repertoire, where better clones may be outcompeted by new mutants before they fix, challenging current vaccine design and therapy ideas.

The HIV-1 virus evolves and proliferates quickly within the human body [1][2][3], often recombining its genetic material among different viral genomes and rapidly mutating. These factors make it very hard for the host immune system to control an infection, leading to long-term chronic infection. While it is clear that the virus exerts strong selective pressure on the host immune system, the adaptive immune response during chronic infections remains unknown.
The immune system has a diverse set of B and T-cells with specialized surface receptors that recognize foreign antigens, such as virus epitopes, and protect the organism. We focus on the chronic phase of HIV infection, where the immune response is dominated by antibodymediated mechanisms, following the strong response of the cytotoxic T-lymphocytes (i.e., CD8+ killers T-cells), around 50 days after infection [4]. During the chronic phase, the symptoms are minor and the viral load is relatively stable but its genetic composition undergoes rapid turnover. After an infection, B-cells undergo a rapid somatic hypermutation in lymph node germinal centers, with a rate that is approximately 4 − 5 orders of magnitude larger than an average germline mutation rate in humans [5]. Mutated B-cells compete for survival and proliferation signals from helper T-cells, based on the B-cell receptor's binding to antigens. This process of affinity maturation is Darwinian evolution within the host and can increase binding affinities of B-cell receptors up to 10-100 fold [6]. It generates memory and plasma B-cells with distinct receptors, forming lineages that trace the evolutionary selection pressures inflicted by the virus [7] (see schematic in Fig. 1A). A B-cell repertoire consists of many such lineages forming a forest of co-existing ge-nealogies.
Immune repertoire high-throughput sequencing has been instrumental in quantifying the diversity of B-cell repertoires [8,9]. Statistical methods have been developed to characterize the processes involved in the generation of diversity in repertoires and to infer the underlying heterogenous hypermutation preferences in B-cell receptors (BCRs) [9][10][11]. Deviation of the observed mutations in BCRs from the expected hypermutation patterns are used to infer selection effects of mutations from repertoire snapshots in order to identify functional changes that contribute to the response against pathogens [10,12].
Recently, longitudinal data, with repertoires sampled over multiple time points from the same individuals, has brought insight into the dynamics of affinity maturation in response to antigens [13][14][15][16]. The dynamics of affinity maturation and selection in response to HIV have also been characterized for chosen monoclonal broadly neutralising antibody lineages [3,17]. Yet, the effect of a chronic infection on the dynamics of the whole BCR repertoire remains unknown.
Here, we compare the structure and dynamics of BCR repertoires sampled over 2.5 years in HIV patients (data from ref. [15] collected through the SPARTAC study [18]) with the repertoire structure in healthy individuals (data from ref. [19]). We evolving virus.
Reconstructed lineage trees show a skewed and asymmetric structure, consistent with rapid evolution under positive selection (see Fig. S1A) [20]. To quantify these asymmetries, we estimated two indices of tree imbalance and terminal branch length anomaly. In both HIV patients and healthy individuals, we observe a significant branching imbalance at the root of the BCR lineage trees, indicated by the U-shaped distribution of the sub-lineage weight ratios (see SI), in contrast to the flat prediction of neutral evolution, calculated from Kingman's coalescent ( Fig. 2A). Moreover, we observe elongated terminal branches in BCR trees compared to their internal branches, with the strongest effect seen in trees from HIV patients, again in violation of neutrality (Fig. 2B,  Fig. S1). These asymmetric features of BCR trees are clear signs of intra-lineage positive selection. However, they only reflect the history of lineage replication and give limited insight into the mechanisms and dynamics of selection. For instance, tree asymmetry is also observed in unproductive BCR lineages, which lack any immunological function but are carried along with the productive version of the recombined gene expressed on the other chromosome ( Fig. 2A,B).
To characterize the selection effect of mutations in more detail, we evaluate the spectrum of mutation frequencies in a lineage, known as the site frequency spectrum (SFS). We evaluate the SFS separately for synony-mous and nonsynonymous mutations in different regions of BCRs (Fig. 2C, Fig. S2). We see a signifiant upturn of SFS polarized on non-synonymous mutations in pathogen-engaging CDR3 regions, consistent with rapid adaptive evolution [20], and in contrast to monotonically decaying SFS in neutrality (SI). This signal of positive selection is strongest in HIV patients with an order of magnitude increase in the high end of the spectrum, suggesting that the BCR population rapidly adapts in HIV patients.
To understand the dynamics and fate of these adaptive mutations, we use the longitudinal nature of the data to analyse the temporal structure of the lineages. We estimate the likelihood that a new mutation appearing in a certain region of the BCR reaches frequency x at some later time within the lineage (Fig. 3A), and evaluate a measure of selection g(x) as the ratio of this likelihood between non-synonymous and synonymous mutations [21] (SI). At frequency x = 1 (i.e., substitution), this ratio is equivalent to the McDonald-Kreitman test for selection [22]. Generalizing it to x < 1 makes it a more flexible measure applicable to the majority of mutations that only reach intermediate frequencies.
A major reason why many beneficial mutations never fix in a lineage is clonal interference, whereby BCR mutants within and across lineages compete with each other [7]. To quantify the prevalence of clonal interference, we also evaluate the nonsynonymous-to-synonymous ratio h(x) of the likelihood for a mutation to reach frequency x and later to go extinct (SI). In short, g(x) identifies "surges" and h(x) "bumps" in frequency trajectories of clones. These likelihood ratios have intuitive interpretations: g(x) > 1 indicates evolution under positive selection, with a fraction of at least α benef. = 1 − 1/g strongly beneficial amino acid changes in a given region [23]. On the other hand, the likelihood ratio g(x) smaller than 1 is indicative of negative selection, with a fraction of at least α del. = 1 − g strongly deleterious changes (see SI for a derivation of these bounds). Likewise, κ benef. = 1 − 1/h or κ del. = 1 − h define a lower bound on the fraction of either beneficial or deleterious mutations that go extinct. In this gene family, we detect positive selection (g > 1) in the CDR3 region, with around a two fold larger fraction of non-synonymous compared to synonymous changes that reach frequencies x > 0.6, indicating at least α benef. = 40% of CDR3 mutations to be strongly beneficial. On the other hand, the likelihood ratio in FWR signals strong negative selection (g < 1), where non-synonymous changes reaching frequencies x > 0.6 are two times fewer than the synony-mous changes, indicating at least α del. = 35% of these mutations to be strongly deleterious. Similarly, the interference likelihood ratio h(x) for a V-gene class IGHV5-10-1 in an HIV patient with interrupted treatment (patient 5) indicates that about κ benef. = 47% of CDR3 mutations in this gene family that go extinct due to clonal competition are strongly beneficial (Fig. 3B). In short, we observe a large fraction of adaptive mutations, and also a substantial amount of clonal interference which prevents some of the mutations from dominating within lineages.
To see how these observations generalize at the repertoire level, we quantify the region-specific fraction of beneficial and deleterious mutations within BCR lineages of distinct VJ-gene classes and also the fraction of selected mutations that are impeded by clonal interference (Fig. 3C and Table I). We infer a larger fraction of VJgene classes with positively selected amino acid changes in their CDR regions α benef. = 12% − 30% and negatively selected amino acid changes in FWRs α del. = 16% − 20%. Moreover, the positively selected beneficial mutations in CDR3 and the pooled CDR1/CDR2 regions are strongly impacted by clonal inference, in contrast to mutations in FWR (Fig. 3C, Table I, Fig. S3). These observations confirm the pervasiveness of clonal interference in the regions of the BCR with the most important functional role.
In patients with interrupted ART, we infer a twice larger fraction of beneficial mutations to rise with strong clonal interference in pathogen-engaging CDR3 regions following the interruption of treatment, compared to the ART-naive patients with a stable chronic infection-such a shift is not present for mutations in CDR1, CDR2 and FWR ( Fig. 3 and Table I). This pattern is consistent with the rate of HIV-1 evolution in patients with different states of therapy. Genome-wide analysis of HIV-1 has revealed that evolution of the virus within ART-naive patients slows down during chronic infections with limited clonal interference in viral populations [24]. The antibody response traces the evolution of the virus [1,7] and forms a quasi-equilibrium balance. On the other hand, rapid expansion and evolution of HIV following the interruption of ART drives a strong immune response and affinity maturation in HIV-responsive B-cell lineages. Evolution of HIV-1 population during viral expansion introduces a time-dependent target for the adaptive immune system and opens room for many beneficial changes in the HIV-engaging CDR3 regions, as indicated in Fig. 3.
Somatic evolution during affinity maturation is complex: there is no one winner of the race for the best antibody. We show that rapid and strong affinity maturation upon sudden pathogenic challenges, and a quasistationary response during chronic infections are a feature of the B-cell response to infections. Somatic evolution of BCRs is similar to rapid evolution in asexual populations where many beneficial mutations rise to intermediate frequencies leading to complex clonal compe- interference likelihood ratio:

FIG. 3:
Inference of selection and clonal interference in BCR lineages. (A) Schematic shows time-dependent frequencies of mutations that rise within a population (left). We denote the fraction of mutations that reach frequency x within a population (or lineage) by G(x) (blue and green) and the subset that later goes extinct due to clonal interference by H(x) (green). The likelihood ratios between non-synonymous and synonymous mutations (g(x), h(x)) quantify the strength of selection in each case. A schematic B-cell genealogy (right) is shown for a lineage sampled at two time points (colors). Non-synonymous and synonymous mutations are shown by empty and filled circles and their frequencies (xt 1 , xt 2 ), as observed in the sampled tree leaves, are indicated below each branch. (B) Selection likelihood ratio g(x) in the V-gene class IGHV2-70D in patient 4 (top) and the interference likelihood ratio h(x) for the V-gene class IGHV5-10-1 in patient 5 (bottom) are plotted against frequency x for mutations in different BCR regions (colors). The likelihood ratios indicate positive selection and strong clonal interference in the CDR3 region, negative selection on the FWR region and positive selection on mutations that rise to intermediate frequencies in the joint CDR1 / CDR2 regions. We do no observe interference in the FWR and the joint CDR1 / CDR2 region. (C) Each panel shows the probability density across distinct VJ-gene classes in HIV patients with interrupted treatment (left) and without treatment (right), for the fraction of beneficial and deleterious mutations ((α benef. / α del. ) on right x-axis and left inverted x-axis) that reach frequency x = 80% (top), and similarly, for beneficial / deleterious mutation fractions (κ benef. / κ del. ) that reach frequency x = 60% within a lineage and later go extinct (bottom). The dotted grey line indicates the null distribution from unproductive lineages of healthy individuals (Fig. S4). The Color code for distinct BCR regions in all panels is consistent with the legend. See Figs. S3, S4.
tition and genetic hitchhiking. Such evolutionary dynamics is prominent in microbial populations [25], in viruses including HIV within a patient [24,26] and global influenza [21,27,28]. In the immune system, clonal competition in BCR repertoires is also observed on short time scales (∼ weeks) in response to the influenza vaccine [16].
Clonal interference among beneficial mutations not only makes selection slower and less efficient, but it also makes the outcome of the evolutionary process less predictable [25]. This is of significant consequence for designing targeted immune-based therapies. Currently, the central challenge in HIV vaccine research is to devise a means to stimulate a lineage producing highly potent broadly neutralizing antibodies (BnAbs). A combination of successive immunization and ART has been suggested as an approach to elicit a stable and effective BnAb response; see e.g. ref. [29]. An optimal treatment strategy should account for clonal interference among BCRs during a rapid immune response to antigen stimulation, which could hamper the emergence of a desired BnAb within the repertoire.

HIV infected
HIV infected Healthy / productive Healthy / unproductive untreated interrupted treatment  The average fraction of beneficial α benef. and deleterious α del. mutations that reach frequency x = 80% (based on selection likelihood ratio g(0.8)) in different regions of BCRs among VJ-gene classes are reported for HIV patients (with interrupted and without treatment) and for healthy individuals (productive and unproductive lineages). Similarly, the fraction of beneficial κ benef. and deleterious κ del. mutations that reach frequency x = 60% followed by extinction (based on interference likelihood ratio h(0.6)) are reported for HIV patients with interrupted and without treatment; we cannot estimate the interference likelihood ratio in healthy individuals due to the lack of time-resolved data. The corresponding distributions are presented in Fig. 3 and Fig. S4.

Supplementary Information
1 B-cell repertoire data HIV patients. We analyze B-cell repertoire data from 6 HIV patients from ref. [1] with raw sequence reads accessible from the European Nucleotide Archive under study accession numbers, ERP009671 and ERP000572.
We study the repertoire data in two untreated HIV patients with sample accession numbers, ERS664994 -5001 (patient 1) and ERS139291-9298 (patient 2) and in four patients with ART interruption at week 48, ERS664966 -4974 (patient 3), ERS664975 -4983 (patient 4), ERS664984 -4992 (patient 5), ERS664976 -5002 (patient 6). The data covers ∼ 2.5 years of study with 6-8 sampled time points per patient; see Table S1 for details. The B-cell repertoire sequences consist of 150bp non-overlapping paired-end reads (Illumina MiSeq), with one read covering much of the V gene and the other read covering the area around the CDR3 region and the J gene. For the initial processing of the raw reads we use pRESTO [2] (version 0.5.2) with the following steps: We filter sequences for quality (> 32) and length (> 100). The paired end reads that overlap are assumed to be anomalous, and are discarded from the analysis. We assemble the paired reads by aligning against the IMGT reference database of V genes [3], such that an appropriate size gap is inserted between the non-overlapping paired reads. Duplicate sequences are collapsed into unique sequences. The sequences contained a large number of singletons, that is sequences with no duplicates. With an R script, we calculate the minimum hamming distance of each singleton to any non-singleton, H 0 . The distribution of H 0 is bimodal, and singletons with H 0 < 5 (the minimum between the modes) are discarded, since sequences with few changes are more likely to have appeared due to sequencing errors. Due to lack of barcoding for individual molecules, we only use the unique BCR sequences for analysis and do not incorporate the information on the multiplicity of each sequence.
Healthy individuals. We analyze memory B-cell repertoire data of 3 individuals published in ref. [4]: https://clients.adaptivebiotech.com/pub/robins-bcell-2016. The published data in healthy individuals is already preprocessed for quality control and corrected for sequencing error.
BCR annotation. In both datasets, we annotate the BCR repertoire sequences of each individual (pooled time points) by Partis [5]. Partis uses very large amounts of memory, so the initial (cache-parameters) stage is run on a subset of 200,000 random sequences, and the annotation stage is run on the full set of sequences. We process the output of Partis in R, which includes the estimated V gene/allele, J gene/allele, location of the CDR3 region, and an inferred naive sequence (germline before hyper-mutation). Sequences which have indels outside of the CDR3 are discarded. We partition the sequences into two groups: productive BCRs, which are in-frame and have no stop codons, and the unproductive BCRs. The sequences are further annotated by processing the inferred naive sequences with MiXCR [6,7], which gives the CDR1, CDR2 and framework regions.
Lineage reconstruction. To identify BCR lineages, we first group sequences by the assigned V gene, J gene and CDR3 length, and then used single linkage clustering with a threshold of 90% hamming distance. A similar threshold has been previously suggested by ref. [8] to identify BCR lineages. Clusters of small size (< 20) are discarded from our analysis. For each cluster, there may be multiple inferred naive sequences, as this is an uncertain estimate, and the most common naive sequence is chosen to be the outgroup for genealogy reconstruction. See Table S1 for detailed statistics of BCR lineages in each individual.
Unproductive BCRs. Due to a larger sequencing depth in healthy individuals, we are able to reconstruct relatively large unproductive BCR lineages. Unproductive sequences are BCRs that were generated but due to a frameshift or insertion of stop codons were never expressed. These BCRs reside with productive (functional) BCRs in a nucleus and undergo hypermutation during B-cell replication, and therefore, provide a suitable null expectation for somatic evolution during affinity maturation.

Inference of lineage phylogenies
Lineage genealogy reconstruction. For each lineage and its aligned sequences we reconstruct its underlying genealogical tree. We use FastTree [9] to construct the initial tree by maximum parsimony. We use this tree as seed for the maximum likelihood construction of the phylogeny with RAxML [10], using the GTRCAT substitution model. In the last step of tree topology reconstruction, we use the GTRGAMMA substitution model to optimize sequence divergence along the tree (i.e., branch lengths). We use a maximum likelihood approach to reconstruct nucleotide sequences of internal nodes on the tree [11]. We do not include the positions with gaps in the multisequence alignment in inference of tree topology and the nucleotide mutations along the tree.
We use the inferred naive sequence (germline) as the outgroup of the genealogy. The root of the tree may be some mutations away from the last common ancestor of the sampled sequences. This may be due to a number of initial rounds of hypermutation prior to secretion of the first selected B-cell, or alternatively, due to incorrect assignment of the germline allele during annotation; a fraction of V, D and J alleles circulating in the human population are missing from the existing reference datasets like IMGT [3]. In order to minimize the effect of such allele mis-assignments, we discard the mutations that separate the inferred germline sequence and the last common ancestor (root) of the tree from our analysis (i.e., mutations common to all sequences). Fig. 2B and Fig. S1B), we use a maximum-likelihood approach and a probabilistic model to annotate internal nodes of a tree with times of occurrence, given the topology of the tree and the mutations on the branches. Internal nodes represent replication events, which may carry new mutations assigned to branches by the ancestral sequence reconstruction procedure. The model can also correct observation times of external nodes on the tree, given sufficient evidence. The model assumes a tree with n nodes,

Inference of branching time along a phylogeny. To characterize the branch length statistics of lineages in units of divergence time (used in
where τ j = t j − t A(j) , the time difference between a node in the tree and its parent, is constrained to be positive. The model assumes Gaussian measurement error with standard deviation σ of the sampling time for the observed nodes, and the Poisson model for mutations on tree branches, with rate µ. To limit the search space of the optimization algorithm we constrain the times t to be discrete; the units of time used in our data are weeks. Here we set σ = 5. The optimization is solved by an iterative procedure in which times of nodes are changed by one unit in the direction of score increase, until convergence.

Inference of selection from lineage tree statistics
We compare genealogies of B-cell lineages in HIV patients with healthy individuals to characterize the evolutionary selection during affinity maturation in response to chronic infection. Structure of genealogies has been linked to evolutionary modes in a population [12]. Rapid evolution under positive selection leads to skewed tree topologies and elongated terminal branches [11,[13][14][15], compared to neutral evolution [12,16] (Fig. S1A).
Asymmetric tree branching. We characterize the asymmetry of trees by the branching imbalance of the last common ancestor at the root of the tree -the last common ancestor may be a number of mutations away from the germline progenitor. We define the weight of each node in a tree by the number of leaves (terminal nodes) within its clade; see Fig. S1A [15]. The weight of the last common ancestor w anc.. , i.e., the total number of leaves in a tree, and its daughters, w D 1 , w D 2 , are indicated in the simulated trees of Fig. S1A. In rapid evolution under selection, the first branching event produces highly imbalanced sub-clades, and hence, extreme values for the weight of the first daughter nodes [15]. In contrast, neutral evolution predicts a uniform distribution of tree weights of ancestral sub-clades. Fig. 2A shows a U-shaped distribution of relative weights of daughters to the ancestor w D /w anc. in lineages reconstructed from BCRs in HIV patients and in healthy individuals, indicating intra-lineage selection during affinity maturation.
Terminal branch statistics. In lineages under selection, the descendent sequences (leaves of a tree) are likely to coalesce to an ancestor with high fitness, resulting in long terminal branches in a tree (Fig. S1A) [14,15] -branch length statistics are estimated in units of divergence time, rather than sequence hamming distance; see Section 2 of SI for inference of branching times along a phylogeny. In Fig. 2B we compare the ratio of mean terminal branch length of a lineage to the averaged length of all the branches in a lineage. The distribution of this branch length ratio in HIV patients show an excess of lineages with relatively long terminal branches, compared to the expected distribution for simulated neutral lineages of the same size (Kingman's coalescence); see Section 5 of SI. A similar trend with a weaker signal is seen in healthy individuals (Fig. 2B). To make sure that the strong signal in HIV patients is not only due to sampling of the repertoire over multiple time points, we repeat the same analysis on the subset of lineages which are sampled in only one time point. Fig. S1B shows a comparable over-representation of long terminal branches in this subset.
Site frequency spectrum (SFS). The SFS is the probability density f (ν) of observing a derived mutation (allele) with a given frequency ν in a lineage. A mutation that occurs along the phylogeny of a lineage forms a clade and is present in all the descendent nodes (leaves) of its clade (see Fig. S1A). Therefore, SFS carries information about the shape of the phylogeny, including both the topology and the branch lengths. In neutrality, mutations rarely reach high frequency, and hence, the SFS decays monotonically with allele frequency as, f (ν) ∼ ν −1 [16]. In phylogenies with skewed branching, many mutations reside on the larger sub-clade following a branching event, and hence, are present in the majority of the descendent leaves on the tree. The SFS of such lineages is often non-monotonic with an upturn in the high frequency part of the spectrum and a steeper drop (∼ ν −β with β > 1) in the low frequency part of the spectrum [14]. To identify the targets of selection, we classify mutations based on the region they occur in. BCRs are made up of the three immunologically important complementarity-determining regions (CDRs) [17], CDR1, CDR2 and CDR3 and the remaining part of the V and J genes referred to as the framework region (FWR). Fig. 2C shows SFS in lineages of HIV patients and in healthy individuals. We see a signifiant upturn of SFS polarized on non-synonymous mutations in pathogen-engaging CDR3 regions; this signal of selection is strongest in HIV patients with an order of magnitude increase in the high end of the spectrum (Fig. 2C). SFS polarized on mutations in other regions show steeper drop in the low frequency side of the spectrum compared to neutral expectation. Fig. S2 shows SFS of the unproductive BCR lineages in healthy individuals, with a comparable steep drop in low frequencies. A similar pattern of SFS has recently been reported for lineages of B-cell repertoires following influenza vaccination [18].
For analysis of lineage tree statistics, i.e., the weight imbalance, terminal branch statistics and SFS, we only rely on the relatively large lineages with size (> 50) leaves.

Selection and clonal interference likelihood ratios
Selection likelihood ratio. Hypermutations during affinity maturation create new clades within a lineage. The frequency x of these clades change over time, as shown by the schematic in Fig. 3A. A mutation under positive (or negative) selection should reach a higher (lower) frequency than a neutral mutation. Many population genetics tests, such as the McDonnell-Kreitman test for positive selection [19], rely on a comparison between statistics of substitutions (i.e., mutations that fix within a population) and the circulating polymorphisms within species. Unlike phylogenies based on the species divergence, B-cell lineages form genealogies with many mutations that rise to intermediate frequencies as polymorphisms but often do not fix within a lineage. Here, instead of relying on the substitution statistics, we use the history of polymorphisms to quantify selection in B-cell lineages. In particular, we estimate the frequency propagator G(x) [20] as the likelihood that a new mutation (allele) appearing in a lineage reaches frequency x at some later time within a lineage (see schematic in Fig. 3A).
To estimate intra-lineage selection, we compare the likelihood of an amino acid changing non-synonymous mutation reaching a given frequency x at any point in its time trace, G(x) to that likelihood for a synonymous mutation in the same lineage, G 0 (x), and detemine the selection likelihood ratio (Fig. 3A) [20], Due to heterogeneity and context dependence of mutation rates in different regions of BCRs, we evaluate the likelihood ratio separately for each region, namely the CDR3, CDR1 & CDR2 (pooled together) and framework regions (FWR). In the Fig. S5 we show the robustness of the region-specific selection likelihood ratio with respect to such mutational biases.
Interference likelihood ratio (time-ordered selection). Clonal competition among beneficial mutations on different genetic backgrounds is a characteristic of evolution in asexual populations. In the absence of clonal interference, beneficial mutations can readily fix in a population after they rise to intermediate frequencies, beyond which stochastic effects would not impact their fate [21]. Clonal interference reduces the efficacy of selection, resulting in a quasi-neutral regime of evolution [22].
To examine the amount of clonal competition among BCRs of a lineage, we consider time ordered selection propagators (interference propagators) indicating the likelihood that a mutation reaches frequency x and later goes extinct, H(x) = G(x) × G(0|x); here G(0|x) is the conditional probability that a mutation trajectory decays to frequency 0 given that it starts from frequency x; see schematic in Fig. 3A. We estimate the interference likelihood ratio by comparing the probability of a non-synonymous mutation to reach a frequency x and later go extinct H(x) to the same scenario for synonymous mutations, H 0 (x) (Fig. 3A), Fraction of selected mutations based on the likelihood ratios. Following the well established tradition of population genetics, we assume that synonymous mutations that do not change the amino acid provide a neutral gauge for evolution. In the case of frequency x = 1 the propagator ratio g(x) becomes equal to the ratio of the fixation probability (d/n) (d 0 /n 0 ) where d and d 0 are respectively the number of fixed non-synonymous and synonymous polymorphisms and n and n 0 are total number of polymorphisms is each class. In other words, g(x = 1) is equivalent to the McDonald-Kreitman test for selection based on the observed polymorphisms [19]. Selection likelihood ratio g(x) = G(x)/G 0 (x) larger than 1 implies an over-representation of non-synonymous compared to synonymous changes that reach frequency x and is indicative of beneficial amino acid changes in a given region. Assuming a total of N non-synonymous mutations, we expect N G 0 (x) of these mutations to reach frequency x by neutral evolution, and at least a fraction α benef. (x) = (N (x) − N G 0 (x))/N (x) = (g(x) − 1)/g(x) of these mutations to be beneficial [23]. On the other hand, a selection likelihood ratio smaller than 1 indicates negatively selected amino acid changes in a given region. The deviation from the expected number of non-synonymous mutations in neutrality, N G 0 (x) − N (x), is an estimate for the number of mutations that were suppressed due deleterious fitness effects, indicating that at least a fraction α del. (x) = 1 − g(x) of non-synonymous mutations to be under negative selection [23]. Similarly, we can compute the fraction of beneficial and deleterious mutations that are impacted by clonal interference, κ benef.
Robustness of selection inference. It should be noted that the heterogenous and context dependent somatic hypermutation rates during affinity maturation [24][25][26][27][28] introduce BCR-specific biases that could influence inference of selection. In order to verify the robustness of our method, we have simulated the process of affinity maturation based on two distinct BCR-specific hyper-mutation models [24,26,28] along the inferred BCR lineage phylogenies; see Section 5 of SI for details. Fig. S5 shows that the region-specific likelihood ratios g(x), h(x) are insensitive to the heterogenous hyper-mutation statistics and such biases do not produce spurious evidence for selection and clonal interference. In addition, the likelihood ratio is insensitive to the initial frequency of an allele within a lineage [20] and provides a robust measure for inference of selection in evolving genealogies.
Inference of likelihood ratio statistics from data. The descendants of a given mutation α on a lineage tree define the clade C α . We evaluate the frequency of mutation α at time t, x α (t) as the fraction of the observed sequences (leaves of the tree) from time point t that reside within the clade C α . The fraction of non-synonymous and synonymous mutations that reach frequency x during their history define the selection propagators G(x) and G 0 (x), respectively. To infer statistically significant evidence for selection, we estimate propagators based on the mutations pooled from lineages of common gene classes, e.g. lineages with common V gene (Fig. 3B) or common V & J genes (Fig. 3C and Fig. S4). We evaluate the expected error of a propagator at frequency x, by assuming binomial sampling from the total of N non-synonymous and N 0 synonymous mutations (i.e., all the mutations observed in a given gene class). This results in the sampling errors σ 2 (x) = G(x)(1 − G(x))/N for non-synonymous and σ 2 0 (x) = G 0 (x)(1 − G 0 (x))/N 0 for synonymous mutations, and a corresponding propagated error for the ratio, g(x) in eq. 2. We use a similar approach to estimate the error for the interference likelihood ratio, h(x) in eq. 3.

Simulations
Simulated trees. In Fig. 2B, we compare the branch length characteristics of the BCR genealogies with the neutral expectation from 2000 simulated trees with Kingman's coalescence, generated by the beta coalescent algorithm with parameter α = 1 [29]. The sizes of the simulated trees in the neutral ensemble are drawn from the BCR lineage size distribution in HIV patients. The schematic trees in Fig. S1A are also generated similarly by the beta coalescent algorithm [29] with parameters α = 1 for neutral evolution and α = 2 for rapid adaptation.
Null model for context-dependent affinity maturation. We simulate mutations along BCR lineage trees according to two context-dependent models of hyper-mutation, (i) IGoR statistics [28] and (ii) S5F [24]. For a given branch on a lineage tree, we draw a number of mutations equal to the branch length from a multinomial distribution with position-specific weights determined by the hypermutation models. Due to the changes in the sequence, we update the position weights at each internal node of the tree to account for context-dependent hyper-mutation rates. This procedure reshuffles the identify of mutations along BCRs according to the neutral hyper-mutation models, while preserving the shape of tree. Fig. S5 shows that the propagator statistics do not recover evidence for region-specific selection in BCRs in the simulated lineages. Therefore, the original selection signal in Fig. 3 is not  Figure S1: Impact of selection on lineage tree statistics. (A) Simulated phylogenies for neutral evolution (left) and rapid adaptation with positive selection (right), generated by the coalescence package [29] and plotted with FastTree [9]. The weights of the ancestral node wanc. and its daughters wD 1 , wD 2 (i.e., their clone size) are indicated in each phylogeny.   Figure S3: Selection and interference likelihood ratios in all individulas. Panels show selection and interference likelihood ratios g(x), h(x) in HIV patients (untreated and with interrupted ART) and the selection likelihood ratio g(x) in productive and unproductive lineages of healthy individuals, estimated from all lineages in each individual. We consistently see strong evidence for negative selection in FWR regions of productive lineages and positive selection in both or either of CDR regions. We do not see such distinction in unproductive lineages. Note that the repertoire level averaged likelihood ratios are highly coarse grained statistics and miss the gene-specific evidence for selection and clonal interference, as shown in Fig. 3.  Figure S5: Robustness of selection inference to BCR hypermutation biases. The figure shows the statistics of selected mutations for simulated neutral hypermutation processes along the inferred B-cell lineages, as described in Section 5 of SI. We use two hypermutation models, IGoR statistics [28] (top row) and the S5F model [24] (bottom row). Similar to Fig. 3, each panel shows the probability density across distinct VJ-gene classes for (A) the minimum fraction of beneficial / deleterious mutations (α benef. / α del. ) that reach frequency x = 80%, and (B) for beneficial / deleterious mutation fractions (κ benef. / κ del. ) that reach frequency x = 60% and later go extinct. These statistics are estimated for mutations in different regions of BCRs (colors) in healthy individuals, HIV patients with interrupted ART and in untreated HIV patients. The region-specific pattern of selection seen in Fig. 3