Learning the heterogeneous hypermutation landscape of immunoglobulins from high-throughput repertoire data

Abstract Somatic hypermutations of immunoglobulin (Ig) genes occurring during affinity maturation drive B-cell receptors’ ability to evolve strong binding to their antigenic targets. The landscape of these mutations is highly heterogeneous, with certain regions of the Ig gene being preferentially targeted. However, a rigorous quantification of this bias has been difficult because of phylogenetic correlations between sequences and the interference of selective forces. Here, we present an approach that corrects for these issues, and use it to learn a model of hypermutation preferences from a recently published large IgH repertoire dataset. The obtained model predicts mutation profiles accurately and in a reproducible way, including in the previously uncharacterized Complementarity Determining Region 3, revealing that both the sequence context of the mutation and its absolute position along the gene are important. In addition, we show that hypermutations occurring concomittantly along B-cell lineages tend to co-localize, suggesting a possible mechanism for accelerating affinity maturation.


INTRODUCTION
B cells are a crucial player in the adaptive immune system. Swift eradication of pathogens is enabled by the production of immunoglobulins (Ig) that bind tightly to antigens, helping in their detection, neutralization, and removal. Achieving high accuracy and breadth relies on the extraordinary diversity of the B cells repertoire. The process of V(D)J recombination results in a highly diverse population of naive cells (1)(2)(3)(4)(5)(6)(7). In addition, B cells undergo affinity maturation, a Darwinian process (8) in which mutations are introduced to the immunoglobulin-coding gene and highest affinity mutants are selected (9). This process is driven by a very high rate of somatic hypermutations (SHM), ∼10 −3 per basepair per cell division (10), targeting the Ig genes. Some receptor genes can ultimately accumulate up to 30% amino acid substitutions, considerably altering the initial genotype. The broad diversity created by SHM ultimately ensures the emergence and selection of strong antigen binders. Understanding SHM and their statistics is key to designing better vaccination strategies (11,12).
Like the VDJ recombination process, SHM are characterized by heterogeneous preferences. Mutational pathways affect the Ig genes unevenly, with 'cold' and 'hot' spots along the receptor gene, even before somatic selection introduces further biases (12). SHM is initiated by activationinduced cytidine deaminase (AID) through the deamination of deoxycytidines triggering an array of error-prone repair pathways (13). AID and repair enzymes preferentially target certain regions of the gene. However, a quantitative picture of how these processes and their context dependencies result in the observed heterogeneous mutational landscape is lacking. High-throughput repertoire sequencing of the Ig gene (2,3,14,15) has facilitated the development of effective models from a detailed analysis of mutational profiles of Ig sequences before (5,16,17) or after selection (18)(19)(20)(21)(22). However, the spatial organization of mutations, their context preferences, and their interplay with selection during affinity maturation are still poorly understood, in part due to a number of confounding factors.
A fundamental issue is the bias of selection, which favors beneficial mutations over deleterious ones in the observed repertoire. This bias can be partially circumvented by analyzing synonymous substitutions (16), with the limitation that extrapolation is required to generalize to nonsynonymous ones. Another way around selection is to study passenger nonproductive sequences, which are unsuccessful products of VDJ recombination and thus unaffected by selection (5,17,22). These sequences make up a minority of DNA sequences, and are rarely found in mRNA sequences because of allelic exclusion, which limits their use to very large datasets.
Another confounding factor arises from phylogenetic biases due to the complex multi-lineage structure of the repertoire. While methods have been developed to infer substitution rates from lineages in a lineage-specific (21) or repertoire-wide way (23), they do not aim to correct for selection and do not address the question of hypermutation targeting.
Here, we propose a new framework for quantifying and predicting immunoglobulin mutability. The model is trained on the reconstructed phylogenies of nonproductive lineages from very large published B cell repertoires totalling around half a million nonproductive sequences (7), allowing us to overcome previous limitations of dataset sizes. The approach accounts for both phylogenetic and selection biases, and allows us to study in detail the spatial and context preferences of hypermutation targeting, and to reveal the colocalization of contemporary mutations.

Repertoire-wide framework to model intrinsic mutabilities from out-of-frame lineages
Out-of-frame Ig sequences are byproducts of the VDJ recombination process that are made non functional by a frameshift in the CDR3 region. Since each cell has two copies of the Ig genes, out-of-frame rearrangements may survive in the cell if recombination on the second chromosome is successful. The mechanism of allelic exclusion ensures that only the functional variant is expressed. Yet, outof-frame IgH sequences comprise ∼ 2% of rearrangements in Ig mRNA sequencing experiments, and ∼ 9% in genomic DNA (6). When a B-cell clone harboring both an out-offrame and a functional rearrangement undergoes affinity maturation, the out-of-frame sequence acts as a passenger and mutates alongside the functional sequence, with the selection pressure acting only on the latter. While the two sequences share the same phylogeny, mutations found in outof-frame lineages are not expected to be subject to selection.
To model the process of SHM, we reconstruct the evolutionary history giving rise to the observed mutation patterns in nonproductive rearrangements. We analysed data consisting of the IgG repertoires of nine individuals from (7), obtained by the targeted mRNA sequencing of the Ig heavy (IgH) chain locus. We pre-processed and aligned raw IgH sequences to keep only out-of-frame sequences. We then grouped sequences into clonal families that originate from the same ancestor using single linkage clustering ( Figure 1A). The size of clonal families typically follows a power-law distribution ( Figure 1C). As a result, many lineages are represented by one or a few sequences. We focused on sufficiently large lineages (comprised of at least six distinct sequences) and reconstructed their lineage structure, using maximum likelihood (24,25) to infer the topology of the underlying tree, and marginal reconstruction for the identity of ancestral states. This provides us with a list >200 000 mutation events occurring between the most recent common ancestor of lineages and their leaves.
Using lineage information is essential for multiple reasons. First, it allows for a better estimate of the sequence context in which a mutation appears. In this paper we define the context as the 5-mer sequence comprising the mutated basepair flanked by 2 bp on each side. In the absence of lineage information, the best guess for the 5-mer context would be given by the genomic sequence of the V, D or J segment where the mutation arose. But that context may itself be affected by other prior mutations. The tree structure allows us to identify the order of mutations and reconstruct the probable 5-mer context in which each mutation occurred. Second, for the same reason, the tree structure can help identify mutations in the hyper-variable CDR3 region, including in the junctions made of nontemplated insertions. This makes it possible to estimate the hypermutation rate in these regions. Together, these improvements mean that mutations can be identified within a broader range of 5mer contexts, and their corresponding mutabilities better estimated. Third, lineage structure helps reduce contamination from sequences that have been under some selection. In some rare events, during affinity maturation a somatic insertion or deletion may be introduced in the CDR3 of a productive sequence, which would lead us to classify it as out-of-frame, even though it has been subject to selection prior to the frame-shift event. Focusing on mutations happening downstream of the most recent common ancestor, which is already out of frame, help us discards those contaminating events.
Given a model P(s → s |t, ) of sequence evolution from s to s , where t is fraction of mutated positions between s and s , (called branch length, equal to the number of mutations divided by alignment length), and denotes model parameters, we can write the joint likelihood of mutation events in each lineage as where S is the set of sequences (observed and reconstructed) at each node of the tree, and T encodes the reconstructed phylogenetic tree through its branches (i, j). We assume every position x of the sequence s evolves independently inside each branch. Mutations occur according to a set of Poisson clocks with sequence-and positiondependent rates, s, x , expressed per unit time of branch length. During t some positions will mutate and others will remain unchanged, so that We assume that mutability depends independently on the local 5-mer sequence context centered around the mutation, w(s, x) = (s(x − 2), . . . , s(x + 2)), and on the absolute position x along the gene (measured as the distance from the 5 end of the gene), so that s, x = ␥ w(s, x)␤ x . In absence of context and position dependence, we would have = 1 by construction. Thus values of ␥ w or ␤ x above 1 imply higher mutabilities than average, and vice versa for values below 1. To lift the degeneracy in overall scale between ␤ x and ␥ w, Overall, the model has 4 5 = 1024 parameters for ␥ w corresponding to each 5-mer, and L = 400 parameters for ␤ x corresponding to each possible position. We infer these pa- rameters from repertoire-wide sequencing data by maximizing the total log-likelihood of mutations in all branches in all lineages, L(β, γ ) = (S,T) ln P(S|T, β, γ ), with respect to (␤, ␥ ), using an iterative procedure.

Data and preprocessing
We perform the analysis on recently published highthroughput RNA sequencing of Ig heavy genes at great depth (7). The sequences were barcoded with unique molecular identifiers (UMI) to correct for the PCR amplification bias. However, UMI cannot be used to correct sequencing errors, as most UMI were represented by a single sequence: the number of UMIs used is of the same order as the total number of cells in use. We aligned raw sequences using presto of the Immcantation pipeline (26) with setup allowing to correct for errors in UMIs and deal with insufficient UMI diversity. The V region primers were masked and the C region primers were used to distinguish the two isotypes of sampled B cells: the IgM and IgG classes. The study of mutation profiles in the two groups revealed a much lower mutational load in the IgM cohort and hence a higher relative level of sequencing errors, as well as shallower tree topologies. For further analysis we chose to focus exclusively on the IgG class. Reads were filtered for quality and paired using default presto parameters. Pre-processed data was then aligned to V, D and J templates from IMGT (27) database using IgBlast (28). In total there were 3.6 × 10 6 IgG sequences per person (average 3.6 × 10 6 , median 1.8 × 10 6 ), of which up to 2% were unproductive (average 5.7 × 10 4 , median 2.9 × 10 4 ).

Inference of evolutionary trajectories
Sequences with a frameshift in the CDR3 region were then selected and used to reconstruct clonal families as follows. In the first step, reads were aligned to the V and J templates and grouped into classes of sequences with the same V and J gene assigned, as well as equal CDR3 length. In the out-of-frame classes we inferred clonal lineages by single linkage clustering with a threshold of 90% on CDR3 region identity (29). We reconstruct maximum likelihood topologies, as well as the identity of ancestral states, under a simple K80 model of character evolution (30) for all lineages comprising at least 6 unique sequences. The model does not capture the complexity of the observed mutation profile, but avoids fitting multiple parameters independently in small lineages of relatively short alignment. The existing repertoire-wide method (23) is incompatible with out-offrame lineages since it operates on 61 productive codons. Ancestral states are found through marginal reconstruction. Germline V and J sequences were used as an outgroup to inform the phylogenetic inference and root the lineage.

Model inference
With the exception of the initial branch, which joins the germline sequence and the most recent common ancestor of the lineage, all branches shorter than 10 substitutions were used for model inference.
Nucleic Acids Research, 2020, Vol. 48, No. 19 10705 Our task is to find a set of parameters {␥ w}, {␤ x } that maximise the log-likelihood where S is the set of sequences (observed and reconstructed) at each node of the tree, and T encodes the reconstructed phylogenetic tree through its branches (i, j), with reconstructed ancestral states s i and s j . The rates s, x = ␤ x ␥ w(s, x) are defined so that the length of each branch t is expressed in terms of the expected number of substitutions per basepair (total number of substitutions divided by the total alignment length). Imposing ∂L/∂γ w = 0 yields an implicit expression for ␥ w as a function of {␤ x }, but independent of {γ w } w =w , which can be solved by one-dimensional root finding. Likewise, setting ∂L/∂β x = 0 gives an implicit expression for ␤ x as a function of {␥ w}. We can perform the following iteration: which converges to the maximum of L with respect to the To estimate the uncertainty of inferred parameters we sample with replacement from the set of all branches to create 400 bootstrap copies. We report 95% confidence intervals.

Substitution models
Not only the targeting rate, but also the identity of the substitution is known to depend on the identity of neighboring bases (16). In our formulation of the model, inference of the targeting rates does not require knowing the substitution type, however we can easily extend the framework to include this dependence. The probability of mutating from w to w over a period t can be expressed as where w ω w→w = 1 and ω w→w = 0 if w is a result of a substitution at the central position of w. This way we add 2 × 4 5 = 2048 parameters to the model. We can infer the maximum likelihood estimates of ω w→w using the same iterative scheme introduced in the previous section.

Synthetic datasets
We created synthetic datasets using the S5F model of mutability (downloaded from clip.med.yale.edu/shm) for ␥ w.
We used a flat profile, ␤ x = 1 as well as sinusoidal profiles ln β 1 x = 2 sin(x/δ) − 1 and ln β 2 x = 2 cos(x/δ) − 1 with ␦ = 50. For each branch (i, j), we compute the mutability μ s i ,x as a function of x, and then introduce mutations at n random positions picked without replacement according to μ x / x μ x , where n is the number of mutations on the branch (fixed by the lineage structure taken from the real data).

Validation on synthetic data
We first tested the ability of the inference framework to recover true mutability parameters using synthetic datasets. Synthetic data was designed to mimic as closely as possible the features of the real repertoire data to be analyzed. We used tree topologies inferred on out-of-frame lineages from nine individuals of (7). The sequence at the root of each tree was replaced by a random sequence drawn using IGoR, a model of stochastic VDJ recombination (22). Random mutations were then introduced along the tree structure, following the same number of mutations on each branch as in the original lineage, and according to the SHM model (equation 2). Context-dependent parameters ␥ w were set to the previously published S5F model (16) , and three variants of the position dependent ␤ x were tested: flat, and two sinusoidal profiles (see Methods). Finally we collected sequences at the leaves of the trees into a synthetic dataset.
Starting from this dataset, we performed alignment, clonal family inference, tree reconstruction and finally model inference using the exact same procedure as for real data. We compared parameters inferred this way to the true values of ␥ w and ␤ x (Figure 2). We were able to recover these rates with excellent accuracy (Pearson's r 2 = 97% for both ␥ and ␤).
The fact that the procedure recovers the correct positiondependent profile ␤ x , including a flat one ( Figure 2B), shows that the framework successfully corrects for the two following confounding factors. First, sequence conservation across the different V, D and J segments means that context and position are often intertwined, making the extraction of each dependence difficult. Second, high variability in the CDR3 may cause errors in the assignment of sequences into clonal families, and makes it harder to reliably call mutations than in the germline regions. This remains true in the presence of large variations of the mutability along the position, including in the CDR3, as demonstrated on the sinusoidal profiles ( Figure 2C). On the other hand, the possibility to use the CDR3 sequence for model inference gives access to a more diverse range of possible contexts, leading to better estimates for contexts that are underrepresented in the germline genes.
To assess the impact of errors in the reconstruction of clonal families and lineages on the inferred parameters, we repeated the procedure using the true tree topologies instead of the reconstructed ones. This only modestly improved accuracy (r 2 = 98%, see Supplementary Figure S1), suggesting that the procedure is robust to lineage misassignments.

Mutabilities depend on both sequence context and position
Confident that our procedure is able to infer rates reliably, we next applied it to real data, consisting of the out-offrame lineages from (7). The inferred dependencies of mutability with context and position are presented in Figure 3. We represent context dependence using a flat variant of the 'hedgehog' plots used in (16), for A-, T-, C-and G-centered motifs ( Figure 3A-D). Full parameter tables are available at https://github.com/statbiophys/shmoof.
Context dependent rates for A-centered motifs correspond well to the standard WA classification (31): 76% of A-centered 5-mers with ␥ w > 1 are of the WA type, and only 7 of 128 WA 5-mers have ␥ w 1. T-centered motifs are dominated by coldspots and their mutabilities do not align well with their corresponding reverse complement counterparts. This is in agreement with the known property of Polymerase to be prone to errors at A nucleotides on the top strand (32).
The C-and G-centered motifs have largely reversecomplement-symmetric rates (see Supplementary Figure  S2). As previously noted (16), this is in agreement with the strand-symmetric targeting of C/G-centered motifs by the AID enzyme.
The previously reported WRCY/RGYW motif (13,33) predicts high mutability reasonably well, while the SYC/GRW class of motifs (34) explains well a good fraction of coldspot motifs. Importantly, a large number of high or low mutability 5-mers do not belong to any of the previously reported motifs (see Supplementary Tables S1  and S2).
The rugged profile of position dependence ( Figure 3E) shows clear enrichment in mutations in the CDR1 and CDR2 regions, reflected in the up to 2-fold increase of the position-dependent rates. Framework regions are less mutated and we also observe a slight drop in the mutabilities of the positions beyond the Cysteine anchor of the CDR3 region. We also learned models where the position was defined from the 3' end of the sequence in the J segment (Supplementary Figure S3), yielding similar results but no clear improvement over 5'-based position. High mutability of CDR1 and CDR2 was already noted (35) and justified as an enrichment in highly mutable motifs (as quantified with the S5F model). Our findings suggest that there is a secondary mechanism of this enrichment, having to do either with accessibility of mutation-inducing enzymes or a superposition of context-dependent effects that evade the assumption of independent evolution at different sites and the limitation of 5-mer motifs.
Note that introducing the explicit position dependence does affect the learning of the context-dependent parameters: learning ␥ w with no position dependence (fixing ␤ x = 1) yields similar but markedly different parameters than when learning a free ␤ x (r 2 = 81%, Supplementary Figure  S4).

Model is consistent across individuals and explains data better than previous approaches
To check the model's generality, we estimated its variability across individuals by computing Pearson's correlation coefficient between the context (␥ w, Figure 4A) and position (␤ x , Figure 4B) mutability profiles of different donors. The precision with which we can estimate model parameters depends on the number of sequences used for inference, particularly for rare 5-mer contexts. Because two individuals had many more reads than the other 7, we pooled together these seven individuals to make comparisons with similar dataset sizes ( Figure 4C). We then compared the two individuals and one meta-individual with each other and with a model learned on data from all individuals. For the two individuals with the largest repertoire datasets, the results are highly reproducible with Pearson's r 2 = 78% for context and r 2 = 70% for position parameters ( Figure 4A), suggesting that the model captures universal biochemical properties of the hypermutation process.
To further validate the model's accuracy, we compared its prediction to data on the V-specific mutation profiles, which consist of the position-dependent mutation rate for each V segment. These rates result from the combined effect of position and context, but they are not fitted directly by the model. A typically good example of such a profile is shown in Figure 4D. The prediction is generally excellent (Pearson's r 2 ∼ 50 − 80%), and is poorest for V segments for which little data was available ( Figure 4E). Similarly, the model predicts well the mutability on Framework Region 4 (FWR4), which encompasses the J segment ( Figure 4F), as well as in the CDR3 ( Figure 4G and H), which is usually ignored in other approaches. Performance is best for the most frequent CDR3 length ( Figure 4H).
We compared the results of our inference to the S5F model (16), which was trained on independent data. The S5F model is defined by a mutability table ␥ w with no attempt to disentangle position dependence, so a direct comparison is subject to caution. Besides, S5F mutabilities are learned from synonymous mutations of productive sequences, requiring extrapolation methods to cover all 1024 contexts, all of which do not occur with synonymous mutations. Yet, the two sets of mutabilities ␥ w correlate fairly well (r 2 = 36%, Figure 4 I). Correlation rises to r 2 = 44% for contexts appearing in synonymous mutations, versus r 2 = 18% for the other contexts for which S5F recourses to extrapolation, emphasizing the limitations of that extrapolation.
A summary of the performances of the different modeling approaches on the mutabilities in the different regions of the IgH gene is shown in Figure 4J. We also checked for overfitting by dividing the dataset into a training and a test-  Figure S6). The full position and context dependent model ( s, x = ␥ w␤ x ) performs better than models with context or position alone ( s, x = ␥ w and s, x = ␤ x ). While the context explains the bulk of the mutation profile, adding positional effects substantially boosts performance. Our model clearly outperforms the S5F model, although it should be reminded that S5F was trained on a distinct dataset. Re-training S5F on the productive sequences from the present datasets using the procedure described in the original article (16) actually yielded worse performance (data not shown), for reasons that are unclear to us. Overall, accounting for phylogeny and disentangling the combined effects of context and position allows our model to accurately predict mutabilities including in the hyper-variable CDR3 region.

Co-localization of mutations cannot be explained by context and position bias
It was previously observed that hypermutations tend to cluster along genomic position in nonproductive sequences (22). However, the origin of this phenomenon and its dependence on confounding factors such as phylogeny and heterogeneous hot spot concentration were not fully characterized.
Clustering of mutations can be directly observed by plotting the fraction n(r) of pairs of mutations at distance r from each other as a function r (normalized by the total number of pairs at that distance, see Supplementary Figure S7), which is also called a spatial correlation function in physics. Focusing on lineages with at least 6 leaves, and iterating through all branches with fewer than 10 mutations, we evaluated this correlation function for pairs of mutations occurring in the same branch of the phylogeny versus distant branches, as schematized in Figure 5A). We then compared this correlation function to our model predictions (Supplementary Figure S8). The enrichment of closeby mutations can be quantified by the correlation function f(r) = n(r)/n m (r), where n(r), the fraction of pairs of mutations distant by r in the same tree branch is normalized by the model prediction n m (r) ( Figure 5B).
Pairs of mutations in distinct branches are well explained by the model, suggesting that they are independent of each other, in agreement with the biological picture that they occur at different rounds of affinity maturation. The enrichment of closeby mutations in distant branches can be entirely explained by the clustering of hotspot regions. Interestingly, both context and position dependencies of the mutability are needed to explain the data (Supplementary Figure S8). In contrast, pairs of mutations inside branches tend to occur closer to each other than predicted by the model. The enrichment of closeby mutations is up to fivefold, pointing to an additional mechanism causing hypermutation clustering. We observe that this enrichment persists in the presence of selection, as verified by computing the correlation function f(r) in productive lineages (Figure 5C).

Minimal model of co-localization
To explain the observed excess of co-localized mutations, we propose a simple phenomenological model. Targeted muta-tions, following the context and position dependent profiles described so far, cause additional nearby 'follow-up' mutations due to error-prone DNA repair. Given a substitution at x 0 drawn from the same distribution as before, each position x = x 0 can subsequently mutate with probability where is the correlation length and ⑀ is small. The total number of follow-up mutations is approximately Poisson distributed with mean x p(x|x 0 ) 2⑀/(1 − e −1/ ). To simulate this process, we followed the same procedure as described earlier for synthetic data, but allowing for follow-up, as well as targeted mutations, while keeping the total number of mutations in each branch constant. We then computed the correlation function f(r), and compared it to true profiles ( Figure 5B). We obtain a good agreement for = 10 and = 5% corresponding to an average of ∼1 followup mutation per targeting event. This result suggests that as many as 50% of observed mutations are follow-up mutations.
We asked whether this large number of non-targetted mutations may bias the inference of the targeting model, which assumes no follow-up mutations. To assess this effect, we reinferred the rates {␤ x } and {␥ w} from synthetic datasets simulated with = 10, = 5%, with data-inferred context profile ␥ w, and with data-inferred or flat position profiles ␤ x (Supplementary Figure S9). We find that the re-inferred mutabilities mostly agree with the true ones, with a slight shrinkage of values and enhanced mutabilities of cold spots, owing to the equalizing effect co-localization. Importantly, co-localization does not introduce additional features in the re-inferred position-dependent profile, indicating that our inference procedure is robust to co-localization effects.

DISCUSSION
The mutational landscape of antibody repertoires results from many entangled effects, which are often lumped together into effective models of hypermutations (12,16,36). First, hypermutations have intrinsic preferences for certain positions along the IgH gene, regardless of their impact on protein function. In addition, selection for antibody function, which includes protein stability and antigen affinity, favors beneficial mutations and suppresses deleterious ones (13). While intrinsic SHM preferences are believed to be universal, selective forces vary across lineages which are involved in distinct immune responses (21), and may also depend on the individual's immune status (37). Repertoire sequencing gives a snapshot of a rapidly adapting population subjected to these forces, making it hard to disentangle intrinsic SHM preferences from the combined effects of selection and genetic drift. By focusing on non-productive lineages and using a phylogeny-based approach, we overcome the biases arising from the dynamics of affinity maturation to obtain a comprehensive picture of SHM intrinsic preferences.
Each hypermutation occurs through a series of events of DNA damage and repair. The action of each enzyme, including AID to error-prone DNA repair enzymes, may each have their own sequence preferences, and the interplay of these different biases results in the observed pro- file. In our approach, these complex mutational pathways are subsumed into an effective model with a limited number of interpretable parameters in terms of effective context and position dependence. As a result, the context dependent weights ␥ w do not simply reflect the binding preference of AID, but also account for the biases of other biochemical steps. Our framework enables direct measurement of the mutability ␥ w of a wide range of 5-mer contexts, recovering the known classifications of hot and cold spots (16,33). We show that our model outperforms existing methods as well as purely context or position dependent models in terms of explaining the data.
The introduction of an explicit and universal position dependence, ␤ x , allows us to unveil an excess of mutations in the CDR1 and CDR2 regions. This enrichment of mutations cannot be simply explained by their harboring more hotspot contexts. We cannot exclude that this residual position dependence is due to more complex context effects missed by our model (based e.g. on 7-mers, which would be impractical to infer from the present dataset). Alternatively, SHM may preferentially target these regions independently of their sequence context, possibly through epigenetic mechanisms. Such preference is known to exist at the genome-wide level to mutate the Ig loci without affecting other genes (13), so it is plausible that the same mechanism targets some specific positions within Ig. The enrichment of mutations in the CDR1 and CDR2 regions is even more marked in productive sequences, meaning that these mutations are more likely to be selected during affinity maturation. This suggests that the intrinsically enhanced mutability of these regions may carry an evolutionary advantage, by focusing hypermutations on regions where they are the most beneficial (35). The stability of the immunoglobulin relies on the FWR regions and most of the substitutions are expected to be deleterious. The purifying nature of selection in FWR regions has been quantified in (38) and contrasted with positive selection in CDR regions.
By studying mutations along lineages, we were able to study mutations in the probable context in which they occurred, rather than relative to the germline sequence, allowing us to take into account the order of mutations and to sample a broader diversity of 5-mer contexts. This approach also allowed us to study and characterize hypermutations in the CDR3, which has been neglected in previous work (11) owing to the difficulty to separate these mutations from junctional diversity.
The phylogenetic methods employed in this study were not specifically designed to study B cell repertoires. In particular the assumptions allowing for fast likelihood computations do not account for the context dependence of the mutation rate beyond the codon frame (23). The positiondependent model introduced here could offer a compromise. While it does not account for the the full complexity of SHM biases, it captures the variation of the mutation rate observed in out-of-frame data well (Figure 4), and can operate under the assumption of independent site evolution. Our framework could also be easily extended to include position-dependent selection in the nucleotide or amino acid representation.
Our analysis confirmed a phenomenon of co-localization of mutations along the sequence. While this effect had been previously reported (22), here we showed that it could not be explained by phylogenetic bias or the existence of regions of higher and lower mutabilities. We proposed a minimal quantitative model of hypermutation targeting, followed by error-prone DNA repair that causes follow-up mutations, which explains the data well. While ideally we would like to infer the position and context mutability profiles taking these follow-up mutations into account, the task is impractical because it would require to identify the origin of each Nucleic Acids Research, 2020, Vol. 48, No. 19 10711 mutation. We expect that doing so would only renormalize the values of the context preferences. While the adaptive advantage of co-localized mutations is unclear, we find the correlation function in productive lineages follows the unproductive baseline with additional enrichment enhanced at multiples of the codon length, 3, suggesting signatures of selection ( Figure 5C). We speculate that nearby mutations occurring simultaneously could help cross barriers of positive sign epistasis, whereby two or more mutations are deleterious by themselves, but beneficial together. This phenomenon could accelerate affinity maturation by favoring compensatory or epistatic mutations at amino acids that interact strongly within the antibody protein (39,40).
The obtained mutability models make predictions about the likelihood and plausibility of particular trajectories of affinity maturation. They could be useful in designing vaccination strategy, by helping choose targets with a greater potential for accumulating beneficial mutations towards antibodies with desired properties such as neutralization power, or broadness in the case of fast evolving pathogens such as influenza or HIV (11,41).

DATA AVAILABILITY
All the data analyzed in this paper has been previously published and can be accessed from original publications. Code for producing the figures of this paper, as well as the inferred model parameters, are freely available at https://github.com/ statbiophys/shmoof.