Proteins across scales through graph partitioning: application to the major peanut allergen Ara h 1

The analysis of community structure in complex networks has been given much attention recently, as it is hoped that the communities at various scales can affect or explain the global behaviour of the system. A plethora of community detection algorithms have been proposed, insightful yet often restricted by certain inherent resolutions. Proteins are multi-scale biomolecular machines with coupled structural organization across scales, which is linked to their function. To reveal this organization, we applied a recently developed multi-resolution method, Markov Stability, which is based on atomistic graph partitioning, along with theoretical mutagenesis that further allows for hot spot identification using Gaussian process regression. The methodology finds partitions of a graph without imposing a particular scale a priori and analyses the network in a computationally efficient way. Here, we show an application on peanut allergenicity, which despite extensive experimental studies that focus on epitopes, groups of atoms associated with allergenic reactions, remains poorly understood. We compare our results against available experiment data, and we further predict distal regulatory sites that may significantly alter protein dynamics.


Introduction
Proteins are multi-scale biomolecular machines with coupled structural organizations across time and length scales. The global dynamics of a protein structure arises as a result of contributions from the smallest scale of atoms, through chemical groups, amino acids residues, secondary structures, to the largest scale of domain motions. Each level of this structural organization links to some functional behaviour and the related scales span across several orders of magnitude [1,2]. This coupling poses challenges to computational methods [3][4][5][6] not only due to their excessive computation cost but also due to sampling issues or appropriate level of description. Coarse-grained methods have been proposed as a means to reach the functionally relevant scales [7][8][9][10][11][12] but with a cost of loss of generality and of smearing the detailed physicochemical atomic interactions. PEANUT   different levels of organization in a protein structure across multiple time scales through partitioning while keeping atomistic details in a computationally efficient manner. First, it constructs an atomic-level graph for each protein from its structural data, where the graph nodes are the atoms, and the edge weights are based on interatomic interactions deduced from atomic positions and geometries [38,40]. It then uses a diffusion on the graph to find communities at each scale by evaluating the quality function defined by Markov Stability [37,45]. This avoids the resolution problem inherent in other community detection methods [46,47] that are embedded within a specific scale [39]. Finding communities without imposing a scale a priori can help the understanding of a system's global behaviour formed by components existing at various resolutions. The method is also capable of identifying dynamical differences between similar structures [48,49], so it is well suited for unravelling the dynamical differences of the major peanut allergen Ara h 1 and its negative controls. We first explore the functional components of Ara h 1 through 4 H. ZHANG ET AL.
its structural hierarchy across scales. We show that there is a correspondence between partitions at intermediate time scales and experiment-proposed epitopes (groups of residues thought to be responsible for triggering an allergic reaction) and also between partition boundaries and critical residues. We highlight the different partitioning processes of the allergen and two negative controls, OxcA and MnCA, as a means to shed light in their differing functions. In parallel, we use a theoretical mutagenesis method to identify hot spots, residues that impact the global dynamics of the peanut allergen.

Graph construction from protein crystal structures
We first derive an undirected, weighted atomic network based on the protein structure to be analysed. We briefly summarize the process here, but for details, please see references [40] and [38]. The protein structures are obtained from the RCSB Protein Data Bank (see Fig. 1 and Table 1). Missing residues are completed by the software MODELLER [50]. Hydrogen atoms are all stripped and then readded using REDUCE [51]. Each node of the network corresponds to an atom and edges correspond to pairwise interactions such as covalent bonds, hydrogen bonds, salt bridges, hydrophobic tethers, electrostatic interactions and π -stacking interactions. Edges are identified using FIRST [52] with a cut-off of 0.01 kcal/mol for hydrogen bonds and 8 Å for hydrophobic tethers. They are subsequently weighted by the strength of interactions as follows: the Mayo potential [53] for hydrogen bonds, the potential of mean force from reference [54] for hydrophobic tethers. In this study, we stripped off all ions and water molecules.

Markov Stability
To reveal the organization of the structure at different scales, we use Markov Stability [45,55] to analyse the generated protein graph. At each Markov time, we find an optimized partition where a random walk is likely to remain trapped. As we increase this timescale, significant partitions are obtained in increasingly coarser communities. In the case of the protein graph [38,39,56], this process allows us to scan across resolutions and find clusters corresponding to chemical groups at very short Markov times, through biochemical units, amino acids and secondary structures, at intermediate Markov times, to partitions corresponding to functional domains or subunits at longer Markov times. These groupings correspond to groups of atoms moving coherently. The duration of the partition, together with its robustness to perturbation, allows us to map out dynamical properties of the protein.
Let us define an undirected, weighted network G(V , E), denoted by the adjacency matrix A of rank n. Its vertex degrees are d i = n ij A ij , and the degree matrix D = diag(d). On this network, we consider a continuous-time Markov process which is governed by the combinatorial Laplacian, L = D − A, as most appropriate for protein dynamics: where d = (1 T D1)/N is the average degree and p = pD −1 . Now the stationary distribution of (2.1) is the uniform distribution over all the nodes: π c = 1 T /N. Markov Stability is then defined as where c is the number of communities in the partition; H is a N × c indicator matrix of P with entries H ij equal to one if node i belongs to community j and zero otherwise; π c denotes the stationary distribution defined above and c are the diagonal elements of π c . The time t here is the Markov time or a dimensionless resolution parameter.

Optimizing Markov Stability using the Louvain algorithm
As is the case in most clustering-related problems, the global optimization of Markov Stability is NP-hard [37]. A variety of heuristic strategies can be used to obtain good partitions that can then be ranked by Markov Stability to provide us with near-optimal partitions at different time scales.
Here, we use the Louvain algorithm, a greedy agglomerative method [57]. The algorithm has been observed to require little computational effort and to find partitions close to the optimal solution [57]. Initially, each node is assigned to its own community, then each node is transferred in turn into the neighbouring community, where the increase of Markov Stability is the largest, as long as it increases the Markov Stability value of the overall partition. This step is repeated until no further transfer can increase the Markov Stability. At this point, a new graph of communities is generated. The algorithm repeats these two steps until a coarse-grained graph is obtained, where no further grouping can improve Markov Stability.
The Louvain algorithm is deterministic, but the final solution found depends on the order in which the different nodes are scanned for the grouping step. This initial ordering, which we refer to as the Louvain initial condition, can be chosen at random every time the calculation is executed. Indeed, we will use the variability of the observed solution induced by our random choice of the Louvain initial condition to estimate the robustness of a partition, a measure of its relevance.

Defining community robustness by variation of information
A distinctive property of a significant community structure should be its robustness to small perturbations. Upon introducing slight variations in the graph itself, the new partition found should be highly similar to the one obtained originally [37,56,58,59]. One way to quantify the extent to which the partitions are affected by the perturbation [59,60] is using a measure of distance between the solutions found before and after the perturbation. An information-theoretic distance between two partitions can be measured by the variation of information (VI) [61][62][63], a metric based on the total information that is not shared by two partitions. The random initial conditions of the Louvain optimization algorithm provide us with an ideal perturbation with respect to which we can measure the robustness of the partitions. By optimizing Markov Stability for an ensemble of such initial conditions for each Markov time, we can calculate the VI between all pairs of solutions obtained and compute the average as a measure of the relevance of the solutions obtained at a particular scale. Other perturbations affecting edge weights or the quality function, for example, have been considered in the past and shown to yield similar results [37,49].

Identifying hot spots by mutational analysis
Another question of interest is the identification of residues or nodes that significantly impact the protein dynamics when mutated, referred to as hot spots. To mimic in silico the process of the experimental procedure, alanine scanning mutagenesis, which replaces each residue at a time with an alanine, we mutate each residue to an alanine in turn by removing all edges of weak interactions with respect to its side chain (see Fig. 2). The mutated graph is partitioned using the linearized version of Markov Stability. We then identify the mutations that affect significantly the robustness of partitions using Gaussian  process regression (GPR), a non-parametric fitting method [40,64]. We use all the VI vectors from the mutation of each residue along with the vector of the wild type to produce a statistically reliable VI vector with confidence bounds. Any mutation with VI values falling outside three standard deviations for more than one-fourth of the time period in the Markov time window is considered to have a significant impact on the graph robustness. The calculation is realized using the gpml MATLAB toolbox (http://www.gaussianprocess.org/gpml/code/).

Structural organization of the Ara h 1 network across scales
The Markov Stability results of the Ara h 1 monomer are shown in Fig. 3B. The zooming at different resolutions starts by finding chemical groups at high resolution, then onto amino acids and secondary structures, followed by segments and functional domains, and finally merging all parts of the structure. From Markov time 10 5 onwards, Ara h 1 exhibits well-defined communities mostly by long plateaux. At longer time scales, where typically proteins are functional, we observe the long-lived and robust communities in the two-barrel separation. This is in line with how a protein with such an architecture should behave and agrees with our work on other proteins. Interestingly, the intermediate time scales indicate less robustness. For example, the VI of the four-way partition is unusually variable. Additionally, the N-terminus segment between F14 and R17 is partitioned into three consecutive communities. Indeed, it is difficult to find a reasonable partitioning in this region. This lack of partitioning shows that the allergen protein structure is susceptible to local disturbances and may also imply its adaptability to multiple conformations for IgE binding or further aggregation.
Certain communities across those intermediate time scales correlate with some of the proposed epitopes. At each time step, each community was sequentially compared with each of the experimental epitopes by overlapping their atoms. As all Ara h 1 epitopes are linear, communities with breakages were not considered. When one community overlapped with a certain epitope with over 80% of their atoms, a correspondence was established. As the 14th epitope is over twice the size of others, its failing threshold was set to 30%. Figure 3D shows the mapping of epitopes onto the communities in Ara h 1. No community can be mapped with epitope at bigger scales, because smaller communities will merge into large functional domains, indicating global dynamics. Epitopes 8 and 14 last much longer than the 8 H. ZHANG ET AL. Fig. 4. The highest occurring residues that appear in boundaries of communities at the intermediate time scales are plotted and coloured red if they belong to an epitope and blue otherwise. Residues previously found to be critical are circled. We identify these residues as well but also find additional ones that may play an important role in the protein dynamics. others, whereas epitopes 2, 11 and 13 did not manifest themselves as single communities. Epitope 8 is an immunodominant epitope according to Burks et al. [19], so the persistence of a community is to some extent related to its allergenicity. Note that epitopes appear and reappear due to either merging of communities, for example, epitopes 1 and 3 or breaking of communities of an epitope, for example, epitope 9. As a comparison, there are no linear epitopes identified in OxdC or MnCA.

Community boundaries correspond to critical residues
We also mapped community boundary residues at intermediate time scales to epitopes, shown in Fig. 4. We identified four of the epitope residues that have been mutated and known to correspond to critical residues that is known to be crucial to IgE binding. However, we also identify further residues that may indicate new pockets for binding events or communication. In particular, residue Ala212 has never been tested experimentally. These residues may be important for functional motions rather than purely thermodynamic or surface exposure reasons. Figure 5 shows the Markov Stability analysis results of Ara h 1 and the two controls. As discussed in the Introduction section, the three proteins are structurally similar and share the two-barrelled configuration (see Fig. 1). The two-domain motions are the same when either protein opens and closes around the virtual dyad axis and in fact appear as the final two community partition at the end of the calculation (partition f). However, the evolution of communities is distinct between the allergen and the other structures, reflecting the differing functions they need to perform.

Different partitioning process of the peanut allergen and the controls
Ara h 1 and OxdC have significantly different number of partitionings even at shorter time scales, indicating different local movements. As time progresses, OxdC formed its C-terminus barrel community first, followed by the other barrel. Then, the C-terminus barrel was split into two, with a varying inner boundary for some time period. When the C-terminus barrel was complete again, it started to merge outside residues, before eventually the two barrels emerged as the two partitions. A similar process was observed in the MncA case. In contrast, the allergen followed a continuous merging of communities until the final two barrels formed (see Fig. 3C). In summary, two distinct community evolution processes appear: the merging-splitting-merging process is shared by OxdC and MncA, whereas the peanut allergen constantly merges more residues into bigger communities. These different processes reflect their distinct functions: the allergen, in a consistently co-operative trend, binds IgE at different timescales, while OxdC maintains itself and reorganizes its functional domains to catalyze and cleave carbon-carbon bonds. Most of the critical residues of Ara h 1 are distributed on the outside, so the binding process, generally happening on the flanking helices and loops, will not affect the barrel on the inside. In contrast, for OxdC, in addition to its manganese-binding sites positioned in the middle region of each barrel, Just et al. [65] argued that Glu162 is a new candidate for the crucial proton donor through substantial conformational change. Consequently, protein segments need to readjust, reflected by the splitting and reforming barrels, which may help explain the different community merging process. Figure 6 shows the VI between every mutant of the peanut allergen Ara h1 with the wild type according to the procedure described in the aforementioned Methodology section. Two residues were picked up as having a significant effect by our procedure, namely Glu222 and His211. Glu222 is located by epitope 5, whereas His211 is within epitope 14, beside the partitioning boundary residue Ala212 at medium 10 H. ZHANG ET AL. Fig. 6. The VI of the allergen Ara h 1 (purple) and its mutants. Most mutants do not have a significant effect over all timescales and fall within the 95% confidence region by GPR (shown in grey). We consider a residue as a hotspot if its deviation exceeds three standard deviations for over one-fourth of the time spectrum studied. Only mutants H211A (orange) and E222 (yellow) fulfil the criteria.

Computational site-directed mutagenesis
timescales, and not far from Glu222. As these two residues are directly related to epitopes, it is both their binding affinity and their conformational dynamics that seem to be altered by the mutation.

Conclusions and perspectives
Although Ara h 1 is well known for triggering IgE-mediated allergy to release various chemicals, the mechanism of this process is still not fully understood. At least 23 epitopes have been proposed as possible binding places, and the role of epitopes is being re-examined [19].
Here, we studied the major peanut allergen Ara h 1 as well as its structurally similar negative controls through atomistic-based graph partitioning based on Markov Stability as a viable approach for studying protein dynamics across timescales. By partitioning the graphs generated from protein structures, we were able to find meaningful communities at different resolutions related to timescales and functional activities. We identified an intermediate time scale where non-robust communities are related to epitopes, known regions important for allergenic response. We observed distinct coupling routes between levels of dynamics from atomic movements up to functional domains, which may influence the differing functions of IgE-binding activities. Finally, two distal residues had strong impact on the global dynamics when they were mutated by computational alanisation. The extent of the mutational effects and the pathways that may link epitopes are subject of future work.

Funding
The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) under REA grant agreement no 607466 and from the Engineering and Physical Sciences Research Council (EPSRC) through award EP/N014529/1 funding the EPSRC Centre for Mathematics of Precision Healthcare.