SEPPA-mAb: spatial epitope prediction of protein antigens for mAbs

Abstract Identifying the exact epitope positions for a monoclonal antibody (mAb) is of critical importance yet highly challenging to the Ab design of biomedical research. Based on previous versions of SEPPA 3.0, we present SEPPA-mAb for the above purpose with high accuracy and low false positive rate (FPR), suitable for both experimental and modelled structures. In practice, SEPPA-mAb appended a fingerprints-based patch model to SEPPA 3.0, considering the structural and physic-chemical complementarity between a possible epitope patch and the complementarity-determining region of mAb and trained on 860 representative antigen-antibody complexes. On independent testing of 193 antigen-antibody pairs, SEPPA-mAb achieved an accuracy of 0.873 with an FPR of 0.097 in classifying epitope and non-epitope residues under the default threshold, while docking-based methods gave the best AUC of 0.691, and the top epitope prediction tool gave AUC of 0.730 with balanced accuracy of 0.635. A study on 36 independent HIV glycoproteins displayed a high accuracy of 0.918 and a low FPR of 0.058. Further testing illustrated outstanding robustness on new antigens and modelled antibodies. Being the first online tool predicting mAb-specific epitopes, SEPPA-mAb may help to discover new epitopes and design better mAbs for therapeutic and diagnostic purposes. SEPPA-mAb can be accessed at http://www.badd-cao.net/seppa-mab/.


INTRODUCTION
Monoclonal antibodies (mAbs) play important roles in adapti v e immune protection, in vitro diagnostic, and clinical therapy owing to their capacity to specifically recognize and bind to the epitope residues in antigen protein ( 1 ). The recent de v elopment of BCR-sequencing and library screening technologies have rendered rapid mAb harvesting from the vaccinated animals ( 2 ), yet the further characterization of Ab-specific epitope positions remains highly challenging, mainly due to the intrinsic spatial nature of Abantigen binding ( 3 ). Though more and more online tools to predict spatial epitopes have emerged, with notab le e xamples of Epitopia ( 4 ), CBTOPE ( 5 ), Discotope2.0 ( 6 ), BepiPred ( 7 , 8 ) and SEPPA 3.0 ( 9 ), they usually focused on the antigens alone, missing the information of cognate antibodies. In other words, these methods forecast all antigenic residues on the antigen surface, which may be targeted by multiple antibody clusters, instead of a specific mAb.
Meanwhile, docking-based strategies were also employed by treating the antigen-antibody interaction as a general pr otein-pr otein interaction ( 10 , 11 ). Typical approaches including ZDOCK ( 12 ), and ClusPro ( 13 ) calculate the complementarity between biomolecules in terms of shape, electrosta tics and sta tistical potential for scoring. Usually, docking-based methods predict multiple possible regions without cutoff for one mAb to bind. In addition, drastically decreased accuracy was detected on modelled structures from antibody sequences ( 14 ).
Curr ently, a new tr end has emerged in attempting to de v elop antibody-specific epitope predictors. For instance, Martin et al de v eloped an antibody-specific B-cell epitope predictor based on antibody-antigen protein comple xes. This method di vided the surface of the antigen structures into patches and used a feed-forward neural network for model construction ( 15 ). Epipred proposed a global docking-based algorithm to identify the epitope region ( 10 ). PECAN designed an antigen-antibody interaction algorithm based on a graph convolution attention network ( 16 ). More recentl y, AbAda pt raised an adapti v e approach to predict antibody-antigen complex structures on the sequence le v el ( 17 ). Xu et al. proposed a pipeline based on AlphaFold to integra te antibod y and antigen structural modelling with rigid docking to predict antibodyspecific epitopes ( 18 ). Howe v er, none of the above provided user-accessible tools or softwar e. Her e, we pr esent SEPPA-mAb, the first online tool for predicting mAbspecific epitopes. SEPPA-mAb was composed of two models: SEPPA 3.0 ( 9 ), which calculated all potential antigenic sites based solely on antigen structure, and a fingerprintbased patch model, scoring the potential complementarity between epitope patch and complementarity-determining region (CDR) patch. The final integration generated high accuracy and a low false-positi v e rate, which may be useful to biomedical users.

DA T ASET
Antigen-antibody structur e complex es wer e extracted and curated with unique epitopes from Protein Database Bank (PDB) ( 19 ). Surface, epitope and paratope r esidues wer e defined as the same as the SEPPA series ( 9 ). Finally, 860 complexes deposited before July 2017 were selected as the internal training dataset (Supplementary Table S1), and 193 complexes after the date were used as the independent testing dataset (Supplementary Table S2), including 36 HIV Env glycoproteins (Supplementary Table S3).

METHODS
The construction of SEPPA-mAb includes three steps: (i) calculating the antigenicity score of SEPPA 3.0, (ii) con-structing the patch model and obtaining the patch model score at the residue le v el and (iii) consolidating the patch model score with the antigenicity score based on the threshold.
In step two, the design of the patch model includes four sub-steps: (i) deriving a group of surface patches for antigen, (ii) generating a series of descriptors for each surface patch of antigen and the CDR patch of the corresponding antibod y, then pa tch complementarity (PC) score was calculated by XGBoost classifier, (iii) mapping the patch scores to each surface residue and then calculating the raw residue score by considering all the patches that contain the target residue, (iv) obtaining the final patch model score on the residue le v el through the calibration process. Detailed information on each step is described in Supplementary Method 1.

Algorithm of SEPPA-mAb
For any query of an antigen-antibody structure pair, SEPPA-mAb predicts which residues on the antigen surface can bind to the CDR of the antibody through the following steps.
Step1: Generate a spatial surface patch for each surface residue on the target antigen (see Design of Patch Model ); Step2: Generate the structure fingerprints for the surface patch of the antigen and CDRs patch of the antibody, and then the PC score was predicted for each surface patch and CDR patch pair (see Design of Patch Model ); Step3: Map the PC scores to each surface residue to obtain the raw residue score according to Equation ( 1 ); Step4: Calcula te the pa tch model scor es on the r esidue le v el by calibration and normalization of the raw residue scores according to Equations ( 2 ) and ( 3 ); Step5: Consolida te the pa tch model scores and antigenicity scor es pr edicted by SEPPA 3.0 to obtain the final list of mAb-specific epitope residues, when both scores are over the thresholds.

Design of patch model
For the input antigen, SEPPA-mAb will automatically generalize the spatial patch for each surface residue on antigen protein and be paired with the CDR patch of the corresponding antibody. Then, the patch model will generate fingerprints, and calculate the patch model score on the residue le v el.
During fingerprint generation, the patch model introduced a cylinder model describing the structural la y out and physic-chemical properties for each patch based on the defaulted pixel. Eight properties are considered to generate the 200-bit fingerprints for the antigen side and antibod y side, separa tely. After being trained on 860 antigenantibody pairs through XGBoost, the patch model can predict the PC score for each antigen patch according to the CDR of cognate antibody ( see Supplementary Method 1 ).
To determine whether one residue is an epitope residue or not, the pr edicted PC scor es ar e first mapped to the individual residue. Considering all the pa tches tha t contain the same residue, the raw residue score for any residue r can be wher e P C score i r epr esents the pr edicted PC scor e of surface patch i which contains residue r, and d is the distance of residue r to the center of patch i , while M is the total number of patches which contains residue r . Then, to identify the final patch model score, calibration, and normalization were introduced for the raw residue scores. The calibration process is designed to adjust the raw residue scores of individual residues based on the overall tendency of neighboring residues. The adjusted residue score of residue r is defined by the averaged raw residue score of all neighboring surface residues as Equation ( 2 ) illustrated: ad j us t res i due s cor e r = r aw r es i due s cor e j N (2) where r aw r es i due s cor e j r epr esents the sum of the raw r esidue scor e of all neighboring surface r esidues within 5 Å atom distance of target residue r , while N means the total number of above residues. Finally, the normalization process is conducted to make the results comparable between different antigens. The ad j us t res i due s cor e was normalized to a range of 0-1 to obtain the patch model score using Equation ( 3 ): where mi n ( ad j us t res i due s cor e ) is the minimum ad j us t res i du e s cor e of residues in a gi v en antigen, and max ( ad j us t res i due s cor e ) means the maximum ad j us t res i due s cor e of residues in a gi v en antigen. P atc h Mode l sc or e r means the patch model score for a specific residue r .

Patch model construction and performance test
On top of the antigenic sites predicted by SEPPA 3.0, we de v elop a patch model to evaluate the physic-chemical complementarity of possible contacting regions between the antigen surface and CDR surface by reporting a patch model score between 0 and 1. The evaluation parameters of the area under the ROC curve (AUC) value and balanced accur acy (B A) are adopted as illustr ated in SEPPA 3.0 ( 9 ). For internal validation, eight machine learning approaches ar e scr eened, including X GBoost (X GB), Sup The performance of the patch model is tested on 193 antigens and compared with available tools online including both well-known epitope prediction methods and dockingbased methods via AUC and Balanced Accuracy (BA) value. As Figure 1 shows, 7 tools are selected for comparison including 5 traditional epitope prediction tools of Epitopia ( 4 ), CBTOPE ( 5 ), DiscoTope2.0 ( 6 ), BepiPred3.0 ( 8 ) and SEPPA 3.0 ( 9 ), and two docking-based methods of ZDOCK ( 12 ) and ClusPro ( 13 ). The test dataset containing 193 antigens was not overlapping with the training dataset of any above state-of-the-art methods (Figure 1 A). Results illustra te tha t the pa tch model gi v es the best results with an AUC of 0.774 and BA of 0.681 based on the default threshold. SEPPA 3.0 ( 9 ) achie v e the second best among all current state-of-the-art methods with an AUC value of 0.730 and BA of 0.635, followed by Bepipred 3.0 with an AUC value of 0.685 and BA of 0.628 (Figure 1 B).
Further, two r epr esentati v e docking-based tools, ZDOCK ( 12 ) and ClusPro ( 13 ) Table S7), docking methods indeed display an obvious performance drop on modelled structures, agr eeing with pr evious r eports ( 14 ). As illustrated in Figure  1 c, the overall AUC value of the top 1 prediction decreased from 0.691 to 0.589 for ZDOCK ( 12 ) and 0.680 to 0.601 for ClusPro ( 13 ), respecti v ely. A similar drop in the top 5 and top 10 results. Among 193 predictions, the patch model gave better AUC prediction on 122 (63%) data points than the top 1 prediction of ZDOCK and 120 (62%) data points than the top 1 results of ClusPro (Supplementary Table  S4).
In terms of the patch model, it outperforms ZDOCK ( 12 ) and ClusPro ( 13 ) Table S5), demonstrating its unique ability to tolerate structure variation.

Performance and case study of SEPPA-mAb tool
The patch model is designed to calculate the complementarity score of interacting surfaces between the antigen and its cogna te antibod y, while SEPPA 3.0 is designed to score the antigenic sites on the antigen surface. Both scor es ar e normalized from 0 to 1. For each residue, SEPPA-mAb simply considers the two scores and gi v es a judgment of YES (1 for epitope) if both scores are above their default cutoffs, otherwise NO ( −1 for non-epitope) or NOT AVAILABLE (0 for internal residue). In this way, the prediction accuracy Nucleic Acids Research, 2023, Vol. 51, Web Server issue W531 To better illustrate the benefits to integrate the patch model with SEPPA 3.0, the r efer ence epitope in hemagglutinin (HA) antigen targeted by antibody C05 (PDB ID: 6D0U, Chain: G) was shown in Figure 2 B. Results showed that SEPPA 3.0 predicts many spreading antigenic sites (Figure 2 C), while the patch model suggests two major r egions r elati v ely complemented to antibody C05 (Figure  2 D). The final integration of SEPPA-mAb efficiently re-moves those false positive r esidues, pr esenting the best candidate positions as C05-targeted epitopes (Figure 2 E).
The pandemic of COVID-19 in 2020 provided an opportunity to test model performance on completely new antigens ne v er seen by SEPPA-mAb. Here, 31 pairs of spike antigens and their cognate Abs were tested with detailed PDB ID listed in Supplementary Table S6. The average accuracy of 0.753 with an average FPR of 0.224 is achie v ed by SEPPA-mAb. As the accuracy is calculated at the individual residue le v el w hich is highl y stringent, we examined w hether SEPPA-mAb can predict the correct epitope area for each antibody (30% residue overlapping between predicted and crystalized epitope positions). The result shows our model can successfully suggest 23 out of 31 epitope areas, indicating the outstanding ability on completely new antigens.

Input
SEPPA-mAb ( http://www.badd-cao.net/seppa-mab ) accepts input files of antigen-antibody pair in the below format: (i) existing PDB IDs with chain name, and (ii) local files in PDB format. Similar to SEPPA 3.0 ( 9 ), users are recommended to select subcellular localization of protein antigen and species of immune host if available. Also, batch query submission is encouraged. During the batch query, users can submit multiple entries including specified PDB IDs, subcellular localization, species of immune host, and chain name. After successful submission, each residue of antigen protein will be processed by patch model with the information of the corresponding antibody to calculate a score of possible Ab interaction, and by SEPPA 3.0 to obtain the antigenicity score respecti v ely.

Output
The output results of SEPPA-mAb will either be presented in .html format by browsing the progress of calculation via job-id or sent back to users via email. The .html format will provide a result summary from three aspects: (i) submission information, including model parameters, sequence of antigen, and sequence of antibody, as well as the predicted epitope information, including the residue positions of the input antigen. Predicted epitope residues are presented in red capital letters and the non-surface amino acids are presented in lowercase letters (Figure 3 A); (ii) epitope 3-D visualiza tion, facilita ting users to observe the spatial distribution of epitopes. The 3D model of the antigen is created via Jmol. The amino acid is labelled in different colors based on the SEPPA-mAb scores (Figure 3 B) and (iii) download-able results in .txt format, which includes the query information, SEPPA 3.0 score, patch model score, and SEPPA-mAb score for each residue (Figure 3 C). More information can be found on the HELP page of SEPPA-mAb.

DISCUSSION
Predicting the Ab-specific epitopes for protein antigens is highly desired but no tool is available online. In this study, we de v eloped SEPPA-mAb for this purpose, based on the previous antigenicity prediction server of SEPPA 3.0, appended by a new patch model calculating the physicchemical complementarity between antigen-antibody interaction surface. Compared with current state-of-theart methods of traditional epitope prediction tools and docking-based algorithms, the patch model alone shows outstanding performance on both Ab-specific interface prediction and robustness on varied structure variation. More importantly, the integrated tool of SEPPA-mAb can efficiently increase the accuracy and reduce the FPR, with the best ability to tolerate structure variation of computer modelling.
Prediction B-cell epitopes for a cognate binding or neutralizing antibody have r eceived incr easing attention in recent years. As more and more antibodies are deri v ed from one antigen, it is becoming apparent that essentially differ ent surface r egions of an antigen may be r ecognized Nucleic Acids Research, 2023, Vol. 51, Web Server issue W533 and bound by antibodies ( 21 , 22 ). Missing the information of a cogna te antibod y, wha t those traditional algorithms ( 4 , 7 , 9 ) calculate are actually kind of pan-antigenic sites. Then se v eral pioneers reformulated the question from predicting pan-antigenic sites to mAb-specific epitope suggestion ( 15 , 22 ). For instance, Rapberger et al proposed that the antigen epitope should geometrically and electrosta tically ma tch the antibod y structur e ( 23 ). Mor e r ecently, the Ab-specific epitope predictor by Martin et al. divided the antigen surface residues into multiple patches to generate 471 features including 237 for antigen patch and 234 for antibod y para tope ( 15 ). Though no accessible tools are provided to biomedical users, the above brought enlightening guidance for the de v elopment of SEPPA-mAb.
The performance of SEPPA-mAb mainly benefits from both the successful prediction of potential antibodycomplimentary regions from the patch model, and the pan-antigenic sites predicted by SEPPA 3.0. Importantly, our patch model is designed with se v eral nov elty aspects. The first is the patch-based structural and physicochemical fingerprints deri v ed from the cylinder model. Via cylinder ( Supplementary Method 1 ), the layers of local microenvironmental variations can be fully considered for a surface residue under the neighborhood influence of both surface and internal r esidues. Secondly, differ ent from the previous patch conception where all residues in the patch are equally treated, we gi v e patch residues weighted scaling according to their distance to the patch center. In this way, the residual la y out and subsequent physic-chemical properties W534 Nucleic Acids Research, 2023, Vol. 51, Web Server issue can be well captured from both sides of the interaction interface describing the complex nature of antigen-antibody interactions. Thirdly, all calculation is made on the residue le v el, instead of the detailed atom coordinates in docking methods. This coarse-grained description enables rapid surface scanning patch-pairing, also toler ating structur al variation caused by computer modelling. Finally, a calibration process is elaborated to further reduce FPR by considering the neighboring influence. In summary, SEPPA-mAb consolidated the results from pan-antigenic sites predicted by SEPPA 3.0, and mAb CDR-complementary surface predicted by patch model, enhancing the prediction performance from baseline (accuracy of 0.776-0.790, FPR of 0.196-0.206) to a le v el with high accuracy of 0.873 and low FPR of 0.097.
Be noted that, the current model aims to recommend the best epitope positions in antigen surface being recognized by its cognate antibodies. Any input antibody is regarded as interacting with input antigen by expectation. Mechanistically, SEPPA-mAb employed more information from antigens rather than antibodies, leading to its insensitivity to antibodies. In fact, it is more sensiti v e to antigen mutation and structural variation. As SEPPA-mAb conducts the calculation based on the structure files, incomplete structures may reduce their performance. Further, SEPPA-mAb considered the influence of glycosylation through SEPPA 3.0. Other forms of post-translational modifications such as phosphoryla tion, ubiquitina tion, methyla tion, and so on, have not been considered in the current version. Also, single-chain antibodies are not applicable for now. In the future, with the rapid accumulation of structures generated through experiments and AI technologies, as well as the development of deep learning algorithms, improved versions can be expected for antibody-specific epitope prediction, which may better assist antibody design in therapeutic and diagnostic purposes.

DA T A A V AILABILITY
The data underlying this article are available in the article and in its online supplementary material. Accession number PDB ID: 6D0U.

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.