AbDiver: a tool to explore the natural antibody landscape to aid therapeutic design

Abstract Motivation Rational design of therapeutic antibodies can be improved by harnessing the natural sequence diversity of these molecules. Our understanding of the diversity of antibodies has recently been greatly facilitated through the deposition of hundreds of millions of human antibody sequences in next-generation sequencing (NGS) repositories. Contrasting a query therapeutic antibody sequence to naturally observed diversity in similar antibody sequences from NGS can provide a mutational roadmap for antibody engineers designing biotherapeutics. Because of the sheer scale of the antibody NGS datasets, performing queries across them is computationally challenging. Results To facilitate harnessing antibody NGS data, we developed AbDiver (http://naturalantibody.com/abdiver), a free portal allowing users to compare their query sequences to those observed in the natural repertoires. AbDiver offers three antibody-specific use-cases: (i) compare a query antibody to positional variability statistics precomputed from multiple independent studies, (ii) retrieve close full variable sequence matches to a query antibody and (iii) retrieve CDR3 or clonotype matches to a query antibody. We applied our system to a set of 742 therapeutic antibodies, demonstrating that for each use-case our system can retrieve relevant results for most sequences. AbDiver facilitates the navigation of vast antibody mutation space for the purpose of rational therapeutic antibody design. Availability and implementation AbDiver is freely accessible at http://naturalantibody.com/abdiver. Supplementary information Supplementary data are available at Bioinformatics online.

Supplementary materials for AbDiver -A tool to explore the natural antibody land-scape to aid therapeutic design.

Data
We employed the Observed Antibody Space database as the source of IMGT-numbered sequences (Kovaltsuk et al., 2018).The data contained herein has the benefit of having been processed using a single assembly & antibodyspecific annotation pipeline.The IMGT (Lefranc, 2011) numbering and gene call annotations in OAS are handled by ANARCI (Dunbar and Deane, 2015).Though it is known that employing different gene references can affect the results (Smakaj et al., 2020), we decided to use the OAS-included annotations as the baseline as at the very least it provides the same annotation for all sequences.
Gene calls could pose specific issues within the remit of the therapeutic applications of our platform.We recognize that one might seek a CDR3/clonotype that originated from a custom scaffold, humanized mouse etc. that does not have a clear annotation in the dropdown on our website.In such cases, we suggest that the user perform a calculation themselves which V-gene(s) their scaffold is closest to in terms of sequences identity.Alternatively one can submit the full length sequence to our profiling service which should identify the V-region which can be subsequently used in the CDR3/clonotype search.
Within our search services we allow the user to either choose to find the closest match to either human or mouse or provide a list of organisms for specific gene search.The choice of the organisms is reflecting the bias towards human or mouse sequencing datasets and smaller availability of these coming from other organisms.We expect that as more species reference genes & BCR studies become available, the service will be extended with others.This should include not only canonical antibodies from multiple species, but also specific formats such as single domain antibody as found in camelids (Deszy´nski et al., 2022).

Variable Profile Calculation
We calculated the immunoglobulin sequence profiles that capture the diversity of these molecules either within genetic or allele context.For each study we identified sequences that belong to either a single gene or allele according to IMGT and after redundancy removal (95% (Steinegger and Söding, 2018)), these served as a basis for profile creation.Within such gene or allele sets from a single study, we aligned the sequences using the IMGT scheme and calculated positional frequencies.We only calculated frequencies for positions that had more than 1,000 entries in an IMGT position.Therefore, each position from a single gene or allele, from a single study, was associated with a 20-entry vector vs,g,i = (a1,a2,a3…,a20) where ai corresponded to frequency of a particular amino acid, s is a particular study, g is particular germline annotation and i is the specific IMGT position.Each vector is then associated with entropy -Shannon entropy calculate from frequencies of the 20 amino acids as well as ordered list of ranks of frequencies with 1 corresponding to the most frequent rank, with ties being assigned same ranks.
In order to avoid biases introduced by different biological conditions, sequencing depths etc. across studies, we chose to calculate ranks and entropies of global positional frequencies across studies.Rank and entropy for a specific gene or allele IMGT position are given as means from all the studies for which we could calculate such vectors.We note that this metric might be afflicted by outliers so in the presentation of frequency distributions, box plots are used that allow the user to examine kernel density first hand.An overview of this entire process is given in Figure 1 in the main manuscript.

Section 2. V-region profile benchmark.
We tested whether the profiling service could find suitable profiles for our 742 therapeutics.A successful profile was arbitrarily defined as requiring at least 10,000 sequences contributing to its calculation (Supplementary Figures 1, 2).Using allele-based profiles there were 688/738 (93.22%) heavy chains and 486/707 (68.74%) light chains where profiles had more than 10,000 OAS sequences.Using the less-stringent gene-based profiles there were 699/738 (94.71%) heavy chains and 496/707 (70.15%) light chains where profiles had more than 10,000 sequences.The smaller number of light chains we can find suitable profiles for results from a skew towards heavy chains in NGS depositions.
We also plotted the number of IMGT framework positions that do not match with top amino acids in identified gene-based (Supplementary Figure 3) and allele-based (Supplementary Figure 4).The higher number of framework positions disagreeing with NGS distribution in the gene-based profiles reflects multiple alleles contributing to these.Our profiles indicated that the majority of therapeutic antibodies contain framework mutations that are not commonly found in naturally occurring antibodies.In total, 191/738 (25.88%) heavy chains and 125/707 (17.68%) light chains from the therapeutic antibodies contained more than five framework mutations not commonly found in naturally occurring antibodies.Therefore, our service identifies profiles for the majority of therapeutic antibodies highlighting non-trivial positional frequency information.