The Patent and Literature Antibody Database (PLAbDab): an evolving reference set of functionally diverse, literature-annotated antibody sequences and structures

Abstract Antibodies are key proteins of the adaptive immune system, and there exists a large body of academic literature and patents dedicated to their study and concomitant conversion into therapeutics, diagnostics, or reagents. These documents often contain extensive functional characterisations of the sets of antibodies they describe. However, leveraging these heterogeneous reports, for example to offer insights into the properties of query antibodies of interest, is currently challenging as there is no central repository through which this wide corpus can be mined by sequence or structure. Here, we present PLAbDab (the Patent and Literature Antibody Database), a self-updating repository containing over 150,000 paired antibody sequences and 3D structural models, of which over 65 000 are unique. We describe the methods used to extract, filter, pair, and model the antibodies in PLAbDab, and showcase how PLAbDab can be searched by sequence, structure, or keyword. PLAbDab uses include annotating query antibodies with potential antigen information from similar entries, analysing structural models of existing antibodies to identify modifications that could improve their properties, and facilitating the compilation of bespoke datasets of antibody sequences/structures that bind to a specific antigen. PLAbDab is freely available via Github (https://github.com/oxpig/PLAbDab) and as a searchable webserver (https://opig.stats.ox.ac.uk/webapps/plabdab/).


Introduction
Antibodies are by far the most successful type of biotherapeutic, with over 100 approved by the FDA and many more in the advanced stages of clinical development ( 1 ,2 ).Their high specificity and affinity also make them a valuable tool in many areas of medical and scientific research.For example, antibodies are used routinely in diagnostic assays ( 3 ), and to better understand the effects of vaccination on the immune system ( 4 ).
The variable region in antibodies (Fv) that is responsible for antigen binding, has a conserved global structure.It is formed by two immunoglobulin domains, the heavy (VH) and light (VL) chain variable domains.The binding site is divided between both chains, and is concentrated in six hypervariable loops, three on each chain, known as the complementaritydetermining regions (CDRs).Among the CDR loops, the third CDR of the heavy chain (CDR-H3) is the most diverse in sequence and structure and often the primary contributor to antigen binding ( 5 ).However, the other five CDR loops and the relative orientation of the heavy and light chain have also been shown to affect binding ( 6 ).
Next generation sequencing (NGS) has enabled researchers to take snapshots of the immune repertoire of an individual at a given point in time, leading to the generation of vast amounts of single chain antibody sequences.Efforts to compile this data has led to the creation of datasets such as Unpaired OAS ( 7 ) and iReceptor ( 8 ) which contain the sequence of the heavy or light variable domains for billions of antibodies.Paired VH-VL sequence data is more expensive to generate and currently just over a million paired antibody sequences can be found in Paired OAS ( 9 ).However, with the binding site in antibodies sitting across both chains, paired data gives a more complete picture of how and to what an antibody binds ( 10 ,11 ).
Although NGS data has proven invaluable to compare repertoires between individuals, it provides little information on the functions of individual sequences within a repertoire.However, there also exists a large number of smaller scale studies, each one dedicated to investigating a small number of antibodies.When combined, the antibodies from these studies amount to a large number of sequences with rich metadata.There are a number of databases that aim to compile subsets of this data, for example SAbDab ( 12 ,13 ) for antibodies with resolved crystal structures, Thera-SAbDab for antibody therapeutics ( 2 ), CoV-AbDab for COVID-19 binding antibodies ( 14 ) or PAD for unpaired antibody sequences from patents ( 15 ).Paired antibody sequences with information on their epitope can also be obtained from IEDB ( 16 ).
Here we present PLAbDab, a database containing 150 000 paired antibody sequences from over 10 000 small scale studies.PLAbDab is larger than any other non-NGS database of paired antibody sequences by at least an order of magnitude.We make the data freely available and provide methods to rapidly search it by either sequence identity using KA-search ( 17 ), structural similarity ( 18 ,19 ), or by keywords in the title of the study.Each of the sequences comes with a direct link to its source material, making it easy to obtain additional information about any antibody of interest.

Collecting unpaired antibody sequences
The majority of data in PLAbDab is extracted from the Protein database of NCBI ( 20 ).The BioPython Entrez module ( 21) is used to query the database for entries containing the words 'antibody' , 'antibodies', 'immunoglobulin', 'scfv' or 'bcr' in any of their fields.Due to the lack of an automated method to accurately differentiate humanised nanobodies from human antibody heavy chains, entries containing the words nanobody or nanobodies are removed at this stage.Entries with sequences longer than 1000 amino acids or shorter than 70 are also removed.Around 2.5 million entries were returned from this search.
Sequences for each of these 2.5 million entries were then searched for antibody variable domain sequences using AN-ARCI ( 22 ).Sequences missing residues in the CDR regions were removed at this stage.This resulted in around 530 000 potential antibody variable domain sequences from around 13 000 different sources.

Creating paired antibody sequences from the unpaired data
For a large number of entries the metadata provides enough information to pair the heavy and light variable domains.For entries from the same literature source, VH-VL pairing was attempted using the following heuristics: ( The described strategies for pairing VH-VL sequences have varying levels of accuracy.Twenty entries paired by each of the methods described above were randomly selected and manually checked to estimate the accuracy of each pairing method.Methods 1-4 achieved perfect accuracy on the twenty entry test set.We therefore refer to these methods as pairing with 'high confidence'.Method 5 incorrectly paired one entry and Method 6 incorrectly paired two entries.Each paired entry is given a flag to indicate how it was paired, making it easy for users to filter for less accurate pairing methods.To further increase the coverage of the dataset, sequences from both SAbDab [7.4% of entries] and Thera-SAbDab [0.6% of entries] were also added to PLAbDab.

Labelling antibodies with potential antigen information
The target of the majority of antibodies in PLAbDab can be inferred from a manual review of their source literature.Although possible, this process can be time-consuming.To reduce the number of entries to review, we provide a 'targets _ mentioned' column that lists common antigens mentioned in the source material.
For patent entries, potential antigens were identified by searching the title, abstract, and claims sections for mentions of commonly recognised antigen names.When a term in the source literature was prefixed with 'anti-', it was also included D 547 in the list of possible antigens.For non-patent entries, only the title was searched for potential targets.
The use of the 'targets _ mentioned' label exhibits high recall: when an antibody's antigen is known, it is often mentioned in the analysed text.However, its accuracy is moderate.Accurate labelling of each entry with antigen information requires manual inspection.Nonetheless, this approach offers a fast and straightforward method for generating an initial reference set.

Searching PLAbDab
To allow users to rapidly search the database for antibodies with a similar sequence, we implemented KA-Search ( 17 ).KAsearch is able to carry out rapid and exhaustive sequence identity searches, allowing users to find similar sequences over the whole variable domain, the CDR loops, the CDR-H3 or a user defined region.The entire database can be queried with KA-Search in under 5 s on 5 CPUs.
Structurally similar antibodies can have similar functions, even if they are distantly related in sequence ( 18 ,19 ).To enable CDR structure-based searching, we model all paired sequences in the database using ABodyBuilder2 ( 23 ).Entries missing more than eight residues at the start of the sequence were restored using AbLang ( 24 ) prior to modelling.Antibody sequences with non-standard residues, or for which ABody-Builder2 was not capable of generating a refined model, are labelled as 'FAILED' under the 'Model' column.Entries with the same sequence were only modelled once, in total 64,000 unique antibody models were generated.
To structurally search PLAbDab, query sequences are first modelled using ABodyBuilder2, without refinement.The refinement step was skipped as in order to boost speed with minimal compromise on backbone model quality.The framework of the predicted structure is then aligned to the structure of all other entries in PLAbDab with the same CDR loop lengths.Lastly, the carbon-alpha (C α ) root-mean squared deviation (RMSD) over all CDR residues is computed and used to rank entries.Searching the database in this way takes around 10 s on 5 CPUs.
Finally, as each sequence in PLAbDab is linked to a patent or publication, we provide users the ability to search fields, such as the title of the study, using regular expressions.

Database statistics
Currently the total number of entries in PLAbDab is around 150 000, over 90% of which are paired with high confidence using methods 1-4, or comes from therapeutic or crystal structure entries.As can be seen in Figure 1 A, the number of antibody sequences that could be collected by the PLAbDab methodology has been steadily growing since the early 2000s, with somewhere between 10 000 and 30 000 new antibody sequences being published each year for the last 5 years.
Figure 1 B shows the distribution of PLAbDab entries by source, around three quarters of entries come from antibody sequences described in patents, while less than 20% are derived from the scientific literature.This may be due to patents following a more standardised way of depositing antibody sequences.The rest of the entries in PLAbDab come either from solved structures in SAbDab ( 12 ), or from additional sequences in Thera-SAbDab ( 2 ).
Patent applicants are not obligated to upload species information about their sequences to NCBI.This means that the majority of entries in PLAbDab lack a species annotation.Unfortunately, computationally predicting the species origin for unannotated patent antibodies in a systematic way is likely to be ineffective due to the extent of non-natural engineering these sequences undergo.However, species information does exist for the majority of the entries sourced from the academic literature.Across the sequences with species information, most entries are labelled as human, with a smaller number of sequences annotated as mouse, macaque or rabbit.(Figure 1 C).
It has been observed that therapeutic antibodies tend to have shorter CDR-H3 loops than those seen in large repertoire studies, an observation that may be due to longer CDR loops leading to issues during therapeutic antibody development ( 25 ,26 ).The average CDR-H3 loop length from antibodies in PLAbDab (around ∼14.0) falls somewhere between the average CDR-H3 loop length of OAS (around ∼15.6) and Thera-SAbDab (around ∼12.9) (Figure 1 D).

Searching PLAbDab
As described in the methods, PLAbDab allows users to search the database using a variety of methods.To demonstrate the sequence and structure search options, we compared the results of searching the database in four ways: • For antibodies with a sequence identity of over 90% over the VH (VH identity).• For antibodies with a sequence identity of over 90% for both the VH and VL (VH+VL identity).• For antibodies with a C α RMSD over the CDR loops of under 1.25Å to an ABodyBuilder2 model of the query (CDR structure).• For antibodies with a C α RMSD over the CDR loops of under 1.25Å to an ABodyBuilder2 model of the query and sequence identity over the CDR loops of over 80% (CDR structure+identity).
We performed the above searches for four antibodies each of which target a different antigen.Antibody one binds programmed cell death protein 1 (PD-1), and was taken from the patent 'Anti-PD-1 antibodies and methods of use thereof.';antibody two binds Respiratory Syncytial Virus (RSV), sourced from the paper 'Rapid profiling of RSV antibody repertoires from the memory B cells of naturally infected adult donors' ( 27 ); antibody three is the therapeutic antibody Samalizumab, which binds the OX2 membrane glycoprotein (CD200); and antibody four is the therapeutic antibody Foralumab, that targets the CD3-epsilon peptide (CD3e).For each of the four queries we took the sequence and built an ABodyBuilder2 model for use in the structure searches.The result of using each search method described above to query PLAbDab with these antibodies is shown in Table 1 .
The PD-1 case study shows the benefits of having the paired heavy and light chain variable domain sequence.While a sequence identity search over the VH finds antibodies binding to the same antigen around 25% (61 out of 258) of the time, including the VL improves the accuracy to around 75% (16 out of 22).The fact that there are many highly identical heavy chain sequences that bind to different antigens suggests that, for this antibody, the light chain may contribute significantly to binding.To validate this, we analysed the crystal structure of an antibody bound to RSV which is returned by all four search methods (PDB ID: 5GGR).According to Arpeggio ( 28 ), the paratope is evenly split between the heavy and light chain, with eight heavy chain and seven light chain residues found to interact with the antigen.Searching PLAbDab for entries similar to our PD-1 target by CDR structure returns antibodies that target the same antigen around 40% of the time (19 out of 46), but many of them are from different studies than those found by sequence search.All the entries returned by a CDR structure plus sequence identity search target the same antigen as our query.
For the RSV binding antibody, there are only two entries in PLAbDab with > 90% sequence identity over the VH.Searching PLAbDab for entries with a similar CDR structure returns 23 unique antibodies which bind the same epitope.This suggests that this antibody may require a very specific CDR loop conformation to bind the RSV Fusion Glycoprotein ( 30 ) at that specific site.Two of the retrieved entries belong to antibodies with resolved crystal structures (PDB IDs: 7LUC and 7LUD), one of which is bound to the antigen.Figure 2 shows the similarity of these structures to the ABodyBuilder2 predicted structure of the RSV binding antibody query.
Only two unique antibody sequences in PLAbDab have a VH sequence identity of over 90% to Samalizumab.A structure search using Samalizumab as a query results in over a thousand antibodies that bind a very diverse set of epitopes.In Figure 3 , we show a selection of antibodies with very similar CDR backbone structures to Samalizumab binding to four very different epitopes on different antigens.This indicates that the backbone structure of the CDR loops is likely not the primary driver of binding specificity for this antibody.However, when adding an 80% sequence identity cutoff over the CDR loops to the structure search, we are left with five unique antibody sequences all of which bind the same antigen.
For the Foralumab case study, only two out of nine sequences with a VH identity of over 90% were found to also bind CD3e.Adding an additional 90% sequence identity filter over the VL reduces the number of non-CD3e binders returned to two.Searching PLAbDab for entries that share a similar CDR structure to Foralumab expands the number of unique sequences retrieved to 34, out of which 14 were found to bind CD3e.The CDR structure plus sequence identity search achieves perfect accuracy, but reduces the number of retrieved sequences to two.These four case studies highlight the utility of PLAbDab's data and various search methods to identify antibodies with similar function to a query.The case studies also highlight that any computational search should be further investigated via both the meta data in PLAbDab and, if possible, experimental means.

Using PLAbDab to generate new datasets
As PLAbDab contains a large number of antibody sequences targeting a diverse set of antigens, it can be used to facilitate the generation of antigen-specific antibody libraries.For example, searching PLAbDab for entries containing the keywords 'ebov' or 'ebola' in their title returns almost 1500 unique antibody sequences from 56 sources.Using the keywords 'hiv' or 'immunodeficiency' finds over 6200 entries with over 3800 unique antibody sequences from over 500 sources, while 'autoimmune' and 'autoantibody' returns over 400 unique antibody sequences from over 90 sources.Not every antibody sequence retrieved from a patent or paper will necessarily bind the searched target, but such prefiltered sets provide a very valuable starting point towards generating an antigen specific library to a target of interest.
To benchmark how effective searching the titles of the sources of PLAbDab entries by keywords is at retrieving relevant antibodies, we manually inspected the results of a search by expression 'hiv|immunodeficiency'.By manually inspecting the source literature from a random set of 100 antibodies from different studies, we found that 88 were true HIV binders.Of the non-HIV binders, four were anti-idiotype antibodies, three bound CD4, three bound other antigens and two were incorrectly paired.Using the search terms 'covid|corona|sars', we inspected another 100 randomly-selected antibodies from different studies.Of these, 98 targeted a one targeted ACE2, and the specificity of the last could not be verified.This highlights how PLAbDab can aid in the creation of antigen-specific antibody datasets, but also shows that manual inspection of the extracted sequences is still required to remove false positives.

Discussion
We present PLAbDab, a large database of paired antibody sequences from patents and papers.Each entry in the database contains a link to its original source material, which relates the paired sequences to useful functional information.PLAbDab can be automatically updated without any manual input, making it straightforward to keep up to date with the latest publications and patents.
To demonstrate the utility of PLAbDab, we explore different search methods for identifying antibodies with similar binding properties to a given antibody of interest.Our results demonstrate that for different antibodies, different types of search yield the best results.In all three cases investigated, CDR structure alongside CDR sequence identity gave the most accurate results, but in most cases, this also missed functionally cognate antibodies.The inclusion of the light chain in any search always improved accuracy, in agreement with recent studies that have highlighted the importance of the light chain residue motifs for antigen binding ( 11 ), and restricted heavy and light chain gene pairs observed among functional antibodies ( 10 ).In one case searching by structure provides a link that is not immediately apparent from sequence search alone.This finding is consistent with recent work which uses structural information for epitope binning ( 18 ,19 ).
One of the main challenges in developing PLAbDab, and the area with most potential for improvement, is the pairing of heavy and light chains based on metadata.Although the strategy described in this paper accurately pairs a large number of antibody sequences, there are still many for which pairing was not possible.The PLAbDab generation code also relies on authors publishing their antibodies to databases such as NCBI ( 20 ) or the PDB ( 31 ), and although many authors do submit their sequences to these databases, there remains a large body of antibody sequence data shared in the text, figures or tables of papers.Nevertheless, PLAbDab already contains over 60k unique annotated antibody sequences providing an invaluable resource to the antibody research community.PLAbDab can also be used to facilitate the generation of antigen specific libraries that could be used to train novel machine learning models or as a starting point for the development of novel therapeutics.

Figure 1 .
Figure 1.Summary statistics for PLAbDab.( A ) Number of new antibody sequences each year and total number of antibody sequences that could be collected by the PLAbDab methodology.( B ) The proportion of data in PLAbDab by type of source.( C ) The proportion of antibodies from each species in non-patent entries.( D ) Comparison of the distribution of CDR-H3 lengths in PLAbDab, Thera-SAbDab ( 2 ) and OAS ( 9 ).

Figure 2 .
Figure 2. Crystal str uct ures of anti-RSV antibodies aligned to the str uct ural model of the query.A cartoon representation of the ABodyBuilder2 model for the query antibody is shown in purple with the crystal str uct ure f or 7LUC sho wn in blue and the cry stal str uct ure for 7LUD shown in orange.The surface of the RSV antigen is coloured based on electrostatics using PyMol ( 29 ) (units in K b T / e c ).

Figure 3 .
Figure 3. Crystal str uct ure of four antibodies with very similar str uct ure that bind very different antigens.The crystal str uct ure of all four aligned antibodies is shown in each figure (4AL8 in purple, 7XY8 in green, 7LA4 in blue, and 7BBJ in orange).On the top left they are shown in complex with the antigen of 4AL8 (Dengue virus DII protein); top right 7BBJ (CD73); bottom left 7XY8 (Emmprin CD147) and bottom right 7LA4 (Integrin α IIb β 3 ).