ProCarbDB: a database of carbohydrate-binding proteins

Abstract Carbohydrate-binding proteins play crucial roles across all organisms and viruses. The complexity of carbohydrate structures, together with inconsistencies in how their 3D structures are reported, has led to difficulties in characterizing the protein–carbohydrate interfaces. In order to better understand protein–carbohydrate interactions, we have developed an open-access database, ProCarbDB, which, unlike the Protein Data Bank (PDB), clearly distinguishes between the complete carbohydrate ligands and their monomeric units. ProCarbDB is a comprehensive database containing over 5200 3D X-ray crystal structures of protein–carbohydrate complexes. In ProCarbDB, the complete carbohydrate ligands are annotated and all their interactions are displayed. Users can also select any protein residue in the proximity of the ligand to inspect its interactions with the carbohydrate ligand and with other neighbouring protein residues. Where available, additional curated information on the binding affinity of the complex and the effects of mutations on the binding have also been provided in the database. We believe that ProCarbDB will be an invaluable resource for understanding protein–carbohydrate interfaces. The ProCarbDB web server is freely available at http://www.procarbdb.science/procarb.


INTRODUCTION
Carbohydrates are amongst the most versatile classes of ligands, being able to form complex, branched glycans from monosaccharide units. This generates a complex structural pattern, commonly referred to as the glycocode, which carbohydrate-binding proteins are able to decipher (1). These proteins are known to play important roles in many cellular processes, including embryogenesis (2), immune response (3), protein trafficking (4), bacterial-toxin uptake (5) and viral infection (6). However, protein-carbohydrate interfaces are not well characterized, which is partly a consequence of the absence of a standardized nomenclature for sugars. Moreover, identifying sugar moieties in the Protein Data Bank (PDB) (7) is challenging, as some of the carbohydrate entries are poorly annotated (8). This is in part due to the large number of naturally occurring monosaccharides, but also due to the multiple ways saccharide units may be linked and the complex branching capacity of polysaccharides.
In the present PDB format, the distinction between the carbohydrate ligand and its saccharide units is not trivial. Hence, interactions cannot be computed without using protein structure visualization software such as PyMol (9) and Chimera (10). This has hindered efforts to characterize systematically and to understand the underlying molecular features of protein-carbohydrate interfaces. Another limitation of current online resources that attempt to decipher the 3D architecture of carbohydrate ligands, such as pdb-care (11), is that they do not differentiate between the covalently bound carbohydrates (post-translational modifications), crystallographic errors (broken ligands) and true, complete ligands.
Due to these restraints, it is non-trivial to incorporate relevant biological information (such as biophysical measurements, interface interactions, the structure of the ligand and mutagenesis analysis) of protein-carbohydrate complexes into databases. Protein-carbohydrate complexes are poorly represented in databases such as Platinum (12) (5.4%), PDBbind (13) (6%) and MOAD (14) (8%), which collect ligand-binding affinity data for proteins. This is due to experimental difficulties encountered while working with carbohydrates, including their low affinity values but high ligand specificity, and their being part of more complex biological molecules, such as gangliosides, which contain functional groups other than sugars (15)(16)(17). Furthermore, none of the above-mentioned repositories pro-Nucleic Acids Research, 2020, Vol. 48, Database issue D369 vides information on protein-carbohydrate interfaces. The scarcity of available protein-carbohydrate datasets, some of which do not distinguish between the whole ligand and its units, has limited the applicability and accuracy of methods developed to investigate protein-carbohydrate interactions (18)(19)(20). Recently, there have been efforts to create highly curated and specific structural repositories for glycan-binding proteins. Unilectin3D (21) hosts experimentally solved structures for lectins, across all kingdoms (including viruses) generating both SNFG (Symbol Nomenclature for Glycans) (22) depictions and IUPAC (International Union of Pure and Applied Chemistry) (23) notations. Carbohydrate-active enzymes are extensively covered in CaZy (Carbohydrate-active enzyme) database (24), and recently they have mapped 3D structures from PDB to their enzyme nomenclature, identifying over 100 types of carbohydrate-like molecules as biological relevant ligands. Another useful online resource for glycan structures and motifs is GlyTouCan (25), which hosts over 100 000 structures and identifies 800 monosaccharides. Resources combining structural information with prediction tools, mass spectrometry and NMR data have also been developed in recent years: ProGlycProt V2.0 (26), for prokaryotic glycoproteins and glycosyltransferases, Carbohydrate Structure Database (27), for bacteria, archaea, fungi and plants, and Glyco3D (28), for a general overview on glycan binding proteins ranging from glycosaminoglycan-binding proteins to antibodies.
Here we describe ProCarbDB, a freely accessible, user friendly database that comprises of 5242 true proteincarbohydrate complexes. For a given PDB entry, Pro-CarbDB correctly annotates and displays the complete carbohydrate ligand present, the ligand interactions and binding affinities (where available), and the effects of experimentally validated mutations on the binding affinity. We believe that ProCarbDB will be an invaluable resource for understanding the features of protein-carbohydrate interfaces and their recognition patterns. It will also facilitate the development of structure-based machine-learning algorithms that can be trained to predict the binding affinity between a putative carbohydrate-binding protein and its saccharide ligand.

Data acquisition and inclusion criteria
An exhaustive list of PDB ligands classified as carbohydrates was obtained using a stand-alone copy of pdb-care (11) and manually curating the results. We obtained a list of 900 carbohydrate PDB Ligand IDs. We retrieved around 13 000 X-ray crystal structures containing at least one saccharide moiety (for the complete pipeline flowchart see Supplementary Figure S1). In comparison, PDB annotates <600 molecules as saccharides.
Using a graph-based approach, we filtered out the possible true negatives: (ii) Structures where no sugar ligand was in the proximity of a protein chain (at least one atom of the ligand has to be 4Å or closer to any heavy atom of a protein residue). (iii) Structures where no protein chain was longer than 30 amino acids (Supplementary Figure S1). (iv) Structures that contained only crystallographic adjuvants (e.g. B-octylglucoside) by using a semi-automatic text-mining algorithm based on cross-reference between well-established databases such as UniProt (29), PDB (7) and ENZYME database (30).
As a result of this filtering approach, we obtained 5242 protein-carbohydrate complexes. It is important to note that several amphipathic molecules (BOG, DA8, DEG, KGM etc.), which are usually used as, or are very similar to, detergents, are actually true biological ligands in a number of entries, such as 1UWF and 2G3N.

Ligand sanitization
Using the above-mentioned graph-based approach and the CONNECT records of the PDB file, we first checked the integrity of the ligands by determining the saccharide units that constitute the whole ligand. Next, we calculated distances from terminal atoms of the ligands (i.e. atoms that only have one covalent bond) to all other atoms. For some entries the distance was within the range expected for a covalent bond, but not listed in the CONNECT records. This resulted from either: (i) overlapping of residues due to the presence of stereoisomers in the crystallization solution (e.g. PDB ID: 5MTU) or (ii) broken ligands (e.g. PDB ID: 5TPC). To solve the former issue, we used the occupancy register in the PDB structure dictionary, where if the total occupancy of both units is equal to 1 they are overlapping. To solve the second issue, before generating a new bond we ensured that no superposed atoms were present and that valence rules were maintained. By using these methods, we were able to identify not only pure carbohydrate ligands but also glycoconjugates, such as PDB ID:2JDH.
The ligands are presented in a table along with their 3D representation, in which PDB Ligand IDs are coloured according to the SNFG nomenclature (22) (Supplementary Figure S2). Furthermore, we also generate IUPAC or LINUCS (Linear Notation for Unique description of Carbohydrate Sequences) (31) notations where possible.

External resources
We mapped these crystal structures with biophysical measurements using two available databases: PDBbind (13) and MOAD (14). Using a series of text mining and request functions, we were able to link 967 protein-carbohydrate complexes with an affinity value. Furthermore, using a combination of APIs from PDB and UniProt, we are able to provide users direct mappings to other well-established databases like UniProt, Pfam (32) and enzyme commission number. In addition, curated mutagenesis information for the protein-carbohydrate complexes present in the database is being continuously added manually.

Database architecture and web interface
The database architecture (Supplementary Figure S3) was written using the SQLAlchemy Python (version 2.7.1). All data are stored in a PostgreSQL server. For World Wide Web Connectivity, the Flask Python module (version 1.0.2) was used.
The website is written in HTML5 using CSS, Javascript and JQuery as well as a Bootstrap (version 4) framework. JINJA2 templating language for Python was used to dynamically generate HTML templates. All 3D rendering is done using NGL (33).

Web interface
The access point for documentation, resources, data and visualization methods is http://www.procarbdb.science/ procarb/. The documentation can be accessed using the 'Help' page from the navigation tab ( Supplementary Figure S4). Links to specific sections of the 'Help' page are also provided based on the user's current location on the website.
In order to access the data, a query/search has to be performed. This can be done either by selecting the 'Query' page from the navigation bar or by clicking the 'Submit Query' button present on the 'Home' page. On the 'Query' page, the user has nine different options to search the database ( Figure 1A and Supplementary Table S1). We provide on-page guidelines in the form of grey question mark tooltips. Since most users might be unaware of specific IDs, and are more commonly interested in searching for relevant terms or keywords, we implemented a pattern matching algorithm that allows the users to use full keywords (lectin), or partial keyworks (lec*) in some of the query fields (UniProt, Pfam, Enzyme Commission, Organism and Monomer). For example, a keyword query for 'influenza' in the organism query field will retrieve 179 entries for several different strains in one simple query.

Results display
After a query has been submitted by pressing the appropriate 'Submit' button, the user will be redirected to either: (i) 'Multiple Results page if the submitted query returned more than one result ( Figure 1B On the 'Multiple Results page ( Figure 1B), for each entry obtained as search result, a summary of the available data is displayed. This includes details such as the PDB ID, PubMed ID, UniProt ID, organism name, Pfam ID, Enzyme Classification, PDB Ligand ID(s), name of PDB Ligand ID and availability of affinity values. The query input will be displayed in red (if possible) on the 'Multiple Results' page to enable users to easily identify the matched term. Each column of the 'Multiple Results' table can be filtered by using the 'Search' fields under the headers. Furthermore, the user can download the summary table in .tsv (tab-delimited file) format by selecting the 'Get TSV' button ( Figure 1B).
In order to access an individual entry, a detailed description is provided in three tabs (namely, 'General Information', 'Ligand Information' and 'Mutant Information'), which are described below. Direct links to the 'Help' page and to the 3D interactive windows are available on each tab. The website generates intuitive and consistent URLs; hence, users can also bookmark the search pages for easy access.
General information tab. Users can click on the PDB ID of an entry obtained as a search result ( Figure 1B) and will be directed to the 'General Information' page by default (Supplementary Figure S5). This is divided further divided into three sections: (i) information about the crystal structure, (ii) mappings to Pfam domain annotations and UniProt IDs and (iii) an interactive window where the user can inspect different features of the protein-carbohydrate complex, including geometric quality, hydrophobicity and B-factors using informative colour schemes. Users are also able to visually inspect the Pfam-annotated domains, by selecting the Pfam colouring scheme, directly on top of the PDB structure, so allowing the user to identify binding and interface domains. Figure S6) can be accessed by selecting the appropriate field from the navigation tab. This page is divided into two sections: (i) ligand Information with available biophysical measurements and 3D representation for each ligand and (ii) interactive window where the user can inspect the protein-ligand interface. The first section aims to map individual ligands, rather than whole structures, with affinity values from established databases. The ligand table is user-responsive and linked to the 3D representation window. By selecting the ligand of interest in the table, the 3D representation changes to the selected ligand. Furthermore, all monomer-colouring schemes are conserved and distinct for each monomer throughout the page.

Ligand information tab. The 'Ligand Information' tab (Supplementary
We also provide dedicated 3D representations for all ligands available in a ProCarbDB entry. The user can inspect here the spatial arrangement of a carbohydrate ligand and glycosidic bond order without the added complexity of viewing the entire protein-carbohydrate complex. Figure S7) that has been manually curated. These data will be continually updated as part of ongoing curation efforts. The tab is divided into two sections: (i) table of available mutations and (ii) interactive window where the user can inspect the positions of the mutants in the 3D structure of the complex as well as the interactions between the ligand and the wild-type residues. We aim not only to map mutagenesis data from literature but also to identify mutant structures present in ProCarbDB. For example, both 4BLN and 4BLK are PDB IDs present in ProCarbDB. The first structure is identified as wild-type while the second is a K176L mutant. By selecting the corresponding field in the 'Is mutant in ProCarbDB' column, users can directly inspect that structure.  Table S2 summarizes all the available data as well as the page where it can be accessed.

3D interactive windows
3D rendering of macromolecules is imperative for understanding their biological function. Based on our curated data, we are able to calculate and display particularities of the entire structure such as hydrophobicity, secondary structure and Pfam domains. We are also able to map the interface formed by the protein and the complete ligand (Figure 2A). Furthermore, users can have an in-depth analysis of the binding pocket by selecting from the 'For Mutagenesis' panel ( Figure 2B) any residue of interest 4Å or closer to the ligand. For ProCarbDB entries that are linked with mutation data, we provide a 3D spatial representation of those mutations. In order to maintain consistency and re-producibility, we aimed to keep colouring schemes and definitions as implemented in the PDB.

Binding affinities
We annotated the complexes present in ProCarbDB with experimentally determined binding data by using already established databases such as MOAD and PDBbind. We retrieved 756 affinity values from MOAD (14) and 626 from PDBbind (13), with an overlap of 415 entries, ultimately generating a collection of 967 complexes with experimentally measured binding affinities. We also checked the values for complexes reporting affinities in both databases and we found out that ∼9% of values do not match. As an example, PDB ID: 5TPC has a K d value of 0.3 mM according to MOAD and a K d value of 1 mM according to PDBbind. Furthermore, there are many inconsistencies with matching the correct ligand and affinity value. For example, PDB ID: Surface opacity is set to 60%, and coloured based on hydrophobicity (green, for hydrophobic residue, to red, for hydrophilic residues). (B) In-depth analysis of the binding pocket. ARG144 was selected and it is displayed in dark orange. In light grey are depicted protein residues. At the top, the same ligand observed in (A) is present, with the same colouring scheme. Only interactions between ARG144 and any other molecule are displayed. Surface opacity is 0%.
4D4U has four different affinity values, two of which are for the same ligand on MOAD. This might be in part due to the fact that the authors of the structure could not fully identify the complete ligand (LewisY tetrasaccharide) in all the binding pockets.
An example where the ligand is not properly identified is 4 × 0Z; PDBbind reports a ligand formed by four monosaccharides while the actual ligand is GM1 ganglioside, which contains five monosaccharides. These small inaccuracies in publicly available repositories are due to have major downstream effects on algorithms using their datasets as training sets. For this reason, we tried to solve these inconsistencies, or at least flag them and make it visible to the user in Pro-CarbDB.

Data statistics
Based on protein partner. We mapped ProCarbDB entries to their kingdom (taxonomy) and identified Bacteria (46.3%) as the most dominant followed by eukaryota (43.2%), viruses (8.8%) and archaea (1.7%) ( Figure 3A). Next, we divided the UniProt IDs based on kingdom and counted the number of entries each UniProt ID has in Pro-CarbDB ( Figure 3B). Most UniProt IDs in ProCarbDB (82%) are present in three or less entries. This shows that the data in ProCarbDB are diverse with respect to the UniProt ID distribution. However, it is clear that UniProt IDs from bacteria and eukaryota are dominant in Pro-CarbDB. The most frequent UniProt ID present in Pro-CarbDB is 'P16442', encoding for histo-blood group ABO system transferase (eukaryota), with 78 entries, followed by 'P00636', encoding for fructose-1,6-bisphosphatase 1 protein (eukaryota), with 52 entries (Supplementary Table S3).
To further investigate the redundancy of sequences present in ProCarbDB, we used the CD-Hit (34) software that clusters sequences based on identity, and found that, for a total of 5242 ProCarbDB sequences, CD-Hit identifies 2018 distinct clusters at 90% sequence identity, and 1805 distinct clusters at 70% sequence identity.
The complete ligands, comprised of one or more of the above-mentioned monomers, were separated into two classes: saccharide ligands (827, 58.5%) and glycoconjugate ligands (587, 41.5%). We observed that most protein-ligand complexes in ProCarbDB comprised only saccharide moieties (3911/5242), while the rest contain glycoconjugates (1426/5242). There is an overlap of 85 entries that are in both ligand classes due to entries having multiple ligands present in the PDB. In order to ensure that ligand data are also diverse, we counted the number of ProCarbDB entries for each monomer ( Figure 3C). Most monomers (73%) in ProCarbDB are present in three or less entries. The most frequent monomers, based on RCSB PDB nomenclature, are GAL, encoding for ␤-D-galactose, with 818 entries Nucleic Acids Research, 2020, Vol. 48, Database issue D373 If an UniProt ID is present twice in the same PDB structure, we only count it once in order to normalize the data for homo-oligomers, hetero-oligomers and asymmetric unit protomer duplication. (C) Monomer Frequency in ProCarbDB. If a monomer is present twice in the same PDB structure, we only count it once.

DISCUSSION
While analysis of experimental structures can provide powerful insights into understanding protein function and mechanism of action, this has not been exploited to its full potential for protein-carbohydrate complexes. Carbohydrates are one of the most complex classes of biomolecules from both structural and functional points of view. Thus, the characterization of recognition patterns for carbohydrate-binding proteins is challenging. A repository of high-quality structural and functional data, includ-ing the full carbohydrate ligand structures, removing covalently bound structures (post-translational modifications) and displaying the crystal complex in an interactive way will facilitate advancement of the field.
To our knowledge, ProCarbDB is the first repository that is able to retrieve complete ligands via simple queries. We generate and display, in a user-friendly way, not only the interactions between the ligand and its environment, but also the non-allosteric interactions that might be responsible for the binding. The user is able to access 3D interactive windows in a standardized fashion, based on PDB architecture, in order to compare results.
Furthermore, we also attributed functional information, in the form of biophysical measurements. To date, we have linked 18.4% (967) of ProCarbDB entries with at least one experimentally measured binding affinity. We identified and corrected, to the best of our capability, several underdocumented issues with currently available databases such as incorrect affinity values and ligands wrongly identified as biologically active. To provide a complete panel of information, we mapped each entry to UniProt, Pfam and NCBI databases. Current efforts are directed towards gathering further mutagenesis information using manual curation, which could not be directly obtained from the external databases.
We believe that ProCarbDB will have a significant impact on the field. Firstly, experimental scientists studying protein-carbohydrate complexes will be able to query Pro-CarbDB to check whether the protein: (i) has been pre-viously characterized biophysically; (ii) has identified homologs or (iii) has known ligands, in which case they can inspect in depth the protein-carbohydrate interfaces. Secondly, computational scientists will have a comprehensive and refined set of coordinates defining the structures of protein-carbohydrate interfaces as well as a benchmark dataset to train machine-learning algorithms.
ProCarbDB will be an invaluable resource for the understanding and modification of carbohydrate-binding sites and will facilitate the development of new computational tools to analyse these interactions and develop prediction algorithms.