SH2db, an information system for the SH2 domain

Abstract SH2 domains are key mediators of phosphotyrosine-based signalling, and therapeutic targets for diverse, mostly oncological, disease indications. They have a highly conserved structure with a central beta sheet that divides the binding surface of the protein into two main pockets, responsible for phosphotyrosine binding (pY pocket) and substrate specificity (pY + 3 pocket). In recent years, structural databases have proven to be invaluable resources for the drug discovery community, as they contain highly relevant and up-to-date information on important protein classes. Here, we present SH2db, a comprehensive structural database and webserver for SH2 domain structures. To organize these protein structures efficiently, we introduce (i) a generic residue numbering scheme to enhance the comparability of different SH2 domains, (ii) a structure-based multiple sequence alignment of all 120 human wild-type SH2 domain sequences and their PDB and AlphaFold structures. The aligned sequences and structures can be searched, browsed and downloaded from the online interface of SH2db (http://sh2db.ttk.hu), with functions to conveniently prepare multiple structures into a Pymol session, and to export simple charts on the contents of the database. Our hope is that SH2db can assist researchers in their day-to-day work by becoming a one-stop shop for SH2 domain related research.


INTRODUCTION
The Src homology 2 (SH2) domain was one of the first pr otein-pr otein interaction (PPI) modules to be discover ed ( 1 ). By r ecognizing specific phosphotyrosine (pTyr)containing peptide motifs, this small (approx. 100 amino acids) protein module acts as the reader unit of pTyr-based signal transduction, an intracellular signaling system that emerged about 600 million years ago, just prior to multicellular organisms ( 2 ). In addition to SH2 domains, this signaling system employs pr otein tyr osine kinases (PTK) as writer units, and pTyr phosphatases (PTP) as eraser units ( 3 ). In accordance with the ubiquity of pTyr signaling in eukaryotic cells, it was realized early on that the disruption of this signaling system by protein mutations or pathogens contributes to a range of disease conditions ( 4 , 5 ), with the Src and Grb2 SH2 domains as early examples of therapeutic proteins being targeted by small-molecule and peptide inhibitors ( 6 ). From the very beginning, SH2 domains have Nucleic Acids Research, 2023, Vol. 51, Web Server issue W543 been regarded as challenging drug targets, due to the shallow binding surface that is characteristic of pr otein-pr otein interactions in general ( 7 ). Nonetheless, the therapeutic interest in targeting SH2 domains has steadily grown with the abundant discoveries of disease-causing mutations in a wide range of SH2 domains (8)(9)(10)(11). In the meantime, a large body of SH2-rela ted computa tional and experimental kno w-ho w has been accumulated ( 12 ).
A total of 120 SH2 domains are present in 110 human proteins -ten of which contain dual SH2 domains. Ov er the years, se v eral approaches hav e been employed to gr oup these pr oteins into informati v e sub-classes ( 13 , 14 ), with the most prominent strategies based on collecting the SH2 domains into 11 functional categories by Liu and colleagues ( 15 ). Interestingly, this functional grouping does not corr elate dir ectl y with the phylo genetic distances of the SH2 domains themselves, although we have recently found that sequence similarities within these functional categories are higher if we account for the general character (polar, aromatic, etc.) of the amino acid sidechains ( 16 ). For the past two decades, the main structural features of the SH2 domain have been well understood: its centerpiece is an antiparallel ␤-sheet (with three strands labelled ␤B-␤D) sandwiched between two ␣-helices ( ␣A and ␣B), also r eferr ed to as the ␣␤␤␤␣ motif ( 10 ). The central ␤-sheet divides the binding surface of the SH2 domain into two subpockets, called the phosphate-binding (pY) and specificity (pY + 3) pockets. Upon phosphopeptide binding, the ␤-sheet is perpendicularly bridged by the interacting partner, exposing its phosphotyrosine group against an (almost) invariant arginine residue on the ␤B strand, while neighbouring sidechains in the C-terminal direction (labelled + 1, +2, etc. fr om the phosphotyr osine r esidue) ar e r ecognized by the specificity pocket ( 17 ). The Sheinerman residues, a group of eight residues in the pY pocket (including the critical arginine) are primarily responsible for anchoring the phosphotyr osine gr oup ( 18 ), and their mutations are usually detrimental to SH2 domain function ( 10 ). A short, conserved sequence of residues within the ␤B strand defines the so-called 'SH2 signature motif' ( 19 ), also known as the FLXRXS or FLVR motif, which includes the critical arginine r esidue. Inter estingly, ther e ar e a small number of proteins (RIN2, TYK2 and SH2D5 in humans) where this arginine is replaced by an aromatic residue: these SH2 domains recognize acidic residues other than pTyr (Glu or Asp) in non-typical binding modes ( 20 ). With a fairly robust understanding of the typical functions and structural features of SH2 domains, recent studies were directed to more specific questions, such as posttranslational modifications other than phosphorylation ( 21 ), the role of water molecules in phosphopeptide binding ( 22 ), de v elopment of SH2 superbinders ( 23 ), or simultaneous phosphotyrosine binding in a protein with dual SH2 domains ( 24 ).
Structural databases boost the productivity of computational medicinal chemists and modelers by offering highly specialized and relevant information on protein families of high interest and therapeutic relevance. A prominent example is GPCRdb, a database of G-protein coupled receptor (GPCR) structures, sequences and ligands, published in its modern form in 2014 ( 25 ), maintained and regularly updated by the Gloriam group ( https://gpcrdb.org/ ). GPCRdb contains a range of useful features, including generic residue numbers for the convenient comparison of residue positions in different proteins ( 26 ), integration of mutagenesis da ta ( 27 ), annota tion of dif ferent functional types of ligands ( 28 ), or as its latest addition, the incorporation of Al-phaFold ( 29 ) models of GPCRs ( 30 ). Similarly, the Kinase-Ligand Interaction Fingerprints and Structure database (KLIFS) was introduced in 2014 for the convenient mining of the available structural information on kinase inhibitors and their interaction patterns ( 31 ), with its functionality expanded multiple times ( 32 ). For SH2 domains, such a convenient and up-to-date online r esour ce is missing as of yet: while an earlier database from the Nash and Pawson labs is still available online ( https://sites.google.com/site/ sh2domain/home ), this mostly focused on providing links to major sequential and structural databases (Entrez, UniProt, PDB, etc.), and was not updated since 2015. We should also point to a few, more generic da tabases tha t are useful in the r esear ch of SH2 domains , including Phospho .ELM for r efer encing experimentally validated phosphorylation sites ( 33 , 34 ), and Scansite for searching for potential interacting partners of SH2 domains ( 35 ).
Here, we outline the de v elopment, ar chitectur e and main functionalities of SH2db, a database and w e bserver for SH2 domain sequences and structures. With SH2db, our aim is to provide a convenient starting point to bioinformaticians, computational and medicinal chemists, and practitioners of related fields for any studies where they utilize SH2 domain structures. In particular, we hav e re vised the sequence alignment of human SH2 domains published by Liu et al. ( 15 ), introduced a generic residue numbering scheme for the comparability of residue positions in different SH2 domains, and launched a w e bserver to facilitate quick access to any arbitrary sets of pre-aligned SH2 domain sequences ( fasta format) or structures ( pdb format or Pymol session). Experimental and theoretical protein structures have been incorporated into SH2db from the PDB ( 36 ) and AlphaFold databases ( 29 , 37 ). The SH2db w e bserver is available at http://sh2db.ttk.hu/ , while its source code is shared at https://github.com/keserulab/SH2db .

Data
Protein sequences were retrie v ed fr om UniPr ot ( 38 ), experimental structur es wer e downloaded fr om the Pr otein Data Bank (PDB) ( 36 , 39 ) and AlphaFold models were gathered from the EMBL-EBI AlphaFold repository ( 29 , 40 ). The PDB files were parsed, renumbered to match the wild-type sequence and non-SH2 domain parts wer e r emoved. Structures containing two SH2 domains or the same domains in se v er al chains were split into separ ate PDB files. In this first release of SH2db, we included only human sequences with their canonical isoform, but built the frame wor k to allow easy incorporation of ortholog sequences and other isoforms in the future.

Fr amew ork
SH2db uses the python-based Django frame wor k with the Postgr eSQL object-r ela tional da tabase system. The hierar- In the most common segment la y out of SH2 domains, strand bD is followed by two shorter strands bE and bF, then the domain ends with the aB helix, with all of them connected by loops or turns. ( B ) In the STAT proteins, the bE and bF strands with their flanking loops / turns are missing and instead the aB' helix is present, which connects to bD and aB without any loops or turns. chy of the database starts on two parallel top le v els, Protein and Structure, which both link to lower le v els of objects: Pr otein-Isoform-Pr otein domain; Structure-Chain-Structure domain. We store the wild-type protein-related data (species , sequence , pr otein family, IDs) in the Pr otein hierarchy and structure-related data in the Structure hierarchy (PDB da ta, publica tion, experimental method, resolution, IDs). The two hierarchies are connected on the top le v el and also on the Protein domain-Structure domain le v el. This latter connection is the main dri v er of the online tools as these objects store the SH2 domain units that are listed in se v eral pages. Also, Residue objects are linked to the Protein domain objects, which powers the sequence alignments. Protein segment and Generic number objects are connected to Residue objects. AlphaFold models are linked to wild-type Protein objects.

Generic residue numbering
Similarly as done for GPCRs with the Ballesteros-Weinstein ( 41 ) or the GPCRdb generic numbering scheme ( 26 ), we aimed to de v elop generic residue numbers for the SH2 domain to easily perform structure and sequence based comparisons between members of the family. An initial structural superposition was performed on all structures in Schr ödinger's Maestro (Schr ödinger Release 2022-4: Maestro, Schr ödinger, LLC, New York, NY, 2022). Starting out from the multiple sequence alignment of Liu et al. ( 15 ), we made local structure-based alignments and adjusted the sequence alignment accordingl y. Mainl y focusing on the segments with conserved secondary structural characteristics, w e w ere able to determine the most conserved residue positions throughout the human sequences. In each segment with a conserved secondary structural characteristic, the most conserved position was labeled as ' ×50', while residues in either direction that belong to this same segment were labeled sequentially. We identified and number ed thr ee ␣helices (aA, aB' and aB) and six ␤-strands (bA, bB, bC, bD, bE and bF). In addition, we assigned two generic numbers to two Sheinerman residues that are located in the bBbC turn. Loops in between the helices and strands were labelled based on the flanking segment labels (e.g. the loop between bA and aA is bAaA). Due to their disordered and fle xib le nature, we opted not to gi v e generic numbers to the loops as structure-based comparison is not possible f or man y of these segments, due to the corresponding residues not occupying the same 3D space. Importantly, helix aB' is exclusi v ely found in the SH2 domains of the STAT protein family, and has been r eferr ed to as the Evolutionary Acti v e Region ( 13 ) in SH2 domains ( Figure 1 ). Based on the sequence alignment from the numbered positions, we created a phylogenetic tree to showcase the evolutionary distances between SH2 domain containing proteins ( Figure 2 ).

Superposition
After multiple iterations of structural alignment, we found that superposing the backbone atoms of residues from the core ␤-sheet (comprised of ␤-strands bB, bC and bD) yielded the most reliab le ov erla y f or the whole set. All structures and models available on SH2db were superposed based on these residues using the structure of the FER kinase (PDB: 2KK6) as r efer ence, running the 'align' function of Pymol (The PyMOL Molecular Graphics System, Version 1.9.0.0 Schr ödinger, LLC). These superposed structur es ar e exposed to all of the download functions, including an internal script for generating Pymol sessions on-the-fly for download. (On the w e bsite, a brief message informs the user about the licensing options of Pymol.)

RESULTS
We have engineered a w e bserver that currently stores 352 PDB and 120 AlphaFold structures of human SH2 domains in a preprocessed and pre-aligned fashion, and provides simple and intuiti v e interfaces for searching, filtering and downloading arbitrary sets of the underlying data in multiple formats. The w e bserver and the underlying database were built in the spirit of scalability, implementing a hierarchy of Django data models (and corresponding PostgreSQL da tabase fields) tha t allow for significant extensions la ter on, e.g. the addition of SH2 domains from di v erse species, providing additional links to external da tabases, incorpora ting isoforms, etc.
In its first published version, the SH2db w e bserver (available at http://sh2db.ttk.hu/ ) provides access to SH2 domain sequences, structures and models in two main ways (Figure 3 ). From the Browse page, the user can access a hierarchy of individual database entries, presented on informati v e summary pages. Protein entries link to their corresponding UniProt page, feature a sequence viewer showing the canonical sequence, and a table that summarizes, and links to, the corresponding structure and model entries, along with core information on experimental / modeling method, resolution, etc. Structure entries link to their respecti v e PDB entry, publication, feature an interacti v e sequence vie wer with options for downloading, and an interacti v e NGLvie wer panel for quick visualization.
The Search page offers an alternati v e route: by starting from a large, interacti v e sequence vie wer, the user can select an arbitrary set of sequences, structures and residues, to be exported into a fasta file, a set of pdb files or, using a backend script, a pre-formatted Pymol session. The Pymol session features the selected structures superposed, and the selected residues saved in named selections and The Search page provides functionalities to filter the underlying database via an interacti v e sequence vie wer and download arbitrary selections of sequences or structures. The toggle button ( 1 ) switches between including structure entries or restricting the table to canonical protein sequences extracted from UniProt. The Domain column ( 2 ) lists the PDB IDs or marks the UniProt sequences and AlphaFold models by their Uniprot ID, followed by 'N' or 'C' for proteins with dual SH2 domains (or 'N' by default for single-SH2 proteins). The table can be filtered by any combination of fields, including individual amino acid positions ( 3 ). Selections can be downloaded as sequences (fasta), structures (pdb) or fed into a backend script to generate a Pymol session ( 4 ), which shows the selected structures superposed, and the selected residues highlighted as sticks, for a quick and easy structure comparison. highlighted in stick r epr esentation. The AlphaFold models are linked to the wild type sequence of the SH2 domain and labeled < UniProt accession > -AF-< domain type > e.g. Q13191-AF-N for the AlphaFold model of the N-terminal SH2 domain of CBLB. The miscellaneous pages offer quick visual summaries of the current contents of SH2db (Charts) and explanatory texts on the main features of SH2 domains, and SH2db itself (About and Documentation).
In the following subsections, we aim to demonstrate the key features of SH2db in two short case studies of structural comparisons. In both cases, it takes the user only a few minutes of browsing the database and a few clicks to produce a script-generated Pymol session that provides a convenient starting point for comparing SH2 domain structures. By downloading the pre-aligned pdb structures, the user can submit molecular dynamics simulations, binding site analysis and virtual screening or other modeling jobs for multiple SH2 structures that will be easy to compare upon completion.

Case study 1: effect of the N642H mutation on the peptide binding affinity of the ST A T5B SH2 domain
Signal transducers and activators of transcription (STAT) are a family of se v en multidomain transcription factors with key roles in intracellular signaling, primarily in the JAK / ST AT signaling pathway ( 42 ). ST AT proteins, especially STAT3 and STAT5B have been identified as potential pharmaceutical targets in a range of oncological conditions, including various types of leukemias and solid cancers (43)(44)(45). STATs are multidomain proteins that can enter the nucleus and initiate gene transcription upon parallel (acti v estate) dimer formation via their SH2 domains, following phosphoryla tion a t a conserved tyrosine residue ( 46 ). In this context, the SH2 domain thus acts as a mediator of dimer formation, by recognizing the tyrosine-phosphorylated, Cterminal tail segment of the opposing STAT monomer. In addition to its importance as a direct pharmaceutical target, the SH2 domain is a hotspot for a variety of oncogenic mutations in STAT3 and STAT5B, which are direct dri v ers of disease conditions, with their exact structural impact only partially understood as of yet ( 10 ).
Recently, the X-ray structures of the STAT5B SH2 domain, as well as its oncogenic N642H mutant were solved ( 47 ). Interestingly, the authors have simultaneously identified two distinct conforma tional sta tes for the mutant SH2 domain: in one of them, the bD strand forms additional hydrogen bonds with the bC strand, as compared to the wild-type structure (Figure 4 B, we will refer to this as 'tight-bD' conformation from here on). The other conformation presents a dissociated bD strand ('loose-bD' from here on), and thereby a greater structural difference from the wild-type SH2 domain (Figure 4 C). In addition to solving the crystal structures, the authors have determined, via a fluorescence polarization assay ( 48 ), that the N642H mutation increases the binding affinity of the fluorescently labeled phosphopeptide GpYLVLDKW (deri v ed from the EPO receptor) by about 7-fold. Howe v er, the question remains open whether this increase in phosphopeptidebinding affinity can be attributed to the tight-bD or loose-bD conformation (or both). Here, we have briefly investigated this question by docking the phosphopeptide GpYLVLDKW into the sites defined by the pY and pY + 3 pockets of the wild-type and mutant (tight-bD and loose-bD) SH2 domains. SH2db provides easy access to the pre-aligned structures in pdb format, and the pre-assembled Pymol session presents a facile approach for visualizing the structures and binding poses in a unified style and viewpoint, w hile systematicall y highlighting the Sheinerman residues ( Table 1 ) that are primarily responsible for phosphotyrosine binding (Figure 4 ). For docking, we have used the Peptide docking mode of single precision (SP) Glide ( 49 , 50 ), and accepted the bestscored docking pose that presented the characteristic salt bridge between the phosphotyrosine and anchoring arginine R618 bBx50 . The binding pose for the phosphopeptide against the tight-bD conformation is overall quite similar to the one against the wild-type SH2 domain, with part of the peptide reaching over the central ␤-sheet and into the pY + 3 pocket. By contrast, in the loose-bD conformation, the bD strand forms a small subpocket with the neighbouring loops that can accommodate the N-terminal end of the phospho-peptide. This difference is also reflected in the superior docking score of this pose (-6.029 vs. -3.152 and -2.188 in the tight-bD and wild-type structures respecti v ely, the smaller the better). Based on this brief analysis, we can propose the loose-bD conformation to be primarily responsible for the increased phosphopeptide-binding affinity of the STAT5B N642H SH2 domain. The SH2db w e bserver grea tly facilita ted this investiga tion by providing a convenient starting point to the calculation and visualization within a few clicks.
The dataset also provides predicti v e power in facilitating functional extrapolations of newl y / currentl y identified mutations (such as those identified from tumour-biopsied patient samples) without structural data. For example, the second most frequent mutation in STAT5B (Y665F) represents a drastic change in polarity, and leads to aggressi v e leukemias.
By exploring the structural aspects in the context of the SH2 domains, this point mutation (and the loss of the hydr oxyl gr oup) re v erts the residue to similar hydrophobic residues that are found at the same position in other SH2 domains that have higher peptide affinity. Understanding the structural impacts of mutations can provide information on the phenotype but also whether a specific drug candidate could have potential in the relevant cancer model.  ( 49 ), and the docking poses of the GpYLVLDKW peptide (green) against these domains (right). In the 'loose-bD' conforma tion, the dissocia ted bD strand contributes to the formation of a small subpocket tha t can accommoda te the N-terminal end of the phosphopeptide within the pY pocket. Compared to the wild-type N-SH2 domain (red), the oncogenic D61G mutant (blue, PDB structure 4H1O, https://www.rcsb.org/structure/4H1O ) misses the crucial negati v ely charged sidechain and is thus not able to form the anchoring salt bridge, resulting in a loss of auto-regulation (permanently acti v e state). The C-SH2 domain (green) has no affinity to the PTP domain either, due in part to the bulkier residues of the bDbE loop (E176, L177), and also to the positi v el y charged sidechain (K178) in the bEx48 position, w hich should be repulsed by the proximal l ysine / arginine cluster.

Case study 2: blocking loop of the N-SH2 domain of the SHP2 phosphatase
SH2 domain containing, pr otein tyr osine phosphatase 2 (SHP2), is a PTP encoded by the PTPN11 gene, which contains two SH2 domains and a pr otein tyr osine phosphatase (PTP) domain, and has been identified as a pharmaceutical target for a range of oncological indications ( 51 ). SHP2 is a prime example of the versatile and diverse utility of SH2 domains, as its phosphatase activity is regulated by an intricate mechanism, where its SH2 domains play central roles, both in conventional and unconventional modes ( 52 , 53 ). Briefly, SHP2 can assume an acti v e and an inacti v e conformation: in the inacti v e confor mation, the N-ter minal SH2 domain closes upon the phosphatase domain, thereby blocking access to its acti v e site. In this atypical regulatory role, the short loop that connects the bD and bE strands of the Nterminal SH2 domain ('blocking loop' or bDbE loop following our nomenclature) inserts into the acti v e site of the phosphatase domain, making it inaccessible for substrates (Figure 5 B). To release this autoregulatory lock, the Nter minal and C-ter minal SH2 domains can sim ultaneousl y bind bis-phosphotyrosyl proteins or peptides like IRS-1 (ie. the 'conventional' mode of phosphopeptide recognition by SH2 domains), which disrupts the SH2-PTP interaction, al-  ( 54 ).
Ther e ar e se v eral onco genic m uta tions tha t circumvent this autoregulatory lock by stabilizing the acti v e (open) conformation, most notably by weakening / abolishing the SH2-PTP interaction ( 55 ). Current pharmaceutical strategies targeting SHP2 are aiming at the stabilization of the closed conformation by small molecular ligands that bind to one of the allosteric sites at the interface of the N-SH2, C-SH2 and PTP domains (labelled 'tunnel', 'latch' and 'groove') ( 56 ), or by a combination of such ligands ( 57 ).
Here, we demonstrate the utility of SH2db in understanding the structural r equir ements for the SH2-PTP interaction. From previous studies, it is known that the short blocking loop, or bDbE loop, and the tyrosine residue in the first position of the following bE strand (Y62 bEx48 ) are directly involved in binding to the active site of the PTP domain ( 56 ). With the interacti v e sequence viewer on the Search page, we can quickly observe that there is a PDB structure (4H1O) available for the SHP2 mutant D61G ( https://www.r csb.org/structur e/4H1O ), wher e the aspartate residue of the blocking loop is replaced by a glycine (Figure 5 A): this is an onco genic m uta tion tha t was identified in multiple disease conditions, including Noonan syndrome and leukemia ( 58 , 59 ). With a few clicks, we can download this structure into a Pymol session, with the crucial residues highlighted. As a basis for comparison, we have also included the N-terminal SH2 domain of the wild-type SHP2 from a recent structure (6BMW), as well as the C-terminal SH2 domain from the same structure ( 57 ). Our comparison clearly verifies some of the crucial r ecognition featur es of the N-SH2 blocking loop (Figure 5 C): for example, removal of the acidic D61 sidechain in the oncogenic D61G mutant stabilizes the open (acti v e) conformation of the enzyme by abolishing the ability of the bDbE loop to form a crucial salt bridge with the R465 sidechain upon the closure of the N-SH2 domain onto the PTP domain. In the meantime, the C-SH2 domain should have poor affinity to the PTP domain, due to a number of structural differences, including bulkier residues in its bDbE loop, as well as its different overall fold. In this scenario, it ultimately took very little effort to find the relevant structures and produce a useful visualization to understand the structural r equir ements of SHP2 autoregulation.

CONCLUSION AND OUTLOOK
We created an online w e bserver and database for the SH2 domain containing proteins with a focus on protein sequence and structural data. With the de v elopment of the SH2 generic numbering system, we obtained a structurebased alignment for the whole family enab ling ov erall and local comparisons to be easily accessed between different protein family members. The w e bserver offers a search and br owse option thr ough the stored pr otein and structure data and highlights mutations in the structures. As shown with our two case studies, using the alignment view and downloadable Pymol session, users can quickly identify and navigate to areas of interest within the SH2 domain.
This utility can be expanded to broader queries including mutational and structural predictions for functional analy-sis. This would empower drug discovery as well as drug candidate forecasting for uncharacterized SH2 mutations that can arise in dif ferent pa tient cancers / diseases. Moreover, protein engineering or upcoming proteomics approaches that le v erage SH2 domains or superbinders for pTyr enrichment can benefit from wider SH2 domain analysis. Additionally, the portfolios of alignments and structural data will allow for deeper analysis in comparati v e genomics between different species and other biotechnological applications.
We aim to update SH2db e v ery six months with newly published structures and new AlphaFold models. In the future, we plan to expand the database to incorporate species ortholo gs w hile sim ultaneousl y de v eloping ne w data deri v ed tools for the w e bsite.

DA T A A V AILABILITY
All structural and sequence data are made available via the w e bsite http://sh2db.ttk.hu . The source code of SH2db is shared via Github at https://github.com/keserulab/SH2db .