STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data

Interactions between proteins and small molecules are an integral part of biological processes in living organisms. Information on these interactions is dispersed over many databases, texts and prediction methods, which makes it difficult to get a comprehensive overview of the available evidence. To address this, we have developed STITCH (‘Search Tool for Interacting Chemicals’) that integrates these disparate data sources for 430 000 chemicals into a single, easy-to-use resource. In addition to the increased scope of the database, we have implemented a new network view that gives the user the ability to view binding affinities of chemicals in the interaction network. This enables the user to get a quick overview of the potential effects of the chemical on its interaction partners. For each organism, STITCH provides a global network; however, not all proteins have the same pattern of spatial expression. Therefore, only a certain subset of interactions can occur simultaneously. In the new, fifth release of STITCH, we have implemented functionality to filter out the proteins and chemicals not associated with a given tissue. The STITCH database can be downloaded in full, accessed programmatically via an extensive API, or searched via a redesigned web interface at http://stitch.embl.de.


INTRODUCTION
The role of small molecules in biological systems can be understood only in the relation to the function of the targeted biomolecules, which, in turn, is largely defined by their interaction partners (1)(2)(3). The role of the interaction network is even more prominent in the area of the drug development, since diseases are often a consequence of multiple changes in the same pathway or protein complex (4,5). Taking into account the neighborhood of the targeted proteins and the topology of the network itself can lead to a better understanding of a drug's cellular impact (6,7). Furthermore, as only a subset of all proteins are viable drug targets (8), most therapeutics target proteins in the network vicinity from more prospective, but undruggable, proteins (7). Several databases provide proteome-wide protein-chemical interactions (9)(10)(11) and several other (12)(13)(14) put proteinchemical interactions in the context of protein-protein interaction networks, which is essential for effective in silico drug discovery.
A drug's impact on the organism and its efficacy depend on its engagement with the targeted proteins and the extent to which it disrupts the protein-protein and proteinchemical interaction network (7,15). This is related to the concentration of the drug, the strength with which it modulates the activity of the target, and the distribution of target proteins among different tissues (16). To enable the users to rationally select possible drug targets, we have added two new features to STITCH: a new mode that allows users to show known binding affinities between proteins and chemicals, and the ability to filter the network to show only proteins related to a selected tissue. STITCH, in its fifth release, shares protein space with STRING v10 (17) and now encompasses more than 9 600 000 proteins from 2031 eukaryotic and prokaryotic genomes. Also, its chemical space grew by a quarter compared to the previous version (18), from 340 000 to 430 000 compounds (not including different stereoisomers). STITCH is available through new redesigned web interface at http://stitch.embl.de and via an extensive API that al-lows programmatic access, including the ability to disambiguate queries, modify all network parameters and generate images. In order to enable large-scale analysis, which may not be feasible through web-interface or API, the precomputed network and the supplementary information are freely available for download.

SOURCES OF INTERACTIONS
Although there is a plethora of data available from which protein-chemical networks could be derived, their dispersed nature, different precision, name-space and focus make it cumbersome to assemble a full picture of all available knowledge. The STITCH pipeline aggregates highthroughput experiments data, manually curated datasets and the results of several prediction methods into a single global network of protein-protein and protein-chemical interactions. This does not expose the user to the heterogeneity of the underlying data, yet, at the same time, keeps all the primary evidence of the interaction readily accessible.
A large part of the known interactions comes from manually curated datasets such as DrugBank (19), GPCRligand database (GLIDA) (20), Matador (21), the Therapeutic Targets Database (TTD) (22) and the Comparative Toxicogenomics Database (CTD) (23), and several pathway databases including the Kyoto Encyclopedia of Genes and Genomes (KEGG) (12), NCI/Nature Pathway Interaction Database (24), Reactome (25) and BioCyc (26). As there can be overlap between different manually curated datasets, we do not consider multiple reports of identical interactions as being independent from each other. Instead, we count redundant interactions only once and do not increase the confidence level. Other large sources of protein-chemical links are the datasets of experimentally validated interactions, which include ChEMBL (27), PDSP K i Database (28), Protein Data Bank (PDB) (29) and two high-throughput kinase-ligand interactions studies (30,31). Also in this case, interactions may be reported in different databases and with different binding affinities. To compute the final confidence score, we only take the strongest reported affinity into account.
The sources of verified protein-chemical interactions are complemented by automated text mining and a structurebased prediction method (18). The text-mining pipeline include co-occurrence text-mining and natural language processing of all MEDLINE abstracts as well as available PubMed Central open-access full-text articles (32). The newest addition to the text-mining sources are NIH Re-PORTER grant abstracts (https://projectreporter.nih.gov/). Considering co-occurring terms, adding the RePORTER data increased the number of high-confidence interactions between human proteins and chemicals from 2740 to 4740. Extensive benchmarking of each data source allows us to provide unified confidence score for every interaction while taking into account the sources' predicted precision.

DISPLAY OF BINDING AFFINITIES IN THE NETWORK VIEW
Small molecules that activate or inhibit proteins such as enzymes or receptors are among the most studied classes of exogenous small molecules. In order to assess the effect and confidence of protein-ligand binding, as well as variability in the affinity of known ligands, it is essential to know the binding affinity between the compound and its target. Usually, this binding affinity is quantified as the inhibition constant K i . In some cases, K i values are not available, but other values such as the IC 50 or EC 50 (half of the maximal inhibitory concentration) can serve as an approximation. K i values of drugs vary greatly, from nanomolar inhibition constants to relatively high values, such as 52 M between aspirin and cyclooxygenase 2 (27). Therefore, for any given drug, it is not so much the absolute value of the K i , but rather the relative binding affinities that determine the impact on the interaction network.
In previous versions of STITCH, K i values from primary sources (27,28) were accessible to the user through the webinterface. In the new release of STITCH, the user can now choose to switch the network view to show the binding affinities of all protein-chemical interactions for which this value is known (Figure 1). This new network view is similar to the STITCH's confidence view: the thickness of the edge between nodes scales with the K i value. If a K i is not available, EC 50 or IC 50 will be used to determine the depicted strength of the interaction. If there are multiple measurements available, the lowest value (i.e. highest reported affinity) will be used to determine the thickness of the edge.

DATA AND FILTERING FOR TISSUE SPECIFICITY
The protein-chemical network in STITCH is global and as such considers interactions anywhere in an organism. However, in multicellular organisms such as humans, not all proteins are present in every tissue. STITCH 5 addresses this through a new feature that allows users to filter a human interaction network so that only the proteins believed to be present in a specified tissue are shown (Figure 2). To provide this feature, STITCH now integrates tissue-specific protein expression patterns from two data sources. First, the TISSUES resource (33), which combines evidence from UniProt annotations, systematic large-scale transcriptomics and proteomics studies, and co-occurrence text mining. For use in STITCH, the text-mining evidence was recomputed based on the same texts used elsewhere in STITCH. Second, STITCH incorporates baseline expression patterns from tissues deposited in the Expression Atlas (34). Before augmenting the network with tissues data, users have to choose if they want to use data from TISSUES or Expression Atlas. The TISSUES resource contains confidence levels ranging from one (lowest confidence) to five (highest confidence). Accordingly, on the STITCH website users can select a tissue and a minimum confidence level. In contrast, datasets from the Expression Atlas are transformed into percentiles. The confidence score for a proteinprotein interaction in the given tissue is then multiplied with the geometric mean of the two proteins' expression percentiles. For protein-chemical interactions, the confidence score is multiplied with the protein's expression percentile. To access the tissue expression patterns, users can search for tissues either by typing parts of the tissue names or by selecting a tissue from a list. Then, users can submit the changed settings to STITCH. In return, an updated network will be Figure 1. Display of binding affinities. The user interface of STITCH has been updated and the option to scale edge width of protein-chemical interactions according to binding affinity has been added. The shown network of multiple NSAIDs makes their different binding affinities clear: for example, aspirin has relatively low binding affinities, whereas rofecoxib is specifically binding PTGS2.
shown. As non-expressed nodes are removed (using TIS-SUES) or confidence values get updated (using Expression Atlas), other interaction partners may become part of the network.

USE CASES
STITCH has been widely used for a variety of different purposes. These fall into three broad classes: (i) small-to medium-scale analyses performed via the web interface, (ii) large-scale analyses that make use of the bulk download files and (iii) reuse of data from STITCH for development of new web-based resources.
Work by O'Reilly et al. on identifying potential drug targets for ␣1-antitrypsin deficiency exemplifies the webbased usage (35). Through a genome-wide RNAi screen in a Caenorhabditis elegans disease model, the authors identi-fied 104 C. elegans genes of interest (having 85 human orthologs). To validate these as potential drug targets, the authors queried STITCH and MetaCore for each of the human proteins and thereby identified a compounds for use in follow-up experiments. Conversely, STITCH can also be queried for a set of chemicals to identify possible targets, as exemplified by the screen by Kumar et al. of compounds capable of altering intracellular manganese levels (36). The ability to see binding affinities in the new web interface makes STITCH 5 even better suited for such use cases than previous versions. STITCH is also commonly used for large-scale analyses, which we facilitate by making the data available for bulk download. Ligeti et al. used these files to construct a network neighborhood of proteins around each drug and showed that the neighborhood overlap of two drugs can The interaction network around diclofenac and PTGS1/2 is shown without filtering for tissue expression patterns. In this and the following panels, the top five interaction partners with the highest scores are shown. (B) Using the TISSUES resource, only proteins believed to be expressed in blood platelets (with medium confidence, i.e. three stars in TIS-SUES) become part of the interaction network. For these settings, PTGS2 is not expressed and is therefore shown in a lighter color. (C) Expression patterns according to RNA-seq data from the Human Protein Atlas are used to focus on genes expressed in smooth muscle. Confidence scores of interactions are scaled by the geometric mean of the binding partners' expression percentiles. Due to the recomputed confidence scores, four interaction partners have been replaced by other proteins. predict synergy of drug combinations (37). On a related note, Vogt et al. made use of both the drug thesaurus and the protein-chemical interaction from STITCH to predict drug contraindications (38).
Last, but not least, the integrated data provided by STITCH is useful to researchers who develop their own web resources and prediction methods. An example of this is the ChemDIS resource, which combines the protein-chemical interactions from STITCH with tools for gene enrichment analysis to link chemicals via proteins to GO terms, pathways and diseases (39). The experimental protein-chemical interactions from STITCH are also sometimes used as a benchmark set when developing prediction methods as exemplified by Zhou et al. (40).