UbiHub: a data hub for the explorers of ubiquitination pathways

Abstract Motivation Protein ubiquitination plays a central role in important cellular machineries such as protein degradation or chromatin-mediated signaling. With the recent discovery of the first potent ubiquitin-specific protease inhibitors, and the maturation of proteolysis targeting chimeras as promising chemical tools to exploit the ubiquitin-proteasome system, protein target classes associated with ubiquitination pathways are becoming the focus of intense drug-discovery efforts. Results We have developed UbiHub, an online resource that can be used to visualize a diverse array of biological, structural and chemical data on phylogenetic trees of human protein families involved in ubiquitination signaling, including E3 ligases and deubiquitinases. This interface can inform target prioritization and drug design, and serves as a navigation tool for medicinal chemists, structural and cell biologists exploring ubiquitination pathways. Availability and implementation https://ubihub.thesgc.org.

Here, we present UbiHub, an online data hub where drugdiscovery scientists focused on ubiquitination pathways can easily navigate data of relevance to their work. The UbiHub graphic user interface is based on the representation of protein families as phylogenetic trees, onto which heterogeneous data collected from diverse repositories and the literature can be projected and scrutinized.

Assembling protein families
Four protein families are included in UbiHub: 8 E1 ubiquitin activating enzymes, 41 E2 ubiquitin conjugating enzymes, 634 E3 ubiquitin ligases and 113 de-ubiquitinases (DUBs). The composition of each family was derived from searches of their respective signature domains in the PFAM (Finn et al., 2014), and SMART (Schultz et al., 2000) databases. Previously reported atypical enzymes were added to the E1 list (Schulman and Harper, 2009). The E3 ligases list was complemented with a previously reported genome-wide functional annotation of human E3s and a systematic inventory of DCAFs (Lee and Zhou, 2007;Li et al., 2008). To improve visibility of the very large E3 family, it was divided into 297 proteins relying on multi-subunit complexes (mostly E3s interacting with Cullins, adaptor proteins and E2-recruiting subunits) and 337 standalone E3 ligases. DUBs were divided into 57 USPs and 56 functionally related, but biochemically distinct non-USP proteins. The composition and subfamily classification of DUBs was based on a previously reported inventory of deubiquitinating enzymes (Nijman et al., 2005) and on the latest developments in the field (Kwasna et al., 2018;Maurer and Wertz, 2016).

Ubiquitin-proteasome system association
Ubiquitination can serve as a signal for ubiquitin-proteasome system (UPS)-mediated degradation or other non-degradation related signaling pathways. The association of E3 ligases to the UPS was estimated automatically and assigned a confidence score of 0 (no indication of UPS association) to 3 (reliable UPS association) based on 3 criteria. First we looked whether the word 'degrad' was found in the Function section of the UniProt entry of the protein (UniProt Consortium, 2018). Second, we searched for the word 'degrad' among the Reactome pathways (Fabregat et al., 2018) linked to the protein. Third, we compiled for each E3 ligase the list of Reactome pathways assigned to all protein interactors from the BioGrid database (Stark, 2006), and searched for the word 'degrad' in the pathways that were enriched among these interactors (pathways enriched at least three times compared with their prevalence in the human proteome, and found in at least three interactors). The UPS association score was set to 0, 1, 2 or 3 when none, one, two or all of these conditions were met respectively. Upon literature review of over 30 random E3s, we found the score to be reasonable in over 90% of cases, and adjusted it manually when it was found inaccurate.

Phylogenetic trees and data collection
Phylogenetic trees are generated, and biological, structural and chemical data collected as previously described for ChromoHub (Liu et al., 2012;Shah et al., 2014), and stored in a MySQL database. Additionally, gene essentiality in cancer is extracted from the Broad Institute's cancer dependency map, where we use data from CRISPR-knockout studies and essentiality scores corrected for copynumber effect, and data from RNAi knock-down studies using DEMETER2 normalization (McFarland et al., 2018;Meyers et al., 2017).

Results
The graphical user interface is based on zoom-able phylogenetic trees that represent any pre-selected protein family. In the case of E3 ligases, users can choose to only display proteins that are associated with the UPS with a pre-defined confidence level. A checkbox menu allows users to simultaneously tag proteins on a tree with diverse icons related to biological, structural or chemical data. Clicking on any of these icons brings pop-up windows with figures providing further details and html links to the source of information (PubMed record or public repository such as PDB entry). The checkbox menu includes click-able '?' symbols next to each menu item that can be used to display information on the data source and the way the data were processed. Through this graphical interface, users can have a bird's-eye view of the disease association landscape of an entire protein family, medicinal chemists can rapidly retrieve compounds cocrystallized with their protein target, structural biologists can inspect the structural coverage of a protein or its phylogenetic neighbors, and cell biologists can find the K D or IC 50 and selectivity profile of chemical inhibitors, produce the chemical coverage of E3 ligases involved in the UPS, or quickly visualize the cancer dependency map of USPs.