One major task in the post-genome era is to reconstruct proteomic and genomic interacting networks using high-throughput experiment data. To identify essential nodes/hubs in these interactomes is a way to decipher the critical keys inside biochemical pathways or complex networks. These essential nodes/hubs may serve as potential drug-targets for developing novel therapy of human diseases, such as cancer or infectious disease caused by emerging pathogens. Hub Objects Analyzer ( Hubba ) is a web-based service for exploring important nodes in an interactome network generated from specific small- or large-scale experimental methods based on graph theory. Two characteristic analysis algorithms, Maximum Neighborhood Component (MNC) and Density of Maximum Neighborhood Component (DMNC) are developed for exploring and identifying hubs/essential nodes from interactome networks. Users can submit their own interaction data in PSI format (Proteomics Standards Initiative, version 2.5 and 1.0), tab format and tab with weight values. User will get an email notification of the calculation complete in minutes or hours, depending on the size of submitted dataset. Hubba result includes a rank given by a composite index, a manifest graph of network to show the relationship amid these hubs, and links for retrieving output files. This proposed method (DMNC || MNC) can be applied to discover some unrecognized hubs from previous dataset. For example, most of the Hubba high-ranked hubs (80% in top 10 hub list, and > 70% in top 40 hub list) from the yeast protein interactome data (Y2H experiment) are reported as essential proteins. Since the analysis methods of Hubba are based on topology, it can also be used on other kinds of networks to explore the essential nodes, like networks in yeast, rat, mouse and human. The website of Hubba is freely available at http://hub.iis.sinica.edu.tw/Hubba .
Proteins control and mediate many biological activities via interactions with other protein partners. Information of protein networks derived from protein interactions can serve as a good starting point for understanding the molecular machinery. Besides, elucidating protein interacting partnerships may help annotate unknown proteins and provide further insight into biological networks. Various experimental strategies are available for identifying protein interactions. While the conducive for high-throughput technology on the yeast two-hybrid system, performed in bacteria, yeast, worms, flies and more recently, mice and humans ( 1–4 ), enable us to characterize physical protein–protein interactions in the genome-wide scale ( 5 , 6 ). Many interactomes derived from such approaches were collected by different databases, for example, Biomolecular Interaction Network Database (BIND) ( 7 ), the Database of Interacting Proteins (DIP) ( 8 ), IntAct ( 9 ), the Munich Information Center for Protein Sequences (MIPS) ( 10 ), STRING ( 11 ), REACTOME ( 12 ) and some other databases with similar purpose. Besides, some interesting interactomes of host–pathogens ( 4 , 13 , 14 ) and carcinogenesis ( 2 ), were also published recently.
A protein interaction network is naturally complicate and far from a random network. Using the network characters, such as the degree distribution, clustering, diameter and relative graphlet frequency distribution, information can be extracted from a protein–protein network ( 15 ). To identify essential nodes/hubs the protein networks is a way to decipher the critical key controllers inside biochemical pathways or complex networks. Combining the gene-expression data with a high-quality yeast protein–protein interaction dataset, Han et al . ( 16 ) deliberated on the network dynamics in protein–protein interaction networks and revealed two types of hubs. One of them is more likely to be the module organizers and the other to be the module connectors ( 17 ). These essential nodes/hubs may serve as candidates of drug-targets for developing novel therapy of human diseases, such as cancer or infectious disease caused by emerging pathogens.
There are several approaches trying to identify motif/functional modules, while few approaches were attempted to decipher the hub/essential proteins directly. For example, CFinder is a tool for predicting the function of a single protein and for discovering novel protein modules ( 18 ). Other similar tools like mfinder ( 19 ), FANMOD ( 20 ) and MAVisto ( 21 ) are designed for network motifs detection. Idowu et al . ( 22 ) use degree and BottleNeck methods to identify the possible-essential proteins in the PPI network of Bacillus Subtilis .
Here, we proposed a framework combined with self-developed algorithms and integrated platform named as Hub Objects Analyzer ( Hubba ) to decipher hub/essential proteins from the user-defined protein interaction networks in graphic mode. Hubba is a web-based service for exploring important nodes in an interactome network generated from specific small- or large-scale experimental methods based on graph theory. In this website, we explore the essential nodes by six characteristic analysis methods on protein–protein interaction network, including Degree, BottleNeck (BN), Edge Percolation Component (EPC), Subgraph Centrality (SC) and two characteristic analysis algorithms developed by us: Maximum Neighborhood Component (MNC) and Density of Maximum Neighborhood Component (DMNC). A double screening scheme (DSS) for exploring and identifying hubs/essential nodes from interactome networks is proposed. Hubba result includes a rank given by a composite index in DSS, a manifest graph of network to show the relationship amid these hubs via SVG viewer ( http://www.adobe.com/svg/ ), and links of results calculated by all algorithms mentioned above. Analyzing the yeast protein interactome data (Y2H experiment) with list of essential proteins from Saccharomyces Genome Database (SGD, http://www.yeastgenome.org/ ), most of the Hubba high-ranked hubs (80% in top 10 hub list, and >70% in top 40 hub list) from are reported as essential proteins. Since the analysis methods of Hubba are based on topology, it can also be used on other kinds of networks to explore the essential nodes, like networks in yeast, rat, mouse and human. The clues revealed from network topological analysis will provide a new sight to experimental biologists.
The Hubba system is built in an open-source structure: Linux (Mandriva 2007, operating system), Apache (web server), PHP (html-embedded scripting language), PostgreSQL (relational database), XMLMakerFlattener (translate data format), Graphviz (graph generator), BGL,ã LAPACK and LAPACK++ (topology calculation). The framework of whole system is depicted in Figure 1 . Interaction network among hub/essential proteins can be visualized in PNG format. More annotations of biological functions related with identified hubs can be shown in SVG viewer when input file is fitting the PSI-MI format.
Algorithms used in Hubba
Hubba explores the possibly essential proteins in the interaction network by six topology-based scoring methods and a DSS. Each scoring method catches certain postulated topological characteristic of essential proteins. Therefore, a DSS is proposed. That is to say, two scoring methods A and B, are used to extract mixed characteristic of essential proteins. For n , most possible essential proteins are expected in the output, the 2 n top ranked proteins by method A are selected firstly. The selected 2 n proteins are further ranked by method B and the n top ranked proteins are output. The number 2 n is an empirical value for this double screening method. The list of yeast essential protein was integrated with the dataset from functional characterization of the Saccharomyces cerevisiae genome by gene deletion ( 27 ) and updated information from SGD ( http://www.yeastgenome.org/ ).
Topology-based scoring methods
Degree ( 23 ): in this method, the score of a node v is assigned as the degree of v , D( v ), the number of links incident to this node.
BottleNeck (BN) ( 15 , 24 ): for each node v in an interaction network, a tree of shortest paths starting from v is constructed. Taking v as the root of the tree T v , the weight of a node w in the tree T v is the number of descendants of w , that is to say, equal to the number of shortest paths starting from v passing through w . A node w is called a bottle-neck node in T v if the weight of w is no less than n /4, where n is the number of nodes in T v . The score of node w , BN( v ), is defined to be the number of node v such that w is a bottle-neck node in T v .
Edge percolated component (EPC) ( 25 ): for an interaction network G , assign a removing probability p to every edge. Let G ′ be a realization of the random edge removing from G. If nodes v and w are connected in G ′, set δ vw be 1, otherwise set δ vw be 0. The percolated connectivity of v and w , c vw , is defined to be the average of δ vw over realizations. The size of percolated component containing node v , s v , is defined to be the sum of c vw over nodes w . The score of node v , EPC( v ), is defined to be s v .
Subgraph centrality (SC) ( 26 ): for a node v , the number of close walks of v of length k is denoted as μ k ( v ). The subgraph centrality of v , SC( v ), is defined to be
MNC: the neighborhood of a node v , nodes adjacent to v , induce a subnetwork N ( v ). The score of node v , MNC( v ), is defined to be the size of the maximum connected component of N ( v ). The neighborhood N(v) is the set of nodes adjacent to v and does not contain node v .
DMNC: for a node v , let N be the node number and E be the edge number of MNC ( v ), respectively. The score of node v , DMNC( v ), is defined to be E / Nε for some 1 ≤ ε ≤ 2. We may assume that the MNC has a strong community structure, such as a clique percolation in a random network. In our system, ε is set to be 1.7, which is close to 1.67, the ε-value as we assume the neighborhood sub-network has a four-community.
The double screening scheme (DSS)
Job processing and result display
The Hubba system separates a job into two modes, ‘user mode’ and ‘system mode’ ( Figure 1 ). In ‘user mode’, protein interaction dataset can be uploaded for network analysis. Three types of data format are accepted: PSI format (Proteomics Standards Initiative, version 2.5 and 1.0), tab format and tab with weight values. The dataset may be submitted by pasting the interaction data in the query form directly, or uploading a file from the local computer. An email address is suggested to provide for those jobs may be time consuming; the Hubba daemon will notify the job completion by email. Once users verify all the parameters and submit their jobs, the process enters ‘system mode’.
All input data in a query are parsed and stored in a temporary database for the following analysis. Hubba will conduct six topological methods and the double screening scheme to submitted dataset and acquire ranking score for each node in the submitted network. The ranking score in Hubba is a composite index calculated by the DSS (DMNC || MNC) as described in the algorithm sections. After all calculations were completed, the process will be directed back to ‘user mode’ for outcome display.
There are three major options in the result page, ‘Hub Selector and Topology Moderator’, ‘Local Network Graph with Hub List’ and ‘Download Area ’ . In ‘Hub Selector and Topology Moderator’, users can select the top of hubs or search for particular nodes to browse the relationship among these nodes in the submitted network. Users also can manipulate on the advanced options, ‘ Check the first-stage nodes ’ to show the neighbors of the top/particular nodes, and ‘ Display the shortest path ’ to mark the shortest path distance between nodes, respectively. In this way, the connectivity among hubs can be easy identified.
An output graph in PNG format is generated by Graphviz and is shown directly in the result page of ‘Local Network Graph with Hub List’. For those query starting from the standard PSI-MI format, the biological functions related to those identified hubs can be shown in SVG viewer. All the output results, including network images and the ranking scores by the DSS and six scoring methods, can be retrieved from the ‘Download Area’. We also provide the output in gml and EPS format, which can be open in Cytoscape ( http://www.cytoscape.org/ ) and edited with standard linux tools for further analysis.
Normally, an analysis job is completed within a few minutes and the result is pushed back to the same web browser window automatically. If a job takes longer than expected, the user can save the link as a bookmark and revisits Hubba later, or follows the link provided in the notice mail to retrieve the analysis results.
RESULTS AND CONCLUSION
The main ideas of the double screening scheme are to select methods catching diverse characters and to include most essential proteins. Firstly, the overlapping of n top lists from different methods is studied. For all the six methods applied to the protein–protein interaction dataset yeast20070107.lst ( http://dip.doe-mbi.ucla.edu/ ), the overlaps in the top 100 ranked proteins of any two scoring methods are expressed in percentage ( Supplementary Table S1 ). Among all methods, DMNC are found to be the one that shares the least proteins with the others. Accordingly, the topological characters extracted by DMNC may differ from those by the other methods. Second, we evaluate the performance of the six scoring method by the coverage of yeast essential proteins. As shown in Table 1 , DMNC has the highest hit rate on the essential protein list. Therefore, we choose DMNC as the first method in the DSS. The second method of the double screen scheme is chosen on the same criteria. Among the five methods, MNC is the best mate of DMNC. The scheme improves the hit rate ( Table 1 , last column).
|Degree (%)||BN (%)||EPC (%)||SC (%)||MNC (%)||DMNC (%)||DMNC || MNC (%)|
|Degree (%)||BN (%)||EPC (%)||SC (%)||MNC (%)||DMNC (%)||DMNC || MNC (%)|
For example, 8 of the top 10 proteins found by DMNC has been identified as yeast essential proteins [% = (8/10) × 100%].
Hubba is constructed as a user-friendly interface for dataset uploading and result displaying. After the analysis process is completed, Hubba provides a community graph of the top n ranked ( n ≤ 100) hub/essential proteins with the identifier provided in the input dataset ( Figure 2 , a graph of top 10 list). We utilize a coloring scheme, from red to green, as a cue of the ranking score and a line pattern to discriminate direct interaction (solid line) from indirect interaction (dotted line). Furthermore, the advanced options of browsing the neighborhood of these hubs and the shortest path distance between hub nodes. Hubba has been applied to discover hubs/essential proteins from the PPI dataset (downloaded from IntAct website) of five model organisms. The more precalculated results are available in our help page ( http://hub.iis.sinica.edu.tw/Hubba/help.htm ).
Identifying hubs or fragile motifs are very important in network biology. For example, based on the overview of the interaction among human proteins and proteins from 190 pathogen stains is revealed that both viral and bacterial pathogens tend to interact with hub and bottlenecks in the human PPI network ( 28 ). Chuang et al . ( 29 ) applied a protein-network-based approach to analyze the expression profiles of the two cohorts of breast cancer patients. They found several notorious cancer markers, such as P53, KRAS, HRAS, HER-2/neu and PIK3CA, are located on the interconnecting bottleneck of many expression-responsive genes, while these markers could not serve as indicators of the disease state using gene-expression data alone. Feldman and his co-workers ( 30 ) conclude some network properties of human inheritable diseases. They found that genes and proteins harboring variation causing the same disease phenotype tend to form directly connected clusters. A similar purpose for identifying disease-associated proteins can be found in Hubba , which accepts a query of an interested list on a user-defined network and provides output for the shortest path among them. In this way, nodes in the paths may serve as candidates related to the disorder the query list involved.
The topological analysis like Hubba is dependent on the completion and accuracy of the input interactome dataset. While this platform provides a chance to build a network related to the scenario the customized interaction dataset derived. Therefore, the secrets hidden inside the networks with specific spatiotemporal scenarios will be deciphered and sketched. We hope this approach can lead to a new strategy for exploring the mechanism of cancer formation and pathogens infection. And it may lead to new therapies and novel insights in understanding basic mechanisms controlling normal cellular processes and disease pathologies.
The authors would like to thank National Science Council (NSC)/National Research Program of Genomic Medicine (NRPGM), Taiwan, for financially supporting this research through NSC 96-3112-B-001-002 to C-.Y.L. and NSC 95-2221-E-008 -055 to C-.W.H. Funding to pay the Open Access publication charges for this article was provided by NSC 96-3112-B-001-002 to C-.Y. L.
Conflict of interest statement . None declared.