-
PDF
- Split View
-
Views
-
Cite
Cite
Jüri Reimand, Laur Tooming, Hedi Peterson, Priit Adler, Jaak Vilo, GraphWeb: mining heterogeneous biological networks for gene modules with functional significance, Nucleic Acids Research, Volume 36, Issue suppl_2, 1 July 2008, Pages W452–W459, https://doi.org/10.1093/nar/gkn230
- Share Icon Share
Abstract
Deciphering heterogeneous cellular networks with embedded modules is a great challenge of current systems biology. Experimental and computational studies construct complex networks of molecules that describe various aspects of the cell such as transcriptional regulation, protein interactions and metabolism. Groups of interacting genes and proteins reflect network modules that potentially share regulatory mechanisms and relate to common function. Here, we present GraphWeb, a public web server for biological network analysis and module discovery. GraphWeb provides methods to: ( 1 ) integrate heterogeneous and multispecies data for constructing directed and undirected, weighted and unweighted networks; (ii) discover network modules using a variety of algorithms and topological filters and (iii) interpret modules using functional knowledge of the Gene Ontology and pathways, as well as regulatory features such as binding motifs and microRNA targets. GraphWeb is designed to analyse individual or multiple merged networks, search for conserved features across multiple species, mine large biological networks for smaller modules, discover novel candidates and connections for known pathways and compare results of high-throughput datasets. The GraphWeb is available at http://biit.cs.ut.ee/graphweb/ .
INTRODUCTION AND BACKGROUND
One of the greatest challenges of biomedical research is to understand the organization and function of living organisms at the molecular level. Experimental and computational data reveal complex networks that consist of genes and proteins as nodes and associations as edges ( 1–3 ). While describing different aspects of the cell, these networks appear to share universal structural properties like log-linear distribution of connections and small-world reachability ( 4 , 5 ). Within networks, modules of tightly interacting genes and proteins are believed to make up functional units responsible for processes in the cell ( 6 ). For instance, collections of protein–protein interactions (PPI) form networks of physically binding proteins, where modules reflect protein complexes or signalling pathways ( 7 , 8 ). Gene expression measures, transcription regulator binding data, cis -regulatory motif discovery and conservation information are combined to uncover transcription regulatory networks with modules of transcription factors (TFs) and target genes ( 9–12 ). From a slightly different angle, text-mining methods extract knowledge-based webs and co-occurring modules of genes and proteins from scientific literature ( 13 ).
Biological network analysis proposes the following computational challenges. The strategies need to take into account the myriad of cellular interactions that may be directed (e.g. TF–gene interaction) or undirected (e.g. PPI), involve quantitative values (e.g. gene expression correlation) or appear in multiple datasets (e.g. co-expression and physical interaction) ( 14 ). Combining different cellular domains requires data integration to deal with various biomolecules and experimental measurements ( 15 ). Module detection involves algorithms that identify nodes with special topological features or search for densely connected areas ( 16 ). Biological interpretation of modules comprises functional analysis using resources such as the Gene Ontology (GO) ( 17 ) and detection of significantly enriched biological processes, functions and cellular locations ( 18 ).
The growing interest in networks and systems biology has increased the need for computational and visual methods for network analysis, and as a result, several useful tools have been published. Notable software libraries include AT&T Graphviz for visualization and C++ Boost for graph structures and algorithms, packaged into Bioconductor by Carey and collegues ( 19 ). Cytoscape is a popular software for visual analysis of biological networks ( 20 ). A number of plugins complement Cytoscape with analytical features such as microarray data integration, dense subgraph detection ( 21 ) and GO-term enrichment analysis ( 22 ). Osprey focuses on visualization ( 23 ), while VisANT also provides topological analysis and functional annotation of nodes ( 24 ). MATISSE is useful for mapping high-throughput datasets onto network topologies and detecting gene modules using a number of algorithms ( 25 ). BiologicalNetworks is a network retrieval, construction and visualization tool with an emphasis on microarray data ( 26 ). BioPIXIE provides a gene-based query engine and GO analysis for a precomputed heterogeneous network for Saccharomyces cerevisiae ( 27 ). NetworkBLAST allows the user to align and compare two networks of different species through user-provided sequence similarity measures to discover conserved protein complexes ( 28 ).
We have identified open questions in the field of biological network analysis. There is a lack of simple ‘point-and-click’ web servers that allow biological data integration and discovery of modules. Some of the available tools involve no biological background information and force the user to put great effort in integrating datasets, linking molecules and retrieving functional annotations, while others constrain the analysis to some pre-calculated network of a specific model organism. Module detection is frequently limited to neighbourhood search of gene lists or topological analysis such as node connectivity. Both Cytoscape and VisANT implement functionality for analysing high-throughput networks, detecting modules and enriched biological features. However, we believe that there is a need for web-based resources that analyse heterogeneous datasets with mixed collections of genes and proteins, detect various types of modules and and provide a rich interface for functional annotation. Moreover, there is little support for the analysis and integration of multispecies data using automatic orthology mapping. With the development of the GraphWeb server, we wish to contribute to the network challenge and propose new solutions to the above questions.
THE GraphWeb SERVER
GraphWeb ( http://biit.cs.ut.ee/graphweb,Figure 1 ) is a public web server for graph-based analysis of cellular networks that:
analyses directed and undirected, weighted and unweighted heterogeneous networks of genes, proteins and microarray probesets for 35+ eukaryotic genomes;
integrates multiple diverse datasets into global networks;
incorporates multispecies data using gene orthology mapping;
filters nodes and edges based on dataset support, edge weight and node annotation;
detects gene modules from networks using a collection of algorithms;
interprets discovered modules using GO, pathways and cis -regulatory motifs.

GraphWeb user interface with data from the case study of human PPI and gene expression (see Results Section for a detailed description). The first module of 33 nodes is shown in Figure 2 . User interface legend: ( A ) data upload, ( B ) module detection algorithms, ( C ) options and filters, ( D ) user data storage, ( E ) network information and labels, ( F ) module information and gene search, ( G ) module export, ( H ) module zoom-in analysis, ( I ) module label distribution, ( J ) module annotation score, ( K ) best functional enrichments and link to g:Profiler, ( L ) links to module visualization and ( M ) export to SIF format.
![The case study: a connected component ( A ) detected from the combined network for protein interactions and gene expression similarity. The discovered module describes a fragment of the human cell cycle and consists of several smaller modules. Two cyclin-dependent kinases (CDC2, CDK2) are hubs regulating different cyclins [e.g. CDC2 module ( B )]. MCM2-7 proteins form a helicase and five of these connect into a clique ( C ). The network neighbourhood module of ORC2L and ORC5L ( D ) contains origin recognition complex proteins.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/36/suppl_2/10.1093_nar_gkn230/2/m_gkn230f2.jpeg?Expires=1748157919&Signature=vLOEzhkW5Lbmt9ebcCKRqsmW4i94X-CbzndyQPiVH1cWe7PgD9Cf3FGXtEDCVcpqPg9tyM1j91cF26KLHio~oANdvtDNGrRXkdxaLwsBXLvQdsGtC2O9VFYri3l6Gd3b1dqpRK07F10fijy8lXpJlCHT7Gn1tQ4dB3rPf4up3vjlAiKfK4RJuu5tJYPRsAYrAKNsaneAihMxYP8ZjE8AXeZKnQWg9MaKYMkEVmurPo26Q4LbAJ~oBUo4bCaazdeIldLlUIQBCnMdBfm46Bs791PiMSbxhMb7KkHWA6ahzhxXMllkiUywju35kFzT9dfp-~MC-3d9kAzET021wVZ6oQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The case study: a connected component ( A ) detected from the combined network for protein interactions and gene expression similarity. The discovered module describes a fragment of the human cell cycle and consists of several smaller modules. Two cyclin-dependent kinases (CDC2, CDK2) are hubs regulating different cyclins [e.g. CDC2 module ( B )]. MCM2-7 proteins form a helicase and five of these connect into a clique ( C ). The network neighbourhood module of ORC2L and ORC5L ( D ) contains origin recognition complex proteins.
Networks in GraphWeb
The primary input of GraphWeb is a combined biological network of a selected species, consisting of genes, proteins or microarray probesets as nodes and corresponding associations as edges. The user may upload the input data as a file or type it into the webform. Genes, proteins and microarray probesets of various databases and platforms are automatically mapped to gene IDs of the Ensembl database ( 29 ) using the g:Profiler software ( 30 ). Unrecognized and ambiguous IDs may be optionally removed, but remain unchanged by default in order to keep the input networks intact. Associations between nodes may be represented as directed or undirected edges, and weights may be assigned to edges to convey quantitative relations between corresponding nodes. A collection of pre-defined datasets is available for immediate analysis, including PPI from IntAct ( 31 ) and HPRD ( 32 ), and the S.cerevisiae transcription regulatory network by MacIsaac et al . ( 33 ).
Data integration
GraphWeb allows the user to insert and combine different data sources and align these into a global network. Besides native plaintext format, Graphweb supports the import of other network files such as SIF, GML, XGMML and BioPAX through the Cytoscape BiNoM plugin ( 34 ). Labels can be used to distinguish associations of different sources, and a network score may be assigned to each label to denote the predictive power of corresponding associations. For example, TF-binding networks from ChIP-chip experiments may be combined and aligned with motif discovery results, and scored with predictive values learned from gene expression data.

Multispecies networks
GraphWeb provides means to incorporate data from different organisms in order to improve network construction. When the user selects a target organism in the GraphWeb interface the nodes and corresponding associations of the input are automatically mapped to orthologous genes in the target. The orthology mapping information is retrieved from Ensembl via g:Profiler software. Resulting ortholog networks can be combined with other datasets of the target organism to highlight conserved associations. Similarly to single-species data integration, GraphWeb ignores ambiguous orthologs in network alignments to avoid noise and misleading results. Such a solution retains the cleanest possible network but undoubtedly results in a certain loss of information.
Graph filtering
GraphWeb filters help the user detect network areas with strong associations. Three types of filters may be used for selecting edges: minimum number of supporting datasets (i.e. labels), lower threshold on edge weights and selection of top-ranking edges. Node filtering excludes unrecognized or ambiguous genes and proteins, while module filtering limits the result to larger modules or those with significant functional enrichments. Filtering techniques are especially useful when incorporating edges from different datasets or species.
Gene module discovery
GraphWeb provides a number of methods and algorithms for detecting gene modules in directed and undirected networks. Resulting gene modules may easily be saved for later use or redirected to input for further analysis. GraphWeb identifies the following types of modules.
Connected components
A connected component ( Figure 2 A) is a group of genes, where every pair of genes, ( g i , g j ) is connected either directly ( g i ⌣ g j ) or indirectly via a path of length n, ( g1 ⌣ g2 ⌣ … ⌣ g n ⌣ gn+1 ). GraphWeb also supports two extensions of the above: a strongly connected component relates to directed networks and requires connections in both directions, and a biconnected component requires at least two non-overlapping paths. Connected component detection is the first step in studying network structure.
Neighbourhood modules
A neighbourhood module ( Figure 2 D) is based on a user-defined list of genes and proteins { G } and on a distance d . If d = 0, GraphWeb retrieves modules that consist of nodes G with internal associations inside the list. If d ≥ 1, modules consist of the initial list { G } and nodes connected to the latter via paths of maximum length d . Neighbourhood modules allow the user to study her focus list in a network context, and retrieve related nodes and associations to propose new hypotheses.
Hub-based modules
A hub-based module ( Figure 2 B) consists of a central hub (a node with many connections) and related genes and proteins within distance d . GraphWeb extracts a list of hub-based modules ranked by the central hub degree (number of connections). Hubs in PPI networks have been described in the context of lethality ( 35 ), and proteins linking to the same hub often refer to similar function ( 36 ). Hub-based modules may also reflect systems of TFs and target genes.
Cliques
A clique ( Figure 2 C) is a fully connected module where every pair of nodes is directly connected. Cliques in PPI networks have often been related to protein complexes and common functions ( 36 ). Fully connected modules also reflect clusters of co-expressed genes.
Cluster modules
A cluster module corresponds to a tightly connected group of nodes. GraphWeb provides two network clustering algorithms: the Markov Cluster (MCL) algorithm ( 37 ) and Betweenness Centrality Clustering (BCC) ( 38 ). These algorithms break networks down into separate modules by removing certain edges, and have been successfully applied in a number of studies, such as protein family detection ( 39 ) and essentiality assessment ( 40 ). MCL constructs modules of edges that are frequently visited during random walks, while BCC removes paths that act as bridges between separate tightly connected modules. Graph clustering is successful in integrative network analysis since it prefers associations with evidence from multiple datasets, and allows the detection of hybrid modules that combine the characteristics of different module types.
Empirical comparisons show that the time complexity of the above algorithms is generally linear to the number of edges. The NP-complete clique detection algorithm is the most computationally expensive method in GraphWeb and is especially sensitive to dense networks, where a network of 30 nodes and 300 edges requires a computation of nearly 10 min. MCL clustering, on the other hand, takes 10 min to handle a network of nearly 8000 nodes and 300 000 edges using GraphWeb default values. Hub-based modules and connected components are detected even faster.
Module interpretation and evaluation
Interpretation and evaluation is an integral process of module detection in GraphWeb. Once a module has been identified, GraphWeb automatically assesses its biological importance through the known properties of its members using the g:Profiler software. Functional profiling of the module involves statistically enriched annotations of biological processes (bp), cellular locations (cc) and molecular functions (mf) from the GO ( 17 ), and related pathways (pw) from the Kyoto Encyclopedia of Genes and Genomes (KEGG) ( 41 ) and Reactome ( 42 ). Besides functional annotations, the analysis takes into account cis -regulatory motif enrichments from TRANSFAC ( 43 ) and miRNA target site enrichments from miRBase ( 44 ).


GraphWeb executes on-the-fly functional profiling and scoring of detected modules, displaying the names and P -values of most important discovered features from all the covered functional domains (GO:bp, GO:cc, GO:mf, KEGG:pw, Reactome:pw, TRANSFAC, miRBase). Hyperlinks to g:Profiler allow the user to access related terms and pathways, ortholog mapping and expression similarity search for related genes. In addition, a hyperlink to g:Cocoa at the bottom of the GraphWeb interface sends all discovered modules to comparative functional enrichment analysis.
RESULTS: A CASE STUDY
We present an example case study that demonstrates a possible data integration and module detection pipeline. The analysis concentrates on human cellular networks and involves six high-throughput datasets comprising gene expression values and PPI from public databases. Human PPI data originate from the study by ( 46 ) and the databases HPRD ( 32 ) and IntAct ( 31 ), and are interpreted as three separate networks. Human expression data are presented as an expression similarity network, computed using Multi Experiment Matrix (MEM) (Adler et al ., manuscript in preparation) across nearly 3700 tumour-related samples of 89 public datasets, originating from GEO ( 47 ) and ArrayExpress ( 48 ). Besides human data, we use orthology mapping to incorporate two datasets for mouse: a MEM gene expression similarity network across 28 datasets and 1700 samples, and the PPI data from IntAct.
Unweighted PPI datasets and weighted expression similarity datasets are aligned into a global-weighted network. Integration of the above datasets reveals frequently co-expressed protein complexes such as ribosome and proteasome. We applied a strong edge filter of minimum dataset support 4, and queried for connected components. The largest resulting component consists of 33 nodes and four notable submodules, is included in known pathways of Reactome and KEGG, and involves strong GO enrichments.
The module plays a significant role in cell cycle and is well described with PPI as well as gene expression similarity. The two hubs denote cyclin-dependent kinases 1 (CDC2/CDK1) and 2 (CDK2), see Figure 2 B for the former module. These kinases control the cell cycle entry to S-phase, while CDK1 also controls the entry to mitosis ( 49 ). MCM2-7 proteins form a helicase and five of these connect into a clique ( Figure 2 C). The neighbourhood of ORC2L and ORC5L partly reveals the origin recognition complex (ORC) ( Figure 2 D), that temporarily interacts with CDT1 and CDC6 and binds to the helicase to initiate replication in S-phase. Other connected proteins include cell cycle checkpoint controllers (e.g. CHEK1 kinase), inhibitors (GMNN, BIRC5) and cyclins (CCNE1, CCNE2, CCNB1).
The thorough common-knowledge description of the detected module provides support for the techniques proposed in GraphWeb. The rather strong filters applied above naturally extracted a well-studied result out of a large collection of public data. The GraphWeb case study provides a simple example of the possibilities and potential results of analysing novel data or combining it with existing public repertoires.
DISCUSSION
The core data structures and algorithms in GraphWeb render the myriad of molecular entities and corresponding relations, physical connections and regulatory events into a uniform collection of network nodes and connecting edges. On the one hand, this simplification creates an intuitive view of the cellular networks. GraphWeb analysis methods allow the researcher to approach a number of interesting tasks, for example proposing novel members of known pathways by strong ‘guilt by association’ evidence, comparing the results of multiple high-throughput datasets, or finding associations and modules of genes that are conserved in diverse species. On the other hand, looking at topological features, weighted edges and tightly connected groups of nodes may admittedly fail to deliver crucial aspects of biological systems, such as quantitative dependencies and dynamics over time. The greatest advantage of GraphWeb analysis is its relative simplicity and speed in handling complex objects as networks. We therefore believe that GraphWeb also proves useful in detailed network studies, since it allows the user to reduce the complexity of the whole network to the complexity of modules. Such a reduction may then provide access to more elaborate methods of mathematical modelling that are inapplicable to systems larger than a handful of variables.
CONCLUSION
GraphWeb is a publicly available web server for analysing and interpreting complex cellular networks. The server provides methods for integrating heterogeneous datasets into networks of interactions, means to incorporate multispecies data using gene orthology information, algorithms and methods for discovering network modules and functional enrichment analysis for biological interpretation. With the creation of the GraphWeb server, we wish to contribute to the difficult task of deciphering and understanding complex biological networks, and provide a tool with an emphasis on ease of use.
IMPLEMENTATION
The GraphWeb web server is implemented in Perl as a CGI application. Graph structures and algorithms are written in C++ and Perl and are partly based on the Boost Graph Library ( http://www.boost.org/ ). GraphWeb applies the MCL algorithm implementation by van Dongen ( 37 ) ( http://micans.org/mcl/ ). Visualization is provided by the AT&T Graphviz graph drawing package ( http://www.graphviz.org/ ) and the SWOG graphical programming language ( http://biit.cs.ut.ee/SWOG/ ).
ACKNOWLEDGEMENTS
The authors wish to thank Dr Nicholas Luscombe and the anonymous reviewers for valuable remarks on the articles and software. This work has been supported by the EU FP6 grants ENFIN LSHG-CT-2005-518254 and COBRED LSHB-CT-2007-037730, and Estonian Science Foundation grant ETF7437. J.R. has recieved funding from the Marie Curie Biostar program and the Tiger University program of the Estonian Information Technology Foundation. Funding to pay the Open Access publication charges for this article was provided by the European Commission (COBRED) project.
Conflict of interest statement : None declared.
REFERENCES
Author notes
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors
Comments