The post-genomic era presents us with the challenge of linking the vast amount of raw data obtained with transcriptomic and proteomic techniques to relevant biological pathways. We present an update of PathExpress, a web-based tool to interpret gene-expression data and explore the metabolic network without being restricted to predefined pathways. We define the Enzyme Neighbourhood (EN) as a sub-network of linked enzymes with a limited path length to identify the most relevant sub-networks affected in gene-expression experiments. PathExpress is freely available at: http://bioinfoserver.rsbs.anu.edu.au/utils/PathExpress/.
With the development of transcriptomic and proteomic techniques, post-genomic data represents a new challenge for researchers attempting to interpret the vast amount of raw data in a biological context (1). The analysis of microarray data is usually performed in two steps: the identification of genes that are differentially expressed under two or more conditions, using different statistical methods (2), and a comparison of selected genes with a background to find overlaps between the observed changes in expression and biologically relevant partitionings of the measured genes. Many ontological tools are now available that support the functional interpretation of gene-expression data via the identification of significantly enriched Gene Ontology (GO) categories (3) within groupings of genes of interest (4).
Additionally, with the availability of pathway databases such as KEGG (5,6) and MetaCyc (7), numerous tools have been proposed that analyse microarray data and visually present associated metabolic or regulatory pathway information (8–16). However, the predefined metabolic pathways used in these methods represent an essentially arbitrary segmentation of the metabolism. In contrast, other methods integrate, a priori, the knowledge of gene networks in the analysis of gene-expression data. Ideker and co-workers presented a procedure for screening a molecular interaction network combined with a statistical measure to identify sub-networks that show significant changes in expression (17). This approach has been included in Cytoscape to identify functional modules, i.e. highly connected network regions with similar responses across multiple experimental conditions (18). Hanisch and co-workers proposed a co-clustering method based on a distance function that combines information from expression data and biological networks (19). A Potts spin algorithm was developed to cluster gene-expression data by using the nearest neighbour reactions of biochemical networks (20). Rapaport and co-workers extracted gene-expression patterns of neighbouring genes in the network, involving the attenuation of high-frequency signals with respect to the graph (21). Another approach identifies the smallest functional units based on the network topology using the Petri net theory (22). It has been shown by Schwartz and co-workers that elementary modes represent true functional units of metabolism and can be used to reveal transcriptional activity (23). However, the combinatorial explosion of computing elementary modes in large networks limits the practical use of these methods.
We previously presented a web-based tool called PathExpress (10) that allowed us to interpret gene-expression results from microarrays in the context of biological pathways. PathExpress has been developed to identify the most relevant pathways or sub-pathways associated with a subset of genes of interest (e.g. a set of differentially expressed genes). It is based on a directed graph modelling enzymatic reactions derived from the publicly available KEGG LIGAND database (24,25).
In the present article, we describe a new development in PathExpress—the enzyme neighbourhood (EN) method. We define the EN as a sub-network of linked enzymes with a limited path length. The EN method enables us to explore the metabolic network and identify the most relevant sub-networks affected in gene-expression experiments without being restricted to predefined pathways. While the interaction with the web server is essentially unchanged, PathExpress now incorporates the EN method and supports 28 Affymetrix 3′ Gene-expression Analysis Arrays, representing 32 distinct organisms, and is easy to extend further. In a case study, the EN method was tested with gene-expression data of the model legume Medicago truncatula by comparing the transcriptomes of meristematic and non-meristematic root cells (26).
PathExpress is based on a directed graph modelling enzymatic reactions as used in the Petri net representation of biological networks (27). Two types of nodes are used to represent compounds and reactions. Specific reactions can encompass one or more enzymes. Directed edges, connecting these nodes, correspond to the consumption or the production of compounds by the reaction. We first built the global metabolic network consisting of 2276 enzymes and 3810 compounds involved in 3663 reactions as specified in the KEGG LIGAND database (24,25). In order to avoid annotation errors due to the misinterpretation of partial Enzyme Commission (EC) numbers (28), we only utilized enzymes defined by a full EC term. This database has the advantage of providing a manually curated representation of enzymatic reactions involved in metabolic pathways where most secondary metabolites (very common and highly connected compounds such as water, oxygen, major coenzymes and prosthetic groups) have been removed, thus avoiding invalid metabolic connections and unspecified pathways.
Many of the current methods for the functional interpretation of gene-expression data are constrained by their need to link expressed genes with predefined metabolic pathways and are therefore often hampered when the species to be analysed is not represented in the pathway database. To overcome this limitation, probe sets of the genome arrays supported in PathExpress are linked to the metabolic network using NetAffx annotations (29) or similarities with protein sequences of known EC numbers retrieved from the UniProt database (30). A complete metabolic graph representing all assignments is produced for each organism. This strategy can be applied to any set of sequences and makes it easy to extend PathExpress for use with novel species. In addition, EC numbers can be directly uploaded and compared to the reference network, which allows the analysis of custom data.
In the global network, two reactions are regarded as neighbours if a metabolite exists that is the product of one reaction and the substrate of the other. We define the EN of depth d for an enzyme e, as the set of enzymes that can be reached in the graph from e by traversing a maximum of d compounds, regardless of the direction of the edges (Figure 1). The EN of depth 1 for a given enzyme thus corresponds to the set of enzymes directly connected via a compound (e.g. immediate neighbours). The EN of depth 2 includes the enzymes involved in the EN of depth 1 plus the enzymes linked to these. As different paths can connect two enzymes, the shortest distance between two enzymes is used to define the EN. These ENs correspond to different sub-networks of the global metabolic network. By comparing a specific list of genes to the ENs it is possible to identify those ENs that are significantly over-represented in the gene list.
To identify the most relevant sub-network associated with a list of submitted enzymes, the EN of each seed (submitted EC number), for a given depth, is determined in the global network and the EC numbers contained in the resulting EN are compared to the submitted list. For each test, a P-value, representing the probability that the intersection of the list of enzymes belonging to the given EN occurs per chance in the population of enzymes involved in the entire network, is calculated using the hypergeometric distribution (31). Because multiple tests are performed, it is necessary to correct these P-values with adjustment methods such as the conservative Bonferroni correction (32) or the False Discovery Rate approach (33).
The size of the EN depends on its depth d, which has to be specified as a parameter in the current implementation. To optimize this parameter with the size of the submitted list of genes, we have computed the average number of enzymes involved in each possible EN for a range of depths (Table 1). Based on these results, it is possible to adjust the depth parameter to compare groups of enzymes with sub-networks of similar size. For example, to compare a group of 10 enzymes, we recommend a depth parameter of 1 (i.e. direct neighbours), corresponding to an average size of 11.7 enzymes.
|Depth||Average no. of neighbours|
|Depth||Average no. of neighbours|
THE PATHEXPRESS WEB SERVER
As input data, PathExpress receives a list of identifiers (Affymetrix probe set identifiers and/or GenBank accession numbers). Other parameters can be specified: the type of comparison (pathway, sub-pathway or EN), the P-value significance threshold and the adjustment method used to correct for multiple testing.
The PathExpress output contains the list of sub-networks (metabolic pathways, sub-pathways or ENs) that are associated with the enzymes in the submitted list of identifiers. The ones with significant association are highlighted. Each of these networks can be displayed, both via an automatically generated graphical representation and as an enumeration of enzymatic reactions.
As an example, we used PathExpress to analyse microarray data obtained from the model legume Medicago truncatula, comparing the gene expression of meristematic and non-meristematic root tissues (26). The data have been deposited in NCBI's Gene Expression Omnibus (34) and are accessible through GEO series accession number GSE8115. Following normalization, differentially expressed probe sets were identified by evaluating the log2 ratio between the two conditions. All probe sets that differed by more than a 2-fold difference were considered to be differentially expressed. Of the 390 transcripts over-expressed in the non-meristematic tissue, 94 could be assigned to 50 distinct enzymatic functions, as defined by their EC number in the Affymetrix Medicago Genome Array. To contrast the whole pathway approach with the EN method, we used the ‘Entire Pathway’ option of PathExpress to identify over-representation of metabolic pathways in the non-meristematic root. Most significantly (P-value: 1.09e–03), the carbon fixation pathway is defined by 22 enzymes of which six are differentially expressed in the tissue. We also identified the most relevant sub-networks corresponding to the same group of over-expressed transcripts, using the EN option with a depth of 4. The resulting sub-networks were ranked by increasing P-values. The most significant EN (P-value: 4.06e–04) is given in Figure 1 and was seeded by the glucuronate isomerase (EC 188.8.131.52, black). Of the 13 enzymes present in the depicted sub-network, seven are involved in the pentose and glucuronate interconversion pathway as described in the KEGG database. The remaining six enzymes connected to this sub-network are part of different pathways involved in carbohydrate metabolism (galactose, inositol phosphate, ascorbate and aldarate) and would not have been considered by an approach restricted to the predefined metabolic pathways.
Our web-based tool for the interpretation of genomics data, first described in 2007 (10), has been extended to implement the concept of ENs. The EN of a given enzyme is defined as a connected sub-network within the global metabolic network, built from the KEGG database. The identification of statistically significantly over-represented ENs is based on the same statistical approach used for the identification of gene enrichment in GO terms or metabolic pathways. However, the clustering method differs, as it includes knowledge about the network of gene products without being restricted to predefined pathways.
Recently, another tool called KEGG spider, presenting a similar approach of interpretation of genomics data in the context of the global gene metabolic network, has been published (35). Although both methods identify statistically significant sub-networks in a submitted list of genes, there are some fundamental differences. KEGG spider infers the network that minimizes the distance between each connected gene pair according to pair-wise distances between genes. It estimates the significance of the inferred network by a Monte Carlo procedure. On the other hand, PathExpress performs an enrichment analysis by comparing the EN of a given depth with the submitted genes, using the hypergeometric distribution and an adjustment method. While KEGG spider limits sub-networks by allowing a maximum of three consecutive missing enzymes, PathExpress can consider all sub-networks up to a depth of 10, corresponding to approximately 250 enzymes. KEGG spider uses the KEGG orthology database to map the genes to the metabolic network and is available only for nine reference organisms, whereas PathExpress uses pre-computed assignments of sequences to EC numbers, and can easily be extended from the currently supported 32 organisms to any organism or set of sequences (e.g. custom DNA microarray, proteome array), enabling the analysis of a wider range of gene-expression experiments. For example, it has recently been used to compare the proteomic data derived from seeds of plants within and beyond the legume family (36).
Since its initial development, PathExpress has been extended to explore the Enzyme Neighbourhood for the identification of relevant sub-networks affected in gene-expression experiments. Many genome arrays have been added, making PathExpress a useful resource for the integration of transcriptomic and proteomic and enzymatic or metabolic reaction datasets.
Australian Research Council Centre of Excellence Grant. Funding for open access charge: Australian Research Council Centre of Excellence Grant.
Conflict of interest statement. None declared.