MetaboSignal: a network-based approach for topological analysis of metabotype regulation via metabolic and signaling pathways

Abstract Summary MetaboSignal is an R package that allows merging metabolic and signaling pathways reported in the Kyoto Encyclopaedia of Genes and Genomes (KEGG). It is a network-based approach designed to navigate through topological relationships between genes (signaling- or metabolic-genes) and metabolites, representing a powerful tool to investigate the genetic landscape of metabolic phenotypes. Availability and Implementation MetaboSignal is available from Bioconductor: https://bioconductor.org/packages/MetaboSignal/ Supplementary information Supplementary data are available at Bioinformatics online.


Supplementary material
In this document, we provide the following information: • High-resolution version of the figure shown in the main paper.
• Glossary and reference information for MetaboSignal.
• Instructions for how to install and get started with MetaboSignal.
• Description of all the functionalities of MetaboSignal, with illustrative examples.
• Common workflow for how MetaboSignal can be used.
• Basic guidelines about how to import, visualize and customize MetaboSignal networks in cytoscape. Reactions are linked to enzymes, and then to genes using the KEGG API (http://www.genome.jp/kegg/rest/keggapi.html).

Main paper figure
• Optionally, the genes from the signaling network can be filtered based on tissue-expression using information from the Human Protein Atlas database (http://www.proteinatlas.org/).
• Finally, the metabolic and signaling networks are merged (i.e. "rbind()" of the 2 matrices) to build the MetaboSignal network.
Node types: MetaboSignal networks are composed by 4 different types of nodes: • Metabolic genes: genes encoding enzymes that catalyze metabolic reactions.
Node betweenness: measurement of node centrality defined as the number of shortest paths going through a given node of the network (http://igraph.org/c/doc/igraph-docs.pdf).
Shortest path: path connecting two nodes with the least number of edges. Shortest paths are calculated using the breadth-first algorithm (http://igraph.org/c/doc/igraph-docs.pdf).

Installation instructions
To use MetaboSignal, R (version >= 3.3) must be correctly installed. Here, we strongly recommend using RStudio.
Nodes represent four different molecular entities: metabolic-genes (i.e. genes encoding enzymes that catalyze metabolic reactions), signaling-genes (e.g. kinases), reactions or metabolites. It is possible to build a tissue-specific network-table that excludes signaling genes that are not expressed in a given tissue. Tissue expression data is obtained using the hpar package, which is based on the The Human Protein Atlas database. The genes "non detected" in the target tissue (reliability = supportive) are neglected.
The network-table generated with this function can be customized based on several criteria.
For instance, undesired nodes can be removed or replaced using the functions "MS_RemoveNode( )" or "MS_ReplaceNode( )" respectively. Also, the network can be filtered according to different topological parameters (e.g. node betweenness) using the function "MS_FilterNetwork( )".

Arguments
# metabo_paths: character vector containing the KEGG IDs of the metabolic pathways of interest (organism-specific). For example, the KEGG ID for the pathway "glycolysis/gluconeogenesis" in the rat is "rno00010". See the function "MS_FindKEGG( )".
# signaling_paths: character vector containing the KEGG IDs for the signaling pathways of interest (organism-specific). For example, the KEGG ID for the pathway "insulin signaling pathway" in the rat is "rno04910".
# organism_name: character vector containing the common name of the organism of interest (e.g. "rat", "mouse", "human", "zebrafish") or taxonomy id. For more details, check: http://docs.mygene.info/en/latest/doc/data.html#species. This argument is only required when filtering genes by tissue expression.
# tissue: character vector containing the name(s) of the target tissue(s). By default, tissue = "all" indicating that signaling gene nodes will not be filtered by tissue expression. Otherwise, possible tissues are those included in the dataset hpaNormalTissue (see levels(hpaNormalTissue$Tissue).

Value
A two-column matrix where each row represents an edge between two nodes.

Note
Reaction directionality reported in KEGG has been cross-validated with published literature.
For more details, check the dataset directionality_reactions.

Description
This function generates a distance matrix containing the length of all shortest paths from a set of genes (or reactions) to a set of metabolites. The shortest path length between two nodes is defined as the minimum number of edges between these two nodes.

Arguments
# network_table: two-column matrix where each row represents an edge between two nodes.
# organism_code: character vector containing the KEGG code for the organism of interest. For example the KEGG code for the rat is "rno". See the function "MS_FindKEGG( )".
# mode: character constant indicating whether a directed or an undirected network will be considered. "all" indicates that all the edges of the network will be considered as undirected.
"out" indicates that all the edges of the network will be considered as directed. "SP" indicates that all network will be considered as directed except the edges linked to target metabolite, which will be considered as undirected. The difference between the "out" and the "SP" options is that the latter aids reaching target metabolites that are substrates of irreversible reactions.
# source_genes: character vector containing the genes from which the shortest paths will be calculated. All input genes need to have the same ID format. Possible ID formats are: entrez IDs, official gene symbols, or gene nodes of the network (i.e. KEGG orthology IDs or KEGG gene IDs). The latter option allows reducing the time required to compute this function. Entrez IDs or gene symbols can be transformed into KEGG IDs using the function "MS_GetKEGG_GeneID( )". By default, genes = "all" indicating that all genes or reactions of the network will be used.
# target_metabolites: character vector containing the KEGG IDs of the metabolites to which the shortest paths will be calculated. Compound KEGG IDs can be obtained using the function "MS_FindKEGG( )". By default, metabolites = "all", indicating that all metabolites of the network will be used.
# names: logical scalar indicating whether the metabolite IDs or gene KEGG IDs will be transformed into common metabolite names or gene symbols. Reaction IDs remain unchanged. By default, names = FALSE.

Value
A matrix containing the shortest path length from the genes or reactions (in the rows) to the metabolites (in the columns). For unreacheable metabolites Inf is included.

##### MetaboSignal_NetworkCytoscape #### Description
This function allows calculating the shortest paths from a set of genes to a set of metabolites, and representing them as a network-table (i.e. two-column matrix). By default, the function exports a network file ("CytoscapeNetwork.txt") and two attribute files ("CytoscapeAttributesType.txt", "CytoscapeAttributesTarget.txt"), which can be imported into cytoscape to visualize the network. The first attribute file allows customizing the nodes of the network based on the molecular entity they represent: metabolic-genes, signaling-genes, or metabolites. The second attribute file allows discriminating the source_genes and the target_metabolites ("target") from any other node ("untarget") of the network.
The network-table generated with this function can be further customized based on different criteria. For instance, undesired nodes can be removed or replaced using the functions "MS_RemoveNode( )" or "MS_ReplaceNode( )" respectively. The final version of the networktable can be used to generate new cytoscape files using the function "MS_ToCytoscape( )".

Arguments
# network_table: two-column matrix where each row represents an edge between two nodes.
# organism_code: character vector containing the KEGG code for the organism of interest. For example the KEGG code for the rat is "rno". See the function "MS_FindKEGG( )".
# source_genes: character vector containing the genes from which the shortest paths will be calculated. All input genes need to have the same ID format. Possible ID formats are: entrez IDs, official gene symbols, or gene nodes of the network (i.e. KEGG orthology IDs or KEGG gene IDs). The latter option allows reducing the time required to compute this function. Entrez IDs or gene symbols can be transformed into KEGG IDs using the function "MS_GetKEGG_GeneID( )".
# target_metabolites: character vector containing the KEGG IDs of the metabolites to which the shortest paths will be calculated. Compound KEGG IDs can be obtained using the function "MS_FindKEGG( )".
# mode: character constant indicating whether a directed or an undirected network will be considered. "all" indicates that all the edges of the network will be considered as undirected.
"out" indicates that all the edges of the network will be considered as directed. "SP" indicates that all network will be considered as directed except the edges linked to target metabolite, which will be considered as undirected. The difference between the "out" and the "SP" options, is that the latter aids reaching target metabolites that are substrates of irreversible reactions.
# type: character constant indicating whether all shortest paths or a single shortest path will be considered when there are several shortest paths between a source_gene and a target_metabolite. If type = "all", all shortest paths will be considered. If type = "first" a single path will be considered. If type = "bw" the path with the highest betweenness score will be considered. The betweenness score is calculated as the average betweenness of the gene nodes of the path. Note that using type = "bw" increases the time required to compute this function. By default, type = "first".
# names: logical scalar indicating whether the metabolite IDs or gene KEGG IDs will be transformed into common metabolite names or gene symbols. Reaction IDs remain unchanged. By default, names = FALSE.
# export_cytoscape: logical scalar indicating whether network and attribute cytoscape files will be generated and exported. By default, export_cytoscape = TRUE.
# file_name: character vector that allows customizing the name of the exported files. By default, file_name = "Cytoscape".

Value
A two-column matrix where each row represents an edge between two nodes. By default, the function also generates a network file ("CytoscapeNetwork.txt") and two attribute files ("CytoscapeAttributesType.txt", "CytoscapeAttributesTarget.txt"), which can be imported into cytoscape to visualize the network.

Note
The network-table generated with this function can be also visualized in R using the igraph package. The network-table can be transformed into an igraph object using the function "graph.data.frame( )" from igraph.

MetaboSignal support functions ##### MS_FindKEGG #### Description
This function returns a list of entries corresponding to one of the following KEGG databases: "compound", "organism", "pathway". It can also find entries with matching query keywords in a given database.
# match: character vector containing one or more elements (i.e. key words) to be matched as compound names.
# organism_code: character vector containing the KEGG code for the organism of interest. For example the KEGG code for the rat is "rno". This argument is only required for KEGG_database = "pathway".

Value
By default, a matrix where each row contains the KEGG entries of the database of interest When using the option "match" a list is returned, each list element containing information of matched entries.

##### MS_FindMappedNodes #### Description
This function can be used to find out if a set of genes or metabolites of interest can be mapped onto the MetaboSignal network.

Arguments
# nodes: character vector containing the IDs of the genes or the metabolites to be mapped onto the network. All IDs need to correspond to the same molecular entity (i.e. gene or metabolite). For metabolites, KEGG IDs are required (see function "MS_FindKEGG( )"). For genes, entrez IDs or official symbols can be used, but note that all genes need to be in the same ID format (i.e. entrez or symbols). It is preferable to use entrez IDs rather than gene symbols, since some gene symbols are not unique.
# network_table: two-column matrix where each row represents and edge between two nodes.
# organism_code: character vector containing the KEGG code for the organism of interest. For example the KEGG code for the rat is "rno". This argument is only required for KEGG_database = "pathway".

Value
A list reporting which genes or metabolites can or cannot be mapped onto the network.

Arguments
# genes: character vector containing the entrez ID or official symbols of the genes of interest.
All genes need to be in the same ID format (i.e. entrez or symbols). It is preferable to use entrez IDs rather than gene symbols, since some gene symbols are not unique.
# organism_code: character vector containing the KEGG code for the organism of interest. For example the KEGG code for the rat is "rno". This argument is only required for KEGG_database = "pathway".
# organism_name: character vector containing the common name of the organism of interest (e.g. "rat", "mouse", "human", "zebrafish") or taxonomy id. For more details, check: http://docs.mygene.info/en/latest/doc/data.html#species. This argument is only required when gene symbols are used.
# output: character constant indicating whether the function will return a vector containing mapped and transformed KEGG IDs (output = "vector"), or a matrix containing both mapped entrez IDs or gene symbols and their corresponding KEGG IDs (output = "matrix"). By default, output = "vector".
# orthology: logical scalar indicating whether the genes IDs will be transformed into orthology IDs or into organism-specific KEGG gene IDs. By default, orthology = TRUE.

Value
A character vector containing mapped and transformed KEGG IDs or a matrix containing both mapped entrez IDs or gene symbols and their corresponding KEGG IDs.

##### MS_ChangeNames #### Description
This function allows transforming KEGG IDs of genes or metabolites into their corresponding common names (for metabolites) or symbols (for genes).
# organism_code: character vector containing the KEGG code for the organism of interest. For example the KEGG code for the rat is "rno". See the function "MS_FindKEGG( )". This argument is ignored when nodes are metabolites.

Value
A character string or a matrix containing the common metabolite names or gene symbols corresponding to the input KEGG IDs. Reaction IDs remain unchanged.

Arguments
# network_table: two-column matrix where each row represents and edge between two nodes.
# source_node: character vector containing the ID of the node from which the shortest paths will be calculated.
# target_node: character vector containing the ID of node to which the shortest path will be calculated.
# mode: character constant indicating whether a directed or an undirected network will be considered. "all" indicates that all the edges of the network will be considered as undirected.
"out" indicates that all the edges of the network will be considered as directed. "SP" indicates that all network will be considered as directed except the edges linked to target metabolite, which will be considered as undirected. The difference between the "out" and "SP" options, is that the latter aids reaching target metabolites that are substrates of irreversible reactions. By default, mode = "SP".
# type: character constant indicating whether all shortest paths or a single shortest path will be considered when there are several shortest paths between a source_gene and a target_metabolite. If type = "all", all shortest paths will be considered. If type = "first" a single path will be considered. If type = "bw" the path with the highest betweenness score will be considered. The betweenness score is calculated as the average betweenness of the gene nodes of the path. Note that using type = "bw" increases the time required to compute this function. By default, type = "first".

Value
A vector or a matrix where each row contains a shortest path from the source_node to the target_node. KEGG IDs can be transformed into common names using the function "MS_ChangeNames( )".
# mode: character constant indicating whether a directed ("out") or undirected ("all") network will be considered. By default, mode = "all".
# normalized: logical scalar indicating whether to normalize the betweenness scores. If TRUE, normalized betweenness scores will be returned. If FALSE, raw betweenness scores will be returned. By default, normalized = TRUE.

Value
A numeric vector containing the betweenness of each node of the network. The function also produces and histogram showing the distribution of node betweenness.

##### MS_FilterNetwork #### Description
This function allows reducing the dimensionality of a network, by removing nodes that do not meet the established distance and/or node betweenness criteria.

Arguments
# network_table: two-column matrix where each row represents and edge between two nodes.
# mode: character constant indicating whether a directed ("out") or undirected ("all") network will be considered. By default, mode = "all".
# type: character constant used to establish the criteria for filtering the network. "bw" indicates that edges (i.e. rows of the network_table) containing at least one node with betweenness below bw_th will be neglected. "distance" indicates edges containing at least one node with shortest path length to the target_node above distance_th will be neglected. "all" indicates that edges containing at least one node with either betweenness below bw_th or distance above distance_th, will be neglected.
# target_node: character vector containing the ID of the node to which the distances will be calculated.
# distance_th: numeric value corresponding to the distance threshold. Nodes with shortest path length to the target_node above this threshold will be removed from the network-table.
# bw_th: numeric value corresponding to the normalized-betweenness threshold. Nodes with betweenness below this threshold will be removed from the network-table. See also "MS_NodeBW( )".

Value
A two-column matrix where each row represents an edge between two nodes.

Arguments
# nodes: character vector containing the node IDs to be removed.
# network_table: two-column matrix where each row represents and edge between two nodes.

Value
A two-column matrix corresponding to the input network-table without the undesired nodes.

Arguments
# node1: character vector containing the node IDs to be replaced.
# node2: character vector containing the ID that will be used as a replacement.
# network_table: two-column matrix where each row represents and edge between two nodes.

Value
A two-column matrix corresponding to the input network-table with replaced nodes.

##### MS_ToCytoscape #### Description
The function exports a network file ("CytoscapeNetwork.txt") and two attribute files ("CytoscapeAttributesType.txt", "CytoscapeAttributesTarget.txt"), which can be imported into cytoscape to visualize the network. The first attribute file allows customizing the nodes of the network based on the molecular entity they represent: metabolites, metabolic-genes, or signaling-genes. The second attribute file allows discriminating a set of nodes of interest ("target") from any other node ("untarget") of the network.

Arguments
# network_table: two-column matrix where each row represents and edge between two nodes.
# organism_code: character vector containing the KEGG code for the organism of interest. For example the KEGG code for the rat is "rno". See the function "MS_FindKEGG( )".
# names: logical scalar indicating whether the metabolite or gene KEGG IDs will be transformed into common metabolite names or gene symbols. Reaction IDs remain unchanged. By default, names = TRUE.
# target_nodes: character vector containing the IDs of the target nodes to be discriminated from the other nodes of the network.This argument is optional.
# file_name: character vector that allows customizing the name of the exported files. By default, the file_name = "Cytoscape".

Value
A data frame where each row represents an edge between two nodes. The function also generates and exports a network file ("CytoscapeNetwork.txt") and two attribute files ("CytoscapeAttributesType.txt", "CytoscapeAttributesTarget.txt"), which can be imported into cytoscape to visualize the network.

MetaboSignal workflow
In order to illustrate the functionality of our package, we have used transcriptomic and metabonomic datasets from white adipose tissue of rat congenic strains derived from the diabetic Goto-Kakizaki (GK) and normoglycemic Brown-Norway (BN) rats.
We next describe how MetaboSignal was used to provide a mechanistic explanation of the gene-metabolite associations found in this study. As an example, we will use the associations between the genes: G6pc3 (303565), Ship2 (65038) or Ppp2r5b (309179) with D-glucose.

1) Define input data
We built a rat-specific MetaboSignal network by merging two metabolic pathways: "glycolysis" and "inositol phosphate metabolism", with two signaling pathways: "insulin signaling pathway" and "PI3K-Akt signaling pathway".
We used the function "MS_FindKEGG( )" to find the "organism_code" of the rat and the IDs of the pathways of interest.

2) Build MetaboSignal network-table
We used the selected metabo_paths and signaling_paths to build a MetaboSignal networktable. Since we focused on an adipose tissue dataset, we decided to exclude signaling genes not expressed in soft tissue.
In total, 24 signaling genes were excluded from the network for not being expressed in adipocytes. Examples of these genes are: Kit, Fgfr3, Elk1, Csf3r, Il4r, Nos3.

4) Build distance matrix
We used the function "MetaboSignal_distances( )" to calculate the shortest path lengths from our genes of interest: G6pc3 (303565), Ship2 (65038) or Ppp2r5b (309179) to D-glucose ("cpd:C00031"). We used the default mode (mode = "SP"), which indicates that all the network is considered as directed, except the edges linked to the target metabolite (in this case Dglucose) that are considered as undirected. This option is designed to reach metabolites acting as substrates of irreversible reactions, while keeping the directionality of the network.