CluePedia Cytoscape plugin: pathway insights using integrated experimental and in silico data

Summary: The CluePedia Cytoscape plugin is a search tool for new markers potentially associated to pathways. CluePedia calculates linear and non-linear statistical dependencies from experimental data. Genes, proteins and miRNAs can be connected based on in silico and/or experimental information and integrated into a ClueGO network of terms/pathways. Interrelations within each pathway can be investigated, and new potential associations may be revealed through gene/protein/miRNA enrichments. A pathway-like visualization can be created using the Cerebral plugin layout. Combining all these features is essential for data interpretation and the generation of new hypotheses. The CluePedia Cytoscape plugin is user-friendly and has an expressive and intuitive visualization. Availability: http://www.ici.upmc.fr/cluepedia/ and via the Cytoscape plugin manager. The user manual is available at the CluePedia website. Contact: bernhard.mlecnik@crc.jussieu.fr or jerome.galon@crc.jussieu.fr Supplementary information: Supplementary data are available at Bioinformatics online.


Summary
CluePedia provides insights into pathways by integrating experimental and in silico information.
CluePedia extends ClueGO [1] functionality down to genes and miRNAs. If ClueGO reveals interrelations of terms and functional groups in biological networks, CluePedia gives the posibility to enrich those networks with known and experimental data.
CluePedia calculates statistical dependencies (correlation) for markers of interest from experimental data. Four tests investigating linear and nonlinear dependencies between variables are implemented: Pearson correlation, Spearman's rank, Distance correlation ( [2]) and Maximal Information Coefficient (MIC) ( [3]). The resulting file is added to CluePedia as an additional resource for further analysis.
Experimental data can be normalized and visualized between adjustable thresholds on network's nodes as a label. Relevant signals with a certain expression level, standard deviation and without missing values can be selected. Another feature allows the extraction of expression corresponding to selected markers, e.g. genes associated with a pathway, from a dataset into a new file.
One major advantage is the possibility to investigate in detail a pathway by combining known [4,5] and new, experimental derived information, about genes/proteins involved. More, miRNAs [6,7] that could influence the expression of the genes can be as well visualized together with the expression data. The location of the genes/proteins [8,9] within cellular compartments can be automatically displayed, by the implementation of Cerebral [10] layout within CluePedia.
CluePedia provides an ID index file (updatable, extendable) that stores entrez gene ids and symbols and allows fast analysis.

Installation
System Requirements: • Windows, Linux, Unix or MacOS operating system.
• 2048MB RAM needed, 4096 MB RAM recommended. Hard disk with at least 1Gb free.
CluePedia is a Cytoscape plugin and works together with ClueGO plugin, thus it is necessary to copy both, CluePediaPlugin.jar and ClueGOPlugin.jar in the plugins folder of Cytoscape. CluePedia and ClueGO source files will be stored in the .cluegoplugin folder. This folder is created in the user home folder at the first startup. If this folder is removed or the content damaged, it will be recreated automatically at the next startup, with the initial configuration.
CluePedia will take at the first run several minutes to initialize.
For miRNA analysis, please download intially available miRNA annotations (See Upload new available example and ontology files or new organisms).

Documentation
After starting Cytoscape, CluePedia and ClueGO can be found under the Plugins menu (Fig 1).

Figure 1: Start ClueGO+CluePedia Plugins
CluePedia (Fig 2) can be accessed by selecting Gene analysis mode: • The two analysis modes: Functional analysis (ClueGO) and Gene based analysis (Clue-Pedia); can be used in combination.
• Single and compared cluster analysis type are available for both analysis modes.
• The ID type is automatically identified for supported organisms.
• The identifies can be uploaded from File, Text Field or from an opened Network.
• Start. Genes are mapped against the ID index file. The extracted info is displayed in Gene Info Panel. Genes are shown as nodes of the network.
• Cytoscape Free Memory Bar estimates the remaining memory for Cytoscape

CluePedia Panels
After mapping a set of genes, a Gene network, a Gene info and a CluePedia panels are created (Fig 3).

Figure 3: CluePedia Panels
Gene Info Panel: • Shows EntrezGeneID, symbol and gene aliases for found genes.
• Not found genes are listed as well.
• Gene information can be directly saved (copy/paste) or using Cytoscape features, from the Node attribute table.
• Close Gene Network. Closes the network and associated tables.
CluePedia Panel: Includes files with interaction and prediction scores.
CluePedia built-in statistical tests for calculating correlation weights from custom data.
Scores and directed and undirected edges refinement options.
Nodes and edges enrichment, adding, removing and update features. If a gene is located in several compartments, those levels will be displayed together as a new category.

I. Gene/miRNA expression visualization/saving options
The visualization of the cellular location is done using the Cerebral layout (Fig 6). The cellular compartments were better defined by increasing the number of associated GO terms. Like this, a higher number of markers are mapped.
In comparison to previous CluePedia versions, in which the not mapped markers were randomly associated with a compartment, now they are placed in a "annotation definition not found" group. This group is placed at the exterior of the other compartments. The user can verify the not mapped markers and add the eventually missing GO terms in the .properties file.

Expression data visualization
Expression data can be visualized as a node label (Show expression data).
After opening the custom file with expression data, the number of found genes and experiments/time points is shown. Several visualization options are available for the user. If two or more filters are set, only genes passing all filters will be visualized (AND relationship). The filtering and the normalization are done with the complete dataset.

• Number of experiments present >=
Specifies the minimum number of experiments with data for a gene to be selected. Genes with more missing values than the threshold will be discarded. (Fig 9a). E.g. If the number of experiments present is set 6, 3 out of 5 genes will be selected.

• Standard deviation >=
Selects the genes having the standard deviation bigger than the set threshold. Genes that change most over the experiments could be visualized.
• Number of experiments to pass: Selects genes with a specified expression level. E.g. If the expression of a gene ranges from -5 to 5 in a set of 6 experiments. The number of experiments to pass a threshold set to 3 is 4 (within threshold, -3 < expression < 3) and 2 (beyond threshold, expression < -3, expression > 3) (Fig 9b).

Figure 9: Expression filters example
The user can select and visualize one or several experimental conditions.
Data can be normalized. The normalization is calculated as mean substraction divided by the standard deviation.
Several customizable color schema are available.
The user can set the max level of the visualized expression data. This absolute value will define the upper and lower threshold of the visualized expression interval.
The filtering and the normalization can be applied to the entire data set or the selected genes only. For a general exploration, is recommended to create an expression matrix from the entire dataset. This process is done once and can be used to investigate e.g. a large network, as long there is enough memory. The other option, the extraction and visualization of selected genes only is less memory consuming. In this case, the extraction, normalization of the data has to be done again if other genes are of interest.
Example of normalized expression (Fig 10) using normal colon mucosa data. All selected experiments and all spots corresponding to a gene will be displayed as a node label. Improvements of the expression data upload available starting with CluePedia v1.7 Files containing one or several columns with identifiers can be imported. The order of identifier columns in the file is not important. Important is that one of those columns contains unique identifiers (no duplicate ids). The column names should be unique as well. All columns containing numerical data will be considered as experiments. When analyzing the entire dataset, it is important to exclude numerical identifiers e.g. entrezGeneIDs.
The user has to choose the column with unique identifiers (in the example, the column "NAME") that will be used to map and select markers. Further, the user can select, analyse and visualize all expression data or just a part of it. Like this, it is possible to investigate from a large dataset one or several experimental time points of interest.
For example, GEO expression matrix downloaded as zip archive can be uploaded in CluePedia and directly used for correlation calculation or expression data visualization. CluePedia selects automatically the data matrix. Commentary lines (marked with "!") are excluded.

Close expression dataset
If not needed anymore, the expression dataset should be closed. Closing the file increases the free Cytoscape memory.

Expression data extraction
The expression corresponding to selected genes can be extracted from an input file (Fig 13).
The info will be saved in the output file. If the input file contains several spots for a gene, all will be extracted. The number of found and not found genes is displayed in a dialog.  If the enrichment is made for multiple genes, a data store containing all possible interactions of those genes is created. Interaction scores are extracted from all source files selected. The enrichment can be done for all or for each of the initial genes. In the first case, after sorting all interaction scores, the top ones are included in the network. Several criteria (All, Common, Specific) can be used to refine the selection (Fig 16). Similarly, interaction scores corresponding to each gene are sorted to select top related genes. The initial network will include different  If the enrichment is made in a network of terms and genes, the enriched genes that are known to be involved in one of the functions will be automatically linked to it and displayed in the same color. Highly connected markers (hubs) with selected genes can be added to the network/pathway using the enrichment function (Fig 16). Positive and negative interaction scores can be selected and visualized on the network (Fig 18).
Add (Include a new set of markers to the current selection) Figure 19: Add Allows to include genes/miRNAs of interest in the network (Fig 19). The user can test if the IDs are found in the ID index file, and then add them on the network. The color of nodes representing added genes can be set by the user.

Remove (Discard enriched/added markers)
Selected genes/miRNAs (markers from initial list as well as added and enriched markers) or enriched genes/miRNAs (enriched and added markers only) can be removed from the network (Fig 20). Applies on the network the latest settings of the user.

Interactions
Markers can be linked using in silico or/and experimental derived information. CluePedia comes with known human and mouse interaction data. The user can create his own interaction files using experimental data (see Edge files, advanced options).

Known interactions
Files with known interaction are based on publicly available data from STRING, miRBase and miRRecords and other resources (Ontologies and in silico data sources section  Precompiled files available are listed (Fig 21a). Each file is loaded after selection (Fig 21b).
If not needed anymore, the file should be closed (for more Cytoscape free memory) (Fig 21c).
The file can be deleted using right mouse button. If a provided file is deleted (by mistake), it will be restored at the next run of the plugin.

Interactions from custom data
Experimental data (affymetrix, tissue microarray, FACS etc) can be used to visualize genegene/protein-protein interactions. For data format see Data format.
The interactions can be calculated using the "Create custom file" feature (Fig 22). After importing the data, the user can filter it, as explained in Expression data visualization.
Futher, several statistical tests can be applied to calculated interrelations for the entire file or for selected genes only (Fig 23). The user has to choose the column with unique identifiers (in the example, the column "NAME") that will be used to map and select markers. Expression data from zip archives can be uploaded starting with CluePedia v1.7.1 (Fig 25).
For example, GEO expression matrix downloaded as zip archive can be uploaded in CluePedia and directly used for correlation calculation or expression data visualization. CluePedia selects automatically the data matrix. Commentary lines (marked with "!") are excluded.

Implemented statistical tests
Four correlation methods are implemented in CluePedia: • Pearson correlation Three levels of correlation analysis are possible: • Whole input file. Correlation between all the genes from the custom file.
• Selected nodes vs all input file. Correlation between the selected node/nodes and the other genes from the dataset.
• Selected nodes. Correlation between the selected nodes only.
The correlation level selection depends on the analysis type. If the user wants to have an overview of the data, the first choice is the best. Because it is time and memory consuming, this option should be applied for a small dataset. In case of large datasets, the analysis can take very long, so the correlation of selected nodes vs all dataset is recommended. This second option is suitable for finding new genes that have a similar behavior/function with known genes. The last option shows the degree of correlation of selected genes e.g. known to be involved in a pathway.

Figure 26: Custom data, positive and negative correlation values
The time needed for correlation depends on the method and level selected, the number of the genes and the computer power. A free memory bar helps to estimate the waiting time.
The resulting file is saved on the hard disk and included in CluePedia as an additional resource useful for further analysis.
Correlation scores are shown on the network as edges. Negative correlation values are represented as sinusoidal lines, for a better visibility (Fig 26). The user can visualize one or several correlation scores.

Interactions visualization (edges) Edge scores
Once an interaction file is selected, the score type contained is displayed (Fig 21d). A threshold is automatically set to display the top interactions. This threshold as well as the color of the edge are customizable (Fig 21e).

Figure 27: Edge thickness
The edge thickness is scaled between max and min scores found among the set criteria. The smallest score will be displayed as thinnest line (Fig 27).  • All. Displays all the edges found between genes under set criteria.
• Common. Displays interactions found in all selected files.
• Specific. Interactions found only in one file. All types of action are selected by default (Fig 32a). The user can choose the type of directed edge to display; e.g only activation (Fig 32b).

Download interactions for other organisms/Update interactions
The CluePedia plugin stores human and mouse interaction data. Precompiled files for other organisms can be downloaded by the user (Fig 33). Files corresponding to the selected organism will be proposed for download. Figure 33: Interaction data, advanced settings New interaction files will be ready to download for each organism following new releases of the databases used as source (Fig 34). The newly downloaded files will be added to the other CluePedia resources. If there are no new files on the server, the user is informed. A deleted CluePedia file can be uploaded again.

Expand pathways in nested networks
This feature provides insights into pathways. All the genes included in the pathway will be visualized in a nested network. Known and calculated interaction scores between those genes can be compared. -Initial genes mapped share the color of the term. The rest of the genes included in the pathway are displayed in white.
-The interaction scores used in the initial network will be automatically visualized in the subnetwork. Genes not passing interaction score thresholds are not linked.
-The terms expanded in a nested network will display a small network drawn on their surface.
-The focus on the nested network can be set by selecting the term (Right mouse click,

Go to CluePedia Nested Network)
-If the subnetwork is not needed anymore, it should be deleted.

Extract from pathways genes interracting with genes from initial list
After performing a ClueGO functional analysis, genes associated with terms can be visualized together on the network. Several options for gene visualization are available (Fig 36): • Show all genes (Fig 37).
• Show all associated genes from terms (Fig 39).
• Show only links to associated genes (Fig 40).

Show only selected genes
Displays genes from the initial list found under the used ClueGO settings. The name of the gene is colored in red. Terms and their genes share the color. If a gene is found in two or more terms, it will have two or more colors. Terms are interconnected using kappa score, while genes based on activation scores.  It is used to extract from the pathway nodes and visualize genes that have an interaction link (known/calculated score) to the initial genes. The example figure was created using activation score. Interconnections between those new genes are visualized as well.

Show only links to associated genes
This feature simplifies the interrelation view in regard to the initial genes compared to Show all associated genes from terms. Suppresses the links between newly extracted genes from pathway nodes. Only links between initial genes and the new genes are kept. E.g. ICAM1 is linked both, to CD40LG and CD40. The link between CD40LG and CD40 is not shown.  found. The smallest score is the thinnest.

Visualize known interactions
• Set the binding score to 0.9.
• Update network => Only the binding score > 0.9 is visualized on the network.

Calculate interactions between markers using experimental data
• Select Analysis mode Gene/miRNAs • Paste genes in Text field  • Click Start => top 20 predicted target genes corresponding to each miRNA.
• Select all miRNAs and genes (Ctrl-A) • Click Functions (ClueGO+CluePedia panel) • Click Network (Get genes from network) • Click Load Attributes

Enrich a network showing two lists of genes/miRNAs
Enrichments can be done on a network comparing two lists of genes/miRNAs. This kind of network displays nodes colored in the selected gradient (e.g. red-white-green).
The new genes showed as white nodes, will have a colored border corresponding to the enrichment color.

Switch the network and pathway-like view
Click the "Use Cerebral layout" selection box. The genes from the network (Fig 48a) will automatically displayed in cellular compartments (Fig 48b). Note: the Cerebral layout allows a single edge type betweent the nodes.  The column names should be unique as well. All columns containing numerical data will be considered as experiments. When analyzing the entire dataset, it is important to exclude numerical identifiers e.g. entrezGeneIDs. Publicly available expression data Gene Expression Omnibus (GEO, [12]) and ArrayExpress ( [13]

Organism and ID types
The CluePedia plugin stores human and mouse interaction data. The user interested in other organisms can download automatically precompiled files from our website. The download has to be done once, and the time to download differs from organism to organism, depending on the data size. Interaction data will be updated at each new release of the sources used. Data for 20 organisms are available, and upon request other organisms can be added. The ClueGO plugin v1.5 contains 23 organisms.
CluePedia recognizes automatically gene and protein identifiers, symbols, synonyms: Ac-cessionIDs, Affymetrix IDs, EnsProtID, EnsGene/TranscriptID, EntrezGeneID, GI, IPI, RefSe-qID Protein, UniProtKB AC, UniProtKB ID, SymbolID, SWIS ID. The automatical recognition of the ID types is made using the provided ID index file. This file stores entrez gene ids and symbols and allows fast analysis and synonyms detection. The ID index file can be extended with custom IDs by the user. Using the ID index file, the user query list containing a mixture of symbol types will be automatically mapped.

Update Ontologies and Interaction data
ClueGO and CluePedia allow an up to date analysis at any time.
The update functions provide an easy integration of the most recent ontology and information. The source files are automatically downloaded and based on this, new precompiled An important new feature is the update of the ID Index file that will provide the latest gene information, as compiled from most recent NCBI resouses that are accessed during the update.

Ontologies and in silico data sources
CluePedia and ClueGO plugins contain or provide precompiled files based on the following sources: • The Gene Ontology (GO) project [8] aims to capture the increasing knowledge on gene function in a controlled vocabulary applicable to all the organisms. GO describes gene products in terms of their associated biological processes, cellular components and molecular functions.
Link: http://www.geneontology.org/ • Kyoto Encyclopedia of Genes and Genomes (KEGG) [9] is a database of biological systems that integrates genomic, chemical and systemic functional information.