StemCellNet: an interactive platform for network-oriented investigations in stem cell biology

Stem cells are characterized by their potential for self-renewal and their capacity to differentiate into mature cells. These two key features emerge through the interplay of various factors within complex molecular networks. To provide researchers with a dedicated tool to investigate these networks, we have developed StemCellNet, a versatile web server for interactive network analysis and visualization. It rapidly generates focused networks based on a large collection of physical and regulatory interactions identified in human and murine stem cells. The StemCellNet web-interface has various easy-to-use tools for selection and prioritization of network components, as well as for integration of expression data provided by the user. As a unique feature, the networks generated can be screened against a compendium of stemness-associated genes. StemCellNet can also indicate novel candidate genes by evaluating their connectivity patterns. Finally, an optional dataset of generic interactions, which provides large coverage of the human and mouse proteome, extends the versatility of StemCellNet to other biomedical research areas in which stem cells play important roles, such as in degenerative diseases or cancer. The StemCellNet web server is freely accessible at http://stemcellnet.sysbiolab.eu.


Stemness signatures
We have included in StemCellNet gene signatures for stemness, i.e. gene sets linked to the capacity of stem cells to both self-renew and differentiate into mature cells. To collect the gene sets associated with stemness, we conducted a literature review and found gene lists related with stem cell signatures published in ten different studies performed in human (Boyer et al.,2005, Assou et al. 2007, Palmer et al. 2012 or mouse (Ivanova et al. 2012, Ramalho-Santos et al., 2002, Fortunel et al. 2003, Gaspar et al. 2012, Chia et al. 2010, Ding et al. 2009Hu et al. 2009 samples.
Altogether, twenty distinct gene sets were derived from either ChIP-chip experiments detecting activated target genes of the core transcription factors Nanog, Pou5f1 (Oct4), and Sox2 targets, or co-activated by all three transcription factors (Boyer et al.,2005); gene expression studies to identify up-regulated genes in stem cells compared to other types of cells (Assou et al. 2007, Palmer et al. 2012, Ivanova et al. 2012, Ramalho-Santos et al., 2002, Fortunel et al. 2003, Gaspar et al. 2012 or large-scale functional RNAi screens to detect genes whose knock-down leads to loss of stem cell markers (Chia et al. 2010, Ding et al. 2009Hu et al. 2009. See supplementary table S1 for an overview of the currently included stemness signatures.

Stem-cell specific physical protein interactions.
Stem-cell specific physical protein interactions (PPI) were extracted from eight studies (Wang et al., 2006;Liang et al., 2008;Kim et al., 2010;Pardo et al., 2010, van den Berg et al., 2010Ding et al., 2012;Gao et al., 2012;Nitzsche et al., 2011) that applied either affinity purification or mass spectrometry methods against the selected proteins. All protein interactions identified and described in the published papers were uploaded to the StemCellNet by extracting the data from the published supplementary information.
Proteins were identified by the corresponding Entrez Gene ID. Supplementary table S2 presents the sources and further details for the collected stem-cell specific physical protein interactions.
In the paper by Chen and colleagues , the binding affinity between transcription factor and target genes showed quantitative values and the authors did not point out any specific threshold. Therefore, we applied different cutoffs for each TF in order to keep as a stringent measure only the 25% top scoring genes for each transcription factor. The minimum binding affinities selected were 0. 221,0.365,0.278,0.291,0.400,0.968,0.730,0.976,0.878,0.900,0.976,0.964,0.970 and 0.534 for Nanog,Pou5f1,Sox2,Smad1,Stat3,Klf4,Myc,Esrrb,Tcfcp2l1,Zfx,E2f1,Suz12 and Ctcf,respectively. For the data extracted from the remaining ten studies [Kim et al., 2010, Marson et al., 2008, Kim et al., 2008, Loh et al., 2006, Boyer et al., 2006, Tam et al., 2008, Han et al., 2010, Mathur et al., 2008 we selected the probes defined as bound to the transcription factor by the authors in the published papers.
For all the studies and for each TF individually, we excluded probes considered not to have binding affinity to the transcription factor and then filtered the probes according to unique gene IDs in order to keep only one interaction for each pair of TF-gene in each study. Genes with no known Entrez Gene ID or probes corresponding to cloning artifacts withdrawn by NCBI were excluded in the input in StemCellNet.

Generic interactions
To increase the coverage of StemCellNet, we imported human molecular interaction from the Unified Human Interactome (UniHI) database (http://www.unihi.org) and murine protein interaction from the BioGRID database (http://thebiogrid.org/). These were termed "generic interactions" to emphasize that they were not specifically detected in stem cells, but detected in other tissue, cell types or in vitro assays. Details about the data curation can be found on the web-pages of the UniHI and BioGRID databases. As both databases release new versions in an ongoing manner, we will regularly update the imported datasets in into StemCellNet. Since UniHI is a metadatabase, which integrates various primary resources and databases for protein interactions, we keep also track of the original resources and refer to these in the StemCellNet. Table S4 gives an overview of the primary sources integrated in the UniHI database.

Expression data sets
Expression datasets obtained either by microarray analysis or high-resolution nano liquid chromatography-tandem mass spectrometry were derived from four different studies [Aiba et al., 2009;Uosaki et al., 2011;Gaspar et al., 2012;Hansson et al., 2012]. Genes with repeated Entrez Gene Ids were also excluded, resulting in 20881 unique Gene id.
In the data published by Uosaki and colleagues [Uosaki et al., 2011] human induced pluripotent stem cells (hiPSc) were differentiated towards cardiomyocytes applying sequential administration of activin, bone morphogenetic protein 4 (BMP4), fibroblast growth factor 4 (FGF4) and Dickkopf 1 homolog (DKK1). The expression profiling was performed with Affymetrix Human Gene 1.0 ST arrays at day 0, 2, 5, 7, 9 and 11 during differentiation. The CEL files were downloaded from GEO (accession number GSE28191) and were submitted to background correction, normalization and summarization of gene expression (rma), using the R package Affy [Gautier et al., 2004]. After elimination of repeated Entrez gene Ids, a total of 19889 genes were covered by the expression . descriptions to obtain the entrez gene id. From the 25164 genes present in the initial matrix series, we obtained a final list of 24287 genes, after Ref_ID from the matrix series without corresponding gene ID or symbol were excluded. Expression data of the replicates were averaged and all time series were adjusted, so that the mean expression of a gene equals 0 and presented as log2 fold changes.
In the proteomic study from Hansson et al., 2012 the authors applied in-depth quantitative proteomics to monitor proteome changes during the course of reprogramming of fibroblasts to iPSCs (Hansson et al., 2012). The data was extracted from the supplementary files available in Cell Reports (http://download.cell.com/cellreports/mmcs/journals/2211-1247/PIIS2211124712003695.mmc1.xlsx). After filtering and excluding proteins that did not match any Entrez Gene ID or did not have any gene name, we obtained 7409 out of the initial 7918 proteins present in the study.

Current and future curation strategy
The current focus lies on the curation of genomic and proteomic studies reporting molecular interaction in stem cells. For this purpose, we performed an extensive and systematic review reported by PubMed when querying with a defined set of terms. The reported publications were then examined whether they included newly generated 6 interaction data. For instance, to obtain a comprehensive set of ChIP-chip and ChIP-seq studies reporting mapping transcription factor binding sites, we performed a Pubmed search using as mandatory key-words "embryonic stem cell and genome" combined with non-mandatory key words such as "chip-chip, chip-seq, chromatin immunoprecipitation, target, regulatory, transcription factor, expression and pathway".
The selection of key-words will be broadened in future version to e.g. include interaction data for types of stem cells other than embryonic. To enable the assessment of the current state of StemCellNet by the user, the web-server provides pages documenting the included data in the different versions.
For future versions of StemCellNet, we will start to curate small scale studies. To obtain assistance by other researchers and experts for this task, we will set up a dedicated webpages, where suggestions to include studies or data sets can be placed or even curation of data can be undertaken by external researchers. We hope that such features can eventually help to transform StemCellNet into a community-based project.

Automatic layout and filtering procedure for network visualization of large networks
For network visualization, we used the Cytoscape Web (Lopes et al., 2010). This software shows a good performance with small to medium-sized networks (including up to several hundred nodes and edges) but becomes increasingly slow in the case of larger networks. Thus, to avoid lengthy response times or even a stalling display, several automatic adjustments were taken for visualization of networks larger than certain thresholds.
If the number of nodes is greater than 400, the network is rendered using radial layout, which is faster compared to the default force directed layout and can support larger networks but if the number of edges in the network surpasses 1000, a filtering procedure is performed. In this case, StemCellNet will attempt to show only interactions that have more than one PubMed ID associated with it. This approach seeks to retain the interactions linked to greater evidence. In the case that there are still more than 1000 edges after this filtering step, the minimal number of required PubMed IDs will be raised until the number of edges in the network is smaller than 1000. In all situations the user is alerted when automatic filtering occurs.
As many transcription factors have a large number of target genes, which makes their visualization readily inefficient, we restricted the display of regulatory interactions to the incoming type. This means that only regulatory interactions are shown which act upon target genes included in the network that is composed by the central proteins and their physical interactors.
Finally, it is possible to download the full set of interactions from the StemCellNet central node selection page. This data can be used as input for alternative stand-alone software tools, such as Cytoscape and R/Bioconductor, which may be more advisable for efficient analysis of large interaction networks.

Deep linking into StemCellNet
StemCellNet can help the users to recreate previously obtained search results or networks without having to repeat the search process, through deep links. This links contain the data necessary to reconstruct either a search result or network in the format of link parameters.
As an example, to deep link into a concrete protein or gene identifier search the user can use a link as the following:

Database structure
The database underlying StemCellNet was developed using MySQL relational database management system. Altogether, the database consists of 18 different tables ( Figure   S1) storing different types of data and information. There are 7 core tables including

Online resources for stem cell biology
Several online resources for stem cell biology have been established in recent years.
Many of them are gene-centric i.e. their main functionality is to provide collected data and information for individual queried genes. This is contrasted by resources such as StemCellNet, which enables interactive analysis of genes of interest within a network context. In the following, some of the currently available online resources and their applicability are briefly described. PluriNet, a molecular network model for pluripotency was described. However, many features on the web-site do not seem to be functional anymore. Finally, ESCAPE (http://www.maayanlab.net/ESCAPE) is a database, which integrates published data for human and mouse embryonic stem cells and enables the derivation of networks for query proteins similarly to our StemCellNet web-server.      Figure S1: Scheme of database underlying StemCellNet