Summary: Large sets of data, such as expression profiles from many samples, require analytic tools to reduce their complexity. The Iterative Signature Algorithm (ISA) is a biclustering algorithm. It was designed to decompose a large set of data into so-called ‘modules’. In the context of gene expression data, these modules consist of subsets of genes that exhibit a coherent expression profile only over a subset of microarray experiments. Genes and arrays may be attributed to multiple modules and the level of required coherence can be varied resulting in different ‘resolutions’ of the modular mapping. In this short note, we introduce two BioConductor software packages written in GNU R: The
The ISA can be applied to identify coherent substructures (i.e. modules) from any rectangular matrix of data. To be specific, we consider here the case of transcriptomics data corresponding to a set of gene expression profiles from a collection of samples. The method has been described in detail in Ihmels et al. (2004) and Bergmann et al. (2003). Here we only give a brief summary.
The ISA identifies modules by an iterative procedure. The algorithm starts from an input seed (corresponding to some set of genes or samples), which is refined at each iteration by adding and/or removing genes and/or samples until the process converges to a stable set, which is referred to as a transcription module.
The output of ISA is a collection of potentially overlapping modules. Every module contains genes that are over- and/or under expressed, in samples that belong to the module. In every module, each gene and each sample is attributed a score between −1 and 1, which reflects the strength of the association with the module. Moreover, if the scores of two genes of a module have the same sign, then they are correlated (across the samples of the module), opposite signs mean anti-correlation. Similarly, if two sample scores have the same sign, then these samples are correlated (across the genes of the module), opposite signs indicate anti-correlation.
For other biclustering algorithms, see e.g. Cheng and Church (2000), Getz et al. (2000), Califano et al. (2000), Sharan et al. (2002), Tanay et al. (2004), Barkow et al. (2006) and Ihmels and Bergmann (2004) for a review.
A typical modular analysis for gene expression data includes the following steps.
Batch correction: to study the global organization of a transcription program including many aspects of transcriptional regulation one often combines several microarray experiments into a single dataset. In such a case, additional data normalization is crucial to reduce the bias due to the constituent datasets. Several methods address this challenge, see e.g. Johnson et al. (2007) for an algorithm that has a GNU R implementation.
Gene filtering: genes that have very low expression levels in all samples, carry little if any information and may reflect ineffective array probes, etc. Since these genes are likely to contribute mostly noise to the analysis (Hackstadt and Hess, 2009), we suggest removing them before running the module identification of the ISA.
ISA normalization (Step 1 in Fig. 1): in each iteration the ISA computes thresholded weighted sums of expression levels over either genes or samples. Since different genes typically show different levels of base expression and variance, it is important to standardize expression levels to Z-scores. The ISA uses two sets of Z-scores, one calculated for each gene across all samples and the other for each sample across all genes.
Random and smart seeding, ISA iteration (Step 2): the iterative procedure of the module identification is typically applied to a large number of seeds. In the unsupervised approach, these seeds are chosen randomly to sample uniformly the immense search space. We also implemented a semi-supervised method, to which we refer as ‘smart seeding’, where the seeds are biased to start with certain sets of genes or samples based on prior knowledge. The ISA can be performed with random or smart seeds, depending on the application.
Merging and filtering the modules (Step 3): it is possible that several seeds converge to the same, or very similar biclusters. This step eliminates such duplicates. To access the significance of a module, we designed a robustness measure that can be used to filter out spurious modules. This is done by applying the ISA to scrambled input data in order to obtain a reference (null) distribution for the significance scores.
Module trees: the ISA works with two stringency threshold parameters, the gene threshold and the sample threshold. ISA modules can be organized into a directed graph, to which we refer as ‘module tree’. An edge from module A to module B indicates that the ISA converges to module B from module A, with the same threshold parameters that were used to find module B. A module tree provides a hierarchical modular description of a dataset.
The ISA and accompanying visualization tools are implemented in two R packages. The
In the detailed workflow every step of the modular analysis is executed separately, possibly with non-default parameters. This allows the users to tailor the ISA according to their needs.
Additional information and a Matlab implementation of ISA are available on the ISA homepage.
Funding: Swiss Institute of Bioinformatics; the Swiss National Science Foundation (3100AO-116323/1); European Framework Project 6 (through the EuroDia and AnEuploidy projects).
Conflict of Interest: none declared.