Abstract

Summary: Large sets of data, such as expression profiles from many samples, require analytic tools to reduce their complexity. The Iterative Signature Algorithm (ISA) is a biclustering algorithm. It was designed to decompose a large set of data into so-called ‘modules’. In the context of gene expression data, these modules consist of subsets of genes that exhibit a coherent expression profile only over a subset of microarray experiments. Genes and arrays may be attributed to multiple modules and the level of required coherence can be varied resulting in different ‘resolutions’ of the modular mapping. In this short note, we introduce two BioConductor software packages written in GNU R: The

isa2
package includes an optimized implementation of the ISA and the
eisa
package provides a convenient interface to run the ISA, visualize its output and put the biclusters into biological context. Potential users of these packages are all R and BioConductor users dealing with tabular (e.g. gene expression) data.

Availability:http://www.unil.ch/cbg/ISA

Contact:sven.bergmann@unil.ch

1 INTRODUCTION

The ISA can be applied to identify coherent substructures (i.e. modules) from any rectangular matrix of data. To be specific, we consider here the case of transcriptomics data corresponding to a set of gene expression profiles from a collection of samples. The method has been described in detail in Ihmels et al. (2004) and Bergmann et al. (2003). Here we only give a brief summary.

The ISA identifies modules by an iterative procedure. The algorithm starts from an input seed (corresponding to some set of genes or samples), which is refined at each iteration by adding and/or removing genes and/or samples until the process converges to a stable set, which is referred to as a transcription module.

The output of ISA is a collection of potentially overlapping modules. Every module contains genes that are over- and/or under expressed, in samples that belong to the module. In every module, each gene and each sample is attributed a score between −1 and 1, which reflects the strength of the association with the module. Moreover, if the scores of two genes of a module have the same sign, then they are correlated (across the samples of the module), opposite signs mean anti-correlation. Similarly, if two sample scores have the same sign, then these samples are correlated (across the genes of the module), opposite signs indicate anti-correlation.

For other biclustering algorithms, see e.g. Cheng and Church (2000), Getz et al. (2000), Califano et al. (2000), Sharan et al. (2002), Tanay et al. (2004), Barkow et al. (2006) and Ihmels and Bergmann (2004) for a review.

2 METHODS

A typical modular analysis for gene expression data includes the following steps.

Batch correction: to study the global organization of a transcription program including many aspects of transcriptional regulation one often combines several microarray experiments into a single dataset. In such a case, additional data normalization is crucial to reduce the bias due to the constituent datasets. Several methods address this challenge, see e.g. Johnson et al. (2007) for an algorithm that has a GNU R implementation.

Gene filtering: genes that have very low expression levels in all samples, carry little if any information and may reflect ineffective array probes, etc. Since these genes are likely to contribute mostly noise to the analysis (Hackstadt and Hess, 2009), we suggest removing them before running the module identification of the ISA.

ISA normalization (Step 1 in Fig. 1): in each iteration the ISA computes thresholded weighted sums of expression levels over either genes or samples. Since different genes typically show different levels of base expression and variance, it is important to standardize expression levels to Z-scores. The ISA uses two sets of Z-scores, one calculated for each gene across all samples and the other for each sample across all genes.

Fig. 1.

(A) Work flow of a typical modular analysis with the

eisa
package. See text for details. (B and C) were generated using the acute lymphoblastic leukemia dataset, (Chiaretti et al., 2004) and the
ALL
R package. (B) Heatmap for a single module, showing coherent expression of the genes across the samples. The red lines are the gene and sample scores. (C) Module tree. Each module is represented by a rectangle with its numeric id in the center. See the definition of the edges in the text. Modules are colored according to their Gene Ontology enrichment P-values, the codes of the enriched GO categories are shown in the top-left corner of the rectangles. The top-right corner shows the number of genes and conditions in the module. The gene thresholds used for finding the modules are shown on the horizontal axes.

Fig. 1.

(A) Work flow of a typical modular analysis with the

eisa
package. See text for details. (B and C) were generated using the acute lymphoblastic leukemia dataset, (Chiaretti et al., 2004) and the
ALL
R package. (B) Heatmap for a single module, showing coherent expression of the genes across the samples. The red lines are the gene and sample scores. (C) Module tree. Each module is represented by a rectangle with its numeric id in the center. See the definition of the edges in the text. Modules are colored according to their Gene Ontology enrichment P-values, the codes of the enriched GO categories are shown in the top-left corner of the rectangles. The top-right corner shows the number of genes and conditions in the module. The gene thresholds used for finding the modules are shown on the horizontal axes.

Random and smart seeding, ISA iteration (Step 2): the iterative procedure of the module identification is typically applied to a large number of seeds. In the unsupervised approach, these seeds are chosen randomly to sample uniformly the immense search space. We also implemented a semi-supervised method, to which we refer as ‘smart seeding’, where the seeds are biased to start with certain sets of genes or samples based on prior knowledge. The ISA can be performed with random or smart seeds, depending on the application.

Merging and filtering the modules (Step 3): it is possible that several seeds converge to the same, or very similar biclusters. This step eliminates such duplicates. To access the significance of a module, we designed a robustness measure that can be used to filter out spurious modules. This is done by applying the ISA to scrambled input data in order to obtain a reference (null) distribution for the significance scores.

Module trees: the ISA works with two stringency threshold parameters, the gene threshold and the sample threshold. ISA modules can be organized into a directed graph, to which we refer as ‘module tree’. An edge from module A to module B indicates that the ISA converges to module B from module A, with the same threshold parameters that were used to find module B. A module tree provides a hierarchical modular description of a dataset.

3 IMPLEMENTATION

The ISA and accompanying visualization tools are implemented in two R packages. The

isa2
package contains the implementation of the basic ISA itself; this package can be used to analyze any tabular data. The
eisa
package builds on
isa2
. It adds support to standard BioConductor data structures and contains gene expression-specific visualization tools (see Fig. 1 for examples).

Both the

isa2
and
eisa
packages support two workflows. The simple workflow involves a single R function call and runs all ISA steps (Steps 1–3 in Fig. 1) with their default parameters.

In the detailed workflow every step of the modular analysis is executed separately, possibly with non-default parameters. This allows the users to tailor the ISA according to their needs.

The

eisa
package implements a set of visualization techniques for modules (see Fig. 1 for examples).

The

biclust
package, (Kaiser et al., 2009), implements a number of biclustering algorithms in a unified framework. The
eisa
package includes tools to convert between
biclust
and ISA biclusters. This allows the cross-talk of the functions in the two packages.

Additional information and a Matlab implementation of ISA are available on the ISA homepage.

Funding: Swiss Institute of Bioinformatics; the Swiss National Science Foundation (3100AO-116323/1); European Framework Project 6 (through the EuroDia and AnEuploidy projects).

Conflict of Interest: none declared.

REFERENCES

Barkow
S
, et al.  . 
BicAT: a biclustering analysis toolbox
Bioinformatics
 , 
2006
, vol. 
22
 (pg. 
1282
-
1283
)
Bergmann
S
, et al.  . 
Iterative signature algorithm for the analysis of large-scale gene expression data
Phys. Rev. E
 , 
2003
pg. 
031902
 
Califano
A
, et al.  . 
Analysis of gene expression microarays for phenotype classification
Proceedings of the International Conference on Computational Molecular Biology
 , 
2000
(pg. 
75
-
85
)
Cheng
Y
Church
G
Biclustering of expression data
Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology
 , 
2000
(pg. 
93
-
103
)
Chiaretti
S
, et al.  . 
Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival
Blood
 , 
2004
, vol. 
103
 
Gentleman
RC
, et al.  . 
Bioconductor: open software development for computational biology and bioinformatics
Genome Biol.
 , 
2004
, vol. 
5
 pg. 
R80
 
Getz
G
, et al.  . 
Coupled two-way clustering analysis of gene microarray data
Proc. Natl Acad. Sci. USA
 , 
2000
, vol. 
97
 (pg. 
12079
-
12804
)
Hackstadt
A
Hess
A
Filtering for increased power for microarray data analysis
BMC Bioinformatics
 , 
2009
, vol. 
10
 pg. 
11
 
Ihmels
JH
Bergmann
S
Challenges and prospects in the analysis of large-scale gene expression data
Brief. Bioinform.
 , 
2004
, vol. 
5
 (pg. 
313
-
327
)
Ihmels
J
, et al.  . 
Defining transcription modules using large-scale gene expression data
Bioinformatics
 , 
2004
(pg. 
1993
-
2003
)
Johnson
W
, et al.  . 
Adjusting batch effects in microarray expression data using empirical bayes methods
Biostatistics
 , 
2007
, vol. 
8
 (pg. 
118
-
127
)
Kaiser
S
, et al.  . 
biclust: BiCluster Algorithms
R package version 0.8.1.
 , 
2009
Sharan
R
, et al.  . 
EXPANDER: EXPression ANalyzer and displayER.
 , 
2002
Technical report, Software package, Tel-Aviv University
 
Tanay
A
, et al.  . 
Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data
Proc. Natl Acad. Sci. USA
 , 
2004
, vol. 
101
 (pg. 
2981
-
2986
)

Author notes

Associate Editor: David Rocke

Comments

0 Comments