Abstract

Systems biology is based on computational modelling and simulation of large networks of interacting components. Models may be intended to capture processes, mechanisms, components and interactions at different levels of fidelity. Input data are often large and geographically disperse, and may require the computation to be moved to the data, not vice versa. In addition, complex system-level problems require collaboration across institutions and disciplines. Grid computing can offer robust, scaleable solutions for distributed data, compute and expertise. We illustrate some of the range of computational and data requirements in systems biology with three case studies: one requiring large computation but small data (orthologue mapping in comparative genomics), a second involving complex terabyte data (the Visible Cell project) and a third that is both computationally and data-intensive (simulations at multiple temporal and spatial scales). Authentication, authorisation and audit systems are currently not well scalable and may present bottlenecks for distributed collaboration particularly where outcomes may be commercialised. Challenges remain in providing lightweight standards to facilitate the penetration of robust, scalable grid-type computing into diverse user communities to meet the evolving demands of systems biology.

INTRODUCTION

Understanding biological systems is one of the grand challenges in science. There is clear evidence that computation is increasingly informing, and in turn being informed by, the biological sciences [1–4]. In this article, we outline some of the issues in modelling biological systems, provide observations on distributed computing and software development and present three case studies illuminating facets of advanced computing for systems biology.

LARGE-SCALE COMPUTING ISSUES IN SYSTEMS BIOLOGY

What is systems biology? Definitions differ, but most focus on how biological ‘parts’ (e.g. molecules, cells, organisms) interact to produce coherent, dynamic systems (cells, tissues and organs, ecosystems). It is rarely obvious how systemwide behaviour emerges from these networks of interacting parts, and some definitions focus on properties that we associate with living systems, including resilience, adaptation and evolution. In any case, we need to know the component parts and at least the general nature of their interactions, and represent them symbolically. In this way, complex biological systems can be described in mathematical or statistical terms. It may additionally be possible to bring the perspectives and analytical methodologies of computer science, systems engineering, control theory and complexity theory to bear on these systems.

However systems biology is defined and delineated, it is based on computational modelling and simulation [1, 2, 5]. In systems biology applications, computational modelling and simulation require access to large, diverse, semantically rich and often geographically disperse data sets. These data are integrated into computational models, which are then executed on computing hardware, often generating large volumes of output. This output must be managed, validated, mined and visualised, typically by researchers representing multiple disciplines (statistics, complexity theory, molecular or organismal biology). A range of issues must be addressed throughout this process:

  • Data, both as input and output.

  • Mapping the problem to a model.

  • Embedding the model in a scalable computational framework.

  • Implementing the model in real computer code of well-engineered software.

  • Executing the code on hardware of appropriate architecture.

Embedding the model in a scalable computational framework is where the abstract view of modelling meets the reality of modern computing architectures. It is safe to assume that almost all computing for systems biology in the next few years will be done using variations of the current digital silicon-based technology: quantum and DNA computing, while of interest, are still very much experimental for real-world problems [4]. The classical, physics-dominated supercomputing of the 1980s has given way to a more diverse array of architectures. This shift has been driven by the economics of the semiconductor industry, where microprocessor design is expensive, but unit fabrication costs are low; hence, only high-volume components are economically feasible (bespoke processors, such as the Cray X1, tend to be for niche markets with wealthy customers). Biology has also contributed to this change because the majority of biological computation is about capacity (many small tasks to solve the problem) rather than capability (few large tasks to solve the problem). An example of this is genome assembly from shotgun sequencing: many small pairwise overlap calculations, followed by a single assembly of the overlaps typically requiring very large memory. Because different parts of a computational problem may be best suited to quite different detailed architectures, it may be attractive to assemble a heterogeneous collection of computers which can be coupled to solve a problem; this is the premise of the TeraGrid [6]. Space does not permit a detailed discussion on the advantages and disadvantages of such a system, but the reader is encouraged to explore documents on the TeraGrid Web site [6].

Here we discuss the philosophy of modelling as it applies in particular to systems biology, then follow with three case studies related to our own work. Case study 1 has large computational requirements but small data, and is in part (but not entirely) almost trivially amenable to a computing grid system. Case study 2 presents large, complex data requirements requiring the integration of disparate datasets, and thus has affinity for Web services and data grid techniques. It also offers some software development lessons. Case study 3 brings together the above two: it is both computationally and data-intensive, and moreover requires the implementation of difficult algorithms.

MODELS, COMPLEXITY AND REQUIREMENTS

Having stated that systems biology is about creating models that are concerned with both entities and interactions, one needs to ask what we want to achieve by modelling the system—is it to understand, to predict or to control? This matters because it will crucially determine what fidelity the model requires, and this in turn has consequences for the data and computational requirements for executing that model on real computer systems. The fidelity of a model is the degree to which it captures the processes or mechanisms, the entities and their relationships that we believe occur in the real system. Fidelity is relative to both our level of understanding of the system, and the end use we intend to make of the model.

The lowest-fidelity modelling is essentially the prediction and use of patterns. In this approach, one simply takes an observed pattern and carries out some kind of statistical analysis to predict its future state. A classic example is trying to predict the future of the stock market using a time series analysis of historical data. In this, we completely ignore any external factors that have affected the historical record, and so we have, a priori, no reason to believe that the future state of the system is necessarily analogous or related to the historical state. This is not to suggest that pattern prediction has no value—if the future and historical states are analogous, then future states can be predicted quite accurately. Pattern prediction can also help us gain understanding and insight into the process that is producing the pattern.

The next step up in the fidelity ladder is what is usually called empirical modelling. With empirical models, the results or consequences of applying an underlying process are captured as a simple set of rules or observations that are then used in the modelling process. For example, consider a model of plant growth. A ‘first principles’ model might, for instance, consider factors such as physiology, water availability, solar radiation and photosynthesis, atmospheric phenomena and soil nutrients. Clearly, a large amount of biophysical process and data must be captured.

The highest level of fidelity is achieved with process-based models. In these, processes that we believe occur in the real system are captured or modelled ‘one to one’ in our model of that system at a level appropriate to the use intended for the model. As we move to finer spatial and temporal resolution, more processes may have to be included explicitly. As a consequence, all modelling contains some degree of empiricism. There is nothing inherently wrong with this, but one must be aware of the assumptions and limitations on which the model is predicated. The output from any model is not delivered on tablets of stone, but must be interpreted or analysed by the modeller, relative to its assumptions and limitations, including data required in the model. Visualisation of raw and modelled data is an important tool in this process.

The scale of modern ‘omic’ biological data makes providing a centralised data warehouse for all information relevant to a model and its analysis increasingly difficult. There are both technical and social reasons for this. Thus, advances in understanding biological systems will require us to learn how to work in distributed environments, and collaborate with coworkers who may be geographically disperse. Fortunately there is evidence of technological advances towards resolving this—the Grid.

The Grid—in the sense of an easy extension of the user's desktop environment to utilise distributed data and computing services—has been publicised for over a decade [7], and yet to-date has made few inroads into the daily life of working scientists. We believe that will change over the next few years, largely due to related developments in the business world, with concepts such as Web services, software as a service and virtual organisations. Whether or not it is still called ‘grid computing’ and is built on technologies such as Globus, there is no doubt that scientists will need to learn to work in a world where resources are distributed [4].

In particular, we believe that data grid approaches will significantly benefit science, and that moving the computation to the data rather than the other way round (as is typically done today) will be required to make sense of all the biological data being generated and incorporate it into systems models.

CASE STUDY 1: RECOGNISING ORTHOLOGOUS NETWORKS VIA PHYLOGENOMICS

Systems biology is a study of ‘parts’ and their interactions. Even now, a decade or more into the genomic age, systems-biological investigations must often begin by making an inventory of relevant parts and interactions. These may be incompletely known, or indeed completely obscure, and must be discovered from primary data. For this we can seek help from comparative genomics. Features—e.g. sequences of genomic regions or the composition of regulatory networks—are similar in two or more organisms due to the interplay between genealogy and function. Function constrains diversification, so homologous features that remain more similar than expected under a non-selective model are likely to be functionally important, i.e. to be ‘parts’ that engage in relevant interactions. At finer scale, conserved sequences or structures may point to how these parts interact. Discovering conserved features is complicated, however, not only by diversification through time but also by genome-level events such as duplication, translocation and deletion of genomic regions. Fine-scale feature mapping requires us to identify and compare only precise evolutionary counterparts (orthologues), and this turns out to be computationally intensive.

Orthology (common descent from a speciation event) can be unambiguously distinguished from paralogy (common descent from a gene duplication event) only on a phylogenetic tree [8]. The heuristic equating orthology with reciprocal best BLAST match [9] can be a useful first approximation, but is inadequate for fine-scale analysis especially with large gene families. Pathways and networks can be considered evolutionarily identical (and assumed, in the absence of evidence to the contrary, to be functionally identical) in different organisms if orthologues constitute their vertices. Finding maximal common subsets shared between networks, and the related problem of retrieving networks from databases, are known to be NP-complete. Here, we deal with the prior issue of establishing orthology among gene or protein sequences, based on the large-scale inference of trees [10].

Given the set of genes encoded by the genomes of interest (or alternatively, the proteins conceptually translated from these genes), the first step in large-scale orthologue mapping is to recognise putative gene (or protein) families. We begin with an all-versus-all BLAST, an embarrassingly parallel application that simply requires many processors; memory is generally not an issue (RAM requirement is a linear function of query length plus a linear function of maximum target length), and optimisation has been well explored. Sensitivity might be improved by use of more-rigorous statistical search operations, e.g. based on the Smith–Waterman algorithm with affine gap penalties, although at substantial CPU cost.

N × N BLAST yields N2 pairwise vectors (edges), among which we next identify clusters of related edges. Even after removing statistically insignificant edges, clustering can pose a large-memory problem for which current Grid architectures are inadequate. Markov clustering [11] is the method of choice because it avoids false-positive matches caused by statistically over-represented substrings (‘promiscuous domains’), but its RAM requirement scales as number of matrices held in memory × cell size × graph cardinality × a large integer that, in turn, depends on resource settings [12]. A popular implementation [11] allows resource requirements to be tuned (at the cost of deviation from ‘pure’ clustering), but RAM requirements in the tens of gigabytes are not uncommon. For some applications, it is possible to post-process these clusters to extract putatively orthologous families, but the polynomial-time approach [10] may be too conservative for fine-scale orthologue analysis among complex genomes (e.g. those of vertebrates), and larger (paralogous) clusters might better be taken forward into subsequent steps.

Having delineated sets of homologous DNA or protein sequences at the granularity appropriate for the biological question, we next align each set to display homology position-by-position. Even pairwise alignment is NP-hard, but diverse algorithms are available that, in practice, yield high-quality multiple sequence alignment (MSA). However, it is rarely possible to know a priori which approach or parameterisation best suits the data, and a reasonable strategy is to apply a spectrum of alternative approaches, then select from among the resulting alignments the one that scores most highly according to an external criterion [13]. This strategy is well suited to distributed computing, although approaches that involve word (motif) discovery have memory requirements that scale as alphabet size to the power of motif length, so RAM can become limiting especially with protein or codon alphabets.

Given a set of aligned sequences, we next want to construct a hypothesis about their evolutionary history—i.e. infer a phylogenetic tree—on which orthologues can be identified. There is much to recommend a Bayesian approach [14, 15], in which the posterior probability of a tree is associated with its probability of being correct, given the prior probability, evolutionary model and data. Posterior probabilities usually cannot be computed analytically, but using Markov chain Monte Carlo (MCMC) methods, it is possible to examine equilibrium distributions of trees, and from these make probability statements about the true tree. MCMC, particularly with so-called heated chains (Metropolis-coupled MCMC), recovers from the posterior probability distribution a sample of topologies within which the empirical relative frequency of a given topology converges to its corresponding marginal posterior probability; the topology with highest relative frequency in this sample is selected, and posterior probabilities of its subtrees can be estimated by consensus from the topologies visited by the Markov chains. In current implementations [16], memory requirement scales as Markov chains × sequences × site patterns × character states × rate categories × 32, which for sets of 50–100 sequences translates to 0.5–1.0 GB. The Markov chains can be run in parallel, but as chain number is typically 4 and many data sets are often analysed simultaneously, the advantage of parallelisation may be modest except for very large data sets. Orthologous families should appear as coherent, statistically well-supported subtrees, and with appropriate indexing can be extracted in linear time. Comparing sets of orthologues can be more difficult, as many relevant operations are known or suspected to be NP-complete.

All the preceding steps, from delineation of gene or protein families through inference and analysis of trees, should ideally be refined by iteration. This is almost never done, both because the computational demands presented by real-life multi-genome problems are so substantial, and because workflow and data management can itself be complex. There is little reported experience in interfacing generic workflow and portal environments [17–20] with the distributed, multi-architecture high-performance computing resources required for multi-genome orthologue mapping as described here.

CASE STUDY 2: THE VISIBLE CELL

The Visible Cell project [21] is an ambitious undertaking of the Institute for Molecular Bioscience (IMB) and the ARC Centre in Bioinformatics. It is intended to provide a framework in which abstract mathematical and computational models can be embedded onto real three-dimensional cellular structural data obtained from high-resolution electron imaging. It is not intended to produce a general-purpose framework for systems biology modelling, but rather is predicated on the belief that, in order to unravel the complexities of the cell, biologists must be able to visualise the three-dimensional spatial arrangement of cellular components at high resolution, link gene expression and protein structures with cellular location and embed models of cell processes. Other cell representation and modelling environments [22–25] are similarly motivated, although without the same emphasis on real cell-image data.

This systems-oriented view, emphasising interactions as well as components, offers a good mapping to object-oriented analysis and design, which has emerged as a popular programming paradigm for designing complicated systems [26]. Although there have been increasing uses of more agile software development methodologies and languages (e.g. eXtreme Programming, Perl/PHP web scripts, Ruby) that emphasise rapid code development and could be considered somewhat counter to object-oriented methods, we believe there is a productive analogy between object-oriented software (and ultimately, software components) and biology modelled as components and interactions.

One of the first milestones of the Visible Cell is three-dimensional visualisation of a mammalian pancreatic beta .Cell at approximately 5 nm resolution, using data from tomographic reconstruction of electron microscope images. This presents the first challenge—data volume. A complete reconstruction will require between 5-10 TB of volume data, and we wish to visualise these data at multiple scales. Data of this volume cannot be rendered on anything other than extremely expensive visualisation clusters, and we desire to make these tools accessible to as wide a range of researchers as possible. This introduces the second challenge—we want the framework to be as platform-agnostic as possible.

We have addressed these issues in two ways: (1) a routine to tile and provide multiple levels of detail of raw microscope images stored in an Oracle database. Using Oracle Intermedia resulted in excessive image-loading times, so we wrote our own software in Java to run as a stored procedure in the Oracle database. Although Java can be slow, the relatively simple class structure required by our application imposes little need to traverse the class hierarchy with expensive function calls. (2) We implemented a Java viewer that interacts with the database to return the relevant tiles and view level of detail. Java implementation allows cross-platform viewing with no application recoding.

As part of the fundamental data management requirements for this project we are developing an object-oriented framework, called .Cell, that exposes various interfaces that data objects must implement if they are to be interoperable with components in the project. This framework provides a consistent interface to the different microscope sources at IMB, all of which have (different) proprietary data formats. The .Cell framework remains in active development, and is being documented so researchers can add their own tools (e.g. sophisticated visualisation environments) simply by conforming to the framework interfaces. This implies that we should keep changes to interfaces to an absolute minimum as the project develops.

In addition to cell-image data, there will be other data types and sources:

  • Segmented features extracted from the volumetric information and spatially registered to the volumes (they are represented as three-dimensional geometries) so they can be displayed simultaneously.

  • Three-dimensional protein structures from structure databases.

  • Protein localisation information, allowing us to embed protein structures in the volumes and segments.

  • Models whose outputs can be spatially registered caricatures of cell structures, thus allowing for embedding in real spatial coordinates.

  • Hyperlinked annotation and other metadata.

These disparate data sources are best accessed through a layer of abstraction, so the implementation specifics of each data source can be hidden from the application. We are applying technologies such as web services, data grid services and Storage Resource Broker [27] and find them to be important in this context. Ultimately, the models themselves should be available as grid services, allowing us to apply distributed computing resources—an approach entirely harmonious with the current trend towards ‘software as a service’ in business processing.

It is worth emphasising that software engineering issues must not be overlooked, as software systems required to make progress in even individual systems biology problems are becoming increasingly complex, and development of a generic systems biology framework is well beyond the capabilities of a single person ‘programming in the small’ [5]. This is a major reason for not constructing a general systems biology framework in this project—we simply do not have the resources to make significant headway with this approach.

CASE STUDY 3: MULTI-SCALE SIMULATIONS IN CELL BIOLOGY

It is increasingly appreciated that many cellular processes, including gene regulation and signalling cascades, are governed by effects associated with small numbers of certain key molecules [28–31]. In these situations the standard chemical kinetics framework described by systems of ordinary differential equations breaks down. The stochastic simulation algorithm (SSA) of Gillespie [32] represents a discrete modelling approach that attempts to model the uncertainty (intrinsic noise) in knowing when a reaction and what reaction takes place. Essentially, the SSA evolves a discrete non-linear Markov process whose elements are the numbers of molecules of all the chemical species in the reacting system as a function of time.

The SSA has been applied to good effect in a variety of settings [28, 33, 34] but its main drawback is that the time step can be very small if some molecular species are present in large numbers or if rate constants are large. A number of algorithmic approaches have been considered to overcome these computational bottlenecks, e.g. by allowing more than one reaction per time step [35, 36], using quasi-steady-state assumptions to reduce the size of the problem [37–39], or partitioning the problem into small, moderate and large numbers of molecules [40].

The need for advanced computing methodologies and implementations arises in this setting, as often thousands of independent simulations must be computed in order to collect appropriate statistics about moment behaviour. Because they are independent, these simulations can be distributed to different processing elements. Burrage et al. [41] give a grid implementation of Gillespie's tau-leap method [32] in which MathML within a web browser is used to set up the chemical reaction system in an intuitive visual manner. The web browser then communicates with the networked computers running independent simulations. Numerical results, statistics and graphs are displayed back within the browser.

The SSA is a trajectorial approach, but since it describes the evolution of a stochastic process, this stochastic process has a probability density function described by the so-called Master Equation (ME). The ME is a system of linear ordinary differential equations in which there is one equation for each configuration of the state space. Consequently, the solution of the ME takes the form
where A is the state-space matrix. Even for relatively small systems, the dimension of A can be in the millions, so advanced computing techniques are necessary. Usually these techniques are based on domain decomposition ideas, but the art is to make sure that the computation-to-communication ratio is kept high. A proposed finite state projection algorithm [42] reduces the size of the matrix A, and Krylov subspace techniques [43, 44] can be used to efficiently compute the exponential of a matrix times a vector.

The challenge is to apply these techniques in modelling important and complicated cellular processes. Particular examples include:

  • Quorum sensing among proliferating cells, where a small change in protein expression levels in one cell can change the state of all other cells in the culture [45]. Here the computational challenge is to model genetic regulation within a cell, coupled to signalling among thousands of proliferating cells. Since cell number grows, adaptive techniques are needed to ensure suitable load balancing on the processing elements.

  • Oscillations of calcium levels in cardiac cells play a major role in the stimulation of cardiac cells, resulting in contraction of the heart [46]. Calcium-induced calcium release occurs around the dyadic cleft and must be modelled at a fine scale by discrete models, which are then coupled to larger-scale dynamics.

  • Epidermal growth factor (EGF) receptors play important roles in growth and differentiation in many mammalian cell types. Binding of EGF to the extracellular domain of the EGF receptor induces dimerisation, autophosphorylation and the activation of downstream proteins in signalling cascades [47]. Complex ordinary differential equation models have been developed involving 125 reactions and 94 species [48] but there is a need to model this from a discrete, stochastic perspective.

In addition, as soon as spatial effects are incorporated into the modelling process (there is a wide array of complicated ultrastructure within a cell), computational complexity increases dramatically. In these situations domain decomposition techniques are crucial, but again the art is in adaptively load-balancing the computational effort on the processing elements.

Other significant problems in systems biology have been identified not only at the cellular level, but also in the biology of tissues, organs and whole organisms. Progress will require linking experiments, observations, experimental data, models and simulation techniques across wide range of temporal and spatial scales. An especially challenging research project is the Physiome Project [49], in which the function of organs such as the heart is being modelled across a range of spatial scales of 15 orders of magnitude and temporal scales of nine orders of magnitude. Such integrative challenges will require not only grid computing [50] but also the use of markup language environments such as CellML [51] and additional computing methodologies such as computational steering.

ACCESS AND INTEROPERABILITY

As science becomes more global in its practice, with advances being made at the intersection of traditional discipline boundaries, the need to provide scalable authentication, authorisation and audit systems (AAA) to provide researchers access to resources becomes increasingly important. This is even more true with the life sciences, where there is an expectation that there will be real commercial gain through, e.g. improved therapeutic regimes. Systems such as traditional password files and LDAP will not scale, at least from a management perspective. We see AAA as a key problem to address, and note that there is work in progress on solutions such as Globus certificates [52] and shibboleth [53]. The (logical) separation of credential providers and credential consumers is a significant component in scaling these trust relationships to large-scale research communities.

We have touched on the value of published data standards and formats aiding data integration and presenting common interfaces for software applications. This, however, is only part of the story, as it addresses only syntactic data integration. The real goal is to have semantic interoperability and integration of disparate data sources and, more generally, of services. Technologies built around ontologies facilitating machine reasoning of data sets are the current solutions to this problem, with the concept being encapsulated with the vision of the Semantic Web [54] and current instantiations such as Web Ontology Language [55]. Other projects, such as BioMoby, have explored alternative solutions to this problem without the need for the construction of specific ontologies [56].

At the same time our collective challenge will be to make sure that such implementations are not overburdened with expensive, cumbersome frameworks and over-managed standards [57]. Occam's principle (pluralitas non est ponenda sine necessitate) should be a guiding light, with the caveat that these implementations should be sufficiently robust and portable to allow use by scientists with a wide variety of backgrounds and computing expertise.

Key Points

  • In systems biology applications, computational modelling and simulation require access to large, diverse, semantically rich and often geographically dispersed data sets.

  • These data are integrated into computational models, which are then executed on computing hardware, often generating large volumes of output.

  • This output must be managed, validated, mined and visualised, typically by researchers representing multiple disciplines.

  • With increasing data volumes, models and simulations must be made more sophisticated and computationally intensive.

  • For these reasons, grid computing or its equivalent will continue to play an ever-increasing role in systems biology.

  • Implementations for the grid must not be overburdened with expensive, cumbersome frameworks, and standards must be sufficiently portable to allow use by scientists with a wide variety of backgrounds and expertise.

Acknowledgement

We thank Prof. Werner Dubitzky for the invitation to discuss these issues and the Australian Research Council for support (ARC Centre in Bioinformatics CE0348221 and funding of K. B. through the Federation Fellowship program).

References

Kitano
H
Computational systems biology
Nature
2002
, vol. 
420
 (pg. 
206
-
10
)
Finkelstein
A
Hetherington
J
Li
L
, et al. 
Computational challenges of systems biology
Computer
2004
, vol. 
37
 (pg. 
26
-
33
)
Russel
J
Bio-IT World Briefing on: System Biology
2005
(20 April 2006, date last accessed)
Framingham MA
Bio-IT World
 
Emmott
S
Towards 2020 Science
2006
(20 April 2006, date last accessed)
Cambridge UK
Microsoft Research
 
Cassman
M
Barriers to progress in systems biology
Nature
2005
, vol. 
438
 pg. 
1079
 
TeraGrid project
(26 August 2006, date last accessed) 
Foster
I
Kesselman
C
The Grid 2: Blueprint for a New Computing Infrastructure
2003
San Francisco, CA
Morgan Kaufmann Publishers
Fitch
WM
Distinguishing homologous from analogous proteins
Syst Zool
1970
, vol. 
19
 (pg. 
99
-
113
)
Mushegian
AR
Koonin
EV
A minimal gene set for cellular life derived by comparison of complete bacterial genomes
Proc Natl Acad Sci USA
1996
, vol. 
93
 (pg. 
10268
-
73
)
Beiko
RG
Harlow
TJ
Ragan
MA
Highways of gene sharing in prokaryotes
Proc Natl Acad Sci USA
2005
, vol. 
102
 (pg. 
14332
-
7
)
Harlow
TJ
Gogarten
JP
Ragan
MA
A hybrid clustering approach to recognition of protein families in 114 microbial genomes
BMC Bioinform
2004
, vol. 
5
 pg. 
45
 
van Dongen
S
The MCL FAQ
(20 April 2006, date last accessed) 
Beiko
RG
Chan
CX
Ragan
MA
A word-oriented approach to alignment validation
Bioinformatics
2005
, vol. 
21
 (pg. 
2230
-
9
)
Huelsenbeck
JP
Ronquist
F
Nielsen
R
, et al. 
Bayesian inference of phylogeny and its impact on evolutionary biology
Science
2001
, vol. 
294
 (pg. 
2310
-
4
)
Mar
JC
Harlow
TJ
Ragan
MA
Bayesian and maximum likelihood phylogenetic analyses of protein sequence data under relative branch-length differences and model violation
BMC Bioinform
2005
, vol. 
5
 pg. 
8
 
Ronquist
F
Huelsenbeck
JP
MrBayes 3: Bayesian phylogenetic inference under mixed models
Bioinformatics
2003
, vol. 
19
 (pg. 
1572
-
4
)
Taverna project
(21 April 2006, date last accessed) 
Shah
SP
He
DYM
Sawkins
JN
, et al. 
Pegasys: software for executing and integrating analyses of biological sequences
BMC Bioinform
2004
, vol. 
5
 pg. 
40
 
Garcia
A
Thoraval
S
Garcia
LJ
, et al. 
Workflows in bioinformatics: meta-analysis and prototype implementation of a workflow generator
BMC Bioinform
2005
, vol. 
6
 pg. 
87
 
Carrere
S
Gouzy
J
REMORA: a pilot in the ocean of BioMoby web-services
Bioinformatics
2006
, vol. 
22
 (pg. 
900
-
1
)
Hunter
J
Ragan
MA
Little
S
Position paper for Semantic Web Life Sciences Workshop (Visible Cell Project)
(4 May 2006, date last accessed) 
Tomita
M
Whole-cell simulation: a grand challenge of the 21st century
Trends Biotechnol
2001
, vol. 
19
 (pg. 
205
-
10
)
Slepchenko
BM
Schaff
JC
Macara
I
, et al. 
Quantitative cell biology with the Virtual Cell
Trends Cell Biol
2003
, vol. 
13
 (pg. 
570
-
6
)
Sundararaj
S
Guo
A
Habibi-Nazhad
B
, et al. 
The CyberCell Database (CCDB): a comprehensive, self-updating, relational database to coordinate and facilitate in silico modelling of Escherichia coli
Nucl Acids Res
2004
, vol. 
32
 (pg. 
D293
-
5
)
Doi
A
Nagasaki
M
Fujita
S
, et al. 
Genomic Object Net: II. Modeling biopathways by hybrid functional Petri net with extension
Appl Bioinform
2004
, vol. 
2
 (pg. 
185
-
8
)
Booch
G
Object Oriented Analysis and Design with Applications
1993
2nd
Redwood City, CA
Benjamin Cummings
Moore
R
Persistent Archives for Data Collections. SDSC TR-1999-2
1999
October
(20 April 2006, date last accessed) 
Arkin
A
Ross
J
McAdams
HH
Stochastic kinetic analysis of developmental pathway bifurcation in phage lambda-infected Escherichia coli cell
Genetics
1998
, vol. 
149
 (pg. 
1633
-
48
)
Hasty
J
McMillen
D
Isaacs
F
, et al. 
Computational studies of gene regulatory networks: in numero molecular biology
Nat Rev Genet
2001
, vol. 
2
 (pg. 
268
-
79
)
Kærn
M
Elston
TC
Blake
WJ
, et al. 
Stochasticity in gene expression: from theories to phenotypes
Nat Rev Genet
2005
, vol. 
6
 (pg. 
451
-
64
)
McAdams
HH
Arkin
A
Stochastic mechanisms in gene expression
Proc Natl Acad Sci USA
1997
, vol. 
94
 (pg. 
814
-
9
)
Gillespie
DT
Exact stochastic simulation of coupled chemical reactions
J Phys Chem
1977
, vol. 
81
 (pg. 
340
-
61
)
Kierzek
AM
STOCKS: stochastic kinetic simulations of biochemical systems with Gillespie algorithm
Bioinformatics
2002
, vol. 
18
 (pg. 
470
-
81
)
Shea
MA
Ackers
GK
The OR control system of bacteriophage lambda: a physical-chemical model for gene regulation
J Mol Biol
1985
, vol. 
181
 (pg. 
211
-
30
)
Gillespie
DT
Approximate accelerated stochastic simulation of chemically reacting systems
J Chem Phys
2001
, vol. 
115
 (pg. 
1716
-
33
)
Tian
T
Burrage
K
Binomial leap methods for simulating stochastic chemical kinetics
J Chem Phys
2004
, vol. 
121
 (pg. 
10356
-
64
)
Haseltine
EL
Rawlings
JB
Approximate simulation of coupled fast and slow reactions for stochastic chemical kinetics
J Chem Phys
2002
, vol. 
117
 (pg. 
6959
-
69
)
Goutsias
J
Quasiequilibrium approximation of fast reaction kinetics in stochastic biochemical systems
J Chem Phys
2005
, vol. 
122
 (pg. 
1
-
15
)
Rao
C
Arkin
A
Stochastic chemical kinetics and the quasi-steady-state assumption: application to the Gillespie algorithm
J Chem Phys
2003
, vol. 
118
 (pg. 
4999
-
5010
)
Burrage
K
Tian
T
Burrage
P
A multi-scaled approach for simulating chemical reaction systems
Prog Biophys Mol Biol
2004
, vol. 
85
 (pg. 
217
-
34
)
Burrage
K
Burrage
P
Jeffrey
S
, et al. 
A grid implementation of chemical kinetic simulation methods in genetic regulation
2003
Proceedings of APAC03 Conference on Advanced Computing, Grid Applications and e-Research
(pg. 
1
-
3
ISBN 0-9579303
Munsky
B
Khammash
M
The finite state projection algorithm for the solution of the chemical master equation
J Chem Phys
2006
, vol. 
124
 pg. 
044104
 
Burrage
K
Hegland
M
MacNamara
S
, et al. 
A Krylov-based finite state projection algorithm for solving the chemical master equation arising in the discrete modelling of biological systems
Proceedings of the Markov 150th Anniversary Conference
 
in press
Sidje
RBS
Expokit: software package for computing matrix exponentials
ACM Trans Math Software
1998
, vol. 
24
 (pg. 
130
-
56
)
Kobayashi
H
Kærn
M
Araki
M
, et al. 
Programmable cells: interfacing natural and engineered gene networks
Proc Natl Acad Sci USA
2004
, vol. 
101
 (pg. 
8414
-
9
)
Higgins
E
Sneyd
J
Modelling the cardiac calcium transient using grid refinement and an ADI algorithm
Technical Report
2006
New Zealand
Deptment of Mathematics, University of Auckland
Orton
RJ
Sturm
OE
Vyshemirsky
V
, et al. 
Computational modeling of the receptor-tyrosine-kinase-activated MAPK pathway
Biochem J
2005
, vol. 
392
 (pg. 
249
-
61
)
Schoeberl
B
Eichler-Jonsson
C
Gillies
ED
, et al. 
Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors
Nature Biotechnol
2002
, vol. 
20
 (pg. 
370
-
5
)
Hunter
PJ
Robbins
P
Noble
D
The IUPS Physiome Project
Eur J Physiol
2002
, vol. 
445
 (pg. 
1
-
9
)
Gavaghan
DJ
Simpson
AC
Lloyd
S
, et al. 
Towards a Grid infrastructure to support integrative approaches to biological research
Phil Trans R Soc Lond A
2005
, vol. 
363
 (pg. 
1829
-
41
)
Lloyd
CM
Halstead
MDB
Nielsen
PF
, et al. 
CellML: its future, present and past
Prog Biophys Mol Biol
2004
, vol. 
85
 (pg. 
433
-
50
)
Globus Alliance. Components for Grid security
(20 April 2006, date last accessed) 
Internet2. Shibboleth
(20 April 2006, date last accessed) 
Berners-Lee
T
Semantic web roadmap
(2 May 2006, date last accessed) 
W3C
OWL Web Ontology Language Overview
(2 May 2006, date last accessed) 
Wilkinson
M
moby
(2 May 2006, date last accessed) 
Bradbury
RH
Do standards liberate or enslave?
AURISA 93 Data Standards Workshop
22 November 1993
Adelaide