The integration of bioinformatics resources worldwide is one of the major concerns of the biological community. We herein established the BOD (Bioinformatics on demand) system to use Grid computing technology to set up a virtual workbench via a web-based platform, to assist researchers performing customized comprehensive bioinformatics work. Users will be able to submit entire search queries and computation requests, e.g. from DNA assembly to gene prediction and finally protein folding, from their own office using the BOD end-user web interface. The BOD web portal parses the user's job requests into steps, each of which may contain multiple tasks in parallel. The BOD task scheduler takes an entire task, or splits it into multiple subtasks, and dispatches the task or subtasks proportionally to computation node(s) associated with the BOD portal server. A node may further split and distribute an assigned task to its sub-nodes using a similar strategy. In the end, the BOD portal server receives and collates all results and returns them to the user. BOD uses a pipeline model to describe the user's submitted data and stores the job requests/status/results in a relational database. In addition, an XML criterion is established to capture task computation program details.
Received March 10, 2004; Revised May 20, 2004; Accepted July 21, 2004
In the research field of life sciences, the rapid proliferation of biological resources (including biology data and biology softwares) not only provides researchers with a great tool in studying complex biological problems, but also leads to problems on using these resources efficiently. Researchers often have to utilize resources supplied by discrete sources, with softwares developed at different platforms, and data with various input and output formats (1–4). Users have to click, copy and paste repeatedly on the various websites, integrate heterogeneous data, and manage a great deal of computer resources (Figure 1). Accomplishing these tasks requires extensive professional bioinformatics expertise and so most biology researchers can hardly proceed. Moreover, data analysis on large amounts of data requires powerful computation ability which most researchers lack, at the same time tremendous bioinformatics resources are left unused. These situations fetter the abilities of biologists to use and analyze the mass of biological information through computers.
In order to resolve these issues, some resource-integration projects utilizing Grid technology (5,6) have been presented in recent years. Discovery Net (http://ex.doc.ic.ac.uk/new/), developed by the Imperial College of Science Technology and Medicine, provides a service-oriented computing model for knowledge discovery, allowing users to connect to and use data analysis software as well as data sources that are made available online by third persons. Discovery Net defines architecture, some standards and tools that allow scientists to plan, manage, share and execute complex knowledge discovery and data analysis procedures available as remote services; it also allows service providers to publish and makes available data mining and data analysis software components as services to be used in knowledge discovery procedures, and allows data owners to provide interfaces and access to scientific databases, data stores, sensors and experimental results as services so that they can be integrated in knowledge discovery processes.
MyGrid [(7); http://mygrid.man.ac.uk/], the bioinformatics database and Grid computation system hosted at Manchester University, is a research project that extends the Grid framework of distributed computing, producing a virtual laboratory workbench that serves the life sciences community. The integration environment supports patterns of scientific investigation that include accumulating evidence, assimilating results, accessing community information sources and collaborating with disparately located researchers via electronic forums. With MyGrid, scientists have the tool to customize the work environment to reflect their preferences on resource selection, data management and process enactment. At the least, the environment is able to support activities relating to the analysis of functional genomic data and the annotation of pattern databases.
The BioMOBY project (http://biomoby.org/), developed by the Plant Biotechnology Institute of National Research Council of Canada, is an international open source research project which aims at generating an architecture for the discovery and distribution of biologic data through web services (8). The data and the services are decentralized, but the availability of these resources, and the instructions for interacting with them, are registered in a central location called MOBY Central. Users can interact with biological data from different sources through BioMOBY, and do not need to take care about the data format and schema. This system can also dynamically identify the new relationship among the data from different sources.
Each of the projects introduced above has its own focus; however, they pay the most attention to data mining, knowledge discovery and heterogeneous data management, and do not completely use the advantages of Grid technology. In practical applications, users usually need to perform high-throughput computation, which cannot be implemented using their own lab devices, and hope to carry out complicated bioinformatics queries and computations consisting of multiple steps as conveniently as possible. In this paper, we propose a new system of satisfying such requests, and describe our BOD (Bioinformatics on demand) project. It aims to be a customized bioinformatics service based on user demands, and to present a convenient one-click-over way to meet users' complicated requirements through the integration of bioinformatics tools. BOD targets on fully utilizing the fruit of the present Grid computation technology to improve the capacity of high-quality computation of biology data. Users can perform several bioinformatics tasks in their own offices through the Internet more conveniently and thoroughly, and can effectively employ and fully exploit the hardware, software and the database bioinformatics resources on the Internet.
MATERIALS AND METHODS
The portal server of BOD system is developed under the Redhat Linux 9.0 platform. The central database is established by MySQL (http://www.mysql.org), and is used only on the portal server. We use Perl (http://www.perl.org) and Java (http://java.sun.com/) as the major programming languages. Perl is mainly used in the BOD web portal module and for parsers between software tools; Java is used in the BOD job scheduler and task scheduler, communication manager and modules on computation nodes; the BOD job status browser is also written in Perl. We have already tested the system on computation node(s) with Linux, AIX and Solaris platform.
A number of services are already available through the BOD portal server webpage (http://e-science.tsinghua.edu.cn/bod/). To validate the feasibility, reliability, stabilization and compatibility of our system schema, we set up a bioinformatics pipeline from sequence assembly to gene prediction, and then fold prediction. For each function, some well-known packages are provided for users, e.g. sequence assembly by TIGR-Assembler (9), gene prediction by Genscan (10), Geneid (11) and Glimmer (12), and fold prediction by Threader (13). We also used the software transeq from the EMBOSS package (14) to provide a DNA translation service. These services could run independently; some of them could work together to form a pipeline. Other services, which could be supported by BOD system, are being developed and will be provided in the future.
Figure 2 outlines the major steps of how BOD works. First, the user inputs the job requests through the end-user web interface to the portal server. Depending on the nature of the job, the portal server parses the user's requests into several sequential steps; each of the steps may contain multiple parallel tasks. In the second step, the portal server takes each individual task, queries the registered available computation nodes for the availability of the tools/databases required by the task as well as the computation capability of the nodes. The computation node may be a single computation workstation, a Grid server or a cascaded Grid server. In the third step, the portal server splits the task into multiple subtasks proportionally based on the computation capability of each usable node when appropriate, and dispatches the task or subtasks to the computation node(s). A computation node may further split a task and distribute to its sub-nodes when necessary, using a similar dispatching strategy as the portal server. Each node communicates with its parent and child nodes with the Grid technology framework. In the fourth step, each employed computation workstation performs computation of its assigned tasks, and after all the computation nodes complete their tasks, the portal server collects and combines the results. If the job contains multiple steps and/or multiple tasks, at the fifth step, the portal server repeats steps 2 to 4 for all tasks in the order of the steps in the job. Finally, in the sixth step, the portal server returns the final results back to user. The results are kept in the system for three months before clean up. We describe below in more detail the major aspects of the BOD system.
BOD LOGICAL ARCHITECTURE
The BOD system uses the webpage as the end-user interface. Constructed based on the Internet, the system aims at utilizing the computers worldwide through collaboration and resource sharing to accomplish the computational demands. We classify the computers related to BOD into three types: client station, portal server and computation node.
Client station: Client station is the computer that the users employ to connect to BOD system via a web interface. It acts as a consumer in the BOD system. Users can customize jobs, submit jobs, query job status and retrieve job results all from the web interface.
BOD portal server: The BOD portal server is the control center and manager of the BOD system, and plays three roles. First, it is a web portal server, which provides web services for obtaining and parsing requests from users. Second, it utilizes a BOD job scheduler and task scheduler to handle all computation workflows required by submitted jobs. BOD portal server does not perform computing itself; instead, depending on the nature of a task, it splits the task into many subtasks if appropriate, and delivers the task or subtasks to registered computation node(s), and receives result files from them. In other words, the portal server is a Grid server. Third, the BOD portal server utilizes a central relational database to store the various information on the jobs, which could support the above manipulations. It also supplies a mechanism for the user to query the status of submitted jobs from the web interface.
Computation node: A computation node is the computer workstation participating in the BOD system to provide services. It communicates with the portal server and accepts tasks from the portal server, performs these tasks, and sends back the results to the portal server. Computation nodes are the service providers in the BOD system. A number of computation nodes could be geographically distributed at different places worldwide.
In the BOD system, a computation node could be an individual computation workstation that performs the actual computations or queries, or could be a Grid server that further distributes tasks to its sub-nodes. As can be easily seen, a sub-node of a Grid server could again be a Grid server, and so on; thus, the BOD system is able to connect an unlimited number of computers in theory.
Computation nodes should be registered into the BOD server in order to be visible to the portal server. Basically, only the IP address of a node is kept at the portal server. In addition, a package is installed at each computation node to ensure proper communication between the node and the portal server. This package is available from BOD upon request.
Resources contributed by computation nodes are diverse in their types and characteristics. In order for each node to properly deal with received tasks, a unified description of the nodes is created and is kept at the node. A set of attributes are described. One is the system attribute, which includes the computer node type (workstation, or Grid server), platform type (Linux, AIX, Solaris, etc.), number of CPUs, and so on. The other is the software attribute, which includes the software category, software standard name, version, etc. Practically all computation nodes can be presented by these two types of attributes no matter how heterogeneous and diversified they are, and therefore, basically all kinds of nodes can be integrated into BOD system, thus engendering a potentially huge computation power and a large biological software pool.
JOB PIPELINE MODEL
BOD utilizes a job pipeline model to manage computational requests from users. Also, a job may be processed in a sequential manner, a parallel manner or a mixed one, and is described below in more detail.
From the view of the inner structure of a job, user's requests can be classified into three hierarchies: job, subjob and task. A job is a user's computational request submitted at one time from the BOD web interface. A whole job may contain several steps, each step corresponding to a subjob. These subjobs must be processed in an orderly manner to meet the job's requirement. Each subjob's input comes from former subjob's output. The final subjob's result comes as the final result. A subjob may contain several tasks, each of which corresponds to a different method to complete the subjob. These tasks could work in parallel, and give a diverse solution to the subjob.
To make these concepts clearer, we have selected a scenario to illustrate it (Figure 3). The user begins with a set of shotgun DNA sequence fragment data, hoping to assemble them using two different methods, then predict the putative gene with two different algorithms based on the assembled sequence and finally find out the most similar folds of these genes in the currently available structure whenever possible. In the BOD system, we regard the above requirement as a whole job. It is made up of three steps. The intention of step 1 (subjob id 101) is using assembling softwares (TIGR-Assembler and Phrap) to assemble inputted shotgun DNA sequence fragment data into some long sequence contigs. This step contains two tasks: calculating with TIGR-Assembler (task id 1001), and calculating with Phrap (task id 1002). The result of subjob 101 is the combination of the result of task 1001 and 1002, and will be used as the input file of Step 2. Step 2 corresponds to subjob 102; it aims to carry out gene prediction using Genscan (task id 1003) and Geneid (task id 1004) separately; the results of these two tasks will be merged together to form this step's result, and will be used as the input file of the next step. Step 3 is the final subjob with id number 103. In this step, BOD uses Threader (task id 1005) to perform protein fold prediction computing. This step's result is just the result of task 1005, and is also the final result of this job. In this result file, the user could find lists of ids of protein fold domains whose structure is mostly similar to the protein structure of each predicted putative gene in the assembled DNA sequence.
After parsing the user's request into individual tasks, the computation can be handled by the Grid computation technology. Task, subtask and item are related to this point of view. For a certain task, the BOD could handle a multi-fasta file style input. The multi-fasta file contains several items, with a description line followed by the detailed information (such as DNA sequence, peptide sequence). Computation could only be performed for each item at the computation workstation side. For most tasks, the BOD will utilize many computation nodes to perform the computing; a certain computation node only need to tackle a subset of the items of the task. The BOD portal server splits the task into several subtasks based on the number of items in the input file and a ratio which is determined by the computational ability of each registered node or Grid server at that time; each subtask will then be sent to a registered node or Grid server. The Grid server like the portal server splits the task further into its subtasks and sends subtasks to its sub-nodes. Computation workstations or the sub-nodes of the Grid system split the input file into items, and then take each input item one by one to perform the computing. The result of the task in a computation workstation is the combination of the result of each items; the result of the task in the portal server is also the combination of the subtasks. The items in the result file are correspond to the items in the input file.
For some types of the tasks (such as sequence assembly task), the input multi-fasta file should be regarded as a whole to perform the computation. This type of task cannot to be divided further; it is regarded as a task containing only a subtask and need not to be divided into items on the computation side.
JOB PROCESSING AND SCHEDULING
The job processing and scheduling is the central portion of the BOD system, and we describe its major players in this section (Figure 4). After a user submits a job from a client station, the BOD web portal parses the user's job request into subjobs and tasks, and stores this information into the central database. The BOD task scheduler reads in a job's task, splits each task into subtasks and delivers each subtask to applicable registered nodes. The XML descriptor file of the task acts as a tutor to direct the (local) task scheduler to perform task/subtask/item computation, i.e. carry out the computation task of each item, process the results and prepare input files for the next process in the pipeline. The BOD communication manager communicates between the portal server and the computation nodes, monitors their status, collects the results from the nodes upon finishing of all tasks and merges the results. During the entire computation progress, the BOD job scheduler controls the computation flow of the pipeline. The tasks within one step can be computed concurrently, while those between steps must be processed in order. In addition, the BOD job status browser is developed to acquire and display the progress of the computation at any stage upon the user's request. The details of the component are provided in the Supplementary Material.
BOD aims at fully utilizing worldwide bioinformatics computation resources together to expand the computation capability to a maximum. It is a highly customizable, integrated system. BOD has an open framework to enable any resources to easily participate in the system. The web interface of BOD enables the user to define his own pipeline, and thus the services provided are expected to meet the user's requirements on demand.
The integration in BOD system resides in two aspects. The first aspect is the integration of computer hardware resources. Using the Grid computation technology, BOD could theoretically integrate all bioinformatics workstations worldwide. The overall performance of BOD is like a huge parallel computer; the speed of the computation will be significantly improved. From this point of view, the computational ability of BOD is unlimited. At the same time, BOD enables user to use some sparse resources which cannot be obtained in a public field. Users need not access the resources directly; instead, BOD takes the role of an agent to help the user in using the resources. BOD provides a convenient way to meet the user's demand, while preserving the provider's security and privacy.
The second aspect of integration is the software. In the BOD system, various bioinformatics software resources could be integrated in two ways. First, some softwares can be combined together to form a pipeline. BOD provides parsers to eliminate the crippling incompatibilities among these software tools and databases. This extends any individual software's function, and could fulfill most bioinformatics computation demand. Second, some softwares have similar functions, but use different computational methods targeting the same goal. BOD integrates these resources as parallel tasks to compute concurrently, thus providing the user with multiple solutions of the same question. These two integrations together bring more comprehensive and effective bioinformatics services.
The BOD system systematically integrates bioinformatics with Grid computing, and enables scientists to perform large-scale customizable multiple step multiple task bioinformatics studies. It does not reinvent blast, clustalw, databases, etc.; instead, it only maintains local copies of some fundamental tools or fully relies on participators' resources. It does not require pulling programs or databases from participators, but pushes structured requests to them and retrieves only the results. It does not push a heavy-duty computation to a single node, unless necessary or inevitably, but splits a large task into smaller ones, if possible, and dispatches them proportionally to the available nodes.
The features described in the BOD system are quite different from currently available, similar services. Most bioinformatics services available on the web can take only one task at a time, or can only compute one input item with single fasta file, or can use only their own resources. BOD is targeted to give a solution on a multi-fasta file and handle complex jobs. The pipeline model can resolve and handle most of the present day bioinformatics problems. BOD is most suitable for high-throughput computation, and is designed specifically for easy use from the point of view of the end user.
The BOD system enables end users to focus on their specific biological analysis and avoid having to deal with issues like data access and parsing, and job management. Other projects such as Biopipe [(15); http://www.biopipe.org], which was hosted by the Open Bioinformatics Foundation (OBF; http://www.open-bio.org), can also handle pipeline computation work, and have developed a unified exchange format that can plug together currently available bioinformatics modules. BOD has three major advantages compared to Biopipe project: (i) BOD can support tasks targeting the same computation goal in one pipeline protocol; (ii) Biopipe requires the end user to write the XML file describing the protocol, sources and modules on their own, while BOD only need the administrator of the BOD system and the service provider to do this work, and does not require the end-user to have prior knowledge of bioinformatics; (iii) BOD utilizes the Grid technology to distribute a task to multiple computation nodes to perform the computing in parallel way, dramatically enhancing the performance.
The BOD system is flexible, reproducible and pretty robust. The central database schema used by BOD is designed specifically for this project, and can easily be expanded to similar projects handling complex job management and Grid Computing. The XML criterion designed for task computation can handle tasks with various styles, such as multiple input files, multiple parameters, multiple commands and multiple result files. The XML descriptor file can be configured and maintained with ease. The programs supported by the XML schema can be in any UNIX file format, including binary file, Perl script, shell scripts, etc. It is quite easy to add a service or program into BOD system, by adding its corresponding description into the XML template, preparing appropriate parsers if applicable, and modifying the web interface and the corresponding program of BOD web portal.
All the programming languages (Perl, java, MySQL) used in BOD are an open source. The program of the whole system is extendable and easy to migrate between platforms.
BOD system could support various UNIX platforms. The platform of each computation workstations does not need to be uniform. This trans-platform feature enables new participators easy in joining. More workstations with various bioinformatics resources are expected to join BOD and expend BOD services.
Supplementary Material is available at NAR Online.
The authors are very grateful to Dr Jinghui Zhang (NCI of NIH) for discussions of BOD idea at early stage, to Professor Zihe Rao (Tsinghua University) for generous sharing of the Threader package, and to all the software authors for permissions to use their tools in BOD system. We thank several colleagues for helpful documentation and advice about this system: Yiqing Wu, Feng Tian, Haiqing Hua. The authors also thank IBM SUR program to provide a p630 workstation for BOD development. This work is supported by China National High-Tech Grant (863) numbered 2002AA234011A.
1Department of Biological Science and Biotechnology, 2Institutes of Biomedicine, 3Department of Computer Science and Technology and 4Ministry of Education Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, People's Republic of China