Motivation: In addition to existing bioinformatics software, a lot of new tools are being developed world wide to supply services for an ever growing, widely dispersed and heterogeneous collection of biological data. The integration of these resources under a common platform is a challenging task. To this end, several groups are developing integration technologies, in which services are usually registered in some sort of catalogue to allow novel discovering and accessing mechanisms to be implemented. However, each service demands specific interfaces to accommodate their parameters and it is a complicated task linking the different service inputs and outputs to solve a biological problem.
Results: In this work we address the design and implementation of a versatile web client to access BioMOBY compatible services (a system by which a client can interact with multiple sources of biological data regardless of the underlying format or schema) using the service description stored in the BioMOBY catalogue. The automatic interface generator significantly reduces developing time and produces uniform service access mechanisms. The design and proof of concept (for such a client) including the generic interface generator have been developed and implemented in the National Institute for Bioinformatics in Spain.
Bioinfomatics research is heavily based on web resources distributed around the world. Unfortunately, most of the different data repositories used in the biological community are heterogeneous in content, each one having its own data format and usually requiring specific services to explore and to exploit the information. To take full advantage of the amount of available information, researchers need to be able to access, link, combine and query these biological datasets easily and efficiently, and the high number of tools which uses these data sources need to be integrated.
To address this problem a growing effort is being focussed on developing common data interchange methods and common ontologies of reference, and to establish automated query access. This field, known as data and services integration, has become a particular area of interest in bioinformatics due to the potential returns on efficiency. Several groups have focussed on general solutions for such integration infrastructures.
TAMBIS (Stevens et al., 2002) makes use of an ontology that provides homogeneous views of heterogeneous databases providing a query interface to create and to refine queries. Model-based mediation (Ludäscher et al., 2003) is a paradigm for data integration in which data sources can be integrated, taking advantage of auxiliary expert knowledge. BioDataServer (Lange et al., 2002) is a mediator architecture whose wrappers export information about its source schema, data and query processing capabilities for each data source. BioBroker (Aldana et al., 2005) is a traditional XML mediator applied to biological data sources that has a unique feature: the capability of managing user software tools and algorithms. BioKleisli (Davidson et al., 1997) is an application of the Kleisly framework to data sources that are critical to the Human Genome Project.
BioMOBY (Wilkinson et al., 2002) is a project that proposes an architecture for the discovery and distribution of biological data, using web services. In this architecture data and services are decentralized. However, the resources are registered in a central location called MOBY central. BioMOBY objects are lightweight XML, and are wrapped in the query and response objects of a simple object access protocol (SOAP) transaction. The primary components of this architecture are MOBY Services (bioinformatics tools), MOBY Objects (input and output data in the services), MOBY Central (registry of all resources), and Object and Service hierarchies. It introduces the use of web services for publishing and using biological data, but it is not an integration architecture that allows applications or users to solve complex queries by different resources.
The Taverna project (Oinn et al., 2004) has developed a stand-alone tool (requiring installation procedure) for the execution of bioinformatics web services (including BioMOBY services), with a fine graphical interface for the creation and execution of workflows. But an authentication mechanism is not provided nor persistence of data for logged users. A scheduling system too is desirable for balancing load processing and offering fault tolerance support.
The distributed annotation system (DAS) is a client-server system in which a single client integrates information from multiple servers. It allows a single machine to gather up genome annotation information from multiple distant web sites. To meet the challenges of integrating and analyzing diverse scientific data from the variety of domains within life sciences, IBM has developed IBM DiscoveryLink (DeCarlo, 2002), with single query data access. This software allows researchers to work with distributed data sources and diverse data formats. PISE (Letondal, 2001) is a Web interface generator for molecular biology command-line driven programs. The generator uses XML as a high-level description language of the legacy software parameters. Its aim is to provide users with the equivalent of a basic Unix environment, with program combination, customization and basic scripting through macro registration.
EMBOSS (Rice et al., 2000), ‘The European Molecular Biology Open Software Suite’ is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. Jemboss (Carver et al., 2003) is a graphical user interface (GUI) for EMBOSS suite.
However when several services are going to be integrated under a common platform each one has its own web-interface style (e.g. see very popular sites like NCBI, EBI, ExPASy, etc.) and a way for including parameters, which is not the most appropriate environment for integration. We have been working with the aim of producing a common user interface to access and present information from multiple online services and databases, as part of the integration of different Web Services. The INB project aims at the following:
To complement the BioMOBY standard in providing access to relevant resources: data, service and computational power.
To combine retrieved information from relevant resources based on common interfaces, so linking resources and tools that could not easily be linked in another way. This can bring new perspectives to the biological information analysis in Bioinformatics.
To make interfaces simple to use, self-contained and intuitive, eliminating the need for a high level of tacit knowledge.
To display this information in a consistent faceted way.
To offer to the research community an easy tool for building workflows linking several services for the solution of specific biological problems.
In this paper we focus on the description of the INB-Client interface designed to allow unified access to services and, internally, aiming to simplify and automate the incorporation of new services by means of automatic web interface builders while at the same time supplying a powerful and versatile Grid management for the efficient use of computational resources. In Section 2 we will describe the general architecture that extends the BioMOBY repository and constantly updates the full functionality of services and resources. Description of the interface includes the several views of stored information: the browsing of the ontology, creation of objects, execution of services, visualization of object results and a help system for assisting in the use of the interface. Section 3 presents a biological example that illustrates the benefit of using this system. Finally we present several conclusions and an account of ongoing work.
2 SYSTEMS AND METHODS
2.1 General considerations
The National Institute for Bioinformatics (INB) in Spain has addressed the integration problem in bioinformatics through the design of a simple, dynamic and extensible platform to represent, recover, process, integrate and discover knowledge. To integrate geographically dispersed resources a Grid-enabled system (see Fig. 1) has been built on top of BioMOBY API, offering a view of the system databases as a single data source where services are readily available for enhancing data processing. Description of input/output objects is coordinated and standardized by means of an object ontology in such a way that services can operate between them, wiring natural bioinformatics workflows. Automatic interfaces and help system builders have been incorporated into the architecture to standardize and facilitate user communication. Superseding traditional bioinformatics platforms, the INB platform includes data persistence system, user management and scheduling abilities which draw on a new generation of bioinformatics platforms.
The scheme in Fig. 1 depicts the INB system architecture organized on three main levels, the user being minimally armed with a web browser and seeking services to process his collection of biological data: (1) a web interface at the top of the architecture facilitates communication between the user and the platform, (2) the architecture core including services' interface through BioMOBY API and (3) at the bottom of the scheme the services' providers.
A web interface manages user sessions with an authentication mechanism. An automatic web interface engine dynamically builds interfaces for browsing data objects, services and namespaces (mostly associated with data containers). The list of available services is deployed in the form of a browseable tree from which the user can access these services making use of the list. In the same way automatic interfaces are built for services parameters and a generic creation service allows new objects be incorporated into the system. Up- and download procedures have also been incorporated in the platform. Note that, the system produces an auto-generated service help that includes training examples for getting started with each service (made available during the service registration procedure).
At the internal level, once a service has been launched, the system provides notification on the progress status of services, including an historical record of executed tasks, together with the relationships between input data, applied service and output data. Error notification has been incorporated in the system by extending the BioMOBY protocol. Frequently output data become the input for new services. The GUI provides a specific list of suitable services that can be applied.
Different services and multiple instances of the same service can be installed at the same or in different sites, offering computational power on a different scale (ranging from simple PC servers to high cost multiprocessor platforms). Since hopefully a high number of users are expected, a pool of tasks to be solved at any given moment will be the natural working scenery. With this scenario, scheduling arises as a natural need.
The Task-scheduler works over this pool of tasks and uses the map of services to choose the best server to solve the task. A task-dispatcher is in charge of communication with the service to solve the task. Worth noting is the fact that new servers and services become available as soon as the registry procedure is completed. Since servers can configure a subgrid the scheduler transfers the job to its front-end and maintains the record of the pending jobs and the machines in charge of it. A buffering technique has been implemented, to send work in advance, avoiding delay in job reposition from the scheduler and benefiting from I/O overlapping in the same machine with replicate services, which is in fact important due to bioinformatics applications being typically I/O bounded.
Load distribution is performed in dynamic and adaptive fashion, as fast as new tasks are available. The current configuration is used to establish the computational power available at any given time. The system evaluates the CPU cost of each task and adjusts predictions when the tasks are reported from the services. Load size is computed as a function of the task's CPU cost and the quality of the service, estimated as the historic response time.
2.2 The INB portal
Traditionally bioinformatics web servers have been open to anonymous users (the user connects, uses the service and logs out). It is also true that most bioinformatics tasks involve the use of several services in pipe-line fashion over an initial set of data. To reduce communication overheads and facilitate data management it is recommended that a data persistence system be offered. The INB platform extends traditional bioinformatics services by offering persistent storage of user data (Fig. 2) by defining three types of users: identified, anonymous and administrator. Anonymous users can make use of the system freely and their data are stored for a short period of time. On the other hand identified users have quota-based persistence time.
When a data persistence system is provided it should be complemented with a set of edition capabilities such as browsing data files, copy, rename and delete objects. Moreover, users can download and upload standard objects from their own storage system (Section 2.4). In order to improve the use of stored objects, a discovery system is provided to identify which services are available for each type of object (and for the descendants of this type of object in a hierarchical organization of objects).
Since there is a gap between (traditional) biological data format and standardized objects, the system provides automatic and generic ‘creation’ services to allow user data to be incorporated into the system (Section 2.3). This ability does not exclude specific creation services provided by users.
In addition to the storage of data, the system offers processing capabilities that include selecting services for their execution (Section 2.5). Once a service has been chosen, the system will automatically ask for specific information (parameters) on the service. Full automatic interface builders make use of the information registered for the service. Notification about the progress status of launched (and finished) services is provided, together with the relationships between input data, applied service and output data. Frequently output data become the input for additional services. Client facilities provide a specific list of suitable services that can be applied.
Furthermore, automatic and standard help files should be available for helping users to understand how to use the system (Section 2.7).
2.3 Browsing the system
Stored information in the system can be classified in three groups: data types, services and namespaces (using MOBY nomenclature). Thus, the first view that we have developed is a tree that shows these groups (Fig. 3). The branch of data types shows all the terms defined in the MOBY ontology by means of a tree-form view. The second element in the tree allows users to see service names classified by means of their types. Finally, users can access Namespaces defined in our repository.
Furthermore, these views offer value-added services by means of links with other tools. Thus, each term of the ontology is related to a tool that searches services for creating this type of object (Section 2.4), and each service name is linked with a tool for executing BioMOBY services (Section 2.5).
2.4 Creating objects
Most of the BioMOBY services stored in our repository use complex MOBY objects, not integer, string, etc. (an example of hierarchy is: object → VirtualSequence → GenericSequence → AminoAcidSequence → CommentedAminoAcidSequence), as input. It is therefore important to provide the web interface with the capability of transforming un-formatted user data into ontology-based objects bearing in mind, that each object has unique characteristics that must be analysed so as to generate an interface for creating it and that new objects become available in the system as far as new services are incorporated. Although specific creation services will be supplied for particular objects in the ontology, this solution has an important drawback: the need for new services as new objects become available in the ontology.
To cope with this problem, the client obtains the object specifications (stored in the BioMOBY catalogue) and generates a generic interface to introduce data. Upload capabilities are also available to allow users to import objects they have stored in their local repositories. Additionally the use of specific creation services is possible when the object's characteristics call for it.
2.5 Using services
When users access a service (by invoking the service on the browsing tree) the system offers an automatically generated interface for introducing the required configuration of parameters and inputs (Fig. 4). This interface is also generated taking advantage of the information stored in BioMOBY databases, which define the parameters, inputs and outputs for each registered service.
Parameters have always got simple data types (integer, string, etc.), and the web interface shows an input text-box for setting them up. Several service parameters define a set of allowed values, a list of which is shown to prevent the user from introducing free text into them.
Services can require input parameters of simple data types or complex objects. For the former, the interface provides input textboxes; for the latter a pop-up list of compatible objects is shown. The list of compatible objects for a data type is obtained by searching related data types by means of ‘is_a’ descriptions (starting from the type required by the service). Then, user objects stored in the system that match the required data type are listed. However, a user might need to create an object before using the selected service. In this case the interface offers ‘Create’ and ‘Upload’ options, for creating the object (using a specific or generic creation service for data type in question) or uploading an object stored in the user's machine.
Since services return objects that must be stored in the system (data persistence), the interface has a textbox for describing the output object (including default values for the object name).
Finally, the architecture offers object persistence and scheduling policies, which involves the following tasks:
Storing parameter and input values in the database.
Storing the output objects in the database (object status is set to ‘creation’, to ensure that new services applied to that object must wait until the object becomes available).
Creating a task for executing the service. This task will be issued by the scheduler taking parameters and input values and performing a remote call to the service.
The execution of a service (which is finally run in the corresponding remote server) can be traced taking advantage of the elements on the right side of the interface (Fig. 5). Thus, once the execution ends the user can visualize the results (see Section 2.6).
2.6 Viewing results
Standardization is also sought in the showing of results. At present the platform has two ways of displaying the content of the output object. The first is as an XML object following BioMOBY standard. This is also the default format for downloading objects locally. The second way is a HTML format and here an automatic template has been built, which covers general information about the object attributes (type, namespace, and Id) as well as the object content. Additionally the object can appear parsed from a previous script, which shows it in a user-friendly way. For this, we have used the bioperl modules widely used in bioinformatics to analyse and show biological data, and other bioinformatics tools (Fig. 6).
These parsed objects are always included in the general template to give a homogeneous image so that the user sees no difference between the service viewers. Though, it is made clear that the services come from heterogeneous servers.
2.7 Help system
BioMOBY catalogue includes a description field in the database for storing a brief description to help the user discover the service. In our view a complex platform requires the design of a more informative and detailed help system. There are several factors behind this initiative: platforms such as the INB are intended to integrate geographically widespread services, and at least some measure of success is anticipated. That is to say, we expect a continuous flow of services to be registered at the INB. So it is essential to design a useful, up-to-date help system, that is as uniform as possible.
Of course, a simple alternative would be to include in the BioMOBY field an implicit URL pointing to a detailed web-based help provided by the service's authority. This would be easy and fast to implement and present no compatibility problems. However, our experience suggests this method would become a set of non-uniform, own-syntaxes, out-of-date web pages often using broken-links for the help system. Moreover, responsibility and authority for the help system would remain distributed around the service's providers with a high risk of un-supported help.
To address this problem we have tackled it on two main fronts: (1) enhancing the quality during the registration procedure which allows us to build the help pages by an automatic engine and (2) providing a generic help server (including mirrors) with an explicit URL. In this way the help system, under a single authority, reduces the risk of broken links, becomes fault-tolerant and offers a homogeneous view, independently of the status of particular servers (it is worth noting that the same service can be replicated in different servers, always referring to a common service-help).
The main drawback of this proposal is the need for a simple preprocessing on the help system, not only during the registration procedure. A short delay could arise while the help is updating since the modifications need to be authorized. But when the end user wants to look up a service help, this is automatically generated from the BioMOBY service database. These help fields will always be present because they are essential when a new service is registered. Additionally the service provider can also include an XML file with new fields not present in the BioMOBY standard.
The above platform presents a set of services, linked to each other, which carry out various analyses on biological data. These different tools can be automatically applied to a set of data to produce a complete analysis for the solution of complex biological problems using the same platform. In demonstration, we present a practical example for the solution of a phylogenetic study using an amino acid sequence as the starting point (Fig. 7). A homology search is conducted (Blast service) to obtain similar sequences with a common evolutionary history. Output from this service contains a set of putative homologous sequences to the query. A new service is linked (Clustalw from Blast) to build-up a multiple alignment with the most similar reported sequences. Finally a phylogenetic tree is produced using CreateTreeFromClustalw service, highlighting the relations between all the sequences.
Moreover this generic data flow can be customized by changing the service parameters for deeper study or by adding different services to obtain other accessory outputs.
This example of data flow (main branch) has been automatically tested with the human survival motor neuron protein (SMN; Swiss-Prot Accession Number: Q16637). The amino acid sequence for this protein is obtained with the Retrieval->‘get AminoAcidSequence from Uniprot’ service, previously creating its input object with Namespace or data source equal to ‘Swiss-Prot’ and ID = ‘SMN_HUMAN’. Later this sequence object is sent to the Analysis: ‘run Blast from AminoAcid Sequence’ service to obtain a list of putative homologues to SMN from different organisms (Fig. 7). This service is run against the Swiss-Prot database and the remaining default parameters. To obtain relationships between the query and the homologous sequences the Analysis: ‘run Clustalw from Blast’ service is used taking as input the Blast report. A relaxed E-value (one) is used as a threshold for this service to select distantly related homologous hits and to carry out the multiple alignments. As a result several related sequences with common domains are identified, the ‘Tudor domains’ being the most significant.
These Tudor domains are thought to function as protein–protein interaction motifs during RNA metabolism and/or transport, and they were first precisely identified as repeats in Drosophila melanogaster (Ponting; 1997; Callebaut and Mornon; 97). As result of the service Analysis: ‘run Create Tree from Clustalw’ this large Drosophila protein (P25823) appears separated from the query sequence in the phylogenetic tree, while this latter (Q16637) is next to the other proteins from the same family (O02771, O18870, rat and mouse homologues O35876, and P97801, and one fish protein Q9W6S8). The survival of motor neuron-related splicing factors (O75940 and Q8BGT7) are grouped together. And other proteins containing the Tudor domain (Q91W18, and Q9H7E2; Tudor domain containing protein 3) are grouped in another branch. Surprisingly the large Drosophila protein seems be more related with the splicing factors than with the Tudor domain containing proteins that currently have a unknown function. In short, these results can conclude that the Drosophila protein, which is required during ogenesis for the formation of primordial germ cells and for normal abdominal segmentation, could be a splicing factor assisting this process. Even though the Tudor domain containing proteins could also have this function.
In summary we report a client engine for the automatic and dynamic development of service interfaces built on top of the BioMOBY standard. This user-friendly interface allows the flexible integration of services to provide fast and intuitive ‘wiring’ of services thus creating virtual, complex, distributed and powerful bioinformatics machines. Based on semantic interconnection concepts the platform is able to integrate into workflows diverse and widely dispersed data collections and various processing services developed by different users and groups through a web-based interface expanding the functionality of current services and enabling the easy incorporation of new procedures to customize the system for specific concerns.
This work has been partially supported by grant ‘GNV5-Bioinformática Integrada’ from Genoma-España.
Conflict of Interest: none declared.