BioServices: a common Python package to access biological Web Services programmatically

Motivation: Web interfaces provide access to numerous biological databases. Many can be accessed to in a programmatic way thanks to Web Services. Building applications that combine several of them would benefit from a single framework. Results: BioServices is a comprehensive Python framework that provides programmatic access to major bioinformatics Web Services (e.g. KEGG, UniProt, BioModels, ChEMBLdb). Wrapping additional Web Services based either on Representational State Transfer or Simple Object Access Protocol/Web Services Description Language technologies is eased by the usage of object-oriented programming. Availability and implementation: BioServices releases and documentation are available at http://pypi.python.org/pypi/bioservices under a GPL-v3 license. Contact: cokelaer@ebi.ac.uk or bioservices@googlegroups.com Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction and Installation
BioServices software has a thorough documentation available on Pypi (the Python package repository) that is upto-date. The on-line documentation provides a User Guide as well as a Reference Guide. All classes and functions are documented and test coverage is around 80%.
The source code of BioServices is also available on PyPi. The following command should install BioServices and its dependencies automatically provided you have PIP installed on your system: sudo pip install bioservices If not, please see the external pip installation page. If PIP fails to install bioservices you may want to try easy_install tool instead: sudo easy_install bioservices In the first section of this supplementary data (Section 2, we reproduce part of the tutorial available in the on-line BioServices documentation. The tutorial demonstrates how to use BioServices classes to obtain information about a protein using different Web services available through BioServices. The Section 4 provides an example that combines BioServices with an external application called PyMOl that demonstrate how to combine application with BioServices. Similarly, the following two sections show how to use BioServices and BioPython together and how to write a plugin for Galaxy. The last section 7 is for developers (or users) who want to implement a new class dedicated to a given Web Services that is not available in BioServices (based on either the REST or WSDL protocol).

Tutorial 1: Protein test case study
Application: retrieving information about a given protein This section uses BioServices to demonstrate the interest of combining several services together within a single framework using the Python language as a glue language.
In this tutorial we are interested in using BioServices to obtain information about a specific protein. Let us focus on the protein known as ZAP70 (homo sapiens).

Get a unique identifier and gene names from a name
Given the gene name of a protein, we first want to obtain its unique Uniprot identifier. Using the UniProt class provided in BioServices, we can obtain the unique accession number of ZAP70, which may be useful later on. Let us first create an instance of UniProt service and use the the UniProt.search() method: >>> from bioservices import * >>> u = UniProt(verbose=False) >>> res = u.search("ZAP70_HUMAN") # could be lower case The default format of the returned answer is in tabulated format. Other formats such as HTML, XML could be used using the format argument. Let us now print the results returned by the search method. It is better, but let us simplify even further. In BioServices, the output of the tabulated format contains several columns but we can select only a subset such as the Entry (accession number) and the gene names, which are coded as "id" and "genes" in uniprot database: >>> res = u.search("ZAP70_HUMAN", format="tab", columns="id,genes") >>> print(res) Entry Gene names P43403 ZAP70 SRK So here we got the Entry P43403, which is the unique identifier we were looking for. In this case, it was easy because the input name is a gene name itself. In some other cases, one may need to introspect the description or protein names instead of the gene names only.

Getting the fasta sequence
It is then straightforward to obtain the FASTA sequence of ZAP70 using another method from the UniProt class called searchUniProtId(): >>> sequence = u.searchUniProtId("P43403", "fasta") >>> print ( Note: There are many services that provides access to the FASTA sequence. We chose UniProt but you could use another service such as the Entrez utilities (EUtils class in BioServices).

Using BLAST on the sequence
You can then analyse this sequence with your favourite tool. As an example, within BioServices you can use NCIBlast class but first let us extract the sequence itself (without the header) using some standard Python code: sequence = sequence.split("\n", 1)[1].strip("\n") then we create a NCBIblast instance and run the analysis by specifying the blast variant (here blastp), >>> s = NCBIblast(verbose=False) >>> jobid = s.run(program="blastp", sequence=sequence, stype="protein", \ ... database="uniprotkb", email="youremail@domain") >>> print s.getResult(jobid, "out") The last command waits for the job to be finished before printing the results, which may takes a few minutes depending on the NCBI server. We could look at the beginning of the reported results and select only HUMAN sequences to see that the best sequence found correspond indeed to ZAP70_HUMAN (as expected!):

Searching for relevant pathways
The KEGG service provides pathways, so let us try to find pathways that contain our targeted protein. First we need to know the KEGG Id that corresponds to ZAP70. We can use the find method form KEGG service: >>> from bioservices import Kegg >>> k = Kegg(verbose=False) >>> k.find("hsa", "zap70") # "hsa" stands for homo sapiens We can look at the first pathway in a browser (highlighting the ZAP70 node): >>> k.show_pathway("hsa04064", keggid={"7535": "red"})

Searching for binary Interactions
Another interesting Web Service available within BioServices is PSICQUIC. This is actually a portal to 25 databases that provide protein interactions. As an example, we can search for interactions that involve the ZAP70 protein within the mint database. The code is as follows: >>> from bioservices import PSICQUIC >>> s = PSICQUIC(verbose=False) >>> data = s.query("mint", "ZAP70 AND species:9606") where 9606 is the taxonomy identifier for the homo sapiens specy. We can check the number of interactions involved is 34:

>>> len(data) 34
We could also figure out how many interactions could be found in each database for this particular query: We see for instance that the mint database has 34 interactions (as already found earlier). Coming back to the data found in the mint database only, we can look at the first entry: >>> for x in data The First two elements are the entries for specy A and B. The last element is the score. The 11th element is the type of interaction and so on.
What could be useful is to convert these elements into uniprot ID only. With mint database it is irrelevant (already in uniprot ID format) for this particular entry but with other DBs or entries, it may be useful (e.g., biogrid).
If the following example do no work with biogrid, it may be that the service if inactive and you may try to replace Biogrid with another service such as mint, string, ... BioServices provides such a function called convert(): >>> data = s.query("biogrid", "ZAP70 AND species:9606") >>> data2 = s.convert(data, "biogrid") convert method converts all entries from data into uniprot ID. If this is not possible, the entry is removed. The query and convert works on a single database but you we could query all or a subset of all databases using the queryAll and convertAll functions: >>> data = s.queryAll("ZAP70 AND species:9606", databases=["mint", "biogrid"]) >>> data2 = s.convertAll(data) However, extra cleaning is required to remove entries that are not relevant (no match to uniprot ID, redundant, not a protein, self interactions, ...). In order to ease this task, the psicquic.AppsPPI class is very useful.

What's next ?
There are lots of other Web Services wrapped within BioServices that could be useful to retrieve more information about the protein ZAP70. An example is the WikiPathway (see Wikipathway) to retrieve even more pathways. Another example is the BioMart portal. You could use it to retrieve pathways from REACTOME (see BioMart). You can also retrieve targets from ChEMBL given the uniprot ID ( get_target_by_uniprotId("P43403") ) and so on.
The full documentation of BioServices (on PyPi repository) provides more tutorials and examples covering other aspects of Web Services accessible from BioServices.

Application: retrieving information about a compound
This section uses BioServices to demonstrate the interest of combining several services together within a single framework using the Python language as a glue language Retrieve a compound identifier from KEGG, ChEBI and ChEMBL Let us look at a compound called Geldanamycin that inhibits Hsp90. Let us search for information about that compound in several databases and manipulate the different identifiers.
First, let us retrieve information on KEGG database: >>> from bioservices import * >>> k = Kegg(verbose=False) KEGG compounds have links to other databases. It is not systematic but the ChEBI database is often referenced. So we will want to convert the KEGG identifer to a ChEBI identifier. Later, we can convert a ChEBI to a ChEMBL identifier using another Web Service such as UniChem.

Mapping identifiers
There are quite a few functions from different Web Services that can help to map identifiers from one database to the other. This tutorial presents some of them.