Abstract

Translational research, the effort to couple the results of basic research to clinical applications, depends on the ability to effectively answer questions using information that spans multiple disciplines. The Semantic Web, with its emphasis on combining information using standard representation languages, access to that information via standard web protocols, and technologies to leverage computation, such as in the form of inference and distributable query, offers a social and technological basis for assembling, integrating and making available biomedical knowledge at Web scale. In this article, we discuss the use of Semantic Web technology for assembling and querying biomedical knowledge from multiple sources and disciplines. We present the Neurocommons prototype knowledge base, a demonstration intended to show the feasibility and benefits of using these technologies. The prototype knowledge base can be used to experiment with and assess the scalability of current tools and methods for creating such a resource, and to elicit issues that will need to be addressed in order to expand the scope and use of it. We demonstrate the utility of the knowledge base by reviewing a few example queries that provide answers to precise questions relevant to the understanding of disease. All components of the knowledge base are freely available at http://neurocommons.org/, enabling readers to reconstruct the knowledge base and experiment with this new technology.

INTRODUCTION

Understanding complex biological systems is a crucial challenge for modern biomedical science and informatics. In order to answer questions that might accelerate translational medicine, knowledge from different disciplines, research methodologies and repositories must be collected and integrated. However, the data and knowledge that measure and describe biomedical phenomena are scattered across numerous information systems, each with its own terminologies, identifier schemes, and data formats. One collation counts more than 1000 publicly accessible molecular biology databases [1]. There is little schema or ontology reuse between these. Beyond these lies a bulk of biomedical knowledge published in journals, monographs, and textbooks. Making effective computational use of all this knowledge is an important contemporary challenge.

Given this situation, it is difficult for researchers to find all available information about a subject of interest, and to organize it so that it can be found and understood. Scientists who would attempt to form a comprehensive view of a biological phenomenon face tedious and error-prone computing tasks such as converting data formats and information schemas, querying different databases and combining the results of these queries, wrestling with a variety of uncoordinated application interfaces, reading articles and extracting and integrating relevant facts from them. Most of such a scientist's resources are spent on working through the complexities of information systems instead of understanding the complexities of biological reality—the actual goal of biomedical research [2].

Instead of ushering in a new era of biomedical insight, the growing abundance of data on the web has intensified the need to develop new approaches to manage and integrate it. If we fail to do so, knowledge will remain fractured—encoded in a myriad of representational dialects—and effectively inaccessible to the majority of researchers.

As a means to change this situation, we have become interested in helping establish a Semantic Web for science [3,4]. By our assessment, the Semantic Web adds to existing Web standards and practices encouraging clearly specified names for things, classes, and relationships, organized and documented in ontologies, with data expressed using standardized well-specified knowledge representation languages. Such a combination could enable computationally assisted management of information, ease the integration of different sources into a coherent system, and make knowledge more widely and easily accessible. As with the existing synergy between Internet and intranet, these technologies continue to enhance the ability to work with knowledge that spans public and organizational boundaries, an essential capability in an ecosystem of biomedical research that includes academia, pharmaceutical companies, medical clinics and government agencies.

A number of recent Semantic Web standards provide a part of the technical basis for such a vision, building on existing Web practices such as the ubiquitous use of Uniform Resource Identifiers (URIs) as globally unique names and documentation locators. The Resource Description Framework (RDF) [5], RDF Schema (RDFS) and the Web Ontology Language (OWL) [6] are standards for knowledge representation. RDF(S) (We use RDF(S) to refer to both RDF and RDF Schema) provides a basic syntax, datatypes and the ability to use classes and instances. OWL goes beyond RDF(S) in offering more expressive ways of specifying classes, relations between classes, properties and relationships between instances. OWL is expressive enough to state inconsistent assertions, therefore going beyond RDF(S) and enabling tools that can profitably check consistency in the service of improving data quality.

The query language SPARQL [7] is a first standard for posing queries against repositories of knowledge expressed in these languages. Reasoners such as Pellet [8] are able to compute implications of statements made in OWL, as well as perform consistency checking.

The Neurocommons prototype is a knowledge base built as a first step towards Web scale integration of scientific knowledge. With it, we are already able to demonstrate how Semantic Web technologies can be applied in biomedical research, for instance by helping scientists more easily answer questions about background science and connections between different research disciplines. The prototype serves as one test bed for exploring the technical, social and legal processes that will be needed to achieve a future in which the results of research are placed seamlessly into the Web of science. It also demonstrates the productive use of existing ontologies and exposes the need for their augmentation and future development. Through our experience working with the SenseLab project [9], the OBO Foundry [10], and with members of the W3C Semantic Web for Health Care and Life Sciences Interest Group [11], we can report insights on methods of collaboration that can work in practice. The prototype is based on the Virtuoso open source triple store (http://virtuoso.openlinksw.com/) as an OWL and RDF repository, and comes with open access data. The knowledge base has been released with the express purpose of allowing others to replicate, experiment with and extend it.

We see this prototype as a step towards the Semantic Web for science. Below we present the construction of the prototype, review related efforts, assess gaps and propose next steps, and set forward what we see as some challenges for both the short and long term.

THE NEUROCOMMONS KNOWLEDGE BASE

Technical goals of the Neurocommons KB

The Neurocommons prototype explores what future life sciences data standards should be like in order to promote integration. We had a number of specific goals in building the prototype. First, we wanted to be able to exercise the ability to ask and get precise answers to questions. Second, we wanted to show that the emerging Semantic Web technologies could accommodate data at a scale appropriate to a knowledge base. There have been a number of biomedical knowledge prototypes that use relatively small amounts of information and are therefore unconvincing. Entrez Gene and PubMed together provide an essential basis for bioinformatics work, so we chose inclusion content from these resources as a baseline.

We wanted to use modern knowledge representation techniques in order to escape the tendency for representation to be too closely tied with storage technology, in particular the biases introduced by the limitations of the relational model (e.g. difficulty in working with hierarchical and nested structures), and in order to work towards representations that were not tied to a specific end. If knowledge is to be shared on a Semantic Web, and be available for new and unanticipated uses (i.e. not the ones for which the data was created), we must attempt to represent knowledge in such a way that it is clearly expressed yet application neutral. In not all cases are we as yet successful. In some cases, the magnitude of the work made it infeasible. In other cases, the current state of OWL is such that it is insufficiently expressive to handle all such representation. However, after applying the principles of the OBO Foundry [12], we were able to succeed in some demonstrations of data integration.

Finally, culminating a long debate on what might be suitable identifiers for entities that are the subject matter of biomedicine, we wanted to prototype a mechanism and protocol for minting URIs that achieved univocity, persistence, manageability, conformance to Semantic Web protocols. (For further discussion of this point see http://neurocommons.org/page/Common_Naming_Project and discussions at http://lists.w3.org/Archives/Public/public-semweb-lifesci/.)

Data sources

The scientific focus of the Neurocommons project is to support disease research for neurological diseases. In an attempt to force the design to be general, we strove to provide background knowledge that would support our own focus as well as other specializations and chose a number of sources based on an assessment of value for query, ease of acquisition, effort required to represent them in OWL and type of data. The knowledge base includes basic information about genes taken from Entrez Gene; the full set of OBO ontologies, including the Gene Ontology [13], the Gene Ontology Annotations (GOA) [14] that associate gene products with functions, processes and structures; the OWL version of GALEN [15]; links to the literature in the form of gene to article links from Entrez and GO, the medical subject heading definitions and article associations from PubMed, as well as selected information associated with each article. Where we had a choice of species-specific information, we include that about human and commonly studied model organisms: mouse, rat, fly, nematode, dog, cow, yeast, zebrafish, chimpanzee, pig, chicken and frog. Homolog information relating genes in these species to each other is taken from Homologene (http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene). In order to get some experience with queries that include reagent information, we incorporate the Addgene (http://www.addgene.com) plasmid catalog. Of these data sources, the OBO ontologies are provided in OWL, whereas most of the others needed to be translated.

There is a broad range of databases that relate to neuroscience—our selection was primarily limited by the not-insignificant effort to represent their subject matter. These sources include: Metadata associated with the Allen Brain Atlas [16] images, NeuronDB, a database of a selection of neuronal properties from the SenseLab project, the Swanson-1998 rat portion of the Brain Architecture and Management System (BAMS) [17] database, which includes gross neural circuitry as well as some molecular expression information, and the PDSP Ki database [18] of compound affinity to neuron receptors. Results of an early information extraction pilot (http://sw.neurocommons.org/2007/text-mining.html) run against a portion of neuroscience related abstracts are also included.

NAMES AND THE NAMED

A central tenet of the Web is that entities (known as ‘resources’ in web parlance) are identified or named by URIs. When the Web was being developed, the primary entities that were manipulated by Web tools, and therefore needed names, were the Web pages themselves and their contents—images, other attached resources, and other pages that were included as links. The URL's that served as the names of the Web pages and their contents were network locations from which resources were fetched in order to aggregate them into a web page. The principles for naming in a Semantic Web go beyond the use of URLs for Web pages and stem from the fact that identity, rather than location, is emphasized. Identity is essential for communication and knowledge sharing. When scientists use the same names for the same things, they can more easily find and share knowledge about those things. Therefore, naming an entity is crucial for sharing knowledge about it. The use of globally shared identifiers has the potential to substantially improve the ability to integrate data. It is much easier to integrate different sources of information when both sources use the same names for the same things, making the often problematic translation between the different names unnecessary.

Bioinformaticians effectively act as communication brokers when performing data integration. Just as with designing a Web page, they must take inventory of what should be named, and decide how to assign, manage, and publicize these names for re-use. On examination, we find that we can divide those things that need naming into two categories. There is the ‘stuff of life’—such as cells, molecules, organisms and processes that need names, e.g. a gene. Giving names to these biological entities is in the province of biomedical ontology efforts. On the other hand, we have descriptions, texts, database records and other things that contain biomedical information, e.g. information about a gene. We attempt to clearly distinguish these two sorts of entities and their roles. Working with existing database records is an essential part of bioinformatics, where they form the input for many computational analyses. An important heuristic for avoiding unnecessary ambiguity is to use different names when we know the things to be named are different. Often records come in a variety of encodings—the same NCBI record about a gene is served as HTML, ASN.1 and XML, and portions are served in FASTA format. Therefore we separately identify each encoding with a different name. In order to refer to the record generally, we also give a name to the record without commitment to encoding (As an example, the Entrez Gene record for TP53 without commitment to encoding is http://purl.org/commons/record/ncbi_gene/7157, the HTML page: http://purl.org/commons/html/ncbi_gene/7157, the ASN.1 representation: http://purl.org/commons/asn/ncbi_gene/7157).

In the context of building Web pages, each URI served two purposes: it acted as an identifier for the resource in question, and as the location of a network endpoint from which the entity (say a JPEG encoding of an image) could be requested. We want to exploit this dual purpose as well, and make it durable, in order to document what our names mean. Wide-scale adoption of shared names has historically been a challenge. While a full discussion of efforts to accomplish that goal is outside the scope of this article, we believe that the emergence of a Semantic Web and of ontology development efforts such as the OBO Foundry offer a fresh chance to accomplish this.

While a number of efforts to normalize naming schemes (see http://neurocommons.org/page/Common_Naming_Project for a description of an effort of Science Commons and several other groups, which also includes references to related projects) are underway, we chose to adopt HTTP URIs to name all entities. The URIs that we chose to use are Persistent URLs (PURLs; see http://purl.org/ which provides the redirection service that we used). PURLs provide re-direction capabilities which make the identifiers/locations more manageable in case the server hosting the documentation changes. In this case, the redirected-to location changes but the PURL itself need not change, therefore preserving its utility as an identifier. Because PURLS are based on the HTTP protocol, documentation is easily accessible on the Web without special installation of software necessary for URN schemes such as LSID [19].

We apply the above naming principles when converting data sources to RDF/OWL (Figure 1). When data in different information sources are about the same thing, one must use a common identifier in order that these elements of the RDF graph connect. For instance, the representation of relationships between gene records and the journal articles referred to by GeneRIFs (http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html) use the same gene record identifiers as are used in the representation of the relations between gene records and, for instance, molecular functions in the Gene Ontology. This allows us to easily link articles to molecular functions via gene records. When converting databases with subject matter overlapping what is already present in the knowledge base, one re-uses the identifiers that are already used for that subject matter.

Figure 1:

Architecture and information flow of the Neurocommons knowledge base. Graph 1, Graph 2, etc are ‘named graphs’, a kind of partition within the triple store.

Figure 1:

Architecture and information flow of the Neurocommons knowledge base. Graph 1, Graph 2, etc are ‘named graphs’, a kind of partition within the triple store.

Scientists that wish to employ the approach described above would begin by trying to find pre-existing identifiers for an entity. Regrettably, doing so is not always possible and the process of creating new, accurate, representations can be difficult [20] (see, for instance http://www.co-ode.org/resources/tutorials/ or http://ontology.buffalo.edu/smith/Ontology_Course.html). However, there are signs of hope emerging. Projects such as the Neurocommons project, the Ontology Lookup Service [21] and the BioPortal (http://bioportal.bioontology.org/) are making it easier to find existing identifiers, and the OBO Foundry initiative [22] is building a community of researchers sharing approaches to, and experience with, representing biological systems.

Architecture

The choice of RDF representations instead of the more conventional relational database (RDB) was influenced by the desire to create open and linked data modules. When an RDF store (triple store) is employed, the disclosure of semantics is accomplished in two ways: (i) the stored data has an associated URI that can link it to data stored in other locations (ii) it is straightforward to directly examine datatypes and to exchange data models because the schema is also written in RDF and exportable as XML. The URIs also promote transparency of SPARQL queries because all data types can be investigated using information in the query. A practical benefit of triple stores is portability. All data, metadata and SPARQL queries are portable among triple stores. In contrast, data and metadata description languages and formats for RDBs depend on the implementation and the vendor. Relations between data within an RDB can be obscured by indirection (e.g. use of primary and foreign keys, rather than directly naming the relation and object) and linking to external data is not built into the specification as it is in RDF and implemented by the common web browser.

Potential adopters should, however, note that RDF triple stores are a young technology and as such may currently lack features that more mature relational database implementations have—optimized performance, redundancy, standard update mechanisms and transaction control. However some triple stores are built on top of relational technology (e.g. Openlink's Virtuoso, or Oracle's 11g RDF Database) and therefore inherit some of those features, and as more experience is gained we expect that performance differences will lessen.

In order to experiment with integrating data from the different data sources, we chose to use a triple store to collect all the sources into a single database. Although it was not necessary to do so—SPARQL is expressive enough to query across data in different locations—the current state of implementation of such distributed queries was not considered mature enough. Moreover the central ideas we wished to demonstrate, those of data integration, accessibility, and SPARQL query were adeqately exercised by using a single store.

Figure 1 shows an overview of the architecture. Some elements of the architecture are similar to other data integration efforts—in particular, the translation of the various sources into a single representation language [23]. What differs is the choice of representation language (OWL) which allows one to more transparently represent and reason over hierarchical relations, and the low impedence interface to the Web via SPARQL.

Each source, when converted to RDF is a bundle. The largest bundle of 254M triples contains the relations between subject heading (MeSH) annotations to articles listed in Medline. In contrast, there are a number of smaller bundles ranging from 10K to 10M triples. All together there are currently 350 million triples occupying 30 gigabytes when loaded into Virtuoso's triple store. Adding inferred triples increased the size of the database by about 10%.

Each converted data source is put in a distinct named graph. In the first version of the knowledge base, queries needed to specify which graphs portions of the query were to be matched against in order to achieve adequate performance. In the current version this is no longer the case, and as a result queries are simpler to write. Graphs can still be used to retrieve provenance information (see Figure 2), and can be individually loaded and unloaded, making incremental updates to the knowledge base practical.

Figure 2:

Query illustrating how to determine which graphs have statements about a particular function defined in the GO, in this case GO_0004872: receptor_activity. Results (http://purl.org/science/query/provenance-example) include 17 graphs including those for GO itself, SenseLab, GO Annotations, several cross product (http://wiki.geneontology.org/index.php/Cross_Product_Guide) ontologies, and a number of mapping ontologies such as PFAM2GO (http://www.obofoundry.org/cgi-bin/detail.cgi?id=pfam2go).

Figure 2:

Query illustrating how to determine which graphs have statements about a particular function defined in the GO, in this case GO_0004872: receptor_activity. Results (http://purl.org/science/query/provenance-example) include 17 graphs including those for GO itself, SenseLab, GO Annotations, several cross product (http://wiki.geneontology.org/index.php/Cross_Product_Guide) ontologies, and a number of mapping ontologies such as PFAM2GO (http://www.obofoundry.org/cgi-bin/detail.cgi?id=pfam2go).

Representation

A basic observation about life sciences data sources is that different sources record similar information in different ways and at different levels of detail. Existing practice is such that there is typically major effort involved in combining databases that deal with the same or related subject matter. For example there are currently common variation such as association of functional information to genes in some databases and to proteins in another, or the variation in the level of detail given in protein interaction databases. The challenge of data integration is to permit a single query to find all relevant information from all sources, in spite of this non-uniformity.

Any solution to this problem necessarily involves simple inferences, such as transitivity of subclass and part-of relations, so that relationships can be found between results that have been stated in different ways. This observation supports two central choices: first, the use of OWL so that inferences take place through a well-defined logic, and second, the adoption of a semantic modeling approach, promoted by the OBO Foundry [10] that restricts the use of logical individuals to particular things in the world, designating all generalizations, such as molecular species, as classes of such things. Competing approaches, such as BioPAX [24] and BioCYC [25] that treat generalizations as logical individuals or ‘instances’ fail to take advantage of the class-level logic of OWL, and data integration ends up replicating class relationships (such as modified protein to protein without regard to modifications) at the level of individuals [26].

The following example (Figure 3) illustrates how class-level reasoning may be applied in biology: The class of glutamate receptors can be defined as proteins that have a particular function—to be a receptor of glutamate molecules. Using OWL, we can capture this with the logical statement ‘EVERY glutamate receptor IS_A protein THAT has glutamate receptor activity’. In this way, the class of glutamate receptors can be specified in terms of the classes protein and glutamate receptor activity, something that is easily expressed in OWL. The knowledge base contains many such specifications of classes. Figures 4 and 5 show the transformation of the names in Figure 3 into specific OWL terms (Figure 4) and triples (Figure 5). (The specification of the class of functions glutamate receptor activity can similarly be decomposed into a more primitive definition in terms of glutamate binding and the consequences of that binding.)

Figure 3:

A statement defining Glutamate receptor and the corresponding OWL in graph form. ‘_:1’ is a so-called blank node, a node without an URI.

Figure 3:

A statement defining Glutamate receptor and the corresponding OWL in graph form. ‘_:1’ is a so-called blank node, a node without an URI.

Figure 4:

Most nodes and arcs in the graph are named by URIs, here abbreviated by using prefixes for each namespace instead of the full URI. To transform the graph into triples, each arc (labeled by a number in a circle) becomes one triple. Namespaces: pro: Protein ontology [27], ro: Relation ontology [28], go: Gene ontology, ex: A hypothetical namespace in which the example terms are defined.

Figure 4:

Most nodes and arcs in the graph are named by URIs, here abbreviated by using prefixes for each namespace instead of the full URI. To transform the graph into triples, each arc (labeled by a number in a circle) becomes one triple. Namespaces: pro: Protein ontology [27], ro: Relation ontology [28], go: Gene ontology, ex: A hypothetical namespace in which the example terms are defined.

Figure 5:

A subset of triples from above graph, and SPARQL query that matches it.

Figure 5:

A subset of triples from above graph, and SPARQL query that matches it.

The RDF/OWL statements we created in this way were finally loaded into an RDF triple store with a public SPARQL endpoint [29], enabling query access to the whole knowledge base through the web.

Specific answers to precise questions

Querying the knowledge base allows one to retrieve more precise answers than is possible with an information retrieval based approach. To illustrate, we consider a query that we wrote as an example of how one might prospect for Alzheimer's drug targets (Figure 6). The query was based on two observations: (i) CA1 Pyramidal Neurons (CA1PN) are known to be particularly damaged in Alzheimer's disease and play a key role in signal transduction [30] and (ii) signal transduction processes are relatively rich in ‘druggable’ targets [31]. If we cast a wide net, can we find proteins known to be relevant in pyramidal neuron physiology that are involved with signal transduction? Simply querying Google or PubMed with a phrase such as ‘signal transduction in pyramidal neurons’ yields too many results to prioritize for further investigation.

Figure 6:

SPARQL query that retrieves gene records and the name of signal transduction related processes that the gene products participate in that are related to pyramidal neurons. Legend: returned variables: bold italic, other variables: bold. The shaded sections lines 12–16 and 20–23 represent class queries to link two classes, but which take several lines to express due to the RDF encoding of OWL. See Figure 8 for a more concise reformulation of the query. Lines 1–6 show the declarations for the prefixes and are elided in subsequent examples.

Figure 6:

SPARQL query that retrieves gene records and the name of signal transduction related processes that the gene products participate in that are related to pyramidal neurons. Legend: returned variables: bold italic, other variables: bold. The shaded sections lines 12–16 and 20–23 represent class queries to link two classes, but which take several lines to express due to the RDF encoding of OWL. See Figure 8 for a more concise reformulation of the query. Lines 1–6 show the declarations for the prefixes and are elided in subsequent examples.

The query we wrote (Figure 6) traverses five data sources within the knowledge base. Those are: MeSH (defining the term pyramidal neuron), PubMed (connecting MeSH terms to journal articles), Entrez Gene (connecting genes to journal articles), Gene Ontology annotations (connecting gene products to processes), and the Gene Ontology (defining processes such as signal transduction). The query, written in SPARQL, returns gene names and associated processes (which might be a type of or parts of signal transduction processes) that are related to pyramidal neurons by virtue of the genes being mentioned together in papers that are related to pyramidal neurons. Note that our query for signal transduction processes is for all parts of signal transduction processes, as well as all subtypes of signal transduction processes. The subclass relation is a transitive relation directly between classes. Inference, in this case, is accomplished by a transitive closure procedure that adds all implied subclass relations to the knowledge base as new triples. The part_of relation is not a direct class/class link, but instead is represented as a restriction in OWL. The inferred part_of relations were computed with the Pellet reasoner and then added to the knowledge base as direct part_of relations between process classes.

The query returns (query results are available at http://purl.org/science/query/pyramidal-signal-transduction) about 40 results. We asked a researcher who studied Alzheimer disease to review the list and she verified that the returned genes and processes were valid. Some results (process:genes): adenylate cyclase activation: DRD1, ADRB2, glutamate signaling pathway: GRIN1, GRIN2A, GRIN2B, GRIK1.

The next query relies on information from the SenseLab databases, which were converted to RDF in an independent effort, illustrating that independent, albeit coordinated, efforts can produce representations that allow cross-database query. The SenseLab resources include a number of specialised databases of which three have been converted to RDF. Details of the RDF conversion and incorporation of SenseLab databases into the knowledge base are described in [32]. The query uses information from NeuronDB, narrowing down the results of the query in Figure 6 to those proteins that are dopaminergic receptors.

The extended query in Figure 7 constrains the results set to gene records that describe proteins that encode some type of dopamine receptor by connecting via a common element—the gene record. Because the gene record is named using the same URI in SenseLab, it provides a bridge which enables the incorporation into the query of knowledge unique to the SenseLab database such as neuroanatomical location and electrophysiological properties of Neurons.

Figure 7:

An extended version of the query in Figure 6. In lines 18–25, the variable ?gene_record bound in line 5 and 14 is used as a key to additional information from the SenseLab ontology. Only those results are retrieved that are associated with some type of Dopamine receptor (receptor_protein_name (gene_ name) process_name): D1 receptor (DRD1) adenylate cyclase activation and D2 receptor (DRD2) G-protein coupled receptor protein signaling pathway.

Figure 7:

An extended version of the query in Figure 6. In lines 18–25, the variable ?gene_record bound in line 5 and 14 is used as a key to additional information from the SenseLab ontology. Only those results are retrieved that are associated with some type of Dopamine receptor (receptor_protein_name (gene_ name) process_name): D1 receptor (DRD1) adenylate cyclase activation and D2 receptor (DRD2) G-protein coupled receptor protein signaling pathway.

RELATED PROJECTS

There are a growing number of projects that attempt to take advantage of Semantic Web technology to serve biological information. Here we mention two of note. Bio2RDF [33], incorporates portions of some of the data sets included in the Neurocommons knowledge base, as well as others that are not. Bio2RDF's current focus is on scripted translation of portions of public databases records to RDF with an emphasis on extracting and converting explicit links between them rather than on translation of the facts these records represent. Bio2RDF has also been an advocate of a uniform system of naming database records and was an earlier participant in the Common Naming Project (http://neurocommons.org/page/Common_Naming_Project). The Bio2RDF team runs a server for this data as well as enabling local installations of their system.

The Semantic Systems Biology Biogateway [34] hosts a triple store which nucleated around providing easier access to the Cell Cycle Ontology [35]. It has since progressed to inclusion of a number of other resources including a translation of a portion of Uniprot and the OBO ontologies. Future development is envisioned to be aimed at more targeted support for developing models and simulations, for instance by being able to generate SBML models as the results of queries.

While the Neurocommons project, Bio2RDF, and the BioGateway currently use RDF-based triple store technology, IBM's Anatomy Lens (http://services.alphaworks.ibm.com/anatomylens/) is based on the novel large scale reasoner for OWL called SHER [36]. Anatomy lens does not provide a SPARQL interface but instead focuses on providing journal articles based on the users entering anatomy terms, MeSH terms and biological processes as search keywords. The system makes use of reasoning in order to improve the relevance of returned articles.

Development of reasoners that operate at the scale of the Neurocommons knowledge base is an ongoing effort, and collaborations with researchers doing research in this area will be essential to advance the field. Of note in this direction is the work on HermiT [37]. The OWL version of BAMS could not be reasoned over when it was initially created. However, it was supplied to the the HermiT team, and after subsequent development of that reasoner, it was finally able to be classified and checked for consistency (http://web.comlab.ox.ac.uk/people/Boris.Motik/HermiT/test-data.html).

DISCUSSION

Representing biological reality is the most solid basis for data integration, but can be a formidable challenge. When representing biology, we are often faced with fundamental representation choices such as: ‘How do I relate an object to its parts?’, ‘How do I represent anatomy and localized physiological processes’, ‘What identifiers should I use for biological entities?’. Coordinated effort is needed to ensure that what ontologies represent is both sound science and that they are usable. Database curators and publishers who wish to get maximum value for their effort by publishing on the Semantic Web and enabling integration need illustrative examples that can help them bootstrap efforts to effectively using this new technology. The Neurocommons knowledge base is offered as one among several efforts to set the stage for future work on the Semantic Web by exploring, implementing, and sharing working solutions.

Efforts such as the Neurocommons project are becoming feasible now due to a confluence of circumstances and efforts. The Open Biomedical Ontologies and OBO Foundry are establishing a baseline of foundational ontologies, such as BFO [38] and the Relation Ontology [28] as well as reference ontologies such as the GO and ChEBI, which can be adopted and reused in diverse projects. Experience with, and tools for using OWL are maturing and triple store technology is reaching the point where stores and query processors can comfortably scale to sizes that are useful for aggregating and serving large bodies of biological knowledge. What make these efforts even more exciting is the changing nature of their presence on the Web, the ability to pose precise queries against multiple resources using SPARQL and, over time, new query languages.

Still, there are a variety of immediate challenges. In order for one to construct powerful queries, one must understand the contents of the knowledge base, and although there are a variety of scattered tools for exploring portions of it, it takes skill and effort to learn enough to use it effectively.

Complete OWL reasoning (i.e. reasoning processes that computes all implications of a set of assertions, see e.g. [39]) on databases the size of the Neurocommons knowledge base is presently infeasible. In such cases, we made a deliberate decision to use incomplete reasoning and proceed despite the limitations (e.g. that some queries that could theoretically be answered will not be) in order to at least identify the challenges for Semantic Web researchers.

SPARQL query against OWL is verbose and unintuitive. Already, we are seeing developments towards making such queries easier to construct and more concise. SPARQL-DL [40] has the potential to make queries more readable by, for example, making unnecessary explicit representation of restriction classes as triples (Figure 8).

Figure 8:

A reformulation of the query in Figure 6 to use SPARQL-DL. As an example, line 6 in this query uses a single line to express the same expression as lines 12–16 in Figure 6.

Figure 8:

A reformulation of the query in Figure 6 to use SPARQL-DL. As an example, line 6 in this query uses a single line to express the same expression as lines 12–16 in Figure 6.

In order to make discovery of terminology and content in knowledge bases of this sort, text indexing and incorporation of freetext queries into SPARQL would be a benefit. Bio2RDF is already exploring this possiblility. Additionally, adapting useful ontology exploration services such as the OLS [21] so that they can be easily deployed to index content on the Semantic Web, and development of other support functions would greatly ease query construction. However structured query languages are not necessarily well-suited for domain scientists, so research that adapts keyword search interfaces by identify relevant queries (e.g. [41]) could also be profitably employed here.

There is a pressing need for the creation of flexible and intuitive web user interfaces that don't require the user to know either SPARQL or the organization of the knowledge base. Dedicated graphical interfaces are already available for subsets of data contained in the knowledge base. For example, Entrez Neuron (http://ycmi.med.yale.edu/entrez_neuron.html) is a graphical interface for the SenseLab ontologies. Currently there is no graphical user interface for the knowledge base as a whole. Faceted browsers such as SIMILE Exhibit, and novel query/visualizations such as Parallax (http://mqlx.com/~david/parallax/) could provide views of the knowledge base that would enable users without specific knowledge about the data to make use of it, effectively issuing queries without being exposed to a query language.

Key Points

  • Large-scale integration of heterogeneous biomedical data can be accomplished with Semantic Web technology.

  • A common naming strategy can simplify the process of data integration in a knowledge base and make it more extensible.

  • Semantic Web technologies are improving rapidly and will soon enable reasoning across large knowledge bases as well as improved query languages and optimization.

ACKNOWLEDGEMENTS

The early development of the knowledge base was done with participation of a number of members of the W3C Semantic Web Health Care and Life Sciences (HCLS) Interest Group. Details can be found in the associated W3C Interest Group note [42]. We appreciated the general camaraderie of that group. John Barkley, Huajun Chen, Kei-Hoi Cheung, June Kinoshita, Gwen Wong, Elizabeth Wu, Don Doherty, William Bug, Ray Hookway, Chris Mungall, Barry Smith, Eric Prud’hommeaux, Kingsley Idehen, Orri Erling, Ivan Mikhailov, Evren Sirin and Alan Bawden contributed to a range of aspects of the project.

Funding

CHDI Foundation (to Science Commons); Ewing Marion Kauffman Foundation; John D. and Catherine T. MacArthur Foundation; Virtual Laboratory for e-Science project (http://www.vl-e.nl) (to M.S.M.); BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) (to VL-e) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ).

*In memory of our friend and colleague William Bug, Ontological Engineer.

REFERENCES

Galperin
MY
The Molecular Biology Database Collection: 2008 update
Nucleic Acids Res
 , 
2008
, vol. 
36
 (pg. 
D2
-
4
)
Ruttenberg
A
Clark
T
Bug
W
, et al.  . 
Advancing translational research with the semantic web
BMC Bioinformatics
 , 
2007
, vol. 
8
 
Suppl. 3
pg. 
S2
 
Berners-Lee
T
Hendler
J
Publishing on the Semantic Web
Nature
 , 
2001
, vol. 
410
 (pg. 
1023
-
4
)
Hendler
J
Communication. Science and the Semantic Web
Science
 , 
2003
, vol. 
299
 (pg. 
520
-
1
)
Beckett
D
McBride
B
‘RDF/XML Syntax Specification (Revised)’
W3C Recommendation.
 , 
2004
 
http://www.w3.org/TR/rdf-syntax-grammar/ (1 September 2008, date last accessed)
McGuinness
DL
van Harmelen
F
‘OWL Web Ontology Language Overview’
W3C Recommendation.
 , 
2004
 
http://www.w3.org/TR/owl-features/ (1 September 2008, date last accessed)
Prud’hommeaux
E
Seaborne
A
‘SPARQL Query Language for RDF’
W3C Recommendation.
 , 
2006
 
http://www.w3.org/TR/rdf-sparql-query/ (1 September 2008, date last accessed)
Evren
S
Bijan
P
Bernardo Cuenca
G
, et al.  . 
Pellet: A Practical OWL-DL Reasoner.
 , 
2007
The Netherlands
Elsevier Science Publishers B.V.
(pg. 
51
-
3
)
Skoufos
E
Mirsky
JS
Healy
MS
, et al.  . 
Acquisition, storing and retrieving diverse biomedical data using the World-Wide-Web: The Senselab Paradigm
AMIA Annual Symposium Proceedings,.
 , 
1998
Smith
AK
Cheung
KH
Yip
KY
, et al.  . 
LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics
BMC Bioinformatics
 , 
2007
, vol. 
8
 
Suppl. 3
pg. 
S5
 
W3C
Semantic Web Health Care and Life Sciences Interest Group
1 September 2008, date last accessed 
Smith
B
Beyond concepts: ontology as reality representation
Formal Ontology In Information Systems: Proceedings of the Third International Conference (FOIS-2004).
 , 
2004
(pg. 
73
-
84
)
Harris
MA
Clark
J
Ireland
A
, et al.  . 
The Gene Ontology (GO) database and informatics resource
Nucleic Acids Res
 , 
2004
, vol. 
32
 (pg. 
D258
-
61
)
Camon
E
Magrane
M
Barrell
D
, et al.  . 
The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro
Genome Res
 , 
2003
, vol. 
13
 (pg. 
662
-
72
)
Rector
AL
Nowlan
WA
The GALEN project
Comput Methods Programs Biomed
 , 
1994
, vol. 
45
 (pg. 
75
-
8
)
Lein
ES
Hawrylycz
MJ
Ao
N
, et al.  . 
Genome-wide atlas of gene expression in the adult mouse brain
Nature
 , 
2007
, vol. 
445
 (pg. 
168
-
76
)
Bota
M
Dong
HW
Swanson
LW
Brain architecture management system
Neuroinformatics
 , 
2005
, vol. 
3
 (pg. 
15
-
48
)
Jensen
NH
Roth
BL
Massively parallel screening of the receptorome
Comb Chem High Throughput Screen
 , 
2008
, vol. 
11
 (pg. 
420
-
6
)
Clark
T
Martin
S
Liefeld
T
Globally distributed object identification for biological knowledgebases
Brief Bioinform
 , 
2004
, vol. 
5
 (pg. 
59
-
70
)
Post
LJ
Roos
M
Marshall
MS
, et al.  . 
A semantic web approach applied to integrative bioinformatics experimentation: a biological use case with genomics data
Bioinformatics
 , 
2007
, vol. 
23
 (pg. 
3080
-
7
)
Cote
RG
Jones
P
Martens
L
, et al.  . 
The Ontology Lookup Service: more data and better tools for controlled vocabulary queries
Nucleic Acids Res
 , 
2008
, vol. 
36
 (pg. 
W372
-
6
)
Smith
B
Ashburner
M
Rosse
C
, et al.  . 
The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration
Nat Biotechnol
 , 
2007
, vol. 
25
 (pg. 
1251
-
5
)
Cadag
E
Louie
B
Myler
PJ
, et al.  . 
Biomediator data integration and inference for functional annotation of anonymous sequences
Pacific Symposium on Biocomputing
 , 
2007
(pg. 
343
-
54
)
Bader
G
Cary
M
BioPAX – Biological Pathways Exchange Language, Level 2, 2005
1 December 2008, date last accessed 
Caspi
R
Foerster
H
Fulcher
CA
, et al.  . 
The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases
Nucleic Acids Res
 , 
2008
, vol. 
36
 (pg. 
D623
-
31
)
Ruttenberg
A
Zucker
J
Rees
J
What BioPAX communicates and how to extend OWL to help it
OWLED*06 Workshop on OWL: Experiences and Directions.
 , 
2006
Athens
Georgia
Natale
D
Arighi
C
Barker
W
, et al.  . 
Framework for a Protein Ontology
BMC Bioinformatics
 , 
2007
, vol. 
8
 pg. 
S1
 
Smith
B
Ceusters
W
Klagges
B
, et al.  . 
Relations in biomedical ontologies
Genome Biol
 , 
2005
, vol. 
6
 pg. 
R46
 
Clark
KG
Feigenbaum
L
Torres
E
SPARQL Protocol for RDF
W3C Recommendation.
 , 
2008
 
http://www.w3.org/TR/rdf-sparql-protocol/ (1 September 2008, date last accessed)
Schliebs
R
Basal forebrain cholinergic dysfunction in Alzheimer's disease – interrelationship with beta-amyloid, inflammation and neurotrophin signaling
Neurochem Res
 , 
2005
, vol. 
30
 (pg. 
895
-
908
)
Persidis
A
Signal transduction as a drug-discovery platform
Nat Biotechnol
 , 
1998
, vol. 
16
 (pg. 
1082
-
3
)
Samwald
M
Cheung
K
Experiences with the conversion of SenseLab databases to RDF/OWL (W3C Interest Group Note)
1 September 2008, date last accessed 
Belleau
F
Nolin
MA
Tourigny
N
, et al.  . 
Bio2RDF: Towards a mashup to build bioinformatics knowledge systems
J Biomed Informatics
 , 
2008
, vol. 
41
 (pg. 
706
-
16
)
Antezana
E
Blondé
W
Aranguren
ME
, et al.  . 
Semantic Systems Biology
1 September 2008, date last accessed 
Aranguren
ME
Antezana
E
Kuiper
M
, et al.  . 
Ontology design patterns for bio-ontologies: a case study on the cell cycle ontology
BMC Bioinformatics
 , 
2008
, vol. 
9
 pg. 
S1
 
Dolby
J
Fokoue
A
Kalyanpur
A
, et al.  . 
Scalable semantic retrieval through summarization and refinement
Proc Natl Conf Artif Intell
 , 
2007
, vol. 
22
 pg. 
299
 
Motik
B
Shearer
R
Horrocks
I
Optimized reasoning in description logics using hypertableaux
Lecture Notes Comput Sci
 , 
2007
, vol. 
4603
 pg. 
67
 
Grenon
P
Smith
B
Goldberg
L
Biodynamic ontology: applying BFO in the biomedical domain
Studies Health Technol Informatics
 , 
2004
, vol. 
102
 (pg. 
20
-
38
)
Smith
M
Horrocks
I
Krötzsch
M
OWL 2 Conformance and Test Cases, 2008
1 December 2008, date last accessed 
Golbreich
C
Kalyanpur
A
Parsia
B
Proceedings of the OWLED 2007 Workshop on OWL: Experiences and Directions, Innsbruck, Austria, June 6–7, 2007. CEUR Workshop Proceedings 258 CEUR-WS.org 2007
Yu
Y
Wang
H
Wang
C
, et al.  . 
SPARK: Adapting Keyword Query to Semantic Search
Proceedings of the International Semantic Web Conference (ISWC).
 , 
2007
South Korea
Marshall
MS
Prud'hommeaux
E
A Prototype Knowledge Base for the Life Sciences (W3C Interest Group Note)
 1 September 2008, date last accessed