Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models that are used to represent them are constantly changing as new techniques are developed and new information becomes available. It is challenging for collaborating groups based in different geographical locations who wish to have unified access to their distributed data sources, because combining and presenting their data creates logistical difficulties. Finally, it is challenging for users of biological databases, because in order to correctly interpret the experimental data located in one database, additional information from other databases is frequently needed, requiring the user to learn multiple systems.
The BioMart project (www.biomart.org) was initiated to address these challenges. The BioMart software is based on two fundamental concepts: data agnostic modelling and data federation. Data agnostic modelling simplifies the difficult and time-consuming task of data modelling. In BioMart, this is achieved by using a predefined, query-optimized relational schema that can be used to represent any kind of data (1). Data federation makes it possible to organize multiple, disparate and distributed database systems into what appears to be a single integrated virtual database. It therefore allows users to access and cross reference data from these data sources using a single user interface, without the need for database administrators to physically collate the data in one location.
Using these fundamental concepts, the BioMart project has driven a change in the biological data management paradigm, where individual biological databases are managed by different custom built systems. To give more control to both the users and the data providers, a new, innovative solution was required. BioMart started by adapting data warehousing ideas to create one universal software system for biological data management and empower biologists with the ability to create complex, customized datasets through a web interface without the need for bioinformatics support (1). It subsequently introduced a new innovative way of creating large multi-database repositories that avoid the need to store all the data in a single location (2), and finally it proved that large-scale projects involving next generation sequencing data can be managed efficiently in a distributed environment (3).
BioMart has successfully adapted data warehousing ideas such as data marts, dimensional modelling (4), and query optimization into the world of biological databases (5–13). BioMart's ability to quickly deploy a website hosting any type of data, user-friendly graphical user interface, several programmatic interfaces and support for third party tools contributed to its success and adoption by many different types of projects around the world as their data management platform (14). During the 10 years of its existence, BioMart has grown from humble beginnings as a ‘data mining extension’ for the Ensembl website (1), to become an international collaboration involving large number of different organizations located on five continents: Asia, Australia, Europe, North America and South America (3,15). It has a large community of users and developers and it has been successfully used in both academia and industry. The latest version of the BioMart software has been significantly enhanced with numerous graphical user interfaces that are tailored to different user groups. In addition, it has been further improved by parallel query processing, it is now extensible with different analysis tools and the installation process can be effortlessly completed with just a few mouse clicks (16).
Building on the wealth of information that has become accessible through the BioMart interface, the BioMart Central Portal (15) has introduced an innovative alternative to the large data stores maintained by specialized organizations such as The European Bioinformatics Institute (EBI) or The National Center for Biotechnology Information (NCBI). BioMart Central Portal is a first-of-its-kind, community-driven effort to provide unified access to dozens of biological databases. All development and maintenance of individual databases is left to the individual data providers, making it a very cost-effective approach. The groups that maintain individual sources can do so at their own location without the necessity of any data exchange procedures. In addition, they can draw on the wealth of information available through the portal to expose their data in the context of third party annotations. The BioMart Central Portal approach is very democratic: everyone can join or remove their data source at any time. BioMart Central Portal is effectively a ‘Virtual Bioinformatics Institute’ with no onsite personnel, minimal administration, and a very ‘green’ footprint.
More recently, the International Cancer Genome Consortium (ICGC) Data Portal has demonstrated how BioMart can scale to manage large collaborative projects involving next generation sequencing data (3). The ICGC is generating data on an unprecedented scale by sequencing 500 cancer genomes and matched normal control genomes for 50 different cancer types (17). The effort is distributed between multiple participating countries and sequencing centres. Given the scale of the effort, moving all of the data to a single location is impractical. Instead, the ICGC Data Portal relies on BioMart data federation. By replicating and distributing the data model across different centres that produce the same type of data according to the same recipe, the scalability of the effort is greatly improved. Each centre is only responsible for managing their own data while data access to all of the consortium data is managed by the BioMart software. This presents a scalable approach, not only in the traditional sense of parallelizing data processing and storage, but also in a more general sense of outsourcing the external annotation expertise by federating annotations from additional, independently-maintained databases that are available in the BioMart Central Portal.
The future developments for BioMart involve specialized ‘pre-packaged’ and reusable data portals. One example already in development is the OncoPortal, aimed at researchers managing cancer data. It will include preconfigured access to sources of annotations that are useful for cancer research such as Ensembl (5), Reactome (12), COSMIC (9), Pancreatic Expression Database (10) and others. It will also include a set of tools that are specifically designed for cancer data analysis. There are plans to build other preconfigured portals for different research areas, such as a mouse portal and a model organism portal. It is an ambition of the BioMart community that the BioMart project remains at the forefront of innovative solutions for biological data management in the years to come. By creating these specialized solutions and further reducing the barriers to entry, the aim is to encourage more groups to share their data through BioMart, thereby further enhancing the entire BioMart community.