Update of the Human MitBASE database

Human MitBASE is a database collecting human mtDNA variants. This database is part of a greater mitochondrial genome database (MitBASE) funded within the EU Biotech Program. The present paper reports the recent improvements in data structure, data quality and data quantity. As far as the database structure is concerned it is now fully designed and implemented. Based on the previously described structure some changes have been made to optimise both data input and data quality. Cross-references with other bio-databases (EMBL, OMIM, MEDLINE) have been implemented. Human MitBASE data can be queried with the MitBASE Simple Query System (http://www.ebi.ac.uk/htbin/Mitbase/mitbase.pl ) and with SRS at the EBI


INTRODUCTION
Since the observation of mitochondrial DNA (mtDNA) mutations in patients with chronic progressive external ophthalmoplegia or Leber's hereditary optic neuropathy in 1998 (1,2), a growing number of mtDNA mutations have been associated with a wide variety of diseases (3).In addition to the patients with the 'classical' mitochondrial diseases mtDNA mutations have been found in patients with diabetes and heart failure (4)(5)(6).Moreover, mtDNA polymorphisms have been associated with Parkinson's and Alzheimer diseases (7,8) and age-related accumulation of somatic mtDNA mutations in post-mitotic tissues have been related to the aging process (9)(10)(11).Finally, due to its relatively high evolutionary rate mtDNA represents one of the major tools available for evolutionary studies of populations (12,13).Indeed the D-loop containing region of human mtDNA has a very high nucleotide substitution rate in two peripheral domains, the hypervariable regions I and II (HV-I and HV-II), a characteristic used extensively to study the origin of modern man.
The quantity of molecular and clinical data available from research groups interested in mitochondrial diseases and the number of variant sequences collected from population studies are growing exponentially; however, it would be very difficult to carry out statistical analyses and to obtain trustworthy results without the bioinformatic support.
The primary focus of the Human MitBASE database is to collect all the data available worldwide relevant to mitochondrial diseases and to mitochondrial DNA intraspecies diversity in a relational database and to set up an ad hoc query system to allow the retrieval of all this information.
In MitBASE the complete human dataset has been defined distinguishing molecular from clinical and pathological data.The molecular dataset is under the responsibility of the Bari MitBASE group, while the clinical and pathological dataset is under the responsibility of the London MitBASE group.
Data are locally managed through Microsoft Access and then centrally stored at the EBI into the ORACLE MitBASE database (15).

DATA SOURCES
Human data are retrieved from bibliographic (MEDLINE) and from primary (EMBL data library and GenBank) databases (16,17).After careful revision these data are stored in MitBASE.Congress proceedings and unpublished data kindly provided by the authors are also included.The human data are coded using as *To whom correspondence should be addressed.Tel: +39 080 548 2130; Fax: +39 080 548 4467; Email: marcella@area.ba.cnr.it

HUMAN MitBASE DATA STRUCTURE
The general structuring of the human dataset was already described in the previous paper (19).Each entry in the Human MitBASE database is related to a variant, which is defined as a mtDNA fragment with a different pattern of variation events with respect to the reference sequence.
The Human MitBASE data structure (Fig. 1) is composed of two sub-structures related to molecular (Fig. 1a) and clinical/ pathological data (Fig. 1b).
Data structure is organised into data tables (reported in Fig. 1) and control value tables 'CV_' (tables not reported).Data tables contain specific data connected to each variant, whereas control value tables contain lists of values generally applicable to any variant.Links between data tables and control value tables have been defined in order to ensure a comprehensive integration of the available information.
All the tables in the structure are linked through the Individual table, which allows a targeted query to be built.This table correlates the molecular sub-structure with the clinical/pathological sub-structure supported by the tables relevant to bibliographic information (JournalCitation, JournalInd and CV_Journal).
Each entry in the database is related to a nucleotide sequence variant located in a specific region of the mtDNA extracted from a tissue of an individual.This entry is identified by a sourceregion_id automatically generated.In this way the same individual can be associated to more than one entry if different regions of its genome have been analysed or if the same region has been analysed in different tissues.Each individual is identified in MitBASE by a pedigree code and/or an individual paper code.The pedigree code is composed of a family code, which identifies the single individual or a family when a pedigree study is reported, followed by characters, Roman and Arabic numbers, to mark the individual in the generation.An internal cross-referencing implemented in the ORACLE database will allow all the entries relevant to the same individual to be collected.
The information annotated in the molecular human MitBASE sub-structure include: analysed mtDNA region, experimental method used for the analysis, tissue or cell lines used for the molecular studies, sex, age and population data of the subject and information about his/her geographical and linguistic origin.Information about the type of variation occurred (substitution, deletion, insertion), the variation location, restriction site gain or loss are also reported.
In the 'clinical sub-structure' the following sub-groups have been defined based on different types of analyses carried out on the patients: clinical, histopathological, analyte and biochemistry features.

FLATFILE FORMAT STRUCTURE
A flatfile (ff) format has been fully designed and implemented as output of the Human MitBASE Simple Query System (http://www.ebi.ac.uk/htbin/Mitbase/mitbase.pl ).This query system allows searching of human variants by gene name.
The ff format follows the rules agreed with the other partners of the MitBASE project (15).It reports information common to all MitBASE nodes (entry date, taxonomy, bibliography and cross-referencing line for the link with other biological databases) followed by the specific human MitBASE data.A general scheme of the present flatfile format is shown in Figure 2.
Individual, clinical, histopathological, biochemical and pathological data are codified adopting new two letter codes based on the model of the EMBL datalibrary 'FT' lines: there are feature key, feature qualifiers and feature description.The major benefit of this ff structuring is its implementation in SRS.

SEARCHING WITH SRS
In collaboration with the EBI HmutDB project (20) the Human MitBASE database has been incorporated into the EBI SRS server under the 'Mutations' section (http://srs.ebi.ac.uk/ ).All the fields, excluding the sequence, are indexed allowing detailed queries that can be combined with arbitrary complexity.
Desired fields can be selected and written out for further statistical analysis.

ACKNOWLEDGEMENT
This work has been funded under the EU Biotechnology programme, contract number: BIO4 CT950160.

Figure 1 .
Figure 1.The Microsoft Access Human MitBASE structure: (a) the molecular structure and (b) the clinical/pathological structure.Data tables and links among them are shown.The fields reported in bold are the identifiers associated to the information in each table automatically updated during data input.The name of the tables describes their content.