This paper describes the organisation of a database for human mitochondrial control-region sequences. The data are divided into three ASCII files that contain aligned sequences from the hypervariable region I (HVRI), from the hypervariable region II (HVRII), and the available information about the individuals, from whom the sequences stem. The current collection comprises 4079 HVRI and 969 HVRII sequences. From 728 individuals sequences of both HVRI and HVRII are available. For easy access, the collection is made available to the scientific community via World Wide Web at URL http://www.zi.biologie.uni-muenchen.de/~meyers/mtdna.html
The history of human populations is studied for a wealth of different genetic systems ( 1–4 ). Because the mitochondrial genome is maternally inherited and accumulates substitutions at a higher rate than the nuclear genome, it is well suited to analyse the population history of humans based on simple models of population history. Especially the hypervariable regions HVRI and HVRII ( 5 ) of the control region have been studied extensively (cf. 6 and references therein). Since 1981 the amount of available HVRI and HVRII data has increased exponentially ( Fig. 1 ). We have collected and aligned a large number of control-region sequences. This paper describes the organisation of the database.
Compilation of sequences
Sequences were collected from publications ( 7–40 ) or were retrieved from GenBank ( 41 ) and stored as plain ASCII files. Sequences from GenBank were compared to the sequences in the corresponding publications. If discrepancies occurred the sequences were stored as given in the paper. If only sequence positions deviating from the reference sequence ( 7 ) were published these deviations were added to the reference sequence and the resulting sequence was stored. When the publication did not clearly state the start and end of a sequence, the first, respectively the last variable sites were used as limitation. Unfortunately, it was not always evident how often each lineage was found or to which population it belonged when individuals of more than one population were studied. If this could not be unraveled the data were not added to the collection.
Sequences were manually aligned. For the HVRI region we aligned positions 16001–16408 and for the HVRII region positions 1–408 were aligned ( 7 ). If sequences were longer than this alignment, they were truncated to the corresponding sites, if they were shorter, question marks were introduced to achieve the length required by the alignment. All non-determined nucleotides within a sequence are also represented by question marks. A dash (−) indicates an insertion or deletion of a nucleotide.
Organization of collected data
The data are divided into an information file (info12.txt) and two sequence files (alld1.txt, alld2.txt). To reduce the amount of storage the sequence files contain only ‘(database)-lineages’, which differ in at least one position from the remaining entries of the collection. The file info12.txt contains available information about the individuals. Currently the following categories are defined:
I: <number>. This number specifies the HVRI lineage found in the individual. The corresponding sequence in alld1.txt has the same number. A zero indicates that HVRI was not sequenced for that individual.
II: <number>. This number refers to the corresponding HVRII lineage in alld2.txt.
Continent the individual stems from. The following abbreviations are used: AFRI, Africa; AMER, Americas; ASIA, Asia; A/OC, Australia & Oceania; EURO, Europe.
N: specifies the name of the sequence in the original publication or the GenBank accession number.
R: gives the original reference.
O: shows the country of origin.
P: gives the population the individual belongs to.
L: gives the language and the language phylum of the individual.
+9bpdel/−9bpdel indicates the presence or absence of the 9 bp deletion ( 42 ).
The file alld1.txt contains the alignment of HVRI lineages. Each lineage in the file is indexed by a number. If an individual from info12.txt has the same number, the corresponding sequence was found in that individual. The file alld2.txt is organised as alld1.txt. It comprises the alignment of the HVRII.
A C-program, that should run on most computers, allows the retrieval of all individual sequences that match a user defined keyword in the information file. The search results are stored in four files: kw-info contains the information about the individuals that match the keyword. In kw-I and kw-II the HVRI and HVRII sequences of the individuals are given and the file kw-I–II contains the sequences of the individuals where both variable regions have been sequenced.
Description of the compilation
The current collection comprises 4079 HVRI, 969 HVRII, and 728 human sequences where HVRI and HVRII are known. This amounts to 2298 and 580 (database)-lineages for HVRI and HVRII, repectively. 539 lineages are found among individuals where both HVRI and HVRII have been determined. These numbers also include some unpublished sequences [K.Bauer, H.Geisert, M.Krings, M.Laan, A.Salem, A.Sajantila and S.Pääbo (1997), manuscript in preparation], that will be made available as soon as they are in press.
Table 1 shows the number of sequences and lineages for each continent. An overview of the world wide sampling is displayed in Figure 2 . Obviously, some regions of the world are sampled well whereas sampling is still poor in other regions. Except for India and South Africa, where the number of HVRI and HVRII sequences is balanced, we note a strong preponderance for the former. For some regions only HVRI sequences are available.
Table 2 shows the number of sequences according to language phyla. Sequences are available for 12 of the 18 language phyla, classified according to Ruhlen ( 43 ). Unfortunately, for 1657 individuals the publications did not specifiy the linguistic affiliation of the sequences.
The alignment of the HVRI sequences is 419 bp long and starts at position 16001 according to the human reference sequence ( 7 ). Gaps of varying length were introduced at positions 16104.1, 16169.1, 16174.1, 16183.1–16183.4, 16227.1, 16259.1, 16366.1, 16386.1. Especially, the region from position 16183 to 16193 shows a high degree of length variants ( 19 , 31 ). Among the 419 positions are 275 variable sites. 188 sites carry two different nucleotides (164 sites with transitions and 24 with transversions), 66 with three nucleotides and 21 sites show all four nucleotides.
The HVRII sequence alignment, which starts at position one comprises 418 bp with gaps at positions 56.1, 65.1, 190.1, 294.1, 302.1–302.4 and 310.1–310.2. Only 105 of 418 positions show different nucleotides. Two nucleotides are found in 89 of these positions (77 transitions and 12 transversions). The rest are 15 positions with three different nucleotides and one position that shows all four nucleotides.
Quality and completeness of the data and future directions
Our data have been largely compiled from published sequences. Although we have taken great pains to minimise mistakes, there may still be sequences in our collection that contain errors or where some annotations are not correct. To ensure a high quality of the data, we are grateful if bugs or obscurities are pointed out to us.
We solicit everybody to furnish new sequences via electronic mail together with the relevant information. We would also be grateful to receive already published sequences which are missing in our collection.
Besides regular updates of the collection of human control-region sequences we are planning to add DNA sequences from the hypervariable region of the mitochondrial control region from chimpanzees. There are currently 377 sequences published ( 44–46 ).
While we have collected only the control region sequences from humans there are other databases like MITOMAP ( 47 ) that collect information about the variabilitiy of the entire human mitochondrial genome.
The collection is available on request ( email@example.com or firstname.lastname@example.org ) It can also be retrieved free of charge over the internet from http://www.zi.biologie.uni-muenchen.de/~meyers/mtdna.html . We also distribute a simple program that allows retrieval of sequences according to specific keywords. The program is written in standard C and should run on most computers equipped with a C-compiler. It can also be obtained from the internet address given above.
We are grateful to all colleagues who provided their sequence data as a computer file and gave additional information when needed. We want to express our special thanks to Matthias Krings, Martin Richards, Antti Sajantila, and Svante Pääbo. Financial support from the DFG is gratefully acknowledged.