Abstract

This paper describes the organisation of a database for human mitochondrial control-region sequences. The data are divided into three ASCII files that contain aligned sequences from the hypervariable region I (HVRI), from the hypervariable region II (HVRII), and the available information about the individuals, from whom the sequences stem. The current collection comprises 4079 HVRI and 969 HVRII sequences. From 728 individuals sequences of both HVRI and HVRII are available. For easy access, the collection is made available to the scientific community via World Wide Web at URL http://www.zi.biologie.uni-muenchen.de/~meyers/mtdna.html

Introduction

The history of human populations is studied for a wealth of different genetic systems ( 1–4 ). Because the mitochondrial genome is maternally inherited and accumulates substitutions at a higher rate than the nuclear genome, it is well suited to analyse the population history of humans based on simple models of population history. Especially the hypervariable regions HVRI and HVRII ( 5 ) of the control region have been studied extensively (cf. 6 and references therein). Since 1981 the amount of available HVRI and HVRII data has increased exponentially ( Fig. 1 ). We have collected and aligned a large number of control-region sequences. This paper describes the organisation of the database.

Compilation of sequences

Sequences were collected from publications ( 7–40 ) or were retrieved from GenBank ( 41 ) and stored as plain ASCII files. Sequences from GenBank were compared to the sequences in the corresponding publications. If discrepancies occurred the sequences were stored as given in the paper. If only sequence positions deviating from the reference sequence ( 7 ) were published these deviations were added to the reference sequence and the resulting sequence was stored. When the publication did not clearly state the start and end of a sequence, the first, respectively the last variable sites were used as limitation. Unfortunately, it was not always evident how often each lineage was found or to which population it belonged when individuals of more than one population were studied. If this could not be unraveled the data were not added to the collection.

Figure 1

Accumulation of HVRI and HVRII sequences during the last 15 years.

Figure 1

Accumulation of HVRI and HVRII sequences during the last 15 years.

Sequences were manually aligned. For the HVRI region we aligned positions 16001–16408 and for the HVRII region positions 1–408 were aligned ( 7 ). If sequences were longer than this alignment, they were truncated to the corresponding sites, if they were shorter, question marks were introduced to achieve the length required by the alignment. All non-determined nucleotides within a sequence are also represented by question marks. A dash (−) indicates an insertion or deletion of a nucleotide.

Organization of collected data

The data are divided into an information file (info12.txt) and two sequence files (alld1.txt, alld2.txt). To reduce the amount of storage the sequence files contain only ‘(database)-lineages’, which differ in at least one position from the remaining entries of the collection. The file info12.txt contains available information about the individuals. Currently the following categories are defined:

  • I: <number>. This number specifies the HVRI lineage found in the individual. The corresponding sequence in alld1.txt has the same number. A zero indicates that HVRI was not sequenced for that individual.

  • II: <number>. This number refers to the corresponding HVRII lineage in alld2.txt.

  • Continent the individual stems from. The following abbreviations are used: AFRI, Africa; AMER, Americas; ASIA, Asia; A/OC, Australia & Oceania; EURO, Europe.

  • N: specifies the name of the sequence in the original publication or the GenBank accession number.

  • R: gives the original reference.

  • O: shows the country of origin.

  • P: gives the population the individual belongs to.

  • L: gives the language and the language phylum of the individual.

  • +9bpdel/−9bpdel indicates the presence or absence of the 9 bp deletion ( 42 ).

Figure 2

Worldwide distribution of mitochondrial hypervariable region I and II sequences. The size of the circles reflects the proportion of individuals from the corresponding region in the collection. The frequencies of the hypervariable region I are shown as solid portions of the pie charts. The approximate origin of the individuals are: 1, Hawaii; 2, Polynesia; 3, North America; 4, Panama; 5, South America; 6, Europe; 7, North Africa; 8, South Africa; 9, Turkey; 10, Jordania; 11, Russia; 12, India; 13, China; 14, Indonesia & Malaysia; 15, Taiwan; 16, Phillipines; 17, Australia; 18, Japan; 19, Micronesia & Melanesia.

Figure 2

Worldwide distribution of mitochondrial hypervariable region I and II sequences. The size of the circles reflects the proportion of individuals from the corresponding region in the collection. The frequencies of the hypervariable region I are shown as solid portions of the pie charts. The approximate origin of the individuals are: 1, Hawaii; 2, Polynesia; 3, North America; 4, Panama; 5, South America; 6, Europe; 7, North Africa; 8, South Africa; 9, Turkey; 10, Jordania; 11, Russia; 12, India; 13, China; 14, Indonesia & Malaysia; 15, Taiwan; 16, Phillipines; 17, Australia; 18, Japan; 19, Micronesia & Melanesia.

The file alld1.txt contains the alignment of HVRI lineages. Each lineage in the file is indexed by a number. If an individual from info12.txt has the same number, the corresponding sequence was found in that individual. The file alld2.txt is organised as alld1.txt. It comprises the alignment of the HVRII.

Program

A C-program, that should run on most computers, allows the retrieval of all individual sequences that match a user defined keyword in the information file. The search results are stored in four files: kw-info contains the information about the individuals that match the keyword. In kw-I and kw-II the HVRI and HVRII sequences of the individuals are given and the file kw-I–II contains the sequences of the individuals where both variable regions have been sequenced.

Description of the compilation

The current collection comprises 4079 HVRI, 969 HVRII, and 728 human sequences where HVRI and HVRII are known. This amounts to 2298 and 580 (database)-lineages for HVRI and HVRII, repectively. 539 lineages are found among individuals where both HVRI and HVRII have been determined. These numbers also include some unpublished sequences [K.Bauer, H.Geisert, M.Krings, M.Laan, A.Salem, A.Sajantila and S.Pääbo (1997), manuscript in preparation], that will be made available as soon as they are in press.

Table 1

Number of sequenced individuals and lineages for the five continents.

Table 1

Number of sequenced individuals and lineages for the five continents.

Table 2

Collection of sequences according to language phyla.

Table 2

Collection of sequences according to language phyla.

Geographical sampling

Table 1 shows the number of sequences and lineages for each continent. An overview of the world wide sampling is displayed in Figure 2 . Obviously, some regions of the world are sampled well whereas sampling is still poor in other regions. Except for India and South Africa, where the number of HVRI and HVRII sequences is balanced, we note a strong preponderance for the former. For some regions only HVRI sequences are available.

Language sampling

Table 2 shows the number of sequences according to language phyla. Sequences are available for 12 of the 18 language phyla, classified according to Ruhlen ( 43 ). Unfortunately, for 1657 individuals the publications did not specifiy the linguistic affiliation of the sequences.

Alignment

The alignment of the HVRI sequences is 419 bp long and starts at position 16001 according to the human reference sequence ( 7 ). Gaps of varying length were introduced at positions 16104.1, 16169.1, 16174.1, 16183.1–16183.4, 16227.1, 16259.1, 16366.1, 16386.1. Especially, the region from position 16183 to 16193 shows a high degree of length variants ( 19 , 31 ). Among the 419 positions are 275 variable sites. 188 sites carry two different nucleotides (164 sites with transitions and 24 with transversions), 66 with three nucleotides and 21 sites show all four nucleotides.

The HVRII sequence alignment, which starts at position one comprises 418 bp with gaps at positions 56.1, 65.1, 190.1, 294.1, 302.1–302.4 and 310.1–310.2. Only 105 of 418 positions show different nucleotides. Two nucleotides are found in 89 of these positions (77 transitions and 12 transversions). The rest are 15 positions with three different nucleotides and one position that shows all four nucleotides.

Quality and completeness of the data and future directions

Our data have been largely compiled from published sequences. Although we have taken great pains to minimise mistakes, there may still be sequences in our collection that contain errors or where some annotations are not correct. To ensure a high quality of the data, we are grateful if bugs or obscurities are pointed out to us.

We solicit everybody to furnish new sequences via electronic mail together with the relevant information. We would also be grateful to receive already published sequences which are missing in our collection.

Besides regular updates of the collection of human control-region sequences we are planning to add DNA sequences from the hypervariable region of the mitochondrial control region from chimpanzees. There are currently 377 sequences published ( 44–46 ).

While we have collected only the control region sequences from humans there are other databases like MITOMAP ( 47 ) that collect information about the variabilitiy of the entire human mitochondrial genome.

Availability

The collection is available on request ( meyers@zi.biologie.uni-muenchen.de or arndt@zi.biologie.uni-muenchen.de ) It can also be retrieved free of charge over the internet from http://www.zi.biologie.uni-muenchen.de/~meyers/mtdna.html . We also distribute a simple program that allows retrieval of sequences according to specific keywords. The program is written in standard C and should run on most computers equipped with a C-compiler. It can also be obtained from the internet address given above.

Acknowledgements

We are grateful to all colleagues who provided their sequence data as a computer file and gave additional information when needed. We want to express our special thanks to Matthias Krings, Martin Richards, Antti Sajantila, and Svante Pääbo. Financial support from the DFG is gratefully acknowledged.

References

1
Nei
M.
Roychoudhury
A.K.
Human Polymorphic Genes: World Distribution
 , 
1988
New York
Oxford University Press
2
Bowcock
A.M.
Ruiz-Linares
A.
Tomfohrde
J.
Minch
E.
Kidd
J.R
Cavalli-Sforza
L.L.
Nature
 , 
1994
, vol. 
368
 (pg. 
455
-
457
)
3
Cavalli-Sforza
L.L.
Menozzi
P.
Piazza
A
The History and Geography of Human Genes
 , 
1994
Princeton, NJ
Princeton University Press
4
Deka
R.
Jin
L.
Shriver
M.D.
Yu
L.M.
DeCroo
S.
Hundrieser
J.
Bunker
C.H.
Ferrell
R.E.
Chakraborty
R.
Am. J. Hum. Genet.
 , 
1995
, vol. 
56
 (pg. 
461
-
474
)
5
Vigilant
L.
Stoneking
M.
Harpending
H.
Hawkes
K.
Wilson
A.C.
Science
 , 
1991
, vol. 
253
 (pg. 
1503
-
1507
)
6
von Haeseler
A.
Sajantila
A.
Pääbo
S.
Nature Genet.
 , 
1996
, vol. 
14
 (pg. 
135
-
140
)
7
Anderson
S.
Bankier
A.T.
Barell
B.G.
deBruijn
M.H.L.
Coulson
A.R.
Drouin
J.
Eperon
I.C.
Nierlich
D.P.
Roe
B.A.
Sanger
F.
Schreier
P.H.
Smith
A.J.H.
Staden
R.
Young
I.G.
Nature
 , 
1981
, vol. 
290
 (pg. 
457
-
465
)
8
Batista
O.
Kolman
C.J.
Bermingham
E.
Hum. Mol. Genet.
 , 
1995
, vol. 
4
 (pg. 
921
-
929
)
9
Bertranpetit
J.
Sala
J.
Calafell
F.
Underhill
P.A.
Moral
P.
Comas
D.
Ann. Hum. Genet.
 , 
1995
, vol. 
59
 (pg. 
63
-
81
)
10
Betty
D.J.
Chin-Atkins
A.N.
Croft
L.
Scraml
M.
Easteal
S.
Am. J. Hum. Genet.
 , 
1995
, vol. 
58
 (pg. 
428
-
433
)
11
Comas
D.
Calafell
F.
Mateu
E.
Bertranpetit
J.
Mol. Biol. Evol.
 , 
1996
, vol. 
13
 (pg. 
1076
-
1077
)
12
Corte-Real
H.B.
Macaulay
V.A.
Richards
M.B.
Hariti
G.
Issad
M.S.
Cambon-Thomsen
A.
Papiha
S.
Bertranpetit
J.
Sykes
B.C.
Ann. Hum. Genet.
 , 
1996
, vol. 
60
 (pg. 
331
-
350
)
13
DiRienzo
A.
Wilson
A.C.
Proc. Natl. Acad. Sci. USA
 , 
1991
, vol. 
88
 (pg. 
1597
-
1601
)
14
Easton
R.
Merriwether
A.
Crews
E.
Ferrell
R.
Am. J. Hum. Genet.
 , 
1996
, vol. 
59
 (pg. 
213
-
225
)
15
Francalacci
P.
Bertranpetit
J.
Calafell
F.
Underhill
P.A.
Am. J. Phys. Anthrop.
 , 
1996
, vol. 
100
 (pg. 
443
-
460
)
16
Ginther
C.
Corach
D.
Penacino
G.
Rey
J.A.
Hutz
M.H.
Carnese
F.R.
Anderson
A.
Just
J.
Salzano
F.M.
King
M.C.
EXS.
 , 
1993
, vol. 
67
 (pg. 
211
-
219
)
17
Graven
L.
Passarino
G.
Semino
O.
Bourset
P.
Santachiara-Benerecetti
S.
Langaney
A.
Excoffier
L.
Mol. Biol. Evol.
 , 
1995
, vol. 
12
 (pg. 
334
-
345
)
18
Handt
O.
Richards
M.
Trommsdorff
M.
Kilger
C.
Simanainen
J.
Georgiev
O.
Bauer
K.
Stone
A.
Hedges
R.
Schaffner
W.
Utermann
G.
Sykes
B.
Pääbo
S.
Science
 , 
1994
, vol. 
264
 (pg. 
1775
-
1778
)
19
Horai
S.
Hayasaka
K.
Am. J. Hum. Genet.
 , 
1990
, vol. 
46
 (pg. 
828
-
842
)
20
Horai
S.
Kondo
R.
Nakagawa-Hattori
Y.
Hayashi
S.
Sonada
S.
Tajima
K.
Mol. Biol. Evol.
 , 
1993
, vol. 
10
 (pg. 
23
-
47
)
21
Jorde
L.B.
Bamshad
M.J.
Watkins
W.S.
Zenger
R.
Fraley
A.E.
Krakowiak
P.A.
Carpenter
K.D.
Soodyall
H.
Jenkins
T.
Rogers
A.R.
Am. J. Hum. Genet.
 , 
1995
, vol. 
57
 (pg. 
523
-
538
)
22
Kolman
C.J.
Bermingham
E.
Cooke
R.
Ward
R.H.
Arias
T.D.
Guionneau-Sinclair
F.
Genetics
 , 
1995
, vol. 
140
 (pg. 
275
-
283
)
23
Kolman
C.J.
Sambuughin
N.
Bermingham
E.
Genetics
 , 
1996
, vol. 
112
 (pg. 
1321
-
1334
)
24
Lum
J.M.
Rickards
O.
Ching
C.
Cann
R.L.
Hum. Biol.
 , 
1994
, vol. 
4
 (pg. 
567
-
590
)
25
Mountain
J.L.
Hebert
J.M.
Bhattacharyya
S.
Underhill
P.A.
Ottolenghi
C.
Gadgil
M.
Cavalli-Sforza
L.L.
Am. J. Hum. Genet.
 , 
1995
, vol. 
56
 (pg. 
979
-
992
)
26
Piercy
R.
Sullivan
K.M.
Benson
N.
Gill
P.
Int. J. Legal. Med.
 , 
1994
, vol. 
106
 (pg. 
85
-
90
)
27
Pinto
F.
Gonzales
A.
Hernandez
M.
Larruga
J.
Cabrera
V.
Ann. Hum. Genet.
 , 
1996
, vol. 
60
 (pg. 
321
-
330
)
28
Pult
I.
Sajantila
A.
Simanainen
J.
Georgiev
O.
Schaffner
W.
Pääbo
S.
Biol. Chem. Hoppe-Seyler
 , 
1994
, vol. 
375
 (pg. 
837
-
840
)
29
Redd
A.J.
Takezaki
N.
Sherry
S.T.
McGarvey
S.T.
Sofro
A.S.M.
Stoneking
M.
Mol. Biol. Evol.
 , 
1995
, vol. 
12
 (pg. 
604
-
615
)
30
Sajantila
A.
Lahermo
P.
Anttinen
T.
Lukka
M.
Sistonen
P.
Savontaus
M.-L.
Aula
P.
Beckman
L.
Tranebjaerg
L.
Gedde-Dahl
T.
Issel-Tarver
L.
DiRienzo
A.
Pääbo
S.
Gen. Res.
 , 
1995
, vol. 
5
 (pg. 
42
-
52
)
31
Santos
M.
Ward
R.H.
Barrantes
R.
Hum. Biol.
 , 
1994
, vol. 
6
 (pg. 
963
-
977
)
32
Santos
S.
Ribeiro-Dos-Santos
A.
Meyer
D.
Zago
M.
Ann. Hum. Genet.
 , 
1996
, vol. 
60
 (pg. 
305
-
319
)
33
Stenico
M.
Nigro
L.
Bertorelle
G.
Calafell
F.
Capitanio
M.
Corrain
C.
Barbujani
G.
Am. J. Hum. Genet.
 , 
1996
, vol. 
59
 (pg. 
1363
-
1375
)
34
Sykes
B.C.
Leiboff
A.
Low-Beer
J.
Tetzner
S.
Richards
M.
Am. J. Hum. Genet.
 , 
1995
, vol. 
57
 (pg. 
1463
-
1475
)
35
Torroni
A.
Schurr
T.G.
Cabell
M.F.
Brown
M.D.
Neel
J.V.
Larsen
M.
Smith
D.G.
Vullo
C.M.
Wallace
D.C.
Am. J. Hum. Genet.
 , 
1993
, vol. 
53
 (pg. 
563
-
590
)
36
Torroni
A.
Sukernik
R.I.
Schurr
T.G.
Starikovskaya
Y.B.
Cabell
M.F.
Crawford
M.H.
Comuzzie
A.G.
Wallace
D.C.
Am. J. Hum. Genet.
 , 
1993
, vol. 
53
 (pg. 
591
-
608
)
37
Vigilant
L.
Control region sequences from African populations and the evolution of human mitochondrial DNA
 , 
1990
Berkeley, CA
University of California
 
PhD thesis
38
Ward
R.H.
Frazier
B.L.
Dew-Jager
K.
Pääbo
S.
Proc. Natl. Acad. Sci. USA
 , 
1991
, vol. 
88
 (pg. 
8720
-
8724
)
39
Ward
R.H.
Redd
A.
Valencia
D.
Frazier
B.
Pääbo
S.
Proc. Natl. Acad. Sci. USA
 , 
1993
, vol. 
90
 (pg. 
10663
-
10667
)
40
Watson
E.
Bauer
K.
Aman
R.
Weiss
G.
von Haeseler
A.
Pääbo
S.
Am. J. Hum. Genet.
 , 
1996
, vol. 
59
 (pg. 
437
-
444
)
41
Benson
D.A.
Boguski
M.S.
Lipman
D.L.
Ostell
J.
Nucleic Acids Res.
 , 
1997
, vol. 
25
 
6
 
[See also this issue, Nucleic Acids Res. (1998) 26 , 1–7.]
42
Wrischnik
L.A.
Higuchi
R.G.
Stoneking
M.
Erlich
H.A.
Arnheim
N.
Wilson
A.C.
Nucleic Acids Res.
 , 
1987
, vol. 
15
 (pg. 
529
-
542
)
43
Ruhlen
M.
A Guide to the World's Languages, Volume 1: Classification
 , 
1991
London, Melbourne, Auckland
Edward Arnold, A Division of Hodder & Stoughton
44
Morin
P.A.
Moore
J.J.
Jin
L.
Chakraborty
R.
Goodall
J.
Woodruff
D.S.
Science
 , 
1994
, vol. 
265
 (pg. 
1193
-
1201
)
45
Wise
C.A.
Sraml
M.
Rubinsztein
D.C.
Easteal
S.
Mol. Biol. Evol.
 , 
1997
, vol. 
14
 (pg. 
707
-
716
)
46
Goldberg
T.L.
Ruovolo
M.
Nucleic Acids Res.
 , 
1997
, vol. 
25
 (pg. 
1
-
6
)
47
Kogelnik
A.M.
Lott
M.T.
Brown
M.D.
Navathe
S.B.
Wallace
D.C.
Nucleic Acids Res.
 , 
1997
, vol. 
25
 (pg. 
196
-
199
[See also this issue, Nucleic Acids Res. (1998) 26 , 112–115.]

Author notes

+
Present address: Department of Cytogenetics and Molecular Genetics, Women's and Children's Hospital, 72 King William Road, North Adelaide SA 5006, Australia

Comments

0 Comments