Abstract

AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids. It consists of two sections: AAindex1 for the amino acid index of 20 numerical values and AAindex2 for the amino acid mutation matrix of 210 numerical values. Each entry of either AAindex1 or AAindex2 consists of the definition, the reference information, a list of related entries in terms of the correlation coefficient, and the actual data. The database may be accessed through the DBGET/LinkDB system at GenomeNet (http://www.genome.ad.jp/dbget/) or may be downloaded by anonymous FTP (ftp://ftp.genome.ad.jp/db/genomenet/aaindex/).

Introduction

The variety and specificity of protein three-dimensional structures and biological functions are due to the combination of the 20 different amino acids as specified by the genetic code. The amino acids are the building blocks of proteins each having different characteristics in terms of the shape, the volume, and the chemical reactivity among others. A large body of experimental and theoretical research has been performed to characterize physicochemical and biochemical properties of individual amino acids. The derived property is often represented by a set of 20 numerical values that is called the amino acid index.

In addition to the properties of individual amino acids, the relations between amino acids are also represented by numerical values in the analysis of protein sequences and structures. Especially, the amino acid mutation matrix, also called the amino acid similarity matrix, is the basis for optimization in protein sequence alignments and similarity searches. The amino acid mutation matrix is generally a set of 20 × 20 numerical values, or a set of 210 numerical values since the matrix is usually symmetric. The AAindex database is a collection of published amino acid indices and mutation matrices.

Background

In 1988 Nakai et al. collected 222 amino acid indices from research papers and investigated the relationships by the hierarchical cluster analysis (1). They identified four major classes, α-helix and turn propensities, β-strand propensity, hydrophobicity that can further be divided into subclasses, and other physicochemical properties such as bulkiness of amino acid residues. In 1996 Tomii and Kanehisa (2) increased the size of the collection to include 402 indices and re-performed the clustering. The result was generally in good agreement with the previous work, but for the sake of convenience the collection was divided into six major classes: α and turn propensities, β propensity, amino acid composition, hydrophobicity, physicochemical properties, and other properties.

Tomii and Kanehisa (2) also collected 42 amino acid mutation matrices from the literature and conducted extensive analysis on the correlations among them and with the amino acid indices. The AAindex database was initiated by Nakai et al. (1), was expanded by Tomii and Kanehisa (2), and is continuously updated by the present authors.

The Current Database

The AAindex database is a flat file database that consists of two sections: AAindex1 for the amino acid indices and AAindex2 for the amino acid mutation matrices. The format of the two sections is as follows.

AAindex1

The AAindex1 section currently contains 434 amino acid indices. A sample entry of AAindex1 is shown in Figure 1. Each entry consists of an accession number, a short description on the index, the reference information, and the numerical values for the property of 20 amino acids. In addition, it contains neighbor information; namely, the cross-links to other entries with an absolute value for the correlation coefficient of 0.8 or larger. With the links the user can identify a set of entries describing similar properties. In some instances the values are not reported for all 20 amino acids. When available we adopt the estimates by Kidera et al. (4) who tried to fill missing values by statistical considerations. When the estimates were not available, the missing values were either replaced by the mean value of the rest or simply filled with zeros.

AAindex2

The AAindex2 section currently contains 66 amino acid mutation matrices: 47 symmetric matrices and 19 non-symmetric matrices. A sample entry of AAindex2 is shown in Figure 2. The format of the entry is almost the same as that of AAindex1 except that it contains 210 numerical values (20 diagonal and 20 × 19/2 off-diagonal elements) for a symmetric matrix and 400 or more numerical values for a non-symmetric matrix (some matrices include a gap or distinguish two states of cysteine).

Figure 1

An example of the amino acid index entry in the AAindex database (AAindex1). Each record of an entry is identified by the one-letter codes: H, accession number; D, definition of the entry; R, LITDB (3) literature database identifier; A, author(s); T, title of the journal article; J, journal citation information; C, accession numbers of similar entries with the correlation coefficients of 0.8 (−0.8) or more (less); I, actual data in the specified order; and *, optional comments.

Availability

The AAindex database can be retrieved through the DBGET/ LinkDB system (5) of the Japanese GenomeNet service (6) at http://www.genome.ad.jp/dbget/

The DBGET/LinkDB system integrates most of the major molecular biology databases and is especially suited for using hyperlinks to related entries within the AAindex database as well as to the other databases.

Alternatively, the entire database may be copied and used locally. The URL for anonymous FTP is: ftp://ftp.genome.ad.jp/db/genomenet/aaindex/ Users are requested to cite this article when making use of the AAindex database.

Figure 2

An example of the amino acid mutation matrix entry in the AAindex database (AAindex2). The data format is the same as described in Figure 1. The order of the matrix elements may be computed by the equation or examined in the database documentation file.

Acknowledgements

We thank Drs Kenta Nakai and Kentaro Tomii for the initial developments of the AAindex database. This work was supported in part by the Grant-in-Aid for Scientific Research on the Priority Area ‘Genome Science’ from the Ministry of Education, Science, Sports and Culture of Japan. The computation time was provided by the Supercomputer Laboratory, Institute for Chemical Research, Kyoto University.

References

1
Nakai
K.
Kidera
A.
Kanehisa
M.
Protein Engng.
1988
, vol. 
2
 (pg. 
93
-
100
)
2
Tomii
K.
Kanehisa
M.
Protein Engng.
1996
, vol. 
9
 (pg. 
27
-
36
)
3
Seto
Y.
Ihara
S.
Kohtsuki
S.
Ooi
T.
Sakakibara
S.
Lesk
A.M.
Computational Molecular Biology
1988
New York
Oxford University Press
(pg. 
27
-
37
)
4
Kidera
A.
Konishi
Y.
Oka
M.
Ooi
T.
Scheraga
H.A.
J. Protein Chem.
1985
, vol. 
4
 (pg. 
23
-
55
)
5
Fujibuchi
W.
Goto
S.
Migimatsu
H.
Uchiyama
I.
Ogiwara
A.
Akiyama
Y.
Kanehisa
M.
Pacific Symp. Biocomput., 1998
1998
(pg. 
683
-
694
)
6
Kanehisa
M.
Trends Biochem. Sci.
1997
, vol. 
22
 (pg. 
442
-
444
)

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.