Glycation is a nonenzymatic process in which proteins react with reducing sugar molecules and thereby impair the function and change the characteristics of the proteins. Glycation is involved in diabetes and aging where the accumulation of glycation products causes side effects. In this study, we statistically investigate the glycation of ε amino groups of lysines and also train a sequence-based predictor. The statistical analysis suggests that acidic amino acids, mainly glutamate, and lysine residues catalyze the glycation of nearby lysines. The catalytic acidic amino acids are found mainly C-terminally from the glycation site, whereas the basic lysine residues are found mainly N-terminally. The predictor was made by combining 60 artificial neural networks in a balloting procedure. The cross-validated Matthews correlation coefficient for the predictor is 0.58, which is quite impressive given the relatively small amount of experimental data available. The method is made available at www.cbs.dtu.dk/services/NetGlycate-1.0.
Glycation is a nonenzymatic process in which proteins react with reducing sugar molecules. The reaction occurs between the α amino group of the N-terminal amino acid or the ε amino groups of lysines and the aldehyde or keto group of the reducing sugar (Bunn et al., 1979; Zhao et al., 1996). The first step of a glycation reaction is the formation of an unstable Schiff base, which rearranges to form a more stable Amadori product. The Amadori product can then react further to form irreversible cross-linked products, the so-called advanced glycation end products (AGEs). Besides lysine residues, AGEs have also been found for the side chains of arginines and histidines (Bidasee et al., 2004).
In human peripheral blood lymphocytes, AGE formation has been shown to target only a restricted set of proteins (Poggioli et al., 2002, 2004). AGEs accumulate and cause deleterious effects, and this accumulation is likely to be more important for proteins with a long biological half-life such as collagen and crystallins (Reiser et al., 1992; Abraham et al., 1994; Casey et al., 1995; Paul and Bailey, 1996; Zhao et al., 1996) and is, for example, associated with the pathogenesis of aging and complications of diabetes. Other human diseases where AGEs are implicated are atherosclerosis, amyloidosis, Alzheimer’s disease, Pick’s disease, Parkinson’s disease, Lewy body disease, and actinic elastosis (Ling et al., 1998). It is mostly AGEs and not the initial glycation products that cause the damaging effects (Paul and Bailey, 1996).
Glycation in vivo is not confined to a single cellular compartment. The presence of AGEs has thus been demonstrated in the nuclei, nuclear envelope, mitochondria, endoplasmic reticula, Golgi complexes, endocytic vesicles, lysosomal vacuoles or granules, secretory granules, cytosol, and cell membranes, as well as in the extracellular matrix of human cells (Ling et al., 1998). Additionally, the in vivo glycation of ε amino groups of lysines has been found to take place for both intracellular and extracellular proteins (Garlick and Mazer, 1983; Shilton and Walton, 1991).
The in vivo glycation process follows different routes intracellularly and extracellularly. In the extracellular process, the dominant route to AGE is the Schiff base reaction followed by the Amadori rearrangement to the Amadori product. The extracellular process has been found to depend on hyperglycemia (Verbeke et al., 2000), that is, raised blood glucose level. The intracellular process is more complex because glucose metabolites also react to form AGEs. Important glucose metabolites leading to AGEs are glyoxal, methylglyoxal, and 3-deoxyglucosone (Verbeke et al., 2000). The Schiff base reaction followed by the Amadori rearrangement is thus not the main pathway for the intracellular formation of AGEs under normal conditions (Verbeke et al., 2000). The involvement of glycation in the above-mentioned diseases sparks a great medical interest in preventing the glycation of proteins in humans. To prevent the glycation of proteins, it is important to gain knowledge about the mechanism of the reaction.
In general, amino groups with lower pKa values should be expected to be more reactive toward glycation because of their greater nucleophilicity (Bunn et al., 1979). However, other factors appear to be more important (Shapiro et al., 1980; Baynes et al., 1989), and the site specificity of glycation is suggested to be attributed to the Amadori rearrangement step (Shilton et al., 1993). For lysines, it has been suggested that the properties of nearby amino acids play a role in determining whether a given lysine is glycated or not. Positively charged amino acids situated close to a lysine in either the primary or the three-dimensional (3D) structure have been proposed to catalyze the glycation of that lysine (Iberg and Flückiger, 1986). For example, the presence of a histidine residue close to a lysine in either the primary or the 3D structure has been reported to promote the glycation of lysines (Baynes et al., 1989; Shilton and Walton, 1991; Shilton et al., 1993; Acosta et al., 2000). Furthermore, the catalytic power of histidines has been suggested to be mediated by threonines (Shilton et al., 1993). The proximity of a lysine to another lysine in the primary structure has also been reported to promote glycation (Baynes et al., 1989).
It has been proposed that the formation of hydrogen bonds between a lysine residue and another amino acid could partially shield the lysine from being glycated (Baynes et al., 1989). It has also been suggested that the proximity of acidic amino acids to lysines in either the primary or the 3D structure catalyzes the Amadori rearrangement. This makes the lysine more reactive toward glycation (Baynes et al., 1989). Model studies have indeed shown that aspartate residues situated proximal in the 3D structure to a lysine can function as catalysts for the glycation of that lysine. The same appears to be the case for arginines situated proximal in the 3D structure to lysines (Venkatraman et al., 2001). Some authors, though, claim that carboxyl groups at neutral pH or physiological pH are mainly dissociated and thus unlikely to function as acid–base catalysts for glycation (Iberg and Flückiger, 1986; Shilton and Walton, 1991).
The presence of biological buffers such as phosphate and bicarbonate can affect the rate and specificity of glycation (Baynes et al., 1989; Shilton et al., 1993). This should be taken into account when in vivo and in vitro data are compared.
In this study, we focus on the in vivo glycation of lysine ε amino groups. Amino acids flanking glycated lysines were investigated thoroughly, and a site-specific predictor was constructed using machine learning algorithms. Such algorithms have found practical use in computational molecular biology where they are used for sequence analysis and pattern recognition because of their ability to model sequence correlations (Baldi and Brunak, 2001). Success stories include the prediction of signal peptides and their cleavage sites (Nielsen et al., 1997; Bendtsen et al., 2004). This study is the first to perform an extensive investigation of lysine glycation, and it constructs a predictor that is made available via the internet.
Generation of data set
The site-specific glycation data were obtained by screening hundreds of papers manually, preceded by a text mining step. After inspecting 400 papers manually, we found in all 89 glycated lysines and 126 nonglycated lysines in 20 experimentally examined proteins.
The context of the glycated lysines was visualized by a sequence logo (Schneider and Stephens, 1990). The sequence logo displays which amino acids are characteristic in the vicinity of the glycated lysines in the sequence (Figure 1). At position –6, with respect to the glycated lysine, valine (V), leucine (L), glutamate (E), and lysine (K) are found to be overrepresented in the descending order of strength. Lysine is also found near the top at positions –20, –13, –10, –7, –1, and +24, where arginine (R) is found at the top of position –13. Lysines and arginines thus appear to be more abundant N-terminally than C-terminally from the glycation site. Glutamate is found at the top of positions –5, +4, and +5, and aspartate (D) is found at the top of position +24. This suggests that there is a small overrepresentation of charged amino acids close to the glycation site. However, no signal is found for histidine (H). Lysine and aspartate found at the top of position +24 seem to be too far away from the glycation site to be significant. The same is probably also true for lysine at position –20.
Amino acid composition around the glycation site
To better evaluate the proposed influence of various amino acid residues on the glycation reaction, we examined more closely the propensity of a few selected residues around the glycation site (Figure 2).
Our analysis clearly shows that histidines are not particularly overrepresented in the proximity (with respect to the primary structure) of the site of glycation for the sites examined in this work. On the contrary, they actually seem less abundant in the region 20 residues N-terminally to the glycation sites, an effect which could be caused by structural issues. The same could be said of lysine and arginine although the trend is less obvious for the latter. We can confirm that lysines and arginines are slightly favored just N-terminally of the glycation site, but arginine and especially lysine are strongly disfavored C-terminally to the site. Although basic residues are rare C-terminally, acidic residues (glutamate and aspartate) are, however, prevalent in this region. In fact, there are 48% more acidic residues just following the site of glycation than N-terminally (positions –5 to –4 compared with positions +4 to +5). The logo for the positive sites shows that there is some overrepresentation of glutamate at positions –6 and –5, but as the propensity analysis shows there are overall more acidic amino acids C-terminally from the glycation site (because the propensity analysis is averaging over positions, the overrepresentation of glutamate at positions –6 and –5 in the logo for the positive sites is not visible in Figure 2).
Position of lysines in the sequence
We investigated the relative position of lysines in the complete sequences used in this study, and a histogram of the length of the proteins in this study was made (Figure 3a). This histogram reveals that the length of most proteins falls within the range of 100–350 residues (as proteins in general). A division of the proteins into N-terminal, middle, and C-terminal parts made it possible to investigate the distribution of the relative position of the glycated lysines as compared with that of all lysines (Figure 3b). We found that the distribution of the relative position of the glycated lysines deviated from that of all lysines. Glycated lysines as compared with all lysines were found to be more abundant in the N-terminal and middle parts than in the C-terminal part.
Prediction by neural networks
We trained network algorithms using sequence input alone and with the additional input of the relative position of the lysine in the sequence due to the trends illustrated in (Figure 3b). The best results were indeed obtained for the input that included the relative position. All network combinations for window sizes of 3–51 amino acid residues and 0, 2, 3, ..., 20 hidden neurons were tried. The highest correlation coefficient obtained was 0.39 for a window of 23 and 11 hidden neurons. The overall correlation coefficient decreased sharply from a window of 3 to a window of 7 and then increased sharply from a window of 9 (data not shown). It was therefore hypothesized that noise in the middle of the window was confusing to the network during training. To remove that noise from the model, we decided to make holes in the middle of window beginning with a hole that ranged from positions –3 to +3. The gradual expansion of the hole in the window was then continued until the hole ranged from positions –5 to +5 where all network combinations for window sizes 13–51 were tried. A maximal correlation coefficient of 0.43 was found for a window of 23 with a hole at positions –3 to +3 and 5 hidden neurons (Figure 4).
The window size is the same as for the best network without a hole. Some dependency of the correlation coefficient on the number of hidden neurons was observed, indicating that the prediction of glycation sites is a nonlinear problem, where correlations between different amino acid positions are important. For comparison, the highest correlation coefficient for sequence input alone was 0.40 for a window of 17 with hole from positions –3 to +3.
On the basis of the above experiments, we decided to train networks with an architecture that had a window size of 23 with hole from positions –3 to +3 and 5 hidden neurons. Different networks were trained, each initialized with different random seeds to further explore the parameter space. The networks were then combined to produce a single predictor by a balloting procedure described in the Materials and Methods section. This strategy was pursued under the assumption that networks initialized with different seeds could learn to recognize slightly different types of sites. We observed that the highest correlation coefficient of 0.58 was obtained for combining 20 different predictions and that the correlation coefficient did not increase when 30 or 50 different seeds were used. Instead the correlation coefficient decreased slightly for 50 seeds. The predictor with 20 seeds was thus the best predictor in this study. The number of true-positive, false-positive, true-negative, and false-negative classifications for this predictor are 70, 25, 101, and 19, respectively. The receiver operating characteristic (ROC) curve for the predictor is shown in Figure 5. The ROC curve is seen to deviate from the diagonal meaning that the predictor has significant predictive capability. Similarly, the area under the ROC curve has a value of 0.77, which deviates significantly from 0.5, representing random guessing.
Prediction of glycation sites in intestinal alkaline phospha-tase
Our prediction method can be used by experimentalists to select relevant proteins for the investigation of glycation sites. The areas of interest could be, for example, glycation in the active sites of enzymes or other positions in a protein that might disturb protein function. An example of a protein to investigate for glycation sites is bovine intestinal alkaline phosphatase (ALP). ALP is a metalloenzyme that is bound to the cell membrane by a covalent linkage with a glycosyl-phosphatidyl-inositol (GPI) anchor, but the physiological function of ALP is unknown. ALP has been shown to be glycated in vitro, but the specific glycation sites are not known (McCarthy et al., 1998). Furthermore, the function of ALP has been shown to be inhibited by glycation. We predicted the following glycation sites: K-75, K-100, K-141, K-247, K-303, and K-400. ALP contains 21 lysines, of which 6 are thus predicted to be glycated, and the sequence with glycation sites is shown in Figure 6. Interestingly, the active site has been determined to depend on residue 111 (Culp et al., 1985; Weissig et al., 1993). K-100 is close to the active site in the primary structure, and our results therefore suggest that the glycation of K-100 could result in the inhibition of ALP. In this connection, it should be noted that there is a structure of the human placental form (1ew2.pdb) and that this structure shows that the residue equivalent to K-100 in intestinal alkaline phosphatase (K-103) is slightly over 20 Å from the residue equivalent to S-111 (S-114) and is pointing in the opposite direction. The glycation of other lysines could thus be responsible for the inhibition of the enzyme; however, we point out that 20 Å is not a very long distance when it is taken into account that the glycated lysine only has to block the access to the substrate to inhibit the enzyme. The fact that the glycation of other lysines also could result in the inhibition of the enzyme just emphasizes the need for a good predictor. Similar situations arise for glycosyltransferases and O-glycosylation where glycosylation of one residue affects the potential (and order) of glycosylation of nearby residues (Julenius et al., 2005).
Prediction of glycation sites in cysteineless recombinant human interferon-γ
We applied our prediction method to cysteineless recombinant human interferon-γ and used the sequence for IFNG_HUMAN P01579 (UniProt ID and accession number, respectively). Mironova and others (2003) suggested that K-109 and maybe K-131 become glycated in cysteineless recombinant human interferon-γ. Our predictor predicts both these lysines to be glycated.
Analysis of the human proteome
We also used the prediction method to predict the glycation potential for all proteins in the human proteome (Figure 7). The glycation potential was calculated as the number of lysines predicted to be glycated divided by the total number of lysines. The proteins of the human proteome were divided into different Gene Ontology categories (GOs) selected so that they covered most of the human proteome and contained the most relevant biological cellular component and molecular function GOs. The molecular function GOs were selected based on the hypothesis that if their proteins were glycated, it would disturb cellular function. The reason for choosing the selected biological cellular components was that the glycation potential of a protein might depend on its cellular location. Proteins located inside the cell are subject to different glycating agents, for example, glucose metabolites, than those located outside the cell where the main glycating agent is glucose (Verbeke et al., 2000). The most glycated GO category based on the overall glycation potential for entire proteins is GO:0005794 Golgi apparatus, which has a glycation potential of 0.392. The lowest overall glycation potential is found in the GO GO:0005783 endoplasmic reticulum, which has a glycation potential of 0.353.
We also listed all GOs for all proteins in the human proteome from the lowest ranked GO to the highest ranked GO. Then, the GOs were ranked based on the average glycation potential for all proteins in the given GO. The glycation potential for the 10 most and the 10 least glycated GOs that contain 100 or more proteins is shown in Figure 8. Many important biological processes and molecular functions are observed from Figure 8 to be among the most glycated GOs—for example, the most glycated GO GO:0045859 regulation of protein kinase activity could be affected in diseases where glycation is the disease-causing factor. Three GOs referring to the extracellular matrix are among the least glycated (however, the proteins in GO:0031012 extracellular matrix and GO:0005578 extracellular matrix [sensu Metazoa] are almost identical).
Our analysis shows that the prediction of glycation sites is indeed a complex problem, but we manage to obtain a reasonable correlation because we are able to produce a predictor with a cross-validated Matthews correlation coefficient of 0.58. The difficult nature of this particular post-translational modification (PTM) compared with that of other PTMs may be attributed to its nonenzymatic character—enzyme involvement presumably offers a more specific reaction and often has a more biased sequence motif.
When interpreting the logo for the positive sites, it does not contain strong position-specific signals at any position (Figure 1). A reason for the lack of a position-specific signal in the logo could be that the 3D space interactions between lysines and other amino acids situated far away in the primary structure also play a role in determining whether a given lysine is glycated or not. This type of correlation can to some extent be captured by a neural network.
In the literature, it is suggested that glycated lysines are situated close to positively charged amino acids in either the primary or the 3D structure (Iberg and Flückiger, 1986). Among the positively charged amino acids, histidine has been reported to promote the glycation of lysines (Baynes et al., 1989; Shilton and Walton, 1991; Shilton et al., 1993; Acosta et al., 2000). We did not find an overrepresentation of histidine at any position in the sequence logo for the positive sites (Figure 1), suggesting that histidine only plays a role in catalyzing the glycation of lysines through 3D space interactions. The propensity analysis confirmed this conclusion. The catalytic power of histidines has been suggested to be mediated by threonines (Shilton et al., 1993), but we did not find any overrepresentation of threonines at any of the positions containing signal in the sequence logo. This finding suggests that if threonines mediate the catalytic power of histidines, it must take place through long-range interactions (Figure 1).
The sequence logo and the propensity analysis both suggest that arginine residues do not play an important role in determining whether a given lysine is glycated or not. In agreement, however, with the literature (Baynes et al., 1989), our results show that proximity in the primary structure of a lysine to another lysine could promote the glycation of that lysine. The literature also suggests that the proximity of acidic amino acids to lysines in either the primary or the 3D structure promotes the glycation of that lysine (Baynes et al., 1989; Venkatraman et al., 2001), and our results support that notion.
In our study, both in vivo and in vitro data have been used, but the in vitro incubations of proteins with glycating sugars were performed in aqueous solution close to neutral pH (however, for some glycation studies, the pH was not given, but the experimental conditions seem to imply that pH was close to neutral [Abraham et al., 1994; Zhao et al., 1996; Swamy-Mruthinti and Schey, 1997; Acosta et al., 2000]). Our results therefore suggest that acidic amino acids promote the glycation of nearby lysines in the primary structure at neutral or physiological pH. However, some authors do not support this notion (Iberg and Flückiger, 1986; Shilton and Walton, 1991).
In summary, basic residues seem to be favored just N-terminally to the site of glycation and, at the same time, acidic residues are abundant just C-terminally to the site of glycation. This suggests that an electrochemical skew somehow promotes glycation. From the training profile of the neural network, we conclude that the most relevant information about the likelihood of glycation is contained within 11 residues to either side of the site.
A possible structural explanation for the fact that the information between positions –3 and +3 contain noise could be that these positions in, for example, an α-helix are not as close to position 0 as the amino acids at positions –4 and +4. This has actually been studied experimentally in relation to glycation and described in the study by Venkatraman and others (2001).
It should be recognized that the motif for glycation is based on the 3D structure. However, we also expect a primary structure motif to be present, which is, for example, also the case for the prediction of O-glycosylation sites (Julenius et al., 2005) and phosphorylation sites (Blom et al., 1999). Another point to make is that the 3D structures are not available for all proteins meaning that a proteome-wide predictor has to be based on the primary structure.
The use of the relative position of the investigated lysine as additional input improves the performance of the predictor. An explanation could be that the N-terminal part of globular proteins is often more surface exposed than the other parts of the proteins, thus making this part of the protein more available to glycation. Our analysis of the human proteome in fact shows that the N-terminal parts of the proteins have a higher predicted glycation potential than the other parts of the proteins.
The analysis of the glycation potential of the human proteome further revealed that the proteins associated with the extracellular matrix could be less glycated than proteins belonging to other biological cellular component GOs. It could be suggested that the reason for the low degree of glycation associated with the extracellular matrix is a result of an evolutionary pressure disfavoring highly glycated proteins. This hypothesis is further strengthened by the fact that extracellular matrix proteins, such as collagen, have a long biological half-life (Verzijl et al., 2000), thus exposing them to glycation for a longer time and increasing the pressure not to be glycated. In this connection, it should be noted that the glycation of collagen is known to be associated with the late complications of aging and diabetes (Paul and Bailey, 1996). The fact that the Golgi apparatus was found to have a higher glycation potential than the endoplasmic reticulum suggests that there is a higher evolutionary pressure in the endoplasmic reticulum than in the Golgi apparatus disfavoring highly glycated proteins.
The analysis of the glycation potential of the human proteome also revealed that many important biological processes and molecular function GOs have a high glycation potential. These could be associated with diseases. There does not appear to be an evolutionary pressure disfavoring highly glycated proteins in these GOs and a possible explanation could be the lack of available glucose or other glycating agents in the natural environment of the proteins in question.
The data set used in this study is relatively small. More glycation data in the form of either in vitro data obtained at physiological conditions or preferably in vivo data obtained from diabetic subjects could definitely make it possible to improve the predictor.
Materials and Methods
Generation of data set
Glycation data were obtained through a thorough literature scan. The resulting data set consists of 20 proteins with 89 glycated lysines and 126 nonglycated lysines. Only experimentally verified glycation sites were used. All sequences were extracted from the UniProt database (Bairoch et al., 2005). It was decided to mask out lysines in pro- and signal peptides because these parts of the proteins are cleaved off during the maturation of the proteins and are thus not available for glycation. A detailed description of the data set and how it is generated can be found at www.cbs.dtu.dk/databases/GlycateBase-1.0.
Both in vitro and in vivo data were used to make the data set as large as possible. It was, however, decided to only include in vitro data that were obtained at conditions that resemble physiological conditions. The glycated proteins used in this study are all mammalian.
A standard feed forward architecture was used for the neural networks (Baldi and Brunak, 2001). The networks thus comprised an input layer connected to a hidden layer that sends its output to the output neurons.
The Matthews correlation coefficient, C, was used to measure the predictive performance of the networks during learning and test (Matthews, 1975) where tp is the number of correctly predicted positive sites (true positives), tn the number of correctly predicted negative sites (true negatives), fp the number of sites falsely predicted to be positive (false positives), and fn the number of sites falsely predicted to be negative (false negatives).
The back-propagation algorithm (Rumelhart et al., 1986) was used for training the networks. Cross-validation was used when evaluating the networks. In the cross-validation procedure, the data set is divided into n parts. One part is then selected for testing of the performance of the network, whereas the other n – 1 parts are used for training the network. This procedure is repeated n times, thus using each of the n parts as a test set. The result is thus n networks that can be combined in different ways to make a predictor. The cross-validated Matthews correlation coefficient is then calculated by adding all true positives, false negatives, true negatives, and false positives separately for the n test sets and calculating the corresponding Matthews correlation coefficient for these summed values. Networks were trained with sequence input alone or together with the relative sequence position.
For some networks, the relative position of the lysine in the sequence was used as an additional input to sequence input. More specifically, two extra input neurons were added to the networks. One of these extra input neurons received the value X/L, where X is the position of the lysine in the sequence and L the length of the protein. The other extra input neuron received the value 1 – X/L.
Networks were combined using a balloting procedure. The networks in this study have two output neurons. By calculating the difference between the positive and the negative output neurons, a measure of how well the network is at separating the input data is obtained. When combining networks, the prediction made by the network that has the numerically largest difference between the positive and negative output neurons is used, thereby selecting the network that is best at separating the input data. This method is a simplified version of the balloting procedure described in the study by Petersen and others (2000).
Analysis of the human proteome
The human proteome was divided into some selected GOs that to some extent overlap, and an average glycation potential was calculated for all proteins in each GO. The glycation potential was calculated as the number of predicted glycation sites in each protein divided by the number of lysines. The proteins were also divided into three parts: the N-terminal, the middle, and the C-terminal parts. The average glycation potential was then calculated for each protein part. When the length of the protein was ≥200 residues, the N-terminal and C-terminal parts of the protein were assigned the first and last 70 residues, respectively, whereas the rest of the protein was assigned to the middle part. When the protein was <200 residues, the protein was simply divided into three parts of equal size. The transmembrane parts or propeptides of the proteins were not removed from the proteins before prediction; however, the signal peptides predicted by the SignalP 3.0 Server (Bendtsen et al., 2004) were removed before prediction.
The human proteome was further analyzed by considering all possible GOs. The 10 most and the 10 least predicted glycated GOs that contain ≥100 proteins were then listed. The reason for only selecting GOs with ≥100 proteins is that a significantly large number of proteins have to be present in the given GO to calculate a reliable average glycation potential.
Conflict of interest statement
We thank Kristoffer Rapacki, Peter Wad Sackett, Hans Henrik Stærfeldt, and Thomas Nordahl Petersen for competent computer assistance. This work is supported by a grant from Danish Center for Scientific Computing and the EU FP6 BioSapiens Network of Excellence.