## Abstract

A remarkable inherent feature of cellular metabolism is that the concentrations of a small but significant number of metabolites are strongly correlated when measurements of biological replicates are performed. This review seeks to summarize the recent efforts to elucidate the origin of these observed correlations and points out several aspects concerning their interpretation. It is argued that correlations between metabolites differ profoundly from their transcriptomic and proteomic counterparts, and a straightforward interpretation in terms of the underlying biochemical pathways will unavoidably fail. It is demonstrated that the comparative correlations analysis offers a way to exploit the observed correlations to obtain additional information about the physiological state of the system.

## INTRODUCTION

Metabolomic measurements provide a wealth of information about the biochemical status of cells, tissues or organisms and play an increasing role to elucidate the function of the unknown and the novel genes [1–5]. Interpretation of these data relies crucially on computational approaches to large-scale data analysis and data visualization, such as principal and independent component analysis [6], multidimensional scaling [7], a variety of clustering techniques [2] and discriminant function analysis, among many others [1, 8]. Common to most of these methods is that they build upon interdependencies between metabolites, i.e. relationships between the concentrations of metabolites, as expressed by a covariance or a correlation matrix.

Indeed, a remarkable feature of metabolomic data is that a small but significant number of metabolite levels are highly interrelated when repeated measurements are performed [9, 10]. Contrary to a naïve interpretation, these correlations do not necessarily occur between metabolites that are neighbours on the metabolic map, but other, more subtle, mechanisms are involved [11–13]. This review seeks to discuss the underlying mechanisms that give rise to an observed pattern of correlations and seeks to point out some aspects of the interpretation of these correlations. It is argued that correlations, or more generally ‘associations’, between observed metabolite levels differ profoundly from their transcriptomic and proteomic counterparts. In the latter case, the concentrations are mainly governed by a network of regulatory interactions, whereas metabolites are synthesized from other metabolites via a network of biochemical reactions. As emphasized previously [11], this results in a level of interdependence between their concentrations that does not exist for transcripts and proteins. While the difference in the underlying causality does not hamper the application of most algorithms as heuristic research tools, for example to discriminate genotypes based on their metabolite profiles, a genuine interpretation of the data has to take this inherent difference into account—and eventually make use of it.

The review is organized as follows: in the first section, we summarize some quantitative measures of correlations between metabolite levels. Subsequently, a variety of possible scenarios that give rise to an observed pattern of correlations are discussed and specific causes for the correlations between measured metabolite concentrations are identified. The next section proceeds towards an interpretation of metabolite correlations as a characteristic fingerprint of the underlying system. Subsequently, we discuss the utilization of correlation networks in large-scale data analysis and point out several drawbacks and pitfalls in their interpretation. In the last section, alternative approaches to metabolomic data analysis are given and the key points are summarized.

## MEASURES OF ASSOCIATIONS BETWEEN METABOLITES

Several measures to quantify the association between metabolite levels have been suggested in the literature, each with their own merits and drawbacks. The most common choices are summarized in Figure 1. Though in the following we will mainly refer to the usual Pearson correlation coefficient, the term ‘correlation’ is understood here in a rather general sense, implying a statistically significant relationship between two metabolites.

**Figure 1:**

**Figure 1:**

## THE ORIGIN OF CORRELATIONS

An interpretation of the observed correlations is, of course, intimately related to the experimental situation under which the metabolite profiles were obtained. To be able to observe a concomitant change in metabolite concentrations at all, presupposes a source of variation for (at least some) metabolites. We can broadly distinguish between three different scenarios: (i) specific perturbations—in this case, the change in metabolite levels results from a specific and localized intervention within the underlying network of biochemical reactions, typically an over-expression or knockout of a gene coding for an enzyme. (ii) global perturbations—this includes measurement of transient or diurnal time series, as well as the response to temperature (heat shock) or other environmental changes (stress). In the case of global perturbations, changes are induced at multiple sites within the metabolic network or are brought about by external factors that influence a large number of metabolites simultaneously. (iii) intrinsic variability—an intriguing feature of metabolomic data is that some metabolites are strongly interrelated even when biological replicates under identical experimental conditions are measured [9, 10]. In this case, changes of metabolite levels do not result from deliberate experimental perturbations or changes of the physiological state, but are induced by the intrinsic variability of cellular metabolism [11–13].

Obviously, metabolite correlations that arise from global perturbations are the hardest to interpret in terms of the underlying biochemical network. For example, given a diurnal time series, all metabolic compounds that show a diurnal variation will inevitably correlate, conveying no, or only little, information about their causal dependency or mutual proximity within the metabolic map. Similarly, a time-dependent transient response to a perturbation may either result in no detectable change, an increase or a decrease of a metabolic compound, resulting inevitably in a large number of ‘correlations’ between metabolites (see Figure 2 for examples). Moreover, metabolism itself is considered to be a rather fast process [15], thus, most current measurements of time-dependent metabolite concentrations do not capture the intrinsic timescales of biochemical processes, but focus on slower timescales that are induced by changes in protein expression or circadian regulation.

**Figure 2:**

**Figure 2:**

For specific perturbations, the situation is slightly different. Over expression of individual enzymes does indeed offer the possibility to infer properties of the metabolic system, as will be discussed in the last section. For the moment, though, we will focus on an interpretation of metabolite correlations in the absence of deliberate experimental perturbations. In this case, the observed correlations are induced by diminutive fluctuations within the metabolic system itself, which then propagate through the system and give rise to a specific pattern of correlations, depending on the physiological state of the system [11–13, 16]. When measuring a population of biological replicates, intrinsic fluctuations may arise due to at least two different mechanisms: first, even under identical experimental conditions, organisms are never actually identical. Inevitable small differences in enzyme concentrations, reflecting the differences in gene expression, affect metabolite concentrations and consequently result in interdependencies between metabolites [11]. Second, cellular metabolism is influenced by a number of environmental factors such as light intensity or nutrient supply. Again, rapidly changing diminutive differences, even in an approximately constant environment, result in changes in metabolite concentrations, which then propagate through the metabolic network and induce a specific pattern of correlations [12, 16].

Common to both cases is that the resulting correlations are a global property of the system, i.e. whether two metabolites correlate or not is a combined result of many, if not all, biochemical reactions, regulatory interactions and the inducing fluctuations that constitute the system. Consistent with experimental observations, both the scenarios lead to a small number of characteristic correlations which do not necessarily occur only for neighbours in the underlying metabolic map. A schematic example is given in Figure 3.

**Figure 3:**

**Figure 3:**

Camacho *et al*. [11] identified several distinct mechanisms that result in a high correlation between two metabolites in replicate experiments. These include: (i) chemical equilibrium—two metabolites near chemical equilibrium will show a high correlation, with their concentration ratio approximating the equilibrium constant. (ii) mass conservation—within a moiety-conserved cycle, at least one member should have a negative correlation with another member of the conserved group. (iii) asymmetric control—if one parameter dominates the concentration of two metabolites, intrinsic fluctuations of this parameter result in a high correlation between these two metabolites. (iv) unusually high variance in the expression of a single gene. Similar to the previous situation, but the resulting correlation is not due to a high sensitivity towards a particular parameter, but due to an unusually high variance of this parameter. In particular, a single enzyme that carries a high variance will induce negative correlations between its substrate and product metabolites.

However, it is emphasized that the resulting correlations are still a systemic property of the underlying metabolic network. To actually distinguish the specific mechanisms responsible for an observed correlation does require additional knowledge [11]. Nonetheless, as we will discuss in the subsequent section, correlations that arise from intrinsic fluctuations do provide additional information about the physiological state of the system and represent a promising starting point for data analysis.

## THE INTERPRETATION OF CORRELATIONS

Despite the difficulties in assessing the causal origin of a specific correlation, the primary interest in the analysis of metabolite correlations stems from the fact that the observed pattern indeed provides information about the physiological state of a metabolic system. As already indicated in Figure 2D, a transition to a different physiological state may not only involve changes in the average levels of the measured metabolites, but additionally may also involve changes in their pair-wise correlations. Likewise, a metabolite which shows no significant change in the average level between two different experimental conditions or genotypes may still show an alteration of its pair-wise correlations with other metabolites. This observation leads to the interpretation of the resulting pattern of correlation as a global fingerprint of the physiological state. In this way, the analysis of correlations exploits the intrinsic variability of a metabolic system to obtain the additional features of the state of the system.

The situation can be best described by an analogy in physics. Consider a particle at rest in a potential, as depicted in Figure 4. In an ideal noise-free world, repeated measurements (replicates) result in identical values for the position of the particle, probably only impaired by the measurement noise. However, in the presence of internal fluctuations, repeated measurements yield a characteristic distribution. This distribution is determined by the shape of the potential, as well as the nature of the intrinsic fluctuations. Changes in the system thus result in concomitant changes of the observed distribution, conveying information that would not be accessible by observation of the average position alone. Along similar lines, the intrinsic fluctuations of a metabolic system induce a characteristic pattern of interdependencies between metabolites, depending on the genetic and experimental background.

**Figure 4:**

**Figure 4:**

Of course, the interpretation of metabolite correlations as a global snapshot of the physiological state is only valid for repeated measurements of biological replicates under identical experimental conditions. Having obtained such replicate measurements, it opens the way to perform an analysis of differential correlations, i.e. a systematic comparison of correlations between different states and tissue types. An altered pattern of correlations, in addition to the changes in average metabolite levels, points to changes in the underlying state of the system. On the other hand, and maybe more important, correlations that are preserved across multiple experimental conditions allow to identify (at least candidates for) rapid equilibrium reactions between metabolites and possible mass conservation relationships. Indeed, in a recent preliminary study [16], comparing four different states and tissue types (*Aravidopsis thaliana* leaf, *Nicotiana tabacum* leaf, *Solanum tuberosum* leaf and tuber), a number of preserved correlations were detected. These include, in addition to the obvious high correlation between glucose-6-phosphate and fructose-6-phosphate observed in almost all the studies so far, the metabolite pairs fumarate with malate as well as serine with threonine.

More striking, however, is the existence of reversed correlations, i.e. a situation in which the correlation between two metabolites changes its sign [10, 12, 16]. This points to a marked change in the underlying regulation of the system and possibly reflects the existence of multiple steady states. Indeed, the phenomenon of reversed correlations is also observed in the numerical models of cellular metabolism involving multistationarity and switching between different states [16]. However, other causes of reversed correlations are also conceivable and a more detailed evaluation is still awaited.

It should be emphasized that a systematic comparison of correlation across multiple experimental states serves also as a crucial test for the validity of the analysis. Assuming that the observed pattern of correlations represents a global fingerprint of the underlying physiological state, vastly different states or tissue types should manifest themselves as different patterns of observed correlations. On the other hand, closely related physiological states should give rise to rather similar patterns of correlations. That is, differences or ‘distance’ in terms of observed correlations should reflect and correspond to differences or ‘distance’ in physiological state, tissue type and experimental condition. The observed correlations should be robust with respect to minor changes in the underlying system, while at the same time they should be susceptible for marked changes in the underlying biochemical system. While the preliminary studies seem to support this view [10, 16], large-scale comparisons of metabolic correlations are still sparsely reported. Thus, a concluding evaluation of the validity and applicability of large-scale metabolomic correlations analysis requires further experimental verification.

## METABOLOMIC NETWORK ANALYSIS

Metabolomics, by definition, usually involves a large number of measured metabolites, necessitating the use of simple and effective visualization procedures. Among the most basic methods to depict the observed correlations are metabolomic correlation networks, schematically shown in Figure 5. Metabolomic correlation networks represent a coarse-grained view of the observed correlations: All metabolites are arranged in a two-dimensional plane, such that their pair-wise distances approximately reflect their pair-wise correlation—a procedure known as multidimensional scaling and already used early in the analysis of metabolic data [7]. To create the final network, two metabolites are connected with a ‘link’ if their pair-wise correlation exceeds a given threshold.

**Figure 5:**

**Figure 5:**

As these kinds of correlation or association networks have received widespread interest in different fields and applications also [17–20], some problems concerning their interpretation should be pointed out. Obvious and widely acknowledged correlation networks should not be confused with the actual causal dependencies within the underlying system. Nonetheless, a large number of studies, including work on metabolomic data [21, 22], have focused on the topological properties of these networks to obtain information about the large-scale organization of the system—mostly revealing a ‘scale-free nature’ of the network. Apart from several minor problems, such that the topology predominantly depends on the choice of the correlation threshold and the resulting networks may range from completely unconnected to fully connected, a few more fundamental questions also arise. Most importantly, a correlation matrix exhibits a number of characteristic features that are inevitably reflected within the topology of the resulting correlation network. For example, correlations are transitive, i.e. given that a node A shows a high correlation to a node B, as well as to another node C, it must be expected that B and C also correlate. For the usual Pearson correlation, this relationship can be made quantitative, resulting in an inequality for the pair-wise correlations which holds for any triplet of nodes. On the level of correlation networks, this results in a high clustering coefficient, i.e. nodes with a common neighbour tend to be also connected. Thus, rather unsurprisingly, a recent study reports an ‘unexpected assortative feature’, i.e. an over-representation of triangles, for biological correlation networks and concludes that the clustering coefficient is in orders of magnitudes larger than those of equivalent random networks [17]. A similar reasoning holds for other topological features, such as the observed hierarchies in the network or the analysis of motifs and cliques [22].

In almost all of these cases, the problem arises out of the notion of an ‘equivalent random network’. Comparing the observed properties of correlation networks against those found for randomized networks, as is required to assess their significance, is likely to result in highly ‘significant’ differences. However, a random network, even with preserved degree distribution, is not an appropriate null model for a correlation network. In order to distinguish specific properties of the underlying system from those features that are inherent to a correlation matrix, it is necessary that the randomized network itself is consistent with a correlation network, i.e. it represents a network that is generated from a correlation matrix but lacks the distinctive features of the original network. Otherwise, the statistical significance of any observed feature, such as the clustering coefficient, cannot be assessed in a meaningful way.

One possible way to relate observed properties of the correlation network to the underlying system is to construct artificial metabolic systems, as has previously been done for genetic networks [23], and then by comparing the resulting correlation networks with the underlying models. Preliminary results indicate that even rather simplistic artificial metabolic systems are able to reproduce some key properties of observed correlation networks (unpublished data, R.S.). However, as yet, the interpretation of topological features of correlation networks in terms of organizing principles of the underlying system, remains elusive.

## CONCLUSIONS AND ALTERNATIVE APPROACHES

According to the view put forward here, observed correlations between metabolite levels obtained from biological replicates represent a promising additional source of information about the state of a metabolic system. However, their interpretation in terms of the underlying biochemical pathways is not straightforward and largely defies an intuitive analysis. In particular, an evaluation based on a ‘guilt-by-association’ principle must unavoidably fail. This puts the metabolomic data in marked contrast to other ‘omics’ data, such as gene expression measurements, where an analysis based on ‘guilt-by-association’, though also hampered by some fundamental difficulties, has already proven to be highly successful [24]. In our opinion, this difference is due to the distinct nature of the underlying system: while for transcriptomics, the notion that correlated genes are likely to be involved in similar regulatory processes has some intuitive justification, such a reasoning does not hold for cellular metabolism.

Likewise, an analysis of the topological properties of the resulting correlation networks is, as yet, unlikely to reveal properties about the large-scale organization of the underlying system. Though widely popular across several disciplines, no study so far has succeeded in establishing a concise relationship between observed topological features of such a network and the underlying biochemical system. Furthermore, most efforts to investigate the topological properties of correlation networks make no, or only little, reference to the specific experimental conditions under which the measurements were obtained.

As a more straightforward approach, we thus proposed to focus on a systematic comparison of the observed correlations across different experimental conditions and genetic strains [16]. Since it is possible to pinpoint several key mechanisms that lead to a high correlation between two metabolites, preserved correlations across multiple conditions can be expected to identify the invariant features of cellular metabolism. Likewise, though still awaiting experimental verification, changes in correlations will point to the key points at which metabolic regulation has changed. In this way, the comparative analysis of correlations extends and supplements the more common approach to look for macroscopic changes in metabolite concentrations in response to experimental interventions in the metabolic system. As replicate measurements are already a necessary prerequisite to assess the statistical significance of macroscopic changes in metabolite concentrations, and comparative analysis of correlations requires only a slightly larger number of replicates, the experimental burdens of this approach seem acceptable.

A number of further developments are possible. In this review, we have focused solely on the pair-wise correlation between metabolites while actual relationships, for example rapid equilibrium reactions, may involve more than two metabolites simultaneously. Drawing upon earlier work on the analysis of gene expression data, concepts like the partial correlation [25] may thus allow to overcome some of the difficulties related to the simple pair-wise correlation. Again, however, we have to point out that a one-to-one application of concepts derived from transcriptomics, even when already successfully applied in this field, should be treated with caution. If indeed the variability among biological replicates is a consequence of slight differences in gene expression, the respective enzymes constitute a set of hidden variables that severely hampers a straightforward analysis of the partial correlations for metabolomic data. This hindrance also emphasizes the importance to consider cellular metabolism as a part of an integrated cellular system, i.e. to complement metabolomics with quantitative data from transcriptomics and proteomics. Only then, we can expect to truly uncover the regulation and design principles that constitute cellular metabolism.

Along similar lines, at least equally powerful could prove an application of several recent methods that aim to reconstruct cellular systems directly from measurements of specific perturbations [26, 27]. Here, metabolomics can draw upon the vast body of theory developed in the past decades in the realm of metabolic control analysis [15, 28]. While currently the application of such concepts is restricted to rather small subsystems, further improvements in experimental methodology could allow one, systematically, to assess metabolic regulation on a larger-scale. In this sense, the analysis of metabolomic data has only just begun.

The concentrations of a small but significant number of metabolites are strongly correlated when repeated measurements are performed.

The origin and interpretation of correlations between observed metabolite levels differs profoundly from their transcriptomic and proteomic counterparts.

It is possible to pinpoint specific mechanisms that give rise to observed correlations between metabolite levels.

A comparison of metabolite concentrations across multiple experimental conditions can be expected to identify invariant features of cellular metabolism.

The analysis of observed correlations in terms of correlation networks involves a number of fundamental difficulties with respect to their interpretation.

## References

*Saccharomyces cerevisiae*cultures