## Abstract

**Motivation:** Data analysis for metabolomics suffers from uncertainty because of the noisy measurement technology and the small sample size of experiments. Noise and the small sample size lead to a high probability of false findings. Further, individual compounds have natural variation between samples, which in many cases renders them unreliable as biomarkers. However, the levels of similar compounds are typically highly correlated, which is a phenomenon that we model in this work.

**Results:** We propose a hierarchical Bayesian model for inferring differences between groups of samples more accurately in metabolomic studies, where the observed compounds are collinear. We discover that the method decreases the error of weak and non-existent covariate effects, and thereby reduces false-positive findings. To achieve this, the method makes use of the mass spectral peak data by clustering similar peaks into latent compounds, and by further clustering latent compounds into groups that respond in a coherent way to the experimental covariates. We demonstrate the method with three simulated studies and validate it with a metabolomic benchmark dataset.

**Availability and implementation:** An implementation in R is available at http://research.ics.aalto.fi/mi/software/peakANOVA/.

**Contact:**samuel.kaski@aalto.fi.

## 1 INTRODUCTION

Changes in metabolite concentrations provide insights into disturbances in biological processes that take place in organisms. Changes in the metabolome are informative, especially about nutrition and metabolism (Orešič, 2009), and about the immune system (Kau *et al.*, 2011). Chromatography-coupled mass spectrometry (Plumb *et al.*, 2004) is the standard measurement technology for the untargeted quantification of metabolites and other small molecules.

The spectral data from the measurement device are known to be noisy with various sources of uncertainty (Katajamaa and Orešič, 2007), starting from sample preparation and compound ionization, and ending at peak identification, annotation and summarization. However, the spectra also have structure (Steuer, 2006; Rogers *et al.*, 2009) that is useful for the inference of differences between groups of samples.

Because of the high level of noise, excessive false discovery has been highlighted among the main risks in the analysis of metabolomic data (Broadhurst and Kell, 2006). On the other hand, weak changes are likely to go undetected from observations of individual compounds (Saccenti *et al.*, 2014).

Singular value decomposition (SVD)-based dimensionality reduction techniques, such as analysis of variance (ANOVA)-simultaneous component analysis (ASCA; Smilde *et al.*, 2005), have been proposed to identifying interpretable associations between experimental covariates and multivariate changes in the metabolome. However, as the decomposition in ASCA operates on the covariate effects of the standard ANOVA model, the method does not improve the quantification of the covariate effects compared with the standard model. Further, SVD has been applied to interpreting changes in the variance of the samples in association to the covariates (Jansen *et al.*, 2012), again, building on the standard ANOVA model.

Outside metabolomics, structured ANOVA-type models have been proposed to improve the inference of covariate effects: a Gaussian process-based ANOVA model for spatial data (Kaufman and Sain, 2010) enables the inference of smooth covariate effects for nearby data points, and a dependent Dirichlet process mixture of ANOVA models (De Iorio *et al.*, 2004) can identify substructure in a designed experiment with low-dimensional observations of the outcome.

For metabolomics, a Bayesian clustering model (Suvitaival *et al.*, 2014) has recently been proposed for improving the inference of covariate effects through the integration of multiple same-source spectral peaks. Individual spectral peaks have been argued to be unreliable for the statistical analysis because of their high level of noise. Although the mass spectrometer produces multiple peaks that arise from one compound, there so far are only few methods to integrate these additional observations: Kuhl *et al.* (2012), Rogers *et al.* (2009) and Tikunov *et al.* (2012) used multiple peaks to enhance peak annotation, addressing a major source of error in the analysis of metabolomic data. The recently proposed multipeak model for the inference of covariate effects (Suvitaival *et al.*, 2014) is, to our knowledge, the first systematic approach for using additional peaks in the statistical analysis of intensity data.

In this work, we aim at reducing the risk of false associations between experimental covariates and the observed metabolome. We propose a structured ANOVA-type model that benefits both from the multiple spectral peaks produced by the mass spectrometer and from the collinear structure (Huopaniemi *et al.*, 2009; Steuer, 2006) of metabolomic data. Because of the collinearity that arises from the compounds’ concurrent involvement in biological processes, it is reasonable to model individual compounds as members of coherent groups of compounds. We achieve this by introducing another level of hierarchy to the peak-clustering model (Suvitaival *et al.*, 2014). We show that by not only clustering spectral peaks into latent compounds but in addition by clustering these compounds into coherently responding latent groups, we can detect weak covariate effects in the data more accurately.

## 2 METHODS

The method introduced in this work extends the peak-clustering model (Suvitaival *et al.*, 2014) to reduce the risk of false discoveries in highly collinear metabolomic data. We address the problem of small sample size and low signal-to-noise ratio by introducing stronger structure to the model. This enables us to detect weak covariate effects that are present in multiple compounds.

In this section, we describe the proposed hierarchical Bayesian two-level model. In the equations that follow, we use the indices *i*, *j*, *m*, *k* and *l* to denote the samples, the variables, the first-level clusters, the second-level clusters and the covariate level of the sample, respectively, as

*N*,

*P*,

*M*,

*K*and

*L*are their respective total numbers.

_{a}The observed data are the spectral peaks, indexed by *j*, following their identification in the samples, indexed by *i*. However, the association between the peaks and the compounds, indexed by *m*, is unknown as is the total number, *M*, and the identity of the compounds.

The data on the peaks are organized into two arrays: first, the peak height information is arranged into a $P-by-N$ matrix $X\u2208RP\xd7N$, which after the $log\u2061$-transformation and centering based on the control group, *l* = 1, is real valued with missing values where a peak was not detected. Secondly, the peaks’ pairwise similarity information is arranged into the array $Q\u2208[0,1]N\xd7P\xd7P$. We choose to measure the similarity between two spectral peaks by computing the Pearson correlation between the peaks over a retention time window. This leads to a measure, where peaks *j* and $j\u2032$, if co-occurring within the retention time window in the sample *i*, have a positive similarity value $qi,j,j\u2032$. Similarity values for pairs with a missing peak or a negative correlation coefficient are set to zero and thus are effectively considered as missing values in the model. The peak similarity data enable the model to cluster together adduct and isotope peaks, which have a different mass-to-charge ratio but which appear at a coinciding retention time.

### 2.1 Peak clustering based on chromatographic similarity

We follow the approach detailed by Suvitaival *et al.* (2014) in the peak-clustering stage, presented briefly here for completeness.

In the following equations for inferring the $P-by-M$ clustering matrix **V**, we use the variable $\epsilon j,j\u2032\u2261vj,\xb7vj\u2032,\xb7T\u2208{0,1}$ to indicate whether the peaks *j* and $j\u2032$ are in the same or different clusters (1 or 0, respectively). In the notation, the subset operator ‘·’ indicates that the entire dimension of the array is included—here, all the *M* clusters (columns) of the clustering matrix **V**.

#### 2.1.1 Peak similarity

A pair of peaks from one compound can only occur close by in retention time, whereas a pair of peaks from two different compounds does not have such restriction set by the measurement device. This means that the observed similarity between same-compound peaks is expected to be higher than between different-compound peaks. Thus, the similarity value $qi,j,j\u2032$ between the peaks *j* and $j\u2032$ in a sample *i* is assumed to have been generated by one of the two components, ‘in’ or ‘out’, both of which are a spike-and-slab mixture (Mitchell and Beauchamp, 1988) with a beta distribution as the ‘slab’. These two components are parametrized as

*a*and

*b*, which are set in the way that the same-cluster and different-cluster components favor large and small values of similarity $qi,j,j\u2032$, respectively.

In Equations (1) and (2), missing values are modeled through the ‘spike’ *δ*, which is a Dirac delta function that introduces a point mass at value zero of its argument. Many missing values are expected, as peaks from two different compounds rarely appear at the same time. The prior probability of a missing value is determined by *p*_{0}, which receives a higher value in the different-compound component than in the same-compound component. Different retention time, thus, is strong evidence for assigning the peaks into different clusters.

**Q**, is then computed through a product over all the samples, all the pairs of peaks, and the same-compound and different-compound terms.

#### 2.1.2 Unknown compounds

To accommodate the unknown set of compounds in the data, we set a Dirichlet process prior (Escobar, 1994) for the peak clusters. In this way, we not only can infer the assignments of the peaks into clusters representing the compounds but we can also infer the number of compounds, *M*, in the data, leading to a $P-by-M$ clustering matrix **V**.

In the Dirichlet process for the cluster assignments of the peaks, the probability of the assignment of a peak *j* into an existing cluster *m*,

*j*, which is left out both from the likelihood term and from the count. This is expressed in the equation by the subset operator ‘−

*j*’ that excludes the row

*j*from the clustering matrix

**V**. As an alternative to the existing

*M*clusters, a new cluster is created with the probability

We infer the posterior distribution of the model via Gibbs sampling. Following the sampling, we acquire a point estimate of the distribution of the clustering as the least-squares clustering (Dahl, 2006) relative to the posterior mean clustering. The inferred clustering, **V**, is then used as a preprocessor for the inference of covariate effects, which is discussed next.

### 2.2 Compound clustering for modeling compound correlations

Following the preprocessing, described in Section 2.1, the matrix of observed peak intensities, $X\u2208RP\xd7N$, for the *P* peaks and the *N* samples has been decomposed into *M* clusters through the *P*-by-*M* clustering matrix **V**. Then, each sample *i*,

*j*of the observations matrix

**X**or the single cluster

*m*of the latent variable $Xlat$. Following from Equation (6), the peak intensity data are assumed to have arisen from the clusters representing the

*M*unknown compounds through a generative process with additive noise

**Λ**. Further, we assume that the noise is independent by constraining the $P-by-P$ noise matrix

**Λ**into a diagonal form with entries

*n*

_{0}prior observations.

In this work, we add another level of hierarchy to the model by assuming that the compounds form groups that respond to the experimental covariates in a coherent manner (Fig. 1). We deviate from the formulation of the earlier work and do not let the covariate effects, ** α**, operate directly on the latent variable, $Xlat$, which represents the coherent variation within a cluster of peaks. Instead, we introduce a higher-level latent variable, $Z\u2208RK\xd7N$, to represent the coherent variation within a group of compounds. The compound-specific latent variable,

*K*higher-level clusters, represented by the higher-level latent variable,

**Z**, through the $M-by-K$ clustering matrix

**W**. The residual variation in the levels of a compound, which is not explained by its group, is controlled by the higher-level variance parameter, $\psi 2\u2208R+$, which scales the $M-by-M$ identity matrix

**I**. In this way, the model can refine the information in the noisy observations first at the compound level and then at the compound group level. To infer the compound clustering, we set a uniform multinomial prior for the

*K*higher-level clusters.

With the second level of hierarchy introduced to the model, the covariate effects, ** α**, no longer operate directly on the compound-specific latent variable, $Xlat$. Instead, we specify the effects to contribute to the higher-level latent variable,

**Z**. In a one-way experimental design, this means that the higher-level latent variable for the sample

*i*,

*K*covariate effects, $\alpha \xb7,ai$, that correspond to the covariate level of the sample

*i*, selected by the categorical indicator,

*a*. This formulation encourages coherently responding compounds to be clustered together in the model.

_{i}All the covariate effects, $\alpha \u2208RK\xd7A$, for the *K* higher-level clusters and *A* levels of the covariate, $a\u2208{1,\u2026,A}N$, are independent and identically distributed,

*l*= 1, which represents the control group and for which the effect is by definition fixed to zero.

In a two-way experimental design, there is a second covariate, $b\u2208{1,\u2026,B}N$, with *B* distinct levels and the corresponding effects, $\beta \u2208RK\xd7B$, analogous to **a**, *A* and ** α**, respectively. Additionally, in a two-way design, there is an interaction effect, $(\alpha \beta )\u2208RK\xd7A\xd7B$, between the two covariates. Together, these three covariate effects influence the higher-level latent variable,

**a**and

**b**, respectively, except for the baseline levels,

*c*= 1 or

*d*= 1, where the covariate effects are by definition fixed to zero.

### 2.3 Model selection

The lower-level clustering of peaks into compounds follows the Dirichlet process (Escobar, 1994), leading to a non-parametric determination of the complexity for the lower-level latent variable.

At the higher clustering level, the model is subject to complexity selection with respect to the number of higher-level clusters, *K*, and the higher-level variance parameter, $\psi 2$. For these parameters, we make the selection jointly based on cross-validation.

The variance parameters, $\psi 2$ and $\sigma 2$, control the flow of information from the observed data, **X**, up the hierarchy of the model toward the inferred covariate effects, ** α**. Small values of the variance parameters allow the information to more readily propagate toward the covariate effects, enabling the detection of weaker covariate effects, while large values protect from excessive false-positive effects.

For data with a simple experimental design, the number of higher-level clusters, *K*, can remain low while still capturing the responses to the covariates. In a more complex experimental design with multiple covariates and their levels, the number of higher-level clusters may need to be larger to capture the richness of the association between the observed data and the experimental covariates. However, the number of higher-level clusters, *K*, is most crucially restricted by the availability of replicates: the required number of samples increases exponentially with the increasing complexity of the experimental design.

## 3 RESULTS

Next, we show that we can benefit from the inherent structure of the mass spectral data in two ways: first, we can make the inference of compound-specific covariate effects more accurate by using multiple peaks from the compound. Second, in the regime of low signal-to-noise ratio, we can further improve the accuracy by imposing stronger structure on the model, and infer the covariate effects on groups of coherently responding compounds.

We present experimental results on simulated data and metabolomic benchmark data (Franceschi *et al.*, 2012) with known changes in the concentrations of chemical compounds.

In the experiments, we compare three approaches for the inference of covariate effects:

0.

*Single peak*. The standard ANOVA model, where the covariate effects are computed as the average difference of the intensity of a single peak between the sample groups.1.

*1-level*. A Bayesian approach for inferring the covariate effects using data from multiple spectral peaks (Suvitaival*et al.*, 2014).2.

*2-level*. The Bayesian approach proposed in this work for inferring the covariate effects through the two-level clustering of peaks and coherently responding compounds.

We evaluate each approach by its accuracy at inferring the covariate effects. The accuracy is measured in terms of the mean squared error (MSE).

### 3.1 Simulated data

We demonstrate the new approach through simulated experiments in regimes, which imitate real metabolomic experiments by their sample size, number of peaks associated with a compound and the general level of noise. We present three experiments where we studied three different aspects of the inference task: (i) the presence of multiple unchanged compounds, (ii) the strength of the change, and (iii) the number of coherently changing compounds. These three experiments are detailed in their respective subsections that follow next. The experiments are summarized in Table 1.

Experiment | 1 | 2 | 3 |
---|---|---|---|

Number of samples (‘case’ + ‘control’), N | 10 + 10 | 10 + 10 | 10 + 10 |

Number of observed variables per a lower-level cluster (peaks), P | 7 | 2 | 2 |

Number of lower-level clusters per a higher-level cluster (compounds), M | 7 | 7 | 1, 3, … , 19 |

Number of higher-level clusters (groups of similarly responding compounds), K | 7 | 1 | 1 |

Covariate effects of the higher-level clusters, $\alpha \xb7,2$ | $[+2,\u22121,+0.5,0,0,0,0]$ | $0,0.2,\u2026,2.0$ | 0.2 |

Validation range of the number higher-level clusters, K | $1,\u2026,7$ | - | - |

Validation range of the higher-level variance parameter, $\psi 2$ | $0.05,0.1,\u2026,0.5$ | $0.1,0.2,\u2026,0.5$ | $0.1,0.2,\u2026,0.5$ |

Experiment | 1 | 2 | 3 |
---|---|---|---|

Number of samples (‘case’ + ‘control’), N | 10 + 10 | 10 + 10 | 10 + 10 |

Number of observed variables per a lower-level cluster (peaks), P | 7 | 2 | 2 |

Number of lower-level clusters per a higher-level cluster (compounds), M | 7 | 7 | 1, 3, … , 19 |

Number of higher-level clusters (groups of similarly responding compounds), K | 7 | 1 | 1 |

Covariate effects of the higher-level clusters, $\alpha \xb7,2$ | $[+2,\u22121,+0.5,0,0,0,0]$ | $0,0.2,\u2026,2.0$ | 0.2 |

Validation range of the number higher-level clusters, K | $1,\u2026,7$ | - | - |

Validation range of the higher-level variance parameter, $\psi 2$ | $0.05,0.1,\u2026,0.5$ | $0.1,0.2,\u2026,0.5$ | $0.1,0.2,\u2026,0.5$ |

#### 3.1.1 Presence of multiple unchanged compounds

In the first simulated experiment, we studied the simultaneous inference of multiple zero and non-zero covariate effects. We generated data with seven similarly responding clusters of compounds, each compound producing seven peaks.

When comparing the inferred covariate effects with the true effects, the two peak-clustering models always had a lower error compared with the single-peak approach (Table 2). Most importantly, the added structure of the two-level model prevented the model from overfitting to the noisy data and improved the accuracy at small and diminishing covariate effects, leading to a decrease in false discoveries.

True covariate effect | RMSE | Corrected P-value of difference of 2-level | |||
---|---|---|---|---|---|

Single | 1-level | 2-level | |||

to Single | to 1-level | ||||

0 | 1.16 | 0.53 | 0.51 | $1.1\xd710\u2212190**$ | $2.9\xd710\u22122*$ |

+0.5 | 1.42 | 0.68 | 0.63 | $3.9\xd710\u221244**$ | $4.0\xd710\u22123**$ |

−1.0 | 1.03 | 0.56 | 0.58 | 1.7 × 10^{−27}** | 4.4 × 10^{−1} |

+2.0 | 1.22 | 0.90 | 1.13 | $5.0\xd710\u22123**$ | $2.6\xd710\u221224**$ |

True covariate effect | RMSE | Corrected P-value of difference of 2-level | |||
---|---|---|---|---|---|

Single | 1-level | 2-level | |||

to Single | to 1-level | ||||

0 | 1.16 | 0.53 | 0.51 | $1.1\xd710\u2212190**$ | $2.9\xd710\u22122*$ |

+0.5 | 1.42 | 0.68 | 0.63 | $3.9\xd710\u221244**$ | $4.0\xd710\u22123**$ |

−1.0 | 1.03 | 0.56 | 0.58 | 1.7 × 10^{−27}** | 4.4 × 10^{−1} |

+2.0 | 1.22 | 0.90 | 1.13 | $5.0\xd710\u22123**$ | $2.6\xd710\u221224**$ |

*Note*: The two-level and one-level models, and the single-peak approach (‘2-level’, ‘1-level’ and ‘Single’, respectively), were compared by their MSE between the inferred and the true covariate effect. The smallest MSE for each true effect is highlighted in bold. The significance of the difference between the two-level model and the two comparison approaches was tested with the two-sided paired *t*-test with the Benjamini–Hochberg control (Benjamini and Hochberg, 1995) for the false discovery rate. The result is from the first simulated experiment (Section 3.1.1).

*/** Significant difference at confidence level 95/99%.

The number of the higher-level clusters, *K*, and the variance parameter, $\psi 2$, were selected jointly via stratified nested 5-fold cross-validation. The procedure was repeated with five independent datasets.

#### 3.1.2 Strength of the change

In the second simulated experiment, we studied how the signal-to-noise ratio of the true covariate effect influences the accuracy of inference. We generated independent datasets with seven compounds and progressively increased the generated covariate effect.

The covariate effect was inferred the most accurately by the two-level model throughout the experiment (Fig. 2a). The accuracy of the model-based approaches decreased slightly when the strength of the true effect increased. This followed from the prior assumption that prevents the model from overfitting to unexpectedly strong covariate effects. The prior for the covariate effects places most of the probability mass around zero and thus effectively sets a bias for the inferred effects toward zero (Fig. 2b). The data-based single-peak approach does not have this bias, but its confidence intervals were considerably wider, which lead to a larger error in the inference task. The effects in any real metabolomic dataset most probably are in the weak regime, where the bias is overshadowed by the noise in the data.

#### 3.1.3 Number of coherently changing compounds

In the third simulated experiment, we studied how the number of coherently responding compounds influences the accuracy of the two-level model. We generated data with a weak covariate effect and gradually increased the number of compounds.

We discovered that the accuracy of the two-level model increased as the number of coherently responding compounds increased (Fig. 3). The experiment empirically confirmed the expected connection between the two Bayesian models: when there was only one responding compound, the two-level model effectively reduced to the one-level model in terms of the error. As expected, the performance of the single-peak approach and the one-level model remained constant throughout the experiment.

### 3.2 Benchmark data with known changes in concentrations

Next, we applied the method on real ultra performance liquid chromatography-mass spectrometry data (Franceschi *et al.*, 2012). The recently published benchmark dataset of apple samples includes a set of annotated spike-in compounds with a known increase in the concentration. The samples have been measured in both the positive and negative ion modes.

We started with the raw spectral data [The raw spike-in dataset by Franceschi *et al.* (2012) is available online at http://cri.fmach.eu/Research/Computational-Biology/Biostatistics-and-Data-Management/download/data/Spiked-Apple-Data (June 11, 2013, date last accessed)] to acquire the shapes of the peaks in addition to their heights. The data were preprocessed using MZmine 2 (Pluskal *et al.*, 2010) with default settings.

We evaluated the approaches by the MSE between the inferred and the true covariate effects. If a cluster contained multiple annotated peaks, the error was computed for each of the annotated peaks. Clusters with no annotated peaks were assumed to have a 0% true effect. The effect of the single-peak approach was computed as the average change of the strongest peak of the cluster.

The number of higher-level clusters and the variance parameter were selected jointly through a stratified nested 5-fold cross-validation from the sets $K\u2208{1,2,\u2026,10}$ and $\psi 2\u2208{0.25,0.5,\u2026,1.5}$, respectively. Model selection and validation was done independently for the positive and negative ion mode datasets.

The analyses for the data from both the ion modes lead to an outcome, where the Bayesian models were more accurate at inferring the covariate effects than the single-peak approach by a significant margin (Table 3). The two-level model improved the accuracy compared with the one-level model at weak effect sizes. It is worth noting that the results were strongly in line with the simulated experiments (Section 3.1.1).

True covariate effect (%) | RMSE | Corrected P-value of difference of 2-level | |||
---|---|---|---|---|---|

Single | 1-level | 2-level | |||

to Single | to 1-level | ||||

(a) Positive ion mode | |||||

+0 | 0.42 | 0.31 | 0.09 | $<\epsilon **$ | $<\epsilon **$ |

+20 | 0.41 | 0.22 | 0.19 | $3.4\xd710\u221234**$ | $2.1\xd710\u221213**$ |

+40 | 0.44 | 0.28 | 0.33 | $3.8\xd710\u221211**$ | $2.7\xd710\u22124**$ |

+100 | 1.06 | 0.91 | 0.92 | $1.3\xd710\u22122*$ | $1.3\xd710\u22125**$ |

(b) Negative ion mode | |||||

+0 | 0.42 | 0.31 | 0.11 | $<\epsilon **$ | $<\epsilon **$ |

+20 | 0.54 | 0.26 | 0.20 | $2.5\xd710\u221247**$ | $4.4\xd710\u221243**$ |

+40 | 0.45 | 0.34 | 0.35 | $6.1\xd710\u221230**$ | $1.1\xd710\u22128**$ |

+100 | 0.82 | 0.74 | 0.88 | $2.5\xd710\u22121$ | $4.2\xd710\u221283**$ |

True covariate effect (%) | RMSE | Corrected P-value of difference of 2-level | |||
---|---|---|---|---|---|

Single | 1-level | 2-level | |||

to Single | to 1-level | ||||

(a) Positive ion mode | |||||

+0 | 0.42 | 0.31 | 0.09 | $<\epsilon **$ | $<\epsilon **$ |

+20 | 0.41 | 0.22 | 0.19 | $3.4\xd710\u221234**$ | $2.1\xd710\u221213**$ |

+40 | 0.44 | 0.28 | 0.33 | $3.8\xd710\u221211**$ | $2.7\xd710\u22124**$ |

+100 | 1.06 | 0.91 | 0.92 | $1.3\xd710\u22122*$ | $1.3\xd710\u22125**$ |

(b) Negative ion mode | |||||

+0 | 0.42 | 0.31 | 0.11 | $<\epsilon **$ | $<\epsilon **$ |

+20 | 0.54 | 0.26 | 0.20 | $2.5\xd710\u221247**$ | $4.4\xd710\u221243**$ |

+40 | 0.45 | 0.34 | 0.35 | $6.1\xd710\u221230**$ | $1.1\xd710\u22128**$ |

+100 | 0.82 | 0.74 | 0.88 | $2.5\xd710\u22121$ | $4.2\xd710\u221283**$ |

*Note*: The two-level and one-level models, and the single-peak approach (‘2-level’, ‘1-level’ and ‘Single’, respectively), are compared by their MSE between the inferred and the true covariate effect. The smallest MSE for each true effect is highlighted in bold. The significance of the difference between the two-level model and the two comparison approaches is tested with the two-sided paired *t*-test with the Benjamini–Hochberg correction (Benjamini and Hochberg, 1995) for the *P*-values. A near-zero value below the machine accuracy is denoted by ‘*ε*’.

*/** Significant difference at confidence level 95/99%.

## 4 CONCLUSION

Additional spectral peaks produced by the mass spectrometer as a result of the ionization process have been shown to be useful for the inference of covariate effects when multiple peaks can be confidently associated with one compound. However, even with multiple peaks supporting the inference, small covariate effects may be hidden under the between-sample variation. We addressed this problem by introducing stronger structure to the model of the covariate effects, thereby regularising the covariate effects and making them less dependent on the variation of individual compounds.

We achieved an improvement in the accuracy of the inferred covariate effects by assuming a structure of coherently responding compounds in the data. We proposed a structured model for inferring covariate effects for groups of compounds through two layers of probabilistic clustering. Metabolomic data are known to have collinear structure for similar compounds. This phenomenon is argued to arise from the biological processes that the compounds are involved in. However, the method proposed in this work does not restrict the groups of compounds by their chemical or biological similarity but infers the groups only based on their responses to the covariates.

In the experiments, we showed that the two-level model proposed in this work decreases the error of inferred covariate effects in a typical setting, where the true effects are small or diminishing. Through three simulated experiments, testing the approaches with multiple zero-effect clusters, varying effect size and varying number of similarly responding compounds, we demonstrated that the two-level model is more accurate at inferring weak covariate effects from noisy multipeak data when compared with the two comparison approaches. The outcome was repeated on a metabolomic benchmark dataset with known changes in the compound concentrations. Following the reduction in the error for the weak covariate effects, the two-level model is argued to reduce false findings.

To further improve the consistency of the inferred covariate effects, we suggest the following avenues of research: (i) prior knowledge about the similarity of the compounds can be incorporated into the prior of the higher-level clustering, either in terms of the biological processes in which the compounds are involved or in terms of the chemical similarity of the compounds. (ii) The lower-level clustering can be improved to detect even the weakest peaks by incorporating prior knowledge about the relative positions of the peaks associated with one compound. This is possible thanks to the fact that the expected positions of many adduct and isotope peaks can be calculated based on the ionization process and the chemical formula of the compound, respectively. (iii) The covariate effects in the isotope peaks are argued to be highly preserved because the isotope peaks do not arise from variation in the ionization process. Additionally, the expected relative heights of these peaks can be calculated if the identity of the compound is known. Incorporating these properties into the model may be even more useful for the inference of covariate effects than the two aforementioned points.

*Funding*: Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN, 251170; Computational Modeling of the Biological Effects of Chemicals, 140057). Computational resources were provided by the Aalto University School of Science ‘Science-IT’ project.

*Conflicts of Interest*: none declared.

## Comments