MetAssign: probabilistic annotation of metabolites from LC–MS data using a Bayesian clustering approach

Motivation: The use of liquid chromatography coupled to mass spectrometry has enabled the high-throughput profiling of the metabolite composition of biological samples. However, the large amount of data obtained can be difficult to analyse and often requires computational processing to understand which metabolites are present in a sample. This article looks at the dual problem of annotating peaks in a sample with a metabolite, together with putatively annotating whether a metabolite is present in the sample. The starting point of the approach is a Bayesian clustering of peaks into groups, each corresponding to putative adducts and isotopes of a single metabolite. Results: The Bayesian modelling introduced here combines information from the mass-to-charge ratio, retention time and intensity of each peak, together with a model of the inter-peak dependency structure, to increase the accuracy of peak annotation. The results inherently contain a quantitative estimate of confidence in the peak annotations and allow an accurate trade-off between precision and recall. Extensive validation experiments using authentic chemical standards show that this system is able to produce more accurate putative identifications than other state-of-the-art systems, while at the same time giving a probabilistic measure of confidence in the annotations. Availability and implementation: The software has been implemented as part of the mzMatch metabolomics analysis pipeline, which is available for download at http://mzmatch.sourceforge.net/. Contact: Ronan.Daly@glasgow.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

The clustering model takes the form of a mixture model with a Dirichlet Process (DP) prior (see e.g. Rasmussen (2000)) to avoid specifying the number of clusters (metabolites) a priori. The conditional distributions required by the Gibbs sampler to assign peak n to a current cluster (k) or a new cluster (k * ) are (note that for * to whom correspondence should be addressed brevity we omit conditioning on hyper-parameters): P (z nk = 1| . . .) ∝ c k p(xn, wn, rn|z nk = 1, . . .) (1) P (z nk * = 1| . . .) ∝ αp (xn, wn, rn| . . .) ( 2) where c k is the number of peaks currently assigned to cluster k, α is the DP concentration parameter and p(xn, wn, rn|z nk = 1, . . .) is obtained by marginalising over all low-level assignments possible for the metabolite to which this cluster is linked: p(xn, wn, rn|z nk = 1, . . .) = p(xn, wn, rn|v nkai = 1, φ k , . . .) (3) where φ k = m if cluster k is linked to formula m and we assume uniform priors over the Am × Im possible adduct and isotope assignments for formula m. For new clusters, the marginalisation is also done over the assignment of the new cluster to a formula (φ * ): p(xn, wn, rn|v nkai = 1, φ * = m, . . .) In this work, we assume that πm = P (φ * = m) = 1 M .
Our model assumes that p(xn, wn, rn|v nkai = 1, φ k , . . .) factorises across the three data-types. For the mass term, we assume a Gaussian density on the log of the mass (i.e. mass noise is proportional to xn): p(xn|v nkai = 1, . . .) = N (log xn| log y φ k ai , ζ −1 ) where y φ k ai is the theoretical mass of the ith isotope peak of the ath adduct for the formula assigned to cluster k, ζ is the c Oxford University Press 2013.
Algorithm 1 The sampler used to compute the posterior probability of peak to metabolite assignments. Please note that the data and hyperparameters are implicitly defined.
The intensity term is also Gaussian, but the density depends on the intensities of other peaks currently assigned to this cluster. In particular, we assume that intensity of adduct a in cluster k, λ ka is drawn from a Gaussian prior N (λ0, κ −1 0 ). We set λ0 to the mean of observed intensities and κ0 to 10 −14 , resulting in a fairly flat prior over the region of interest. Individual peak intensities are then assumed to to be drawn from a Gaussian conditioned on their adduct-isotope assignment: where β φ k ai is the theoretical proportion of total intensity that would be observed as isotope peak i and κ = 10 −8 is the observation precision. Based on the peaks currently assigned to cluster k, we can compute the posterior density over λ ka . This is another Gaussian: (note that all summations do not include the peak we are currently sampling assignments for). We can then marginalise over λ ka to obtain the conditional density that can be used by the sampler: For the retention time term, we assume the following generative model: The cluster retention time, l k is assumed to be drawn from N (µ0, δ −1 0 ), where µ0 is the mean of the retention times in the data and δ0 is 10 −5 . Each peak retention time is assumed to be l k with additive noise: rn ∼ N (l k , γ −1 ), where γ is given as 2.5 × 10 −1 . We can analytically compute the posterior density for l k , which is: As for intensity, we can marginalise l k to get: p(xn, wn, rn|z nk = 1, . . .) is then given by the product of Equations 5, 6 and 7. The quantity required for a new cluster is computed in a similar manner, but with the posterior parameters replaced by their prior counterparts (for rn and wn).
If a peak is assigned to a current cluster, it must then be assigned to a particular adduct-isotope pair within that cluster. The probability of isotope i and adduct a is: P (v nkai = 1|z nk = 1, xn, wn, rn, . . .) ∝ p(xn, rn, wn|v nkai = 1, . . .) (8) which can be decomposed as above. For a new cluster, we must also first assign the cluster to a formula. This is done with: Am a=1 p(xn, wn, rn|v nk * ai = 1, φ * = m) (9) and the assignment to adduct and isotope follows as in the previous case.

DATASETS
The data used for the experiments came from three mixtures of standard metabolites, normally used for identification purposes.
These standards (referred to as Standard 1, Standard 2 and Standard 3) contain 104, 96 and 40 metabolites, respectively. Each of these standards was run in triplicate on the system, with interleaving negative and positive ionisation modes. The data from each ionisation mode was gathered to provide the LC-MS profile for that run. The three replicates used were aligned by using mzMatch's Combine algorithm to produce the data used. The standards were run using ZIC-HILIC chromatography (Merck Sequant, Darmstadt, DE) on an UltiMate 3000 RSLC system (Thermo, Hemel Hempstead, UK), coupled to an Orbitrap Exactive mass spectrometer (Thermo, Hemel Hempstead, UK) in positive and negative ionization mode. The output from each of these runs was transformed to an mzXML file and then to a PeakML file (Scheltema et al., 2011), which was used as input to the algorithms.
The composition of the standards is given in the supplementary information of (Creek et al., 2011, DOI:10.1021

ROBUSTNESS AND CONVERGENCE
Since the MetAssign algorithm is based on a Bayesian framework and is implemented using Monte-Carlo sampling, it is important to check the robustness of inferences to prior parameters. To test prior sensitivity, parameter sweeps were performed, varying the parameters α, γ, γm, p0 and p1 and no significant changes in posterior values were found.
In order to examine the converge of the Markov chain to its stationary distribution, the MetAssign program was run on each of the standard datasets in positive and negative mode thirty times, each time starting at a random clustering. Each run had a burn in of 600 samples and used 1000 samples for calculating the posterior. The posterior probability for a peak being assigned to each metabolite-adduct-isotope combination was calculated. The standard deviation of these quantities over the thirty runs was calculated. The proportion of these standard deviations under 0.1 are shown in Table 1 and a histogram of these standard deviations for Standard 1 in positive mode is shown in Figure 2. These results show that the vast majority of peak assignments are sampled from the stationary distribution of the Markov chain defined by the sampler.

COMPUTATIONAL COMPLEXITY
The runtime of MetAssign depends on a number of factors. The most important of these are the number of peaks N , the size of the database M , the number of adduct types specified A. Strictly speaking, the complexity will be in O(N M A), but because most peaks will match only a very small amount of entries in M × A, the constant in this will be very small. Figure 3 shows the linear relationship between N ×M ×A and runtime. An indicative sample is given as N = 10334 peaks, M = 1079 compounds and A = 14 adduct types, for a runtime of 1259 seconds. Figure 4 shows the relationship between N × M × A and memory. Whilst the trend is not as linear as the runtime trend, the relationship is still fairly well determined by this model. The same indicative example as before gives a memory usage of 954 MB. Note that there is a fairly large constant in these figures, dealing with overheads of the executable environment etc.
The execution environment of all the examples given in this paper consisted of an Intel Xeon E5606 running at 2.13 GHz, with 2 GB of available memory.

RUNNING THE METASSIGN PROGRAM
MetAssign is available as part of the mzMatch suite of tools that operates on data derived from LC-MS experiments. The tools are available at http://mzmatch.sourceforge.net/, with a MetAssign tutorial at http://mzmatch.sourceforge. net/MetAssign.php. There are many parameters that can modify the behaviour of the program. These will be explained here, though in many cases the default values will work well.

Running from the command line
Since the mzMatch system is Java-based, a Java Runtime Environment (such as the one provided by Oracle) is needed. Assuming the runtime environment has been set up as explained on the mzMatch website, the following command will run the MetAssign algorithm: JAVA mzmatch.ipeak.sort.MetAssign <parameters> The following is a list of the more important parameters used in MetAssign, with a brief explanation and some guidance to their use. For a full explanation of all parameters, please refer to the mzMatch documentation.

-i <filename>
The PeakML input file -o <filename> The PeakML output file containing peak annotations -sampleOut <filename> A tab separated output file where compound annotations are stored -ppm <number> The PPM accuracy of the measuring mass spectrometer -filterPPM <number> Only peak m/z values within filterPPM of theoretical peaks are used -numDraws <number> The number of posterior Monte-Carlo draws to collect -burnIn <number> The number of posterior Monte-Carlo draws to discard before collection occurs -databases <comma separated filenames> A list of databases in mzMatch XML format used for annotation -adducts <comma separated adduct-formats> A list of possible adducts that could be formed in the mass spectrometer -retentionTimeSD <number> The retention time standard deviation of peaks from their 'true' value -identificationPeaks <number> The program will output probabilistic annotations of metabolites supported by at least this many peaks

Parameter Usage
When running the MetAssign program, there are a few parameters that should be explicitly set by the user, as these will have a direct influence on the results obtained. These are: ppm This should be set to the parts-per-million accuracy of the MS equipment. The probability distributions over the theoretical peaks have been defined so that 95% of the probability mass is covered by this value in each direction. filterPPM This is an optimisation measure to speed up processing by ignoring peaks that are not closer than this value to a theoretical peak. Generally a value between 1.1×ppm and 1.5×ppm should be appropriate. numDraws This parameter says how many posterior Monte-Carlo draws to take. For the analysis performed in this paper, 200 samples were taken, which should be sufficient for most uses. For data with more peaks than the test data sets present here or for much larger databases, more samples might be needed, perhaps up to 500. burnIn This parameters says how many initial Monte-Carlo draws should be discarded before saving posterior samples. For larger datasets, it is recommended to set this to 200. databases This is a comma separated list of databases that are to be matched against. The format of these databases is XML and is described in the mzMatch documentation. adducts This is a comma separated list of adduct types that will be used in the generation of theoretical peaks. Whilst an exhaustive list can be provided, it is better to stick with those adducts that are known to be generated, as spurious adducts can generate more false positives. retentionTimeSD This parameter describes the spread of the retention times of the LC-MS peaks (e.g. isotopic peaks and adduct peaks) that are generated by a single chromatographic peak. This should be set so that the deviation of the retention times of most of these LC-MS peaks from the retention time of the chromatographic peak is less than two times this value. This value can vary widely because of difficulties in detecting accurate retention times from noisy peaks. identificationPeaks This parameter says to output metabolite annotations that are supported by at least this many peaks assigned to the metabolite.

Program Output
The output of the MetAssign program consists of two elements: annotations on each of the input peaks of the probability it came from a certain compound and the probability that a certain compound was present in the sample.

Peak Annotation Output
The annotated output file (corresponding to the -o option) consists of the input file with extra PeakML annotations on each of the peaks. For a description of the PeakML annotation format, please see the mzMatch website. The particular annotations produced by the MetAssign program consist of STRING annotations on the top-level peaksets that give the probability that a peak came from a certain compound. The three different annotation labels are: priorIdentification corresponding to the prior probability probabilityIdentification corresponding to the basic posterior probability; and junkProbabilityIdentification corresponding to the posterior probability with 'noisy' peaks filtered out.

Compound Identification Output
The output for the compound annotation (corresponding to the -sampleOut option) consists of a tab-separated file, the rows of which correspond to database compounds and with the following column headers: compoundId The identifier of the compound from the database compoundName The human readable name of the compound Multiple headers of the form p.<number> Where <number>≥ 1: the posterior probability of that compound at support level <number> p.combined The mean of the p.<number> columns

INTENSITY FILTER PEAKSETS
The threshold, against which peaks were filtered, was varied as 0, 5000, 10000, 15000, 20000. This resulted in different peak sets, the sizes of which are shown in Table 2.

ADDUCT-TYPES USED IN ANALYSIS
The set of possible adducts used is given in Table 3

PEAK INTENSITY FILTERING
In order to examine the results of pre-filtering the data by a peakintensity thresholding algorithm, each of the data sets was filtered by a set amount and the result of this on the output was examined. For this analysis, the database with 1000 decoy compounds was used. Table 4 shows the results for the F1 score and Table 5 shows the results for the TPR at 5% FPR. As can be seen, the output can be sensitive to the pre-filtering, as expected. However, independent of the filtering applied, MetAssign outperforms the alternative methods.

DATASET PR AND ROC CURVES
This section contains results similar to section 4 of the article, but across all data sets and ionisation modes. Figure 5 gives precisionrecall curves whilst Figure 6 gives ROC curves.