- Split View
-
Views
-
Cite
Cite
Eric Bach, Simon Rogers, John Williamson, Juho Rousu, Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification, Bioinformatics, Volume 37, Issue 12, June 2021, Pages 1724–1731, https://doi.org/10.1093/bioinformatics/btaa998
- Share Icon Share
Abstract
Identification of small molecules in a biological sample remains a major bottleneck in molecular biology, despite a decade of rapid development of computational approaches for predicting molecular structures using mass spectrometry (MS) data. Recently, there has been increasing interest in utilizing other information sources, such as liquid chromatography (LC) retention time (RT), to improve identifications solely based on MS information, such as precursor mass-per-charge and tandem mass spectrometry (MS2).
We put forward a probabilistic modelling framework to integrate MS and RT data of multiple features in an LC-MS experiment. We model the MS measurements and all pairwise retention order information as a Markov random field and use efficient approximate inference for scoring and ranking potential molecular structures. Our experiments show improved identification accuracy by combining MS2 data and retention orders using our approach, thereby outperforming state-of-the-art methods. Furthermore, we demonstrate the benefit of our model when only a subset of LC-MS features has MS2 measurements available besides MS1.
Software and data are freely available at https://github.com/aalto-ics-kepaco/msms_rt_score_integration.
Supplementary data are available at Bioinformatics online.
1 Introduction
The identification of small molecules, such as metabolites or drugs, in biological samples is a challenging task posing a bottleneck in various research fields, such as biomedicine, biotechnology, environmental chemistry and drug discovery. In untargeted metabolomics studies, the samples typically contain thousands of different molecules, the vast majority of which remain unidentified (Aksenov et al., 2017; da Silva et al., 2015). Liquid chromatography (LC) coupled with tandem mass spectrometry (MS2) is arguably the most important measurement platform in metabolomics (Blaženović et al., 2018), due to its suitability to high-throughput screening, its high sensitivity and applicability to a wide range of molecules. Briefly explained, LC separates molecules by their differential physicochemical interaction between the stationary and mobile phase, which results in retention time (RT) differences and MS separates molecular ions by their mass per charge (MS1). Subsequently, MS2 can be used to fragment molecules in a narrow mass window and to record the fragment intensities (MS2-spectrum). In an untargeted metabolomics experiment, large sets of MS features (MS1 and RT, plus optionally MS2), are observed, corresponding to the different molecules in the sample. Metabolite identification concerns then the structural annotation of the observed MS features.
In recent years, numerous powerful approaches (Nguyen et al., 2018a; Schymanski et al., 2017) to predict molecular structure annotations for MS2 spectra have been developed (Allen et al., 2014; Brouard et al., 2016; Dührkop et al., 2015, 2019; Nguyen et al., 2018b, 2019; Ruttkies et al., 2016, 2019). Typically, these methods output a ranked list of molecular structure candidates, that can be shown to human experts, or further post-processed, e.g. by using additional information available for the analysed sample. Sources of additional information include, e.g. RT (Bach et al., 2018; Ruttkies et al., 2016; Samaraweera et al., 2018), collision cross-section (Plante et al., 2019) or prior knowledge on the data generating process, such as the source organism’s metabolic characteristics (Rutz et al., 2019).
RT, i.e. the time that a molecule takes to elute from the LC column, is readily available in all LC-MS pipelines, and is frequently used in aiding annotation (Stanstrup et al., 2015). A basic technique is to use the difference between the observed and predicted RT (Domingo-Almenara et al., 2019; Samaraweera et al., 2018) to prune the list of candidate molecular structures. A major challenge for utilizing RT information, however, is that the RT of the same molecule can vary significantly across different LC systems and configurations, necessitating system specific candidate RT reference databases and RT predictors. Different approaches have been proposed to tackle this challenge, such as using physicochemical properties (e.g. partition coefficient, LogP) as RT proxies (Hu et al., 2018; Ruttkies et al., 2016), RT mapping across LC systems (Stanstrup et al., 2015) or predicting retention orders, which are largely preserved within a family of LC systems (e.g. reversed phase) (Bach et al., 2018; Liu et al., 2019). Using LogP as an RT proxy is simple to implement, but only models the hydrophobic separation effects of the LC system. RT mapping, on the other hand, is limited to pairs of LC systems in which the same molecules have been measured. Retention order prediction can overcome those drawbacks, by learning the LC system’s separation directly from RT data of multiple systems (Bach et al., 2018).
This study proposes a probabilistic framework to integrate MS1 or MS2-based annotations with predicted retention order for improved small molecule identification given a set of MS features measured within one LC-MS run by building on the work by Bach et al. (2018) and Del Carratore et al. (2019). The latter proposed a probabilistic approach for integrating different types of additional information to MS1 data, including RT information. We too define a probabilistic approach, but differ in how RT is handled. Where Del Carratore et al. (2019) use absolute RT information, we follow Bach et al. (2018) and use pairwise retention order predictions for molecules eluting within the same LC-MS run. In contrast to the work done by Bach et al. (2018), our model makes use of pairwise retention order information between all MS features rather than only the ones adjacent in terms of their RTs, resulting in more accurate annotations. Furthermore, our model allows to rank all candidate lists, instead of just returning the most likely candidate assignment for each MS feature, as done by Bach et al. (2018).
Our framework models the score integration as an inference problem on a graphical model, where the edges correspond to retention order predictions, the nodes correspond to MS features and the node labels correspond to candidate molecular structures, scored by a MS2 based predictor, such as CSI: FingerID (Dührkop et al., 2015), MetFrag (Ruttkies et al., 2016) or IOKR (Brouard et al., 2016), or in the absence of MS2 information, MS1 precursor mass deviation. This graph is fully connected, which makes exact inference an NP-hard problem. To solve this challenge, we resort to efficient approximate inference, in particular spanning tree approximations (Marchand et al., 2014; Pletscher et al., 2009; Su and Rousu, 2015; Wainwright et al., 2005).
2 Materials and methods
2.1 Overall workflow
We assume data arising from a typical LC-MS-based experimental workflow (including chromatographic peak picking, and alignment): MS features consisting MS1 measurement and the associated RT. A subset of these will include an MS2 spectrum. In the following, we present our score-integration model in the most general form in which it is provided with MS features and a set of possible candidate molecular structures. The candidate list can be generated, e.g. by querying molecular structures from a structure database, such as ChemSpider (Pence and Williams, 2010), that have the same mass as the observed MS feature. In addition, we assume that to each candidate structure a score is assigned by either an MS2-based predictor, or, if no MS2-spectrum is available, a score based on the mass deviation of the candidates from the MS mass. For all molecular candidate pairs, associated with the different MS features, the retention order is predicted. Here, we use the Ranking Support Vector Machine (RankSVM)-based predictor by Bach et al. (2018). The candidate structure scores and predicted retention orders are integrated through a probabilistic graphical model (described in the following). This allows us to rank the molecular candidate structures by their inferred marginal probabilities, given both the MS and RT information.
More formally, the output of an LC-MS experiment is given as a tuple set , with being the spectrum of feature i (either an MS2 or a spectrum containing only a single peak at the mass of the precursor ion if no MS2 information is available), being its RT, and being the associated molecular candidates. Here, represents a molecular candidate structure and ni is the number of molecular candidates for the ith MS feature. Figure 1 shows an overview of our workflow.
2.2 Probabilistic model
Let be an undirected graph, in which each node, represents one observed MS feature, and with an edge for all MS feature pairs . The edge-set E does not contain any parallel edges. The number of MS features is denoted with N, i.e. . We associate each node i in the vertex set with a discrete random variable zi that takes values from the space . Intuitively, zi defines which candidate has been assigned to the ith MS feature. The full vector corresponds to the molecular structure assignment to each MS feature in the LC-MS experiment, and it takes values from the set . In this work, we consider to be fixed and finite for a given a set of MS features, due to our definition of the molecular candidates sets, which assumes that we can restrict the putative annotation for a given MS feature.
2.2.1 Markov random field
2.2.2 Node potential function ψi
2.2.3 Edge potential function ψij
The different functions can be interpreted as follows. The sigmoid makes full use of the information from the RankSVM margin, i.e. the score of each candidate pair depends on the preference score difference. In this work, we consider k as a hyper-parameter of our method that needs to be estimated from data (Section 3.5). The step-function, on the other hand, only differentiates between aligned and not aligned pairs.
2.2.4 Weighting of information sources
2.3 Ranking candidates through approximated marginals
In practice, the calculation of (2) is intractable due to the size of the domain of z, which grows exponentially with the number of MS features, thus we will resort to approximate inference methods.
2.3.1 Tree approximation of G
As the graphical model associated with (3) is a tree, we can exactly infer its marginals through the sum-product algorithm (MacKay, 2005), The sum-product algorithm is a message-passing algorithm using dynamic programming that has linear time complexity in the number of MS features. See, e.g. MacKay (2005) for further details on the algorithm.
2.3.2 Random spanning trees sampling
We compare two approaches to retrieve spanning trees from G. The first approach is to randomly sample spanning trees from G (c.f. Pletscher et al., 2009; Su and Rousu, 2015; Wainwright et al., 2005). We sample the trees by applying the minimum weighted spanning tree algorithm to a random adjacency matrix. If for an MS feature pair (i, j) both RTs are equal, i.e. ti=tj, than their corresponding edge is not sampled. This is justified by the observation, that MS features with a RT difference equal zero, do not impose constraints on the retention order of their corresponding candidates. We will refer to a sampled spanning tree as Tt. The second approach was implicitly used by Bach et al. (2018) and corresponds to a linear Markov chain where edges connect adjacent MS features ordered by increasing RT, which can be seen as a degenerate spanning tree. In the remaining text, we refer to this tree as .
2.3.3 Averaged marginal over a random spanning tree ensemble
Using tree-like graphical models for the inference is motivated by the exact and fast inference it enables us to do. However, a single tree, such as or a sampled Tt, will most likely be only a rough approximation of the original probability distribution (1). Therefore, the use of random spanning tree ensembles has been proposed. In particular, Wainwright et al. (2005) show that an expectation over trees can be used to obtain an upper bound on the maximum a posteriori (MAP) estimate of the original graph, and showed that this approximation can be tight if the underlying trees agree about the MAP configuration. More recently, Marchand et al. (2014) demonstrated generalization bounds for joint learning and inference using tree ensembles. More applied work in using tree-based approximation can be found in Pletscher et al. (2009), who use majority voting and Su and Rousu (2015), who empirically study several aggregation schemes in multilabel classification.
2.3.4 Max-marginals
By plugging Equation (5) into (4) instead of the sum-marginal, we obtain the averaged max-marginal .
2.3.5 Run-time complexity
Calculating the marginals for an individual tree and all MS features i has run-time complexity . Here, N is the total number of features and the maximum number of molecular candidates for a feature.
3 Material and experiments
3.1 Evaluation datasets
To evaluate our score-integration approach, we use two publicly available datasets. These are described in this section and summarized in Table 1.
Dataset . | Ionization . | Mass spectra info. . | . | Molecular candidatesa . | . | Chromatography . | . |
---|---|---|---|---|---|---|---|
. | . | MS1 info. . | #MS2 . | Tot. #Cand. . | Median #Cand. . | Column . | Eluent . |
CASMI 2016 | Negative | Precursor m/z | 81 | 74 589 | 420 | Phenomenex Kinetex EVO C18 | H2O → MeOH (both 0.1% formic acid) |
CASMI 2016 | Positive | Precursor m/z | 127 | 183 633 | 919 | Phenomenex Kinetex EVO C18 | H2O → MeOH (both 0.1% formic acid) |
EA (Massbank) | Negative | Precursor m/z | 154 | 75 107 | 119.5 | Waters XBridge C18 | H2O → MeOH (both 0.1% formic acid) |
EA (Massbank) | Positive | Precursor m/z | 319 | 215 893 | 246 | Waters XBridge C18 | H2O → MeOH (both 0.1% formic acid) |
Dataset . | Ionization . | Mass spectra info. . | . | Molecular candidatesa . | . | Chromatography . | . |
---|---|---|---|---|---|---|---|
. | . | MS1 info. . | #MS2 . | Tot. #Cand. . | Median #Cand. . | Column . | Eluent . |
CASMI 2016 | Negative | Precursor m/z | 81 | 74 589 | 420 | Phenomenex Kinetex EVO C18 | H2O → MeOH (both 0.1% formic acid) |
CASMI 2016 | Positive | Precursor m/z | 127 | 183 633 | 919 | Phenomenex Kinetex EVO C18 | H2O → MeOH (both 0.1% formic acid) |
EA (Massbank) | Negative | Precursor m/z | 154 | 75 107 | 119.5 | Waters XBridge C18 | H2O → MeOH (both 0.1% formic acid) |
EA (Massbank) | Positive | Precursor m/z | 319 | 215 893 | 246 | Waters XBridge C18 | H2O → MeOH (both 0.1% formic acid) |
Extracted from ChemSpider. CASMI: ±5 ppm window around monoisotopic exact mass of correct candidate. EA: MF of correct candidate.
Dataset . | Ionization . | Mass spectra info. . | . | Molecular candidatesa . | . | Chromatography . | . |
---|---|---|---|---|---|---|---|
. | . | MS1 info. . | #MS2 . | Tot. #Cand. . | Median #Cand. . | Column . | Eluent . |
CASMI 2016 | Negative | Precursor m/z | 81 | 74 589 | 420 | Phenomenex Kinetex EVO C18 | H2O → MeOH (both 0.1% formic acid) |
CASMI 2016 | Positive | Precursor m/z | 127 | 183 633 | 919 | Phenomenex Kinetex EVO C18 | H2O → MeOH (both 0.1% formic acid) |
EA (Massbank) | Negative | Precursor m/z | 154 | 75 107 | 119.5 | Waters XBridge C18 | H2O → MeOH (both 0.1% formic acid) |
EA (Massbank) | Positive | Precursor m/z | 319 | 215 893 | 246 | Waters XBridge C18 | H2O → MeOH (both 0.1% formic acid) |
Dataset . | Ionization . | Mass spectra info. . | . | Molecular candidatesa . | . | Chromatography . | . |
---|---|---|---|---|---|---|---|
. | . | MS1 info. . | #MS2 . | Tot. #Cand. . | Median #Cand. . | Column . | Eluent . |
CASMI 2016 | Negative | Precursor m/z | 81 | 74 589 | 420 | Phenomenex Kinetex EVO C18 | H2O → MeOH (both 0.1% formic acid) |
CASMI 2016 | Positive | Precursor m/z | 127 | 183 633 | 919 | Phenomenex Kinetex EVO C18 | H2O → MeOH (both 0.1% formic acid) |
EA (Massbank) | Negative | Precursor m/z | 154 | 75 107 | 119.5 | Waters XBridge C18 | H2O → MeOH (both 0.1% formic acid) |
EA (Massbank) | Positive | Precursor m/z | 319 | 215 893 | 246 | Waters XBridge C18 | H2O → MeOH (both 0.1% formic acid) |
Extracted from ChemSpider. CASMI: ±5 ppm window around monoisotopic exact mass of correct candidate. EA: MF of correct candidate.
3.1.1 CASMI 2016
The Critical Assessment of Small Molecule Identification (CASMI) challenge is a contest organized for the computational spectrometry community (Schymanski et al., 2017). For its implementation in 2016, a dataset of 208 (MS2, retention-time)-tuples was released. The dataset contains 81 negative and 127 positive ionization mode MS2 spectra. The molecular candidate structure sets were extracted from ChemSpider, using a ± 5 ppm window around the monoisotopic exact mass of the correct candidate, by the challenge organizers.
3.1.2 EA (Massbank)
Massbank (Horai et al., 2010) is a publicly available repository for MS data. For the development of MetFrag 2.2, Ruttkies et al. (2016) extracted 473 (MS2, retention-time)-tuples of 359 unique molecular structures from Massbank (EA dataset). The dataset is split into 154 negative and 319 positive ionization mode MS2 spectra. We used the molecular candidates provided by Ruttkies et al. (2016) extracted from ChemSpider using the molecular formula (MF) of the correct candidate.
For each dataset and ionization mode, we repeatedly subsample training and test (MS2, RT)-tuple sets: CASMI (negative) 50-times ; CASMI (positive) 50-times ; EA (negative) 50-times ; and EA (positive) 100-times . No molecular structure, determined by its InChI representation, appears simultaneously in test and training. The training set is used for the hyper-parameter selection (Section 3.5) and the test sets are used to assess the average identification performance of our score-integration framework (Section 3.4).
3.2 Training setup for the retention order predictor
To calculate the edge potentials of our MRF model (1), we use the RankSVM retention order prediction approach by Bach et al. (2018). The RankSVM model is trained using seven publicly available RT datasets. Six where published by Stanstrup et al. (2015) along with their RT mapping tool PredRet: UFZ_Phenomenex, FEM_long, FEM_orbitrap_plasma, FEM_orbitrap_urine, FEM_short and Eawag_XBridgeC18. The seventh dataset contains examples for which RTs were published as part of the training dataset for the CASMI 2016 challenge (Schymanski et al., 2017). The joint dataset covered four different chromatographic columns all using H2O → MeOH (with 0.1% formic acid as additive) as eluent. In total, the dataset contained 1248 (molecule, RT)-tuples of 890 unique molecular structures, after the same pre-processing as in Bach et al. (2018) was applied. We represent the molecular structures using Substructure counting fingerprints calculated with rcdk and CDK 2.2 (Willighagen et al., 2017). We use the MinMax-kernel (Ralaivola et al., 2005) to calculate the similarity between the fingerprints. For our experiments, we build an individual RankSVM model for each (MS, RT)-tuple subsample (Section 3.1), ensuring no molecular structure in the subsample is used for the RankSVM training.
3.3 MS2-based match scores from MetFrag and IOKR
We apply MetFrag (Ruttkies et al., 2016) and IOKR (Brouard et al., 2016) as representative methods to obtain MS2 matching scores for the molecular structures in the candidate list of each MS2 spectrum.
3.3.1 MetFrag
We use the latest MetFrag version 2.4.5 (http://msbi.ipb-halle.de/ cruttkie/metfrag/MetFrag2.4.5-CL.jar) and utilize it as described in Ruttkies et al. (2016). The MS2 matching scores are calculated using the FragmenterScore feature of MetFrag.
3.3.2 IOKR
Two IOKR models are trained, for negative and positive mode MS2 spectra, respectively. The training (MS2, molecular structure)-tuples are extracted from GNPS (Wang et al., 2016), Massbank and the CASMI 2016 training data. We remove training molecular structures that appear in our evaluation datasets (Section 3.1). This results in 3255 negative and 6773 positive mode training examples. We use a uniform combination of 16 MS2 spectra and fragmentation tree (FT) kernels as input kernel (Supplementary Section S4). On the output side, we use the same molecular fingerprint definitions as Dührkop et al. (2019) as feature representation and a Gaussian kernel those distances are derived from the Tanimoto kernel (Brouard et al., 2019) as output kernel. For all MS2 spectra in our evaluation datasets, we calculate the FTs using SIRIUS 4.0.1 (Dührkop et al., 2019) and keep the highest scoring tree for each spectra to calculate the MS2 and FT kernels used by the IOKR.
3.4 Performance evaluation
In our experiments, we use the top-k accuracy to determine the metabolite identification performance, i.e. the percentage of correctly ranked molecular candidates at rank k or less. Different approaches can be used to determine the rank of the correct structure. We follow the protocol used by Schymanski et al. (2017). If multiple stereo-isomers were present in the candidate list, only the one with the highest MS2-score was retained. The correct molecular structure was found by comparing the InChIs containing no stereo information. The top-k accuracies are calculated the test sets.
3.5 Hyper-parameter estimation
The training set of each individual subsample is used to determine optimal weighting D between MS and retention order information. For that, we run the score-integration framework for a different D values, and calculate the area under-the-ranking curve up to rank 20: , where is the number of correct structures up to rank i and N is the number of MS features. Subsequently, we select the retention order weight with the highest (Supplementary Section S2). The optimal sigmoid parameter k is estimated using Platt’s method (Lin et al., 2007; Platt, 2000) calibrated using RankSVM’s training data (Section 3.2).
3.6 Experiments
3.6.1 Full MS2 information available
We compare our approach for combining MS2 and RT information for metabolite identification against the baseline, which only uses MS2 information for the candidate ranking. This allowed us to quantify the performance gain by using RTs. Furthermore, we applied two recently published methods for the integration of MS2 and RT scores and compared them to our approach. The first one is MetFrag 2.2 (Ruttkies et al., 2016), which exploits the RT information by establishing a linear relationship between the candidates’ predicted LogP values with the observed RTs. Each molecular candidate receives an additional score by comparing its predicted RT against that observed for the corresponding MS2 spectra. We use the CDK (Willighagen et al., 2017) XLogP predictions, which are automatically calculated by the MetFrag software. The weight of the RT feature RetentionTimeScore is determined as described in Section 3.5. Our second comparison is the approach by Bach et al. (2018), which uses predicted retention order and dynamic programming over a chain-graph connecting adjacent MS features. Bach et al. (2018) focussed in extracting the most likely assignment z [Equation (3)] given the chain-graph using dynamic programming. Here, we use our generalized framework to also compute the marginals of all candidates given the chain-graph Tchain. The approach by Bach et al. (2018) implicitly used a hinge-sigmoid to compute edge potentials: . Its parameter k is determined as described in the Supplementary Section S2. We refer to this method as Chain-graph.
3.6.2 Missing MS2
4 Results
4.1 Parameters of our framework
This section investigates the influence of different settings for framework, such as number of random spanning trees or the marginal type.
4.1.1 Number of random spanning-trees and marginal type
Figure 2 shows the top-k accuracy as a function of the number of random spanning trees L averaged across the datasets and ionizations. The identification performance increases for larger L, whereby the improvement per tree decreases. For the top-1 performance remains similar for trees. However, for top-20, we observe improvements till L = 128. Figure 2 also shows that the max-marginal approach performs slightly better than the sum-margin. An explanation could be that max-marginal is more robust against candidate configurations z with very low probability. The sum-marginal averages over such cases, whereas the max-marginal only includes the one with maximum probability.
4.1.2 Comparison of the edge potential functions
The average metabolite identification performance does not differ much between two edge potential functions (see Supplementary Table S1). This is interesting specifically for the Step-function, which uses the predicted retention orders in a binary fashion only. However, the Sigmoid function still can significantly outperform the Step-function for top-1 and top-5 accuracy.
4.2 Performance of our score integration framework
Here, we compare our score-integration framework with other methods and evaluate it under different data setups. We use L = 128 with max-marginals and the Sigmoid as edge potential function for the experiments.
4.2.1 Comparison to other approaches
In Table 2, we compare the performance of our score-integration framework with other approaches from the literature that utilize RT information for metabolite identification. It can be seen that our framework performs well across all datasets and ionization modes and we reach significant improvements over the baseline (Only MS2). Especially for the positive mode spectra, our method seems to have an advantage, as both competing approaches, cannot consistently improve the identification by including RT information. The least performance improvement of our approach can be observed for the negative CASMI dataset, which might be due to the small training set. The other approaches, MetFrag 2.2 and Chain-graph, can consistently (top-1, 5, 10 and 20) improve the results only on particular (dataset, ionization mode) combinations. However, they almost always increase top-1 performance. The results in Table 3 show that our framework significantly outperforms MetFrag 2.2 and Chain-graph in terms of identification performance.
. | . | Negative . | . | . | . | Positive . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Dataset . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
CASMI 2016 | MS2 + RT (our) | 15.2 (***) | 47.2 (***) | 57.0 (**) | 70.1 (***) | 14.0 (***) | 40.7 (***) | 52.2 (***) | 62.8 (***) |
MS2 + RT (Chain-graph) | 13.2 (***) | 49.4 (***) | 61.0 (***) | 69.4 (***) | 11.9 | 36.5 | 50.2 (***) | 60.7 (***) | |
MS2 + RT (MetFrag 2.2) | 14.0 (***) | 42.0 | 55.5 | 71.2 (***) | 13.7 (***) | 36.2 | 46.2 | 57.5 | |
Only MS2 | 11.1 | 44.2 | 55.3 | 68.0 | 11.8 | 37.3 | 47.0 | 58.3 | |
EA Massbank | MS2 + RT (our) | 28.7 (***) | 61.9 (***) | 73.8 (***) | 83.6 (***) | 27.3 (***) | 61.6 (***) | 72.9 (***) | 80.7 (***) |
MS2 + RT (Chain-graph) | 27.2 (***) | 59.5 (***) | 72.4 (***) | 81.8 (***) | 23.9 (***) | 59.2 | 70.1 | 79.1 (***) | |
MS2 + RT (MetFrag 2.2) | 30.2 (***) | 59.2 (***) | 73.6 (***) | 84.4 (***) | 24.0 (***) | 59.0 | 69.5 | 77.1 | |
Only MS2 | 22.8 | 57.6 | 69.5 | 78.5 | 21.2 | 59.0 | 69.7 | 77.6 |
. | . | Negative . | . | . | . | Positive . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Dataset . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
CASMI 2016 | MS2 + RT (our) | 15.2 (***) | 47.2 (***) | 57.0 (**) | 70.1 (***) | 14.0 (***) | 40.7 (***) | 52.2 (***) | 62.8 (***) |
MS2 + RT (Chain-graph) | 13.2 (***) | 49.4 (***) | 61.0 (***) | 69.4 (***) | 11.9 | 36.5 | 50.2 (***) | 60.7 (***) | |
MS2 + RT (MetFrag 2.2) | 14.0 (***) | 42.0 | 55.5 | 71.2 (***) | 13.7 (***) | 36.2 | 46.2 | 57.5 | |
Only MS2 | 11.1 | 44.2 | 55.3 | 68.0 | 11.8 | 37.3 | 47.0 | 58.3 | |
EA Massbank | MS2 + RT (our) | 28.7 (***) | 61.9 (***) | 73.8 (***) | 83.6 (***) | 27.3 (***) | 61.6 (***) | 72.9 (***) | 80.7 (***) |
MS2 + RT (Chain-graph) | 27.2 (***) | 59.5 (***) | 72.4 (***) | 81.8 (***) | 23.9 (***) | 59.2 | 70.1 | 79.1 (***) | |
MS2 + RT (MetFrag 2.2) | 30.2 (***) | 59.2 (***) | 73.6 (***) | 84.4 (***) | 24.0 (***) | 59.0 | 69.5 | 77.1 | |
Only MS2 | 22.8 | 57.6 | 69.5 | 78.5 | 21.2 | 59.0 | 69.7 | 77.6 |
Note: Compares our score-integration framework (MS2 + RT (our)), against the baseline (Only MS2), MetFrag 2.2 with predicted RT and the Chain-graph model. The best performance for each dataset and ionization is indicated by bold-font. The stars (*) represent the significant improvement over the baseline calculated using a one-sided Wilcoxon signed-rank test on the sample top-k accuracies (P < 0.05 (*), P < 0.01 (**) and P < 0.001 (***)).
. | . | Negative . | . | . | . | Positive . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Dataset . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
CASMI 2016 | MS2 + RT (our) | 15.2 (***) | 47.2 (***) | 57.0 (**) | 70.1 (***) | 14.0 (***) | 40.7 (***) | 52.2 (***) | 62.8 (***) |
MS2 + RT (Chain-graph) | 13.2 (***) | 49.4 (***) | 61.0 (***) | 69.4 (***) | 11.9 | 36.5 | 50.2 (***) | 60.7 (***) | |
MS2 + RT (MetFrag 2.2) | 14.0 (***) | 42.0 | 55.5 | 71.2 (***) | 13.7 (***) | 36.2 | 46.2 | 57.5 | |
Only MS2 | 11.1 | 44.2 | 55.3 | 68.0 | 11.8 | 37.3 | 47.0 | 58.3 | |
EA Massbank | MS2 + RT (our) | 28.7 (***) | 61.9 (***) | 73.8 (***) | 83.6 (***) | 27.3 (***) | 61.6 (***) | 72.9 (***) | 80.7 (***) |
MS2 + RT (Chain-graph) | 27.2 (***) | 59.5 (***) | 72.4 (***) | 81.8 (***) | 23.9 (***) | 59.2 | 70.1 | 79.1 (***) | |
MS2 + RT (MetFrag 2.2) | 30.2 (***) | 59.2 (***) | 73.6 (***) | 84.4 (***) | 24.0 (***) | 59.0 | 69.5 | 77.1 | |
Only MS2 | 22.8 | 57.6 | 69.5 | 78.5 | 21.2 | 59.0 | 69.7 | 77.6 |
. | . | Negative . | . | . | . | Positive . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Dataset . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
CASMI 2016 | MS2 + RT (our) | 15.2 (***) | 47.2 (***) | 57.0 (**) | 70.1 (***) | 14.0 (***) | 40.7 (***) | 52.2 (***) | 62.8 (***) |
MS2 + RT (Chain-graph) | 13.2 (***) | 49.4 (***) | 61.0 (***) | 69.4 (***) | 11.9 | 36.5 | 50.2 (***) | 60.7 (***) | |
MS2 + RT (MetFrag 2.2) | 14.0 (***) | 42.0 | 55.5 | 71.2 (***) | 13.7 (***) | 36.2 | 46.2 | 57.5 | |
Only MS2 | 11.1 | 44.2 | 55.3 | 68.0 | 11.8 | 37.3 | 47.0 | 58.3 | |
EA Massbank | MS2 + RT (our) | 28.7 (***) | 61.9 (***) | 73.8 (***) | 83.6 (***) | 27.3 (***) | 61.6 (***) | 72.9 (***) | 80.7 (***) |
MS2 + RT (Chain-graph) | 27.2 (***) | 59.5 (***) | 72.4 (***) | 81.8 (***) | 23.9 (***) | 59.2 | 70.1 | 79.1 (***) | |
MS2 + RT (MetFrag 2.2) | 30.2 (***) | 59.2 (***) | 73.6 (***) | 84.4 (***) | 24.0 (***) | 59.0 | 69.5 | 77.1 | |
Only MS2 | 22.8 | 57.6 | 69.5 | 78.5 | 21.2 | 59.0 | 69.7 | 77.6 |
Note: Compares our score-integration framework (MS2 + RT (our)), against the baseline (Only MS2), MetFrag 2.2 with predicted RT and the Chain-graph model. The best performance for each dataset and ionization is indicated by bold-font. The stars (*) represent the significant improvement over the baseline calculated using a one-sided Wilcoxon signed-rank test on the sample top-k accuracies (P < 0.05 (*), P < 0.01 (**) and P < 0.001 (***)).
Method . | Top-1 . | . | Top-20 . | . |
---|---|---|---|---|
(MS2 + RT) . | Chain-graph . | MetFrag 2.2 . | Chain-graph . | MetFrag 2.2 . |
Our | ||||
Chain-graph | — | n.s. | — | |
MetFrag 2.2 | — | n.s. | — |
Method . | Top-1 . | . | Top-20 . | . |
---|---|---|---|---|
(MS2 + RT) . | Chain-graph . | MetFrag 2.2 . | Chain-graph . | MetFrag 2.2 . |
Our | ||||
Chain-graph | — | n.s. | — | |
MetFrag 2.2 | — | n.s. | — |
Note: We show the P-values for testing the improvement of the row over the column method using a one-sided Wilcoxon signed-rank test. The test is performed over all top-k accuracy samples (datasets and ionization). MetFrag 2.2 and Chain-graph could not significantly outperform our framework. P-values are marked with ‘n.s.’.
Method . | Top-1 . | . | Top-20 . | . |
---|---|---|---|---|
(MS2 + RT) . | Chain-graph . | MetFrag 2.2 . | Chain-graph . | MetFrag 2.2 . |
Our | ||||
Chain-graph | — | n.s. | — | |
MetFrag 2.2 | — | n.s. | — |
Method . | Top-1 . | . | Top-20 . | . |
---|---|---|---|---|
(MS2 + RT) . | Chain-graph . | MetFrag 2.2 . | Chain-graph . | MetFrag 2.2 . |
Our | ||||
Chain-graph | — | n.s. | — | |
MetFrag 2.2 | — | n.s. | — |
Note: We show the P-values for testing the improvement of the row over the column method using a one-sided Wilcoxon signed-rank test. The test is performed over all top-k accuracy samples (datasets and ionization). MetFrag 2.2 and Chain-graph could not significantly outperform our framework. P-values are marked with ‘n.s.’.
4.2.2 Influence of MS2 scoring method
Table 4 shows the performance using of our score-integration framework for two difference MS2 scoring methods, MetFrag and IOKR (Section 3.3). Retention order information (MS2 + RT) can improve the identification performance in both cases; however the improvement with IOKR scores is lower.
MS2-scorers . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
---|---|---|---|---|---|
MetFrag | MS2 + RT | 21.3 | 52.9 | 64.0 | 74.3 |
Only MS2 | 16.7 | 49.5 | 60.4 | 70.6 | |
IOKR | MS2 + RT | 26.7 | 52.1 | 62.5 | 70.3 |
Only MS2 | 25.1 | 49.5 | 60.3 | 67.6 |
MS2-scorers . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
---|---|---|---|---|---|
MetFrag | MS2 + RT | 21.3 | 52.9 | 64.0 | 74.3 |
Only MS2 | 16.7 | 49.5 | 60.4 | 70.6 | |
IOKR | MS2 + RT | 26.7 | 52.1 | 62.5 | 70.3 |
Only MS2 | 25.1 | 49.5 | 60.3 | 67.6 |
MS2-scorers . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
---|---|---|---|---|---|
MetFrag | MS2 + RT | 21.3 | 52.9 | 64.0 | 74.3 |
Only MS2 | 16.7 | 49.5 | 60.4 | 70.6 | |
IOKR | MS2 + RT | 26.7 | 52.1 | 62.5 | 70.3 |
Only MS2 | 25.1 | 49.5 | 60.3 | 67.6 |
MS2-scorers . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
---|---|---|---|---|---|
MetFrag | MS2 + RT | 21.3 | 52.9 | 64.0 | 74.3 |
Only MS2 | 16.7 | 49.5 | 60.4 | 70.6 | |
IOKR | MS2 + RT | 26.7 | 52.1 | 62.5 | 70.3 |
Only MS2 | 25.1 | 49.5 | 60.3 | 67.6 |
4.2.3 Influence of the candidate set
Here, we study the effect of two commonly used ways of defining the candidate lists of molecular structures: first approach (‘All’) includes all molecules with a matching exact mass to the list, and the second approach (‘Correct MF’) only includes molecular structures matching the pre-determined MF [e.g. SIRIUS (Dührkop et al., 2019) uses this approach]. To determine the effect of the candidate set definition on our framework, we modify the CASMI dataset, such that for a spectrum i only candidates are used that have the same MFs as the correct structures. This leads to significantly reduced candidate sets: For the positive mode spectra, the median number of candidates decreases from 919 to 238 and for the negative ones from 420 to 58. For the Massbank data, we cannot do this modification, as the candidates are already restricted to structures with the correct MF. Table 5 shows that the baseline performance using MetFrag MS2 scores (Only MS2) improves after filtering of the candidates. Further improvement can be reached by using retention order information (MS2 + RT) even though the absolute improvement is slightly lower than without candidate filtering. For IOKR, we notice that RT information significantly improves the top-k accuracies when all matching exact mass candidates are used, whereas when the candidate sets only contain molecules with the correct, i.e. ground truth, MF, using RT information can only improve top-10 and top-20 accuracies.
. | . | MetFrag . | . | . | . | IOKR . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Candidate Set . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
All | MS2 + RT | 14.6 (***) | 44.0 (***) | 54.6 (***) | 66.5 (***) | 26.0 (***) | 48.0 (***) | 60.0 (***) | 69.1 (***) |
Only MS2 | 11.4 | 40.7 | 51.2 | 63.2 | 24.4 | 46.0 | 58.4 | 65.5 | |
Correct MF | MS2 + RT | 17.7 (***) | 48.4 (***) | 59.8 (***) | 71.0 (***) | 30.6 | 52.3 | 66.2 (***) | 75.1 (*) |
Only MS2 | 13.1 | 46.0 | 56.9 | 68.7 | 30.6 | 53.9 | 65.3 | 74.8 |
. | . | MetFrag . | . | . | . | IOKR . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Candidate Set . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
All | MS2 + RT | 14.6 (***) | 44.0 (***) | 54.6 (***) | 66.5 (***) | 26.0 (***) | 48.0 (***) | 60.0 (***) | 69.1 (***) |
Only MS2 | 11.4 | 40.7 | 51.2 | 63.2 | 24.4 | 46.0 | 58.4 | 65.5 | |
Correct MF | MS2 + RT | 17.7 (***) | 48.4 (***) | 59.8 (***) | 71.0 (***) | 30.6 | 52.3 | 66.2 (***) | 75.1 (*) |
Only MS2 | 13.1 | 46.0 | 56.9 | 68.7 | 30.6 | 53.9 | 65.3 | 74.8 |
Note: The stars (*) represent the significant improvement over the Only MS2 (see Table 2 for details on the significance test).
. | . | MetFrag . | . | . | . | IOKR . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Candidate Set . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
All | MS2 + RT | 14.6 (***) | 44.0 (***) | 54.6 (***) | 66.5 (***) | 26.0 (***) | 48.0 (***) | 60.0 (***) | 69.1 (***) |
Only MS2 | 11.4 | 40.7 | 51.2 | 63.2 | 24.4 | 46.0 | 58.4 | 65.5 | |
Correct MF | MS2 + RT | 17.7 (***) | 48.4 (***) | 59.8 (***) | 71.0 (***) | 30.6 | 52.3 | 66.2 (***) | 75.1 (*) |
Only MS2 | 13.1 | 46.0 | 56.9 | 68.7 | 30.6 | 53.9 | 65.3 | 74.8 |
. | . | MetFrag . | . | . | . | IOKR . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Candidate Set . | Method . | Top-1 . | Top-5 . | Top-10 . | Top-20 . | Top-1 . | Top-5 . | Top-10 . | Top-20 . |
All | MS2 + RT | 14.6 (***) | 44.0 (***) | 54.6 (***) | 66.5 (***) | 26.0 (***) | 48.0 (***) | 60.0 (***) | 69.1 (***) |
Only MS2 | 11.4 | 40.7 | 51.2 | 63.2 | 24.4 | 46.0 | 58.4 | 65.5 | |
Correct MF | MS2 + RT | 17.7 (***) | 48.4 (***) | 59.8 (***) | 71.0 (***) | 30.6 | 52.3 | 66.2 (***) | 75.1 (*) |
Only MS2 | 13.1 | 46.0 | 56.9 | 68.7 | 30.6 | 53.9 | 65.3 | 74.8 |
Note: The stars (*) represent the significant improvement over the Only MS2 (see Table 2 for details on the significance test).
4.3 Missing MS2
In Figure 3, we show the identification accuracy using our score-integration framework compared to the baseline (Only MS) when only some percentage of the MS features has an MS2 spectrum. The features without spectra only use the precursor mass as MS information (Section 3.6). We vary the percentage from 0% to 100% with 25%-point steps. The retention order weight D was optimized using the 100% setting. At 0%, the score-integration framework only uses the mass of the candidates and their predicted retention order for the ranking. In the absence of MS2 information, we observe a high performance gain for top-20. The more MS2 information we add, the smaller the gain in top-20 accuracy using the retention orders. The fact that RT is a weaker information than MS2 could explain this observation. The more MS2 are available, the less additional information RT can add. For the top-1, there is constant improvement for all MS2%’s.
5 Discussion
In this article, we have put forward a rigorous probabilistic framework for the integration of MS-based candidate structure and retention order predictions. Our framework allows the use of any of the popular models, such as CSI: FingerID, IOKR or Metfrag for scoring candidate structures on MS data.
Our method takes into account the retention orders of all candidate structure pairs in distinct candidate lists through an approximated fully connected MRF model. It generally achieves higher quality structural annotations of observed MS features than using a single Markov chain as implied in the Bach et al. (2018) model. It also improves on the method of Ruttkies et al. (2016), which uses predicted RTs, in three out of four experiments. For the latter approach, we believe using the RankSVM scores instead of the predicted LogP values could improve the performance. Both measures are proxies for retention behaviour and our results show that the RankSVM predicts the retention order more accurately than the LogP values (see Supplementary Table S2). We also demonstrate that our framework improves the identifications, if only a subset of the MS features come with an MS2 spectrum. The framework is computationally efficient, e.g. ranking the candidates for a set of N = 75 MS features takes <4 min (see Section S.5), and can be trained using modest-sized datasets.
The amount of improvement using RT information was shown to depend on the dataset and MS2 scorer (here MetFrag or IOKR). This indicates that RT information rather fine tunes the ranking given by the MS2 scorer, e.g. by better tie-breaking. The underlying factors could be ambiguities in the candidate sets that can be only be resolved by RT or molecular features that cannot be predicted by MS. Stereochemistry is an obvious factor, but annotations of stereochemistry are not always provided for the RT databases limiting the use of this information for training better retention order prediction models. Thus improved modelling of stereochemistry features is an important open problem (Witting and Böcker, 2020). A further research direction could be to include the LC system’s configuration, e.g. column or eluent, into the retention order prediction. As LC systems can be configured to separate certain molecular classes, this could provide additional information to certain molecular candidates. Also, using the LC peak shape to train a model directly predicting the retention order probabilities could be more accurate, e.g. by incorporating RT variance. However, such data are currently not part of the public RT databases.
Acknowledgements
E.B. likes to thank Emma Schymanski for answering numerous questions on mass spectrometry and Sandor Szedmak for his helpful comments on the computational framework. All authors thank Justin J.J. van der Hooft for critical reading and corrections.
Funding
This work was supported by the Academy of Finland [grant 310107 (MACOME)]; and the Aalto Science-IT infrastructure. S.R. and J.H.W. were supported by the Engineering and Physical Sciences Research Council [EP/R018634/1]. This work started during a visit by J.R. to S.R., funded through the Scottish Informatics and Computing Science Alliance (SICSA) distinguished visiting fellow scheme.
Conflict of Interest: none declared.