Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification

Abstract Motivation Identification of small molecules in a biological sample remains a major bottleneck in molecular biology, despite a decade of rapid development of computational approaches for predicting molecular structures using mass spectrometry (MS) data. Recently, there has been increasing interest in utilizing other information sources, such as liquid chromatography (LC) retention time (RT), to improve identifications solely based on MS information, such as precursor mass-per-charge and tandem mass spectrometry (MS2). Results We put forward a probabilistic modelling framework to integrate MS and RT data of multiple features in an LC-MS experiment. We model the MS measurements and all pairwise retention order information as a Markov random field and use efficient approximate inference for scoring and ranking potential molecular structures. Our experiments show improved identification accuracy by combining MS2 data and retention orders using our approach, thereby outperforming state-of-the-art methods. Furthermore, we demonstrate the benefit of our model when only a subset of LC-MS features has MS2 measurements available besides MS1. Availability and implementation Software and data are freely available at https://github.com/aalto-ics-kepaco/msms_rt_score_integration. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
The identification of small molecules, such as metabolites or drugs, in biological samples is a challenging task posing a bottleneck in various research fields, such as biomedicine, biotechnology, environmental chemistry and drug discovery. In untargeted metabolomics studies, the samples typically contain thousands of different molecules, the vast majority of which remain unidentified (Aksenov et al., 2017;da Silva et al., 2015). Liquid chromatography (LC) coupled with tandem mass spectrometry (MS 2 ) is arguably the most important measurement platform in metabolomics (Bla zenovi c et al., 2018), due to its suitability to high-throughput screening, its high sensitivity and applicability to a wide range of molecules. Briefly explained, LC separates molecules by their differential physicochemical interaction between the stationary and mobile phase, which results in retention time (RT) differences and MS separates molecular ions by their mass per charge (MS 1 ). Subsequently, MS 2 can be used to fragment molecules in a narrow mass window and to record the fragment intensities (MS 2 -spectrum). In an untargeted metabolomics experiment, large sets of MS features (MS 1 and RT, plus optionally MS 2 ), are observed, corresponding to the different molecules in the sample. Metabolite identification concerns then the structural annotation of the observed MS features.
In recent years, numerous powerful approaches (Nguyen et al., 2018a;Schymanski et al., 2017) to predict molecular structure annotations for MS 2 spectra have been developed (Allen et al., 2014;Brouard et al., 2016;Dü hrkop et al., 2015Dü hrkop et al., , 2019Nguyen et al., 2018bNguyen et al., , 2019Ruttkies et al., 2016Ruttkies et al., , 2019. Typically, these methods output a ranked list of molecular structure candidates, that can be shown to human experts, or further post-processed, e.g. by using additional information available for the analysed sample. Sources of additional information include, e.g. RT (Bach et al., 2018;Ruttkies et al., 2016;Samaraweera et al., 2018), collision cross-section (Plante et al., 2019) or prior knowledge on the data generating process, such as the source organism's metabolic characteristics (Rutz et al., 2019). RT, i.e. the time that a molecule takes to elute from the LC column, is readily available in all LC-MS pipelines, and is frequently used in aiding annotation (Stanstrup et al., 2015). A basic technique is to use the difference between the observed and predicted RT (Domingo-Almenara et al., 2019;Samaraweera et al., 2018) to prune the list of candidate molecular structures. A major challenge for utilizing RT information, however, is that the RT of the same molecule can vary significantly across different LC systems and configurations, necessitating system specific candidate RT reference databases and RT predictors. Different approaches have been proposed to tackle this challenge, such as using physicochemical properties (e.g. partition coefficient, LogP) as RT proxies (Hu et al., 2018;Ruttkies et al., 2016), RT mapping across LC systems (Stanstrup et al., 2015) or predicting retention orders, which are largely preserved within a family of LC systems (e.g. reversed phase) (Bach et al., 2018;Liu et al., 2019). Using LogP as an RT proxy is simple to implement, but only models the hydrophobic separation effects of the LC system. RT mapping, on the other hand, is limited to pairs of LC systems in which the same molecules have been measured. Retention order prediction can overcome those drawbacks, by learning the LC system's separation directly from RT data of multiple systems (Bach et al., 2018).
This study proposes a probabilistic framework to integrate MS 1 or MS 2 -based annotations with predicted retention order for improved small molecule identification given a set of MS features measured within one LC-MS run by building on the work by Bach et al. (2018) andDel Carratore et al. (2019). The latter proposed a probabilistic approach for integrating different types of additional information to MS 1 data, including RT information. We too define a probabilistic approach, but differ in how RT is handled. Where Del Carratore et al. (2019) use absolute RT information, we follow Bach et al. (2018) and use pairwise retention order predictions for molecules eluting within the same LC-MS run. In contrast to the work done by Bach et al. (2018), our model makes use of pairwise retention order information between all MS features rather than only the ones adjacent in terms of their RTs, resulting in more accurate annotations. Furthermore, our model allows to rank all candidate lists, instead of just returning the most likely candidate assignment for each MS feature, as done by Bach et al. (2018).
Our framework models the score integration as an inference problem on a graphical model, where the edges correspond to retention order predictions, the nodes correspond to MS features and the node labels correspond to candidate molecular structures, scored by a MS 2 based predictor, such as CSI: FingerID (Dü hrkop et al., 2015), MetFrag (Ruttkies et al., 2016) or IOKR (Brouard et al., 2016), or in the absence of MS 2 information, MS 1 precursor mass deviation. This graph is fully connected, which makes exact inference an NP-hard problem. To solve this challenge, we resort to efficient approximate inference, in particular spanning tree approximations (Marchand et al., 2014;Pletscher et al., 2009;Su and Rousu, 2015;Wainwright et al., 2005).

Overall workflow
We assume data arising from a typical LC-MS-based experimental workflow (including chromatographic peak picking, and alignment): MS features consisting MS 1 measurement and the associated RT. A subset of these will include an MS 2 spectrum. In the following, we present our score-integration model in the most general form in which it is provided with MS features and a set of possible candidate molecular structures. The candidate list can be generated, e.g. by querying molecular structures from a structure database, such as ChemSpider (Pence and Williams, 2010), that have the same mass as the observed MS feature. In addition, we assume that to each candidate structure a score is assigned by either an MS 2 -based predictor, or, if no MS 2 -spectrum is available, a score based on the mass deviation of the candidates from the MS mass. For all molecular candidate pairs, associated with the different MS features, the retention order is predicted. Here, we use the Ranking Support Vector Machine (RankSVM)-based predictor by Bach et al. (2018). The candidate structure scores and predicted retention orders are integrated through a probabilistic graphical model (described in the following). This allows us to rank the molecular candidate structures by their inferred marginal probabilities, given both the MS and RT information.
More formally, the output of an LC-MS experiment is given as a tuple set D ¼ fðx i ; t i ; C i Þg, with x i 2 X being the spectrum of feature i (either an MS 2 or a spectrum containing only a single peak at the mass of the precursor ion if no MS 2 information is available), t i 2 R !0 being its RT, and C i ¼ fm i1 ; . . . ; m ini g M being the associated molecular candidates. Here, m ir 2 M represents a molecular candidate structure and n i is the number of molecular candidates for the ith MS feature. Figure 1 shows an overview of our workflow.

Probabilistic model
Let G ¼ ðV; EÞ be an undirected graph, in which each node, i 2 V represents one observed MS feature, and with an edge for all MS feature pairs E ¼ fði; jÞji; j 2 V; i 6 ¼ jg. The edge-set E does not contain any parallel edges. The number of MS features is denoted with N, i.e. jVj ¼ N. We associate each node i in the vertex set with a discrete random variable z i that takes values from the space Z i ¼ f1; . . . ; n i g. Intuitively, z i defines which candidate has been assigned to the ith MS feature. The full vector z ¼ fz i ji 2 Vg corresponds to the molecular structure assignment to each MS feature in the LC-MS experiment, and it takes values from the set Z ¼ Z 0 Â Á Á Á Â Z N . In this work, we consider Z to be fixed and finite for a given a set of MS features, due to our definition of the molecular candidates sets, which assumes that we can restrict the putative annotation for a given MS feature.

Markov random field
The probability distribution of z is given as a pairwise Markov Random Field (MRF) (MacKay, 2005): composed of node w i and edge w ij potential functions, and omits higher-order cliques (hence the term pairwise). Above, w i : Z i 7 !R >0 is the potential function of node i measuring how well the i's candidates matches the measured MS information, and w ij : Z i Â Z j 7 !R >0 encodes the consistency of the observed retention orders for MS feature i and j and the predicted retention order of their candidates z i and z j and Z ¼ P  (Ruttkies et al., 2016) or IOKR (Brouard et al., 2016). We use the latter two in our experiments as representative MS 2 scoring methods (Section 3.3). MetFrag performs an in silico fragmentation of m ir , compares these fragments peaks with the observed ones in x i and outputs a matching score. IOKR, on the other hand, can be used to directly predict a matching score f(x, m) for any (MS 2 feature, molecular structure)tuple. All matching scores h ir are normalized to the range ½0; 1. Finally, we express the potential of a molecular candidate m r given the spectrum x i as follows: where c > 0 is a constant used to avoid zero potentials. In our experiments, we select c such that it is 10 times smaller than the minimum of all non-zero scores across all candidate sets.

Edge potential function w ij
For each candidate pair ðr; sÞ 2 Z i Â Z j associated with the MS pair (i, j), we compute how well the candidates' predicted retention order is aligned with the observed one defined by the RTs t i and t j . To this end, we apply the framework for retention order prediction developed by Bach et al. (2018). The edge potential w ij ðz i ¼ r; z j ¼ sÞ is defined as follows: where w 2 R jF m j is the RankSVM's parameter vector, and / ir ; / js 2 F m are the feature vectors of the candidates' molecular structures, and r : R7 !ð0; 1 is a monotonic function mapping the predicted preference value difference to a value between zero and one. In our experiments, we consider two mapping functions: Sigmoid : r sigmoid ðxÞ ¼ 1 1 þ expðÀkxÞ Step À Function : The different functions can be interpreted as follows. The sigmoid makes full use of the information from the RankSVM margin, i.e. the score of each candidate pair depends on the preference score difference. In this work, we consider k as a hyper-parameter of our method that needs to be estimated from data (Section 3.5). The stepfunction, on the other hand, only differentiates between aligned and not aligned pairs.

Weighting of information sources
To control the contribution of each information source, i.e. MS information and retention orders, we introduce a modification on the potential functions: with D 2 ½0; 1. A D value close to one, e.g. will result in a score mainly based on the observed retention orders. In our experiments, we explain how this hyper-parameter can be estimated in practice (Section 3.5).

Ranking candidates through approximated marginals
We rank the molecular candidates using the marginals of the MRF (1). The marginal for the candidate r of MS feature i is given as: In practice, the calculation of (2) is intractable due to the size of the domain Z of z, which grows exponentially with the number of MS features, thus we will resort to approximate inference methods.

Tree approximation of G
To enable feasible inference of (2), we approximate the MRF (1) using spanning trees of the original graphical model G (Marchand et al., 2014;Pletscher et al., 2009;Su and Rousu, 2015;Wainwright et al., 2005). In the following let T be a spanning tree of G with the same nodes, but an edge-set EðTÞ E, with jEðTÞj ¼ N À 1, that ensures T being a cycle-free single connected component. The probability distribution of the graphical model induced by T is given as: As the graphical model associated with (3) is a tree, we can exactly infer its marginals through the sum-product algorithm (MacKay, 2005), The sum-product algorithm is a message-passing algorithm using dynamic programming that has linear time complexity in the number of MS features. See, e.g. MacKay (2005) for further details on the algorithm.
The output of the sum-product algorithm are the unnormalized marginals lðz i ¼ rjT t Þ for all i 2 V and r 2 Z i . We calculate the normalized marginals as follows (MacKay, 2005):

Random spanning trees sampling
We compare two approaches to retrieve spanning trees from G. The first approach is to randomly sample spanning trees from G (c.f.  Pletscher et al., 2009;Su and Rousu, 2015;Wainwright et al., 2005). We sample the trees by applying the minimum weighted spanning tree algorithm to a random adjacency matrix. If for an MS feature pair (i, j) both RTs are equal, i.e. t i ¼t j , than their corresponding edge is not sampled. This is justified by the observation, that MS features with a RT difference equal zero, do not impose constraints on the retention order of their corresponding candidates. We will refer to a sampled spanning tree as T t . The second approach was implicitly used by Bach et al. (2018) and corresponds to a linear Markov chain where edges connect adjacent MS features ordered by increasing RT, which can be seen as a degenerate spanning tree. In the remaining text, we refer to this tree as T chain .

Averaged marginal over a random spanning tree ensemble
Using tree-like graphical models for the inference is motivated by the exact and fast inference it enables us to do. However, a single tree, such as T chain or a sampled T t , will most likely be only a rough approximation of the original probability distribution (1). Therefore, the use of random spanning tree ensembles T ¼ fT t g L t¼1 has been proposed. In particular, Wainwright et al. (2005) show that an expectation over trees can be used to obtain an upper bound on the maximum a posteriori (MAP) estimate of the original graph, and showed that this approximation can be tight if the underlying trees agree about the MAP configuration. More recently, Marchand et al. (2014) demonstrated generalization bounds for joint learning and inference using tree ensembles. More applied work in using tree-based approximation can be found in Pletscher et al. (2009), who use majority voting and Su and Rousu (2015), who empirically study several aggregation schemes in multilabel classification. Motivated by the mentioned literature, we opted to average the marginals of a random spanning tree ensemble T, where for each tree T t , we independently retrieve the marginals using the sumproduct:

Max-marginals
The exact inference on trees allows us to use the max-marginal, as an alternative to the sum-marginal shown in Equation (2). The maxmarginal is closely related to the MAP estimate. For a single tree T it is given as: The interpretation of the two marginals (sum and max) differs slightly. Whereas the sum-marginal expresses the sum of the probabilities of all configurations z 0 with z 0 i ¼ r, the max-marginal is the maximum probability that a configuration with the constraint z 0 i ¼ r can reach. In our experiments, we compare the performance of both marginal types (Section 4.1). The max-product algorithm is used to calculate the unnormalized max-marginals l max ðz i ¼ rjTÞ, which is a modification of the sum-product algorithm, in which summations are replaced by maximization. The normalized marginal can be calculated as (MacKay, 2005): By plugging Equation (5) into (4) instead of the sum-marginal, we obtain the averaged max-marginal p max .

Run-time complexity
Calculating the marginals for an individual tree and all MS features i has run-time complexity OðN Á n 2 max Þ. Here, N is the total number of features and n max ¼ max i2V jZ i j the maximum number of molecular candidates for a feature.

Evaluation datasets
To evaluate our score-integration approach, we use two publicly available datasets. These are described in this section and summarized in Table 1.

CASMI 2016
The Critical Assessment of Small Molecule Identification (CASMI) challenge is a contest organized for the computational spectrometry community (Schymanski et al., 2017). For its implementation in 2016, a dataset of 208 (MS 2 , retention-time)-tuples was released. The dataset contains 81 negative and 127 positive ionization mode MS 2 spectra. The molecular candidate structure sets were extracted from ChemSpider, using a 6 5 ppm window around the monoisotopic exact mass of the correct candidate, by the challenge organizers. For each dataset and ionization mode, we repeatedly subsample training and test (MS 2 , RT)-tuple sets: CASMI (negative) 50-times N train ¼ 31; N test ¼ 50; CASMI (positive) 50-times N train ¼ 52; N test ¼ 75; EA (negative) 50-times N train ¼ 45; N test ¼ 65; and EA (positive) 100-times N train ¼ 50; N test ¼ 100. No molecular structure, determined by its InChI representation, appears simultaneously in test and training. The training set is used for the hyper-parameter selection (Section 3.5) and the test sets are used to assess the average identification performance of our score-integration framework (Section 3.4).

Training setup for the retention order predictor
To calculate the edge potentials of our MRF model (1), we use the RankSVM retention order prediction approach by Bach et al. (2018). The RankSVM model is trained using seven publicly available RT datasets. Six where published by Stanstrup et al. (2015) along with their RT mapping tool PredRet: UFZ_Phenomenex, FEM_long, FEM_orbitrap_plasma, FEM_orbitrap_urine, FEM_short and Eawag_XBridgeC18. The seventh dataset contains examples for which RTs were published as part of the training dataset for the CASMI 2016 challenge (Schymanski et al., 2017). The joint dataset covered four different chromatographic columns all using H2O ! MeOH (with 0.1% formic acid as additive) as eluent. In total, the dataset contained 1248 (molecule, RT)-tuples of 890 unique molecular structures, after the same pre-processing as in Bach et al. (2018) was applied. We represent the molecular structures using Substructure counting fingerprints calculated with rcdk and CDK 2.2 (Willighagen et al., 2017). We use the MinMax-kernel (Ralaivola et al., 2005) to calculate the similarity between the fingerprints. For our experiments, we build an individual RankSVM model for each (MS, RT)-tuple subsample (Section 3.1), ensuring no molecular structure in the subsample is used for the RankSVM training.

MS 2 -based match scores from MetFrag and IOKR
We apply MetFrag (Ruttkies et al., 2016) and IOKR (Brouard et al., 2016) as representative methods to obtain MS 2 matching scores for the molecular structures in the candidate list of each MS 2 spectrum.

IOKR
Two IOKR models are trained, for negative and positive mode MS 2 spectra, respectively. The training (MS 2 , molecular structure)-tuples are extracted from GNPS (Wang et al., 2016), Massbank and the CASMI 2016 training data. We remove training molecular structures that appear in our evaluation datasets (Section 3.1). This

Performance evaluation
In our experiments, we use the top-k accuracy to determine the metabolite identification performance, i.e. the percentage of correctly ranked molecular candidates at rank k or less. Different approaches can be used to determine the rank of the correct structure. We follow the protocol used by Schymanski et al. (2017). If multiple stereo-isomers were present in the candidate list, only the one with the highest MS 2 -score was retained. The correct molecular structure was found by comparing the InChIs containing no stereo information. The top-k accuracies are calculated the test sets.

Hyper-parameter estimation
The training set of each individual subsample is used to determine optimal weighting D between MS and retention order information. For that, we run the score-integration framework for a different D values, and calculate the area under-the-ranking curve up to rank 20: top20AUC ¼ 1 20 P 20 i¼1 topðiÞ N , where topðiÞ is the number of correct structures up to rank i and N is the number of MS features. Subsequently, we select the retention order weight with the highest top20AUC (Supplementary Section S2). The optimal sigmoid parameter k is estimated using Platt's method (Lin et al., 2007;Platt, 2000) calibrated using RankSVM's training data (Section 3.2).

Full MS 2 information available
We compare our approach for combining MS 2 and RT information for metabolite identification against the baseline, which only uses MS 2 information for the candidate ranking. This allowed us to quantify the performance gain by using RTs. Furthermore, we applied two recently published methods for the integration of MS 2 and RT scores and compared them to our approach. The first one is MetFrag 2.2 (Ruttkies et al., 2016), which exploits the RT information by establishing a linear relationship between the candidates' predicted LogP values with the observed RTs. Each molecular candidate receives an additional score by comparing its predicted RT against that observed for the corresponding MS 2 spectra. We use the CDK (Willighagen et al., 2017) XLogP predictions, which are automatically calculated by the MetFrag software. The weight of the RT feature RetentionTimeScore is determined as described in Section 3.5. Our second comparison is the approach by Bach et al. (2018), which uses predicted retention order and dynamic programming over a chain-graph connecting adjacent MS features. Bach et al. (2018) focussed in extracting the most likely assignment z [Equation (3)] given the chain-graph using dynamic programming. Here, we use our generalized framework to also compute the marginals of all candidates given the chain-graph T chain . The approach by Bach et al. (2018) implicitly used a hinge-sigmoid to compute edge potentials: r hinge ðxÞ ¼ min 2 1þexpðÀkxÞ ; 1:0 . Its parameter k is determined as described in the Supplementary Section S2. We refer to this method as Chain-graph.

Missing MS 2
In a second experiment, we simulated the common application scenario, in which during an LC-MS experiment, a set of MS features (MS and RT) have been measured, but MS 2 spectra have only been acquired for a subset of the features. There can be multiple reasons for this, such as limited measuring time when using, e.g. datadependent acquisition (Xiao et al., 2012) protocols, bad fragmentation quality or inability to deconvolute all spectra when using dataindependent acquisition. In this case, besides the RT, only MS 1 related information is available for some features, which includes the mass of the ion (precursor m/z) and its isotope pattern. We use our proposed score-integration framework to perform structural identification when the proportion of MS features that have an MS 2 spectrum varies. We vary the percentage of available MS 2 -spectra from 0% to 100% and investigate how the joint use of MS 1 , MS 2 and RT information can improve over the baseline solely relying on only MS information. For the candidates r of an MS feature i, simulated to be without MS 2 spectrum, we assign the following candidate score (Del Carratore et al., 2019):  where m i is the neutral exact mass of the measured ion calculated from the precursor m/z using the ground truth adduct, here either ½M þ H þ (positive) or ½M À H À (negative), and m ir is the exact mass of candidate r associated with MS feature i, and r ¼ ppmÁmi variance of Gaussian noise model. ppm expresses the MS-device accuracy, which we set to ppm¼5.

Parameters of our framework
This section investigates the influence of different settings for framework, such as number of random spanning trees or the marginal type. Figure 2 shows the top-k accuracy as a function of the number of random spanning trees L averaged across the datasets and ionizations. The identification performance increases for larger L, whereby the improvement per tree decreases. For the top-1 performance remains similar for L ! 16 trees. However, for top-20, we observe improvements till L ¼ 128. Figure 2 also shows that the maxmarginal approach performs slightly better than the sum-margin. An explanation could be that max-marginal is more robust against candidate configurations z with very low probability. The summarginal averages over such cases, whereas the max-marginal only includes the one with maximum probability.

Comparison of the edge potential functions
The average metabolite identification performance does not differ much between two edge potential functions (see Supplementary  Table S1). This is interesting specifically for the Step-function, which uses the predicted retention orders in a binary fashion only. However, the Sigmoid function still can significantly outperform the Step-function for top-1 and top-5 accuracy.

Performance of our score integration framework
Here, we compare our score-integration framework with other methods and evaluate it under different data setups. We use L ¼ 128 with max-marginals and the Sigmoid as edge potential function for the experiments.

Comparison to other approaches
In Table 2, we compare the performance of our score-integration framework with other approaches from the literature that utilize RT information for metabolite identification. It can be seen that our framework performs well across all datasets and ionization modes and we reach significant improvements over the baseline (Only MS 2 ). Especially for the positive mode spectra, our method seems to have an advantage, as both competing approaches, cannot consistently improve the identification by including RT information. The least performance improvement of our approach can be observed for the negative CASMI dataset, which might be due to the small training set. The other approaches, MetFrag 2.2 and Chain-graph, can consistently (top-1, 5, 10 and 20) improve the results only on particular (dataset, ionization mode) combinations. However, they almost always increase top-1 performance. The results in Table 3 show that our framework significantly outperforms MetFrag 2.2 and Chain-graph in terms of identification performance. Table 4 shows the performance using of our score-integration framework for two difference MS 2 scoring methods, MetFrag and IOKR (Section 3.3). Retention order information (MS 2 þ RT) can  Top-1  Top-5  Top-10  Top-20  Top-1  Top-5  Top-10  Top- Note: Compares our score-integration framework (MS 2 þ RT (our)), against the baseline (Only MS 2 ), MetFrag 2.2 with predicted RT and the Chain-graph model. The best performance for each dataset and ionization is indicated by bold-font. The stars (*) represent the significant improvement over the baseline calculated using a one-sided Wilcoxon signed-rank test on the sample top-k accuracies (P < 0.05 (*), P < 0.01 (**) and P < 0.001 (***)).  Table 5 shows that the baseline performance using MetFrag MS 2 scores (Only MS 2 ) improves after filtering of the candidates. Further improvement can be reached by using retention order information (MS 2 þ RT) even though the absolute improvement is slightly lower than without candidate filtering. For IOKR, we notice that RT information significantly improves the top-k accuracies when all matching exact mass candidates are used, whereas when the candidate sets only contain molecules with the correct, i.e. ground truth, MF, using RT information can only improve top-10 and top-20 accuracies.

Missing MS 2
In Figure 3, we show the identification accuracy using our scoreintegration framework compared to the baseline (Only MS) when only some percentage of the MS features has an MS 2 spectrum. The features without spectra only use the precursor mass as MS information (Section 3.6). We vary the percentage from 0% to 100% with 25%-point steps. The retention order weight D was optimized using the 100% setting. At 0%, the score-integration framework only uses the mass of the candidates and their predicted retention order for the ranking. In the absence of MS 2 information, we observe a high performance gain for top-20. The more MS 2 information we add, the smaller the gain in top-20 accuracy using the retention orders. The fact that RT is a weaker information than MS 2 could explain this observation. The more MS 2 are available, the less additional information RT can add. For the top-1, there is constant improvement for all MS 2 %'s.

Discussion
In this article, we have put forward a rigorous probabilistic framework for the integration of MS-based candidate structure and retention order predictions. Our framework allows the use of any of the popular models, such as CSI: FingerID, IOKR or Metfrag for scoring candidate structures on MS data. Our method takes into account the retention orders of all candidate structure pairs in distinct candidate lists through an approximated fully connected MRF model. It generally achieves higher quality structural annotations of observed MS features than using a single Markov chain as implied in the Bach et al. (2018) model. It also improves on the method of Ruttkies et al. (2016), which uses predicted RTs, in three out of four experiments. For the latter approach, we believe using the RankSVM scores instead of the predicted LogP values could improve the performance. Both measures are proxies for retention behaviour and our results show that the RankSVM predicts the retention order more accurately than the LogP values (see Supplementary Table S2). We also demonstrate that our framework improves the identifications, if only a subset of the MS features come with an MS 2 spectrum. The framework is computationally efficient, e.g. ranking the candidates for a set of N ¼ 75 MS features takes <4 min (see Section S.5), and can be trained using modest-sized datasets.
The amount of improvement using RT information was shown to depend on the dataset and MS 2 scorer (here MetFrag or IOKR). This indicates that RT information rather fine tunes the ranking given by the MS 2 scorer, e.g. by better tie-breaking. The underlying factors could be ambiguities in the candidate sets that can be only be resolved by RT or molecular features that cannot be predicted by MS. Stereochemistry is an obvious factor, but annotations of stereochemistry are not always provided for the RT databases limiting the use of this information for training better retention order prediction models. Thus improved modelling of stereochemistry features is an important open problem (Witting and Bö cker, 2020). A further research direction could be to include the LC system's configuration, e.g. column or eluent, into the retention order prediction. As LC systems can be configured to separate certain molecular classes, this could provide additional information to certain molecular candidates. Also, using the LC peak shape to train a model directly predicting the retention order probabilities could be more accurate, e.g. by incorporating RT variance. However, such data are currently not part of the public RT databases. Acknowledgements E.B. likes to thank Emma Schymanski for answering numerous questions on mass spectrometry and Sandor Szedmak for his helpful comments on the Table 5. Top-k accuracies averaged on the CASMI data (pos. & neg.) using either MetFrag or IOKR as MS 2 -scorer for two different candidate sets: 'All' molecules queried using a mass window; only those with 'correct molecular formula'