UC2 search: using unique connectivity of uncharged compounds for metabolite annotation by database searching in mass spectrometry-based metabolomics

Abstract Summary For metabolite annotation in metabolomics, variations in the registered states of compounds (charged molecules and multiple components, such as salts) and their redundancy among compound databases could be the cause of misannotations and hamper immediate recognition of the uniqueness of metabolites while searching by mass values measured using mass spectrometry. We developed a search system named UC2 (Unique Connectivity of Uncharged Compounds), where compounds are tentatively neutralized into uncharged states and stored on the basis of their unique connectivity of atoms after removing their stereochemical information using the first block in the hash of the IUPAC International Chemical Identifier, by which false-positive hits are remarkably reduced, both charged and uncharged compounds are properly searched in a single query and records having a unique connectivity are compiled in a single search result. Availability and implementation The UC2 search tool is available free of charge as a REST web service (http://webs2.kazusa.or.jp/mfsearcher) and a Java-based GUI tool. Supplementary information Supplementary data are available at Bioinformatics online.

ions are anthocyanins as pigment compounds in plants and quaternary ammonium compounds such as phosphatidylcholines. These compounds are usually registered as charged molecules in compound databases. As we show in Supplementary Table S1 When the neutralized mass value of 579.1503 for [M+H] + was searched, four candidates were found.
Next, we checked the appropriateness of these candidates by considering their charges. The 29 candidates found when a search was made assuming [M] + have to be registered as [M] + in the databases too. We checked the structure of these candidates at the websites of the original compound databases and found that all 29 candidates were registered as neutral molecules. Therefore, these candidates were all false positives. Similarly, the four candidates found assuming [M+H] + were checked and they were also found to be false positives registered as [M] + (Supplementary Fig. S2).
Using UC2, the false positives in the conventional search are eliminated. With UC2, charged compounds in the original databases are tentatively neutralized by adding or removing hydrogens to or from the formulae and the tentatively neutralized mass values are registered. Hydrogen is selected for adjustment of the neutralized mass, because in most cases, [M+H]  In the next example, appropriate candidates are found. When a mass value of 859.21374 was applied in a conventional search, 12 candidates for [M] + and one candidate for [M+H] + were found.
All these were found to have appropriate charges upon checking on the database websites. When the mass value was applied in a single UC2 search with [M+H] + , the same 13 candidate compounds were found in five results consisting of the constitutional isomers (Supplementary Fig. S4 The elimination of apparent false positives with mismatching charge is a practical advantage of UC2, because researchers have to check the appropriateness of the candidates in a conventional search by checking the original databases one by one, as shown in these examples. Small numbers of candidates (30 and 13) are found here, but in practical data, more than 100 candidates will often be hit per query, as shown in Supplementary Figs S12-14. Many isomers are known in compound groups such as lipids and flavonoids (Supplementary Table S1, note the low proportion of unique formulae) and this would increase the number of candidates. Therefore, checking all candidates found in a conventional search with thousands of peaks detected in each LC-MS run is not practical.
Using UC2 solves this issue. Among the databases used here, a search function that takes into account a match of charge is only provided by the LIPID MAPS website. KEGG and FlavonoidViewer do not even provide a web user interface to search compounds by mass values. Therefore, the unique cross-database search function using UC2 implemented in the MFSearcher web service and MFSearcher GUI tool should contribute to a better annotation of metabolites detected in untargeted metabolome analyses using LC-MS.

False positives caused by compounds registered as multiple components
When the m/z value of 939.2384, detected in the negative mode (Iijima et al., 2008 ) One advantage of the UC2 search is that both salt and non-salt entries are obtained in a compiled result. When another m/z value, 346.0558 detected in negative mode (Iijima et al., 2008), was applied in a UC2 search with [M-H] -, six results with the formula C 10 H 14 N 5 O 7 P 1 were found ( Supplementary Fig. S5). One of the results, 'Adenosine monophosphate', included two KEGG entries (C00020 and C18344); the latter is a di-sodium salt of the former. It is an advantage of the UC2 search that the compounds registered only as salts among databases can be searched (Supplementary Fig. S6).

Obtaining candidates for different dissociation forms by the UC2 search
When a mass value of 496.3398 was applied in a conventional search, two formulae (C 24 H 51 NO 7 P, one entry each in KEGG and HMDB, and C 28 H 48 O 7 , three entries in KEGG) were obtained as [M] + and a single formula (C 24 H 50 NO 7 P, five entries in LipidMAPS, one entry in HMDB) was found as [M+H] + . When searched as [M+H] + using UC2, four constitutional isomers (C 24 H 50 NO 7 P) were obtained, and the abovementioned candidate C 28 H 48 O 7 was found as a false positive with mismatching charge. Among the four true positive candidates with C 24 H 50 NO 7 P, one result including the KEGG entry C04102 was a kind of glycerophosphocholine, a quaternary ammonium compound.
In KEGG, this compound is registered as its [M] + form, while the compound is registered as the neutral form in LIPID MAPS and HMDB (Supplementary Fig. S8). As shown in this example, UC2 can provide the candidates registered in different dissociation forms in a single result.  Fig. S9). This result shows that UC2 is applicable to searches for hardly neutralizable compounds; the adjustment of the charge by tentatively adding or removing hydrogen atoms to or from the formula is a practical procedure.

Example of the records of the UC2 database
In the UC2 database, the IDs of the compounds in each compound database are stored based on InChIKey skeletons (the first block in the hash of the IUPAC International Chemical Identifier). service. The head of the data in the table for HMDB is shown.

Features of the entries in the databases
The number of charged entries and unique connectivities differs very much between the compound databases; therefore, a query of multiple databases with the proper charge is needed to obtain the maximum number of appropriate candidate compounds. Supplementary Table S1 shows a summary of the entries in the compound databases. A substantial number of compounds were stored as charged molecule in each database, especially in FlavonoidViewer (7.2%). As for the charged compounds in FlavonoidViewer, LIPID MAPS and KNApSAcK, less than 2% were those whose uncharged counterparts with the same connectivity were also registered in the same database  1.4). Relatively low ratios of the unique InChIKey skeletons were found in databases with a huge number of entries (UNPD and PubChem, 73% and 78%, respectively), implying that many entries with different constitutional isomers and stereoisomers are registered in them (Supplementary Table S1). Low ratios of unique formulae in FlavonoidViewer (15.2%), LIPID MAPS (18.3%), UNPD (14.2%) and PubChem (2.7%) suggest that many isomeric and/or fragmented compounds are included there (Supplementary Table S1). Supplementary Table S3 shows the extent of the shared unique connectivity (InChIKey skeletons) between the databases. shared unique InChIKey skeletons is less than 34%, suggesting that unique compounds are stored in these databases. Especially in HMDB, 37% are shared and hence 63% are unique even when compared with UNPD and PubChem. These results suggest that it is necessary for researchers to query multiple databases to cover the maximum number of compounds and also to remove the redundancy of the same compounds between databases.

Details of the data acquired from the compound databases
The structural data of compounds were obtained from the compound database of the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2016)

Construction of the UC2 database
The

Preparation of the metabolite peak lists 2.3.1. Metabolites list in tomato fruits
As an example of manually curated metabolite peaks including secondary metabolites biosynthesized in plants, we used a list of 869 metabolite peaks detected and annotated in tomato fruits using LC-Fourier transform ion cyclotron (FT-ICR) MS (Iijima et al., 2008). The listprovided as Supplementary Table S2  metabolites were detected only in the positive or the negative mode, respectively.

Metabolites list in human urine
As an example of a computationally calculated and not curated peak list, we chose data obtained from human urine. Lists of metabolite peaks in human urine were prepared from the raw data published by van der Hooft et al. (2016) as follows: The raw data analysed in ESI positive mode (Pooled_Urine_15_POS.raw) and data analysed in ESI negative mode (Pooled_Urine_14_NEG.raw) using LC-Orbitrap MS in the study MTBLS307 in MetaboLights (Salek et al., 2013) were downloaded. The raw data were converted to mzXML files using ProteoWizard software (version 3.0.6447) (Kessner et al., 2008). Detection of the metabolite peaks and estimation of the adducts were performed using an in-house version of the PowerGet software (Sakurai et al., 2014) that was slightly modified for batch processing. Sets of 1,264 and 1,475 metabolite peaks were detected in the positive and negative modes, respectively.

Random mass list
To examine the effect of biological selection of the metabolites on the database search results, a random mass list in the m/z range of 100-1,500 was computationally generated using JDK 1.7. In total 6,491 and 6,379 mass values were generated to obtain exactly 1,000 mass values that showed results in database searches of the positive and negative modes, respectively.

Comparison of search results from the UC2 database and other compound databases
We compared the results from a search using UC2 (UC2 search) and a search in the normal way The causes of unusual results in queries, namely those queries whose results were only found in either the UC2 search or the conventional search, were manually investigated.  f Chemical structure with repeat units (e.g., C02072 in KEGG database) whose mol file is written as unrepeated structure.