From chromatogram to analyte to metabolite. How to pick horses for courses from the massive web resources for mass spectral plant metabolomics

Abstract The grand challenge currently facing metabolomics is the expansion of the coverage of the metabolome from a minor percentage of the metabolic complement of the cell toward the level of coverage afforded by other post-genomic technologies such as transcriptomics and proteomics. In plants, this problem is exacerbated by the sheer diversity of chemicals that constitute the metabolome, with the number of metabolites in the plant kingdom generally considered to be in excess of 200 000. In this review, we focus on web resources that can be exploited in order to improve analyte and ultimately metabolite identification and quantification. There is a wide range of available software that not only aids in this but also in the related area of peak alignment; however, for the uninitiated, choosing which program to use is a daunting task. For this reason, we provide an overview of the pros and cons of the software as well as comments regarding the level of programing skills required to effectively exploit their basic functions. In addition, the torrent of available genome and transcriptome sequences that followed the advent of next-generation sequencing has opened up further valuable resources for metabolite identification. All things considered, we posit that only via a continued communal sharing of information such as that deposited in the databases described within the article are we likely to be able to make significant headway toward improving our coverage of the plant metabolome.

tools are already 'out of dates', never updated for a long time, and never used for metabolomics research anymore. But I really feel a 'value' in this paper especially for an 'education' purpose too. Therefore, I highly would like authors to add 'the date of last update' for each tool (or as much as possible) cited in this manuscript. As you know, the evaluation of GO analysis tools is now performed like that: http://www.nature.com/nmeth/journal/v13/n9/full/nmeth.3963.html?WT.ec_id=NMETH-201609&spMailingID=52180959&spUserID=MzcwMzk3NDY5OTES1&spJobID=98558 4826&spReportId=OTg1NTg0ODI2S0 Reply: As suggested by both reviewers, the problem of outdated tools available online is a major one. To highlight this in the paper we included a sentence in the background pointing out the importance of evaluating the current state of each resource and referred to the "last updated" dates included in supplementary table 1. Regarding the extension of the manuscript, we briefly described even outdated tools so that the reader can have an idea of the previous developments leading to the current state-of-the-art in each respective step of the metabolomics pipeline.
I know your review is not for the evaluation. But you have to add the information of 'recommended'-, 'activity-', 'special interest' or 'outstanding interest' as a lot of reviews do. See like COCB reviews: http://www.sciencedirect.com/science/journal/13675931/36/supp/C. Reply: Included in supplementary table.
Full details of the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist. Information essential to interpreting the data presented should be made available in the figure legends.
Have you included all the information requested in your manuscript?
If not, please give reasons for any omissions below. as follow-up to "Experimental design and statistics Full details of the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist. Information essential to interpreting the data presented should be made available in the figure legends.
Have you included all the information requested in your manuscript? " Review article Resources A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section. Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.
Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?

No
If not, please give reasons for any omissions below. as follow-up to "Resources A description of all resources used, including antibodies, cell lines, animals Review article and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section. Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.
Have you included the information requested as detailed in our Minimum Standards Reporting Checklist? " Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.
Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?

No
If not, please give reasons for any omissions below. as follow-up to "Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.
Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist? " Review article level of coverage afforded by other post-genomic technologies such as transcriptomics and 13 proteomics. In plants this problem is exacerbated by the sheer diversity of chemicals that 14 constitute the metabolome with the number of metabolites in the plant kingdom generally 15 being considered to be in excess of 200 000. In this review we focus on web-resources that 16 can be exploited in order to improve analyte and ultimately metabolite identification and 17 quantification. There is a wide range of available software that not only aids in this but also 18 in the related area of peak alignment, however, for the uninitiated choosing which program 19 to use is a daunting task. For this reason we provide an overview of the pros and cons of the 20 software as well as comments regarding the level of programing skills required to effectively 21 exploit their basic functions. In addition the torrent of available genome and transcriptome 22 sequences that followed the advent of next-generation sequencing has opened up further 23 valuable resources for metabolite identification. All things considered, we posit that only via 24 a continued communal sharing of information such as that deposited in the databases 25 described within the article are we likely to be able to make significant headway towards 26 improving our coverage of the plant metabolome.

Manuscript
Click here to download Manuscript GigaScienceReview_Revised. docx   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Background 34 Metabolomics emerged in the late 1990s with the term coined in a review of Steven Oliver 35 [1]. However, the 2000 paper by Fiehn and co-workers wherein gas chromatography (GC) 36 coupled to mass spectrometry (MS) defined the chemical composition of a morphological 37 and metabolic mutant of the model plant Arabidopsis thaliana [2]; in doing so they were 38 able to describe changes in the level of 326 analytes. This work thus greatly extended on the 39 early metabolite profiling study of Sauter et al. [3] , which presented the technology as a 40 means of putative classification of mode-of-action of pesticides. Thus the advent of 41 metabolomics in plants arguably preceded that in microbes and mammals although the 42 approach was rapidly adopted in these communities also [2,[4][5][6]. During the next two 43 decades metabolomics had one considerable advantage over profiling technologies such as 44 transcriptomics and proteomics in that it is not directly reliant on the genome sequence and 45 during this time the species scope of metabolomics rapidly expanded such that it was no 46 longer merely a tool for identifying biomarkers of cellular circumstance but additionally one 47 of the cornerstones of systems biology and an approach which could provide mechanistic 48 insight into metabolic regulation [7][8][9][10][11] . This advantage has subsequently disappeared 49 following the widespread adoption of next-generation sequencing and the lack of linear 50 relationship between the genome and the metabolome now represents part of the problem 51 in identification of unknown analytes [12] . This is nicely exemplified by the fact that 52 computation of the size of the metabolome on genome information as attempted by Nobeli 53 and co-workers in 2003 for the E. coli metabolome and [13] rendered values far smaller 54 than the number of metabolites actually measured to date [14]. Whilst the size of the 55 metabolome for prokaryotes has been estimated at a couple of thousand, that of the plant 56 kingdom dwarves these numbers with estimates ranging between 200 000 and 1 million 57 metabolites [15]. Within the last two decades metabolomics has been employed to address 58 a wide range of important questions in plant biology including pathway structure [15], the 59 influence of metabolism on growth [8,16], plant ecology [17], various aspects of plant 60 genetics including evolution and the domestication syndrome [18][19][20] as well as detailed 61 characterizations of the metabolic response to biotic and abiotic stressors [21,22]. 62 In this review, we discuss two topics.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 that despite current hurdles regarding comparability of data there is great potential for 74 cross-study comparisons on metabolite responses in determining common responses 75 between either genetic or environmental perturbations of metabolism. Finally, we will 76 provide an outlook as to how the grand challenge of comprehensitivity will best be met and 77 how the power of archived plant metabolic responses will be best exploited in the future. 78 It is not the scope of this review to discuss the theoretical details of every procedure or to 79 document the subtle differences between the many similar tools referred to here. We 80 rather aim to provide a general idea of the importance and challenges of each Sample preparation and data acquisition 103 The metabolomics workflow ( Figure 1) starts with sample preparation including extraction 104 and often coupled to pre-treatment and chemical derivatization, followed by data 105 acquisition which will depend on the chromatographic system, ionization source and 106 analyzer  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  were recently reviewed by [29,30], and some of the most popular algorithms for feature 134 detection and peak alignment were compared in different works [31,32]. Most software 135 somehow integrate both steps in the same pipeline to generate a report of signal intensities 136 over samples from raw data, and many of them also include some resource for data analysis 137 and peak annotation that will be discussed later in more detail. In the following section we 138 will detail the available tools for this step, adopting a similar approach in all subsequent 139 sections also (the details of the programs are all given in additional file 1  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 modular suite, was developed to provide immediate graphical feedback of every step of the 152 processing pipeline, its benchmark paper compared the complexity of different algorithms 153 highlighting the importance of low complexity when dealing with large data files and 154 demonstrating it to be more efficient than MZmine 2 (see below for discussion of this 155 software) and comparable to XCMS, two of the most popular current data processing tools. clustering of fragment ions, the updated version is incorporated into the MZmine 2 platform 251 and addressed issues from the first version such as fragment ions that are produced by 252 more than one co-eluting components, and improved sensitivity and robustness. Finally, 253 MetPP [62] is a processing tool that includes normalization and statistical analysis but is 254 directed towards data emanating from GC×GC-TOF MS system. 255 Extracting compound mass spectra is another important step of data processing that 256 reduces data complexity by many orders of magnitude by identifying m/z signals that belong 257 to the same compound and provide essential information for further metabolite annotation 258 through the reconstructing of mass spectra. While this process is usually integrated in GC-259 MS tools for feature detection, alignment and annotation, as mentioned above, there are 260 many approaches to deal with LC-MS data such as the ones employed by CAMERA [63] a 261 package developed in R to extract compound spectra, annotate isotopes and adducts, and 262 propose compound mass as an extension to XCMS, it is easy to use in combination with this 263 software and provides a significant reduction on data complexity. AStream [64] is another R 264 package very similar to CAMERA but using a simpler algorithm for grouping the peaks.

265
ALLocator [65], is a web based workflow that applies centwave from XCMS for feature 266 detection followed by spectra deconvolution either by CAMERA or by the ALLocatorSD 267 algorithm which is optimized for dealing with the particularities of 13 C labeled data by 268 grouping mirrored isotopes (lighter isotopologues from feeding experiment Interpretation of omics data is usually complicated by the amount and complexity of data.