CluMSID: an R package for similarity-based clustering of tandem mass spectra to aid feature annotation in metabolomics

Depke, Tobias; Franke, Raimo; Brönstrup, Mark

doi:10.1093/bioinformatics/btz005

Abstract

Summary

Compound identification is one of the most eminent challenges in the untargeted analysis of complex mixtures of small molecules by mass spectrometry. Similarity of tandem mass spectra can provide valuable information on putative structural similarities between known and unknown analytes and hence aids feature identification in the bioanalytical sciences. We have developed CluMSID (Clustering of MS² spectra for metabolite identification), an R package that enables researchers to make use of tandem mass spectra and neutral loss pattern similarities as a part of their metabolite annotation workflow. CluMSID offers functions for all analysis steps from import of raw data to data mining by unsupervised multivariate methods along with respective (interactive) visualizations. A detailed tutorial with example data is provided as supplementary information.

Availability and implementation

CluMSID is available as R package from https://github.com/tdepke/CluMSID/and from https://bioconductor.org/packages/CluMSID/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The untargeted analysis of complex mixtures of small molecules using liquid chromatography coupled to (tandem) mass spectrometry (LC-MS(/MS)) has developed into an important technology to study biological systems, from the various applications of metabolomics to natural product research, drug discovery, environmental and forensic sciences. Still, the technique faces various challenges, the most important being metabolite identification (Johnson et al., 2016). To address this issue, multiple tools have been developed that aid identification by different approaches, mostly relying on computational mass spectrometry (Tsugawa, 2018).

In proteomics and metabolomics, similarity of tandem mass spectra is routinely used to gauge the match score of experimental and library spectra (Tabb et al., 2003). It is now also established that similarities in tandem mass spectrometry (MS²) fragmentation can hint towards structural relations between analytes (Garg et al., 2015). In this paper, we present a customizable open access tool for similarity-based clustering of LC-MS/MS data from data-dependent acquisitions. It enables reproducible analyses and is fully integratable into R pipelines that use e.g. the popular packages from the ‘xcms’ family. The tool is also applicable to flow injection or GC-EI-MS data. However, we will refer to the most frequent experimental type, i.e. LC-MS/MS, in the following. The tool can handle mass spectra following positive and negative ionization from low- and high-resolution mass analyzers. With these assets, CluMSID has the potential to become a valuable extension of the metabolomics data analyst’s toolbox.

2 Implementation and main functions

CluMSID is a highly flexible open source tool written in R that combines MS² spectral similarity comparisons with several unsupervised data mining methods. The data can be accessed at all stages and custom data analysis steps can easily be integrated. CluMSID is available as R package that can be installed from Bioconductor or downloaded from GitHub and used on all platforms.

2.1 Data import, pre-processing and similarity calculation

For its main functions (Fig. 1), CluMSID requires LC-MS/MS data in a standard non-proprietary format that can be parsed by ‘mzR’ (http://github.com/sneumann/mzR), e.g. mzXML. MS² spectra are extracted from the raw file, and redundant MS² spectra are merged into consensus spectra. This process can either operate stand-alone by grouping spectra from the same precursor ion within a retention time interval defined by the user, or spectra can be assigned to peaks picked by a different method or software by means of an external peak table, e.g. generated by ‘xcms’ (Smith et al., 2006).

Fig. 1.

Open in new tab Download slide

Schematic of a CluMSID workflow. The main functions of CluMSID and their logical order are illustrated. Grey boxes signify data (pre-)processing steps that serve as the basis for multidimensional scaling, clustering, correlation networks or other additional analysis tools. White boxes on the right symbolize optional in- or output, the black box indicates required input

During the merging process, neutral loss patterns are generated for all features by subtraction of each fragment m/z from the precursor m/z. Neutral losses represent uncharged fragments that also convey structural information comparable to charged fragments. The neutral loss patterns can be processed like MS² spectra. Annotations can be added to features that have been identified by the user, e.g. by using online or in-house spectral libraries.

A distance matrix is produced from the list of (annotated) consensus spectra by calculating spectral similarities for every pairwise combination. The similarity measure used by CluMSID is the spectral contrast angle, also known as cosine similarity or cosine score (Wan et al., 2002). This distance matrix is the basis for the clustering functions and can be used with non-CluMSID functions to customize the analysis workflow.

2.2 Mining and visualization of similarity data

Four data mining methods along with respective (interactive) visualizations are included in CluMSID: multidimensional scaling (MDS), density-based clustering with the OPTICS (Ordering points to identify the clustering structure) algorithm (https://cran.r-project.org/web/packages/dbscan/), hierarchical clustering and the generation of correlation networks using the ‘network’ package (Butts, 2008). In the first study that employed a CluMSID prototype, hierarchical clustering proved the most suitable method to putatively annotate P.aeruginosa secondary metabolites (Depke et al., 2017).

CluMSID’s correlation network functions are particularly useful for researchers who have worked with ‘Molecular Networking’ before but would like to use a more interactive and flexible tool. The MDS plot and the correlation network visualization can also be generated in an interactive and zoomable version based on the ‘plotly’ package (https://cran.r-project.org/web/packages/plotly/).

2.3 Additional functionalities

CluMSID harbours a set of accessory functions, e.g. to find spectra that contain a specific fragment or neutral loss or to match single spectra against a set of (library) spectra.

Furthermore, MS1 pseudospectra, i.e. groups of MS1 peaks derived from the same analyte like adducts, fragments etc., as produced by the ‘CAMERA’ package (Kuhl et al., 2012), can be extracted from raw data and analyzed like MS² spectra.

2.4 Comparison to existing tools

Similarity-based clustering of LC-MS/MS data from data-independent acquisition can also be performed using the ‘MetFamily’ tool (Treutler et al., 2016) or the more specialized MS/MS data mining tool ‘MetCirc’ (Naake and Gaquerel, 2017), while the best known method to study MS² spectra from data-dependent acquisition (auto-MS/MS) is ‘Molecular Networking’, a widely used web tool that however offers relatively little possibilities for customization (Wang et al., 2016). Whereas ‘Molecular Networking’ and ‘MetFamily’ are web applications that can be accessed via a webserver with the need to upload data, we provide a package that does not rely on the public upload of data and offers greater transparency and customisability by enabling access to the data at every step of the analysis. The incorporation of CluMSID in an existing R pipeline that uses for instance xcms for peak picking is a usecase which we expect to be very common and useful for the community working with untargeted metabolomics data. ‘MetCirc’ is an interactive visualization tool to compare MS² experimental data across organisms or tissues, thus it has a different focus than CluMSID. While the distance matrix is calculated in a similar manner, CluMSID features more options for data preprocessing, analysis and visualization, like hierarchical and density-based clustering. The analysis of pseudospectra and neutral loss patterns are additional, valuable features.

3 Conclusion

CluMSID is a tool that aids the identification of features in untargeted LC-MS/MS analysis by the use of MS² spectral similarity and unsupervised learning methods. It offers functions for a complete and customizable workflow from raw data to visualizations in the form of a freely accessible R package. We are convinced that CluMSID will benefit both the wider metabolomics community and scientists from other bioanalytical fields applying untargeted LC-MS/MS analysis by enabling researchers to integrate MS² spectra and neutral loss patterns similarity data into their feature annotation workflow.

Funding

This work was supported by the President’s Initiative and Networking Funds of the Helmholtz Association of German Research Centres [VH-GS-202]; the EU-funded European Marine Biological Research Infrastructure Cluster [654008]; and by a PhD scholarship of the Studienstiftung des deutschen Volkes to T.D.

Conflict of Interest: none declared.

References

Butts

C.T.

(

2008

)

Network: a package for managing relational data in R

.

J. Stat. Softw.

,

24

,

1

–

36

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Depke

T.

et al. . (

2017

)

MS2 spectra using unsupervised methods to aid the identification of secondary metabolites from Pseudomonas aeruginosa

.

J. Chromatogr. B

.,

1071

,

19

–

28

.

Google Scholar

Crossref

WorldCat

Garg

N.

et al. . (

2015

)

Mass spectral similarity for untargeted metabolomics data analysis of complex mixtures

.

Int. J. Mass Spectrom.

,

377

,

719

–

727

.

Johnson

C.H.

et al. . (

2016

)

Metabolomics: beyond biomarkers and towards mechanisms

.

Nat. Rev. Mol. Cell Biol.

,

17

,

451

–

459

.

Kuhl

C.

et al. . (

2012

)

CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets

.

Anal. Chem.

,

84

,

283

–

289

.

Naake

T.

,

Gaquerel

E.

(

2017

)

MetCirc: navigating mass spectral similarity in high-resolution MS/MS metabolomics data

.

Bioinformatics

,

33

,

2419

–

2420

.

Smith

C.A.

et al. . (

2006

)

XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification

.

Anal. Chem.

,

78

,

779

–

787

.

Tabb

D.L.

et al. . (

2003

)

Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility

.

Anal. Chem.

,

75

,

2470

–

2477

.

Treutler

H.

et al. . (

2016

)

Discovering regulated metabolite families in untargeted metabolomics studies

.

Anal. Chem.

,

88

,

8082

–

8090

.

Tsugawa

H.

(

2018

)

Advances in computational metabolomics and databases deepen the understanding of metabolisms

.

Curr. Opin. Biotechnol.

,

54

,

10

–

17

.

Wan

K.X.

et al. . (

2002

)

Comparing similar spectra: from similarity index to spectral contrast angle

.

J. Am. Soc. Mass Spectrom.

,

13

,

85

–

88

.

Wang

M.

et al. . (

2016

)

Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking

.

Nat. Biotechnol.

,

34

,

828

–

837

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
January 2019	48
February 2019	86
March 2019	53
April 2019	48
May 2019	36
June 2019	12
July 2019	10
August 2019	15
September 2019	67
October 2019	40
November 2019	35
December 2019	17
January 2020	21
February 2020	23
March 2020	26
April 2020	8
May 2020	5
June 2020	4
July 2020	6
August 2020	5
September 2020	12
October 2020	19
November 2020	15
December 2020	20
January 2021	13
February 2021	11
March 2021	7
April 2021	25
May 2021	33
June 2021	36
July 2021	27
August 2021	23
September 2021	23
October 2021	28
November 2021	46
December 2021	23
January 2022	20
February 2022	35
March 2022	28
April 2022	40
May 2022	45
June 2022	44
July 2022	36
August 2022	36
September 2022	35
October 2022	43
November 2022	36
December 2022	27
January 2023	31
February 2023	15
March 2023	25
April 2023	29
May 2023	21
June 2023	24
July 2023	13
August 2023	17
September 2023	9
October 2023	30
November 2023	37
December 2023	28
January 2024	22
February 2024	39
March 2024	26
April 2024	27

Article Contents

CluMSID: an R package for similarity-based clustering of tandem mass spectra to aid feature annotation in metabolomics

Abstract

1 Introduction

2 Implementation and main functions

2.1 Data import, pre-processing and similarity calculation

2.2 Mining and visualization of similarity data

2.3 Additional functionalities

2.4 Comparison to existing tools

3 Conclusion

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

CluMSID: an R package for similarity-based clustering of tandem mass spectra to aid feature annotation in metabolomics

Abstract

1 Introduction

2 Implementation and main functions

2.1 Data import, pre-processing and similarity calculation

2.2 Mining and visualization of similarity data

2.3 Additional functionalities

2.4 Comparison to existing tools

3 Conclusion

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only