- Split View
-
Views
-
Cite
Cite
Tobias Greisager Rehfeldt, Konrad Krawczyk, Mathias Bøgebjerg, Veit Schwämmle, Richard Röttger, MS2AI: automated repurposing of public peptide LC-MS data for machine learning applications, Bioinformatics, Volume 38, Issue 3, February 2022, Pages 875–877, https://doi.org/10.1093/bioinformatics/btab701
- Share Icon Share
Abstract
Liquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (i) absence of balanced training data with large sample size; (ii) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (iii) lack of benchmarking of ML methods on specific LC-MS problems.
We created the MS2AI pipeline that automates the process of gathering vast quantities of MS data for large-scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data are stored in a standardized format amenable for ML, encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides.
An open-source implementation of the software can be found at https://gitlab.com/roettgerlab/ms2ai
Supplementary data are available at Bioinformatics online.
1 Introduction
Liquid-chromatography mass-spectrometry (LC-MS) is the leading technique to measure the cellular proteome. LC-MS is routinely performed as tandem-MS, as this procedure allows in-depth identification of the proteome, and delivers information such as sequence and posttranslational modifications based on a database search of the secondary mass spectra (MS2; Behrmann et al., 2018). When analyzing tandem-MS data, the full information of primary peptide mass spectra (MS1) and MS2 is mostly simplified, and no current pipeline converts the full wealth of data into a format amenable for machine learning (ML). In addition, the plethora of publicly available MS data is a mostly untapped resource for advanced ML techniques. It is thus desirable to establish a connection between available LC-MS data and ML applications in order to unravel any latent but biologically or technically relevant information. Advanced ML methods like deep learning models have shown great potential to predict peptide features of different charge states, as well as estimate their intensity (Cox and Mann, 2008), tumor classification using imaging MS (Hulstaert et al., 2020) and peptide MS2 spectra prediction (Sinitcyn et al., 2018).
To address the issue of data connectivity, we created MS2AI, a pipeline ensuring that advanced ML techniques are applicable to large-scale MS data. This is done by standardizing heterogeneous data resources for reliable repurposing, and further enrichment with experimental and peptide-centric metadata. MS2AI solves the fundamental challenges of ML in MS, suffering from the lack of convenient acquisition of large-scale training and test data. MS2AI automatically extracts compatible entries (analyzed with MaxQuant) from the largest public repository of LC-MS data, PRIDE (Vizcaíno et al., 2016), and stores these in a homogeneous and ML ready standardized format.
2 Materials and methods
2.1 Automated data retrieval
MS2AI fetches metadata from all ∼12 000 PRIDE projects using the PRIDE API and stores it in a MongoDB collection. These projects can be filtered according to specific metadata entries (see Supplementary Table S1), as well as file specific information found in the MaxQuant XML files, such as fixed peptide modifications. For convenience, MS2AI contains a complete database collection of PRIDE project metadata (as of July 2021) that is easily updatable to include newer projects.
Currently, MS2AI can only process projects that come with peptide identifications from the MaxQuant suite, which is the predominantly used software for bottom-up proteomics. This information yields details about peptide identity and quantification, such as the peptide location, sequence, intensity and modifications. By restricting the database to MaxQuant identifications, MS2AI ensures a more homogeneous identification and extraction of peptides across all projects and their raw data files.
Based on the PRIDE metadata file, which is automatically filtered with regard to MaxQuant analysis and user-specified requirements, MS2AI retrieves the MaxQuant output files, along with the corresponding raw files, from the PRIDE database. These raw files are in a proprietary file format used by the majority of MS instruments in PRIDE and contain information from the data acquisition process, most notably the objects in the m/z–retention time–intensity space (Zeng et al., 2019).
The following data transformations are executed after raw file download; (i) conversion of raw files into the community standard mzML data format using the ThermoRawFileParser (Zohora et al., 2019) and (ii) further extraction from the mzML file, including all MS2 spectra, and values in the m/z–retention time–intensity space from the MS1 spectra. The output of the processed raw files is stored in the local data directory, while the raw files themselves are discarded (see Supplementary Material for file storage structure).
Along with the automatic PRIDE data retrieval, MS2AI also allows for the extraction of local in-house data. This requires the MaxQuant and corresponding raw file(s) to be available to the software. The key steps of the extraction are identical to the steps taken during the PRIDE method. The user has the possibility to provide additional metadata to their local data comparable to the metadata (instrument, modifications, etc.) available on PRIDE, to seamlessly integrate local and PRIDE received data.
2.2 Peptide representation (PR)
From the extracted information described in Section 2.1, we are able to create a run representation (RR) of the entire LC-MS run (see Fig. 1). However, due to the sheer amount of m/z and retention time entries in the LC-MS spectra, data reduction is paramount, as LC-MS base resolution would make ML methods insurmountably computationally challenging (see Supplementary Fig. S6). The user can freely define the range of m/z and retention time to be summarized into a single data point of the RR; the smaller the range in both m/z and retention time, the less data-loss will occur, at the expense of increased data size. This increase in size influences both storage space and the runtime of any future ML applications (see Supplementary Section S7). To account for the inevitable data-loss on the RR, we have constructed a 4-channel data representation of the MS1 spectra consisting of: (i) mean value of all summarized data points, (ii) minimum value of all summarized data points, (iii) maximum value of all summarized data points and (iv) absolute number of summarized data points. To avoid unwanted bias in the intensities, all intensities in a single raw file are normalized based on the highest value in the individual raw file. The PR consists of the peptide and its neighborhood of configurable size, and which is drawn directly from RR, along with the MS2 spectra for the corresponding peptide.
The MS2 information is either extracted exactly as represented in the mzML file (with differing length of m/z–intensity space), or in a binned fashion, to ensure equal length and homogeneity between spectra. The m/z bin-size of MS2 can be calculated in two methods: fixed length (e.g. 500 bins ranging from 0 to 2500 m/z) or in a variable length (e.g. 500 bins ranging from 0 to the precursor m/z). Furthermore, intensity information can be retained as the mean of the normalized values in a bin, or binarized; indicating whether a peak exists in the binned area or not. All of these options give users full flexibility to adjust the PR to their specifications.
Each PR is accompanied by relevant information from the MaxQuant output file, which is appended to a MongoDB collection for ML purposes. MS2AI also gives the user additional data dependent information that allows adding pre- or postprocessing steps by interacting with the MongoDB database. This information includes entries such as file specific RT step size, and normalized m/z and RT values. This collection can then be filtered and sorted to help transition into any ML application, by querying and removing entries based on needs.
2.3 ML applications
The MS2AI PR is well suited for a multitude of ML applications, as it delivers homogeneous data representations of heterogeneous MS experiments. The generation of PRs, along with the metadata accompanying each of the peptide data, trivialize the task of getting from unsuitable raw data to ML ready peptide information files.
MS2AI also includes a functional convolutional neural network that uses a tailored data generator for TensorFlow, allowing easy integration of the 4-channel images along with m/z and rt information from MS1 and the MS2 spectra. In order to demonstrate the utility of MS2AI, we trained and tested a simple neural network (see Supplementary Fig. S6) on ∼200 000 PRs from 307 different projects (filtered from ∼69 000 000 total PRs with 98th percentile score filter) separated into peptides with an oxidation on methionine, and peptides without an oxidation on methionine. We then trained and tested whether the network could, using MS1 and MS2 information, distinguish the two classes. Doing this, we obtained 95% training accuracy and 93% validation accuracy along with an 85% test accuracy on a different set of PRIDE projects consisting of 25 000 PRs; entire projects from which none of the data were not used for training or validation. This separation of training and test data causes the highest possible heterogeneity between data points and increases robustness in the neural network capabilities (see Supplementary Section S6).
3 Conclusion
MS2AI is the first automated pipeline for strategic and large-scale extraction and processing of LC-MS experiments that allows hassle-free and powerful ML applications in the realm of computational proteomics.
MS2AI offers unique and a concise peptide data representation that combines the measurements in the multidimensional area around a peak in the m/z–retention time–intensity space with the raw or binned MS2 spectra. This customizable representation enables researchers to summarize peptide information in accordance with specific needs, along with a comprehensive database structure of known information on the peptides gathered from MaxQuant. In general, MS2AI will facilitate the advancement of ML techniques performed in the field of MS, potentially opening the door for more in-depth analysis of the proteome. We illustrated this by using the simple in-built neural network to identify oxidized peptides, achieving 85% testing accuracy.
Acknowledgements
The authors want to thank Alex Gherega for his valuable feedback and help to structure the software and streamline the development process.
Funding
This work was supported by the Velux Foundation [00028116].
Conflict of Interest: none declared.