Abstract

Summary

This R package helps to implement a robust approach to deal with mass spectrometry (MS) data. It is aimed at alleviating reproducibility issues and pernicious effects of deviating signals on both data pre-processing and downstream data analysis. Based on robust statistical methods, it facilitates the identification and filtering of low-quality mass spectra and atypical peak profiles as well as monitoring and data handling through pre-processing, which extends existing computational tools for high-throughput data.

Availability and implementation

MALDIrppa is implemented as a package for the R environment for data analysis and it is freely available to download from the CRAN repository at https://CRAN.R-project.org/package=MALDIrppa.

1 Introduction

Matrix-assisted laser desorption/ionization (MALDI) mass spectrometry (MS) is actively being used as a rapid, reliable and cost-effective tool for high-throughput proteomic profiling. Reproducibility is a significant challenge in MALDI MS protein profiling. Raw MS data contain not only signals corresponding to actual peptides or proteins, but also signals derived from different forms of noise and other artifacts. Experimental factors such as sample preparation or matrix solution and instrumental variation affect the quality of the signals (Albrethsen, 2007). Besides, human or technical errors during data acquisition and preparation are unfortunately common. Data pre-processing is hence a critical stage to extract a reliable list of protein-induced signals for the identification and typing of organisms. Pre-processing methods are however not immune to the presence of faulty or low-quality mass spectra. Moreover, atypical peak profiles exhibiting patterns beyond the bounds of expected random biological variability can reflect on process disturbances, sub-structures in the data or chemical contamination. Ignoring these issues often distort the results from pre-processing and data analysis and, hence, the biological conclusions.

MALDIrppa implements procedures for quality control and robust pre-processing of MS data, along with functionality to simplify data and metadata monitoring and handling. It uses object classes and methods from the MALDIquant package (Gibb and Strimmer, 2012), and it is thought to work in conjunction with this later.

2 Methods

2.1 Screening for low-quality mass spectra

A signal quality scoring procedure is implemented that facilitates the rapid identification, assessment and filtering of low-quality mass spectra through numerical and graphical outputs. It is aimed to be run as an initial pre-processing step and produces atypicality scores (A scores) based on a weighted combination of robust scale estimators (Rousseeuw and Croux, 1993) of the derivative mass spectra and median intensities. Mass spectra falling beyond the fitted tolerance limits are marked as potentially problematic. There is a choice of methods to compute tolerance limits according to the characteristics of the data. Figure 1 (top panel) illustrates one of the graphical outputs available in MALDIrppa showing A scores and tolerance limits from in-house E. coli bacteria whole cell MALDI MS proteomic profiles. Some of the identified mass spectra are shown at the bottom of Fig. 1. Importantly, the associated metadata can be taken as input when using the computer routines so that any relevant manipulation, for example discarding signals across replicates, is seamlessly applied to the metadata as well.

Fig. 1

Top: atypically scores (A scores) and tolerance limits (dashed lines) from a collection of raw in-house MALDI mass spectra. Bottom: examples of low-quality mass spectra. From left to right respectively, characteristic poorly resolved peak profile, ion suppression and discretized signal cases, which are all reliably flagged out using MALDIrppa routines.

Fig. 1

Top: atypically scores (A scores) and tolerance limits (dashed lines) from a collection of raw in-house MALDI mass spectra. Bottom: examples of low-quality mass spectra. From left to right respectively, characteristic poorly resolved peak profile, ion suppression and discretized signal cases, which are all reliably flagged out using MALDIrppa routines.

2.2 Detection of outlying peak profiles

Unlike peptide-wise outlier detection methods (see e.g. Erhard and Zimmer, 2012), MALDIrppa accounts for correlations patterns across the multi-peak structure of spectral data. Thus, atypical patterns are searched for which do not necessarily contain peaks standing-out individually, but peak intensity profiles that depart from the dominant structure. The multivariate outlier detection method introduced in Filzmoser et al. (2005) based on statistically robust scale and location measures is adapted in MALDIrppa to the MS context and extended to work with different data formats, either peak intensities or peak presence/absence profiles. The procedure can be easily applied to investigate outliers at different levels of aggregation, e.g. within technical or biological replicate or by isolate.

2.3 Other features

  • Undecimated discrete wavelet transform (UDWT) denoising: robust method to decompose mass spectra into noise and genuine signal (Coombes et al., 2005), preventing from undesirable artifacts into the signal near either end of the spectrum due to large changes in wavelet coefficients as a consequence of small shifts in location.

  • Simplified double-binning procedure which refines peak alignment in high-resolution mass spectra.

  • Ability to export pre-processed data and metadata into R, CSV, FASTA and NEXUS formats.

  • Extra functions for data management, interactive visualization of marked spectra and summaries.

  • Comprehensive tutorial included as an R vignette which illustrates a robust MS data pre-processing pipeline.

3 Conclusions

Pre-processing algorithms play a crucial role in rendering MS data analysis and modelling more robust and accurate. MALDIrppa introduces robust statistical tools for MS data pre-processing aimed to assist researchers in the diagnostic and filtering of atypical signals with reduced intervention of the user. It contributes to alleviate intrinsic reproducibility issues with MS data at different levels and pernicious effects of deviating signals, then facilitating reliable molecule identification and quantification. MALDIrppa integrates with and extends existing R packages for high-throughput MS data. The results can be easily exported into formats widely used in bioinformatics for downstream analyses.

Funding

This work was supported by the Scottish Government’s Rural and Environment Science and Analytical Services Division (RESAS), Strategic Partnership for Animal Science Excellence (SPASE).

References

Albrethsen
J.
(
2007
)
Reproducibility in protein profiling by MALDI-TOF mass spectrometry
.
Clin. Chem
.,
53
,
852
858
.

Coombes
K.R.
et al.  (
2005
)
Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform
.
Proteomics
,
5
,
4107
4117
.

Erhard
F.
,
Zimmer
R.
(
2012
)
Detecting outlier peptides in quantitative high-throughput mass spectrometry data
.
J. Proteom
.,
75
,
3230
3239
.

Filzmoser
P.
et al.  (
2005
)
Multivariate outlier detection in exploration geochemistry
.
Comput. Geosci
.,
31
,
579
587
.

Gibb
S.
,
Strimmer
K.
(
2012
)
MALDIquant: a versatile R package for the analysis of mass spectrometry data
.
Bioinformatics
,
28
,
2270
2271
.

Rousseeuw
P.
,
Croux
C.
(
1993
)
Alternatives to the median absolute deviation
.
J. Am. Stat. Assoc
.,
88
,
1273
1283
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on: