Abstract

Summary

Proteome studies frequently encounter challenges in down-stream data analysis due to limited bioinformatics resources, rapid data generation, and variations in analytical methods. To address these issues, we developed SpectroPipeR, an R package designed to streamline data analysis tasks and provide a comprehensive, standardized pipeline for Spectronaut® DIA-MS data. This novel package automates various analytical processes, including XIC plots, ID rate summary, normalization, batch and covariate adjustment, relative protein quantification, multivariate analysis, and statistical analysis, while generating interactive HTML reports for e.g. ELN systems.

Availability and implementation

The SpectroPipeR package (manual: https://stemicha.github.io/SpectroPipeR/) was written in R and is freely available on GitHub (https://github.com/stemicha/SpectroPipeR).

1 Introduction

Proteome studies are crucial for elucidating complex protein networks and functions within biological systems. Over the past decades, proteomics has evolved significantly, driven by advancements in mass spectrometry (MS) technologies (Peters-Clarke et al. 2024). Data-independent acquisition (DIA) has emerged as a powerful method for proteome analysis, offering advantages over data-dependent acquisition (DDA). While DDA uses a selective, intensity-driven approach to isolate and fragment the most intense precursor ions, DIA methodically fragments all ions within a predetermined mass-to-charge (m/z) range over the liquid chromatography gradient. This parallel, unbiased, and continuous measurement process generates a more comprehensive dataset, providing quantitative information that ensures high-precision, robust, and reliable quantitation (Vowinckel et al. 2013, Michalik et al. 2017, Krasny and Huang 2021).

The widespread adoption of DIA-MS on different mass spectrometers over the past decade has led to an exponential increase in research papers referencing this technique (Peters-Clarke et al. 2024). However, researchers frequently encounter significant challenges during down-stream data analysis, primarily due to high bioinformatics demands, rapid raw data generation, and variations in MS analysis methods. Spectronaut® (Bruderer et al. 2015) and DIA-NN (Demichev et al. 2020) have emerged as key tools for DIA-MS raw data analysis and generation of quantitative ion data, compatible with a range of MS devices. They have gained extensive recognition and acceptance within the research community.

Subsequently, to the raw data analysis, the down-stream data analysis presents additional challenges, particularly in terms of computational intensity and specialized bioinformatics expertise requirements. To address these issues, various computational tools and pipelines have been developed, with R being a popular environment in the scientific community for proteomics data analysis. Although Spectronaut® includes basic down-stream analysis features such as plots and tables, these outputs often require substantial modifications to meet publication standards. Thus, a downstream analysis solution, such as an R package, would be advantageous and more flexible. Despite the availability of numerous R packages and pipelines such as DEqMS (Zhu et al. 2020), prolfqua (Wolski et al. 2023), iq (Pham et al. 2020), MSstats (Choi et al. 2014), and FragPipeAnalystR (Hsiao et al. 2024) designed to streamline label-free quantification (LFQ) proteome data analysis, these tools often require significant pre-existing knowledge and informatics skills. These demands can pose a significant barrier to many researchers, especially those lacking advanced computational skills. To address these challenges, we developed SpectroPipeR, an R package that simplifies data analysis tasks, reduces scientists’ workload, and provides standardized outputs and reports for Spectronaut® down-stream data analysis in core facilities.

2 Approach and implementation

SpectroPipeR is a package developed in the programming language R and has been designed to address the challenges faced in proteome studies focusing on label-free quantitation based on DIA-MS data. The package is compatible with R version ≥4.0 and Spectronaut® version ≥18.7.24056. The comprehensive SpectroPipeR pipeline provides a fully automated and standardized data analysis, offering a solution for bottlenecks often encountered in proteomics research. The development of SpectroPipeR was driven by the need to simplify data analysis tasks, reduce the workload for scientists, and offer a user-friendly, scalable platform that produces uniform outputs manifested as graphs, tables, and reports in a convenient folder structure.

The functionality of SpectroPipeR encompasses a wide range of data analysis tasks, making it a versatile tool for proteomic researchers. It facilitates XIC (extracted ion chromatogram) plotting, ID rate summary, ON/OFF analysis, data normalization, batch or covariate adjustment of peptide intensities, protein quantification (iBAQ, Hi3, and MaxLFQ), multivariate analysis, peptide-centric statistical analysis (ROPECA, modified t-test, t-test) and standalone-html interactive report generation (Fig. 1). Optional parameters, such as the removal of oxidized methionine peptides and condition-wise filtering, significantly enhance the functionality of SpectroPipeR. These features allow for more precise and tailored analyses (Supplementary Figs S3–S9), accommodating the diverse needs of proteomics researchers.

The figure summarizes the workflow and functionalities of SpectroPipeR providing a visual representation of the export folder structure, reports, and XIC plots, thereby offering a comprehensive overview of the package operations and selected output (reading report).
Figure 1.

The figure summarizes the workflow and functionalities of SpectroPipeR providing a visual representation of the export folder structure, reports, and XIC plots, thereby offering a comprehensive overview of the package operations and selected output (reading report).

The default statistical framework of the pipeline uses the peptide-centric ROPECA method, which is renowned for its minimal rate of false positives (Suomi and Elo 2017).

Beyond the standard statistical outputs such as log2-ratios (colored), statistical scores, P-values, adjusted P-values, and effect sizes, the iBAQ intensity quantiles (Schwanhäusser et al. 2011) of both conditions of a comparison is colored in the user-friendly Excel table outputs (example outputs can be found under: https://github.com/stemicha/SpectroPipeR_examples; tables are described in detail in the manual: https://stemicha.github.io/SpectroPipeR/). This is achieved by using the ten iBAQ quantiles per condition as a scale for the protein abundance, which is then combined with a 2D color code. This approach provides a robust data foundation to assess the reliability of the log2-ratios. For instance, proteins with low abundance tend to exhibit a more divergent fold-change compared to high abundant proteins. However, this divergence can be attributed to the limited dynamic range of the mass detector. This phenomenon is particularly noticeable in species mix experiments (Navarro et al. 2016). The ability to discern such nuances underscores the value of the iBAQ intensities and quantiles in enhancing the user’s understanding of the data, thereby facilitating more accurate and reliable interpretations.

The architecture of SpectroPipeR is built on a modular approach (Fig. 1, Supplementary Fig. S1), consisting of a global parameter setting and four analysis modules, along with a reporting module. These modules are executed sequentially, allowing for flexibility in the analysis process. Researchers can run specific analyses independently or as part of the complete pipeline, depending on their requirements. This modular structure enhances the tool’s adaptability to various research needs. An all-in-one function that executes all modules and saves time and code lines for the user is also included, making the analysis highly convenient for new R users.

3 Examples of SpectroPipeR usage

In order to illustrate the analytical capabilities and report generation of the SpectroPipeR package, we used species mix data from the study conducted by Reder et al. (2024). Furthermore, to showcase a clinically pertinent analysis using the SpectroPipeR package, we drew upon a case-control cancer cohort study previously published by Vitko et al. (2024). This study includes 20 samples each from lung cancer patients and control subjects. The plasma samples were enriched using the SEER technology (NP2). The reports and results of the analysis are available at the GitHub https://github.com/stemicha/SpectroPipeR_examples. Moreover, the Supplementary Material (Supplementary Section S7) encompasses valuable code snippets tailored for SpectroPipeR.

Looking ahead, future development of SpectroPipeR may include expanding the range of supported data formats, integrating additional statistical analysis methods for time series analysis, and developing further tutorials and documentation to assist new users. These planned improvements aim to make SpectroPipeR an even more powerful and user-friendly tool for proteome data analysis, ultimately advancing the field of proteomics.

4 Conclusion

SpectroPipeR represents a significant advancement in proteomics data analysis tools. By addressing the challenges of limited bioinformatics resources, rapid data generation, and variations in analysis methods, it streamlines the data analysis process and ensures high-quality, reproducible results. As proteomics continues to play a crucial role in understanding complex biological systems, tools like SpectroPipeR will be instrumental in accelerating research and discovery in this field. Its ability to simplify complex analyses, provide standardized outputs, and generate publication-ready results positions SpectroPipeR as a valuable asset for proteomics researchers.

Author contributions

Stephan Michalik (Conceptualization [lead], Methodology [lead], Software [lead], Validation [lead], Visualization [lead]), Elke Hammer (Conceptualization [supporting], Methodology [supporting], Validation [supporting]), Leif Steil (Conceptualization [supporting], Methodology [supporting], Validation [supporting]), Manuela Gesell Salazar (Conceptualization [supporting], Methodology [supporting], Validation [supporting]), Christian Hentschker (Conceptualization [supporting], Methodology [supporting], Validation [supporting]), Kristin Surmann (Conceptualization [supporting], Methodology [supporting], Validation [supporting]), Larissa M. Busch (Methodology [supporting], Validation [supporting]), Thomas Sura (Methodology [supporting], Validation [supporting]), and Uwe Völker (Conceptualization [supporting], Funding acquisition [lead], Methodology [supporting], Project administration [lead])

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest: None declared.

Funding

This work was supported by grant 031L0310B-PosyMed within the Computational Life Sciences funding initiative and grant 01ZX2208B-Sys_CARE within the e:MED initiative to U.V.

Data availability

The data and resources used in this study are accessible through the following links: SpectroPipeR manual available at: https://stemicha.github.io/SpectroPipeR/. SpectroPipeR examples can be found at: https://github.com/stemicha/SpectroPipeR_examples. For SpectroPipeR example datasets and scripts, refer to: https://doi.org/10.5281/zenodo.14849402. Additional supplemental material for SpectroPipeR is available at Bioinformatics online.

References

Bruderer
R
,
Bernhardt
OM
,
Gandhi
T
 et al.  
Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen treated 3D liver microtissues
.
Mol Cell Proteomics
 
2015
;
14
:
1400
10
.

Choi
M
,
Chang
C-Y
,
Clough
T
 et al.  
MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments
.
Bioinformatics
 
2014
;
30
:
2524
6
.

Demichev
V
,
Messner
CB
,
Vernardis
SI
 et al.  
DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput
.
Nat Methods
 
2020
;
17
:
41
4
.

Hsiao
Y
,
Zhang
H
,
Li
GX
 et al.  
Analysis and visualization of quantitative proteomics data using FragPipe-Analyst
.
J Proteome Res
 
2024
;
23
:
4303
15
.

Krasny
L
,
Huang
PH.
 
Data-independent acquisition mass spectrometry (DIA-MS) for proteomic applications in oncology
.
Mol Omics
 
2021
;
17
:
29
42
.

Michalik
S
,
Depke
M
,
Murr
A
 et al.  
A global Staphylococcus aureus proteome resource applied to the in vivo characterization of host-pathogen interactions
.
Sci Rep
 
2017
;
7
:
9718
.

Navarro
P
,
Kuharev
J
,
Gillet
LC
 et al.  
A multicenter study benchmarks software tools for label-free proteome quantification
.
Nat Biotechnol
 
2016
;
34
:
1130
6
.

Peters-Clarke
TM
,
Coon
JJ
,
Riley
NM
 et al.  
Instrumentation at the leading edge of proteomics
.
Anal Chem
 
2024
;
96
:
7976
8010
.

Pham
TV
,
Henneman
AA
,
Jimenez
CR
 et al.  
iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics
.
Bioinformatics
 
2020
;
36
:
2611
3
.

Reder
A
,
Hentschker
C
,
Steil
L
 et al.  
MassSpecPreppy—an end‐to‐end solution for automated protein concentration determination and flexible sample digestion for proteomics applications. PROTEOMICS
.
Proteomics
 
2024
;
24
:
e2300294
.

Schwanhäusser
B
,
Busse
D
,
Li
N
 et al.  
Global quantification of mammalian gene expression control
.
Nature
 
2011
;
473
:
337
42
.

Suomi
T
,
Elo
LL.
 
Enhanced differential expression statistics for data-independent acquisition proteomics
.
Sci Rep
 
2017
;
7
:
5869
.

Vitko
D
,
Chou
W-F
,
Nouri Golmaei
S
 et al.  
timsTOF HT improves protein identification and quantitative reproducibility for deep unbiased plasma protein biomarker discovery
.
J Proteome Res
 
2024
;
23
:
929
38
.

Vowinckel
J
,
Capuano
F
,
Campbell
K
 et al.  
The beauty of being (label)-free: sample preparation methods for SWATH-MS and next-generation targeted proteomics
.
F1000Res
 
2013
;
2
:
272
.

Wolski
WE
,
Nanni
P
,
Grossmann
J
 et al.  
Prolfqua: a comprehensive R-Package for proteomics differential expression analysis
.
J Proteome Res
 
2023
;
22
:
1092
104
.

Zhu
Y
,
Orre
LM
,
Zhou Tran
Y
 et al.  
DEqMS: a method for accurate variance estimation in differential protein expression analysis
.
Mol Cell Proteomics
 
2020
;
19
:
1047
57
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Macha Nikolski
Macha Nikolski
Associate Editor
Search for other works by this author on:

Supplementary data