MSPypeline: a python package for streamlined data analysis of mass spectrometry-based proteomics

Abstract Summary Mass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large datasets and ensure comparability of results, it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. MSPypeline provides a platform for importing MaxQuant output tables, generating quality control reports, data preprocessing including normalization and performing exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions. Availability and implementation The source code is available under the MIT license at https://github.com/siheming/mspypeline with documentation at https://mspypeline.readthedocs.io. Benchmark mass spectrometry data are available on ProteomeXchange (PXD025792). Supplementary information Supplementary data are available at Bioinformatics Advances online.


Introduction
Mass spectrometry (MS)-based proteomics is, to date, the most comprehensive approach for quantitative profiling of proteins in a great variety of biological and clinical samples. However, regardless of sample complexity, unbiased investigation of proteomic alterations in organisms is intrinsically challenging, requiring the standardization of operational procedures in different yet interconnected areas like biochemistry, MS and bioinformatics. The latter composes a particular bottleneck as many sequential steps and a multitude of parameters are required in a bioinformatics workflow that renders them challenging to document and, as a consequence, limits reproducibility. Even minor changes to an analysis workflow can significantly affect the final results.
Ready-to-use tools, such as the MaxQuant-associated Perseus (Tyanova et al., 2016), can be applied to analyze a wide variety of proteomic data. However, since the specific software settings are not stored, it is very difficult to reproduce previously obtained results. Furthermore, Perseus does not support the automation of data quality assessment or the reproducible production of highquality figures. Several open-source packages are distributed by the Bioconductor repository (www.bioconductor.org), aiming to align and standardize the first steps of proteome data analysis and to provide statistical functionalities for relative label-free quantification of proteins. Amongst the most common packages, MSstats (Choi et al., 2014), MSnbase (Gatto et al., 2021), DEqMS (Zhu et al., 2020) and obaDIA (Yan et al., 2021) provide statistical models to derive differential protein abundances and feature graphical interfaces. However, these available applications do not support the creation of all-in-one reproducible workflows to analyze quantitative proteome data. For example, they lack opportunities for the generation of quality control (QC) reports, for the functional annotation of proteins, or the visualization of pathways/groups of proteins of interest. For the stand-alone generation of QC reports, several packages are available at Bioconductor, such as proteoQC and qcmetrics. Likewise, individual packages can be found that support the functional annotation of proteins and differential analysis, e.g. topGO and clusterProfiler. Therefore, a unified pipeline that enables the standardized and comprehensive analysis of label-free proteomics data is missing.
To address these issues, here we introduce MSPypeline. This user-friendly, all-in-one python-based proteomics pipeline integrates a set of tools, allowing the seemly and standardized preprocessing and downstream analysis of label-free data acquired in datadependent acquisition (DDA) mode. It supports the automatic creation of QC reports, offers different normalization strategies, the functional annotation of proteins, and the visualization of proteins of interest, providing an exciting tool to analyze complex datasets. Moreover, MSPypeline offers the user the advantage of saving the exact software versions and parameters, guaranteeing reproducibility of results regardless of the computing environments.

The MSPypeline package
MSPypeline is a programing package written in Python 3 (available for 3.7 or 3.8) and uses multiple standard packages for scientific computing (pandas, numpy, sklearn and matplotlib). The recommended installation is via Conda. An intuitive and concise graphical user interface offers researchers unfamiliar with programing or data analysis the opportunity to explore and visualize their data independently and in a time-effective manner. For advanced users, MSPypeline has two additional entry points, the python module and the command line. Currently, the MSPypeline package supports the analysis of label-free shotgun proteomics data analyzed by the MaxQuant software, i.e. aggregated protein intensities after feature detection and quantification of raw MS spectra; however, the internal BaseReader class can be subclassed, allowing other data inputs, thus making the package as extensible as possible. MSPypeline builds a tree-structured analysis design (Supplementary Material) to investigate the data at distinct levels, such as cell lines, treatments or patients based on the sample names.
Several analysis methods require the determination of whether a protein can be compared between two groups. In MS data, proteins are frequently not detected at random in some samples. Yet, to ensure appropriate data analyses, the protein has to be detected (intensity >0) in a sufficient number of samples per group. MSPypeline defines the required number of samples in which the respective protein has to be detected by a sigmoidal threshold function starting at 100% for up to three samples and relaxing to 50% for 12 or more samples. Based on this threshold, there are four potential scenarios of categorizing the protein: the protein can be compared between groups A and B if it is detected above the threshold in A and B, it is unique in A if it is above threshold in A and utterly absent in B, or vice versa, and it is not considered if it is below threshold in A and B.
By automating the calculation and generation of versatile figures, MSPypeline performs comprehensive and conclusive data analyses within minutes. Simultaneously, the advanced user may interact closer with MSPypeline to perform advanced analysis exploiting the plethora of customization options recorded to ensure reproducibility. Although there is a logic flow linking the four different steps of analysis ( Fig. 1), each step can be performed separately, making the personalization of different analyses possible.
It is worth noting that thresholding is important for the Venn group diagrams, the relative standard deviation graph, the group comparison scatter plot and the volcano plot.
The workflow for MSPypeline consists of the following steps: 1. Data import-data are loaded, converted to the required format and filtered. 2. QC-a comprehensive QC report is generated to investigate technical and biological parameters at a glance for all samples included in a given experiment. 3. Data preprocessing-tools to check normalization schemes produce plots to help to decide among five default normalization strategies (Table 1) applicable to raw, LFQ (Cox et al., 2014) or iBAQ (Schwanhä usser et al., 2011) intensities. 4. Exploratory analysis-descriptive and/or comparative analyses are performed on the preprocessed data allowing biologically relevant conclusions through differential expression analysis and hypothesis testing (Table 2). Visualization tools make the MSPypeline features a precisely structured workflow that starts with a QC report of the data, followed by the assessment and choice of data preprocessing operations to finally allow optimal exploratory analyses. Median normalization median_norm For each sample, the median protein intensity is calculated. The mean of all sample-wise medians is calculated and subtracted from each sample median. This correction factor is then subtracted from each protein intensity.
Quantile normalization with missing value handling quantile_norm_missing_handled Quantile normalization: for each sample, proteins are ranked after their intensity value. The mean protein intensity per quantile across all samples is calculated and assigned to every protein of each sample. The data are rearranged to the original order of the intensity values for each sample. Missing value handling: during normalization, missing values (protein int ¼0) are interpolated by sampling from the same distribution as the input distribution. After normalization, missing values are restored.
Tail robust quantile normalization trqn An offsetting factor is calculated by taking the sample-wise mean and is subtracted from each protein of the respective sample. Quantile normalization (see above) is applied, and the respective offset value is added back to each protein of the sample (Brombacher et al., 2020).
Tail robust quantile normalization with missing value handling trqn_missing_handled Tail robust quantile normalization (see above) is applied with missing value handling (see above).
Tail robust median normalization trmn The sample-wise mean protein intensity is calculated and used as an offset to be subtracted from each protein of the respective sample. Median normalization (see above) is applied, and the respective offset value is added back to each protein of the sample.  exploration of the results possible and include visualization by bar plots, Venn diagrams, volcano plots showing differentially regulated and unique proteins, rank plots and principal component analysis plots. All resulting plots are saved as PDF files, alongside CSV files containing the plotted data.

Results
To validate and visualize the functionalities of MSPypeline, a labelfree DDA experiment was performed to generate a benchmark dataset deployed in the documentation for a demonstrative analysis. It serves as the built-in dataset of the software (Supplementary Material). The original MS raw data files and the MaxQuant search result files are available on the ProteomeXchange consortium via PRIDE (Deutsch et al., 2020) repository (dataset identifier PXD025792). All input and output files from the benchmark dataset are wrapped with the MSPypeline release.
By providing automation and standardization of the downstream steps in the analysis of label-free proteome data, MSPypeline minimizes time-consuming and error-prone manual tasks. Moreover, new users can get started faster in analyzing proteomics datasets through the available graphical user interface because it is unnecessary to familiarize themselves with a complex analysis environment. Because MSPypeline offers the possibility of step-wise extensions, an additional advantage of this package is the possibility to link, in the future, more building blocks to its core, providing the possibility for extension while retaining the basic functionalities. Thus, MSPypeline can be easily adapted to the output of other search tools, such as Proteome Discoverer (Thermo Fisher Scientific) and OpenMS (Pfeuffer et al., 2017). Similarly, MSPypeline can be adapted to analyze label-based data, e.g. stable isotope labeling by amino acids in cell culture, tandem mass tag or data-independent acquisition datasets.

Conclusions
The modular structure of MSPypeline allows it to be readily extended to meet the needs of future developments of technology. Standardization and reproducibility are ensured by automatically logging all analysis settings and saving them to a separate configuration file. Thus, MSPypeline provides a platform that supports users with their proteomics data analysis by providing insight into data quality, offering parameter adaptation when needed and generating custom figures with guaranteed reproducibility. The reliability of differential expression analysis can be improved, and the testing of biologically relevant hypotheses is fostered. Volcano plot displaying -log 10 (P-value) versus log 2 fold change comparing protein intensities of two groups. Intensities of the unique proteins are shown on each side of the plot.

Protein intensities, including thresholding
Groups of the selected level The P-value (focus on affected pathways and processes) and adjusted P-value (Benjamini þ Hochberg, focus on regulated proteins) are determined using the R limma package. Calculations are corrected for the intensityvariance relationship. Either the 10 most significant proteins or the proteins of the selected pathways are annotated.