promor: a comprehensive R package for label-free proteomics data analysis and predictive modeling

Abstract Summary We present promor, a comprehensive, user-friendly R package that streamlines label-free quantification proteomics data analysis and building machine learning-based predictive models with top protein candidates. Availability and implementation promor is freely available as an open source R package on the Comprehensive R Archive Network (CRAN) (https://CRAN.R-project.org/package=promor) and distributed under the Lesser General Public License (version 2.1 or later). Development version of promor is maintained on GitHub (https://github.com/caranathunge/promor) and additional documentation and tutorials are provided on the package website (https://caranathunge.github.io/promor/). Supplementary information Supplementary data are available at Bioinformatics Advances online.


Introduction
Label-free quantification (LFQ) approaches are commonly used in mass spectrometry-based proteomics. One of the most widely used software tools for protein identification and quantification is MaxQuant (Tyanova et al., 2016a). The downstream analysis of MaxQuant output files can be complex and often challenging to those inexperienced in proteomics data analysis. Some tools available for this purpose are implemented as graphical user interface (GUI) applications [e.g. LFQ-Analyst (Shah et al., 2019), ProVision (Gallant et al., 2020), ProteoSign (Efstathiou et al., 2017)], among which, one of the most popular is the MaxQuant-associated tool, Perseus (Tyanova et al., 2016b). Perseus is an extensive software suite that offers a range of features to analyze several different types of proteomics data. While Perseus is fairly easy to use, the user interface with its wide range of options can be overwhelming at times to new users. Furthermore, the inability to save previously used analytical settings in GUI applications such as Perseus may present challenges to researchers looking to standardize data analysis. Other tools, such as MSstats (Choi et al., 2014), protti (Quast et al., 2022), pmartR (Stratton et al., 2019) and DEP (Zhang et al., 2018) are primarily implemented as R packages and provide greater analytical flexibility and reproducibility to proteomics data analysis workflows. While these available software all offer analytical capability to perform the steps in typical proteomics data analysis workflows, users may need additional software to perform tasks specific to their research domain (e.g. clinical applications, biomarker discovery).
In recent years, machine learning (ML) has made its presence felt in the field of proteomics. Particularly in biomarker research, ML is becoming a popular tool to derive candidate biomarker panels from proteomics data (Bader et al., 2020;Virreira Winter et al., 2021). ML algorithms are now being widely employed to build proteomicsbased predictive models of disease prognosis and diagnosis (Desaire et al., 2022;Mann et al., 2021). When building a proteomics-based predictive model, choosing a robust panel of protein candidates can greatly improve the accuracy of the model. In this regard, ML-based predictive models could benefit from narrowing down protein features to those that show significant differences in abundance between groups of interest. In the current landscape of proteomics data analytical tools, the capability to seamlessly transition from differential expression analysis to predictive modeling is limited. Realizing this need, we developed promor, a comprehensive, userfriendly, R package that streamlines differential expression analysis and predictive modeling of label-free proteomics data. promor provides an all-in-one reproducible workflow that integrates tools to perform quality control, visualization and differential expression analysis of label-free proteomics data. Furthermore, promor integrates tools to build ML-based predictive models using top protein V C The Author(s) 2023. Published by Oxford University Press. candidates identified through differential expression analysis, assess model performance, determine feature importance and estimate the predictive power of the models.

Implementation
promor is implemented in R ( 3:5:0) and relies on packages such as imputeLCMD (Lazar et al., 2016), limma (Ritchie et al., 2015) and caret (Kuhn, 2008) for back-end pre-processing, differential expression analysis and ML-based modeling, respectively. As input, promor requires a user-generated tab-delimited text file containing the experimental design and a MaxQuant-produced 'proteinGroups.txt' file or a standard quantitative table of protein intensities, which could be produced by any proteomic data analysis software. For visualization, promor employs the popular ggplot2 (Wickham et al., 2016) architecture and produces ggplot objects, which allows for further customization (Fig. 1).

Proteomics data analysis
promor can be used to analyze any bottom-up label-free proteomics data (e.g. raw, LFQ or iBAQ). Multiple functions are provided for quality control, visualization, missing data imputation, normalization and differential expression analysis (Table 1 and Fig. 1A).
To demonstrate the utility of promor for analyzing label-free proteomics data that do not contain technical replicates, we analyzed a previously published proteome benchmark data set by Cox et al. (2014) (PRIDE ID: PXD000279). The data set consists of LFQ protein intensity data for 6694 proteins quantified from HeLa (H) and Escherichia coli (L) lysates that were mixed at defined ratios. There were six samples in total. Three biological replicates represented each of the two groups. The results from the analysis were visualized at multiple stages (Supplementary Figs S1-S5). First, we pre-processed the data using the create_df function with default settings. create_df function removed contaminant proteins, proteins identified 'only-by-site', reverse sequence proteins and proteins identified by two or fewer unique peptides. To remove proteins with a high proportion of missing values, we used the filterby-group_na function, setting the highest allowed missing data percentage in either group at 40%. Next, we imputed the missing data in the data frame using the impute_na function with the default 'minProb' method assuming that the missing values are leftcensored. Since the data have already been normalized with the MaxLFQ algorithm (Cox et al., 2014) in MaxQuant, we did not further normalize the data in promor. The output of imputation (imp_df object) was used in the differential expression analysis, performed using the default settings in the find_dep function. We identified 1294 significantly differentially expressed proteins between the 'H' and 'L' groups in the data (Supplementary Table S1 and Supplementary Figs S4 and S5).
Furthermore, to test the utility of promor for analyzing label-free proteomics data that contain technical replicates, we analyzed previously published data by Ramond et al. (2015) (PRIDE ID: PXD001584). This data set consists of LFQ protein intensity data obtained from two strains (WT-wild type and D8-DargP mutant) of Francisella tularensis, a pathogenic bacterium responsible for the zoonotic disease tularemia. The proteinGroups.txt file contained LFQ data for 1265 proteins across 18 samples representing the two conditions (WT and D8) with three biological replicates in each condition and three technical replicates for each biological replicate. A step-by-step tutorial providing a detailed description of the workflow and the implementation choices are provided here: https://cara nathunge.github.io/promor/articles/promor_with_techreps.html

Building predictive models
In promor, multiple functions are provided to build predictive models with differentially expressed proteins and assess model performance (Table 1 and Fig. 1B). Over 200 ML algorithms are made accessible through the caret package (Kuhn, 2008) for building predictive models. For users inexperienced in complex ML algorithms, promor provides a default list of five widely used classificationbased algorithms, chosen to represent a variety of ML model types (e.g. random forest, support vector machines, generalized linear models, naive bayes and gradient boosting). However, while many different algorithms can be applied to proteomics data, it is important to note that not all of them are well-suited to address the problem at hand. The choice of machine algorithms should be carefully decided according to the prediction task, data type, sample size and the number of features (proteins) in the data set. We tested the use of promor for building predictive models by analyzing a previously published data set by Suvarna et al. (2021) (PRIDE ID: PXD022296). In the original study, the authors built proteomics-based classification models to predict COVID severity in patients. To avoid class imbalance in the data, only a subset of the samples were used from the original proteinGroups.txt file. The steps leading up to differential expression analysis are described in detail here: https://caranathunge.github.io/promor/articles/promor_ for_modeling.html. The results from differential expression analysis (fit_df object) and the normalized data frame (norm_df object) were used in the modeling workflow. The fit_df and norm_df objects were pre-processed with the pre_process function to convert the data into a model_df object. Next, we split the data into training and test data sets using the split_data function. The training data set contained 70% of the data (29 samples), while the test data set contained the remaining 30% (6 samples). The train_models function was run on the training data set in the split_df object with four selected ML algorithms: random forest (rf), support vector machine with linear kernel (svmLinear), naive bayes (naive_bayes) and K-nearest neighbor (knn). The four algorithms were chosen based on their suitability for building models using few features (8 proteins) and samples (35 samples). Furthermore, a k-fold cross-validation (k ¼ 10, repeats ¼ 3) was employed to evaluate model performance. The output was used to test the models on the test data set included in the split_df object. The results from the analysis were visualized at multiple levels during the modeling workflow . The model built with the 'naive_bayes' algorithm performed best in terms of accuracy (85.5) and Area Under the Curve (AUC ¼ 88.9%) ( Supplementary Fig. S9).

Benchmarking
We compared the performance of promor against Perseus using the previously mentioned Cox et al. (2014) (PRIDE ID: PXD000279) data set. An identical workflow and parameters to those mentioned in Section 2.2 were used in Perseus. In Perseus, we used the imputeLCMD plugin to implement the 'minProb' imputation method, and the limma plugin to implement the moderated t-test. We observed a significant overlap in the differentially expressed proteins identified by both programs (98.85%) (Supplementary Tables S1 and S2 and Fig. 2A). The number of proteins that were only identified by a single program could be attributed to the random sampling during missing value imputation. Furthermore, the calculated log-fold changes and P-values were strongly correlated between the two programs ( Fig. 2B and C). R code for benchmarking analysis is provided on github at https://github.com/caranathunge/promor_bioRxiv_preprint

Conclusions
We present promor, a user-friendly, comprehensive R package that facilitates seamless transition from differential expression analysis of label-free proteomics data to building predictive models with top  Scatterplots of the resulting protein log 2 fold changes (B) and log 10 P-values (C) of differentially expressed proteins as calculated by promor and Perseus protein candidates; a feature that could be particularly useful in clinical and biomarker research.