Updated MS²PIP web server supports cutting-edge proteomics applications

Abstract Interest in the use of machine learning for peptide fragmentation spectrum prediction has been strongly on the rise over the past years, especially for applications in challenging proteomics identification workflows such as immunopeptidomics and the full-proteome identification of data independent acquisition spectra. Since its inception, the MS²PIP peptide spectrum predictor has been widely used for various downstream applications, mostly thanks to its accuracy, ease-of-use, and broad applicability. We here present a thoroughly updated version of the MS²PIP web server, which includes new and more performant prediction models for both tryptic- and non-tryptic peptides, for immunopeptides, and for CID-fragmented TMT-labeled peptides. Additionally, we have also added new functionality to greatly facilitate the generation of proteome-wide predicted spectral libraries, requiring only a FASTA protein file as input. These libraries also include retention time predictions from DeepLC. Moreover, we now provide pre-built and ready-to-download spectral libraries for various model organisms in multiple DIA-compatible spectral library formats. Besides upgrading the back-end models, the user experience on the MS²PIP web server is thus also greatly enhanced, extending its applicability to new domains, including immunopeptidomics and MS3-based TMT quantification experiments. MS²PIP is freely available at https://iomics.ugent.be/ms2pip/.


INTRODUCTION
Over the past decade, the ever-broadening scope of diverse proteomics workflows has engendered greatly incr eased inter est in the field. Howe v er, these ne w applications each come with their specific challenges. For example, imm unopeptidomics m ust address the non-tryptic nature of imm unopeptides, w hereas isobaric labelling for quantification can result in reduced identification efficiency ( 1 , 2 ). These specialized a pproaches, w hich build on sample preparation and proteomics acquisition innovations, ther efor e also r equir e nov el de v elopments in data analysis to maximally exploit the value of the corresponding data.
One data analysis innovation that has impacted nearly all of the new proteomics workflows is the machine learningbased prediction of accurate peptide fragmentation spectra, as pioneered by MS 2 PIP and others ( 3 , 4 ). Indeed, we hav e pre viousl y showcased the wide a pplicability of MS 2 PIP predictions ( 5-7 ) ( https://iomics.ugent.be/ms2pip ), and how these can be le v eraged to boost the yields from various proteomics identification strategies ( 8 ). Interesting use cases of these predictions include the rescoring Nucleic Acids Research, 2023, Vol. 51, Web Server issue W339 of peptide-spectrum matches (PSMs) ( 8 , 9 ), the creation of proteome-wide spectral libraries for data-independent acquisition (DIA) ( 10 , 11 ) and streamlining the design of targeted proteomics experiments ( 12 , 13 ). While MS 2 PIP already supported a wide variety of fragmentation methods, instruments, and labelling techniques, the de v elopment of various novel impactful proteomics workflows resulted in a clear demand for additional, specialized MS 2 PIP models.
We have ther efor e further expanded MS 2 PIP with the r equisite new pr ediction models, which now include support for tryptic-and non-tryptic peptides, for immunopeptides, and for collision-induced dissociation (CID) spectra of peptides treated with tandem-mass-tag (TMT) quantification labels. These new models allow MS 2 PIP to be applied in alternati v e digestion e xperiments, in immunopeptidomics experiments, and in MS3-TMT-based quantification studies. We have updated the MS 2 PIP w e b server to include these new prediction models, alongside several new features, such as the integration of our state-of-the-art retention time predictor DeepLC ( 14 ), the option to generate proteomewide spectr al libr aries starting from only a FASTA file, and the availability of pr ebuilt, r eady-to-download spectral libraries for ten common model organisms in multiple DIAcompatible file formats. These updates will further streamline downstream use of MS 2 PIP, allowing e v en wider adoption and utility.

Updated MS 2 PIP core library with increased availability
Since the previous MS 2 PIP w e b server publication, we have drastically improved the availability and usability of MS 2 PIP's core library. It is now available as a standalone Python package that can be easily installed on all major OS platforms with PyPI, with Bioconda, or as a BioContainer. In addition to the command line interface (CLI), a new Python interface now allows MS 2 PIP to be easily integrated into other tools and workflows. To compute correlations between observed and predicted spectra, MS 2 PIP now supports both MGF and mzML spectrum file formats. MS 2 PIP now also seamlessly integrates the state-of-the-art retention time predictor DeepLC. Furthermore, we have implemented two new operating modes for MS 2 PIP: (i) the f asta2spec lib command allows users to generate proteome-wide predicted spectral libraries, starting from only a FASTA proteome file and (ii) the singleprediction command allows users to quickly predict a single spectrum directly from the CLI. The MS 2 PIP core package is open-source under the permissi v e Apache2 license, and is freely available at https://github.com/compomics/ms2pip/ .

Extended and impr ov ed MS 2 PIP web server
For an optimal, user-friendly experience, MS 2 PIP is made available as an online w e b server. Since its previous publication, we have significantly extended the MS 2 PIP w e b server functionality. First, the w e b server contains all new features of the MS 2 PIP core library, most notably including the new prediction models (see below). Second, without any additional configuration, users can opt to include accurate retention time predictions in the predicted libraries from our retention time predictor DeepLC. Third, the w e b server now also accepts --next to the existing peptide list input --a protein FASTA file with 'search space' settings that define which peptides will be included in the libr ary. Configur able settings include the cleavage rules for in silico digestion, the number of allowed missed cleavages, the precursor m / z range, and common residue modifications. Fourth, we now provide ready-to-download spectral libraries for ten common model organism UniProt r efer ence proteomes, including Homo sapiens , Esc heric hia coli and Arabidopsis thaliana . Each library is available in the MSP and SSL / MS2 file forma ts, ensuring compa tibility with major DIA search engines, such as DIA-NN ( 15 ) and Skyline ( 16 ).

New prediction models for (non-)tryptic peptides, immunopeptides and MS3 quantification experiments
We have updated MS 2 PIP with three new prediction models. The 2019 model for HCD fragmentation was originally only trained on tryptic peptides. Non-tryptic peptides, howe v er, lack the basic lysine or arginine on their C-terminus, w hich heavil y influences fragmenta tion pa tterns ( 17 ). As a result, the existing MS 2 PIP models performed sub optimally for non-tryptic peptides. To allow MS 2 PIP to be applied to proteomics workflows that yield non-tryptic peptides, such as alternati v e-digestion and biopeptidomics e xperiments, we have trained a new and improved HCD model capable of both tryptic and non-tryptic peptide predictions. This model was validated on external evaluation data sets containing peptides from both trypsin-and chymotrypsindigestion. Importantly , this new , much more generic model outperforms the previous model on both tryptic and nontryptic peptides. Additionally, we have trained a specialized model for immunopeptides to be used in immunopeptidomics experiments. This model was validated on both HLA class I and HLA class II peptides.
In quantitati v e mass spectrometry, MS3 acquisition of TMT-labeled spectra has been gaining popularity over traditional MS2 acquisition ( 18 ). Howe v er, the combination of CID fr agmentation, ion tr ap acquisition of MS2 spectra, and of TMT-labelling substantially alters fragmentation patterns, which is detrimental for the performance of both the existing CID and HCD-TMT MS 2 PIP models. Ther efor e, we have trained and validated a new CID-TMT model to allow for applications of MS 2 PIP in MS3-TMTbased quantification studies.
Train, test, and evaluation data was downloaded from PRIDE ( 19 , 20 ) and converted to MS 2 PIP input files (Supplementary Table S1) --except for the CID-TMT training data, which was generated in-house and is available from PRIDE with identifier PXD041002 (see supplementary methods). While not explicitly considered for intensity prediction, the train and test data also included common modifications such as oxidation of M, carbamidomethylation of C and acetylation of the amino termini. To guarantee fully external unseen evaluation data sets, overlapping peptidoforms between train and test sets wer e r emoved from the test set. Similar to the 2019 MS 2 PIP models ( 7 ) all new models were trained with a gradient boosting machine learning algorithm (see Supplementary Table S2) as imple-

Performance of the new MS 2 PIP models
To evaluate the newly added MS 2 PIP models, we selected se v eral unseen, e xternal e valua tion da ta sets to compare the predictions with observed spectra and calculate Pearson correla tion coef ficients (PCC) per spectrum. The selected orbitrap HCD data sets consist of trypsin-digested, chymotrypsin-digested, HLA class I and HLA class II peptides, respecti v ely. We compared the performance of the new MS 2 PIP HCD and immunopeptide models with the 2019 MS 2 PIP model on each of these evaluation data sets. Both new models showed substantial increases in performance on their respecti v e target data sets, with a median PCC of 0.93 and 0.88 for the 2021 HCD model on trypsin and chymotrypsin and a median PCC of 0.94 and 0.91 for the immunopeptide HCD model on HLA-I and HLA-II data (Figure 1 ). Notab ly, e v en when evaluated on a trypsindigested peptide data set, the new, more generic HCD model still shows an increase in performance, suggesting that combining tryptic and non-tryptic data for training leads to an over all better gener alized model. Furthermore, the lower performance of the specialized immunopeptide model on chymotrypsin-digested peptide da ta indica tes tha t these two types of non-tryptic peptides are likely very different. Indeed, separating predicti v e performance by peptide length shows a significant drop in accuracy for peptides longer than 17 amino acids for the immunopeptide model, while the new general HCD model shows a consistently high performance across peptide lengths (Supplementary Figure  S1). When examining the prediction accuracies for HLA type I and type II in a similar manner, we observe an improved performance across all peptide lengths and for both HLA types, compared to the 2019 HCD model (Supplementary Figure S2).
Previously we have shown that acquisition modes and isobaric labelling techniques can heavily alter peak intensity patterns ( 7 ). This is especially the case for ion trap-based CID acquisition of TMT-labelled spectra. Evaluation on a CID-TMT data set shows that neither the existing CID nor the existing HCD-TMT MS 2 PIP models generalized well for this type of peptide spectra. Interestingly, the HCD-TMT model still outperforms the CID model, suggesting that the labelling method has a larger influence on peak intensity patterns than the fragmentation method (Supplementary Figure S3). This can be confirmed by correlating observed spectra directly for each type. Indeed, observed HCD-TMT spectra correlate slightly better with CID-TMT spectra than with unlabeled CID peptide spectra. Ne v ertheless, as both correlations are low, there was a need for a specialized CID-TMT prediction model. The newly trained CID-TMT model vastly outperforms current models with a median PCC of 0.84 (Figure 1 ).

CONCLUSION AND FUTURE PERSPECTIVES
The use of machine learning-based predicti v e models for analyte behavior has become an indispensable part of proteomics, as is reflected by the number and popularity of machine learning tools -including MS 2 PIP -that have been published in the past years ( 3 , 4 ). Among these tools, the prediction of fragment intensities and peptide retention times hav e prov en highly valuable useful to improve the confidence in peptide identification ( 9 ). While recently many deep learning-based spectrum predictors have been de v eloped, the use of the gradient tree boosting (XGBoost) machine learning algorithm allows us to easily build accurate prediction models for specialized use cases where less Nucleic Acids Research, 2023, Vol. 51, Web Server issue W341 training data might be available. Additionally, MS 2 PIP does not r equir e graphical processing units and can be run on virtually any computer system. Ne v ertheless, with the updated MS 2 PIP w e b server w e aim to make both MS 2 PIP and DeepLC e v en more easily accessible to the entire proteomics community. The updated MS 2 PIP w e b server is the first to allow users to generate proteome-wide spectral libraries onthe fly directl y from a FASTA file and additionally provides pre-built spectral libraries for ten model organisms. Furthermore, thanks to the addition of three new, highly performant peptide spectrum prediction models, MS 2 PIP continues to support and push forward innovations in proteomics and its various established and emerging subfields.

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.