Ionmob: a Python package for prediction of peptide collisional cross-section values

Abstract Motivation Including ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion’s mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by posttranslational modifications of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven, and offer the possibility to be easily adapted to individual workflows for experimental design and data processing. Results We created ionmob, a Python-based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈21 000 unique phosphorylated peptides and ≈17 000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task. Availability and implementation The Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.


Biological samples
Cell and yeast culture.The human cervix carcinoma cell line HeLa was obtained from the German Resource Centre for Biological Material (DSMZ).Cells were cultured in Iscove's Modified Dulbecco's Medium (IMDM; PAN Biotech, Aidenbach, Germany) supplemented with 10% (v/v) fetal calf serum (FCS; Thermo Fisher Scientific (Invitrogen), Waltham, MA), 1% (v/v) L-glutamine (Carl Roth), and 1% (v/v) sodium pyruvate (Serva, Heidelberg, Germany) at 37°C in a 5% CO 2 environment and harvested at 70% confluence.Cells were washed once with phosphate-buffered saline (PBS; Carl Roth) and detached from the culture flasks with 0.05% Trypsin-EDTA solution (Sigma-Aldrich).The human B lymphoblastoid cell line JY was purchased from ATCC.JY cells were grown suspension in RPMI1640 medium supplemented with 10 % FCS (Gibco), 2 mM glutamine, 1 mM sodium pyruvate, 100 units/ml penicillin, and 100 µg/ml streptomycin.Saccharomyces cerevisiae bayanus, strain Lalvin EC-1118 was obtained from the Institut Oenologique de Champagne and grown in YPD medium.Harvested HeLa, JY, and yeast cells were transferred into centrifugal tubes and washed three times with PBS before being frozen and stored at -80°C until further processing.
conducted in accordance with national laws and approved by the local authorities.
Commercial human plasma and E. coli samples.Human blood plasma was purchased from XXY, aliquoted, and stored at -80°C until further processing.A tryptic protein digest of E. coli proteins (MassPREP standard) was purchased from Waters.

Sample preparation
Protein extraction.HeLa cells were lysed using a urea-based lysis buffer (7 M urea, 2 M thiourea, 5 mM dithiothreitol (DTT), 2% (w/v) CHAPS).JY cell pellets were thawed and lysed in 1% CHAPS in PBS.Tissue was ground in liquid nitrogen using a mortar and pestle.Proteins were extracted from the tissue powder adding a urea-based lysis buffer (8 M urea, 2 M thiourea in 100 mM NH4HCO3, pH 7.4).Lysis was further promoted by sonication at 4°C for 15 min (30 s on/30 s off) using a Bioruptor device (Diagenode, Liège, Belgium).After cell lysis, protein concentration was determined using the Pierce 660 nm (for HeLa) or Pierce BCA protein assays (for JY, due to the CHAPS) according to the manufacturer s protocols (Thermo Fisher Scientific).
Protein digestion for whole-proteome samples HeLa, JY, Yeast, and plasma samples were processed using filter-aided sample preparation (FASP) as detailed before (Wisniewski et al. (2009), Sielaff et al. (2017)).In brief, lysates were loaded onto spin filter columns (Nanosep centrifugal devices with Omega membrane, 30 kDa MWCO; Pall, Port Washington, NY) and washed three times with buffer containing 8 M urea.Afterward, proteins were reduced and alkylated using DTT and iodoacetamide (IAA), respectively.After alkylation, excess IAA was quenched by the addition of DTT.Then, the buffer was exchanged by washing the membrane three times with 50 mM NH4HCO3.The proteins were digested overnight at 37°C using trypsin (Trypsin Gold, Promega, Madison, WI) at an enzymeto-protein ratio of 1:50 (w/w).After proteolytic digestion, peptides were recovered by centrifugation and two additional washes with 50 mM NH4HCO3.After combining peptide flow-throughs, samples were acidified with trifluoroacetic acid (TFA) to a final concentration of 1% (v/v) trifluoroacetic acid (TFA) and lyophilized.Lyophilized peptides were reconstituted in 0.1% (v/v) formic acid (FA) for LC-MS analysis.

Liquid-chromatography mass spectrometry (LC-MS)
Liquid-chromatography mass spectrometry (LC-MS).Reconstituted peptides were directly injected and separated on a nanoElute LC system (Bruker Corporation, Billerica, MA, USA) at 400 nL/min using a reversedphase C18 column (Aurora 25 cm x 75 µm 1.6 µm, IonOpticks) attached to a MS.Eluting peptides were ionized in a CaptiveSpray Source (Bruker Corporation)  nanoLC separation.The column was heated to 50°C.Mobile phase A was 0.1% FA (v/v) in water, and mobile phase B was 0.1% FA (v/v) in ACN.Peptides were loaded onto the column in direct injection mode at 600 bar and were separated, running a linear gradient from 2% to 37% mobile phase B over 38 min.Afterward, the column was rinsed at 95% B resulting in a total method time of 47 min.For the analysis on the timsTOF SCP, phosphopeptide samples were analyzed using a C18 Aurora UHPLC emitter column (15 cm x 75 µm 1.6 µm, IonOpticks), which was heated to 50°C.Peptides were loaded onto the column in direct injection mode at 600 bar and separated, running a linear gradient from 2% to 37% mobile phase B over 39 min at a flow rate of 400 nL/min.Then, the column was rinsed for 5 min at 95% B.
Analysis on the timsTOF Pro 2 or timsTOF SCP.In the timsTOF Pro 2 (Bruker), the dual TIMS was operated at a fixed duty cycle close to 100% using equal accumulation and ramp times of 100 ms, each spanning a mobility range from 1/K0 = 0.6V scm -2 to1.6V scm -2 .The DDA-PASEF mode comprised ten PASEF scans per topN acquisition cycle (Meier et al. (2018)).Singly charged precursors were excluded from fragmentation by their position in the m/z-ion mobility plane.The collision energy was ramped linearly as a function of the mobility from 59eV at1/K0 = 1.3V scm -2 to20eV at1/K0 = 0.85V scm -2 .In the timsTOF SCP, the high sensitivity detection mode was activated.or PEAKS XPro version 10.6 (BSI, Canada).For identification, FDR was set to 1%.For rescoring, FDR was set to 100%, and decoy peptides were included in the exported reports.

Peptide and protein identification from LC-MS raw files
Protein databases.Phosphopeptide samples were searched using a custom compiled database containing UniprotKB/Swissprot entries of either the mouse reference proteome (UniProtKB release 2022_04, 17,107 entries), Homo sapiens (UniProtKB release 2022_02), or a merged list of human, yeast and E. coli (UniProtKB) proteomes.All databases were supplemented with a list of common contaminants.Default decoy database generation was used in each software for FDR calculation.
Phosphopeptide data processing in PEAKS.Trypsin was set as digestion allowing up to two missed cleavages.Carbamidomethylation at cysteines was set as fixed modification.Methionine oxidation, N-term acetylation as well as phosphorylation on serine, threonine and tyrosine were set as variable modifications allowing a maximum of five variable modifications per peptide.Peptides were identified with resolution thresholds of 15 ppm for MS1 and 0.03 Da for MS2.In addition to the 1% FDR threshold, PTMs were filtered during data post-processing to conserve only identifications with a a PTM AScore above 20 for a theoretical confidence of 99% of the PTM location.
MHC ligand data processing using PEAKS.Protein in silico digestion was configured to unspecific cleavage and no enzyme.Methionine oxidation, cysteine cysteinylation, and Protein N-ter acetylation were all configured as variable modifications with a maximum of two modifications per peptide.Peptides were identified with resolution thresholds of 15 ppm for MS1 and 0.03 Da for MS2.

Fig. 1 .
Fig. 1.Impact of phosphorylation on observed and predicted CCS values.A) Scatter plot showing pairwise experimentally observed differences between relative CCS of peptides with and without phosphorylation.X-axis represents relative CCS (CCS / m/z) of unphosphorylated peptides, Y-axis represents relative CCS of unphosphorylated peptides minus relative CCS of phosphorylated peptides in percent, charge states are color coded.B), C), D) Distribution of relative pairwise differences of predicted CCS values between peptides with and without phosphorylation for all modeled charge states.Phosphorylations were added to a set of sequences in-silico at random and CCS values where predicted for both versions (modified and unmodified) of a given sequence and charge.Since phosphorylation also increases peptide m/z, predicted CCS values are normalized by peptide m/z.It can be observed that for phosphorylated peptides, relaitve CCS values are decreased compared to the unmodified charge states.Difference is calculated by dividing CCS values by m/z and then subtracting resulting values of phosphorylated peptides from unphosphorylated peptides.

Fig. 2 .
Fig. 2. Peptide spectrum matches resulting from rescoring with and without CCS features.A) Boxplots showing the percentual CCS error by charge for identifications exlusively identified with and without CCS errors as features during rescoring.B) Barplot showing number of PSMs by charge.

Fig. 3 .
Fig. 3. Distribution of absolute normalized percolator feature weights for the different features used in MS²Rescore with CCS (left) and without CCS features (right) for the rescoring of the sinlgy charged MHC ligands with standard deviations for the different cross validations.

Fig. 4 .
Fig. 4. Linear correlation between relative deep contribution of sequences and additive sequence property descriptors.Most impact was observed for hydropathy and local flexibility on relative CCS.All calculated pearson correlations calculated for relative increase or decrease in CCS with respect to the inital fit baseline.

Fig. 5 .Fig. 6 .
Fig. 5. Predicted vs observed CCS values of ionmob GRU predictor compared to data generated from both a TWIMS and DTIMS device.A), B) Observed CCS values vs relative prediction error for TWIMS (left) and DTIMS (right), charge states are color coded.C) Boxplots showing charge state wise relative prediction error for both devices.D).The m/z vs CCS plane showing all datapoints, measured with TWIMS, DTIMS and in-silico predicted.Data was extracted from Bush et al. (2012).

Fig. 7 .
Fig. 7. Experimentally determined m/z vs inverse reduced mobility, 1/K0, and translation of 1/K0 to CCS using the Mason-Schamp equation.A), C) Experimentally measured 1/K0 values for MHC peptides published by Feola et al. (2022) and a newly generated in-house MHC peptide dataset.It can be observed that the determined 1/K0 values in A) are unreliable to accuratly calculate CCS values, see B).However, the altered device settings in C) allow for a informative translation to CCS in D).