MsQuality: an interoperable open-source package for the calculation of standardized quality metrics of mass spectrometry data

Abstract Motivation Multiple factors can impact accuracy and reproducibility of mass spectrometry data. There is a need to integrate quality assessment and control into data analytic workflows. Results The MsQuality package calculates 43 low-level quality metrics based on the controlled mzQC vocabulary defined by the HUPO-PSI on a single mass spectrometry-based measurement of a sample. It helps to identify low-quality measurements and track data quality. Its use of community-standard quality metrics facilitates comparability of quality assessment and control (QA/QC) criteria across datasets. Availability and implementation The R package MsQuality is available through Bioconductor at https://bioconductor.org/packages/MsQuality.

The MsQuality package calculates low-level quality metrics that only require minimal information about the mass spectrometry data: retention time, m/z values, and associated intensities.The list of quality metrics provided by the mzQC framework (hupo-psi.github.io/mzQC) is extensive, also including metrics that depend on higher level information which might not be readily accessible from .raw or .mzMLfiles, such as pump pressure mean, or that rely on alignment results, like retention time mean shift, signal-to-noise ratio, precursor errors (ppm).Such metrics are currently not implemented in MsQuality.
The MsQuality package relies on the Spectra package for data import and representation.Quality metrics are calculated from the information in a Spectra object.The dataOrigin variable is used to distinguish between the MS data from different measurements/files.Section 1 loads these and other packages into the environment of the R session in order to run all analyses.
In subsequent sections of this document the quality of two data sets will be analyzed: • Section 4: the Cherkaoui et al. [2022] data set is a mass spectrometry (MS) metabolomics data set of 180 cancer cell lines obtained via flow injection analysis (TOF, negative ionization mode).The data set comprises a total of 1397 measurements.
We note that these quality metrics are indicative, but by themselves might not be sufficient for data quality control decision-making, such as removing low-quality measurements, which might require additional consideration of more advanced analytics, such as those provided by MatrixQCvis [Naake and Huber, 2022].As stated previously [Bittremieux et al., 2017], the utility of QC metrics depend on the type of the sample, e.g., on whether a single peptide or a complex lysate of proteins is analyzed [Bereman, 2015, Köcher et al. [2011], Paulovich et al. [2010]].
In this document, we will -create Spectra objects from the raw data of the two datasets, -calculate the quality metrics on these data sets, -visualize some of the metrics, andassess performance and scalability of the implemented algorithms using the microbenchmark package.
Due to journal's publication format, this document presents static plots.Note that the MsQuality package also includes an interactive shiny application to interactively navigate quality metrics, with plots based on the plotly framework.For reproducibility, we provide the source .Rmd file in the accompanying GitHub repository.
A list of the attached packages can be found in Section 6.We will indicate which parts of this document are reproducible.

Preparation of the environment
This analysis uses functions from multiple R packages, including Spectra for representing mass spectrometry spectral data and MsQuality for calculating quality metrics.Other packages are required for data visualization (ggplot2, ggbeeswarm, ggpubr), data wrangling (dplyr, readxl, stringr, tibble, tidyr), and performance and scalability analysis (microbenchmark).Before starting the analysis, ensure to load these packages.

List of available metrics
The following list gives a brief explanation on the available metrics.Further information may be found at the HUPO-PSI mzQC project page or in the respective help file for the quality metric (accessible by e.g.entering ?chromatographyDuration to the R console).We also give here explanation on how the metric is calculated in MsQuality.Currently, all quality metrics can be calculated for both Spectra and MsExperiment objects.
• chromatographyDuration, chromatography duration (MS:4000053), "The retention time duration of the chromatography in seconds."[PSI:MS]; Longer duration may indicate a better chromatographic separation of compounds which depends, however, also on the sampling/scan rate of the MS instrument.
The metric is calculated as follows: (1) the retention time associated to the Spectra object is obtained, (2) the maximum and the minimum of the retention time is obtained, (3) the difference between the maximum and the minimum is calculated and returned.
• ticQuartersRtFraction, TIC quarters RT fraction (MS:4000054), "The interval when the respective quarter of the TIC accumulates divided by retention time duration." [PSI:MS]; The metric informs about the dynamic range of the acquisition along the chromatographic separation.The metric provides information on the sample (compound) flow along the chromatographic run, potentially revealing poor chromatographic performance, such as the absence of a signal for a significant portion of the run.
The metric is calculated as follows: (1) the Spectra object is ordered according to the retention time, (2) the cumulative sum of the ion count is calculated (TIC), (3) the quantiles are calculated according to the probs argument, e.g. when probs is set to c(0, 0.25, 0.5, 0.75, 1) the 0%, 25%, 50%, 75%, and 100% quantile is calculated, (4) the retention time/relative retention time (retention time divided by the total run time taking into account the minimum retention time) is calculated, (5) the (relative) duration of the LC run after which the cumulative TIC exceeds (for the first time) the respective quantile of the cumulative TIC is calculated and returned.
• rtOverMsQuarters, MS1 quarter RT fraction (MS:4000055), "The interval used for acquisition of the first, second, third, and fourth quarter of all MS1 events divided by retention time duration."[PSI:MS], msLevel = 1L; The metric informs about the dynamic range of the acquisition along the chromatographic separation.For MS1 scans, the values are expected to be in a similar range across samples of the same type.
The metric is calculated as follows: (1) the retention time duration of the whole Spectra object is determined (taking into account all the MS levels), (2) the Spectra object is filtered according to the MS level and subsequently ordered according to the retention time, (3) the MS events are split into four (approximately) equal parts, (4) the relative retention time is calculated (using the retention time duration from (1) and taking into account the minimum retention time), (5) the relative retention time values associated to the MS event parts are returned.
• rtOverMsQuarters, MS2 quarter RT fraction (MS:4000056), "The interval used for acquisition of the first, second, third, and fourth quarter of all MS2 events divided by retention time duration."[PSI:MS], msLevel = 2L; The metric informs about the dynamic range of the acquisition along the chromatographic separation.For MS2 scans, the comparability of the values depends on the acquisition mode and settings to select ions for fragmentation.
The metric is calculated as follows: (1) the retention time duration of the whole Spectra object is determined (taking into account all the MS levels), (2) the Spectra object is filtered according to the MS level and subsequently ordered according to the retention time, (3) the MS events are split into four (approximately) equal parts, (4) the relative retention time is calculated (using the retention time duration from (1) and taking into account the minimum retention time), (5) the relative retention time values associated to the MS event parts are returned.
• ticQuartileToQuartileLogRatio, MS1 TIC-change quartile ratios (MS:4000057), ""The log ratios of successive TIC-change quartiles.The TIC changes are the list of MS1 total ion current (TIC) value changes from one to the next scan, produced when each MS1 TIC is subtracted from the preceding MS1 TIC.The metric's value triplet represents the log ratio of the TIC-change Q2 to Q1, Q3 to Q2, TIC-change-max to Q3" [PSI:MS], mode = "TIC_change", relativeTo = "previous", msLevel = 1L; The metric informs about the dynamic range of the acquisition along the chromatographic separation.This metric evaluates the stability (similarity) of MS1 TIC values from scan to scan along the LC run.High log ratios representing very large intensity differences between pairs of scans might be due to electrospray instability or presence of a chemical contaminant.
The metric is calculated as follows: (1) the TIC (ionCount) of the Spectra object is calculated per scan event (with spectra ordered by retention time), (2) the differences between TIC values are calculated between subsequent scan events, (3) the ratios between the 25%, 50%, 75%, and 100% quantile to the 25% quantile of the values of (2) are calculated, (4) the log values of the ratios are returned.
• ticQuartileToQuartileLogRatio, MS1 TIC quartile ratios (MS:4000058), "The log ratios of successive TIC quartiles.The metric's value triplet represents the log ratios of TIC-Q2 to TIC-Q1, TIC-Q3 to TIC-Q2, TIC-max to TIC-Q3."[PSI:MS], mode = "TIC", relativeTo = "previous", msLevel = 1L; The metric informs about the dynamic range of the acquisition along the chromatographic separation.The ratios provide information on the distribution of the TIC values for one LC-MS run.Within an experiment, with the same LC setup, values should be comparable between samples.
The metric is calculated as follows: (1) the TIC (ionCount) of the Spectra object is calculated per scan event (with spectra ordered by retention time), (2) the TIC values between subsequent scan events are taken as they are, (3) the ratios between the 25%, 50%, 75%, and 100% quantile to the 25% quantile of the values of (2) are calculated, (4) the log values of the ratios are returned.
• numberSpectra, number of MS1 spectra MS:4000059), "The number of MS1 events in the run.The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the number of the spectra are obtained (length of Spectra) and returned.
• numberSpectra, number of MS2 spectra (MS:4000060), "The number of MS2 events in the run."[PSI:MS], msLevel = 2L; An unusual low number may indicate incomplete sampling/scan rate of the MS instrument, low sample volume and/or failed injection of a sample.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the number of the spectra are obtained (length of Spectra) and returned.
• mzAcquisitionRange, m/z acquisition range (MS:4000069), "Upper and lower limit of m/z precursor values at which MSn spectra are recorded."[PSI:MS]; The metric informs about the dynamic range of the acquisition.Based on the used MS instrument configuration, the values should be similar.Variations between measurements may arise when employing acquisition in DDA mode.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the m/z values of the peaks within the Spectra object are obtained, (3) the minimum and maximum m/z values are obtained and returned.
• rtAcquisitionRange, retention time acquisition range (MS:4000070), "Upper and lower limit of retention time at which spectra are recorded."[PSI:MS]; An unusual low range may indicate incomplete sampling and/or a premature or failed LC run.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the retention time values of the features within the Spectra object are obtained, (3) the minimum and maximum retention time values are obtained and returned.
• The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the intensity values of the features are obtained via the ion count, (4) the signal jumps/declines of the intensity values with the two subsequent intensity values is calculated, (5) the signal jumps by a factor of ten or more are counted and returned.The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the intensity values of the features are obtained via the ion count, (4) the signal jumps/declines of the intensity values with the two subsequent intensity values is calculated, (5) the signal declines by a factor of ten or more are counted and returned.
• numberEmptyScans, number of empty MS1 scans (MS:4000099), "Number of MS1 scans where the scans' peaks intensity sums to 0 (i.e.no peaks or only 0-intensity peaks)."[PSI:MS], msLevel = 1L; An unusual high number may indicate incomplete sampling/scan rate of the MS instrument, low sample volume and/or failed injection of a sample.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensities per entry are obtained, (3) the number of intensity entries that are NULL, NA, or that have a sum of 0 are obtained and returned.
• The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensities per entry are obtained, (3) the number of intensity entries that are NULL, NA, or that have a sum of 0 are obtained and returned.
• The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensities per entry are obtained, (3) the number of intensity entries that are NULL, NA, or that have a sum of 0 are obtained and returned.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the 25%, 50%, and 75% quantile of the precursor intensity values are obtained (NA values are removed) and returned.
• precursorIntensityMean, MS2 precursor intensity distribution mean (MS:4000117), "From the distribution of MS2 precursor intensities, the mean." [PSI:MS], identificationLevel = "all"; The intensity distribution of the precursors informs about the dynamic range of the acquisition.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the mean of the precursor intensity values is obtained (NA values are removed) and returned.
• precursorIntensitySd, MS2 precursor intensity distribution sigma (MS:4000118), "From the distribution of MS2 precursor intensities, the sigma value."[PSI:MS], identificationLevel = "all"; The intensity distribution of the precursors informs about the dynamic range of the acquisition.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the standard deviation of precursor intensity values is obtained (NA values are removed) and returned.
• medianPrecursorMz, MS2 precursor median m/z of identified quantification data points (MS:4000152), "Median m/z value for MS2 precursors of all quantification data points after user-defined acceptance criteria are applied.These data points may be for example XIC profiles, isotopic pattern areas, or reporter ions (see MS:1001805).The used type should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified", msLevel = 1L; The m/z distribution informs about the dynamic range of the acquisition.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor m/z values are obtained, (3) the median value is returned (NAs are removed).
• rtIqr, interquartile RT period for identified quantification data points (MS:4000153), "The interquartile retention time period, in seconds, for all quantification data points after user-defined acceptance criteria are applied over the complete run.These data points may be for example XIC profiles, isotopic pattern areas, or reporter ions (see MS:1001805).The used type should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; Longer duration may indicate a better chromatographic separation of compounds which depends, however, also on the sampling/scan rate of the MS instrument.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the retention time values are obtained, (3) the interquartile range is obtained from the values and returned (NA values are removed).
• rtIqrRate, rate of the interquartile RT period for identified quantification data points (MS:4000154), "The rate of identified quantification data points for the interquartile retention time period, in identified quantification data points per second.These data points may be for example XIC profiles, isotopic pattern areas, or reporter ions (see MS:1001805).The used type should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; Higher rates may indicate a more efficient sampling and identification.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the retention time values are obtained, (3) the 25% and 75% quantiles are obtained from the retention time values (NA values are removed), (4) the number of eluted features between this 25% and 75% quantile is calculated, (5) the number of features is divided by the interquartile range of the retention time and returned.
• areaUnderTic, area under TIC (MS:4000155), "The area under the total ion chromatogram."[PSI:MS]; The metric informs about the dynamic range of the acquisition.Differences between samples of an experiment may indicate differences in the dynamic range and/or in the sample content.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the sum of the ion counts are obtained and returned.
• areaUnderTicRtQuantiles, area under TIC RT quantiles (MS:4000156), "The area under the total ion chromatogram of the retention time quantiles.Number of quantiles are given by the n-tuple."[PSI:MS]; The metric informs about the dynamic range of the acquisition.Differences between samples of an experiment may indicate differences in the dynamic range and/or in the sample content.The metric informs about the dynamic range of the acquisition along the chromatographic separation.Differences between samples of an experiment may indicate differences in chromatographic performance, differences in the dynamic range and/or in the sample content.
• extentIdentifiedPrecursorIntensity, extent of identified MS2 precursor intensity (MS:4000157), "Ratio of 95th over 5th percentile of MS2 precursor intensity for all quantification data points after user-defined acceptance criteria are applied.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; The metric informs about the dynamic range of the acquisition.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensities of the precursor ions are obtained, (3) the 5% and 95% quantile of these intensities are obtained (NA values are removed), (4) the ratio between the 95% and the 5% intensity quantile is calculated and returned.
• medianTicRtIqr, median of TIC values in the RT range in which the middle half of quantification data points are identified (MS:4000158), "Median of TIC values in the RT range in which half of quantification data points are identified (RT values of Q1 to Q3 of identifications).These data points may be for example XIC profiles, isotopic pattern areas, or reporter ions (see MS:1001805).The used type should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; The metric informs about the dynamic range of the acquisition along the chromatographic separation.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the Spectra object is ordered according to the retention time, (3) the features between the 1st and 3rd quartile are obtained (half of the features that are present in the Spectra object), (4) the ion count of the features within the 1st and 3rd quartile is obtained, (5) the median value of the ion count is calculated (NA values are removed) and the median value is returned.
• medianTicOfRtRange, median of TIC values in the shortest RT range in which half of the quantification data points are identified (MS:4000159), "Median of TIC values in the shortest RT range in which half of the quantification data points are identified.These data points may be for example XIC profiles, isotopic pattern areas, or reporter ions (see MS:1001805).The used type should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; The metric informs about the dynamic range of the acquisition along the chromatographic separation.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the Spectra object is ordered according to the retention time, (3) the number of features in the Spectra object is obtained and the number for half of the features is calculated, (4) iterate through the features (always by taking the neighbouring half of features) and calculate the retention time range of the set of features, (5) retrieve the set of features with the minimum retention time range, (6) calculate from the set of (5) the median TIC (NA values are removed) and return it.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the minimum and maximum precursor intensity values are obtained and returned.
• precursorIntensityQuartiles, identified MS2 precursor intensity distribution Q1, Q2, Q3 (MS:4000161), "From the distribution of identified MS2 precursor intensities, the quartiles Q1, Q2, Q3.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; The metric informs about the dynamic range of the acquisition in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the 25%, 50%, and 75% quantile of the precursor intensity values are obtained (NA values are removed) and returned.
• precursorIntensityQuartiles, unidentified MS2 precursor intensity distribution Q1, Q2, Q3 (MS:4000162), "From the distribution of unidentified MS2 precursor intensities, the quartiles Q1, Q2, Q3.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "unidentified"; The metric informs about the dynamic range of the acquisition in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the 25%, 50%, and 75% quantile of the precursor intensity values are obtained (NA values are removed) and returned.
• precursorIntensityMean, identified MS2 precursor intensity distribution mean (MS:4000163), "From the distribution of identified MS2 precursor intensities, the mean.The intensity distribution of the identified precursors informs about the dynamic range of the acquisition in relation to identifiability.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; The metric informs about the dynamic range of the acquisition in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the mean of the precursor intensity values is obtained (NA values are removed) and returned.
The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability." [PSI:MS], identificationLevel = "unidentified"; The metric informs about the dynamic range of the acquisition in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the mean of the precursor intensity values is obtained (NA values are removed) and returned.
• precursorIntensitySd, identified MS2 precursor intensity distribution sigma (MS:4000165), "From the distribution of identified MS2 precursor intensities, the sigma value.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; The metric informs about the dynamic range of the acquisition in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the standard deviation of precursor intensity values is obtained (NA values are removed) and returned.
• precursorIntensitySD, unidentified MS2 precursor intensity distribution sigma (MS:4000166), "From the distribution of unidentified MS2 precursor intensities, the sigma value.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "unidentified"; The metric informs about the dynamic range of the acquisition in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the intensity of the precursor ions within the Spectra object are obtained, (3) the standard deviation of precursor intensity values is obtained (NA values are removed) and returned.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the number of precursors with charge 1+ is divided by the number of precursors with charge 2+ and the ratio is returned.
• ratioCharge1over2, ratio of 1+ over 2+ of identified MS2 known precursor charges (MS:4000168), ""The ratio of 1+ over 2+ MS2 precursor charge count of identified spectra.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; High ratios of 1+/2+ MS2 precursor charge count may indicate inefficient ionization in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the number of precursors with charge 1+ is divided by the number of precursors with charge 2+ and the ratio is returned.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the number of precursors with charge 3+ is divided by the number of precursors with charge 2+ and the ratio is returned.
• ratioCharge3over2, ratio of 3+ over 2+ of identified MS2 known precursor charges (MS:4000170), "The ratio of 3+ over 2+ MS2 precursor charge count of identified spectra.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; Higher ratios of 3+/2+ MS2 precursor charge count may indicate e.g.preference for longer peptides in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the number of precursors with charge 3+ is divided by the number of precursors with charge 2+ and the ratio is returned.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the number of precursors with charge 4+ is divided by the number of precursors with charge 2+ and the ratio is returned.
• ratioCharge4over2, ratio of 4+ over 2+ of identified MS2 known precursor charges (MS:4000172), "The ratio of 4+ over 2+ MS2 precursor charge count of identified spectra.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; Higher ratios of 3+/2+ MS2 precursor charge count may indicate e.g.preference for longer peptides in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the number of precursors with charge 4+ is divided by the number of precursors with charge 2+ and the ratio is returned.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the mean of the precursor charge values is calculated and returned.
• meanCharge, mean MS2 precursor charge in identified spectra (MS:4000174), "Mean MS2 precursor charge in identified spectra.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; Higher charges may indicate inefficient ionization or e.g.preference for longer peptides in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the mean of the precursor charge values is calculated and returned.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the median of the precursor charge values is calculated and returned.
• medianCharge, median MS2 precursor charge in identified spectra (MS:4000176), "Median MS2 precursor charge in identified spectra.The used type of identification should be noted in the metadata or analysis methods section of the recording file for the respective run.In case of multiple acceptance criteria (FDR) available in proteomics, PSM-level FDR should be used for better comparability."[PSI:MS], identificationLevel = "identified"; Higher charges may indicate inefficient ionization and/or e.g.preference for longer peptides in relation to identifiability.
The metric is calculated as follows: (1) the Spectra object is filtered according to the MS level, (2) the precursor charge is obtained, (3) the median of the precursor charge values is calculated and returned.
An up-to-date list can be found in the vignette of the package via Bioconductor.

Quick start to the package
For demonstration purposes, we apply here the MsQuality package on the .mzMLfiles shipped by the msdata package.One of the files is a LC-MS/MS DIA (SWATH) and one a LC-MS/MS DDA file.
fls <dir(system.file("TripleTOF-SWATH",package = "msdata"), full.names= TRUE) In a next step, we create a Spectra object from the two .mzMLfiles using the Spectra function.
The metrics are defined by the metrics vector.We further specify that the metrics should be calculated on MS1 spectra (msLevel = 1).metrics <c("chromatographyDuration", "numberSpectra", "areaUnderTic") ## calculate the metrics metrics_sps <-calculateMetricsFromSpectra(spectra = sps, metrics = metrics, msLevel = 1) The output of the function is per default a data.frameobject that has the metrics as columns and the samples as rows.
We use the BiocFileCache package from Bioconductor to download and cache the .mzMLfiles locally.To this end we first determine below the full file names of all .mzMLfiles of this data set.
All parts in the section Cherkaoui et al. [2022]: A functional analysis of 180 cancer cell lines reveals conserved intrinsic metabolic programs are reproducible except the parallelization steps in the subsection Performance under parallelization.For these steps precalculated objects are loaded to the environment.

library(BiocFileCache)
## every additional result should be saved in there cherkaoui <-BiocFileCache("../Cherkaoui2022", ask = FALSE) path <bfcrpath(cherkaoui, paste0(url, curl_escape(ftp_files))) These downloaded .mzMLfiles can however not be directly loaded because they are not fully compliant with the open .mzMLstandard file format (internal references to instrumentation configuration are missing).We thus need to process all files to remove these incompatible lines from each .mzMLfile.This needs to be done (once) using the below unix shell commands that should be executed in the folder containing the downloaded files.

Instantiation of the Spectra object
In the subsequent analysis, a Spectra object is instantiated.The operations were executed within a (high-performance) computing environment (31 cores, 64 GB RAM pool for all cores).

Calculate the metrics via MsQuality
MsQuality uses the Spectra class for storing the spectral data.In this particular case, where the spectral data was obtained via flow injection analysis, metrics that incorporate retention time information are not relevant and the analysis will only focus on the three metrics The metrics are calculated using the function calculateMetricsFromSpectra, which takes as input the Spectra object, sps, and the above-defined metrics.Optional parameters can also be passed to this function for further control of the calculation, such as msLevel for cases where multiple mass spectra levels are present in the Spectra object.It is unnecessary to specify msLevel in the current context since only MS1 level spectra are stored in the Spectra object.

Visualization
We next visualize the three quality metrics using the ggplot2 package.We include also information from the original study Cherkaoui et al. [2022] in particular which of the files were included in the final analysis.The results of the study are available from this resource: https://doi.org/10.3929/ethz-b-000511784 .We first download and cache the PrimaryAnalysis.ziparchive that contains all results, unzip it to a temporary folder and import the metabolomics_180CCL.xlsx file.
We then create a Figure to compare the differences in quality metrics between the analyzed and excluded measurements (Figure S1). Figure S1 demonstrates that the excluded measurements show a bimodal distribution of the total ion current (TIC).Specifically, some of the excluded measurements have lower total ion current (TIC) values, which was already noted in the original publication and was the reason for their exclusion from subsequent analysis steps. Figure S1 a serves as a visual confirmation of this statement and aids in understanding the data quality of the measurements The metrics mzAcquisitionRange.maxand mzAcquisitionRange.min on the other hand (Figure S1 (b) and (c)) are not informative for the decision making on excluding/including the measurements in further analysis steps.

Performance under parallelization
An important aspect, especially when dealing with large amount of data, is scalability and performance when computing the quality metric.
By monitoring parallelization, it is possible to determine the scalability of the computation and ensure that the performance of the analysis remains acceptable as the data size increases.
We measure below the time it takes to evaluate the calculation of quality metrics by parallelizing the tasks on 1, 2, 4, 8, and 16 workers using the microbenchmark package.This package allows for precise measurement of the execution time of R expressions by repeating the evaluation multiple times and providing detailed summary statistics of the execution times.By parallelizing the calculation of the quality metrics across multiple workers, it is possible to significantly reduce the execution time, and the microbenchmark package was used to accurately measure the performance improvements achieved by parallelization (Figure S2).The parallelization process can help in the management of bigger data sets, and to save valuable time in data analysis.

Instantiation of the Spectra object
In the subsequent analysis, a Spectra object is instantiated.The operations were executed within a (high-performance) computing environment (3 cores, 128 GB RAM pool for all cores), where the .mzMLfiles were stored in the directory Amidan2014.

Calculate the metrics via MsQuality
MsQuality utilizes Spectra objects that store the spectral data.Here, retention time information was available from the .mzMLfiles and a higher number of metrics could be calculated.

Visualization
In the analysis of the Amidan et al. [2014] study, the quality metrics were visualized using the ggplot2 package.The XLS files pr401143e_si_002.xlsand pr401143e_si_003.xls(provided as Supplemental Material of the original publication) was used to extract information on the measurement quality.This information was added to the metrics_sps_msLevel1_filtered, metrics_sps_msLevel2_filtered, metrics_sps_msLevel1 and metrics_sps_msLevel2 objects.
The Figures S3, S4, S5, S6, S7, and S8 were created as examples to compare the differences between the low-and high-quality measurements for several of the supported quality metrics.filtered refers to the metrics where filterEmptySpectra was set to TRUE, on the other hand, unfiltered refers to the metrics where filterEmptySpectra was set to FALSE.
While the metric chromatographyDuration (retention time) is a continuous variable, for visualization purposes we will bin the variable to discrete values and will use the measurements over 60 min and 100 min for visualization.Area under TIC RT quantiles (areaUnderTicRtQuantiles). The MsQuality metrics are calculated from filtered and unfiltered MS2 spectra.One data point is obtained per MS2 measurement run and the data points are displayed as beeswarm plots stratified for high-quality and low-quality measurements as classified in Amidan et al. [2014].(a) 25% quantile for filtered MS2 spectra.(b) 25% quantile for unfiltered MS2 spectra.(c) 50% quantile for filtered MS2 spectra.(d) 50% quantile for unfiltered MS2 spectra.(e): 75% quantile for filtered MS2 spectra.(f) 75% quantile for unfiltered MS2 spectra.(g) 100% quantile for filtered MS2 spectra.(h) 100% quantile for unfiltered MS2 spectra.

Comparison to QuaMeter metrics
In the following, we will compare the QuaMeter metrics to the MsQuality metrics to check if MsQuality shows concordant results compared to QuaMeter.The QuaMeter metrics were calculated via the command line tool bumbershoot with -MetricsType set to idfree.The metric IS-1A was taken from the Supplemental Files of Amidan et al.
QuaMeter removes the entries of .mzMLfiles with defaultArrayLength=0 at any MS level.Thus, the metrics that were calculated by the .mzMLfiles where the zero-length and zerointensity entries were removed showed higher correlation compared to the unfiltered files.We provide flexibility to remove zero-length and zero-intensity entries by setting the argument filterEmptySpectra to TRUE or FALSE depending on the intended behavior.

Performance under parallelization
Similar to the above-mentioned analysis using the flow injection analysis, an important aspect, especially when dealing with large amount of data, is scalability and performance when computing the quality metric.
We measure the time it takes to calculate the quality metrics under parallelization of the tasks on 1, 2, 4, 8, and 16 workers using the microbenchmark package.For computational reasons we limit the calculation to the first 500 .mzMLfiles.The operations were executed within a (high-performance) computing environment (31 cores, 128 GB RAM pool for all cores).
msSignal10xChange, MS1 signal jump (10x) count (MS:4000097), "The number of times where MS1 TIC increased more than 10-fold between adjacent MS1 scans.An unusual high count of signal jumps or falls can indicate ESI stability issues."[PSI:MS], change = "jump", msLevel = 1L; An unusual high count of signal jumps or falls may indicate ESI stability issues.

•
msSignal10xChange, MS1 signal fall (10x) count (MS:4000098), "The number of times where MS1 TIC decreased more than 10-fold between adjacent MS1 scans.An unusual high count of signal jumps or falls can indicate ESI stability issues."[PSI:MS], change = "fall", msLevel = 1L; An unusual high count of signal jumps or falls may indicate ESI stability issues.

Figure S1 :
Figure S1: Quality metrics for data set of Cherkaoui et al. [2022] stratified by information if the measurement was analyzed or excluded.(a) Area under the TIC (areaUnderTic).(b) Minimum values of the m/z acquisition range (mzAcquisitionRange.min).(c) maximum values of the m/z acquisition range mzAcquisitionRange.max).A.U.: arbitrary units.
Amidan et al. [2014]: Signatures for mass spectrometry data quality are reproducible.Due to long computation time or requirement of an environment that enables for parallelization, the creation of the Spectra object in the subsection Instantiation of the Spectra object, the calculation of the quality metrics in the subsection Calculate the metrics via MsQuality, and the parallelization steps in the subsection Performance under parallelization are precomputed.

Figure S3 :Figure S4 :
Figure S3: Quality metrics by MsQuality: Number of MS1 and MS2 spectra (numberSpectra).The MsQuality metrics are calculated from filtered and unfiltered MS1 and MS2 spectra.One data point is obtained per MS1 and MS2 measurement run and the data points are displayed as beeswarm plots stratified for high-quality and low-quality measurements as classified in Amidan et al. [2014].(a) Number of filtered MS1 spectra.(b) Number of unfiltered MS1 spectra.(c) Number of filtered MS2 spectra.(d) Number of unfiltered MS2 spectra.

Figure S7 :Figure S8 :
Figure S7: Quality metrics by MsQuality: TIC quartile to quartile log ratio (ticQuartileToQuartileLogRatio).The MsQuality metrics are calculated from filtered and unfiltered MS1 spectra.One data point is obtained per MS1 measurement run and the data points are displayed as beeswarm plots stratified for high-quality and low-quality measurements as classified in Amidan et al. [2014].(a) log ratio of quartile 2 to quartile 1 for filtered MS1 spectra.(b) log ratio of quartile 2 to quartile 1 for unfiltered MS1 spectra.(c) log ratio of quartile 3 to quartile 2 for filtered MS1 spectra.(d) log ratio of quartile 3 to quartile 2 for unfiltered MS1 spectra.(e) log ratio of quartile 4 to quartile 3 for filtered MS1 spectra.(f) log ratio of quartile 4 to quartile 3 for unfiltered MS1 spectra.