pymzML v2.0: introducing a highly compressed and seekable gzip format

Kösters, M; Leufken, J; Schulze, S; Sugimoto, K; Klein, J; Zahedi, R P; Hippler, M; Leidel, S A; Fufezan, C

doi:10.1093/bioinformatics/bty046

Abstract

Motivation

In the new release of pymzML (v2.0), we have optimized the speed of this established tool for mass spectrometry data analysis to adapt to increasing amounts of data in mass spectrometry. Thus, we integrated faster libraries for numerical calculations, improved data retrieving algorithms and have optimized the source code. Importantly, to adapt to rapidly growing file sizes, we developed a generalizable compression scheme for very fast random access and applied this concept to mzML files to retrieve spectral data.

Results

pymzML performs at par with established C programs when it comes to processing times. However, it offers the versatility of a scripting language, while adding unprecedented fast random access to compressed files. Additionally, we designed our compression scheme in such a general way that it can be applied to any field where fast random access to large data blocks in compressed files is desired.

Availability and implementation

pymzML is freely available on https://github.com/pymzML/pymzML under GPL license. pymzML requires Python3.4+ and optionally numpy. Documentation available on http://pymzml.readthedocs.io.

1 Introduction

The rise of high-throughput methods in modern mass spectrometry, has triggered the demand for tools to analyze data with new and complex experimental setups. The two key elements at the forefront of computational mass spectrometry method development are rapid development cycles, by using e.g. scripting languages, such as Python and most importantly standardized file formats.

With pymzML we have pioneered the scripting language access to the mzML file format (Bald et al., 2012), which has been proven to be very powerful in different projects (Kukuczka et al., 2014; Röst et al., 2015). Subsequently, other scripting implementations, such as pyteomics (Goloborodko et al., 2013) and pyOpenMS (Röst et al., 2014), highlighted the need for rapid development cycles. Python is well suited for such a task, due to its high flexibility, extendability and platform independence. Additionally, the Python community is very large, especially in the scientific computing area as manifested by the large number of highly optimized, stable and open source packages. Thus, pymzML v2.0 could be build using the latest optimized libraries for numerical calculations, that e.g. accelerate vectorized calculations (TIC calculation or normalizations). Since pymzML is widely used in computational mass spectrometry method development, we further extended its API towards user written file interface plugins, to allow tailored pipelines for non-standard data analysis to be developed that maintain downstream compatibility. Finally, we integrated the interactive plotting library, plotly, to create a toolbox for visualizing customizable annotated mass spectra and ion chromatograms.

The standard file format for storing and sharing mass spectrometry data is mzML. It was defined by a joint effort of the Seattle Proteome Center/Institute for Systems Biology and the HUPO PSI (Proteomics Standards Initiative) (Deutsch, 2008). mzML is a human-readable XML format, which comes at cost of a large file size due to the high repetition of tags, names and white-spaces. This issue has been addressed by employing compression algorithms, e.g. golomb-rice-coding (Golomb, 1966; Rice and Plaunt, 1971) [aka Numpress (Teleman et al., 2014)] or zlib. These implementations compress the data arrays which constitute most of the data within a mzML file. However, they fail to reduce the size to RAW levels. Currently, the only strategy to achieve such file size reductions is to apply gzip compression in combination with zlib or golomb-rice-coding. Unfortunately, this sacrifices random access the mzML files, which is a major drawback for many computational mass spectrometry approaches.

In this update we enhance the performance of pymzML including highly optimized numerical libraries and we have overcome the limitation in compression approaches by the development of the indexed gzip format (igzip). Our generalized API for igzip allows any data to be stored into a highly compressed gzip file and retrieved by user defined tags, e.g. spectrum IDs in the case of mzML.

2 Materials and methods

We determined general mzML parsing speed in order to evaluate firstly, how the improved pymzML compares to the established C implementation (Fig. 1a) and secondly, how the igzip format alters parsing efficiency in general (Fig. 1b). As a benchmark, we used ten different unindexed mzML files obtained from 4 different mass spectrometers with sizes ranging from 1.3 to 3.8 GB as dataset. Figure 1a shows that C-implementations are still faster [ProteoWizard (Kessner et al., 2008) 2.3×, OpenMS (Röst et al., 2016) 2×] while calculating the total ion current (TIC), i.e. calculating the sum of all intensities in every spectrum, which implies data decoding.

Fig. 1.

Open in new tab Download slide

(a) Calculating total ion current (TIC) benchmark comparing OpenMS, Proteowizard and pymzML against each other. (b) Processing times in seconds required for accessing and writing a peak list file from a random spectrum using 10 files varying in size. X-axis, combination of file type and program; Y-axis, processing time in seconds; note log scale

When using gzip compressed files, openMS and Proteowizard perform about 2× faster than pymzML whereas ProteoWizard is slightly faster than openMS. It is important to note, that by tightly integrating numpy into pymzML v2.0, any numerical and vectorial operations on arrays, such as TIC calculations are about 50% faster than the previous pymzML v0.7 (Bald et al., 2012) (data not shown).

On average, uncompressed mzML is 4.9 times larger than RAW and only the combination of zlib or golomb-rice-coding with gzip can reach RAW level sizes (see online documentation). However, files compressed using gzip do not offer random access. While index gzip solutions exist, none are optimized for data blocks that largely exceed the 64KB block size, like in mass spectrometry data. Therefore, we have developed the index gzip format (igzip), which, in combination with golomb-rice-coding or zlib allows files to be compressed to sizes similar or smaller than RAW files while maintaining random access capability (see online documentation for detailed file sizes). Figure 1b shows the random access benchmark, where 10 randomly defined spectra ID were accessed and decoded from each of the benchmark files. In general random access in uncompressed (yet unindexed) mzML is approximately two orders of magnitude faster when using pymzML compared to ProteoWizard’s msaccess command line tool. The fast random access of pymzML v2.0 in uncompressed mzML files is due to our implementation of binary jumps within the file, making the the index information obsolete. Random access in igzip files is faster than in uncompressed files, since a) less data has to be read from disk and b) no binary jumps are used, only one seek operation is required to find the spectrum. A comparison to openMS is not straightforward and was therefore not performed.

3 Conclusion

The tight integration with high performance numerical libraries makes pymzML a very attractive module for computational mass spectrometry.

Additionally, we introduced igzip to enable fast random access in compressed files, allowing mzML files to be compressed back to RAW levels while maintaining random access. Importantly, the igzip strategy can be generally applied to data blocks that largely exceed the 64KB block size, therefore igzip can be used in any other computational context.

Acknowledgements

We thank all contributors over the years for their valuable work with and on pymzML, especially B. Akdeniz, H. Roest and M. Wolfson.

Conflict of Interest: none declared.

References

Bald

T.

et al. (

2012

)

PymzML-Python module for high-throughput bioinformatics on mass spectrometry data

.

Bioinformatics (Oxford, England)

,

28

,

1052

–

1053

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Deutsch

E.

(

2008

)

mzML: a single, unifying data format for mass spectrometer output

.

Proteomics

,

8

,

2776

–

2777

.

Goloborodko

A.A.

et al. (

2013

)

Pyteomics – a python framework for exploratory data analysis and rapid software prototyping in proteomics

.

J. Am. Soc. Mass Spectrom

.,

24

,

301

–

304

.

Golomb

S.W.

(

1966

)

Run-length encodings

.

IEEE Transact. Inf. Theory

,

12

,

399

–

401

.

Google Scholar

Crossref

WorldCat

Kessner

D.

et al. (

2008

)

ProteoWizard: open source software for rapid proteomics tools development

.

Bioinformatics (Oxford, England)

,

24

,

2534

–

2536

.

Kukuczka

B.

et al. (

2014

)

Proton gradient regulation5-like1-mediated cyclic electron flow is crucial for acclimation to anoxia and complementary to nonphotochemical quenching in stress adaptation

.

Plant Physiol

.,

165

,

1604.

Rice

R.

,

Plaunt

J.

(

1971

)

Adaptive variable-length coding for efficient compression of spacecraft television data

.

IEEE Trans. Commun. Technol

.,

19

,

889

–

897

.

Google Scholar

Crossref

WorldCat

Röst

H.L.

et al. (

2014

)

pyOpenMS: a Python-based interface to the OpenMS mass-spectrometry algorithm library

.

Proteomics

,

14

,

74

–

77

.

Röst

H.L.

et al. (

2015

)

Efficient visualization of high-throughput targeted proteomics experiments: tAPIR: fig. 1

.

Bioinformatics

,

31

,

2415

–

2417

.

Röst

H.L.

et al. (

2016

)

OpenMS: a flexible open-source software platform for mass spectrometry data analysis

.

Nat. Methods

,

13

,

741

–

748

.

Teleman

J.

et al. (

2014

)

Numerical compression schemes for proteomics mass spectrometry data

.

Mol. Cell. Proteomics MCP

,

13

,

1537

–

1542

.

Google Scholar

Crossref

WorldCat

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Associate Editor:

Download all slides

Month:	Total Views:
January 2018	30
February 2018	113
March 2018	23
April 2018	18
May 2018	23
June 2018	17
July 2018	67
August 2018	33
September 2018	22
October 2018	10
November 2018	20
December 2018	1
January 2019	17
February 2019	19
March 2019	21
April 2019	26
May 2019	22
June 2019	15
July 2019	23
August 2019	50
September 2019	39
October 2019	40
November 2019	57
December 2019	36
January 2020	45
February 2020	49
March 2020	57
April 2020	47
May 2020	15
June 2020	62
July 2020	143
August 2020	45
September 2020	47
October 2020	75
November 2020	52
December 2020	33
January 2021	73
February 2021	25
March 2021	57
April 2021	33
May 2021	27
June 2021	60
July 2021	54
August 2021	58
September 2021	45
October 2021	69
November 2021	29
December 2021	28
January 2022	52
February 2022	40
March 2022	54
April 2022	42
May 2022	44
June 2022	48
July 2022	36
August 2022	47
September 2022	60
October 2022	54
November 2022	42
December 2022	28
January 2023	45
February 2023	48
March 2023	47
April 2023	47
May 2023	43
June 2023	38
July 2023	31
August 2023	56
September 2023	26
October 2023	60
November 2023	64
December 2023	203
January 2024	102
February 2024	51
March 2024	61
April 2024	35

Article Contents

pymzML v2.0: introducing a highly compressed and seekable gzip format

Abstract

1 Introduction

2 Materials and methods

3 Conclusion

Acknowledgements

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

pymzML v2.0: introducing a highly compressed and seekable gzip format

Abstract

1 Introduction

2 Materials and methods

3 Conclusion

Acknowledgements

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only