pyM2aia: Python interface for mass spectrometry imaging with focus on deep learning

Abstract Summary Python is the most commonly used language for deep learning (DL). Existing Python packages for mass spectrometry imaging (MSI) data are not optimized for DL tasks. We, therefore, introduce pyM2aia, a Python package for MSI data analysis with a focus on memory-efficient handling, processing and convenient data-access for DL applications. pyM2aia provides interfaces to its parent application M2aia, which offers interactive capabilities for exploring and annotating MSI data in imzML format. pyM2aia utilizes the image input and output routines, data formats, and processing functions of M2aia, ensures data interchangeability, and enables the writing of readable and easy-to-maintain DL pipelines by providing batch generators for typical MSI data access strategies. We showcase the package in several examples, including imzML metadata parsing, signal processing, ion-image generation, and, in particular, DL model training and inference for spectrum-wise approaches, ion-image-based approaches, and approaches that use spectral and spatial information simultaneously. Availability and implementation Python package, code and examples are available at (https://m2aia.github.io/m2aia)


AVAILABILITY OF SUPPORTING SOURCE CODE, DATA, AND REQUIREMENTS
Data is available for download in the MetaboLights (Haug et al., 2019) repository using the accession number MTBLS2639 (https://www.ebi.ac.uk/metabolights/MTBLS2639).

EXAMPLES
The examples provided demonstrate the practical application of pyM²aia's API and focus on the generation of MSI data samples for deep neural networks based on the strategies described in the main paper.pyM²aia enables access to image meta data, including pixel spacing (spot size), image dimensions, spectrum depth and spectrum type (continuous/processed profile/centroid spectrum types as defined in the imzML standard (Schramm et al., 2012)).In addition, image context metadata, including tags and values defined in imzML, are accessible, providing detailed information about the MSI datasets.

Example II
pyM²aia wraps M²aia's optimized signal processing utilities.Example II outlines how to configure the signal processing pipeline.Results of several signal processing configurations are shown in Figure S1.Currently, different types of baseline-correction, signal smoothing, normalization, pooling and intensity transformations are supported (Cordes et al., 2021).

Example III
Ion-image generation and how to combine multiple ion-images into a colored image (Fig. S2) is demonstrated in example III.Generated images can be written in common image formats (e.g., Nearly Raw Raster Data [*.nrrd] image file format) that are compatible with M²aia, enabling interactive exploration of image artifacts generated with pyM²aia using the desktop application M²aia.Target objective is to learn how to reconstruct individual spectra of a set of MSI datasets I n (four in the example) using an autoencoder model M. The result of the peak learning procedure is a list of centroids.Two different variants are illustrated.The upper blue path shows how to train four independent models M n using four independent (individual) spectrum batch generators G n of pyM 2 aia.The lower orange path uses a single instance G all of a pyM 2 aia spectrum batch generator to process multiple images at the same time (combined).

Spectral Strategy: Example IV -autoencoder for peak learning
All changes to the original peak learning code-base (Abdelmoula et al., 2021) are available in a GitHub fork [3].Results of the peak learning process are shown in Figure S6 and Figure S7.Different training strategies for individual and combined models are shown in Figure S3.

Spatial Strategy: Example V -ion-image-based co-localization
All changes to the original self-supervised ion-image clustering code-base (Hu et al., 2022) are available in a GitHub fork [4].Results are illustrated in Figure S4.

Spatio-spectral strategy: Example VI -variational autoencoder
In this example, the 3x3 spatial neighborhood of randomly selected spectra are used to train a variational autoencoder (see Figure S5).Results of encoded spectra (latent variable z) are shown in Figure S8.

Spatio-spectral strategy: Example VII -pixel-wise classification
This example demonstrates the training of a pixel-wise classification model using spatial annotations on one MSI dataset, applying the trained model on unseen data of all four MSI datasets, and storage of the results including metadata for spatially correct display or further processing.Manual annotations were interactively created for a single sample (slice 3) and exported as a labeled image in NRRD format.Additionally, centroid lists were generated for each sample (slice 1-4) and combined into a single centroid list, exported in text format as comma-separated values (.csv).M²aia was utilized for the interactive creation of labeled images and centroid lists.In the Python notebook of Example VII, we load all four imzML MSI datasets, including the labeled image (using SimpleITK) and the combined centroid list (using numpy).The list of centroids and the labeled image are then passed to the spectrum generator, generating batches of the form [X=[B,C,H,W], Y=[B]] (see Figure S5).Here, X represents the spectral data, and Y denotes the labels for each sample in the batch.We then build a convolutional neural network for classification using categorical cross entropy, a 9x9 spatial neighborhood, and randomly selected spectra from the provided annotated regions.Further details can be found in the example notebook.

COMPARISON OF PYM²AIA AND PYIMZML
Table S1 provides a comparison between pyM²aia and pyimzML, as far as this is possible: pyimzML supports only loading of imzML datasets, whereas pyM²aia additionally supports preprocessing (different methods for baseline correction, normalization, smoothing) and the data access functions for the three different strategies, which are therefore not directly comparable to pyimzML.

Feature pyM²aia pyimzML
Lazy ImzML meta-data queries All XML elements with "IMS:.." and "MS:..." tags Tags related to correctly represent the image (max count of pixels x/y, max dimension x/y, pixel size x/y/z) Notes * Not in scope of the package ** by default, M²aia performs a full parse of imzML xml tags, creates an index image, normalization images (for all implemented normalization methods) and overview spectra (max/mean).For comparison with pyimzML, we implemented these functionalities directly in Python with numpy.All four data sets were loaded sequentially.The runtime and maximum memory usage was averaged over 50 repetitions.

Fig. S1 .
Fig. S1.Example II: comparison of mean overview spectra from the same dataset using different signal-processing methods.Mean overview spectra for Section 1 within the range of m/z 200 to m/z 270 are shown.TIC: Total Ion Current; SR: Square Root Transformation; hws: half window size.

Fig. S3 .
Fig. S3.Spectral Strategy: Example IV -pyM 2 aia implementations of a spectral strategy for peak learning by Abdelmoula et al. (2021).Target objective is to learn how to reconstruct individual spectra of a set of MSI datasets I n (four in the example) using an autoencoder model M. The result of the peak learning procedure is a list of centroids.Two different variants are illustrated.The upper blue path shows how to train four independent models M n using four independent (individual) spectrum batch generators G n of pyM 2 aia.The lower orange path uses a single instance G all of a pyM 2 aia spectrum batch generator to process multiple images at the same time (combined).

Fig. S4 .
Fig. S4.Spatial strategy: Example V -pyM 2 aia implementation of a spatial strategy for self-supervised clustering of ion-images by Hu et al. (2022).pyM 2 aia's ion-image batch generator of MSI dataset I 3 is utilized to feed (1.) a pre-trained EfficientNet M pre (Tan and Le, 2020) model to generate a lower (1024) dimensional embedding A M of all ion-images generated with respect to a user defined list of centroids C. For (2.) fine-tuning of the model, unsupervised SimCLR (Chen et al., 2020) training is applied, resulting in (3.) a refined embedding ÂM .Subsequently (fourth column in the figure), UMAP (McInnes et al., 2020) is applied to embed A M and ÂM in two dimensions.Each embedded point refers to one ion-image.Spectral clustering (Damle et al., 2019) was applied to ÂM .The resulting clusters are color-coded in the figure.To visually demonstrate that the clustering in the space ÂM was successful, the ion-images corresponding to cluster are with red crosses in the UMAP visualisations and shown in the last column.the images are visually similar.Without finetuning, these images would not form a cluster, as can be seen by the wide distribution of the marked cluster instances in the UMAP(A M ) visualization.

Fig. S5 .
Fig. S5.Spatio-spectral strategy: Example VI/VII -variational autoencoder/pixel-wise classification.Generators can be initialized using label images (L) and/or centroid lists (C).For the description of the two examples, see the text.pyM 2 aia enables individual as well as combined spatio-spectral processing of MSI datasets (as demonstrated for the spectral strategy in Example IV, see Fig.S3).

Fig. S6 .
Fig. S6.Spectral strategy: Results of example IV -Peak Learning.Each row represents a single MSI dataset.Absolute errors between reconstructed mean profile spectra (individual models: blue lines; combined model: orange lines) and the original mean profile spectrum (gray lines) are shown for each slice.All values are normalized to 5% of the maximum of the respective original mean spectrum.Mass range m/z 220 -m/z 240.Learned peaks are shown for individual models (blue markers) and the combined model (orange markers) .

Fig. S7 .
Fig. S7.Spectral strategy: Results of example IV -Peak Learning.Recovered structures of the original high-dimensional MSI dataset visualized with the represented values of the encoded spectra (latent variable z) of variational autoencoders.Each row represents a single slice.Each column represents one component of the encoded latent variable z.From left to right, each row represents the respective component of the latent variable z 0 , z 1 , z 2 , z 3 , z 4 .In A) individual models and in B) the combined model is used to encode each spectrum of each MSI dataset.Displayed values represent data between the 1st and 99th percentiles.
Fig. S8.Spatio-spectral strategy: Results of example VI -variational autoencoder.Latent variable z of the variational autoencoder using the spatio-spectral strategy.From left to right, each row represents the respective component of the latent variable z 0 , z 1 , z 2 , z 3 , z 4 .Displayed values represent data between the 1st and 99th percentiles..

•
Spatio-spectral Strategies are still rare, with only a few applications integrating spatial and spectral information in some way, e.g.Palmer et al. (2017); Abu Sammour et al. (2021).
It is important to note that these examples do not explain or evaluate the methods themselves.Detailed discussions of the methods can be found in the original publications.The first three examples show how imzML data is handled with pyM²aia.Example IV to VII demonstrate how to utilize pyM²aia for state-of-the-art DL applications.All examples can be found in the GitHub repository [2].