LSMMD-MA: scaling multimodal data integration for single-cell genomics data analysis

Abstract Motivation Modality matching in single-cell omics data analysis—i.e. matching cells across datasets collected using different types of genomic assays—has become an important problem, because unifying perspectives across different technologies holds the promise of yielding biological and clinical discoveries. However, single-cell dataset sizes can now reach hundreds of thousands to millions of cells, which remain out of reach for most multimodal computational methods. Results We propose LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. In LSMMD-MA, we reformulate the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation in Python. We show that LSMMD-MA scales to a million cells in each modality, two orders of magnitude greater than existing implementations. Availability and implementation LSMMD-MA is freely available at https://github.com/google-research/large_scale_mmdma and archived at https://doi.org/10.5281/zenodo.8076311.


B Primal (LSMMD-MA) and dual (MMD-MA) formulations side-byside
In the dual, the mappings between original and latent spaces are parameterized with the dual variables α x ∈ R nx×d and α y ∈ R ny×d , such that the embedding of the first (resp. second) modality is XX ⊤ α x (resp. Y Y ⊤ α y ). Instead, in the primal, we equivalently parameterize the mappings by primal variables W x ∈ R px×d and W y ∈ R py×d , such that the embedding of the first (resp. second) modality is XW x (resp. Y W y ). The relationship between the two formulations can be given by W x = X ⊤ α x and W y = Y ⊤ α y . Given this relationship, we can rewrite all the terms of MMD-MA in the primal as a function of (X, Y, W x , W y ) instead of (K x , K y , α x , α y ): We observe that LSMMD-MA and MMD-MA obtain similar FOSCTTM scores on 12 synthetic datasets. We considered only datasets with a maximum number of samples of 10, 000 because MMD-MA runs out of memory for more than 14, 000 samples (see Table A1). For this comparison, we chose the multi-start experimental setting presented in [1], where the algorithm is run for several seeds and the best seed is chosen according to the smallest loss in the training set. We simulated our own synthetic dataset with a branch-like latent space with the generate data function in lsmmdma/data/data pipeline.py. We fixed the number of seeds to 15. For both algorithms, we stopped training at 50, 000 epochs. We observed that the losses do not decrease significantly anymore at this point. The scaling parameter of the Gaussian RBF kernel in the MMD term was set to 30, and the regularisation parameters were fixed (λ 1 , λ 2 ) = ( 0.01 √ p , 0.0001 n √ p ), where n and p are the numbers of samples and features in each modality. We set the learning rate to 5.0 * 10 −4 for LSMMD-MA and to 1.0 * 10 −6 for MMD-MA. Initialisation of the parameters was uniform in both cases, from 0 to 1 for LSMMD-MA and 0 to 0.1 for MMD-MA as in [2] and [1]. We add amsgrad=True to the optimizer of MMD-MA as suggested in [2]. We found that in practice this set of hyperparameters, selected according to [1], works well for a variety of datasets; however, most results are robust to changes of one to two orders of magnitude of the hyperparameters. C Runtime and memory requirements of LSMMD-MA and MMD-MA Table A2: Runtime and memory requirements for MMD-MA in the primal and dual forms without using KeOps. n is the number of cells, p the number of features and d the dimension of the embeddings. We remove the subscripts x and y for readability but the O notations of the table hold for each modality. Typically, n >> p >> d for each modality.
We compute the distortion term in two different ways, depending on the number of samples and features. If the number of samples is smaller than the number of features (n < p), we compute the distortion as: dis = ||XX ⊤ − XW W ⊤ X ⊤ || 2 . If the number of samples is larger than the number of features (n > p), we want to avoid computing XX ⊤ which scales quadratically in n, in terms of memory and runtime. We therefore take advantage of the relationship between the trace of a matrix and the Frobenius norm and compute instead:

D Brief introduction to KeOps and its Map-Reduce computations
KeOps [3] is a package that allows to compute efficiently, on GPUs, operations of the form a) calculation of a vector valued function based on input vectors (in our case, the Gaussian kernel) and b) reduction operation on a given axis (in our case, the average of the Gaussian RBF kernel). The idea is based on a block by block map reduced scheme: the data is loaded from the device to the GPU memory only block by block. Once a block is loaded, the computation of the vector values function is done on the GPU, and the result of the reduction operation is stored in a running buffer. As a consequence, it is possible to have access to fast computations in the GPU without needing to load the entire matrix at once.
In our case, the bottleneck comes from the calculation of the average of the Gaussian RBF kernel in the MMD term. For example, for one million samples, the kernel matrix would be one million by one million which does not fit in GPU memory. The block-by-block scheme of KeOps enables one to calculate the average of the Gaussian kernel without ever needing to instantiate the entire matrix. Detailed explanations and tutorials are available at: https://www.kerneloperations.io/keops/index.html

E FOSCTTM obtained with LSMMD-MA
The Fraction Of Samples Closer than The True Match (FOSCTTM) is a common metric to measure accuracy in modality matching. The lower the metrics, the better the algorithm. We tested LSMMD-MA on several synthetic datasets, as described in Appendix G. The experimental settings follow the ones described in the simulations of [1] and [2]. In both cases, the algorithm is ran several times (i.e. for several seeds) and the best seed is chosen according to the smallest loss in the training set (see [1]) or to the smallest FOSCTTM in the validation set (see [2]). We stopped training at 20, 000 epochs. The learning rate was fixed to 0.0005, the scaling parameter of the Gaussian RBF kernel in the MMD term was fixed to 4σ where σ was the average of the distance matrix, the regularisation parameters (λ 1 , λ 2 ) were either ( 0.

G Datasets
The synthetic datasets used for the runtime experiments have a varying number of samples from 10 3 to 10 6 and a varying number of features from 10 to 10 4 . To generate them, we used the generate data function in lsmmdma/data/data pipeline.py, with the random seed argument fixed to 4 and the simulation argument set to 'branch'. The simulation process works as follows. A manifold (of shape 'branch') is generated in two dimensions. The resulting set of points is standardised. The points are then mapped to p feature dimensions using a (2 x p feature) mapping, sampled from a standard Gaussian distribution, resulting in a (n sample x p feature) matrix. Gaussian noise is then added to each element of the matrix. This simulation was first described in [1] and the latent space shaped as a branch aims at mimicking a development process for example. The real-world dataset on which the last experiment is ran comes from [4], where datasets were made publicly available for the Neurips Competition Multimodal Single-Cell Data Integration. The dataset was already preprocessed to remove low quality cells and doublets. The counts were logtransformed. We selected the genes in the gene expression modality and ADT protein measurements based on a common gene ID, which resulted in 36 features for each modality. We used the same hyperparameters as in the simulations (see Section E in the Appendix) and ran LSMMD-MA for 100,000 epochs.

H Comparison to baselines
In this section, we compare LSMMD-MA to several widely used single-cell data integration methods.
Among potential comparison partners we note that • SCOT [5] and UnionCOM [6] do not scale to a large number of cells (n > 50, 000 cells). Indeed, SCOT's runtime scales in O(n 3 ), which is prohibitive on a CPU and scales in O(n 2 ) in terms of memory which is prohibitive on a GPU. UnionCOM is a deep learning-based method that scales quadratically in terms of the number of samples memory-wise, which is prohibitive on a GPU.
• SCIM [7] is a deep-learning based approach that requires weak cell-label supervision, which is also not the setting we are tackling.
We therefore compared LSMMD-MA to three baselines that can scale to a similar number of cells and are appropriate for the current setting. For our evaluation, we used the CITE-seq dataset and the simulation dataset described in Appendix Section B, with 10000 samples and 1000 features. The three methods are as follows: • LIGER [12], in particular its Python version PyLiger [13], which is NMF-based. We used the provided GitHub code and adapted it to make it work with a more recent Python version. The simulated data was made positive by substracting the (negative) minimum value of each modality to the entire matrix. The real data was preprocessed according to the package guidelines. We ran PyLiger and varied the following parameters: the embedding dimension ( [10,20,30]) and the lambda value ([1, 5 (default), 8, 10]) which balances the importance of shared features compared to modality-specific ones.
• Harmonic alignment [14]. We note that using the GitHub code with the default parameters does not scale to large matrices, as it requires computing all the eigenvalues of an n by n matrix. We therefore reduced the number of eigenvectors used when decomposing the matrix. The real data and simulated data were preprocessed as for LSMMD-MA. We ran the algorithm varying the following parameters: the number of eigenvectors ( [10,20,40,100]) and the number of filters ([4 (default), 10].
• Procrustes [15]. We used Procrustes superimposition (scipy.spatial.procrustes) on the aligned modalities as a positive control and on the unaligned modalities to have a comparable setting to LSMMD-MA. The real data and simulated data were preprocessed as for LSMMD-MA.
We obtained the following results with the FOSCTTM metric:  We observe that LSMMD-MA's performance remains comparable to the baselines.