Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian matrix optimization

Abstract Motivation Simultaneous profiling of multi-omics single-cell data represents exciting technological advancements for understanding cellular states and heterogeneity. Cellular indexing of transcriptomes and epitopes by sequencing allowed for parallel quantification of cell-surface protein expression and transcriptome profiling in the same cells; methylome and transcriptome sequencing from single cells allows for analysis of transcriptomic and epigenomic profiling in the same individual cells. However, effective integration method for mining the heterogeneity of cells over the noisy, sparse, and complex multi-modal data is in growing need. Results In this article, we propose a multi-modal high-order neighborhood Laplacian matrix optimization framework for integrating the multi-omics single-cell data: scHoML. Hierarchical clustering method was presented for analyzing the optimal embedding representation and identifying cell clusters in a robust manner. This novel method by integrating high-order and multi-modal Laplacian matrices would robustly represent the complex data structures and allow for systematic analysis at the multi-omics single-cell level, thus promoting further biological discoveries. Availability and implementation Matlab code is available at https://github.com/jianghruc/scHoML.


Introduction
With the first release of single-cell transcriptome analysis technology in 2009 (Tang et al. 2009), an explosion of research has been conducted in obtaining high-resolution views of single-cell RNA-seq data, such as Smart-seq (Ramskö ld et al. 2012) and Smart-seq2 (Picelli et al. 2014); in vitro transcription-based Cel-seq (Hashimshony et al. 2012) and Cel-seq2 (Hashimshony et al. 2016); and designed primerbased MALBAC (Zong et al. 2012), etc. Advances in scRNAseq technologies have enabled the exploration of cellular heterogeneity where traditional bulk sequencing cannot reveal. In Zhang et al. (2018), T cell heterogeneity was investigated in colorectal cancer. scRNA-seq data were introduced for analyzing genetic tumor heterogeneity in Fan et al. (2018). Intra-tumoral heterogeneity of pancreatic ductal adenocarcinoma was highlighted in Peng (2019). In Sorek et al. (2021), transcriptional heterogeneity was discovered in disease-state neurons. Ren et al. (2021) applied scRNA-seq to obtain comprehensive immune landscape for better understanding of  Apart from biological research in heterogeneity analysis using scRNA-seq techniques, extensive research has been carried out in developing effective and efficient computational methods for exploring cellular heterogeneity. For instance, cell-pair differentiability correlation was proposed in evaluating cellular relationships (Jiang et al. 2018), and further incorporated for cellular heterogeneity analysis. Semi-supervised clustering method (Chen et al. 2021) was developed for analyzing scRNA-seq data. When data are in large scale, efficient hierarchical clustering algorithm was developed (Zou et al. 2021). Also, method focused on similarity learning was proposed in Mei et al. (2021) for identifying cell types using scRNA-seq data. Recent progress has shed light on graph attention auto-encoder for scRNA-seq data representation and clustering (Cheng and Ma 2022).
Cell state, as usually evaluated by RNA expression, may not fully capture the complex structure embedded. It is a complex representation determined by the interplay between transcriptome, proteome, epigenome, etc. Multi-omics single-cell sequencing technologies, by profiling multiple types of "omics" expression in the same individual cells, enable the exploration of cellular heterogeneity in an integrative way. Cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) can simultaneously measure RNA expression and surface protein abundance via antibody-derived tags, and robust protein profiling contribute for better understanding of cell states (Stoeckius et al. 2017). Single-cell methylation and transcriptome sequencing (scM&T-seq) technique allows for quantification of transcriptomic and epigenomic expression in the same individual cells (Angermueller 2016). Different "omics" layers will together present more accurate and complete view of "single cell state" and enable dissecting regulatory heterogeneity from complex cell populations.
With the flourishing development of single-cell multi-omics technologies, a growing number of methods have been proposed for integrating multi-omics data. These methods can be categorized into two major types. A major type of methods was designed for multi-omics data measured in different cells. Representative methods include Seurat (Stuart et al. 2019) for integrating scRNA-seq and scATAC-seq data. LIGER (Welch et al. 2019) was developed to align scRNA-seq data and single-cell epigenomic data in a low-dimensional space. MATCHER (Welch et al. 2017) utilized a Gaussian process latent variable model to evaluate the correlations between scRNA-seq data and single-cell epigenomic measurements from different cells. In Zhana et al. (2018), a coupled nonnegative matrix factorization method was proposed for integrating scRNA-seq and scATAC-seq data. Another type of methods was proposed for integration of multi-omics profiles measured in the same set of cells. scAI (Jin 2020) proposed a regularized matrix factorization framework to iteratively learn a low-dimensional representation for the multi-modal single-cell data. In BREM-SC (Wang et al. 2020), a Bayesian random effects mixture model was developed for joint clustering single-cell transcriptomic and proteomic data generated by CITE-seq, where multinomial distribution was proposed to model scRNA-seq data and Dirichlet multinomial distribution was proposed to model surface protein (ADT) data. Hao et al. (2021) designed a weighted nearest-neighbor framework for integrating multi-modal single-cell data. An elegant design of modal weights in generating the weighted combination of modality affinities is proposed to measure the final weighted similarity metric integrating multiple modalities of single-cell data. It would be an interesting topic on adaptively and automatically determining a weighted combination of modalspecific "similarities" or "affinities." Considering the inherent relationship between chromatin accessibility and gene expression, Duren et al. (2022) proposed a new concept: cisregulatory potential to formulate a matrix-factorization framework to integrate scRNA-seq data and scATAC-seq data. scAB (Zhang et al. 2022) integrated scRNA-seq data or scATAC-seq data with annotated bulk sequencing data incorporating knowledge and guided graph information. The phenotype-associated cell states and signatures were elucidated through matrix factorization framework on the Pearson correlation matrix linking single-cell data and bulk RNA sequencing data. There are also attempts in using deep learning frameworks for single-cell multi-omics data integration such as GLUE (Cao and Gao 2022), a knowledge-based guidance graph-linked unified embedding method using variational autoencoders. The incorporation of knowledge either from gene-gene interaction network or bulk sequencing data would positively contribute to a better understanding of the cell states and characteristics described by single-cell omics data. However, the generalization ability in other multi-omics integration may be constrained. The development of integration techniques for single-cell multi-omics data without knowledge information is in growing need as well.
In this article, we focus on integrative analysis of parallel multi-omics single-cell data. Most of the current methods mainly aim to model a common low-dimensional embedding or unified relationship between cells. Taking into consideration of the sparsity and nonlinearity nature of single-cell data, also inspired by the above findings, we propose a multimodal high-order neighborhood Laplacian matrix optimization framework (scHoML) for integrating multi-omics singlecell data. The method is very flexible and robust, which can be applied for efficiently integrating multi-omics data both in simulation and real-world datasets generated by scM &T-seq, CITE-seq technologies, etc. The article is structured as follows. In Section 2, we present preliminary information on the framework for multi-omics data integration. Section 3 presents the method scHoML for integrating multi-omics data. Experimental results are presented in Section 4. Section 5 discusses the application capability of scHoML. Finally, Section 6 concludes the article.

Preliminaries
In this section, we provide preliminary information on the framework for multi-omics data integration.
• High-order networks Assume single modal dataset X ¼ ½x 1 ; x 2 ; . . . ; x n T 2 R nÂd , where n is the number of samples, d is the dimensionality of the attributes. We first present the high-order networks for the dataset X modeling the connectivity of the data through definitions of adjacent matrices in different orders In the construction of the first-order adjacent matrix W 1 , w 1;jk ¼ n A jk ; if a pair of vertices ðj; kÞ is connected 0; otherwise (1) where A jk ¼ exp À jjxjÀx k jj 2 2r 2 ; 1 j; k n: Here, a pair of vertices (j, k) is connected if and only if vertex j is in the nearest neighbors of the vertex k. In the construction of high-order adjacent matrix W i , we follow the similar concept in obtaining W iÀ1 . If w iÀ1;j is similar to w iÀ1;k , vertex j is also similar to vertex k in a i À 1th-order connectivity. Let w i;j ¼ ðw i;j1 ; w i;j2 ; . . . ; w i;jn Þ represent the jth row of W i , high-order adjacent matrix W i ; i ! 2 can be derived in the following formulation.
where w i;jj ¼ 0; j ¼ 1; 2; . . . ; n, making W i satisfy the form of adjacent matrix. • High-order Laplacian matrix The corresponding normalized Laplacian matrices in different orders can be derived based on W i ; i ¼ 1; 2; 3; . . .. Let D i 2 R nÂn represent its ith-order degree matrix, a diagonal matrix and D i;jj ¼ P n k¼1 w i;jk The ith-order normalized Laplacian matrix can be defined as L ðiÞ ¼ I n À ðD i Þ À 1 2 W i ðD i Þ À 1 2 ; i ¼ 1; 2; . . .
• Multi-modal Laplacian matrix Suppose we have paralleled profiled single-cell multi-omics data in V modals X ð1Þ ; X ð2Þ ; . . . ; X ðVÞ , where X ðiÞ 2 R nÂd ðiÞ ; i ¼ 1; 2; . . . ; V is the dataset of ith modal and d ðiÞ represents the attribute dimensionality of ith modal. According to the steps in construction of high-order networks and high-order Laplacian matrix, we define L ðiÞ p ; i ¼ 1; 2; . . . ; U to represent the ith-order normalized Laplacian matrix for the pth modal, i.e. L ðiÞ p ¼ I n À ðD ðiÞ p Þ À 1 2 W ðiÞ p ðD ðiÞ p Þ À 1 2 ; p ¼ 1; 2; . . . ; V, where W ðiÞ p 2 R nÂn represent ith-order adjacent matrix for pth modal singlecell data, D ðiÞ p 2 R nÂn represent ith-order degree matrix for the pth modal single-cell data. Hence, in single-cell data composed of V modals, we have in total U Â V Laplacian matrices for different modals and different orders. • Multi-modal multi-order Laplacian matrix fusion In multi-modal single-cell data integration, how to integrate the modal-specific Laplacian matrices to formulate a fused, appropriate Laplacian matrix is a central and critical problem. Taking into consideration on the modal-specific Laplacian matrices, we first integrate Laplacian matrices for specific order i, i ¼ 1; 2; . . . ; U, where to integrate modal-specific proximity information from multi-modal data. Second, incorporating high-order connectivity information embedded in the high-order Laplacian matrix, we propose the linear combination of L ðiÞ l ; i ¼ 1; 2; . . . ; U to formulate the fused Laplacian matrix for multi-omics single-cell data Intuitively, we aim to approximate the fused Laplacian matrix L Ã as the linear combination of different order Laplacian matrix L ðiÞ l ; i ¼ 1; 2; . . . ; U, respectively, to incorporate comprehensive structure information and seek better representation capability of the relationship described by multi-modal data.

Materials and methods
In this section, we present the method for integrating multiomics data, to optimize the fused Laplacian matrix, as well as obtaining the low-dimensional representation for the multiomics data.
For single modal data, spectral clustering can be realized through solving optimization problem with given Laplacian matrix L. In terms of multi-omics single-cell data, the proposed Laplacian matrix L Ã approximated by linear combination of different order modal-specific Laplacian matrices in the form P U i¼1 k i L ðiÞ l has several parameters yet to be determined.
How to automatically determine the parameters embedded in Laplacian matrix L Ã and seek better representation capability of the common embedding for multi-modal data thus constitutes a critical challenge. Motivated by the framework of spectral clustering, we propose the optimization objective as minimization of trðH T L Ã HÞ, while simultaneously seek optimized H as the low-dimensional embedding for the multi-modal data, and the optimization problem can be expressed as follows:

Multi-modal Laplacian matrix optimization
High-order Laplacian matrix can model the hidden high-order connection information among data, but the value of order needs properly selected as too high order may distort the original relationship embedded in the dataset. Hence, we focus on integration of Laplacian matrix in first order and second order, to preserve global data structure in a better manner, as well as improving learning performance. However, the positive-semi-definite property of L Ã added in the constraints of the optimization problem makes the optimization problem hard and inefficient to solve. Taking into consideration on the original definition of Laplacian matrix I n À D À1=2 W 1 D À1=2 , and the symmetric property of W 1 that can be decomposed into eigen-matrix form W 1 ¼ŨKŨ T , the optimization term L Ã can be reformulated with I n À WKW T , hence we alternatively propose the final optimization problem in the following: min k;W;K;H;l trðH T ðI n À WKW T ÞHÞ þkI n À WKW T À ðkL ð1Þ l þ ð1 À kÞL ð2Þ l Þk 2 F s:t: L ðiÞ l ¼ Here K is a diagonal matrix, and 0 K kk 1 makes sure the optimization stable.

Optimization framework
Taking into consideration on the nonconvexity of the above problem, we propose alternative optimization framework to solve the problem by updating each variable iteratively. For the convenience of optimization, we rewrite jjI n À WKW T À ðkL ð1Þ l þ ð1 À kÞL ð2Þ l Þjj 2 F into the following form: Robust joint clustering 3 The optimization process consists of the following five steps: • Updating k: Fixing W; K; H; l, the update of k can be realized through solving the optimization problem: • Updating W: Given fixed k; K; H; l, the update of W can be generated through the optimization problem as follows: where B ¼ kL ð1Þ l þ ð1 À kÞL ð2Þ l À 1 2 HH T . The solution W of Equation (7) can be calculated as the first c eigenvectors of B (Zhou et al. 2020). • Updating K: Given fixed k; W; H; l, we optimize the following problem to update K: We can get: • Updating H: Fixing k; W; K; l, the optimization problem with respect to H can be reduced into the following formula: min H T H¼Ic trðH T ðI n À WKW T ÞHÞ Then we can obtain the solution H of Equation (9) by calculating the first c eigenvectors of I n À WKW T .
• Updating l: Given fixed k; W; K; H, then we can optimize the problem in the following form: The optimization problem can be rewritten as a standard quadratic programming formulation, which can be effectively solved with MATLAB quadprog. Algorithm 1 presents the process of optimization for better understanding of scHoML.

Convergence and complexity
• Convergence analysis Since Laplacian matrix is a positive-semi-definite matrix, we can conclude that the objective function of scHoML takes zero as lower bound. Obtaining its global optimal solution is difficult, because the objective function is nonconvex. If alternative optimization framework is applied, the objective function value decreases while updating variables. Therefore, the algorithm will eventually converge to a local solution.

• Complexity
The computational complexity of scHoML is mainly caused by SVD decomposition when updating W and H, and its corresponding complexity is Oðn 3 Þ. Meanwhile, the complexity of updating k and K is O(1) and O(n), respectively. Furthermore, to update l, we need to solve a standard quadratic programming problem. Let e be the precision of the result and V be the number of modals, the complexity of solving the quadratic programming problem is Oðe À1 VÞ. If the algorithm has been run for t iterations, the total complexity of our method is Oðtðn 3 þ n þ e À1 VÞÞ. If e À1 ( n 2 , the complexity of scHoML can be considered as Oðtn 3 Þ.

Clustering with inferred low-dimensional representation
In the optimization of high-order neighborhood Laplace matrix, we simultaneously obtain a common low-dimensional Algorithm 1 High-order Laplacian matrix optimization for single-cell multi-omics data: scHoML Input: Datasets:fX 1 ; X 2 ; . . . ; X V g, dimensionality of common embedding c, number of nearest neighbors k.
embedding H 2 R nÂc for the single-cell multi-modal data. The cell subpopulations can be identified from the matrix H through appropriate evaluation on the cellular relationships between cells.
Assume H ¼ ½h 1 ; h 2 ; . . . ; h n T 2 R nÂc , we model the distance between cell s and cell t (s; t ¼ 1; 2; . . . ; n) as . . . ; n: and Agglomerative hierarchical clustering was performed on the constructed distance matrix to entangle the heterogeneity embedded in the cells.
An appropriate evaluation of the cluster number is critical. We here provide a grain to coarse design of the optimal cluster number. The cluster number is determined through solving the following optimization problems.
If the involved number of samples is small, we strive to evaluate the sample-specific Silhouette coefficient to measure the clustering matching degree and optimize the mean Silhouette coefficient of all samples to determine the best cluster number c ? no : where a k ðiÞ ¼ 1 jC I jÀ1 P j2CI;i6 ¼j dði; jÞ is the average distance from the ith point to the other points in the same cluster I as i, and b k ðiÞ ¼ min J6 ¼I 1 jC J j P j2CJ dði; jÞ is the minimum average distance from the ith point to points in a different cluster J, minimized over clusters. For different cluster number k 2 K, we run agglomerative hierarchical clustering to generate different clustering results. If most samples have a high Silhouette value, the clustering solution is believed appropriate.
If the involved number of samples is relatively large, we propose statistical measures in terms of variance for evaluation of cluster number appropriateness.
where C q is the set of all data in class q, c q is the central point of class q, c E is the central point of all data involved, and n q is the total number of data points in class q. It is reasonable that we evaluate inter-class variance and intra-class variance to determine the optimal cluster number c ? no when tr P k q¼1 nqðcqÀcEÞðcqÀcEÞ T

Datasets
We introduce a number of multi-modal single-cell data to test the performance of our proposed method.
• Simulation dataset 1. The dataset is obtained from Jin (2020), which consists of two simulated modals (paired scRNA-seq and scATAC-seq) at different noise levels. The ground truth data matrices were Hði; jÞ ¼ Dropouts were generated by a Bernoulli distribution on X 1 and X 2 with the probabilities p 1i ; p 2j , which were defined as p 1i ¼ e Àk1x 2 i ; p 2j ¼ e Àk2y 2 j , where x i is the mean expression level of the ith cluster of X 1 , y j is the mean expression level of the jth cluster of X 2 . Next, Gaussian noises were added to X 1 and X 2 as X 1 ¼ X 1 þ q 1 E; X 2 ¼ X 2 þ q 2 E, where noise parameter q 1 in modal 1 varied from 3 to 5 with an increment 0.5, and noise parameter q 2 varied from 0.2 to 1 with an increment 0.2. In total, there are 200 simulated cells, the number of attributes in modal 1 and 2 are 5000 and 2000, respectively. The celltype number is 3. • Simulation dataset 2.
The dataset is obtained from Jin (2020), which consists of two simulated modals (paired scRNA-seq and scATAC-seq) where some clusters that were defined from epigenetic profile do not reflect transcriptomic distinctions. The ground truth data matrices were W 2 ði; jÞ ¼ 1; 1 þ x j ð500Þ i 500 þ x j ð500Þ; 0; otherwise; The rank of W 1 was set to be 3 and the rank of W 2 (denoted as K 2 ) varied from 3 to 7.
Hði; jÞ ¼ 1; 1 þ x j ðcÞ i x j ðcÞ; j K 2 À 1 or 1 þ x j ðcÞ i n; j ¼ K 2 ; 0; otherwise; Similar to the above-mentioned procedures, dropouts on both X 1 and X 2 were generated with k 1 ¼ 0:05; k 2 ¼ Robust joint clustering 0:025 and added Gaussian noise with q 1 ¼ 2; q 2 ¼ 1. In addition, if the values in X 2 with dropouts were greater than 0.7, we set the values to be 1, otherwise 0. In total, there are 500 simulated cells with four different cell types, the number of attributes in modal 1 and 2 are 5000 and 2000, respectively. • Mouse embryonic stem cells (mESCs) data.
It is a CITE-seq dataset extracted from a healthy donor under IRB approval from the University of Pittsburgh (Wang et al. 2020). We follow the instructions of cell type identification using well-defined markers, and removed those cells with uncertain cell types. In the dataset, there are 1242 cells in total, containing five different cell types. • pbmc_10X data.
The multi-modal single-cell data (CITE-seq dataset) was downloaded from 10Â Genomics website. A total of 7865 human peripheral blood mononuclear cells (PBMCs) with 14 surface protein markers are included in the dataset in addition to matched scRNA-seq data. Cells with uncertain cell types were removed. In total, there are 6661 cells involved, containing seven different cell types: B cells, CD14þ monocytes, CD16þ monocytes, CD4þ T cells, CD8þ T cells, dendritic cells, and natural killer (NK) cells.

Methods for comparison
We compare our proposed method scMoHL with state-ofthe-art method scAI (Jin 2020), for deconvoluting cellular heterogeneity from parallel transcriptomic and epigenomic profiles. Apart from that, we introduce a number of methods for dealing with multi-modal data for comparison. The methods for comparison are listed as follows: • SCbest: a single-view spectral clustering applied for all the views with the best clustering for output.

Computational results
For performance evaluation, we introduced two popular measures: Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) for comparing the clustering accuracy. Table 1 reports the ARI measures for the compared methods applying in the considered datasets. Table 2 reports the NMI measures for the compared methods applying in the considered datasets. In the construction of scMoHL, there are two parameters involved: the scale parameter r and the parameter k in KNN graph. We set the scale parameter r as the sample standard deviation of dataset, and the parameter k required by the KNN matrix is set to 10 for small datasets and 100 when sample size is relatively large. Modal-specific Laplacian matrices are computed on the preprocessed datasets through standard normalization and dimension reduction with principal component analysis. As for the multi-modal integration methods, SCbest, MSE, CoregSC, AASC, RMSC, AMGL, and k-means clustering methods are applied on the integrated embedding matrix obtained by these methods, hence we report the averaged ARI and NMI values running 20 times for performance comparison. And the averaged ARI and NMI values with standard deviations are reported. AWP can ensure a stable clustering result, hence the standard deviation is 0. In scAI method, a low-dimensional representation matrix H for the multi-omics single-cell data was obtained through regularized matrix factorization framework. The heterogeneity of cells is then identified by clustering through the low-dimensional representation matrix H using the Leiden community detection method, with default resolution parameter setting of 1. The resolution parameter has a great effect on the cluster number evaluation for the dataset. When we set the default resolution parameter 1 in experiments, most clustering results turned to be single cluster instead, distorting the original heterogeneity in datasets. Hence we use resolution parameter 0.1 in Leiden algorithm (scAI-Leiden) on the best H chosen (scAI is applied to each dataset 10 times with different seeds) in the comparisons. Besides, we also introduce consensus hierarchical clustering method on the matrix H obtained by scAI (scAI-hc). Regarding the cluster number, if the cluster number is a parameter as input including scAI-hc, we use the true cluster number for comparison. In simula-tion_data1, the true cluster number is 3, and the estimated cluster number by Leiden algorithm and scHoML are both 3. For simulation_data2, where the true cluster number is 4, the estimated cluster number by scHoML is 4. But for Leiden algorithm, the estimated cluster number is 5. mESC dataset has two different cell states, but in Leiden algorithm, the estimated cluster number is equal to the number of cells in the dataset, namely, 77. scHoML can accurately estimate the cluster number. For pbmc_inhouse and pbmc_10X data, where the true cluster numbers are 5 and 7, respectively, scAI-Leiden estimated the cluster number to be 13 and 77 instead. Overall, scAI-Leiden tends to overestimate the heterogeneous groups inside datasets.

Overall performance comparison
Aside from scMoHL, all the compared algorithms demonstrate specific patterns, with no dominant methods. scAI demonstrates obvious superiority compared to SCbest, MSE, CoregSC, AASC, RMSC, AMGL, AWP, and OPMC for simulated datasets. But the performance of scAI is not satisfactory when applied in mESC and pbmc datasets. SCbest as a single-view clustering method performs in an unstable way, where the ARI values obtained for the two simulated datasets and mESC dataset are unsatisfactory, revealing the complex structure of multi-modals in the respected datasets. But it is interesting to see that on pbmc data, SCbest outperforms most of the compared methods including scAI, indicating that in this dataset, there exists some modal showing a clear relationship among data. MSE as a multi-modal integration method performs similarly as SCbest. In particular for simulation data2 when the single-view clustering result showing 0.2323 in ARI value on average, MSE cannot learn a better integration of modals, getting only 0.0111 in averaged ARI value. For AASC, the clustering result on mESC data is the best excluding scHoML, showing 0.7022 in averaged ARI value. The performance of RMSC and AMGL resembled with each other on simulated datasets. When applied on real-world datasets, RMSC and AGML cannot compete with scAI in ARI values on mESC data. RMSC and AGML show better performance than scAI in ARI values and NMI values on pbmc data. For AWP algorithm, the best clustering performance is achieved on pbmc_10X dataset, 0.7230 in ARI value. OPMC algorithm shows the best clustering performance in pbmc_inhouse data, while has poor discrimination power in other datasets.
When we compare single-view clustering-based algorithm (SCbest) with the other multi-modal clustering methods, some conclusions can be made as follows. First, different modal may reveal the data in different perspectives, there are cases when some particular modal shows a clear relationship among data. Second, integration methods may not fully integrate the proper information embedded in modals, showing unsatisfactory result compared to single-view based clustering method. Taking CoregSC for example, the results in mESC data, pmbc_inhouse and pbmc_10X data are inferior to that of SCbest. Third, appropriate evaluation of the data relationship is of critical significance for entangling the heterogeneity described by multi-modals. Among all the compared methods, scHoML as a graph-based embedding method provides a better description on the relationship between cells, showing that the incorporation of high-order correlation contributes in a positive manner for relationship description.

Comparison with scAI: aggregation and integration
method for parallel single-cell multi-omics data scAI demonstrates explicit superiority compared to other traditional multi-view data clustering methods for simulation datasets. For simulation datasets, scAI ranks the second best, slightly inferior to scHoML. In real-world datasets such as mESC data, scAI with consensus hierarchical clustering method ranks the third in clustering accuracy in terms of ARI and NMI values. It is interesting to see that in pbmc-inhouse data and pbmc-10X data, scAI is not satisfactory. The computational efficiency is also restricted in scAI when the datasets contain large population of cells. When we compare different clustering methods in conjunction with scAI, we have the following findings. scAI with consensus hierarchical clustering (scAI-hc) outperforms scAI with Leiden clustering (scAI-Leiden) in cellular population heterogeneity analysis in an overall manner. For simulation datasets, scAI-Leiden perform in a similar way with scAI-hc. While for real-world datasets, scAI-Leiden perform in a unsatisfactory way. One    Robust joint clustering possible explanation may be that the resolution parameter play an important role in the performance of scAI-Leiden method, especially in tuning the number of clusters to be detected. In the experiments with resolution parameter 0.1, the tuned number of clusters is more reasonable compared to default resolution parameter 1 for many datasets. Hence we used resolution parameter 0.1 for performance illustration. Second, we further checked the influence of resolution parameter on the clustering performance for datasets as shown in Supplementary Table S3. We found that default resolution parameter 1 would be appropriate in traditional cases, but may also fail in many new cases. In our experiments with the considered datasets, they all perform unsatisfactory results. Besides, different datasets have different optimal resolution parameters. The estimation and determination of optimal resolution parameter would become an interesting problem.
In the comparison, scHoML shows the best performance for integration in both simulation datasets and real-world single-cell multi-omics data when the performance is evaluated in ARI and NMI measures. As a nonlinear relationship modeling framework, scHoML used Laplacian matrix to model the relationship in multi-modal single-cell data. In particular, the incorporation of high-order neighborhood Laplacian matrix in optimization contributes to a better description of the geometric structure of the complex multimodal data. Besides, scHoML can robustly represent the noisy, sparse multi-omics data in a unified low-dimensional embedding space. The cluster number determination strategy with sample-specific Silhouette coefficient for small sample problems as well as variance-based statistical measure offers a flexible way for accurately estimating the intrinsic clusters in the data. However, the computational complexity would become an unavoidable issue if the involved number of cells is large, because the time complexity is proportional to the number of cells n, which is Oðn 3 Þ.

Common embedding performance evaluation
All the compared methods attempt to find a common lowdimensional embedding for the single-cell multi-omics data, hence we compared 2D visualization of aggregated lowdimensional embeddings by different methods to evaluate the embedding capabilities of the considered methods. All the figures are attached in Supplementary Files. Figures 1-5 show the visualization of aggregated lowdimensional embeddings by different methods using tSNE. Different color represents different true clusters in the dataset. Figures 1 and 2 refer to the tSNE plots for simulation datasets with considered methods. Traditional multi-view clustering methods, such as SCbest, MSE, CoregSC, AASC, RMSC, AMGL, AWP, and OPMC failed to decipher the heterogeneity in the cells, where the cell subpopulations were indistinguishable in the recovered low-dimensional space by those methods. However, scAI can obtain a proper lowdimensional embedding matrix H showing appropriate relationship in the cells, where the cells are almost distinguishable. scHoML demonstrates clear superiority in getting common embedding information for the two simulation datasets, and the cell subpopulations were clearly distinguishable in the low-dimensional space when using the aggregated data. Figure 3 shows tSNE plot of different methods in obtaining low-dimensional embeddings for mESC data. AASC and AMGL perform similarly, and the cells tend to show distinguishable properties. scHoML clearly help recover a satisfactory low-dimensional embedding for mESC data generated by parallel scM&T-seq technique. Other methods including scAI cannot guarantee a proper embedding for mESC data, where different types of cells tend to mix with each other.
As shown in Fig. 4, for pbmc_inhouse data, AMGL, SCbest, and MSE perform quite similarly and the tSNE plots for the three methods share similar pattern formations, where most of the cells are distinguishable. scHoML undoubtedly demonstrates superiority compared to the remaining nine methods in aggregated low-dimensional representation for the 1242 pbmc cells profiled by CITE-seq. Figure 5 corresponds to the aggregated low-dimensional embedding for PBMC_10X data, where the multi-modal single-cell data (CITE-seq dataset) was downloaded from 10Â Genomics website containing 6661 human PBMCs cells. When the number of cells increase, the data become more complicated and the cells are more heterogeneous. Methods include OPMC and RMSC, AWP tend to mix the cells. Apart from scHoML, CoregSC shows the best aggregation performance where the same type of cells are more compactly scattered though for some particular types, the cells are diversely scattered.
Take a further look at the intrinsic complexity of the multiomics data, we analyze the tSNE plots for original data in all considered modals. Due to the inherent sparsity and noise in the data, the cells were not well separated in the scRNA-seq data and the scATAC-seq data using t-SNE, for simulation datasets as shown in subfigs (a) and (b) for simulation data1 and (c) and (d) for simulation data2 in Fig. 6. Also, for mESC data, the cell populations are mixed, as shown in subfigs (e) and (f). However, the aggregated low-dimensional data generated by scHoML help capture heterogeneity between different types of cells, as shown in subfig (j) in Figs 1-3. For pbmc_inhouse data, the modal described by ADT features can clearly differentiate cell types, while the modal described by RNA is quite noisy and makes the data analysis complicated. It is interesting to see that for scAI and many of the traditional multi-view data integration methods, such as MSE, AASC, RMSC, and OPMC, the aggregated matrix H obtained by those methods failed to give play to the descriptive advantages of ADT features, but quite influenced by the noisy RNA data. scHoML tends to obtain the satisfactory common embedding for pbmc_inhouse data. Similar results can be discovered by pbmc_10X data.

Discussions
From the computational results, we can confirm the robustness and effectiveness of scHoML in dealing with single-cell multi-omics data under different signal-to-noise ratio scenarios. In the following, we discuss the feasibility and effectiveness of multi-modal and high-order Laplacian matrix optimization in scHoML. While only considering one mode of single-cell data, we have the following observations. Different datasets demonstrate different characteristics. For simulation data1 (subfigs a and b), simulation data2 (subfigs c and d), and mESC data (subfigs e and f) as shown in Fig. 6, both modals are noisy. scHoML, however, can overcome the influence of noise effect, robustly integrate the noisy modals to generate a clear common embedding in low dimensions as shown in Figs 1-3 subfig (j). Slightly inferior to scHoML, scAI can similarly recover the geometric distribution of cells after data integration, shown in Figs 1-3 subfig (i). The multi-view clustering-based algorithms seem to be considerably influenced by the noisy data, and the performance of integration sometimes cannot compete with single-mode clustering method SCbest. For pbmc_inhouse data or pbmc_10X data, the ADT modal is relatively clear for celltype differentiation on tSNE shown in Fig. 6 subfigs (g) and (i). It is shown in Figs 4 and 5 that multi-view clustering-based algorithms, such as AWP and CoregSC can learn the clear relationship revealed by the ADT modal, demonstrating a relatively acceptable performance. However, scAI seems to be influenced by the clustering algorithm, in particular for Leiden algorithm. When the resolution parameter differs, the performance fluctuates with large variance. scHoML among all the compared partners shows the stable and robust performance. We conclude that when modals are noisy, scHoML  (a) AASC can dig inside the intrinsic geometric relationship and learn a clear common embedding; when multi-modal data contains clear modal information, scHoML has the ability of not being influenced by noisy modals. When only considering the first-order Laplacian matrix optimization, we can check the performance of MSE whose optimization objective is argmin H;a P V p¼1 a r p trðH T L p HÞ. From the tSNE plots in Figs 1-5 as well as Tables 1 and 2, we came to know that when all the involved modals are noisy, MSE did not have the ability to extract the intrinsic relationship of cells accurately. However, when incorporating high-order Laplacian information, we model the hidden high-order connection information among data in scHoML, and the learned low-dimensional data provide a better representation of the original noisy modals.
Besides, we compared scHoML to a robust single-cell mutiomics integration method UnionCom (Cao et al. 2020) in terms of embedding capability. In UnionCom, a reference modal within the dataset should be given in advance for  (a) simulation data1-scRNA-seq further processing. We therefore have included all the cases when each modal is regarded as reference modal in the comparison. It can be shown in Figs 7 and 8 that the projected low-dimensional modals cannot compete with scHoML in tSNE showing. When all the modals are noisy, the common low-dimensional space learned by UnionCom is also noisy. However, if some modals show high data quality in the multiomics data, UnionCom can guarantee a relative clear representation of the multi-omics data.
• Heterogeneity analysis We further hope to dissect the heterogeneity analysis results provided by scHoML. It is interesting to see that, for pbmc_10X data, the original data contains 1112 CD8þ T cells as a single cluster. However, scHoML divides the 1112 cells into two major clusters, with one cluster containing 766 cells, the other cluster containing 322 cells. Hence, we conduct statistical analysis to compare the differences between the two subclusters. In the ADT-based data, we did one-sided two-sample Kolmogorov-Smirnov goodness-of-fit hypothesis test on the two populations generated by scHoML, and select the representative markers which rejects the null hypothesis that F1ðxÞ ¼ F2ðxÞ as the corresponding true (but unknown) population CDFs at the 5% significance level.
The representative markers in subcluster 1 (766 cells) include "TIGIT," "CD3þ," "CD4þ," and "CD8þ," which are reported marker genes for annotating exhausted CD8þ T cells (Deng et al. 2021). A further analysis on the representative genes for cluster one through one-sided two-sample Kolmogorov-Smirnov goodness-of-fit hypothesis test leads to a filtration of top-ranked genes "CCL5," "HLA-DRB1," "GZMH," "HLA-DPA1," and "NKG7." It is well-established (Ren et al. 2021) that for CD8þ T cells, the major pTRTs were exhausted T cells and exhibited high heterogeneity. And in our dataset, we successfully identify the subcluster that is consistent with cluster C7 harboring a low frequency of terminal Tex cells and high frequency of "CD8þZNF683þCXCR6þ" Trm cells, dominated by naive T cells. It demonstrates the capability of scHoML in identifying the heterogeneity pattern embedded in the noisy single-cell multi-omics data. • Cellular state identification We conduct statistical analysis to compare the differences between specific cluster and the remaining clusters by scHoML, to investigate the potential of scHoML in cellular state identification. We performed t-test of the hypothesis that the two independent samples generated by the specific cluster and the remaining clusters come from distributions with equal means, and returns the result of the data1-scRNA-seq scATAC-seq-reference data2-scRNA-seq   Robust joint clustering 11

Conclusions
In this study, we propose a multi-modal high-order neighborhood Laplacian matrix optimization framework for integrating the multi-omics single-cell data: scMoHL. scHoML can robustly model the complex data structures and represent the noisy, sparse multi-omics data in a unified low-dimensional embedding space. Experiments on simulated datasets as well as real single-cell multi-omics data reveal that scHoML faithfully aligned heterogeneous modalities. The embedded data can further be utilized for heterogeneity analysis as well as cellular state identification, expecting to shed light on intriguing studies for revealing significant mechanisms among cells.