Abstract

Motivation

Combining omics and images can lead to a more comprehensive clustering of individuals than classic single-view approaches. Among the various approaches for multi-view clustering, nonnegative matrix tri-factorization (NMTF) and nonnegative Tucker decomposition (NTD) are advantageous in learning low-rank embeddings with promising interpretability. Besides, there is a need to handle unwanted drivers of clusterings (i.e. confounders).

Results

In this work, we introduce a novel multi-view clustering method based on NMTF and NTD, named INMTD, which integrates omics and 3D imaging data to derive unconfounded subgroups of individuals. According to the adjusted Rand index, INMTD outperformed other clustering methods on a synthetic dataset with known clusters. In the application to real-life facial-genomic data, INMTD generated biologically relevant embeddings for individuals, genetics, and facial morphology. By removing confounded embedding vectors, we derived an unconfounded clustering with better internal and external quality; the genetic and facial annotations of each derived subgroup highlighted distinctive characteristics. In conclusion, INMTD can effectively integrate omics data and 3D images for unconfounded clustering with biologically meaningful interpretation.

Availability and implementation

INMTD is freely available at https://github.com/ZuqiLi/INMTD.

1 Introduction

Clustering is a crucial technique in data analysis, enabling the identification of intrinsic structures within complex datasets by grouping similar data points. In the field of medicine, clustering has been widely used for uncovering disease subtypes, tailoring personalized treatments, and improving early diagnosis (Ghosal et al. 2020). As data complexity grows, clustering methods based on a single view or single data source are often insufficient in finding meaningful clusters, necessitating the development of more sophisticated approaches. Multi-view clustering has emerged as a powerful solution, leveraging multiple data perspectives to enhance clustering quality and reveal richer patterns than single-view methods (Rappoport and Shamir 2018, Chauvel et al. 2020). As the data views commonly used in biomedical science to describe an individual, omics and imaging data have shown essential advantages in understanding various biological phenomena (Antonelli et al. 2019). For instance, Chen et al. (2021) obtained better prediction for subtypes of lung adenocarcinoma by integrating extracted features from histopathological images and omics data than using a single view. However, there is a scarcity of research published on clustering individuals based on omics and imaging data. Moreover, processing images normally requires tensor methods due to their 3D format (Hériché et al. 2019), e.g. a color image consists of pixels represented by height, width, and color channel, and a 3D mesh consists of X, Y, Z coordinates for height, weight, and depth.

Various multi-view clustering methods have been developed, which can be generally classified into three categories, based on the relationship between data integration and clustering (Rappoport and Shamir 2018): (i) early integration combines all datasets into a single one before building the model for clustering, (ii) intermediate integration clusters a joint embedding learnt from all views, and (iii) late integration computes a clustering from each dataset and then merges all clusterings together. Out of the three, intermediate integration approaches have shown superior performance in many applications possibly because they require a model specifically designed for multi-view clustering tasks (Khan and Maji 2020, Wang et al. 2020, Yun et al. 2021). By clustering subjects with multi-view data from a joint embedding, integrative nonnegative matrix factorization (intNMF) proposed by Chalise and Fridley (2017) has found similar cancer subtypes identified by previous studies. This embedding represents the subjects with patterns that are naturally additive and hence easily interpretable (Lee and Seung 1999). A well-established extension of NMF model for better interpretation and clustering is nonnegative matrix tri-factorization (NMTF) that decomposes the input dataset into three smaller matrices (Ding et al. 2006). However, NMTF models only work with 2D matrices and cannot deal with data views of higher dimensions, e.g. a 3D tensor, which is the common data format for imaging or spatial data. A generalization of NMTF to tensors is the nonnegative Tucker decomposition (NTD) (Kim and Choi 2007). It decomposes the original input tensor into a core tensor with the same number of dimensions and one embedding matrix corresponding to each of its dimensions. There have been attempts to integrate multiple cross-linked 2D matrices into a tensor, which, however, does not work on originally 3D data (Luo et al. 2022). Another work by Broadbent et al. (2024) combined a tensor with a similarity matrix, while they treated the matrix as a graph regularization to the Tucker decomposition instead of a separate data view.

Clustering real-world data is often complicated by the presence of confounders—factors that influence the observed data in an unwanted way (Liu et al. 2015). Confounders can obscure the true clustering structure, leading to spurious results (Schwarz et al. 2024). Addressing confounders is essential for accurate clustering, as their effects can mask the genuine patterns within the data. Effective clustering methods must account for these confounding variables to uncover the true underlying structure. A widely adopted approach is to regress out the confounding effects from every feature during pre-processing, but it comes with the potential loss of useful signals prior to the modeling (Pourhoseingholi et al. 2012). Some other strategies include kernel conditional clustering which computes the final clustering conditioned on confounders (He et al. 2020). However, the conditional clustering is computationally expensive and cannot efficiently work on high-dimensional data.

In this work, we propose a novel multi-view clustering method, Integrative Non-negative Matrix and Tensor Decomposition (INMTD), which obtains unconfounded clustering by jointly decomposing 2D and 3D datasets with NMTF and NTD. It learns an embedding matrix for each data dimension and subgroups the individuals from their embedding after removing vectors in the embedding space that are linked with confounders. Because the true cluster structure of real-life patient dataset is often unknown, we evaluated INMTD first on a synthetic dataset with known clusters, and then on a US cohort from healthy individuals (White et al. 2021), whose heterogeneity mainly comes from the population structure with confounders including age, sex, etc. Combining 2D genotypes and 3D facial morphology, our model computed biologically meaningful embeddings and connected well the facial and genetic embeddings. Furthermore, INMTD derived an unconfounded clustering of individuals with better intrinsic quality and clearer association with population structure than the original clustering. We also characterized each population subgroup with their enriched genetic pathways and highlighted facial areas.

2 Materials and methods

2.1 INMTD: integrative nonnegative matrix and tensor decomposition with correction for confounders

INMTD unifies NMTF and NTD to cluster subjects with multi-view data of 2D and 3D structure. We assume p1 subjects described by two data views, a 2D matrix X12R+p1×p2 of p2 features and a 3D tensor X134R+p1×p3×p4 of p3 features in the second dimension and p4 features in the third dimension, both nonnegative. The aim of our method is to jointly compute the embedding matrices for each dimension and cluster the p1 subjects based on its own embedding (Fig. 1).

Overview of INMTD model for integrating 2D and 3D data. The 2D matrix X12 is decomposed into two embedding matrices G1 and G2 and a core matrix S12. The 3D tensor X134 is decomposed into three embedding matrices G1, G3, and G4 and a core tensor S134. INMTD integrates X12 and X134 by jointly optimizing G1, which is shared by the two views. The orthogonality constraint on G1 ensures disentanglement of embedding vectors.
Figure 1.

Overview of INMTD model for integrating 2D and 3D data. The 2D matrix X12 is decomposed into two embedding matrices G1 and G2 and a core matrix S12. The 3D tensor X134 is decomposed into three embedding matrices G1, G3, and G4 and a core tensor S134. INMTD integrates X12 and X134 by jointly optimizing G1, which is shared by the two views. The orthogonality constraint on G1 ensures disentanglement of embedding vectors.

For 2D matrix X12, NMTF factorizes it into three nonnegative submatrices G1R+p1×r1, G2R+p2×r2, and S12R+r1×r2, so that X12G1S12G2T. The objective of NMTF is to find the optimal G1, G2, and S12 that minimize the reconstruction error:
(1)
where F2 indicates the Frobenius norm and 0 for a matrix means all values in that matrix should be nonnegative. The multiplicative update rules to solve this objective function have been proposed by Ding et al. (2006)  G1 and G2 are low-rank embeddings for the p1 subjects and p2 features, respectively, where r1 and r2 are their ranks and normally r1p1 and r2p2. S12 is the core matrix that links G1 with G2 and can be considered as the compressed representation of X12.
Similar to NMTF, the NTD model decomposes the 3D tensor X134 into three embeddings GiRpi×ri with i{1,3,4}, named the mode matrices, and a core tensor S134Rr1×r3×r4 using the mode product of tensor:
(2)
where S134×nGi is the mode-n product between tensor S134 and matrix Gi, resulting in a new tensor with its n-th dimension changed. The objective of NTD is to minimize the reconstruction error:
(3)
To jointly decompose X12 and X134, we derive an integrative objective from the two views via NMTF and NTD:
(4)
where GiRpi×ri are the low-rank embeddings corresponding to p1 subjects, p2 features of view 1, p3 features of view 2, and p4 channels of view 2. The rank parameters ri are determined via the rule of thumb: ri=pi/2 (Kodinariya and Makwana 2013). G1 is shared by both terms in Equation (4) and jointly learnt from both views. We further adopt an orthogonality constraint on G1 for more rigorous clustering interpretation, inspired by Ding et al. (2006):
(5)

This optimization problem with nonnegativity and orthogonality constraints can be solved via Lagrangian multipliers and Karush–Kuhn–Tucker conditions (Ding et al. (2006) have shown detailed mathematical proof for solving these constraints in their orthogonal nonnegative matrix t-factorization model). Because Equation (5) has no analytic solution for Gi, S12, and S134, we iteratively compute their values via the multiplicative update rules (Dissez et al. 2019) (see Section 2.2). Furthermore, in each iteration, we normalize every column of G1 after updating to further guarantee unit vectors and eliminate the scale indeterminacy as suggested by Li et al. (2015).

We apply k-means clustering (with 10 random initializations) on the joint embedding, G1, and select the best number of clusters based on Silhouette score. Silhouette score is a classic internal metric (no ground truth required) that measures how well-separated and cohesive clusters are in a dataset. It evaluates both intra-cluster cohesion, namely how close data points within a cluster are to each other, and inter-cluster separation, namely how far each cluster is from other clusters. A higher value (close to 1) suggests a more valid clustering.

To assess how much G1 is confounded by a set of known confounders, C, we conduct a linear model F-test between every confounder and every embedding vector (i.e. matrix column) of G1 [with Benjamini–Hochberg (BH) correction for multiple testing]. The unconfounded clustering is then recomputed from the columns of G1 that have no significant association with any confounders. This is done also by k-means and Silhouette score for the optimal number of clusters. The removal of confounded embedding components is based on the additive nature of NMF-based methods that the data is represented by the sum of all its embedding aspects (Lee and Seung 1999). Deconfounding at the embedding level is computationally less expensive and can better preserve meaningful information than the widely used approach, which is to regress out confounders from each feature at the input level.

2.2 Training procedure of INMTD

To solve INMTD, we use a fixed-point method that, starting from an initial solution, iteratively uses multiplicative update rules to converge toward a locally optimal solution. During the optimization process, all the embedding matrices and core matrix/tensor of INMTD are iteratively updated to minimize the objective function [Equation (5)]. Following the derivation procedure used to derive multiplicative update rules for orthogonal NMTF and NTD, we derive the update rules for INMTD:
(6)
 
(7)
 
(8)
 
 
 
X134n denotes the mode-n matricization of X134, which reshapes X134 to a 2D matrix along its n-th dimension.

We initialize Gi via singular value decomposition (SVD), which has shown better results than random initialization (Malod-Dognin et al. 2019). For X12, the original matrix is decomposed by SVD and G1 and G2 are derived from the left and right matrices of SVD, respectively. Because X134 is 3D, we run SVD for every slice along the p4 channels and average all the right matrices to compute the initial G3 and similarly run SVD for every slice along the p3 features to initialize G4. As the update rules are multiplicative, to avoid entries in Gi to remain 0, we add an infinitesimal number (1e−5) so that these entries can be updated.

S12 and S134 can also be initialized by SVD through the eigen values if the original data frames are symmetric. But we do not assume their symmetry in our framework, therefore, we apply the following rules:
(12)
 
(13)

The initialized S12 and S134 are automatically nonnegative because all the multipliers are nonnegative.

The total computational complexity of running INMTD is O(tp12(r1+p2+p3p4)), where t means the number of iterations. We refer to the Supplementary Materials for more details about the computational complexity of the entire procedure.

To assess the goodness and convergence of INMTD, we track a metric along the optimization, which is the total relative error:

The total relative error computes the fraction of the reconstruction errors of the two datasets X12 and X134 in their L2 norms. It is a nonnegative value and a lower total relative error indicates better reconstruction from the decomposed elements.

2.3 Association between embeddings

Linking two embeddings from different views can be achieved by mapping them to the same space of G1 because G1 is the embedding shared by both data types. More specifically, we project G2 to the space of G1 via the core matrix S12, so that G2=G2S12T. The new matrix G2 now has the same embedding size as G1 and is in the same embedding space as G1. Similarly, G3 is also projected to the space of G1 via the core tensor S134, resulting in G3=S134×3G3. In the special case, when r4=1 and hence S134 is of shape r1×r3×1, this is equivalent to G3=G3S13T, where S13 reshapes S134 to r1×r3. Subsequently, the relationship between feature (row) i in G2 and feature (row) j in G3 can be assessed by cosine similarity:
(15)
where G2i indicates the i-th row of G2 and G3j the j-th row of G3. Cosine similarity measures the dot product between two vectors regardless of their magnitudes, providing good normalization when comparing between G2 and G3. It ranges from −1 to 1 and the higher the more similar between the two vectors.

3 Experiments

3.1 Evaluation with synthetic data

To evaluate the clustering performance of INMTD under controlled conditions and compare it with other well-established methods, we simulate a multi-view dataset of 1000 samples with known clusters. All samples are described by two data types of different dimensions, resulting in a 2D matrix X12R1000×200 of 1000 samples and 200 features and a 3D tensor X134R1000×50×10 of 1000 samples, 50 first-dimensional features and 10 second-dimensional features. To mimic real-world scenarios, e.g. gene expression data where certain set of genes may be more expressed in one group of people than another (Liu et al. 2014, Chen et al. 2019), features of each data view also belong to distinct subgroups, i.e. X12 and X134 have co-cluster or block structure. To be specific, five clusters are simulated for the 1000 samples (as ground-truth structure), 10 clusters for the 200 features of X12, four clusters for the 50 first-dimensional features of X134 and two clusters for the 10 second-dimensional features of X134. We perturb X12 and X134 by adding random noise onto their features to further imitate real-life data. More details about the simulation settings are available in Supplementary Materials.

We employ INMTD with SVD initialization and 200 max iterations on the synthetic data and measure its clustering capability based on the adjusted Rand index (ARI). ARI computes the similarity between two data clusterings, and hence, the accuracy of the derived clustering to the ground-truth clustering. An ARI of around 0 indicates random labeling and the closer it is to 1, the better match. We also employ a few other widely used clustering approaches to compare their ARIs with INMTD’s, including (i) k-means, (ii) NMTF, (iii) NTD, (iv) simultaneous NMF (Badea 2008) (siNMF), (v) joint NMF (Zhang et al. 2012) (jNMF), and (vi) similarity network fusion (SNF) (Wang et al. 2014). Note that the first three methods are single-view while the last three are multi-view methods. k-means is one of the most popular clustering algorithms and has close relationship with matrix factorization. We choose NMTF and NTD for the comparison because they are the building blocks of INMTD. siNMF and jNMF are both multi-view clustering methods based on NMF. SNF, on the other hand, is a network-based multi-view clustering approach, widely used in cancer studies (Chiu et al. 2018) and precision medicine (Wang et al. 2023).

For the sake of robustness, we simulate the synthetic data for six runs with different random seeds and compute the ARI of INMTD and the other approaches in each run. More details about the method implementation, including choices of hyperparameters, are available in Supplementary Materials.

3.2 Evaluation with real-life data

3.2.1 Evaluation dataset

We apply INMTD to a multi-view dataset of 4680 healthy people with European ancestry, characterized by a 2D matrix and a 3D tensor (White et al. 2021). We select this cohort dataset because very few factors are expected to affect its structure, and therefore, it suits well to evaluate the clustering and deconfounding performance of INMTD. These people were recruited from three independent studies in the USA, 3D Facial Norms cohort (PITT), Pennsylvania State University (PSU), and Indiana University-Purdue University Indianapolis (IUPUI). Every individual was described by 7 141 882 single-nucleotide polymorphisms (SNPs, encoded by 0,1,2 for the number of minor alleles) and a 3D mesh image which contains the X, Y, Z coordinates of 7160 landmarks, namely X12R4680×7141882 and X134R4680×7160×3, respectively. Due to the enormous number of SNPs the SVD initialization on X12 had to adopt randomized SVD for feasibility. The initialization on X134 still used full SVD. To standardize genomic and facial data, we subtracted the mean from each view and divided every entry by the maximum of each view. We then took the absolute values to ensure nonnegativity and focus on the deviation of each individual from the population mean. The rank r1 of embedding G1 was determined via the rule of thumb: p1/248 where p1 is the number of individuals, thus G1R+4680×48. r2 was chosen in a similar way but based on the number of protein-coding genes, namely 19 430, instead of SNPs to reduce the computational burden, and thus G2R+7141882×99. We had G3R+7160×60 and G4R+3×1 because r3=p3/260 and r4=p4/21, where p3=7160 and p4=3. For every individual, we also collected a few covariates, including age, sex, height, weight, face size, and camera system. Body mass index (BMI) was derived via: BMI=weight(kg)/height(m)2 as an additional covariate. Here, we consider these covariates as confounders to population structure because they might hinder us in finding the population subgroups based on genetic heterogeneity. To measure the population structure of the cohort, White et al. (2021) have computed four ancestry axes by projecting the individuals onto the principal component (PC) space of the SNPs from the 1000G Project.

3.2.2 Evaluation on the embeddings

INMTD learns an embedding for SNPs, faces, and individuals. Here, we outline how we assess the biological validity for those embeddings separately and jointly.

To biologically validate our individual embedding, we assess if G1 captures any heterogeneity of the cohort, including both population structure (ancestry axes) and confounding effects. Due to the orthogonality constraint, embedding vectors of G1 can be considered as independent and characterize different aspects of the individuals. We, therefore, test the statistical association between every embedding vector and every ancestry axis or confounder. In particular, Kruskal–Wallis analysis of variance (ANOVA) is used for ancestry axes and continuous confounders (age, height, weight, BMI, and face size) and chi-squared test is used for categorical confounders (sex and camera system). We apply BH correction for the multiple testing.

To assess our SNP embedding, we cluster SNPs based on their embedding and use enrichment analysis to see if their space is functionally organized. As G2 is not orthogonal, we use k-means (with 10 random initializations) on G2 to subgroup SNPs into r2=99 clusters. We first check how well those SNP clusters coincide with the 19 430 protein-coding genes given the fact that gene is a natural summary of SNPs and biological processes are usually interpreted on a gene level. A gene is defined to be enriched in an SNP cluster if SNPs located 2K base pairs around this gene are present in this cluster significantly more than in the background. Hypergeometric test is applied for the enrichment analysis and the BH procedure is used for multiple testing correction. To further check if these clusters have biological relevance, we run another enrichment analysis for Gene Ontology (GO) terms for biological process. We annotate every SNP by GO terms if it is mapped to a gene that is annotated by a GO term and then test the overrepresentation of GO terms in every SNP cluster. Because GO terms are gene annotations, we annotate SNPs with GO terms that annotate their mapped genes. Only GO terms under biological process category are used and the BH procedure is applied for multiple testing correction.

To assess our facial embedding, facial landmarks are subgrouped by Ward’s hierarchical clustering on G3 into r3=60 clusters, which segments the shape of face. The Ward’s method has been shown outperforming other common linkage methods (Ferreira and Hitchcock 2009, Vijaya et al. 2019). We first compute the Ward distance between every pair of landmarks in G3, based on which we then construct a hierarchical tree and cut it at a height with 60 clusters. We adopt hierarchical clustering in order to compare with the hierarchical segmentation done on the same dataset by White et al. They segmented the facial shape from global to local into five levels with 63 segments. To compare the ability of the hierarchical tree of G3 and the hierarchical segmentation by White et al. to faithfully capture the pairwise dendrogrammatic distances between landmarks, we compute their cophenetic correlation coefficient:
(16)
where i and j are facial landmarks, D is the Euclidean distance matrix between landmarks, and Z is the cophenetic distance matrix between landmarks, denoting the heights at which two points are first merged in the dendrogram. D¯ and Z¯ are the mean of D and Z, respectively. A cophenetic correlation close to 1 indicates a high-quality hierarchical clustering.

To assess the association between the SNP and facial embeddings, we map them to the space of G1 and compute cosine similarity between each SNP and each facial landmark (see Section 2.3). For the SNPs closest to facial landmarks in the joint space, we apply the GREAT analysis (v.4.0.4), which finds biological meaning of the set of SNPs via the annotations of nearby genes. Technically speaking, GREAT analysis performs a binomial test for a set of SNPs to check whether the overlap between their associated genes and genes with a certain annotation is greater than random chance. To associate SNPs with genes, we apply the default and recommended settings (McLean et al. 2010), namely the “basal plus extension” rule with 5 kb upstream and 1 kb downstream plus 1000 kb extension. Note that one SNP can be associated with multiple genes and SNPs not associated with any genes are not included in this analysis.

3.2.3 Characterization of unconfounded population subgroups

The unconfounded population subgroups are characterized based on the projection of genetic and facial embeddings to the space of the sample embedding, enabling computing the similarity between population subgroup centroids and SNPs and facial landmarks. For each subgroup, we first select the top 0.1% SNPs with highest cosine similarity to its centroid in the joint space, to which the GREAT analysis is applied to reveal the most relevant phenotypes [human phenotype ontology (HPO)]. The threshold of 0.1% is determined by balancing the number of genes selected per subgroup and the genomic coverage of genes selected for all subgroups (Supplementary Figs S1 and S2). We then visualize the cosine similarities of all facial landmarks to a subgroup centroid on the averaged face, in order to demonstrate how different areas of the face are associated with the corresponding subgroup.

4 Results

4.1 INMTD outperforms other methods on the synthetic data

Our evaluation on the synthetic data has shown that INMTD, followed by SNF, achieved superior clustering performance over other methods (Fig. 2). This result indicates that integrating multiple data types can help derive a more accurate clustering than using a single data type. However, siNMF and jNMF, as two well-established methods for jointly decomposing multiple data matrices, did not perform well on the simulated data, presumably because they are not as specialized in clustering and tensor decomposition as INMTD. Besides, the comparison between different implementations of k-means (only X12, only X134, or both X12 and X134) implies that simply concatenating features from multiple data views cannot effectively fuse the information and improve the clustering.

Clustering performance on the synthetic data. This plot shows the mean and standard deviation of ARIs of INMTD and the other methods over six runs with different random seeds. The methods are divided into three column panels: k-means and NMTF using only X12, k-means and NTD using only X134, and k-means, siNMF, jNMF, SNF, and INMTD using both X12 and X134.
Figure 2.

Clustering performance on the synthetic data. This plot shows the mean and standard deviation of ARIs of INMTD and the other methods over six runs with different random seeds. The methods are divided into three column panels: k-means and NMTF using only X12, k-means and NTD using only X134, and k-means, siNMF, jNMF, SNF, and INMTD using both X12 and X134.

4.2 INMTD generates biologically meaningful embeddings from a real-life multi-view dataset

We applied INTMD to a real-world facial-genomic cohort collected from the USA for unconfounded population subgrouping (White et al. 2021). This dataset consists of two data types, X12R+4680×7141882 for 7 141 882 SNPs in 4680 people and X134R+4680×7160×3 for the 3D coordinates of 7160 facial landmarks in the same 4680 individuals (White et al. 2021). A few confounders were collected as well, which are age, sex, height, weight, camera system, and face size. We also derived BMI from height and weight as a potential confounder. To assess the population structure of this cohort, White et al. computed four ancestry axes by projecting the genotypes to a principal component space built from the 1000 Genomes Project data, in the manner of EIGENSTRAT (Price et al. 2006). We ran our INMTD model on this multi-view dataset for 1000 iterations with SVD initialization and it converged in terms of the total relative error (Supplementary Fig. S3). Due to the heuristic solver for Equation (5), G1 is only approximately orthonormal. Therefore, we further checked the independencies between the embedding vectors of G1 based on its covariance matrix, namely G1TG1 (Supplementary Fig. S4).

Because the heterogeneity of a population can be largely described by population structure, age, sex, etc., we validated the information captured by G1 based on the statistical association between every column vector of G1 and every ancestry axis or confounder (Fig. 3). The results indicate that most vectors captured the information of population structure while being confounded. This is expected as, for instance, height has been reported to highly relate with different European ancestries (Cavelaars et al. 2000). Furthermore, the 48 embedding vectors have different association patterns with the ancestry axes and confounders. For instance, the first vector is significantly associated with most ancestry axes as well as confounders, while no association with the sixth vector is observed. The observation that height, weight, and face size are significantly associated to similar embedding vectors is in line with the close relationships among these factors. The results also imply that sex is not a main driver of the heterogeneity within the data (only two embedding vectors have significant association with sex), which may be due to the exclusion of sex chromosomes in this study.

Heatmap of P-values for the linear F test between every G1 embedding vector and every confounder and ancestry axis. X-axis is the exploratory variables including confounders and ancestry axes, and Y-axis is the 48 column vectors of the embedding matrix G1. Non-significant P-values (larger than 0.05 after BH correction) were removed from the plot.
Figure 3.

Heatmap of P-values for the linear F test between every G1 embedding vector and every confounder and ancestry axis. X-axis is the exploratory variables including confounders and ancestry axes, and Y-axis is the 48 column vectors of the embedding matrix G1. Non-significant P-values (larger than 0.05 after BH correction) were removed from the plot.

To assess the biological relevance of the SNP embedding from G2, we applied k-means on G2 to derive 99 clusters of SNPs. We first mapped every SNP to genes if it falls within or 2k base pairs around a gene and ignored SNPs that do not map to any genes. Ninety-seven out of 99 SNP clusters have at least one overrepresented gene (P-value < 0.05 in a hypergeometric test with BH correction for multiple testing) with respect to the background of all the SNPs that can map to a gene. 98.6% of genes have been enriched in at least one SNP cluster, showing the coverage of the SNP clusters, while most genes were enriched in only a few clusters, showing the specificity of the clusters (Fig. 4). We then applied GO enrichment analysis for each cluster after assigning GO annotations of a gene to all its belonging SNPs. 96.0% of G2 clusters statistically significantly overrepresented at least one GO term and 99.6% of GO terms have been enriched in at least one cluster (Supplementary Fig. S5). This result validated that the SNP clusters obtained from G2 have both genomic specificity and biological relevance.

Histogram showing the number of genes (y-axis) that are enriched in a given number of G2 clusters (x-axis). An enrichment analysis is done between every gene and every G2 (SNP) cluster. We then compute how many genes (y-axis) are found to be enriched in different number of clusters of SNPs (x-axis).
Figure 4.

Histogram showing the number of genes (y-axis) that are enriched in a given number of G2 clusters (x-axis). An enrichment analysis is done between every gene and every G2 (SNP) cluster. We then compute how many genes (y-axis) are found to be enriched in different number of clusters of SNPs (x-axis).

To assess the quality of the facial embedding from G3, we aimed to obtain a hierarchical segmentation on G3 and compare it with the work of White et al. (2021). The chosen Ward’s hierarchical clustering yielded 60 facial segments (Supplementary Fig. S6), with a cophenetic correlation coefficient of 0.638, which is higher than that of the facial segmentation by White et al. on the same images (0.414). It suggested that the hierarchical segmentation from embedding G3 better groups together facial landmarks that are close in 3D space than the one by White et al., defining more spatially coherent “patches.” The visualization of these 60 clusters on an averaged face also showed that the landmarks within each cluster are spatially close to each other and the clustering is morphologically meaningful (Fig. 5).

The 60 clusters derived from G3, illustrated on the mean face shape of all individuals. Not every cluster has a unique color due to the large number of clusters, but neighboring clusters are distinguished by different colors. Every face shape is symmetric along X-axis (left/right), so is the facial segmentation.
Figure 5.

The 60 clusters derived from G3, illustrated on the mean face shape of all individuals. Not every cluster has a unique color due to the large number of clusters, but neighboring clusters are distinguished by different colors. Every face shape is symmetric along X-axis (left/right), so is the facial segmentation.

To further investigate the relationships between genetics and facial morphology, we mapped both G2 and G3 to the space of G1 (Supplementary Fig. S7). For each facial landmark in the joint space, we selected the closest SNP in terms of cosine similarity, resulting in 905 unique SNPs in total. To assess what biological traits are associated with the chosen SNPs, we conducted a GREAT analysis on their neighboring genes (McLean et al. 2010), which found 17 significantly enriched HPO terms (Gargano et al. 2024) based on the adjusted binomial P-values (Fig. 6). Most of the enriched terms are highly linked to facial morphology (especially eyes), limb or spine morphology and embryonic development, depicting close biological relationship between the embeddings for SNPs and facial landmarks, namely G2 and G3. Other enriched terms suggested high relatedness between facial morphology and myotonia. The results are in line with a previously published genome-wide association study (GWAS) work on the same facial-genomic data (White et al. 2021) and indicate that INMTD allows for uncovering biologically relevant associations between SNPs and facial landmarks.

Dotplot of 17 significantly enriched HPO terms from 905 SNPs that are closest to facial landmarks in the joint space. GREAT analysis first mapped the 905 SNPs to 1156 genes based on the default and recommended settings, and then ran a binomial test for the enrichment analysis with BH correction. X-axis shows the ratio between the number of observed genes and the number of genes annotated for each HPO term. The dot color indicates the significance of the adjusted binomial P-value in the form of −log10.
Figure 6.

Dotplot of 17 significantly enriched HPO terms from 905 SNPs that are closest to facial landmarks in the joint space. GREAT analysis first mapped the 905 SNPs to 1156 genes based on the default and recommended settings, and then ran a binomial test for the enrichment analysis with BH correction. X-axis shows the ratio between the number of observed genes and the number of genes annotated for each HPO term. The dot color indicates the significance of the adjusted binomial P-value in the form of −log10.

4.3 INMTD finds unconfounded population subgroups characterized by their genetic and facial information

To derive the optimal population subgroups, we applied k-means on G1 with different number of clusters and found that the clustering with 48 clusters achieved the highest Silhouette score (0.169) (Supplementary Fig. S8). As mentioned before, many vectors of G1 are significantly confounded, potentially disturbing correct interpretation and characterization of derived subgroups. In order to deal with the confounding effect, we removed the 28 columns in G1 that are significantly associated with any confounders, and clustered individuals based on the remaining 20 embedding vectors (Supplementary Fig. S9). We then applied k-means on the unconfounded G1 with different number of clusters and the clustering with 20 clusters achieved the highest Silhouette score (0.329) (Supplementary Fig. S10). This score is statistically higher (empirical P-value = 0.02 from 1000 repetitions, Supplementary Fig. S11) than clusterings with the same number of clusters from 20 randomly sampled G1 vectors, indicating better intrinsic validity of the unconfounded clustering.

To validate our reduced space is unconfounded and leads to a better capturing of the population structure, we assessed the statistical association between the derived clustering and the ancestry axes and confounders. The Kruskal–Wallis test (nonparametric ANOVA) showed both the original and the unconfounded clusterings are significantly associated with all ancestry axes (P-value <0.05), suggesting their relationships with population structure (Table 1). Yet, the original clustering also has significant associations with most confounders, especially age and camera system, while the unconfounded clustering has no significant associations with any confounders, validating our unconfounding strategy. The effective reduction of the influence by camera system, which resembles the batch effect, also indicates the strength of the confounder correction.

Table 1.

Adjusted P-values of the statistical tests between the clustering of G1 and every ancestry axis and confounder.

VariableStatistical testAdj. P-value (original)Adj. P-value (unconfounded)
Ancestry axis 1Kruskal–Wallis7.24e−402.99e−9
Ancestry axis 2Kruskal–Wallis1.04e−112.29e−7
Ancestry axis 3Kruskal–Wallis6.07e−112.69e−3
Ancestry axis 4Kruskal–Wallis8.11e−228.16e−11
AgeKruskal–Wallis9.36e−80.888
SexChi-squared0.1180.899
HeightKruskal–Wallis8.38e−20.888
WeightKruskal–Wallis5.51e−20.951
BMIKruskal–Wallis2.19e−30.899
Camera systemChi-squared3.01e−780.899
Face sizeKruskal–Wallis3.72e−20.888
VariableStatistical testAdj. P-value (original)Adj. P-value (unconfounded)
Ancestry axis 1Kruskal–Wallis7.24e−402.99e−9
Ancestry axis 2Kruskal–Wallis1.04e−112.29e−7
Ancestry axis 3Kruskal–Wallis6.07e−112.69e−3
Ancestry axis 4Kruskal–Wallis8.11e−228.16e−11
AgeKruskal–Wallis9.36e−80.888
SexChi-squared0.1180.899
HeightKruskal–Wallis8.38e−20.888
WeightKruskal–Wallis5.51e−20.951
BMIKruskal–Wallis2.19e−30.899
Camera systemChi-squared3.01e−780.899
Face sizeKruskal–Wallis3.72e−20.888

Kruskal–Wallis test was used for continuous variables and chi-squared test for categorical variables. All P-values have been corrected for multiple testing via the Benjamini-Hochberg (BH) procedure. Column 3 are adjusted P-values from the original clustering based on all vectors in G1, while Column 4 from the unconfounded clustering. Adjusted P-values lower than 0.05 (threshold for significance) are in bold.

Table 1.

Adjusted P-values of the statistical tests between the clustering of G1 and every ancestry axis and confounder.

VariableStatistical testAdj. P-value (original)Adj. P-value (unconfounded)
Ancestry axis 1Kruskal–Wallis7.24e−402.99e−9
Ancestry axis 2Kruskal–Wallis1.04e−112.29e−7
Ancestry axis 3Kruskal–Wallis6.07e−112.69e−3
Ancestry axis 4Kruskal–Wallis8.11e−228.16e−11
AgeKruskal–Wallis9.36e−80.888
SexChi-squared0.1180.899
HeightKruskal–Wallis8.38e−20.888
WeightKruskal–Wallis5.51e−20.951
BMIKruskal–Wallis2.19e−30.899
Camera systemChi-squared3.01e−780.899
Face sizeKruskal–Wallis3.72e−20.888
VariableStatistical testAdj. P-value (original)Adj. P-value (unconfounded)
Ancestry axis 1Kruskal–Wallis7.24e−402.99e−9
Ancestry axis 2Kruskal–Wallis1.04e−112.29e−7
Ancestry axis 3Kruskal–Wallis6.07e−112.69e−3
Ancestry axis 4Kruskal–Wallis8.11e−228.16e−11
AgeKruskal–Wallis9.36e−80.888
SexChi-squared0.1180.899
HeightKruskal–Wallis8.38e−20.888
WeightKruskal–Wallis5.51e−20.951
BMIKruskal–Wallis2.19e−30.899
Camera systemChi-squared3.01e−780.899
Face sizeKruskal–Wallis3.72e−20.888

Kruskal–Wallis test was used for continuous variables and chi-squared test for categorical variables. All P-values have been corrected for multiple testing via the Benjamini-Hochberg (BH) procedure. Column 3 are adjusted P-values from the original clustering based on all vectors in G1, while Column 4 from the unconfounded clustering. Adjusted P-values lower than 0.05 (threshold for significance) are in bold.

To investigate if the derived subgroups from the unconfounded G1 clustering capture well the population structure, we adopted 3519 European ancestry informative markers (EuroAIMs) found by Tian et al. (2009), which are SNPs capable of distinguishing European subpopulations. We first mapped the SNP embedding G2 to the space of G1 and then, in the joint space, selected 3519 SNPs with highest cosine similarities to the centroids of derived subgroups, which drive the clustering of individuals. We selected the same number of SNPs as the EuroAIMs for better comparison. Because only 13 EuroAIMs were originally included in our dataset, we looked at the gene level and found that 856 out of the 3519 selected SNPs are located in the same genes as the EuroAIMs. A hypergeometric test showed that this fraction is significantly (P-value = 2.15e−63) higher than a random selection from all SNPs in our dataset, indicating that the genetic basis of the derived subgrouping is statistically significantly associated with the European population structure. We also checked for the subgroups derived from the original G1 clustering and found only 40 out of 3,519 selected SNPs that are located in the EuroAIM genes. The P-value of 1 from the hypergeometric test also implied this fraction is not statistically higher than a random selection. This result further proved that our deconfounding strategy has successfully led to population subgroups that could better highlight the population structure.

After showing the validity of our unconfounded population subgrouping, we focused on the characterization of two subgroups that are most associated with the four ancestry axes (Supplementary Table S1), namely subgroup 2 and 7. The top 0.1% SNPs selected for population subgroup 2 were significantly enriched in over 500 HPO terms, with the top 20 terms clearly related to skeletal morphology or bone formation (Fig. 7A). It is in line with the highlighted areas on the face (Fig. 7B), e.g. the nasal bone, the zygomatic bone, maxilla, and most part of the mandible. This result characterized population subgroup 2 with facial skeleton. Note that the frontal bone does not show up as much as the facial bones, implying the different morphology of cranial and facial skeleton (Anderson et al. 2024). Some other enriched HPO terms involve anemia, telangiectasia, and neutrophil, which are related to the blood. Whereas the genetic representation of population subgroup 7 found 72 significantly associated HPO terms in total, and many of the top 20 terms are strongly involved in kidney function (Fig. 7C). Meanwhile, the eye area was remarkably underlined on the mean face (Fig. 7D), which supports the common embryogenic stage of eyes and kidneys and the reported relationship between eye and kidney diseases (Bodaghi et al. 2014). Therefore, population subgroup 7 is likely characterized by eye morphology related to kidney function. Other enriched terms indicate the involvement of kidney function in diabetes, hair development, tooth development, etc.

Genetic and facial representation for population subgroup 2 (top, A and B) and 7 (bottom, C and D). The genetic representation was obtained via GREAT analysis (left) for the top 20 HPO terms enriched in the top 0.1% SNPs highlighted for population subgroup 2 (A) and 7 (C). We used the default and recommended settings of GREAT analysis with binomial test and BH correction. For the facial representation, the cosine similarity between each facial landmark and the centroid of subgroup 2 (B) and 7 (D) separately was plotted on the mean face.
Figure 7.

Genetic and facial representation for population subgroup 2 (top, A and B) and 7 (bottom, C and D). The genetic representation was obtained via GREAT analysis (left) for the top 20 HPO terms enriched in the top 0.1% SNPs highlighted for population subgroup 2 (A) and 7 (C). We used the default and recommended settings of GREAT analysis with binomial test and BH correction. For the facial representation, the cosine similarity between each facial landmark and the centroid of subgroup 2 (B) and 7 (D) separately was plotted on the mean face.

5 Discussion and conclusion

In this study, we proposed INMTD, a framework that integrates both 2D matrices and 3D tensors for unconfounded clustering. The simulation showed the superior clustering performance of INMTD over other competing methods on synthetic data with known clusters. We then applied it to a real-life facial-genomic dataset to evaluate its performance and find an unconfounded subgrouping for European population structure. INMTD discovered 20 unconfounded population subgroups with their representative genetic and facial characteristics, providing the potential for more precise healthcare toward each subpopulation.

This work applies INMTD to facial and genomic data, which reflect a large fraction of the population structure. A comprehensive subgrouping of population structure has the potential to facilitate precision medicine where individuals in each population subgroup may receive tailored medical decisions based on their intrinsic characteristics (Saria and Goldenberg 2015, Loh et al. 2019). For instance, the characterization of subgroup 2 indicates that variations in facial skeletal traits may serve as proxies for underlying genetic diversity within populations, and special medical treatment may be given to population subgroups with distinct skeletal development. On the other hand, the kidney-related phenotypes enriched for subgroup 7 may bring new insights into the embryological connection between eye and kidney functions. Understanding such connections could help explain the genetic diversity of eye morphology in different European subpopulations, improving both population health and tailored medical interventions.

While classic matrix or tensor decomposition models only focus on a single dataset and current integrative matrix factorization models cannot deal with higher-dimensional data structures, INMTD is able to jointly decompose both matrix and tensor data. Another key feature of INMTD is its orthogonality constraint on G1 for better clustering and interpretation. An orthogonal matrix has all its vectors independent to each other, and therefore, every vector can be investigated specifically for its characteristics. We further normalized the vectors of G1 to make them orthonormal, resembling naturally a cluster indicator matrix. Optionally, we can impose orthogonality on any embedding if needed.

As an extension of joint NMF model, INMTD has the potential to predict facial images from genotypes or vice versa of new samples, in a similar fashion as Akata et al. (2011). In the former case, given the genotypes x12 of an individual that has not been seen by the model, we first solve ming10J=x12-g1S12G2TF2 for g1 and then predict its facial image x134=S134×1g1×2G3×3G4. Alternatively, we could find the closest image from the cohort to x134 according to cosine similarity or the image whose corresponding embedding vector in G1 space is closest to g1. In addition, a new sample can be classified into one of the derived subgroups by assigning g1 to its closest cluster centroid.

Even though INMTD was illustrated in a facial-genomic dataset for population subgrouping, it is not restricted to facial images and genomic data. INMTD can be applied to any 2D and 3D datasets for joint clustering, as long as they are nonnegative or can be converted to nonnegative values, e.g. transcriptomics or epigenomics as 2D matrices and CT scans or time series data as 3D tensors. Also, the rapidly emerging field of spatial transcriptomics highly involves multi-view data with different dimensions, such as gene expression, spatial coordinates of cells, and histological images; we expect integrative studies in this field to benefit from INMTD. Note that it is also possible to further extend INMTD to handle more than two data types or data of higher dimensions, e.g. a moving 3D image (4D), by adding extra embeddings to the model and objective function. In the future, we plan to apply INMTD to patient datasets with more data views, e.g. TCGA, for improved disease subtyping and reveal novel patient subgroups with meaningful characteristics for more precise healthcare decisions.

There are two main limitations of INMTD for future improvements. The first one is that it determines the ranks of each embedding based on a rule of thumb because of the extremely large scale of genomic dataset. Nevertheless, in most cases, the data dimensionality would be feasible for INMTD model on a modern computer or computing server and the user could choose the optimal ranks via cross-validation or other non-heuristic methods. The other potential deficiency comes from the post-hoc confounder correction, which removes confounded vectors but in a fairly strict manner with the risk of “over-correcting.” A more compromising strategy could be adding a regularization term in the objective function to iteratively minimize the confounding effect in G1, which indicates our future work.

In conclusion, with the surge in biological data in diverse formats and the growing demand for personalized medicine with Big Data, INMTD is envisaged to become an essential tool for integrating multi-view datasets of varying dimensions, enabling meaningful and unconfounded clustering. We are confident that INMTD has the potential of widespread adoption in the future due to its exceptional performance and ease-of-interpretation.

Acknowledgements

We want to acknowledge the European Research Council (ERC) Consolidator grant No. 770827, the Spanish State Research Agency and the Ministry of Science and Innovation MCIN grant PID2022-141920NB-I00/AEI/10.13039/501100011033/FEDER, UE, and the Department of Research and Universities of the Generalitat de Catalunya code 2021 SGR 01536.

Author contributions

Zuqi Li (Conceptualization [equal], Formal analysis [lead], Investigation [lead], Methodology [lead], Writing—original draft [lead]), Sam F.L. Windels (Conceptualization [equal], Methodology [supporting], Writing—review & editing [equal]), Noël Malod-Dognin (Methodology [supporting], Writing—review & editing [equal]), Seth M. Weinberg (Resources [equal], Writing—review & editing [equal]), Mary L. Marazita (Resources [equal], Writing—review & editing [equal]), Susan Walsh (Resources [equal], Writing—review & editing [equal]), Mark D. Shriver (Resources [equal], Writing—review & editing [equal]), David W. Fardo (Funding acquisition [supporting], Supervision [supporting], Writing—review & editing [equal]), Peter Claes (Resources [supporting], Supervision [equal], Writing—review & editing [equal]), Nataša Pržulj (Conceptualization [equal], Project administration [supporting], Supervision [equal], Writing—review & editing [equal]), and Kristel Van Steen (Conceptualization [equal], Funding acquisition [lead], Methodology [supporting], Project administration [lead], Supervision [equal], Writing—review & editing [equal])

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest: None declared.

Funding

This work was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreements No. 813533 (MLFPM) and No. 860895 (TranSYS).

Data availability

For the 3D Facial Norms dataset, genotypic markers are available to the research community through the dbGaP controlled-access repository (http://www.ncbi.nlm.nih.gov/gap) at accession #phs000929.v1.p1. The raw source data for the phenotypes—the 3D facial surface models in .obj format—are available through the FaceBase Consortium (https://www.facebase.org) at accession #FB00000491.01. Access to these 3D facial surface models requires proper institutional ethics approval and approval from the FaceBase data access committee. The PSU and IUPUI datasets were not collected with broad data sharing consent. Code for INMTD is publicly available on GitHub: https://github.com/ZuqiLi/INMTD.

References

Akata
Z
,
Thurau
C
,
Bauckhage
C.
Non-negative matrix factorization in multimodality data for segmentation and label prediction. In: 16th Computer Vision Winter Workshop, Mitterberg, Austria,
2011
.

Anderson
BW
,
Kortz
MW
,
Black
AC
 et al.  Anatomy, head and neck, skull. In: Kharazi KA (ed.),
StatPearls
.
Treasure Island (FL
):
StatPearls Publishing
,
2024
.

Antonelli
L
,
Guarracino
MR
,
Maddalena
L
 et al.  
Integrating imaging and omics data: a review
.
Biomed Signal Process Control
 
2019
;
52
:
264
80
.

Badea
L.
 
Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization
.
Pac Symp Biocomput
 
2008
;2008:
267
78
.

Bodaghi
B
,
Massamba
N
,
Izzedine
H.
 
The eye: a window on kidney diseases
.
Clin Kidney J
 
2014
;
7
:
337
8
.

Broadbent
C
,
Song
T
,
Kuang
R.
 
Deciphering high-order structures in spatial transcriptomes with graph-guided Tucker decomposition
.
Bioinformatics
 
2024
;
40
:
i529
38
.

Cavelaars
AE
,
Kunst
AE
,
Geurts
JJ
 et al.  
Persistent variations in average height between countries and between socio-economic groups: an overview of 10 European countries
.
Ann Hum Biol
 
2000
;
27
:
407
21
.

Chalise
P
,
Fridley
BL.
 Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization  
algorithm
.
Peddada
SD
(ed.).
PLoS One
 
2017
;
12
:
e0176278
.

Chauvel
C
,
Novoloaca
A
,
Veyre
P
 et al.  
Evaluation of integrative clustering methods for the analysis of multi-omics data
.
Brief Bioinform
 
2020
;
21
:
541
52
.

Chen
L
,
Zeng
H
,
Xiang
Y
 et al.  
Histopathological images and Multi-Omics integration predict molecular characteristics and survival in lung adenocarcinoma
.
Front Cell Dev Biol
 
2021
;
9
:
720110
.

Chen
X
,
Huang
JZ
,
Wu
Q
 et al.  
Subspace weighting co-clustering of gene expression data
.
IEEE/ACM Trans Comput Biol Bioinform
 
2019
;
16
:
352
64
.

Chiu
AM
,
Mitra
M
,
Boymoushakian
L
 et al.  
Integrative analysis of the inter-tumoral heterogeneity of triple-negative breast cancer
.
Sci Rep
 
2018
;
8
:
11807
.

Ding
C
,
Li
T
,
Peng
W
 et al. Orthogonal nonnegative matrix t-factorizations for clustering. In: Ungar L (ed.), Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA, USA: ACM,
2006
,
126
35
.

Dissez
G
,
Ceddia
G
,
Pinoli
P
 et al. Drug repositioning predictions by non-negative matrix tri-factorization of integrated association data. In: Shi X, Buck M (eds.), Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. Niagara Falls, NY, USA: ACM,
2019
,
25
33
.

Ferreira
L
,
Hitchcock
DB.
 
A comparison of hierarchical methods for clustering functional data
.
Commun Stat Simul Comput
 
2009
;
38
:
1925
49
.

Gargano
MA
,
Matentzoglu
N
,
Coleman
B
 et al.  
The human phenotype ontology in 2024: phenotypes around the world
.
Nucleic Acids Res
 
2024
;
52
:
D1333
46
.

Ghosal
A
,
Nandy
A
,
Das
AK
 et al.  A short review on different clustering techniques and their applications. In:
Mandal
JK
,
Bhattacharya
D
(eds),
Emerging Technology in Modelling and Graphics
.
Singapore
:
Springer Singapore
,
2020
,
69
83
.

He
X
,
Gumbsch
T
,
Roqueiro
D
 et al.  
Kernel conditional clustering and kernel conditional semi-supervised learning
.
Knowl Inf Syst
 
2020
;
62
:
899
925
.

Hériché
J-K
,
Alexander
S
,
Ellenberg
J.
 
Integrating imaging and omics: computational methods and challenges
.
Annu Rev Biomed Data Sci
 
2019
;
2
:
175
97
.

Khan
A
,
Maji
P.
 
Low-rank joint subspace construction for cancer subtype discovery
.
IEEE/ACM Trans Comput Biol and Bioinf
 
2020
;
17
:
1290
302
.

Kim
Y-D
,
Choi
S.
 Nonnegative tucker decomposition. In: Kanade T, Medioni G (eds.),
2007 IEEE Conference on Computer Vision and Pattern Recognition
.
Minneapolis, MN, USA
:
IEEE
,
2007
,
1
8
.

Kodinariya
TM
,
Makwana
PR.
Review on determining number of cluster in K-means clustering. Int J Adv Res Comput Sci Manag Stud  
2013
;1
:90–5
.

Lee
DD
,
Seung
HS.
 
Learning the parts of objects by non-negative matrix factorization
.
Nature
 
1999
;
401
:
788
91
.

Li
B
,
Zhou
G
,
Cichocki
A.
 
Two efficient algorithms for approximately orthogonal nonnegative matrix factorization
.
IEEE Signal Process Lett
 
2015
;
22
:
843
6
.

Liu
J
,
Brodley
CE
,
Healy
BC
 et al.  
Removing confounding factors via constraint-based clustering: an application to finding homogeneous groups of multiple sclerosis patients
.
Artif Intell Med
 
2015
;
65
:
79
88
.

Liu
Y
,
Gu
Q
,
Hou
JP
 et al.  
A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression
.
BMC Bioinformatics
 
2014
;
15
:
37
.

Loh
W-Y
,
Cao
L
,
Zhou
P.
 
Subgroup identification for precision medicine: a comparative review of 13 methods
.
WIREs Data Min Knowl Discov
 
2019
;
9
:
e1326
.

Luo
J
,
Liu
Y
,
Liu
P
 et al.  
Data integration using tensor decomposition for the prediction of miRNA-Disease associations
.
IEEE J Biomed Health Inform
 
2022
;
26
:
2370
8
.

Malod-Dognin
N
,
Petschnigg
J
,
Windels
SFL
 et al.  
Towards a data-integrated cell
.
Nat Commun
 
2019
;
10
:
805
.

McLean
CY
,
Bristor
D
,
Hiller
M
 et al.  
GREAT improves functional interpretation of cis-regulatory regions
.
Nat Biotechnol
 
2010
;
28
:
495
501
.

Pourhoseingholi
MA
,
Baghestani
AR
,
Vahedi
M.
 
How to control confounding effects by statistical analysis
.
Gastroenterol Hepatol Bed Bench
 
2012
;
5
:
79
83
.

Price
AL
,
Patterson
NJ
,
Plenge
RM
 et al.  
Principal components analysis corrects for stratification in genome-wide association studies
.
Nat Genet
 
2006
;
38
:
904
9
.

Rappoport
N
,
Shamir
R.
 
Multi-omic and multi-view clustering algorithms: review and cancer benchmark
.
Nucleic Acids Res
 
2018
;
46
:
10546
62
.

Saria
S
,
Goldenberg
A.
 
Subtyping: what it is and its role in precision medicine
.
IEEE Intell Syst
 
2015
;
30
:
70
5
.

Schwarz
M
,
Geryk
J
,
Havlovicová
M
 et al.  
Body mass index is an overlooked confounding factor in existing clustering studies of 3D facial scans of children with autism spectrum disorder
.
Sci Rep
 
2024
;
14
:
9873
.

Tian
C
,
Kosoy
R
,
Nassir
R
 et al.  
European population genetic substructure: further definition of ancestry informative markers for distinguishing among diverse European ethnic groups
.
Mol Med
 
2009
;
15
:
371
83
.

Vijaya Sharma
S
,
Batra
N.
Comparative study of single linkage, complete linkage, and ward method of agglomerative clustering. In: Yadav D (ed.), 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). Faridabad, India: IEEE,
2019
,
568
73
.

Wang
B
,
Mezlini
AM
,
Demir
F
 et al.  
Similarity network fusion for aggregating data types on a genomic scale
.
Nat Methods
 
2014
;
11
:
333
7
.

Wang
R-S
,
Maron
BA
,
Loscalzo
J.
 
Multiomics network medicine approaches to precision medicine and therapeutics in cardiovascular diseases
.
Arterioscler Thromb Vasc Biol
 
2023
;
43
:
493
503
.

Wang
X
,
Sun
Z
,
Zhang
Y
 et al.  
BREM-SC: a Bayesian random effects mixture model for joint clustering single cell multi-omics data
.
Nucleic Acids Res
 
2020
;
48
:
5814
24
.

White
JD
,
Indencleef
K
,
Naqvi
S
 et al.  
Insights into the genetic architecture of the human face
.
Nat Genet
 
2021
;
53
:
45
53
.

Yun
Y
,
Xia
W
,
Zhang
Y
 et al.  
Self-representation and class-specificity distribution based multi-view clustering
.
Neurocomputing
 
2021
;
437
:
9
20
.

Zhang
S
,
Liu
C-C
,
Li
W
 et al.  
Discovery of multi-dimensional modules by integrative analysis of cancer genomic data
.
Nucleic Acids Res
 
2012
;
40
:
9379
91
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on:

Supplementary data