-
PDF
- Split View
-
Views
-
Cite
Cite
Zuqi Li, Sam F L Windels, Noël Malod-Dognin, Seth M Weinberg, Mary L Marazita, Susan Walsh, Mark D Shriver, David W Fardo, Peter Claes, Nataša Pržulj, Kristel Van Steen, Clustering individuals using INMTD: a novel versatile multi-view embedding framework integrating omics and imaging data, Bioinformatics, Volume 41, Issue 4, April 2025, btaf122, https://doi.org/10.1093/bioinformatics/btaf122
- Share Icon Share
Abstract
Combining omics and images can lead to a more comprehensive clustering of individuals than classic single-view approaches. Among the various approaches for multi-view clustering, nonnegative matrix tri-factorization (NMTF) and nonnegative Tucker decomposition (NTD) are advantageous in learning low-rank embeddings with promising interpretability. Besides, there is a need to handle unwanted drivers of clusterings (i.e. confounders).
In this work, we introduce a novel multi-view clustering method based on NMTF and NTD, named INMTD, which integrates omics and 3D imaging data to derive unconfounded subgroups of individuals. According to the adjusted Rand index, INMTD outperformed other clustering methods on a synthetic dataset with known clusters. In the application to real-life facial-genomic data, INMTD generated biologically relevant embeddings for individuals, genetics, and facial morphology. By removing confounded embedding vectors, we derived an unconfounded clustering with better internal and external quality; the genetic and facial annotations of each derived subgroup highlighted distinctive characteristics. In conclusion, INMTD can effectively integrate omics data and 3D images for unconfounded clustering with biologically meaningful interpretation.
INMTD is freely available at https://github.com/ZuqiLi/INMTD.
1 Introduction
Clustering is a crucial technique in data analysis, enabling the identification of intrinsic structures within complex datasets by grouping similar data points. In the field of medicine, clustering has been widely used for uncovering disease subtypes, tailoring personalized treatments, and improving early diagnosis (Ghosal et al. 2020). As data complexity grows, clustering methods based on a single view or single data source are often insufficient in finding meaningful clusters, necessitating the development of more sophisticated approaches. Multi-view clustering has emerged as a powerful solution, leveraging multiple data perspectives to enhance clustering quality and reveal richer patterns than single-view methods (Rappoport and Shamir 2018, Chauvel et al. 2020). As the data views commonly used in biomedical science to describe an individual, omics and imaging data have shown essential advantages in understanding various biological phenomena (Antonelli et al. 2019). For instance, Chen et al. (2021) obtained better prediction for subtypes of lung adenocarcinoma by integrating extracted features from histopathological images and omics data than using a single view. However, there is a scarcity of research published on clustering individuals based on omics and imaging data. Moreover, processing images normally requires tensor methods due to their 3D format (Hériché et al. 2019), e.g. a color image consists of pixels represented by height, width, and color channel, and a 3D mesh consists of X, Y, Z coordinates for height, weight, and depth.
Various multi-view clustering methods have been developed, which can be generally classified into three categories, based on the relationship between data integration and clustering (Rappoport and Shamir 2018): (i) early integration combines all datasets into a single one before building the model for clustering, (ii) intermediate integration clusters a joint embedding learnt from all views, and (iii) late integration computes a clustering from each dataset and then merges all clusterings together. Out of the three, intermediate integration approaches have shown superior performance in many applications possibly because they require a model specifically designed for multi-view clustering tasks (Khan and Maji 2020, Wang et al. 2020, Yun et al. 2021). By clustering subjects with multi-view data from a joint embedding, integrative nonnegative matrix factorization (intNMF) proposed by Chalise and Fridley (2017) has found similar cancer subtypes identified by previous studies. This embedding represents the subjects with patterns that are naturally additive and hence easily interpretable (Lee and Seung 1999). A well-established extension of NMF model for better interpretation and clustering is nonnegative matrix tri-factorization (NMTF) that decomposes the input dataset into three smaller matrices (Ding et al. 2006). However, NMTF models only work with 2D matrices and cannot deal with data views of higher dimensions, e.g. a 3D tensor, which is the common data format for imaging or spatial data. A generalization of NMTF to tensors is the nonnegative Tucker decomposition (NTD) (Kim and Choi 2007). It decomposes the original input tensor into a core tensor with the same number of dimensions and one embedding matrix corresponding to each of its dimensions. There have been attempts to integrate multiple cross-linked 2D matrices into a tensor, which, however, does not work on originally 3D data (Luo et al. 2022). Another work by Broadbent et al. (2024) combined a tensor with a similarity matrix, while they treated the matrix as a graph regularization to the Tucker decomposition instead of a separate data view.
Clustering real-world data is often complicated by the presence of confounders—factors that influence the observed data in an unwanted way (Liu et al. 2015). Confounders can obscure the true clustering structure, leading to spurious results (Schwarz et al. 2024). Addressing confounders is essential for accurate clustering, as their effects can mask the genuine patterns within the data. Effective clustering methods must account for these confounding variables to uncover the true underlying structure. A widely adopted approach is to regress out the confounding effects from every feature during pre-processing, but it comes with the potential loss of useful signals prior to the modeling (Pourhoseingholi et al. 2012). Some other strategies include kernel conditional clustering which computes the final clustering conditioned on confounders (He et al. 2020). However, the conditional clustering is computationally expensive and cannot efficiently work on high-dimensional data.
In this work, we propose a novel multi-view clustering method, Integrative Non-negative Matrix and Tensor Decomposition (INMTD), which obtains unconfounded clustering by jointly decomposing 2D and 3D datasets with NMTF and NTD. It learns an embedding matrix for each data dimension and subgroups the individuals from their embedding after removing vectors in the embedding space that are linked with confounders. Because the true cluster structure of real-life patient dataset is often unknown, we evaluated INMTD first on a synthetic dataset with known clusters, and then on a US cohort from healthy individuals (White et al. 2021), whose heterogeneity mainly comes from the population structure with confounders including age, sex, etc. Combining 2D genotypes and 3D facial morphology, our model computed biologically meaningful embeddings and connected well the facial and genetic embeddings. Furthermore, INMTD derived an unconfounded clustering of individuals with better intrinsic quality and clearer association with population structure than the original clustering. We also characterized each population subgroup with their enriched genetic pathways and highlighted facial areas.
2 Materials and methods
2.1 INMTD: integrative nonnegative matrix and tensor decomposition with correction for confounders
INMTD unifies NMTF and NTD to cluster subjects with multi-view data of 2D and 3D structure. We assume subjects described by two data views, a 2D matrix of features and a 3D tensor of features in the second dimension and features in the third dimension, both nonnegative. The aim of our method is to jointly compute the embedding matrices for each dimension and cluster the subjects based on its own embedding (Fig. 1).

Overview of INMTD model for integrating 2D and 3D data. The 2D matrix is decomposed into two embedding matrices and and a core matrix . The 3D tensor is decomposed into three embedding matrices , , and and a core tensor . INMTD integrates and by jointly optimizing , which is shared by the two views. The orthogonality constraint on ensures disentanglement of embedding vectors.
This optimization problem with nonnegativity and orthogonality constraints can be solved via Lagrangian multipliers and Karush–Kuhn–Tucker conditions (Ding et al. (2006) have shown detailed mathematical proof for solving these constraints in their orthogonal nonnegative matrix t-factorization model). Because Equation (5) has no analytic solution for , , and , we iteratively compute their values via the multiplicative update rules (Dissez et al. 2019) (see Section 2.2). Furthermore, in each iteration, we normalize every column of after updating to further guarantee unit vectors and eliminate the scale indeterminacy as suggested by Li et al. (2015).
We apply -means clustering (with 10 random initializations) on the joint embedding, , and select the best number of clusters based on Silhouette score. Silhouette score is a classic internal metric (no ground truth required) that measures how well-separated and cohesive clusters are in a dataset. It evaluates both intra-cluster cohesion, namely how close data points within a cluster are to each other, and inter-cluster separation, namely how far each cluster is from other clusters. A higher value (close to 1) suggests a more valid clustering.
To assess how much is confounded by a set of known confounders, , we conduct a linear model F-test between every confounder and every embedding vector (i.e. matrix column) of [with Benjamini–Hochberg (BH) correction for multiple testing]. The unconfounded clustering is then recomputed from the columns of that have no significant association with any confounders. This is done also by -means and Silhouette score for the optimal number of clusters. The removal of confounded embedding components is based on the additive nature of NMF-based methods that the data is represented by the sum of all its embedding aspects (Lee and Seung 1999). Deconfounding at the embedding level is computationally less expensive and can better preserve meaningful information than the widely used approach, which is to regress out confounders from each feature at the input level.
2.2 Training procedure of INMTD
We initialize via singular value decomposition (SVD), which has shown better results than random initialization (Malod-Dognin et al. 2019). For , the original matrix is decomposed by SVD and and are derived from the left and right matrices of SVD, respectively. Because is 3D, we run SVD for every slice along the channels and average all the right matrices to compute the initial and similarly run SVD for every slice along the features to initialize . As the update rules are multiplicative, to avoid entries in to remain 0, we add an infinitesimal number (1e−5) so that these entries can be updated.
The initialized and are automatically nonnegative because all the multipliers are nonnegative.
The total computational complexity of running INMTD is , where means the number of iterations. We refer to the Supplementary Materials for more details about the computational complexity of the entire procedure.
The total relative error computes the fraction of the reconstruction errors of the two datasets and in their L2 norms. It is a nonnegative value and a lower total relative error indicates better reconstruction from the decomposed elements.
2.3 Association between embeddings
3 Experiments
3.1 Evaluation with synthetic data
To evaluate the clustering performance of INMTD under controlled conditions and compare it with other well-established methods, we simulate a multi-view dataset of 1000 samples with known clusters. All samples are described by two data types of different dimensions, resulting in a 2D matrix of 1000 samples and 200 features and a 3D tensor of 1000 samples, 50 first-dimensional features and 10 second-dimensional features. To mimic real-world scenarios, e.g. gene expression data where certain set of genes may be more expressed in one group of people than another (Liu et al. 2014, Chen et al. 2019), features of each data view also belong to distinct subgroups, i.e. and have co-cluster or block structure. To be specific, five clusters are simulated for the 1000 samples (as ground-truth structure), 10 clusters for the 200 features of , four clusters for the 50 first-dimensional features of and two clusters for the 10 second-dimensional features of . We perturb and by adding random noise onto their features to further imitate real-life data. More details about the simulation settings are available in Supplementary Materials.
We employ INMTD with SVD initialization and 200 max iterations on the synthetic data and measure its clustering capability based on the adjusted Rand index (ARI). ARI computes the similarity between two data clusterings, and hence, the accuracy of the derived clustering to the ground-truth clustering. An ARI of around 0 indicates random labeling and the closer it is to 1, the better match. We also employ a few other widely used clustering approaches to compare their ARIs with INMTD’s, including (i) -means, (ii) NMTF, (iii) NTD, (iv) simultaneous NMF (Badea 2008) (siNMF), (v) joint NMF (Zhang et al. 2012) (jNMF), and (vi) similarity network fusion (SNF) (Wang et al. 2014). Note that the first three methods are single-view while the last three are multi-view methods. -means is one of the most popular clustering algorithms and has close relationship with matrix factorization. We choose NMTF and NTD for the comparison because they are the building blocks of INMTD. siNMF and jNMF are both multi-view clustering methods based on NMF. SNF, on the other hand, is a network-based multi-view clustering approach, widely used in cancer studies (Chiu et al. 2018) and precision medicine (Wang et al. 2023).
For the sake of robustness, we simulate the synthetic data for six runs with different random seeds and compute the ARI of INMTD and the other approaches in each run. More details about the method implementation, including choices of hyperparameters, are available in Supplementary Materials.
3.2 Evaluation with real-life data
3.2.1 Evaluation dataset
We apply INMTD to a multi-view dataset of 4680 healthy people with European ancestry, characterized by a 2D matrix and a 3D tensor (White et al. 2021). We select this cohort dataset because very few factors are expected to affect its structure, and therefore, it suits well to evaluate the clustering and deconfounding performance of INMTD. These people were recruited from three independent studies in the USA, 3D Facial Norms cohort (PITT), Pennsylvania State University (PSU), and Indiana University-Purdue University Indianapolis (IUPUI). Every individual was described by 7 141 882 single-nucleotide polymorphisms (SNPs, encoded by 0,1,2 for the number of minor alleles) and a 3D mesh image which contains the X, Y, Z coordinates of 7160 landmarks, namely and , respectively. Due to the enormous number of SNPs the SVD initialization on had to adopt randomized SVD for feasibility. The initialization on still used full SVD. To standardize genomic and facial data, we subtracted the mean from each view and divided every entry by the maximum of each view. We then took the absolute values to ensure nonnegativity and focus on the deviation of each individual from the population mean. The rank of embedding was determined via the rule of thumb: where is the number of individuals, thus . was chosen in a similar way but based on the number of protein-coding genes, namely 19 430, instead of SNPs to reduce the computational burden, and thus . We had and because and , where and . For every individual, we also collected a few covariates, including age, sex, height, weight, face size, and camera system. Body mass index (BMI) was derived via: as an additional covariate. Here, we consider these covariates as confounders to population structure because they might hinder us in finding the population subgroups based on genetic heterogeneity. To measure the population structure of the cohort, White et al. (2021) have computed four ancestry axes by projecting the individuals onto the principal component (PC) space of the SNPs from the 1000G Project.
3.2.2 Evaluation on the embeddings
INMTD learns an embedding for SNPs, faces, and individuals. Here, we outline how we assess the biological validity for those embeddings separately and jointly.
To biologically validate our individual embedding, we assess if captures any heterogeneity of the cohort, including both population structure (ancestry axes) and confounding effects. Due to the orthogonality constraint, embedding vectors of can be considered as independent and characterize different aspects of the individuals. We, therefore, test the statistical association between every embedding vector and every ancestry axis or confounder. In particular, Kruskal–Wallis analysis of variance (ANOVA) is used for ancestry axes and continuous confounders (age, height, weight, BMI, and face size) and chi-squared test is used for categorical confounders (sex and camera system). We apply BH correction for the multiple testing.
To assess our SNP embedding, we cluster SNPs based on their embedding and use enrichment analysis to see if their space is functionally organized. As is not orthogonal, we use -means (with 10 random initializations) on to subgroup SNPs into clusters. We first check how well those SNP clusters coincide with the 19 430 protein-coding genes given the fact that gene is a natural summary of SNPs and biological processes are usually interpreted on a gene level. A gene is defined to be enriched in an SNP cluster if SNPs located 2K base pairs around this gene are present in this cluster significantly more than in the background. Hypergeometric test is applied for the enrichment analysis and the BH procedure is used for multiple testing correction. To further check if these clusters have biological relevance, we run another enrichment analysis for Gene Ontology (GO) terms for biological process. We annotate every SNP by GO terms if it is mapped to a gene that is annotated by a GO term and then test the overrepresentation of GO terms in every SNP cluster. Because GO terms are gene annotations, we annotate SNPs with GO terms that annotate their mapped genes. Only GO terms under biological process category are used and the BH procedure is applied for multiple testing correction.
To assess the association between the SNP and facial embeddings, we map them to the space of and compute cosine similarity between each SNP and each facial landmark (see Section 2.3). For the SNPs closest to facial landmarks in the joint space, we apply the GREAT analysis (v.4.0.4), which finds biological meaning of the set of SNPs via the annotations of nearby genes. Technically speaking, GREAT analysis performs a binomial test for a set of SNPs to check whether the overlap between their associated genes and genes with a certain annotation is greater than random chance. To associate SNPs with genes, we apply the default and recommended settings (McLean et al. 2010), namely the “basal plus extension” rule with 5 kb upstream and 1 kb downstream plus 1000 kb extension. Note that one SNP can be associated with multiple genes and SNPs not associated with any genes are not included in this analysis.
3.2.3 Characterization of unconfounded population subgroups
The unconfounded population subgroups are characterized based on the projection of genetic and facial embeddings to the space of the sample embedding, enabling computing the similarity between population subgroup centroids and SNPs and facial landmarks. For each subgroup, we first select the top 0.1% SNPs with highest cosine similarity to its centroid in the joint space, to which the GREAT analysis is applied to reveal the most relevant phenotypes [human phenotype ontology (HPO)]. The threshold of 0.1% is determined by balancing the number of genes selected per subgroup and the genomic coverage of genes selected for all subgroups (Supplementary Figs S1 and S2). We then visualize the cosine similarities of all facial landmarks to a subgroup centroid on the averaged face, in order to demonstrate how different areas of the face are associated with the corresponding subgroup.
4 Results
4.1 INMTD outperforms other methods on the synthetic data
Our evaluation on the synthetic data has shown that INMTD, followed by SNF, achieved superior clustering performance over other methods (Fig. 2). This result indicates that integrating multiple data types can help derive a more accurate clustering than using a single data type. However, siNMF and jNMF, as two well-established methods for jointly decomposing multiple data matrices, did not perform well on the simulated data, presumably because they are not as specialized in clustering and tensor decomposition as INMTD. Besides, the comparison between different implementations of -means (only , only , or both and ) implies that simply concatenating features from multiple data views cannot effectively fuse the information and improve the clustering.

Clustering performance on the synthetic data. This plot shows the mean and standard deviation of ARIs of INMTD and the other methods over six runs with different random seeds. The methods are divided into three column panels: -means and NMTF using only , -means and NTD using only , and -means, siNMF, jNMF, SNF, and INMTD using both and .
4.2 INMTD generates biologically meaningful embeddings from a real-life multi-view dataset
We applied INTMD to a real-world facial-genomic cohort collected from the USA for unconfounded population subgrouping (White et al. 2021). This dataset consists of two data types, for 7 141 882 SNPs in 4680 people and for the 3D coordinates of 7160 facial landmarks in the same 4680 individuals (White et al. 2021). A few confounders were collected as well, which are age, sex, height, weight, camera system, and face size. We also derived BMI from height and weight as a potential confounder. To assess the population structure of this cohort, White et al. computed four ancestry axes by projecting the genotypes to a principal component space built from the 1000 Genomes Project data, in the manner of EIGENSTRAT (Price et al. 2006). We ran our INMTD model on this multi-view dataset for 1000 iterations with SVD initialization and it converged in terms of the total relative error (Supplementary Fig. S3). Due to the heuristic solver for Equation (5), is only approximately orthonormal. Therefore, we further checked the independencies between the embedding vectors of based on its covariance matrix, namely (Supplementary Fig. S4).
Because the heterogeneity of a population can be largely described by population structure, age, sex, etc., we validated the information captured by based on the statistical association between every column vector of and every ancestry axis or confounder (Fig. 3). The results indicate that most vectors captured the information of population structure while being confounded. This is expected as, for instance, height has been reported to highly relate with different European ancestries (Cavelaars et al. 2000). Furthermore, the 48 embedding vectors have different association patterns with the ancestry axes and confounders. For instance, the first vector is significantly associated with most ancestry axes as well as confounders, while no association with the sixth vector is observed. The observation that height, weight, and face size are significantly associated to similar embedding vectors is in line with the close relationships among these factors. The results also imply that sex is not a main driver of the heterogeneity within the data (only two embedding vectors have significant association with sex), which may be due to the exclusion of sex chromosomes in this study.

Heatmap of P-values for the linear F test between every embedding vector and every confounder and ancestry axis. X-axis is the exploratory variables including confounders and ancestry axes, and Y-axis is the 48 column vectors of the embedding matrix . Non-significant P-values (larger than 0.05 after BH correction) were removed from the plot.
To assess the biological relevance of the SNP embedding from , we applied -means on to derive 99 clusters of SNPs. We first mapped every SNP to genes if it falls within or 2k base pairs around a gene and ignored SNPs that do not map to any genes. Ninety-seven out of 99 SNP clusters have at least one overrepresented gene (P-value < 0.05 in a hypergeometric test with BH correction for multiple testing) with respect to the background of all the SNPs that can map to a gene. 98.6% of genes have been enriched in at least one SNP cluster, showing the coverage of the SNP clusters, while most genes were enriched in only a few clusters, showing the specificity of the clusters (Fig. 4). We then applied GO enrichment analysis for each cluster after assigning GO annotations of a gene to all its belonging SNPs. 96.0% of clusters statistically significantly overrepresented at least one GO term and 99.6% of GO terms have been enriched in at least one cluster (Supplementary Fig. S5). This result validated that the SNP clusters obtained from have both genomic specificity and biological relevance.

Histogram showing the number of genes (y-axis) that are enriched in a given number of clusters (x-axis). An enrichment analysis is done between every gene and every (SNP) cluster. We then compute how many genes (y-axis) are found to be enriched in different number of clusters of SNPs (x-axis).
To assess the quality of the facial embedding from , we aimed to obtain a hierarchical segmentation on and compare it with the work of White et al. (2021). The chosen Ward’s hierarchical clustering yielded 60 facial segments (Supplementary Fig. S6), with a cophenetic correlation coefficient of 0.638, which is higher than that of the facial segmentation by White et al. on the same images (0.414). It suggested that the hierarchical segmentation from embedding better groups together facial landmarks that are close in 3D space than the one by White et al., defining more spatially coherent “patches.” The visualization of these 60 clusters on an averaged face also showed that the landmarks within each cluster are spatially close to each other and the clustering is morphologically meaningful (Fig. 5).

The 60 clusters derived from , illustrated on the mean face shape of all individuals. Not every cluster has a unique color due to the large number of clusters, but neighboring clusters are distinguished by different colors. Every face shape is symmetric along X-axis (left/right), so is the facial segmentation.
To further investigate the relationships between genetics and facial morphology, we mapped both and to the space of (Supplementary Fig. S7). For each facial landmark in the joint space, we selected the closest SNP in terms of cosine similarity, resulting in 905 unique SNPs in total. To assess what biological traits are associated with the chosen SNPs, we conducted a GREAT analysis on their neighboring genes (McLean et al. 2010), which found 17 significantly enriched HPO terms (Gargano et al. 2024) based on the adjusted binomial P-values (Fig. 6). Most of the enriched terms are highly linked to facial morphology (especially eyes), limb or spine morphology and embryonic development, depicting close biological relationship between the embeddings for SNPs and facial landmarks, namely and . Other enriched terms suggested high relatedness between facial morphology and myotonia. The results are in line with a previously published genome-wide association study (GWAS) work on the same facial-genomic data (White et al. 2021) and indicate that INMTD allows for uncovering biologically relevant associations between SNPs and facial landmarks.

Dotplot of 17 significantly enriched HPO terms from 905 SNPs that are closest to facial landmarks in the joint space. GREAT analysis first mapped the 905 SNPs to 1156 genes based on the default and recommended settings, and then ran a binomial test for the enrichment analysis with BH correction. X-axis shows the ratio between the number of observed genes and the number of genes annotated for each HPO term. The dot color indicates the significance of the adjusted binomial P-value in the form of −log10.
4.3 INMTD finds unconfounded population subgroups characterized by their genetic and facial information
To derive the optimal population subgroups, we applied -means on with different number of clusters and found that the clustering with 48 clusters achieved the highest Silhouette score (0.169) (Supplementary Fig. S8). As mentioned before, many vectors of are significantly confounded, potentially disturbing correct interpretation and characterization of derived subgroups. In order to deal with the confounding effect, we removed the 28 columns in that are significantly associated with any confounders, and clustered individuals based on the remaining 20 embedding vectors (Supplementary Fig. S9). We then applied -means on the unconfounded with different number of clusters and the clustering with 20 clusters achieved the highest Silhouette score (0.329) (Supplementary Fig. S10). This score is statistically higher (empirical P-value = 0.02 from 1000 repetitions, Supplementary Fig. S11) than clusterings with the same number of clusters from 20 randomly sampled vectors, indicating better intrinsic validity of the unconfounded clustering.
To validate our reduced space is unconfounded and leads to a better capturing of the population structure, we assessed the statistical association between the derived clustering and the ancestry axes and confounders. The Kruskal–Wallis test (nonparametric ANOVA) showed both the original and the unconfounded clusterings are significantly associated with all ancestry axes (P-value ), suggesting their relationships with population structure (Table 1). Yet, the original clustering also has significant associations with most confounders, especially age and camera system, while the unconfounded clustering has no significant associations with any confounders, validating our unconfounding strategy. The effective reduction of the influence by camera system, which resembles the batch effect, also indicates the strength of the confounder correction.
Adjusted P-values of the statistical tests between the clustering of and every ancestry axis and confounder.
Variable . | Statistical test . | Adj. P-value (original) . | Adj. P-value (unconfounded) . |
---|---|---|---|
Ancestry axis 1 | Kruskal–Wallis | 7.24e−40 | 2.99e−9 |
Ancestry axis 2 | Kruskal–Wallis | 1.04e−11 | 2.29e−7 |
Ancestry axis 3 | Kruskal–Wallis | 6.07e−11 | 2.69e−3 |
Ancestry axis 4 | Kruskal–Wallis | 8.11e−22 | 8.16e−11 |
Age | Kruskal–Wallis | 9.36e−8 | 0.888 |
Sex | Chi-squared | 0.118 | 0.899 |
Height | Kruskal–Wallis | 8.38e−2 | 0.888 |
Weight | Kruskal–Wallis | 5.51e−2 | 0.951 |
BMI | Kruskal–Wallis | 2.19e−3 | 0.899 |
Camera system | Chi-squared | 3.01e−78 | 0.899 |
Face size | Kruskal–Wallis | 3.72e−2 | 0.888 |
Variable . | Statistical test . | Adj. P-value (original) . | Adj. P-value (unconfounded) . |
---|---|---|---|
Ancestry axis 1 | Kruskal–Wallis | 7.24e−40 | 2.99e−9 |
Ancestry axis 2 | Kruskal–Wallis | 1.04e−11 | 2.29e−7 |
Ancestry axis 3 | Kruskal–Wallis | 6.07e−11 | 2.69e−3 |
Ancestry axis 4 | Kruskal–Wallis | 8.11e−22 | 8.16e−11 |
Age | Kruskal–Wallis | 9.36e−8 | 0.888 |
Sex | Chi-squared | 0.118 | 0.899 |
Height | Kruskal–Wallis | 8.38e−2 | 0.888 |
Weight | Kruskal–Wallis | 5.51e−2 | 0.951 |
BMI | Kruskal–Wallis | 2.19e−3 | 0.899 |
Camera system | Chi-squared | 3.01e−78 | 0.899 |
Face size | Kruskal–Wallis | 3.72e−2 | 0.888 |
Kruskal–Wallis test was used for continuous variables and chi-squared test for categorical variables. All P-values have been corrected for multiple testing via the Benjamini-Hochberg (BH) procedure. Column 3 are adjusted P-values from the original clustering based on all vectors in G1, while Column 4 from the unconfounded clustering. Adjusted P-values lower than 0.05 (threshold for significance) are in bold.
Adjusted P-values of the statistical tests between the clustering of and every ancestry axis and confounder.
Variable . | Statistical test . | Adj. P-value (original) . | Adj. P-value (unconfounded) . |
---|---|---|---|
Ancestry axis 1 | Kruskal–Wallis | 7.24e−40 | 2.99e−9 |
Ancestry axis 2 | Kruskal–Wallis | 1.04e−11 | 2.29e−7 |
Ancestry axis 3 | Kruskal–Wallis | 6.07e−11 | 2.69e−3 |
Ancestry axis 4 | Kruskal–Wallis | 8.11e−22 | 8.16e−11 |
Age | Kruskal–Wallis | 9.36e−8 | 0.888 |
Sex | Chi-squared | 0.118 | 0.899 |
Height | Kruskal–Wallis | 8.38e−2 | 0.888 |
Weight | Kruskal–Wallis | 5.51e−2 | 0.951 |
BMI | Kruskal–Wallis | 2.19e−3 | 0.899 |
Camera system | Chi-squared | 3.01e−78 | 0.899 |
Face size | Kruskal–Wallis | 3.72e−2 | 0.888 |
Variable . | Statistical test . | Adj. P-value (original) . | Adj. P-value (unconfounded) . |
---|---|---|---|
Ancestry axis 1 | Kruskal–Wallis | 7.24e−40 | 2.99e−9 |
Ancestry axis 2 | Kruskal–Wallis | 1.04e−11 | 2.29e−7 |
Ancestry axis 3 | Kruskal–Wallis | 6.07e−11 | 2.69e−3 |
Ancestry axis 4 | Kruskal–Wallis | 8.11e−22 | 8.16e−11 |
Age | Kruskal–Wallis | 9.36e−8 | 0.888 |
Sex | Chi-squared | 0.118 | 0.899 |
Height | Kruskal–Wallis | 8.38e−2 | 0.888 |
Weight | Kruskal–Wallis | 5.51e−2 | 0.951 |
BMI | Kruskal–Wallis | 2.19e−3 | 0.899 |
Camera system | Chi-squared | 3.01e−78 | 0.899 |
Face size | Kruskal–Wallis | 3.72e−2 | 0.888 |
Kruskal–Wallis test was used for continuous variables and chi-squared test for categorical variables. All P-values have been corrected for multiple testing via the Benjamini-Hochberg (BH) procedure. Column 3 are adjusted P-values from the original clustering based on all vectors in G1, while Column 4 from the unconfounded clustering. Adjusted P-values lower than 0.05 (threshold for significance) are in bold.
To investigate if the derived subgroups from the unconfounded clustering capture well the population structure, we adopted 3519 European ancestry informative markers (EuroAIMs) found by Tian et al. (2009), which are SNPs capable of distinguishing European subpopulations. We first mapped the SNP embedding to the space of and then, in the joint space, selected 3519 SNPs with highest cosine similarities to the centroids of derived subgroups, which drive the clustering of individuals. We selected the same number of SNPs as the EuroAIMs for better comparison. Because only 13 EuroAIMs were originally included in our dataset, we looked at the gene level and found that 856 out of the 3519 selected SNPs are located in the same genes as the EuroAIMs. A hypergeometric test showed that this fraction is significantly (P-value = 2.15e−63) higher than a random selection from all SNPs in our dataset, indicating that the genetic basis of the derived subgrouping is statistically significantly associated with the European population structure. We also checked for the subgroups derived from the original clustering and found only 40 out of 3,519 selected SNPs that are located in the EuroAIM genes. The P-value of 1 from the hypergeometric test also implied this fraction is not statistically higher than a random selection. This result further proved that our deconfounding strategy has successfully led to population subgroups that could better highlight the population structure.
After showing the validity of our unconfounded population subgrouping, we focused on the characterization of two subgroups that are most associated with the four ancestry axes (Supplementary Table S1), namely subgroup 2 and 7. The top 0.1% SNPs selected for population subgroup 2 were significantly enriched in over 500 HPO terms, with the top 20 terms clearly related to skeletal morphology or bone formation (Fig. 7A). It is in line with the highlighted areas on the face (Fig. 7B), e.g. the nasal bone, the zygomatic bone, maxilla, and most part of the mandible. This result characterized population subgroup 2 with facial skeleton. Note that the frontal bone does not show up as much as the facial bones, implying the different morphology of cranial and facial skeleton (Anderson et al. 2024). Some other enriched HPO terms involve anemia, telangiectasia, and neutrophil, which are related to the blood. Whereas the genetic representation of population subgroup 7 found 72 significantly associated HPO terms in total, and many of the top 20 terms are strongly involved in kidney function (Fig. 7C). Meanwhile, the eye area was remarkably underlined on the mean face (Fig. 7D), which supports the common embryogenic stage of eyes and kidneys and the reported relationship between eye and kidney diseases (Bodaghi et al. 2014). Therefore, population subgroup 7 is likely characterized by eye morphology related to kidney function. Other enriched terms indicate the involvement of kidney function in diabetes, hair development, tooth development, etc.

Genetic and facial representation for population subgroup 2 (top, A and B) and 7 (bottom, C and D). The genetic representation was obtained via GREAT analysis (left) for the top 20 HPO terms enriched in the top 0.1% SNPs highlighted for population subgroup 2 (A) and 7 (C). We used the default and recommended settings of GREAT analysis with binomial test and BH correction. For the facial representation, the cosine similarity between each facial landmark and the centroid of subgroup 2 (B) and 7 (D) separately was plotted on the mean face.
5 Discussion and conclusion
In this study, we proposed INMTD, a framework that integrates both 2D matrices and 3D tensors for unconfounded clustering. The simulation showed the superior clustering performance of INMTD over other competing methods on synthetic data with known clusters. We then applied it to a real-life facial-genomic dataset to evaluate its performance and find an unconfounded subgrouping for European population structure. INMTD discovered 20 unconfounded population subgroups with their representative genetic and facial characteristics, providing the potential for more precise healthcare toward each subpopulation.
This work applies INMTD to facial and genomic data, which reflect a large fraction of the population structure. A comprehensive subgrouping of population structure has the potential to facilitate precision medicine where individuals in each population subgroup may receive tailored medical decisions based on their intrinsic characteristics (Saria and Goldenberg 2015, Loh et al. 2019). For instance, the characterization of subgroup 2 indicates that variations in facial skeletal traits may serve as proxies for underlying genetic diversity within populations, and special medical treatment may be given to population subgroups with distinct skeletal development. On the other hand, the kidney-related phenotypes enriched for subgroup 7 may bring new insights into the embryological connection between eye and kidney functions. Understanding such connections could help explain the genetic diversity of eye morphology in different European subpopulations, improving both population health and tailored medical interventions.
While classic matrix or tensor decomposition models only focus on a single dataset and current integrative matrix factorization models cannot deal with higher-dimensional data structures, INMTD is able to jointly decompose both matrix and tensor data. Another key feature of INMTD is its orthogonality constraint on for better clustering and interpretation. An orthogonal matrix has all its vectors independent to each other, and therefore, every vector can be investigated specifically for its characteristics. We further normalized the vectors of to make them orthonormal, resembling naturally a cluster indicator matrix. Optionally, we can impose orthogonality on any embedding if needed.
As an extension of joint NMF model, INMTD has the potential to predict facial images from genotypes or vice versa of new samples, in a similar fashion as Akata et al. (2011). In the former case, given the genotypes of an individual that has not been seen by the model, we first solve for and then predict its facial image . Alternatively, we could find the closest image from the cohort to according to cosine similarity or the image whose corresponding embedding vector in space is closest to . In addition, a new sample can be classified into one of the derived subgroups by assigning to its closest cluster centroid.
Even though INMTD was illustrated in a facial-genomic dataset for population subgrouping, it is not restricted to facial images and genomic data. INMTD can be applied to any 2D and 3D datasets for joint clustering, as long as they are nonnegative or can be converted to nonnegative values, e.g. transcriptomics or epigenomics as 2D matrices and CT scans or time series data as 3D tensors. Also, the rapidly emerging field of spatial transcriptomics highly involves multi-view data with different dimensions, such as gene expression, spatial coordinates of cells, and histological images; we expect integrative studies in this field to benefit from INMTD. Note that it is also possible to further extend INMTD to handle more than two data types or data of higher dimensions, e.g. a moving 3D image (4D), by adding extra embeddings to the model and objective function. In the future, we plan to apply INMTD to patient datasets with more data views, e.g. TCGA, for improved disease subtyping and reveal novel patient subgroups with meaningful characteristics for more precise healthcare decisions.
There are two main limitations of INMTD for future improvements. The first one is that it determines the ranks of each embedding based on a rule of thumb because of the extremely large scale of genomic dataset. Nevertheless, in most cases, the data dimensionality would be feasible for INMTD model on a modern computer or computing server and the user could choose the optimal ranks via cross-validation or other non-heuristic methods. The other potential deficiency comes from the post-hoc confounder correction, which removes confounded vectors but in a fairly strict manner with the risk of “over-correcting.” A more compromising strategy could be adding a regularization term in the objective function to iteratively minimize the confounding effect in , which indicates our future work.
In conclusion, with the surge in biological data in diverse formats and the growing demand for personalized medicine with Big Data, INMTD is envisaged to become an essential tool for integrating multi-view datasets of varying dimensions, enabling meaningful and unconfounded clustering. We are confident that INMTD has the potential of widespread adoption in the future due to its exceptional performance and ease-of-interpretation.
Acknowledgements
We want to acknowledge the European Research Council (ERC) Consolidator grant No. 770827, the Spanish State Research Agency and the Ministry of Science and Innovation MCIN grant PID2022-141920NB-I00/AEI/10.13039/501100011033/FEDER, UE, and the Department of Research and Universities of the Generalitat de Catalunya code 2021 SGR 01536.
Author contributions
Zuqi Li (Conceptualization [equal], Formal analysis [lead], Investigation [lead], Methodology [lead], Writing—original draft [lead]), Sam F.L. Windels (Conceptualization [equal], Methodology [supporting], Writing—review & editing [equal]), Noël Malod-Dognin (Methodology [supporting], Writing—review & editing [equal]), Seth M. Weinberg (Resources [equal], Writing—review & editing [equal]), Mary L. Marazita (Resources [equal], Writing—review & editing [equal]), Susan Walsh (Resources [equal], Writing—review & editing [equal]), Mark D. Shriver (Resources [equal], Writing—review & editing [equal]), David W. Fardo (Funding acquisition [supporting], Supervision [supporting], Writing—review & editing [equal]), Peter Claes (Resources [supporting], Supervision [equal], Writing—review & editing [equal]), Nataša Pržulj (Conceptualization [equal], Project administration [supporting], Supervision [equal], Writing—review & editing [equal]), and Kristel Van Steen (Conceptualization [equal], Funding acquisition [lead], Methodology [supporting], Project administration [lead], Supervision [equal], Writing—review & editing [equal])
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest: None declared.
Funding
This work was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreements No. 813533 (MLFPM) and No. 860895 (TranSYS).
Data availability
For the 3D Facial Norms dataset, genotypic markers are available to the research community through the dbGaP controlled-access repository (http://www.ncbi.nlm.nih.gov/gap) at accession #phs000929.v1.p1. The raw source data for the phenotypes—the 3D facial surface models in .obj format—are available through the FaceBase Consortium (https://www.facebase.org) at accession #FB00000491.01. Access to these 3D facial surface models requires proper institutional ethics approval and approval from the FaceBase data access committee. The PSU and IUPUI datasets were not collected with broad data sharing consent. Code for INMTD is publicly available on GitHub: https://github.com/ZuqiLi/INMTD.