A functional analysis of omic network embedding spaces reveals key altered functions in cancer

Abstract Motivation Advances in omics technologies have revolutionized cancer research by producing massive datasets. Common approaches to deciphering these complex data are by embedding algorithms of molecular interaction networks. These algorithms find a low-dimensional space in which similarities between the network nodes are best preserved. Currently available embedding approaches mine the gene embeddings directly to uncover new cancer-related knowledge. However, these gene-centric approaches produce incomplete knowledge, since they do not account for the functional implications of genomic alterations. We propose a new, function-centric perspective and approach, to complement the knowledge obtained from omic data. Results We introduce our Functional Mapping Matrix (FMM) to explore the functional organization of different tissue-specific and species-specific embedding spaces generated by a Non-negative Matrix Tri-Factorization algorithm. Also, we use our FMM to define the optimal dimensionality of these molecular interaction network embedding spaces. For this optimal dimensionality, we compare the FMMs of the most prevalent cancers in human to FMMs of their corresponding control tissues. We find that cancer alters the positions in the embedding space of cancer-related functions, while it keeps the positions of the noncancer-related ones. We exploit this spacial ‘movement’ to predict novel cancer-related functions. Finally, we predict novel cancer-related genes that the currently available methods for gene-centric analyses cannot identify; we validate these predictions by literature curation and retrospective analyses of patient survival data. Availability and implementation Data and source code can be accessed at https://github.com/gaiac/FMM.


Contents
The analysis of a large set of functional annotations is not straightforward. These large sets typically contain redundant and dependent annotations that can be summarized to improve the biological interpretation of the sets. To summarize them, different tools, e.g., REVIGO (Supek et al., 2011) or DAVID (Dennis et al., 2003), have been proposed. These methods group similar functional annotations by computing their Lin's semantic similarity. Thus, large groups of annotations are summarized into functional clusters according to different semantic similarity metrics, e.g., the simRel score (Schlicker et al., 2006) or the Lin's semantic similarity . In this paper, we propose to use the REVIGO tool to summarize a large list of annotations. As a first step, we use REVIGO (with the default parameters) to cluster the functional annotations based on their Lin's semantic similarity. Then, we consider as the most representative function of each cluster, the annotation terms having the highest average Lin's semantic similarity with the other terms in the cluster. In this manuscript, we use this method to summarize the functions altered by cancer.

Multiplicative update rules
As presented in section 2.3 of the main paper, the Non-negative Matrix Tri-Factorization, NMTF, can be formulated as the following minimization problem: min P,S,G≥0 f (P, S, G) = min P,S,G≥0 ∥X − P SG T ∥ 2 F , G T G = I, where F denotes the Frobenius norm, X is the PPMI matrix representation of a molecular network (whose nodes are genes), rows in matrix P · S are the embedding vectors of the genes and columns in G T are the axis of the basis describing the space in which the genes are embedded.
Following the semi-NMTF simplification Ding et al. (2008) for a more computationally tractable solution, we remove the non-negativity constraint on S ≥ 0. To solve the optimization problem, we derive the Karush-Kuhn-Tucker (KKT) conditions for our NMTF as follows: ∂f ∂G = −X T P S + GS T P T P S − η 1 = 0, ∂f ∂S = −P T XG + P T P SG T G = 0, ∂f ∂P = −XGS T + P SG T GS T − η 2 , η 1 , G ≥ 0, where ⊙ is the Hadamard (element wise) product and matrices η 1 and η 2 are the dual variables for the primal constraint G, P ≥ 0. For S, we have the following closed formula: As explained in (Pržulj, 2019), we derive the following multiplicative update rule to solve the KKT conditions above: We start from initial solutions, G init , P init , S init , and iteratively use Equations (1) and (2) to compute new matrix factors G, P and S until convergence. To generate initial G init , P init and S init , we use the Singular Value Decomposition based strategy (Qiao, 2015). However, SVD matrix factors can contain negative entries; thus, we use only their positive entries and replace the negative entries with 0, to account for the non-negativity constraint of the NMTF. This strategy makes the solver deterministic and also reduces the number of iterations that are needed to achieve convergence (Qiao, 2015).
We measure the quality of the factorization by sum of the relative square errors (RSE) between the decomposed matrices and the corresponding decompositions: In our implementation, the iterative solver stops after 1000 iterations, the value for which the RSE of the decomposition is not decreasing anymore.

Impact of the PPI network matrix representation to the functional organization of the embedding space
In this section, we compare the ability of the adjacency and PPMI matrix representations of the tissues-specific PPI networks (detailed in sections 2.1 of the main text) to produce functionally coherent network embedding spaces. To this aim, we embed each tissue-specific PPI network by applying our NMTF-based methodology (see section 2.3 of the main text) on either its adjacency matrix representation or on its PPMI matrix representation. We generate these embedding spaces with 200 dimensions since this dimensionality corresponds to the optimal dimensionality of such spaces (as detailed in section 2.5 of the main text).
In a first step, as standardly done in the literature, we compare the ability of the adjacency and PPMI matrix representations to produce functionally coherent embedding spaces from the genecentric point of view. For each embedding space, we cluster together genes that are embedded close in space by applying the k-medoid algorithm (Park and Jun, 2009) on the genes' embedding vectors.
For the number of clusters, we use the heuristic rule of thumb (k = n 2 , where n is the number of nodes in the tissue-specific network) (Kodinariya and Makwana, 2013). We end up with 65,45,44,44,42,38,47, and 47 clusters for breast cancer, breast glandular cells, prostate cancer, prostate glandular cells, lung cancer, lung pneumocytes, colorectal cancer, and colorectal glandular cells, respectively. After clustering, we measure the enrichment of those clusters in GO BP annotations by using the sampling without replacement strategy (hypergeometric test) and we consider a GO BP term to be significantly enriched in a gene cluster if the corresponding enrichment p-value, after Benjamini Hochberg correction for multiple hypothesis testing (Benjamini and Hochberg, 1995), is smaller than or equal to 5%. For each embedding space, we report the percentage of enriched clusters (clusters with at least one enriched GO BP term), the percentage of enriched genes (genes that are annotated with at least one GO BP term that is enriched in their clusters), and the percentage of enriched GO BP terms. As detailed in Supplementary In a second step, we compare the ability of the adjacency and PPMI matrix representations to produce functionally coherent network embedding spaces from our new function-centric point of view. To this aim, we use our FMM-based method to embed and capture the relative positions of the GO BP terms in the eight tissues-specific PPI network embedding spaces described above (detailed in section 2.4 of the main text). We evaluate the functional organization of these embedding spaces by assessing if functionally similar GO BP terms (with high Lin's semantic similarity) are located close in the embedding space, and thus have low values in the corresponding FMM. To this aim, we first compute the pairwise Lin's semantic similarity  between any two GO BP terms. Then, we cluster GO BP terms based on their proximity in the embedding space (detailed in section 2.6 of the main text) and report both the average semantic similarity of the pairs of GO BP terms that are in the same cluster ("intra-SS") and the average semantic similarity of the pairs of GO BP terms that are not clustered together ("inter-SS"). Intuitively, the higher the intra-SS and the lower the inter-SS, the better functionally organized the embedding space is. As detailed in Table 1 and Supplementary Table 4, we find that the embedding spaces obtained from the PPMI matrix representations are more functionally coherent, with an intra-SS of 0.21 and an inter-SS of 0.161 (on average over the eight tissues-specific PPI networks), compared to the embedding spaces obtained from the adjacency matrix representations (with an intra-SS of 0.18 and an inter-SS of 0.165, on average).
Furthermore, for each tissues-specific PPI network, the pairs of GO BP terms that are clustered together in the PPMI-based network embedding spaces have statistically significantly higher Lin's semantic similarity than the pairs of GO BP terms that are clustered together in the adjacencybased network embedding spaces (with all one-sided Mann-Whitney U test p-values being smaller than or equal to 4 ×10 −3 , as detailed in Supplementary Table 4).
To conclude, both the gene-centric and our FMM approach show that the embedding spaces obtained from the PPMI matrix representations of our tissues-specific PPI networks better capture the cell's functional organization than the embedding spaces obtained from the adjacency matrix representations of these networks. These results further demonstrate that the PPMI matrix is not only a richer representation compared to the adjacency matrix (Xenos et al., 2021), but also that the extra information that it contains is useful for producing a more functionally organized embedding space.
1.2.2 Our FMM-based methodology captures more biological information from the embedding space compared to the actual gene-centric approaches In this section, we compare the ability of our FMM-based method to uncover functional interactions between GO BP terms from the PPI network embedding spaces to that of the standard gene-centric approach. To this aim, we consider the eight cancer and control tissues-specific PPI networks described in section 2.1 of the main text, which we embed by applying our NMTF-based methodology on their PPMI matrix representations (see section 2.3 of the main text). We generate these embedding spaces with 200 dimensions since this dimensionality corresponds to the optimal dimensionality of such spaces (as detailed in section 2.5 of the main text).
For a given tissues-specific PPI network embedding space, our FMM directly quantifies all the functional interactions between any two GO BP terms that annotate genes in the PPI network by measuring the cosine distance between the GO BP terms' embedding vectors (see section 2.4 of the main text). On the other hand, the gene-centric approach does not directly uncover such functional interactions between GO BP terms. Instead, we indirectly uncover them by performing the following gene clustering and enrichment analysis. For each embedding space, we cluster together genes that are embedded close in space by applying the k-medoid algorithm (Park and Jun, 2009) on the genes' embedding vectors. For the number of clusters, we use the heuristic rule of thumb (k = n 2 , where n is the number of nodes in the tissue-specific network) (Kodinariya and Makwana, 2013). We end up with 65,45,44,44,42,38,47, and 47 clusters for breast cancer, breast glandular cells, prostate cancer, prostate glandular cells, lung cancer, lung pneumocytes, colorectal cancer and colorectal glandular cells, respectively. Then, we measure the enrichment of the resulting gene clusters in GO BP terms by using the sampling without replacement strategy (hypergeometric test) and we consider a GO BP term to be significantly enriched in a gene cluster if the corresponding enrichment p-value, after Benjamini and Hochberg correction for multiple hypothesis testing (Benjamini and Hochberg, 1995), is smaller than or equal to 5%. Then, we consider that two GO BP terms functionally interact if they are both significantly enriched in the same gene cluster. Finally, for the GO BP terms that are significantly enriched in at least one gene cluster, we measure the agreement between the functional interactions uncovered by the gene-centric approach and the functional interactions that are captured by our FMM methodology by using the following receiver operating characteristic (ROC) curve analysis. In particular, for each GO BP pair, we consider the result of the gene-centric approach as the ground truth, i.e., a pair of GO BP terms is considered as "true" if the two terms are enriched in the same cluster, or as "false" otherwise. Also, for each GO BP pair, we consider as the prediction score their cosine similarity in the embedding space (1 minus their associated value in the FMM). Then, we compute the area under the ROC curve (AUROC) (Bradley, 1997) between the ground truth and the prediction score over all the considered GO BP pairs. Note that an AUROC score of 0.5 corresponds to a random classification and a score of 1 to a perfect one. Hence, the closer to one the AUROC score is, the higher the agreement between our FMM-based method and the gene-centric approach.
On average over our eight tissues-specific PPI networks, we find that only 51.1% of the GO BP terms that annotate genes in a network are found to be significantly enriched in at least one gene cluster, leaving about one-half of the functional space unexplored (see Supplementary Table 3). For the significantly enriched GO BP terms, the functional interactions uncovered by the gene-centric and the FMM approaches are in significant agreement, with an average AUROC of 88% and all p-values ≤ 1 × 10 −323 (see Supplementary Figures 13 and 14). These results confirm that the GO BP terms that are enriched in the same gene cluster tend to be located close in the embedding space and thus, tend to have small association values in the FMM.
In conclusion, our FMM-based method is not only able to uncover the functional organization of biological functions that are identified by the gene-centric approach, but it goes beyond and characterizes the functional organization of all available GO BP terms.

The FMMs reveal the higher-order functional organizations of the GO BP terms in the network embedding spaces
In the previous section, we showed that our FMM better capture the pairwise functional interactions between GO BP terms than the traditional gene-centric approach. Here, we ask if the FMM can uncover the higher-order functional organization of the GO BP terms in a network embedding space.
To this aim, we embed all tissue-specific PPI networks by applying our NMTF-based methodology on the PPMI matrix representations of the networks (detailed in sections 2.1 and 2.3 of the main manuscript). We generate these embedding spaces with 200 dimensions since this dimensionality corresponds to the optimal dimensionality of such spaces (as detailed in section 2.5 of the main text). Then, we apply our FMM-based method to embed and capture the relative positions of the GO BP terms in the resulting network embedding spaces (detailed in section 2.4 of the main text). To reveal the higher-order functional organization of the GO BP terms in the network embedding spaces, we apply the hierarchical clustering method Pvclust (Suzuki and Shimodaira, 2006) to the rows and columns (representing GO BP terms) of the FMMs. Pvclust evaluates the statistical significance of each cluster in the hierarchy by computing its Approximately Unbiased p-value (AU) (Suzuki and Shimodaira, 2006). Clusters with an AU value greater than or equal to 95% are considered to be strongly supported by the data, i.e., they are not expected by random.
On average over our eight tissues-specific PPI network embedding spaces, we find that about 53.62% of the clusters in the hierarchies are statistically significant with AUs greater than or equal to 95%. In detail, we find that 54%, 54%, 55%, 52%, 53%, 54%, 53%, and 54% of the clusters in the hierarchy are statistically significant with AUs greater than or equal to 95% for breast cancer, breast glandular cells, prostate cancer, prostate glandular cells, lung cancer, lung pneumocytes, colorectal cancer and colorectal glandular cells tissue-specific PPI embedding space, respectively. Importantly, these significant clusters cover all the GO BP terms that annotate the tissues-specific In conclusion, these results demonstrate that our FMM methodology captures the higher-order organization of the GO BP terms in the network embedding space. While these results motivate us to compare FMMs across different conditions to uncover condition-related changes in the functional organization of GO BP terms in the network embedding spaces, the extraction of novel knowledge from the hierarchical organization of the GO BP terms is a subject of future study.

FMM discriminates between functionally and not functionally organized embedding spaces
In section 3.1 of the main manuscript, we use our novel FMM-based method to confirm that the embedding spaces of both, cancer and control, are functionally organized. Here, we compare these results against a randomized experiment, i.e., when rewiring the previous PPI networks. In particular, for each tissue-specific PPI network, we randomly rewire the corresponding adjacency matrix and compute its corresponding PPMI matrix (detailed in section 2.1, of the paper). We follow the same protocol as used for the real tissue-specific networks to generate the corresponding "random" embedding space (detailed in section 2.3, of the paper). Next, we apply our FMMmethodology to obtain the embedding vectors of each of the GO BP annotations and the mutual positions of these vectors, which we call "distances", in the "random" embedding spaces (detailed in section 2.4, of the paper). We evaluate the functional organization of these "random" embedding spaces by using the same clustering method as we use with the real PPI networks (detailed in section 2.6, of the paper). For each tissue-specific PPI network, we repeat this procedure 100 times. In each repetition, we statistically test if those annotations whose embedding vectors cluster together based on their mutual positions in the space have a statistically significant higher Lin's semantic similarity than those annotations whose embedding vectors do not cluster. For this test, we use the Mann-Whitney U test (keeping the corresponding p-value in each repetition). After all the repetitions are finished, we correct the p-values for multiple tests by using the Bonferroni correction (Brown, 2008). As expected, we do not find a statistically significant difference in the Lin's semantic similarity between the annotation whose embedding vectors cluster and the annotations whose embedding vectors do not cluster in the space. Hence, we conclude that the "random" embedding spaces are not functionally organized (see Supplementary Table 6). These results demonstrate that our methodology correctly discriminates between functionally and not functionally organized embedding spaces.

FMMs identify novel cancer-related functions
In section 3.2, of the main paper, we use our novel FMM-based methodology to predict new cancerrelated functions and we verify the importance of one of our cancer-related predictions (the first annotation in our top 10 annotations predicted to be cancer-related, that we could not validate in the currently available literature). In this section, we extend this discussion for the remaining top 10 predicted cancer-related annotations. Starting with breast cancer, first we discuss the viral translational termination reinitiation. This function could be connected with the alternative transcriptional regulation pathways described in cancer (Vaklavas et al., 2017). In the same cancer, we also find as predicted to be cancer-related the RNA phosphodiester bond hydrolysis, endonucleolytic. This function could be connected with the regulatory roles of RNA modifications reported in this cancer type (Kumari et al., 2021). Following with prostate cancer, we find the positive regulation of endoplasmic reticulum unfolded protein response. The accumulation of unfolded protein in the ER induces this unfolded protein response as our predicted cancer-related function. It has been shown that the upregulation of this response could provide a growth advantage to tumor cells (So et al., 2009). Regarding lung cancer, we find the viral translational termination reinitiation as predicted cancer-related function. As discussed for breast cancer (see section 3.2 of the main paper), this process could also be connected with the alternative transcriptional regulation pathways described in cancer (Vaklavas et al., 2017). In lung cancer, we also find the positive regulation of transcription regulatory region DNA binding as predicted cancer-related function. This processes could be connected with the well-known deregulation of the gene expression observed in different cancers (Malik and Brown, 2000). In conclusion, we demonstrate that our as predicted cancer-related function are indeed cancer-related. Thus, our novel FMM-based methodology can be used to identify new cancer-related functions.

Towards pan-cancer functions
In section 3.2 of the main paper, we use the total change of the distances of the annotation embedding vectors ("movement") between cancer and control embedding spaces, to identify the set of shifted annotations. We demonstrate that these shifted annotations are cancer-related and that they can be used to predict new cancer-related functions for each cancer type. In this section, we explore if there are common biological functions that are shifted in all four studied cancers. We hypothesize that common shifted annotations may represent those functions that are commonly altered in all cancers. To this end, we analyze the overlap between the shifted annotations of the four cancer types. We find a statistically significant intersection of eight annotations between the shifted functions in each cancer type (permutation test with p-value < 0.05). In particular, we randomly sample, 100 times, the equivalent number of the top-shifted functions for each cancer type, and we compute the times, n, the overlap in the randomized experiments is equal or higher than the observed (eight). Then, we derive the p-value by dividing the n times by the number of permutations (100).
To explore the meaning of these eight common annotations, we summarize them into functional domains (Supplementary section 1.1.1). We find five functional clusters: cellular response to chemokine (GO:1990869 and GO:0008543), histone phosphorylation (GO:0016572), positive regulation of the RNA export from nucleus (GO:0046833), response to radiation (GO:0009314 and GO:0006970), and stress-activated MAPK cascade (GO:0051403 and GO:0007254). To confirm the link between these five clusters and cancer, we explore the literature that studies these domains.
As expected, we find that their link to cancer is coherent. For instance, dysregulation of the MAPK signaling cascades is known to be involved in the progression of various human cancers (Rezatabar et al., 2019). Regarding the response to radiation, healthy cells are known to react to radiation in three ways: arrest cell cycle progression, repair DNA lesions, or apoptosis (Li et al., 2001). In cancer cells, these three reactions are known to be deregulated (Sharma et al., 2019). In the case of the histone phosphorylation and RNA export from the nucleus, it could be related to the epigenetic alterations, and the dysregulation of nuclear trafficking observed in cancer (Rezatabar et al., 2019;Borden, 2020). Finally, the response to chemokines has been identified to play an important role in the tumor microenvironment (Vilgelm and Richmond, 2019). Altogether, these results suggest that the functions that are shifted in all cancers may define generic cancer-related functions that are normally deregulated in all tumors.
To further evaluate if these eight annotations represent general mechanisms of cancer, we investigate their link with the cancer hallmarks (Hanahan and Weinberg, 2011). First, we calculate the Lin's semantic similarity  of our eight common shifted annotations with the set of GO BP terms that are defined as the hallmarks of cancer by Chen et al. (2021). We consider an annotation to be related to one of the cancer hallmarks if it has at least a Lin's semantic similarity of 0.7 with one of the annotations that are included in the identified set of GO BP terms related to the cancer hallmarks by Chen et al., 2021. We choose a semantic similarity of 0.7 since it is the threshold that other tools apply to cluster functional annotations based on their semantic similarity, e.g., REVIGO (Supek et al., 2011). Interestingly, we find that our eight common shifted annotations are semantically similar to the following hallmarks: inducing angiogenesis, deregulation of cellular energetic, and sustaining proliferative signal (see Supplementary Figure 12). In particular, we find that the stress-activated MAPK cascade is involved in sustaining the proliferative signal.
Interestingly, this classification is coherent since these signaling cascades are known to participate in cell growth and cancer proliferation (Rezatabar et al., 2019;Feitelson et al., 2015). Also we see that response to chemokines is semantically similar to the inducing angiogenesis hallmark. Again, this makes sense, since chemokines are known to play an important role in the tumor microenvironment, in which they can contribute to tumor progression by inducing angiogenesis (Fousek et al., 2021).
Finally, we observe that response to radiation and histone phosphorylation are both semantically similar to the deregulation of cellular energetic hallmark. Alteration of the cellular epigenetic patterns has been connected with the cellular metabolism, while the response to radiation has recently been linked to the carbon metabolism of the cell (Korimerla and Wahl, 2022).
In conclusion, these results suggest that our analysis could be extended to more cancers to uncover pan-cancer functions.
Supplementary Figure 1: Lin's semantic similarity between our set of cancer-related GO BP terms (104 annotations) and the set of GO BP terms classified as the set of GO BP cancer hallmark defined by Chen et al. (2021) (135 annotations). For each GO BP term in our set, we show its maximum Lin's semantic similarity to one annotation in the cancer hallmarks set.
Supplementary Figure 3: Change in the pairwise distances (cosine distances), that we call "movement", of the functional annotation embedding vectors between breast, cancer, and control embedding spaces. For a pair of annotation embedding vectors, its "movement" is the difference between the cosine distance between the two embedding vectors in one embedding space (control) and the corresponding cosine distance in the other space (cancer) (defined in section 2.7, of the paper). Thus, positive "movement" means that the two annotation embedding vectors got closer in the cancer embedding space, and negative "movement" means that the two annotation embedding vectors got further apart in the cancer embedding space. The red lines represent the 95 th and 5 th percentiles of the distributions. We use these thresholds to define when two annotation embedding vectors are "moving significantly apart" in the embedding space of cancer (95 th percentile) or are "moving significantly closer" in the embedding space of cancer (5 th percentile). The panels are for breast, lung, colon, and prostate cancers versus controls.
Supplementary Figure 4: "Total movement distribution" of the functional annotation embedding vectors. For each annotation embedding vector, we compute its "total movement" (defined in section 2.7, of the paper). Thus, those annotation embedding vectors that change their mutual positions, "movement", the most between control embedding space and cancer embedding space have higher "total movement" than those annotation embedding vectors that do not change their "movement". The red lines represent two standard deviations above and below the mean of the distribution. We use these thresholds to define as shifted biological functions those functional annotations whose embedding vectors' "total movement" is two standard deviations above the mean of the "total movement distribution." In contrast, we define as stable biological functions those functional annotations whose embedding vectors' "total movement" is two standard deviations below the mean of the "total movement" distribution. The distributions are for breast, lung, colon, and prostate cancers.
Supplementary Figure 5: The t-Distributed Stochastic Neighbor Embedding (t-SNE) of embedding vectors of the functional annotations in four cancer tissue-specific PPI embedding spaces (breast, lung, colon, and prostate) and their corresponding control tissue-specific PPI embedding spaces. For each tissue-specific PPI embedding space, we generate the embedding vectors of the functional annotations in the corresponding embedding space (detailed in section 2.4, of the paper). We use the t-SNE technique (Van der Maaten and Hinton, 2008) to visualize these embedding vectors in the tissue-specific PPI embedding space. Each dot in the plot corresponds to the embedding vector of a specific GO BP annotation. The colors of the dots correspond to the clustering of the embedding vectors of the functional annotations based on their cosine distances (detailed in section 2.6, of the paper).
Supplementary Figure 6: The Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) of embedding vectors of the functional annotations in four cancer tissue-specific PPI embedding spaces (breast, lung, colon, and prostate) and their corresponding control tissuespecific PPI embedding spaces. For each tissue-specific PPI embedding space, we generate the embedding vectors of the functional annotations in the corresponding embedding space (detailed in section 2.4, of the paper). We use the UMAP technique (McInnes et al., 2018) to visualize these embedding vectors in the tissue-specific PPI embedding space. Each dot in the plot corresponds to the embedding vector of a specific GO BP annotation. The colors of the dots correspond to the clustering of the embedding vectors of the functional annotations based on their cosine distances (detailed in section 2.6, of the paper).
Supplementary Figure 7: The Multidimensional Scaling (MDS) of embedding vectors of the functional annotations in four cancer tissue-specific PPI embedding spaces (breast, lung, colon, and prostate) and their corresponding control tissue-specific PPI embedding spaces. For each tissuespecific PPI embedding space, we generate the embedding vectors of the functional annotations in the corresponding embedding space (detailed in section 2.4, of the paper). We use the MDS technique (Carroll and Arabie, 1998) to visualize these embedding vectors in the tissue-specific PPI embedding space. Each dot in the plot corresponds to the embedding vector of a specific GO BP annotation. The colors of the dots correspond to the clustering of the embedding vectors of the functional annotations based on their cosine distances (detailed in section 2.6, of the paper).
Supplementary Figure 8: The embedding vectors of GO BP terms change their mutual positions in the cancer embedding space with respect to the control embedding space, for each cancer type (breast cancer, prostate cancer, lung cancer, and colorectal cancer) and its corresponding control. Heatmaps in the first and second columns show the cosine distances (mutual positions) between the embedding vectors of the GO BP annotations in control embedding space (FMM Control ) and cancer embedding space (FMM Cancer ), respectively. Heatmaps in the third column show changes in the mutual positions of the embedding vectors of the functional annotations between cancer embedding space with respect to the control embedding space (computed as: FMM Control -FMM Cancer ).
Supplementary Figure 9: Gene maximum "movement" distribution. For each gene, we have a vector with n positions, where n corresponds to the number of the "shifted" GO terms. Each entry of this n-dimensional vector corresponds to the "movement" (change of mutual positional) of the gene and the GO term. This "movement" can either be positive (a gene is going closer to the GO term in the cancer space), or negative (a gene is going further from the GO term in the cancer space). Since this "movement" is bi-directional (getting closer or further), we use the absolute value of the "movement" at each coordinate of this vector, to keep only the magnitude of this movement independently of the direction of the "movement". Then, since all the values in the n-dimensional vector are now positive, for each gene we assign as its cancer-related score the maximum value (maximum magnitude of movement) in its corresponding vector. The red lines represent the 95 th and 5 th percentiles of the distributions. Based on these thresholds, we consider cancer-related gene predictions whose genes that are above the 95 th percentile of the maximum "movement" distribution. The distributions are for breast, lung, colon, and prostate cancers.
Supplementary Figure 10: For the human embedding spaces. The plot illustrates the Lin's semantic similarity between the top 500 closest functional annotation embedding vectors in the tissuesspecific embedding spaces and the Lin's semantic similarity between the top 500 farthest functional annotation embedding vectors in the tissues-specific embedding spaces. The plot shows this measure for human embedding spaces generated by applying the NMTF algorithm on the corresponding tissue-specific PPI network with a different number of dimensions (48,96,144,192,240,288,300,400,500,600,700,800,900,1000). In all the cases, we find that the Lin's semantic similarity of the 500 closest pairs of annotation embedding vectors in the embedding space is statistically higher than the average Lin's semantic similarity of the 500 farthest pairs (one-sided Mann Whitney U test p-value < 0.05).
Supplementary Figure 11: For the human species-specific embedding space. Each panel shows the Relative Square Error (RSE) of FMMs corresponding to the cancer and control tissues-specific embedding spaces of increasing dimensions (from 48 to 288 with a step of 48 dimensions and from 400 to 1000 with a step of 100 dimensions).
Supplementary Figure 12: We summarize the meaning of the 8 Pan-cancer annotations into functional domains (see Supplementary section 1.1.1). The panel represents the functional domains that cluster these annotations (the inner circle), and the outer represents the classification of the domains into the hallmarks of cancer.
Supplementary Figure 13: Our FMM-based method uncovers the functional interactions between GO BP terms that are identified by the standard gene-centric approach (based on clustering and functional enrichment analyses) in four cancer tissue-specific PPI embedding spaces (breast, lung, colon, and prostate). For each cancer tissue-specific PPI embedding space, we take the subset of GO BP terms that are statistically enriched based on the gene-centric approach (detailed in Supplementary section 1.2.2). Then, for a pair of GO BP terms, we set the ground truth as one if they are enriched in the same cluster (zero otherwise). For the same pair, we set the prediction score as the value of their embedding vectors' cosine distance in the embedding space, as captured by the FMM. Finally, we compute the area under the receiver operating characteristic curve (AUROC) (Bradley, 1997) between the ground truth and the prediction score. Each panel shows the corresponding ROC curves with its AUROC.
Supplementary Figure 14: Our FMM-based method uncovers the functional interactions between GO BP terms that are identified by the standard gene-centric approach (based on clustering and functional enrichment analyses) in the control tissue-specific PPI embedding spaces of four cancer types (breast, lung, colon, and prostate). For each control tissue-specific PPI embedding space, we take the subset of GO BP terms that are statistically enriched based on the gene-centric approach (detailed in Supplementary section 1.2.2). Then, for a pair of GO BP terms, we set the ground truth as one if they are enriched in the same cluster (zero otherwise). For the same pair, we set the prediction score as the value of their embedding vectors' cosine distance in the embedding space, as captured by the FMM. Finally, we compute the area under the receiver operating characteristic curve (AUROC) (Bradley, 1997) between the ground truth and the prediction score. Each panel shows the corresponding ROC curves with its AUROC.  The statistics for the tissue-specific PPI networks in this study. Column one, "Cancer," specifies the type of cancer that we analyzed; column two, "TCGA Project," gives the name of the project from TCGA that produced the data that we used; column three, "# of patient samples" specifies the number of patient samples in the project from column two; column four, "Disease Type," specifies the numbers of patient samples from the corresponding project with a specific cancer type. The statistics for the tissue-specific PPI networks in this study. Column "Network" presents the tissue-specific PPI network that we analyzed column; column, "# Nodes," presents the number of nodes in the PPI network; column, "# Edges," presents the number of edges between the nodes; column, "#Density," presents the edge density of the corresponding PPI network. . For each tissue-specific PPI embedding space, we cluster genes whose embedding vectors are close in the space based on their cosine distance, and then we measure the enrichment of those clusters in GO BP annotations. The first column, "Matrix," indicates the matrix representation of the tissue-specific PPI network. The second column, "Data set," specifies the tissue-specific PPI network. The third column, "%Clusters," shows the percentage of clusters with at least one GO BP term enriched. The fourth column, "%Genes," presents the percentage of enriched genes in the clusters (out of the total number of genes in the corresponding tissue-specific PPI network). The sixth column, "%GO," shows the percentage of GO BP terms enriched in the clusters (out of the total GO BP terms that annotate the genes of the corresponding tissue-specific PPI network).  The adjacency embedding spaces of the most prevalent cancers (breast, prostate, lung, and colorectal cancer) and their control tissues (breast glandular cells, prostate glandular cells, lung pneumocytes, and colorectal glandular cells) are functionally organized. The first column, "Embedding," lists the tissues. The second column, "Intra-SS," shows the average Lin's semantic similarity of those annotations whose embedding vectors cluster together based on their cosine distances in the embedding space. The third column, "Inter-SS," shows the average Lin's semantic similarity of those annotations whose embedding vectors do not cluster together based on their cosine distances in the embedding space. The fourth column, "Fold," displays how many times the average Lin's semantic similarity of those annotations whose embedding vectors cluster together based on their cosine distances in the embedding space is higher than of those annotations whose embedding vectors do not cluster together. The fifth column, "p-value Fold," shows the p-value from a one-sided Mann-Whitney U test comparing Lin's semantic similarity between annotations whose embedding vectors cluster together and those with non-clustered embedding vectors. The sixth column, "p-value (PPMI)," shows the p-value from a one-sided Mann-Whitney U test comparing Lin's semantic similarity between annotations that cluster together based on their proximity in the PPMI embedding space and those annotations that cluster together based on their proximity in the corresponding adjacency embedding space.  Table 5: Optimal number of dimensions for each tissue-specific and species-specific PPI embedding spaces. Column "Network," specifies the tissue, or species-specific PPI network. Column, "# Optimal Dimensions," contains the optimal number of dimensions that we found experimentally as explained in section 2.5 of the paper, that we then used for generating the corresponding embedding space by our NMTF-based procedure explained in the paper.  ) and generate the embedding space by using the NMTF algorithm. Then, we use our new FMM-based method to evaluate the functional organization of these random PPI embedding spaces (detailed in section 2.6 of the main manuscript). The first column, "Embedding," lists the randomized tissue-specific PPI embedding spaces. The second column, "Intra," shows the average Lin's semantic similarity of those annotations whose embedding vectors cluster together based on their cosine distances in the randomized tissue-specific embedding space. The third column, "Inter," shows the average Lin's semantic similarity of those annotations whose embedding vectors do not cluster together based on their cosine distances in the randomized tissue-specific embedding space. The fourth column, "Fold," displays how many times the average "Intra" semantic similarity is higher than the "Inter" semantic similarity. The fifth column, "p-value," shows the one-sided Mann-Whitney U test p-value between the distributions of "Intra" and "Inter".

Shifted Stable
Breast 58    . Column one, "Annotations," presents the shifted annotations in breast cancer; column two, "# Norm," presents the "total movement" of the annotations (detailed in section 2.7, of the paper); column three, "# Cancer related," presents whether the annotations is part of our cancer-related set (True) or not (False); column four, "# Bibliography," presents the number of publications in Pubmed that relate the function to breast cancer.  . Column one, "Annotations," presents the shifted annotations in prostate cancer; column two, "# Norm," presents the "total movement" of the annotations (detailed in section 2.7, of the paper); column three, "# Cancer related," presents whether the annotations is part of our cancer-related set (True) or not (False); column four, "# Bibliography," presents the number of publications in Pubmed that relate the function to prostate cancer.  . Column one, "Annotations," presents the shifted annotations in lung cancer; column two, "# Norm," presents the "total movement" of the annotations (detailed in section 2.7, of the paper); column three, "# Cancer related," presents whether the annotations is part of our cancer-related set (True) or not (False); column four, "# Bibliography," presents the number of publications in Pubmed that relate the function to lung cancer.  . Column one, "Annotations," presents the shifted annotations in colorectal cancer; column two, "# Norm," presents the "total movement" of the annotations (detailed in section 2.7, of the paper); column three, "# Cancer related," presents whether the annotations is part of our cancer-related set (True) or not (False); column four, "# Bibliography," presents the number of publications in Pubmed that relate the function to colorectal cancer. The embedding vectors of the biological functions (GO BP terms) are significantly closer in space to the embedding vectors in the same space of the genes that they annotate than to the embedding vectors of other genes. Column, " Sample," presents the tissuesspecific PPI networks. Column, "Avg Distance Annotate," presents the average cosine distance in the embedding space between the embedding vectors of genes and embedding vectors of those functional annotations that annotate them; column, "Avg Distance Not-Annotate," presents the average cosine distance in the embedding space between the embedding vectors of genes and embedding vectors of those embedded functional annotations that do not annotate them. In all samples, the difference between these distances is statistically significant (p-value of the Mann-Whitney U test < 0.05). Supplementary Table 13: Top 10 shifted genes (the most shifted ones) in breast cancer. The first column, "Gene name," presents the gene names of the top 10 shifted genes. The second column, "PubMed Counts," presents the number of publications in Pubmed that relate the gene to breast cancer. The third column, "Prognostic Marker," indicates if the gene is a prognostic marker ("yes" if it is a marker, "-" otherwise) in breast cancer (based on survival curves collected from the Human Protein Atlas (Pontén et al., 2008)); the fourth column, "Pan-Cancer Marker," presents whether the gene is a prognostic marker for other cancer types.  Top 10 shifted genes in lung cancer. The first column, "Gene name," presents the gene names of the top 10 shifted genes. The second column, "PubMed Counts," presents the number of publications in Pubmed that relate the gene to lung cancer. The third column, "Prognostic Marker," presents if the gene is a prognostic marker ("yes" if it is a marker, "-" otherwise) in lung cancer (based on survival curves collected from the Human Protein Atlas (Pontén et al., 2008)). The fourth column, "Pan-Cancer Marker," presents whether the gene is a prognostic marker for other cancer types.  shifted genes (the most shifted ones) in colorectal cancer. The first column, "Gene name," presents the gene names of the top 10 shifted genes. The second column, "PubMed Counts," presents the number of publications in Pubmed that relate the gene to colorectal cancer. The third column, "Prognostic Marker," presents if the gene is a prognostic marker ("yes" if it is a marker, "-" otherwise) in colorectal cancer (based on survival curves collected from the Human Protein Atlas (Pontén et al., 2008)). The fourth column, "Pan-Cancer Marker," presents whether the gene is a prognostic marker for other cancer types.  et al., 2019). We model these species-specific PPI data as PPI networks in which nodes represent genes (or equivalently in this study, their protein products), and edges connect nodes whose corresponding proteins physically bind. The network statistics of these species-specific PPI networks are presented in Supplementary Table 18. (2014) with its default settings, which corresponds to 10 iterations, to compute the PPMI matrix.
This formula can be interpreted as a diffusion process that captures high order proximities between the nodes in the network; hence, PPMI is a richer representation than the adjacency matrix (Xenos et al., 2021).
Biological Annotations. We use the Gene Ontology Biological Process (GO BP) annotations to represent the biological functions in a cell. As described in the main manuscript (section 2.1), we collect the experimentally validated genes to GO BP annotations from NCBI's web-server (collected on 28 September 2021). However, to have a more generic perspective of the functional organization of the species-specific embedding spaces, here we extend the previous set by also considering the ancestors of the annotations in the GO ontology directed acyclic graph, using

FMM captures the functional organization of different species-specific embedding spaces
In this section, we evaluate the capability of the FMM to capture biologically relevant interactions between the annotation embedding vectors, in six different species-specific embedding spaces. In particular, we take the species-specific protein-protein interaction (PPI) networks of the following species: human, baker's yeast, fission yeast, fruit fly, rat, and mouse (detailed in Supplementary section 2.1.1). We produce the embedding space of each network by using the NMTF algorithm (detailed in section 2.3, of the paper). For these embedding spaces, we find the optimal numbers of dimensions: 240, 80, 80, 100, 100, 100, and 100 for human, baker's yeast, fission yeast, fruit fly, rat, and mouse, respectively (detailed in Supplementary section 2.2.2). Next, we apply our FMM-methodology to obtain the embedding vectors of each of the GO BP annotations and the mutual positions of these vectors, which we call "distances", in the species-specific embedding spaces (detailed in section 2.4, of the paper). Having the corresponding FMMs, we explore if all six species-specific embedding spaces are functionally organized, i.e., the annotations whose embedding vectors are close in a space are more semantically similar (functionally related) than those annotations whose embedding vectors are far in the space (detailed in section 2.6, of the paper). To this end, first we calculate the Pearson's correlation coefficient between the Lin's semantic similarities of the functional annotations and the mutual positions of their embedding vectors in the space (as detailed in section 2.6, of the paper). We find negative correlations of fly, rat, and mouse species-specific embedding spaces, respectively (see Supplementary Tables 16   and 17). We assess the significance of these correlations coefficients by calculating the probability of having the same results if the correlations coefficients were zero (null hypothesis).
Also, we evaluate if those annotations whose embedding vectors cluster together based on their distances in the species-specific embedding spaces, as measured by the cosine distance, are more functionally related than those annotations whose embedding vectors do not cluster in space. To this end, we cluster the functional annotation embedding vectors based on their mutual positions (cosine distances) in the species-specific embedding spaces by applying the k-medoid algorithm to each FMM (detailed in section 2.6, of the paper). For the number of clusters, we use the heuristic rule of thumb (k = n 2 , where n is the number of annotations) (Kodinariya and Makwana, 2013). We end up with 59, 39, 31, 38, 43 and 56 clusters for human, baker's yeast, fission yeast, fruit fly, rat, and mouse, respectively. We observe that the annotations whose embedding vectors cluster together based on their distances in the species-specific embedding spaces, have an average Lin's semantic similarity 1.35, 1.60, 1.60, 1.41, and 1.40 times higher than those annotations whose embedding vectors do not cluster in human, baker's yeast, fission yeast, fruit fly, rat, and mouse species-specific embedding space, respectively (see Supplementary Figure 17 and 18).
To confirm the previous results, we also analyze the functional organization of random PPI networks, i.e., when rewiring the previous species-specific networks randomly. In particular, for each species-specific PPI network, we randomly rewire the corresponding adjacency matrix. Then, we follow the same protocol as we used with the real species-specific PPI networks to generate the gene embedding spaces. We obtain the FMMs of these spaces and evaluate their functional organization following the same clustering approaches explained in the previous paragraph. For each species, we repeat this procedure 100 times, calculating the average intra and inter cluster Lin's semantic similarity of the annotations and keeping the p-value of the corresponding Mann-Whitney U test. Once the 100 repetitions are finished, we correct the p-values for multiple tests by using the Bonferroni correction (Brown, 2008). As expected, we do not find a statistically significant difference in the semantic similarity between the annotations whose embedding vectors are clustered together based on their distances in the space and those annotations whose embedding vectors do not cluster in the space.
Finally, we illustrate the previous property by focusing on the annotation embedding vectors of the 500 closest and 500 farthest pairs of annotations in the species-specific embedding spaces.
Although the observation remains the same (annotations whose embedding vectors are close in the embedding space have higher semantic similarity than those whose embedding vectors are distant in the space), we find a bigger difference in the Lin's semantic similarity between these sets of annotations (see Supplementary Figure 19 and 10). Indeed, the mean average Lin's semantic similarity of the annotations corresponding to the 500 closest pairs of vectors in the embedding space is close to 0.9 in all six species-specific embedding spaces. In contrast, this average is close to 0.1 in the annotations corresponding to the 500 farthest pairs of vectors in the space. Altogether, these results demonstrate that our FMM-based method captures the functional organization of different gene embedding spaces from a functional perspective for all six species.
To conclude, we also evaluate the ability of the FMMs to capture the similarities in the functional organization of different species-specific embedding spaces (see section 2.5 of the paper).
As expected, we find that the functional organization of the embedding space of evolutionary related species is more similar (lower RSE between their FMMs) than the functional organization of embedding spaces of evolutionary distant species. For instance, the RSE between human and mouse FMMs is 0.15, while it is 0.20 between human and baker's yeast (see Supplementary Tables 21 and 20). Thus, as captured by our new FMM-based method, similarities between the functional organization of different species-specific embedding spaces correctly identify the evolutionary closeness between the species. Although this observation is promising, a further exploration of its capabilities is left for future research, and we devote the rest of the paper to cancer-related applications.

2.2.2
The similarity between the FMMs of different dimensional spaces reveal the optimal dimensionality of the embedding space In this section, we apply our FMM-based method to find the optimal dimensionality of six speciesspecific embedding spaces. First we produce the embedding space of each species-specific PPI network by using the NMTF algorithm (following the same methodology explained in the first paragraph of the Supplementary section 2.2.1). To generate these embeddings, we use different sets of dimensions. We use the heuristic rule of thumb (k = n 2 , where n is the number of nodes in each species-specific PPI network) (Kodinariya and Makwana, 2013) to define the previous set of dimensions. This heuristic rule gives 95.6, 38.3, 28.5, 47.2, 44.8, and 26.6 dimensions for human, baker's yeast, fission yeast, fruit fly, rat, and mouse, respectively. For human, we round the dimensions to 96, and we use half of this dimensionality (48) to define the increment in the number of dimensions from 48 to 288 dimensions. For the other model organisms, we also round the dimensions to 40, 30, 50, 45, and 30 in baker's yeast, fission yeast, fruit fly, rat, and mouse, respectively. In this case we decide to choose the number of dimensions that we obtain from baker's yeast since it has the most complete PPI. Hence, similar to human, we use half of its dimensionality (20) to define the increment in the number of dimensions from 20 to 100 dimensions. As detailed in section 2.4 of the paper, we obtain the FMMs by first generating the embedding vectors of each of the GO BP annotations in each embedding space and then calculating their mutual positions.
By tracking the Relative Square Errors (RSEs) of the FMMs across previous sets of dimensions (detailed in section 2.5, of the paper), we find that the mutual positions of the embedding vectors of the functional annotations converge to a stable i.e., non-changing functional organization, after 240, 80, 80, 100, 100, and 100 dimensions in human, baker's yeast, fission yeast, fruit fly, rat, and mouse embedding spaces, respectively (RSE between their FMMs plateaus, i.e., stops decreasing, see Supplementary Figures 11 and 20). We farther validate this observation by extending the sets of dimensions to 600, 700, 800, 900, 1000 dimensions in human, and 150, 200, 250, 300 in baker's yeast, fission yeast, fruit fly, rat, and mouse, respectively. As expected, we do not find an increment in the RSE after the optimal dimensionality. Thus, we conclude that 240, 80, 80, 100, 100, and 100 dimensions are optimal dimensionality for human, baker's yeast, fission yeast, fruit fly, rat, and mouse embedding spaces, respectively. We hypothesize the number of dimensions may reflect the increasing evolutionary complexity of the organisms.
Supplementary Figure 17: Lin's semantic similarity (Semantic similarity) between the annotations whose embedding vectors are clustered together (Intra) based on mutual position in human embedding space and those that are not (Inter). The plot shows this measure for human embedding spaces generated by applying the NMTF algorithm on the corresponding tissue-specific PPI network with different number of dimensions (48,96,144,192,240,288,300,400,500,600,700,800,900,1000). In all the cases, the intra cluster Lin's semantic similarity is statistically higher than the inter cluster one (one-sided Mann Whitney U test p-value < 0.05).
Supplementary Figure 18: For five species-specific embedding spaces: Saccharomyces cerevisiae (denoted by baker's yeast), Schizosaccharomyces pombe (denoted by fission yeast), Rattus norvegicus (denoted by rat), Drosophila melanogaster (denoted by fruit fly), and Mus musculus (denoted by mouse). The plot shows the Lin's semantic similarity (Semantic similarity) between the annotations whose embedding vectors are clustered together (Intra) based on mutual position in the corresponding species-specific embedding spaces and those that are not (Inter). The plot illustrate this measure for the species-specific embedding spaces generated by applying the NMTF algorithm on the corresponding tissue-specific PPI network with number of dimensions (20, 40, 60, 80, 100, 150, 200, 250 and 300). In all the cases, the intra cluster Lin's semantic similarity is statistically higher than the inter cluster one (one-sided Mann Whitney U test p-value < 0.05).  Supplementary Table 16: The first row, labeled "Species," presents each of the 13 dimensions that we tested for the human PPI network embedding spaces: 48, 96, 144, 192, 240, 288, 300, 400, 500, 600, 700, 800, 900, and 1000. Each column shows the Pearson's correlation coefficient (Benesty et al., 2009) between the pairwise cosines distance of the annotations' embedding vectors in the space and the semantic similarities of the annotations (measured by the Lin's semantic similarity ). We assess the significance of these correlations coefficients by calculating the probability of having the same results if the correlations coefficients were zero (null hypothesis). We find that all coefficients are statistically significant (p-value < 0.05). Regarding the p-value, their values are close to 0, but due to the fact that p-values in Python are float64 objects, i.e., 16 decimals are reported), they are rendered to 0.  Homo sapiens sapiens (denoted by "Human"), Saccharomyces cerevisiae (denoted by "Baker's yeast"), Schizosaccharomyces pombe (denoted by "Fission yeast"), Drosophila melanogaster (denoted by "Fruit fly"), Mus musculus (denoted by "Mouse") and Rattus norvegicus (denoted by "Rat"). Column, "# Nodes", specifies the number of nodes in the species-specific PPI network; column, "# Edges," contains the number of edges between the nodes; column, "# Density," specifies the edge density of the corresponding species-specific PPI network.  For the six species: Homo sapiens sapiens (denoted by "Human"), Saccharomyces cerevisiae (denoted by "Baker's yeast"), Schizosaccharomyces pombe (denoted by "Fission yeast"), Drosophila melanogaster (denoted by "Fruit fly"), Rattus norvegicus (denoted by "Rat") and Mus musculus (denoted by "Mouse"). Column, "# GO BP terms," presents the number of GO BP terms that annotates at least one gene in the corresponding species-specific PPI network.  For the five species: Homo sapiens sapiens (denoted by "Human"), Saccharomyces cerevisiae (denoted by "Baker's yeast"), Schizosaccharomyces pombe (denoted by "Fission yeast"), Drosophila melanogaster (denoted by "Fruit fly") and Mus musculus (denoted by "Mouse"). The table shows the million years from the common ancestor between the species.