Joint embedding of biological networks for cross-species functional alignment

Abstract Motivation Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein–protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem. Results We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA’s embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies. Availability and implementation https://github.com/ylaboratory/ETNA

Table S1.Selected hyperparameters for ETNA models between H. sapiens (Hsa) and four model organisms: M. musculus (Mmu), S. cerevisiae (Sce), D. melanogaster (Dme), and C. elegans (Cel).α, γ, ω, and λ are described in Equation (1) and ϕ is described in Equation ( 5).These parameters were searched on a logarithmic scale (base 10).Each individual species embedding has its own α, γ, ω, and λ parameters.For each joint embedding listed in the table, the first column corresponds to the hyperparameters for the H. sapiens embedding, and the second corresponds to that of the model organism.)) based on 10 random sets of hyperparameters, and ribbons denote the 95% confidence interval.While hyperparameters selected via cross-validation (ETNA (CV)) performed the best, models with random hyperparameters (ETNA (random)) can also achieve strong performance, and within a few epochs of training, consistently outperform MUNK (the best performing previously existing method).Using ETNA with suggested default parameters (ETNA (default)) typically results in comparable to better performance than with random hyperparameters, and the performance gap with cross-validated hyperparameters is not large.

Figure S1 .
Figure S1.ETNA's performance is robust to different neural network architecture and choice of anchors.In each subfigure, the change in prediction performance using the alternative neural network architecture is shown.The change shown is the log 2 fold change over the original AUPRC.(A) Performance change with respect to different embedding dimensions (64, 128, 256, 512, 1024), where ETNA's default embedding dimension is 128.(B) Performance change with respect to different numbers of hidden layers (1, 2, 3, 4), where ETNA's original architecture had 1 hidden layer.(C) Performance change with respect to different choices of activation function used (ReLU, ELU, Sigmoid, Linear), where ETNA's default is LeakyReLU with a negative slope of 0.1.(D) Performance change with respect to different choice of cross-species anchors, based on BLAST bit score cutoffs (100, 200, 400), instead of orthologs as is the default in ETNA.

Figure S2 .
Figure S2.ETNA's performance is robust to choice of hyperparameters.The red solid line shows the mean AUPRC over random (log 2 ( AUPRC prior)) based on 10 random sets of hyperparameters, and ribbons denote the 95% confidence interval.While hyperparameters selected via cross-validation (ETNA (CV)) performed the best, models with random hyperparameters (ETNA (random)) can also achieve strong performance, and within a few epochs of training, consistently outperform MUNK (the best performing previously existing method).Using ETNA with suggested default parameters (ETNA (default)) typically results in comparable to better performance than with random hyperparameters, and the performance gap with cross-validated hyperparameters is not large.

Table S2 . Summary statistics of PPI networks and GO annotation information. For
, (# vertices) / (# estimated protein-coding genes) for each species), the # of orthologs with H. sapiens, and the % of genes that in the PPI network with at least 1 GO annotation.

Table S3 . AUROC of ETNA, MUNK, IsoRank, and HubAlign for predicting cross-species gene pairs that share GO annotations based on 5-fold cross validation.
Because MUNK's predictions require choosing a source organism and a target organism, we present its performance for both directions (the arrow points from source to target).

Table S4 . Mean validation error and mean test error of ETNA on 4 species pairs based on 5-fold cross validation.
AUPRC over random calculated as in Table2.

Table S5 . Prediction of genetic interactions from S. cerevisiae (Sce) to S. pombe (Spo) and H. sapiens (Hsa), where the Sce training set is subsampled to the size of the target organism.
For each SL prediction task, A → B indicates SL pairs in A were used for training to predict SL pairs in B. Here, instead of using the entirety of the original 13,920 training examples in Sce, the training set as subsampled to match the training set size for Spo and Hsa (1,078 and 1,883 examples, respecitvely).