Functional protein representations from biological networks enable diverse cross-species inference

Abstract Transferring knowledge between species is key for many biological applications, but is complicated by divergent and convergent evolution. Many current approaches for this problem leverage sequence and interaction network data to transfer knowledge across species, exemplified by network alignment methods. While these techniques do well, they are limited in scope, creating metrics to address one specific problem or task. We take a different approach by creating an environment where multiple knowledge transfer tasks can be performed using the same protein representations. Specifically, our kernel-based method, MUNK, integrates sequence and network structure to create functional protein representations, embedding proteins from different species in the same vector space. First we show proteins in different species that are close in MUNK-space are functionally similar. Next, we use these representations to share knowledge of synthetic lethal interactions between species. Importantly, we find that the results using MUNK-representations are at least as accurate as existing algorithms for these tasks. Finally, we generalize the notion of a phenolog (‘orthologous phenotype’) to use functionally similar proteins (i.e. those with similar representations). We demonstrate the utility of this broadened notion by using it to identify known phenologs and novel non-obvious ones supported by current research.


S1.1 Parameter choice
For all regularized Laplacians, we used a value of λ = 0.05. We found that the resulting relationship between Resnik similarity and MUNK similarity score did not vary significantly for λ values between 0.005 and 0.1.

S1.2 Function Prediction Methods
We assess prediction accuracy using leave-one-out cross validation. Let K s = {k s ij } denote the regularized Laplacian for species s. For GO term g, let G s g be the set of proteins in species s that are annotated with g. The same-species annotation score for a given protein p and GO term g is: in which p is excluded from the sum (i.e., if it is contained in G s g ). We also construct a cross-species annotation score for each protein, in which MUNK scores with respect to proteins in the other organism are used: where d pi = D 12 (p,i) is the MUNK score for protein p in species s 0 and protein i in species s 1 . The prediction score is then h(p,g) = αc s 0 (p,g)+(1−α)c s 0 ,s 1 (p,g). To use multiple cross-species annotations, say n, we generalize h(p,g) to a convex combination of the same-and cross-species annotation scores: h(p,g) = α 0 c s 0 (p,g)+ n i=1 α i c s 0 ,s i (p,g) such that n i=0 α i = 1. * To whom correspondence should be addressed: mdml@cs.umd.edu We evaluate predictions using area under the receiver operating curve (AUC) and maximal F-score (over all detection thresholds). Since we are concerned with predicting rare GO terms, we find that maximal F-score is generally a more discriminative metric. We set the convex coefficients {α i } via cross-validation. S1.3 Phenolog Discovery Our method matches that used in (1), using protein pairs with high MUNK similarity scores rather than homologs obtained from Homologene. (Note that none of the landmarks (which are a subset of the homologs) are used to discover new phenologs.) Specifically, let P 1 be the genes associated with the phenotype in species 1 and P 2 be the genes associated with the phenotype in species 2. Our contingency table consists of the counts of the number of 'MUNK-homologs' involving P 1 ∩P 2 , P 1 \P 2 , P 2 \P 1 , and (Ω\P 1 )\P 2 , with Ω denoting the set of all close pairs. We used a Fisher exact test to measure significance, and considered the match significant if the uncorrected Pvalue was less than 0.05. We corrected for multiple testing using a Bonferroni correction; there were 1,278,312 possible phenotype matches so we set the significance level at 3.9× 10 −8 .

S2.1 Protein-protein interaction networks
We constructed protein-protein interaction (PPI) networks in S.c., S.p., mouse, and human. The S.c. and S.p. networks were obtained from the Biological General Repository for Interaction Datasets (BioGRID) (2) version 3.4.157. Mouse and human PPIs were obtained from the STRING database version 9.1 (3). PPI networks obtained were processed by mapping the protein names to the same namespace. Genes that could not be mapped via the UniProt database were removed from the PPI networks entirely. We provide further details of the network processing below. Table S1 shows summary statistics for the PPI networks before and after processing.

S2.2 Synthetic lethal interactions
We constructed datasets of synthetic lethal interactions (SLI) in S.c. and S.p. from published epistatic miniarray profiles (E-MAPs). E-MAPs include genetic interactions scores for pairs of genes, where the magnitude of the score reflects the strength of the genetic interaction. We downloaded E-MAPs for S.c. from the supplementary information of Collins, et al. (4), and for S.p. from the supplementary information of Roguev, et al. (5). We classified each pair of genes in each E-MAP as SLI, non-SLI, or uncertain. We used the thresholds from the Collins, et al. (4) supplementary information to classify pairs in S.c.. Given a pair with E-MAP score , we classified it as SLI if < −3, uncertain if −3 ≤ < −1, and non-SLI otherwise. Similarly, we used the threshold for synthetic lethality from the Roguev, et al (5) supplementary information and used the same threshold for uncertainty, classifying S.p. pairs as SLI if < −2.5, uncertain if −2.5 ≤ < −1, and non-SLI otherwise. We also remove pairs of genes in which either gene is not found in the corresponding PPI networks described in the main text. The resulting datasets included 7,165 SLI and 123,507 non-SLI in S.c., and 5,599 SLI and 97,541 non-SLI in S.p.
We then standardized the datasets by mapping genes names to Uniprot Accession IDs (6). Genes that could not be mapped via UniProt were excluded for this study, as were those that were not found in the processed PPI networks. For the BioGRID SLI dataset, we followed Jacunski, et al. (7), by sampling an equivalent number of non-SLI pairs from genes PPI networks that do not partake in SLI in the BioGRID dataset. Table S2 shows summary statistics of the SLI datasets before and after processing.

S3 Results
S3.1 Associations of MUNK scores with functional similarity for other pairs of species We associated MUNK similarity scores and functional similarity for pairs of proteins in additional pairs of species, using the methodology described in the main text. Figures S1 and S2 show the results for embedding human into yeast, and mouse into yeast.
S3.2 Evaluating the generalization of synthetic lethal interaction classifiers to held-out genes Predicting synthetic lethal interactions between gene pairs using features constructed for individual genes is an example of a pair-input classification problem. A challenge with evaluating classifiers trained on pair inputs with held-out data is that, for a given pair (u,v), it is possible that the features for only u, only v, both u and v, or neither u and v, can be found in the training data (8). Thus, information concerning genes found in the held-out data may be 'leaked' to the classifier during training. To evaluate the effect of this issue, Park & Marcotte (8) suggest evaluating classifications for gene pairs which contain one, two, or no genes in the training data separately. This is analogous to holding out individual genes instead of gene pairs at training time and, thus, we evaluate the effect of pair-inputs by repeating the experiments above but hold out genes instead of gene pairs for evaluation. We report the results in Table S7. We find that the classifiers are able to predict SLI for genes not found in the training data, but with a significant change in performance compared to genes found in the training data. On the BioGRID dataset, the classifiers achieve an AUROC of 0.872 in S.c. (0.823 in S.p.), an AUPRC of 0.875 (0.814), and maximum F 1 of 0.797 (0.772). On the chromosome biology dataset, the classifiers achieve an AUROC of 0.701 in S.c. (0.691 in S.p.), an AUPRC of 0.160 (0.207), and maximum F 1 of 0.202 (0.285). We hypothesize that the larger drop in performance on the chromosome biology data is due to the matched nature of the S.c. and S.p. datasets. We also find similar drops in performance for SINATRA when holding out genes instead of pairs (also in Table S7). Table S1. Summary statistics of PPI networks. We processed the graphs to restrict to the two-core of the largest connected component.  Table S6. Results training linear support vector machines to classify synthetic lethal interactions on S.c. and S.p. data simultaneously. We compute performance separately for each species (indicated by 'Test species'). For each statistic, we report the average on held-out data from 4-fold cross-validation over gene pairs, and bold the highest (best) score.