Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2

Abstract Background Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction. Results We design a network propagation framework with 2 novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. Conclusions We examine how our provenance-tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly emerging viruses.


Background
Network propagation algorithms have been widely used for nearly years for function and phenotype prediction in systems biology [ , , , , , , ].More recently, applications of these techniques have included determination of genes associated with cancers and complex diseases [ ] and denoising single-cell gene expression data [ ]. Nowadays, network-based algorithms facilitate large-scale and automated data analysis of such complexity that it can be difficult for humans to understand the rationale that underlies a prediction, leading to decreased transparency and interpretability.
In this work, we consider the fundamental problem of tracing the provenance of a prediction back to the experimental sources [ ].Given a protein interaction network and a set of "sources", e.g., the human proteins that physically interact with SARS-CoV-[ ], suppose we apply a network-based algorithm to score and prioritize additional proteins that may directly or indirectly interact with the virus.Can we determine which source proteins make the highest contribution to the score computed for each prediction?Surprisingly, this question has been insufficiently studied in the field of network biology [ ].This aspect takes particular importance in the context of COVID-or other clinically or scientifically critical applications, where it may be important to understand the rationale behind the computational prediction of a new drug target before committing to expensive experimental validation.
We present a simple and direct method to solve this problem for a large class of network propagation algorithms.Specifically, for each protein u in the network, we compute the precise contribution of each source to the score of u.This calculation enables us to sort the sources by their relative contributions to u and to quantify the relative roles of sources at different distances from u.
To evaluate the effectiveness of this strategy, we apply it to prioritize host proteins that may "functionally" (directly or indirectly) interact with SARS-CoV-proteins and host cellular processes that may be hijacked by the virus (Figure ).To this end, we take advantage of a recently published dataset of human proteins that physically interact with SARS-CoV-[ ].Although these SARS-CoV-interactors are entry points to host cellular processes that may be hijacked by viral infection, the proteomics pipeline used to discover them [ ] may not capture in vivo conditions and tissue-specific interactions, leading to false negatives.Therefore, we apply network propagation algorithms to these known human protein interactors of SARS-CoVproteins (sources) and a whole-genome human protein interaction network from the STRING database [ ].We identify statisticallyenriched host biological processes and pathways that include highlyranking proteins computed by our methods.We illustrate how our provenance analysis can simplify visualizations of these processes and assist in understanding how they may be impacted by SARS-CoV-.

Data Description
Here, we detail the different viral-human and human protein and functional interaction networks that we used in our study.

SARS-CoV--Human Protein-Protein Interactions (PPIs).
We obtained human proteins that interact with SARS-CoV-[ ] and treated them as positive examples for our analysis.We added the ACE receptor to this set.
Functional and protein interaction networks.We used the human functional interaction network in the STRING database (version ) [ ], comprising of , nodes and , edges after applying a "medium" score cutoff of and mapping to UniProt IDs.We used the interaction reliabilities provided by STRING as edge weights; we divided each value in STRING by , to scale them between and .An edge in this network may be derived from experimental data or computational analysis.Thus, an edge may represent either direct physical binding or indirect functional interaction.Of the viral interactors, were present in this network; REEP (Q HR ), PPIL (Q H H ), RAB (Q NP ), and FKBP (Q Y ) were missing.
We also computed results for PPI networks from two other sources: the BioGRID database [ ], and the high-quality "HIunion" network published by Luck et al. [ ].For BioGRID, we considered two versions: (i) all PPIs (including protein complex membership), and (ii) only direct PPIs from yeast two-hybrid (Y H) screens.For each of these networks, we did not use edge weights and restricted the nodes and edges to those in the largest connected component.See , respectively.

SARS-CoV--human A AP-MS interactome.
We obtained human proteins determined to interact with SARS-CoV-proteins by affinity purification followed by mass spectrometry analysis (AP-MS) [ ].This dataset was generated in A lung carcinoma cells transduced with lentivirus vectors expressing HA-tagged SARS-CoVproteins.The authors used affinity purification with anti-HA antibodies to isolate stable complexes of human proteins bound to SARS-CoV-proteins.Subsequently, they identified and quantified the purified proteins by mass spectrometry.

SARS-CoV--human HEK
AP-MS interactome.We obtained a set of human proteins determined to interact with SARS-CoV-by AP-MS [ ].This dataset was generated by analyzing HEK embryonic kidney cells transfected with plasmid vectors expressing FLAG-tagged SARS-CoV-proteins.Affinity purification with anti-FLAG antibodies was used to isolate stable complexes of human proteins bound to SARS-CoV-proteins, and the purified proteins were identified and quantified by mass spectrometry.

SARS-CoV--human BioID interactome.
We obtained a set of , human proteins determined to interact transiently or weakly with SARS-CoV-proteins by using proximity-dependent biotinylation (BioID) [ ].This dataset was generated by analyzing A lung carcinoma cells transduced with lentivirus vectors expressing SARS-CoV-proteins fused with a bacterial biotin ligase.The addition of biotin resulted in the biotinylation of host proteins in the proximity of SARS-CoV-proteins.Biotinylated proteins were purified and then identified and quantified by mass spectrometry.Compared to interactomes identified by AP-MS, BioID is more capable of identifying weaker interactions in poorly soluble intracellular locations such as membranes and organelles.

Network Propagation
Provenance Tracing   .Network statistics.For STRING, the weight cutoff applied is in parentheses.The column titled "# SARS-CoV-inter.( /)" shows the number of sources that were in the network.The "# Nbrs. of sources" column shows the number of neighbors of the human proteins that interact with SARS-CoVproteins (i.e., sources) in the given network.

Algorithms and Software Datasets
Differential protein abundance in SARS-CoV--infected iAT cells.We obtained a set of , human proteins determined to have differential abundance in response to SARS-CoV-infection [ ].This dataset was generated by infecting induced pluripotent stem cellderived alveolar epithelial type cells (iAT ) with SARS-CoV-and measuring protein abundance by quantitative mass spectrometry at , , , and hours post-infection.The authors compared protein abundance in infected iAT cells with that of the uninfected iAT controls to obtain differentially-expressed proteins.In our analysis, we used the set of proteins with differential expression (FDR p-value < .) at any of the , , , and hours post-infection.

Differential gene expression in upper airway samples in SARS-
CoV--infected patients.We obtained three sets of human proteins determined to have differential gene expression in cells infected with respiratory viruses [ ].To generate this dataset, the authors used metagenomic RNA-seq to identify and quantify both human and viral RNA expression in upper airway samples collected from patients with acute respiratory illness.They compared the gene expression values between samples that contained SARS-CoVto uninfected samples in order to obtain differentially-expressed genes.They also identified additional viral infections including SARS-CoV, HRV, Influenza, HMPV, RSV, PIV in patient samples.
Comparing SARS-CoV-infections with other viral infections and other viral infections with uninfected samples yielded two additional sets of differentially-expressed genes.In our analysis, we used the genes with differential expression (FDR p-value < .) in these three sets obtaining (i) , genes from SARS-CoV--infected cells compared with uninfected samples, (ii) , genes from SARS-CoV--infected cells compared with other viral infections, and (iii) , genes from other viral infections compared with uninfected samples.
From each of these interactome and differential expression datasets, we removed human proteins used as positive examples in our analysis and the proteins that were not present in the STRING network.This step resulted in , , , proteins, respectively, from the interactome datasets and , , , , , , and , proteins, respectively, from the differential expression datasets.We used Fisher's exact test to estimate the statistical significance of the overlap between the remaining proteins and our top-ranking proteins.

Analyses
Various network propagation methods have been successfully used in diverse applications in systems biology [ ].In particular, we model network propagation using the Regularized Laplacian (RL) [ ].As we describe below ("Methods"), RL has the benefit of two mutuallyreinforcing interpretations.On one hand, it can be understood as an optimal labeling of network nodes, when some node labels are known a priori.On the other hand, it can be seen as the result of diffusion, i.e., a continuous-time random walk, on the network.Under this second interpretation, we derived a novel mathematical formula for the expected length of the path traversed in the network by the random walker, which we then used to characterize our top-ranking proteins.

Prioritization of Potential SARS-CoV-Interactors
Our underlying hypothesis was that network propagation via methods such as the RL yields a reasonable mechanism for predicting SARS-CoV-interactors. Therefore, we applied RL to the set of posi- tive examples to rank the remaining proteins in the STRING network.We also ranked these proteins using multiple other network propagation methods and off-the-shelf classifiers [ , , , ].We used a stratified sampling approach to estimate the statistical significance of the resulting node scores (see "Statistical Significance of Node Scores" in the supplementary methods).The sampling accounted for the possibility that if many sources have high degree, then scores may tend to be large overall in the network.Henceforth, for every method, we only considered proteins in the network that had a p-value less than . .
To decide which methods to select for subsequent analyses, we compared them using -fold cross validation ("Comparison of Cross-Validation Results" in the supplementary results and Figure S ).RL, random walk with restarts (RWR)[ ], and deepNF [ ] had the highest values of area under the precision-recall curve followed by SVM and logistic regression.RL achieved marginally worse values of area under the precision-recall curve than RWR and deepNF.We selected one network propagation method (RL) and one supervised classifier (SVM) for the following reasons.We preferred RL over deepNF because the provenance tracing method we developed for RL enabled its results to be more easily interpreted than those for deepNF.Since RL and RWR produced highly similar predictions with a very high Spearman's correlation for the ranking of all proteins ("Overlap among algorithms" in the supplementary results and Figure S ), we selected RL as representative of the two methods.We chose SVM among the two off-the-shelf classifiers since it also had very good performance in cross-validation.We considered the top predictions of RL and SVM that were statistically significant at p < .("List of RL and SVM predictions, p-values, and top-two contributors" [ ]), which we refer to as "top-ranking proteins" below.
Three recent publications or preprints have independently discovered physical interactions between SARS-CoV-and human proteins [ , , ].These datasets differed in the type of host cell in which the viral proteins were expressed and the experimental methods used to determine if two proteins interacted.("Datasets").The top-ranking proteins for both RL and SVM had significant overlaps with each of the three new datasets, while the results for Local were not statistically significant (p-value > .) (Figure (a)).We observed an especially striking overlap with the "proximity interactome" [ ]. Approximately one-third of the and topranking proteins computed by RL were present in this dataset of , interactions (p-value .× -and p-value .× -respectively).
The corresponding publication used BioID with the fast-acting miniTurbo enzyme [ ], a technique that is useful for discovering viral-host protein interactions that take place at intracellular membranes and poorly soluble organelles, which are difficult to profile using classical biochemical purification approaches used in the other publications [ , , ].Thus, our top-ranking proteins may be members of biological processes that occur in such locations in the cell.These three independent datasets provide strong support of our predictions.Our top-ranking proteins that do not overlap with these resources may interact with viral proteins indirectly and thus would not be captured by assays that test for direct protein-protein interactions.
We additionally tested the overlap between our top predictions and independent experimental datasets identifying differential expression of proteins in response to SARS-CoV-infection [ ].As in the previous analysis, we observed that the results for Local were not statistically significant (p-value > .), while both RL and SVM had significant overlaps with differential protein abundance in SARS-CoV-infected cells compared with uninfected cells [ ](Figure (a)).Approximately half of the and topranking proteins computed by RL were present in this dataset of , differentially expressed proteins (p-value .× -and pvalue × -respectively).This high overlap may indicate that these proteins are involved with changes in host protein expression occurring in SARS-CoV-infected cells, via either direct or indirect virus-host protein interactions.
In contrast, when we analyzed gene expression measurements in response to SARS-CoV-infection [ ], we did not observe a significant overlap between our top-ranking proteins and differentiallyexpressed genes (Figure S ).This result may be attributed to a difference in cell types used for measuring gene expression data, including cells not directly infected by the virus.Moreover, the lack of edges connecting transcription factors to target genes in the PPI network we used may limit the size of the overlap between interactors predicted by RL and SVM with differentially-expressed genes.
We tested for enrichment of Gene Ontology (GO) biological processes (Benjamini-Hochberg corrected p-value ≤ .) among the top-ranking proteins from RL and from SVM, as well as in the interactors of SARS-CoV-("Functional Enrichment" in the supplementary methods).Our top-ranking proteins were enriched in five broad categories of GO biological processes: organelle organization, transcription and translation, respiration, ER stress, and posttranslational modifications (Figure (b), Figure S , and "Enrichment results for RL, SVM and viral interactors" [ ]).We examine the relevance of these processes to the viral life cycle in more detail in "Discussion" and in "Enriched Biological Processes" in the supplementary results.

Tracing the Provenance of Top-Ranking Proteins
We can interpret the RL in terms of a continuous-time random walk over the network, which is governed by the internal parameter α.We are interested in the node reached by the walker after a random time that depends on α.The expected number of transitions made by the walker increases with the parameter α ("Analytical Perspective on the RL and Expected Path Length" in the supplementary methods).Hence for larger values of α, the "influence" of the sources is diffused more broadly across the network.To test how this spreading of "influence" affects our results, we varied α over four orders of magnitude from . to and performed two analyses.First and most importantly, for each top-ranking protein computed by the RL, we developed a systematic procedure to determine the provenance of its score, i.e., which SARS-CoV-interactors made the greatest contributions to this score.For our second analysis, we developed a new methodology to select a value of α.We were motivated to do so since we could not use the common practice of choosing the parameter's value based on maximization of cross-validation performance: the AUROC, AUPRC, and precision at .recall of the RL varied very little with α (Figure S ).
For provenance tracing, we took advantage of the fact that the score computed by the RL for each protein in the network is a linear combination of contributions from source proteins ("Methods").Therefore, for each protein u in the network, we sorted the source proteins by their relative contributions to the score of u ("Provenance tracing matrix" [ ]). Figure (a)-(d) provide illustrative examples of the practical usefulness of provenance tracing.We used a value of α = .to obtain these results.We present our method for selecting α at the end of this section.
In Figure (a), we display the top ranking proteins computed by the RL that are annotated to the enriched GO term "protein folding in endoplasmic reticulum".For each such protein, we also show all the sources that interact with it as well as the viral proteins that in turn interact with the sources.This network is complex and difficult to understand.In contrast, in Figure (b), we connect each topranking protein only to the two source proteins that contribute the most to its score.This simplified network considerably facilitates the interpretation and rationalization of the RL's predictions.Figure (c,d) are similar in nature and correspond to the enriched term "cilium assembly".We return to the biological insights present in these networks in "Discussion".
Next, we considered the effect of α on the amount of diffusion in the network.When α was very small, e.g., ., we expected the highest contributing sources to be direct neighbors of top-ranking proteins.As α increased, and the random walker traversed longer paths in the network, we expected more of the highest contributors to not be directly connected by an edge to top-ranking proteins.Contrary to our expectations, we found that for every value of α and for every top-ranking protein u (till a rank of , ), the source protein with the highest contribution to u's score was always a neighbor of u.Even when we considered the second and third highest contributors, we found that they were more than one edge away for as few as % of the top-ranking proteins for α = . .This number increased only to % for α = .The STRING network includes both direct, physical and indirect, functional PPIs.Therefore, we sought to see if this trend in the provenance analysis held for networks with only physical interactions corresponding to direct binding and indirect protein complex membership.We repeated the analyses up to this point on three other PPI networks: BioGRID, BioGRID-Y H, and HI-union ("Methods").For BioGRID, the results were comparable to those for STRING.The highest contributor was always a neighbor, except for α ≥ where up to % of nodes received most of their score from a source more than one edge away.The second and third highest contributor was more than one step away for as few as % of top-ranking nodes for α = ., and up to % for α = .For BioGRID-Y H and HI-union, which are smaller, sparser networks with only direct PPIs, only nodes had scores that were statistically significant at the .level.The highest contributing source was more than one step away for as many as -% of the top-ranking nodes, even for α = .For the second highest contributor, this percentage jumped to more than % for α = .itself.
To further characterize the contribution of non-neighboring sources, we defined the effective diffusion to a protein u as the fraction of its score s(u) that arose from the non-direct neighbors of u that were also SARS-CoV-interactors.As expected, the effective diffusion to the top-ranking proteins increased with α with values close to zero for α = .and a median of .
for α = (Figure (e)).We concluded that the neighbours of the sources received non-trivial contributions to their RL scores from indirectly-connected sources only for values of α = and higher.
We repeated these experiments for BioGRID, BioGRID-Y H, and ).BioGRID maintained fairly similar results to STRING.On the other hand, for the other two networks, their effective diffusion values were quite a bit smaller (difference from STRING about .on average).Taken together, these results suggest that in the sparser networks (BioGRID-Y H and HIunion), a top-ranking protein has fewer sources as direct neighbours than in the denser networks (STRING and BioGRID) but a larger proportion of its score arises from these adjacent sources.
These results motivated us to test a different method for selecting an appropriate value of α for downstream analysis.As mentioned earlier, we mathematically derived a new expression for the expected value of the path length of the random walker ("Analytical Perspective on the RL and Expected Path Length" in the supplementary methods).To our knowledge, no such formula is known for the interpretation of the RL as a continuous-time Markov chain.This value depended on α, the topology of the network, and which proteins interacted with SARS-CoV-.We computed the expected path length for different values of α (Table S ).Independently, we computed the distribution of path lengths in the network from SARS-CoV-interactors to every other protein (Figure S ).The median number of edges in these paths was three.Therefore, we set the value of α = .for which the expected path length of the random walker was .(Table S ).The median effective diffusion for this value of α was around . .We used this value of α to generate the results presented in this work.

Discussion
The COVID-pandemic and its medical and economic impact have created an urgent challenge for biomedical researchers to understand infection mechanisms used by SARS-CoV-and to develop therapeutics against the disease [ ].A manifestation of this community response is the first protein-protein interactome associated with the SARS-CoV--human interface [ ].This set of human proteins reported to interact with SARS-CoV-is likely to have both false positives and false negatives due to the properties of the proteomic screening pipeline used.
In this work, we sought to further extend the results of this study to significantly expand the resources available to the COVID-community by producing an extended set of putative SARS-CoV-interactors. Comparison of our results with independently-generated SARS-CoV--human protein interaction networks [ , , ] provides substantial experimental support for our predictions.We note that complementary efforts are based on protein structures [ ], observational studies of treatments being administered to patients [ ], shortest paths in protein networks [ ], propagation in protein networks with predicted SARS-CoV-interactors [ ], and exploratory analyses of virus-host-drug networks [ ].
A notable new feature of our methodology is tracing the provenance of each of our predictions back to the most informative experimental sources [ ].In principle, the RL computes scores by integrating over all paths in the network.We were surprised to see that the top-contributing sources were invariably direct neighbours of the top-ranking predictions in the STRING network.A partial explanation for this trend may be the fact that as many as , proteins in the STRING network were direct neighbors of at least one source protein, even when we considered only interactions with weight at least .(the STRING database deems edges with such weights to be of "very high quality").Thus, the structure of the STRING network and central location of sources within it may cause the RL both to give high ranks only to direct neighbors of sources and to channel propagation primarily along these direct connections.We stress that using only the interactions between sources and their neighbors in the network does not result in high-quality predictions, as evidenced by the relatively poor cross-validation performance of the Local algorithm.Thus, the integration of multiple paths by the RL plays a key role in prioritizing which neighbors of the sources are more likely to be potential interactors of SARS-CoV-proteins than others.
COVID-research has focused disproportionately on a small set of human proteins [ ].Our research has the potential to expand the repertoire of host proteins that are studied in the context of COVIDand thereby open new directions of study of the disease.The cellular processes in which our top-ranking proteins participate suggest how the virus may infect human cells.We discuss two illustrative examples of the type of insights provided by our approach, highlighting several proteins targeted by drugs that are already in clinical trials for COVID-.We remind the reader that we computed functions enriched in the top-ranking proteins, performed the provenance analysis independently, and then integrated the results in the protein networks we visualized.

The Role of Endoplasmic Reticulum Stress, HSPA , and Anti-Clotting Drugs
Our analysis points to a connection among interactors of SARS-CoV-, proteins involved in endoplasmic reticulum (ER) stress, and anticlotting drugs ((Figure (a,b))).The GO biological process "protein folding in endoplasmic reticulum" was enriched in the top-ranking proteins (p-value .× -for RL and .for interactors of SARS-CoV-).HSPA , also referred to as glucose regulated protein (GRP ) or immunoglobulin binding protein (BiP) in the literature, is evolutionarily conserved from prokaryotes to humans [ ].It has a repertoire of functions associated with ER stress response.HSPA is usually localized in the ER.When the ER is stressed, HSPA can translocate to the cell surface, the nucleus and mitochondria [ , ].On the cell surface, HSPA plays a multi-functional role in cell proliferation, cell viability, apoptosis, and regulation of innate and adaptive immunity [ , ]. Blood hypercoagulability is reported to be common among COVID-patients [ ]. Top-ranking proteins HSPA and CANX act as chaperones for pro-coagulant proteins such as Factor V and Factor VIII.Once Factor VIII is secreted, it binds to another procoagulant protein von Willebrand factor (vWF) to prevent degradation of clots [ ].Although Factor V, Factor VIII, and vWF are not among the top-ranking proteins and thus do not appear in Figure (a,b), this network is suggestive of mechanisms that SARS-CoVmay use to cause abnormal blood coagulation.
Anti-coagulant drugs that interact with HSPA or CANX include Tenecteplase, a third generation plasminogen activating enzyme and the investigational drug Lanoteplase, which is a serine protease that binds to fibrin leading to the formation of plasmin [ ], an enzyme that breaks clots.Lanoteplase is a second-generation derivative of Alteplase, and a third generation derivative of recombinant plasminogen.It is notable that there are clinical trials for Tenecteplase (Clin-icalTrials.gov,NCT , NCT ) and Alteplase (Clini-calTrials.gov,NCT , NCT ) to test their effectiveness in treating COVID-.Aspirin, also present in (Figure (a,b)), binds to and inhibits the ATPase activity of HSPA [ ]. Aspirin is currently involved in clinical trials (ClinicalTrials.gov),with one testing the effects of aspirin at various levels of COVIDseverity (NCT ), and another testing whether early treatment of COVID-patients with aspirin and vitamin D can inhibit the production of blood clots and decrease rates of hospitalization (NCT ).

Cilium Assembly and Tubulin-Modulating Drugs
GO biological processes related to cilia were significantly enriched in the top-ranking RL and SVM predictions.An example is "cilium assembly" (p-value . × for RL vs. .in the human interactors of SARS-CoV-.Many proteins annotated to this term belong to the tubulin family, which are components of microtubules.The SARS-CoV-M protein binds to two γ-tubulins (TUBGCP and TUBGCP ), which interact with several α-and β-tubulins among the top predictions (Figure (c,d)).Microtubules are polymers that provide shape and structure to eukaryotic cells and are necessary in cell transport and cell division, among other functions [ ]. αand β-tubulins compose microtubule filaments, while γ-tubulins connect them to the microtubule organizing center.
Viruses commonly utilize microtubules for cellular entry, intracellular trafficking, and exit from cells [ ].For instance, the S protein of human α-coronavirus interacts with tubulin α and β chains [ ], suggesting that tubulin may be involved in the transport and localization of the S protein and its assembly into virions [ ]. Relevant to SARS-CoV-, microtubules are the primary structural component of cilia, which line epithelial cells in the respiratory tract and are responsible for the transport of mucus out of cells [ ].The ACE receptor that SARS-CoV-uses to enter cells appears to be expressed primarily on the cilia of respiratory tract epithelial cells [ , ], further implicating microtubules in viral infection.The combination of high expression levels of ACE and the presence of cilia may also explain the detection of the virus in multiple organs [ ] and the deleterious effect of COVID-on the renal, gastroinstestinal, and olfactory systems [ ].The drugs that target Tubulin proteins (Figure (c,d)) are mostly anti-mitotic agents, which are being investigated as anti-cancer therapeutics.It is notable that ongoing clinical trials (ClinicalTrials.gov)are testing the effectiveness of Colchicine against COVID-.
Our work also sets the stage for follow-up analyses on SARS-CoV-.Integrating new datasets of SARS-CoV--human protein interactions [ , , ] and human proteins whose deletion inhibits viral replication [ , ] with other omics data using our methods and with orthogonal analysis techniques promises to predict more biologically meaningful networks and processes impacted by the virus.In particular, single-cell RNA-seq data offer many opportunities to examine cellular heterogeneity and context-specific interactions.

Potential Implications
The approach we advocate here is inspired by the general framework of producing explanations for machine learning methods [ ].This area of "explanations" of predictions is receiving strong interest because of deep learning.While the idea has previously been studied in graphical models [ ], most machine learning methods are not fully interpretable by the fairly strict definition of Kasif and Roberts [ ]: tracing each prediction to the experimental evidence that supports it.This notion of explanation is a special but particularly important case for computational genomics and systems biology.
Causal perturbations [ ] provide a general approach for producing explanations of this type for virtually any predictive model.Consider a model with experimental evidence that a gene g performs a function f.We perturb the variable associated with the gene, e.g.we change the probability Pr(g performs f) = to Pr(g performs f) = .
We then compute the change in probability of every other variable in the model due to this perturbation in order to assess the importance of this particular gene-function pair.
For network propagation, this idea yields the special case discussed in this work that is amenable to very efficient computation.Our strategy for tracing provenance extends to any algorithm that makes predictions using a linear combination of evidence such as logistic regression and GeneMania [ ].In particular, it is applicable to the large number of random-walk-based methods that have been developed for predicting disease genes or annotations to GO terms [ , , , ].
An important future line of research will be to develop provenance tracing techniques for other classes of network-based methods such as Markov random fields (MRFs) [ , ] and min-cut based methods [ , ].For MRFs, we can apply the general perturbation-based method described above.For mincut-based methods, it is possible to recalculate the cut for any single change in experimental data using dynamic data structures [ ].Thus, the provenance tracing approach that we advocate here has many natural follow-ups that we expect to be studied by the community in the future.
It remains to be seen whether the trends we observed on the contributions from direct neighbors generalize to these methods and to annotations of terms in the Gene Ontology or the Human Phenotype Ontology terms.In general, it is quite likely that sources that are not direct neighbors may make substantial contributions to scores.In these cases, new algorithmic developments may be required to trace the paths by which the sources spread their influence to a given node.
Our work provides significant new data and software resources to the COVID-community.Three properties of our results facilitate their use by experimentalists who are seeking to obtain new insights into the pathogenesis of this disease.First, the prioritized list of predicted interactors of SARS-CoV-("List of RL and SVM predictions, p-values, and top-two contributors" [ ]) contains druggable targets that may be promising to study further.Second, our provenance analysis provides the rationale underlying each prediction by directly linking to the relevant experimental input.Third, the viralhuman protein interaction networks corresponding to enriched GO terms (Figure and Figure S ) are available for visualization and download on GraphSpace (http://graphspace.org/graphs/?query=tags:2021-sarscov2-network-analysis).Examination of these networks provides further context for the predictions.
We conclude by noting that our methodology is general purpose and easy to generalise to a new virus.The software requires a dataset of host proteins that interact with the virus and an interaction network among the host proteins themselves.The virus-host network may be determined experimentally [ ].If such a dataset is not available, a user can predict the network computationally from the se-quence of the viral genes and interaction networks for phylogenetically similar viruses [ ]. Subsequently, a user can apply network propagation to predict additional human proteins and biological processes that may be targeted by the virus.

Algorithms
To facilitate the complete reproducibility of our results, we now describe the RL algorithm that we use for label propagation and prediction.We present the other methods that we use (GeneMANIA, SinkSource, RWR, Local, deepNF, the Support Vector Machine, and Logistic Regression) and implementation details in "Other Algorithms" in the supplementary methods.We are given a weighted, undirected network G = (V, E, w), where each node in V is a human protein, each edge (u, v) represents an interaction between proteins u and v, and w : E → ( , ] is a function specifying the weight of each edge in E. Informally, the weight of an edge indicates our confidence in the experimental data supporting the corresponding protein-protein interaction.We are also given a set P ∈ V of positive examples consisting of the human proteins that interact with SARS-CoV-proteins [ ].Each node in G is a human protein and each edge represents a physical or functional interaction between two proteins.We seek to compute a score vector s ∈ R n , where n is the number of nodes in G.For every node v, the score s(v) in this vector indicates our confidence that node v either physically interacts with or is functionally linked to a SARS-CoV-protein.
Regularized Laplacian [ ].Given a parameter α > , we compute s using the following steps: i. Define a label vector y over the nodes in G where y(u) = if node u is in P and y(u) = , otherwise.ii.Define W ∈ R n×n as the adjacency matrix of G with edge weights, i.e., the entry in row u and column v of W equals w uv if (u, v) is an edge in G and , otherwise.iii.Define D as a diagonal matrix with D uu = v w uv , for every node u in G. iv.Compute the R n×n matrix W = D -/ WD -/ , which denotes the normalized network.v. Compute the Laplacian of G as L = D -W, where we define D to be a diagonal matrix with Duu = v wuv .vi. Compute the vector s = (I + α L) -y.
The RL was introduced by Zhou and Schölkopf.Since then, several variations of this method have been published.The version we use is identical to the strategy used by Fouss et al. [ ].We provide the intuition behind the resulting RL matrix (i.e., (I + α L) -) and discuss its properties in "Analytical Perspective on the RL and Expected Path Length" in the supplementary methods.In particular, we derive an expression for the expected path length of the continuous-time Markov chain corresponding to the RL.As far as we know, this mathematical analysis has not previously been published.

Tracing the Provenance of Prediction Scores
Let K denote the RL matrix (I + α L) -.We remind the reader that the RL algorithm ranks proteins based on diffusion scores that associate a node u in the network with a diffusion score s(u), where s(u) = v∈P K uv , where v ranges over the set P of all SARS-CoVinteractors.For every protein u, we sorted the proteins in P in de-creasing order of the values of K uv , where v ranged over P. In the manner, we ranked the experimentally determined interactors that in decreasing order of their contributions to each node's diffusion score.This analysis is important for tracing the provenance of computational predictions to their experimental sources [ ].
Machine; vWF: von Willebrand factor; USDA-NIFA: United States Department of Agriculture National Institute of Food and Agriculture

Figure .
Figure .Overview of methodology.Algorithms and software for network propagation and provenance analysis take as input experimentally determined host-pathogen protein interactions and a human protein interaction network.Evaluation includes cross-validation, functional enrichment, and literature-based examination of promising protein targets and drugs.

Figure .
Figure .Network propagation results.(a) Heatmap showing the FDR adjusted p-value from the hypergeometric test for the overlap between the top-ranking predictions of RL, SVM, and Local and three new experimental datasets of SARS-CoV--human protein interactions [ , , ] and one dataset of differentially expressed (DE) proteins after SARS-CoV-infection [ ].Each cell displays the FDR-adjusted p-value and the number of overlapping proteins in parentheses.A gray cell indicates a p-value larger than . .(b) Heatmap summarizing GO biological process terms enriched in top ranking proteins from RL and SVM and human interactors of SARS-CoV-proteins (indicated as 'Kr').We manually grouped the terms into broader categories shown in bold text.A gray cell indicates a p-value larger than . .We examine the relevance of these biological processes to SARS-CoV-and COVID-in "Enriched Biological Processes" in the supplementary results and in "Discussion".

Figure .
Figure .Provenance tracing results and illustrative examples of networks.(a) Network of the top ranking proteins for RL (green nodes) that are annotated to the enriched term "protein folding in ER".For each top-ranking protein, we display its connections with all neighboring SARS-CoV-interactors. (b) The same network as in (a) except that we display only the top-two contributing SARS-CoV-interactors for each top-ranking protein.(c) Network of the top ranking proteins for RL (green nodes) that are annotated to the enriched term "cilium assembly."(d) The same network as in (c) except that we display only the top-two contributing SARS-CoV-interactors for each top-ranking protein.In all four network visualizations, the number below the name of a green protein is its rank as computed by the RL.Proteins discussed in the text are highlighted with a red border.In (a,c), we removed STRING edges with weight < to simplify the visualization.We retained this restriction in (b,d) as well to maintain consistency between the visualized networks.In (c,d), we removed drugs that promote clotting.(e) Distribution of effective diffusion for the top ranking proteins for different values of α. (f) The same distribution as (e) except comparing different networks with α = . .
HSPA has been proposed as a universal target for human diseases [ ].It has increasingly well-documented essential interactions and activities during viral infections.In particular, the role of HSPA in viral entry and pathogenesis has been widely investigated.SARS-CoV infection has been shown to lead to ER stress and the upregulation of HSPA [ , ].The S protein of SARS-CoV can induce transcriptional activation of HSPA [ ].This protein can serve as a point of attachment for both MERS-CoV and bat coronavirus (bCoV HKU ) [ ].Both Zika virus and Japanese encephalitis virus use HSPA to prevent apoptosis and to help in viral replication [ ].A recent molecular docking study has predicted HSPA as a potential receptor for the SARS-CoV S protein [ ].The observed expression in vitro of HSPA in airway epithelial cells suggests that it may serve as an additional receptor for SARS-CoV-in these cells [ ]. Based on our network-based analysis and support in the literature, we hypothesize that HSPA may serve as a co-receptor, a point of viral attachment, or aid in viral entry of SARS-CoV-.
Table for statistics of the network size and density.
Evaluation includes cross-validation, functional enrichment, and literature-based examination of promising protein targets and drugs.