PCGAN: a generative approach for protein complex identification from protein interaction networks

Abstract Motivation Protein complexes are groups of polypeptide chains linked by non-covalent protein–protein interactions, which play important roles in biological systems and perform numerous functions, including DNA transcription, mRNA translation, and signal transduction. In the past decade, a number of computational methods have been developed to identify protein complexes from protein interaction networks by mining dense subnetworks or subgraphs. Results In this article, different from the existing works, we propose a novel approach for this task based on generative adversarial networks, which is called PCGAN, meaning identifying Protein Complexes by GAN. With the help of some real complexes as training samples, our method can learn a model to generate new complexes from a protein interaction network. To effectively support model training and testing, we construct two more comprehensive and reliable protein interaction networks and a larger gold standard complex set by merging existing ones of the same organism (including human and yeast). Extensive comparison studies indicate that our method is superior to existing protein complex identification methods in terms of various performance metrics. Furthermore, functional enrichment analysis shows that the identified complexes are of high biological significance, which indicates that these generated protein complexes are very possibly real complexes. Availability and implementation https://github.com/yul-pan/PCGAN.


Introduction
Protein complexes (PCs) are composed of interacting proteins, which play important roles in various cell functions, including signal transduction, protein degradation, and mRNA translation (Alberts 1998, Hartwell et al. 1999. With the development of high-throughput techniques, such as yeast 2-hybrid (Y2H) (Fields and Song 1989) assays and affinity purification-mass spectrometry (AP-MS) (Morris et al. 2014), it is cheap and fast to acquire large amounts of protein-protein interactions (PPIs) of different organisms. This provides an opportunity to identify PCs from protein interaction networks (PINs) by computational methods, which are of lower cost and higher speed, compared to biological experiments (e.g. tandem affinity purification with mass spectrometry) (Xu and Guan 2014, Shi et al. 2018, Yao et al. 2020b. A PIN can be represented as an undirected graph G ¼ ðV; EÞ where V indicates the set of proteins (or nodes) and E is the set of interactions (or edges). A comprehensive PIN containing more proteins and interactions is beneficial to accurately and completely identify complexes.
In the past decade, a number of computational methods have been proposed to identify PCs from PINs (Jiang and Singh 2010, Kenley and Cho 2011, Nepusz et al. 2012, Wu et al. 2021b, Pan et al. 2023. Most of them employ clustering algorithms by treating PCs as dense subnetworks or subgraphs in PINs (Kenley and Cho 2011, Nepusz et al. 2012, Wu et al. 2021b. Thus, the quality of PINs is crucial to the performance of complex identification. Though more and more PPI data are available, which benefits the construction of larger PINs, the overlap between different PINs of the same organism produced by different labs are quite low, especially for interactions or edges. So, it is reasonable to merge multiple PINs to construct larger and more comprehensive PINs for boosting PC identification. In this work, we propose a novel approach PCGAN to identify PCs from PINs. It is a generative approach, which is quite different from the existing works that are mainly based on clustering analysis over PINs. Here, PCGAN means identifying Protein Complexes by Generative Adversarial Networks (GANs). With the help of some known complexes as training samples, PCGAN first learns the characteristics of PCs, and then generates new complexes. Specifically, PCGAN trains two models: a generator for generating PCs, and a discriminator for distinguishing the generated PCs from real ones. The competition learning between the generator and the discriminator promotes the two models to improve their capabilities until the generated complexes are indistinguishable from the real ones. Furthermore, to improve the performance of complex identification, we construct two more comprehensive PINs and a larger gold standard complex (GC) dataset by merging the existing datasets.
We conduct extensive experiments to evaluate the proposed method. Our experimental results indicate that PCGAN is superior to the existing methods. We also perform function enrichment analysis on the generated complexes, which shows that our generated PCs are of high biological significance, i.e. they are very possibly real complexes.
To the best of our knowledge, this is the first generative work of complex identification. Our method pioneers a new direction of PC identification.

Datasets
The datasets used in this article consist of two parts: PINs and gold standard PC sets, of two organisms (human and yeast). The detailed description of the datasets is as follows:

Protein interaction networks
For human, we used two recently released PIN datasets HuRI (Luck et al. 2020) and BioPlex (Huttlin et al. 2021), whose PINs cover a large number of human proteins and PPIs. Among them, HuRI was derived from Y2H, comprising 63 132 PPIs across 8975 proteins, and BioPlex was generated by AP-MS, consisting of 167 932 PPIs across 14 484 proteins. We merged the above two PINs and removed redundant interactions to get a more comprehensive PIN of human called CPIN-H, which contains 225 642 PPIs over 16 600 human proteins. As for yeast, we used five widely used PINs, including Collins (Collins et al. 2007), Gavin (Gavin et al. 2006), Krogan (Krogan et al. 2006), WIPHI (Kiemer et al. 2007), and DIP (Xenarios et al. 2002), which contain high reliable PPIs. Similar to human PINs, we merged the five PINs to get a more comprehensive PIN of yeast, named CPIN-Y. The information of PINs of human and yeast is summarized in Table 1. In this article, we used only CPIN-H and CPIN-Y for complex identification.

Gold standard PC sets
We downloaded the latest version CORUM 3.0 (Giurgiu et al. 2019) as the GC set of human, which contains 2485 PCs of size !2. For yeast, three PC datasets have been used as the gold standard sets, including CYC2008 (Pu et al. 2009), the Munich Information Centre for Protein Sequences (MIPS) dataset (Mewes et al. 2006), and the Saccharomyces Genome Database (SGD) (Hong et al. 2008). Here, we merged the three datasets and removed the redundant complexes (if two complexes exactly match each other, they are redundant to each other) to obtain a larger gold standard set of yeast (named CGold-Y). Table 2 provides the statistics of these gold standard PC sets. In experiments, we used only CORUM and CGold-Y.

Methodology
Here, we first introduce the framework of PCGAN as shown in Fig. 1, then present the techniques of the discriminator and generator in PCGAN. Finally, we describe the algorithm. Figure 1a shows the architecture of PCGAN based on GANs, where the generator is to generate PCs, each of which is started with a seed node, and expanded iteratively based on the policy gradient strategy. The discriminator is to distinguish a generated PC from real ones. The competition learning between the generator and the discriminator drives them to improve their capabilities until the generated complexes are indistinguishable from the true complexes. In detail, the input data of PCGAN include a PIN (CPIN-H or CPIN-Y) and a GC dataset (CORUM or CGold-Y). N complexes are selected from the GC dataset to form the training set, and the rest are used as the test set. Among them, 30% of the training set is for model verification, i.e. evaluating the quality of the model in the training process. To generate a complex, a node is randomly selected from the PIN as the seed node, and input to the generator. In the process of complex generation, an optimal node from the neighbors of the current (intermediate) complex is chosen to join the current complex or terminate the expansion of the current (intermediate) complex. As each node in the PIN will be selected for complex generation, the order will not impact the final result. Next, the true complexes and the generated complexes are input into the discriminator, which returns the gradient update to the generator, to make it generate more high-quality complexes. Finally, the proteins of complexes in the test set are input into the trained generator to generate candidate PCs. And the final PC set is obtained by removing redundant complexes according to the complex match rate Rate match [Equation (10)]. As an example, Fig. 1b illustrates the generation process of a real complex. In this article, we use the notation in Table 3. In addition, lowercase letters (e.g. x) represent column vectors, and uppercase letters (e.g. X) denote sets.

Discriminator
A PIN can be represented by an undirected graph G ¼ ðV; EÞ, where V is a set of vertices (i.e. proteins) and E is the set of edges (i.e. interactions). The discriminator D aims to judge whether a PC is real or generated. Here, we use a graph isomorphism network (GIN) (Welling and Kipf 2016, Gasteiger et al. 2018, Xu et al. 2018 to construct the discriminator of PCGAN because it is effective in representing graphs and shows good performance in graph classification. To characterize the features of complexes in the PIN, we consider both internal and external connections of a PC, i.e. C ¼ PC þ NPC, where PC represents the internal node set of the PC, and NPC represents its neighbor node set. The discriminator is a "three"-layer GIN, and the initial features of nodes are evaluated as follows: where v is a node in the PIN, and w 0 , w 1 , and w 2 2 R d are parameters. ' is the condition function, which returns 1 when the condition is true, otherwise 0. At layer l ðl 3Þ, the representation of node v is evaluated as follows: (2) z l ðvÞ ¼ w l ðz lÀ1 ðvÞ þ m l ðvÞÞ; above, m l ðvÞ and z l ðvÞ are intermediate representation vectors of node v, and r is the activation function. N C ðvÞ represents the neighbor node set of v in C, and w l is the weight vector at the l-th layer. The final representation of v is obtained by concatenating the representations of each layer: zðvÞ ¼ ½z 0 ðvÞ; . . . ; z 3 ðvÞ. Then, we aggregate the representations of nodes in C by the readout function to generate the representation of C: zðCÞ ¼ P v2C zðvÞ. Finally, we use the sigmoid function to obtain the probability that the PC is a real complex: DðC; WÞ ¼ ½1 þ expðw T 3 zðCÞÞ À1 , where W means the parameter set of D, and w 3 2 W. The input data of PCGAN includes a PIN and a gold standard dataset. The generator is to generate complexes as similar as possible to true complexes. The discriminator tries to distinguish between true and generated complexes, and returns gradient updates to make the generator generate high-quality complexes. The iterative learning process improves the generator and the discriminator until generated complexes are indistinguishable from true complexes. (b) The illustration of complex generation process. A complex is generated by starting from a seed node. The policy Gða t jPC t À 1 Þ models the complex generation by iteratively expanding the current generated intermediate complex. Some examples of failed and successful complexes generated by PCGAN are presented in the Supplementary Material (see Supplementary Fig. S1). Table 3. The notation of this article.

Symbol Description
The set of neighbor nodes of a protein complex C ¼ PC þ NPC A protein complex with its neighbor nodes D, G Discriminator and generator PCGAN 2.2.3 Generator Given a seed node v 0 , the generator G tries to generate an optimal PC (that has the feature of high cohesion and low coupling) containing node v 0 : Gðv 0 Þ ¼ PC ¼ fv 0 ; v 1 ; . . . ; v T g in an iterative way. Specifically, given the current intermediate complex PC tÀ1 ¼ fv 0 ; v 1 ; . . . ; v tÀ1 g, we select a node v t from NPC tÀ1 to expand PC tÀ1 or stop the expansion of the complex at the t-th step (t T). T represents the number of proteins or nodes contained in the final PC. Here, we use the policy Gða t jPC t À 1 Þ to model the expansion of the current complex. a t represents the next action, i.e. adding a node to the current complex or stopping the expansion process. As an example, the complex expansion process is illustrated in Fig. 1b. 1) The design of policy Gða t jPC t À 1 Þ. At the t-th step, the augmented initial feature of node v is as follows: where q 0 , q 1 , and q 2 2 R d are parameters. This indicates that the feature of node v contains the information of the current complex and the seed node. Then, the node feature is processed by a graph neural network (GNN) model. Here, we use the Graph Pointer Network with incremental updates (iGPN) model (Zhang et al. 2020), which consists of three layers of GNN and a multi-layer perception (Longstaff and Cross 1987), to process node features as follows:H whereH t ðC tÀ1 Þ and H 0 t ðC tÀ1 Þ are stacked node representations of C tÀ1 , C tÀ1 denotes the set of internal and neighboring nodes of complex PC tÀ1 , Q represents the set of parameters in iGPN. iGPN is specially designed for sequential decision problems on graphs, which has the advantage of less training time and memory (Zhang et al. 2020).
Then, we design the action representationh t ða t Þ, which determines whether to add a node to the current complex PC tÀ1 or to end the expansion process via a probability: Specifically, the STOP action representation is defined as follows: where q 3 and q 4 are parameters. The iGPN parameter set Q ¼ fq 0 ; q 1 ; q 2 ; q 3 ; q 4 g.
2) Optimizing the generator with policy gradient.
Here, the generator is trained by the policy gradient strategy (Sutton et al. 1999). We define the reward for the intermediate PC as rðPCÞ ¼ Àlogð1 À DðPCÞÞ. The generator tries to maximize the expected reward for a given seed v 0 , its policy gradient relative to Q is where SðPC tÀ1 ; v t Þ ¼ E v tþ1 ;...;v T jPC t $G ½rðPCÞ represents the state-action function. We use Monte-Carlo estimation to approximate the policy gradient as in previous studies (Yu et al. 2017).
3) Pre-training and teacher forcing. Pre-training can provide a well initial model, and teacher forcing can effectively prevent the model from deteriorating and getting stuck in some training batches with the help of supervised learning in each reinforcement learning step. Here, we use maximum log likelihood estimation (MLE) for model pre-training and do teacher forcing as described in Vinyals et al. (2015) and Zhang et al. (2020). 4) Algorithm.
The overall PCGAN algorithm is outlined in Algorithm 1, which consists of four parts: (i) pre-training the generator G on the training set Ta using MLE (Line 1); (ii) iteratively training the discriminator D (Lines 3-8) and the generator G (Lines 9-14) on the training set Ta. (iii) Generating complexes one by one with the trained generator on the testing set Te (Lines

Performance metrics
To fairly compare the performance of various methods, we use four commonly used metrics, including "Recall," "Precision," "F-measure," and "maximum matching ratio" (MMR). Before describing these metrics, we first introduce Rate match , which is used to measure the similarity between a predicted complex (PC) and a GC.
jPC \ GCj represents the number of common proteins between the predicted complex and the GC. jPCj and jGCj represent the number of proteins in the predicted complex and the GC, respectively. Following previous works (Wang et al. 2019, Wu et al. 2020, if Rate match ! 0:2, we think PC and GC match successfully. The four performance metrics are defined as follows: above, P c and G c represent the number of complexes in the predicted set and the gold standard set, respectively. P gc is the number of predicted complexes matching with some GCs, and G pc is the number of GCs matching with some predicted complexes. "F-measure" is the harmonic mean of "Recall" and "Precision." MMR (Nepusz et al. 2012) is calculated based on the maximum matching in the bipartite graph where the two node sets of the graph are the predicted complex set and the GC set, respectively. Then, dividing the sum of the maximum matching edge weights by the number of GCs, and the maximum matching edge weight is evaluated by Rate match .

Complex function enrichment
Currently, though the scales of PINs are stably growing, the GC set has not been updated correspondingly. This means that the known PCs are very limited, which is not beneficial to validate the generated complexes, while experimental validation is too expensive and time consuming. Here, we try another way to evaluate the effectiveness of PC prediction. In general, the higher the biological significance of a complex, the more likely it is a real complex (Wang et al. 2019, Omranian et al. 2021). Thus, we assess the biological significance of predicted PCs by functional enrichment analysis.
Here, g: Profiler (Raudvere et al. 2019), a popular method of functional enrichment analysis, is used to evaluate the P-value of each generated PC to measure its biological significance. g: Profiler uses multiple test corrections to obtain P-values from GO and pathway enrichment analysis. Given an input query size, g: Profiler analyzes the approximate threshold t, which corresponds to the 5% upper quantile of randomly generated queries of that size. All actual P-values resulting from the query are calibrated by multiplying these values with the ratio of the approximate threshold t over the initial experiment-wide threshold. In this study, we use a default P-value threshold 0.05, i.e. if the P-value of a complex is smaller than 0.05, it is biologically significant. All the proteins in the PIN (i.e. CPIN-H of human and CPIN-Y of yeast) are used as the background set and we ignore the annotations.

Data integration
Here, we discuss and analyse the integration results of PINs and gold standard PCs.

PIN integration
The quality of PINs is essential for computationally identifying PCs from PINs. Therefore, we examined the overlaps of proteins and PPIs between PINs of human and yeast in Table 1. For human, we found that the protein overlap rate is 47% (BioPlex overlapping HuRI) and 76% (HuRI overlapping BioPlex), but the overlap of PPIs is very low, 1% (BioPlex overlapping HuRI) and 2.6% (HuRI overlapping BioPlex). This indicates that PPIs are incomplete for any single PIN, which is harmful to complex identification (some complexes cannot be identified). Here, the rate of A overlapping B is evaluated by jA \ Bj=jAj, A and B represent two PINs of the same organism, j•j is the number of proteins or PPIs in a certain PIN. There are three reasons for limited PPI overlapping between different PINs: (i) different high-throughput experiments tend to find different types of PPIs (Drew et al. 2021); (ii) a PIN is a dynamic network, PPIs change in cell at any time (Shi et al. 2018); (iii) current PPI experimental techniques may produce false-positive interactions (Yao et al. 2020a). Therefore, we merge the two PINs of human (i.e. HuRI and BioPlex) and remove the redundant interactions to construct a more comprehensive PIN, called CPIN-H, which covers more proteins and PPIs.
In addition, we have conducted complex identification on HuRI, BioPlex, and CPIN-H. The performance comparisons of PCGAN on the three PINs are presented in the Supplementary Material (see Supplementary Table S1). We can see that the prediction performance on the CPIN-H is better than that on HuRI and BioPlex. This shows that the combination of multiple different PINs is beneficial to PC identification.
Tables 4 and 5 present the overlap rates of proteins and PPIs in different PINs of yeast, respectively. Similarly, we can see that the overlap of proteins between different yeast PINs is very high, but the overlap of interactions is very low. Therefore, we combine the five different yeast PINs and remove the redundant interactions to generate a comprehensive yeast PIN, called CPIN-Y. Subsequent complex identification

Integrating PC sets
The GC sets are used to evaluate the performance of computational methods in identifying PCs. Therefore, a larger gold standard set will make the evaluation more accurate. We usually think that a complex is a complete graph. That is, any two proteins in the complex is interacted. For each complex in a gold standard set, we checked whether it is a full graph in a certain PIN. The results are presented in Table 6, from which we can see that only a small number of complexes in a gold standard set are fully connected in a PIN. These fully connected complexes are easy to identify by computational methods. For example, CYC2008 has 349 complexes, but only 138 complexes are fully connected in Collins. And each gold standard set contains a limited number of PCs. Therefore, we combined these three GC sets to generate a more comprehensive GC set of yeast, called CGold-Y, which contains more complexes, and the number of fully connected protein complexes (FPCs) substantially increases in each yeast PIN.

Performance comparison with existing methods
Since PCGAN needs some real complexes to guide model learning, we use the remaining complexes (those not used as training samples) in the merged gold standard set as testing samples. Specifically, for human, we used 500 complexes in CORUM as the training set, 30% of the training set as the verification set, and the rest as the test set. For yeast, we used 300 in CGold-Y as the training set, 30% of the training set as the verification set, and the rest as the test set. All methods identify PCs from CPIN-H (for human) and CPIN-Y (for yeast), and compare the predicted complexes with the gold standard test set. Table 7 shows the performance of different methods on CPIN-H and CPIN-Y. We compared PCGAN with various major existing methods, including Markov Clustering (MCL) (Enright et al. 2002), GraphEntropy (Kenley and Cho 2011), ClusterONE (Nepusz et al. 2012), SPICi (Jiang and Singh 2010), MCODE (Bader and Hogue 2003), Core (Leung et al. 2009 shows advantageous results over other methods in most performance metrics, only slightly lower than ProRankþ (Hanna and Zaki 2014) in the metric of Precision. Although the ProRankþ has a higher precision, its recall is much lower, indicating that it correctly predicts only much less complexes, i.e. most of the complexes in its predicted set are successfully matched with only a small number of complexes in the gold standard set. More importantly, our method PCGAN performs much better than the other methods in terms of the comprehensive metrics F-measure and MMR, which indicates that our method is more effective in identifying PCs than the other methods. For yeast, PCGAN is superior to the other methods in terms of all metrics. In summary, the results above show that PCGAN is effective in identifying PCs, it can generate high-quality complexes from both yeast and human PINs.

Functional enrichment analysis
Due to the incompleteness of existing gold standard sets, which cannot validate all generated complexes. Thus, in this article, functional enrichment analysis is employed to verify the effectiveness of the proposed method. Concretely, we used g:Profiler (Raudvere et al. 2019) for functional enrichment analysis of generated (for our method) or predicted (for the other methods) complexes. The functional enrichment degree of a PC is measured by its P-value. The smaller the P-value of a complex is, the more significant its biological function is. The P-value threshold of significance is set to .05 by default. Table 8 shows the percentage of biologically significant complexes identified by different methods. Compared with the other methods, PCGAN identifies the largest percentage (54.88%) of biologically significant identified complexes on the yeast PIN. At the threshold of 1e-3, the proportion of significant complexes (30.98%) is only slightly lower than that of MCODE (32.08%). However, our method PCGAN identifies more PCs than MCODE. For human, the percentage of biologically significant complexes (75.18%) is less than that of ClusterONE (Nepusz et al. 2012), MCODE (Bader and Hogue 2003), and ProRankþ (Hanna and Zaki 2014). On the one hand, the size of the complexes identified by these methods is >2, which leads to low P-values. On the other hand, these three methods identify much less complexes than our method PCGAN. In summary, PCGAN can effectively generate PCs of high biological significance.

Case study
The value of a PC identification method lies in its capability of identifying unknown real PCs. However, even if a method does identify some new and real complexes, it is difficult to validate that they are real complexes without the help of biological experiments. But biological experiment validation is expensive and time-consuming.
Here, we provide two PCs generated by our method as examples, which are very possibly real complexes according to our functional enrichment analysis and function check. These two PCs are not in the identified results of the other methods. We got the 3D structures of the two PCs by AlphaFold-Multimer (Evans et al. 2021), rendered the 3D structures using PyMOL (DeLano 2002). The first was generated from the human PIN, shown in Fig. 2a, we denote it by NHC1 (meaning novel human complex No. 1). NHC1 is composed of eight human proteins and its P-value of functional enrichment is 1.16E-17, indicating that it has high biological significance. By further function check, we found that  Pan et al.
these proteins in NHC1 mainly exist in the mitochondria of cells, involving in biological processes of aerobic respiration and oxidative phosphorylation. This implies that it may be a PC related to cell respiratory function. Another is a yeast complex shown in Fig. 2b, which is named NYC1 (meaning novel yeast complex No. 1). NYC1 consists of nine proteins, and its P-value of functional enrichment is 4.00E-20. We found that the significant functional enrichment of NYC1 is   mainly contributed by protein ubiquitination and cell mitosis. This suggests that NYC1 may be a nuclear ubiquitin ligase complex or a complex related to cell mitosis.

Conclusion
Accurate identification of PCs from PINs is an important research topic in computational biology (Wu et al. 2021a). In this article, we proposed a novel approach PCGAN for identifying PCs by GANs. To the best of our knowledge, this is the first generative method for PC identification. Existing methods are mainly based on unsupervised learning-clustering over PINs by assuming that complexes are dense subnetworks of PINs. Our work provides a new framework for future research, which may inspire a new wave of studies on complex identification. Furthermore, considering the limitations of small and noisy PINs and small GC sets in previous works, we merged existing PINs of human and yeast to get larger and more comprehensive PINs, and combined existing GC sets to construct a large GC set. This not only boost the performance of our method, but also provide new sources for future research on complex identification. Our experimental results show that PCGAN outperforms the existing methods, and the generated complexes have high biological significance. As for future work, we will focus on two directions: (i) extending the proposed method to identify PCs from PINs with biological features; and (ii) designing more advanced deep generative models (e.g. diffusion models) for PC generation.