## Abstract

**Motivation:** Protein–protein interaction, mediated by protein interaction sites, is intrinsic to many functional processes in the cell. In this paper, we propose a novel method to discover patterns in protein interaction sites. We observed from protein interaction networks that there exist a kind of significant substructures called interacting protein group pairs, which exhibit an all-versus-all interaction between the two protein-sets in such a pair. The full-interaction between the pair indicates a common interaction mechanism shared by the proteins in the pair, which can be referred as an interaction type. Motif pairs at the interaction sites of the protein group pairs can be used to represent such interaction type, with each motif derived from the sequences of a protein group by standard motif discovery algorithms. The systematic discovery of all pairs of interacting protein groups from large protein interaction networks is a computationally challenging problem. By a careful and sophisticated problem transformation, the problem is solved using efficient algorithms for mining frequent patterns, a problem extensively studied in data mining.

**Results:** We found 5349 pairs of interacting protein groups from a yeast interaction dataset. The expected value of sequence identity within the groups is only 7.48%, indicating non-homology within these protein groups. We derived 5343 motif pairs from these group pairs, represented in the form of blocks. Comparing our motifs with domains in the BLOCKS and PRINTS databases, we found that our blocks could be mapped to an average of 3.08 correlated blocks in these two databases. The mapped blocks occur 4221 out of total 6794 domains (protein groups) in these two databases. Comparing our motif pairs with iPfam consisting of 3045 interacting domain pairs derived from PDB, we found 47 matches occurring in 105 distinct PDB complexes. Comparing with another putative domain interaction database InterDom, we found 203 matches.

**Contact:**jinyan@i2r.a-star.edu.sg

## 1 INTRODUCTION

Protein–protein interactions carry out many biological processes in the cells such as gene expressions, signal transduction and inter-cellular communication. Protein interactions are usually mediated by short sequences of residues, which form the contact interfaces between two interacting proteins, referred to as interaction sites (Sheu *et al*., 2005). These interaction sites are often geometrically complementary and electric-statically compatible (Jones and Thornton, 1996). They are also highly conserved (Keskin *et al*., 2004, 2005) and co-evolved (Pazos *et al*., 1997) and only limited interaction templates exist, which are termed as interaction types (Aloy and Russell, 2004). Unraveling these interaction sites is helpful for understanding the mechanism of protein recognition and protein function, and is beneficial to the design of drug-aimed protein–protein interactions (Loregian and Palu, 2005).

Protein interaction sites can be determined by various experimental methods, including X-Ray crystallographic screening (Garman *et al*., 2000), NMR-based methods (Swanson *et al*., 1995; Takahashi *et al*., 2000), site-directed mutagenesis (Clemmons, 2001) and phage display (DeLano *et al*., 2000). Experimental methods are generally laborious and expensive. Consequently, only a small number of interaction types have been determined so far. It was estimated that it would take >20 years to accomplish all interaction types using current experimental techniques (Aloy and Russell, 2004).

On the other hand, computational methods play an important role in the determination of interaction sites owing to their low cost. Recently, protein–protein docking, which predicts the structures of protein complexes based on solved or modeled structures of the component proteins (Terwilliger, 2004), has made significant progress since the proposal of CAPRI assessment in 2001 (Mendez *et al*., 2005). Protein interaction sites can be pinpointed during the course of docking. However, ∼40% of proteins cannot be modeled for putative structures (Aloy *et al*., 2005). This leaves a critical gap in this docking approach.

Another approach is based on the conservation characteristics of interaction sites among homologous sequences, also referred to as binding motif discovery algorithms such as PROTOMAT (Henikoff and Heinikoff, 1991) and MEME (Bailey and Elkan, 1995). Correlations between the binding motifs can be measured by an expectation maximization (EM) model as shown by Wang *et al*., (2005). The intrinsic deficiency of this approach lies in the difficulty to distinguish the folding and binding motifs as binding and folding are both interrelated (Kumar *et al*., 2000). The third category of computational methods are machine learning oriented approaches. The features utilized in the learning are some known characteristics about interaction sites such as hydrophobicity (Gallet *et al*., 2000), the sequence segments (Ofran and Rost, 2003) or the spatial patches (Jones and Thornton, 1997). SVMs (Yan *et al*., 2004) and neural networks (Zhou and Shan, 2001) are two commonly used machine learning methods. Drawbacks in this approach include the difficulty in finding discriminating features and the unattractive performance in accuracy. Overall, current computational methods for interaction site prediction are far from perfect.

In this paper, we propose a novel approach to the discovery of interaction sites on a proteome-wide scale. This approach uses only protein interaction data and the associated sequence data. As mentioned earlier, interaction sites are highly conserved (Keskin *et al*., 2005). Conserved interaction sites are favorable interfacial scaffolds that have been repeatedly used in the evolution process by proteins with different sequence, structure and function (Keskin and Nussinov, 2005). An example can be seen from *cipa* (PDB code 1aoh) and *Dsred* (PDB code 1g7k), two complexes which have similar interfaces between their component chains A and B (Keskin *et al*., 2004), but which have dissimilar global structures and functions. (See Supplementary information.) A set of conserved interaction sites corresponds to an interaction type (Aloy and Russell, 2004) as they share some common binding mechanism. Whenever the interaction type occurs in a novel protein pair regardless of their homology, the two proteins are likely to interact—This principle has been used by Tong *et al*. (2002) and Aytuna *et al*. (2005) to predict protein interactions with an acceptable performance. Such an interaction type implies a most-versus-most and even an all-versus-all interaction subnetwork between two groups of proteins in a protein network, with each protein group corresponding to one side of the interaction type. Figure 1 shows an example of such a subnetwork (Tong *et al*., 2002).

**Fig. 1**

**Fig. 1**

Interestingly, if a large enough subnetwork with all-versus-all interactions between two protein groups is found in a protein network, an interaction type with conserved interaction sites can be predicted. That is because most proteins only contain a small number of interaction sites [usually, 2–6 for typical proteins (Liang *et al*., 1998)]. Owing to the constraints of all-versus-all interactions between these two groups, it is expected that there exists two groups of interaction sites from these two protein groups which interact with each other for at least some occurrences. The interaction sites within the same group should hold similar structures and possibly have a sequence motif as they have similar interaction partners. These two groups of interaction sites and their corresponding motifs can be easily identified using standard motif discovery methods from the sequence data of the corresponding protein group. Then, an interacting motif pair (Li *et al*., 2004) is formed, with which to represent the corresponding interaction sites of the interaction type.

We term the above two protein groups that exhibit an all-versus-all interaction as a pair of interacting protein groups. It is a challenging problem to discover all pairs of interacting protein groups from a proteome-wide protein interaction network by a naive way as the number of combinations of proteins is exponential. However, we found that this problem of mining interacting protein groups can be transformed into the classical problem of mining frequent patterns (Agrawal and Srikant, 1994). As frequent pattern mining has been extensively studied in the data mining field, many existing algorithms can be directly used to efficiently find all pairs of frequent interacting protein groups from large datasets of protein interactions.

To assess the performance of our proposed method for mining motif pairs from a large yeast interaction dataset, we propose a systematic validation experiment on comprehensive domain databases and domain–domain interaction databases. We compare our single motifs with the domains in specific domain databases to study the relationship between our motifs and domains. Even more importantly, we study the relationship between motif pairs at interaction sites and interacting domain pairs, by mapping our motif pairs into domain–domain interacting pairs and analyzing the amount of overlaps between our mapped domain pairs and those in domain–domain interaction databases.

## 2 INTERACTING PROTEIN GROUPS

We fix

*m*proteins: {

*P*,

_{i}*i*= 1, …,

*m*}, and

*n*interacting protein pairs of the proteins in

*PP*= {

_{i}*P*,

_{i}*Q*},

_{i}*i*= 1, … ,

*n*,

*P*∈

_{i}*Q*∈

_{i}*P*and

_{i}*Q*have interactions}.

_{i}*[Neighborhood of a protein] The neighborhood β*(*P*) *of a protein P* ∈

*is defined as the set of proteins in*

*that interact with P. That is, β*(

*P*) = {

*Q*|

*Q*∈

*Q interacts with P*}

This neighborhood notion can be generalized by replacing one protein with a protein set. Then the definition can capture a partial all-versus-all relation between two protein sets.

*[Neighborhood of a protein set] The neighborhood β*(**S**) *of a protein set***S** ⊆

*is the intersection of the neighborhoods of all proteins in*

**S**.

*In other words, it is the set of proteins that interact with all proteins in*

**S**.

*That is, β*(

**S**) = ∩

_{P∈S}β(

*P*).

*In particular, we define β*(∅) =

If a protein interacts with all proteins in **S**, it must be in the neighborhood set β(**S**). However, if a protein interacts with all proteins in β(**S**), it may not be in **S**. Our next definition gives a maximal all-versus-all neighborhood-relation between two protein-sets.

*[A pair of interacting protein groups] Let***A, B** ⊆

*be two protein sets. If*β(

**A**) =

**B**

*and*β(

**B**) =

**A**,

*then we call*

**A**

*and*

**B**

*a pair of*interacting protein groups.

*If*|

**A**| ≥ τ and |

**B**| ≥ τ,

*we call*

**A**

*and*

**B**

*a pair of*frequent

*interacting protein groups, where τ is a positive user-defined threshold*.

Definition 3 also says that if two proteins are a pair of interacting protein groups, then every protein in one set (**A** or **B**) interacts with all proteins in the other set, and vice versa. Note that not every protein set is an interacting protein group because the partner interacting protein group may not exist.

Our definition of interacting protein group pairs is closely related to that of maximal complete bipartite subgraphs in graph theory (Eppstein, 1994). For details about their theoretical issues, please refer to our previous paper (Li *et al*., 2005).

We require a protein group to be large enough (≥τ) in our definition. This is because it is rather hard to determine whether a motif is significant if there are only a few of proteins in a group.

*Problem statement*. Let a set

**A**and

**B**, such that |

**A**| ≥ τ and |

**B**| ≥ τ, and then to identify ‘good’ motif pairs from the pairs of frequent interacting protein groups.

## 3 METHODS

Our algorithm consists of two steps: The first step is to find all pairs of interacting protein groups from

### 3.1 Mining interacting protein groups

The classic problem of mining frequent patterns in the data mining field is: Given a set *I* of items and a set TDB of transactions *T _{i}*,

*i*= 1, … ,

*x*, where a transaction

*T*is a subset of

_{i}*I*(i.e.

*T*⊆

_{i}*I*), the problem is to find all frequent patterns of TDB—all itemsets

*I*′ ⊆

*I*such that the number of transactions that contain

*I*′ is no less than a threshold τ, where τ is a user-specified threshold. Here, the number of the transactions that contain

*I*′ is called the support of

*I*′ in

*TDB*. The set of the identities (ids) of the transactions that contain

*I*′ is called the occurrence set of

*I*′, denoted by

*g*(

*I*′) = {id(

*T*) |

_{i}*T*∈ TDB,

_{i}*I*′ ⊆

*T*}, where id(

_{i}*T*) is the identity of

_{i}*T*.

_{i}To transform the problem of mining frequent interacting protein groups to the problem of mining frequent patterns, a crucial step is to determine what is an item and what is a transaction. We map a protein to an item—therefore,

*I*. Thus, a protein set is a transaction. The next crucial step is to determine which and how many protein sets (transactions) are in TDB. We define a transaction in TDB as the neighborhood of a protein in

^{PrtAll}. So, this special TDB contains

*m*transactions, namely

*T*= β(

_{i}*P*),

_{i}*i*= 1, . . . ,

*m*, where

*m*is the total number of proteins in

*T*(

_{i}*i*= 1, . . . ,

*m*) is

*P*, namely id(

_{i}*T*) =

_{i}*P*.

_{i}Let **X** be a frequent pattern of TDB^{}

**X**be

*k*, and its occurrence set be

*g*(

**X**) = {

*P*

_{1},

*P*

_{2}, . . . ,

*P*}. Then, the meaning of

_{k}**X**is (of course) a protein set in which every protein interacts with all proteins in the occurrence set

*g*(

**X**). This is because

**X**is a subset of β(

*P*) for all

_{i}*i*= 1, . . . ,

*k*. In other words, all proteins in

**X**interact with every protein in

*g*(

**X**). Therefore, all proteins in

*g*(

**X**) must be in the neighborhood of

**X**[i.e. β(

**X**)].

Furthermore, there is no protein other than *P _{i}* (

*i*= 1, …,

*k*) that interacts with all proteins in

**X**. This is due to the definition of support of a pattern. We explain it using a contradiction. Suppose there is a protein

*Q*∈

*P*(

_{i}*i*= 1, …,

*k*) that interacts with all proteins in

**X**, then β(

*Q*)—a transaction in TDB

^{PrtAll}—contains

**X**. So, the support of

**X**would be

*k*+ 1. But, the support of

**X**is only

*k*. Here is a contradiction. Therefore, there are exactly only

*P*,

_{i}*i*= 1, …,

*k*, that interact with all proteins in

**X**. This can be re-written as β(

**X**) =

*g*(

**X**). That is, the neighborhood of a frequent pattern

**X**of TDB

^{PrtAll}is the occurrence set of

**X**. All these ideas and discussions can be in the important theorem below.

*Let X be a frequent pattern of TDB*

^{}

*Thenβ*(

**X**) =

*g*(

**X**),

*and g*(

**X**)

*is a frequent protein set. Let f*(

_{β}**X**) = β(β(

**X**)).

*If*|

*f*(

_{β}**X**)| ≥ τ,

*then β*(

**X**) and

*f*(

_{β}**X**)

*is a pair of frequent interacting protein groups*.

Proof: Denote β(**X**) = **A** and *f _{β}*(

**X**) =

**B**. So,

**B**is an itemset of TDB

^{}

*g*(

**X**)|. On the other hand, it is a superset of

**X**as

**X**is contained in every β(

*P*) for

*P*∈

*g*(

**X**), so, its support is at most |

*g*(

**X**)|. Therefore,

**B**and

**X**have the same level of support in TDB

^{PrtAll}, and also have the same occurrence set, namely the

*g*(

**X**). This means β(

**B**) =

*g*(

**X**) =

**A**.

Obviously, β(

**A**) =*f*(_{β}**X**) =**B**.As

**B**is the neighborhood of*g*(**X**), then$B=\u2229P\u2208g(X)\beta (P).$

Combining (I) and (II), we get β(**X**) and *f _{β}*(

**X**) is a pair of frequent interacting protein groups if |

*f*(

_{β}**X**)| ≥ τ.

This theorem indicates that every frequent pattern of *TDB*^{}

Is there any other patterns of *TDB*^{}

**X**of

*TDB*

^{PrtAll}, its occurrence set β(

**X**) is infrequent, i.e. |β(

**X**)| < τ. So, no matter what is the size of

*f*

_{β}(

**X**), β(

**X**) and

*f*

_{β}(

**X**) is not a pair of frequent interacting protein pairs.

We also conjecture that some frequent patterns of *TDB*^{}

**X**in an equivalence class (Nicolas

*et al*., 1999) share the same pair of interacting protein groups (Li

*et al*., 2005). The resulted

*f*

_{β}(

**X**) defined in above theorem one-to-one matches an equivalence class, often termed as a closed pattern (Nicolas

*et al*., 1999). So, it is unnecessary for us to identify all frequent patterns from TDB

^{PrtAll}for a given threshold τ, instead, one can just discover all frequent closed patterns from TDB

^{PrtAll}using an efficient algorithm such as FPClose* (Grahne and Zhu, 2003).

### 3.2 Generating a motif pair from a pair of interacting protein groups

Given a protein group and its sequence data, we can get a motif (possibly containing flexible gaps) using standard motif discovery algorithms such as PROTOMAT (Henikoff and Heinikoff, 1991) and MEME (Bailey and Elkan, 1995). So, we can easily obtain a motif pair from a pair of interacting protein groups by executing the motif discovery algorithm twice. In this paper, we choose PROTOMAT (Henikoff and Heinikoff, 1991) as the motif discovery algorithm because it is believed to be a good method to find local conserved regions from a group of related proteins. PROTOMAT is also a key method to construct BLOCKS database (Pietrokovski *et al*., 1996)—a comprehensive database of highly conserved regions for homologous protein groups (domains).

## 4 RESULTS

To assess the performance of our proposed method for mining motif pairs, we performed several experiments on a PC with a CPU clock rate of 3.2 GHz and 2 GB of main memory. The protein interaction set PairDB used in the experiments was downloaded from DIP (database of interacting proteins) on October 23, 2005, consisting of 17 511 experimentally determined interactions in *Saccharomyces cerevisiae* (yeast) among 4959 proteins. We select 10 640 physical interactions by excluding 6871 interactions determined only by complex level experiments such as Tandem Affinity Purification (TAP) and immunoprecipitation (the full list of excluded experiments can be found in the Supplementary information). To discover frequent closed patterns by FPClose* (Grahne and Zhu, 2003), we set the threshold τ = 5, an average number of interactions per protein in the yeast genome (Grigoriev, 2003). Default parameters are used for PROTOMAT (Henikoff and Heinikoff, 1991). To facilitate our analysis, we further term the motifs induced from closed patterns as left motifs (left blocks), while the ones induced from the occurrence sets of the closed patterns as right motifs (right blocks).

The FPClose* algorithm outputs a total of 5349 non-redundant pairs of interacting protein groups, by taking 4.35 s on our machine (including the transformation). The mining based on the transformation idea is very efficient compared with a naive search method which needs ∼33 min (455-fold more than the efficient approach) to find all the protein groups. The implementation of the naive search contains some optimization techniques.

The homology property within a group is an interesting issue. It can be estimated simply by the sequence identity within the group—A value <15% is often considered as a good indicator for non-homology (Doolittle, 1981). We calculate all pairwise sequence identities within a same protein group using CLUSTAL W package with default parameters (Thompson *et al*., 1994). Then we use the average value of these pairwise sequence identities as the sequence identity within the group. The distribution of the sequence identities within the 10 698 groups is shown in Figure 2, with more details in the Supplementary information. The expected value of the sequence identities within the groups is 7.48%, with a SD 1.33%. This is a good value indicating the non-homology within these groups. Therefore, these groups and their underlying sequence motifs are unlikely to detect by standard methods based on sequence homology (Sauder *et al*., 2000).

**Fig. 2**

**Fig. 2**

The PROTOMAT method outputs 5343 motif pairs from these 5349 pairs of interacting protein groups by taking 3 h. Of the protein groups 85% generate two or three blocks. (Note that a group in BLOCKS contains 6.91 blocks on average.) Only four left groups and two right groups failed to produce any valid motif, with a failure rate <0.2%. Totally, there are 11 948 left blocks and 13 004 right blocks. The average length of these blocks is 11.05, with a SD 5.06. Compared with BLOCKS where the average length of blocks is 25.337 and the SD is 12.897, our blocks are more specific and match better with current knowledge about interaction sites, i.e. 10–20 residues in length (Sheu *et al*., 2005).

We treat the whole set of blocks generated by PROTOMAT from a protein group rather than each individual block as a motif to reflect the cooperation among these blocks. We expect that some interactions happen among the blocks from different sides of the motif pair, but do not study the detailed interactions among these blocks in this paper. In our results, the average number of blocks per motif is 2.33, with a SD 0.73. The average number of proteins per motif is 7.01, with a SD 2.59. More details are in the Supplementary information.

### 4.1 Validations

Currently, comprehensive databases for motif–motif interactions (motif pairs) are hard to find but there are a handful of databases for domain–domain interactions such as iPfam (Finn *et al*., 2005), 3did (Stein *et al*., 2005) and InterDom (Ng *et al*., 2003). Since domains are known to involve in protein interactions and are closely related to motifs, we compare our motif pairs with these domain pairs. The following two steps are employed to illustrate the effectiveness of our algorithm.

Compare all single motifs in our discovered motif pairs with all domains in specific domain databases to obtain overall matches, i.e. to determine the number of motifs that can be mapped to these domains and the overall correlation in the portions that are mapped.

Map our motif pairs into domain–domain interacting pairs to determine the number of overlaps between our mapped domain pairs and those in the domain–domain interaction database.

#### 4.1.1 Validations for single motifs

As our motifs are in the form of blocks, we need domain databases also in the form of blocks for comparison. Currently, there are two major domain databases in the form of blocks: BLOCKS (Pietrokovski *et al*., 1996) and PRINTS (Attwood and Beck, 1994). Some information of these two domain databases are shown in the first two columns of Table 1, where an entry corresponds to a block.

**Table 1**

BLOCKS | PRINTS | Pfam | iPfam | |
---|---|---|---|---|

Version | 14.0 | 37.0 | 16.0 | 18.0 |

Number of domains | 4944 | 1850 | 7677 | 2145 |

Number of entries | 24294 | 11170 | 7677 | 3045 |

BLOCKS | PRINTS | Pfam | iPfam | |
---|---|---|---|---|

Version | 14.0 | 37.0 | 16.0 | 18.0 |

Number of domains | 4944 | 1850 | 7677 | 2145 |

Number of entries | 24294 | 11170 | 7677 | 3045 |

The comparison is conducted by a program called Local Alignment of Multiple Alignments (LAMA) (Pietrokovski, 1996). It utilizes Smith–Waterman algorithm (Smith and Waterman, 1981) to determine the optimal local alignments for pairs of position-specific scoring matrices (PSSMs) (Gribskov *et al*., 1987) of the corresponding blocks. To estimate the alignment scores with different lengths and to filter out the coincidental matches, LAMA uses the *Z*-score as a significance measurement, where a *Z*-score between a pair of PSSMs is defined as the number of standard deviations away from the mean score generated by millions of shuffled blocks in the BLOCKS database.

In our study, we used the default threshold 5.6 for *Z*-score in LAMA to compare our blocks with those in BLOCKS and PRINTS. If 95% of the positions of a block are in the optimal alignment between this block and another block and the *Z*-score is no less than the threshold, we say there is a mapping from the former block to the latter one. If there is a mapping from any block of a motif to any block of a domain, we say the motif can be mapped to the domain. We have following results from this experiment:

On average, each of our blocks maps to ∼3.08 blocks in the BLOCKS or PRINTS databases. See more detailed report in the columns 2 and 3 of Table 2.

The average correlation between the columns of our blocks and the columns from the database in the optimal alignments is as high as 53.88%. See column 4 of Table 2 for details.

Our motifs can be mapped to 4221 domains out of a total of 6794 domains in these two databases, having a coverage of 62%. See Table 3. This result is interesting as our blocks can only be mapped to 8582 blocks out of the total 35 464 blocks in these two databases, having a coverage <24%. The interpretation from a biological perspective is that most domains have ∼40% of blocks as their interaction sites, while others may be related to folding.

Although only 59% (14 620 out of 24 952) of our blocks can be mapped to blocks in BLOCKS and PRINTS, as high as 86% (9153 out of 10 686) of motifs can be mapped to domains in these two databases. See Table 4 for details.

**Table 2**

# of our blocks | # of mappings to BLOCKS blocks | # of mappings to PRINTS blocks | Average correlation | |
---|---|---|---|---|

Left blocks | 11948 | 29357 | 8632 | 54.31 |

Right blocks | 13004 | 30220 | 8738 | 53.42 |

# of our blocks | # of mappings to BLOCKS blocks | # of mappings to PRINTS blocks | Average correlation | |
---|---|---|---|---|

Left blocks | 11948 | 29357 | 8632 | 54.31 |

Right blocks | 13004 | 30220 | 8738 | 53.42 |

**Table 3**

Mapped/total | Mapped/total | Mapped/total | |
---|---|---|---|

# in BLOCKS | # in PRINTS | # in ANY | |

Blocks | 6408/24294 | 2174/11170 | 8582/35464 |

Domains | 3128/4944 | 1093/1850 | 4221/6794 |

Mapped/total | Mapped/total | Mapped/total | |
---|---|---|---|

# in BLOCKS | # in PRINTS | # in ANY | |

Blocks | 6408/24294 | 2174/11170 | 8582/35464 |

Domains | 3128/4944 | 1093/1850 | 4221/6794 |

**Table 4**

Total # | # mapped | # mapped | # mapped | |
---|---|---|---|---|

to BLOCKS | to PRINTS | to ANY | ||

Blocks | 24952 | 13859 | 8010 | 14620 |

Motifs | 10686 | 8879 | 6464 | 9153 |

Total # | # mapped | # mapped | # mapped | |
---|---|---|---|---|

to BLOCKS | to PRINTS | to ANY | ||

Blocks | 24952 | 13859 | 8010 | 14620 |

Motifs | 10686 | 8879 | 6464 | 9153 |

Note that our groups and groups in BLOCKS and PRINTS are constructed in quite different ways and their homology properties are also different. However, our comparison results reveal high correlation between their resulted blocks. This correlation may originate from the common involvement of interactions for both our motifs and their domains. This confirms the effectiveness of our method in some way.

#### 4.1.2 Validations for motif pairs

To assess whether our discovered motif pairs are indeed interaction sites, we compare them with domain–domain interacting pairs. If our motif pairs represent interaction sites, they should be mapped to some domain–domain interacting pairs in some databases. We choose iPfam (Finn *et al*., 2005) for this purpose. It consists of 3045 interacting pairs among 2145 Pfam domains (Sonnhammer *et al*., 1997) derived from protein complexes in PDB.

The cross-links between our motif pairs and the domain–domain pairs in iPfam is complicated. A reason is that the domain–domain pairs are represented by Pfam entries. To find the cross-links, (1) we first map our motifs to domains (protein groups) in the BLOCKS or PRINTS database, as shown in Section 4.1.1; (2) we then map a protein group of BLOCKS to a protein group of InterPro (Apweiler *et al*., 2001) as there exists a one-to-one mapping between an entry of BLOCKS and an entry of InterPro; (3) then we use existing cross-links between protein groups of InterPro and domains of Pfam to determine the cross-link between our motifs and Pfam domains. By this roadmap, we can map our motif pairs into domain–domain pairs with Pfam domain entries. Note that the association between PRINTS and Pfam is clear. Also note that the cross-linking mapping between motif pairs and domain–domain pairs is not a one-to-one mapping.

Using the above cross-link mapping, we compared our 5343 motif pairs with the 3045 domain–domain pairs in the iPfam database, 47 motif pairs can be mapped to 18 distinct domain pairs among 22 domains occurring in PDB complexes for 172 times (totally 105 distinct protein complexes).

Though the overlapping proportion seems modest, we assert that the result is significant because of the following:

We read only interacting protein sequence pairs, while some predictions about interaction sites can be confirmed by domain–domain interactions in PDB complexes.

iPfam is a rather incomplete database, containing merely 3045 pairs among 2145 domains. Moreover, only 997 out of 4221 of our mapped domains are studied in iPfam, as shown in Table 5.

The motif pairs we discovered are taken only from the yeast genome while iPfam covers a variety of species.

Comparing with Interdom with 30037 putative interacting domain pairs (Ng

*et al*., 2003), our motif pairs can be mapped to 203 domain pairs, including 94 high-confidence ones.

**Table 5**

BLOCKS | PRINTS | Combined | |
---|---|---|---|

BLOCKS/PRINTS domains | 3128 | 1093 | 4221 |

Pfam domains | 2305 | 144 | 2338 |

iPfam domains | 975 | 87 | 997 |

BLOCKS | PRINTS | Combined | |
---|---|---|---|

BLOCKS/PRINTS domains | 3128 | 1093 | 4221 |

Pfam domains | 2305 | 144 | 2338 |

iPfam domains | 975 | 87 | 997 |

### 4.2 A case study

The 5343 motif pairs that we discovered can be ranked according to their correlation score in the mapping. Most of top-ranked motif pairs can be confirmed by protein complexes. Here we report details of one such pair. Our purpose is to check whether some block pairs in the motif pair can be aligned with a segment pair in a complex containing the mapped domain pair, and then check whether the segment pair has some contacts among their residues.

This motif pair is generated from the first pair of interacting protein groups. This protein group pair generates three blocks on the left and one block on the right. The first left block 1xxxxxxA contains 24 positions, while the right block 1xright contains 36 positions, as shown in Table 6.

**Table 6**

AC 1xxxxxxA; distance from previous block=(18,243) |

BL LLE motif=[4,0,17] motomat=[1,80,−10] width=24 seqs=4 |

DIP:1330N (58) LRDGRMLFGVLRTFD QY A NLI LQD |

DIP:2570N (206)TLE GRE I MIRNLSTE LL D ENLLRE |

DIP:848N (19) LKNGE I I QGILT NVDNWMNLTLSN |

DIP:883N (244)LQS GRR SKRDLS PEE QR R LQI RHA |

pdb 1mgq_A(30) LKg dRE f r GV Lk SFD Lh M NLvLn D |

AC 1xright; distance from previous block=(6,52) |

BL GNL motif=[3,0,17] motomat=[1,80,−10] width=36 seqs=5 |

DIP:1417N (12) IDK TI N QKVLI V LQS NRE FEG TLV GFD DFV NVI LED |

DIP:1418N (53) LSDI I G KTV NVKLAS GLL YSGRLE S I DGFMNVALSS |

DIP:1419N (22) LAKYKDSK I RVKLMGGKL VI G VLKGYDQLMNLVLDD |

DIP:794N (7) FKT LVD QEV VVELKNDI E I KGTLQSVD QFL NLKLDN |

DIP:903N (24) LKDYLNKRV V I I KVDGEC LI A SLN GFD KNT NLF I TN |

pdb1mgq_A (18) Lg n s LN S p V i I K LKGDRE Fr G VLKSFD l h M NLVLn D |

AC 1xxxxxxA; distance from previous block=(18,243) |

BL LLE motif=[4,0,17] motomat=[1,80,−10] width=24 seqs=4 |

DIP:1330N (58) LRDGRMLFGVLRTFD QY A NLI LQD |

DIP:2570N (206)TLE GRE I MIRNLSTE LL D ENLLRE |

DIP:848N (19) LKNGE I I QGILT NVDNWMNLTLSN |

DIP:883N (244)LQS GRR SKRDLS PEE QR R LQI RHA |

pdb 1mgq_A(30) LKg dRE f r GV Lk SFD Lh M NLvLn D |

AC 1xright; distance from previous block=(6,52) |

BL GNL motif=[3,0,17] motomat=[1,80,−10] width=36 seqs=5 |

DIP:1417N (12) IDK TI N QKVLI V LQS NRE FEG TLV GFD DFV NVI LED |

DIP:1418N (53) LSDI I G KTV NVKLAS GLL YSGRLE S I DGFMNVALSS |

DIP:1419N (22) LAKYKDSK I RVKLMGGKL VI G VLKGYDQLMNLVLDD |

DIP:794N (7) FKT LVD QEV VVELKNDI E I KGTLQSVD QFL NLKLDN |

DIP:903N (24) LKDYLNKRV V I I KVDGEC LI A SLN GFD KNT NLF I TN |

pdb1mgq_A (18) Lg n s LN S p V i I K LKGDRE Fr G VLKSFD l h M NLVLn D |

Through the approach depicted in Section 4.1.2, we map the block pair (1xxxxxxA, 1xright) into domain pair (PF01423,PF01423) in iPfam. Pfam database indicates that PF01423 is a LSM domain, and iPfam shows that one LSM domain interacts with another LSM domain densely in 20 complexes such as pdb1mgq, pdb1h64. We take the complex pdb1mgq as an example to explain what we found. It has seven chains each containing a LSM domain. The 3D structure of these seven chains and their interactions can be found in our Supplementary information and also in the reference (). We observed the following details:

Our left block 1xxxxxxA can be well aligned at positions 30–53 within the LSM domain of the chain A at the complex pdf1gmgq, and our right block 1xright can be well-aligned at positions 18–53 of the chain B also within the LSM domain at the same complex. See Table 6 for alignment details.

The residue 47M (residue M at position 47) of the chain A interacts with residue 48N of the chain B in pdb1mgq; another pair between residue 46H of the chain A and residue 48N of the chain B is also spatially close. See Figure 3 for details about the interactions between this segment pair ().

The interaction pair (47M,48N) is well conserved in the complex pdb1mgq—it occurs in seven chain interactions out of a total of nine chain interactions. The seven interactions are between chain A and chain B, between chain B and chain C, … , and between chain G to chain A. Interestingly, this residue interaction located in the middle of the domain is also highly conserved in other complexes containing LSM domains, e.g. in the complex pdb1h64.

**Fig. 3**

**Fig. 3**

## 5 DISCUSSION AND CONCLUSION

Our motif pairs are conceptually similar to correlated sequence-signatures proposed by Sprinzak and Margalit (2001). But their correlated sequence-signatures are modeled as over-represented domain pairs, which are essentially longer than our motif pairs and cannot derive novel binding motifs since their domains are predefined. On the other hand, our interacting protein group pairs are structurally similar to interacting domain profile pairs proposed by Wojcik and Schachter (2001). But each of their domain profiles is the summarization of a domain cluster, which is a set of domains sharing significant sequence similarity and interacting with the same region of a certain protein. This approach replies on protein–protein interactions with domain interaction annotations, which are not widely available.

In our model, we require that pairs of interacting protein groups should always have an all-versus-all relationship. This is a bit strict as it is vulnerable to handle incomplete dataset. As a future direction, we will consider most-versus-most relationship.

Other future work include new evaluation methods. For example, the predicted interaction sites in the blocks of motif pairs can be compared with known interaction sites in some protein-protein interaction databases (Rain *et al*., 2001) or compared with interaction sites in interface databases (Keskin *et al*., 2004). Also our motif pairs can be compared with those learned from non-interacting protein pairs or from random protein pairs, to study their statistical significance, as done in our previous study (Li and Li, 2005).

Finally, we summarize the main results achieved in this work. We have used the concept of motif pairs to model protein interaction sites and studied the mining problem based on the sequence data of interacting protein pairs. We have proposed the new concept of interacting protein groups for the discovery, where a protein group may share a common interaction motif and a pair of protein groups may share a motif pair at their interaction sites. We transformed the mining of interacting protein groups into the mining of frequent closed patterns. We used standard motif discovery algorithms onto these discovered interacting protein groups to generate motif pairs in form of blocks. The high efficiency of this two-step approach is because of (1) In the discovery of interacting protein groups, we examine only interacting protein pairs without checking their sequences, thereby dramatically reduce the complexity of the problem; (2) By producing protein groups first, the discovery of interaction motifs is greatly accelerated as we need not execute the NP-hard motif discovery algorithm on insignificant candidates of protein sets.

The systematic validation results of the discovered motif pairs indicate that our discovered motifs have high correlation with domains in the existing domains databases. Our discovered motif pairs can also be mapped into the domain–domain interacting pairs in an experimentally validated domain–domain database with good matches.

We are grateful to Hao Han and Soon Heng Tan for their comments on the validation methods, biological concepts and nomenclature in the paper. We are also thankful to Benjamin Schuster-Böckler for providing the iPfam database. We appreciate Donny Soh and Ling Li for polishing the initial version of the manuscript.

*Conflict of Interest:* none declared.

## REFERENCES

*Nature*,

**409**, 553; (2001)

*Nature*,

**409**, 743.]

## Comments