Abstract

Several algorithms have been recently designed to identify motifs in biological networks, particularly in protein–protein interaction networks. Motifs correspond to repeated modules in the network that may be of biological interest. The approaches proposed in the literature often differ in the definition of a motif, the way the occurrences of a motif are counted and the way their statistical significance is assessed. This has strong implications on the computational complexity of the discovery process and on the type of results that can be expected. This review presents in a systematic way the different computational settings outlining their main features and limitations.

INTRODUCTION

Much data has become recently available about protein–protein interactions (PPI) in selected organisms [1–3]. The interactions have been determined by a number of experimental [4–6] and computational [7–9] techniques designed to systematically identify both the candidate and actual protein interactions. A PPI network is modelled by a graph whose nodes correspond to proteins and edges to interactions between two proteins. Much work has been devoted to the characterization and modelling of biological networks in terms of global properties of the underlying graphs, such as diameter, degree distribution and clustering coefficient. Recently, there has been a shift of attention of researchers from global properties to local properties that provide a concise description of a network in terms of appearances of a given number of small subgraphs. It has been shown that PPI networks exhibit interesting features in terms of repeated occurrences of certain modules and that these modules have biological meaning, since they may represent evolutionary conserved topological units of cellular networks. Thus, much attention has been paid to the identification of small subgraphs, particularly those occurring significantly often within the biological networks. These subgraphs, referred to as network motifs or simply motifs, can be considered as basic building blocks of complex networks [10]. The PPI data available are certainly incomplete, thus many false negatives affect the studies on these networks. Nevertheless, the current research effort on the characterization of biological networks may ultimately aid in understanding the cell organization.

This review focuses on the computational aspects of the motif discovery process in biological networks. Although we will consider PPI networks in this review, most of the algorithms that will be reviewed have been applied also to other types of biological networks, such as transcription–regulation [11–14], and to networks of different nature, as social [15, 16] and WWW [17]. A discovery algorithm typically consists of three steps: (i) find the occurrences of subgraphs of a given size in a graph; (ii) determine which subgraphs are isomorphic and group them into classes, sometimes referred to as graphlets; and (iii) determine the statistical significance of the graphlets by comparing their frequencies to those of an ensemble of random networks. In ‘Graphs: definitions’ section, we give basic definitions of graphs and graph isomorphism. In ‘Network motifs: definitions and properties’ section, we present various definitions of motifs, concentrating on different ways of counting subgraphs, and show how they affect the computational complexity of the discovery process. Models of biological networks are reviewed in ‘Model of random graphs’ section, since they provide a framework for the evaluation of the statistical significance of the obtained subgraphs. Finally, in ‘Motif discovery algorithms’ section, we give a short outline of some of the algorithms proposed in the literature to address this problem. We conclude with a short survey of the literature on the controversial biological interpretation of motifs.

GRAPHS: DEFINITIONS

In this section we will briefly review basic definitions of graphs. A graph is a pair G = (V, E) where V = {v} is the set of nodes or vertices and E = {(u, v)} is the set of edges each connecting a pair of nodes.

A subgraph GS = (VS, ES) of G = (V, E) consists of a subset of nodes VSV and of a subset of edges ESE connecting the nodes of VS in the original graph. The graph induced by VS is the subgraph GS that includes all the edges of G which connect the vertices of VS.

In most of the algorithms we will review, the graphs are represented by adjacency matrices. Assume that the nodes of a graph are numbered in some arbitrary order from 1 to n, that is V = {v1, …,vn}. The adjacency matrix M = {mij}, i, j = 1, …,n, is then defined as a two-dimensional square matrix whose element mij is 1 if there is an edge connecting nodes vi and vj and 0 otherwise. Two graphs G1 and G2 are isomorphic if there is a one-to-one mapping f between vertices of G1 and those of G2 with the property that two vertices u and v of G1 are adjacent if and only if f(u) and f(v) are adjacent in G2 [18]. In other words, two graphs are isomorphic if it is possible to permute the vertices so that G1 coincides with G2. Determining graph isomorphism is not known to be either in P or in NP-complete. On the other hand, given two graphs G and H determining whether G contains a subgraph isomorphic to H is an NP-complete problem, thus no polynomial time algorithm is known to solve it. In practice, several efficient approaches exist for small subgraphs. Canonical labelling of a graph has been successfully used in heuristic approaches to graph isomorphism to reduce the computation time.

As already said, graph isomorphism is a required step in most motif-discovery algorithms for counting the occurrences of subgraphs and distinguishing non-isomorphic subgraphs. The canonical label is a unique label associated to a graph that is independent on the permutation of the nodes, and therefore invariant to isomorphism. A graph with n nodes has up to n! different adjacency matrices, obtained from the set of permutations of the n nodes. A code is associated to each adjacency matrix, M, defined as the string of values obtained by the concatenation of the rows of the lower triangular sub-matrix of M, including the diagonal, i.e. code(M) =m11m12m22m1nmnn. Note that M is symmetric because G is assumed to be an indirect graph. The canonical labelling of the graph G denoted by cl(G) is the labelling that generates the code that is maximal over all adjacency matrices of G according to a given order among strings, e.g. lexicographic order. A property of this labelling is that, given two subgraphs G1 and G2, cl(G1) = cl(G2) iff G1 is isomorphic to G2. Thus, determining whether two graphs have the same canonical label is as difficult as the graph isomorphism problem is.

For specific instances of graphs, efficient heuristics can be set up to reduce the search space of canonical labelling. One such heuristics consists of grouping the nodes of each graph based on vertex invariant properties such as degree and/or labels [19] and then restricting the permutations to nodes within the same group.

A generalization of standard-canonical labelling applied to PPI networks is defined in [20] where labels are attached to all nodes of a graph. The label of node vi depends on its degree mi and the distances dij to all connected vertices j. Specifically:  

formula
where d is the graph diameter, i.e. the max length of a graph path. The product of all these labels provides a code for the entire graph. Although such a code does not uniquely characterize a graph, i.e. non-isomorphic graphs may have the same code, in practice it allows to distinguish between graphs of size up to 8 quite well.

NETWORK MOTIFS: DEFINITIONS AND PROPERTIES

Two distinct definitions of a motif have been used in the literature based on frequency and statistical significance.

Definition 1

A motif is a subgraph that appears more than a threshold number of times.

Definition 2

A motif is a subgraph that appears more often than expected by chance.

A motif of size k, i.e. containing k nodes, is called k-motif. A motif according to definition 2 is often said to be over-represented. It is possible that an over-represented motif does not occur more than the threshold number of times required by definition 1 and vice versa. In the following sections, we will specify how to count the frequency of a graphlet, i.e. the number of its occurrences, and how to evaluate its significance, i.e. the degree of surprise of such frequency.

We stress that another important distinction among the different definitions of a motif depends on whether or not the subgraph has to be induced, i.e. it must contain all edges that connect the nodes of the subgraph in the original graph.

The number of possible k-motifs grows very fast with k. This is one of the reasons why only small size k-motifs have been studied in PPI networks. To overcome in part this problem two different types of motifs are defined in [21]: (i) subgraphs that are generated by a random walk of a given length (8 in [21]) and 2. subgraphs with a given number of edges. These choices allow to extend the exact search to larger subgraphs (k ≤ 14). An alternative definition of motif is also provided in [22]: here a subgraph alignment is performed across a set of small graphs in order to extract an average structure called consensus pattern. This definition seems to be less affected by the incompleteness of the data, on the other hand an alignment among many different subgraphs may lose interesting informations.

Frequency

The frequency of a subgraph is strictly related to the kind of node and edge overlap that is allowed in counting the occurrences of the subgraph [23, 24], and this affects in a significant way both the results and the complexity of the counting. Indeed, a distinction can be made between non-identical subgraphs, i.e. subgraphs that differ in at least one node or one edge and edge-disjoint subgraphs, i.e. subgraphs that do not share any edges. This distinction is crucial in the counting. For example, for the network G of Figure 1, the counting of all edge-disjoint subgraphs leads to only one occurrence of both subgraphs a and b. However, if we count all non-identical subgraphs, subgraph b turns out to appear six times while subgraph a still only once. This naïve example well explains the most important difference between the two ways of counting: the edge-disjoint counting leads to a downward closed frequency, i.e. the frequency of a subgraph monotonically decreases with its size. The downward closure property does not hold for the case of non-identical counting. In other words, if arbitrary overlaps are allowed, a subgraph of size k can be, and in general is, more frequent than the subgraphs of size <k it contains. The downward closure is a highly desirable property from a computational point of view: first, it reduces considerably the number of subgraphs to process; second, as we will see later, it allows to use the A-priori algorithm to further prune the number of subgraphs to be examined.

Figure 1:

Given the graph G let a and b be two subgraphs, respectively, of size 3 and 4. Their frequencies vary according to whether or not overlap is allowed in counting.

Figure 1:

Given the graph G let a and b be two subgraphs, respectively, of size 3 and 4. Their frequencies vary according to whether or not overlap is allowed in counting.

A further distinction is made in [24] where the case of only overlaps of nodes is considered. In [24] indeed three frequency concepts, denoted by forumla1, forumla2 and forumla3, are identified:

forumla1: arbitrary overlaps of nodes and edges (non-identical case),

forumla2: only overlaps of nodes (edge-disjoint case),

forumla3: no overlaps (edge and vertex-disjoint case).

It has to be noticed that the downward closure property of the frequency holds for both forumla2 and forumla3, while it does not for forumla1.

Hence the definition of the frequency of a subgraph is a key point when designing algorithms for motif discovery. Several approaches proposed by the computational biology community have considered arbitrary overlaps of motifs [10, 22, 25–32]. In PPI networks proteins often participate into different modules where they perform different functions; thus it is conceivable that subgraphs with overlapping vertices or edges are simultaneously active at any time. In contrast, in the data mining literature, most algorithms consider appearances of subgraphs that do not share edges nor vertices [23, 24, 33–35]. In such context, the search for recurring subgraphs is performed not in a single graph, as we have discussed so far, but in a dataset consisting of many graphs, also called transactions. The frequency of a subgraph is the number of graphs of the dataset that contains it. If this number is above a given threshold, the subgraph is denoted as a motif. Thus one can distinguish between a single-graph setting, as we have considered so far for biological networks, and a graph-transaction setting [23]. An attempt at combining these two settings is given by the algorithm NeMoFINDER for motif discovery in PPI networks [25], where a different definition of motif is used. A motif is a subgraph (not necessarily an induced subgraph) that is frequent and unique. A subgraph is frequent if it occurs more than tf times in the real PPI network. A frequent subgraph is unique if its frequency in the real network is greater than that of at least tu random networks among an ensemble of n random networks, where tf, tu and n are given parameters. The requirement that the frequency in the real network has to be greater than tf excludes the possibility that a relatively unfrequent motif may turn out to be over-represented.

An important characterization of a motif is that of maximality [28, 36]. A motif GS is a maximal motif if it is not contained in any subgraph of G that is a motif, i.e. none of the super graphs of GS are frequent. Detecting only maximal motifs has the advantage of a reduced computation time since a smaller number of subgraphs are produced and stored.

The concentration is another interesting measure associated to a subgraph that has been used mostly in conjunction with sampling methods for motifs detection. For a given subgraph type Gk of size t the concentration is given by C(Gk) = N(Gk)/∑iN(Gi), where N(Gk) is the number of occurrences of Gk and ∑iN(Gi) is the total number of occurrences of all the subgraphs Gi of the same size t as Gk.

Statistical significance

We now turn to the definition of statistical significance of the frequency of a subgraph. According to definition 2, motifs have to be over-represented, i.e. appear significantly more often than what would be expected by chance. The over-representation of a subgraph is established on the basis of its frequency compared to the average frequency of the same subgraph in a set of random networks. The obtained values of the frequencies for the observed and random networks are compared through either the Z-score or the abundance (Δ) as defined in [10, 27]:  

formula
where freal(Gk) is the frequency of the subgraph Gk into the real network, while
f
rand(Gk) and std(frand(Gk)) are the average and the standard deviation of the frequency of Gk in the set of random networks, respectively. The value ε is a small constant that prevents Δ from growing too much when the number of occurrences of the subgraph is very small in both the real and random networks. The Z-score is high if the subgraph is over-represented, negative if it is under-represented and close to zero otherwise. The abundance varies between −1 (under-represented) and +1 (over-represented). Even though rarely used in the biological literature, the measure of abundance could be more practical than the Z-score when evaluating and comparing the appearances of small subgraphs in different networks, such as biological, social and internet networks, since it is less affected by the different network sizes.

An interesting characterization of networks was introduced in [27] based on network motifs: the significance profile (SP). Basically the SP of a network is a vector of normalized Z-scores that gives a histogram of the subgraph over-representation values over a network; the Z-scores are normalized to the unit length such that negative values, especially those close to −1, are associated to under-represented subgraphs, while positive ones, especially those close to 1, allow to recognize the motifs.

The use of Z-score is criticized when indiscriminately applied to scientific or real world problems irrespective of whether or not they have a Gaussian behaviour. A popular and entertaining book on randomness [37] lists several real life examples with non-Gaussian properties that are mistakenly interpreted in terms of Z-score. In the biological literature on networks a similar concern has been raised. In [38] it is observed that many network properties are not Gaussian and their distribution should be instead estimated from data. Thus a method from machine learning, namely kernel density estimation and cross validation [39], is proposed to evaluate the statistical significance of motifs. Once the distribution is learnt from sample data, the deviation is measured as the likelihood that an observation was drawn from random networks.

MODELS OF RANDOM GRAPHS

An important issue at this point is how to model random networks to be used for comparison and evaluation of motif occurrences. There is a large body of literature on network models [15, 16, 40–49] in different fields from physics to sociology to internet and biology. An extensive mathematical-oriented survey of network models can be found in [50] where the Poisson random graphs, scale-free, exponential and Markov graphs are reviewed and their relations to various global and local network properties are discussed in depth. A much cited work by Barabasi and Oltvai [51] provides a short introduction to network models and their properties with focus on the principles that govern the growth and evolution of biological networks, especially scale-free networks.

We will briefly review the random graph models that are relevant to motif discovery. In the pioneering work of [10, 27, 52] random networks preserve the same degree distribution of biological networks. Degree distribution is perhaps the most important global property of a biological network. The degree distribution P(k) gives the probability that a node has degree k, i.e. has k adjacent nodes. Much work have been devoted to the characterization of biological networks in terms of degree distribution. In [51] biological networks and PPI networks are described as having a power-law degree distribution, i.e. P(k) ∼ k−γ, with 2 < γ < 3. These networks are known as scale-free networks. The power-law distribution differs from the Poisson degree distribution typical of Erdos–Renyi (ER) random networks [43, 44] where the probability P of two nodes being connected is uniform across the network. In ER random networks, most nodes have a degree close to the average degree of the network and the probability P(k) decreases exponentially with k. In contrast, in a network with power–law degree distribution there is a long tail in the distribution implying a higher probability of nodes with high degrees with respect to ER models.

In [29], an extension of the idea of a random graph model preserving degree sequence is proposed for the search of n-node motifs. The random graph preserves the distribution of the (n − 1)-node subgraphs of the original graph. In their implementation n = 4. The same random graph model is suggested in a recent paper [53] to evaluate motifs of size up to 8. However, the generated random graphs preserve the distribution of the 3-node subgraphs only, since it is found computationally infeasible to preserve the subgraph distribution for larger values of n.

Recently, a new model of PPI networks has been proposed [54, 55] based on the notion of geometric random networks [56] and Poisson distribution of the degrees. A random geometric graph G(n, r) with radius r has n nodes that are independently and uniformly randomly distributed in a metric space. An edge connects two nodes u, v of G if their distance is less than the radius r. The results in [54] show that, despite the fact that the degree distribution of currently available PPI networks tends to be power–law, as in the scale-free model, other global properties such as diameter and clustering coefficient seem to better fit the geometric model. The prediction in [54] is that as the number of new experimentally and computationally determined protein interactions will increase, the degree distribution of PPI networks will be better approximated by a Poisson distribution.

A promising new research direction tries to improve the fit by incorporating node clustering into the model to account for different patterns of connectivity in different network modules, as in assortative mixing models [50]. One such scheme, called ERMG for Erdos-Renyi Mixtures for Graphs, is proposed in [57, 58] based on mixture distributions.

As observed in [21], for most models a perfect fit with a real network can be obtained in terms of a single selected property, either local or global, by accurately tuning certain model parameters. However, none of the proposed models can reproduce all the statistical properties of a naturally occurring network. Before concluding this section, it is important to stress that a major problem in modelling arises from the incompleteness and noise of the PPI data. The impact of false positives and negatives has been recently studied in [59, 60] where models of random link removal that reproduce incomplete networks have been analysed. Their findings indicate no significant dependence of degree distribution on the parameters of the models. On the other hand, motif frequencies seem to be affected by link removal strength. Further investigation is certainly needed in this area.

MOTIF DISCOVERY ALGORITHMS

In this section, we will review some of the existing algorithms for motif discovery. This is a complex task from a computational point of view, since it involves graph and subgraph isomorphism. We first review exact approaches to identify motifs that are typically limited to 3- and 4-motifs. Then we survey approximate methods that extend the analysis to motifs of size up to 12–14; they are either based on sampling or resort to heuristics that may help reduce the amount of computation. We also briefly review graph data mining approaches, based on the A-priori algorithm. Even though they have been introduced for a different problem, namely the search for a subgraph in a dataset of graphs, they can provide good insights into the motif discovery problem.

Note that with the exception of the A-priori, the algorithms presented here adopt the forumla1 frequency, consistently with the biological meaning discussed in ‘Frequency’ section.

Exact algorithms

Exhaustive search

Pioneering work on motif discovery was presented in [10, 29] where an exhaustive search for motifs is performed across networks of different nature: transcription–regulation, neuron-synaptic connection, food webs, WWW. The algorithm has been applied also to compute the SP of the Saccharomyces cerevisiae network [31] and of various other different networks, including WWW and social networks [27]. This exhaustive recursive search ERS finds all connected induced subgraphs of 3 and 4 nodes. Edge and node overlaps of subgraphs are allowed. The input network is represented by an adjacency matrix M. The n-graphlets are extracted by enumerating all the n × n submatrices induced by n connected nodes. ERS does not scale to motifs of size bigger than 4.

A faster exact search algorithm is ESU [61] that incrementally builds subgraphs of a certain size k starting with individual nodes and adding one node at a time until the required size k is reached. During construction, for each partial subgraph the algorithm keeps an auxiliary list of nodes that are candidates for future additions. Such list is dynamically updated: for each new node w added to the subgraph, a node is added to the candidate list if it has a label higher than the first node selected for the subgraph construction and if it is adjacent to w but not to any of the nodes already in the subgraph. As we will see later, ESU, when combined to sampling, allows the detection of subgraphs of size up to 14.

Compact topological motifs

The motif discovery problem is solved exactly in [28] where a motif is considered to be a subgraph that occurs at least t times. The algorithm extends the notion of maximality typical of string analysis [62] to graphs. Edge-maximality (vertex-maximality) ensures that no more edges (vertices) can be added to a motif without altering the occurrences of the motif in the graph. The combinatorial explosion of the motifs, due to arbitrary overlaps, is handled by the introduction of a compact graph representation, obtained by grouping together maximal sets of nodes that are ‘indistinguishable’. The notion of maximal indistinguishability is here presented by means of an example. Formal definitions are in the article. In Figure 2 the nodes of the set U1 are indistinguishable with respect to the set U2. The nodes within each set have the same label or colour; furthermore, there is an edge connecting each pair of nodes of U1 × U2. The nodes in U1 may produce three distinct occurrences of a motif that includes also U2. The enumeration of all such occurrences can be done more efficiently if a compact representation is used for the nodes of U1. The resulting algorithm achieves a drastic reduction of the size of the output motifs.

Figure 2:

The graph on the left shows the sets U1 and U2 mutually maximally indistinguishable, on the right the graph after the substitution of U1 and U2 as compact nodes and U1U2 as compact edge.

Figure 2:

The graph on the left shows the sets U1 and U2 mutually maximally indistinguishable, on the right the graph after the substitution of U1 and U2 as compact nodes and U1U2 as compact edge.

Approximate algorithms

Search algorithms based on sampling

A non-exact search was proposed by [26], where an algorithm based on sampling, called MFINDER, was developed. A sample subgraph of size k is extracted by picking at random edges of the input graph until a set of k nodes is obtained. The sampling procedure iteratively expands a subgraph by one by randomly picking a new edge in the set of edges adjacent to nodes already in the sample. When the sample reaches the required size, all the edges that connect its nodes in the original graph are added to it, thus obtaining an induced subgraph. Since some subgraphs have higher probability of being picked, an important part of the process is to assign weights to the samples to correct the non-uniform sampling. Then MFINDER computes the score of each type of subgraph of size k as the sum of the weights of all samples of the same type. Finally, the value of concentration defined in the previous section is used to estimate the significance of a motif.

Interesting features of this work are the ability to extract rare motifs with high probability and the low number of samples needed to obtain satisfactory results (even with 5000 samples MFINDER obtains concentrations very similar to ERS's). This algorithm scales well with large networks since the runtime is independent of the network size. However, as noticed in [25], it does not scale well with large motifs (it takes about 2 h to extract the 8-motifs in the transcriptional network of Escherichia coli, with | V | = 423 and | E | = 519).

One drawback of this approach is the extra time needed to compute the weights of all samples. This issue is addressed in [61, 63] where a new sampling method, Rand-ESU, is proposed to overcome this problem. Rand-ESU is based on the exact search algorithm ESU reviewed in the previous section. The recursive procedure ESU builds a tree whose leaves correspond to subgraphs of size k while internal nodes correspond to subgraphs of size 1 up to k − 1, depending on the tree level. Rather than traversing the entire tree, the sampling procedure visits parts of it and guarantees that all leaves are visited with uniform probability. This is achieved by assigning to each level in the tree a probability that the nodes are further explored. Although this approach is certainly faster and more intuitive, there is still the problem of how to empirically choose such probabilities.

A simpler strategy that produces uniform sampling consists of randomly picking subsets of k-nodes [20] instead of edges. Obviously, there is a possibility that the picked nodes do not belong to a connected component and do not identify a subgraph, therefore.

NeMoFINDER

A more recent paper [25] proposes a combination of approaches developed within the data mining and computational biology communities. The algorithm, NeMoFINDER, applies to unlabelled undirected graphs. A subgraph needs not be an induced subgraph. Furthermore, a motif is frequent and unique, as described in ‘Frequency’ section.

NeMoFINDER adopts the idea in SPIN [36] to search for repeated trees and then extend them to subgraphs. In short, the procedure first builds the set Tk of frequent trees of size up to k that appear in G more than a given threshold number; then, based on such trees, the graph G is divided into a set GD of transactions or graphs, such that each tree in Tk can be embedded in at least one of these transactions. The frequency of a motif g is computed as the number of transactions of GD into which g can be embedded. To compute such frequency, NeMoFINDER first determines the trees of Tk that are embedded in at least tf transactions of GD, where tf is the given threshold. Then each such tree is expanded by one edge and the frequency of the obtained graphlet is computed. The definition of motifs adopted in the article leads to a reduction of the number of graphlets to be examined and, consequently, of the computation time thus enabling the discovery of larger motifs, but at the cost of missing some potentially interesting subgraphs. An extension of this work is in [64].

Subgraph counting by scalar computation

An alternative interesting approach to characterize a biological network is by a set of measures (or an embedding space for a graph) based on scalars and functionals of the adjacency matrix A associated to the network [38]. The functionals are matrix multiplications of A and its transpose AT, corresponding to moving forward or backward in the graph, respectively. The scalars are the results of an operation, such as a sum, on either the diagonal terms or the off-diagonal terms of the matrix obtained by the application of the functionals. As a simple example, the sum of the diagonal terms of the product ATA2 gives the number of subgraphs of 3 nodes i, j and k with edges (i, j), (j, k) and (i, k). However, in most cases there is not an obvious meaningful correspondence of a scalar measure with a subgraph. In fact, the mapping between scalars and subgraphs is many-to-many. One of the major advantages of this approach, besides its mathematical elegance, is its computational efficiency when compared with other approaches, especially on dense scale-free networks. However, in the original paper [38] the time performance is only compared to that of the ERS and not to the latest versions of the exhaustive search, which are significantly more efficient.

A-priori-based motif detection

When no overlap of subgraphs in the original network is considered, the complexity of the discovery process can be handled more efficiently by a levelwise search, as in the A-priori algorithm for graph-transactions. In fact, in that case the downward closure property holds for the frequency. In other words, if a subgraph is frequent so are all its subgraphs. A typical levelwise search builds candidate motifs of size k by joining motifs of size k − 1 and then evaluating their frequency. Two subgraphs of size k − 1 are joined if they share a core of size k − 2. The core can consist of a set of vertices [34] or a set of edges [23]. Then the process of growing subgraphs may proceed by either adding a node at a time or an edge at a time. The issues of how to join separate motifs to obtain larger candidate motifs and how to count their occurrence are far from trivial, since they involve subgraph isomorphism. However, this setting allows the development of efficient pruning schemes, based on the canonical labelling of the subgraphs [20, 35] reviewed in ‘Graphs: definitions’ section. An overall significant speed-up over previous methods is achieved by this approach because fewer subgraphs need to be explored and stored.

CONCLUSIONS

This survey has focused on the algorithmic aspects of motif discovery and on the graph models that best describe biological networks. We have not considered so far some very important biological questions that motivate the search for motifs and explain the attention paid to this problem.

Perhaps the most important question is: Are motifs the result of evolutionary selection for function? This question relates to the other important question: Which mechanisms regulate PPI network's growth and evolution? We cannot treat these controversial issues in depth here and only mention a few interesting points.

The selection of network motifs for their function is a theory proposed originally in [52] where the over abundance of certain motifs is interpreted as manifestation of evolutionary design principles. A criticism to this interpretation has been raised in [65] where it is asserted that the choice of the random network model of [52] is not adequate, and the null hypothesis is not well posed. The obtained motifs may be a consequence of some topological characteristics (such as spatial aggregation) present only in real networks and not in random networks, rather than a result of evolutionary design principles. While this comment poses an interesting issue related to the randomization process, it has been shown [29, 66] that random networks built under this additional topological constraint reveal a profile that is different from that of a real network. Another study in support of the theory of evolutionary conservation of motifs is in [67]. It shows how ortholog proteins from different species are often found in the same network motifs, indicating that motifs are evolutionary conserved topological units. This study has been conducted for proteins in S. cerevisiae and their orthologs in networks of five higher eukaryotes (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens).

The over-representation of certain motifs in PPI networks, namely triangles and squares, has been also explained by the duplication-mutation model of network growth [68–70]. In this probabilistic model, the network evolves by the addition of a new node that duplicates a randomly picked node of the original network and inherits all its links. Then, during evolution some of the new links may disappear and a new link between the copied and the original node may be added.

Seven different evolutionary mechanisms are investigated in [21] to determine which model best reproduces the structure of the PPI network of D. melanogaster in terms of motifs. The result of the study suggests that the duplication-mutation-complementation mechanism shaped Drosophila melanogaster. Another interesting outcome of this work is that the ER random model shows a better fit score than other popular growing graph mechanisms, such as preferential attachment [50, 51], when the analysis is restricted to the so-called core network, containing edges that have a high probability of occurrence in vivo [4].

In conclusion, caution should be taken when it comes to a biological interpretation because of the observational incompleteness and noise. Hopefully, future research on the discovery of larger composite motif combined with the availability of more reliable PPI data will provide answers to the very important biological questions.

Key Points

  • A comprehensive review of the ‘Network Motif’ concept is provided on the basis of the different definitions of frequency and significance in computational biology and data mining.

  • The importance of network motifs in biological networks, such as protein-protein interaction (PPI) networks, is related to their ability of being a manifestation of evolutionary design principles.

  • The network motif discovery and classification problem is difficult since it involves the subgraph isomorphism problem. We analyze and compare several exact and approximate algorithmic approaches.

REFERENCES

Han
K
Park
B
Kim
H
, et al.  . 
HPID: The Human Protein Interaction Database
Bioinformatics
 , 
2004
, vol. 
20
 
15
(pg. 
2466
-
70
)
Mewes
HW
Frishman
D
Gruber
C
, et al.  . 
MIPS: a database for genomes and protein sequences
Nucleic Acids Res
 , 
2000
, vol. 
28
 (pg. 
37
-
40
)
Xenarios
I
Rice
DW
Salwinski
L
, et al.  . 
DIP: the database of interacting proteins
Nucleic Acid Res
 , 
2000
, vol. 
28
 
1
(pg. 
289
-
91
)
Giot
L
Bader
JS
Brouwer
C
, et al.  . 
A Protein interaction map of Drosophila melanogaster
Science
 , 
2004
, vol. 
302
 (pg. 
1727
-
36
)
Ito
T
Chiba
T
Ozawa
R
, et al.  . 
A comprehensive two-hybrid analysis to explore the yeast protein interactome
Proc Natl Acad Sci
 , 
2001
, vol. 
98
 (pg. 
4569
-
74
)
Uetz
P
Giot
L
Cagney
G
, et al.  . 
A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae
Nature
 , 
2002
, vol. 
403
 (pg. 
623
-
7
)
Aloy
P
Russell
RB
Interrogating protein interaction networks through structural biology
Proc Natl Acad Sci USA
 , 
2002
, vol. 
99
 (pg. 
5896
-
901
)
Jansen
R
Yu
H
Greenbaum
D
, et al.  . 
A Bayesian networks approach for predicting protein–protein interactions from genomic data
Science
 , 
2003
, vol. 
302
 
5644
(pg. 
449
-
53
)
Marcotte
EM
Pellegrini
M
Ng
HL
, et al.  . 
Detecting protein function and protein–protein interactions from genome sequences
Science
 , 
1999
, vol. 
285
 (pg. 
751
-
3
)
Milo
R
Shen-Orr
S
Itzkovitz
S
, et al.  . 
Network motifs: simple building blocks of complex networks
Science
 , 
2002
, vol. 
298
 (pg. 
824
-
7
)
Alon
U
Network motifs: theory and experimental approaches
Nature
 , 
2007
, vol. 
8
 (pg. 
450
-
61
)
Dobrin
R
Beg
QK
Barabasi
A-L
, et al.  . 
Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network
BMC Bioinformatics
 , 
2004
, vol. 
5
 pg. 
10
 
Lee
TI
Rinaldi
NJ
Robert
F
, et al.  . 
Transcriptional regulatory networks in Saccharomyces cerevisiae
Science
 , 
2002
, vol. 
298
 (pg. 
799
-
804
)
Mazurie
A
Bottani
S
Vergassola
M
An evolutionary and functional assessment of regulatory network motifs
Genome Biol
 , 
2005
, vol. 
6
 
4
pg. 
R35
 
Holme
P
Edling
CR
Liljeros
F
Structure and time-evolution of n dating Internet community
Soc Networks
 , 
2004
, vol. 
26
 (pg. 
155
-
74
)
Scott
J
Social Network Analysis: A Handbook
 , 
2000
2nd
London:
Sage Publications
Kautz
H
Selman
B
Shah
M
Referral web: combining social networks and collaborative filtering
Commun ACM
 , 
1997
, vol. 
40
 (pg. 
63
-
5
)
Ullman
JR
An algorithm for subgraph isomorphism
J Assoc Comp Mach
 , 
1976
, vol. 
23
 
1
(pg. 
31
-
42
)
Read
RC
Corneil
DG
The graph isomorph desease
J Graph Theory
 , 
1977
, vol. 
1
 (pg. 
339
-
63
)
Baskerville
K
Paczuski
M
Subgraph ensembles and motif discovery using a new heuristic for graph isomorphism
Phys Rev
 , 
2006
, vol. 
E 74
 pg. 
051903
 
Middendorf
M
Ziv
E
Wiggins
CH
Inferring network mechanisms: the Drosophila melanogaster protein interaction network
PNAS
 , 
2005
, vol. 
102
 
9
(pg. 
3192
-
7
)
Berg
J
Lassig
M
Local graph alignment and motif search in biological networks
PNAS
 , 
2004
, vol. 
101
 
41
(pg. 
14689
-
94
)
Kuramochi
M
Karypis
G
Finding frequent patterns in a large sparse graph
SIAM International Conference on Data Mining (SDM'04) 2004
SIAM, Lake Buena Vista, Florida, USA
(pg. 
243
-
71
)
Schreiber
F
Schwöbbermeyer
H
Frequency concepts and pattern detection for the analysis of motifs in networks
Trans Comput Syst Biol III
 , 
2005
, vol. 
LNBI 3737
 (pg. 
89
-
104
)
Chen
J
W
Hsu
ML
Lee
, et al.  . 
NeMoFinder: dissecting genome-wide protein–protein interactions with meso-scale network motifs
KDD 2006
 
Philadelphia, USA
ACM
(pg. 
106
-
115
)
Kashtan
N
Itzkovitz
S
Milo
R
, et al.  . 
Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs
Bioinformatics
 , 
2004
, vol. 
20
 
11
(pg. 
1746
-
58
)
Milo
R
Itzkovitz
S
Kashtan
N
, et al.  . 
Superfamilies of evolved and designed networks
Science
 , 
2004
, vol. 
303
 (pg. 
1538
-
42
)
Parida
L
Discovering Topological motifs using a compact notation
J Comput Biol
 , 
2007
, vol. 
4
 
3
(pg. 
300
-
23
)
Shen-Orr
SS
Milo
R
Mangan
S
, et al.  . 
Network motifs in the transcriptional regulation network of Escherichia coli
Nat Genet
 , 
2002
, vol. 
31
 (pg. 
64
-
8
)
Schreiber
F
Schwobbermeyer
H
MAVisto: a tool for the exploration of network motifs
Bioinform Appl Note
 , 
2005
, vol. 
21
 
17
(pg. 
3572
-
4
)
Yeger-Lotem
E
Sattath
S
Kashtan
N
, et al.  . 
Network motifs in integrated cellular networks of transcription-regulation and protein–protein interaction
Proc Natl Acad Sci
 , 
2004
, vol. 
101
 (pg. 
5934
-
9
)
Zhang
LV
King
OD
Wong
SL
, et al.  . 
Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network
J Biol
 , 
2005
, vol. 
4
 pg. 
6
 
Huan
J
Wang
W
Prins
J
, et al.  . 
Efficient mining of frequent subgraphs in the presence of isomorphism
Third IEEE International Conference on Data Mining (ICDM'03) 2003
IEEE Computer Society, Melbourne, Florida, USA
(pg. 
549
-
60
)
Inokuchi
A
Washio
T
Motoda
H
An apriori-based algorithm for mining frequent substructures from graph data
PKDD 2000
 , 
1910
Lyon, France LNAI
Springer
(pg. 
13
-
23
)
Kuramochi
M
Karypis
G
Frequent subgraph discovery
2001
IEEE International Conference on Data Mining (ICDM)
IEEE Computer Society, San Jose, California, USA
(pg. 
313
-
20
)
Huan
J
Wang
W
Prins
J
, et al.  . 
Spin: mining maximal frequent subgraphs from graph databases
KDD 2004
 
Seattle, Washington, USA
ACM
(pg. 
581
-
6
)
Taleb
NN
The Black Swan: The Impact of the Highly Improbable
2007
New York
Random House Publishing Group
Ziv
E
Koytcheff
E
Middendorf
M
, et al.  . 
Systematic identification of statistically significant network
Physc Rev
 , 
2005
, vol. 
E 71
 pg. 
016110
 
Hastie
T
Tibshirani
R
Friedman
J
The Elements of Statistical Learning. New York: Springer
 , 
2001
Alm
E
Arkin
AP
Biological networks
Curr Opin Struct Biol
 , 
2002
, vol. 
13
 (pg. 
193
-
202
)
Barabasi
AL
Linked: The New Science of Networks
2002
Cambridge: Perseus
Dorogovtsev
SN
Mendes
JFF
Evolution of Networks: From Biological Nets to the Internet and WWW
 , 
2003
Oxford: Oxford University Press
(pg. -)
Erdos
P
Renyi
A
On random graphs
Publicationes Mathematicae
 , 
1959
, vol. 
6
 (pg. 
290
-
7
)
Erdos
P
Renyi
A
On the evolution of random graphs
Publ Math Inst Hungarian Acad Sci
 , 
1960
, vol. 
5
 (pg. 
17
-
61
)
Newman
MEJ
Barabasi
A-L
Watts
DJ
The Structure and Dynamics of Networks
 , 
2003
Princeton University Press
 
Princeton
Strogatz
SH
Exploring complex networks
Nature
 , 
2003
, vol. 
410
 (pg. 
268
-
76
)
Vazquez
A
Flammini
A
Maritan
A
, et al.  . 
Modeling of protein interaction networks
Complexus
 , 
2003
, vol. 
1
 (pg. 
38
-
44
)
Vazquez
A
Growing networks with local rules: preferential attachment, clustering hierarchy and degree correlations
Phys Rev
 , 
2003
, vol. 
E 67
 
5
pg. 
056104
 
Vespignani
A
Evolution thinks modular
Nat Genet
 , 
2003
, vol. 
35
 (pg. 
118
-
19
)
Newman
MEJ
The structure and function of complex networks
SIAM Rev
 , 
2003
, vol. 
45
 
2
(pg. 
167
-
256
)
Barabasi
A-L
Oltvai
ZN
Network biology: understanding the cell's functional organization
Nat Rev
 , 
2004
, vol. 
5
 (pg. 
101
-
13
)
Milo
R
Kashtan
N
Itzkovitz
S
, et al.  . 
On the uniform generation of random graphs with prescribed degree sequences
2003
 
Grochow
JA
Kellis
M
Network motif discovery using subgraph enumeration and symmetry-breaking
RECOMB 2007, Lecture Notes in Computer Science 4453
 
Oakland, CA: Springer. pp. 92–106
Przulj
N
Corneil
DG
Jurisica
I
Modeling interactome: scale-free or geometric?
Bioinformatics
 , 
2004
, vol. 
20
 
18
(pg. 
3508
-
15
)
Przulj
N
Corneil
DG
Jurisica
I
Efficient estimation of graphlet frequency distributions in protein–protein interaction networks
Bioinformatics
 , 
2006
, vol. 
22
 
8
(pg. 
974
-
80
)
Penrose
M
Geometric Random Graphs
 , 
2003
Oxford Univeristy Press
Oxford:
Lacroix
V
Fernandes
CG
Sagot
MF
Motif search in graphs: application to metabolic networks
IEEE/ACM Trans Comput Biol Bioinform (TCBB)
 , 
2006
, vol. 
3
 
4
(pg. 
360
-
8
)
Daudin
JJ
Lacroix
V
Picard
F
, et al.  . 
Uncovering structure in biological networks
JOBIM
 , 
2006
Bordeaux, France
Kuhnt
M
Glauchec
I
Greinerb
M
Impact of observational incompleteness on the structural properties of protein interaction networks
Physica A
 , 
2007
, vol. 
373
 (pg. 
759
-
69
)
Scholz
J
Dejori
M
Stetter
M
Greiner
M
Noisy scale-free networks
Physica
 , 
2005
, vol. 
A 350
 (pg. 
622
-
42
)
Wernicke
S
Efficient detection of network motifs
IEEE/ACM Trans Comput Biol Bioinform
 , 
2006
, vol. 
3
 
4
(pg. 
347
-
59
)
Apostolico
A
Comin
M
Parida
L
Bridging lossy and lossless compression by motif pattern discovery. general theory of information transfer and combinatorics
Lect. Notes Comput Sci
 , 
2006
, vol. 
4123
 (pg. 
793
-
813
)
Wernicke
S
Rasche
F
FANMOD: a tool for fast network motif detection
Bioinformatics
 , 
2006
, vol. 
22
 (pg. 
1152
-
3
)
Chen
J
Hsu
W
Lee
ML
, et al.  . 
Labeling network motifs in protein interactomes for protein function prediction
IEEE 23rd International Conference on Data Engineering (ICDE'07) 2007
IEEE Computer Society, Istanbul, Turkey
(pg. 
546
-
55
)
Atzy-Randrup
Y
Fleishman
SJ
Ben-Tal
N
, et al.  . 
Comment on “Network Motifs: Simple Building Blocks of Complex Networks” and “Superfamilies of Evolved and Designed Networks”
Science
 , 
2004
, vol. 
305
 pg. 
1107c
 
Milo
R
Itzkovitz
S
Kashtan
N
, et al.  . 
Response to comment on “Network Motifs: Simple Building Blocks of Complex Networks” and “Superfamilies of Evolved and Designed Networks”
Science
 , 
2004
, vol. 
305
 pg. 
1107d
 
Wuchty
S
Oltvai
ZN
Barabasi
A-L
Evolutionary conservation of motif constituents in the yeast protein interaction network
Nat Genet
 , 
2003
, vol. 
35
 (pg. 
176
-
9
)
Ispolatov
I
Krapivsky
PL
Yuryev
A
Duplication-divergence model of protein interaction network
Phys Rev
 , 
2005
, vol. 
E 71
 pg. 
061911
 
Ispolatov
I
Krapivsky
PL
Mazo
I
, et al.  . 
Cliques and duplication-divergence network growth
New J Phys
 , 
2005
, vol. 
7
 (pg. 
145
-
59
)
Rice
JJ
Kershenbaum
A
Stolovitzky
G
Lasting impressions: motifs in protein–protein maps may provide footprints of evolutionary events
Proc Natl Acad Sci
 , 
2005
, vol. 
102
 
9
(pg. 
3173
-
4
)