## Abstract

Several algorithms have been recently designed to identify motifs in biological networks, particularly in protein–protein interaction networks. Motifs correspond to repeated modules in the network that may be of biological interest. The approaches proposed in the literature often differ in the definition of a motif, the way the occurrences of a motif are counted and the way their statistical significance is assessed. This has strong implications on the computational complexity of the discovery process and on the type of results that can be expected. This review presents in a systematic way the different computational settings outlining their main features and limitations.

## INTRODUCTION

Much data has become recently available about protein–protein interactions (PPI) in selected organisms [1–3]. The interactions have been determined by a number of experimental [4–6] and computational [7–9] techniques designed to systematically identify both the candidate and actual protein interactions. A PPI network is modelled by a graph whose nodes correspond to proteins and edges to interactions between two proteins. Much work has been devoted to the characterization and modelling of biological networks in terms of global properties of the underlying graphs, such as diameter, degree distribution and clustering coefficient. Recently, there has been a shift of attention of researchers from global properties to local properties that provide a concise description of a network in terms of appearances of a given number of small subgraphs. It has been shown that PPI networks exhibit interesting features in terms of repeated occurrences of certain modules and that these modules have biological meaning, since they may represent evolutionary conserved topological units of cellular networks. Thus, much attention has been paid to the identification of small subgraphs, particularly those occurring significantly often within the biological networks. These subgraphs, referred to as *network motifs* or simply *motifs*, can be considered as basic *building blocks of complex networks* [10]. The PPI data available are certainly incomplete, thus many false negatives affect the studies on these networks. Nevertheless, the current research effort on the characterization of biological networks may ultimately aid in understanding the cell organization.

This review focuses on the computational aspects of the motif discovery process in biological networks. Although we will consider PPI networks in this review, most of the algorithms that will be reviewed have been applied also to other types of biological networks, such as transcription–regulation [11–14], and to networks of different nature, as social [15, 16] and WWW [17]. A discovery algorithm typically consists of three steps: (i) find the occurrences of subgraphs of a given size in a graph; (ii) determine which subgraphs are isomorphic and group them into classes, sometimes referred to as graphlets; and (iii) determine the statistical significance of the graphlets by comparing their frequencies to those of an ensemble of random networks. In ‘Graphs: definitions’ section, we give basic definitions of graphs and graph isomorphism. In ‘Network motifs: definitions and properties’ section, we present various definitions of motifs, concentrating on different ways of counting subgraphs, and show how they affect the computational complexity of the discovery process. Models of biological networks are reviewed in ‘Model of random graphs’ section, since they provide a framework for the evaluation of the statistical significance of the obtained subgraphs. Finally, in ‘Motif discovery algorithms’ section, we give a short outline of some of the algorithms proposed in the literature to address this problem. We conclude with a short survey of the literature on the controversial biological interpretation of motifs.

## GRAPHS: DEFINITIONS

In this section we will briefly review basic definitions of graphs. A graph is a pair *G* = (*V*, *E*) where *V* = {*v*} is the set of nodes or vertices and *E* = {(*u*, *v*)} is the set of edges each connecting a pair of nodes.

A subgraph *G*_{S} = (*V*_{S}, *E*_{S}) of *G* = (*V*, *E*) consists of a subset of nodes *V*_{S} ⊆ *V* and of a subset of edges *E*_{S} ⊆ *E* connecting the nodes of *V*_{S} in the original graph. The graph induced by *V*_{S} is the subgraph *G*_{S} that includes all the edges of *G* which connect the vertices of *V*_{S}.

In most of the algorithms we will review, the graphs are represented by adjacency matrices. Assume that the nodes of a graph are numbered in some arbitrary order from 1 to *n*, that is *V* = {*v*_{1}, …,*v*_{n}}. The adjacency matrix *M* = {*m*_{ij}}, *i*, *j* = 1, …,*n*, is then defined as a two-dimensional square matrix whose element *m*_{ij} is 1 if there is an edge connecting nodes *v*_{i} and *v*_{j} and 0 otherwise. Two graphs *G*_{1} and *G*_{2} are *isomorphic* if there is a one-to-one mapping *f* between vertices of *G*_{1} and those of *G*_{2} with the property that two vertices *u* and *v* of *G*_{1} are adjacent if and only if *f*(*u*) and *f*(*v*) are adjacent in *G*_{2} [18]. In other words, two graphs are isomorphic if it is possible to permute the vertices so that *G*_{1} coincides with *G*_{2}. Determining graph isomorphism is not known to be either in P or in NP-complete. On the other hand, given two graphs *G* and *H* determining whether *G* contains a subgraph isomorphic to *H* is an NP-complete problem, thus no polynomial time algorithm is known to solve it. In practice, several efficient approaches exist for small subgraphs. *Canonical labelling* of a graph has been successfully used in heuristic approaches to graph isomorphism to reduce the computation time.

As already said, graph isomorphism is a required step in most motif-discovery algorithms for counting the occurrences of subgraphs and distinguishing non-isomorphic subgraphs. The canonical label is a unique label associated to a graph that is independent on the permutation of the nodes, and therefore invariant to isomorphism. A graph with *n* nodes has up to *n*! different adjacency matrices, obtained from the set of permutations of the *n* nodes. A *code* is associated to each adjacency matrix, *M*, defined as the string of values obtained by the concatenation of the rows of the lower triangular sub-matrix of *M*, including the diagonal, i.e. *code*(*M*) =*m*_{11}*m*_{12}*m*_{22}…*m*_{1n}…*m*_{nn}. Note that *M* is symmetric because *G* is assumed to be an indirect graph. The canonical labelling of the graph *G* denoted by *cl*(*G*) is the labelling that generates the code that is maximal over all adjacency matrices of *G* according to a given order among strings, e.g. lexicographic order. A property of this labelling is that, given two subgraphs *G*_{1} and *G*_{2}, *cl*(*G*_{1}) = *cl*(*G*_{2}) iff *G*_{1} is isomorphic to *G*_{2}. Thus, determining whether two graphs have the same canonical label is as difficult as the graph isomorphism problem is.

For specific instances of graphs, efficient heuristics can be set up to reduce the search space of canonical labelling. One such heuristics consists of grouping the nodes of each graph based on vertex invariant properties such as degree and/or labels [19] and then restricting the permutations to nodes within the same group.

A generalization of standard-canonical labelling applied to PPI networks is defined in [20] where labels are attached to all nodes of a graph. The label of node *v*_{i} depends on its degree *m*_{i} and the distances *d*_{ij} to all connected vertices *j*. Specifically:

*d*is the graph diameter, i.e. the max length of a graph path. The product of all these labels provides a code for the entire graph. Although such a code does not uniquely characterize a graph, i.e. non-isomorphic graphs may have the same code, in practice it allows to distinguish between graphs of size up to 8 quite well.

## NETWORK MOTIFS: DEFINITIONS AND PROPERTIES

Two distinct definitions of a motif have been used in the literature based on *frequency* and *statistical significance*.

*Definition 1*

A motif is a subgraph that appears more than a threshold number of times.

*Definition 2*

A motif is a subgraph that appears more often than expected by chance.

A motif of size *k*, i.e. containing *k* nodes, is called *k*-motif. A motif according to definition 2 is often said to be *over*-*represented*. It is possible that an over-represented motif does not occur more than the threshold number of times required by definition 1 and vice versa. In the following sections, we will specify how to count the frequency of a graphlet, i.e. the number of its occurrences, and how to evaluate its *significance*, i.e. the degree of surprise of such frequency.

We stress that another important distinction among the different definitions of a motif depends on whether or not the subgraph has to be induced, i.e. it must contain all edges that connect the nodes of the subgraph in the original graph.

The number of possible *k*-motifs grows very fast with *k*. This is one of the reasons why only small size *k*-motifs have been studied in PPI networks. To overcome in part this problem two different types of motifs are defined in [21]: (i) subgraphs that are generated by a random walk of a given length (8 in [21]) and 2. subgraphs with a given number of edges. These choices allow to extend the exact search to larger subgraphs (*k* ≤ 14). An alternative definition of motif is also provided in [22]: here a subgraph alignment is performed across a set of small graphs in order to extract an average structure called *consensus pattern*. This definition seems to be less affected by the incompleteness of the data, on the other hand an alignment among many different subgraphs may lose interesting informations.

### Frequency

The frequency of a subgraph is strictly related to the kind of node and edge overlap that is allowed in counting the occurrences of the subgraph [23, 24], and this affects in a significant way both the results and the complexity of the counting. Indeed, a distinction can be made between *non-identical* subgraphs, i.e. subgraphs that differ in at least one node or one edge and *edge-disjoint* subgraphs, i.e. subgraphs that do not share any edges. This distinction is crucial in the counting. For example, for the network *G* of Figure 1, the counting of all edge-disjoint subgraphs leads to only one occurrence of both subgraphs *a* and *b*. However, if we count all non-identical subgraphs, subgraph *b* turns out to appear six times while subgraph *a* still only once. This naïve example well explains the most important difference between the two ways of counting: the edge-disjoint counting leads to a downward closed frequency, i.e. the frequency of a subgraph monotonically decreases with its size. The downward closure property does not hold for the case of non-identical counting. In other words, if arbitrary overlaps are allowed, a subgraph of size *k* can be, and in general is, more frequent than the subgraphs of size <*k* it contains. The downward closure is a highly desirable property from a computational point of view: first, it reduces considerably the number of subgraphs to process; second, as we will see later, it allows to use the *A-priori* algorithm to further prune the number of subgraphs to be examined.

**Figure 1:**

**Figure 1:**

A further distinction is made in [24] where the case of *only overlaps of nodes* is considered. In [24] indeed three frequency concepts, denoted by _{1}, _{2} and _{3}, are identified:

_{1}: arbitrary overlaps of nodes and edges (non-identical case),

_{2}: only overlaps of nodes (edge-disjoint case),

_{3}: no overlaps (edge and vertex-disjoint case).

It has to be noticed that the downward closure property of the frequency holds for both _{2} and _{3}, while it does not for _{1}.

Hence the definition of the *frequency* of a subgraph is a key point when designing algorithms for motif discovery. Several approaches proposed by the computational biology community have considered arbitrary overlaps of motifs [10, 22, 25–32]. In PPI networks proteins often participate into different modules where they perform different functions; thus it is conceivable that subgraphs with overlapping vertices or edges are simultaneously active at any time. In contrast, in the data mining literature, most algorithms consider appearances of subgraphs that do not share edges nor vertices [23, 24, 33–35]. In such context, the search for recurring subgraphs is performed not in a single graph, as we have discussed so far, but in a dataset consisting of many graphs, also called *transactions*. The frequency of a subgraph is the number of graphs of the dataset that contains it. If this number is above a given threshold, the subgraph is denoted as a motif. Thus one can distinguish between a *single-graph* setting, as we have considered so far for biological networks, and a *graph-transaction* setting [23]. An attempt at combining these two settings is given by the algorithm NeMoFINDER for motif discovery in PPI networks [25], where a different definition of motif is used. A motif is a subgraph (not necessarily an induced subgraph) that is *frequent* and *unique*. A subgraph is frequent if it occurs more than *t*_{f} times in the real PPI network. A frequent subgraph is unique if its frequency in the real network is greater than that of at least *t*_{u} random networks among an ensemble of *n* random networks, where *t*_{f}, *t*_{u} and *n* are given parameters. The requirement that the frequency in the real network has to be greater than *t*_{f} excludes the possibility that a relatively unfrequent motif may turn out to be over-represented.

An important characterization of a motif is that of maximality [28, 36]. A motif *G*_{S} is a *maximal motif* if it is not contained in any subgraph of *G* that is a motif, i.e. none of the super graphs of *G*_{S} are frequent. Detecting only maximal motifs has the advantage of a reduced computation time since a smaller number of subgraphs are produced and stored.

The *concentration* is another interesting measure associated to a subgraph that has been used mostly in conjunction with sampling methods for motifs detection. For a given subgraph type *G*_{k} of size *t* the concentration is given by *C*(*G*_{k}) = *N*(*G*_{k})/∑_{i}*N*(*G*_{i}), where *N*(*G*_{k}) is the number of occurrences of *G*_{k} and ∑_{i}*N*(*G*_{i}) is the total number of occurrences of all the subgraphs *G*_{i} of the same size *t* as *G*_{k}.

### Statistical significance

We now turn to the definition of statistical significance of the frequency of a subgraph. According to definition 2, motifs have to be *over-represented*, i.e. appear significantly more often than what would be expected by chance. The over-representation of a subgraph is established on the basis of its frequency compared to the average frequency of the same subgraph in a set of random networks. The obtained values of the frequencies for the observed and random networks are compared through either the *Z-score* or the *abundance* (Δ) as defined in [10, 27]:

*f*

_{real}(

*G*

_{k}) is the frequency of the subgraph

*G*

_{k}into the real network, while

_{rand}(

*G*) and std(

_{k}*f*

_{rand}(

*G*

_{k})) are the average and the standard deviation of the frequency of

*G*

_{k}in the set of random networks, respectively. The value ε is a small constant that prevents Δ from growing too much when the number of occurrences of the subgraph is very small in both the real and random networks. The

*Z*-score is high if the subgraph is over-represented, negative if it is under-represented and close to zero otherwise. The abundance varies between −1 (under-represented) and +1 (over-represented). Even though rarely used in the biological literature, the measure of abundance could be more practical than the

*Z*-score when evaluating and comparing the appearances of small subgraphs in different networks, such as biological, social and internet networks, since it is less affected by the different network sizes.

An interesting characterization of networks was introduced in [27] based on network motifs: the *significance profile* (SP). Basically the SP of a network is a vector of normalized *Z*-scores that gives a histogram of the subgraph over-representation values over a network; the *Z*-scores are normalized to the unit length such that negative values, especially those close to −1, are associated to under-represented subgraphs, while positive ones, especially those close to 1, allow to recognize the motifs.

The use of *Z-score* is criticized when indiscriminately applied to scientific or real world problems irrespective of whether or not they have a Gaussian behaviour. A popular and entertaining book on randomness [37] lists several real life examples with non-Gaussian properties that are mistakenly interpreted in terms of *Z-score*. In the biological literature on networks a similar concern has been raised. In [38] it is observed that many network properties are not Gaussian and their distribution should be instead estimated from data. Thus a method from machine learning, namely *kernel density estimation* and *cross validation* [39], is proposed to evaluate the statistical significance of motifs. Once the distribution is learnt from sample data, the deviation is measured as the likelihood that an observation was drawn from random networks.

## MODELS OF RANDOM GRAPHS

An important issue at this point is how to model random networks to be used for comparison and evaluation of motif occurrences. There is a large body of literature on network models [15, 16, 40–49] in different fields from physics to sociology to internet and biology. An extensive mathematical-oriented survey of network models can be found in [50] where the Poisson random graphs, scale-free, exponential and Markov graphs are reviewed and their relations to various global and local network properties are discussed in depth. A much cited work by Barabasi and Oltvai [51] provides a short introduction to network models and their properties with focus on the principles that govern the growth and evolution of biological networks, especially scale-free networks.

We will briefly review the random graph models that are relevant to motif discovery. In the pioneering work of [10, 27, 52] random networks preserve the same degree distribution of biological networks. Degree distribution is perhaps the most important global property of a biological network. The degree distribution *P*(*k*) gives the probability that a node has degree *k*, i.e. has *k* adjacent nodes. Much work have been devoted to the characterization of biological networks in terms of degree distribution. In [51] biological networks and PPI networks are described as having a power-law degree distribution, i.e. *P*(*k*) ∼ *k*^{−γ}, with 2 < γ < 3. These networks are known as scale-free networks. The power-law distribution differs from the Poisson degree distribution typical of Erdos–Renyi (ER) random networks [43, 44] where the probability *P* of two nodes being connected is uniform across the network. In ER random networks, most nodes have a degree close to the average degree of the network and the probability *P*(*k*) decreases exponentially with *k*. In contrast, in a network with power–law degree distribution there is a long tail in the distribution implying a higher probability of nodes with high degrees with respect to ER models.

In [29], an extension of the idea of a random graph model preserving degree sequence is proposed for the search of *n*-node motifs. The random graph preserves the distribution of the (*n* − 1)-node subgraphs of the original graph. In their implementation *n* = 4. The same random graph model is suggested in a recent paper [53] to evaluate motifs of size up to 8. However, the generated random graphs preserve the distribution of the 3-node subgraphs only, since it is found computationally infeasible to preserve the subgraph distribution for larger values of *n*.

Recently, a new model of PPI networks has been proposed [54, 55] based on the notion of geometric random networks [56] and Poisson distribution of the degrees. A random geometric graph *G*(*n*, *r*) with radius *r* has *n* nodes that are independently and uniformly randomly distributed in a metric space. An edge connects two nodes *u*, *v* of *G* if their distance is less than the radius *r*. The results in [54] show that, despite the fact that the degree distribution of currently available PPI networks tends to be power–law, as in the scale-free model, other global properties such as diameter and clustering coefficient seem to better fit the geometric model. The prediction in [54] is that as the number of new experimentally and computationally determined protein interactions will increase, the degree distribution of PPI networks will be better approximated by a Poisson distribution.

A promising new research direction tries to improve the fit by incorporating node clustering into the model to account for different patterns of connectivity in different network modules, as in assortative mixing models [50]. One such scheme, called ERMG for Erdos-Renyi Mixtures for Graphs, is proposed in [57, 58] based on mixture distributions.

As observed in [21], for most models a perfect fit with a real network can be obtained in terms of a single selected property, either local or global, by accurately tuning certain model parameters. However, none of the proposed models can reproduce all the statistical properties of a naturally occurring network. Before concluding this section, it is important to stress that a major problem in modelling arises from the incompleteness and noise of the PPI data. The impact of false positives and negatives has been recently studied in [59, 60] where models of random link removal that reproduce incomplete networks have been analysed. Their findings indicate no significant dependence of degree distribution on the parameters of the models. On the other hand, motif frequencies seem to be affected by link removal strength. Further investigation is certainly needed in this area.

## MOTIF DISCOVERY ALGORITHMS

In this section, we will review some of the existing algorithms for motif discovery. This is a complex task from a computational point of view, since it involves graph and subgraph isomorphism. We first review exact approaches to identify motifs that are typically limited to 3- and 4-motifs. Then we survey approximate methods that extend the analysis to motifs of size up to 12–14; they are either based on sampling or resort to heuristics that may help reduce the amount of computation. We also briefly review graph data mining approaches, based on the A-priori algorithm. Even though they have been introduced for a different problem, namely the search for a subgraph in a dataset of graphs, they can provide good insights into the motif discovery problem.

Note that with the exception of the A-priori, the algorithms presented here adopt the _{1} frequency, consistently with the biological meaning discussed in ‘Frequency’ section.

### Exact algorithms

#### Exhaustive search

Pioneering work on motif discovery was presented in [10, 29] where an exhaustive search for motifs is performed across networks of different nature: transcription–regulation, neuron-synaptic connection, food webs, WWW. The algorithm has been applied also to compute the SP of the *Saccharomyces cerevisiae* network [31] and of various other different networks, including WWW and social networks [27]. This exhaustive recursive search ERS finds all connected induced subgraphs of 3 and 4 nodes. Edge and node overlaps of subgraphs are allowed. The input network is represented by an adjacency matrix *M*. The *n*-graphlets are extracted by enumerating all the *n* × *n* submatrices induced by *n* connected nodes. ERS does not scale to motifs of size bigger than 4.

A faster exact search algorithm is ESU [61] that incrementally builds subgraphs of a certain size *k* starting with individual nodes and adding one node at a time until the required size *k* is reached. During construction, for each partial subgraph the algorithm keeps an auxiliary list of nodes that are candidates for future additions. Such list is dynamically updated: for each new node *w* added to the subgraph, a node is added to the candidate list if it has a label higher than the first node selected for the subgraph construction and if it is adjacent to *w* but not to any of the nodes already in the subgraph. As we will see later, ESU, when combined to sampling, allows the detection of subgraphs of size up to 14.

#### Compact topological motifs

The motif discovery problem is solved exactly in [28] where a motif is considered to be a subgraph that occurs at least *t* times. The algorithm extends the notion of maximality typical of string analysis [62] to graphs. Edge-maximality (vertex-maximality) ensures that no more edges (vertices) can be added to a motif without altering the occurrences of the motif in the graph. The combinatorial explosion of the motifs, due to arbitrary overlaps, is handled by the introduction of a compact graph representation, obtained by grouping together maximal sets of nodes that are ‘indistinguishable’. The notion of maximal indistinguishability is here presented by means of an example. Formal definitions are in the article. In Figure 2 the nodes of the set *U*_{1} are indistinguishable with respect to the set *U*_{2}. The nodes within each set have the same label or colour; furthermore, there is an edge connecting each pair of nodes of *U*_{1} × *U*_{2}. The nodes in *U*_{1} may produce three distinct occurrences of a motif that includes also *U*_{2}. The enumeration of all such occurrences can be done more efficiently if a compact representation is used for the nodes of *U*_{1}. The resulting algorithm achieves a drastic reduction of the size of the output motifs.

**Figure 2:**

**Figure 2:**

### Approximate algorithms

#### Search algorithms based on sampling

A non-exact search was proposed by [26], where an algorithm based on sampling, called MFINDER, was developed. A sample subgraph of size *k* is extracted by picking at random edges of the input graph until a set of *k* nodes is obtained. The sampling procedure iteratively expands a subgraph by one by randomly picking a new edge in the set of edges adjacent to nodes already in the sample. When the sample reaches the required size, all the edges that connect its nodes in the original graph are added to it, thus obtaining an induced subgraph. Since some subgraphs have higher probability of being picked, an important part of the process is to assign weights to the samples to correct the non-uniform sampling. Then MFINDER computes the score of each type of subgraph of size *k* as the sum of the weights of all samples of the same type. Finally, the value of concentration defined in the previous section is used to estimate the significance of a motif.

Interesting features of this work are the ability to extract rare motifs with high probability and the low number of samples needed to obtain satisfactory results (even with 5000 samples MFINDER obtains concentrations very similar to ERS's). This algorithm scales well with large networks since the runtime is independent of the network size. However, as noticed in [25], it does not scale well with large motifs (it takes about 2 h to extract the 8-motifs in the transcriptional network of *Escherichia coli*, with | *V* | = 423 and | *E* | = 519).

One drawback of this approach is the extra time needed to compute the weights of all samples. This issue is addressed in [61, 63] where a new sampling method, Rand-ESU, is proposed to overcome this problem. Rand-ESU is based on the exact search algorithm ESU reviewed in the previous section. The recursive procedure ESU builds a tree whose leaves correspond to subgraphs of size *k* while internal nodes correspond to subgraphs of size 1 up to *k* − 1, depending on the tree level. Rather than traversing the entire tree, the sampling procedure visits parts of it and guarantees that all leaves are visited with uniform probability. This is achieved by assigning to each level in the tree a probability that the nodes are further explored. Although this approach is certainly faster and more intuitive, there is still the problem of how to empirically choose such probabilities.

A simpler strategy that produces uniform sampling consists of randomly picking subsets of *k*-nodes [20] instead of edges. Obviously, there is a possibility that the picked nodes do not belong to a connected component and do not identify a subgraph, therefore.

#### NeMoFINDER

A more recent paper [25] proposes a combination of approaches developed within the data mining and computational biology communities. The algorithm, NeMoFINDER, applies to unlabelled undirected graphs. A subgraph needs not be an induced subgraph. Furthermore, a motif is *frequent* and *unique*, as described in ‘Frequency’ section.

NeMoFINDER adopts the idea in SPIN [36] to search for repeated trees and then extend them to subgraphs. In short, the procedure first builds the set *T*_{k} of frequent trees of size up to *k* that appear in *G* more than a given threshold number; then, based on such trees, the graph *G* is divided into a set *GD* of transactions or graphs, such that each tree in *T*_{k} can be embedded in at least one of these transactions. The frequency of a motif *g* is computed as the number of transactions of *GD* into which *g* can be embedded. To compute such frequency, NeMoFINDER first determines the trees of *T*_{k} that are embedded in at least *t*_{f} transactions of *GD*, where *t*_{f} is the given threshold. Then each such tree is expanded by one edge and the frequency of the obtained graphlet is computed. The definition of motifs adopted in the article leads to a reduction of the number of graphlets to be examined and, consequently, of the computation time thus enabling the discovery of larger motifs, but at the cost of missing some potentially interesting subgraphs. An extension of this work is in [64].

#### Subgraph counting by scalar computation

An alternative interesting approach to characterize a biological network is by a set of measures (or an embedding space for a graph) based on scalars and functionals of the adjacency matrix *A* associated to the network [38]. The functionals are matrix multiplications of *A* and its transpose *A*^{T}, corresponding to moving forward or backward in the graph, respectively. The scalars are the results of an operation, such as a sum, on either the diagonal terms or the off-diagonal terms of the matrix obtained by the application of the functionals. As a simple example, the sum of the diagonal terms of the product *A*^{T}*A*^{2} gives the number of subgraphs of 3 nodes *i*, *j* and *k* with edges (*i*, *j*), (*j*, *k*) and (*i*, *k*). However, in most cases there is not an obvious meaningful correspondence of a scalar measure with a subgraph. In fact, the mapping between scalars and subgraphs is many-to-many. One of the major advantages of this approach, besides its mathematical elegance, is its computational efficiency when compared with other approaches, especially on dense scale-free networks. However, in the original paper [38] the time performance is only compared to that of the ERS and not to the latest versions of the exhaustive search, which are significantly more efficient.

#### A-priori-based motif detection

When no overlap of subgraphs in the original network is considered, the complexity of the discovery process can be handled more efficiently by a levelwise search, as in the A-priori algorithm for graph-transactions. In fact, in that case the downward closure property holds for the frequency. In other words, if a subgraph is frequent so are all its subgraphs. A typical levelwise search builds candidate motifs of size *k* by joining motifs of size *k* − 1 and then evaluating their frequency. Two subgraphs of size *k* − 1 are joined if they share a *core* of size *k* − 2. The core can consist of a set of vertices [34] or a set of edges [23]. Then the process of growing subgraphs may proceed by either adding a node at a time or an edge at a time. The issues of how to join separate motifs to obtain larger candidate motifs and how to count their occurrence are far from trivial, since they involve subgraph isomorphism. However, this setting allows the development of efficient pruning schemes, based on the canonical labelling of the subgraphs [20, 35] reviewed in ‘Graphs: definitions’ section. An overall significant speed-up over previous methods is achieved by this approach because fewer subgraphs need to be explored and stored.

## CONCLUSIONS

This survey has focused on the algorithmic aspects of motif discovery and on the graph models that best describe biological networks. We have not considered so far some very important biological questions that motivate the search for motifs and explain the attention paid to this problem.

Perhaps the most important question is: Are motifs the result of evolutionary selection for function? This question relates to the other important question: Which mechanisms regulate PPI network's growth and evolution? We cannot treat these controversial issues in depth here and only mention a few interesting points.

The selection of network motifs for their function is a theory proposed originally in [52] where the over abundance of certain motifs is interpreted as *manifestation of evolutionary design principles*. A criticism to this interpretation has been raised in [65] where it is asserted that the choice of the random network model of [52] is not adequate, and the null hypothesis is not well posed. The obtained motifs may be a consequence of some topological characteristics (such as spatial aggregation) present only in real networks and not in random networks, rather than a result of evolutionary design principles. While this comment poses an interesting issue related to the randomization process, it has been shown [29, 66] that random networks built under this additional topological constraint reveal a profile that is different from that of a real network. Another study in support of the theory of evolutionary conservation of motifs is in [67]. It shows how ortholog proteins from different species are often found in the same network motifs, indicating that motifs are evolutionary conserved topological units. This study has been conducted for proteins in *S. cerevisiae* and their orthologs in networks of five higher eukaryotes (*Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus* and *Homo sapiens*).

The over-representation of certain motifs in PPI networks, namely triangles and squares, has been also explained by the duplication-mutation model of network growth [68–70]. In this probabilistic model, the network evolves by the addition of a new node that duplicates a randomly picked node of the original network and inherits all its links. Then, during evolution some of the new links may disappear and a new link between the copied and the original node may be added.

Seven different evolutionary mechanisms are investigated in [21] to determine which model best reproduces the structure of the PPI network of *D. melanogaster* in terms of motifs. The result of the study suggests that the duplication-mutation-complementation mechanism shaped Drosophila melanogaster. Another interesting outcome of this work is that the ER random model shows a better fit score than other popular growing graph mechanisms, such as preferential attachment [50, 51], when the analysis is restricted to the so-called core network, containing edges that have a high probability of occurrence *in vivo* [4].

In conclusion, caution should be taken when it comes to a biological interpretation because of the observational incompleteness and noise. Hopefully, future research on the discovery of larger composite motif combined with the availability of more reliable PPI data will provide answers to the very important biological questions.

A comprehensive review of the ‘Network Motif’ concept is provided on the basis of the different definitions of frequency and significance in computational biology and data mining.

The importance of network motifs in biological networks, such as protein-protein interaction (PPI) networks, is related to their ability of being a manifestation of evolutionary design principles.

The network motif discovery and classification problem is difficult since it involves the subgraph isomorphism problem. We analyze and compare several exact and approximate algorithmic approaches.

## REFERENCES

*Drosophila melanogaster*

*Saccharomyces cerevisiae*

*Saccharomyces cerevisiae*

*Drosophila melanogaster*protein interaction network

*Escherichia coli*

*Linked: The New Science of Networks*