## Abstract

We present a computational method for the prediction of functional modules encoded in microbial genomes. In this work, we have also developed a formal measure to quantify the degree of consistency between the predicted and the known modules, and have carried out statistical significance analysis of consistency measures. We first evaluate the functional relationship between two genes from three different perspectives—phylogenetic profile analysis, gene neighborhood analysis and Gene Ontology assignments. We then combine the three different sources of information in the framework of Bayesian inference, and we use the combined information to measure the strength of gene functional relationship. Finally, we apply a threshold-based method to predict functional modules. By applying this method to *Escherichia coli* K12, we have predicted 185 functional modules. Our predictions are highly consistent with the previously known functional modules in *E.coli* . The application results have demonstrated that our approach is highly promising for the prediction of functional modules encoded in a microbial genome.

## INTRODUCTION

The worldwide sequencing efforts of microbial genomes ( http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html , http://www.sanger.ac.uk/Projects/Microbes , http://www.tigr.org/tdb/mdb/mdbcomplete.html and http://microbialgenome.org ) have led to the completion of over 200 microbial genomes, and this number will continue to increase very rapidly. We expect to see over a thousand sequenced genomes within the next few years. This wealth of genomic data provides unprecedented opportunities for computational biologists to unveil the enormous amount of information encoded in the genomes about the biological machinery of these micro-organisms.

The complex biological processes in a living microbial cell, including metabolism, regulations and signal transduction, are carried out by a large set of functional modules at various levels. These functional modules are made up of interacting biomolecules and serve as the basic building blocks of the complex biological machinery in a microbial cell. Some of the functional modules are organized in a hierarchical manner while others could serve in multiple levels forming a complex organizational network. At the very basic level in the functional hierarchy is the set of operons ( 1 , 2 ) (we also consider single-gene operons here), each of which are arranged in tandem in the genome and share a common promoter and a common terminator. A regulon is a group of operons that are regulated by a common transcriptional regulator ( 1 , 2 ), and a modulon is a group of regulons that are controlled by more global regulators and respond to more general physiological states ( 1 , 2 ). At the top of this functional hierarchy is a set of stimulons, each of which consists of a collection of operons, regulons and/or modulons that respond to a common environmental stimulus ( 1 , 2 ). In general, functional modules at different levels are made up of combinations of operons, regulons, modulons and possibly stimulons, many of which might have ‘conserved’ components and structures across multiple (related) microbial organisms. We expect that the interactions of these functional modules play the essential roles in the entire functionality of a microbial organism ( 3 , 4 ).

In this paper, we present a new computational framework for the prediction of functional modules in a microbial organism through comparative genome analysis and application of Gene Ontology (GO) ( 5 ) information. Our focus will be on the identification of genes involved in a functional module rather than the detailed interaction relationships among these genes. This study provides a basis for further prediction of detailed gene functions and prediction of biological (metabolic, signaling and regulatory) networks.

Comparative genome analysis is one of the most powerful tools to unravel the information encoded in a genome ( 6 , 7 ). By identifying conserved elements across multiple genomes, researchers have been able to uncover biological functions and structures at different levels of the biological machinery in a microbial cell. Successful applications of using such a strategy include predictions of gene functions ( 8 – 10 ), *cis* -regulatory elements ( 11 , 12 ), operons ( 13 , 14 ) and regulons ( 11 , 15 ).

Our approach to the identification of functional modules is based on three classes of information: (i) co-evolutionary information of genes that are encoded in phylogenetic profiles ( 8 ); (ii) conserved gene neighborhood information, which reflects if genes have ‘conserved’ adjacency relationship across multiple microbial genomes and, hence, possibly suggest their functional relatedness; and (iii) functional relatedness information of genes that are encoded in the GO classification ( 5 , 16 ). GO is a dynamically controlled vocabulary that can be applied to all organisms and has been used to measure protein or gene relationships ( 17 – 20 ). In this paper, we define a similarity measure among GO terms to evaluate the functional relationship of genes.

Each of the three measures provides a different perspective about functional relationships among genes. Information derived through each of them is then combined using a Bayesian inference framework. Using this combined score, we predict whether two genes belong to the same functional module. We use a graph representation to describe such a functional relatedness relationship. That is, if two genes are predicted to belong to the same functional module, they will have an edge linking their representative nodes in this graph representation. We believe that a functional module, in general, should be represented by a group of genes that are highly connected in this graph representation. In addition, each ‘highly connected’ subgraph may be part of a larger and also highly connected subgraph representing for functional modules at different levels in the biological machinery of a microbial cell. We have applied this computational procedure to the genome of *Escherichia coli* K12. On the 2579 genes of *E.coli* that have been assigned biological process GO terms, we have predicted a large interaction network involving these genes. Then using a particular segmentation strategy on this network, we have obtained 185 highly connected modules covering 654 genes. By comparing these predicted modules to the known pathways, regulons and operons in the Eco Cyc ( 21 ) and KEGG ( 22 ) databases, we have observed that the highest matching degrees (see definition later) achieved by these predicted modules are significantly higher than those achieved by randomly generated modules. We believe that this large interaction network contains a great amount of information about functional modules and relationships among them in *E.coli* . The predicted modules presented in this paper represent probably only a small subset of the functional modules in *E.coli* . In our future work, we intend to fully investigate the functional modules and their organizational structures by further refining the large network prediction and exploring more sophisticated strategies to identify many more such functional modules.

There have been numerous efforts in the past few years, devoted to the discovery of molecular modules through computational methods, as exemplified by the previous works ( 19 , 20 , 23 – 25 ). Lee *et al* . ( 19 ) compared different classes of data (including the physical and genetic interaction datasets, mRNA co-expression data, functional links extracted through literature search, prediction of gene fusion events and phylogenetic profile analysis) and integrate them by using a Bayesian framework, and Lee *et al* . have applied this capability for the prediction of functional relatedness of genes in *Saccharomyces cerevisiae* . von Mering *et al* . ( 20 ) first developed quantitative ways to measure functional relationship among genes from three different sources of information (including gene fusion, chromosomal proximity and phylogenetic profiles) and predicted functional modules by using a clustering algorithm. Spirin and Mirny ( 23 ) developed algorithms to analyze the structural properties of a predicted interaction network to identify the subsets of genes that are densely connected among themselves but sparsely connected with others. Yamada *et al* . ( 24 ) extracted gene modules from metabolic pathways by identifying genes that share similar phylogenetic profiles. Yanai and Delisi ( 25 ) predict gene links by using the same sources of information as described previously ( 20 ) and use the union operation to combine the three types of links to predict gene modules. The gene modules predicted by all these studies have shown some level of consistency with the well-established biological concepts as described in COGG ( 26 ), Eco Cyc ( 21 ) and KEGG ( 22 ).

Our approach differs significantly from the previous methods, as summarized below: (i) we utilize both the phylogenetic and the neighborhood profiles obtained from the comparative genome analysis; (ii) we explicitly incorporate the GO information into our evaluation of functional relationships of genes; (iii) we combine different sources of information in the framework of the Bayesian inference; and (iv) we develop a formal measure to quantify the degree of consistency between the predicted and the known modules, and we provide analysis of statistical significance for such comparisons.

## MATERIALS AND METHODS

We first evaluate the functional relationships among genes from three different perspectives: one based on GO assignments and the other two based on comparative genome analysis. We, then, combine these different measures by using a Bayesian inference to predict functional modules.

### Gene Ontology

The GO Consortium ( 5 ) has developed three separate ontologies—molecular function, biological process and cellular component—to describe the attributes of gene products, where molecular function defines what a gene product does at the biochemical level without specifying where or when the event actually occurs or its broader context; biological process describes the contribution of a gene product to a biological objective; and cellular component refers to where in the cell a gene product functions. Each GO is structured as a directed acyclic graph, wherein each term is a child of one or multiple parents, and child terms are instances or components of parent terms. For example, in Figure 1 , the term carbohydrate biosynthesis (GO: 0016051) is an instance of the term carbohydrate metabolism (GO: 0005975) as well as an instance of the term macromolecule biosynthesis (GO: 0009059).

From another point of view, from each GO term *V* , we can induce a directed acyclic graph that has the following properties: Figure 1 shows the graph induced from the term UDP-N-acetylgalactosamine biosynthesis (GO: 0019277). Note that the graph induced from a GO term can also be represented by a collection of paths with each path corresponding to a complete trace from the bottommost level (i.e. the GO term of interest itself) to the topmost level (i.e. the term Gene_Ontology). For example, one possible path in the graph of Figure 1 is as follows:

The bottommost level of the graph is

*V*itself, and at upper levels are its ancestor GO terms. Particularly, at the topmost level is the term Gene_Ontology (GO: 0003673).Given two GO terms

*V*_{1}and*V*_{2}, if*V*_{1}is one of the ancestors of*V*_{2}; then, the graph induced from*V*_{1}is completely included as a subgraph in the graph induced from*V*_{2}.

UDP-N-acetylgalactosamine biosynthesis (GO: 0019277) → amino sugar biosynthesis (GO: 0046349) → carbohydrate biosynthesis (GO: 0016051) → macromolecule biosynthesis (GO: 0009059) → biosynthesis (GO: 0009058) → metabolism (GO: 0008152) → physiological process (GO: 0007582) → biological_process (GO: 0008150) → Gene_Ontology (GO: 0003673).

The number of terms along the longest path in a graph is called the depth of the graph. For example, the depth of the graph for Figure 1 is 9. The depth of a graph reflects how specific the GO term is in describing the attributes of gene products. Hence, a GO term is always more specific than its ancestor GO terms.

In the rest of this paper, we do not differentiate between a GO term and the graph induced from it. When quantifying the similarity between two GO terms, it is desired that both their commonality and individual specificities (in describing the attributes of gene products) can be captured simultaneously. Let * V _{s}* and

*V*be the graphs induced from two GO terms, respectively. We define their similarity

_{t}*s*(

*V*,

_{s}*V*) as follows:

_{t}*L*and

_{s}*L*are the paths of

_{t}*V*and

_{s}*V*, respectively. If

_{t}*s*(

*V*,

_{s}*V*) is large, it means that the two GO terms are both highly specific and share much commonality in describing the attributes of gene products; if

_{t}*S*

_{GO}(

*V*,

_{s}*V*) is small, it means that either the two GO terms are not highly specific or they do not share much commonality; and, if

_{t}*s*(

*V*,

_{s}*V*) >

_{t}*s*(

*V*,

_{u}*V*), it means that the most recent common ancestor of

_{v}*V*and

_{s}*V*is more specific than the most recent common ancestor of

_{t}*V*and

_{u}*V*.

_{v} The above defined similarity measure is then used to assess the functional relationship among genes (through their products) based on their biological process GO assignments. Because a gene product may be involved in more than one biological process, it may be assigned with multiple GO terms, and, therefore, multiple GO graphs may be induced. Let **V** ( *g* ) denote all the GO terms assigned to a gene *g* . We define the GO similarity *S*_{GO} for a pair of genes * g _{i}* and

*g*as the maximum similarity of all possible combinations of

_{j}**V**(

*g*) and

_{i}**V**(

*g*), i.e.

_{j}*V*and

_{i}*V*are the GO terms assigned to

_{j}*g*and

_{i}*g*, respectively. If

_{j}*S*

_{GO}(

*g*,

_{i}*g*) is large, then at a very specific level the two genes are involved in at least one common biological process (e.g. both of them are involved in the UDP-N-acetylgalactosamine biosynthesis); and if

_{j}*S*

_{GO}(

*g*,

_{i}*g*) is small, then only at a very general level can the two genes be considered to be involved in the same biological process (e.g. both of them are involved in the physiological process).

_{t} Our similarity measure for GO annotations is very similar in concept to the information-content based semantic similarity defined by Lord *et al* . ( 27 ), although the two definitions treat in different ways how specific the most recent common ancestor of two GO terms is. In our definition, the specificity of a GO term is reflected by the number of GO terms along the longest path (distance) to the topmost level GO term, whereas in ( 27 ) the specificity is reflected by the number of genes that are assigned with it or its descendant GO terms. When assessing the functional relationship among genes using the similarity measure of GO terms, we take a different approach from that of Lord *et al* . ( 27 ). Given all GO terms assigned to two genes, we use the maximum similarity of all term pairs, whereas Lord *et al* . use the average similarity of all term pairs. We believe that the difference between the implementations of ours and Lord *et al* .'s is minor compared with their commonality at the conceptual level.

As we focus on the prediction of functional modules, which are basic building blocks of the biological machinery of a microbial cell for carrying out complex biological processes, we are most interested in whether a specific gene is involved in related biological processes and have consequently only used the biological process GO annotations of genes. Out of the 4311 genes in *E.coli* K12 (release of December 2003), 2579 genes have been assigned biological process GO terms by the GO Annotation project ( 16 ) (release of September 2004). In this paper we focus on these 2579 genes. These genes form 3 324 331 gene pairs, among which 46 009 pairs whose two genes belong to the same functional module (i.e. an operon, regulon or pathway) according to Eco Cyc ( 21 ) are considered to form the positive set, and the remaining 3 278 322 pairs are considered to form the background (or called random) set. Note that a random set is not necessarily an equivalent to a negative set, where the former consists of the pairs whose two genes have not been confirmed by experiments to be involved in the same functional module, and the latter consists of the pairs whose two genes have been confirmed by experiments not to be involved in the same functional module. Since our knowledge about the role of a protein/gene in biological processes is still accumulating and evolving, it might be difficult to identify a true negative set. Hence, we use a random set rather than a negative set.

The means and standard deviations of *S*_{GO} ( * g _{i}* ,

*g*) for both the positive and the random sets are summarized in Table 1 . We have performed a χ

_{j}^{2}-test ( 28 ) to check if the distribution of

*S*

_{GO}(

*g*,

_{i}*g*) is different for the positive and the random sets. The χ

_{j}^{2}-statistics (four bins have been used for the χ

^{2}-test, so that there are ∼32, 41, 15 and 27% of positive pairs, and 37, 37, 16 and 10% random pairs falling into the four bins, respectively) is corresponding to a

*P*-value less than 10

^{−4}, which reveals that the distribution of

*S*

_{GO}(

*g*,

_{i}*g*) for the positive set is significantly different from the random set. Figure 2 shows the distribution of

_{j}*S*

_{GO}(

*g*,

_{i}*g*) for the positive and the random sets. From this figure, we can see that a pair of genes of the same functional module (operon, regulon or pathway) are more likely to have a high degree of GO similarity than a random pair.

_{j}### Comparative genome analysis

Among the 145 complete genome sequences of bacteria and archaea that are available (release of December 2003), we have used 135 genomes for our comparative genome analysis, including *E.coli* K12 as the target genome and the other 134 as the reference genomes. Let *G*_{0} , *N*_{0} and * g _{i}* denote the genome, the number of genes and the

*i*-th (

*i*= 1, …,

*N*

_{0}) gene of

*E.coli*K12, and

*G*and

_{k}*N*denote the genome and the number of genes of the

_{k}*k*-th (

*k*= 1, …,

*K*,

*K*= 134 in our study) reference genome, respectively.

The first step in our comparative genome analysis is to predict orthologous genes for each gene of *E.coli* K12 in the reference genomes. We have used the PSI-BLAST ( 29 ) with an *E* -value of 10 ^{−6} to search for the bi-directional best hits (BDBH) and have obtained the following two profiles for each gene of *E.coli* K12, * g _{i}* (

*i*= 1, …,

*N*

_{0}): These two profiles are then used to evaluate the functional relationships of genes.

The phylogenetic profile of

*g*is a_{i}*K*-dimensional binary vector indicating the presence or absence of the orthologous genes of*g*in the reference genomes._{i}The neighborhood profile is a

*K*-dimensional vector with each element indicating the absence or the order of the orthologous genes of*g*along the reference genomes._{i}

#### Dissimilarity of phylogenetic profiles

Let **x*** _{i}* ≡ [

*x*

_{i}_{1},

*x*

_{i}_{2}, …,

*x*]

_{ik}^{T}denote the phylogenetic profile of gene

*g*, with

_{i}*x*= 1 representing the presence and

_{ik}*x*= 0 for the absence of

_{ik}*g*in

_{i}*G*. Given the phylogenetic profiles

_{k}**x**

*and*

_{i}**x**

*for a pair of genes*

_{j}*g*and

_{i}*g*, we define their dissimilarity

_{j}*d*(

*g*,

_{i}*g*) as follows:

_{j}*d*

_{Hamming}(

**x**

*,*

_{i}**x**

*) represents the Hamming distance between*

_{j}**x**

*and*

_{i}**x**

*, γ is a non-negative constant and has been set as 2 in our study, and*

_{j}*E*

_{ntropy}(

**x**

*,*

_{i}**x**

*) is the entropy of the common part of*

_{j}**x**

*and*

_{i}**x**

*and is computed as follows:*

_{j}*p*being the frequency of 1's in the common part. The dissimilarity between

**x**

*and*

_{i}**x**

*ranges from 0 to*

_{j}*K*, with 0 and

*K*corresponding to the identical and complementary phylogenetic profiles, respectively, and is smaller when

**x**

*and*

_{i}**x**

*have a smaller Hamming distance and/or a more diverse common part. We omit further details about how to choose the value of the constant γ to balance between the Hamming distance and the entropy of*

_{j}**x**

*and*

_{i}**x**

*.*

_{j} To check if the distribution of *d* ( * g _{i}* ,

*g*) is different for the positive and the random sets, we have performed both Kolmogorov–Smirnov test [by treating

_{j}*d*(

*g*,

_{i}*g*) as an observation of a continuous random variable] and χ

_{j}^{2}-tests ( 28 ). The Kolmogorov–Smirnov test rejects the hypothesis that the distribution of

*d*(

*g*,

_{i}*g*) is the same for the positive and the random sets with a significance level less than 10

_{j}^{−4}, while from the χ

^{2}-test (nine bins have been used for the χ

^{2}-test, so that there are at least 6% of positive pairs and 5.58% of random pairs in each bin) we have obtained the χ

^{2}-statistics value of 1341 corresponding to a

*P*-value less than 10

^{−4}. Therefore, both tests demonstrate that the distribution of

*d*(

*g*,

_{i}*g*) for the positive set is significantly different from the random set. Figure 3 shows the distributions of

_{j}*d*(

*g*,

_{i}*g*) for the positive and the random sets. Note that a pair of genes from the same functional module are less likely to have a large measure of dissimilarity than a random pair.

_{j}#### Likelihood of neighborhood profiles

Let **y*** _{i}* ≡ [

*y*

_{i}_{1},

*y*

_{i}_{2}, …,

*y*]

_{ik}^{T}denote the neighborhood profile of a gene

*g*. When the

_{i}*k*-th element of

**y**

*,*

_{i}*y*, is 0, it stands for that the orthologous gene of

_{ik}*g*is absent from

_{i}*G*; when

_{k}*y*is

_{ik}*n*≠ 0, it stands for that the orthologous gene of

*g*is ordered as the

_{i}*n*-th gene along

*G*. We make the following assumptions about the statistical model for the neighborhood profiles: This means that a gene's behavior (i.e. the presence/absence and order) is independent among different reference genomes.

_{k}For each gene

*g*and each reference genome_{i}*G*,_{k}*g*is present in_{i}*G*with probability_{k}*p*; and when present, the order of_{ik}*g*along_{i}*G*is uniformly distributed over {1, 2, …,_{k}*N*}, i.e._{k}\[ p\left({y}_{ik}=n\right)=\{\begin{array}{cc}1-{p}_{ik},& \hbox{ when }n=0\\ {p}_{ik}/{N}_{k},& \hbox{ when }n=1,2,\dots ,{N}_{k}.\end{array} \]For each gene

*g*, the elements of_{i}**y**are independent, i.e._{i}\[ P\left({y}_{ik},{y}_{il}\right)=P\left({y}_{ik}\right)P\left({y}_{il}\right),\hbox{ \hspace{1em} }k,l\in \left\{1,2,\mathrm{\dots },K\right\}\hbox{ and }k\ne l. \]

Given a pair of genes * g _{i}* and

*g*, under the hypothesis that

_{j}*g*and

_{i}*g*do not functionally relate to each other, their neighborhood profiles

_{j}**y**

*and*

_{i}**y**

*can be treated as independent random vectors; hence, the log-likelihood of*

_{j}**y**

*and*

_{i}**y**

*,*

_{j}*L*(

*g*,

_{i}*g*), is computed as follows:

_{j}*L*(

*g*,

_{i}*g*,

_{j}*G*) standing for the log-likelihood of

_{k}*y*and

_{ik}*y*and are being computed as follows:

_{jk}*I*(·,·) is 1 if and only if both criteria within the parentheses are satisfied and is 0 otherwise,

*P*

_{00}stands for the probability that neither

*g*nor

_{i}*g*is present,

_{j}*P*

_{01}stands for the probability that only

*g*is present,

_{j}*P*

_{10}stands for the probability that only

*g*is present, and

_{i}*P*

_{11}stands for the probability that both

*g*and

_{i}*g*are present and have a distance not more than

_{j}*d*(

_{k}*i*,

*j*) with

*d*(

_{k}*i*,

*j*) ≡ |

*y*−

_{ik}*y*| being the observed distance between

_{jk}*g*and

_{i}*g*(in terms of the number of genes in between) along

_{j}*G*. Since

_{k}*y*and

_{ik}*y*can be treated as independent random variables under the hypothesis that

_{jk}*g*and

_{i}*g*are functionally unrelated, these four probabilities are computed as follows:

_{j} Note that *P*_{11} is very small when * d _{k}* (

*i*,

*j*),

*p*and/or

_{ik}*p*are small. This is consistent with our intuition—it is very unlikely that two functionally unrelated genes are simultaneously present at a genome with a small distance, especially when these two genes are not highly conserved at this genome.

_{jk} The likelihood *L* ( * g _{i}* ,

*g*) is the evidence supporting the hypothesis that the two genes do not functionally relate to each other. The larger

_{j}*L*(

*g*,

_{i}*g*) is, the more supportive the neighborhood profiles

_{j}**y**

*and*

_{i}**y**

*are for this hypothesis; and the smaller*

_{j}*L*(

*g*,

_{i}*g*) is, the more

_{j}**y**

*and*

_{i}**y**

*are against this hypothesis (i.e. the more supportive*

_{j}**y**

*and*

_{i}**y**

*are for the alternative hypothesis that the two genes functionally relate to each other in some way).*

_{j}*g*and

_{i}*g*in terms of their neighborhood profiles. In this way, a larger

_{j}*S*

_{N}(

*g*,

_{i}*g*) implies a stronger functional relationship between

_{j}*g*and

_{i}*g*.

_{j} In practice, * p _{ik}* is unknown and must be estimated from phylogenetic profiles. We first group all the 134 reference genomes into 14 groups so that each group corresponds to a phylum (as shown in Table 2 ), and then assume that

*p*is identical within the same group of genomes for each gene

_{ik}*g*. The maximum-likelihood estimation of

_{i}*p*is computed as the frequency of

_{ik}*g*in the group that

_{i}*G*belongs to, i.e.

_{k} We have performed both the Kolmogorov–Smirnov test and the χ ^{2} -test to check if the distribution of *S*_{N} ( * g _{i}* ,

*g*) is different for the positive and the random sets. The Kolmogorov–Smirnov test rejects the hypothesis that the distribution of

_{j}*S*

_{N}(

*g*,

_{i}*g*) is the same for the positive and the random sets with significance level less than 10

_{j}^{−4}; and from the χ

^{2}-test (seven bins have been used for the χ

^{2}-test, so that there are at least 8% of positive gene pairs and 5.7% of random gene pairs in each bin) we have obtained the χ

^{2}-statistics value of 7583 corresponding to a

*P*-value less than 10

^{−4}. Therefore, both tests demonstrate that the distribution of

*S*

_{N}(

*g*,

_{i}*g*) for the positive set is significantly different from the random set. Figure 4 shows the distribution of

_{j}*S*

_{N}(

*g*,

_{i}*g*) for both the positive and the random sets. As shown in the figure, a pair of genes of the same functional module are more likely to have a large neighborhood score than a random pair.

_{j}### Bayesian inference for information fusion

As we have shown, the GO similarity measure *S*_{GO} ( * g _{i}* ,

*g*), the dissimilarity measure of phylogenetic profiles

_{j}*d*(

*g*,

_{i}*g*), and the neighborhood score

_{j}*S*

_{N}(

*g*,

_{i}*g*) reflect the possible functional relationship between two genes from different perspectives. To fully utilize the information from all the three sources, we combine them by using a Bayesian inference approach.

_{j} Given two genes * g _{i}* and

*g*, we assume that their GO assignments are conditionally independent of their comparative genome analysis. Therefore, based on the three measures,

_{j}*S*

_{GO}(

*g*,

_{i}*g*),

_{j}*d*(

*g*,

_{i}*g*) and

_{j}*S*

_{N}(

*g*,

_{i}*g*), the odds of

_{j}*g*and

_{i}*g*belonging to the same functional module can be computed as follows:

_{j}*g*and

_{i}*g*. The higher

_{j}*C*

_{ombined}(

*g*,

_{i}*g*) is, the stronger the functional relationship between

_{j}*g*and

_{i}*g*we consider them to have.

_{j} Note that the ratios and conditional distributions in the last three rows of ( 9 ) must be known or estimated a priori in order to calculate *C*_{ombined} ( * g _{i}* ,

*g*) for any pair of genes. To estimate them, we consider a pair of genes in the positive set to be in the same module, and a pair in the random set not to be in the same module.

_{j} On estimating the joint conditional distributions of { *d* ( * g _{i}* ,

*g*),

_{j}*S*

_{N}(

*g*,

_{i}*g*)} of (9), we have taken two different approaches. One approach, called the naive Bayesian inference ( 30 ), assumes the conditional independence between

_{j}*d*(

*g*,

_{i}*g*) and

_{j}*S*

_{N}(

*g*,

_{i}*g*), and estimates the conditional distributions of

_{j}*d*(

*g*,

_{i}*g*) and

_{j}*S*

_{N}(

*g*,

_{i}*g*) separately. The second approach, called the Bayesian inference, directly estimates the joint conditional distribution of {

_{j}*d*(

*g*,

_{i}*g*),

_{j}*S*

_{N}(

*g*,

_{i}*g*)}. Each approach has its own strengths and limitations. For example, the naive Bayesian inference approach heavily relies on the assumption of conditional independence; hence, the resulting estimated joint distribution may be far away from the true joint distribution when the assumption is not valid. Whereas, for the Bayesian inference approach, when estimating the distribution of random variables, we need much more observations for a multi-dimensional random vector than for a one-dimensional random variable (the number of needed observations grows exponentially with the dimensionality of the random vector) to achieve the same level of resolution ( 31 ); hence, there may exist a resolution problem with the estimated joint distribution. Especially for the estimation of

_{j} We have computed *C*_{ombined} ( * g _{i}* ,

*g*) by using both the naive Bayesian and Bayesian inferences. We have performed the Kolmogorov–Smirnov test and the χ

_{j}^{2}-test to check if the distribution of

*C*

_{ombined}(

*g*,

_{i}*g*) is different for the positive and the negative tests. The Kolmogorov–Smirnov tests reject the hypothesis that the distribution of

_{j}*C*

_{ombined}(

*g*,

_{i}*g*) is the same for the positive and the random sets with a significance level less than 10

_{j}^{−4}for both approaches; and, from the χ

^{2}-tests we have obtained the values of χ

^{2}-statistics as 11 591 and 4866, both corresponding to the

*P*-values less than 10

^{−4}, for the naive Bayesian and Bayesian approaches, respectively. [The χ

^{2}-tests have been performed on the normalized

*C*

_{ombined}(

*g*,

_{i}*g*) for both approaches. Although seven bins have been used, bins have been located differently for the naive Bayesian and Bayesian approaches. The bins for the naive Bayesian approach are located at 0.35, 0.36, …, 0.41, so that there are at least 8% of positive gene pairs and 8.92% of random gene pairs in each bin. The bins for the Bayesian approach are located at 0.25, 0.26, …, 0.31, so that there are at least 5% positive gene pairs and 5.8% random gene pairs in each bin.] All these tests demonstrate that the distribution of

_{j}*C*

_{ombined}(

*g*,

_{i}*g*) is different for the positive and the random sets, for both the naive Bayesian and the Bayesian approaches. To compare these two inference approaches, we have normalized

_{j}*C*

_{ombined}(

*g*,

_{i}*g*) so that the normalized

_{j}*C*

_{ombined}(

*g*,

_{i}*g*) for each approach ranges from 0 to 1. Figure 5 shows the distributions of the normalized

_{j}*C*

_{ombined}(

*g*,

_{i}*g*) for both the positive and the random sets, and for both the naive Bayesian and the Bayesian inference approaches. As shown in the figure that (i) for both approaches, a pair of genes of the same functional module are more likely to have a large

_{j}*C*

_{ombined}(

*g*,

_{i}*g*) than a random pair; and (ii) the naive Bayesian approach discriminates the positive pairs out of the random pairs more accurately than the Bayesian approach. So, in the rest of this paper, we focus on the naive Bayesian inference approach.

_{j}### Threshold-based module prediction and evaluation

Every two genes will have a score *C*_{ombined} ( * g _{i}* ,

*g*) measuring their functional relationship. The higher the score, the stronger their functional relationship is. Note that a negative value of

_{j}*C*

_{ombined}(

*g*,

_{i}*g*) does not necessarily mean that the

_{j}*g*and the

_{i}*g*are less likely to belong to the same functional module, because the positive and the random sets we have used to estimate the ratios and conditional distributions in (9) are not complete. More specifically, the positive set we have used is only part of the true positive set, and the random set we have used contains some true positives as well as true negatives, where by true positive we mean a set consisting of all pairs whose two genes belong to the same module, and by true negative we mean a set consisting of all pairs whose two genes do not belong to the same module.

_{j}The genes and their functional connections can be interpreted at different levels. At the lowest resolution level, all genes are functionally connected to form one large network that is responsible for all activities of a cell; at a higher resolution level, genes with stronger functional relationship stand out and form smaller and densely interacted modules that are responsible for some specific activities of a cell. At the highest resolution level, each gene forms a functional module by itself.

To predict biologically meaningful functional modules of smaller sizes, we apply a simple thresholding method described as follows. We first compute for each gene * g _{i}* the mean (

*m*) and standard deviation (σ

_{i}*) of its functional connection scores*

_{i}*C*

_{ombined}(

*g*,

_{i}*g*) with all other genes

_{j}*g*(

_{j}*j*≠

*i*), keep the connection between

*g*and

_{i}*g*if and only if

_{j}We choose the value of α to make our predicted modules consistent as much as possible with the known functional modules in Eco Cyc, where the consistency is measured by using the matching degrees defined below.

#### Matching degree between a pair of known and predicted gene modules

Let * K _{m}* be the set of genes in the

*m*-th known functional module, and

*C*be the set of genes in the

_{n}*n*-th predicted module, the matching degree between

*K*and

_{m}*C*,

_{n}*t*, is defined as follows:

_{mn}^{.}| represents for the cardinality of a set, and ∩ and ∪ represent the intersection and union operations between two sets.

As we demonstrate below, the matching degree * t _{mn}* defined in (11) is actually a combination of the measures of sensitivity and specificity of

*C*, where sensitivity = |

_{n}*K*∩

_{m}*C*|/|

_{n}*K*| and specificity = |

_{m}*K*∩

_{m}*C*|/|

_{n}*C*|.

_{n}*t*in (11) around sensitivity = 1 and specificity = 1. Note that one can add coefficients before sensitivity or specificity in (12) to put different weights on these two measures.

_{mn}#### Highest matching degree for a known module

Let **C** be the collection consisting of all the *N* predicted modules *C*_{1} , …, * C _{N}* , i.e.

**C**≡ {

*C*

_{1},

*C*

_{2}, …,

*C*}. Since

_{N}*t*is monotonically increasing as the sensitivity and/or specificity of

_{mn}*C*increases, by maximizing

_{n}*t*with respect to all the predicted modules in

_{mn}**C**, we pick one particular predicted module for

*K*that best balances the measures of sensitivity and specificity. We define the highest matching degree (HMD) for

_{m}*K*,

_{m}*t*which measures the matching capability of all the predicted modules of

_{m}**C**to

*K*, as follows:

_{m}**K**be the collection consisting of all the

*M*known functional modules, i.e.

**K**≡ {

*K*

_{1},

*K*

_{2}, …,

*K*}. We choose the value of α so that the average HMD (AHMD) over

_{M}**K**[defined in the following equation] is maximized.

If the genes in

**C**does not cover any gene of*K*, then the HMD achieved by_{m}**C**for*K*is 0._{m}If there exist several predicted modules in

**C**each of which can be perfectly matched with a fraction of*K*, then the HMD achieved by_{m}**C**for*K*is the ratio of the maximum size of these predicted modules to the size of_{m}*K*._{m}If there exists one predicted module

*C*one of whose fractions is perfectly matched with the entire_{n}*K*, then the HMD achieved by_{m}**C**for*K*is the ratio of the size of_{m}*K*to the size of_{m}*C*._{n}If there exists one predicted module

*C*that is perfectly matched with the entire_{n}*K*, then the HMD achieved by_{m}**C**for*K*is 1._{m}

#### Statistical analysis on the highest matching degree

Let *t*_{1} , *t*_{2} , …, * t _{m}* be the HMDs for

**K**≡ {

*K*

_{1},

*K*

_{2}, …,

*K*} achieved by our predicted modules

_{M}**C**≡ {

*C*

_{1},

*C*

_{2}, …,

*C*}. To evaluate the statistical significance of {

_{N}*t*

_{1},

*t*

_{2}, …,

*t*}, we first estimate the probability distribution of the HMDs for the same

_{M}**K**achieved by a collection of randomly predicted modules

*C*, and then estimate the

_{n}*Z*-score of {

*t*

_{1},

*t*

_{2}, …,

*t*}. If the

_{M}*Z*-score is high, then the HMDs achieved by

**C**are statistically significant. Here by randomly predicted modules we mean a module that is predicted by randomly picking out the genes with equal probability and without replacement from the pool of the

*N*

_{0}genes of

*E.coli*K12.

Given a known functional module * K _{m}* , because of the nature of randomness of the functional modules in

**C**′, the matching degree achieved by each

**C**′ and the HMD achieved by

**C**′ are all random variables; hence, in the following analysis, we use

*K*achieved by

_{m}**C**′, respectively.

We first focus on the statistical model of

*K*|, is small compared with

_{m}*N*

_{0}, i.e.

*p*≡ |

_{m}*K*|/

_{m}*N*

_{0}and

We then turn our attention to the statistical model of

*z*being given as in (16).

_{mn} Let μ * _{m}* and

*s*be the mean and standard deviation of

_{m}*K*, then

_{m}*K*. By using the Central Limit Theorem ( 32 ), the sum of the standardized HMDs over all

_{m}**K**≡ {

*K*

_{1},

*K*

_{2}, …,

*K*} asymptotically complies to a normal distribution, i.e.

_{M}*Z*-score of {

*t*

_{1},

*t*

_{2}, …,

*t*}, which are the HMDs for

_{M}**K**≡ {

*K*

_{1},

*K*

_{2}, …,

*K*} achieved by our predicted modules

_{M}**C**≡ {

*C*

_{1},

*C*

_{2}, …,

*C*}, is then computed as follows:

_{N}*Z*-score means that {

*t*

_{1},

*t*

_{2}, …,

*t*} and, consequently, our predicted modules in

_{M}**C**as well as our prediction method, are statistically significant.

## EXPERIMENTS AND RESULTS

### Performance by using the combined score *C*_{ombined} ( * g *_{i} , * g *_{j} )

_{i}

_{j}

We have performed the following procedure to predict functional modules based on the combined score *C*_{ombined} ( * g _{i}* ,

*g*): Owing to the nature of randomness in choosing genes to form the training set in the first step, the outputs of the following steps, including

_{j}*C*

_{ombined}(

*g*,

_{i}*g*), the maximum AHMD and the associated α-value, are all random in essence. Therefore, we have repeated the above procedure 10 times. Table 3 summarizes the maximum AHMDs and their associated α-values for these 10 experiments.

_{j}Randomly choose 1250 genes (≈50%) out of the gene pool (consisting of 2579

*E.coli*K12 genes) to form the training set, and use the pairs of these training genes to estimate the ratios and conditional distributions in (9).Compute

*C*_{ombined}(*g*,_{i}*g*) for all pairs (including the pairs of the training genes)._{j}Segment the large interaction network with various α values to predict functional modules that maximize the AHMD [defined in (14)].

As shown in Table 3 , all experiments have achieved similar AHMD for the known pathways, regulons and operons; and have shown that our predicted functional modules are matched with the known pathways better than with the known regulons or operons. Because all the experiments have shown similar performance in terms of AHMD, without loss of generality, we focus on the first experiment for the rest of the analysis.

### Comparisons among different sources of information

To see whether the combined score can better describe functional relationships among genes than individual scores, we have also performed experiments on each individual source of information, i.e. we have used *S*_{GO} ( * g _{i}* ,

*g*),

_{j}*d*(

*g*,

_{i}*g*) or

_{j}*S*

_{N}(

*g*,

_{i}*g*) alone as a measure of functional relationship between genes to predict functional modules. Table 4 summarizes the maximum AHMD values, the associated values of α, the number (

_{j}*N*) of predicted modules, the total number (|

**C**|) of genes in all the predicted modules and the associated

*Z*-score for the experiments based on

*C*

_{ombined}(

*g*,

_{i}*g*),

_{j}*S*

_{N}(

*g*,

_{i}*g*),

_{j}*d*(

*g*,

_{i}*g*) and

_{j}*S*

_{GO}(

*g*,

_{i}*g*), respectively.

_{j}From Table 4 , we can see that for the individual information sources, and for the combined information, The observation that phylogenetic profiles do not seem to contribute much to the identification of functional modules, while surprising, is consistent with the observation made by other authors ( 19 ). To exclude the possibility that this might be an artifact of the specific prediction method for orthologous genes using BDBH, we have also used the reciprocal smallest distance algorithm ( 33 ) for the prediction of orthologous genes, and have then performed the same experiment of comparing different information sources. The results are summarized in Table 5 , for which we have made the same observations as above.

The AHMD values and the associated

*Z*-scores achieved by using the phylogenetic profile dissimilarity measure*d*(*g*,_{i}*g*) are much smaller than those achieved by using the other information sources, which demonstrates that_{j}*d*(*g*,_{i}*g*) alone cannot provide sufficient information to predict functional modules that are reasonably consistent with the known functional modules._{j}Although the AHMD values for both the GO assignments and the neighborhood profiles are moderate, their associated

*Z*-scores are very high, which demonstrates that either*S*_{GO}(*g*,_{i}*g*) or_{j}*S*_{N}(*g*,_{i}*g*) alone already provides sufficient information to achieve sound consistency with the known functional modules. This observation regarding the neighborhood profiles makes it very promising as a measure for the prediction of functional modules for those microbial genomes that have been sequenced but do not have much other information (e.g. GO assignments) available._{j}For the GO assignments, the AHMD value and the associated

*Z*-score for the known pathways are larger than those for either the known regulons or known operons, which demonstrate that the functional modules predicted by using our approach are more consistent with the known pathways than with the known regulons or known operons. We have made a similar observation for the neighborhood profiles.

For the known pathways, the AHMD value achieved by using the combined information

*C*_{ombined}(*g*,_{i}*g*) is larger than that of each individual information source, although its_{j}*Z*-score is smaller than that of the GO assignments. Because the*Z*-scores for both*C*_{ombined}(*g*,_{i}*g*) and_{j}*S*_{GO}(*g*,_{i}*g*) are already very high, and their values of_{j}*N*and |**C**| (consequently the sizes of all the predicted modules) are different, the fact that the former is smaller than the latter does not necessarily mean that*C*_{ombined}(*g*,_{i}*g*) is worse than_{j}*S*_{GO}(*g*,_{i}*g*); rather, the obvious difference among their AHMD values demonstrates that information fusion can achieve a higher degree of consistency with the known pathways than individual information sources._{j}For the known operons, though the neighborhood profiles alone already provide sufficient information to achieve sound consistency, the GO assignments do not. The incapabilities of

*S*_{GO}(*g*,_{i}*g*) greatly undermine the capabilities of_{j}*S*_{N}(*g*,_{i}*g*) during the information fusion; hence, either the AHMD value or_{j}*Z*-score for the combined information*C*_{ombined}(*g*,_{i}*g*) cannot even be as high as that of_{j}*S*_{N}(*g*,_{i}*g*) alone. We have made similar observations about known regulons._{j}

### Modules predicted using *C*_{ombined} ( * g *_{i} , * g *_{j} ) for a particular choice of α

_{i}

_{j}

For the combined information *C*_{ombined} , Figure 6 shows the total number of edges and the AHMD values for the known pathways, regulons and operons as functions of α. Observe from the figure that (i) the number of edges decreases rapidly as the value of α increases; (ii) the tendencies of the three AHMD values all first increase and then decrease as the value of α increases; and (iii) these three AHMD values achieve their maximum around α ∈ [6, 7].

When using α = 6.75, we have obtained 185 predicted functional modules covering 654 genes. Figure 7 shows those predicted modules consisting of at least three genes, where genes are identified by using their PIDs. For most cases, genes within the same predicted modules belong to the same functional modules according to Eco Cyc, or have similar GO assignments. This means that we can predict functional units based on our approach. For example, all genes in the module of Figure 8 are involved in the flagellar metabolism pathway according to KEGG ( 22 ), even though only the colored genes and edges are confirmed by Eco Cyc to belong to the same operons or regulons. We have also made the following interesting observations:

Genes within the same predicted module belong to several different pathways, as shown in Figure 9 . For the predicted module (a) involving the histidine and the tryptophan biosyntheses, the two pathways are connected through the gene

*hisA*(16129965), which is assigned to be involved in both pathways by GO Annotation ( 16 ). For the predicted module (b) involving the methionine and the cysteine biosyntheses, the two pathways are connected through the gene*metB*(16131777), which is assigned to be involved in both pathways by KEGG ( 22 ). For the predicted module (c) involving the glycerolipid metabolism, glycerol-3-P ABC transporter and branched amino acid ABC transporter, the first two parts are connected because they are both related to glycerol, and the last two parts are connected because they have similar GO assignments (i.e. transport). So far, we have not been able to find experimental evidence to support our predicted module in (d) which involves the pantothenate and folate biosyntheses, but this prediction may be due to that both pathways are related to the precursors for co-enzyme biosynthesis ( 34 ).Genes are connected in a predicted module mainly because they are conserved neighboring genes on the same strand of the genome, as

*nusB*(16128401) and*ribH*(16128400) in the module of Figure 10 and other five modules each consisting of one pair of genes. These genes are highly likely to be functionally related, and deserve further experimental investigations.Genes are connected via their paralogous genes. For example, we have predicted two modules each consisting of two genes that do not have any obvious commonality. For the pair

*amtB*(16128436) and*glnB*(16130478), one of the paralogous genes of*glnB*,*glnK*(16128435), belongs to the same operon as*amtB*according to Eco Cyc. For the other pair*purU*(16129193) and*add*(16129581), one of the paralogous genes of*purU*,*purN*(16130425), is involved the same pathway (purine metabolism) as*add*.So far, we have not been able to find evidences in Eco Cyc, KEGG or GO to support our predicted modules in Figure 11 and the other five modules each consisting of two genes. They deserve further experimental investigations.

### Modules predicted using *C*_{ombined} ( * g *_{i} , * g *_{j} ) for a particular gene

_{i}

_{j}

We have also focused on one particular gene *hemL* (16128147) to see how its involved module changes as α is changed. Figure 12 shows three predicted modules that *hemL* is involved for α = 6.75, α = 5.5 and α = 4, respectively. When α = 6.75, the predicted module consists of only four genes, all of which are involved in the porphyrin and chlorophyll metabolism pathway. When α decreases to 5.5, the predicted module consists of 79 genes, which are involved in valine, leucine and isoleucine biosynthesis, lysine biosynthesis, arginine metabolism, glutamate metabolism, proline biosynthesis and degradation, methionine metabolism, cysteine biosynthesis, biosynthesis of proto- and siroheme, biosynthesis of steroids, and sulfate transport pathways, respectively. When α decreases further to 4, the predicted module consists of 1116 out of all 2579 genes.

As we mentioned earlier, the functional relationships among genes can be viewed at different levels. At a very high resolution level (as shown in Figure 12a ), only a small number of genes are grouped together so that the group is responsible for one specific activity; at a lower resolution level (as shown in Figure 12b ), a relatively larger number of genes are grouped together so that the group is responsible for more general activities; and, at the lowest resolution level (as shown in Figure 12c ) most of the genes are connected directly or indirectly so that the group is responsible for most activities of a cell. Consequently, by varying the threshold values, we can predict the hierarchical structure of the functional modules.

## CONCLUSIONS

We have presented a new computational method to predict functional modules by combining the information from the comparative genome analysis and the GO in the framework of the Bayesian inference. In this work, we have also developed a formal measure to quantify the degree of consistency between the predicted and the known modules, and provided analysis of the statistical significance for such consistency degrees. We have applied our method to the genome of *E.coli* K12, and have observed that (i) the predicted modules are more consistent with the known pathways than to the known regulons or operons; (ii) neighborhood profiles or GO assignments alone can provide sufficient information for predicting modules that are fairly consistent with the known functional modules, but phylogenetic profiles cannot; (iii) by fusing the information from the GO, phylogenetic and neighborhood profiles using the naive Bayesian inference and using the combined information for module prediction, even higher degrees of consistency can be achieved for the known functional modules; (iv) most of the predicted modules can be verified by Eco Cyc, KEGG or GO, and the unverified predicted modules reveal interesting gene functional relationships that deserve further experimental investigations; and (v) different threshold values can be used to predict functional modules at different resolution levels.

In our future study, we will apply the method presented in this paper to other microbial genomes. Particularly, since we have observed that the neighborhood profiles alone can provide sufficient information for the prediction, we will use the neighborhood profiles to evaluate the gene functional relationships for those microbial genomes that have already been completely sequenced but do not have much other information (e.g. GO) available. We will also generalize the current method of information fusion based on the Bayesian inference to incorporate more sources of information, e.g. microarray data. And finally, we will apply more sophisticated methods to gene clustering so that even higher degrees of consistency can be achieved for the known functional modules.

**Figure 1**

**Figure 1**

**Figure 2**

**Figure 2**

**Table 1**

S_{GO} ( g , _{i} g ) _{j} | d ( g , _{i} g ) _{j} | S_{N} ( g , _{i} g ) _{j} | C_{ombined} ( g , _{i} g ) _{j} | |||||
---|---|---|---|---|---|---|---|---|

Mean | SD | Mean | SD | Mean | SD | Mean | SD | |

Positive set | 3.652 | 1.871 | 23.273 | 11.365 | 0.864 | 0.436 | 0.286 | 1.192 |

Random set | 3.111 | 1.244 | 26.882 | 16.077 | 0.720 | 0.266 | −0.262 | 0.813 |

S_{GO} ( g , _{i} g ) _{j} | d ( g , _{i} g ) _{j} | S_{N} ( g , _{i} g ) _{j} | C_{ombined} ( g , _{i} g ) _{j} | |||||
---|---|---|---|---|---|---|---|---|

Mean | SD | Mean | SD | Mean | SD | Mean | SD | |

Positive set | 3.652 | 1.871 | 23.273 | 11.365 | 0.864 | 0.436 | 0.286 | 1.192 |

Random set | 3.111 | 1.244 | 26.882 | 16.077 | 0.720 | 0.266 | −0.262 | 0.813 |

**Figure 3**

**Figure 3**

**Table 2**

Phylum | Genomes |
---|---|

Crenarchaeota | Aeropyrum pernix , Pyrobaculum aerophilum , Sulfolobus solfataricus , Sulfolobus tokodaii |

Aquificae | Aquifex aeolicus |

Euryarchaeota | Archaeoglobus fulgidus DSM 4304, Halobacterium sp. NRC-1, Methanococcus jannaschii , Methanopyrus kandleri AV19, Methanosarcina acetivorans str. C2A, Methanosarcina mazei Goe1, Methanothermobacter thermautotrophicus , Pyrococcus abyssi , Pyrococcus horikoshii , Pyrococcus furiosus DSM 3638, Thermoplasma acidophilum , Thermoplasma volcanium |

Firmicutes | Bacillus anthracis A2012, Bacillus anthracis str. Ames , Bacillus cereus ATCC 14579, Bacillus halodurans , Bacillus subtilis , Clostridium acetobutylicum , Clostridium perfringens , Clostridium tetani E88, Enterococcus faecalis V583, Lactobacillus plantarum WCFS1, Lactococcus lactis subsp. lactis , Listeria innocua , Listeria monocytogenes EGD-e, Mycoplasma gallisepticum R, Mycoplasma genitalium , Mycoplasma penetrans , Mycoplasma pneumoniae , Mycoplasma pulmonis , Oceanobacillus iheyensis HTE831, Staphylococcus aureus subsp. aureus MW2, Staphylococcus aureus subsp. aureus Mu50, Staphylococcus aureus subsp. aureus N315, Staphylococcus epidermidis ATCC 12228, Streptococcus agalactiae 2603V/R, Streptococcus agalactiae NEM316, Streptococcus mutans UA159, Streptococcus pneumoniae R6, Streptococcus pneumoniae TIGR4, Streptococcus pyogenes , Streptococcus pyogenes MGAS315, Streptococcus pyogenes MGAS8232, Streptococcus pyogenes SSI-1, Thermoanaerobacter tengcongensis , Ureaplasma urealyticum |

Bacteroidetes | Bacteroides thetaiotaomicron VPI-5482, Chlorobium tepidum TLS, Porphyromonas gingivalis W83 |

Actinobacteria | Bifidobacterium longum NCC2705, Corynebacterium diphtheriae , Corynebacterium efficiens YS-314, Corynebacterium glutamicum ATCC 13032, Mycobacterium bovis subsp. bovis AF2122/97, Mycobacterium leprae , Mycobacterium tuberculosis CDC1551, Mycobacterium tuberculosis H37Rv, Streptomyces avermitilis MA-4680, Streptomyces coelicolor A3(2), Tropheryma whipplei TW08/27, Tropheryma whipplei str. Twist |

Spirochaetes | Borrelia burgdorferi , Treponema pallidum |

Chlamydiae | Chlamydia muridarum , Chlamydia trachomatis , Chlamydophila caviae GPIC, Chlamydophila pneumoniae AR39, Chlamydophila pneumoniae CWL029, Chlamydophila pneumoniae J138, Chlamydophila pneumoniae TW-183 |

Fusobacteria | Fusobacterium nucleatum subsp. nucleatum ATCC 25586 |

Cyanobacteria | Gloeobacter violaceus , Nostoc sp. PCC 7120, Prochlorococcus marinus str. MIT 9313, Prochlorococcus marinus subsp. marinus str. CCMP1375, Prochlorococcus marinus subsp. pastoris str. CCMP1378, Synechococcus sp. WH 8102, Synechocystis sp. PCC 6803, Thermosynechococcus elongatus BP-1 |

Nanoarchaeota | Nanoarchaeum equitans Kin4-M |

Planctomycetes | Pirellula sp. |

Thermotogae | Thermotoga maritima |

Proteobacteria | Bordetella bronchiseptica , Bordetella parapertussis , Bordetella pertussis , Bradyrhizobium japonicum , Buchnera aphidicola ( Baizongia pistaciae ), Buchnera aphidicola str. APS ( Acyrthosiphon pisum ), Buchnera aphidicola str. Sg ( Schizaphis graminum ), Campylobacter jejuni , Candidatus Blochmannia floridanus , Caulobacter crescentus CB15, Chromobacterium violaceum ATCC 12472, Coxiella burnetii RSA 493, Escherichia coli CFT073, Escherichia coli O157:H7, Escherichia coli O157:H7 EDL933, Haemophilus ducreyi 35000HP, Haemophilus influenzae Rd, Helicobacter hepaticus ATCC 51449, Helicobacter pylori 26695, Helicobacter pylori J99, Mesorhizobium loti , Neisseria meningitidis MC58, Neisseria meningitidis Z2491, Nitrosomonas europaea ATCC 19718, Pasteurella multocida , Photorhabdus luminescens subsp. laumondii TTO1, Pseudomonas aeruginosa PA01, Pseudomonas putida KT2440, Pseudomonas syringae pv. tomato str. DC3000, Ralstonia solanacearum , Rickettsia conorii , Rickettsia prowazekii , Salmonella enterica subsp. enterica serovar Typhi , Salmonella enterica subsp. enterica serovar Typhi Ty2, Salmonella typhimurium LT2, Shewanella oneidensis MR-1, Shigella flexneri 2a str. 2457T, Shigella flexneri 2a str. 301, Sinorhizobium meliloti , Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis , Wolinella succinogenes , Xanthomonas axonopodis pv. citri str. 306, Xanthomonas campestris pv. campestris str. ATCC 33913, Xylella fastidiosa 9a5c, Xylella fastidiosa Temecula1, Yersinia pestis , Yersinia pestis KIM |

Phylum | Genomes |
---|---|

Crenarchaeota | Aeropyrum pernix , Pyrobaculum aerophilum , Sulfolobus solfataricus , Sulfolobus tokodaii |

Aquificae | Aquifex aeolicus |

Euryarchaeota | Archaeoglobus fulgidus DSM 4304, Halobacterium sp. NRC-1, Methanococcus jannaschii , Methanopyrus kandleri AV19, Methanosarcina acetivorans str. C2A, Methanosarcina mazei Goe1, Methanothermobacter thermautotrophicus , Pyrococcus abyssi , Pyrococcus horikoshii , Pyrococcus furiosus DSM 3638, Thermoplasma acidophilum , Thermoplasma volcanium |

Firmicutes | Bacillus anthracis A2012, Bacillus anthracis str. Ames , Bacillus cereus ATCC 14579, Bacillus halodurans , Bacillus subtilis , Clostridium acetobutylicum , Clostridium perfringens , Clostridium tetani E88, Enterococcus faecalis V583, Lactobacillus plantarum WCFS1, Lactococcus lactis subsp. lactis , Listeria innocua , Listeria monocytogenes EGD-e, Mycoplasma gallisepticum R, Mycoplasma genitalium , Mycoplasma penetrans , Mycoplasma pneumoniae , Mycoplasma pulmonis , Oceanobacillus iheyensis HTE831, Staphylococcus aureus subsp. aureus MW2, Staphylococcus aureus subsp. aureus Mu50, Staphylococcus aureus subsp. aureus N315, Staphylococcus epidermidis ATCC 12228, Streptococcus agalactiae 2603V/R, Streptococcus agalactiae NEM316, Streptococcus mutans UA159, Streptococcus pneumoniae R6, Streptococcus pneumoniae TIGR4, Streptococcus pyogenes , Streptococcus pyogenes MGAS315, Streptococcus pyogenes MGAS8232, Streptococcus pyogenes SSI-1, Thermoanaerobacter tengcongensis , Ureaplasma urealyticum |

Bacteroidetes | Bacteroides thetaiotaomicron VPI-5482, Chlorobium tepidum TLS, Porphyromonas gingivalis W83 |

Actinobacteria | Bifidobacterium longum NCC2705, Corynebacterium diphtheriae , Corynebacterium efficiens YS-314, Corynebacterium glutamicum ATCC 13032, Mycobacterium bovis subsp. bovis AF2122/97, Mycobacterium leprae , Mycobacterium tuberculosis CDC1551, Mycobacterium tuberculosis H37Rv, Streptomyces avermitilis MA-4680, Streptomyces coelicolor A3(2), Tropheryma whipplei TW08/27, Tropheryma whipplei str. Twist |

Spirochaetes | Borrelia burgdorferi , Treponema pallidum |

Chlamydiae | Chlamydia muridarum , Chlamydia trachomatis , Chlamydophila caviae GPIC, Chlamydophila pneumoniae AR39, Chlamydophila pneumoniae CWL029, Chlamydophila pneumoniae J138, Chlamydophila pneumoniae TW-183 |

Fusobacteria | Fusobacterium nucleatum subsp. nucleatum ATCC 25586 |

Cyanobacteria | Gloeobacter violaceus , Nostoc sp. PCC 7120, Prochlorococcus marinus str. MIT 9313, Prochlorococcus marinus subsp. marinus str. CCMP1375, Prochlorococcus marinus subsp. pastoris str. CCMP1378, Synechococcus sp. WH 8102, Synechocystis sp. PCC 6803, Thermosynechococcus elongatus BP-1 |

Nanoarchaeota | Nanoarchaeum equitans Kin4-M |

Planctomycetes | Pirellula sp. |

Thermotogae | Thermotoga maritima |

Proteobacteria | Bordetella bronchiseptica , Bordetella parapertussis , Bordetella pertussis , Bradyrhizobium japonicum , Buchnera aphidicola ( Baizongia pistaciae ), Buchnera aphidicola str. APS ( Acyrthosiphon pisum ), Buchnera aphidicola str. Sg ( Schizaphis graminum ), Campylobacter jejuni , Candidatus Blochmannia floridanus , Caulobacter crescentus CB15, Chromobacterium violaceum ATCC 12472, Coxiella burnetii RSA 493, Escherichia coli CFT073, Escherichia coli O157:H7, Escherichia coli O157:H7 EDL933, Haemophilus ducreyi 35000HP, Haemophilus influenzae Rd, Helicobacter hepaticus ATCC 51449, Helicobacter pylori 26695, Helicobacter pylori J99, Mesorhizobium loti , Neisseria meningitidis MC58, Neisseria meningitidis Z2491, Nitrosomonas europaea ATCC 19718, Pasteurella multocida , Photorhabdus luminescens subsp. laumondii TTO1, Pseudomonas aeruginosa PA01, Pseudomonas putida KT2440, Pseudomonas syringae pv. tomato str. DC3000, Ralstonia solanacearum , Rickettsia conorii , Rickettsia prowazekii , Salmonella enterica subsp. enterica serovar Typhi , Salmonella enterica subsp. enterica serovar Typhi Ty2, Salmonella typhimurium LT2, Shewanella oneidensis MR-1, Shigella flexneri 2a str. 2457T, Shigella flexneri 2a str. 301, Sinorhizobium meliloti , Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis , Wolinella succinogenes , Xanthomonas axonopodis pv. citri str. 306, Xanthomonas campestris pv. campestris str. ATCC 33913, Xylella fastidiosa 9a5c, Xylella fastidiosa Temecula1, Yersinia pestis , Yersinia pestis KIM |

**Figure 4**

**Figure 4**

**Figure 5**

**Figure 5**

**Table 3**

Experiment | Pathway | Regulon | Operon | |||
---|---|---|---|---|---|---|

α | AHMD | α | AHMD | α | AHMD | |

1 | 6.75 | 0.265 | 6.5 | 0.168 | 6.5 | 0.164 |

2 | 6.25 | 0.257 | 5.25 | 0.171 | 5.25 | 0.165 |

3 | 7 | 0.259 | 5.75 | 0.192 | 5.5 | 0.172 |

4 | 6.25 | 0.260 | 5.25 | 0.184 | 5 | 0.157 |

5 | 6 | 0.269 | 5.5 | 0.194 | 6 | 0.181 |

6 | 6.25 | 0.268 | 5.75 | 0.183 | 6 | 0.171 |

7 | 5.25 | 0.249 | 5.25 | 0.190 | 4.75 | 0.171 |

8 | 6.25 | 0.288 | 6 | 0.217 | 6 | 0.188 |

9 | 5.75 | 0.261 | 4.75 | 0.191 | 4.75 | 0.165 |

10 | 6 | 0.267 | 5.25 | 0.200 | 5.25 | 0.176 |

Experiment | Pathway | Regulon | Operon | |||
---|---|---|---|---|---|---|

α | AHMD | α | AHMD | α | AHMD | |

1 | 6.75 | 0.265 | 6.5 | 0.168 | 6.5 | 0.164 |

2 | 6.25 | 0.257 | 5.25 | 0.171 | 5.25 | 0.165 |

3 | 7 | 0.259 | 5.75 | 0.192 | 5.5 | 0.172 |

4 | 6.25 | 0.260 | 5.25 | 0.184 | 5 | 0.157 |

5 | 6 | 0.269 | 5.5 | 0.194 | 6 | 0.181 |

6 | 6.25 | 0.268 | 5.75 | 0.183 | 6 | 0.171 |

7 | 5.25 | 0.249 | 5.25 | 0.190 | 4.75 | 0.171 |

8 | 6.25 | 0.288 | 6 | 0.217 | 6 | 0.188 |

9 | 5.75 | 0.261 | 4.75 | 0.191 | 4.75 | 0.165 |

10 | 6 | 0.267 | 5.25 | 0.200 | 5.25 | 0.176 |

**Table 4**

AHMD | α | N | | C | | Z -score | |
---|---|---|---|---|---|

Pathways ( M = 207) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.265 | 6.75 | 185 | 654 | 62.293 |

S_{N} ( g , _{i} g ) _{j} | 0.236 | 3.75 | 189 | 998 | 58.474 |

d ( g , _{i} g ) _{j} | 0.0364 | 3 | 28 | 221 | 4.753 |

S_{GO} ( g , _{i} g ) _{j} | 0.224 | 4 | 106 | 796 | 70.103 |

Regulons ( M = 132) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.168 | 6.5 | 194 | 717 | 37.908 |

S_{N} ( g , _{i} g ) _{j} | 0.182 | 3.5 | 189 | 1099 | 40.576 |

d ( g , _{i} g ) _{j} | 0.0200 | 2.75 | 26 | 431 | 0.769 |

S_{GO} ( g , _{i} g ) _{j} | 0.117 | 3.75 | 115 | 959 | 31.591 |

Operons ( M = 745) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.164 | 6.5 | 194 | 717 | 32.572 |

S_{N} ( g , _{i} g ) _{j} | 0.176 | 5 | 188 | 702 | 39.406 |

d ( g , _{i} g ) _{j} | 0.0147 | 3 | 28 | 221 | 0.502 |

S_{GO} ( g , _{i} g ) _{j} | 0.0708 | 3.75 | 115 | 959 | 13.868 |

AHMD | α | N | | C | | Z -score | |
---|---|---|---|---|---|

Pathways ( M = 207) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.265 | 6.75 | 185 | 654 | 62.293 |

S_{N} ( g , _{i} g ) _{j} | 0.236 | 3.75 | 189 | 998 | 58.474 |

d ( g , _{i} g ) _{j} | 0.0364 | 3 | 28 | 221 | 4.753 |

S_{GO} ( g , _{i} g ) _{j} | 0.224 | 4 | 106 | 796 | 70.103 |

Regulons ( M = 132) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.168 | 6.5 | 194 | 717 | 37.908 |

S_{N} ( g , _{i} g ) _{j} | 0.182 | 3.5 | 189 | 1099 | 40.576 |

d ( g , _{i} g ) _{j} | 0.0200 | 2.75 | 26 | 431 | 0.769 |

S_{GO} ( g , _{i} g ) _{j} | 0.117 | 3.75 | 115 | 959 | 31.591 |

Operons ( M = 745) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.164 | 6.5 | 194 | 717 | 32.572 |

S_{N} ( g , _{i} g ) _{j} | 0.176 | 5 | 188 | 702 | 39.406 |

d ( g , _{i} g ) _{j} | 0.0147 | 3 | 28 | 221 | 0.502 |

S_{GO} ( g , _{i} g ) _{j} | 0.0708 | 3.75 | 115 | 959 | 13.868 |

The phylogenetic and neighborhood profiles are obtained by using the BDBH method.

**Table 5**

AHMD | α | N | | C | | Z -score | |
---|---|---|---|---|---|

Pathways ( M = 207) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.248 | 7.25 | 191 | 700 | 66.694 |

S_{N} ( g , _{i} g ) _{j} | 0.212 | 3.5 | 173 | 920 | 52.832 |

d ( g , _{i} g ) _{j} | 0.0416 | 3.25 | 61 | 416 | 2.647 |

Regulons ( M = 132) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.176 | 6 | 171 | 1006 | 45.653 |

S_{N} ( g , _{i} g ) _{j} | 0.170 | 3.25 | 165 | 1008 | 37.033 |

d ( g , _{i} g ) _{j} | 0.0317 | 3.25 | 61 | 416 | −1.406 |

Operons ( M = 745) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.164 | 7.25 | 191 | 700 | 32.959 |

S_{N} ( g , _{i} g ) _{j} | 0.157 | 4.5 | 173 | 669 | 36.274 |

d ( g , _{i} g ) _{j} | 0.0246 | 3.25 | 61 | 416 | −0.988 |

AHMD | α | N | | C | | Z -score | |
---|---|---|---|---|---|

Pathways ( M = 207) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.248 | 7.25 | 191 | 700 | 66.694 |

S_{N} ( g , _{i} g ) _{j} | 0.212 | 3.5 | 173 | 920 | 52.832 |

d ( g , _{i} g ) _{j} | 0.0416 | 3.25 | 61 | 416 | 2.647 |

Regulons ( M = 132) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.176 | 6 | 171 | 1006 | 45.653 |

S_{N} ( g , _{i} g ) _{j} | 0.170 | 3.25 | 165 | 1008 | 37.033 |

d ( g , _{i} g ) _{j} | 0.0317 | 3.25 | 61 | 416 | −1.406 |

Operons ( M = 745) | |||||

C_{ombined} ( g , _{i} g ) _{j} | 0.164 | 7.25 | 191 | 700 | 32.959 |

S_{N} ( g , _{i} g ) _{j} | 0.157 | 4.5 | 173 | 669 | 36.274 |

d ( g , _{i} g ) _{j} | 0.0246 | 3.25 | 61 | 416 | −0.988 |

The neighborhood and phylogenetic profiles are obtained by using the reciprocal smallest distance algorithm ( 33 ).

**Figure 6**

**Figure 6**

**Figure 7**

**Figure 7**

**Figure 8**

**Figure 8**

**Figure 9**

**Figure 9**

**Figure 10**

**Figure 10**

**Figure 11**

**Figure 11**

**Figure 12**

**Figure 12**

This research was supported in part by National Science Foundation (#NSF/DBI-0354771, #NSF/ITR-IIS-0407204) and by the US Department of Energy's Genomes to Life program ( http://doegenomestolife.org/ ) under project ‘Carbon Sequestration in *Synechococcus* sp.: From Molecular Machines to Hierarchical Modeling’ ( www.genomes2life.org ). Funding to pay the Open Access publication charges for this article was provided by the University of Georgia.

*Conflict of interest statement* . None declared.

## REFERENCES

*cis*-regulatory motifs by comparative genomics

*cis*regulatory elements

*Synechococcus*sp. WH8102 genome

*Escherichia coli*genes and metabolism

## Comments