## Abstract

**Motivation:** Cluster analysis of gene expression profiles has been widely applied to clustering genes for gene function discovery. Many approaches have been proposed. The rationale is that the genes with the same biological function or involved in the same biological process are more likely to co-express, hence they are more likely to form a cluster with similar gene expression patterns. However, most existing methods, including model-based clustering, ignore known gene functions in clustering.

**Results:** To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions as prior probabilities in model-based clustering. In contrast to a global mixture model applicable to all the genes in the standard model-based clustering, we use a stratified mixture model: one stratum corresponds to the genes of unknown function while each of the other ones corresponding to the genes sharing the same biological function or pathway; the genes from the same stratum are assumed to have the same prior probability of coming from a cluster while those from different strata are allowed to have different prior probabilities of coming from the same cluster. We derive a simple EM algorithm that can be used to fit the stratified model. A simulation study and an application to gene function prediction demonstrate the advantage of our proposal over the standard method.

**Contact:**weip@biostat.umn.edu

## 1 INTRODUCTION

This article concerns with clustering genes for gene function discovery using microarray gene expression data. It has been widely observed that genes with a similar function or involved in the same biological process are likely to co-express, hence clustering genes’ expression profiles provides a means for gene function discovery; see, e.g. Eisen *et al*. (1998), Brown *et al*. (2000), Wu *et al*. (2002), Xiao and Pan (2005) and references therein. However, most existing approaches all ignore known functions of some genes in the process of clustering; few exceptions in the context of non-model-based clustering include Hanisch *et al*. (2002), Cheng *et al*. (2004), Fang *et al*. (2006) and Huang and Pan (2006). For example, in model-based clustering, all the genes are treated equally a priori; in particular, all the genes are assumed to have an equal prior probability of being in a given cluster (e.g. Li and Hong, 2001; Ghosh and Chinnaiyan, 2002; Pan *et al*., 2002). As mentioned, if some genes are known to share the same function, it is more likely that they belong to the same cluster. Hence, it seems more plausible to model the genes sharing the same biological function to have an equal prior probability while allowing the genes with different functions to have varying prior probabilities. This provides a more efficient way to account for the association between gene function and co-expression. In this paper, we propose such an approach that uses gene functional annotations as priors for model-based clustering. Specifically, first, the genome is partitioned into several groups with one group containing the genes of unknown function and each of the other groups containing the genes sharing the same function. Gene functional annotations are readily available from many existing databases, such as the Gene Ontology (GO) (Ashburner *et al*., 2000) and MIPS (Mewes *et al*., 2004). Second, each group is treated as a stratum and a stratified mixture model is used: the genes from the same groups are assumed to have the same prior probability of coming from the same cluster while the prior probabilities for different groups are allowed to be unequal. Because of possible heterogeneity in each gene functional group, we do not assume that the genes from the same functional group come from the same cluster. In fact, for the genes in the group of unknown function, they may come from any cluster. With relatively high noise levels of genomic data, it is recognized that incorporating biological knowledge into statistical analysis is a reliable way to maximize statistical efficiency and enhance the interpretability of the analysis results.

This article is organized as follows. In Section 2, we first briefly review the standard method of model-based clustering with a global mixture model, then propose our stratified mixture model and associated stratified clustering. We derive a simple EM algorithm to fit our stratified model. In Section 3, we demonstrate the advantage of our proposal using simulated data, and then using real data for gene function prediction. We end the paper with a short discussion.

## 2 METHODS

### 2.1 Standard model-based clustering

In model-based clustering, it is assumed that each observation *x*, a *p*-dimensional vector, is drawn from a finite mixture distribution

_{i}, component-specific distribution

*f*and its parameters θ

_{i}_{i}. We use Θ = {(π

_{i}, θ

_{i}) :

*i*= 1, … ,

*g*} to denote all unknown parameters, with the restriction that 0 ≤ π

_{i}≤ 1 for any

*i*and that $\u2211i=1g\pi i=1$. Each component of the mixture distribution corresponds to a cluster. The number of clusters,

*g*, has to be determined in practice; see Section 2.3.

The finite normal mixture model is most widely used: each component *f _{i}* is a normal distribution; we use this model throughout this article. Because we are particularly interested in applications with ‘large

*p*, small

*n*’ (i.e. high-dimensional data with small sample sizes) encountered in genomic studies, we adopt a working independence model for the components of

*x*as in the naive Bayesian, though other more sophisticated methods may be preferred (e.g. McLachlan

*et al*., 2003). Specifically, we have

Given a dataset *x _{j}* for

*j*= 1, … ,

*n*, the EM algorithm (Dempster

*et al*., 1977) is used to estimate the parameters Θ iteratively in the standard model-based clustering (McLachlan and Peel, 2002; Fraley and Raftery, 2002). We use generic notation Θ

^{(m)}to represent the parameter estimates at iteration

*m*, then the EM works by iterating the following:

*x*s coming from component

_{j}*i*.

The above iteration is repeated until convergence, resulting in maximum likelihood estimate (MLE) $\Theta ^$. Then we use (4) to calculate the posterior probability of any observation *x _{j}*s belonging to any cluster

*i*, and assign the observation to the cluster with the largest such probability. Because of possible existence of multiple local maxima, we need to start the algorithm multiple times with various starting values; in this paper, we use the result from

*K*-means as starting values for the EM. We fit a series of models with various values of

*g*, then use a model selection criterion to choose its appropriate value, as discussed in Section 2.3.

### 2.2 Stratified model-based clustering

In Section 2.1, all the genes are treated equally a priori. To take advantage of the known gene functions, we propose first partitioning the genome into several groups, say, *G*_{1}, … , *G _{K}*.

*G*contains the genes of unknown function while each of the other groups contains the genes sharing the same biological function. Second, rather than a global model (1), we propose using a stratified model: for any gene

_{K}*j*in functional group

*G*,

_{k}*k*= 1, … ,

*K*. Note that the

*K*stratified models differ in using stratum-specific prior probabilities while sharing the same set of component distributions. Hence, we assume that genes from the same group

*G*share the same prior probability π

_{k}_{(k),i}of coming from the same cluster

*i*while allowing them to come from different clusters. It is easy to see that the above stratified model reduces to the standard model with

*K*= 1.

In practice, *G _{k}* can be determined based on the GO (Ashburner

*et al*., 2000), MIPS (Mewes

*et al*., 2004) or other sources of biological knowledge; the choice of the database depends on the application at hand. For example, if the goal is to predict gene function while both data quality and terminology of GO are preferred, GO can be used. Because some genes may have multiple functions, we allow a gene to be in multiple groups. A simple method considered here is that, for example, if a gene has two functions corresponding to groups

*G*

_{1}and

*G*

_{2}, we duplicate the gene expression profile for the gene and treat the two observations as two genes, one in

*G*

_{1}and one in

*G*

_{2}, in the process of clustering; see the final section for more discussions on this issue.

Next we derive the EM algorithm for the above model-based clustering. Given data *X* = {*x _{j}* :

*j*= 1, … ,

*n*}, the log-likelihood is

*k*=

*k*(

*j*) on gene

*j*and we use this convention throughout. Maximization of the above log-likelihood is difficult, and it is common to use the EM algorithm (Dempster

*et al*., 1977) by casting the problem in the framework of missing data. Define

*z*as the indicator of whether

_{ij}*x*is from component

_{j}*i*; i.e.

*z*= 1 if

_{ij}*x*is indeed from component

_{j}*i*and

*z*= 0 otherwise. If we could observe the missing data

_{ij}*z*s, then the complete data log-likelihood is

_{ij}*j*∈

*G*,

_{k}*x*s coming from component

_{j}*i*.

The M-step of the EM maximizes the above *Q* to update the parameter estimates. For a stratified model, updates for the mean and variance parameters are the same as in (2), but the update for the prior probability is different:

*n*is the number of the genes in

_{k}*G*.

_{k}As before, the above updates are iterated until convergence with resulting MLE $\Theta ^$. We fit a series of models with various values of *g*, then use a model selection criterion to choose an appropriate value.

We use (6) to calculate the posterior probability of any observation *x _{j}*s belonging to each cluster, and assign the observation to the cluster with the largest such probability.

### 2.3 Model selection

In practice, we have to determine the number of components, *g*. This is realized by first fitting a series of models with various numbers of components, and then using a model selection criterion to choose the best one. In model-based clustering, it is common to use Bayesian information criterion (BIC) (Schwartz, 1978) defined as

*d*is the total number of (effective) parameters (Fraley and Raftery, 1998). In the standard model, because we have three sets of parameters, π

_{i}s, σ

_{q}s and μ

_{iq}s, and we have the constraint $\u2211i=1g\pi i=1$, we have

*d*=

*p*+

*gp*+

*g*− 1; in the stratified model with

*K*strata, instead of a set of π

_{i}s, we have

*K*sets of π

_{(k),i}s, thus

*d*=

*p*+

*gp*+

*K*(

*g*− 1).

## 3 RESULTS

### 3.1 Simulated data

We did a simulation study to demonstrate the improved performance of our proposal. There were two univariate clusters with distributions *N* (μ_{1}, σ_{1}) and *N* (μ_{2}, σ_{2}) respectively. We had two gene functional groups with *n*_{1} and *n*_{2} genes respectively. For a gene in functional group *k* for *k* = 1 or 2, its prior probability of being in cluster 1 was π_{(k),1}. Four simulation set-ups were used with various parameter values: three sets of π_{(k),1}s were used; *n*_{1} = *n*_{2} = 50 in the first three set-ups while *n*_{1} = 25 and *n*_{2} = 75 in set-up 4; set-ups 2 and 4 were the same except for different *n*_{1} and *n*_{2}.

The results were based on 1000 simulations for each set-up. We considered both the mean and variance of a parameter estimate over 1000 simulations; we also gave the average numbers of observations in a simulated dataset that were incorrectly assigned to clusters different from their true ones.

In the first simulation set-up, we had π_{(1),1} = π_{(2),1}, hence the standard mixture model was correct, and the standard method was supposed to perform better. Although the stratified mixture model was still correct, it unnecessarily required to estimate two separate parameters π_{(1),1} and π_{(2),1}. It was confirmed that the standard method worked better, however, more interestingly, its performance was quite close to that of our proposed method: the estimates of the mean and variance parameters from the stratified method were almost the same as that from the standard method, but as expected, the estimate of the prior probability from the former had slightly larger variability.

In set-ups 2–4, the stratified mixture model was correct with π_{(1),1} ≠ π_{(2),1}. Hence, with the knowledge of the gene functional groups, it was better to use the stratified model. On the other hand, ignoring the gene functional groups, the global mixture model still held with a corresponding π_{1} = 0.5 or π_{1} = 0.35. Unsurprisingly, the standard method gave estimates with larger estimation errors e.g. in terms of a mean squared error, which is the sum of the squared bias and the variance; this was especially evident for the mean and variance parameters of the second cluster, both with much larger variances than those from the stratified model. In consequence, the standard method resulted in much larger numbers of misclassified genes. Comparing set-ups 2 and 3, it was noted that the performance difference between the two methods increased with $\u2223\pi 1(1)\u2212\pi 1(2)\u2223$. With different numbers of genes in the two functional groups (set-up 4), the stratified method still maintained better performance (Table 1).

**Table 1**

Set-up | Method | μ_{1} | μ_{2} | σ_{1} | σ_{2} | π_{1} | $\pi 1(1)$ | $\pi 1(2)$ | #Err1 | #Err2 |
---|---|---|---|---|---|---|---|---|---|---|

True | 1 | 2 | 1 | 2 | 0.80 | 0.80 | 0.80 | 0 | 0 | |

Standard | 0.91 | 2.86 | 1.01 | 1.39 | 0.764 | — | — | 11.3 | 11.4 | |

1 | (0.48) | (1.52) | (0.30) | (0.69) | (0.213) | (8.6) | (8.5) | |||

New | 0.91 | 2.80 | 1.01 | 1.39 | — | 0.748 | 0.749 | 12.1 | 12.1 | |

(0.48) | (1.54) | (0.30) | (0.69) | (0.243) | (0.240) | (9.5) | (9.5) | |||

True | 1 | 2 | 1 | 2 | 0.50 | 0.80 | 0.20 | 0 | 0 | |

Standard | 1.01 | 2.66 | 0.99 | 1.74 | 0.588 | — | — | 13.6 | 23.0 | |

2 | (0.30) | (1.32) | (0.29) | (0.55) | (0.224) | (8.4) | (8.2) | |||

New | 1.00 | 2.23 | 0.98 | 1.92 | — | 0.822 | 0.246 | 9.0 | 14.3 | |

(0.25) | (0.78) | (0.21) | (0.37) | (0.152) | (0.223) | (2.8) | (6.8) | |||

True | 1 | 2 | 1 | 2 | 0.50 | 1 | 0 | 0 | 0 | |

Standard | 0.95 | 2.68 | 1.01 | 1.72 | 0.575 | — | — | 11.6 | 25.4 | |

3 | (0.52) | (1.31) | (0.36) | (0.54) | (0.227) | (14.0) | (12.7) | |||

New | 0.98 | 2.17 | 0.97 | 1.99 | — | 0.986 | 0.101 | 0.3 | 2.6 | |

(0.15) | (0.47) | (0.11) | (0.27) | (0.036) | (0.139) | (0.9) | (7.7) | |||

True | 1 | 2 | 1 | 2 | 0.35 | 0.80 | 0.20 | 0 | 0 | |

Standard | 0.99 | 2.43 | 1.00 | 1.81 | 0.445 | — | — | 10.1 | 29.8 | |

4 | (0.53) | (1.01) | (0.43) | (0.46) | (0.228) | (5.8) | (12.9) | |||

New | 1.00 | 2.18 | 0.98 | 1.92 | — | 0.829 | 0.253 | 5.0 | 21.4 | |

(0.30) | (0.66) | (0.30) | (0.34) | (0.191) | (0.219) | (2.6) | (10.7) |

Set-up | Method | μ_{1} | μ_{2} | σ_{1} | σ_{2} | π_{1} | $\pi 1(1)$ | $\pi 1(2)$ | #Err1 | #Err2 |
---|---|---|---|---|---|---|---|---|---|---|

True | 1 | 2 | 1 | 2 | 0.80 | 0.80 | 0.80 | 0 | 0 | |

Standard | 0.91 | 2.86 | 1.01 | 1.39 | 0.764 | — | — | 11.3 | 11.4 | |

1 | (0.48) | (1.52) | (0.30) | (0.69) | (0.213) | (8.6) | (8.5) | |||

New | 0.91 | 2.80 | 1.01 | 1.39 | — | 0.748 | 0.749 | 12.1 | 12.1 | |

(0.48) | (1.54) | (0.30) | (0.69) | (0.243) | (0.240) | (9.5) | (9.5) | |||

True | 1 | 2 | 1 | 2 | 0.50 | 0.80 | 0.20 | 0 | 0 | |

Standard | 1.01 | 2.66 | 0.99 | 1.74 | 0.588 | — | — | 13.6 | 23.0 | |

2 | (0.30) | (1.32) | (0.29) | (0.55) | (0.224) | (8.4) | (8.2) | |||

New | 1.00 | 2.23 | 0.98 | 1.92 | — | 0.822 | 0.246 | 9.0 | 14.3 | |

(0.25) | (0.78) | (0.21) | (0.37) | (0.152) | (0.223) | (2.8) | (6.8) | |||

True | 1 | 2 | 1 | 2 | 0.50 | 1 | 0 | 0 | 0 | |

Standard | 0.95 | 2.68 | 1.01 | 1.72 | 0.575 | — | — | 11.6 | 25.4 | |

3 | (0.52) | (1.31) | (0.36) | (0.54) | (0.227) | (14.0) | (12.7) | |||

New | 0.98 | 2.17 | 0.97 | 1.99 | — | 0.986 | 0.101 | 0.3 | 2.6 | |

(0.15) | (0.47) | (0.11) | (0.27) | (0.036) | (0.139) | (0.9) | (7.7) | |||

True | 1 | 2 | 1 | 2 | 0.35 | 0.80 | 0.20 | 0 | 0 | |

Standard | 0.99 | 2.43 | 1.00 | 1.81 | 0.445 | — | — | 10.1 | 29.8 | |

4 | (0.53) | (1.01) | (0.43) | (0.46) | (0.228) | (5.8) | (12.9) | |||

New | 1.00 | 2.18 | 0.98 | 1.92 | — | 0.829 | 0.253 | 5.0 | 21.4 | |

(0.30) | (0.66) | (0.30) | (0.34) | (0.191) | (0.219) | (2.6) | (10.7) |

The means and standard deviations (in parentheses) of parameter estimates, and #Err1 and #Err2 for the numbers of misclassifications for clusters 1 and 2 respectively, are presented.

### 3.2 Gene function prediction using gene expression profiles

With the completion of the human genome and other sequencing projects, it has become compelling to learn functions of many newly discovered genes. An important approach is to cluster gene expression profiles drawn from microarray experiments under various conditions; see, e.g. Eisen *et al*. (1998), Brown *et al*. (2000), Wu *et al*. (2001), Zhou *et al*. (2002), Xiao and Pan (2005) and references therein. The rationale is that genes sharing the same biological function are likely to co-express.

We considered clustering gene expression data to predict gene functions for yeast *Saccharomyces cerevisiae*. We used a large dataset containing 300 microarray experiments with gene deletions and drug treatments (Hughes *et al*., 2000). The data were centered and scaled so that the mean and variance of the expression profile for each gene were 0 and 1, respectively. Gene functions were downloaded from the MIPS database (Mewes *et al*., 2004). For illustration, we only considered two gene functions, mitotic cell cycle and cell cycle control (with MIPS code 030301) and mitochondrion (with MIPS 4016), shortened as classes 1 and 2, respectively. There were three strata: the first two strata each contained 200 genes randomly selected from one of the two classes, while the third stratum was a mixture of 119 and 112 genes from the two classes respectively. Hence, the genes in the first two strata were treated as those with known functions whereas the third stratum consisted of the genes whose functions were to be predicted.

We did not use the class labels of the genes in the process of clustering. Based on the clustering results, we predicted the functions of the genes in the third stratum. There were two ways to accomplish prediction, called hard classification and soft classification. In hard classification, each gene was assigned to a cluster, and a cluster was classified to a class to which the majority of the genes from the first two strata that were assigned to the cluster belonged; if there was an equal number of the genes from each of the first two strata, we randomly assigned a class label to the cluster. Any gene from stratum 3 that was assigned to a cluster was predicted to be in the same class as that of the cluster. When there were an equal or almost equal number of the genes from the first two strata, there seemed to be a certain degree of randomness in the hard classification. Furthermore, it did not take advantage of the soft-clustering feature of model-based clustering: even two genes were both assigned to the same cluster, they might have quite different posterior probabilities of being in the cluster. Hence, as an alternative, we used soft classification: first, for each cluster *i*, we calculated the proportion of the genes in class *c*, $Pic$, for *c* = 1 and 2; second, for any gene *j* in stratum 3, the expectation of its being in class *c* is $\u2211i=1gPic\tau ij$. Summing over all the genes from class *c* (or the other class), we obtained an expected number of the genes predicted to be in class *c*. To illustrate the effect of various prior probabilities, we also included results using the prior probabilities of the first two strata to calculate τ_{ij} for each *j* ∈ *G*_{3}; in a table (i.e. Tables 2 and 3), they were respectively denoted as ‘New: $\pi ^(k)$’ for *k* = 1, 2, 3.

**Table 2**

g | Standard | New |
---|---|---|

BIC | BIC | |

1 | 162479 | 162479 |

2 | 150773 | 150786 |

3 | 144384 | 144388 |

4 | 139320 | 139230 |

5 | 136185 | 136103 |

6 | 134219 | 134137 |

7 | 133502 | 133426 |

8 | 132840 | 132785 |

9 | 132892 | 132826 |

10 | 133112 | 133092 |

g | Standard | New |
---|---|---|

BIC | BIC | |

1 | 162479 | 162479 |

2 | 150773 | 150786 |

3 | 144384 | 144388 |

4 | 139320 | 139230 |

5 | 136185 | 136103 |

6 | 134219 | 134137 |

7 | 133502 | 133426 |

8 | 132840 | 132785 |

9 | 132892 | 132826 |

10 | 133112 | 133092 |

Italics give the minima

**Table 3**

Standard | New: $\pi ^(1)$ | New: $\pi ^(2)$ | New: $\pi ^(3)$ | |||||
---|---|---|---|---|---|---|---|---|

Truth | Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 |

1 | 88 | 31 | 88 | 31 | 88 | 31 | 88 | 31 |

2 | 30 | 82 | 31 | 81 | 30 | 82 | 30 | 82 |

Accuracy | 0.736 | 0.732 | 0.736 | 0.736 |

Standard | New: $\pi ^(1)$ | New: $\pi ^(2)$ | New: $\pi ^(3)$ | |||||
---|---|---|---|---|---|---|---|---|

Truth | Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 |

1 | 88 | 31 | 88 | 31 | 88 | 31 | 88 | 31 |

2 | 30 | 82 | 31 | 81 | 30 | 82 | 30 | 82 |

Accuracy | 0.736 | 0.732 | 0.736 | 0.736 |

As a comparison, we also treated the problem as supervised learning: the 200 genes in the first two strata and their class labels were training data, while the genes in the third stratum were test data. We applied a classic linear discriminant analysis (LDA) and three new classifiers regarded as among the best: the nearest shrunken centroids (NSC) (Tibshirani *et al*., 2003), random forests (RF) (Breiman, 2001) and support vector machines (SVM) (Vapnik, 1998). Our purpose here was not to directly compare clustering analyses with these classifiers; rather, these classifiers provided some estimates of the upper bound of the predictive performance of clustering for this particular application. We used the default settings of R functions

#### 3.2.1 Using 300 microarray experiments

First, we considered the use of the full dataset with all 300 microarray experiments included. Table 4 gives the BIC values for the two methods with various numbers of clusters, based on which *g* = 8 that minimized BIC was selected for both methods. The predictive results based on hard classification for the genes in stratum 3 are given in Table 5, showing that the two methods gave almost the same result. As mentioned earlier, the result of ‘New: $\pi ^(k)$’ was based on using the prior probability estimate $\pi ^(k)$ to calculate the posterior probabilities and thus made predictions for the genes in stratum 3; in practice, we would just simply use ‘New: $\pi ^(3)$’ for stratum 3. Furthermore, based on soft classification, the results from the two methods were still almost the same. Tables 6 and 7 give the estimated prior probabilities and predictive results of the five classifiers, respectively. For NSC, cross-validation selected the number of the microarray experiments in the constructed classifier at 85; as a comparison, we also included the result using all the 300 microarray experiments. All the supervised methods except LDA worked well; the inferior performance of LDA was likely because of a problematic estimation of a large (i.e. 300 × 300) covariance matrix with only a few hundred observations.

**Table 4**

Truth | Standard | New: $\pi ^(1)$ | New: $\pi ^(2)$ | New: $\pi ^(3)$ | ||||
---|---|---|---|---|---|---|---|---|

Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | |

1 | 74.5 | 44.5 | 74.5 | 44.5 | 74.5 | 44.5 | 74.5 | 44.5 |

2 | 36.2 | 75.8 | 36.7 | 75.3 | 36.0 | 76.0 | 36.1 | 75.9 |

Accuracy | 0.651 | 0.649 | 0.652 | 0.651 |

Truth | Standard | New: $\pi ^(1)$ | New: $\pi ^(2)$ | New: $\pi ^(3)$ | ||||
---|---|---|---|---|---|---|---|---|

Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | |

1 | 74.5 | 44.5 | 74.5 | 44.5 | 74.5 | 44.5 | 74.5 | 44.5 |

2 | 36.2 | 75.8 | 36.7 | 75.3 | 36.0 | 76.0 | 36.1 | 75.9 |

Accuracy | 0.651 | 0.649 | 0.652 | 0.651 |

**Table 5**

$\pi ^$ | 0.135 | 0.154 | 0.114 | 0.105 | 0.084 | 0.215 | 0.117 | 0.077 |

$\pi ^(1)$ | 0.110 | 0.015 | 0.200 | 0.090 | 0.130 | 0.280 | 0.165 | 0.010 |

$\pi ^(2)$ | 0.149 | 0.245 | 0.045 | 0.130 | 0.040 | 0.176 | 0.050 | 0.165 |

$\pi ^(3)$ | 0.143 | 0.195 | 0.100 | 0.091 | 0.082 | 0.199 | 0.130 | 0.061 |

$\pi ^$ | 0.135 | 0.154 | 0.114 | 0.105 | 0.084 | 0.215 | 0.117 | 0.077 |

$\pi ^(1)$ | 0.110 | 0.015 | 0.200 | 0.090 | 0.130 | 0.280 | 0.165 | 0.010 |

$\pi ^(2)$ | 0.149 | 0.245 | 0.045 | 0.130 | 0.040 | 0.176 | 0.050 | 0.165 |

$\pi ^(3)$ | 0.143 | 0.195 | 0.100 | 0.091 | 0.082 | 0.199 | 0.130 | 0.061 |

**Table 6**

Truth | NSC (p = 85) | NSC (p = 300) | LDA | RF | SVM | |||||
---|---|---|---|---|---|---|---|---|---|---|

Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | |

1 | 100 | 19 | 97 | 22 | 70 | 49 | 100 | 19 | 99 | 20 |

2 | 21 | 91 | 23 | 89 | 30 | 82 | 21 | 91 | 19 | 93 |

Accuracy | 0.827 | 0.805 | 0.658 | 0.827 | 0.831 |

Truth | NSC (p = 85) | NSC (p = 300) | LDA | RF | SVM | |||||
---|---|---|---|---|---|---|---|---|---|---|

Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | |

1 | 100 | 19 | 97 | 22 | 70 | 49 | 100 | 19 | 99 | 20 |

2 | 21 | 91 | 23 | 89 | 30 | 82 | 21 | 91 | 19 | 93 |

Accuracy | 0.827 | 0.805 | 0.658 | 0.827 | 0.831 |

**Table 7**

g | Standard | New |
---|---|---|

BIC | BIC | |

1 | 5416 | 5416 |

2 | 5172 | 5115 |

3 | 4989 | 4913 |

4 | 4977 | 4912 |

5 | 4888 | 4830 |

6 | 4892 | 4860 |

7 | 4901 | 4871 |

g | Standard | New |
---|---|---|

BIC | BIC | |

1 | 5416 | 5416 |

2 | 5172 | 5115 |

3 | 4989 | 4913 |

4 | 4977 | 4912 |

5 | 4888 | 4830 |

6 | 4892 | 4860 |

7 | 4901 | 4871 |

Italics give the minima

In summary, using the gene expression data with all the 300 microarray experiments, the standard clustering and our proposed method had similar performance. The close performance between the two methods was presumably because there was enough information in the data for predicting the two functional classes; with fewer microarray experiments or more functional classes, difference might appear. On the other hand, as in the null case (i.e. set-up 1) of simulation, it showed that, although our proposed method did not improve over the standard method, it did not result in any significant loss of efficiency either.

#### 3.2.2 Using 10 microarray experiments

We suspected that because of information redundancy in the gene expression data for predicting the two gene functional classes, the influence of the prior probability was negligible; e.g. the cluster centers were all far away from each other, dominating the resulting posterior probability calculations and thus final clustering results. Hence, to increase the difficulty or the noise level of the problem, next we only used the first 10 microarray experiments.

Based on BIC, we chose *g* = 5 clusters for both methods (Table 8). Using hard classification (Table 3), the overall accuracy rates from the two approaches were close, though the new method had a slight edge over the standard method. It seemed that using the estimated prior probabilities for strata 1 and 2 improved the predictive accuracy rates for the two classes respectively at the expense of a lower accuracy rate for the other class. This trend was more evident with soft classification (Table 9); more importantly, when using the estimated prior probability from stratum 3 as prescribed, our proposed method gave a higher overall accuracy rate than that of the standard method.

**Table 8**

Truth | Standard | New: $\pi ^(1)$ | New: $\pi ^(2)$ | New: $\pi ^(3)$ | ||||
---|---|---|---|---|---|---|---|---|

Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | |

1 | 101 | 18 | 97 | 22 | 81 | 38 | 92 | 27 |

2 | 53 | 59 | 47 | 65 | 26 | 86 | 35 | 77 |

Accuracy | 0.693 | 0.701 | 0.723 | 0.732 |

Truth | Standard | New: $\pi ^(1)$ | New: $\pi ^(2)$ | New: $\pi ^(3)$ | ||||
---|---|---|---|---|---|---|---|---|

Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | |

1 | 101 | 18 | 97 | 22 | 81 | 38 | 92 | 27 |

2 | 53 | 59 | 47 | 65 | 26 | 86 | 35 | 77 |

Accuracy | 0.693 | 0.701 | 0.723 | 0.732 |

**Table 9**

Truth | Standard | New: $\pi ^(1)$ | New: $\pi ^(2)$ | New: $\pi ^(3)$ | ||||
---|---|---|---|---|---|---|---|---|

Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | |

1 | 66.8 | 51.2 | 78.5 | 39.5 | 67.6 | 50.4 | 74.8 | 43.2 |

2 | 43.0 | 68.0 | 47.5 | 63.5 | 32.9 | 78.1 | 38.5 | 72.5 |

Accuracy | 0.584 | 0.615 | 0.631 | 0.638 |

Truth | Standard | New: $\pi ^(1)$ | New: $\pi ^(2)$ | New: $\pi ^(3)$ | ||||
---|---|---|---|---|---|---|---|---|

Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | |

1 | 66.8 | 51.2 | 78.5 | 39.5 | 67.6 | 50.4 | 74.8 | 43.2 |

2 | 43.0 | 68.0 | 47.5 | 63.5 | 32.9 | 78.1 | 38.5 | 72.5 |

Accuracy | 0.584 | 0.615 | 0.631 | 0.638 |

Table 10 gives prior probability estimates for the two methods. Most of these estimates were close to each other; in particular, $\pi ^$ from the standard method was in general close to $\pi ^(3)$ from the new method, perhaps because stratum 3 had similar proportions of the genes from the two classes as did the total dataset. However, some estimates were different; e.g. for cluster 2, the four prior estimates were 0.461, 0.535, 0.134 and 0.344 respectively. The overall closeness as well as some individual differences among the estimated prior probabilities offered an explanation on why the clustering results were close, but nevertheless different.

**Table 10**

$\pi ^$ | 0.219 | 0.461 | 0.206 | 0.111 | 0.003 |

$\pi ^(1)$ | 0.272 | 0.535 | 0.023 | 0.169 | 0.000 |

$\pi ^(2)$ | 0.140 | 0.134 | 0.329 | 0.397 | 0.000 |

$\pi ^(3)$ | 0.203 | 0.344 | 0.244 | 0.200 | 0.009 |

$\pi ^$ | 0.219 | 0.461 | 0.206 | 0.111 | 0.003 |

$\pi ^(1)$ | 0.272 | 0.535 | 0.023 | 0.169 | 0.000 |

$\pi ^(2)$ | 0.140 | 0.134 | 0.329 | 0.397 | 0.000 |

$\pi ^(3)$ | 0.203 | 0.344 | 0.244 | 0.200 | 0.009 |

The predictive performances of the four classifiers are summarized in Table 11. Comparing Tables 11 and 3, we found that the two clustering methods, especially our proposed new one, worked well with results quite close to that of supervised learning methods, highlighting the promise of using clustering analysis for gene function prediction. Note that there are several other advantages of clustering analysis over supervised learning. First, clustering methods allow the possibility of discovering new classes while predicting for existing classes (Golub *et al*., 1999). Second, rather than using a hierarchical ontology to describe gene functions, gene associations or linkages have been argued to be an alternative (Fraser and Marcotte, 2004), for which clustering methods are more naturally applicable than supervised learning.

**Table 11**

NSC (p = 9) | NSC (p = 10) | LDA | RF | SVM | ||||||
---|---|---|---|---|---|---|---|---|---|---|

Truth | Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 |

1 | 91 | 28 | 91 | 28 | 83 | 36 | 89 | 30 | 85 | 34 |

2 | 31 | 81 | 32 | 80 | 33 | 79 | 28 | 84 | 30 | 82 |

Accuracy | 0.745 | 0.740 | 0.701 | 0.749 | 0.723 |

NSC (p = 9) | NSC (p = 10) | LDA | RF | SVM | ||||||
---|---|---|---|---|---|---|---|---|---|---|

Truth | Pred = 1 | Pred = 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 |

1 | 91 | 28 | 91 | 28 | 83 | 36 | 89 | 30 | 85 | 34 |

2 | 31 | 81 | 32 | 80 | 33 | 79 | 28 | 84 | 30 | 82 |

Accuracy | 0.745 | 0.740 | 0.701 | 0.749 | 0.723 |

## 4 DISCUSSION

With relatively high noise levels in genomic data, the importance of incorporating biological knowledge into statistical analysis has been increasingly recognized. However, the current practice in this direction is mainly restricted to using biological knowledge as an evaluating criteria to validate the analysis results after the analysis is done. For example, many systems have been built to assess statistical enrichments of a list of user-supplied genes in any GO categories, as evidenced by two recent reviews by Handl *et al*. (2005) and Khatri and Draghici (2005); see references therein. There are a few exceptions: e.g. Mootha *et al*. (2003), Al-Shahrour *et al*. (2005) and Pan (2005) proposed different stratified/subgroup analysis approaches to detect differential gene expression, all based on the idea of using biological information to group genes to form strata; Lottaz and Spang (2005) considered incorporating GO categories into sample or tumor classifications, though their main motivation was to enhance interpretability of results, which is also an important factor applicable to genomic analyses. Here we extend the idea to clustering genes and demonstrate its improved performance using both simulated and real data.

In practice, one has to determine which gene functional groups to use. As in any serious statistical modeling, some careful thoughts and trade-offs are needed. There are two factors that need to be balanced. It is desirable to have any group *G _{k}* as homogeneous as possible, which, however, may contain only few genes. Furthermore, using smaller groups leads to a larger number of the groups, hence more parameters (i.e. π

_{(k)}s) are to be estimated based on smaller sample sizes. Here we suggest three approaches. The first is both simple and practical: to avoid using functional groups either too big or too small, we specify an acceptable range of the number of the genes a functional group can contain, plus possibly some other constraints, and thus determine eligible functional groups. A specific example is the use of ‘informative’ GO categories (Zhou

*et al*., 2002); an informative GO category is the one satisfying the two conditions: (1) it contains at least γ genes and (2) it does not have any child category containing at least γ genes, where γ is a threshold in the range of 20–40. For example, with γ = 30, 73 informative MIPS categories were found (Xiao and Pan, 2005), which can be used as priors in our method. Second, among several candidate sets of functional groups, model-selection criteria, such as BIC, can be used to choose the one estimated to give the best predictive performance; this is an advantage of model-based clustering. Third, following the line of Huang

*et al*. (2006), a weighted method can be applied to combine the results of using two sets of functional groups, or more generally, to take account of the hierarchical structure of a gene annotation system. More studies on genomic scales are needed.

In our example, to deal with genes with multiple functions, we simply replicated their expression profiles for each of their functional groups. This paralleled the usual treatment when the expression profile of a gene with multiple functions was used as a test observation. Nevertheless, a downside was that, owing to the introduced correlations among the expression profiles of the same genes, the resulting calculation of BIC was not exact. On the other hand, with only a small number of such genes as in our example, we suspect that the influence was limited. Furthermore, to avoid any potential problem, a simple alternative is to assign a gene randomly to only one of its known functional groups for training data; this will only influence the prior specification by ignoring multiple functions of some genes, but not the validity of final results.

In this article, we have only considered clustering genes, but the idea can be equally applied to model-based clustering of samples for tumor classifications (e.g. Yeung *et al*., 2001; Ghosh and Chinnaiyan, 2002; McLachlan *et al*., 2002) if the samples are known a priori to belong to different groups; these groups can be formed based on some diagnostic or prognostic factors. Our proposed method also bears some similarity to clustering partially classified data (McLachlan and Peel, 2000, Section 2.19); see Qu and Xu (2004) and Alexandridis *et al*. (2004) for the latter's two applications to gene expression data. However, there is a key difference: in partially classified data, all the genes known to have the same class label are assumed to be from the same cluster, whereas, we allow genes sharing the same class label to come from different clusters; we only assume that they share the same prior probability of coming from the same cluster. In other words, in partially classified data, a class label is assumed to be the same as a cluster membership, whereas we do not impose such a restriction. Considering the heterogeneity of various gene functional categories, our modeling assumption is not only weaker, but also more reasonable in the current context.

Many existing clustering methods are not model-based (e.g. Tamayo *et al*., 1999; Tseng and Wong, 2005). An advantage of model-based clustering is its explicit statistical modeling, which for instance facilitates incorporating subject-matter knowledge as priors into analysis, as we have shown here. In contrast, in non-model-based clustering, biological knowledge was directly incorporated into a distance metric (Hanisch *et al*., 2002; Cheng *et al*., 2004; Huang and Pan, 2006). The advantage of the former is its robustness: when the prior specification is incorrect, the final results may still be valid as long as there is enough information in the data (e.g. Carlin and Louis, 2000), while the latter may give completely wrong results. This also partially explains the results in our example: first, with the data containing 300 microarray experiments, the new method gave results similar to that of the standard method, presumably because there was enough information in the data and incorporating biological knowledge as prior would not make any difference; on the contrary, with the data consisting of only 10 microarray experiments, by incorporating biological knowledge, the new method was more efficient than the standard method and thus giving better results. We are currently further evaluating the difference between the two approaches of incorporating biological knowledge in clustering. In addition, our proposed method is an empirical Bayes approach; conceptually it seems possible to generalize our idea to fully Bayesian model-based clustering (Richardson and Green, 1997; Broet *et al*., 2002; Medvedovic and Sivaganesan, 2002; Fraley and Raftery, 2005, ), though computational challenges remain. The methodology may be also extended to other more elaborate modeling contexts, such as analysis of time-course microarray data (e.g. Ramoni *et al*., 2002; Luan and Li, 2003). These are interesting topics to be studied in the future.

The author thanks Guanghua Xiao and Peng Wei for helpful discussions and assistance with the gene expression data. The author is grateful to the reviewers for their constructive comments. This work was supported by NIH grant HL65462 and a UM AHC Development grant.

*Conflict of Interest:* none declared.

## REFERENCES

*OMICS*, (in press)

*Saccharomyces cerevisiae*gene function using overlapping transcriptional clusters

## Comments