## Abstract

Motivation: Cluster analysis of gene expression profiles has been widely applied to clustering genes for gene function discovery. Many approaches have been proposed. The rationale is that the genes with the same biological function or involved in the same biological process are more likely to co-express, hence they are more likely to form a cluster with similar gene expression patterns. However, most existing methods, including model-based clustering, ignore known gene functions in clustering.

Results: To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions as prior probabilities in model-based clustering. In contrast to a global mixture model applicable to all the genes in the standard model-based clustering, we use a stratified mixture model: one stratum corresponds to the genes of unknown function while each of the other ones corresponding to the genes sharing the same biological function or pathway; the genes from the same stratum are assumed to have the same prior probability of coming from a cluster while those from different strata are allowed to have different prior probabilities of coming from the same cluster. We derive a simple EM algorithm that can be used to fit the stratified model. A simulation study and an application to gene function prediction demonstrate the advantage of our proposal over the standard method.

Contact:weip@biostat.umn.edu

## 1 INTRODUCTION

This article concerns with clustering genes for gene function discovery using microarray gene expression data. It has been widely observed that genes with a similar function or involved in the same biological process are likely to co-express, hence clustering genes’ expression profiles provides a means for gene function discovery; see, e.g. Eisen et al. (1998), Brown et al. (2000), Wu et al. (2002), Xiao and Pan (2005) and references therein. However, most existing approaches all ignore known functions of some genes in the process of clustering; few exceptions in the context of non-model-based clustering include Hanisch et al. (2002), Cheng et al. (2004), Fang et al. (2006) and Huang and Pan (2006). For example, in model-based clustering, all the genes are treated equally a priori; in particular, all the genes are assumed to have an equal prior probability of being in a given cluster (e.g. Li and Hong, 2001; Ghosh and Chinnaiyan, 2002; Pan et al., 2002). As mentioned, if some genes are known to share the same function, it is more likely that they belong to the same cluster. Hence, it seems more plausible to model the genes sharing the same biological function to have an equal prior probability while allowing the genes with different functions to have varying prior probabilities. This provides a more efficient way to account for the association between gene function and co-expression. In this paper, we propose such an approach that uses gene functional annotations as priors for model-based clustering. Specifically, first, the genome is partitioned into several groups with one group containing the genes of unknown function and each of the other groups containing the genes sharing the same function. Gene functional annotations are readily available from many existing databases, such as the Gene Ontology (GO) (Ashburner et al., 2000) and MIPS (Mewes et al., 2004). Second, each group is treated as a stratum and a stratified mixture model is used: the genes from the same groups are assumed to have the same prior probability of coming from the same cluster while the prior probabilities for different groups are allowed to be unequal. Because of possible heterogeneity in each gene functional group, we do not assume that the genes from the same functional group come from the same cluster. In fact, for the genes in the group of unknown function, they may come from any cluster. With relatively high noise levels of genomic data, it is recognized that incorporating biological knowledge into statistical analysis is a reliable way to maximize statistical efficiency and enhance the interpretability of the analysis results.

This article is organized as follows. In Section 2, we first briefly review the standard method of model-based clustering with a global mixture model, then propose our stratified mixture model and associated stratified clustering. We derive a simple EM algorithm to fit our stratified model. In Section 3, we demonstrate the advantage of our proposal using simulated data, and then using real data for gene function prediction. We end the paper with a short discussion.

## 2 METHODS

### 2.1 Standard model-based clustering

In model-based clustering, it is assumed that each observation x, a p-dimensional vector, is drawn from a finite mixture distribution

(1)
$f(x;Θ)=∑i=1gπifi(x;θi),$
with the prior probability πi, component-specific distribution fi and its parameters θi. We use Θ = {(πi, θi) : i = 1, … , g} to denote all unknown parameters, with the restriction that 0 ≤ πi ≤ 1 for any i and that $∑i=1gπi=1$. Each component of the mixture distribution corresponds to a cluster. The number of clusters, g, has to be determined in practice; see Section 2.3.

The finite normal mixture model is most widely used: each component fi is a normal distribution; we use this model throughout this article. Because we are particularly interested in applications with ‘large p, small n’ (i.e. high-dimensional data with small sample sizes) encountered in genomic studies, we adopt a working independence model for the components of x as in the naive Bayesian, though other more sophisticated methods may be preferred (e.g. McLachlan et al., 2003). Specifically, we have

$fi(x;θi)=1(2π)p/2∣V∣1/2exp(−12(x−μi)′V−1(x−μi)),$
where $V=diag(σ12,σ22,…,σp2)$ and $∣V∣=∏q=1pσq2$.

Given a dataset xj for j = 1, … , n, the EM algorithm (Dempster et al., 1977) is used to estimate the parameters Θ iteratively in the standard model-based clustering (McLachlan and Peel, 2002; Fraley and Raftery, 2002). We use generic notation Θ(m) to represent the parameter estimates at iteration m, then the EM works by iterating the following:

(2)
$μ^i(m+1)=∑jτij(m+1)xj/∑jτij(m+1), σ^k2,(m+1)=∑i=1g∑j=1nτij(m)(xjk−μik(m))2/n,$

(3)
$π^i(m+1)=∑j=1nτij(m)/n,$
where
(4)
$τij(m)=πi(m)fi(xj;θi(m))∑i=1gπi(m)fi(xj;θi(m)).$
is the estimated posterior probability of xjs coming from component i.

The above iteration is repeated until convergence, resulting in maximum likelihood estimate (MLE) $Θ^$. Then we use (4) to calculate the posterior probability of any observation xjs belonging to any cluster i, and assign the observation to the cluster with the largest such probability. Because of possible existence of multiple local maxima, we need to start the algorithm multiple times with various starting values; in this paper, we use the result from K-means as starting values for the EM. We fit a series of models with various values of g, then use a model selection criterion to choose its appropriate value, as discussed in Section 2.3.

### 2.2 Stratified model-based clustering

In Section 2.1, all the genes are treated equally a priori. To take advantage of the known gene functions, we propose first partitioning the genome into several groups, say, G1, … , GK. GK contains the genes of unknown function while each of the other groups contains the genes sharing the same biological function. Second, rather than a global model (1), we propose using a stratified model: for any gene j in functional group Gk,

(5)
$f(k)(xj;Θ(k))=∑i=1gπ(k),ifi(xj;θi)$
for k = 1, … , K. Note that the K stratified models differ in using stratum-specific prior probabilities while sharing the same set of component distributions. Hence, we assume that genes from the same group Gk share the same prior probability π(k),i of coming from the same cluster i while allowing them to come from different clusters. It is easy to see that the above stratified model reduces to the standard model with K = 1.

In practice, Gk can be determined based on the GO (Ashburner et al., 2000), MIPS (Mewes et al., 2004) or other sources of biological knowledge; the choice of the database depends on the application at hand. For example, if the goal is to predict gene function while both data quality and terminology of GO are preferred, GO can be used. Because some genes may have multiple functions, we allow a gene to be in multiple groups. A simple method considered here is that, for example, if a gene has two functions corresponding to groups G1 and G2, we duplicate the gene expression profile for the gene and treat the two observations as two genes, one in G1 and one in G2, in the process of clustering; see the final section for more discussions on this issue.

Next we derive the EM algorithm for the above model-based clustering. Given data X = {xj : j = 1, … , n}, the log-likelihood is

$logL(Θ)=∑j=1nlog[∑i=1gπ(k),ifi(xj;θi)],$
where for simplicity we suppress the dependence of group index k = k(j) on gene j and we use this convention throughout. Maximization of the above log-likelihood is difficult, and it is common to use the EM algorithm (Dempster et al., 1977) by casting the problem in the framework of missing data. Define zij as the indicator of whether xj is from component i; i.e. zij = 1 if xj is indeed from component i and zij = 0 otherwise. If we could observe the missing data zijs, then the complete data log-likelihood is
$logLc(Θ)=∑i∑jzij[logπ(k),i+logfi(xj;θi)].$
It is easy to verify that the E-step of the EM yields
$Q(Θ;Θ(m))=EΘ(m)(logLc∣X)=∑i∑jτij(m)[logπ(k),i+logfi(xj;θi)],$
where for gene jGk,
(6)
$τij(m)=π(k),i(m)fi(xj;θi(m))∑i=1gπ(k),i(m)fi(xj;θi(m))$
is the estimated posterior probability of xjs coming from component i.

The M-step of the EM maximizes the above Q to update the parameter estimates. For a stratified model, updates for the mean and variance parameters are the same as in (2), but the update for the prior probability is different:

(7)
$π^(k),i(m+1)=∑j∈Gkτij(m)nk,$
where nk is the number of the genes in Gk.

As before, the above updates are iterated until convergence with resulting MLE $Θ^$. We fit a series of models with various values of g, then use a model selection criterion to choose an appropriate value.

We use (6) to calculate the posterior probability of any observation xjs belonging to each cluster, and assign the observation to the cluster with the largest such probability.

### 2.3 Model selection

In practice, we have to determine the number of components, g. This is realized by first fitting a series of models with various numbers of components, and then using a model selection criterion to choose the best one. In model-based clustering, it is common to use Bayesian information criterion (BIC) (Schwartz, 1978) defined as

$BIC=−2logL(Θ^)+log(n)d,$
where d is the total number of (effective) parameters (Fraley and Raftery, 1998). In the standard model, because we have three sets of parameters, πis, σqs and μiqs, and we have the constraint $∑i=1gπi=1$, we have d = p + gp + g − 1; in the stratified model with K strata, instead of a set of πis, we have K sets of π(k),is, thus d = p + gp + K(g − 1).

## 3 RESULTS

### 3.1 Simulated data

We did a simulation study to demonstrate the improved performance of our proposal. There were two univariate clusters with distributions N1, σ1) and N2, σ2) respectively. We had two gene functional groups with n1 and n2 genes respectively. For a gene in functional group k for k = 1 or 2, its prior probability of being in cluster 1 was π(k),1. Four simulation set-ups were used with various parameter values: three sets of π(k),1s were used; n1 = n2 = 50 in the first three set-ups while n1 = 25 and n2 = 75 in set-up 4; set-ups 2 and 4 were the same except for different n1 and n2.

The results were based on 1000 simulations for each set-up. We considered both the mean and variance of a parameter estimate over 1000 simulations; we also gave the average numbers of observations in a simulated dataset that were incorrectly assigned to clusters different from their true ones.

In the first simulation set-up, we had π(1),1 = π(2),1, hence the standard mixture model was correct, and the standard method was supposed to perform better. Although the stratified mixture model was still correct, it unnecessarily required to estimate two separate parameters π(1),1 and π(2),1. It was confirmed that the standard method worked better, however, more interestingly, its performance was quite close to that of our proposed method: the estimates of the mean and variance parameters from the stratified method were almost the same as that from the standard method, but as expected, the estimate of the prior probability from the former had slightly larger variability.

In set-ups 2–4, the stratified mixture model was correct with π(1),1 ≠ π(2),1. Hence, with the knowledge of the gene functional groups, it was better to use the stratified model. On the other hand, ignoring the gene functional groups, the global mixture model still held with a corresponding π1 = 0.5 or π1 = 0.35. Unsurprisingly, the standard method gave estimates with larger estimation errors e.g. in terms of a mean squared error, which is the sum of the squared bias and the variance; this was especially evident for the mean and variance parameters of the second cluster, both with much larger variances than those from the stratified model. In consequence, the standard method resulted in much larger numbers of misclassified genes. Comparing set-ups 2 and 3, it was noted that the performance difference between the two methods increased with $∣π1(1)−π1(2)∣$. With different numbers of genes in the two functional groups (set-up 4), the stratified method still maintained better performance (Table 1).

Table 1

Simulation results for the standard method and the new method

Set-up Method μ1 μ2 σ1 σ2 π1 $π1(1)$ $π1(2)$ #Err1 #Err2
True 0.80 0.80 0.80
Standard 0.91 2.86 1.01 1.39 0.764 — — 11.3 11.4
(0.48) (1.52) (0.30) (0.69) (0.213)   (8.6) (8.5)
New 0.91 2.80 1.01 1.39 — 0.748 0.749 12.1 12.1
(0.48) (1.54) (0.30) (0.69)  (0.243) (0.240) (9.5) (9.5)
True 0.50 0.80 0.20
Standard 1.01 2.66 0.99 1.74 0.588 — — 13.6 23.0
(0.30) (1.32) (0.29) (0.55) (0.224)   (8.4) (8.2)
New 1.00 2.23 0.98 1.92 — 0.822 0.246 9.0 14.3
(0.25) (0.78) (0.21) (0.37)  (0.152) (0.223) (2.8) (6.8)
True 0.50
Standard 0.95 2.68 1.01 1.72 0.575 — — 11.6 25.4
(0.52) (1.31) (0.36) (0.54) (0.227)   (14.0) (12.7)
New 0.98 2.17 0.97 1.99 — 0.986 0.101 0.3 2.6
(0.15) (0.47) (0.11) (0.27)  (0.036) (0.139) (0.9) (7.7)
True 0.35 0.80 0.20
Standard 0.99 2.43 1.00 1.81 0.445 — — 10.1 29.8
(0.53) (1.01) (0.43) (0.46) (0.228)   (5.8) (12.9)
New 1.00 2.18 0.98 1.92 — 0.829 0.253 5.0 21.4
(0.30) (0.66) (0.30) (0.34)  (0.191) (0.219) (2.6) (10.7)
Set-up Method μ1 μ2 σ1 σ2 π1 $π1(1)$ $π1(2)$ #Err1 #Err2
True 0.80 0.80 0.80
Standard 0.91 2.86 1.01 1.39 0.764 — — 11.3 11.4
(0.48) (1.52) (0.30) (0.69) (0.213)   (8.6) (8.5)
New 0.91 2.80 1.01 1.39 — 0.748 0.749 12.1 12.1
(0.48) (1.54) (0.30) (0.69)  (0.243) (0.240) (9.5) (9.5)
True 0.50 0.80 0.20
Standard 1.01 2.66 0.99 1.74 0.588 — — 13.6 23.0
(0.30) (1.32) (0.29) (0.55) (0.224)   (8.4) (8.2)
New 1.00 2.23 0.98 1.92 — 0.822 0.246 9.0 14.3
(0.25) (0.78) (0.21) (0.37)  (0.152) (0.223) (2.8) (6.8)
True 0.50
Standard 0.95 2.68 1.01 1.72 0.575 — — 11.6 25.4
(0.52) (1.31) (0.36) (0.54) (0.227)   (14.0) (12.7)
New 0.98 2.17 0.97 1.99 — 0.986 0.101 0.3 2.6
(0.15) (0.47) (0.11) (0.27)  (0.036) (0.139) (0.9) (7.7)
True 0.35 0.80 0.20
Standard 0.99 2.43 1.00 1.81 0.445 — — 10.1 29.8
(0.53) (1.01) (0.43) (0.46) (0.228)   (5.8) (12.9)
New 1.00 2.18 0.98 1.92 — 0.829 0.253 5.0 21.4
(0.30) (0.66) (0.30) (0.34)  (0.191) (0.219) (2.6) (10.7)

The means and standard deviations (in parentheses) of parameter estimates, and #Err1 and #Err2 for the numbers of misclassifications for clusters 1 and 2 respectively, are presented.

### 3.2 Gene function prediction using gene expression profiles

With the completion of the human genome and other sequencing projects, it has become compelling to learn functions of many newly discovered genes. An important approach is to cluster gene expression profiles drawn from microarray experiments under various conditions; see, e.g. Eisen et al. (1998), Brown et al. (2000), Wu et al. (2001), Zhou et al. (2002), Xiao and Pan (2005) and references therein. The rationale is that genes sharing the same biological function are likely to co-express.

We considered clustering gene expression data to predict gene functions for yeast Saccharomyces cerevisiae. We used a large dataset containing 300 microarray experiments with gene deletions and drug treatments (Hughes et al., 2000). The data were centered and scaled so that the mean and variance of the expression profile for each gene were 0 and 1, respectively. Gene functions were downloaded from the MIPS database (Mewes et al., 2004). For illustration, we only considered two gene functions, mitotic cell cycle and cell cycle control (with MIPS code 030301) and mitochondrion (with MIPS 4016), shortened as classes 1 and 2, respectively. There were three strata: the first two strata each contained 200 genes randomly selected from one of the two classes, while the third stratum was a mixture of 119 and 112 genes from the two classes respectively. Hence, the genes in the first two strata were treated as those with known functions whereas the third stratum consisted of the genes whose functions were to be predicted.

We did not use the class labels of the genes in the process of clustering. Based on the clustering results, we predicted the functions of the genes in the third stratum. There were two ways to accomplish prediction, called hard classification and soft classification. In hard classification, each gene was assigned to a cluster, and a cluster was classified to a class to which the majority of the genes from the first two strata that were assigned to the cluster belonged; if there was an equal number of the genes from each of the first two strata, we randomly assigned a class label to the cluster. Any gene from stratum 3 that was assigned to a cluster was predicted to be in the same class as that of the cluster. When there were an equal or almost equal number of the genes from the first two strata, there seemed to be a certain degree of randomness in the hard classification. Furthermore, it did not take advantage of the soft-clustering feature of model-based clustering: even two genes were both assigned to the same cluster, they might have quite different posterior probabilities of being in the cluster. Hence, as an alternative, we used soft classification: first, for each cluster i, we calculated the proportion of the genes in class c, $Pic$, for c = 1 and 2; second, for any gene j in stratum 3, the expectation of its being in class c is $∑i=1gPicτij$. Summing over all the genes from class c (or the other class), we obtained an expected number of the genes predicted to be in class c. To illustrate the effect of various prior probabilities, we also included results using the prior probabilities of the first two strata to calculate τij for each jG3; in a table (i.e. Tables 2 and 3), they were respectively denoted as ‘New: $π^(k)$’ for k = 1, 2, 3.

Table 2

BIC with various numbers (g) of clusters in the standard and stratified clustering for Hughes’ gene expression data with 300 microarray experiments

g Standard New
BIC BIC
162479 162479
150773 150786
144384 144388
139320 139230
136185 136103
134219 134137
133502 133426
132840 132785
132892 132826
10 133112 133092
g Standard New
BIC BIC
162479 162479
150773 150786
144384 144388
139320 139230
136185 136103
134219 134137
133502 133426
132840 132785
132892 132826
10 133112 133092

Italics give the minima

Table 3

Predictions for stratum 3 based on hard classification using standard clustering and new clustering with g = 8 for Hughes’ gene expression data with 300 microarray experiments

Standard  New: $π^(1)$  New: $π^(2)$  New: $π^(3)$  Truth  Pred = 1  Pred = 2  1  2 1 88 31 88 31 88 31 88 31 2 30 82 31 81 30 82 30 82 Accuracy 0.736 0.732 0.736 0.736
Standard  New: $π^(1)$  New: $π^(2)$  New: $π^(3)$  Truth  Pred = 1  Pred = 2  1  2 1 88 31 88 31 88 31 88 31 2 30 82 31 81 30 82 30 82 Accuracy 0.736 0.732 0.736 0.736

As a comparison, we also treated the problem as supervised learning: the 200 genes in the first two strata and their class labels were training data, while the genes in the third stratum were test data. We applied a classic linear discriminant analysis (LDA) and three new classifiers regarded as among the best: the nearest shrunken centroids (NSC) (Tibshirani et al., 2003), random forests (RF) (Breiman, 2001) and support vector machines (SVM) (Vapnik, 1998). Our purpose here was not to directly compare clustering analyses with these classifiers; rather, these classifiers provided some estimates of the upper bound of the predictive performance of clustering for this particular application. We used the default settings of R functions

lda()
,
pamr()
,
randomForest()
and
svm()
implementing the methods except that 5-fold cross-validation was used to select parameters for NSC and SVM.

#### 3.2.1 Using 300 microarray experiments

First, we considered the use of the full dataset with all 300 microarray experiments included. Table 4 gives the BIC values for the two methods with various numbers of clusters, based on which g = 8 that minimized BIC was selected for both methods. The predictive results based on hard classification for the genes in stratum 3 are given in Table 5, showing that the two methods gave almost the same result. As mentioned earlier, the result of ‘New: $π^(k)$’ was based on using the prior probability estimate $π^(k)$ to calculate the posterior probabilities and thus made predictions for the genes in stratum 3; in practice, we would just simply use ‘New: $π^(3)$’ for stratum 3. Furthermore, based on soft classification, the results from the two methods were still almost the same. Tables 6 and 7 give the estimated prior probabilities and predictive results of the five classifiers, respectively. For NSC, cross-validation selected the number of the microarray experiments in the constructed classifier at 85; as a comparison, we also included the result using all the 300 microarray experiments. All the supervised methods except LDA worked well; the inferior performance of LDA was likely because of a problematic estimation of a large (i.e. 300 × 300) covariance matrix with only a few hundred observations.

Table 4

Predictions for stratum 3 based on soft classification using standard clustering and new clustering with g = 8 for Hughes’ gene expression data with 300 microarray experiments

Truth Standard New: $π^(1)$ New: $π^(2)$ New: $π^(3)$
Pred = 1 Pred = 2
74.5 44.5 74.5 44.5 74.5 44.5 74.5 44.5
36.2 75.8 36.7 75.3 36.0 76.0 36.1 75.9
Accuracy 0.651 0.649 0.652 0.651
Truth Standard New: $π^(1)$ New: $π^(2)$ New: $π^(3)$
Pred = 1 Pred = 2
74.5 44.5 74.5 44.5 74.5 44.5 74.5 44.5
36.2 75.8 36.7 75.3 36.0 76.0 36.1 75.9
Accuracy 0.651 0.649 0.652 0.651
Table 5

Estimated prior probabilities with g = 8 for Hughes’ gene expression data with 300 microarray experiments

 $π^$ 0.135 0.154 0.114 0.105 0.084 0.215 0.117 0.077 $π^(1)$ 0.11 0.015 0.2 0.09 0.13 0.28 0.165 0.01 $π^(2)$ 0.149 0.245 0.045 0.13 0.04 0.176 0.05 0.165 $π^(3)$ 0.143 0.195 0.1 0.091 0.082 0.199 0.13 0.061
 $π^$ 0.135 0.154 0.114 0.105 0.084 0.215 0.117 0.077 $π^(1)$ 0.11 0.015 0.2 0.09 0.13 0.28 0.165 0.01 $π^(2)$ 0.149 0.245 0.045 0.13 0.04 0.176 0.05 0.165 $π^(3)$ 0.143 0.195 0.1 0.091 0.082 0.199 0.13 0.061
Table 6

Predictions for a separate test dataset (i.e. stratum 3) using the NSC, LDA, RF and SVM for Hughes’ gene expression data with 300 microarray experiments

Truth NSC (p = 85) NSC (p = 300) LDA RF SVM
Pred = 1 Pred = 2
100 19 97 22 70 49 100 19 99 20
21 91 23 89 30 82 21 91 19 93
Accuracy 0.827 0.805 0.658 0.827 0.831
Truth NSC (p = 85) NSC (p = 300) LDA RF SVM
Pred = 1 Pred = 2
100 19 97 22 70 49 100 19 99 20
21 91 23 89 30 82 21 91 19 93
Accuracy 0.827 0.805 0.658 0.827 0.831
Table 7

BIC with various numbers (g) of clusters in the standard and stratified clustering for Hughes’ gene expression data with only 10 microarray experiments

g Standard New
BIC BIC
5416 5416
5172 5115
4989 4913
4977 4912
4888 4830
4892 4860
4901 4871
g Standard New
BIC BIC
5416 5416
5172 5115
4989 4913
4977 4912
4888 4830
4892 4860
4901 4871

Italics give the minima

In summary, using the gene expression data with all the 300 microarray experiments, the standard clustering and our proposed method had similar performance. The close performance between the two methods was presumably because there was enough information in the data for predicting the two functional classes; with fewer microarray experiments or more functional classes, difference might appear. On the other hand, as in the null case (i.e. set-up 1) of simulation, it showed that, although our proposed method did not improve over the standard method, it did not result in any significant loss of efficiency either.

#### 3.2.2 Using 10 microarray experiments

We suspected that because of information redundancy in the gene expression data for predicting the two gene functional classes, the influence of the prior probability was negligible; e.g. the cluster centers were all far away from each other, dominating the resulting posterior probability calculations and thus final clustering results. Hence, to increase the difficulty or the noise level of the problem, next we only used the first 10 microarray experiments.

Based on BIC, we chose g = 5 clusters for both methods (Table 8). Using hard classification (Table 3), the overall accuracy rates from the two approaches were close, though the new method had a slight edge over the standard method. It seemed that using the estimated prior probabilities for strata 1 and 2 improved the predictive accuracy rates for the two classes respectively at the expense of a lower accuracy rate for the other class. This trend was more evident with soft classification (Table 9); more importantly, when using the estimated prior probability from stratum 3 as prescribed, our proposed method gave a higher overall accuracy rate than that of the standard method.

Table 8

Predictions for stratum 3 based on hard classification using standard clustering and new clustering with g = 5 for Hughes’ gene expression data with 10 microarray experiments

Truth Standard New: $π^(1)$ New: $π^(2)$ New: $π^(3)$
Pred = 1 Pred = 2
101 18 97 22 81 38 92 27
53 59 47 65 26 86 35 77
Accuracy 0.693 0.701 0.723 0.732
Truth Standard New: $π^(1)$ New: $π^(2)$ New: $π^(3)$
Pred = 1 Pred = 2
101 18 97 22 81 38 92 27
53 59 47 65 26 86 35 77
Accuracy 0.693 0.701 0.723 0.732
Table 9

Predictions for stratum 3 based on soft classification using standard clustering and new clustering with g = 5 for Hughes’ gene expression data with 10 microarray experiments

Truth Standard New: $π^(1)$ New: $π^(2)$ New: $π^(3)$
Pred = 1 Pred = 2
66.8 51.2 78.5 39.5 67.6 50.4 74.8 43.2
43.0 68.0 47.5 63.5 32.9 78.1 38.5 72.5
Accuracy 0.584 0.615 0.631 0.638
Truth Standard New: $π^(1)$ New: $π^(2)$ New: $π^(3)$
Pred = 1 Pred = 2
66.8 51.2 78.5 39.5 67.6 50.4 74.8 43.2
43.0 68.0 47.5 63.5 32.9 78.1 38.5 72.5
Accuracy 0.584 0.615 0.631 0.638

Table 10 gives prior probability estimates for the two methods. Most of these estimates were close to each other; in particular, $π^$ from the standard method was in general close to $π^(3)$ from the new method, perhaps because stratum 3 had similar proportions of the genes from the two classes as did the total dataset. However, some estimates were different; e.g. for cluster 2, the four prior estimates were 0.461, 0.535, 0.134 and 0.344 respectively. The overall closeness as well as some individual differences among the estimated prior probabilities offered an explanation on why the clustering results were close, but nevertheless different.

Table 10

Estimated prior probabilities with g = 5 for Hughes’ gene expression data with 10 microarray experiments

 $π^$ 0.219 0.461 0.206 0.111 0.003 $π^(1)$ 0.272 0.535 0.023 0.169 0 $π^(2)$ 0.14 0.134 0.329 0.397 0 $π^(3)$ 0.203 0.344 0.244 0.2 0.009
 $π^$ 0.219 0.461 0.206 0.111 0.003 $π^(1)$ 0.272 0.535 0.023 0.169 0 $π^(2)$ 0.14 0.134 0.329 0.397 0 $π^(3)$ 0.203 0.344 0.244 0.2 0.009

The predictive performances of the four classifiers are summarized in Table 11. Comparing Tables 11 and 3, we found that the two clustering methods, especially our proposed new one, worked well with results quite close to that of supervised learning methods, highlighting the promise of using clustering analysis for gene function prediction. Note that there are several other advantages of clustering analysis over supervised learning. First, clustering methods allow the possibility of discovering new classes while predicting for existing classes (Golub et al., 1999). Second, rather than using a hierarchical ontology to describe gene functions, gene associations or linkages have been argued to be an alternative (Fraser and Marcotte, 2004), for which clustering methods are more naturally applicable than supervised learning.

Table 11

Predictions for a separate test dataset based on the NSC, LDA, RF and SVM for Hughes’ gene expression data with 10 microarray experiments

NSC (p = 9) NSC (p = 10) LDA RF SVM
Truth Pred = 1 Pred = 2
91 28 91 28 83 36 89 30 85 34
31 81 32 80 33 79 28 84 30 82
Accuracy 0.745 0.740 0.701 0.749 0.723
NSC (p = 9) NSC (p = 10) LDA RF SVM
Truth Pred = 1 Pred = 2
91 28 91 28 83 36 89 30 85 34
31 81 32 80 33 79 28 84 30 82
Accuracy 0.745 0.740 0.701 0.749 0.723

## 4 DISCUSSION

With relatively high noise levels in genomic data, the importance of incorporating biological knowledge into statistical analysis has been increasingly recognized. However, the current practice in this direction is mainly restricted to using biological knowledge as an evaluating criteria to validate the analysis results after the analysis is done. For example, many systems have been built to assess statistical enrichments of a list of user-supplied genes in any GO categories, as evidenced by two recent reviews by Handl et al. (2005) and Khatri and Draghici (2005); see references therein. There are a few exceptions: e.g. Mootha et al. (2003), Al-Shahrour et al. (2005) and Pan (2005) proposed different stratified/subgroup analysis approaches to detect differential gene expression, all based on the idea of using biological information to group genes to form strata; Lottaz and Spang (2005) considered incorporating GO categories into sample or tumor classifications, though their main motivation was to enhance interpretability of results, which is also an important factor applicable to genomic analyses. Here we extend the idea to clustering genes and demonstrate its improved performance using both simulated and real data.

In practice, one has to determine which gene functional groups to use. As in any serious statistical modeling, some careful thoughts and trade-offs are needed. There are two factors that need to be balanced. It is desirable to have any group Gk as homogeneous as possible, which, however, may contain only few genes. Furthermore, using smaller groups leads to a larger number of the groups, hence more parameters (i.e. π(k)s) are to be estimated based on smaller sample sizes. Here we suggest three approaches. The first is both simple and practical: to avoid using functional groups either too big or too small, we specify an acceptable range of the number of the genes a functional group can contain, plus possibly some other constraints, and thus determine eligible functional groups. A specific example is the use of ‘informative’ GO categories (Zhou et al., 2002); an informative GO category is the one satisfying the two conditions: (1) it contains at least γ genes and (2) it does not have any child category containing at least γ genes, where γ is a threshold in the range of 20–40. For example, with γ = 30, 73 informative MIPS categories were found (Xiao and Pan, 2005), which can be used as priors in our method. Second, among several candidate sets of functional groups, model-selection criteria, such as BIC, can be used to choose the one estimated to give the best predictive performance; this is an advantage of model-based clustering. Third, following the line of Huang et al. (2006), a weighted method can be applied to combine the results of using two sets of functional groups, or more generally, to take account of the hierarchical structure of a gene annotation system. More studies on genomic scales are needed.

In our example, to deal with genes with multiple functions, we simply replicated their expression profiles for each of their functional groups. This paralleled the usual treatment when the expression profile of a gene with multiple functions was used as a test observation. Nevertheless, a downside was that, owing to the introduced correlations among the expression profiles of the same genes, the resulting calculation of BIC was not exact. On the other hand, with only a small number of such genes as in our example, we suspect that the influence was limited. Furthermore, to avoid any potential problem, a simple alternative is to assign a gene randomly to only one of its known functional groups for training data; this will only influence the prior specification by ignoring multiple functions of some genes, but not the validity of final results.

In this article, we have only considered clustering genes, but the idea can be equally applied to model-based clustering of samples for tumor classifications (e.g. Yeung et al., 2001; Ghosh and Chinnaiyan, 2002; McLachlan et al., 2002) if the samples are known a priori to belong to different groups; these groups can be formed based on some diagnostic or prognostic factors. Our proposed method also bears some similarity to clustering partially classified data (McLachlan and Peel, 2000, Section 2.19); see Qu and Xu (2004) and Alexandridis et al. (2004) for the latter's two applications to gene expression data. However, there is a key difference: in partially classified data, all the genes known to have the same class label are assumed to be from the same cluster, whereas, we allow genes sharing the same class label to come from different clusters; we only assume that they share the same prior probability of coming from the same cluster. In other words, in partially classified data, a class label is assumed to be the same as a cluster membership, whereas we do not impose such a restriction. Considering the heterogeneity of various gene functional categories, our modeling assumption is not only weaker, but also more reasonable in the current context.

Many existing clustering methods are not model-based (e.g. Tamayo et al., 1999; Tseng and Wong, 2005). An advantage of model-based clustering is its explicit statistical modeling, which for instance facilitates incorporating subject-matter knowledge as priors into analysis, as we have shown here. In contrast, in non-model-based clustering, biological knowledge was directly incorporated into a distance metric (Hanisch et al., 2002; Cheng et al., 2004; Huang and Pan, 2006). The advantage of the former is its robustness: when the prior specification is incorrect, the final results may still be valid as long as there is enough information in the data (e.g. Carlin and Louis, 2000), while the latter may give completely wrong results. This also partially explains the results in our example: first, with the data containing 300 microarray experiments, the new method gave results similar to that of the standard method, presumably because there was enough information in the data and incorporating biological knowledge as prior would not make any difference; on the contrary, with the data consisting of only 10 microarray experiments, by incorporating biological knowledge, the new method was more efficient than the standard method and thus giving better results. We are currently further evaluating the difference between the two approaches of incorporating biological knowledge in clustering. In addition, our proposed method is an empirical Bayes approach; conceptually it seems possible to generalize our idea to fully Bayesian model-based clustering (Richardson and Green, 1997; Broet et al., 2002; Medvedovic and Sivaganesan, 2002; Fraley and Raftery, 2005, ), though computational challenges remain. The methodology may be also extended to other more elaborate modeling contexts, such as analysis of time-course microarray data (e.g. Ramoni et al., 2002; Luan and Li, 2003). These are interesting topics to be studied in the future.

The author thanks Guanghua Xiao and Peng Wei for helpful discussions and assistance with the gene expression data. The author is grateful to the reviewers for their constructive comments. This work was supported by NIH grant HL65462 and a UM AHC Development grant.

Conflict of Interest: none declared.

## REFERENCES

Alexandridis
R.
, et al.  .
Class discovery and classification of tumor samples using mixture modeling of gene expression data
Bioinformatics
,
2004
, vol.
20
(pg.
2545
-
2552
)
Al-Shahrour
F.
, et al.  .
Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information
Bioinformatics
,
2005
, vol.
21
(pg.
2988
-
2993
)
Ashburner
M.
, et al.  .
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat. Genet.
,
2000
, vol.
25
(pg.
25
-
29
)
Breiman
L.
Random forests
Mach. Learn.
,
2001
, vol.
45
(pg.
5
-
32
)
Broet
P.
, et al.  .
Bayesian hierarchical model for identifying changes in gene expression from microarray experiments
J. Comput. Biol.
,
2002
, vol.
9
(pg.
671
-
683
)
Brown
M.P.
, et al.  .
Knowledge-based analysis of microarray gene expression data using support vector machines
Proc. Natl Acad. Sci. USA
,
2000
, vol.
97
(pg.
262
-
267
)
Carlin
B.P.
Louis
T.A.
Bayes and Empirical Bayes Methods for Data Analysis
,
2000
Chapman and Hall/CRC Press
Cheng
J.
, et al.  .
A knowledge-based clustering algorithm driven by Gene Ontology
J. Biopharm. Stat.
,
2004
, vol.
14
(pg.
687
-
700
)
Dempster
A.P.
, et al.  .
Maximum likelihood from incomplete data via the EM algorithm (with discussion)
J. R. Statist. Soc. B
,
1977
, vol.
39
(pg.
1
-
38
)
Eisen
M.
, et al.  .
Cluster analysis and display of genome-wide expression patterns
Proc. Natl Acad. Sci. USA
,
1998
, vol.
95
(pg.
14863
-
14868
)
Fang
Z.
, et al.  .
Knowledge guided analysis of microarray data
J. Biomed. Inform.
,
2006

(in press)
Fraley
C.
Raftery
A.E.
How many clusters? Which clustering methods?—Answers via model-based cluster analysis
Comput. J.
,
1998
, vol.
41
(pg.
578
-
588
)
Fraley
C.
Raftery
A.E.
Model-based clustering, discriminant analysis, and density estimation
J. Am. Stat. Assoc.
,
2002
, vol.
97
(pg.
611
-
631
)
Fraley
C.
Raftery
A.E.
Bayesian regularization for normal mixture estimation and model-based clustering
Technical report 486
,
2005
Department of Statistics, University of Washington
Fraser
A.G.
Marcotte
E.M.
A probabilistic view of gene function
Nat. Genet.
,
2004
, vol.
36
(pg.
559
-
564
)
Ghosh
D.
Chinnaiyan
A.M.
Mixture modeling of gene expression data from microarray experiments
Bioinformatics
,
2002
, vol.
18
(pg.
275
-
286
)
Golub
T.R.
, et al.  .
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
,
1999
, vol.
285
(pg.
531
-
537
)
Handl
J.
, et al.  .
Computational cluster validation in post-genomic data analysis
Bioinformatics
,
2005
, vol.
21
(pg.
3201
-
3212
)
Hanisch
D.
, et al.  .
Co-clustering of biological networks and gene expression data
Bioinformatics
,
2002
, vol.
18
(pg.
145
-
154
)
Huang
D.
Pan
W.
Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data
2006

Research report 2006-007. Division of Biostatistics, University of Minnesota. Available at
Huang
D.
Wei
P.
Pan
W.
Combining gene annotations and gene expression data in model-based clustering: a weighted method
2006

OMICS, (in press)
Hughes
T.R.
, et al.  .
Functional discovery via a compendium of expression profiles
Cell
,
2000
, vol.
102
(pg.
109
-
126
)
Khatri
P.
Draghici
S.
Ontological analysis of gene expression data: current tools, limitations, and open problems
Bioinformatics
,
2005
, vol.
21
(pg.
3587
-
3595
)
Li
H.
Hong
F.
Cluster-rasch models for microarray gene expression data
Genome Biol.
,
2001
, vol.
2

RESEARCH0031
Lottaz
C.
Spang
R.
Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data
Bioinformatics
,
2005
, vol.
21
(pg.
1971
-
1978
)
Luan
Y.
Li
H.
Clustering of time-course gene expression data using a mixed-effects model with B-splines
Bioinformatics
,
2003
, vol.
19
(pg.
474
-
482
)
McLachlan
G.J.
Peel
D.
Finite Mixture Model
,
2002
New York
John Wiley & Sons, Inc
McLachlan
G.J.
, et al.  .
A mixture model-based approach to the clustering of microarray expression data
Bioinformatics
,
2002
, vol.
18
(pg.
413
-
422
)
McLachlan
G.J.
, et al.  .
Modeling high-dimensional data by mixtures of factor analyzers
Comput. Stat. Data Anal.
,
2003
, vol.
41
(pg.
379
-
388
)
Medvedovic
M.
Sivaganesan
S.
Bayesian infinite mixture model based clustering of gene expression profiles
Bioinformatics
,
2002
, vol.
18
(pg.
1194
-
1206
)
Mewes
H.W.
, et al.  .
MIPS: analysis and annotation of proteins from whole genomes
Nucleic Acids Res.
,
2004
, vol.
32
(pg.
D41
-
D44
)
Mootha
V.K.
, et al.  .
PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes
Nat. Genet.
,
2003
, vol.
34
(pg.
267
-
273
)
Pan
W.
Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data
Stat. Appl. Genet. Mol. Biol.
,
2005
, vol.
4

Article 12
Pan
W.
, et al.  .
Model-based cluster analysis of microarray gene-expression data
Genome Biol.
,
2002
, vol.
3

RESEARCH0009
Qu
Y.
Xu
S.
Supervised cluster analysis for microarray data based on multivariate Gaussian mixture
Bioinformatics
,
2004
, vol.
20
(pg.
1905
-
1913
)
Ramoni
M.
, et al.  .
Cluster analysis of gene expression dynamics
Proc. Natl Acad. Sci. USA
,
2002
, vol.
99
(pg.
9121
-
9126
)
Richardson
S.
Green
P.J.
On Bayesian analysis of mixtures with an unknown number of components
J. B. Statist. Soc.
,
1997
, vol.
59
(pg.
731
-
758
)
Schwarz
G.
Estimating the dimensions of a model
Annal. Stat.
,
1978
, vol.
6
(pg.
461
-
464
)
Tamayo
P.
, et al.  .
Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation
Proc. Natl Acad. Sci. USA
,
1999
, vol.
96
(pg.
2907
-
2912
)
Tibshirani
R.
, et al.  .
Class prediction by nearest shrunken centroids, with application to DNA microarrays
Stat. Sci.
,
2003
, vol.
18
(pg.
104
-
117
)
Tseng
G.C.
Wong
W.H.
Tight clustering: a resampling-based approach for identifying stable and tight patterns in data
Biometrics
,
2005
, vol.
61
(pg.
10
-
16
)
Vapnik
V.
Statistical Learning Theory
,
1998
NY
Wiley
Wu
L.F.
, et al.  .
Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters
Nat. Genet.
,
2002
, vol.
31
(pg.
255
-
265
)
Xiao
G.
Pan
W.
Gene function prediction by a combined analysis of gene expression data and protein–protein interaction data
J. Bioinform. Comput. Biol.
,
2005
, vol.
3
(pg.
1371
-
1389
)
Yeung
K.Y.
, et al.  .
Model-based clustering and data transformations for gene expression data
Bioinformatics
,
2001
, vol.
17
(pg.
977
-
987
)
Zhou
X.
, et al.  .
Transitive functional annotation by shortest-path analysis of gene expression data
Proc. Natl Acad. Sci. USA
,
2002
, vol.
99
(pg.
12783
-
12788
)

## Author notes

Associate Editor: John Quackenbush