Abstract

Motivation: In recent years, work has been carried out on clustering gene expression microarray data. Some approaches are developed from an algorithmic viewpoint whereas others are developed via the application of mixture models. In this article, a family of eight mixture models which utilizes the factor analysis covariance structure is extended to 12 models and applied to gene expression microarray data. This modelling approach builds on previous work by introducing a modified factor analysis covariance structure, leading to a family of 12 mixture models, including parsimonious models. This family of models allows for the modelling of the correlation between gene expression levels even when the number of samples is small. Parameter estimation is carried out using a variant of the expectation–maximization algorithm and model selection is achieved using the Bayesian information criterion. This expanded family of Gaussian mixture models, known as the expanded parsimonious Gaussian mixture model (EPGMM) family, is then applied to two well-known gene expression data sets.

Results: The performance of the EPGMM family of models is quantified using the adjusted Rand index. This family of models gives very good performance, relative to existing popular clustering techniques, when applied to real gene expression microarray data.

Availability: The reduced, preprocessed data that were analysed are available at www.paulmcnicholas.info

Contact:pmcnicho@uoguelph.ca

1 INTRODUCTION

1.1 Model-based clustering

Cluster analysis methods are used to find subgroups in a population. Clustering is of particular interest when analysing gene expression data because it can be used to find subgroups that are well distinguished by their expression profiles. A number of clustering techniques are commonly used including agglomerative hierarchical, divisive hierarchical, k-means, k-medoids and model-based clustering. Model-based clustering is a technique for estimating group membership based on parametric finite mixture models. The density of a parametric finite mixture model can be written  
formula
where πg ∈ [0, 1], such that ∑g=1Gπg = 1, is the probability of membership of subpopulation g, and r(xθg) is the density of a multivariate random variable X with parameters θg. Overviews of finite mixture models are given by McLachlan and Peel (2000a) and Frühwirth-Schnatter (2006).
In the model-based clustering literature, the finite Gaussian mixture model is most commonly used (examples include Fraley and Raftery, 2002; McLachlan et al., 2002; McNicholas and Murphy, 2008, 2010). The density of a finite Gaussian mixture model is given by,  
formula
(1)
where ϕ(xμg, Σg) is the density of a multivariate Gaussian random variable X with mean μg and covariance matrix Σg, and ϑ = (π1,…, πG, μ1,…, μG, Σ1,…, ΣG). Note that the Gaussian mixture model has been used within the bioinformatics literature for purposes other than clustering: for example, McLachlan et al. (2006) apply a two-component mixture model to detect differential gene expression.

Gaussian mixture models offer an advantage over other commonly used approaches because the covariance structure can potentially account for correlation between expression levels within an expression profile. Consequently, these models are more flexible than k-means or hierarchical clustering which commonly use Euclidean distance. However, due to the high-dimensional nature of expression data, additional structure needs to be assumed for the covariance matrices, so that the model can be fitted in high-dimensional settings. The MCLUST (Fraley and Raftery, 2002) approach to model-based clustering, which utilizes eigen-decomposed covariance matrices, can only be applied to clustering expression profiles if a diagonal covariance structure is assumed; Yeung et al. (2001) were able to cluster genes using MCLUST but not expression profiles. By assuming a highly parsimonious but non-diagonal covariance structure, it is possible to cluster expression profiles while allowing for correlation between gene expressions.

In general, a structure like that given in Equation (1) can be used to model such data. Then the parameters, and hence group memberships, can be estimated using some variant of the expectation–maximization (EM) algorithm (Dempster et al., 1977). The covariance matrices Σg can be decomposed to allow the construction of more parsimonious models.

1.2 Parsimonious Gaussian mixture models

The factor analysis model (Spearman, 1904) assumes that a p-dimensional random vector Xi can be modelled using a q-dimensional vector of latent factors Ui, where qp. The model can be written Xi = μ + ΛUi + εi, where Λ is a p × q matrix of factor weights, the latent variables Ui ∼ 𝒩(0, Iq) and εi ∼ 𝒩(0, Ψ), where Ψ is a p × p diagonal matrix. Therefore, the marginal distribution of Xi is 𝒩(μ, ΛΛ′+Ψ).

To illustrate the implications of the covariance matrix attached to this marginal distribution, Σ = ΛΛ′+Ψ, suppose that Xij and Xik are expression levels from a sample Xi. Then, Cov(Xij, Xik) = σjk = ∑s=1qλjsλks for jk, and Var(Xij) = σjj = ∑s=1qλjs2qq. Hence, the matrix Λ models the covariance between expression levels, and a combination of the Λ and Ψ matrices models the variance of expression levels. The factor analysis model allows for the modelling of a high-dimensional non-diagonal covariance matrix with a low number of parameters.

Ghahramani and Hinton (1997) proposed a mixture of factor analysers model given by the finite Gaussian mixture model in Equation (1), with Σg = ΛgΛg′+Ψ. McLachlan and Peel (2000b) used the more general covariance structure Σg = ΛgΛg′+Ψg. Tipping and Bishop (1999) proposed the mixtures of probabilistic principal component analysers model, for which the component covariance matrix is Σg = ΛgΛg′+ψgIp.

McNicholas and Murphy (2008) further generalized the factor analysis covariance structure by including the possibility of imposing the constraints: Λg = Λ, Ψg = Ψ and Ψg = ψgIp. The result of imposing, or not, each of these three constraints is the family of eight parsimonious Gaussian mixture models (PGMMs) that are described in Table 1. Each member of this family of models has a number of covariance parameters that is linear in data dimensionality. This is one of the reasons that this family of models is particularly well suited to the analysis of high-dimensional data. The constraints allow for assuming common structure in the component covariance matrix Σg, if appropriate. By assuming common covariance structure, a more parsimonious model can be used and this can be estimated in a more stable manner.

Table 1.

The covariance structure of each parsimonious Gaussian mixture model—note that the UCU, UUC and UUU models previously existed under different names, as described in Section 1.2

Λg = ΛΨg = ΨIsotropicCovariance structure
Σg = ΛΛ′+ψIp 
Σg = ΛΛ′+Ψ 
Σg = ΛΛ′+ψgIp 
Σg = ΛΛ′+Ψg 
Σg = ΛgΛg′+ψIp 
Σg = ΛgΛg′+Ψ 
Σg = ΛgΛg′+ψgIp 
Σg = ΛgΛg′+Ψg 
Λg = ΛΨg = ΨIsotropicCovariance structure
Σg = ΛΛ′+ψIp 
Σg = ΛΛ′+Ψ 
Σg = ΛΛ′+ψgIp 
Σg = ΛΛ′+Ψg 
Σg = ΛgΛg′+ψIp 
Σg = ΛgΛg′+Ψ 
Σg = ΛgΛg′+ψgIp 
Σg = ΛgΛg′+Ψg 

C, constrained; U, unconstrained.

Table 1.

The covariance structure of each parsimonious Gaussian mixture model—note that the UCU, UUC and UUU models previously existed under different names, as described in Section 1.2

Λg = ΛΨg = ΨIsotropicCovariance structure
Σg = ΛΛ′+ψIp 
Σg = ΛΛ′+Ψ 
Σg = ΛΛ′+ψgIp 
Σg = ΛΛ′+Ψg 
Σg = ΛgΛg′+ψIp 
Σg = ΛgΛg′+Ψ 
Σg = ΛgΛg′+ψgIp 
Σg = ΛgΛg′+Ψg 
Λg = ΛΨg = ΨIsotropicCovariance structure
Σg = ΛΛ′+ψIp 
Σg = ΛΛ′+Ψ 
Σg = ΛΛ′+ψgIp 
Σg = ΛΛ′+Ψg 
Σg = ΛgΛg′+ψIp 
Σg = ΛgΛg′+Ψ 
Σg = ΛgΛg′+ψgIp 
Σg = ΛgΛg′+Ψg 

C, constrained; U, unconstrained.

The PGMM family has another significant advantage that is particularly important in applications involving high-dimensional data. When running the alternating expectation-conditional maximization (AECM) algorithm (Meng and van Dyk, 1997) for these models, it is advantageous to make use of the Woodbury identity (Woodbury, 1950) to avoid inverting any non-diagonal p × p matrices. Given an n × n matrix A, an n × k matrix H, a k × k matrix C and a k × n matrix V, the Woodbury identity states that  
formula
(2)
Setting H = Λ, V = Λ′, A = Ψ and C = Iq in Equation (2) gives  
formula
(3)
Now, the left-hand side of Equation (3) involves inversion of a p × p matrix, but the right-hand side leaves only diagonal and q × q matrices to be inverted. This is a major computational advantage when modelling expression data, since qp. A related identity for the determinant of the covariance matrix is given by  
formula
(4)
Equations (3 and 4) are used by McLachlan and Peel (2000b) for the mixtures of factor analysers model and by McNicholas and Murphy (2008) and McNicholas et al. (2010) for the PGMM family.

2 METHODOLOGY

2.1 Modified factor analysis covariance structure

The factor analysis covariance structure (cf. McLachlan and Peel, 2000b) can be further parameterized by writing Ψg = ωgΔg, where ωg ∈ ℝ+ and Δg = diag{δ1, δ2,…, δp} such that |Δg| = 1, for g = 1, 2,…, G. The resulting covariance structure Σg = ΛgΛg′+ωgΔg shall be known as the modified factor analysis covariance structure. Now, this covariance structure can be used within the model-based clustering framework, opening up the possibility of models that are more parsimonious than their PGMM counterparts. Specifically, constraints can be imposed on the parameters Λg, ωg and Δg leading to the 12 Gaussian mixture models illustrated in Table 2. The family of models in Table 2 will be referred to as the expanded PGMM (EPGMM) family hereafter. Table 2 contains a total of four new, parsimonious, models when compared with Table 1. Notably, all 12 members of the EPGMM family have a number of covariance parameters that is linear in the dimensionality of the data. Furthermore, the identities given in Equations (3 and 4) can be used for all 12 models.

Table 2.

The covariance structure, number of covariance parameters and nomenclature for each member of the EPGMM family, along with the name of the equivalent member of the PGMM family where applicable

EPGMM nomenclature
PGMM equivalentCovariance structureNumber of covariance parameters
Λg = ΛΔg = Δωg = ωΔg = Ip
CCC Σg = ΛΛ′+ωIp [pqq(q−1)/2]+1 
CUC Σg = ΛΛ′+ωgIp [pqq(q−1)/2]+G 
UCC Σg = ΛgΛg′+ωIp G[pqq(q−1)/2]+1 
UUC Σg = ΛgΛg′+ωgIp G[pqq(q−1)/2]+G 
CCU Σg = ΛΛ′+ωΔ [pqq(q−1)/2]+p 
– Σg = ΛΛ′+ωgΔ [pqq(q−1)/2]+[G+(p−1)] 
UCU Σg = ΛgΛg′+ωΔ G[pqq(q−1)/2]+p 
– Σg = ΛgΛg′+ωgΔ G[pqq(q−1)/2]+[G+(p−1)] 
– Σg = ΛΛ′+ωΔg [pqq(q−1)/2]+[1+G(p−1)] 
CUU Σg = ΛΛ′+ωgΔg [pqq(q−1)/2]+Gp 
– Σg = ΛgΛg′+ωΔg G[pqq(q−1)/2]+[1+G(p−1)] 
UUU Σg = ΛgΛg′+ωgΔg G[pqq(q−1)/2]+Gp 
EPGMM nomenclature
PGMM equivalentCovariance structureNumber of covariance parameters
Λg = ΛΔg = Δωg = ωΔg = Ip
CCC Σg = ΛΛ′+ωIp [pqq(q−1)/2]+1 
CUC Σg = ΛΛ′+ωgIp [pqq(q−1)/2]+G 
UCC Σg = ΛgΛg′+ωIp G[pqq(q−1)/2]+1 
UUC Σg = ΛgΛg′+ωgIp G[pqq(q−1)/2]+G 
CCU Σg = ΛΛ′+ωΔ [pqq(q−1)/2]+p 
– Σg = ΛΛ′+ωgΔ [pqq(q−1)/2]+[G+(p−1)] 
UCU Σg = ΛgΛg′+ωΔ G[pqq(q−1)/2]+p 
– Σg = ΛgΛg′+ωgΔ G[pqq(q−1)/2]+[G+(p−1)] 
– Σg = ΛΛ′+ωΔg [pqq(q−1)/2]+[1+G(p−1)] 
CUU Σg = ΛΛ′+ωgΔg [pqq(q−1)/2]+Gp 
– Σg = ΛgΛg′+ωΔg G[pqq(q−1)/2]+[1+G(p−1)] 
UUU Σg = ΛgΛg′+ωgΔg G[pqq(q−1)/2]+Gp 

C, constrained, U, unconstrained.

Table 2.

The covariance structure, number of covariance parameters and nomenclature for each member of the EPGMM family, along with the name of the equivalent member of the PGMM family where applicable

EPGMM nomenclature
PGMM equivalentCovariance structureNumber of covariance parameters
Λg = ΛΔg = Δωg = ωΔg = Ip
CCC Σg = ΛΛ′+ωIp [pqq(q−1)/2]+1 
CUC Σg = ΛΛ′+ωgIp [pqq(q−1)/2]+G 
UCC Σg = ΛgΛg′+ωIp G[pqq(q−1)/2]+1 
UUC Σg = ΛgΛg′+ωgIp G[pqq(q−1)/2]+G 
CCU Σg = ΛΛ′+ωΔ [pqq(q−1)/2]+p 
– Σg = ΛΛ′+ωgΔ [pqq(q−1)/2]+[G+(p−1)] 
UCU Σg = ΛgΛg′+ωΔ G[pqq(q−1)/2]+p 
– Σg = ΛgΛg′+ωgΔ G[pqq(q−1)/2]+[G+(p−1)] 
– Σg = ΛΛ′+ωΔg [pqq(q−1)/2]+[1+G(p−1)] 
CUU Σg = ΛΛ′+ωgΔg [pqq(q−1)/2]+Gp 
– Σg = ΛgΛg′+ωΔg G[pqq(q−1)/2]+[1+G(p−1)] 
UUU Σg = ΛgΛg′+ωgΔg G[pqq(q−1)/2]+Gp 
EPGMM nomenclature
PGMM equivalentCovariance structureNumber of covariance parameters
Λg = ΛΔg = Δωg = ωΔg = Ip
CCC Σg = ΛΛ′+ωIp [pqq(q−1)/2]+1 
CUC Σg = ΛΛ′+ωgIp [pqq(q−1)/2]+G 
UCC Σg = ΛgΛg′+ωIp G[pqq(q−1)/2]+1 
UUC Σg = ΛgΛg′+ωgIp G[pqq(q−1)/2]+G 
CCU Σg = ΛΛ′+ωΔ [pqq(q−1)/2]+p 
– Σg = ΛΛ′+ωgΔ [pqq(q−1)/2]+[G+(p−1)] 
UCU Σg = ΛgΛg′+ωΔ G[pqq(q−1)/2]+p 
– Σg = ΛgΛg′+ωgΔ G[pqq(q−1)/2]+[G+(p−1)] 
– Σg = ΛΛ′+ωΔg [pqq(q−1)/2]+[1+G(p−1)] 
CUU Σg = ΛΛ′+ωgΔg [pqq(q−1)/2]+Gp 
– Σg = ΛgΛg′+ωΔg G[pqq(q−1)/2]+[1+G(p−1)] 
UUU Σg = ΛgΛg′+ωgΔg G[pqq(q−1)/2]+Gp 

C, constrained, U, unconstrained.

2.2 Parameter estimation for the EPGMM family

2.2.1 Introduction

Estimation of the model parameters, via the AECM algorithm, is analogous to that of the PGMM parameter estimation procedure described by McNicholas and Murphy (2008). The estimates for the eight pre-existing models are obtained from the PGMM estimates by writing Ψg = |Ψg|1/pΨg/|Ψg|1/p, and then setting ωg = |Ψg|1/p and Δg = Ψg/|Ψg|1/p. However, the derivation of the maximum likelihood estimates of the model parameters for the new models, requires the method of Lagrange multipliers (Lagrange, 1788). Parameter estimates for the CCUU model are derived in Section 2.2.2 and derivations for the other three new models are given at the end of said section.

2.2.2 AECM algorithm

The EM algorithm is an iterative technique for finding maximum likelihood estimates when data are incomplete, or are treated as incomplete. In the expectation step (E-step), the expected value of the complete-data log-likelihood (Q, say) is computed, where the complete-data is the missing data plus the observed data. Then in the maximization step (M-step), Q is maximized with respect to the model parameters.

In the expectation-conditional maximization (ECM) algorithm (Meng and Rubin, 1993), the M-step is replaced by a number of conditional maximization (CM) steps. The AECM algorithm (Meng and van Dyk, 1997) is an extension of the ECM algorithm that permits different specification of the complete-data at each stage. Extensive details on the EM algorithm and variants thereof are given by McLachlan and Krishnan (2008).

Since there are two sources of missing data for the EPGMM family, the group memberships and the latent factors, the AECM algorithm is used for parameter estimation. We shall use zig to denote the group membership of sample i, so that zig = 1 if sample i is in group g and zig = 0 otherwise. At the first stage of the algorithm, the complete-data are (xi, zig) and in the E-step the zig are replaced by their expected values  
formula
to give the expected value of the complete-data log-likelihood, 𝒬1 say. In the interest of brevity, the expected value of Zig will be denoted forumla herein. The function 𝒬1 is then maximized in the CM-step to give forumla and forumla where forumla and n = ∑g=1Gng.
At the second stage, the complete-data is (xi, zig, uig) and in the E-step the zig are replaced by forumla and the sufficient statistics for the factors Uig are replaced by  
formula
respectively, where βg = Λg′(ΛgΛg′+ωgΔg)−1, to give 𝒬2. The CM-step at this second stage will depend on the model. Consider the CCUU model, so that Λg = Λ and Δg = Δ. In this case, the expected complete-data log-likelihood 𝒬2(Λ, ωg, Δ) can be written  
formula
where C is constant with respect to Λ, ωg and Δ, and forumla.
To maximize 𝒬2 with respect to Λ, ωg and Δ, it is necessary to use the method of Lagrange multipliers. First, form the Lagrange function L(Λ, ωg, Δ, κ)=Q(Λ, ωg, Δ)−κ(|Δ| − 1). Note that we use κ to denote the Lagrange multiplier to avoid confusion with the elements of the matrix Λ. Differentiating L with respect to Λ, ωg−1, Δ−1 and κ, respectively, gives the following score functions.  
formula
Note that S4 is included for completeness only and solving S4(Λ, ωg, Δ, κ) = 0 just returns the constraint |Δ| = 1. Now, solving forumla gives  
formula
and solving forumla gives  
formula
Solving forumla leads to  
formula
But forumla is a diagonal matrix with forumla, therefore  
formula
where ξj is the j-th element along the diagonal of the matrix  
formula
Therefore, it follows that  
formula
(5)
The derivations for the other three new models are similar. The estimates in the UCUU case are  
formula
where κ is as defined in Equation (5) but, in this case, ξj is the j-th element along the diagonal of the matrix  
formula
In the CUCU case, the estimate for Λ is derived in a row-by-row fashion as  
formula
for i=1,…, p where ri is the i-th row of the matrix forumla and forumla is the i-th element along the diagonal of the matrix forumla. The other estimates are  
formula
where ξgj is the j-th element along the diagonal of the matrix forumla
In the UUCU case, the parameter estimates are given by  
formula
and κg is as in the CUCU case but with ξgj given by the j-th element along the diagonal of the matrix forumla.

Note that the predicted clustering for each member of the EPGMM family is given by the maximum a posteriori (MAP) classification. That is, the posterior predicted component membership of tissue i is the value of g for which forumla is the greatest.

2.3 Convergence and model selection

2.3.1 Convergence criterion

Aitken's acceleration (Aitken, 1926) is used in the analyses herein to estimate the asymptotic maximum of the log-likelihood at each iteration. This allows a decision to be made about whether or not a given AECM algorithm has converged. Aitken's acceleration at iteration t is given by  
formula
where l(t+1), l(t) and l(t−1) are the log-likelihood values from iterations t + 1, t and t − 1, respectively. The asymptotic estimate of the log-likelihood at iteration t + 1 is given by  
formula
(Böhning et al., 1994). Herein, the stopping criterion proposed by McNicholas et al. (2010) is used, so that the algorithm can be stopped when l(t+1)l(t) < ϵ. More specifically, ϵ = 0.1 is used. Note that this criterion is very similar to that proposed by Lindsay (1995), who suggested stopping when l(t+1)l(t+1) < ϵ.

2.3.2 Model selection

The Bayesian information criterion (BIC Schwarz, 1978) is used to select the best member of the EPGMM family, in terms of both model and number of factors. Note that the BIC can also be used to select the number of mixture components (cf. Fraley and Raftery, 1999; McNicholas and Murphy, 2008) but this is not necessary for the analyses herein since we fix G = 2. For a model with parameters θ, the BIC is given by forumla where forumla is the maximized log-likelihood, forumla is the maximum likelihood estimate of θ, m is the number of free parameters in the model and n is the number of observations. The effectiveness of the BIC for choosing the number of factors in a factor analysis model has been established by Lopes and West (2004), while McNicholas et al. (2010) provide practical evidence that the BIC performs well in choosing the number of factors for the PGMM family of models.

A number of other model selection criteria could be used including the Akaike information criterion (AIC; Akaike, 1974), the integrated completed likelihood (ICL; Biernacki et al., 2000) and clustering stability (cf. von Luxburg, 2009). However, we found that the BIC gave a quick solution and generally good clustering results.

3 ANALYSES

3.1 Dimensionality reduction

McLachlan et al. (2002) analysed two microarray gene expression data sets—one on leukaemia data and another on colon tissue samples—using the EMMIX-GENE approach. The first stage of this approach focuses on data reduction where, initially, one- and two-component mixtures of t-distributions are fitted to the data. Then a gene is retained only if two conditions are satisfied.

One of these conditions is that the minimum cluster size exceeds some prespecified threshold a1. The other condition concerns the result of a likelihood ratio test, or tests. First, the hypothesis H0 : G = 1 is tested against H1 : G = 2 and the gene is retained if  
formula
(6)
where λ is the likelihood ratio statistic. However, if the condition in Equation (6) is not met then the hypothesis H0 : G = 2 is tested against H1 : G = 3 and the gene is retained if the same condition is satisfied, with the same a2, for this test statistic λ and at least two of the three components contain at least a1 tissues.

When fitting the two- and three-component mixture models for this purpose, starting values for the component memberships are defined randomly or by using starting values based on k-means clustering results. This whole process represents the first stage of the EMMIX-GENE approach and can be carried out using the select-genes software that accompanies McLachlan et al. (2004). For the analyses herein, the select-genes software is used with thresholds a1 = a2 = 8, as in McLachlan et al. (2002), and 50 random and 50 k-means starts.

3.2 Leukaemia data

3.2.1 The data

Golub et al. (1999) presented data on two forms of acute leukaemia: acute lymphoblastic leukaemia (ALL) and acute myeloid leukaemia (AML). Affymetrix arrays were used to collect measurements for 7129 genes on 72 tissues. There were a total of 47 ALL tissues and 25 with AML. The data were sourced from the web site accompanying McLachlan et al. (2004, www.maths.uq.edu.au/∼gjm/emmix-gene/) and so they had been preprocessed (Dudoit et al., 2002; McLachlan et al., 2002) as follows. Following this preprocessing, a total of 3731 genes remained. This number was further reduced to 2030 following application of the select-genes software (cf. Section 3.1).

  1. Genes with expression less than 100 or greater than 16 000 were removed.

  2. Genes with expressions satisfying max/min≤5 and max−min≤500 were removed.

  3. The natural logarithm was taken.

3.2.2 The EPGMM approach

Treating this as a clustering problem where the form of leukaemia is unknown, all 12 members of the EPGMM family (Table 2) were fitted to these data for G = 2, q=1,…, 6 and 10 different random starting values for the forumla. The BIC for the best q for each of the 12 members of the EPGMM family is given in Table 3.

Table 3.

The BIC for the best q for each of the 12 members of the EPGMM family for the leukaemia data

ModelqBICModelqBIC
CCCC −411 646.50 CCUC −411 566.29 
UCCC −416 954.56 UCUC −416 803.57 
CCCU −414 615.22 CCUUa −413 207.29 
UCCU −423 354.79 UCUUa −422 089.38 
CUCUa −413 966.90 CUUU −413 978.04 
UUCUa −423 933.46 UUUU −423 532.04 
ModelqBICModelqBIC
CCCC −411 646.50 CCUC −411 566.29 
UCCC −416 954.56 UCUC −416 803.57 
CCCU −414 615.22 CCUUa −413 207.29 
UCCU −423 354.79 UCUUa −422 089.38 
CUCUa −413 966.90 CUUU −413 978.04 
UUCUa −423 933.46 UUUU −423 532.04 

aOne of the four new models.

Table 3.

The BIC for the best q for each of the 12 members of the EPGMM family for the leukaemia data

ModelqBICModelqBIC
CCCC −411 646.50 CCUC −411 566.29 
UCCC −416 954.56 UCUC −416 803.57 
CCCU −414 615.22 CCUUa −413 207.29 
UCCU −423 354.79 UCUUa −422 089.38 
CUCUa −413 966.90 CUUU −413 978.04 
UUCUa −423 933.46 UUUU −423 532.04 
ModelqBICModelqBIC
CCCC −411 646.50 CCUC −411 566.29 
UCCC −416 954.56 UCUC −416 803.57 
CCCU −414 615.22 CCUUa −413 207.29 
UCCU −423 354.79 UCUUa −422 089.38 
CUCUa −413 966.90 CUUU −413 978.04 
UUCUa −423 933.46 UUUU −423 532.04 

aOne of the four new models.

The best of these models, in terms of BIC, was a CCUC model with q = 3 latent factors. The chosen model has a non-diagonal covariance structure where the covariance between pairs of genes is equal across different clusters but the variance of each gene is unequal across different clusters (Section 1.2). The MAP classifications arising from the parameter estimates associated with this model are given in Table 4; only five tissue samples were misclassified.

Table 4.

Estimated group membership for the best EPGMM model for the leukaemia data

12
ALL 42 
AML 25 
12
ALL 42 
AML 25 
Table 4.

Estimated group membership for the best EPGMM model for the leukaemia data

12
ALL 42 
AML 25 
12
ALL 42 
AML 25 

3.2.3 Hierarchical clustering, k-means, k-medoids and MCLUST

In addition to the EPGMM technique, several other techniques were applied to these data using the R software (R Development Core Team, 2010). Agglomerative hierarchical clustering was used, with Euclidean distance and three different linkage methods: complete, average and single. The k-means (cf. Hartigan and Wong, 1979) and k-medoids techniques were also used. In the latter case, the partitioning around medoids (PAM; cf. Kaufman and Rousseeuw, 1990, Chapter 2) algorithm was used. Finally, in order to compare our model-based clustering approach to the well-established MCLUST approach, we used the mclust package (Fraley and Raftery, 1999) for the R software.

The results, which are summarized in Table 5, give the Rand and adjusted Rand indices as measures of class agreement. The Rand index (Rand, 1971) is based on pairwise agreements and disagreements, and the adjusted Rand index (Hubert and Arabie, 1985) is effectively the Rand index corrected for random chance. These indices reveal that the best of the non-model-based approaches was k-means clustering, with an adjusted Rand index of 0.187. In fact, k-means clustering narrowly outperformed mclust on these data, but the EPGMM model with the greatest BIC (CCUC, q = 3) was the best model overall.

Table 5.

Summary results for all of the clustering techniques that were applied to the leukaemia data

BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.532 0.058 
Hierarchical (average) – 0.525 −0.024 
Hierarchical (single) – 0.532 −0.013 
k-means – 0.593 0.187 
PAM – 0.518 0.023 
MCLUST (VII) −416 293.2 0.593 0.186 
EPGMM (CCUC, q = 3) −411 566.3 0.869 0.738 
BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.532 0.058 
Hierarchical (average) – 0.525 −0.024 
Hierarchical (single) – 0.532 −0.013 
k-means – 0.593 0.187 
PAM – 0.518 0.023 
MCLUST (VII) −416 293.2 0.593 0.186 
EPGMM (CCUC, q = 3) −411 566.3 0.869 0.738 
Table 5.

Summary results for all of the clustering techniques that were applied to the leukaemia data

BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.532 0.058 
Hierarchical (average) – 0.525 −0.024 
Hierarchical (single) – 0.532 −0.013 
k-means – 0.593 0.187 
PAM – 0.518 0.023 
MCLUST (VII) −416 293.2 0.593 0.186 
EPGMM (CCUC, q = 3) −411 566.3 0.869 0.738 
BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.532 0.058 
Hierarchical (average) – 0.525 −0.024 
Hierarchical (single) – 0.532 −0.013 
k-means – 0.593 0.187 
PAM – 0.518 0.023 
MCLUST (VII) −416 293.2 0.593 0.186 
EPGMM (CCUC, q = 3) −411 566.3 0.869 0.738 

3.2.4 The EMMIX-GENE approach

McLachlan et al. (2002) analysed the same data using the EMMIX-GENE approach with four random and four k-means starts in the first stage, which reduced the number of genes to 2015. In the second stage, a mixture of 40 normal distributions with isotropic covariance structure was fitted to the 2015 genes. Two of these groups (Groups 1 and 3) provided clusterings that were most similar to the type of leukaemia—of course, in a real clustering scenario this could not be established. A two-component mixture of factor analysers, with q = 6 factors, was fitted to the data using the genes from Groups 1 and 3, respectively. Using the genes from Group 1 led to the misclassification of 13 tissues and using those from Group 3 led to the misclassification of six tissues. Note that McLachlan et al. (2002) did not specify how many different random starts were used but, based on other analyses, it seems likely that 50 random and 50 k-means starts were used.

3.2.5 Two other approaches

In addition to the EMMIX-Gene approach, McLachlan et al. (2002) used two other approaches to cluster the leukaemia tissues. In both cases, the first stage was identical to that described in Section 3.2.4. The first alternative approach was to cluster the tissues based on the 40 fitted group means and the top 50 of the 2015 genes. Fitting a two-component mixture of factor analysers with q = 8 factors to these data, using 50 random and 50 k-means starts, led to the misclassification of just one tissue. The second alternative approach was to base the analysis on the top 50 genes. Fitting a two-component mixture of factor analysers, with q = 8 factors to these data, using 50 random and 50 k-means starts, led to the misclassification of 10 tissues.

3.2.6 Comments

The EPGMM approach gave very good clustering performance when applied to the leukaemia data. This approach used 10 random starts and led to the misclassification of just five tissues. This performance far exceeds that of agglomerative hierarchical clustering, k-means clustering, PAM and MCLUST. In fact, the best of all of these techniques had an adjusted Rand index of 0.187, while the best EPGMM model had an adjusted Rand index of 0.738. Although, in one instance, one of the approaches of McLachlan et al. (2002) returned a better predicted classification, it is difficult to make a direct comparison to the EPGMM approach. This difficulty arises because the EPGMM approach is a genuine clustering approach, while the methods described in Sections 3.2.4 and 3.2.5 assumed, to some extent, knowledge of the truth. This knowledge was clearly used in the analyses described in Section 3.2.4 but was used in a less obvious fashion in the analyses given in Section 3.2.5. In this latter case, the choice of the number of clusters (40) was validated in some sense by the fact that two of the groups give classifications that were similar to the true leukaemia type. In fact, as mentioned by McLachlan et al. (2002), an objective technique for choosing this number is not possible since genes cannot be assumed to be independently distributed within a tissue sample. Furthermore, it is quite likely that the number of factors q was selected, in each case, to give the best classification. This could be done objectively, as in Section 3.2.2, using the BIC. Finally, any comparison between the EPGMM approach and the approaches of McLachlan et al. (2002) would have to be taken in context with the fact that different subsets of the 3731 genes are used in each case.

3.3 Colon data

3.3.1 The data

Alon et al. (1999) presented gene expression data on 62 colon tissue samples, of which 40 were tumours and the remaining 22 were normal. Affymetrix arrays were used to collect measurements for 6500 gene expressions on all 62 tissues. Following Alon et al. (1999) and McLachlan et al. (2002), only the 2000 genes with the highest minimal intensity are focused upon. The data were again sourced from the web site mentioned in Section 3.2.1 and, this time, the only preprocessing was the taking of natural logarithms, followed by normalization. Application of the select-genes software, with the settings specified in Section 3.1, led to the reduction of the number of genes from 2000 to just 461.

3.3.2 The EPGMM approach

Treating this as a clustering problem where the type of tissue is unknown, all 12 members of the EPGMM family (Table 2) were fitted to these data for G = 2, q = 1,…, 10 and 10 different random starting values for the forumla. The BIC for the best q for each of the 12 members of the EPGMM family is given in Table 6.

Table 6.

The BIC for the best q for each of the 12 members of the EPGMM family for the colon data

ModelqBICModelqBIC
CCCC −79 085.73 CCUC −70 937.72 
UCCC −77 267.65 UCUC −77 268.16 
CCCU −71 064.11 CCUUa −71 063.71 
UCCU −77 310.19 UCUUa −77 532.97 
CUCUa −71 609.10 CUUU −71 631.35 
UUCUa −78 458.83 UUUU −78 306.33 
ModelqBICModelqBIC
CCCC −79 085.73 CCUC −70 937.72 
UCCC −77 267.65 UCUC −77 268.16 
CCCU −71 064.11 CCUUa −71 063.71 
UCCU −77 310.19 UCUUa −77 532.97 
CUCUa −71 609.10 CUUU −71 631.35 
UUCUa −78 458.83 UUUU −78 306.33 

aOne of the four new models.

Table 6.

The BIC for the best q for each of the 12 members of the EPGMM family for the colon data

ModelqBICModelqBIC
CCCC −79 085.73 CCUC −70 937.72 
UCCC −77 267.65 UCUC −77 268.16 
CCCU −71 064.11 CCUUa −71 063.71 
UCCU −77 310.19 UCUUa −77 532.97 
CUCUa −71 609.10 CUUU −71 631.35 
UUCUa −78 458.83 UUUU −78 306.33 
ModelqBICModelqBIC
CCCC −79 085.73 CCUC −70 937.72 
UCCC −77 267.65 UCUC −77 268.16 
CCCU −71 064.11 CCUUa −71 063.71 
UCCU −77 310.19 UCUUa −77 532.97 
CUCUa −71 609.10 CUUU −71 631.35 
UUCUa −78 458.83 UUUU −78 306.33 

aOne of the four new models.

The best of these models, again in terms of BIC, was a CCUC model with q = 6 latent factors; the covariance structure in this model is the same as that chosen for the leukaemia data (Section 3.2.2). The MAP classifications given by the parameter estimates associated with this model are given in Table 7; only five tissue samples were misclassified.

Table 7.

Estimated group membership for the best EPGMM model for the colon data

12
Tumour 37 
Normal 20 
12
Tumour 37 
Normal 20 
Table 7.

Estimated group membership for the best EPGMM model for the colon data

12
Tumour 37 
Normal 20 
12
Tumour 37 
Normal 20 

3.3.3 Hierarchical clustering, k-means, k-medoids and MCLUST

In addition to the EPGMM technique, the methods used in Section 3.2.3 were run on these colon data using the R software. The results, which are summarized in Table 8, suggest that the best of the non-model-based approaches was PAM, with an adjusted Rand index of 0.218. This time, mclust outperformed k-means clustering but PAM outperformed mclust. The EPGMM model with the greatest BIC (CCUC, q = 6) was the best model, misclassifying just five tissues based on 10 random starts.

Table 8.

Summary results for all of the clustering techniques that were applied to the colon data

BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.497 −0.018 
Hierarchical (average) – 0.526 −0.005 
Hierarchical (single) – 0.526 −0.014 
k-means – 0.494 −0.016 
PAM – 0.611 0.218 
MCLUST (VII) −81 124.36 0.500 −0.006 
EPGMM (CCUC, q = 6) −70 937.72 0.849 0.697 
BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.497 −0.018 
Hierarchical (average) – 0.526 −0.005 
Hierarchical (single) – 0.526 −0.014 
k-means – 0.494 −0.016 
PAM – 0.611 0.218 
MCLUST (VII) −81 124.36 0.500 −0.006 
EPGMM (CCUC, q = 6) −70 937.72 0.849 0.697 
Table 8.

Summary results for all of the clustering techniques that were applied to the colon data

BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.497 −0.018 
Hierarchical (average) – 0.526 −0.005 
Hierarchical (single) – 0.526 −0.014 
k-means – 0.494 −0.016 
PAM – 0.611 0.218 
MCLUST (VII) −81 124.36 0.500 −0.006 
EPGMM (CCUC, q = 6) −70 937.72 0.849 0.697 
BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.497 −0.018 
Hierarchical (average) – 0.526 −0.005 
Hierarchical (single) – 0.526 −0.014 
k-means – 0.494 −0.016 
PAM – 0.611 0.218 
MCLUST (VII) −81 124.36 0.500 −0.006 
EPGMM (CCUC, q = 6) −70 937.72 0.849 0.697 

3.3.4 Correspondence with McLachlan et al. (2002)

Using various techniques, McLachlan et al. (2002) found five different clusterings of these data. However, none of these clusterings corresponded to the tissue type. While, once again, the EPGMM results are not directly comparable with those of McLachlan et al. (2002), it is interesting to look at the second best of the EPGMM models. The second best of the EPGMM models, in terms of BIC, was a CCUU model with q = 7 latent factors. Note that this is one of the four new models that were introduced herein and, again, this model has equal covariance between pairs of genes; however, the variance structure is more complex than for the CCUC model. The MAP classification given by the parameter estimates associated with this CCUU model do not separate tumour from normal tissue. However, they are similar to what McLachlan et al. (2002) call C1, in that they seem sensible when one considers that there was a change of protocol during the experiment (Getz et al., 2000; McLachlan et al., 2002). Specifically, tissues 1–11 and 41–51 were all extracted from the first 11 patients using a poly detector, while the remaining samples were taken from the other patients using total extraction of RNA. Looking at the tissues by extraction method, rather than by tissue type, leads to the estimated classifications given in Table 9; only eight of the tissues were misclassified by this CCUU model when the data are considered by extraction method.

Table 9.

Estimated group membership for the second best EPGMM model for the colon data

12
Poly detector 19 
Total extraction of RNA 35 
12
Poly detector 19 
Total extraction of RNA 35 
Table 9.

Estimated group membership for the second best EPGMM model for the colon data

12
Poly detector 19 
Total extraction of RNA 35 
12
Poly detector 19 
Total extraction of RNA 35 

The results from applying the other methods to the colon data (cf. Table 8), can also be viewed in terms of extraction method, rather than tissue type. These results are given, along with our best CCUU model, in Table 10. From this table, it is clear that our CCUU model gives the best clustering performance of all of the approaches. Furthermore, the hierarchical (complete and average linkage), k-means, and mclust clustering results are all better when viewed in terms of extraction method.

Table 10.

Summary results, by extraction method, for all of the clustering techniques that were applied to the colon data

BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.518 0.024 
Hierarchical (average) – 0.545 0.035 
Hierarchical (single) – 0.526 −0.014 
k-means – 0.526 0.048 
PAM – 0.581 0.158 
MCLUST (VII) −81 124.36 0.526 0.045 
EPGMM (CCUU, q = 7) −71 063.71 0.772 0.542 
BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.518 0.024 
Hierarchical (average) – 0.545 0.035 
Hierarchical (single) – 0.526 −0.014 
k-means – 0.526 0.048 
PAM – 0.581 0.158 
MCLUST (VII) −81 124.36 0.526 0.045 
EPGMM (CCUU, q = 7) −71 063.71 0.772 0.542 
Table 10.

Summary results, by extraction method, for all of the clustering techniques that were applied to the colon data

BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.518 0.024 
Hierarchical (average) – 0.545 0.035 
Hierarchical (single) – 0.526 −0.014 
k-means – 0.526 0.048 
PAM – 0.581 0.158 
MCLUST (VII) −81 124.36 0.526 0.045 
EPGMM (CCUU, q = 7) −71 063.71 0.772 0.542 
BICRand indexAdjusted Rand index
Hierarchical (complete) – 0.518 0.024 
Hierarchical (average) – 0.545 0.035 
Hierarchical (single) – 0.526 −0.014 
k-means – 0.526 0.048 
PAM – 0.581 0.158 
MCLUST (VII) −81 124.36 0.526 0.045 
EPGMM (CCUU, q = 7) −71 063.71 0.772 0.542 

3.3.5 Comments

The EPGMM approach gave very good clustering performance when applied to the colon data. Our approach led to the misclassification of just five tissues, when these data were viewed by tissue type. This performance far exceeded that of all of the other techniques that were used—in fact, the performance of these other approaches was surprisingly poor, with only PAM giving better than random classifications (cf. Table 8). This phenomenon is partly explained when one looks at the classifications by extraction method, rather than by tissue type (cf. Table 10). In this case, only one method performed worse than random, which might suggest that techniques such as k-means clustering and MCLUST were picking up extraction method more-so than tissue type. That said, the performance of these methods was only slightly better than random which suggests that the restrictive cluster shapes imposed by k-means clustering and MCLUST were not at all suited to the data. On the other hand, the best of the new EPGMM models gave the based clustering performance, misclassifying just eight samples.

4 DISCUSSION

The EPGMM family of models has been shown to give good clustering performance when applied to gene expression microarray data. These applications, concerning leukaemia and colon tissue data, respectively, were conducted as genuine clustering examples. That is, no information on the true tissue classification was used for parameter estimation or model selection. In fact, this information was only used to assess the performance of the selected model. In this context, the clustering performance of the EPGMM family can be looked upon favourably. Moreover, the performance of the EPGMM family on both data sets far exceeded that of a number of popular clustering techniques, including agglomerative hierarchical clustering and k-means clustering.

However, like the techniques of McLachlan et al. (2002), the EPGMM family relies on multiple random starts. In addition to the obvious drawback of the sensitivity of results to the starting values, there is the computation time that is required. Furthermore, there is no guarantee that increasing the number of random starts will lead to better clustering results. This is due, in the main, to the fact that models with greater BIC do not necessarily give better clustering performance. This phenomenon has been observed previously and work into finding better model selection techniques is ongoing. That said, the EPGMM family did perform well in the analyses in Section 3, based on random starting values and using the BIC.

5 CONCLUSION

The EPGMM family of mixture models has been introduced and used for the model-based clustering of gene expression microarray data. This family of models is an extension of the PGMM family of models which, in turn, is an extension of the mixtures of factor analysers model. The EPGMM family of models are very well suited to the analysis of high-dimensional data. The reason for this suitability is 3-fold. First, each member of the EPGMM family has a number of covariance parameters that is linear in the data dimensionality. Second, as shown herein, the Woodbury identity can be used to avoid the inversion of any non-diagonal p × p matrices, leading to efficient computation. Thirdly, as shown by McNicholas et al. (2010) in the context of the PGMM family, these models are ‘trivially parallelizable’, opening up the possibility of even more efficient parameter estimation using parallel computing. The EPGMM family was applied to two well-known gene expression microarray data sets. In both cases, the EPGMM family performed well and gave much better clusterings than several popular clustering techniques. Herein, we took G = 2 for all of the analysis but future work will focus on the selection of G.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge the insightful and helpful comments of three anonymous reviewers.

Funding: This work was supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada, and by a Basic Research Grant (04/BR/M0057) and a Research Frontiers Grant (2007/RFP/MATF281) from Science Foundation Ireland. The high-performance computing equipment that was used was provided by Silicon Graphics Inc. through funding from the Canada Foundation for Innovation–Leaders Opportunity Fund and from the Ontario Research Fund–Research Infrastructure Program.

Conflict of Interest: none declared.

REFERENCES

Aitken
AC
On Bernoulli's numerical solution of algebraic equations
Proc. R. Soc. Edinb.
1926
, vol. 
46
 (pg. 
289
-
305
)
Akaike
H
A new look at the statistical model identification
IEEE Trans. Automat. Contr.
1974
, vol. 
19
 (pg. 
716
-
723
)
Alon
U
, et al. 
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proc. Natl Acad. Sci. USA
1999
, vol. 
96
 (pg. 
6745
-
6750
)
Biernacki
C
, et al. 
Assessing a mixture model for clustering with the integrated completed likelihood
IEEE Trans. Pattern Anal. Mach. Intell.
2000
, vol. 
22
 (pg. 
719
-
725
)
Böhning
D
, et al. 
The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family
Ann. Inst. Stat. Math.
1994
, vol. 
46
 (pg. 
373
-
388
)
Dempster
AP
, et al. 
Maximum likelihood from incomplete data via the EM algorithm
J. R. Stat. Soc. Ser. B
1977
, vol. 
39
 (pg. 
1
-
38
)
Dudoit
S
, et al. 
Comparison of discrimination methods for the classification of tumors using gene expression data
J. Am. Stat. Assoc.
2002
, vol. 
97
 (pg. 
77
-
87
)
Fraley
C
Raftery
AE
MCLUST: software for model-based cluster analysis
J. Classif.
1999
, vol. 
16
 (pg. 
297
-
306
)
Fraley
C
Raftery
AE
Model-based clustering, discriminant analysis, and density estimation
J. Am. Stat. Assoc.
2002
, vol. 
97
 (pg. 
611
-
631
)
Frühwirth-Schnatter
S
Finite Mixture and Markov Switching Models.
2006
New York
Springer
Getz
G
, et al. 
Coupled two-way clustering analysis of gene microarray data
Proc. Natl Acad. Sci. USA
2000
, vol. 
97
 (pg. 
12079
-
12084
)
Ghahramani
Z
Hinton
GE
The EM algorithm for factor analyzers
Technical Report CRG-TR-96-1
1997
Toronto
University Of Toronto
Golub
TR
, et al. 
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
1999
, vol. 
286
 (pg. 
531
-
537
)
Hartigan
JA
Wong
MA
A k-means clustering algorithm
Appl. Stat.
1979
, vol. 
28
 (pg. 
100
-
108
)
Hubert
L
Arabie
P
Comparing partitions
J. Classif.
1985
, vol. 
2
 (pg. 
193
-
218
)
Kaufman
L
Rousseeuw
PJ
Finding Groups in Data: An Introduction to Cluster Analysis.
1990
New York
Wiley
Lagrange
JL
Méchanique Analitique.
1788
Paris
Chez le Veuve Desaint
Lindsay
BG
Mixture models: theory, geometry and applications
NSF-CBMS Regional Conference Series in Probability and Statistics
1995
, vol. 
5
 
Hayward, CA
Institute of Mathematical Statistics
Lopes
HF
West
M
Bayesian model assessment in factor analysis
Stat. Sin.
2004
, vol. 
14
 (pg. 
41
-
67
)
McLachlan
GJ
Krishnan
T
The EM Algorithm and Extensions
2008
2
New York
Wiley
McLachlan
GJ
Peel
D
Finite Mixture Models.
2000
New York
John Wiley & Sons
McLachlan
GJ
Peel
D
Langley
P
Mixtures of factor analyzers
Seventh International Conference on Machine Learning.
2000
San Francisco
Morgan Kaufmann
(pg. 
599
-
606
)
McLachlan
GJ
, et al. 
A mixture model-based approach to the clustering of microarray expression data
Bioinformatics
2002
, vol. 
18
 (pg. 
412
-
422
)
McLachlan
GJ
, et al. 
Analyzing Microarray Gene Expression Data.
2004
Hoboken, New Jersey
Wiley
McLachlan
GJ
, et al. 
A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays
Bioinformatics
2006
, vol. 
22
 (pg. 
1608
-
1615
)
McNicholas
PD
Murphy
TB
Parsimonious Gaussian mixture models
Stat. Comput.
2008
, vol. 
18
 (pg. 
285
-
296
)
McNicholas
PD
Murphy
TB
Model-based clustering of longitudinal data
Can. J. Stat.
2010
, vol. 
38
 (pg. 
153
-
168
)
McNicholas
PD
, et al. 
Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models
Comput. Stat. Data Anal.
2010
, vol. 
54
 (pg. 
711
-
723
)
Meng
XL
Rubin
DB
Maximum likelihood estimation via the ECM algorithm: a general framework
Biometrika
1993
, vol. 
80
 (pg. 
267
-
278
)
Meng
XL
van Dyk
DA
The EM algorithm — an old folk song sung to a fast new tune (with discussion)
J. R. Stat. Soc. Ser. B
1997
, vol. 
59
 (pg. 
511
-
567
)
Rand
WM
Objective criteria for the evaluation of clustering methods
J. Am. Stat. Assoc.
1971
, vol. 
66
 (pg. 
846
-
850
)
R Development Core Team
R: A Language and Environment for Statistical Computing.
2010
Vienna, Austria
R Foundation for Statistical Computing
Schwarz
G
Estimating the dimension of a model
Ann. Stat.
1978
, vol. 
6
 (pg. 
31
-
38
)
Spearman
C
The proof and measurement of association between two things
Am. J. Psychol.
1904
, vol. 
15
 (pg. 
72
-
101
)
Tipping
TE
Bishop
CM
Mixtures of probabilistic principal component analysers
Neural Comput.
1999
, vol. 
11
 (pg. 
443
-
482
)
von Luxburg
U
Clustering stability: an overview
Found. Trends Mach. Learn.
2009
, vol. 
2
 (pg. 
235
-
274
)
Woodbury
MA
Inverting Modified Matrices
Statistical Research Group Memorandum Report no. 42.
1950
Princeton, New Jersey
Princeton University
Yeung
KY
, et al. 
Model-based clustering and data transformations for gene expression data
Bioinformatics
2001
, vol. 
17
 (pg. 
977
-
987
)

Author notes

Associate Editor: Martin Bishop