-
PDF
- Split View
-
Views
-
Cite
Cite
Paul D. McNicholas, Thomas Brendan Murphy, Model-based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics, Volume 26, Issue 21, 1 November 2010, Pages 2705–2712, https://doi.org/10.1093/bioinformatics/btq498
Close -
Share
Abstract
Motivation: In recent years, work has been carried out on clustering gene expression microarray data. Some approaches are developed from an algorithmic viewpoint whereas others are developed via the application of mixture models. In this article, a family of eight mixture models which utilizes the factor analysis covariance structure is extended to 12 models and applied to gene expression microarray data. This modelling approach builds on previous work by introducing a modified factor analysis covariance structure, leading to a family of 12 mixture models, including parsimonious models. This family of models allows for the modelling of the correlation between gene expression levels even when the number of samples is small. Parameter estimation is carried out using a variant of the expectation–maximization algorithm and model selection is achieved using the Bayesian information criterion. This expanded family of Gaussian mixture models, known as the expanded parsimonious Gaussian mixture model (EPGMM) family, is then applied to two well-known gene expression data sets.
Results: The performance of the EPGMM family of models is quantified using the adjusted Rand index. This family of models gives very good performance, relative to existing popular clustering techniques, when applied to real gene expression microarray data.
Availability: The reduced, preprocessed data that were analysed are available at www.paulmcnicholas.info
Contact:pmcnicho@uoguelph.ca
1 INTRODUCTION
1.1 Model-based clustering
Gaussian mixture models offer an advantage over other commonly used approaches because the covariance structure can potentially account for correlation between expression levels within an expression profile. Consequently, these models are more flexible than k-means or hierarchical clustering which commonly use Euclidean distance. However, due to the high-dimensional nature of expression data, additional structure needs to be assumed for the covariance matrices, so that the model can be fitted in high-dimensional settings. The MCLUST (Fraley and Raftery, 2002) approach to model-based clustering, which utilizes eigen-decomposed covariance matrices, can only be applied to clustering expression profiles if a diagonal covariance structure is assumed; Yeung et al. (2001) were able to cluster genes using MCLUST but not expression profiles. By assuming a highly parsimonious but non-diagonal covariance structure, it is possible to cluster expression profiles while allowing for correlation between gene expressions.
In general, a structure like that given in Equation (1) can be used to model such data. Then the parameters, and hence group memberships, can be estimated using some variant of the expectation–maximization (EM) algorithm (Dempster et al., 1977). The covariance matrices Σg can be decomposed to allow the construction of more parsimonious models.
1.2 Parsimonious Gaussian mixture models
The factor analysis model (Spearman, 1904) assumes that a p-dimensional random vector Xi can be modelled using a q-dimensional vector of latent factors Ui, where q ≪ p. The model can be written Xi = μ + ΛUi + εi, where Λ is a p × q matrix of factor weights, the latent variables Ui ∼ 𝒩(0, Iq) and εi ∼ 𝒩(0, Ψ), where Ψ is a p × p diagonal matrix. Therefore, the marginal distribution of Xi is 𝒩(μ, ΛΛ′+Ψ).
To illustrate the implications of the covariance matrix attached to this marginal distribution, Σ = ΛΛ′+Ψ, suppose that Xij and Xik are expression levels from a sample Xi. Then, Cov(Xij, Xik) = σjk = ∑s=1qλjsλks for j ≠ k, and Var(Xij) = σjj = ∑s=1qλjs2+ψqq. Hence, the matrix Λ models the covariance between expression levels, and a combination of the Λ and Ψ matrices models the variance of expression levels. The factor analysis model allows for the modelling of a high-dimensional non-diagonal covariance matrix with a low number of parameters.
Ghahramani and Hinton (1997) proposed a mixture of factor analysers model given by the finite Gaussian mixture model in Equation (1), with Σg = ΛgΛg′+Ψ. McLachlan and Peel (2000b) used the more general covariance structure Σg = ΛgΛg′+Ψg. Tipping and Bishop (1999) proposed the mixtures of probabilistic principal component analysers model, for which the component covariance matrix is Σg = ΛgΛg′+ψgIp.
McNicholas and Murphy (2008) further generalized the factor analysis covariance structure by including the possibility of imposing the constraints: Λg = Λ, Ψg = Ψ and Ψg = ψgIp. The result of imposing, or not, each of these three constraints is the family of eight parsimonious Gaussian mixture models (PGMMs) that are described in Table 1. Each member of this family of models has a number of covariance parameters that is linear in data dimensionality. This is one of the reasons that this family of models is particularly well suited to the analysis of high-dimensional data. The constraints allow for assuming common structure in the component covariance matrix Σg, if appropriate. By assuming common covariance structure, a more parsimonious model can be used and this can be estimated in a more stable manner.
The covariance structure of each parsimonious Gaussian mixture model—note that the UCU, UUC and UUU models previously existed under different names, as described in Section 1.2
| Λg = Λ . | Ψg = Ψ . | Isotropic . | Covariance structure . |
|---|---|---|---|
| C | C | C | Σg = ΛΛ′+ψIp |
| C | C | U | Σg = ΛΛ′+Ψ |
| C | U | C | Σg = ΛΛ′+ψgIp |
| C | U | U | Σg = ΛΛ′+Ψg |
| U | C | C | Σg = ΛgΛg′+ψIp |
| U | C | U | Σg = ΛgΛg′+Ψ |
| U | U | C | Σg = ΛgΛg′+ψgIp |
| U | U | U | Σg = ΛgΛg′+Ψg |
| Λg = Λ . | Ψg = Ψ . | Isotropic . | Covariance structure . |
|---|---|---|---|
| C | C | C | Σg = ΛΛ′+ψIp |
| C | C | U | Σg = ΛΛ′+Ψ |
| C | U | C | Σg = ΛΛ′+ψgIp |
| C | U | U | Σg = ΛΛ′+Ψg |
| U | C | C | Σg = ΛgΛg′+ψIp |
| U | C | U | Σg = ΛgΛg′+Ψ |
| U | U | C | Σg = ΛgΛg′+ψgIp |
| U | U | U | Σg = ΛgΛg′+Ψg |
C, constrained; U, unconstrained.
The covariance structure of each parsimonious Gaussian mixture model—note that the UCU, UUC and UUU models previously existed under different names, as described in Section 1.2
| Λg = Λ . | Ψg = Ψ . | Isotropic . | Covariance structure . |
|---|---|---|---|
| C | C | C | Σg = ΛΛ′+ψIp |
| C | C | U | Σg = ΛΛ′+Ψ |
| C | U | C | Σg = ΛΛ′+ψgIp |
| C | U | U | Σg = ΛΛ′+Ψg |
| U | C | C | Σg = ΛgΛg′+ψIp |
| U | C | U | Σg = ΛgΛg′+Ψ |
| U | U | C | Σg = ΛgΛg′+ψgIp |
| U | U | U | Σg = ΛgΛg′+Ψg |
| Λg = Λ . | Ψg = Ψ . | Isotropic . | Covariance structure . |
|---|---|---|---|
| C | C | C | Σg = ΛΛ′+ψIp |
| C | C | U | Σg = ΛΛ′+Ψ |
| C | U | C | Σg = ΛΛ′+ψgIp |
| C | U | U | Σg = ΛΛ′+Ψg |
| U | C | C | Σg = ΛgΛg′+ψIp |
| U | C | U | Σg = ΛgΛg′+Ψ |
| U | U | C | Σg = ΛgΛg′+ψgIp |
| U | U | U | Σg = ΛgΛg′+Ψg |
C, constrained; U, unconstrained.
2 METHODOLOGY
2.1 Modified factor analysis covariance structure
The factor analysis covariance structure (cf. McLachlan and Peel, 2000b) can be further parameterized by writing Ψg = ωgΔg, where ωg ∈ ℝ+ and Δg = diag{δ1, δ2,…, δp} such that |Δg| = 1, for g = 1, 2,…, G. The resulting covariance structure Σg = ΛgΛg′+ωgΔg shall be known as the modified factor analysis covariance structure. Now, this covariance structure can be used within the model-based clustering framework, opening up the possibility of models that are more parsimonious than their PGMM counterparts. Specifically, constraints can be imposed on the parameters Λg, ωg and Δg leading to the 12 Gaussian mixture models illustrated in Table 2. The family of models in Table 2 will be referred to as the expanded PGMM (EPGMM) family hereafter. Table 2 contains a total of four new, parsimonious, models when compared with Table 1. Notably, all 12 members of the EPGMM family have a number of covariance parameters that is linear in the dimensionality of the data. Furthermore, the identities given in Equations (3 and 4) can be used for all 12 models.
The covariance structure, number of covariance parameters and nomenclature for each member of the EPGMM family, along with the name of the equivalent member of the PGMM family where applicable
| EPGMM nomenclature . | PGMM equivalent . | Covariance structure . | Number of covariance parameters . | |||
|---|---|---|---|---|---|---|
| Λg = Λ . | Δg = Δ . | ωg = ω . | Δg = Ip . | . | . | . |
| C | C | C | C | CCC | Σg = ΛΛ′+ωIp | [pq−q(q−1)/2]+1 |
| C | C | U | C | CUC | Σg = ΛΛ′+ωgIp | [pq−q(q−1)/2]+G |
| U | C | C | C | UCC | Σg = ΛgΛg′+ωIp | G[pq−q(q−1)/2]+1 |
| U | C | U | C | UUC | Σg = ΛgΛg′+ωgIp | G[pq−q(q−1)/2]+G |
| C | C | C | U | CCU | Σg = ΛΛ′+ωΔ | [pq−q(q−1)/2]+p |
| C | C | U | U | – | Σg = ΛΛ′+ωgΔ | [pq−q(q−1)/2]+[G+(p−1)] |
| U | C | C | U | UCU | Σg = ΛgΛg′+ωΔ | G[pq−q(q−1)/2]+p |
| U | C | U | U | – | Σg = ΛgΛg′+ωgΔ | G[pq−q(q−1)/2]+[G+(p−1)] |
| C | U | C | U | – | Σg = ΛΛ′+ωΔg | [pq−q(q−1)/2]+[1+G(p−1)] |
| C | U | U | U | CUU | Σg = ΛΛ′+ωgΔg | [pq−q(q−1)/2]+Gp |
| U | U | C | U | – | Σg = ΛgΛg′+ωΔg | G[pq−q(q−1)/2]+[1+G(p−1)] |
| U | U | U | U | UUU | Σg = ΛgΛg′+ωgΔg | G[pq−q(q−1)/2]+Gp |
| EPGMM nomenclature . | PGMM equivalent . | Covariance structure . | Number of covariance parameters . | |||
|---|---|---|---|---|---|---|
| Λg = Λ . | Δg = Δ . | ωg = ω . | Δg = Ip . | . | . | . |
| C | C | C | C | CCC | Σg = ΛΛ′+ωIp | [pq−q(q−1)/2]+1 |
| C | C | U | C | CUC | Σg = ΛΛ′+ωgIp | [pq−q(q−1)/2]+G |
| U | C | C | C | UCC | Σg = ΛgΛg′+ωIp | G[pq−q(q−1)/2]+1 |
| U | C | U | C | UUC | Σg = ΛgΛg′+ωgIp | G[pq−q(q−1)/2]+G |
| C | C | C | U | CCU | Σg = ΛΛ′+ωΔ | [pq−q(q−1)/2]+p |
| C | C | U | U | – | Σg = ΛΛ′+ωgΔ | [pq−q(q−1)/2]+[G+(p−1)] |
| U | C | C | U | UCU | Σg = ΛgΛg′+ωΔ | G[pq−q(q−1)/2]+p |
| U | C | U | U | – | Σg = ΛgΛg′+ωgΔ | G[pq−q(q−1)/2]+[G+(p−1)] |
| C | U | C | U | – | Σg = ΛΛ′+ωΔg | [pq−q(q−1)/2]+[1+G(p−1)] |
| C | U | U | U | CUU | Σg = ΛΛ′+ωgΔg | [pq−q(q−1)/2]+Gp |
| U | U | C | U | – | Σg = ΛgΛg′+ωΔg | G[pq−q(q−1)/2]+[1+G(p−1)] |
| U | U | U | U | UUU | Σg = ΛgΛg′+ωgΔg | G[pq−q(q−1)/2]+Gp |
C, constrained, U, unconstrained.
The covariance structure, number of covariance parameters and nomenclature for each member of the EPGMM family, along with the name of the equivalent member of the PGMM family where applicable
| EPGMM nomenclature . | PGMM equivalent . | Covariance structure . | Number of covariance parameters . | |||
|---|---|---|---|---|---|---|
| Λg = Λ . | Δg = Δ . | ωg = ω . | Δg = Ip . | . | . | . |
| C | C | C | C | CCC | Σg = ΛΛ′+ωIp | [pq−q(q−1)/2]+1 |
| C | C | U | C | CUC | Σg = ΛΛ′+ωgIp | [pq−q(q−1)/2]+G |
| U | C | C | C | UCC | Σg = ΛgΛg′+ωIp | G[pq−q(q−1)/2]+1 |
| U | C | U | C | UUC | Σg = ΛgΛg′+ωgIp | G[pq−q(q−1)/2]+G |
| C | C | C | U | CCU | Σg = ΛΛ′+ωΔ | [pq−q(q−1)/2]+p |
| C | C | U | U | – | Σg = ΛΛ′+ωgΔ | [pq−q(q−1)/2]+[G+(p−1)] |
| U | C | C | U | UCU | Σg = ΛgΛg′+ωΔ | G[pq−q(q−1)/2]+p |
| U | C | U | U | – | Σg = ΛgΛg′+ωgΔ | G[pq−q(q−1)/2]+[G+(p−1)] |
| C | U | C | U | – | Σg = ΛΛ′+ωΔg | [pq−q(q−1)/2]+[1+G(p−1)] |
| C | U | U | U | CUU | Σg = ΛΛ′+ωgΔg | [pq−q(q−1)/2]+Gp |
| U | U | C | U | – | Σg = ΛgΛg′+ωΔg | G[pq−q(q−1)/2]+[1+G(p−1)] |
| U | U | U | U | UUU | Σg = ΛgΛg′+ωgΔg | G[pq−q(q−1)/2]+Gp |
| EPGMM nomenclature . | PGMM equivalent . | Covariance structure . | Number of covariance parameters . | |||
|---|---|---|---|---|---|---|
| Λg = Λ . | Δg = Δ . | ωg = ω . | Δg = Ip . | . | . | . |
| C | C | C | C | CCC | Σg = ΛΛ′+ωIp | [pq−q(q−1)/2]+1 |
| C | C | U | C | CUC | Σg = ΛΛ′+ωgIp | [pq−q(q−1)/2]+G |
| U | C | C | C | UCC | Σg = ΛgΛg′+ωIp | G[pq−q(q−1)/2]+1 |
| U | C | U | C | UUC | Σg = ΛgΛg′+ωgIp | G[pq−q(q−1)/2]+G |
| C | C | C | U | CCU | Σg = ΛΛ′+ωΔ | [pq−q(q−1)/2]+p |
| C | C | U | U | – | Σg = ΛΛ′+ωgΔ | [pq−q(q−1)/2]+[G+(p−1)] |
| U | C | C | U | UCU | Σg = ΛgΛg′+ωΔ | G[pq−q(q−1)/2]+p |
| U | C | U | U | – | Σg = ΛgΛg′+ωgΔ | G[pq−q(q−1)/2]+[G+(p−1)] |
| C | U | C | U | – | Σg = ΛΛ′+ωΔg | [pq−q(q−1)/2]+[1+G(p−1)] |
| C | U | U | U | CUU | Σg = ΛΛ′+ωgΔg | [pq−q(q−1)/2]+Gp |
| U | U | C | U | – | Σg = ΛgΛg′+ωΔg | G[pq−q(q−1)/2]+[1+G(p−1)] |
| U | U | U | U | UUU | Σg = ΛgΛg′+ωgΔg | G[pq−q(q−1)/2]+Gp |
C, constrained, U, unconstrained.
2.2 Parameter estimation for the EPGMM family
2.2.1 Introduction
Estimation of the model parameters, via the AECM algorithm, is analogous to that of the PGMM parameter estimation procedure described by McNicholas and Murphy (2008). The estimates for the eight pre-existing models are obtained from the PGMM estimates by writing Ψg = |Ψg|1/pΨg/|Ψg|1/p, and then setting ωg = |Ψg|1/p and Δg = Ψg/|Ψg|1/p. However, the derivation of the maximum likelihood estimates of the model parameters for the new models, requires the method of Lagrange multipliers (Lagrange, 1788). Parameter estimates for the CCUU model are derived in Section 2.2.2 and derivations for the other three new models are given at the end of said section.
2.2.2 AECM algorithm
The EM algorithm is an iterative technique for finding maximum likelihood estimates when data are incomplete, or are treated as incomplete. In the expectation step (E-step), the expected value of the complete-data log-likelihood (Q, say) is computed, where the complete-data is the missing data plus the observed data. Then in the maximization step (M-step), Q is maximized with respect to the model parameters.
In the expectation-conditional maximization (ECM) algorithm (Meng and Rubin, 1993), the M-step is replaced by a number of conditional maximization (CM) steps. The AECM algorithm (Meng and van Dyk, 1997) is an extension of the ECM algorithm that permits different specification of the complete-data at each stage. Extensive details on the EM algorithm and variants thereof are given by McLachlan and Krishnan (2008).
herein. The function 𝒬1 is then maximized in the CM-step to give
and
where
and n = ∑g=1Gng.
and the sufficient statistics for the factors Uig are replaced by respectively, where βg = Λg′(ΛgΛg′+ωgΔg)−1, to give 𝒬2. The CM-step at this second stage will depend on the model. Consider the CCUU model, so that Λg = Λ and Δg = Δ. In this case, the expected complete-data log-likelihood 𝒬2(Λ, ωg, Δ) can be written where C is constant with respect to Λ, ωg and Δ, and
.
gives and solving
gives Solving
leads to But
is a diagonal matrix with
, therefore where ξj is the j-th element along the diagonal of the matrix Therefore, it follows that
and
is the i-th element along the diagonal of the matrix
. The other estimates are where ξgj is the j-th element along the diagonal of the matrix 
.Note that the predicted clustering for each member of the EPGMM family is given by the maximum a posteriori (MAP) classification. That is, the posterior predicted component membership of tissue i is the value of g for which
is the greatest.
2.3 Convergence and model selection
2.3.1 Convergence criterion
2.3.2 Model selection
The Bayesian information criterion (BIC Schwarz, 1978) is used to select the best member of the EPGMM family, in terms of both model and number of factors. Note that the BIC can also be used to select the number of mixture components (cf. Fraley and Raftery, 1999; McNicholas and Murphy, 2008) but this is not necessary for the analyses herein since we fix G = 2. For a model with parameters θ, the BIC is given by
where
is the maximized log-likelihood,
is the maximum likelihood estimate of θ, m is the number of free parameters in the model and n is the number of observations. The effectiveness of the BIC for choosing the number of factors in a factor analysis model has been established by Lopes and West (2004), while McNicholas et al. (2010) provide practical evidence that the BIC performs well in choosing the number of factors for the PGMM family of models.
A number of other model selection criteria could be used including the Akaike information criterion (AIC; Akaike, 1974), the integrated completed likelihood (ICL; Biernacki et al., 2000) and clustering stability (cf. von Luxburg, 2009). However, we found that the BIC gave a quick solution and generally good clustering results.
3 ANALYSES
3.1 Dimensionality reduction
McLachlan et al. (2002) analysed two microarray gene expression data sets—one on leukaemia data and another on colon tissue samples—using the EMMIX-GENE approach. The first stage of this approach focuses on data reduction where, initially, one- and two-component mixtures of t-distributions are fitted to the data. Then a gene is retained only if two conditions are satisfied.
When fitting the two- and three-component mixture models for this purpose, starting values for the component memberships are defined randomly or by using starting values based on k-means clustering results. This whole process represents the first stage of the EMMIX-GENE approach and can be carried out using the select-genes software that accompanies McLachlan et al. (2004). For the analyses herein, the select-genes software is used with thresholds a1 = a2 = 8, as in McLachlan et al. (2002), and 50 random and 50 k-means starts.
3.2 Leukaemia data
3.2.1 The data
Golub et al. (1999) presented data on two forms of acute leukaemia: acute lymphoblastic leukaemia (ALL) and acute myeloid leukaemia (AML). Affymetrix arrays were used to collect measurements for 7129 genes on 72 tissues. There were a total of 47 ALL tissues and 25 with AML. The data were sourced from the web site accompanying McLachlan et al. (2004, www.maths.uq.edu.au/∼gjm/emmix-gene/) and so they had been preprocessed (Dudoit et al., 2002; McLachlan et al., 2002) as follows. Following this preprocessing, a total of 3731 genes remained. This number was further reduced to 2030 following application of the select-genes software (cf. Section 3.1).
Genes with expression less than 100 or greater than 16 000 were removed.
Genes with expressions satisfying max/min≤5 and max−min≤500 were removed.
The natural logarithm was taken.
3.2.2 The EPGMM approach
Treating this as a clustering problem where the form of leukaemia is unknown, all 12 members of the EPGMM family (Table 2) were fitted to these data for G = 2, q=1,…, 6 and 10 different random starting values for the
. The BIC for the best q for each of the 12 members of the EPGMM family is given in Table 3.
The BIC for the best q for each of the 12 members of the EPGMM family for the leukaemia data
| Model . | q . | BIC . | Model . | q . | BIC . |
|---|---|---|---|---|---|
| CCCC | 3 | −411 646.50 | CCUC | 3 | −411 566.29 |
| UCCC | 1 | −416 954.56 | UCUC | 1 | −416 803.57 |
| CCCU | 4 | −414 615.22 | CCUUa | 5 | −413 207.29 |
| UCCU | 1 | −423 354.79 | UCUUa | 1 | −422 089.38 |
| CUCUa | 4 | −413 966.90 | CUUU | 5 | −413 978.04 |
| UUCUa | 1 | −423 933.46 | UUUU | 1 | −423 532.04 |
| Model . | q . | BIC . | Model . | q . | BIC . |
|---|---|---|---|---|---|
| CCCC | 3 | −411 646.50 | CCUC | 3 | −411 566.29 |
| UCCC | 1 | −416 954.56 | UCUC | 1 | −416 803.57 |
| CCCU | 4 | −414 615.22 | CCUUa | 5 | −413 207.29 |
| UCCU | 1 | −423 354.79 | UCUUa | 1 | −422 089.38 |
| CUCUa | 4 | −413 966.90 | CUUU | 5 | −413 978.04 |
| UUCUa | 1 | −423 933.46 | UUUU | 1 | −423 532.04 |
aOne of the four new models.
The BIC for the best q for each of the 12 members of the EPGMM family for the leukaemia data
| Model . | q . | BIC . | Model . | q . | BIC . |
|---|---|---|---|---|---|
| CCCC | 3 | −411 646.50 | CCUC | 3 | −411 566.29 |
| UCCC | 1 | −416 954.56 | UCUC | 1 | −416 803.57 |
| CCCU | 4 | −414 615.22 | CCUUa | 5 | −413 207.29 |
| UCCU | 1 | −423 354.79 | UCUUa | 1 | −422 089.38 |
| CUCUa | 4 | −413 966.90 | CUUU | 5 | −413 978.04 |
| UUCUa | 1 | −423 933.46 | UUUU | 1 | −423 532.04 |
| Model . | q . | BIC . | Model . | q . | BIC . |
|---|---|---|---|---|---|
| CCCC | 3 | −411 646.50 | CCUC | 3 | −411 566.29 |
| UCCC | 1 | −416 954.56 | UCUC | 1 | −416 803.57 |
| CCCU | 4 | −414 615.22 | CCUUa | 5 | −413 207.29 |
| UCCU | 1 | −423 354.79 | UCUUa | 1 | −422 089.38 |
| CUCUa | 4 | −413 966.90 | CUUU | 5 | −413 978.04 |
| UUCUa | 1 | −423 933.46 | UUUU | 1 | −423 532.04 |
aOne of the four new models.
The best of these models, in terms of BIC, was a CCUC model with q = 3 latent factors. The chosen model has a non-diagonal covariance structure where the covariance between pairs of genes is equal across different clusters but the variance of each gene is unequal across different clusters (Section 1.2). The MAP classifications arising from the parameter estimates associated with this model are given in Table 4; only five tissue samples were misclassified.
Estimated group membership for the best EPGMM model for the leukaemia data
| . | 1 . | 2 . |
|---|---|---|
| ALL | 42 | 0 |
| AML | 5 | 25 |
| . | 1 . | 2 . |
|---|---|---|
| ALL | 42 | 0 |
| AML | 5 | 25 |
Estimated group membership for the best EPGMM model for the leukaemia data
| . | 1 . | 2 . |
|---|---|---|
| ALL | 42 | 0 |
| AML | 5 | 25 |
| . | 1 . | 2 . |
|---|---|---|
| ALL | 42 | 0 |
| AML | 5 | 25 |
3.2.3 Hierarchical clustering, k -means, k -medoids and MCLUST
In addition to the EPGMM technique, several other techniques were applied to these data using the R software (R Development Core Team, 2010). Agglomerative hierarchical clustering was used, with Euclidean distance and three different linkage methods: complete, average and single. The k-means (cf. Hartigan and Wong, 1979) and k-medoids techniques were also used. In the latter case, the partitioning around medoids (PAM; cf. Kaufman and Rousseeuw, 1990, Chapter 2) algorithm was used. Finally, in order to compare our model-based clustering approach to the well-established MCLUST approach, we used the mclust package (Fraley and Raftery, 1999) for the R software.
The results, which are summarized in Table 5, give the Rand and adjusted Rand indices as measures of class agreement. The Rand index (Rand, 1971) is based on pairwise agreements and disagreements, and the adjusted Rand index (Hubert and Arabie, 1985) is effectively the Rand index corrected for random chance. These indices reveal that the best of the non-model-based approaches was k-means clustering, with an adjusted Rand index of 0.187. In fact, k-means clustering narrowly outperformed mclust on these data, but the EPGMM model with the greatest BIC (CCUC, q = 3) was the best model overall.
Summary results for all of the clustering techniques that were applied to the leukaemia data
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.532 | 0.058 |
| Hierarchical (average) | – | 0.525 | −0.024 |
| Hierarchical (single) | – | 0.532 | −0.013 |
| k-means | – | 0.593 | 0.187 |
| PAM | – | 0.518 | 0.023 |
| MCLUST (VII) | −416 293.2 | 0.593 | 0.186 |
| EPGMM (CCUC, q = 3) | −411 566.3 | 0.869 | 0.738 |
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.532 | 0.058 |
| Hierarchical (average) | – | 0.525 | −0.024 |
| Hierarchical (single) | – | 0.532 | −0.013 |
| k-means | – | 0.593 | 0.187 |
| PAM | – | 0.518 | 0.023 |
| MCLUST (VII) | −416 293.2 | 0.593 | 0.186 |
| EPGMM (CCUC, q = 3) | −411 566.3 | 0.869 | 0.738 |
Summary results for all of the clustering techniques that were applied to the leukaemia data
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.532 | 0.058 |
| Hierarchical (average) | – | 0.525 | −0.024 |
| Hierarchical (single) | – | 0.532 | −0.013 |
| k-means | – | 0.593 | 0.187 |
| PAM | – | 0.518 | 0.023 |
| MCLUST (VII) | −416 293.2 | 0.593 | 0.186 |
| EPGMM (CCUC, q = 3) | −411 566.3 | 0.869 | 0.738 |
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.532 | 0.058 |
| Hierarchical (average) | – | 0.525 | −0.024 |
| Hierarchical (single) | – | 0.532 | −0.013 |
| k-means | – | 0.593 | 0.187 |
| PAM | – | 0.518 | 0.023 |
| MCLUST (VII) | −416 293.2 | 0.593 | 0.186 |
| EPGMM (CCUC, q = 3) | −411 566.3 | 0.869 | 0.738 |
3.2.4 The EMMIX-GENE approach
McLachlan et al. (2002) analysed the same data using the EMMIX-GENE approach with four random and four k-means starts in the first stage, which reduced the number of genes to 2015. In the second stage, a mixture of 40 normal distributions with isotropic covariance structure was fitted to the 2015 genes. Two of these groups (Groups 1 and 3) provided clusterings that were most similar to the type of leukaemia—of course, in a real clustering scenario this could not be established. A two-component mixture of factor analysers, with q = 6 factors, was fitted to the data using the genes from Groups 1 and 3, respectively. Using the genes from Group 1 led to the misclassification of 13 tissues and using those from Group 3 led to the misclassification of six tissues. Note that McLachlan et al. (2002) did not specify how many different random starts were used but, based on other analyses, it seems likely that 50 random and 50 k-means starts were used.
3.2.5 Two other approaches
In addition to the EMMIX-Gene approach, McLachlan et al. (2002) used two other approaches to cluster the leukaemia tissues. In both cases, the first stage was identical to that described in Section 3.2.4. The first alternative approach was to cluster the tissues based on the 40 fitted group means and the top 50 of the 2015 genes. Fitting a two-component mixture of factor analysers with q = 8 factors to these data, using 50 random and 50 k-means starts, led to the misclassification of just one tissue. The second alternative approach was to base the analysis on the top 50 genes. Fitting a two-component mixture of factor analysers, with q = 8 factors to these data, using 50 random and 50 k-means starts, led to the misclassification of 10 tissues.
3.2.6 Comments
The EPGMM approach gave very good clustering performance when applied to the leukaemia data. This approach used 10 random starts and led to the misclassification of just five tissues. This performance far exceeds that of agglomerative hierarchical clustering, k-means clustering, PAM and MCLUST. In fact, the best of all of these techniques had an adjusted Rand index of 0.187, while the best EPGMM model had an adjusted Rand index of 0.738. Although, in one instance, one of the approaches of McLachlan et al. (2002) returned a better predicted classification, it is difficult to make a direct comparison to the EPGMM approach. This difficulty arises because the EPGMM approach is a genuine clustering approach, while the methods described in Sections 3.2.4 and 3.2.5 assumed, to some extent, knowledge of the truth. This knowledge was clearly used in the analyses described in Section 3.2.4 but was used in a less obvious fashion in the analyses given in Section 3.2.5. In this latter case, the choice of the number of clusters (40) was validated in some sense by the fact that two of the groups give classifications that were similar to the true leukaemia type. In fact, as mentioned by McLachlan et al. (2002), an objective technique for choosing this number is not possible since genes cannot be assumed to be independently distributed within a tissue sample. Furthermore, it is quite likely that the number of factors q was selected, in each case, to give the best classification. This could be done objectively, as in Section 3.2.2, using the BIC. Finally, any comparison between the EPGMM approach and the approaches of McLachlan et al. (2002) would have to be taken in context with the fact that different subsets of the 3731 genes are used in each case.
3.3 Colon data
3.3.1 The data
Alon et al. (1999) presented gene expression data on 62 colon tissue samples, of which 40 were tumours and the remaining 22 were normal. Affymetrix arrays were used to collect measurements for 6500 gene expressions on all 62 tissues. Following Alon et al. (1999) and McLachlan et al. (2002), only the 2000 genes with the highest minimal intensity are focused upon. The data were again sourced from the web site mentioned in Section 3.2.1 and, this time, the only preprocessing was the taking of natural logarithms, followed by normalization. Application of the select-genes software, with the settings specified in Section 3.1, led to the reduction of the number of genes from 2000 to just 461.
3.3.2 The EPGMM approach
Treating this as a clustering problem where the type of tissue is unknown, all 12 members of the EPGMM family (Table 2) were fitted to these data for G = 2, q = 1,…, 10 and 10 different random starting values for the
. The BIC for the best q for each of the 12 members of the EPGMM family is given in Table 6.
The BIC for the best q for each of the 12 members of the EPGMM family for the colon data
| Model . | q . | BIC . | Model . | q . | BIC . |
|---|---|---|---|---|---|
| CCCC | 4 | −79 085.73 | CCUC | 6 | −70 937.72 |
| UCCC | 3 | −77 267.65 | UCUC | 3 | −77 268.16 |
| CCCU | 8 | −71 064.11 | CCUUa | 7 | −71 063.71 |
| UCCU | 3 | −77 310.19 | UCUUa | 4 | −77 532.97 |
| CUCUa | 8 | −71 609.10 | CUUU | 8 | −71 631.35 |
| UUCUa | 2 | −78 458.83 | UUUU | 2 | −78 306.33 |
| Model . | q . | BIC . | Model . | q . | BIC . |
|---|---|---|---|---|---|
| CCCC | 4 | −79 085.73 | CCUC | 6 | −70 937.72 |
| UCCC | 3 | −77 267.65 | UCUC | 3 | −77 268.16 |
| CCCU | 8 | −71 064.11 | CCUUa | 7 | −71 063.71 |
| UCCU | 3 | −77 310.19 | UCUUa | 4 | −77 532.97 |
| CUCUa | 8 | −71 609.10 | CUUU | 8 | −71 631.35 |
| UUCUa | 2 | −78 458.83 | UUUU | 2 | −78 306.33 |
aOne of the four new models.
The BIC for the best q for each of the 12 members of the EPGMM family for the colon data
| Model . | q . | BIC . | Model . | q . | BIC . |
|---|---|---|---|---|---|
| CCCC | 4 | −79 085.73 | CCUC | 6 | −70 937.72 |
| UCCC | 3 | −77 267.65 | UCUC | 3 | −77 268.16 |
| CCCU | 8 | −71 064.11 | CCUUa | 7 | −71 063.71 |
| UCCU | 3 | −77 310.19 | UCUUa | 4 | −77 532.97 |
| CUCUa | 8 | −71 609.10 | CUUU | 8 | −71 631.35 |
| UUCUa | 2 | −78 458.83 | UUUU | 2 | −78 306.33 |
| Model . | q . | BIC . | Model . | q . | BIC . |
|---|---|---|---|---|---|
| CCCC | 4 | −79 085.73 | CCUC | 6 | −70 937.72 |
| UCCC | 3 | −77 267.65 | UCUC | 3 | −77 268.16 |
| CCCU | 8 | −71 064.11 | CCUUa | 7 | −71 063.71 |
| UCCU | 3 | −77 310.19 | UCUUa | 4 | −77 532.97 |
| CUCUa | 8 | −71 609.10 | CUUU | 8 | −71 631.35 |
| UUCUa | 2 | −78 458.83 | UUUU | 2 | −78 306.33 |
aOne of the four new models.
The best of these models, again in terms of BIC, was a CCUC model with q = 6 latent factors; the covariance structure in this model is the same as that chosen for the leukaemia data (Section 3.2.2). The MAP classifications given by the parameter estimates associated with this model are given in Table 7; only five tissue samples were misclassified.
Estimated group membership for the best EPGMM model for the colon data
| . | 1 . | 2 . |
|---|---|---|
| Tumour | 37 | 3 |
| Normal | 2 | 20 |
| . | 1 . | 2 . |
|---|---|---|
| Tumour | 37 | 3 |
| Normal | 2 | 20 |
Estimated group membership for the best EPGMM model for the colon data
| . | 1 . | 2 . |
|---|---|---|
| Tumour | 37 | 3 |
| Normal | 2 | 20 |
| . | 1 . | 2 . |
|---|---|---|
| Tumour | 37 | 3 |
| Normal | 2 | 20 |
3.3.3 Hierarchical clustering, k -means, k -medoids and MCLUST
In addition to the EPGMM technique, the methods used in Section 3.2.3 were run on these colon data using the R software. The results, which are summarized in Table 8, suggest that the best of the non-model-based approaches was PAM, with an adjusted Rand index of 0.218. This time, mclust outperformed k-means clustering but PAM outperformed mclust. The EPGMM model with the greatest BIC (CCUC, q = 6) was the best model, misclassifying just five tissues based on 10 random starts.
Summary results for all of the clustering techniques that were applied to the colon data
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.497 | −0.018 |
| Hierarchical (average) | – | 0.526 | −0.005 |
| Hierarchical (single) | – | 0.526 | −0.014 |
| k-means | – | 0.494 | −0.016 |
| PAM | – | 0.611 | 0.218 |
| MCLUST (VII) | −81 124.36 | 0.500 | −0.006 |
| EPGMM (CCUC, q = 6) | −70 937.72 | 0.849 | 0.697 |
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.497 | −0.018 |
| Hierarchical (average) | – | 0.526 | −0.005 |
| Hierarchical (single) | – | 0.526 | −0.014 |
| k-means | – | 0.494 | −0.016 |
| PAM | – | 0.611 | 0.218 |
| MCLUST (VII) | −81 124.36 | 0.500 | −0.006 |
| EPGMM (CCUC, q = 6) | −70 937.72 | 0.849 | 0.697 |
Summary results for all of the clustering techniques that were applied to the colon data
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.497 | −0.018 |
| Hierarchical (average) | – | 0.526 | −0.005 |
| Hierarchical (single) | – | 0.526 | −0.014 |
| k-means | – | 0.494 | −0.016 |
| PAM | – | 0.611 | 0.218 |
| MCLUST (VII) | −81 124.36 | 0.500 | −0.006 |
| EPGMM (CCUC, q = 6) | −70 937.72 | 0.849 | 0.697 |
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.497 | −0.018 |
| Hierarchical (average) | – | 0.526 | −0.005 |
| Hierarchical (single) | – | 0.526 | −0.014 |
| k-means | – | 0.494 | −0.016 |
| PAM | – | 0.611 | 0.218 |
| MCLUST (VII) | −81 124.36 | 0.500 | −0.006 |
| EPGMM (CCUC, q = 6) | −70 937.72 | 0.849 | 0.697 |
3.3.4 Correspondence with McLachlan et al. (2002 )
Using various techniques, McLachlan et al. (2002) found five different clusterings of these data. However, none of these clusterings corresponded to the tissue type. While, once again, the EPGMM results are not directly comparable with those of McLachlan et al. (2002), it is interesting to look at the second best of the EPGMM models. The second best of the EPGMM models, in terms of BIC, was a CCUU model with q = 7 latent factors. Note that this is one of the four new models that were introduced herein and, again, this model has equal covariance between pairs of genes; however, the variance structure is more complex than for the CCUC model. The MAP classification given by the parameter estimates associated with this CCUU model do not separate tumour from normal tissue. However, they are similar to what McLachlan et al. (2002) call C1, in that they seem sensible when one considers that there was a change of protocol during the experiment (Getz et al., 2000; McLachlan et al., 2002). Specifically, tissues 1–11 and 41–51 were all extracted from the first 11 patients using a poly detector, while the remaining samples were taken from the other patients using total extraction of RNA. Looking at the tissues by extraction method, rather than by tissue type, leads to the estimated classifications given in Table 9; only eight of the tissues were misclassified by this CCUU model when the data are considered by extraction method.
Estimated group membership for the second best EPGMM model for the colon data
| . | 1 . | 2 . |
|---|---|---|
| Poly detector | 19 | 3 |
| Total extraction of RNA | 5 | 35 |
| . | 1 . | 2 . |
|---|---|---|
| Poly detector | 19 | 3 |
| Total extraction of RNA | 5 | 35 |
Estimated group membership for the second best EPGMM model for the colon data
| . | 1 . | 2 . |
|---|---|---|
| Poly detector | 19 | 3 |
| Total extraction of RNA | 5 | 35 |
| . | 1 . | 2 . |
|---|---|---|
| Poly detector | 19 | 3 |
| Total extraction of RNA | 5 | 35 |
The results from applying the other methods to the colon data (cf. Table 8), can also be viewed in terms of extraction method, rather than tissue type. These results are given, along with our best CCUU model, in Table 10. From this table, it is clear that our CCUU model gives the best clustering performance of all of the approaches. Furthermore, the hierarchical (complete and average linkage), k-means, and mclust clustering results are all better when viewed in terms of extraction method.
Summary results, by extraction method, for all of the clustering techniques that were applied to the colon data
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.518 | 0.024 |
| Hierarchical (average) | – | 0.545 | 0.035 |
| Hierarchical (single) | – | 0.526 | −0.014 |
| k-means | – | 0.526 | 0.048 |
| PAM | – | 0.581 | 0.158 |
| MCLUST (VII) | −81 124.36 | 0.526 | 0.045 |
| EPGMM (CCUU, q = 7) | −71 063.71 | 0.772 | 0.542 |
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.518 | 0.024 |
| Hierarchical (average) | – | 0.545 | 0.035 |
| Hierarchical (single) | – | 0.526 | −0.014 |
| k-means | – | 0.526 | 0.048 |
| PAM | – | 0.581 | 0.158 |
| MCLUST (VII) | −81 124.36 | 0.526 | 0.045 |
| EPGMM (CCUU, q = 7) | −71 063.71 | 0.772 | 0.542 |
Summary results, by extraction method, for all of the clustering techniques that were applied to the colon data
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.518 | 0.024 |
| Hierarchical (average) | – | 0.545 | 0.035 |
| Hierarchical (single) | – | 0.526 | −0.014 |
| k-means | – | 0.526 | 0.048 |
| PAM | – | 0.581 | 0.158 |
| MCLUST (VII) | −81 124.36 | 0.526 | 0.045 |
| EPGMM (CCUU, q = 7) | −71 063.71 | 0.772 | 0.542 |
| . | BIC . | Rand index . | Adjusted Rand index . |
|---|---|---|---|
| Hierarchical (complete) | – | 0.518 | 0.024 |
| Hierarchical (average) | – | 0.545 | 0.035 |
| Hierarchical (single) | – | 0.526 | −0.014 |
| k-means | – | 0.526 | 0.048 |
| PAM | – | 0.581 | 0.158 |
| MCLUST (VII) | −81 124.36 | 0.526 | 0.045 |
| EPGMM (CCUU, q = 7) | −71 063.71 | 0.772 | 0.542 |
3.3.5 Comments
The EPGMM approach gave very good clustering performance when applied to the colon data. Our approach led to the misclassification of just five tissues, when these data were viewed by tissue type. This performance far exceeded that of all of the other techniques that were used—in fact, the performance of these other approaches was surprisingly poor, with only PAM giving better than random classifications (cf. Table 8). This phenomenon is partly explained when one looks at the classifications by extraction method, rather than by tissue type (cf. Table 10). In this case, only one method performed worse than random, which might suggest that techniques such as k-means clustering and MCLUST were picking up extraction method more-so than tissue type. That said, the performance of these methods was only slightly better than random which suggests that the restrictive cluster shapes imposed by k-means clustering and MCLUST were not at all suited to the data. On the other hand, the best of the new EPGMM models gave the based clustering performance, misclassifying just eight samples.
4 DISCUSSION
The EPGMM family of models has been shown to give good clustering performance when applied to gene expression microarray data. These applications, concerning leukaemia and colon tissue data, respectively, were conducted as genuine clustering examples. That is, no information on the true tissue classification was used for parameter estimation or model selection. In fact, this information was only used to assess the performance of the selected model. In this context, the clustering performance of the EPGMM family can be looked upon favourably. Moreover, the performance of the EPGMM family on both data sets far exceeded that of a number of popular clustering techniques, including agglomerative hierarchical clustering and k-means clustering.
However, like the techniques of McLachlan et al. (2002), the EPGMM family relies on multiple random starts. In addition to the obvious drawback of the sensitivity of results to the starting values, there is the computation time that is required. Furthermore, there is no guarantee that increasing the number of random starts will lead to better clustering results. This is due, in the main, to the fact that models with greater BIC do not necessarily give better clustering performance. This phenomenon has been observed previously and work into finding better model selection techniques is ongoing. That said, the EPGMM family did perform well in the analyses in Section 3, based on random starting values and using the BIC.
5 CONCLUSION
The EPGMM family of mixture models has been introduced and used for the model-based clustering of gene expression microarray data. This family of models is an extension of the PGMM family of models which, in turn, is an extension of the mixtures of factor analysers model. The EPGMM family of models are very well suited to the analysis of high-dimensional data. The reason for this suitability is 3-fold. First, each member of the EPGMM family has a number of covariance parameters that is linear in the data dimensionality. Second, as shown herein, the Woodbury identity can be used to avoid the inversion of any non-diagonal p × p matrices, leading to efficient computation. Thirdly, as shown by McNicholas et al. (2010) in the context of the PGMM family, these models are ‘trivially parallelizable’, opening up the possibility of even more efficient parameter estimation using parallel computing. The EPGMM family was applied to two well-known gene expression microarray data sets. In both cases, the EPGMM family performed well and gave much better clusterings than several popular clustering techniques. Herein, we took G = 2 for all of the analysis but future work will focus on the selection of G.
ACKNOWLEDGEMENTS
The authors gratefully acknowledge the insightful and helpful comments of three anonymous reviewers.
Funding: This work was supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada, and by a Basic Research Grant (04/BR/M0057) and a Research Frontiers Grant (2007/RFP/MATF281) from Science Foundation Ireland. The high-performance computing equipment that was used was provided by Silicon Graphics Inc. through funding from the Canada Foundation for Innovation–Leaders Opportunity Fund and from the Ontario Research Fund–Research Infrastructure Program.
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: Martin Bishop























