Discriminative variable subsets in Bayesian classification with mixture models, with application in flow cytometry studies

We discuss the evaluation of subsets of variables for the discriminative evidence they provide in multivariate mixture modeling for classification. The novel development of Bayesian classification analysis presented is partly motivated by problems of design and selection of variables in biomolecular studies, particularly involving widely used assays of large-scale single-cell data generated using flow cytometry technology. For such studies and for mixture modeling generally, we define discriminative analysis that overlays fitted mixture models using a natural measure of concordance between mixture component densities, and define an effective and computationally feasible method for assessing and prioritizing subsets of variables according to their roles in discrimination of one or more mixture components. We relate the new discriminative information measures to Bayesian classification probabilities and error rates, and exemplify their use in Bayesian analysis of Dirichlet process mixture models fitted via Markov chain Monte Carlo methods as well as using a novel Bayesian expectation–maximization algorithm. We present a series of theoretical and simulated data examples to fix concepts and exhibit the utility of the approach, and compare with prior approaches. We demonstrate application in the context of automatic classification and discriminative variable selection in high-throughput systems biology using large flow cytometry datasets.


Mixture of Gaussians under truncated Dirichlet process priors
We use the truncated Dirichlet process Gaussian mixture model Ishwaran and James (2001) in which p−vector observations x follow the model with prior hierarchically defined as follows: π 1 = V 1 , π j = (1 − V 1 ), ..., (1 − V j−1 )V j , 1 < j < J, V j | α ∼ B(1, a), j = 1, ..., J − 1, a ∼ G(e, f ), for specified hyperparameters (e, f, m, t, k, K) and some fixed (large) upper bound J on the number of effective components. Based on observing the random sample x 1:n = {x 1 , ..., x n } we simulate the full posterior for Θ = {µ 1:J , Σ 1:J , V 1:J−1 } and latent variables (a, z 1:n ) where z 1:n = {z 1 , ..., z n } is the set of latent configuration indicators, viz. z i = j if, and only if, x i comes from normal component j. The standard blocked Gibbs sampler (Ishwaran and James, 2001;Ji et al., 2009) for this model is effective, widely used and implemented in efficient serial and parallel code Wang et al., 2010).

Component relabelling in MCMC analysis of mixtures
To resolve the well-known component label switching problem (e.g. West, 1997;Stephens, 2000;Yao and Lindsay, 2009) our Gibbs sampler imposes a per iterate relabelling based on the efficient and effective method of Cron and West (2011). The code used for the analysis incorporates this as a default , and the essential idea is summarized here. At each Gibbs iterate let M be the n × J binary classification matrix with elements M ij = 1 where j = argmax J r=1 {π r (x i )} based on current values of the component posterior classification probabilities π r (x) = π r f r (x i )/g(x i ) under equation (1). Let M 0 denote a reference classification matrix obtained this way but using parameters Θ obtained as highest posterior modes following a Bayesian expectation-maximization based search. At the current simulation iterate, relabel components as follows.
2. Beginning with column 1 of M 0 , find column r 1 of M such that the two columns have the best match: Delete column 1 from M 0 and column r 1 from M. Repeat to assign a match of column 2 of the original M 0 with r 2 of the original M. Continue this to define the complete assignment of columns [r 1 , ..., r J ] and use this to reorder π 1:J , µ 1:J , Σ 1:J and reassign the z i accordingly.

Bayesian EM algorithm in Dirichlet process mixtures
Numerical search to identify modes of the posterior p(Θ|x 1:n ) uses a new expectation-maximization procedure for truncated Dirichlet mixture models, as follows. This extends the standard method treating the latent variables (a, z 1:n ) as missing data, iterating over t = 0, 1, . . . , based on starting parameter values Θ (0) , as follows. At iterate t + 1: E-step: Define Q(Θ|Θ (t) ) = E[ log{p(Θ, z 1:n , a|x 1:n )}| Θ (t) , x 1:n ]. For given parameters Θ, denote the posterior classification probabilities by π ij = π j ( ij , this yields the following, with index j running from j = 1, . . . , J except as noted for the V j terms: A key practical point to note is that an identified posterior mode will typically identify fewer than the maximum specified number of components, so providing an automatic indicator of effective number of components from a mode search. This arises when the M-step optimization over the V j yields V j = 1 for j ≥ J , for some J < J.

Non-Gaussian component mixtures via aggregating normals
Given a set of parameters, whether posterior mode estimates or a sample from the posterior, for the Gaussian mixture of equation (1), we follow previous work (Chan et al., 2008;Finak et al., 2009) in defining subpopulations by aggregating proximate normal components. That is, identify C ≤ J subpopulations with index sets I c containing components indices j for each subtype c = 1 : C. Then α c = j∈Ic π j and Grouping components into clusters can be done by associating each of the normal components with the closest mode of g(x). By running an efficient modal search beginning at each of the µ j we can swiftly identify the set of modes in g(x) together with the indicators of which mode each normal component is attracted too. The number of modes so identified is C, taken as the realized number of subpopulations in the mixture.
Efficient numerical optimization uses the mode trace function for Gaussian mixtures. Define precision matrices Ω j = Σ −1 j . Mode search start with iteration index i = 0 and a point x 0 in data space and then, for i = 1, 2, ... iteratively computes . This is a convergent local mode search that is broadly useful to quickly identify modes, antimodes and ridge lines between them in the contours of Gaussian mixtures, and typically takes just a few iterates. A second derivative of g(x) evaluated at any identified stationary point then identifies it as a mode or antimode. Rather than being interested in all modes of g(x), we are here only interested in those that define basins of attraction for the mixture components in order to find the sets I c of component indicators related to different modes. Hence we run this numerical search J times, initializing at x 0 = µ j , j = 1 : J in turn, and record the unique modes so identified as well as the sets I c of Gaussian components attracted to each in this search. 5 Approximations to α c+ and α c− As a simple but illuminating example, take p = 1 : 10, Figure  1 plots these values for c = 1. We also show Monte Carlo estimates of the expected posterior classification probabilities α c+ , α c− , computed by importance sampling using 10,000 draws from a Cauchy importance sampling distribution. Figure 1. Example with p ranging from 1 to 10, C = 2, α c = 0.5 for each c = 1, 2, and where f 1 (x) = N (x|0 p , I) and f 2 (x) = N (x|m p , I). In this special case, ∆ c = ∆ −c = exp(−p × m 2 /4) and τ c− = 1 − τ c+ with τ c+ = 1/(1 + exp(−p × m 2 /4)) for each c = 1, 2.

Estimation of discriminative measures
If all mixture components of interest are normal, the effective number of components is C and each of the component densities f c (x) is normal. Then each DIME measure ∆ * is easily computable based on given mean vectors and variance matrices for components, and hence we can easily compute discriminative probabilities given the α c parameters. Posterior analysis using MCMC methods (see supplementary material and references) generates posterior samples of all model parameters, including C, and so we can directly compute the corresponding posterior samples for all discriminative measures of interest. Summary estimates of the ∆ * , τ * and A * quantities are based on approximate posterior means from these MCMC outputs in our examples below.
It is also of interest to consider plug-in estimates of the parameters based on posterior modes for the mixture model parameters. The supplementary material also details a new Bayesian expectationmaximization algorithm for finding posterior modes in truncated Dirichlet process mixtures of normals. This is an efficient numerical search strategy that can be run from multiple starting values to determine posterior modes. We use this to compare the resulting plug-in estimates of discriminative quantities with their MCMC-based posteriors and approximate posterior means; the Bayesian EM algorithm is also useful for quickly generating initial parameter values as starting values for the standard MCMC. Both MCMC and the Bayesian EM algorithm are particularly effective for problems with larger dimensions, numbers of components and sample sizes when exploiting the parallelization opportunities using GPU implementations Wang et al., 2010).
Under the contexts where practically relevant component densities f c may have quite non-Gaussian forms. In the flow cytometry study, biologically relevant components are assumed to represent distributions of cell surface markers on specific cellular subtypes and these can exhibit markedly non-normal forms. Here we use the emerging standard strategy of assuming each f c (x) is itself a mixture of multivariate normals, i.e., g(x) is a mixture of mixtures (Cao and West, 1996;Chan et al., 2008;Frelinger et al., 2010;Finak et al., 2009). This is operationally defined by setting g(x) to be a mixture of multivariate normals with a large number of components, again utilizing the inherent parsimony of the Bayesian truncated Dirichlet process mixture to automatically cut-back to smaller numbers of components deemed relevant by the data. Then, given any set of components and their parameters, we use the modal aggregation strategy (e.g. Chan et al., 2008;Finak et al., 2009) to identify C sets of subsets of the normal densities and take each f c (x) as the implied conditional mixture of normals of one of these subsets; see Supplementary section 4. Whether using plug-in estimates from posterior modes or repeat evaluations using MCMC outputs, the ∆ * quantities are then easily evaluated since the underlying concordance measures δ a,b between two mixtures of multivariate normals f a (x), f b (x) with given parameters are analytically available.

Forward Search Algorithm
A simple forward search over subsets operates as follows. This applies separately-and in parallelto each chosen component c of interest. Starting at k = 0 and with an initial empty variable subset h ≡ h c0 = ∅, move over a series of iterates, at each staging updating the variable subset. Suppose that at iterate k ≥ 0 we have a current subset of variables h ck ⊆ {1 : p}. Then at step k + 1 :

For each j /
∈ h ck , compute A c (j, h ck ); 2. Identify j * ck = argmax j / ∈h ck A c (j, h ck ); 3. Update k to k + 1 and the current variable subset to (j * ck , h ck ); 4. Continue to the next iterate, or stop.
We might simply continue this process until all variables are selected, or stop at point 4 if the increase in A c ( * ) is below a chosen threshold and/or if A c ( * ) exceeds a specified high probability. We can also address potential masking issues by modifications that have multiple branching subsets of selected variables by considering two or more different additional variables at step 3.

Two synthetic examples 8.1 A simple example
This simple example involves sample of size n = 5, 000 drawn from p = 3−dimensional mixtures of C = 2 normal distributions. Only the first 2 variables carry primary discriminative information. The model and analysis allows up to 9 components using default, relatively vague priors in the truncated Dirichlet process mixture. The Bayesian expectation-maximization algorithm (Supplementary section 3) was run repeatedly from many random starting points. The posterior mode identified the correct number of components and parameters perfectly consistent with the known, true values underlying the synthetic data generation. MCMC analysis, as in Supplementary section 1 and 2, was initialized at the posterior mode identified by the Bayesian EM analysis, and run to generate posterior simulations of size 10,000 following additional burn-in iterates.
Supplementary Example 1. In this example, variables 1 and 2 together discriminate the 2 normal components while variable 3 is redundant. A data scatter plot appears in Supplementary Figure 2.
Supplementary Table 1 displays MCMC-based posterior means of discriminative measures. This clearly shows the τ c+ (h) and τ c,1 (h) correctly identify the first 2 variables as highly discriminative and that the 3rd variable is redundant. Note that τ c+ is close to 1 for h = (1, 2) and less than 0.9 for other subsets of just 1 or 2 variables; similarly, τ c− is close to 0 for h = (1, 2) and greater than 0.1 for other subsets of just 1 or 2 variables; adding variable 3 to (1, 2) makes no practical change. In addition, the data was generated such that the difference between the two normal mean vectors (µ 1 and µ 2 for the two normal components) for variable 1 (|µ 1 (1)−µ 2 (1)|) is slightly larger than the difference for variable 2 (|µ 1 (2) − µ 2 (2)|), given that the variances are the same for both variable 1 and 2 among each of the normal component. Hence, Supplementary Table 1  In all cases, plug-in values based on the posterior modal parameters identified via the Bayesian EM search are very close to the MCMC-based posterior means for the DIME, τ c+ (h) and τ c− (h) measures. In this example, all differences all well below 0.01 for the discriminative probabilities and most are much smaller. This is found in other examples, but as the mixture model dimension and complexity increase, we have found much more divergence between MCMC-based posterior means and plug-in values based on identified posterior modes. Hence as a routine we use MCMC, utilizing the EM search for initial exploratory analysis and initializing MCMC.

A more challenging mixture and comparison with related approaches
This more challenging mixture example comparing the analysis with the ridgeline-based separability measure (RSM) of Lee and Li (2012). Logistically, we follow the forward selected strategy and recommendations in Lee and Li (2012), evaluating variables to add to a current discriminatory subset if the increase in RSM exceeds 0.01 at each step, and stopping otherwise. MCMC analyses were initialized at the Bayesian EM-based posterior modes, and we generated posterior simulations of size 10,000 following additional burn-in iterates. For most direct comparison, RSM measures were evaluated using mixture model parameters estimated by MCMC-based posterior means.

Supplementary Example 2.
This proof-of-principle example, where DIME and RSM approaches agree in identifying a single subset of discriminative variables for each of the components, follows Lee and Li (2012) in simulating 6,000 observations from the following 8−dimensional distribution: the first two dimension are generated according to 1/3N ((3, 9), I) + 1/3N ((5, 6), I) + 1/3Unif ([0, 8] × [4, 12]), where the uniform distribution serves to weaken the separation between the two primary normal components. The other six dimensions are non-informative, the variables being independent standard normals. Supplementary Figure 4 displays pairwise scatter plots of standardized data.
The model allowed up to 16 components and analysis used default, relatively vague priors. MCMC-based posterior outputs identify two main modes of concentration following aggregation of normal components (Supplementary section 4), with additional structure representing noise. More specifically, given the fitted 16-component mixture model, we identify modes by clustering the normal components into groups; this assigns each of the 16 components of the mixture to the closest mode of the fitted mixture model. Investigation of MCMC outputs (not shown) clearly show that the two components concentrate in regions consistent with those of the underlying distribution that generated the synthetic data. Discriminative analysis summaries are shown in Supplementary Table  2. Applying our stopping rule requiring a change of at least 0.01 on the classification probability scale for the accuracy measure A c (h) yields discriminative variable subsets h = (1, 2) for each of the two components, correctly identifying the structure underlying the data. This agrees the RSM result that identifies (1, 2) for both components simultaneously. This example highlights the ability of DIME-based analysis to perform as well as the existing method, while providing additional information: the table shows an average probability classification rate of about 92% for each component using variables (1, 2), that variable 2 alone (or, in fact, variable 1 alone) would yield around 80-82% accuracies, and quite evidently the other variables only add noise to the discrimination. Figure 4. Pairwise scatter plots of a randomly selected subset of the n = 6, 000 observations in Supplementary Example 2. Table 2. Accuracy A c (h), (c = 1, 2), in Supplementary Example 2. Variables are ordered according to the forward search based on accuracy A c ( * ) for DIME-based analysis to compare with the order and variables identified using the RSM approach. For each component, † indicates the last variable whose addition increases the accuracy probability by at least 0.01. RSM variables and values are defined as in Lee and Li (2012), computed at the posterior means of model parameters; underlining indicates the index of the last variable entered into the discriminative set using RSM.